How to extract sub-strings for a collection of text?

How to extract sub-strings for a collection of text? - java

I extracted text from pdf document. .. I want to extract some particular fields in it using java..
The portion of text ..
US00RE44697E (i9) United States (12) Reissued Patent (10)
Patent Number: RE44,697 E Jones et al. (45) Date of
ReissuedPatent: Jan. 7, 2014 (54) ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT (75) Inventors: David E.Jones, Ottawa
(CA); Cormac M.O'Connell, Carp (CA) (73) Assignee: Mosaid
Technologies Incorporated, Ottawa, Ontario (CA) (21)
Appl.No.: 13/603,137 (22) Filed: Sep. 4, 2012 Related U.S.
Patent Documents Reissue of: (64) Patent No.: Issued:
Appl. No.: Filed: 6,088,800 Jul. 11, 2000
09/032,029 Feb. 27, 1998 (51) Int.CI. G06F 21/00
(2013.01) (52) U.S. CI. USPC .............713/189; 713/190;
713/193; 380/28; 380/33; 380/52 (58) Field of Classification
Search None
Now my mission is to extract fields form it and give to strings.. that is
the text (10) Patent Number: RE44,697 E will be extracted as String pat_no= " RE44,697 E"
the text (54) ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT will be extracted as String title= "ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT"
the extremely irregular text block
(64) Patent No.: Issued: Appl. No.: Filed:
6,088,800 Jul. 11, 2000 09/032,029 Feb. 27, 1998
have to be extracted as
String pat_no_org = "6,088,800";
String issued = "jul.11,2000"
String filed = "feb 27 ,1998"
......
like this..
My Works
First i used the string.split , string.substring , string,indexof and even apache string utils , but none helped.. Because the text are scattered , above methods doesn't helped.. I also tried regular expressions ,but since I very weak in it I can't program .
Please tell me how to achieve my objective using java ?

With regex, I would split it in 3 parts:
1.) (10) Patent Number the regex could look like this:
\(10\)\s*Patent Number:\s*([\w,]+)
as a java string:
"\\(10\\)\\s*Patent Number:\\s*([\\w,]+)"
The matches for the first parenthesized group will be in [1].
\s is a shorthand for [ \t\r\n\f] any kind of white-space.
\w is a shorthand for [A-Za-z0-9_] word-characters, together with , in a character class.
Some characters have special meanings in regex. They have to be escaped with a backslash.
2.) (54) ENCRYPT...
A pattern could look like:
(?s)\(54\)\s*(.*?)\s*(?=\(\d|$\))
as a java string:
"(?s)\\(54\\)\\s*(.*?)\\s*(?=\\(\\d|$\\))"
(?s) The s modifier equals Pattern.DOTALL where the dot matches new-lines too.
(?=\(\d|$\)) a lookahead is used, to match (.*?) lazy any amount of any characters until another ( followed by a digit | or string-end $ (anchor for end) is seen.
3.) For the other desired 3 parts I would try to reflect formatting of the input with the pattern. This requires, that all data is constructed compatible. A pattern could look like this:
(?s)\(64\).*?Filed:\s*([\d,]+)\s*(\w+\.\s*\d+,\s*\d+)\s*\n[\d+][^\n]+\n\s*(\w+\.\s*\d+,\s*\d+)
as a java string:
"(?s)\\(64\\).*?Filed:\\s*([\\d,]+)\\s*(\\w+\\.\\s*\\d+,\\s*\\d+)\\s*\\n[\\d+][^\\n]+\\n\\s*(\\w+\\.\\s*\\d+,\\s*\\d+)"
\n matches a newline.
Matches will be in [1] e.g. 6,088,800, [2] e.g. Jul. 11, 2000 and [3] e.g. Feb. 27, 1998.
For getting started with regex, this is too much information at once :)

Related

Regex based on address in any order

I have a regex based on an address format :
([0-9-]*) ?([\p{L}*,\. '-]*) ?([0-9 ]*) ?([\p{L}*,\. '-]*) ([0-9]{5}) ?([\p{L}*,\. '-]*)
It can match with this :
16 Rue du Pont Louis-Philippe 75000 Paris
But I'd like to get this regex match with this format too :
75000 Paris 16 Rue du Pont Louis-Philippe
Can someone help me pls ?

There are a lot of optional parts in the pattern. You can make the last 2 groups optional, but you would have to change the quantifiers to 1 or more times using + to prevent partial matches (Or add ^ to the pattern to assert the start of the string)
([0-9-]+) ([\p{L}*,. '-]+) ([0-9 ]+) ([\p{L}*,. '-]+)(?: ([0-9]{5}) ([\p{L}*,. '-]+))?
Regex demo

How to tokenize, scan or split this string of email addresses

For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.
Here is an example of a valid input:
"name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
So there are two basic forms "name#domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.
The following array shows the data needed (trailing spaces or delimiters would not be a big problem):
["name#domain.com",
"Sixpack, Joe 1 <name#domain.com>",
"Sixpack, Joe 2 <name#domain.com>",
"Sixpack, Joe, 3<name#domain.com>",
"nameFoo#domain.com",
"nameBar#domain.com",
"nameBaz#domain.com"]
I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?
Final solution (variation on the accepted answer):
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
.replace(/(#.*?>?)\s*[,;]/g, "$1<|>")
.replace(/<\|>$/,"") // remove trailing delimiter
.split(/\s*<\|>\s*/) // split on delimiter including surround space
console.log(result)
Or in Java:
public static String[] extractEmailAddresses(String emailAddressList) {
return emailAddressList
.replaceAll("(#.*?>?)\\s*[,;]", "$1<|>")
.replaceAll("<\\|>$", "")
.split("\\s*<\\|>\\s*");
}

since you are not validating, i assume that the email addresses are valid.
Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
const result = string.match(/(.*?#.*?\..*?)[,;]/g)
console.log(result)

This pattern works for your provided examples:
([^#,;\s]+#[^#,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^#,;\s]+#[^#,;\s]+)>
([^#,;\s]+#[^#,;\s]+) # email defined by an # with connected chars except ',' ';' and white-space
| # OR
(?:$|\s*[,;])(?:\s*) # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?) # name
<([^#,;\s]+#[^#,;\s]+)> # email enclosed by lt-gt
PCRE Demo

Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.
Here is a javascript example, but Java's repalceAll and split can do the same job.
var string = "name#domain.com,Joe Sixpack <name#domain.com>, Sixpack, Joe <name#domain.com> ;Sixpack, Joe<name#domain.com> , name#domain.com,name#domain.com;name#domain.com;"
const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)

Parse a log file using java.regex.Matcher

I am learning Java programming. I have a Cisco log:
String logLine="Jul 15 21:12:41 router_provider_pe2 57: *Jul 15 21:12:26.223: %LDP-5-NBRCHG: LDP Neighbor 10.1.1.34:0 (3) is UP";
I am trying this regular expression:
String logPattern = "([\\w]+\\s[\\d]+\\s[\\d:]+) (\\d+:) ([*\\w]+\\s[\\d]+\\s[\\d:]+:) (\\w.+)";
But it is not fine. Could you help me?

Your string:
"Jul 15 21:12:41 router_provider_pe2 57: *Jul 15 21:12:26.223: %LDP-5-NBRCHG: LDP Neighbor 10.1.1.34:0 (3) is UP"
Your pattern:
"([\w]+\s[\d]+\s[\d:]+) (\d+:) ([*\w]+\s[\d]+\s[\d:]+:) (\w.+)"
The part of the pattern in the first set of parentheses matches Jul 15 21:12:41. The pattern expects this to be followed by a space, and then by at least one digit. But the string at this point contains a space and the letter r, which is not a digit. Therefore, there is no match.

java pattern matcher syntax, selectively incapable of string recognition

I am using the the java pattern matcher to tease out strings of the form 'XXX'('XXX','XXX'). I wan't only the text, i.e. XXX
This is what I'm currently using:
Pattern p = Pattern.compile("'(.*?)'\\('(.*?)','(.*?)'\\)\\.");
It it's able to match this:
'prevents'('scurvy','vitamin C').
'contains'('vitamin C','orange').
'contains'('vitamin C','sauerkraut').
'isa'('fruit','orange').
'improves'('health','fruit').
But is incapable to recognizing this, although they are formatted in the same way.
'take place in'('the grand hall of the hong kong convention', 'the ceremony').
'attend by'('some # guests', 'the grand hall of the hong kong convention').
'seat on'('the central dais', 'principal representatives of both countries').
'be'('mr jiang', 'representing china').
'be'('hrh', 'britain').
'be more than'('# distinguished guests', 'the principal representatives').
'end with'('the playing of the british national anthem', 'hong kong').
'follow at'('the stroke of midnight', 'this').
'take part in'('the ceremony', 'both countries').
'start at about'('# pm', 'the ceremony').
'end about'('# am', 'the ceremony').
'lower'('the british hong kong flag', '# royal hong kong police officers').
'raise'('the sar flag', 'another #').
'leave for'('the royal yacht britannia', 'the #').
'hold by'('the chinese and british governments', 'the handover of hong kong').
'rise over'('this land', 'the regional flag of the hong kong special administrative region of the people \'s republic of china').
'cast eye on'('hong kong', 'the world').
'hold on'('schedule', 'the # governments').
'be festival for'('the chinese nation', 'this').
'go in'('the annals of history', 'july # , #').
'become master of'('this chinese land', 'the hong kong compatriots').
'enter era of'('development', 'hong kong').
'remember'('mr deng xiaoping', 'history').
'be along'('the course', 'it').
'resolve'('the hong kong question', 'we').
What is the cause of this?
Is there a website where I can demo my regex specifically as it is applied to the java pattern matcher? like regexr.com
Or some simple comprehensible documentation would also be good, the results of my google search were highly fragmentary and incoherent.

Because all has a space after the comma.
So i suggest you to use \s* (matches zero or more spaces) or \s? (matches an optional space),
Pattern p = Pattern.compile("'(.*?)'\\('(.*?)',\\s*'(.*?)'\\)\\.");
Example:
'prevents'('scurvy','vitamin C').
^
| - no space
But
'take place in'('the grand hall of the hong kong convention', 'the ceremony').
^
|- space
DEMO

Extracting packed data using regular expressions

I have data in a database in the format below:
a:19:{s:9:"raceclass";a:5:{i:0;a:1:{i:0;s:7:"250cc B";}i:1;a:1:{i:1;s:6:"OPEN B";}i:2;a:1:{i:2;s:9:"Plus 25 B";}i:3;a:1:{i:3;s:8:"Vet 30 B";}i:4;a:1:{i:4;s:7:"Vintage";}}s:9:"firstname";a:1:{i:0;a:1:{i:0;s:5:"James";}}s:12:"middle_FIELD";a:1:{i:0;a:1:{i:0;s:1:"R";}}s:8:"lastname";a:1:{i:0;a:1:{i:0;s:9:"Slaughter";}}s:5:"email";a:1:{i:0;a:1:{i:0;s:29:"jslaughter#xtrememxseries.com";}}s:8:"address1";a:1:{i:0;a:1:{i:0;s:18:"21 DiMartino Court";}}s:4:"city";a:1:{i:0;a:1:{i:0;s:6:"Walden";}}s:5:"state";a:1:{i:0;a:1:{i:0;s:8:"New York";}}s:3:"zip";a:1:{i:0;a:1:{i:0;s:5:"12586";}}s:7:"country";a:1:{i:0;a:1:{i:0;s:13:"United States";}}s:6:"gender";a:1:{i:0;a:1:{i:0;s:4:"Male";}}s:3:"dob";a:1:{i:0;a:1:{i:0;s:10:"06/04/1974";}}s:5:"phone";a:1:{i:0;a:1:{i:0;s:12:"845-713-4421";}}s:5:"skill";a:1:{i:0;a:1:{i:0;s:12:" AMATEUR (B)";}}s:11:"ridernumber";a:1:{i:0;a:1:{i:0;s:2:"69";}}s:8:"bikemake";a:1:{i:0;a:1:{i:0;s:3:"HON";}}s:8:"enginecc";a:1:{i:0;a:1:{i:0;s:3:"450";}}s:9:"amanumber";a:1:{i:0;a:1:{i:0;s:7:"1094649";}}s:10:"amaexpdate";a:1:{i:0;a:1:{i:0;s:5:"03/12";}}}
How can I write a regular expression to manipulate the above string to get data in the following format?:
raceclass - 250cc B, OPEN B, Plus 25 B, Vet30, Vintage
firstname - James
middle_FIELD - R
address1 = 21 DiMartino Court
city - walden
state - New york
zip - 12586
country - United States
gender - Male
dob - 06/04/1974
phone - 845-713-4421
skill - AMATEUR (B)
ridernumber - 69
bikemake - HON
enginecc - 450
amanumber - 1094649
amaexpdate - 03/12

This data isn't suitable for a regular expression. You should use a proper parser with a proper grammar for handling this string. There are several good options for that in Java, such as ANTLR.
Alternatively, if that is not an option it looks like you only want to handle things between "". Take a look at the java class Scanner. You should be able to get something working with that. Just look through the string and look for a ". If found start to gather text into a buffer. Once you have found another " ignore tokens until you have found the next " or the end of the input text.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract sub-strings for a collection of text? - java

Related

Regex based on address in any order

How to tokenize, scan or split this string of email addresses

Parse a log file using java.regex.Matcher

java pattern matcher syntax, selectively incapable of string recognition

Extracting packed data using regular expressions

Categories

Resources