Extracting packed data using regular expressions

Extracting packed data using regular expressions - java

I have data in a database in the format below:
a:19:{s:9:"raceclass";a:5:{i:0;a:1:{i:0;s:7:"250cc B";}i:1;a:1:{i:1;s:6:"OPEN B";}i:2;a:1:{i:2;s:9:"Plus 25 B";}i:3;a:1:{i:3;s:8:"Vet 30 B";}i:4;a:1:{i:4;s:7:"Vintage";}}s:9:"firstname";a:1:{i:0;a:1:{i:0;s:5:"James";}}s:12:"middle_FIELD";a:1:{i:0;a:1:{i:0;s:1:"R";}}s:8:"lastname";a:1:{i:0;a:1:{i:0;s:9:"Slaughter";}}s:5:"email";a:1:{i:0;a:1:{i:0;s:29:"jslaughter#xtrememxseries.com";}}s:8:"address1";a:1:{i:0;a:1:{i:0;s:18:"21 DiMartino Court";}}s:4:"city";a:1:{i:0;a:1:{i:0;s:6:"Walden";}}s:5:"state";a:1:{i:0;a:1:{i:0;s:8:"New York";}}s:3:"zip";a:1:{i:0;a:1:{i:0;s:5:"12586";}}s:7:"country";a:1:{i:0;a:1:{i:0;s:13:"United States";}}s:6:"gender";a:1:{i:0;a:1:{i:0;s:4:"Male";}}s:3:"dob";a:1:{i:0;a:1:{i:0;s:10:"06/04/1974";}}s:5:"phone";a:1:{i:0;a:1:{i:0;s:12:"845-713-4421";}}s:5:"skill";a:1:{i:0;a:1:{i:0;s:12:" AMATEUR (B)";}}s:11:"ridernumber";a:1:{i:0;a:1:{i:0;s:2:"69";}}s:8:"bikemake";a:1:{i:0;a:1:{i:0;s:3:"HON";}}s:8:"enginecc";a:1:{i:0;a:1:{i:0;s:3:"450";}}s:9:"amanumber";a:1:{i:0;a:1:{i:0;s:7:"1094649";}}s:10:"amaexpdate";a:1:{i:0;a:1:{i:0;s:5:"03/12";}}}
How can I write a regular expression to manipulate the above string to get data in the following format?:
raceclass - 250cc B, OPEN B, Plus 25 B, Vet30, Vintage
firstname - James
middle_FIELD - R
address1 = 21 DiMartino Court
city - walden
state - New york
zip - 12586
country - United States
gender - Male
dob - 06/04/1974
phone - 845-713-4421
skill - AMATEUR (B)
ridernumber - 69
bikemake - HON
enginecc - 450
amanumber - 1094649
amaexpdate - 03/12

This data isn't suitable for a regular expression. You should use a proper parser with a proper grammar for handling this string. There are several good options for that in Java, such as ANTLR.
Alternatively, if that is not an option it looks like you only want to handle things between "". Take a look at the java class Scanner. You should be able to get something working with that. Just look through the string and look for a ". If found start to gather text into a buffer. Once you have found another " ignore tokens until you have found the next " or the end of the input text.

Related

JSON parser exception is observed for on fly JSON

I created a JSON file on the fly by using some runtime data and stored as string as like below:
JSON:
{
"ticketDetails": "kindle tracking ticket: TICKET0900060
Iimpact statement: impacted due to year 2020 format handling issue,
depending on the Gateway,
user can be asked to
try with another instrument.
Timeline: 00: 00 SAP Internal Declines spiked to 300 +
05: 22 AM flintron reported DECLINED errors since 0: 00 PST.
As per TDO,flitron is not seeing clear metrics impact " }
Note: I just copied the exact json which i'm getting at runtime. It containse \n space and exactly like above.
I can see few JSONObjects like ticketDetails is having huge description and when I tried to parse the above string is leading to parse error.
I tried the below way to eliminate the parse error by using
String removeSpecialCharacterFromJson= jsonString.replaceAll("[^A-Za-z0-9]","")
System.out.println(removeSpecialCharacterFromJson);
Sample Output:
kindletrackingticketTICKET0900060Iimpactstatementimpactedduetoyear2020formathandlingissue....[space between characters are removed and It's hard to read the string]
The above code removed all the special characters from the string and It will be successfully parsed. But the description is not having the space and It very hard to read the content after the above changes done.
I tried to escape the \s in the regular expression which is giving the original String value which is leading to parse exception.
String removeSpecialCharacterFromJson= jsonString.replaceAll("[^A-Za-z0-9\\s]","")
Is there anyohter way to handle this ? I just want to the ticketDetails to be readable format and It should not have any special characters and \n lines.
Can someone help me on this?

s in regular expression is not for space but for the whitespace
I guess that you may have some additional non allowed whitespace in your JSON string
Take a look at The JSON spec (RFC 7159):
Insignificant whitespace is allowed before or after any of the six structural characters.
ws = *(
%x20 / ; Space
%x09 / ; Horizontal tab
%x0A / ; Line feed or New line
%x0D ) ; Carriage return
Verify your values and look for improper whitespaces

How to create a valid OBX segment in java using HAPI?

I want to generate just the OBX segment in a HL7 message in this format.
OBX|6|CE|59783-1^Status in immunization series^LN|**5**|||||||F
where no. 5 is the Series no.
problem is when ever I try to decode this line with a HL7 decoder. it results in something like this.
Vaccine funding program eligibility category
V07 - VFC Eligibility-Local-specific Eligibility
Vaccine purchased with
null -
vaccine type
107 - DTaP
Date vaccine information statement published
-
Date vaccine information statement presented
-
Status in immunization series
Here is my source code:
obx.getSetIDOBX().setValue(String.valueOf(obxSetId));
obx.getValueType().setValue("CE");
obx.getObservationIdentifier().getIdentifier().setValue("59783-1");
obx.getObservationIdentifier().getText().setValue("Status in immunization series");
obx.getObservationIdentifier().getNameOfCodingSystem().setValue("LN");
obx.getObservationSubID().setValue(String.valueOf(immunizationData.getSeries().toString()));
obx.getObservationResultStatus().setValue("F");
where obx is the reference to the OBX jar in hapi structures.

I suspect that the receiver of the message is complaining about the missing observation value (OBX.5). You specified a data type for the observation value (value type, OBX.2), but OBX.5 in your message does not contain a value of data type CE - it is empty.

Extracting required substring from a result retrieved from Wolfram Alpha with Java

I'm working on a Java Program which takes a question from a user, sends it to the Wolfram Alpha API and then cleans up the result and prints it.
If the user asks the question "Who is the President of the USA?" the result is as follows
Response: <section><title>Input interpretation</title> <sectioncontents>United States | President</sectioncontents></section><section><title>Result</title><sectioncontents>Barack Obama (from 20/01/2009 to present)</sectioncontents></section><section><title>Basic information</title><sectioncontents>official position | President (44th)..........etc
I would like to Extract "Barack Obama (from 20/01/2009 to present)"
I have been able to trim up to Barack using the below code:
String clean =response.substring(response.indexOf("Result") + 31 , response.length());
System.out.println("Response: " + clean);
How would I trim the rest of the result?

Well, in case it helps, I came up with this regex:
Result.+?>([^<]+?)<
After finding "Result" it captures the first instance of > and < with at least one character between them.
UPDATE
Below is some sample code that might be helpful:
String response = "Response: <section><title>..."
Pattern pattern = Pattern.compile("Result.+?>([^<]+?)<");
Matcher match = pattern.matcher(response);
String clean = "";
if (match.find())
clean = match.group(1);
System.out.println(clean);

The response is essentially XML.
As has been discussed endlessly in many programming fora, regular expressions are not suitable for parsing XML - you should use an XML parser.

How to extract sub-strings for a collection of text?

I extracted text from pdf document. .. I want to extract some particular fields in it using java..
The portion of text ..
US00RE44697E (i9) United States (12) Reissued Patent (10)
Patent Number: RE44,697 E Jones et al. (45) Date of
ReissuedPatent: Jan. 7, 2014 (54) ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT (75) Inventors: David E.Jones, Ottawa
(CA); Cormac M.O'Connell, Carp (CA) (73) Assignee: Mosaid
Technologies Incorporated, Ottawa, Ontario (CA) (21)
Appl.No.: 13/603,137 (22) Filed: Sep. 4, 2012 Related U.S.
Patent Documents Reissue of: (64) Patent No.: Issued:
Appl. No.: Filed: 6,088,800 Jul. 11, 2000
09/032,029 Feb. 27, 1998 (51) Int.CI. G06F 21/00
(2013.01) (52) U.S. CI. USPC .............713/189; 713/190;
713/193; 380/28; 380/33; 380/52 (58) Field of Classification
Search None
Now my mission is to extract fields form it and give to strings.. that is
the text (10) Patent Number: RE44,697 E will be extracted as String pat_no= " RE44,697 E"
the text (54) ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT will be extracted as String title= "ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT"
the extremely irregular text block
(64) Patent No.: Issued: Appl. No.: Filed:
6,088,800 Jul. 11, 2000 09/032,029 Feb. 27, 1998
have to be extracted as
String pat_no_org = "6,088,800";
String issued = "jul.11,2000"
String filed = "feb 27 ,1998"
......
like this..
My Works
First i used the string.split , string.substring , string,indexof and even apache string utils , but none helped.. Because the text are scattered , above methods doesn't helped.. I also tried regular expressions ,but since I very weak in it I can't program .
Please tell me how to achieve my objective using java ?

With regex, I would split it in 3 parts:
1.) (10) Patent Number the regex could look like this:
\(10\)\s*Patent Number:\s*([\w,]+)
as a java string:
"\\(10\\)\\s*Patent Number:\\s*([\\w,]+)"
The matches for the first parenthesized group will be in [1].
\s is a shorthand for [ \t\r\n\f] any kind of white-space.
\w is a shorthand for [A-Za-z0-9_] word-characters, together with , in a character class.
Some characters have special meanings in regex. They have to be escaped with a backslash.
2.) (54) ENCRYPT...
A pattern could look like:
(?s)\(54\)\s*(.*?)\s*(?=\(\d|$\))
as a java string:
"(?s)\\(54\\)\\s*(.*?)\\s*(?=\\(\\d|$\\))"
(?s) The s modifier equals Pattern.DOTALL where the dot matches new-lines too.
(?=\(\d|$\)) a lookahead is used, to match (.*?) lazy any amount of any characters until another ( followed by a digit | or string-end $ (anchor for end) is seen.
3.) For the other desired 3 parts I would try to reflect formatting of the input with the pattern. This requires, that all data is constructed compatible. A pattern could look like this:
(?s)\(64\).*?Filed:\s*([\d,]+)\s*(\w+\.\s*\d+,\s*\d+)\s*\n[\d+][^\n]+\n\s*(\w+\.\s*\d+,\s*\d+)
as a java string:
"(?s)\\(64\\).*?Filed:\\s*([\\d,]+)\\s*(\\w+\\.\\s*\\d+,\\s*\\d+)\\s*\\n[\\d+][^\\n]+\\n\\s*(\\w+\\.\\s*\\d+,\\s*\\d+)"
\n matches a newline.
Matches will be in [1] e.g. 6,088,800, [2] e.g. Jul. 11, 2000 and [3] e.g. Feb. 27, 1998.
For getting started with regex, this is too much information at once :)

Java String formatting solution

I have a string description of a company, which is nasty written by different users (hand-typed). Here is a example (focus on the dots, spaces, first letters etc..):
XXXX is a Global menagement consulting,Technology services and
outsourcing company, with 257000people serving clients in more than
120 countries.. combining unparalleled experience, comprehensive
capabilities across all industries and business functions,and
extensive research on the worlds most successfull companies, XXXX
collaborates with clients to help them become high-performance
businesses and governments., the company generated net revenues of
US$27.9 Billion for the fiscal year ended 31.07.2012..
Now what i want is to format the string to a bit nicer version like this:
XXXX is a global management consulting, technology services and
outsourcing company, with 257,000 people serving clients in more than
120 countries. Combining unparalleled experience, comprehensive
capabilities across all industries and business functions, and
extensive research on the world’s most successful companies, XXXX
collaborates with clients to help them become high-performance
businesses and governments. The company generated net revenues of
US$27.9 billion for the fiscal year ended Aug. 31, 2012.
My question is: Is there any library with already defined methods which could do all the spelling corrections, unneeded space removal, etc .. ?
So far, I do it be replacing stuff like " ," with ", " and toUpperCase() if the is a "///." in front etc..
desc = desc.replace(" ", " ");
desc = desc.replace("..", ".");
desc = desc.replace(" .", ".");
desc = desc.replace(" ,", ", ");
desc = desc.replace(".,", ".");
desc = desc.replace(",.", ".");
desc = desc.replace(", .", ".");
desc = desc.replace("*", "");
I'm sure there is a cleaner and better version to do this. Using regex maybe??
Any solution would be appreciated.

If I were trying to solve your problem, I would probably read the text 1 char at a time, and format it as you go. For example, in psuedocode...
while (has more chars){
char letter = readChar();
if (letter == ','){
// checking for the ',.' combination
letter = readChar();
if (readChar == '.'){
// write out a '.' only
out.print('.');
}
else {
// it wasn't the ',.' combination, so you need to output both characters, whatever they are
out.print(',');
out.print(letter);
}
}
else if (another letter you want to filter){
// etc.
}
else {
// doesn't match any of the filters, so just output the letter
out.print(letter);
}
}
Basically if you read the text 1 char at a time, you can detect any of your chosen formatting problems as you go, and correct them immediately. This provides a performance improvement, as you're only reading over the text string once (not 8 times, like you are currently doing), and allows you to add as many different/complex formatting changes as you want. The downside, however, is that you need to write the logic yourself rather than relying on in-built functions.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.