Regex parsing issues of multi-line entries containing formatted text

Regex parsing issues of multi-line entries containing formatted text - java

I have an internal system log file which I have been parsing via and external Hive table and a Regex statement. But recently, issues in the output data have emerged with parts of the XML data incomplete in the Output result set. After further investigation, the issue is down to formatted text in the XML data being passed into the logs.
Traditionally, each log message is contained on a single line and the regex statement works fine but when a message contains formatted text, it pushes parts of the message onto one or more lines and so the XML data is chopped when parsed.
The issue I have is that I need to some how piece together the complete message back into one line so that the regex can successfully parse the whole message and not chop any data.
Below is sample of the data I am dealing with.
Typical log message
0 20130323212857832 20130323212857832 0000 006 00/0000/000 BPAGPRDAGA01 Lexkin 000 000000 00 Reply: 0ms 865b3926-9002-4506-9825-c72bf19e694c: <GetProgramIdResponse xmlns="http://tempuri.org/"><GetProgramIdResult xmlns:b="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Accounts.Responses" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"><ContactUs xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">false</ContactUs><ErrorDescription xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorDescription><ErrorMessage xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorMessage><ErrorNo xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorNo><ErrorSource xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">None</ErrorSource><ErrorTitle xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorTitle><ErrorType xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorType><b:ProgramId>-1</b:ProgramId></GetProgramIdResult></GetProgramIdResponse>
Message with formatted text 1
0 20130323212857832 20130323212857832 0000 006 00/0000/000 BPAGPRDAGA01 Lexkin 000 000000 00 Reply: 0ms 865b3926-9002-4506-9825-c72bf19e694c: <GetProgramIdResponse xmlns="http://tempuri.org/"><GetProgramIdResult xmlns:b="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Accounts.Responses" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"><ContactUs xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">false</ContactUs><ErrorDescription xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">Issues with Core system,
Engineer support required</ErrorDescription><ErrorMessage xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorMessage><ErrorNo xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorNo><ErrorSource xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">None</ErrorSource><ErrorTitle xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorTitle><ErrorType xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorType><b:ProgramId>-1</b:ProgramId></GetProgramIdResult></GetProgramIdResponse>
Message with formatted text 2
0 20130323212857832 20130323212857832 0000 006 00/0000/000 BPAGPRDAGA01 Lexkin 000 000000 00 Reply: 0ms 865b3926-9002-4506-9825-c72bf19e694c: <GetProgramIdResponse xmlns="http://tempuri.org/"><GetProgramIdResult xmlns:b="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Accounts.Responses" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"><ContactUs xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">false</ContactUs><ErrorDescription xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">Isuses with Core system,
Engineer support required</ErrorDescription><ErrorMessage xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorMessage><ErrorNo xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorNo><ErrorSource xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">Lexkin webservice,
Exception detected</ErrorSource><ErrorTitle xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorTitle><ErrorType xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorType><b:ProgramId>-1</b:ProgramId></GetProgramIdResult></GetProgramIdResponse>
Regex Statement
Due to the log file containing various log messages that are completely different format to each other, the complex regex statement below caters for all the messages types except for multi-line data
(^[0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+\/[0-9]+\/[0-9]+) (\S*)\s*(\S*)\s*([0-9]+)\s([0-9]+)\s([0-9]+)\s+([^:]*): ([0-9]*\.?[0-9]*)[ms]*(?<=[s ]) ?([0-9a-f-]*):? ?(.*)
So I am not sure what the best approach is to resolve this issue, some people have suggest writing a customer XML input format but my experience with input formats is amateur at best, hopefully this community can suggest other alternatives.

The issue seems to be the line-breaks inside the tags. Those are not captured by the . in the very last part of the regex. To have the . match newline characters you can set single-line mode-modifier
The thing with that is that .* then grabs everything greedily which you wanna prevent, I would do it with a negative lookahead (i.e. grab everything that isn't followed by the beginning of this regex)
(^[0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+\/[0-9]+\/[0-9]+) (\S*)\s*(\S*)\s*([0-9]+)\s([0-9]+)\s([0-9]+)\s+([^:]*): ([0-9]*\.?[0-9]*)[ms]*(?!<=[s ]) ?([0-9a-f-]*):? ?((?:.(?![0-9]+ [0-9]+ [0-9]+ [0-9]+ [0-9]+ [0-9]+\/[0-9]+\/[0-9]+))*)
Regex101 demo

Related

JSON parser exception is observed for on fly JSON

I created a JSON file on the fly by using some runtime data and stored as string as like below:
JSON:
{
"ticketDetails": "kindle tracking ticket: TICKET0900060
Iimpact statement: impacted due to year 2020 format handling issue,
depending on the Gateway,
user can be asked to
try with another instrument.
Timeline: 00: 00 SAP Internal Declines spiked to 300 +
05: 22 AM flintron reported DECLINED errors since 0: 00 PST.
As per TDO,flitron is not seeing clear metrics impact " }
Note: I just copied the exact json which i'm getting at runtime. It containse \n space and exactly like above.
I can see few JSONObjects like ticketDetails is having huge description and when I tried to parse the above string is leading to parse error.
I tried the below way to eliminate the parse error by using
String removeSpecialCharacterFromJson= jsonString.replaceAll("[^A-Za-z0-9]","")
System.out.println(removeSpecialCharacterFromJson);
Sample Output:
kindletrackingticketTICKET0900060Iimpactstatementimpactedduetoyear2020formathandlingissue....[space between characters are removed and It's hard to read the string]
The above code removed all the special characters from the string and It will be successfully parsed. But the description is not having the space and It very hard to read the content after the above changes done.
I tried to escape the \s in the regular expression which is giving the original String value which is leading to parse exception.
String removeSpecialCharacterFromJson= jsonString.replaceAll("[^A-Za-z0-9\\s]","")
Is there anyohter way to handle this ? I just want to the ticketDetails to be readable format and It should not have any special characters and \n lines.
Can someone help me on this?

s in regular expression is not for space but for the whitespace
I guess that you may have some additional non allowed whitespace in your JSON string
Take a look at The JSON spec (RFC 7159):
Insignificant whitespace is allowed before or after any of the six structural characters.
ws = *(
%x20 / ; Space
%x09 / ; Horizontal tab
%x0A / ; Line feed or New line
%x0D ) ; Carriage return
Verify your values and look for improper whitespaces

URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "^*"

i want to send an email this text
Destination : 6W - ATLANTA WEST!##$%^*!gemini!##$%^*!jfds!##$%^*!,Trailer Number : 000564,,Drop empty trailer at Plant Numbe :546,Pick up trailer at Plant Number :45, Bill Date : 25-Jan-2013,Bill Time - Eastern Time : 1,Trip Number :456,MBOL :546,Carrier :Covenant!##$%^*!test#shaw.com!##$%^*!transport#shaw.com!##$%^*!test#transport.com!##$%^*!antoalphi#gmail.com,Destination : 6W - ATLANTA WEST!##$%^*!gemini!##$%^*!jfds!##$%^*!,Customer Name : 567,Cusomer Delivery Address : 657567657,General Comments :657,Warehouse Comments : 65,Carrier Comments : ,Appointment Date :25-Jan-2013,Appointment Time : 1am,Rail Only :Standard,Total Weight : 45645
and i used this mailContent = URLDecoder.decode(Body, "UTF-8"); decode,
but it is giving me this exception URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "^*"
could any one of you help me,how to solve this. I get this while sending mail.
Best Regards

You are trying to URL decode something that wasn't URL encoded in the first place. What's wrong with the body as it is? In other words, what happens if you just use:
mailContent = Body
(In URL encoding, the % character is used with two hexadecimal digits to encode characters that might cause problems, for example / would be encoded as %2F, as its ASCII code is 47 (decimal) or 2F (hex). In your body, % is followed by two characters that are not hexadecimal digits - that's how I can tell it hasn't been URL encoded, and why the decoder is erroring.)

Simply stop calling URLDecoder.decode() and you will stop getting the error! The string value you are passing to it is not URL encoded.
There are various forms of MIME encoding that you might want to consider, if you are sending an email with content that would not normally be allowed in an email message without encoding. There references might be handy:
What is allowed in SMTP: http://www.apps.ietf.org/rfc/rfc788.html
Basic MIME encoding: http://www.apps.ietf.org/rfc/rfc1341.html
Java MIME support: http://docs.oracle.com/javaee/1.4/api/javax/mail/internet/MimeUtility.html
For example, you might try:
String sendable = MimeUtility.encodeText(body,"UTF-8","BASE64")

Extracting data from a text file - repeated values

79 0009!017009!0479%0009!0479 0009!0469%0009!0469
0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449
0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419
0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009
0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339
0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032
In this data, I'm supposed to extract the number 47, 46 , 45 , 44 and so on. I´m supposed to avoid the rest. The numbers always follow this flow - 9!0 no 9%
for example: 9!0 42 9%
Which language should I go about to solve this and which function might help me?
Is there any function that can position a special character and copy the next two or three elements?
Ex: 9!0 42 9% and ' 009
look out for ! and then copy 42 from there and look out for ' that refers to another value (009). It's like two different regex to be used.

You can use whatever language you want, or even a unix command line utility like sed, awk, or grep. The regex should be something like this - you want to match 9!0 followed by digits followed by 0%. Use this regex: 9!0(\d+)0% (or if the numbers are all two digits, 9!0(\d{2})0%).

The other answers are fine, my regex solution is simply "9!.(\d\d)"
And here's a full solution in powershell, which can be easily correlated to other .net langs
$t="79 0009!017009!0479%0009!0479 0009!0469%0009!0469 0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449 0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419 0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009 0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339 0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032"
$p="9!.(\d\d)"
$ms=[regex]::match($t,$p)
while ($ms.Success) {write-host $ms.groups[1].value;$ms=$ms.NextMatch()}

This is perl:
#result = $subject =~ m/(?<=9!0)\d+(?=9%)/g;
It will give you an array of all your numbers. You didn't provide a language so I don't know if this is suitable for you or not.
Pattern regex = Pattern.compile("(?<=9!0)\\d+(?=9%)");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}

Trying to get some useful data from a log file with regex in Java

I'm having trouble with regex, because I can only match some of my goals.
I have a log file and I must match some of the items and write another txt file. I wrote a Java code for a short example of my code but when I put the whole file, everything gets messed up.
*052511 074217 0065 02242806000 UNKNOWN U G
*052511 074217 0065 4874 02242806000 UNKNOWN U A
*052511 074218 0065 4874 02242806000 UNKNOWN U R
-------- 05/25/11 07:42:17 LINE = 0065 STN = 4874
CALLING NUMBER 02242806000
NAME UNKNOWN
UNKNOWN
BC = SPEECH
00:00:00 INCOMING CALL RINGING 0:02
00:00:11 CALL RELEASED
I have to find these results from the file:
incomming,05/25/11,07:42:17,0065,4874,02242806000,00:00:09,2
In this expression 00:00:09 means [00:00:11-00:00:00]-0:02
For every incoming and outgoing calls, I must make the conversation above.
Here is my code
Here is the log file

You could use a regex like:
(?xm:
^-------- \s+ (\S+) \s+ (\S+) \s+ LINE\s*=\s*(\d+) \s+ STN\s*=\s*(\d+)
\s+ CALLING\ NUMBER \s+ (\d+) \s*
(?:^(?:[ \t]+.*)?[\n\r]+)* # eat unwanted part
^(\d\d:\d\d:\d\d) \s+ INCOMING\ CALL \s+ RINGING\ ([\d:]+) \s*
(?:^\d.*[\r\n]+)* # possible stuff
^(\d\d:\d\d:\d\d) \s+ CALL\ RELEASED
)
Use the values of the capturing groups to get your results. You may need to remove the /x related things like comments and spaces.
Perl example at http://ideone.com/qTBFe

org.json.JSONException: Unterminated string at 737 [character 738 line 1]

I'm using the org.json.JSONObject to parse some json being sent to my servlet by an iphone. I was stuck for a while by why I would be getting an error message at all. The error message was:
org.json.JSONException: Unterminated string at 737 [character 738 line 1]
After printing out what I received, I see that the string sent was indeed cut short and stopped mid-json. I can't understand why it would be cut short. There's no limit on String size is there (or at least only a memory limit surely).
Has anyone else had thins error?
Cheers
Joe

json work well with \n but if you have any other special charachters in your meesage like
\ , # , & , # etc.. first convert them into their respective HEX value and then send your message.

If you're using the HTTP GET method to send data using query parameters, realize that there's a practical limit on the amount of data you can send that way. It's about 2000 characters (varies by server and client). You can easily exceed that when URL encoding a shorter string.

Json won't work if received string contain new line character like \n. Try to check for it and escape the character.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.