I created a JSON file on the fly by using some runtime data and stored as string as like below:
JSON:
{
"ticketDetails": "kindle tracking ticket: TICKET0900060
Iimpact statement: impacted due to year 2020 format handling issue,
depending on the Gateway,
user can be asked to
try with another instrument.
Timeline: 00: 00 SAP Internal Declines spiked to 300 +
05: 22 AM flintron reported DECLINED errors since 0: 00 PST.
As per TDO,flitron is not seeing clear metrics impact " }
Note: I just copied the exact json which i'm getting at runtime. It containse \n space and exactly like above.
I can see few JSONObjects like ticketDetails is having huge description and when I tried to parse the above string is leading to parse error.
I tried the below way to eliminate the parse error by using
String removeSpecialCharacterFromJson= jsonString.replaceAll("[^A-Za-z0-9]","")
System.out.println(removeSpecialCharacterFromJson);
Sample Output:
kindletrackingticketTICKET0900060Iimpactstatementimpactedduetoyear2020formathandlingissue....[space between characters are removed and It's hard to read the string]
The above code removed all the special characters from the string and It will be successfully parsed. But the description is not having the space and It very hard to read the content after the above changes done.
I tried to escape the \s in the regular expression which is giving the original String value which is leading to parse exception.
String removeSpecialCharacterFromJson= jsonString.replaceAll("[^A-Za-z0-9\\s]","")
Is there anyohter way to handle this ? I just want to the ticketDetails to be readable format and It should not have any special characters and \n lines.
Can someone help me on this?
s in regular expression is not for space but for the whitespace
I guess that you may have some additional non allowed whitespace in your JSON string
Take a look at The JSON spec (RFC 7159):
Insignificant whitespace is allowed before or after any of the six structural characters.
ws = *(
%x20 / ; Space
%x09 / ; Horizontal tab
%x0A / ; Line feed or New line
%x0D ) ; Carriage return
Verify your values and look for improper whitespaces
Related
I am reading a JSON file in Java using this code:
String data = Files.readFile(jsonFile)
.trim()
.replaceAll("[^\\x00-\\x7F]", "")
.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "")
.replaceAll("\\p{C}", "");
In my JSON file, there is a unique char: 'あ' (12354) that is interpreted to: "" (nothing) when reading the file.
How can I make this char show up in my variable "data"?
Due to answers I've got, I understand that the data is cleaned from high ASCII characters by adding replaceAll("[^\\x00-\\x7F]", ""). But what can I do if I want all high ASCII characters to be cleaned except this one 'あ'?
The character you want is the unicode character HIRAGANA LETTER A and has code U+3042.
You can simply add it to the list of valid characters:
...
.replaceAll("[^\\x00-\\x7F\\u3042]", "")
...
I have to find a user-defined String in a Document (using Java), which is stored in a database in a BLOB. When I search a String with special characters ("Umlaute", äöü etc.), it failes, meaning it does not return any positions at all. And I am not allowed to convert the document's content into UTF-8 (which would have fixed this problem but raised a new, even bigger one).
Some additional information:
The document's content is returned as String in "ISO-8859-1" (Latin1).
Here is an example, what a String could look like:
Die Erkenntnis, daà der Künstler Schutz braucht, ...
This is how it should look like:
Die Erkenntnis, daß der Künstler Schutz braucht, ...
If I am searching for Künstler it would fail to find it, because it looks for ü but only finds ü.
Is it possible to convert Künstler into Künstler so I can search for the wrong encoded version instead?
Note:
We are using the Hibernate Framework for Database access. The original Getter for the Document's Content returns a byte[]. The String is than returned by calling
new String(getContent(), "ISO-8859-1")
The problem here is, that I cannot change this to UTF-8, because it would then mess up the rest of our application which is based on a third party application that delivers data this way.
Okay, looks like I've found a way to mess up the encoding on purpose.
new String("Künstler".getBytes("UTF-8"), "ISO-8859-1")
By getting the Bytes of the String Künstler in UTF-8 and then creating a new String, telling Java that this is Latin1, it converts to Künstler. It's a hell of a hack but seems to work well.
Already answered by yourself.
An altoghether different approach:
If you can search the blob, you could search using
"SELECT .. FROM ... WHERE"
+ " ... LIKE '%" + key.replaceAll("\\P{Ascii}+", "%") + "%'"
This replaces non-ASCII sequences by the % wildcard: UTF-8 multibyte sequences are non-ASCII by design.
I have an internal system log file which I have been parsing via and external Hive table and a Regex statement. But recently, issues in the output data have emerged with parts of the XML data incomplete in the Output result set. After further investigation, the issue is down to formatted text in the XML data being passed into the logs.
Traditionally, each log message is contained on a single line and the regex statement works fine but when a message contains formatted text, it pushes parts of the message onto one or more lines and so the XML data is chopped when parsed.
The issue I have is that I need to some how piece together the complete message back into one line so that the regex can successfully parse the whole message and not chop any data.
Below is sample of the data I am dealing with.
Typical log message
0 20130323212857832 20130323212857832 0000 006 00/0000/000 BPAGPRDAGA01 Lexkin 000 000000 00 Reply: 0ms 865b3926-9002-4506-9825-c72bf19e694c: <GetProgramIdResponse xmlns="http://tempuri.org/"><GetProgramIdResult xmlns:b="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Accounts.Responses" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"><ContactUs xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">false</ContactUs><ErrorDescription xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorDescription><ErrorMessage xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorMessage><ErrorNo xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorNo><ErrorSource xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">None</ErrorSource><ErrorTitle xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorTitle><ErrorType xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorType><b:ProgramId>-1</b:ProgramId></GetProgramIdResult></GetProgramIdResponse>
Message with formatted text 1
0 20130323212857832 20130323212857832 0000 006 00/0000/000 BPAGPRDAGA01 Lexkin 000 000000 00 Reply: 0ms 865b3926-9002-4506-9825-c72bf19e694c: <GetProgramIdResponse xmlns="http://tempuri.org/"><GetProgramIdResult xmlns:b="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Accounts.Responses" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"><ContactUs xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">false</ContactUs><ErrorDescription xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">Issues with Core system,
Engineer support required</ErrorDescription><ErrorMessage xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorMessage><ErrorNo xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorNo><ErrorSource xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">None</ErrorSource><ErrorTitle xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorTitle><ErrorType xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorType><b:ProgramId>-1</b:ProgramId></GetProgramIdResult></GetProgramIdResponse>
Message with formatted text 2
0 20130323212857832 20130323212857832 0000 006 00/0000/000 BPAGPRDAGA01 Lexkin 000 000000 00 Reply: 0ms 865b3926-9002-4506-9825-c72bf19e694c: <GetProgramIdResponse xmlns="http://tempuri.org/"><GetProgramIdResult xmlns:b="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Accounts.Responses" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"><ContactUs xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">false</ContactUs><ErrorDescription xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">Isuses with Core system,
Engineer support required</ErrorDescription><ErrorMessage xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorMessage><ErrorNo xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorNo><ErrorSource xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">Lexkin webservice,
Exception detected</ErrorSource><ErrorTitle xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common"></ErrorTitle><ErrorType xmlns="http://schemas.datacontract.org/2004/07/ApplicationServices.DataContracts.Common">0</ErrorType><b:ProgramId>-1</b:ProgramId></GetProgramIdResult></GetProgramIdResponse>
Regex Statement
Due to the log file containing various log messages that are completely different format to each other, the complex regex statement below caters for all the messages types except for multi-line data
(^[0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+\/[0-9]+\/[0-9]+) (\S*)\s*(\S*)\s*([0-9]+)\s([0-9]+)\s([0-9]+)\s+([^:]*): ([0-9]*\.?[0-9]*)[ms]*(?<=[s ]) ?([0-9a-f-]*):? ?(.*)
So I am not sure what the best approach is to resolve this issue, some people have suggest writing a customer XML input format but my experience with input formats is amateur at best, hopefully this community can suggest other alternatives.
The issue seems to be the line-breaks inside the tags. Those are not captured by the . in the very last part of the regex. To have the . match newline characters you can set single-line mode-modifier
The thing with that is that .* then grabs everything greedily which you wanna prevent, I would do it with a negative lookahead (i.e. grab everything that isn't followed by the beginning of this regex)
(^[0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+\/[0-9]+\/[0-9]+) (\S*)\s*(\S*)\s*([0-9]+)\s([0-9]+)\s([0-9]+)\s+([^:]*): ([0-9]*\.?[0-9]*)[ms]*(?!<=[s ]) ?([0-9a-f-]*):? ?((?:.(?![0-9]+ [0-9]+ [0-9]+ [0-9]+ [0-9]+ [0-9]+\/[0-9]+\/[0-9]+))*)
Regex101 demo
i want to send an email this text
Destination : 6W - ATLANTA WEST!##$%^*!gemini!##$%^*!jfds!##$%^*!,Trailer Number : 000564,,Drop empty trailer at Plant Numbe :546,Pick up trailer at Plant Number :45, Bill Date : 25-Jan-2013,Bill Time - Eastern Time : 1,Trip Number :456,MBOL :546,Carrier :Covenant!##$%^*!test#shaw.com!##$%^*!transport#shaw.com!##$%^*!test#transport.com!##$%^*!antoalphi#gmail.com,Destination : 6W - ATLANTA WEST!##$%^*!gemini!##$%^*!jfds!##$%^*!,Customer Name : 567,Cusomer Delivery Address : 657567657,General Comments :657,Warehouse Comments : 65,Carrier Comments : ,Appointment Date :25-Jan-2013,Appointment Time : 1am,Rail Only :Standard,Total Weight : 45645
and i used this mailContent = URLDecoder.decode(Body, "UTF-8"); decode,
but it is giving me this exception URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "^*"
could any one of you help me,how to solve this. I get this while sending mail.
Best Regards
You are trying to URL decode something that wasn't URL encoded in the first place. What's wrong with the body as it is? In other words, what happens if you just use:
mailContent = Body
(In URL encoding, the % character is used with two hexadecimal digits to encode characters that might cause problems, for example / would be encoded as %2F, as its ASCII code is 47 (decimal) or 2F (hex). In your body, % is followed by two characters that are not hexadecimal digits - that's how I can tell it hasn't been URL encoded, and why the decoder is erroring.)
Simply stop calling URLDecoder.decode() and you will stop getting the error! The string value you are passing to it is not URL encoded.
There are various forms of MIME encoding that you might want to consider, if you are sending an email with content that would not normally be allowed in an email message without encoding. There references might be handy:
What is allowed in SMTP: http://www.apps.ietf.org/rfc/rfc788.html
Basic MIME encoding: http://www.apps.ietf.org/rfc/rfc1341.html
Java MIME support: http://docs.oracle.com/javaee/1.4/api/javax/mail/internet/MimeUtility.html
For example, you might try:
String sendable = MimeUtility.encodeText(body,"UTF-8","BASE64")
I'm using the org.json.JSONObject to parse some json being sent to my servlet by an iphone. I was stuck for a while by why I would be getting an error message at all. The error message was:
org.json.JSONException: Unterminated string at 737 [character 738 line 1]
After printing out what I received, I see that the string sent was indeed cut short and stopped mid-json. I can't understand why it would be cut short. There's no limit on String size is there (or at least only a memory limit surely).
Has anyone else had thins error?
Cheers
Joe
json work well with \n but if you have any other special charachters in your meesage like
\ , # , & , # etc.. first convert them into their respective HEX value and then send your message.
If you're using the HTTP GET method to send data using query parameters, realize that there's a practical limit on the amount of data you can send that way. It's about 2000 characters (varies by server and client). You can easily exceed that when URL encoding a shorter string.
Json won't work if received string contain new line character like \n. Try to check for it and escape the character.