XML.toJSONObject Throws Exception Though XML is Perfectly Valid (W3C Validated) - java

I am using a class to crunch XML feeds (RSS feeds like: http://www.reddit.com/r/carporn/.rss) into JSONObjects for easy processing. Normally this class works perfect for every feed I give it. Strangely, trying to use Reddit's feeds, which are perfectly valid XML per W3C validators, I get the error:
E/JSON exception﹕ Missing ';' in XML entity: & at character 21607
I threw the feed into Notepad++ and went to character 21607 and found:
"
This appears to be a perfectly valid encoding for XML purposes of the double quote character: ". W3C took the same input and passed 0 warnings or errors, the XML is definitely completely valid.
So, why is XML.toJSONObject failing on valid XML? I've noted it also fails when confronted with:
'
I can't believe some rookie like me is finding a bug, so what's really going on here?
Thank you!
Ultimately, I fixed this problem by doing the following HACK, I'd still like to know why this is necessary:
/*
Replaces the double-quote and single-quote values below with the actual characters
*/
feedsRssResult = feedsRssResult.replaceAll(""", "\"");
feedsRssResult = feedsRssResult.replaceAll("'", "'");

Related

java SAXParser ignore exception and continue parsing

I have a java class that parses an xml file, and writes its content to MySQL. Everything works fine, but the problem is when the xml file contains invalid unicode characters, an exception is thrown and the program stops parsing the file.
My provider sends this xml file on a daily basis with a list of products with its price, quantity etc. and I have no control over this, so invalid characters will always be there.
All I'm trying to do is to catch these errors, ignore them and continue parsing the rest of the xml file.
I've added a try-catch statements on the startElement, endElement and characters methods of the SAXHandler class, however, they don't catch any exception and the execution stops whenever the parser finds an invalid character.
It seems that I can only catch these exceptions from the function who calls the parser:
try {
myIS = new FileInputStream(xmlFilePath);
parser.parse(myIS, handler);
retValue = true;
} catch(SAXParseException err) {
System.out.println("SAXParseException " + err);
}
However, that's useless in my case, even if the exception tells me where the invalid character is, the execution stops, so the list of products is far from being complete. This list has about 8,000 products and only a couple of invalid characters, however, if the invalid character is in the first 100 products, then all the 7,900 products are not updated in the database. I've also noticed that the endDocument method is not called if an exception occurs.
Somebody asked the same question here some years ago, but didn't get any solution.
I'd really appreciate any ideas or workarounds for this.
Data Sample (as requested):
<Producto>
<Brand>
<Description>Epson</Description>
<ManufacturerId>eps</ManufacturerId>
<BrandId>eps</BrandId>
</Brand>
<New>false</New>
<OnSale>null</OnSale>
<Type>Physical</Type>
<Description>Epson TM T88V - Impresora de recibos - línea térmica - rollo 8 cm - hasta 300 mm/segundo - paralelo, USB</Description>
<Category>
<CategoryId>pos</CategoryId>
<Description>Puntos de Venta</Description>
<Subcategories>
<CategoryId>pos.printer</CategoryId>
<Description>Impresoras para Recibos</Description>
</Subcategories>
</Category>
<InStock>0</InStock>
<Price>
<UnitPrice>4865.6042</UnitPrice>
<CurrencyId>MXN</CurrencyId>
</Price>
<Manufacturer>
<Description>Epson</Description>
<ManufacturerId>eps</ManufacturerId>
</Manufacturer>
<Mpn>C31CA85814</Mpn>
<Sku>PT910EPS27</Sku>
<CompilationDate>2020-02-25T12:30:14.6607135Z</CompilationDate>
</Producto>
The XML philosophy is that you don't process bad data. If it's not well-formed XML, the parser is supposed to give up, and user applications are supposed to give up. Culturally, this is a reaction against the HTML culture, where it was found that if it's generally expected that data users will tolerate bad data, the consequence is that suppliers will produce bad data.
Standards deliver cost reduction because you can use readily available off-the-shelf tools both for creating valid data and for reading it at the other end. The benefits are totally neutralised if you decide you're going to interchange things that are almost XML but not quite. If you were downloading software you wouldn't put up with it if it didn't compile. So why are you prepared to put up with bad data? Send it back and demand a refund.
Having said that, if the problem is "invalid Unicode characters" then it's possible that it started out as good XML and got corrupted in transit. Find out what went wrong and get it fixed as close to the source of the problem as you can.
I solved it removing invalid characters of the xml file before processing it.
I couldn't do what I was trying to do (cath error and continue) but this workaround worked.

How to get read a JSON string from a file in Mule 4

I'm trying to create an MUnit test that mocks an HTTP request by setting the payload to a JSON object that I have saved in a file. In Mule 3 I would have just done getResource('fileName.json').asString() and that worked just fine. In Mule 4 though, I can't statically call getResource.
I found a forum post on the Mulesoft forums that suggested I use MunitTools::getResourceAsString. When I run my test, I do see the JSON object but with all the \n and \r characters as well as a \ escaping all of the quotation marks. Obviously this means my JSON is no longer well formed.
Ideally I would like to find a reference for MunitTools so that I can see a list of functions that I can call and maybe find one that does not add the escape characters, but I haven't had any luck. If anybody knows of a some reference document that I can refer to, please let me know.
Not being able to find a way to return the data without the extra characters, I tried replacing them via dataweave. This is fine when replacing \n and \r, but as there are also more \s in front of each double quote and I can't seem to make these go away.
If I do this...
replace (/\/) with ("")
...I get an error. A co-worker suggested targeting the each \" and replacing them with ", but that's a problem because that gives me """. To get around this, I've tried
replace(/\"/) with "\""
...which does not cause any errors, but for some reason it reads the \ as a literal so it replaces the original string with itself. I've also tried...
replace(/\"/) with '"'
...but that also results in an error
I'm open to any other solutions as well.
Thanks
--Drew
I had the same concern so I started using the readUrl() method. This is a DataWeave method so you should be able to use it in any MUnit processor. Here is an example of how I used it in the set event processor. It reads the JSON file and then converts it into Java for my own needs but you can just replace java with JSON for your needs.
<munit:set-event doc:name="Set Event" doc:id="e7b1da19-f746-4964-a7ae-c23aedce5e6f" >
<munit:payload mediaType="application/java" value="#[output application/java --- readUrl('classpath://singleItemRequest.json','application/json')]"/>
</munit:set-event>
Here is the documentation for readUrl https://docs.mulesoft.com/mule-runtime/4.2/dw-core-functions-readurl
Hope that helps!
Follow this snippet (more specifically the munit-tools:then-return tag):
<munit-tools:mock-when doc:name="Mock GET /users" doc:id="89c8b7fb-1e94-446f-b9a0-ef7840333328" processor="http:request" >
<munit-tools:with-attributes >
<munit-tools:with-attribute attributeName="doc:name" whereValue="GET /users" />
</munit-tools:with-attributes>
<munit-tools:then-return>
<munit-tools:payload value="#[read(MunitTools::getResourceAsString('examples/responses/anypoint-get-users-response.json'), "application/json")]" />
</munit-tools:then-return>
</munit-tools:mock-when>
It mocks an HTTP request and returns a JSON object using the read() function.

Combination of Specific special character causes Error

When I am sending a TextEdit data as a JSON with data as a combination of "; the app fails every time.
In detail if I am entering my username as anything but password as "; the resultant JSON file looks like:-
{"UserName":"qa#1.com","Password":"\";"}
I have searched a lot, what I could understand is the resultant JSON data voilates the syntax which results in throwing Default exception. I tried to get rid of special symbol by using URLEncoder.encode() method. But now the problem is in decoding.
Any help at any step will be very grateful.
Logcat:
I/SW_HttpClient(448): sending post: {"UserName":"qa#1.com","Password":"\";"}
I/SW_HttpClient(448): HTTPResponse received in [2326ms]
I/SW_HttpClient(448): stream returned: <!DOCTYPE html PUBLIC ---- AN HTML PAGE.... A DEFAULT HANDLER>
Hi try the following code
String EMPLOYEE_SERVICE_URI = Utils.authenticate+"?UserName="+uid+"&Email="+eid+"&Password="+URLEncoder.encode(pwd,"UTF-8");
The JSON you provided in the Question is valid.
The JSON spec requires double quotes in strings to be escaped with a backslash. Read the syntax graphs here - http://www.json.org/.
If something is throwing an exception while parsing that JSON, then either the parser is buggy or the exception means something else.
I have searched a lot, what I could understand is the resultant JSON data voilates the syntax
Your understanding is incorrect.
I tried to get rid of special symbol by using URLEncoder.encode() method.
That is a mistake, and is only going to make matters worse:
The backslash SHOULD be there.
The server or whatever that processes the JSON will NOT be expecting random escaping from a completely different standard.
But now the problem is in decoding.
Exactly.
Following provided JSON can be parsed through GSON library with below code
private String sampledata = "{\"UserName\":\"qa#1.com\",\"Password\":\"\\\";\"}";
Gson g = new Gson();
g.fromJson(sampledata, sample.class);
public class sample {
public String UserName;
public String Password;
}
For decoding the text I got the solution with..
URLDecoder.decode(String, String);

Digester: The element type "user" must be terminated by the matching end-tag "</user>"

I'm using Digester to parse a xml file and I get the following error:
May 3, 2011 6:41:25 PM org.apache.commons.digester.Digester fatalError
SEVERE: Parse Fatal Error at line 2336608 column 3: The element type "user" must be terminated by the matching end-tag "</user>".
org.xml.sax.SAXParseException: The element type "user" must be terminated by the matching end-tag "</user>".
However 2336608 is the last line of my text file. I guess I'm opening a tag and I never close it. Do you know how can I find it and fix it, in big text files ?
thanks
Write another script which scans each file of the line and whenever it finds an open <user> tag, increments a counter and prints
line number 1234 <user> opened (1 open total)
and whenever it finds a close </user> tag, decrements the counter prints
line number 4546 </user> closed (0 open total)
Since you have one more opening tag than closing tag, the final output of this script will tell you that 1 tag was left open. However, assuming that your XML model does not allow for nested <user> tags, then you can assume the problemsome declaration is wherever you see the output of line number ... <user> opened (2 open total).
$ grep -Hin "</\?user>" Text.xml will print out every line with either or . If they're not nested, then you should be able to inspect that output fand find the missing close tag (when immediately follows . A script do do the same:
https://gist.github.com/953837
This assumes that the open and close tags are on different lines.
Use tidy -xml -e <your-xml-file>. http://tidy.sourceforge.net/
Tidy is a great little tool for validating HTML, and in XML mode (-xml above) it will validate XML as well.
It prints out line and column numbers for parse errors.
Most of the major package managers (apt, port, etc.) will have pre-built packages for it.
I think there is no need to start scripting for detecting xml errors.
You can use the w3 xml validator for instance
http://www.w3schools.com/xml/xml_validator.asp
I just pasted a 15 mb xml in there and I managed to fix it quite easily. You can also input the xml as a url if you have the possibility to upload it somewhere. Java reported the error in some place which seemed fine, but this tool localized the actual error, and after correcting that, java didn't error anymore.
There are many types of xml errors, and are not all related to the nested structure, so it is best to just use a well known tool for this. For instance, my error was an argument error(I was missing a ") but java detected a nesting problem.

"Content is not allowed in prolog" when parsing perfectly valid XML on GAE

I've been beating my head against this absolutely infuriating bug for the last 48 hours, so I thought I'd finally throw in the towel and try asking here before I throw my laptop out the window.
I'm trying to parse the response XML from a call I made to AWS SimpleDB. The response is coming back on the wire just fine; for example, it may look like:
<?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/">
<ListDomainsResult>
<DomainName>Audio</DomainName>
<DomainName>Course</DomainName>
<DomainName>DocumentContents</DomainName>
<DomainName>LectureSet</DomainName>
<DomainName>MetaData</DomainName>
<DomainName>Professors</DomainName>
<DomainName>Tag</DomainName>
</ListDomainsResult>
<ResponseMetadata>
<RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId>
<BoxUsage>0.0000071759</BoxUsage>
</ResponseMetadata>
</ListDomainsResponse>
I pass in this XML to a parser with
XMLEventReader eventReader = xmlInputFactory.createXMLEventReader(response.getContent());
and call eventReader.nextEvent(); a bunch of times to get the data I want.
Here's the bizarre part -- it works great inside the local server. The response comes in, I parse it, everyone's happy. The problem is that when I deploy the code to Google App Engine, the outgoing request still works, and the response XML seems 100% identical and correct to me, but the response fails to parse with the following exception:
com.amazonaws.http.HttpClient handleResponse: Unable to unmarshall response (ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.): <?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/"><ListDomainsResult><DomainName>Audio</DomainName><DomainName>Course</DomainName><DomainName>DocumentContents</DomainName><DomainName>LectureSet</DomainName><DomainName>MetaData</DomainName><DomainName>Professors</DomainName><DomainName>Tag</DomainName></ListDomainsResult><ResponseMetadata><RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId><BoxUsage>0.0000071759</BoxUsage></ResponseMetadata></ListDomainsResponse>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source)
at com.amazonaws.transform.StaxUnmarshallerContext.nextEvent(StaxUnmarshallerContext.java:153)
... (rest of lines omitted)
I have double, triple, quadruple checked this XML for 'invisible characters' or non-UTF8 encoded characters, etc. I looked at it byte-by-byte in an array for byte-order-marks or something of that nature. Nothing; it passes every validation test I could throw at it. Even stranger, it happens if I use a Saxon-based parser as well -- but ONLY on GAE, it always works fine in my local environment.
It makes it very hard to trace the code for problems when I can only run the debugger on an environment that works perfectly (I haven't found any good way to remotely debug on GAE). Nevertheless, using the primitive means I have, I've tried a million approaches including:
XML with and without the prolog
With and without newlines
With and without the "encoding=" attribute in the prolog
Both newline styles
With and without the chunking information present in the HTTP stream
And I've tried most of these in multiple combinations where it made sense they would interact -- nothing! I'm at my wit's end. Has anyone seen an issue like this before that can hopefully shed some light on it?
Thanks!
The encoding in your XML and XSD (or DTD) are different.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-16'?>
Another possible scenario that causes this is when anything comes before the XML document type declaration. i.e you might have something like this in the buffer:
helloworld<?xml version="1.0" encoding="utf-8"?>
or even a space or special character.
There are some special characters called byte order markers that could be in the buffer.
Before passing the buffer to the Parser do this...
String xml = "<?xml ...";
xml = xml.trim().replaceFirst("^([\\W]+)<","<");
I had issue while inspecting the xml file in notepad++ and saving the file, though I had the top utf-8 xml tag as <?xml version="1.0" encoding="utf-8"?>
Got fixed by saving the file in notpad++ with Encoding(Tab) > Encode in UTF-8:selected (was Encode in UTF-8-BOM)
This error message is always caused by the invalid XML content in the beginning element. For example, extra small dot “.” in the beginning of XML element.
Any characters before the “<?xml….” will cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” error message.
A small dot “.” before the “<?xml….
To fix it, just delete all those weird characters before the “<?xml“.
Ref: http://www.mkyong.com/java/sax-error-content-is-not-allowed-in-prolog/
I catched the same error message today.
The solution was to change the document from UTF-8 with BOM to UTF-8 without BOM
I was facing the same issue. In my case XML files were generated from c# program and feeded into AS400 for further processing. After some analysis identified that I was using UTF8 encoding while generating XML files whereas javac(in AS400) uses "UTF8 without BOM".
So, had to write extra code similar to mentioned below:
//create encoding with no BOM
Encoding outputEnc = new UTF8Encoding(false);
//open file with encoding
TextWriter file = new StreamWriter(filePath, false, outputEnc);
file.Write(doc.InnerXml);
file.Flush();
file.Close(); // save and close it
In my xml file, the header looked like this:
<?xml version="1.0" encoding="utf-16"? />
In a test file, I was reading the file bytes and decoding the data as UTF-8 (not realizing the header in this file was utf-16) to create a string.
byte[] data = Files.readAllBytes(Paths.get(path));
String dataString = new String(data, "UTF-8");
When I tried to deserialize this string into an object, I was seeing the same error:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
When I updated the second line to
String dataString = new String(data, "UTF-16");
I was able to deserialize the object just fine. So as Romain had noted above, the encodings need to match.
Removing the xml declaration solved it
<?xml version='1.0' encoding='utf-8'?>
Unexpected reason: # character in file path
Due to some internal bug, the error Content is not allowed in prolog also appears if the file content itself is 100% correct but you are supplying the file name like C:\Data\#22\file.xml.
This may possibly apply to other special characters, too.
How to check: If you move your file into a path without special characters and the error disappears, then it was this issue.
I was facing the same problem called "Content is not allowed in prolog" in my xml file.
Solution
Initially my root folder was '#Filename'.
When i removed the first character '#' ,the error got resolved.
No need of removing the #filename...
Try in this way..
Instead of passing a File or URL object to the unmarshaller method, use a FileInputStream.
File myFile = new File("........");
Object obj = unmarshaller.unmarshal(new FileInputStream(myFile));
In the spirit of "just delete all those weird characters before the <?xml", here's my Java code, which works well with input via a BufferedReader:
BufferedReader test = new BufferedReader(new InputStreamReader(fisTest));
test.mark(4);
while (true) {
int earlyChar = test.read();
System.out.println(earlyChar);
if (earlyChar == 60) {
test.reset();
break;
} else {
test.mark(4);
}
}
FWIW, the bytes I was seeing are (in decimal): 239, 187, 191.
I had a tab character instead of spaces.
Replacing the tab '\t' fixed the problem.
Cut and paste the whole doc into an editor like Notepad++ and display all characters.
In my instance of the problem, the solution was to replace german umlauts (äöü) with their HTML-equivalents...
bellow are cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” exception.
First check the file path of schema.xsd and file.xml.
The encoding in your XML and XSD (or DTD) should be same.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-8'?>
if anything comes before the XML document type declaration.i.e: hello<?xml version='1.0' encoding='utf-16'?>
I zipped the xml in a Mac OS and sent it to a Windows machine, the default compression changes these files so the encoding sent this message.
Happened to me with #JsmListener with Spring Boot when listening to IBM MQ. My method received String parameter and got this exception when I tried to deserialize it using JAXB.
It seemed that that the string I got was a result of byte[].toString(). It was a list of comma separated numbers.
I solved it by changing the parameter type to byte[] and then created a String from it:
#JmsListener(destination = "Q1")
public void receiveQ1Message(byte[] msgBytes) {
var msg = new String(msgBytes);

Categories