java sax parser mangles attributes for xml 1.1

java sax parser mangles attributes for xml 1.1 - java

I'm using java's sax classes to parse an xml file. If the xml file says version 1.0, everything works fine, but if it says version 1.1, then some of the attributes get mangled, giving me the wrong results but not throwing any kind of exception.
My xml file basically looks like this:
<?xml version="1.1" encoding="UTF-8" ?>
<gpx>
<trk>
<name>Name of the track</name>
<trkseg>
<trkpt lat="12.3456789" lon="1.2345678">
<ele>1234</ele>
<time>2013-03-26T12:34:56Z</time>
<speed>0</speed>
</trkpt>
... and then 419 further identical copies of this trkpt
</trkseg>
</trk>
</gpx>
So what I expect, when I use sax to parse this file, is to find 420 trkpt tags, and for each of them to have lat and lon attributes. In particular, I expect to find 420 "lat" attributes which are all "12.3456789".
For the parsing I construct a handler object and give it the stream to this local file:
SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
inStream = new FileInputStream(file);
saxParser.parse(inStream, handler);
System.out.println("done");
The handler class extends org.xml.sax.helpers.DefaultHandler and just has one method, startElement to react to the opening of the trkpt tag:
public void startElement(String uri, String localName, String qName, Attributes attributes)
{
if (qName.equals("trkpt") && attributes != null
&& attributes.getLength() == 2
&& attributes.getValue(0).charAt(0) != '1')
{
// The trkpt tag has two attributes
// but the value of the first one doesn't begin with '1'
System.out.println(attributes.getQName(0) + " = " + attributes.getValue(0));
}
super.startElement(uri, localName, qName, attributes);
}
So what is the result?
If the xml file has version 1.0, then all I see is "done". 420 trkpt tags were found, all of them had two attributes, the first one was always called "lat" and the value of this attribute always started with '1', as I expect. Great!
If the xml file is changed to specify version="1.1" on the first line, then I get the following output:
lat = :34.56Z</t
lat = :56Z</time
done
So even though all my 420 points should be identical, two of them gave me a completely wrong attribute value. No exceptions are thrown. Still 420 trkpts were found, and all had two attributes called "lat" and "lon". Oddly the lon values are always ok.
I created this xml file in a text editor by direct copy/pasting the first trkpt, so I'm sure that all the values are identical, I'm sure there are no points in the xml file with funny attribute values, and I'm sure that there are no non-ascii character values or entity codes or anything else odd about the file.
I've tried it using Sun's JRE6, OpenJDK6 and OpenJDK7, on three different machines with two different OSs. So either I'm doing something wrong, or this particular xml file is incompatible with xml1.1 somehow, or there's a widespread sax bug (which seems unlikely as I would expect it to affect lots of people). Again, please note, with xml1.0 it all works fine. Also note, there's nothing special about the number 420, it's just that if the file only has 100 entries then they all get parsed properly. If you have several thousand entries then a certain number of them get their first attribute value mangled in this way. The length of the attribute value always seems to be correct but it's pulling characters out from the wrong point in the file. Index overflow perhaps?
I tried removing all the speed tags, but the problem still persists if you have enough trkpts. It's also sensitive to additional whitespace, so the problem occurs with different points or gives back different attribute values if I add line breaks between the trkpts.

This bug has been present in the JDK XML parser for years, and neither Sun nor Oracle has showed any interest in fixing it. I strongly advise using the Apache Xerces XML parser in preference.

Related

Latest Open JDK 8 JAXB library fails to unmarshal objects with properties that contain new line characters

I am using Java on Ubuntu 16.04. Recently I upgraded to Open JDK java version "1.8.0_161" installed using the oracle-java8-installer package (package version 8u161-1~webupd8~0). Since doing this upgrade , I am getting new exceptions when doing JAXB marshalling of Java objects.
Specifically, when attempting to use JAXB to marshal a Java object to XML I get the following exception if the Java object has a String property that contains any newline ("\n") characters and that String property is being serialized as element content in the XML. (As an aside, if the String property is serialized as attribute content, any newline character in the value of the String is converted to a space character and the exception is not triggered.)
What appears to be happening is that
com.sun.xml.internal.bind.v2.runtime.output.XMLStreamWriterOutput$NewLineEscapeHandler.escape
converts the newline character in the String property of the Java object to the entity reference
. This entity reference is then written out to the XML output stream but when verifying the entity reference name, the exception is being thrown because #xa is not being recognised as a valid entity reference name.
Is this the expected behaviour? If so, what should I do to preserve the newline characters in the serialization of the Java object? If not, what should I do to work around this problem?
The relevant part of the stack trace is:
... Caused by: javax.xml.stream.XMLStreamException: Invalid name start character '#' (code 35) (name "#xa")
at com.fasterxml.aalto.out.XmlWriter.throwOutputError(XmlWriter.java:472)
at com.fasterxml.aalto.out.XmlWriter.reportNwfName(XmlWriter.java:383)
at com.fasterxml.aalto.out.ByteXmlWriter.verifyNameComponent(ByteXmlWriter.java:235)
at com.fasterxml.aalto.out.ByteXmlWriter.constructName(ByteXmlWriter.java:181)
at com.fasterxml.aalto.out.WNameTable.findSymbol(WNameTable.java:324)
at com.fasterxml.aalto.out.StreamWriterBase.writeEntityRef(StreamWriterBase.java:615)
at net.galexy.fieldguide.jaxb.CustomXMLStreamWriter.writeEntityRef(CustomXMLStreamWriter.java:198)
at com.sun.xml.internal.bind.v2.runtime.output.XMLStreamWriterOutput$XmlStreamOutWriterAdapter.writeEntityRef(XMLStreamWriterOutput.java:277)
at com.sun.xml.internal.bind.v2.runtime.output.XMLStreamWriterOutput$NewLineEscapeHandler.escape(XMLStreamWriterOutput.java:242)
... 60 more
For example, if I unmarshall the following XML:
<?xml version='1.0' encoding='UTF-8'?>
<description>
<note>The text of the note</note>
</description>
and then attempt to marshall it back to XML then no exception is thrown.
If, however, there is a new line in the middle of the note content:
<?xml version='1.0' encoding='UTF-8'?>
<description>
<note>The text of
the note</note>
</description>
Then the exception is thrown.
The JAXB context that is being used is com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl.
The JAXB marshaller that is being used is com.sun.xml.internal.bind.v2.runtime.MarshallerImpl
Looking for more information on the changes, I came across the following bug report that suggests others have encountered the same change with this release of JAXB:
JDK-8196491 Newlines in JAXB string values of SOAP-requests are escaped to "
"
The answer to this stack overflow question suggests that I can resume control over character escaping by getting my marshaller to use a custom implementation of com.sun.xml.bind.marshaller.CharacterEscapeHandler.
That is puzzling me because javax.xml.bind.Marshaller does not appear to declare a static property name com.sun.xml.bind.marshaller.CharacterEscapeHandler while it does declare other property names like Marshaller.JAXB_FORMATTED_OUTPUT, which equals "jaxb.formatted.output.
Even if I could instruct the marshaller to use my custom character escape handler, I am not totally sure what I should be doing within that escape handler. Is there an appropriate base escape handler that I can override to inherit all of the standard escape handling which ensuring that I intervene to stop escaping of the newline characters?
I have also tried Oracle Java 9 (package version 9.0.4-1~webupd8~0) and that version of Java has the same issues.
I have also tried the next release of Oracle Java 8 (1.8.0_162) and that version has the same issues.
Downloading an older version of Java from the Oracle website (1.8.0_152) sorts out the problem but is not a satisfactory way of resolving the problem.

In my case, I'm using JAXB to convert a few objects into XML and serialise them to a file, via StAX/WoodStox. I've managed to fix the problem at issue by filtering the XML that is being serialised. In detail, the approach is like:
Define a custom StreamWriter2Delegate, override writeEntityRef(), so that, when this method receives the wrong entity code (#xd or #xa), it invokes its delegate to actually write back the original character (i.e., \n or \r), which doesn't actually need to be escaped:
#Override
public void writeEntityRef ( String eref ) throws XMLStreamException
{
if ( eref == null || !eref.startsWith ( "#x" ) ) {
super.writeEntityRef ( eref );
return;
}
String hex = eref.substring ( 2 );
for ( char c: new char[] { '\r', '\n' } )
if ( Integer.toHexString ( c ).equals ( hex ) ) {
this.writeCharacters ( Character.toString ( c ) );
return;
}
super.writeEntityRef ( eref );
}
This is equivalent (apart from some overhead) to the fix they've already filed for this problem, which should be available with JDK8u192 (and should already be in JDK 9/10).
Wrap your XMLStreamWriter2 with the above filter, for instance:
FileOutputStream fout = new FileOutputStream ( "test.xml" );
WstxOutputFactory wsof = (WstxOutputFactory) WstxOutputFactory.newInstance();
XMLStreamWriter2 xmlOut = (XMLStreamWriter2) wsof.createXMLStreamWriter ( fout, CharsetNames.CS_UTF8 );
xmlOut = new NewLineFixWriterFilter ( xmlOut );
// Now write into xmlOut, directly or via JAXB
The complete/production code is here. It shouldn't be difficult to adapt the same approach to similar pipelines (in general, the problem occurs because com.sun.xml.internal.bind.v2.runtime.output.XMLStreamWriterOutput escapes \n and \r the wrong way, so the trick is to hijack this wrong encoding from the upper levels).

Geoff S,
I tried to comment on the existing post but I quickly found out that you need to have “50 reputations” which I do not have.
It appears that I am experiencing a similar issue when we moved to JDK 1.8.0_161 and 1.8.0_162 some of our SOAP services started throwing the exceptions below
Feb 28, 2018 8:34:12 AM com.sun.xml.internal.messaging.saaj.soap.SOAPDocumentImpl createEntityReference
SEVERE: SAAJ0543: Entity References are not allowed in SOAP documents
SEVERE: java.lang.UnsupportedOperationException: Entity References are not allowed in SOAP documents
javax.xml.ws.WebServiceException: java.lang.UnsupportedOperationException: Entity References are not allowed in SOAP documents
at com.sun.xml.internal.ws.handler.ClientSOAPHandlerTube.callHandlersOnRequest(ClientSOAPHandlerTube.java:135)
at com.sun.xml.internal.ws.handler.HandlerTube.processRequest(HandlerTube.java:112)
at com.sun.xml.internal.ws.api.pipe.Fiber.__doRun(Fiber.java:1121)
at com.sun.xml.internal.ws.api.pipe.Fiber._doRun(Fiber.java:1035)
at com.sun.xml.internal.ws.api.pipe.Fiber.doRun(Fiber.java:1004)
at com.sun.xml.internal.ws.api.pipe.Fiber.runSync(Fiber.java:862)
at com.sun.xml.internal.ws.client.Stub.process(Stub.java:448)
at com.sun.xml.internal.ws.client.sei.SEIStub.doProcess(SEIStub.java:178)
at com.sun.xml.internal.ws.client.sei.SyncMethodHandler.invoke(SyncMethodHandler.java:93)
at com.sun.xml.internal.ws.client.sei.SyncMethodHandler.invoke(SyncMethodHandler.java:77)
at com.sun.xml.internal.ws.client.sei.SEIStub.invoke(SEIStub.java:147)
at com.sun.proxy.$Proxy38.getUserProfile(Unknown Source)
As indicated by the above question and other threads:
https://bugs.openjdk.java.net/browse/JDK-8196491
https://bugs.java.com/view_bug.do?bug_id=8196491
It has something to do with newlines in the payload. For example some of our payloads include XML strings that have new lines which cause the issue. however if the newlines are removed prior to calling the service then it works. See immediately below:
Fail
<?xml version="1.0" encoding="UTF-8"?>
<user>
<userId>XXXX</userId>
<name>XXXXXX, XXXXXX</name>
<phone>(xxx)xxx-xxxx</phone>
<title><![CDATA[MY TITLE]]></title>
<mail>xxx#xxxx.com</mail>
</user>
Works
<?xml version="1.0" encoding="UTF-8"?><user><userId>XXXX</userId><name>XXXXXX, XXXXXX</name><phone>(xxx)xxx-xxxx</phone><title><![CDATA[MY TITLE]]></title><mail>xxx#xxxx.com</mail></user>
Do you or anyone else know if there is workaround other than stripping the payload from “new lines”, and is this considered a bug in the latest Oracle JDK and are there any plans to rectify the behavior.
Thanks
max

Invalid byte 1 of 1-byte UTF-8 sequence: RestTemplate [duplicate]

I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?

How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.

Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save

Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.

I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.

I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some �  special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.

I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.

I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".

This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png

I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.

I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».

Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.

I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.

You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'

This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...

How match JAXB elements in CIM/RDF?

Trying to load a model from a CIM/XML file acording to IEC 61970 (Common Information Model, for power systems models), I found a problem;
According JAXB´s graphs between elements are provided by #XmlREF #XmlID and these both should be equals to match. But in CIM/RDF the references to a resource through an ID, i.e. rdf:resource="#_37C0E103000D40CD812C47572C31C0AD" contain the "#" character, consequently JAXB is unable to match "GeographicalRegion" vs. "SubGeographicalRegion.Region" when in the rdf:resource atribute the "#" character is present.
Here an example:
<cim:GeographicalRegion rdf:ID="_37C0E103000D40CD812C47572C31C0AD">
<cim:IdentifiedObject.name>GeoRegion</cim:IdentifiedObject.name>
<cim:IdentifiedObject.localName>OpenCIM3bus</cim:IdentifiedObject.localName>
</cim:GeographicalRegion>
<cim:SubGeographicalRegion rdf:ID="_ID_SubGeographicalRegion">
<cim:IdentifiedObject.name>SubRegion</cim:IdentifiedObject.name>
<cim:IdentifiedObject.localName>SubRegion</cim:IdentifiedObject.localName>
<cim:SubGeographicalRegion.Region rdf:resource="#_37C0E103000D40CD812C47572C31C0AD"/>
</cim:SubGeographicalRegion>

I realize you're asking for a solution using JAXB, but I would urge you to consider an RDF-based solution as it is more flexible and robust. You're basically trying to reinvent what RDF parsers already have built in. RDF/XML is a difficult format to parse, it doesn't make much sense to try and hack your own parsing together - especially since files that have very different XML structures can express exactly the same information: this only becomes apparent when looking at the level of the RDF. You may find that your JAXB parser workaround works on one CIM/RDF file but completely fails on another.
So, here's an example of how to process your file using the Sesame RDF API. No inferencing is involved, this just parses the file and puts it in an in-memory RDF model, which you can then manipulate and query from any angle.
Assuming the root element of your CIM file looks something like this:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:cim="http://example.org/cim/">
(only a guess of course, but I need prefixes for a proper example)
Then you can do the following, using Sesame's Rio RDF/XML parser:
String baseURI = "http://example.org/my/file";
FileInputStream in = new FileInputStream("/path/to/my/cim.rdf");
Model model = Rio.parse(in, baseURI, RDFFormat.RDFXML);
This creates an in-memory RDF model of your document. You can then simply filter-query over that. For example, to print out the properties of all resources that have _37C0E103000D40CD812C47572C31C0AD as their SubGeographicalRegion.Region:
String CIM_NS = "http://example.org/cim/";
ValueFactory vf = ValueFactoryImpl.getInstance();
URI subRegion = vf.createURI(CIM_NS, "SubGeographicalRegion.Region");
URI res = vf.createURI("http://example.org/my/file#_37C0E103000D40CD812C47572C31C0AD");
Set<Resource> subs = model.filter(null, subRegion, res).subjects();
for (Resource sub: subs) {
System.out.println("resource: " + sub + " has the following properties: ");
for (URI prop: model.filter(sub, null, null).predicates()) {
System.out.println(prop + ": " + model.filter(sub, prop, null).objectValue());
}
}
Of course at this point you can also choose to convert the model to some other syntax format for further handling by your application - as you see fit. The point is that the difference between the identifiers with the leading # and without has been resolved for you by the RDF/XML parser.
This is of course personal opinion only, since I don't know the details of your use case, but I think you'll find that this is quite quick and flexible. I should also point out that although the above solution keeps the entire model in memory, you can easily adapt this to a more streaming (and therefore less memory-intensive) approach if you find your files are too big.

How can I parse a namespace using the SAX parser?

Using a twitter search URL ie. http://search.twitter.com/search.rss?q=android returns CSS that has an item that looks like:
<item>
<title>#UberTwiter still waiting for #ubertwitter android app!!!</title>
<link>http://twitter.com/meals69/statuses/21158076391</link>
<description>still waiting for an app!!!</description>
<pubDate>Sat, 14 Aug 2010 15:33:44 +0000</pubDate>
<guid>http://twitter.com/meals69/statuses/21158076391</guid>
<author>Some Twitter User</author>
<media:content type="image/jpg" height="48" width="48" url="http://a1.twimg.com/profile_images/756343289/me2_normal.jpg"/>
<google:image_link>http://a1.twimg.com/profile_images/756343289/me2_normal.jpg</google:image_link>
<twitter:metadata>
<twitter:result_type>recent</twitter:result_type>
</twitter:metadata>
</item>
Pretty simple. My code parses out everything (title, link, description, pubDate, etc.) without any problems. However, I'm getting null on:
<google:image_link>
I'm using Java to parse the RSS feed. Do I have to handle compound localnames differently than I would a more simple localname?
This is the bit of code that parses out Link, Description, pubDate, etc:
#Override
public void endElement(String uri, String localName, String name)
throws SAXException {
super.endElement(uri, localName, name);
if (this.currentMessage != null){
if (localName.equalsIgnoreCase(TITLE)){
currentMessage.setTitle(builder.toString());
} else if (localName.equalsIgnoreCase(LINK)){
currentMessage.setLink(builder.toString());
} else if (localName.equalsIgnoreCase(DESCRIPTION)){
currentMessage.setDescription(builder.toString());
} else if (localName.equalsIgnoreCase(PUB_DATE)){
currentMessage.setDate(builder.toString());
} else if (localName.equalsIgnoreCase(GUID)){
currentMessage.setGuid(builder.toString());
} else if (uri.equalsIgnoreCase(AVATAR)){
currentMessage.setAvatar(builder.toString());
} else if (localName.equalsIgnoreCase(ITEM)){
messages.add(currentMessage);
}
builder.setLength(0);
}
}
startDocument looks like:
#Override
public void startDocument() throws SAXException {
super.startDocument();
messages = new ArrayList<Message>();
builder = new StringBuilder();
}
startElement looks like:
#Override
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
super.startElement(uri, localName, name, attributes);
if (localName.equalsIgnoreCase(ITEM)){
this.currentMessage = new Message();
}
}
Tony

An element like <google:image_link> has the local name image_link belonging to the google namespace. You need to ensure that the XML parsing framework is aware of namespaces, and you'd then need to find this element using the appropriate namespace.
For example, a few SAX1 interfaces in package org.xml.sax has been deprecated, replaced by SAX2 counterparts that include namespace support (e.g. SAX1 Parser is deprecated and replaced by SAX2 XMLReader). Consult the documentation on how to specify the namespace uri or qualified (prefixed) qName.
See also
Wikipedia/XML namespace
package org.xml.sax
saxproject.org - Namespaces

From sample it is not actually clear what namespace that 'google' prefix binds to -- previous answer is slightly incorrect in that it is NOT in "google" namespace; rather, it is a namespace that prefix "google" binds to. As such you have to match the namespace (identified by URI), and not prefix. SAX does have confusing way of reporting local name / namespace-prefix combinations, and it depends on whether namespace processing is even enabled.
You could also consider alternative XML processing libraries / APIs; while SAX implementations are performant, there are as fast and more convenient alternatives. Stax (javax.xml.stream.*) implementations like Woodstox (and even default one that JDK 1.6 comes with) are fast and bit more convenient. And StaxMate library that builds on top of Stax is much simpler to use for both reading and writing, and speedwise as fast as SAX implementations like Xerces. Plus Stax API has less baggage wrt namespace handling so it is easier to see what is the actual namespace of elements.

Like user polygenelubricants said: generally the parser needs to be namespace aware to handle elements which belong to some prefixed namespace. (Like that <google:image_link> element.)
This needs to be set as a "parser feature" which AFAIK can be done in few different ways: The XMLReader interface itself has method setFeature() that can be used to set features for a certain parser but you can also use same method for SAXParserFactory class so that this factory generates parsers with those features already on or off. SAX2 standard feature flags should be on SAXproject's website but at least some of them are also listed in Java API documentation of package org.xml.sax.
For simple documents you can try to take a shortcut. If you don't actually care about namespaces and element names as in a URL + local-name combination, and you can trust that the elements you are looking for (and only these) always have certain prefix and that there aren't elements from other namespaces with same local name then you might just solve your problem by using qname parameter of startElement() method instead of localName or vice versa or by adding/dropping the prefix from the tag name string you compare to.
The contents of parameters namespaceUri, qname or localName is according to Java specs actually optional and AFAIK they might be null for this reason. Which ones of them are null depends on what are those aforementioned "parser features" that affect namespaces. I don't know can the parameter that is null vary between elements in a namespace and elements without a namespace - I haven't investigated that behaviour.
PS. XML is case sensitive. So ideally you don't need to ignore case in tag name string comparison.-First post, yay!

Might help someone using the Android SAX util. I was trying geo:lat to get the lat element form the geo namepace.
Sample XML:
<item>
<title>My Item title</title>
<geo:lat>40.720741</geo:lat>
</item>
First attempt returned null:
item.getChild("geo:lat");
As suggested above, I found passing the namespace URI to the getChild method worked.
item.getChild("http://www.w3.org/2003/01/geo/wgs84_pos#", "lat");

Using startPrefixMapping method of my xml handler I was able to parse out text of a namespace.
I placed several calls to this method beneath my handler instantiation.
GoogleReader xmlhandler = new GoogleReader();
xmlhandler.startPrefixMapping("dc", "http://purl.org/dc/elements/1.1/");
where dc is the namespace <dc:author>some text</dc:author>

"Content is not allowed in prolog" when parsing perfectly valid XML on GAE

I've been beating my head against this absolutely infuriating bug for the last 48 hours, so I thought I'd finally throw in the towel and try asking here before I throw my laptop out the window.
I'm trying to parse the response XML from a call I made to AWS SimpleDB. The response is coming back on the wire just fine; for example, it may look like:
<?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/">
<ListDomainsResult>
<DomainName>Audio</DomainName>
<DomainName>Course</DomainName>
<DomainName>DocumentContents</DomainName>
<DomainName>LectureSet</DomainName>
<DomainName>MetaData</DomainName>
<DomainName>Professors</DomainName>
<DomainName>Tag</DomainName>
</ListDomainsResult>
<ResponseMetadata>
<RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId>
<BoxUsage>0.0000071759</BoxUsage>
</ResponseMetadata>
</ListDomainsResponse>
I pass in this XML to a parser with
XMLEventReader eventReader = xmlInputFactory.createXMLEventReader(response.getContent());
and call eventReader.nextEvent(); a bunch of times to get the data I want.
Here's the bizarre part -- it works great inside the local server. The response comes in, I parse it, everyone's happy. The problem is that when I deploy the code to Google App Engine, the outgoing request still works, and the response XML seems 100% identical and correct to me, but the response fails to parse with the following exception:
com.amazonaws.http.HttpClient handleResponse: Unable to unmarshall response (ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.): <?xml version="1.0" encoding="utf-8"?>
<ListDomainsResponse xmlns="http://sdb.amazonaws.com/doc/2009-04-15/"><ListDomainsResult><DomainName>Audio</DomainName><DomainName>Course</DomainName><DomainName>DocumentContents</DomainName><DomainName>LectureSet</DomainName><DomainName>MetaData</DomainName><DomainName>Professors</DomainName><DomainName>Tag</DomainName></ListDomainsResult><ResponseMetadata><RequestId>42330b4a-e134-6aec-e62a-5869ac2b4575</RequestId><BoxUsage>0.0000071759</BoxUsage></ResponseMetadata></ListDomainsResponse>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(Unknown Source)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Source)
at com.amazonaws.transform.StaxUnmarshallerContext.nextEvent(StaxUnmarshallerContext.java:153)
... (rest of lines omitted)
I have double, triple, quadruple checked this XML for 'invisible characters' or non-UTF8 encoded characters, etc. I looked at it byte-by-byte in an array for byte-order-marks or something of that nature. Nothing; it passes every validation test I could throw at it. Even stranger, it happens if I use a Saxon-based parser as well -- but ONLY on GAE, it always works fine in my local environment.
It makes it very hard to trace the code for problems when I can only run the debugger on an environment that works perfectly (I haven't found any good way to remotely debug on GAE). Nevertheless, using the primitive means I have, I've tried a million approaches including:
XML with and without the prolog
With and without newlines
With and without the "encoding=" attribute in the prolog
Both newline styles
With and without the chunking information present in the HTTP stream
And I've tried most of these in multiple combinations where it made sense they would interact -- nothing! I'm at my wit's end. Has anyone seen an issue like this before that can hopefully shed some light on it?
Thanks!

The encoding in your XML and XSD (or DTD) are different.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-16'?>
Another possible scenario that causes this is when anything comes before the XML document type declaration. i.e you might have something like this in the buffer:
helloworld<?xml version="1.0" encoding="utf-8"?>
or even a space or special character.
There are some special characters called byte order markers that could be in the buffer.
Before passing the buffer to the Parser do this...
String xml = "<?xml ...";
xml = xml.trim().replaceFirst("^([\\W]+)<","<");

I had issue while inspecting the xml file in notepad++ and saving the file, though I had the top utf-8 xml tag as <?xml version="1.0" encoding="utf-8"?>
Got fixed by saving the file in notpad++ with Encoding(Tab) > Encode in UTF-8:selected (was Encode in UTF-8-BOM)

This error message is always caused by the invalid XML content in the beginning element. For example, extra small dot “.” in the beginning of XML element.
Any characters before the “<?xml….” will cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” error message.
A small dot “.” before the “<?xml….
To fix it, just delete all those weird characters before the “<?xml“.
Ref: http://www.mkyong.com/java/sax-error-content-is-not-allowed-in-prolog/

I catched the same error message today.
The solution was to change the document from UTF-8 with BOM to UTF-8 without BOM

I was facing the same issue. In my case XML files were generated from c# program and feeded into AS400 for further processing. After some analysis identified that I was using UTF8 encoding while generating XML files whereas javac(in AS400) uses "UTF8 without BOM".
So, had to write extra code similar to mentioned below:
//create encoding with no BOM
Encoding outputEnc = new UTF8Encoding(false);
//open file with encoding
TextWriter file = new StreamWriter(filePath, false, outputEnc);
file.Write(doc.InnerXml);
file.Flush();
file.Close(); // save and close it

In my xml file, the header looked like this:
<?xml version="1.0" encoding="utf-16"? />
In a test file, I was reading the file bytes and decoding the data as UTF-8 (not realizing the header in this file was utf-16) to create a string.
byte[] data = Files.readAllBytes(Paths.get(path));
String dataString = new String(data, "UTF-8");
When I tried to deserialize this string into an object, I was seeing the same error:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
When I updated the second line to
String dataString = new String(data, "UTF-16");
I was able to deserialize the object just fine. So as Romain had noted above, the encodings need to match.

Removing the xml declaration solved it
<?xml version='1.0' encoding='utf-8'?>

Unexpected reason: # character in file path
Due to some internal bug, the error Content is not allowed in prolog also appears if the file content itself is 100% correct but you are supplying the file name like C:\Data\#22\file.xml.
This may possibly apply to other special characters, too.
How to check: If you move your file into a path without special characters and the error disappears, then it was this issue.

I was facing the same problem called "Content is not allowed in prolog" in my xml file.
Solution
Initially my root folder was '#Filename'.
When i removed the first character '#' ,the error got resolved.
No need of removing the #filename...
Try in this way..
Instead of passing a File or URL object to the unmarshaller method, use a FileInputStream.
File myFile = new File("........");
Object obj = unmarshaller.unmarshal(new FileInputStream(myFile));

In the spirit of "just delete all those weird characters before the <?xml", here's my Java code, which works well with input via a BufferedReader:
BufferedReader test = new BufferedReader(new InputStreamReader(fisTest));
test.mark(4);
while (true) {
int earlyChar = test.read();
System.out.println(earlyChar);
if (earlyChar == 60) {
test.reset();
break;
} else {
test.mark(4);
}
}
FWIW, the bytes I was seeing are (in decimal): 239, 187, 191.

I had a tab character instead of spaces.
Replacing the tab '\t' fixed the problem.
Cut and paste the whole doc into an editor like Notepad++ and display all characters.

In my instance of the problem, the solution was to replace german umlauts (äöü) with their HTML-equivalents...

bellow are cause above “org.xml.sax.SAXParseException: Content is not allowed in prolog” exception.
First check the file path of schema.xsd and file.xml.
The encoding in your XML and XSD (or DTD) should be same.
XML file header: <?xml version='1.0' encoding='utf-8'?>
XSD file header: <?xml version='1.0' encoding='utf-8'?>
if anything comes before the XML document type declaration.i.e: hello<?xml version='1.0' encoding='utf-16'?>

I zipped the xml in a Mac OS and sent it to a Windows machine, the default compression changes these files so the encoding sent this message.

Happened to me with #JsmListener with Spring Boot when listening to IBM MQ. My method received String parameter and got this exception when I tried to deserialize it using JAXB.
It seemed that that the string I got was a result of byte[].toString(). It was a list of comma separated numbers.
I solved it by changing the parameter type to byte[] and then created a String from it:
#JmsListener(destination = "Q1")
public void receiveQ1Message(byte[] msgBytes) {
var msg = new String(msgBytes);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java sax parser mangles attributes for xml 1.1 - java

This bug has been present in the JDK XML parser for years, and neither Sun nor Oracle has showed any interest in fixing it. I strongly advise using the Apache Xerces XML parser in preference.

Related

Latest Open JDK 8 JAXB library fails to unmarshal objects with properties that contain new line characters

Invalid byte 1 of 1-byte UTF-8 sequence: RestTemplate [duplicate]

How match JAXB elements in CIM/RDF?

How can I parse a namespace using the SAX parser?

"Content is not allowed in prolog" when parsing perfectly valid XML on GAE

Categories

Resources