How to parse badly formed XML in Java?

How to parse badly formed XML in Java? - java

I have XML that I need to parse but have no control over the creation of. Unfortunately it's not very strict XML and contains things like:
<mytag>This won't parse & contains an ampersand.</mytag>
The javax.xml.stream classes don't like this at all, and rightly error with:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[149,50]
Message: The entity name must immediately follow the '&' in the entity reference.
How can I work around this? I can't change the XML, so I guess I need an error-tolerant parser.
My preference would be for a fix that doesn't require too much disruption to the existing parser code.

Use libraries such as tidy or tagsoup.
TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.

If it's not valid XML (like the above) then no XML parser will handle it (as you've identified). If you know the scope of the errors (such as the above entity issue), then the simplest solution may be to run a correcting process over it (fixing entities such as inserting entities) and then feed it to an existing parser.
Otherwise you'll have to code one yourself with built-in support for such anomalies. And I can't believe that's anything other than a tedious and error-prone task.

I believe JSoup can handle badly formed XML

Related

Parse Ampersand in XML with Java's DOM XML API

I am trying to parse an XML document with the Java DOM API (not SAX). Whenever the parser encounters the ampersand (&) when parsing a text node, it errors out. I am guessing that this is solvable with 1)escaping, 2) encoding or 3) Use a different parser.
I am reading an XML document that I dont have any control over, so I cannot precisely identify where the ampersand appears in the document every time I read it.
The answers I have seen to similar questions have advised replacing the entity type when parsing the XML, but I am not sure how I will be able to do that since, it doesnt even parse when it encounters the XML ampersand.
Any help will be appreciated.

As noted, the XML is malformed (oops!): all occurrences of & in XML (other than the token introducing a character entity [?]) must be encoded as &.
Some solutions (which are basically just as described in the post!):
Fix the XML (at source, or in hack-it-up phase), or;
Parse it with the "appropriate" tool (e.g. a "forgiving" HTML parser)
For the "hack-it-up" approach, consider a separate input stream -- see Working with Filter Streams -- that executes as a filter prior to the actual DOM parser: whenever a & is encountered (that is not part of a character entity) it "fixes it" by inserting & into the stream. Of course, if the XML source didn't get basic encoding correct...
Happy coding.

"I am reading an XML document that I dont have any control over".
No, you are reading a non-XML document. The reason you get an error is that XML parsers are required to give you an error when you read something that isn't XML.
The XML culture is that responsibility for producing well-formed XML rests with the sender. You need to change whatever produces this data to do it properly. Otherwise, you might as well forget XML and its benefits, and move back to the chaotic world of privately-agreed protocols and custom parsers.

parsing/scanning/tokenizing "raw XML"

I have an application where I need to parse or tokenize XML and preserve the raw text (e.g. don't parse entities, don't convert whitespace in attributes, keep attribute order, etc.) in a Java program.
I've spent several hours today trying to use StAX, SAX, XSLT, TagSoup, etc. before realizing that none of them do this. I can't afford to spend much more time attacking this problem, and parsing the text manually seems highly nontrivial. Is there any Java library that can help me tokenize the XML?
edit: why am I doing this? -- I have a large XML file that I want to make a small number of localized changes programmatically, that need to be reviewed. It is highly valuable to be able to use a diff tool. If the parser/filter normalizes the XML, then all I see is "red ink" in the diff tool. The application that produces the XML in the first place isn't something that I can easily have changed to produce "canonical XML", if there is such a thing.

I think you might have to generate your own grammar.
Some links:
Parsing XML with ANTLR Tutorial
ANTXR
XPA
http://www.google.com/search?q=antlr+xml

I don't think any XML parser will do what you want. Why ? For instance, the XML spec doesn't enforce attribute ordering. I think you're going to have to parse it yourself, and that is non-trivial.
Why do you have to do this ? I'm guessing you have some client 'XML' that enforces or relies on non-standard construction. In that case I'd push back and get that fixed, rather than jump through numerous fixes to try and accommodate this.

I'm not entirely sure that I understand what it is you are trying to do. Have you tried using CDATA regions for the parts of the document you don't want the parser to touch?
Also relying on attribute order is not a good idea - if I remember the XML standard correctly then order is never to be expected.
It sounds like you are dealing with some malformed XML and that it would be easier to first turn it into proper XML.

What does the org.apache.xmlbeans.XmlException with a message of "Unexpected element: CDATA" mean?

I'm trying to parse and load an XML document, however I'm getting this exception when I call the parse method on the class that extends XmlObject. Unfortunately, it gives me no ideas of what element is unexpected, which is my problem.
I am not able to share the code for this, but I can try to provide more information if necessary.

Not being able to share code or input data, you may consider the following approach. That's a very common dichotomic approach to diagnostic, I'm afraid, and indeed you may readily have started it...
Try and reduce the size of the input XML by removing parts of it, ensuring that the underlying XML document remains well formed and possibly valid (if validity is required in your parser's setup). If you maintain validity, this may require to alter [a copy of] the Schema (DTD or other), as manditory elements might be removed during the cut-and-try approach... BTW, the error message seems to hint more at a validation issue that a basic well-formedness assertion issue.
Unless one has a particular hunch as to the area that triggers the parser's complaint, we typically remove (or re-add, when things start working) about half of what was previously cut or re-added.
You may also start with trying a mostly empty file, to assert that the parser does work at all... There again is the idea to "divide to prevail": is the issue in the XML input or in the parser ? (remembering that there could be two issues, one in the input and one in the parser, and thtat such issues could even be unrelated...)
Sorry to belabor basic diagnostics techniques which you may well be fluent with...

You should check the arguments you are passing to the method parse();
If you are directly passing a string to parse or file or inputstream accordingly (File/InputStream/String) etc.

The exception is caused by the length of the XML file. If you add or remove one character from the file, the parser will succeed.
The problem occurs within the 3rd party PiccoloLexer library that XMLBeans relies on. It has been fixed in revision 959082 but has not been applied to xbean 2.5 jar.
XMLBeans - Problem with XML files if length is exactly 8193bytes
Issue reported on XMLBean Jira

Is there a validating HTML parser implemented in Java?

I need to parse HTML 4 in Java.
Ideally I'd like an implementation that is SAX compatible.
I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.
My requirements are:
No tidying.
If the input document is invalid HTML parsing should fail.
The document should be validatable against the HTML DTDs.
The parser can produce SAX2 events.
Is there a library that meets these requirements?

You can find a collection of HTML parsers here HTML Parsers. I don't remeber exactly but I think TagSoup parses the file without applying corrections...

I think the Jericho HTML Parser can deliver at least one of your core requirements ('If the input document is invalid HTML parsing should fail.') in that it will at least tell you if there are mismatched tags or other poisonous HTML flaws, and you can choose to fail based on this information.
Try typing invalid html into this Jericho formatting demo, and note the 'Parser Log' at the bottom of the page:
http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp
So yes, this is doing tag tidying, but it is at least telling you about it - you can grab this information by setting a net.htmlparser.jericho.Logger (e.g. a WriterLogger or something more specific of your own creation) on your source, and then proceeding depending on what errors are logged out. This is a small example:
Source source=new Source("<a>I forgot to close my link!");
source.setLogger(myListeningLogger);
source.getSourceFormatter().writeTo(new NullWriter());
// myListeningLogger has now had all the HTML flaws written to it
In the example above, your logger's info() method is called with the string: 'StartTag at (r1,c1,p0) missing required end tag', which is relatively parseable, and you can always decide to just reject any HTML that logs any message worse than debug - in fact Jericho logs almost all errors as 'info' level, with a couple at 'warn' level (you might be tempted to create a small fork with the severities adjusted to correspond to what you care about).
Jericho is available on Maven Central, which is always a good sign:
http://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html
Good luck!

You may wish to check http://lobobrowser.org/cobra.jsp. They have a pure Java web browser (Lobo) implemented. They have the parser component (Cobra) pulled out separately for use. I honestly am not sure if it will do what you require with the "no tidying" requirement, but it may be worth a look. I ran across it when exploring the wild for a pure Java web browser.

You can try to subclass javax.swing.text.html.parser.Parser and implement the handleXXX() methods. It seems it doesn't try to fix the XML. See more at the API

Castor and sockets

I'm new to Castor and data binding in general. I'm working on an application that, in part, needs to take data off of a socket and unmarshall the data to make POJOs. Now, I've got the socket stuff down, and I've even generated and compiled java files thanks to Ant and Castor.
Here's the problem: the data stream that I'll receive could be one of about 9 different objects. That is, I receive a stream of text (XML) that represents an object with stuff that I'll operate on; again, depending on the object type. If it were just one object, it'd be easy: call the unmarshall commands on it and go on my merry way. But, since it could be one of many kinds of objects, who do I know what to unmarshall? I read up on mapping, but either I didn't get it, or it seems like a static mapping, not a dynamic mapping.
Any help out there?

You are right, Castor expects a static mapping. But you can work with that. You can write some code that will modify the incoming xml so that, on your side, Castor can use one schema, and on your clients' side they don't have to change their schemas.
Change the schema that Castor expects to get to something with a common root-element, with under that your nine different alternatives for your different objects (I think you can restrict it so the schema will allow only one of the nine, if that doesn't work out you could just make all the sub-elements optional).
Then you can write code that modifies the incoming xml to wrap your incoming xml with that common root-element, then feeds the wrapped xml into a stream that gets read by the Castor unmarshaller.
There are at least 3 different ways to implement the xml-wrapping part: SAX, XSLT, and XML libraries (like JDOM, DOM4J, and XOM--I prefer XOM but any of them will work).
The SAX way is probably best if you're already familiar with SAX or if one of the other ways has worked but come up short on performance. If I had to implement that then I would create an XMLFilter that takes in xml and writes xml out, stacking that on top of another piece that writes xml to an OutputStream, and writing a wrapper method around the unmarshalling stuff to feed the incoming stream to the xmlreader, copy the OutputStream to another InputStream (an easy way is to use commons-io), and feed the new InputStream to the Castor unmarshaller.
With XSLT there is no fooling with SAX, although XSLT has a reputation for pain sometimes, it seems to me like this might be a relatively straightforward transformation, but I haven't taken a stab at it either. It is a long time since I used XSLT for anything. I am not sure about performance either, though I wouldn't write it off out of hand.
Using XOM or JDOM or DOM4J to wrap the XML is also possible, and the learning curve is a lot lower than for SAX or XSLT. The downside is the whole XML document tends to get sucked into memory at once so if you deal with big enough documents you could run out of memory.

I have a similar thing in Jibx where all of the incoming message objects implement a base interface which has a field denoting the message type.
The text/xml is serialized into the base interface and I then used the command pattern to call the respective business logic depending upon the message type which is defined in the base interface.
Not sure if this is possible using castor but take a look at Jibx as the performance is fantastic.
http://jibx.sourceforge.net/

I appreciate your insights. You both have given me some good information to go on and new knowledge that I didn't have. In the end, I got the process to work via a hack. I grab the text stream, parse out the root tag of the message, and then switch on it to determine the right object to create. I'm unmarshalling all of my objects independently and everyone is happy on our end.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.