Parse Ampersand in XML with Java's DOM XML API

Parse Ampersand in XML with Java's DOM XML API - java

I am trying to parse an XML document with the Java DOM API (not SAX). Whenever the parser encounters the ampersand (&) when parsing a text node, it errors out. I am guessing that this is solvable with 1)escaping, 2) encoding or 3) Use a different parser.
I am reading an XML document that I dont have any control over, so I cannot precisely identify where the ampersand appears in the document every time I read it.
The answers I have seen to similar questions have advised replacing the entity type when parsing the XML, but I am not sure how I will be able to do that since, it doesnt even parse when it encounters the XML ampersand.
Any help will be appreciated.

As noted, the XML is malformed (oops!): all occurrences of & in XML (other than the token introducing a character entity [?]) must be encoded as &.
Some solutions (which are basically just as described in the post!):
Fix the XML (at source, or in hack-it-up phase), or;
Parse it with the "appropriate" tool (e.g. a "forgiving" HTML parser)
For the "hack-it-up" approach, consider a separate input stream -- see Working with Filter Streams -- that executes as a filter prior to the actual DOM parser: whenever a & is encountered (that is not part of a character entity) it "fixes it" by inserting & into the stream. Of course, if the XML source didn't get basic encoding correct...
Happy coding.

"I am reading an XML document that I dont have any control over".
No, you are reading a non-XML document. The reason you get an error is that XML parsers are required to give you an error when you read something that isn't XML.
The XML culture is that responsibility for producing well-formed XML rests with the sender. You need to change whatever produces this data to do it properly. Otherwise, you might as well forget XML and its benefits, and move back to the chaotic world of privately-agreed protocols and custom parsers.

Related

While reading and rewriting XML in Java, is there a systematic way of preserving the processing instruction?

I want to update the XML but preserve the original processing instruction, most of the time it's just:
<?xml version="1.0" encoding="UTF-8"?>
However I can't find a way to extract the line from com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.JAXPSAXParser(and other XML reader) or how to automatically carry it to the write. Is there any other way than manually read the line, keep it then write it first before flushing the new XML ?

It's proper name is an XML declaration; it looks like a processing instruction but technically it isn't one.
Parsing invariably involves decoding the file (that is, converting the octets into characters); once that has been done, the theory goes, the application doesn't need to know how they were originally encoded. Similarly, when serializing the file, the application has to tell the serializer what encoding to use, and the serializer then takes responsibility for writing an XML declaration that reflects that encoding.
Allowing the application control over the XML declaration would break proper architectural layering, and would create the possibility of writing an XML declaration that is wrong. This bit of the content belongs to the parser layer, not to the application layer.
Of course in practice it's possible to get an XML declaration that doesn't match the actual encoding anyway, because there's nothing to stop you writing an XML declaration using software that knows nothing about XML. People do that, and they create broken content, and then they ask us on StackOverflow how to fix it. I'm not going to encourage you down that route.

Stax does not ready characters like "“" [duplicate]

I am trying to parse an XML using SAX Parser but keep getting XML document structures must start and end within the same entity. which is expected as the XML doc I get from other source won't be a proper one. But I don't want this exception to be raised as I would like to parse an XML document till I find the <myTag> in that document and I don't care whether that doc got proper starting and closing entities.
Example:
<employeeDetails>
<firstName>xyz</firsName>
<lastName>orp</lastName>
<departmentDetails>
<departName>SALES</departName>
<departCode>982</departCode>...
Here I don't want to care whether the document is valid one or not as this part is not in my hand. So I would like to parse this document till I see <departName> after that I don't want to parse the document. Please suggest me how to do this. Thanks.

You cannot use an XML parser to parse a file that does not contain well-formed XML. (It does not have to be valid, just well-formed. For the difference, read Well-formed vs Valid XML.)
By definition, XML must be well-formed, otherwise it is not XML. Parsers in general have to have some fundamental constraints met in order to operate, and for XML parsers, it is well-formedness.
Either repair the file manually first to be well-formed XML, or open it programmatically and parse it as a text file using traditional parsing techniques. An XML parser cannot help you unless you have well-formed XML.

BeautifulSoup in Python can handle incomplete xml really well.
I use it to parse prefix of large XML files for preview.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<a><b>foo</b><b>bar<','xml')
<?xml version="1.0" encoding="unicode-escape"?>\n<a><b>foo</b><b>bar</b></a>

Is it possible to use Apache Digester to filter dynamic xml leaf tags?

I've used Apache digester before and loved the branch based searching of xml tags.
Specifying a tag as
h\a\b\
is very intuitive.
Now i want to do xml filtering project, but apache digester doesn't seem like it will work, simply because there is no way to get to the underlying xml tags. As the faq says:
How do I get some xml nested within a tag as a literal string?
It is frequently asked how some XML (esp. XHTML) nested within a document can be extracted as a string, eg to extract the contents of the "body" tag below as a string:
...some xml code...
If you can modify the above to wrap the desired text as a CDATA section then things are easy; Digester will simply treat that CDATA block as a single string:
...some xml code...
If this can't be done then you need to use a NodeCreateRule to create a DOM node representing the body tag and its children, then serialise that DOM node back to text.
Remember that Digester is just a layer on top of a standard XML parser, and standard XML parsers have no option to just stop parsing input at a specific element - unless it knows that the contents of that element is a block of characters (CDATA).
If there was something that uses the same pattern system that i can use to filter xml? My idea is to use the patterns given by the user and blacklist them, and copy everything else.
Or maybe there is a way to find the location of a match in Apache Digester (the location on the xml, not just the displayed text). That would be enough for me to copy the other text by keeping a copy of it around, and skipping the matches.
Edit: I've since found out that XPath looks almost ok for doing this, but all applications i found were for selecting something, not removing it. Do you have a example for this?

Never mind, managed to do it with XPath.

parsing/scanning/tokenizing "raw XML"

I have an application where I need to parse or tokenize XML and preserve the raw text (e.g. don't parse entities, don't convert whitespace in attributes, keep attribute order, etc.) in a Java program.
I've spent several hours today trying to use StAX, SAX, XSLT, TagSoup, etc. before realizing that none of them do this. I can't afford to spend much more time attacking this problem, and parsing the text manually seems highly nontrivial. Is there any Java library that can help me tokenize the XML?
edit: why am I doing this? -- I have a large XML file that I want to make a small number of localized changes programmatically, that need to be reviewed. It is highly valuable to be able to use a diff tool. If the parser/filter normalizes the XML, then all I see is "red ink" in the diff tool. The application that produces the XML in the first place isn't something that I can easily have changed to produce "canonical XML", if there is such a thing.

I think you might have to generate your own grammar.
Some links:
Parsing XML with ANTLR Tutorial
ANTXR
XPA
http://www.google.com/search?q=antlr+xml

I don't think any XML parser will do what you want. Why ? For instance, the XML spec doesn't enforce attribute ordering. I think you're going to have to parse it yourself, and that is non-trivial.
Why do you have to do this ? I'm guessing you have some client 'XML' that enforces or relies on non-standard construction. In that case I'd push back and get that fixed, rather than jump through numerous fixes to try and accommodate this.

I'm not entirely sure that I understand what it is you are trying to do. Have you tried using CDATA regions for the parts of the document you don't want the parser to touch?
Also relying on attribute order is not a good idea - if I remember the XML standard correctly then order is never to be expected.
It sounds like you are dealing with some malformed XML and that it would be easier to first turn it into proper XML.

How to parse badly formed XML in Java?

I have XML that I need to parse but have no control over the creation of. Unfortunately it's not very strict XML and contains things like:
<mytag>This won't parse & contains an ampersand.</mytag>
The javax.xml.stream classes don't like this at all, and rightly error with:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[149,50]
Message: The entity name must immediately follow the '&' in the entity reference.
How can I work around this? I can't change the XML, so I guess I need an error-tolerant parser.
My preference would be for a fix that doesn't require too much disruption to the existing parser code.

Use libraries such as tidy or tagsoup.
TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.

If it's not valid XML (like the above) then no XML parser will handle it (as you've identified). If you know the scope of the errors (such as the above entity issue), then the simplest solution may be to run a correcting process over it (fixing entities such as inserting entities) and then feed it to an existing parser.
Otherwise you'll have to code one yourself with built-in support for such anomalies. And I can't believe that's anything other than a tedious and error-prone task.

I believe JSoup can handle badly formed XML

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.