Java XMLReader getting SAXParseException on special characters in XML - java

I have a problem parsing a XML file which contains special characters like ", <, > or & in attributes of an element. At the moment I use XMLReader with an own ContentHandler. Unfortunatel changing the XML is not an option since I get a huge bunch of files. Any idea what I could do??
Best!

You have to change the XML in order to make it well-formed. The five magic characters must be encoded properly OR wrapped in a CDATA section to tell the parser to allow them to pass.
If the five magic characters are not encoded properly, you aren't receiving well-formed XML. That ought to be the foundation of your contract with users.
Do a one-shot change.

It's not XML. Don't call it XML, because you are misleading yourself. You're dealing with a proprietary data syntax, and you are missing out on all the benefits of using XML for data interchange. You can't use any of the wonderful tools that exist for processing XML, because your data is not XML. You're in the dark ages of data interchange that existed before XML was invented, where everyone had to write their own parsers and port them to multiple platforms, at vast cost. It may be expensive to switch from this mess to the modern world of open standards, but the investment will pay off quickly. Just don't let any of the stakeholders delude themselves into thinking that because your syntax is "almost XML", you are almost there in terms of reaping the benefits. XML is all or nothing.

It's not best practice, but you could use regex to transform your almost-XML into proper XML before you open it with XMLReader. Something along these lines (just using javascript for a quick proof-of-concept):
var xml = '<root><node attr="bad attr chars...<"&>..."/></root>';
xml = xml.replace(/("[^"]*)&([^"]*")/, '$1&$2')
xml = xml.replace(/("[^"]*)<([^"]*")/, '$1<$2')
xml = xml.replace(/("[^"]*)>([^"]*")/, '$1>$2')
xml = xml.replace(/("[^"]*)"([^"]*")/, '$1"$2')
alert(xml);

Related

Hexadecimal (§) to Char (§) in Java

I have a huge xml file which contains a lot hexadecimal values ex: §, so how to convert these hexadecimal value in entire file to Char ex: § (§) in java.
<md.first.line.cite>UT ST § 10-2-409</md.first.line.cite>
You mentioned that you want to convert to §, so this is one of way to conver hex to your desired character:
System.out.println((char)0xA7);
Or
int hex=0xA7;
System.out.println((char)hex);
Lets start with what § and   mean. These are XML character references. They represent the Unicode code-points U+00A7 and U+2002. (This is real XML syntax, not just some random nuisance escape sequence that needs to be dealt with.)
So, if you parse that XML with a conformant XML parser, the parser will automatically take care of translating the references to the corresponding Unicode code-points. Your application should not need to do any translating.
This implies that you are NOT using a proper XML parser in your application. Bad idea! Doing your own XML parsing by string bashing or using regexes tends to give inflexible code and/or unreliable results when faced with "variant" XML. So my main recommendation would be:
Use a standard off-the-shelf XML parser.
If your XML is non-compliant, consider using Jsoup or similar to extract information from the XML.
If you already deep down the rabbit hole of string bashing, etc, the best thing to do would be to extract the entire encoded XML text segment and convert it to a String using existing library code. The standard Java SE class libraries don't provide this functionality, but you could use StringEscapeUtils.unescapeXml() from org.apache.commons.text. (The version from org.apache.commons.lang3 has been deprecated.)

In an XML document, is it possible to tell the difference between an entity-encoded character and one that is not?

I am being feed an XML document with metadata about online resources that I need to parse. Among the different metadata items are a collection of tags, which are comma-delimited. Here is an example:
<tags>Research skills, Searching, evaluating and referencing</tags>
The issue is that one of these "tags" contains a comma in it. The comma within the tag is encoded, but the commas intended to delimit tags are not. I am (currently) using the getText() method on org.dom4j.Node to read the text content of the <tags> element, which returns a String.
The problem is that I am not able -- as far as I'm aware -- to differentiate the encoded comma (from the ones that aren't encoded) in the String I receive.
Short of writing my own XML parser, is there another way to access the text content of this node in a more "raw" state? (viz. a state where the encoded comma is still encoded.)
When you use dom4j or DOM all the entities are already resolved, so you would need to go back to the parsing step to catch character references.
SAX is a more lowlevel interface and has support via its LexicalHandler interface to get notified when the parser encounters entity references, but it does not report character references. So it seems that you would really need to write an own parser, or patch an existing one.
But in the end it would be best if you can change the schema of your document:
<tags>
<tag>Research skills</tag>
<tag>Searching, evaluating and referencing</tag>
</tags>
In your current document character references are used to act as metadata. XML elements are a better way to express that.
Using LexEv from http://andrewjwelch.com/lexev/, putting xercesImpl.jar from Apache Xerces on the class path, I am able to compile and run some short sample using dom4j:
LexEv lexEv = new LexEv();
SAXReader reader = new SAXReader(lexEv);
Document doc = reader.read("input1.xml");
System.out.println(doc.getRootElement().asXML());
If the input1.xml has your sample XML snippet, then the output is
<tags xmlns:lexev="http://andrewjwelch.com/lexev">Research skills, Searching<lexev:char-ref name="#44">,</lexev:char-ref> evaluating and referencing</tags>
So that way you could get a representation of your input where a pure character and a character reference can be distinguished.
As far as I know, every XML processing frameworks (except vtd-xml) resolve entities during parsing....
you can only distinguish a character from its entity encoded counterpart using vtd-xml by using VTDNav's toRawString() method...

What is the canonical way to test generated HTML code?

The two approaches I usually follow are:
Convert the HTML to a string, and then test it against a target string. The problem with this approach is that it is too brittle, and there'll be very frequent false negatives due to say, things like extra whitespace somewhere.
Convert the HTML to a string and parse it back as an XML, and then use XPath queries to assert on specific nodes. This approach works well but not all HTML comes with closing tags and parsing it as XML fails in such cases.
Both these approaches have serious flaws. I imagine there must be a well-established approach (or approaches) for this sort of tests. What is it?
You could use jsoup or JTidy instead of XML parsing and use your second strategy.

How to make SAXParser ignore escape codes

I am writing a Java program to read and XML file, actually an iTunes library which is XML plist format.
I have managed to get round most obstacles that this format throws up except when it encounters text containing the &. The XLM file represents this ampersand as & and I can only manage to read the text following the & in any particular section of text.
Is there a way to disable detection of escape codes? I am using SAXParser.
There is something fishy about what you are trying to do.
If the file format you are trying to parse contains bare ampersand (&) characters then it is not well-formed XML. Ampersands are represented as character entities (e.g. &) in well-formed XML.
If it is really supposed to be real XML, then there is a bug in whatever wrote / generated the file.
If it is not supposed to be real XML (i.e. those ampersands are not a mistake), then you probably shouldn't by trying to parse it using an XML parser.
Ah, I see. The XML is actually correctly encoded, but you didn't get the SO markup right.
It would appear that your real problem is that your characters(...) callback is being called separately for the text before the &, for the (decoded) &, and finally for the text after the &. You simply have to have to deal with this by joining the text chunks back together.
The javadoc for ContentHandler.characters() says this:
"The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks ...".
It's probably not the best general solution for escape characters, but I only had to take into account new lines so it was easy to just check for \n.
You could check for the backslash \ only to check for all escape characters or in your case &, although I think others will come with more elegant solutions.
#Override
public void characters(char[] ch, int start, int length)
{
String elementData = new String(ch, start, length);
boolean elementDataContainsNewLine = (elementData.indexOf("\n") != -1);
if (!elementDataContainsNewLine)
{
//do what you want if it is no new line
}
}
Do you have an excerpt for us? Is the file itunes-generated? If so, it sounds like a bug in iTunes to me, that forgot to encode the ampersand correctly. I would not be surprised: they clearly didn't get XML in the first place, their schema of <name>[key]</name><string>[value]</string> must make the XML inventors puke.
You might want to use a different, more robust, parser. SAX is great as long as the file is well-formed. I do however not know how robust dom4j and jdom are. Just give them a try. For python, I know that I would recomment ElementTree or BeautifulSoup which are very robust.
Also have a look at http://code.google.com/p/xmlwise/ which I found mentioned here in stackoverflow (did you use search?).
Update: (as per updated question) You need to understand the role of entities in XML and thus SAX. They by default a separate nodes, just like text nodes. So you will likely need to join them with adjacent text nodes to get the full value. Do you use a DTD in your parser? Using a proper DTD - with entity definitions - can help parsing a lot, as it can contain mappings from entities such as & to the characters they represent &, and the parser may be able to do the merging for you. (At least the python XML-pull parser I like to use for large files does when materializing subtrees.)
I am parsing the below string using SAXParser
<xml>
<FirstTag>&<</FirstTag>
<SecondTag>test</SecondTag>
</xml>
I want the same string to be retained but it is getting converted to below
<xml>
<FirstTag>&<</FirstTag>
<SecondTag>test</SecondTag>
<xml>
Here is my code. How can I avoid this being converted?
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
MyHandler handler = new MyHandler();
values = handler.getValues();
saxParser.parse(x, handler);

How to generate an *exact* copy of an XML document with resolved entities

Given an XML document like this:
<!DOCTYPE doc SYSTEM 'http://www.blabla.com/mydoc.dtd'>
<author>john</author>
<doc>
<title>&title;</title>
</doc>
I wanted to parse the above XML document and generate a copy of it with all of its entities already resolved. So given the above XMl document, the parser should output:
<!DOCTYPE doc SYSTEM 'http://www.blabla.com/mydoc.dtd'>
<author>john</author>
<doc>
<title>Stack Overflow Madness</title>
</doc>
I know that you could implement an org.xml.sax.EntityResolver to resolve entities, but what I don't know is how to properly generate a copy of the XML document with everything still intact (except its entities). By everything, I mean the whitespaces, the dtd at the top of the document, the comments, and any other things except the entities that should have been resolved previously. If this is not possible, please suggest a way that at least can preserve most of the things (e.g. all but no comments).
Note also that I am restricted to the pure Java API provided by Sun, so no third party libraries can be used here.
Thanks very much!
EDIT: The above XML document is a much simplified version of its original document. The original one involves a very complex entity resolution using EntityResolver whose significance I have greatly reduced in this question. What I am really interested is how to produce an exact copy of the XML document with an XML parser that uses EntityResolver to resolve the entities.
You almost certainly cannot do this using any XML parser I've heard of, and certainly the Sun XML parsers cannot do it. They will happily discard details that have no significance as far as the meaning of the XML is concerned. For example,
<title>Stack Overflow Madness</title>
and
<title >Stack Overflow Madness</title >
are indistinguishable from the perspective of the XML syntax, and the Sun parsers (rightly) treat them as identical.
I think your choices are to do the replacement treating the XML as text (as #Wololo suggests) or relax your requirements.
By the way, you can probably use an XmlEntityResolver independently of the XML parser. Or create a class that does the same thing. This may mean that String.replace... is not the answer, but you should be able to implement an ad-hoc expander that iterates over the characters in a character buffer, expanding them into a second one.
Is it possible for you to read in the xml template as a string?
And with the string do something like
string s = "<title>&title;</title>";
s = s.replace("&title;", "Stack Overflow Madness");
SaveXml(s);

Categories