XML parsing issue with '&' in element text - java

I have the following code:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(inputXml)));
And the parse step is throwning:
SAXParseException: The entity name must immediately follow
the '&' in the entity reference
due to the following '&' in my inputXml:
<Line1>Day & Night</Line1>
I'm not in control of in the inbound XML. How can I safely/correctly parse this?

Quite simply, the input "XML" is not valid XML. The entity should be encoded, i.e.:
<Line1>Day & Night</Line1>
Basically, there's no "proper" way to fix this other than telling the XML supplier that they're giving you garbage and getting them to fix it. If you're in some horrible situation where you've just got to deal with it, then the approach you take will likely depend on what range of values you're expected to receive.
If there's no entities in the document at all, a regex replace of & with & before processing would do the trick. But if they're sending some entities correctly, you'd need to exclude these from the matching. And on the rare chance that they actually wanted to send the entity code (i.e. sent & but meant &amp;) you're going to be completely out of luck.
But hey - it's the supplier's fault anyway, and if your attempt to fix up invalid input isn't exactly what they wanted, there's a simple thing they can do to address that. :-)

Your input XML isn't valid XML; unfortunately you can't realistically use an XML parser to parse this.
You'll need to pre-process the text before passing it to an XML parser. Although you can do a string replace, replacing '& ' with '& ', this isn't going to catch every occurrence of & in the input, but you may be able to come up with something that does.

I used Tidy framework before xml parsing
final StringWriter errorMessages = new StringWriter();
final String res = new TidyChecker().doCheck(html, errorMessages);
...
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(addRoot(html))));
...
And all Ok

is inputXML a string? Then use this:
inputXML = inputXML.replaceAll("&\\s+", "&");

Related

Special characters creates problem while writing xml

first of all please excuse my shallow understanding into coding as I am a business analyst. Now my question. I am writing java code to convert a csv into xml. I am able to read csv successfully into objects. However, while writing the xml, when special a space or "=" is encounteredan error is thrown.
Piece of the problematic code, I have imporovised the value in create element just to highlight the problem. In actual I am getting this value from an object:-
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
Document xmlDocument= documentBuilder.newDocument();
Element root = xmlDocument.createElement("Media NationalGroupId="8" AllFTA="1002" AllSTV="1001");
xmlDocument.appendChild(root);
My xml should look something like this
<Media DateCreated="20200224 145251" NationalGroupId="8" AllFTA="1002" AllSTV="1001" AllTV="1000" NextId="1000000">
createElement should only receive Media as the argument.
To add the other attributes (DateCreated, NationalGroupId, etc), you need to call setAttribute on root, one by one.

jsoup to w3c-document: INVALID_CHARACTER_ERR

My usecase: Get html-pages by jsoup and returns a w3c-DOM for further processing by XML-transformations:
...
org.jsoup.nodes.Document document = connection.get();
org.w3c.dom.Document dom = new W3CDom().fromJsoup(document);
...
Works well for most documents but for some it throws INVALID_CHARACTER_ERR without telling where.
It seems extremely difficult to find the error. I changed the code to first import the url to a String and then checking for bad characters by regexp. But that does not help for bad attributes (eg. without value) etc.
My current solution is to minimize the risk by removing elements by tag in the jsoup-document (head, img, script ...).
Is there a more elegant solution?
Try setting the outputSettings to 'XML' for your document:
document
.outputSettings()
.syntax(OutputSettings.Syntax.xml);
document
.outputSettings()
.charset("UTF-8");
This should ensure that the resulting XML is valid.
Solution found by OP in reply to nyname00:
Thank you very much; this solved the problem:
Whitelist whiteList = Whitelist.relaxed();
Cleaner cleaner = new Cleaner(whiteList);
jsoupDom = cleaner.clean(jsoupDom);
"relaxed" in deed means relaxed developer...

Exception while evaluating simple XPath expression

Hi I have a problem with an XPath expression while trying to write a test.
I have a following fragment of code.
final String resultCode = xPath.compile(
"//*:Envelope/*:Body/ResultCode/text()")
.evaluate(responseEntity.getBody());
The responseEntity is returned by my mock. It consists of HttpStatus and proper response body in xml format. While executing the test I get this exception
Caused by: javax.xml.xpath.XPathExpressionException: Cannot locate an object model implementation for nodes of class java.lang.String
at net.sf.saxon.xpath.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:321)
at net.sf.saxon.xpath.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:396)
...
I am using saxon for this task, but to be honest I am not very familiar with it. Any suggestions what to check are welcome
Ok I've figured it out. Unfortunately such things happen when you have to fix somebody's code in a field you have no knowledge at all.
The issue was that a String was passed to the evaluate method while it expect one of NodeInfo, DOM, Document etc.
Also thanks to #paul trmbrth for fixing the xpath expression which was malformed too.
I changed the code to something like this:
InputSource source = new InputSource(new StringReader(responseEntity.getBody()));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
final Document document = db.parse(source);
final String resultCode = xPath.compile(
"//*[local-name()='Envelope']/*[local-name()='Body']/*[local-name()='ResultCode']/text()")
.evaluate(document);

Java XML Parsing: Avoid entity reference resolution

I am currently parsing XHTML documents with a DOM parser, like:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
final DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(MY_ENTITY_RESOLVER);
db.setErrorHandler(MY_ERROR_HANDLER);
...
final Document doc = db.parse(inputSource);
And my problem is that when my document contains an entity reference like, for example:
<p>€</p>
My parser creates a Text node for that content containing "€" instead of "€". This is, it is resolving the entity in the way it is supposed to do it (the XHTML 1.0 Strict DTD links to the ENTITIES Latin1 DTD, which in turn establishes the equivalence of "€" with "€").
The problem is, I don't want the parser to do such thing. I would like to keep the "€" text unmodified.
I've already tried with:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setExpandEntityReferences(false);
But:
I don't like this because I fear this might make some parser implementations not navigate from the XHTML 1.0 Strict DTD to the ENTITIES Latin1 DTD and therefore not consider "€" as a declared entity.
When I do this, it weirdly creates two nodes: a "pound" Entity node, and a Text node with the "€" symbol after it.
Any ideas? Is it possible to configure this in a DOM Parser without resorting to preprocessing the XHTML and substituting all "&" symbols for something other?...
Solutions could be for a DOM parser or also a SAX one, I wouldn't mind using SAX parsing and then creating my DOM using a transformation...
Also, I cannot switch to a non standard XML parsing libray. No jdom, no jsoup, no HtmlCleaner, etc.
Thanks a lot.
The approach I took was to replace any entities with a unique marker that is treated as plain text by Xerces. Once converted into a Document object, the markers are replaced with Entity Reference objects.
See the convertStringToDocument() function in http://sourceforge.net/p/commonclasses/code/14/tree/trunk/src/com/redhat/ecs/commonutils/XMLUtilities.java

Invalid character '&#x0' encountered

I am getting following exception while parsing the xml.
Fatal error at line -1 Invalid character '&#x0' encountered. No stack trace
I have Xml data in string format and I am parsing it using DOM parser.
I am parsing data which is a response from Java server to a Blackberry client.
I also tried parsing with SAX parser,but problem is not resolved.
Please help.
You have a null character in your character stream, i.e. char(0) which is not valid in an XML-document. If this is not present in the original string, then it is most likely a character decoding issue.
I got the solution,
I just trimmed it with trim()
and it worked perfectly fine with me.
Your code currently calls getBytes() using the platform default encoding - that's very rarely a good idea. Find out what the encoding of the data really is, and use that. (It's likely to be UTF-8.)
If the Blackberry includes DocumentBuilder.parse(InputSource), that would be preferable:
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
StringReader reader = new StringReader(xmlData);
try {
Document doc = docBuilder.parse(xml);
doc.getDocumentElement().normalize();
} finally {
reader.close();
}
If that doesn't work, have a very close look at your string, e.g. like this:
for (int i=0; i < xmlData.length(); i++) {
// Use whatever logging you have on the Blackberry
System.out.println((int) xmlData.charAt(i));
}
It's possible that the problem is reading the response from the server - if you're reading it badly, you could have Unicode nulls (\u0000) in your string, which may not appear obviously in log/debug output, but would cause the error you've shown.
EDIT: I've just seen that you're getting the base64 data in the first place - so why convert it to a string and then back to bytes? Just decode the base64 to a byte array and then use that as the basis of your ByteArrayInputStream. Then you never have to deal with a text encoding in the first place.
InputStream xml = new ByteArrayInputStream(xmlData.getBytes());
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(xml);
doc.getDocumentElement().normalize();
xml.close();
Above is the code I am using for parsing.

Categories