Java XML Parsing: Avoid entity reference resolution - java

I am currently parsing XHTML documents with a DOM parser, like:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
final DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(MY_ENTITY_RESOLVER);
db.setErrorHandler(MY_ERROR_HANDLER);
...
final Document doc = db.parse(inputSource);
And my problem is that when my document contains an entity reference like, for example:
<p>€</p>
My parser creates a Text node for that content containing "€" instead of "€". This is, it is resolving the entity in the way it is supposed to do it (the XHTML 1.0 Strict DTD links to the ENTITIES Latin1 DTD, which in turn establishes the equivalence of "€" with "€").
The problem is, I don't want the parser to do such thing. I would like to keep the "€" text unmodified.
I've already tried with:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setExpandEntityReferences(false);
But:
I don't like this because I fear this might make some parser implementations not navigate from the XHTML 1.0 Strict DTD to the ENTITIES Latin1 DTD and therefore not consider "€" as a declared entity.
When I do this, it weirdly creates two nodes: a "pound" Entity node, and a Text node with the "€" symbol after it.
Any ideas? Is it possible to configure this in a DOM Parser without resorting to preprocessing the XHTML and substituting all "&" symbols for something other?...
Solutions could be for a DOM parser or also a SAX one, I wouldn't mind using SAX parsing and then creating my DOM using a transformation...
Also, I cannot switch to a non standard XML parsing libray. No jdom, no jsoup, no HtmlCleaner, etc.
Thanks a lot.

The approach I took was to replace any entities with a unique marker that is treated as plain text by Xerces. Once converted into a Document object, the markers are replaced with Entity Reference objects.
See the convertStringToDocument() function in http://sourceforge.net/p/commonclasses/code/14/tree/trunk/src/com/redhat/ecs/commonutils/XMLUtilities.java

Related

How to append the Elements to existing Nodelist during the parse of XSD file in Java DocumentBuilder

Application Background:
Basically, I am building an application in which I am parsing the XML document using SAX PARSER for every incoming tag I would like to know its datatype and other information so I am using the XSD associated with that XML file to get the datatype and other information related to those tags. Hence, I am parsing the XSD file and storing all the information in Hashmap so that whenever the tag comes I can pass that XML TAG as key to my Hashmap and obtain the value (information associated with it which is obtained during XSD parsing) associated with it.
Problem I am facing:
As of now, I am able to parse my XSD using the DocumentBuilderFactory. But during the collection of elements, I am able to get only one type of element and store it in my NODELIST such as elements with tag name "xs:element". My XSD also has some other element type such as "xs:complexType", xs:any etc. I would like to read all of them and store them into a single NODELIST which I can later loop and push to HASHMAP. However I am unable to add any additional elements to my NODELIST after adding one type to it:
Below code will add tags with the xs:element
NodeList list = doc.getElementsByTagName("xs:element");
How can I add the tags with xs:complexType and xs:any to the same NODELIST?
Is this a good way to find the datatype and other attributes of the XSD or any other better approach available. As I may need to hit the HASHMAP many times for every TAG in XML will there be a performance issue?
Is DocumentBuilderFactory is a good approach to parse XML or are there any better libaraies for XSD parsing? I looked into Xerces2 but could not find any good example and I got struck and posted the question here.
Following is my code for parsing the XSD using DocumentBuilderFactory:
public class DOMParser {
private static Map<String, Element> xmlTags = new HashMap<String, Element>();
public static void main(String[] args) throws URISyntaxException, SAXException, IOException, ParserConfigurationException {
String xsdPath1 = Paths.get(Xerces2Parser.class.getClassLoader().getResource("test.xsd").toURI()).toFile().getAbsolutePath();
String filePath1 = Path.of(xsdPath1).toString();
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new File(filePath1));
NodeList list = doc.getElementsByTagName("xs:element");
System.out.println(list.getLength());
// How to add the xs:complexType to same list as above
// list.add(doc.getElementsByTagName("xs:complexType"));
// list = doc.getElementsByTagName("xs:complexType");
// Loop and add data to Map for future lookups
for (int i = 0; i < list.getLength(); i++) {
Element element = (Element) list.item(i);
if (element.hasAttributes()) {
xmlTags.put(element.getAttribute("name"), element);
}
}
}
}
I don't know what you are trying to achieve (you have described the code you are writing, not the problem it is designed to solve) but what you are doing seems misguided. Trying to get useful information out of an XSD schema by parsing it at the XML level is really hard work, and it's clear from the questions you are asking that you haven't appreciated the complexities of what you are attempting.
It's hard to advise you on the low-level detail of maintaining hash maps and node lists when we don't understand what you are trying to achieve. What information are you trying to extract from the schema, and why?
There are a number of ways of getting information out of a schema at a higher level. Xerces has a Java API for accessing a compiled schema. Saxon has an XML representation of compiled schemas called SCM (the difference from raw XSD is that all the work of expanding xs:include and xs:import, expanding attribute groups, model groups, and substitution groups etc has been done for you). Saxon also has an XPath API (a set of extension functions) for accessing compiled schema information.

Special characters creates problem while writing xml

first of all please excuse my shallow understanding into coding as I am a business analyst. Now my question. I am writing java code to convert a csv into xml. I am able to read csv successfully into objects. However, while writing the xml, when special a space or "=" is encounteredan error is thrown.
Piece of the problematic code, I have imporovised the value in create element just to highlight the problem. In actual I am getting this value from an object:-
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentFactory.newDocumentBuilder();
Document xmlDocument= documentBuilder.newDocument();
Element root = xmlDocument.createElement("Media NationalGroupId="8" AllFTA="1002" AllSTV="1001");
xmlDocument.appendChild(root);
My xml should look something like this
<Media DateCreated="20200224 145251" NationalGroupId="8" AllFTA="1002" AllSTV="1001" AllTV="1000" NextId="1000000">
createElement should only receive Media as the argument.
To add the other attributes (DateCreated, NationalGroupId, etc), you need to call setAttribute on root, one by one.

Download and parse an XML file in Java

I have never had to download an XML file in Java before and parse it after. I'm looking to download and parse this file http://api.irishrail.ie/realtime/realtime.asmx/getStationDataByNameXML?StationDesc=Bayside
All I want to do is read the train times. I've been reading about parsing XML but I'm not really getting anywhere with it. I just keep reading about parsers like stax, after that I'm a bit lost.
Can anyone give me some basic advice of what I need to do?
You can use JAXB for this and any other XML processing needs. Start here.
You can use DOM parser and new Java Architecture for XML Binding JAXB it will help you in marshalling (for converting an xml to Object) and unmarshalling(for converting an Object to xml)
link for the example
http://www.vogella.com/tutorials/JAXB/article.html
You can use the DOM Parser to create a Document object from the XML file.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new File(filename));
The DOM Parser creates a traversable tree from your XML data.
You can then pick out the data you need from the DOM tree using XPath.
try this library
http://x-stream.github.io/download.html
it is simple and fast to use to write and read xml from java.

Specifying DTD to be used by DocumentBuilders for XML parsing?

I am currently writing a tool, using Java 1.6, that brings together a number of XML files. All of the files validate to the DocBook 4.5 DTD (I have checked this using xmllint and specifying the DocBook 4.5 DTD as the --dtdvalid parameter), but not all of them include the DOCTYPE declaration.
I load each XML file into the DOM to perform the required manipulation like so:
private Document fileToDocument( File input ) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setIgnoringElementContentWhitespace(false);
factory.setIgnoringComments(false);
factory.setValidating(false);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse( input );
}
For the most part this has worked quite well, I can use he returned object to navigate the tree and perform the required manipulations and then write the document back out. Where I am encountering problems is with files which:
Do not include the DOCTYPE declaration, and
Do include entities defined in the DTD (for example — / —).
Where this is the case an exception is thrown from the builder.parse(...) call with the message:
[Fatal Error] :5:15: The entity "mdash" was referenced, but not declared.
Fair enough, it isn't declared. What I would ideally do in this instance is set the DocumentBuilderFactory to always use the DocBook 4.5 DTD regardless of whether one is specified in the file.
I did try validation using the DocBook 4.5 schema but found that this produced a number of unrelated errors with the XML. It seems like the schema might not be functionally equivalent to the DTD, at least for this version of the DocBook specification.
The other option I can think of is to read the file in, try and detect whether a doctype was set or not, and then set one if none was found prior to actually parsing the XML into the DOM.
So, my question is, is there a smarter way that I have not seen to tell the parser to use a specific DTD or ensure that parsing proceeds despite the entities not resolving (not just the &emdash; example but any entities in the XML - there are a large number of potentials)?
Could using an EntityResolver2 and implementing EntityResolver2.getExternalSubset() help?
... This method can also be used with documents that have no DOCTYPE declaration. When the root element is encountered, but no DOCTYPE declaration has been seen, this method is invoked. If it returns a value for the external subset, that root element is declared to be the root element, giving the effect of splicing a DOCTYPE declaration at the end the prolog of a document that could not otherwise be valid. ...

XML parsing issue with '&' in element text

I have the following code:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(inputXml)));
And the parse step is throwning:
SAXParseException: The entity name must immediately follow
the '&' in the entity reference
due to the following '&' in my inputXml:
<Line1>Day & Night</Line1>
I'm not in control of in the inbound XML. How can I safely/correctly parse this?
Quite simply, the input "XML" is not valid XML. The entity should be encoded, i.e.:
<Line1>Day & Night</Line1>
Basically, there's no "proper" way to fix this other than telling the XML supplier that they're giving you garbage and getting them to fix it. If you're in some horrible situation where you've just got to deal with it, then the approach you take will likely depend on what range of values you're expected to receive.
If there's no entities in the document at all, a regex replace of & with & before processing would do the trick. But if they're sending some entities correctly, you'd need to exclude these from the matching. And on the rare chance that they actually wanted to send the entity code (i.e. sent & but meant &amp;) you're going to be completely out of luck.
But hey - it's the supplier's fault anyway, and if your attempt to fix up invalid input isn't exactly what they wanted, there's a simple thing they can do to address that. :-)
Your input XML isn't valid XML; unfortunately you can't realistically use an XML parser to parse this.
You'll need to pre-process the text before passing it to an XML parser. Although you can do a string replace, replacing '& ' with '& ', this isn't going to catch every occurrence of & in the input, but you may be able to come up with something that does.
I used Tidy framework before xml parsing
final StringWriter errorMessages = new StringWriter();
final String res = new TidyChecker().doCheck(html, errorMessages);
...
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(addRoot(html))));
...
And all Ok
is inputXML a string? Then use this:
inputXML = inputXML.replaceAll("&\\s+", "&");

Categories