Xml document to DOM object using DocumentBuilderFactory - java

I am currently modifying a piece of code and I am wondering if the way the XML is formatted (tabs and spacing) will affect the way in which it is parsed into the DocumentBuilderFactory class.
In essence the question is...can I pass a big long string with no spacing into the DocumentBuilderFactory or does it need to be formatted in some way?
Thanks in advance, included below is the Class definition from Oracles website.
Class DocumentBuilderFactory
"Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. "

The documents will be different. Tabs and new lines will be converted into text nodes. You can eliminate these using the following method on DocumentBuilderFactory:
http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setIgnoringElementContentWhitespace(boolean)
But in order for it to work you must set up your DOM parser to validate the content against a DTD or xml schema.
Alternatively you could programmatically remove the extra whitespace yourself using something like the following:
public static void removeEmptyTextNodes(Node node) {
NodeList nodeList = node.getChildNodes();
Node childNode;
for (int x = nodeList.getLength() - 1; x >= 0; x--) {
childNode = nodeList.item(x);
if (childNode.getNodeType() == Node.TEXT_NODE) {
if (childNode.getNodeValue().trim().equals("")) {
node.removeChild(childNode);
}
} else if (childNode.getNodeType() == Node.ELEMENT_NODE) {
removeEmptyTextNodes(childNode);
}
}
}

It should not affect the ability of the parser as long as the string is valid XML. Tabs and newlines are stripped out or ignored by parsers and are really for the aesthetics of the human reader.
Note you will have to pass in an input stream (StringBufferInputStream for example) to the DocumentBuilder as the string version of parse assumes it is a URI to the XML.

The DocumentBuilder builds different DOM objects for xml string with line feeds and xml string without line feeds. Here is the code I tested:
StringBuilder sb = new StringBuilder();
sb.append("<root>").append(newlineChar).append("<A>").append("</A>").append(newlineChar).append("<B>tagB").append("</B>").append("</root>");
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputStream xmlInput = new ByteArrayInputStream(sb.toString().getBytes());
Element documentRoot = builder.parse(xmlInput).getDocumentElement();
NodeList nodes = documentRoot.getChildNodes();
System.out.println("How many children does the root have? => "nodes.getLength());
for(int index = 0; index < nodes.getLength(); index++){
System.out.println(nodes.item(index).getLocalName());
}
Output:
How many children does the root have? => 4
null
A
null
B
But if the new newlineChar is removed from the StringBuilder,
the ouptput is:
How many children does the root have? => 2
A
B
This demonstrates that the DOM objects generated by DocumentBuilder are different.

There shouldn't be any effect regarding the format of the XML-String, but I can remember a strange problem, as I passed a long String to an XML parser. The paser was unable to parse a XML-File as it was written all in one long line.
It may be better if you insert line-breaks, in that kind, that the lines wold not be longer than, lets say 1000 bytes.
But sadly i do neigther remember why that error occured nor which parser I took.

Related

How to extract XML data using Java in Android?

I have the following XML document which I'm trying to get the inner text. I have tried numerous ways, using Xpath, DOM, SAX but no success.
This is my XML, I'm not sure if it's the XML structure which is causing a problem or my code.
<?xml version="1.0"?>
<ArrayOfPurchaseEntitites xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema">
<PurchaseEntitites>
<rInstalmentAmt>634.0</rInstalmentAmt>
<rAnnualRate>12.0</rAnnualRate>
<rInterestAmt>2670.0</rInterestAmt>
<dFirstInstalment>3/31/2016 12:00:00 AM</dFirstInstalment>
<dLastInstalment>8/31/2018 12:00:00 AM</dLastInstalment>
<rInsurancePremium>1350.0</rInsurancePremium>
<sResponseCode>00</sResponseCode>
</PurchaseEntitites>
</ArrayOfPurchaseEntitites>
InputStream stream = connect.getInputStream();
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document doc = documentBuilder.parse(stream);
doc.normalize();
System.out.println("===============================================================");
String g = doc.getDocumentElement().getTextContent();
System.out.println(g);
NodeList rootNodes = doc.getElementsByTagName("ArrayOfPurchaseEntitites");
Node rootnode =rootNodes.item(0);
Element rootElement = (Element) rootnode;
NodeList noteslist = rootElement.getElementsByTagName("PurchaseEntitites");
for(int i = 0; i < noteslist.getLength(); i++)
{
Node theNote = noteslist.item(i);
Element noteElement =(Element) theNote;
Node theExpiryDate = noteElement.getElementsByTagName("dLastInstalment").item(0);
Element dateElement = (Element) theExpiryDate;
System.out.println(dateElement.getTextContent());
}
stream.close();
I had a similar problem where I wanted to call getElementsByTagName for the first item in a NodeList. The trick - which you already utilize - is to cast the Node to Element. However, just to be sure, I suggest you add if (rootnode instanceof Element).
Assuming you use packages javax.xml.parsers and org.w3c.dom (no wild guess) your code works nicely when the xml is read from a file.
So if there still a problem with the code (it's been a while since this question was asked) I suggest you update the question with more info regarding connect.getInputStream();.

Extract XML blocks as string in Java

I have an XML as below
<accountProducts>
<accountProduct>...</accountProduct>
<accountProduct>...</accountProduct>
<accountProduct>...</accountProduct>
<accountProduct>...</accountProduct>
</accountProducts>
Now I want to extract each of the accountProduct block as string. So is there any XML parsing technique to do that or I need to do string manipulation.
Any help please.
Using the DOM as suggested above, you will need to parse your XML with a DocumentBuilder.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
//if your document has namespaces, you can specify that in your builder.
DocumentBuilder db = dbf.newDocumentBuilder();
Using this object, you can call the parse() method.
Your XML input can be provided to a DOM parser as a file or as a stream.
As a file...
File f = new File("MyXmlFile.xml");
Document d = db.parse(f);
As a string...
String myXmlString = "...";
InputSource ss = new InputSource(new StringReader(myXmlString));
Document d = db.parse(ss);
Once you have a Document object, you can traverse the document with DOM functions or with XPATH. This example illustrates the DOM methods.
In your example, assuming that accountProduct nodes contain only text, the following should work.
NodeList nl = d.getElementsByTagName("accountProduct");
for(int i=0; i<nl.getLength(); i++) {
Element elem = (Element)nl.item(i);
System.out.println(elem.getTextContent());
}
If accountProduct contains mixed content (text and elements), you would need more code to extract what you need.
Use JAXP for this.
The Java API for XML Processing (JAXP) is for processing XML data using applications written in the Java programming language.

I want to pretty print an org.w3c.dom.Document without a schema

i feel i'm going mad. I want to pretty print an org.w3c.dom.Document without a schema (in Java). Indentation is not all that i need, i want useless empty lines and whitespaces ignored. Somehow this doesn't happen, every time i parse an XML from a file or write it back to a file there are text nodes containing whitespace in the DOM document(\n, spaces, etc). Isn't there a way that i can get rid of these simply, without a schema and without transforming the XML myself by iterating over all the nodes and removing the empty text nodes?
Example: my input file looks like this (but with a lot more empty lines :)
<mytag>
<anothertag>content</anothertag>
</mytag>
I would like my output file to look like this:
<mytag>
<anothertag>content</anothertag>
</mytag>
Note: i don't have a schema for the XML (so i'm forced to call builder.setValidating(false)) and i don't have the luxury of an internet connection when this code is run.
Thanks!
UPDATE: i found something very close to what i need and maybe it helps other soldiers fighting against XML documents without schemas:
org.apache.axis.utils.XMLUtils.normalize(document);
Source code here. Calling this after the Document is created and before it's written with a Transformer will produce the pretty output with absolutely no schema validation. JB Nizet also gave me a working answer but i have the feeling some validation is going on behind the scenes of that code which would make it different than my use case. I leave the question open for a few days though in case someone has an even better solution.
Here's a working example:
public class Xml {
private static final String XML =
"<mytag>\n" +
" <anothertag>content</anothertag>\n" +
"\n" +
"\n" +
"\n" +
"</mytag>";
public static void main(String[] args) throws ParserConfigurationException, IOException, SAXException, InstantiationException, IllegalAccessException, ClassNotFoundException {
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setValidating(false);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document document = documentBuilder.parse(new InputSource(new StringReader(XML)));
NodeList childNodes = document.getDocumentElement().getChildNodes();
for (int i = 0; i < childNodes.getLength(); i++) {
System.out.println(childNodes.item(i));
}
final DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
final DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
final LSSerializer writer = impl.createLSSerializer();
writer.getDomConfig().setParameter("xml-declaration", false);
writer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
System.out.println(writer.writeToString(document));
}
}
Output:
[#text:
]
[anothertag: null]
[#text:
]
<mytag>
<anothertag>content</anothertag>
</mytag>
So, the parser doesn't validate, it preserves the text nodes, and the output produced by the serializer is as you expect it.

Xpath approach in case of large files

The class you're gonna see right now is the classic approach to parse an XML document via XPath in Java:
public class Main {
private Document createXMLDocument(String fileName) throws Exception {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(fileName);
return doc;
}
private NodeList readXMLNodes(Document doc, String xpathExpression) throws Exception {
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(xpathExpression);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
}
public static void main(String[] args) throws Exception {
Main m = new Main();
Document doc = m.createXMLDocument("tv.xml");
NodeList nodes = m.readXMLNodes(doc, "//serie/eason/#id");
int n = nodes.getLength();
Map<Integer, List<String>> series = new HashMap<Integer, List<String>>();
for (int i = 1; i <= n; i++) {
nodes = m.readXMLNodes(doc, "//serie/eason[#id='" + i + "']/episode/text()");
List<String> episodes = new ArrayList<String>();
for (int j = 0; j < nodes.getLength(); j++) {
episodes.add(nodes.item(j).getNodeValue());
}
series.put(i, episodes);
}
for (Map.Entry<Integer, List<String>> entry : series.entrySet()) {
System.out.println("Season: " + entry.getKey());
for (String ep : entry.getValue()) {
System.out.println("Episodio: " + ep);
}
System.out.println("+------------------------------------+");
}
}
}
In there I find some methods to be worrying in case of a huge xml file. Like the use of
Document doc = builder.parse(fileName);
return doc;
or
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
I'm worried because the xml document I need to handle is created by the customer and inside you can basically have an indefinite number of records describing emails and their contents (every user has its own personal email, so lots of html in there). I know it's not the smartest approach but it's one of the possibilities and it was already up and running before I arrived here.
My question is: how can I parse and evaluate huge xml files using xpath?
You could use the StAX parser. It will take less memory than the DOM options. A good introduction to StAX is at http://tutorials.jenkov.com/java-xml/stax.html
First of all, XPath doesn't parse XML. Your createXMLDocument() method does that, producing as output a tree representation of the parsed XML. The XPath is then used to search the tree representation.
What you are really looking for is something that searches the XML on the fly, while it is being parsed.
One way to do this is with an XQuery system that implements "document projection" (for example, Saxon-EE). This will analyze your query to see what parts of the document are needed, and when you parse your document, it will build a tree containing only those parts of the document that are actually needed.
If the query is as simple as the one in your example, however, then it isn't too hard to code it as a SAX application, where events such as startElement and endElement are notified by the XML parser to the application, without building a tree in memory.

Convert a String to w3c.dom.Element: XMLParseException:Start of root element expected

I found the following piece of code from a blog, when running it I get an exception
XMLParseException:Start of root element expected. at 9th line.
Can any one explain why I get the Exception and suggest any other way for converting String to an element?
String s = "Hello DOM Parser";
java.io.InputStream sbis = new java.io.StringBufferInputStream(s);
javax.xml.parsers.DocumentBuilderFactory b = javax.xml.parsers.DocumentBuilderFactory.newInstance();
b.setNamespaceAware(false);
org.w3c.dom.Document doc = null;
javax.xml.parsers.DocumentBuilder db = null;
db = b.newDocumentBuilder();
doc = db.parse(sbis);
org.w3c.dom.Element e = doc.getDocumentElement();
To create a DOM Element with a custom tag (which I assume is what you want, but can't be sure), you can use the following approach:
String customTag = "HelloDOMParser";
Document doc = documentBuilder.newDocument();
String fullName = nameSpacePrefix + ":" + customTag;
Element customElement = document.createElementNS(namespaceUri, fullName);
doc.appendChild(customElement);
I am assuming you know the namespace URI and your prefix (if any). If you don't use namespaces, just use the createElement() method instead.
Like said in the comment, "Hello DOM Parser" is not a XML element. And so the parser doesn't know what to do with it. I don't know what kind document you are building, but if you want HTML you can embed the text in a html tag for example;
<div>Hello DOM Parser</div>
<span>Hello DOM Parser</span>
if you are building XML, you can embed the text in any random html tag;
<mytag>Hello DOM Parser</mytag>
Some explanation on DOM;
http://www.w3.org/DOM
To answer your question, to convert a String to a w3c Element, you can use createElement;
Element hello = document.createElement("hello");
hello.appendChild(document.createTextNode("Hello DOM Parser"));
This results in;
<hello>Hello DOM Parser</hello>
The parse method of DocumentBuilder accept the input stream which contains the xml content. The following change will work for you:
String s = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<root><elem1>Java</elem1></root>";
Try to avoid using the deprecated classes such as StringBufferInputStream. You can refer to the document below to know more about Java XML parsing.
http://www.java-samples.com/showtutorial.php?tutorialid=152

Categories