How can I automate the XML Parsing using JDOM - java

I have to parse an XML file using JDOM and get some infos from all his elements.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<element1>something</element1>
<element2>
<subelement21>moo</subelement21>
<subelement22>
<subelement221>toto</subelement221>
<subelement222>tata</subelement222>
</subelement22>
</element2>
</root>
So, for the element1 it's easy. But for the element2 I have to go through his children and if the children has children go through them too and so on.
public static void getInfos(Vector<String> files) {
Document document = null;
Element root = null;
SAXBuilder sxb = new SAXBuilder();
for (int i =0 ; i< files.size() ; i++)
{
System.out.println("n°" + i + " : " + files.elementAt(i));
try
{
document = sxb.build(files.elementAt(i));
root = document.getRootElement();
List<?> listElements = root.getChildren();
Iterator<?> it = listElements.iterator();
while(it.hasNext())
{
Element courant = (Element)it.next();
System.out.println(courant.getName());
if(courant.getChildren().size() > 0)
{
// here is the problem -> the element has a children
}
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}
What do you suggest in this case, like a recursive call or something else so I can use the same function.
Thanks.

I would use SAX. I'd keep a stack in the contenthandler that tracked what my current path was in the document, and keep a buffer that my characters method appended to. In endElement I'd get the content from the buffer and clear it out, then use the current path to decide what to do with it.
(this is assuming this document has no mixed-content.)
Here's a link to an article on using SAX to process complex XML documents, it expands on what I briefly described into an approach that handles recursive data structures. (It also has a predecessor article that is an introduction to SAX.)

You could consider using XPath to get the exact elements you want. The example here uses namespaces but the basic idea holds.

Related

JDOM getChildren() returns empty list

this is my xml:
Example:
<?xml version="1.0" encoding="UTF_8" standalone="yes"?>
<StoreMessage xmlns="http://www.xxx.com/feed">
<billingDetail>
<billingDetailId>987</billingDetailId>
<contextId>0</contextId>
<userId>
<pan>F0F8DJH348DJ</pan>
<contractSerialNumber>46446</contractSerialNumber>
</userId>
<declaredVehicleClass>A</declaredVehicleClass>
</billingDetail>
<billingDetail>
<billingDetailId>543</billingDetailId>
<contextId>0</contextId>
<userId>
<pan>F0F854534534348DJ</pan>
<contractSerialNumber>4666546446</contractSerialNumber>
</userId>
<declaredVehicleClass>C</declaredVehicleClass>
</billingDetail>
</StoreMessage>
With JDOM parser i want to get all <billingDetail> xml nodes from it.
my code:
SAXBuilder builder = new SAXBuilder();
try {
Reader in = new StringReader(xmlAsString);
Document document = (Document)builder.build(in);
Element rootNode = document.getRootElement();
List<?> list = rootNode.getChildren("billingDetail");
XMLOutputter outp = new XMLOutputter();
outp.setFormat(Format.getCompactFormat());
for (int i = 0; i < list.size(); i++) {
Element node = (Element)list.get(i);
StringWriter sw = new StringWriter();
outp.output(node.getContent(), sw);
StringBuffer sb = sw.getBuffer();
String text = sb.toString();
xmlRecords.add(sb.toString());
}
} catch (IOException io) {
io.printStackTrace();
} catch (JDOMException jdomex) {
jdomex.printStackTrace();
}
but i never get as output xml node as string like:
<billingDetail>
<billingDetailId>987</billingDetailId>
<contextId>0</contextId>
<userId>
<pan>F0F8DJH348DJ</pan>
<contractSerialNumber>46446</contractSerialNumber>
</userId>
<declaredVehicleClass>A</declaredVehicleClass>
</billingDetail>
what i am doing wrong? How can i get this output with JDOM parser?
EDIT
And why if XML start with
<StoreMessage> instead like <StoreMessage xmlns="http://www.xxx.com/MediationFeed">
then works? How is this possible?
The problem is that there are two versions of the getChildren method:
java.util.List getChildren(java.lang.String name)
This returns a List of all the child elements nested directly (one level deep) within this element with the given local name and belonging to no namespace, returned as Element objects.
and
java.util.List getChildren(java.lang.String name, Namespace ns)
This returns a List of all the child elements nested directly (one level deep) within this element with the given local name and belonging to the given Namespace, returned as Element objects.
The first one doesn't find your node if it belongs to a namespace, you should use the second one.

NullPointerException while trying to parse an xml file [duplicate]

This question already has answers here:
What is a NullPointerException, and how do I fix it?
(12 answers)
Closed 4 years ago.
I have the following:
public static void main(String args[]) {
// upload config' data for program - param' are path and Xml's Root node/ where to get data from
confLoader conf = new confLoader("conf.xml", "config");
System.out.println(conf.getDbElement("dataSource") );
System.out.println(conf.getDbElement("dataSource") );
System.out.println(conf.getDbElement("dataSource") ); // Fails
...
The code that's responsible to build the DOM and parse from ('getDbElement()'):
public class confLoader{
DocumentBuilderFactory docBuilderFactory;
DocumentBuilder docBuilder;
Document doc;
NodeList nList;
public confLoader(String path, String XmlRoot){
try {
docBuilderFactory = DocumentBuilderFactory.newInstance();
docBuilder = docBuilderFactory.newDocumentBuilder();
doc = docBuilder.parse(new File(path));
// normalize text representation
doc.getDocumentElement().normalize();
nList = doc.getElementsByTagName(XmlRoot);
} catch (Exception e) {
e.printStackTrace();
}
}
public String getDbElement(String element) {
Node nNode = nList.item(0); // 1st item/node - sql
try {
if (nNode.getNodeType() == Node.ELEMENT_NODE) { ///// Line 36 - Problematic
Element eElement = (Element) nNode;
return (((Node) eElement.getElementsByTagName(element).item(0).getChildNodes().item(0)).getNodeValue());
}
} catch (Exception ex) {
System.out.println("Error retrieving " + element + " :" + ex.getMessage());//Thread.dumpStack();
ex.printStackTrace();
}
return "not available";
}
}
stacktrace for given code:
jdbc:mysql://localhost:...
java.lang.NullPointerException
jdbc:mysql://localhost:...
Error retrieving dataSource :null
not available
at exercise.confLoader.getDbElement(confLoader.java:36)
at exercise.Exercise.main(Exercise.java:22)
Line 36 : if (nNode.getNodeType() == Node.ELEMENT_NODE)
The xml parsing is done twice, and for the 3rd time I try to parse from Xml, I get the NullPointerException.
Too much code! Also, reading configuration pieces on demand is not that useful. And relying on instance variables makes your code more difficult to test and to understand, and even potentially unsafe in a concurrent scenario. You don't need all those classes, methods and things. It's just a matter of
public class Exercise {
public static void main(String[] args) throws XPathExpressionException {
XPath xpath = XPathFactory.newInstance().newXPath();
InputSource in = new InputSource("res/config.xml");
String user = xpath.evaluate("//sql/user/text()", in);
String password = xpath.evaluate("//sql/password/text()", in);
String path = xpath.evaluate("//sql/dataSource/text()", in);
Sql sql = new Sql(path, user, password);
}
}
You could optionally make your code a bit more complex, by storing all of your configuration in a Map<String, String>, but really you'd better use a common API like Properties, which is able to load from XML.
Problem solved by removing gnujaxp.jar from the build path.
First of all, I would recommend you don't chain too many methods on one line. Breaking the call structure up into multiple lines will increase readability and ease debugging.
For exaple, rewrite:
return (((Node) (eElement.getElementsByTagName("password").item(0).getChildNodes().item(0)).getNodeValue());
to:
NodeList rootEls = eElement.getElementsByTagName("password");
Node rootEl = rootEls.item(0)
NodeList children = rootEl.getChildNodes();
Node passEl = children.item(0);
return passEl.getNodeValue();
When you get the NullPointerException, with this code, you can extract a lot more information from the line number in the exception.
Secondly, in this case it may prove useful to take a look at the various XML processing libraries for Java and to find one which allows the use of XPath, see also: this tutorial on Xpath.
I hope this helps. Feel free to pose any questions.

Xpath approach in case of large files

The class you're gonna see right now is the classic approach to parse an XML document via XPath in Java:
public class Main {
private Document createXMLDocument(String fileName) throws Exception {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(fileName);
return doc;
}
private NodeList readXMLNodes(Document doc, String xpathExpression) throws Exception {
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(xpathExpression);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
}
public static void main(String[] args) throws Exception {
Main m = new Main();
Document doc = m.createXMLDocument("tv.xml");
NodeList nodes = m.readXMLNodes(doc, "//serie/eason/#id");
int n = nodes.getLength();
Map<Integer, List<String>> series = new HashMap<Integer, List<String>>();
for (int i = 1; i <= n; i++) {
nodes = m.readXMLNodes(doc, "//serie/eason[#id='" + i + "']/episode/text()");
List<String> episodes = new ArrayList<String>();
for (int j = 0; j < nodes.getLength(); j++) {
episodes.add(nodes.item(j).getNodeValue());
}
series.put(i, episodes);
}
for (Map.Entry<Integer, List<String>> entry : series.entrySet()) {
System.out.println("Season: " + entry.getKey());
for (String ep : entry.getValue()) {
System.out.println("Episodio: " + ep);
}
System.out.println("+------------------------------------+");
}
}
}
In there I find some methods to be worrying in case of a huge xml file. Like the use of
Document doc = builder.parse(fileName);
return doc;
or
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
I'm worried because the xml document I need to handle is created by the customer and inside you can basically have an indefinite number of records describing emails and their contents (every user has its own personal email, so lots of html in there). I know it's not the smartest approach but it's one of the possibilities and it was already up and running before I arrived here.
My question is: how can I parse and evaluate huge xml files using xpath?
You could use the StAX parser. It will take less memory than the DOM options. A good introduction to StAX is at http://tutorials.jenkov.com/java-xml/stax.html
First of all, XPath doesn't parse XML. Your createXMLDocument() method does that, producing as output a tree representation of the parsed XML. The XPath is then used to search the tree representation.
What you are really looking for is something that searches the XML on the fly, while it is being parsed.
One way to do this is with an XQuery system that implements "document projection" (for example, Saxon-EE). This will analyze your query to see what parts of the document are needed, and when you parse your document, it will build a tree containing only those parts of the document that are actually needed.
If the query is as simple as the one in your example, however, then it isn't too hard to code it as a SAX application, where events such as startElement and endElement are notified by the XML parser to the application, without building a tree in memory.

Xml document to DOM object using DocumentBuilderFactory

I am currently modifying a piece of code and I am wondering if the way the XML is formatted (tabs and spacing) will affect the way in which it is parsed into the DocumentBuilderFactory class.
In essence the question is...can I pass a big long string with no spacing into the DocumentBuilderFactory or does it need to be formatted in some way?
Thanks in advance, included below is the Class definition from Oracles website.
Class DocumentBuilderFactory
"Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. "
The documents will be different. Tabs and new lines will be converted into text nodes. You can eliminate these using the following method on DocumentBuilderFactory:
http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setIgnoringElementContentWhitespace(boolean)
But in order for it to work you must set up your DOM parser to validate the content against a DTD or xml schema.
Alternatively you could programmatically remove the extra whitespace yourself using something like the following:
public static void removeEmptyTextNodes(Node node) {
NodeList nodeList = node.getChildNodes();
Node childNode;
for (int x = nodeList.getLength() - 1; x >= 0; x--) {
childNode = nodeList.item(x);
if (childNode.getNodeType() == Node.TEXT_NODE) {
if (childNode.getNodeValue().trim().equals("")) {
node.removeChild(childNode);
}
} else if (childNode.getNodeType() == Node.ELEMENT_NODE) {
removeEmptyTextNodes(childNode);
}
}
}
It should not affect the ability of the parser as long as the string is valid XML. Tabs and newlines are stripped out or ignored by parsers and are really for the aesthetics of the human reader.
Note you will have to pass in an input stream (StringBufferInputStream for example) to the DocumentBuilder as the string version of parse assumes it is a URI to the XML.
The DocumentBuilder builds different DOM objects for xml string with line feeds and xml string without line feeds. Here is the code I tested:
StringBuilder sb = new StringBuilder();
sb.append("<root>").append(newlineChar).append("<A>").append("</A>").append(newlineChar).append("<B>tagB").append("</B>").append("</root>");
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputStream xmlInput = new ByteArrayInputStream(sb.toString().getBytes());
Element documentRoot = builder.parse(xmlInput).getDocumentElement();
NodeList nodes = documentRoot.getChildNodes();
System.out.println("How many children does the root have? => "nodes.getLength());
for(int index = 0; index < nodes.getLength(); index++){
System.out.println(nodes.item(index).getLocalName());
}
Output:
How many children does the root have? => 4
null
A
null
B
But if the new newlineChar is removed from the StringBuilder,
the ouptput is:
How many children does the root have? => 2
A
B
This demonstrates that the DOM objects generated by DocumentBuilder are different.
There shouldn't be any effect regarding the format of the XML-String, but I can remember a strange problem, as I passed a long String to an XML parser. The paser was unable to parse a XML-File as it was written all in one long line.
It may be better if you insert line-breaks, in that kind, that the lines wold not be longer than, lets say 1000 bytes.
But sadly i do neigther remember why that error occured nor which parser I took.

Parsing xml with dom4j or jdom or anyhow

I wanna read feed entries and I'm just stuck now. Take this for example : https://stackoverflow.com/feeds/question/2084883 lets say I wanna read all the summary node value inside each entry node in document. How do I do that? I've changed many variations of code this one is closest to what I want to achieve I think :
Element entryPoint = document.getRootElement();
Element elem;
for(Iterator iter = entryPoint.elements().iterator(); iter.hasNext();){
elem = (Element)iter.next();
System.out.println(elem.getName());
}
It goes trough all nodes in xml file and writes their name. Now what I wanted to do next is
if(elem.getName().equals("entry"))
to get only the entry nodes, how do I get elements of the entry nodes, and how to get let say summary and its value? tnx
Question: how to get values of summary nodes from this link
Have you tried jdom? I find it simpler and convenient.
http://www.jdom.org/
To get all children of an xml element, you can just do
SAXBuilder sb = new SAXBuilder();
StringReader sr = new StringReader(xmlDocAsString);
Document doc = sb.build(sr);
Element root = doc.getRootElement();
List l = root.getChildren("entry");
for (Iterator iter = l.iterator(); iter.hasNext();) {
...//do whatever...
}
Here's how you'd do it using vanilla Java:
//read the XML into a DOM
StreamSource source = new StreamSource(new StringReader("<theXml></theXml>"));
DOMResult result = new DOMResult();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(source, result);
Node root = result.getNode();
//make XPath object aware of namespaces
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(new NamespaceContext(){
#Override
public String getNamespaceURI(String prefix) {
if ("atom".equals(prefix)){
return "http://www.w3.org/2005/Atom";
}
return null;
}
#Override
public String getPrefix(String namespaceURI) {
return null;
}
#Override
public Iterator getPrefixes(String namespaceURI) {
return null;
}
});
//get all summaries
NodeList summaries = (NodeList) xpath.evaluate("/atom:feed/atom:entry/atom:summary", root, XPathConstants.NODESET);
for (int i = 0; i < summaries.getLength(); ++i) {
Node summary = summaries.item(i);
//print out all the attributes
for (int j = 0; j < summary.getAttributes().getLength(); ++j) {
Node attr = summary.getAttributes().item(j);
System.out.println(attr.getNodeName() + "=" + attr.getNodeValue());
}
//print text content
System.out.println(summaries.item(i).getTextContent());
}
if(elem.getName() == "entry")
I have no idea whether this is your problem (you don't really state what your problem is), but never test string equality with --. Instead, use equals():
if(elem.getName().equals("entry"))
A bit late but it might be useful for people googling...
There is a specialized API for dealing with RSS and Atom feeds in Java. It's called Rome, can be found here :
http://java.net/projects/rome/
It is really quite useful, it makes easy to read feed whatever the RSS or Atom version. You can also build feeds and generate the XML with it though I have no experience with this feature.
Here is a simple example that reads a feed and prints out the description nodes of all the entries in the feed :
URL feedSource = new URL("http://....");
feed = new SyndFeedInput().build(new XmlReader(feedSource));
List<SyndEntryImpl> entries = (List<SyndEntryImpl>)feed.getEntries();
for(SyndEntryImpl entry : entries){
System.out.println(entry.getDescription().getValue());
}
Simple enough.

Categories