Xpath approach in case of large files - java

The class you're gonna see right now is the classic approach to parse an XML document via XPath in Java:
public class Main {
private Document createXMLDocument(String fileName) throws Exception {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(fileName);
return doc;
}
private NodeList readXMLNodes(Document doc, String xpathExpression) throws Exception {
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(xpathExpression);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
}
public static void main(String[] args) throws Exception {
Main m = new Main();
Document doc = m.createXMLDocument("tv.xml");
NodeList nodes = m.readXMLNodes(doc, "//serie/eason/#id");
int n = nodes.getLength();
Map<Integer, List<String>> series = new HashMap<Integer, List<String>>();
for (int i = 1; i <= n; i++) {
nodes = m.readXMLNodes(doc, "//serie/eason[#id='" + i + "']/episode/text()");
List<String> episodes = new ArrayList<String>();
for (int j = 0; j < nodes.getLength(); j++) {
episodes.add(nodes.item(j).getNodeValue());
}
series.put(i, episodes);
}
for (Map.Entry<Integer, List<String>> entry : series.entrySet()) {
System.out.println("Season: " + entry.getKey());
for (String ep : entry.getValue()) {
System.out.println("Episodio: " + ep);
}
System.out.println("+------------------------------------+");
}
}
}
In there I find some methods to be worrying in case of a huge xml file. Like the use of
Document doc = builder.parse(fileName);
return doc;
or
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
I'm worried because the xml document I need to handle is created by the customer and inside you can basically have an indefinite number of records describing emails and their contents (every user has its own personal email, so lots of html in there). I know it's not the smartest approach but it's one of the possibilities and it was already up and running before I arrived here.
My question is: how can I parse and evaluate huge xml files using xpath?

You could use the StAX parser. It will take less memory than the DOM options. A good introduction to StAX is at http://tutorials.jenkov.com/java-xml/stax.html

First of all, XPath doesn't parse XML. Your createXMLDocument() method does that, producing as output a tree representation of the parsed XML. The XPath is then used to search the tree representation.
What you are really looking for is something that searches the XML on the fly, while it is being parsed.
One way to do this is with an XQuery system that implements "document projection" (for example, Saxon-EE). This will analyze your query to see what parts of the document are needed, and when you parse your document, it will build a tree containing only those parts of the document that are actually needed.
If the query is as simple as the one in your example, however, then it isn't too hard to code it as a SAX application, where events such as startElement and endElement are notified by the XML parser to the application, without building a tree in memory.

Related

Read few xml elements only in an efficient way

I want to read only few XML tag values .I have written the below code.XML is big and a bit complex. But for example I have simplified the xml . Is there any other efficient way to solve it ?I am using JAVA 8
DocumentBuilderFactory dbfaFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = dbfaFactory.newDocumentBuilder();
Document doc = documentBuilder.parse("xml_val.xml");
System.out.println(doc.getElementsByTagName("date_added").item(0).getTextContent());
<item_list id="item_list01">
<numitems_intial>5</numitems_intial>
<item>
<date_added>1/1/2014</date_added>
<added_by person="person01" />
</item>
<item>
<date_added>1/6/2014</date_added>
<added_by person="person05" />
</item>
<numitems_current>7</numitems_current>
<manager person="person48" />
</item_list>
Using XPAth and passing a specific expression to get the desired element
public class MainJaxbXpath {
public static void main(String[] args) {
try {
FileInputStream fileIS;
fileIS = new FileInputStream("/home/luis/tmp/test.xml");
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
builder = builderFactory.newDocumentBuilder();
Document xmlDocument;
xmlDocument = builder.parse(fileIS);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//item_list[#id=\"item_list01\"]//date_added[1]";
String nodeList =(String) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.STRING);
System.out.println(nodeList);
} catch (SAXException | IOException | ParserConfigurationException | XPathExpressionException e3) {
e3.printStackTrace();
}
}
}
Result:
1/1/2014
To look for more than one element on the same operation
String expression01 = "//item_list[#id=\"item_list01\"]//date_added[1]";
String expression02 = "//item_list[#id=\"item_list02\"]//date_added[2]";
String expression = String.format("%s | %s", expression01, expression02);
NodeList nodeList =(NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);
for (int i = 0; i < nodeList.getLength(); i++) {
Node currentNode = nodeList.item(i);
if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
System.out.println(currentNode.getTextContent());
}
}
Some suggestions.
Firstly, don't use DOM. There's a wide range of dom-like XML tree representations available in Java; DOM is the first and the worst. Later third-party models like JDOM2 and XOM are much better designed.
Secondly, consider doing the whole thing in an XML-oriented language like XSLT or XQuery rather than in Java. In XQuery, using Saxon's XQuery API, this would be:
Processor proc = new Processor(false);
XQueryCompiler comp = proc.newXQueryCompiler();
XQueryExecutable exec = comp.compile("//date_added");
XQueryEvaluator eval = exec.load();
eval.setSource(new StreamSource(new File("/home/luis/tmp/test.xml")));
for (XdmItem item : eval.evaluate()) {
System.out.println(item.getStringValue());
}
But since the query is so simple, Saxon also has a direct map/reduce style API to access the tree. This would be:
Processor proc = new Processor(false);
XdmNode doc = proc.newDocumentBuilder().build(
new StreamSource(new File("/home/luis/tmp/test.xml")));
for (XdmItem item : doc.select(descendant("date_added")).asList()) {
System.out.println(item.getStringValue());
}
A suggestion that has nothing to do with efficiency: please use international standard dates. 1/6/2014 could be 1st June or 6th January. Writing it as 2014-06-01 (or 2014-01-06 if that's what you intended) not only avoids the kind of dangerous bugs that arise if you use an ambiguous format, it also means you can use standard date-and-time processing libraries, such as the XPath 2.0+ function library.

How to get html from a org.w3c.dom.Node in java?

I've build a method which extracts data from an html document using the xpath components of saxon-he. I'm using w3c dom object model for this.
I already created a method which returns the text-value, similar like the text value method from jsoup (jsoupElement.text()):
protected String getNodeValue(Node node) {
NodeList childNodes = node.getChildNodes();
for (int x = 0; x < childNodes.getLength(); x++) {
Node data = childNodes.item(x);
if (data.getNodeType() == Node.TEXT_NODE)
return data.getNodeValue();
}
return "";
}
This works fine but i now i need the underlying html of a selected node (with jsoup it would be jsoupElement.html()). Using the w3c dom object model i have org.w3c.dom.Node. How can i get the html from a org.w3c.dom.Node as String? I couldn't find anything regarding this in the documentation.
Just for clarification: I need the inner html (with or without the node element/tag) as String. Similar like http://api.jquery.com/html/ or http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#html--
To serialize a W3C DOM Node's child nodes to HTML with Saxon you can use a default Transformer where you set the output method to html:
public static String getInnerHTML(Node node) throws TransformerConfigurationException, TransformerException
{
StringWriter sw = new StringWriter();
Result result = new StreamResult(sw);
TransformerFactory factory = new net.sf.saxon.TransformerFactoryImpl();
Transformer proc = factory.newTransformer();
proc.setOutputProperty(OutputKeys.METHOD, "html");
for (int i = 0; i < node.getChildNodes().getLength(); i++)
{
proc.transform(new DOMSource(node.getChildNodes().item(i)), result);
}
return sw.toString();
}
But as said, this is a serialization of the tree, the original XML or HTML is not stored in a DOM tree or Saxon's tree model, there is no way to access it.

Create XML document using nodeList

I need to create a XML Document object using the NodeList. Can someone pls help me to do this. This is my Java code:
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.*;
import org.w3c.dom.*;
public class ReadFile {
public static void main(String[] args) {
String exp = "/configs/markets";
String path = "testConfig.xml";
try {
Document xmlDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(path);
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xPathExpression = xPath.compile(exp);
NodeList nodes = (NodeList)
xPathExpression.evaluate(xmlDocument,
XPathConstants.NODESET);
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
I want to have an XML file like this:
<configs>
<markets>
<market>
<name>Real</name>
</market>
<market>
<name>play</name>
</market>
</markets>
</configs>
Thanks in advance.
You should do it like this:
you create a new org.w3c.dom.Document newXmlDoc where you store the nodes in your NodeList,
you create a new root element, and append it to newXmlDoc
then, for each node n in your NodeList, you import n in newXmlDoc, and then you append n as a child of root
Here is the code:
public static void main(String[] args) {
String exp = "/configs/markets/market";
String path = "src/a/testConfig.xml";
try {
Document xmlDocument = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().parse(path);
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xPathExpression = xPath.compile(exp);
NodeList nodes = (NodeList) xPathExpression.
evaluate(xmlDocument, XPathConstants.NODESET);
Document newXmlDocument = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().newDocument();
Element root = newXmlDocument.createElement("root");
newXmlDocument.appendChild(root);
for (int i = 0; i < nodes.getLength(); i++) {
Node node = nodes.item(i);
Node copyNode = newXmlDocument.importNode(node, true);
root.appendChild(copyNode);
}
printTree(newXmlDocument);
} catch (Exception ex) {
ex.printStackTrace();
}
}
public static void printXmlDocument(Document document) {
DOMImplementationLS domImplementationLS =
(DOMImplementationLS) document.getImplementation();
LSSerializer lsSerializer =
domImplementationLS.createLSSerializer();
String string = lsSerializer.writeToString(document);
System.out.println(string);
}
The output is:
<?xml version="1.0" encoding="UTF-16"?>
<root><market>
<name>Real</name>
</market><market>
<name>play</name>
</market></root>
Some notes:
I've changed exp to /configs/markets/market, because I suspect you want to copy the market elements, rather than the single markets element
for the printXmlDocument, I've used the interesting code in this answer
I hope this helps.
If you don't want to create a new root element, then you may use your original XPath expression, which returns a NodeList consisting of a single node (keep in mind that your XML must have a single root element) that you can directly add to your new XML document.
See following code, where I commented lines from the code above:
public static void main(String[] args) {
//String exp = "/configs/markets/market/";
String exp = "/configs/markets";
String path = "src/a/testConfig.xml";
try {
Document xmlDocument = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().parse(path);
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xPathExpression = xPath.compile(exp);
NodeList nodes = (NodeList) xPathExpression.
evaluate(xmlDocument,XPathConstants.NODESET);
Document newXmlDocument = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().newDocument();
//Element root = newXmlDocument.createElement("root");
//newXmlDocument.appendChild(root);
for (int i = 0; i < nodes.getLength(); i++) {
Node node = nodes.item(i);
Node copyNode = newXmlDocument.importNode(node, true);
newXmlDocument.appendChild(copyNode);
//root.appendChild(copyNode);
}
printXmlDocument(newXmlDocument);
} catch (Exception ex) {
ex.printStackTrace();
}
}
This will give you the following output:
<?xml version="1.0" encoding="UTF-16"?>
<markets>
<market>
<name>Real</name>
</market>
<market>
<name>play</name>
</market>
</markets>
you can try the adoptNode() method of Document. Maybe you will need to iterate over your NodeList. You can access the individual Nodes with nodeList.item(i).If you want to wrap your search results in an Element, you can use createElement() from the Document and appendChild() on the newly created Element

A question on how to solve the string problem in Java

I've created a simple xml file here:
http://roberthan.host56.com/productsNew.xml
which is quite simple, the root node is [products] while all other element nodes are [product]. Under each [product] node, there are two child nodes, [code] and [name], so it basically looks like:
[product]
[code]ddd[/code]
[name]ssss[/name]
[/product]
I've also written up the following Java code to parse this XML file and take out the text content of the [product] node, and add it to a JComboBox.
docBuilder = docFactory.newDocumentBuilder();
doc = docBuilder.parse("http://roberthan.host56.com/productsNew.xml");
NodeList productNodes = doc.getElementsByTagName("product");
productlist.clear();
for (i = 0; i < productNodes.getLength(); i++)
{
Node childNode = productNodes.item(i);
if (childNode.hasChildNodes()) {
NodeList nl = childNode.getChildNodes();
Node nameNode = nl.item(2);
productlist.add(nameNode.getTextContent());
}
}
final JComboBox productComboB = new JComboBox();
Iterator iterator = productlist.iterator();
while(iterator.hasNext())
{
productComboB.addItem(iterator.next().toString());
}
The code is quite straightforward, I firstly parse the xml and get all the product nodes and put them into a nodelist, and the productList is an arrayList. I loop through the all the [product] nodes, for each of them, if it has child nodes, then I take the second child node (which is the [name] node) and put the text content of it in the array list, and finally, I loop through the arrayList and add each item to the combo box.
The problem I got is, if I select the [code] child node, which means "Node nameNode = nl.item(1)", it will work perfectly; however, if I change that item(1) to item(2) to extract all the [name] nodes, the combo box will have a drop down list, but all the items are blank, like I have inserted 10 empty strings.
Also, if I try to add a "Hello World" string into the combo box after the above code, the "Hello World" item will appear after the 10 empty items.
I have spent the whole afternoon debugging this but still no breakthrough, the XML is actually quite simple and the Java is straightforward too. Could anyone share some thoughts with me on this please. Thanks a lot!
It is because the node list contains text nodes also.
If you add the following snippet to your code you will find that
for(int j = 0;j<nl.getLength();j++){
System.out.println(nl.item(j).getNodeName());
}
It will give the following output for each iteration of the product
#text
code
#text
name
#text
This means you have to get the 3rd element to get the name node.
Node nameNode = nl.item(3);
But I'll suggest you to use XPath to solve this problem.
NodeList nodelist = XPathAPI.selectNodeList(doc, "//products/product/name");
for (int i = 0; i < nodelist.getLength(); i++) {
productlist.add(nodelist.item(i).getTextContent());
}
XPath using this expression will easily solve your problem:
String XPATH_EXPRESSION1 = "//name/text()";
e.g.,
public static final String PRODUCTS_NEW = "http://roberthan.host56.com/productsNew.xml";
public static final String XPATH_EXPRESSION1 = "//name/text()";
public XmlFun() {
URL productsUrl;
try {
productsUrl = new URL(PRODUCTS_NEW);
List<String> nameList = xPathExtract(productsUrl.openStream());
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (XPathExpressionException e) {
e.printStackTrace();
}
}
private List<String> xPathExtract(InputStream inStream) throws ParserConfigurationException, SAXException, IOException, XPathExpressionException {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document domDoc = builder.parse(inStream);
XPathFactory xFactory = XPathFactory.newInstance();
XPath xpath = xFactory.newXPath();
XPathExpression xExpr = xpath.compile(XPATH_EXPRESSION1);
NodeList nodes = (NodeList)xExpr.evaluate(domDoc, XPathConstants.NODESET);
List<String> resultList = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
String node = nodes.item(i).getNodeValue();
resultList.add(node);
}
return resultList;
}

Parsing xml with dom4j or jdom or anyhow

I wanna read feed entries and I'm just stuck now. Take this for example : https://stackoverflow.com/feeds/question/2084883 lets say I wanna read all the summary node value inside each entry node in document. How do I do that? I've changed many variations of code this one is closest to what I want to achieve I think :
Element entryPoint = document.getRootElement();
Element elem;
for(Iterator iter = entryPoint.elements().iterator(); iter.hasNext();){
elem = (Element)iter.next();
System.out.println(elem.getName());
}
It goes trough all nodes in xml file and writes their name. Now what I wanted to do next is
if(elem.getName().equals("entry"))
to get only the entry nodes, how do I get elements of the entry nodes, and how to get let say summary and its value? tnx
Question: how to get values of summary nodes from this link
Have you tried jdom? I find it simpler and convenient.
http://www.jdom.org/
To get all children of an xml element, you can just do
SAXBuilder sb = new SAXBuilder();
StringReader sr = new StringReader(xmlDocAsString);
Document doc = sb.build(sr);
Element root = doc.getRootElement();
List l = root.getChildren("entry");
for (Iterator iter = l.iterator(); iter.hasNext();) {
...//do whatever...
}
Here's how you'd do it using vanilla Java:
//read the XML into a DOM
StreamSource source = new StreamSource(new StringReader("<theXml></theXml>"));
DOMResult result = new DOMResult();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(source, result);
Node root = result.getNode();
//make XPath object aware of namespaces
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(new NamespaceContext(){
#Override
public String getNamespaceURI(String prefix) {
if ("atom".equals(prefix)){
return "http://www.w3.org/2005/Atom";
}
return null;
}
#Override
public String getPrefix(String namespaceURI) {
return null;
}
#Override
public Iterator getPrefixes(String namespaceURI) {
return null;
}
});
//get all summaries
NodeList summaries = (NodeList) xpath.evaluate("/atom:feed/atom:entry/atom:summary", root, XPathConstants.NODESET);
for (int i = 0; i < summaries.getLength(); ++i) {
Node summary = summaries.item(i);
//print out all the attributes
for (int j = 0; j < summary.getAttributes().getLength(); ++j) {
Node attr = summary.getAttributes().item(j);
System.out.println(attr.getNodeName() + "=" + attr.getNodeValue());
}
//print text content
System.out.println(summaries.item(i).getTextContent());
}
if(elem.getName() == "entry")
I have no idea whether this is your problem (you don't really state what your problem is), but never test string equality with --. Instead, use equals():
if(elem.getName().equals("entry"))
A bit late but it might be useful for people googling...
There is a specialized API for dealing with RSS and Atom feeds in Java. It's called Rome, can be found here :
http://java.net/projects/rome/
It is really quite useful, it makes easy to read feed whatever the RSS or Atom version. You can also build feeds and generate the XML with it though I have no experience with this feature.
Here is a simple example that reads a feed and prints out the description nodes of all the entries in the feed :
URL feedSource = new URL("http://....");
feed = new SyndFeedInput().build(new XmlReader(feedSource));
List<SyndEntryImpl> entries = (List<SyndEntryImpl>)feed.getEntries();
for(SyndEntryImpl entry : entries){
System.out.println(entry.getDescription().getValue());
}
Simple enough.

Categories