Get XPath of XML Tag

Get XPath of XML Tag - java

If I have an XML document like below:
<foo>
<foo1>Foo Test 1</foo1>
<foo2>
<another1>
<test10>This is a duplicate</test10>
</another1>
</foo2>
<foo2>
<another1>
<test1>Foo Test 2</test1>
</another1>
</foo2>
<foo3>Foo Test 3</foo3>
<foo4>Foo Test 4</foo4>
</foo>
How do I get the XPath of <test1> for example? So the output should be something like: foo/foo2[2]/another1/test1
I'm guessing the code would look something like this:
public String getXPath(Document document, String xmlTag) {
String xpath = "";
...
//Get the node from xmlTag
//Get the xpath using the node
return xpath;
}
Let's say String XPathVar = getXPath(document, "<test1>");. I need to get back an absolute xpath that will work in the following code:
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression xpr = xpath.compile(XPathVar);
xpr.evaluate(Document, XPathConstants.STRING);
But it can't be a shortcut like //test1 because it will also be used for meta data purposes.
When printing the result out via:
System.out.println(xpr.evaluate(Document, XPathConstants.STRING));
I should get the node's value. So if XPathVar = foo/foo2[2]/another1/test1 then I should get back:
Foo Test 2 and not This is a duplicate

You don't 'get' an xpath in the same way you don't 'get' sql.
An xpath is a query you write based on your understanding of an xml document or schema, just as sql is a query you write based on your understanding of a database schema - you don't 'get' either of them.
I would be possible to generate xpath statements from the DOM simply by walking back up the nodes from a given node, though to do this generically enough, taking into account attribute values on each node, would make the resulting code next to useless. For example (which comes with a warning that this will find the first node that has a given name, xpath is much more that this and you may as well just use the xpath //foo2):
import java.io.ByteArrayInputStream;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class XPathExample
{
private static String getXPath(Node root, String elementName)
{
for (int i = 0; i < root.getChildNodes().getLength(); i++)
{
Node node = root.getChildNodes().item(i);
if (node instanceof Element)
{
if (node.getNodeName().equals(elementName))
{
return "/" + node.getNodeName();
}
else if (node.getChildNodes().getLength() > 0)
{
String xpath = getXPath(node, elementName);
if (xpath != null)
{
return "/" + node.getNodeName() + xpath;
}
}
}
}
return null;
}
private static String getXPath(Document document, String elementName)
{
return document.getDocumentElement().getNodeName() + getXPath(document.getDocumentElement(), elementName);
}
public static void main(String[] args)
{
try
{
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(
new ByteArrayInputStream(
("<foo><foo1>Foo Test 1</foo1><foo2><another1><test1>Foo Test 2</test1></another1></foo2><foo3>Foo Test 3</foo3><foo4>Foo Test 4</foo4></foo>").getBytes()
)
);
String xpath = "/" + getXPath(document, "test1");
System.out.println(xpath);
Node node1 = (Node)XPathFactory.newInstance().newXPath().compile(xpath).evaluate(document, XPathConstants.NODE);
Node node2 = (Node)XPathFactory.newInstance().newXPath().compile("//test1").evaluate(document, XPathConstants.NODE);
//This evaluates to true, hence you may as well just use the xpath //test1.
System.out.println(node1.equals(node2));
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
Likewise you could write an XML transformation that turned an xml document into a series of xpath statements but this transformation would be more complicated that writing the xpath in the first place and so largely pointless.

How's this:
private static String getXPath(Document root, String elementName)
{
try{
XPathExpression expr = XPathFactory.newInstance().newXPath().compile("//" + elementName);
Node node = (Node)expr.evaluate(root, XPathConstants.NODE);
if(node != null) {
return getXPath(node);
}
}
catch(XPathExpressionException e) { }
return null;
}
private static String getXPath(Node node) {
if(node == null || node.getNodeType() != Node.ELEMENT_NODE) {
return "";
}
return getXPath(node.getParentNode()) + "/" + node.getNodeName();
}
Note that this is first locating the node (using XPath) and then using the located node to get its XPath. Quite the roundabout approach to get a value you already have.
Working ideone example: http://ideone.com/EL4783

Related

XMLParsing, Dynamic Structure, Content

Want to Achive:
Get an unknown XML file's Elements (Element Name, How many elements are there in the xml file).
Then get all the attributes and their name and values to use it later (eg Comparison to other xml file)
element_vs_attribute
Researched:
1. 2. 3. 4. 5.
And many more
Does Anyone have any idea for this?
I dont want to pre define more then 500 table like in the previous code snippet, somehow i should be able to get the number of elements and the element names itself dynamically.
EDIT!
Example1
<Root Attri1="" Attri2="">
<element1 EAttri1="" EAttri2=""/>
<Element2 EAttri1="" EAttri2="">
<nestedelement3 NEAttri1="" NEAttri2=""/>
</Element2>
</Root>
Example2
<Root Attri1="" Attri2="" Attr="" At="">
<element1 EAttri1="" EAttri2="">
<nestedElement2 EAttri1="" EAttri2="">
<nestedelement3 NEAttri1="" NEAttri2=""/>
</nestedElement2>
</element1>
</Root>
Program Snipet:
String Example1[] = {"element1","Element2","nestedelement3"};
String Example2[] = {"element1","nestedElement2","nestedelement3"};
for(int i=0;i<Example1.length;++){
NodeList Elements = oldDOC.getElementsByTagName(Example1[i]);
for(int j=0;j<Elements.getLength();j++) {
Node nodeinfo=Elements.item(j);
for(int l=0;l<nodeinfo.getAttributes().getLength();l++) {
.....
}
}
Output:
The expected result is to get all the Element and all the Attributes out from the XML file without pre defining anything.
eg:
Elements: element1 Element2 nestedelement3
Attributes: Attri1 Attri2 EAttri1 EAttri2 EAttri1 EAttri2 NEAttri1 NEAttri2

The right tool for this job is xpath
It allows you to collect all or some elements and attributes based on various criteria. It is the closest you will get to a "universal" xml parser.
Here is the solution that I came up with. The solution first finds all element names in the given xml doc, then for each element, it counts the element's occurrences, then collect it all to a map. same for attributes.
I added inline comments and method/variable names should be self explanatory.
import java.io.*;
import java.nio.file.*;
import java.util.*;
import java.util.function.*;
import java.util.stream.*;
import org.w3c.dom.*;
import javax.xml.parsers.*;
import javax.xml.xpath.*;
public class TestXpath
{
public static void main(String[] args) {
XPath xPath = XPathFactory.newInstance().newXPath();
try (InputStream is = Files.newInputStream(Paths.get("C://temp/test.xml"))) {
// parse file into xml doc
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xmlDocument = builder.parse(is);
// find all element names in xml doc
Set<String> allElementNames = findNames(xmlDocument, xPath.compile("//*[name()]"));
// for each name, count occurrences, and collect to map
Map<String, Integer> elementsAndOccurrences = allElementNames.stream()
.collect(Collectors.toMap(Function.identity(), name -> countElementOccurrences(xmlDocument, name)));
System.out.println(elementsAndOccurrences);
// find all attribute names in xml doc
Set<String> allAttributeNames = findNames(xmlDocument, xPath.compile("//#*"));
// for each name, count occurrences, and collect to map
Map<String, Integer> attributesAndOccurrences = allAttributeNames.stream()
.collect(Collectors.toMap(Function.identity(), name -> countAttributeOccurrences(xmlDocument, name)));
System.out.println(attributesAndOccurrences);
} catch (Exception e) {
e.printStackTrace();
}
}
public static Set<String> findNames(Document xmlDoc, XPathExpression xpathExpr) {
try {
NodeList nodeList = (NodeList)xpathExpr.evaluate(xmlDoc, XPathConstants.NODESET);
// convert nodeList to set of node names
return IntStream.range(0, nodeList.getLength())
.mapToObj(i -> nodeList.item(i).getNodeName())
.collect(Collectors.toSet());
} catch (XPathExpressionException e) {
e.printStackTrace();
}
return new HashSet<>();
}
public static int countElementOccurrences(Document xmlDoc, String elementName) {
return countOccurrences(xmlDoc, elementName, "count(//*[name()='" + elementName + "'])");
}
public static int countAttributeOccurrences(Document xmlDoc, String attributeName) {
return countOccurrences(xmlDoc, attributeName, "count(//#*[name()='" + attributeName + "'])");
}
public static int countOccurrences(Document xmlDoc, String name, String xpathExpr) {
XPath xPath = XPathFactory.newInstance().newXPath();
try {
Number count = (Number)xPath.compile(xpathExpr).evaluate(xmlDoc, XPathConstants.NUMBER);
return count.intValue();
} catch (XPathExpressionException e) {
e.printStackTrace();
}
return 0;
}
}

Getting Element (not Node) below root in Java DOM (XML parser)

I need to get the tag of an element right below the root, but DOM seems only to offer methods getting child nodes (not elements) and you cant cast from one to the other.
http://ideone.com/SUjRmn
#Override
public void loadXml(String filepath) throws Exception {
File f = new File(filepath);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = null;
Document doc = null;
try {
db = dbf.newDocumentBuilder();
} catch (ParserConfigurationException e) {
e.printStackTrace();
}
try {
doc = db.parse(f);
} catch (SAXException | IOException | NullPointerException e) {
e.printStackTrace();
}
Element root = doc.getDocumentElement();
Node firstChild = root.getFirstChild();
String tag = firstChild.getNodeName();
//here is the problem. I can't cast from Node to Element and Node
//stores only an int value, not the name of the object I want to restore
ShapeDrawer drawable = null;
switch (tag) {
case "scribble":
drawable = new ScribbleDrawer();
...
From the class to restore:
#Override
public void setValues(Element root) {
NodeList nodelist = null;
nodelist = root.getElementsByTagName("color");
colorManager.setColor((nodelist.item(0).getTextContent()));
this.color = colorManager.getCurrentColor();
System.out.println(color.toString());
nodelist = root.getElementsByTagName("pressx");
pressx = Integer.parseInt(nodelist.item(0).getTextContent());
System.out.println(pressx);
nodelist = root.getElementsByTagName("pressy");
pressy = Integer.parseInt(nodelist.item(0).getTextContent());
System.out.println(pressy);
nodelist = root.getElementsByTagName("lastx");
lastx = Integer.parseInt(nodelist.item(0).getTextContent());
nodelist = root.getElementsByTagName("lasty");
lasty = Integer.parseInt(nodelist.item(0).getTextContent());
}
public void toDOM(Document doc, Element root) {
System.out.println("ScribbleDrawer being saved");
Element shapeBranch = doc.createElement("scribble");
Attr attr1 = doc.createAttribute("hashcode");
attr1.setValue(((Integer) this.hashCode()).toString());
shapeBranch.setAttributeNode(attr1);
root.appendChild(shapeBranch);
Element eColor = doc.createElement("color");
eColor.setTextContent(colorManager.namedColorToString(color));
shapeBranch.appendChild(eColor);
// creating tree branch
Element press = doc.createElement("press");
Attr attr2 = doc.createAttribute("pressx");
attr2.setValue(((Integer) pressy).toString());
press.setAttributeNode(attr2);
Attr attr3 = doc.createAttribute("pressy");
attr3.setValue(((Integer) pressy).toString());
press.setAttributeNode(attr3);
shapeBranch.appendChild(press);
Element last = doc.createElement("last");
Attr attr4 = doc.createAttribute("lastx");
attr4.setValue(((Integer) lastx).toString());
last.setAttributeNode(attr4);
Attr attr5 = doc.createAttribute("lasty");
attr5.setValue(((Integer) lasty).toString());
last.setAttributeNode(attr5);
shapeBranch.appendChild(last);
}
I know other parsers are easier, but I am almost finished and when it comes to polymorphy JAXB seems to be just as complicated with Option-marshalling etc
EDIT: this is what the xml looks like; instead of "scribble" other tags/polymorphic children are possible which are deserialized from different instance variables (and thus different DOM-trees except for the root)
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Drawables>
<scribble hashcode="189680059">
<color>Black</color>
<press pressx="221" pressy="221"/>
<last lastx="368" lasty="219"/>
</scribble>
<scribble hashcode="1215837841">
<color>Black</color>
<press pressx="246" pressy="246"/>
<last lastx="368" lasty="221"/>
</scribble>

If your node is an Element, you can cast it from node to element. But your first child might also be a text node, which can't be cast, of course. You have to test the nodes for their NodeType before casting.
If your XML is not using namespaces, you can use a method like this one to extract your child elements. It receives a list of nodes, test each one and returns a list containing only the elements:
public static List getChildren(Element element) {
List<Element> elements = new ArrayList<>();
NodeList nodeList = element.getChildNodes();
for (int i = 0; i < nodeList.getLength(); i++) {
Node node = nodeList.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
elements.add((Element) node);
}
}
return elements;
}
An alternative is to use an API which already includes such utility methods, like DOM4J, or JDOM.

XML child node attribute value

I'm trying to read xml file, ex :
<entry>
<title>FEED TITLE</title>
<id>5467sdad98787ad3149878sasda</id>
<tempi type="application/xml">
<conento xmlns="http://mydomainname.com/xsd/radiofeed.xsd" madeIn="USA" />
</tempi>
</entry>
Here is the code I have so far :
Here is my attempt of trying to code this, what to say not successful thats why I started bounty. Here it is http://pastebin.com/huKP4KED .
Bounty update :
I really really tried to do this for days now didn't expect to be so hard, I'll accept useful links/books/tutorials but prefer code because I need this done yesterday.
Here is what I need:
Concerning xml above :
I need to get value of title, id
attribute value of tempi as well as madeIn attribute value of contento
What is the best way to do this ?
EDIT:
#Pascal Thivent
Maybe creating method would be good idea like public String getValue(String xml, Element elementname), where you specify tag name, the method returns tag value or tag attribute(maybe give it name as additional method argument) if the value is not available
What I really want to get certain tag value or attribute if tag value(s) is not available, so I'm in the process of thinking what is the best way to do so since I've never done it before

The best solution for this is to use XPath. Your pastebin is expired, but here's what I gathered. Let's say we have the following feed.xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<entries>
<entry>
<title>FEED TITLE 1</title>
<id>id1</id>
<tempi type="type1">
<conento xmlns="dontcare?" madeIn="MadeIn1" />
</tempi>
</entry>
<entry>
<title>FEED TITLE 2</title>
<id>id2</id>
<tempi type="type2">
<conento xmlns="dontcare?" madeIn="MadeIn2" />
</tempi>
</entry>
<entry>
<id>id3</id>
</entry>
</entries>
Here's a short but compile-and-runnable proof-of-concept (with feed.xml file in the same directory).
import javax.xml.xpath.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import java.io.*;
import java.util.*;
public class XPathTest {
static class Entry {
final String title, id, origin, type;
Entry(String title, String id, String origin, String type) {
this.title = title;
this.id = id;
this.origin = origin;
this.type = type;
}
#Override public String toString() {
return String.format("%s:%s(%s)[%s]", id, title, origin, type);
}
}
final static XPath xpath = XPathFactory.newInstance().newXPath();
static String evalString(Node context, String path) throws XPathExpressionException {
return (String) xpath.evaluate(path, context, XPathConstants.STRING);
}
public static void main(String[] args) throws Exception {
File file = new File("feed.xml");
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
NodeList entriesNodeList = (NodeList) xpath.evaluate("//entry", document, XPathConstants.NODESET);
List<Entry> entries = new ArrayList<Entry>();
for (int i = 0; i < entriesNodeList.getLength(); i++) {
Node entryNode = entriesNodeList.item(i);
entries.add(new Entry(
evalString(entryNode, "title"),
evalString(entryNode, "id"),
evalString(entryNode, "tempi/conento/#madeIn"),
evalString(entryNode, "tempi/#type")
));
}
for (Entry entry : entries) {
System.out.println(entry);
}
}
}
This produces the following output:
id1:FEED TITLE 1(MadeIn1)[type1]
id2:FEED TITLE 2(MadeIn2)[type2]
id3:()[]
Note how using XPath makes the value retrieval very simple, intuitive, readable, and straightforward, and "missing" values are also gracefully handled.
API links
package javax.xml.xpath
http://www.w3.org/TR/xpath
Wikipedia/XPath

Use Element.getAttribute and Element.setAttribute
In your example, ((Node) content.item(0)).getFirstChild().getAttributes(). Assuming that content is a typo, and you mean contento, getFirstChild is correctly returning NULL as contento has no children. Try: ((Node) contento.item(0)).getAttributes() instead.
Another issue is that by using getFirstChild and getChildNodes()[0] without checking the return value, you are running the risk of picking up child text nodes, instead of the element you want.

As pointed out, <contento> doesn't have any child so instead of:
(contento.item(0)).getFirstChild().getAttributes()
You should treat the Node as Element and use getAttribute(String), something like this:
((Element)contento.item(0)).getAttribute("madeIn")
Here is a modified version of your code (it's not the most robust code I've written):
InputStream inputStream = new ByteArrayInputStream(xml.getBytes());
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(inputStream);
doc.getDocumentElement().normalize();
System.out.println("Root element " + doc.getDocumentElement().getNodeName());
NodeList nodeLst = doc.getElementsByTagName("entry");
System.out.println("Information of all entries");
for (int s = 0; s < nodeLst.getLength(); s++) {
Node fstNode = nodeLst.item(s);
if (fstNode.getNodeType() == Node.ELEMENT_NODE) {
Element fstElmnt = (Element) fstNode;
NodeList title = fstElmnt.getElementsByTagName("title").item(0).getChildNodes();
System.out.println("Title : " + (title.item(0)).getNodeValue());
NodeList id = fstElmnt.getElementsByTagName("id").item(0).getChildNodes();
System.out.println("Id: " + (id.item(0)).getNodeValue());
Node tempiNode = fstElmnt.getElementsByTagName("tempi").item(0);
System.out.println("Type : " + ((Element) tempiNode).getAttribute("type"));
Node contento = tempiNode.getChildNodes().item(0);
System.out.println("Made in : " + ((Element) contento).getAttribute("madeIn"));
}
}
Running it on your XML snippet produces the following output:
Root element entry
Information of all entries
Title : FEED TITLE
Id: 5467sdad98787ad3149878sasda
Type : application/xml
Made in : USA
By the way, did you consider using something like Rome instead?

Speeding up xpath

I have a 1000 entry document whose format is something like:
<Example>
<Entry>
<n1></n1>
<n2></n2>
</Entry>
<Entry>
<n1></n1>
<n2></n2>
</Entry>
<!--and so on-->
There are more than 1000 Entry nodes here. I am writing a Java program which basically gets all the node one by one and do some analyzing on each node. But the problem is that the retrieval time of the nodes increases with its no. For example it takes 78 millisecond to retrieve the first node 100 ms to retrieve the second and it keeps on increasing. And to retrieve the 999 node it takes more than 5 second. This is extremely slow. We would be plugging this code to XML files which have even more than 1000 entries. Some like millions. The total time to parse the whole document is more than 5 minutes.
I am using this simple code to traverse it. Here nxp is my own class which has all the methods to get nodes from xpath.
nxp.fromXpathToNode("/Example/Entry" + "[" + i + "]", doc);
and doc is the document for the file. i is the no of node to retrieve.
Also when i try something like this
List<Node> nl = nxp.fromXpathToNodes("/Example/Entry",doc);
content = nl.get(i);
I face the same problem.
Anyone has any solution on how to speed up the tretirival of the nodes, so it takes the same amount of time to get the 1st node as well as the 1000 node from the XML file.
Here is the code for xpathtonode.
public Node fromXpathToNode(String expression, Node context)
{
try
{
return (Node)this.getCachedExpression(expression).evaluate(context, XPathConstants.NODE);
}
catch (Exception cause)
{
throw new RuntimeException(cause);
}
}
and here is the code for fromxpathtonodes.
public List<Node> fromXpathToNodes(String expression, Node context)
{
List<Node> nodes = new ArrayList<Node>();
NodeList results = null;
try
{
results = (NodeList)this.getCachedExpression(expression).evaluate(context, XPathConstants.NODESET);
for (int index = 0; index < results.getLength(); index++)
{
nodes.add(results.item(index));
}
}
catch (Exception cause)
{
throw new RuntimeException(cause);
}
return nodes;
}
and here is the starting
public class NativeXpathEngine implements XpathEngine
{
private final XPathFactory factory;
private final XPath engine;
/**
* Cache for previously compiled XPath expressions. {#link XPathExpression#hashCode()}
* is not reliable or consistent so use the textual representation instead.
*/
private final Map<String, XPathExpression> cachedExpressions;
public NativeXpathEngine()
{
super();
this.factory = XPathFactory.newInstance();
this.engine = factory.newXPath();
this.cachedExpressions = new HashMap<String, XPathExpression>();
}

Try VTD-XML. It uses less memory than DOM. It is easier to use than SAX and supports XPath. Here is some sample code to help you get started. It applies an XPath to get the Entry elements and then prints out the n1 and n2 child elements.
final VTDGen vg = new VTDGen();
vg.parseFile("/path/to/file.xml", false);
final VTDNav vn = vg.getNav();
final AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/Example/Entry");
int count = 1;
while (ap.evalXPath() != -1) {
System.out.println("Inside Entry: " + count);
//move to n1 child
vn.toElement(VTDNav.FIRST_CHILD, "n1");
System.out.println("\tn1: " + vn.toNormalizedString(vn.getText()));
//move to n2 child
vn.toElement(VTDNav.NEXT_SIBLING, "n2");
System.out.println("\tn2: " + vn.toNormalizedString(vn.getText()));
//move back to parent
vn.toElement(VTDNav.PARENT);
count++;
}

The correct solution is to detach the node right after you call item(i), like so:
Node node = results.item(index)
node.getParentNode().removeChild(node)
nodes.add(node)
See XPath.evaluate performance slows down (absurdly) over multiple calls

I had similar issue with the Xpath Evaluation , I tried using CachedXPathAPI’s which is faster by 100X than the XPathApi’s which was used earlier.
more information about this Api is provided here :
http://xml.apache.org/xalan-j/apidocs/org/apache/xpath/CachedXPathAPI.html
Hope it helps.
Cheers,
Madhusudhan

If you need to parse huge but flat documents, SAX is a good alternative. It allows you to handle the XML as a stream instead of building a huge DOM. Your example could be parsed using a ContentHandler like this:
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.ext.DefaultHandler2;
public class ExampleHandler extends DefaultHandler2 {
private StringBuffer chars = new StringBuffer(1000);
private MyEntry currentEntry;
private MyEntryHandler myEntryHandler;
ExampleHandler(MyEntryHandler myEntryHandler) {
this.myEntryHandler = myEntryHandler;
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
chars.append(ch);
}
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
if ("Entry".equals(localName)) {
myEntryHandler.handle(currentEntry);
currentEntry = null;
}
else if ("n1".equals(localName)) {
currentEntry.setN1(chars.toString());
}
else if ("n2".equals(localName)) {
currentEntry.setN2(chars.toString());
}
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
chars.setLength(0);
if ("Entry".equals(localName)) {
currentEntry = new MyEntry();
}
}
}
If the document has a deeper and more complex structure, you're going to need to use Stacks to keep track of the current path in the document. Then you should consider writing a general purpose ContentHandler to do the dirty work and use with your document type dependent handlers.

What kind of parser are you using?
DOM pulls the whole document in memory - once you pull the whole document in memory then your operations can be fast but doing so in a web app or a for loop can have an impact.
SAX parser does on demand parsing and loads nodes as and when you request.
So try to use a parser implementation that suits your need.

Use the JAXEN library for xpaths:
http://jaxen.codehaus.org/

Best way to compare 2 XML documents in Java

I'm trying to write an automated test of an application that basically translates a custom message format into an XML message and sends it out the other end. I've got a good set of input/output message pairs so all I need to do is send the input messages in and listen for the XML message to come out the other end.
When it comes time to compare the actual output to the expected output I'm running into some problems. My first thought was just to do string comparisons on the expected and actual messages. This doens't work very well because the example data we have isn't always formatted consistently and there are often times different aliases used for the XML namespace (and sometimes namespaces aren't used at all.)
I know I can parse both strings and then walk through each element and compare them myself and this wouldn't be too difficult to do, but I get the feeling there's a better way or a library I could leverage.
So, boiled down, the question is:
Given two Java Strings which both contain valid XML how would you go about determining if they are semantically equivalent? Bonus points if you have a way to determine what the differences are.

Sounds like a job for XMLUnit
http://www.xmlunit.org/
https://github.com/xmlunit
Example:
public class SomeTest extends XMLTestCase {
#Test
public void test() {
String xml1 = ...
String xml2 = ...
XMLUnit.setIgnoreWhitespace(true); // ignore whitespace differences
// can also compare xml Documents, InputSources, Readers, Diffs
assertXMLEqual(xml1, xml2); // assertXMLEquals comes from XMLTestCase
}
}

The following will check if the documents are equal using standard JDK libraries.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setCoalescing(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setIgnoringComments(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc1 = db.parse(new File("file1.xml"));
doc1.normalizeDocument();
Document doc2 = db.parse(new File("file2.xml"));
doc2.normalizeDocument();
Assert.assertTrue(doc1.isEqualNode(doc2));
normalize() is there to make sure there are no cycles (there technically wouldn't be any)
The above code will require the white spaces to be the same within the elements though, because it preserves and evaluates it. The standard XML parser that comes with Java does not allow you to set a feature to provide a canonical version or understand xml:space if that is going to be a problem then you may need a replacement XML parser such as xerces or use JDOM.

Xom has a Canonicalizer utility which turns your DOMs into a regular form, which you can then stringify and compare. So regardless of whitespace irregularities or attribute ordering, you can get regular, predictable comparisons of your documents.
This works especially well in IDEs that have dedicated visual String comparators, like Eclipse. You get a visual representation of the semantic differences between the documents.

The latest version of XMLUnit can help the job of asserting two XML are equal. Also XMLUnit.setIgnoreWhitespace() and XMLUnit.setIgnoreAttributeOrder() may be necessary to the case in question.
See working code of a simple example of XML Unit use below.
import org.custommonkey.xmlunit.DetailedDiff;
import org.custommonkey.xmlunit.XMLUnit;
import org.junit.Assert;
public class TestXml {
public static void main(String[] args) throws Exception {
String result = "<abc attr=\"value1\" title=\"something\"> </abc>";
// will be ok
assertXMLEquals("<abc attr=\"value1\" title=\"something\"></abc>", result);
}
public static void assertXMLEquals(String expectedXML, String actualXML) throws Exception {
XMLUnit.setIgnoreWhitespace(true);
XMLUnit.setIgnoreAttributeOrder(true);
DetailedDiff diff = new DetailedDiff(XMLUnit.compareXML(expectedXML, actualXML));
List<?> allDifferences = diff.getAllDifferences();
Assert.assertEquals("Differences found: "+ diff.toString(), 0, allDifferences.size());
}
}
If using Maven, add this to your pom.xml:
<dependency>
<groupId>xmlunit</groupId>
<artifactId>xmlunit</artifactId>
<version>1.4</version>
</dependency>

Building on Tom's answer, here's an example using XMLUnit v2.
It uses these maven dependencies
<dependency>
<groupId>org.xmlunit</groupId>
<artifactId>xmlunit-core</artifactId>
<version>2.0.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.xmlunit</groupId>
<artifactId>xmlunit-matchers</artifactId>
<version>2.0.0</version>
<scope>test</scope>
</dependency>
..and here's the test code
import static org.junit.Assert.assertThat;
import static org.xmlunit.matchers.CompareMatcher.isIdenticalTo;
import org.xmlunit.builder.Input;
import org.xmlunit.input.WhitespaceStrippedSource;
public class SomeTest extends XMLTestCase {
#Test
public void test() {
String result = "<root></root>";
String expected = "<root> </root>";
// ignore whitespace differences
// https://github.com/xmlunit/user-guide/wiki/Providing-Input-to-XMLUnit#whitespacestrippedsource
assertThat(result, isIdenticalTo(new WhitespaceStrippedSource(Input.from(expected).build())));
assertThat(result, isIdenticalTo(Input.from(expected).build())); // will fail due to whitespace differences
}
}
The documentation that outlines this is https://github.com/xmlunit/xmlunit#comparing-two-documents

Thanks, I extended this, try this ...
import java.io.ByteArrayInputStream;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
public class XmlDiff
{
private boolean nodeTypeDiff = true;
private boolean nodeValueDiff = true;
public boolean diff( String xml1, String xml2, List<String> diffs ) throws Exception
{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setCoalescing(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setIgnoringComments(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc1 = db.parse(new ByteArrayInputStream(xml1.getBytes()));
Document doc2 = db.parse(new ByteArrayInputStream(xml2.getBytes()));
doc1.normalizeDocument();
doc2.normalizeDocument();
return diff( doc1, doc2, diffs );
}
/**
* Diff 2 nodes and put the diffs in the list
*/
public boolean diff( Node node1, Node node2, List<String> diffs ) throws Exception
{
if( diffNodeExists( node1, node2, diffs ) )
{
return true;
}
if( nodeTypeDiff )
{
diffNodeType(node1, node2, diffs );
}
if( nodeValueDiff )
{
diffNodeValue(node1, node2, diffs );
}
System.out.println(node1.getNodeName() + "/" + node2.getNodeName());
diffAttributes( node1, node2, diffs );
diffNodes( node1, node2, diffs );
return diffs.size() > 0;
}
/**
* Diff the nodes
*/
public boolean diffNodes( Node node1, Node node2, List<String> diffs ) throws Exception
{
//Sort by Name
Map<String,Node> children1 = new LinkedHashMap<String,Node>();
for( Node child1 = node1.getFirstChild(); child1 != null; child1 = child1.getNextSibling() )
{
children1.put( child1.getNodeName(), child1 );
}
//Sort by Name
Map<String,Node> children2 = new LinkedHashMap<String,Node>();
for( Node child2 = node2.getFirstChild(); child2!= null; child2 = child2.getNextSibling() )
{
children2.put( child2.getNodeName(), child2 );
}
//Diff all the children1
for( Node child1 : children1.values() )
{
Node child2 = children2.remove( child1.getNodeName() );
diff( child1, child2, diffs );
}
//Diff all the children2 left over
for( Node child2 : children2.values() )
{
Node child1 = children1.get( child2.getNodeName() );
diff( child1, child2, diffs );
}
return diffs.size() > 0;
}
/**
* Diff the nodes
*/
public boolean diffAttributes( Node node1, Node node2, List<String> diffs ) throws Exception
{
//Sort by Name
NamedNodeMap nodeMap1 = node1.getAttributes();
Map<String,Node> attributes1 = new LinkedHashMap<String,Node>();
for( int index = 0; nodeMap1 != null && index < nodeMap1.getLength(); index++ )
{
attributes1.put( nodeMap1.item(index).getNodeName(), nodeMap1.item(index) );
}
//Sort by Name
NamedNodeMap nodeMap2 = node2.getAttributes();
Map<String,Node> attributes2 = new LinkedHashMap<String,Node>();
for( int index = 0; nodeMap2 != null && index < nodeMap2.getLength(); index++ )
{
attributes2.put( nodeMap2.item(index).getNodeName(), nodeMap2.item(index) );
}
//Diff all the attributes1
for( Node attribute1 : attributes1.values() )
{
Node attribute2 = attributes2.remove( attribute1.getNodeName() );
diff( attribute1, attribute2, diffs );
}
//Diff all the attributes2 left over
for( Node attribute2 : attributes2.values() )
{
Node attribute1 = attributes1.get( attribute2.getNodeName() );
diff( attribute1, attribute2, diffs );
}
return diffs.size() > 0;
}
/**
* Check that the nodes exist
*/
public boolean diffNodeExists( Node node1, Node node2, List<String> diffs ) throws Exception
{
if( node1 == null && node2 == null )
{
diffs.add( getPath(node2) + ":node " + node1 + "!=" + node2 + "\n" );
return true;
}
if( node1 == null && node2 != null )
{
diffs.add( getPath(node2) + ":node " + node1 + "!=" + node2.getNodeName() );
return true;
}
if( node1 != null && node2 == null )
{
diffs.add( getPath(node1) + ":node " + node1.getNodeName() + "!=" + node2 );
return true;
}
return false;
}
/**
* Diff the Node Type
*/
public boolean diffNodeType( Node node1, Node node2, List<String> diffs ) throws Exception
{
if( node1.getNodeType() != node2.getNodeType() )
{
diffs.add( getPath(node1) + ":type " + node1.getNodeType() + "!=" + node2.getNodeType() );
return true;
}
return false;
}
/**
* Diff the Node Value
*/
public boolean diffNodeValue( Node node1, Node node2, List<String> diffs ) throws Exception
{
if( node1.getNodeValue() == null && node2.getNodeValue() == null )
{
return false;
}
if( node1.getNodeValue() == null && node2.getNodeValue() != null )
{
diffs.add( getPath(node1) + ":type " + node1 + "!=" + node2.getNodeValue() );
return true;
}
if( node1.getNodeValue() != null && node2.getNodeValue() == null )
{
diffs.add( getPath(node1) + ":type " + node1.getNodeValue() + "!=" + node2 );
return true;
}
if( !node1.getNodeValue().equals( node2.getNodeValue() ) )
{
diffs.add( getPath(node1) + ":type " + node1.getNodeValue() + "!=" + node2.getNodeValue() );
return true;
}
return false;
}
/**
* Get the node path
*/
public String getPath( Node node )
{
StringBuilder path = new StringBuilder();
do
{
path.insert(0, node.getNodeName() );
path.insert( 0, "/" );
}
while( ( node = node.getParentNode() ) != null );
return path.toString();
}
}

AssertJ 1.4+ has specific assertions to compare XML content:
String expectedXml = "<foo />";
String actualXml = "<bar />";
assertThat(actualXml).isXmlEqualTo(expectedXml);
Here is the Documentation

Below code works for me
String xml1 = ...
String xml2 = ...
XMLUnit.setIgnoreWhitespace(true);
XMLUnit.setIgnoreAttributeOrder(true);
XMLAssert.assertXMLEqual(actualxml, xmlInDb);

skaffman seems to be giving a good answer.
another way is probably to format the XML using a commmand line utility like xmlstarlet(http://xmlstar.sourceforge.net/) and then format both the strings and then use any diff utility(library) to diff the resulting output files. I don't know if this is a good solution when issues are with namespaces.

I'm using Altova DiffDog which has options to compare XML files structurally (ignoring string data).
This means that (if checking the 'ignore text' option):
<foo a="xxx" b="xxx">xxx</foo>
and
<foo b="yyy" a="yyy">yyy</foo>
are equal in the sense that they have structural equality. This is handy if you have example files that differ in data, but not structure!

I required the same functionality as requested in the main question. As I was not allowed to use any 3rd party libraries, I have created my own solution basing on #Archimedes Trajano solution.
Following is my solution.
import java.io.ByteArrayInputStream;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.junit.Assert;
import org.w3c.dom.Document;
/**
* Asserts for asserting XML strings.
*/
public final class AssertXml {
private AssertXml() {
}
private static Pattern NAMESPACE_PATTERN = Pattern.compile("xmlns:(ns\\d+)=\"(.*?)\"");
/**
* Asserts that two XML are of identical content (namespace aliases are ignored).
*
* #param expectedXml expected XML
* #param actualXml actual XML
* #throws Exception thrown if XML parsing fails
*/
public static void assertEqualXmls(String expectedXml, String actualXml) throws Exception {
// Find all namespace mappings
Map<String, String> fullnamespace2newAlias = new HashMap<String, String>();
generateNewAliasesForNamespacesFromXml(expectedXml, fullnamespace2newAlias);
generateNewAliasesForNamespacesFromXml(actualXml, fullnamespace2newAlias);
for (Entry<String, String> entry : fullnamespace2newAlias.entrySet()) {
String newAlias = entry.getValue();
String namespace = entry.getKey();
Pattern nsReplacePattern = Pattern.compile("xmlns:(ns\\d+)=\"" + namespace + "\"");
expectedXml = transletaNamespaceAliasesToNewAlias(expectedXml, newAlias, nsReplacePattern);
actualXml = transletaNamespaceAliasesToNewAlias(actualXml, newAlias, nsReplacePattern);
}
// nomralize namespaces accoring to given mapping
DocumentBuilder db = initDocumentParserFactory();
Document expectedDocuemnt = db.parse(new ByteArrayInputStream(expectedXml.getBytes(Charset.forName("UTF-8"))));
expectedDocuemnt.normalizeDocument();
Document actualDocument = db.parse(new ByteArrayInputStream(actualXml.getBytes(Charset.forName("UTF-8"))));
actualDocument.normalizeDocument();
if (!expectedDocuemnt.isEqualNode(actualDocument)) {
Assert.assertEquals(expectedXml, actualXml); //just to better visualize the diffeences i.e. in eclipse
}
}
private static DocumentBuilder initDocumentParserFactory() throws ParserConfigurationException {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(false);
dbf.setCoalescing(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setIgnoringComments(true);
DocumentBuilder db = dbf.newDocumentBuilder();
return db;
}
private static String transletaNamespaceAliasesToNewAlias(String xml, String newAlias, Pattern namespacePattern) {
Matcher nsMatcherExp = namespacePattern.matcher(xml);
if (nsMatcherExp.find()) {
xml = xml.replaceAll(nsMatcherExp.group(1) + "[:]", newAlias + ":");
xml = xml.replaceAll(nsMatcherExp.group(1) + "=", newAlias + "=");
}
return xml;
}
private static void generateNewAliasesForNamespacesFromXml(String xml, Map<String, String> fullnamespace2newAlias) {
Matcher nsMatcher = NAMESPACE_PATTERN.matcher(xml);
while (nsMatcher.find()) {
if (!fullnamespace2newAlias.containsKey(nsMatcher.group(2))) {
fullnamespace2newAlias.put(nsMatcher.group(2), "nsTr" + (fullnamespace2newAlias.size() + 1));
}
}
}
}
It compares two XML strings and takes care of any mismatching namespace mappings by translating them to unique values in both input strings.
Can be fine tuned i.e. in case of translation of namespaces. But for my requirements just does the job.

This will compare full string XMLs (reformatting them on the way). It makes it easy to work with your IDE (IntelliJ, Eclipse), cos you just click and visually see the difference in the XML files.
import org.apache.xml.security.c14n.CanonicalizationException;
import org.apache.xml.security.c14n.Canonicalizer;
import org.apache.xml.security.c14n.InvalidCanonicalizerException;
import org.w3c.dom.Element;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSSerializer;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.TransformerException;
import java.io.IOException;
import java.io.StringReader;
import static org.apache.xml.security.Init.init;
import static org.junit.Assert.assertEquals;
public class XmlUtils {
static {
init();
}
public static String toCanonicalXml(String xml) throws InvalidCanonicalizerException, ParserConfigurationException, SAXException, CanonicalizationException, IOException {
Canonicalizer canon = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS);
byte canonXmlBytes[] = canon.canonicalize(xml.getBytes());
return new String(canonXmlBytes);
}
public static String prettyFormat(String input) throws TransformerException, ParserConfigurationException, IOException, SAXException, InstantiationException, IllegalAccessException, ClassNotFoundException {
InputSource src = new InputSource(new StringReader(input));
Element document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(src).getDocumentElement();
Boolean keepDeclaration = input.startsWith("<?xml");
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSSerializer writer = impl.createLSSerializer();
writer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
writer.getDomConfig().setParameter("xml-declaration", keepDeclaration);
return writer.writeToString(document);
}
public static void assertXMLEqual(String expected, String actual) throws ParserConfigurationException, IOException, SAXException, CanonicalizationException, InvalidCanonicalizerException, TransformerException, IllegalAccessException, ClassNotFoundException, InstantiationException {
String canonicalExpected = prettyFormat(toCanonicalXml(expected));
String canonicalActual = prettyFormat(toCanonicalXml(actual));
assertEquals(canonicalExpected, canonicalActual);
}
}
I prefer this to XmlUnit because the client code (test code) is cleaner.

Using XMLUnit 2.x
In the pom.xml
<dependency>
<groupId>org.xmlunit</groupId>
<artifactId>xmlunit-assertj3</artifactId>
<version>2.9.0</version>
</dependency>
Test implementation (using junit 5) :
import org.junit.jupiter.api.Test;
import org.xmlunit.assertj3.XmlAssert;
public class FooTest {
#Test
public void compareXml() {
//
String xmlContentA = "<foo></foo>";
String xmlContentB = "<foo></foo>";
//
XmlAssert.assertThat(xmlContentA).and(xmlContentB).areSimilar();
}
}
Other methods : areIdentical(), areNotIdentical(), areNotSimilar()
More details (configuration of assertThat(~).and(~) and examples) in this documentation page.
XMLUnit also has (among other features) a DifferenceEvaluator to do more precise comparisons.
XMLUnit website

Using JExamXML with java application
import com.a7soft.examxml.ExamXML;
import com.a7soft.examxml.Options;
.................
// Reads two XML files into two strings
String s1 = readFile("orders1.xml");
String s2 = readFile("orders.xml");
// Loads options saved in a property file
Options.loadOptions("options");
// Compares two Strings representing XML entities
System.out.println( ExamXML.compareXMLString( s1, s2 ) );

Since you say "semantically equivalent" I assume you mean that you want to do more than just literally verify that the xml outputs are (string) equals, and that you'd want something like
<foo> some stuff here</foo></code>
and
<foo>some stuff here</foo></code>
do read as equivalent. Ultimately it's going to matter how you're defining "semantically equivalent" on whatever object you're reconstituting the message from. Simply build that object from the messages and use a custom equals() to define what you're looking for.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get XPath of XML Tag - java

Related

XMLParsing, Dynamic Structure, Content

Getting Element (not Node) below root in Java DOM (XML parser)

XML child node attribute value

Speeding up xpath

Best way to compare 2 XML documents in Java

Categories

Resources