Java XPath scan file looking for a word - java

Im building an application that will taka a word from user and then scan file using XPath returning true or false depending on wheather the word was found in that file or not.
I have build following class that implements XPath, but i am either missunderstanding how it should work or there is something wrong with my code. Can anyone explain to me how to use Xpath to make full file search?
public XPath() throws IOException, SAXException, ParserConfigurationException, XPathExpressionException {
FileInputStream fileIS = new FileInputStream("text.xml");
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(fileIS);
XPathFactory xPathfactory = XPathFactory.newInstance();
javax.xml.xpath.XPath xPath = xPathfactory.newXPath();
XPathExpression expr = xPath.compile("//text()[contains(.,'java')]");
System.out.println(expr.evaluate(xmlDocument, XPathConstants.NODESET));
}
And the xml file i am currently testing on.
<?xml version="1.0"?>
<Tutorials>
<Tutorial tutId="01" type="java">
<title>Guava</title>
<description>Introduction to Guava</description>
<date>04/04/2016</date>
<author>GuavaAuthor</author>
</Tutorial>
<Tutorial tutId="02" type="java">
<title>XML</title>
<description>Introduction to XPath</description>
<date>04/05/2016</date>
<author>XMLAuthor</author>
</Tutorial>
</Tutorials>
Found the solution, i was missing correct display of the found entries and as someone pointed out in comment 'java' is in arguments and i want to scan only text fields so it would be never found, after adding following code and changing the word my app will look for, application works
Object result = expr.evaluate(xmlDocument, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getNodeValue());
}

Your XPath is searching the text() nodes, but the word java appears in the #type attribute (which is not a text() node).
If you want to search for the word in both text() and #* then you could use a union | operator and check for either/both containing that word:
//text()[contains(. ,'java')] | //#*[contains(., 'java')]
But you might also want to scan comment() and processing-instruction(), so could generically match on node() and then in the predicate test:
//node()[contains(. ,'java')] | //#*[contains(., 'java')]
With XPath 2.0 or greater, you could use:
//node()[(.|#*)[contains(., 'java')]]

Related

How to write XPath to get node attribute value from a "Name Space XML" in Java

INPUT_XML:
<?xml version="1.0" encoding="UTF-8">
<root xmlns:ns1="http://path1/schema1" xmlns:ns2="http://path2/schema2">
<ns1:abc>1234</ns1:abc>
<ns2:def>5678</ns2:def>
</root>
In Java, I am trying to write XPath expression which will get the value corresponding to this attribute "xmlns:ns1" from the above INPUT_XML string content.
I've tried the following:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(INPUT_XML);
String xpathExpression = "/root/xmlns:ns1";
// Create XPathFactory object
XPathFactory xpathFactory = XPathFactory.newInstance();
// Create XPath object
XPath xpath = xpathFactory.newXPath();
// Create XPathExpression object
XPathExpression expr = xpath.compile(xpathExpression);
// Evaluate expression result on XML document
NodeList nodes = (NodeList) expr.evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getNodeValue());
}
But the above code is not giving the expected value of the specified attribute i.e. xmlns:ns1. I heavily suspect the xPathExpression is wrong. Please suggest with the right XPath expression or the right approach to tackle this issue.
If you're using an XPath 1.0 processor, or a XPath 2.0 processor with XPath 1.0 compatibility mode turned on, you can use the namespace axis to select the namespace value.
You will need to make the following change in your code:
String xpathExpression = "/root/namespace::ns1"
The xmlns:ns1="http://path1/schema1" and xmlns:ns2="http://path2/schema2" are not attributes, but namespace declarations. You cannot retrieve them with an XPath declaration so easily (there is XPath function namespace-uri() for this purpose, but root element does not have any namespace, it only defines them for future use).
When using DOM API you could use method lookupNamespaceURI():
System.out.println("ns1 = " + doc.getDocumentElement().lookupNamespaceURI("ns1"));
System.out.println("ns2 = " + doc.getDocumentElement().lookupNamespaceURI("ns2"));
When using XPath you could try following expressions:
namespace-uri(/*[local-name()='root']/*[local-name()='abc'])
namespace-uri(/*[local-name()='root']/*[local-name()='def'])

Using XPath count function

I am using a oracle sql database to carryout sql queries with xpath expressions – I have created an XML file which contains data relating to a film
The XPath expression you're looking for (not the SQL expression) is:
count(/film/directors/director)
which result should be 1 with your example XML file.
If you want to check if it's 2, use
count(/film/directors/director) = 2
which should return FALSE with your XML file.
First, you obviously know you need to use xPath to query the XML file, but you seem to have failed to understand what xPath is or how it should be used.
My first suggestion would be to go a read up on xPath and xPath in Java because it has nothing to do with the SQL.
I then did a quick search on "java xpath count" and come across a number of excellent examples, but based on XPath count() function, I went about testing your document with...
try {
DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder b = f.newDocumentBuilder();
// This is your document in a file
Document d = b.parse(new File("Test.xml"));
d.getDocumentElement().normalize();
String expression = "//film[count(directors)=1]";
XPath xPath = XPathFactory.newInstance().newXPath();
Object result = xPath.compile(expression).evaluate(d, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println(nodes.getLength());
for (int i = 0; i < nodes.getLength(); i++) {
Node node = nodes.item(i);
System.out.println("Found " + node.getTextContent());
}
} catch (ParserConfigurationException | SAXException | IOException | XPathExpressionException | DOMException exp) {
exp.printStackTrace();
}
This basically listed the film node (found one match) ... but, why did you produce a result?! Look at the query, //film[count(directors)=1], it's listing all film matches with a one director, because I want to test the query. Change it to //film[count(directors)=2] and it will return a result of zero matches based on your example.
I would highly recommend that you pause for a moment and become more familiar with what xPath is and how it works before you continue

How can I keep from returning white spaces and line returns in-between nodes with xpath?

I am trying to learn to use Java xpath, but have run into an issue. When I use getNodeName and getTextContent, I end up grabbing the whitespace and line returns that occur in-between nodes. For example, if my XML looks like:
<node-i-am-looking-for-in-my-xml>
<parent-node-01>
<child-node-01>
some text
</child-node>
<child-node-02>
some more text
</child-node>
<child-node-03>
even more text
</child-node>
</parent-node-01>
<parent-node-02>
<child-node-01>
some text
</child-node>
<child-node-02>
some more text
</child-node>
<child-node-03>
even more text
</child-node>
</parent-node-02>
<parent-node-03>
<child-node-01>
some text
</child-node>
<child-node-02>
some more text
</child-node>
<child-node-03>
even more text
</child-node>
</parent-node-03>
</node-i-am-looking-for-in-my-xml>
What I get when I use getNodeName looks like:
child-node-01
#text
child-node-02
#text
child-node-03
#text
And when I use getTextContent, it looks like:
some text
some more text
even more text
This is the code I am using:
public static void main(String[] args) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
DocumentBuilder db = dbf.newDocumentBuilder();
String filename = "C:\\Users\\Me\\file.xml";
Document doc = db.parse(new FileInputStream(new File(filename)));
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
String expression;
Node node;
NodeList nodeList;
expression = "//node-i-am-looking-for/*";
nodeList = (NodeList) xpath.evaluate(expression, doc, XPathConstants.NODESET);
System.out.println("nodeList.getLength(): " + nodeList.getLength());
for (int i = 0; i < nodeList.getLength(); i++) {
for(int j=1; j<(nodeList.item(i).getChildNodes().getLength()); j++){
Node nowNode = nodeList.item(i).getChildNodes().item(j);
System.out.println(nowNode.getNodeName() + ":" + nowNode.getTextContent());
}
}
}
In looking around Google, it appears I need to use "normalize-space", but I cannot figure out how to implement that.
As you have seen, whitespace is significant in XML text nodes. The text contents of child-node-01 (or more accurately, the contents of the text node whose parent is child-node-01) is actually '\n some text\n '.
You would only use normalize-space if you needed to deal with this whitespace inside an XPath expression, as normalize-space is an XPath function. For example, if you wanted to select all nodes where the text contents (with leading/trailing whitespace stripped) was 'some data', you could have an XPath like:
//*[normalize-space(.) = 'some data']
But by the time you've retrieved the text content, you're outside of the XPath world, and back in Java, so you might be better off with:
nowNode.getTextContent().trim()

How to access to value read XML using XPath in Java

I want to read XML data using XPath in Java.
I have the next XML file named MyXML.xml:
<?xml version="1.0" encoding="iso-8859-1" ?>
<REPOSITORY xmlns:LIBRARY="http://www.openarchives.org/LIBRARY/2.0/"
xmlns:xsi="http://www.w3.prg/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/LIBRARY/2.0/ http://www.openarchives.org/LIBRARY/2.0/LIBRARY-PHM.xsd">
<repository>Test</repository>
<records>
<record>
<ejemplar>
<library_book:book
xmlns:library_book="http://www.w3c.es/LIBRARY/book/"
xmlns:book="http://www.w3c.es/LIBRARY/book/"
xmlns:bookAssets="http://www.w3c.es/LIBRARY/book/"
xmlns:bookAsset="http://www.w3c.es/LIBRARY/book/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3c.es/LIBRARY/book/ http://www.w3c.es/LIBRARY/replacement/book.xsd">
<book:bookAssets count="1">
<book:bookAsset nasset="1">
<book:bookAsset.id>value1</book:bookAsset.id>
<book:bookAsset.event>
<book:bookAsset.event.id>value2</book:bookAsset.event.id>
</book:bookAsset.event>
</book:bookAsset>
</book:bookAssets>
</library_book:book>
</ejemplar>
</record>
</records>
</REPOSITORY>
I want access to value1 and value2 values. For this, I try this:
// Standard of reading a XML file
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder;
Document doc = null;
XPathExpression expr = null;
builder = factory.newDocumentBuilder();
doc = builder.parse("MyXML.xml");
// Create a XPathFactory
XPathFactory xFactory = XPathFactory.newInstance();
// Create a XPath object
XPath xpath = xFactory.newXPath();
expr = xpath.compile("//REPOSITORY/records/record/ejemplar/library_book:book//book:bookAsset.event.id/text()");
Object result = expr.evaluate(doc, XPathConstants.STRING);
System.out.println("RESULT=" + (String)result);
But I don't get any results. Only prints RESULT=.
¿How to access to value1 and value2 values?. ¿What is the XPath filter to apply?.
Thanks in advanced.
I'm using JDK6.
You are having problems with namespaces, what you can do is
take them into account
ignore them using the XPath local-name() function
Solution 1 implies implementing a NamespaceContext that maps namespaces names and URIs and set it on the XPath object before querying.
Solution 2 is easy, you just need to change your XPath (but depending on your XML you may fine-tune your XPath to be sure to select the correct element):
XPath xpath = xFactory.newXPath();
expr = xpath.compile("//*[local-name()='bookAsset.event.id']/text()");
Object result = expr.evaluate(doc, XPathConstants.STRING);
System.out.println("RESULT=" + result);
Runnable example on ideone.
You can take a look at the following blog article to better understand the uses of namespaces and XPath in Java (even if old)
Try
Object result = expr.evaluate(doc, XPathConstants.NODESET);
// Cast the result to a DOM NodeList
NodeList nodes = (NodeList) result;
for (int i=0; i<nodes.getLength();i++){
System.out.println(nodes.item(i).getNodeValue());
}
One approach is to implement a name space context like:
public static class UniversalNamespaceResolver implements NamespaceContext {
private Document sourceDocument;
public UniversalNamespaceResolver(Document document) {
sourceDocument = document;
}
public String getNamespaceURI(String prefix) {
if (prefix.equals(XMLConstants.DEFAULT_NS_PREFIX)) {
return sourceDocument.lookupNamespaceURI(null);
} else {
return sourceDocument.lookupNamespaceURI(prefix);
}
}
public String getPrefix(String namespaceURI) {
return sourceDocument.lookupPrefix(namespaceURI);
}
public Iterator getPrefixes(String namespaceURI) {
return null;
}
}
And then use it like
xpath.setNamespaceContext(new UniversalNamespaceResolver(doc));
You also need to move up all the namespace declarations to the root node (REPOSITORY). Otherwise it might be a problem if you have namespace declarations on two different levels.

Parsing XML in Java from Wordpress feed

private void parseXml(String urlPath) throws Exception {
URL url = new URL(urlPath);
URLConnection connection = url.openConnection();
DocumentBuilder db = DOCUMENT_BUILDER_FACTORY.newDocumentBuilder();
final Document document = db.parse(connection.getInputStream());
XPath xPathEvaluator = XPATH_FACTORY.newXPath();
XPathExpression nameExpr = xPathEvaluator.compile("rss/channel/item/title");
NodeList trackNameNodes = (NodeList) nameExpr.evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < trackNameNodes.getLength(); i++) {
Node trackNameNode = trackNameNodes.item(i);
System.out.println(String.format("Blog Entry Title: %s" , trackNameNode.getTextContent()));
XPathExpression artistNameExpr = xPathEvaluator.compile("rss/channel/item/content:encoded");
NodeList artistNameNodes = (NodeList) artistNameExpr.evaluate(trackNameNode, XPathConstants.NODESET);
for (int j=0; j < artistNameNodes.getLength(); j++) {
System.out.println(String.format(" - Artist Name: %s", artistNameNodes.item(j).getTextContent()));
}
}
}
I have this code for parsing the title and content from the default wordpress xml, the only problem is that when I try to get the content of the blog entry, the xml tag is: <content:encoded> and I do not understand how to retrieve this data ?
The tag <content:encoded> means an element with the name encoded in the XML namespace with the prefix content. The XPath evaluator is probably unable to resolve the content prefix to it's namespace, which I think is http://purl.org/rss/1.0/modules/content/ from a quick Google.
To get it to resolve, you'll need to do the following:
Ensure your DocumentBuilderFactory has setNamespaceAware( true ) called on it after construction, otherwise all namespaces are discarded during parsing.
Write an implementation of javax.xml.namespace.NamespaceContext to resolve the prefix to it's namespace (doc).
Call XPath#setNamespaceContext() with your implementation.
You could also try to use XStream, wich is a good and easy to use XML parser. Makes you have almost no work for parsing known XML structures.
PS: Their site is currently offline, use Google Cache to see it =P

Categories