org.xml.sax.SAXParseException while parsing XMl using XPATH - java

I am trying to get values from an XML using XPATH. I received the following exception:
[Fatal Error] books.xml:4:16: The prefix "abc" for element "abc:priority" is not bound.
Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:///D:/XSL%20TEST%20APP%20BACK%20UP/XMLTestApp/books.xml; lineNumber: 4; columnNumber: 16; The prefix "abc" for element "abc:priority" is not bound.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at xpath.XPathExample.main(XPathExample.java:18)
I am getting this error because my XML is a little bit of different from normal one (please see below):
<?xml version="1.0" encoding="UTF-8"?>
<inventory>
<Sample>
<abc:priority>1</abc:priority>
<abc:value>2</abc:value>
</Sample>
</inventory>
Here is my code (Java) to get values from the above XML:
import java.io.IOException;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import javax.xml.parsers.*;
import javax.xml.xpath.*;
public class XPathExample {
public static void main(String[] args)
throws ParserConfigurationException, SAXException,
IOException, XPathExpressionException {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true); // never forget this!
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse("books.xml");
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
XPathExpression expr
= xpath.compile("//Sample/*/text()");////book/Sample[author='Neal Stephenson']/title/text()
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getNodeValue());
}
}
}
If I remove the semicolon, I never get this error.
Is it possible to get content from an XML like mentioned above using XPATH?

"Is it possible to get content from an XML like mentioned above using Xpath ?" - I don't think so. This XML isn't well-formed.
From the spec (http://www.w3.org/TR/REC-xml-names/#ns-qualnames):
The Prefix provides the namespace prefix part of the qualified name,
and MUST be associated with a namespace URI reference in a namespace
declaration. [Definition: The LocalPart provides the local part of the
qualified name.]
In order to do anything with it, I think you'll have to add a namespace declaration.
Example
<inventory xmlns:abc="x">
<Sample>
<abc:priority>1</abc:priority>
<abc:value>2</abc:value>
</Sample>
</inventory>

Try without this line:
domFactory.setNamespaceAware(true); // never forget this!
Although it normally is a bad idea to run without namespace awareness, in this specific case it makes sense, since the input file is the way it is.

Related

Java XPath scan file looking for a word

Im building an application that will taka a word from user and then scan file using XPath returning true or false depending on wheather the word was found in that file or not.
I have build following class that implements XPath, but i am either missunderstanding how it should work or there is something wrong with my code. Can anyone explain to me how to use Xpath to make full file search?
public XPath() throws IOException, SAXException, ParserConfigurationException, XPathExpressionException {
FileInputStream fileIS = new FileInputStream("text.xml");
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(fileIS);
XPathFactory xPathfactory = XPathFactory.newInstance();
javax.xml.xpath.XPath xPath = xPathfactory.newXPath();
XPathExpression expr = xPath.compile("//text()[contains(.,'java')]");
System.out.println(expr.evaluate(xmlDocument, XPathConstants.NODESET));
}
And the xml file i am currently testing on.
<?xml version="1.0"?>
<Tutorials>
<Tutorial tutId="01" type="java">
<title>Guava</title>
<description>Introduction to Guava</description>
<date>04/04/2016</date>
<author>GuavaAuthor</author>
</Tutorial>
<Tutorial tutId="02" type="java">
<title>XML</title>
<description>Introduction to XPath</description>
<date>04/05/2016</date>
<author>XMLAuthor</author>
</Tutorial>
</Tutorials>
Found the solution, i was missing correct display of the found entries and as someone pointed out in comment 'java' is in arguments and i want to scan only text fields so it would be never found, after adding following code and changing the word my app will look for, application works
Object result = expr.evaluate(xmlDocument, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getNodeValue());
}
Your XPath is searching the text() nodes, but the word java appears in the #type attribute (which is not a text() node).
If you want to search for the word in both text() and #* then you could use a union | operator and check for either/both containing that word:
//text()[contains(. ,'java')] | //#*[contains(., 'java')]
But you might also want to scan comment() and processing-instruction(), so could generically match on node() and then in the predicate test:
//node()[contains(. ,'java')] | //#*[contains(., 'java')]
With XPath 2.0 or greater, you could use:
//node()[(.|#*)[contains(., 'java')]]

Java - Handle XML With Less Than/Greater Than Symbols in Text

I am trying to parse an XML file with the "less than" and "greater than" symbols in the text.
Here is a sample XML file:
<document>
<summary>
The equation for t is: 567<T<600.
</summary>
</document>
Is there any way to handle this in a Java XML parser? I know about escaping and changing to
<
and
>
but I only want to escape the characters in the text.
Currently, I am trying to use the DocumentBuilder, but it is erroring out.
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
domFactory.setExpandEntityReferences(false);
try {
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(sectionXML.toString())));
} catch (ParserConfigurationException e) {
e.printStackTrace();
}
The error I am getting is:
[Fatal Error] :1:70: Element type "T" must be followed by either attribute specifications, ">" or "/>".
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 70; Element type "T" must be followed by either attribute specifications, ">" or "/>".
Any thoughts? Thanks in advance for any help.

How to XPath return empty string if node or child node is not present in XML in java

I have one XML file as "sample.xml" and has 4 records .
<?xml version='1.0' encoding='UTF-8'?>
<hello xmlns:show="http://www.example.com" xmlns:css="http://www.example.com" xml_version="2.0">
<entry id="2008-0001">
<show:id>2008-0001</show:id>
<show:published-datetime>2008-01-15T15:00:00.000-05:00</show:published-datetime>
<show:last-modified-datetime>2012-03-19T00:00:00.000-04:00</show:last-modified-datetime>
<show:css>
<css:metrics>
<css:score>3.6</css:score>
<css:access-vector>LOCAL</css:access-vector>
<css:authentication>NONE</css:authentication>
<css:generated-on-datetime>2008-01-15T15:22:00.000-05:00</css:generated-on-datetime>
</css:metrics>
</show:css>
<show:summary>This is first entry.</show:summary>
</entry>
<entry id="2008-0002">
<show:id>2008-0002</show:id>
<show:published-datetime>2008-02-11T20:00:00.000-05:00</show:published-datetime>
<show:last-modified-datetime>2014-03-15T23:22:37.303-04:00</show:last-modified-datetime>
<show:css>
<css:metrics>
<css:score>5.8</css:score>
<css:access-vector>NETWORK</css:access-vector>
<css:authentication>NONE</css:authentication>
<css:generated-on-datetime>2008-02-12T10:12:00.000-05:00</css:generated-on-datetime>
</css:metrics>
</show:css>
<show:summary>This is second entry.</show:summary>
</entry>
<entry id="2008-0003">
<show:id>2008-0003</show:id>
<show:published-datetime>2009-03-26T06:12:08.780-04:00</show:published-datetime>
<show:last-modified-datetime>2009-03-26T06:12:09.313-04:00</show:last-modified-datetime>
<show:summary>This is 3rd entry with missing "css" tag and their metrics.</show:summary>
</entry>
<entry id="2008-0004">
<show:id>CVE-2008-0004</show:id>
<show:published-datetime>2008-01-11T19:46:00.000-05:00</show:published-datetime>
<show:last-modified-datetime>2011-09-06T22:41:45.753-04:00</show:last-modified-datetime>
<show:css>
<css:metrics>
<css:score>4.3</css:score>
<css:access-vector>NETWORK</css:access-vector>
<css:authentication>NONE</css:authentication>
<css:generated-on-datetime>2008-01-14T09:37:00.000-05:00</css:generated-on-datetime>
</css:metrics>
</show:css>
<show:summary>This is 4th entry.</show:summary>
</entry>
</hello>
and 1 Java file as "Test.java" -
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class Test {
public static void main(String[] args) {
List<String> list = new ArrayList<String>();
File fXmlFile = new File("/home/ankit/sample.xml");
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try
{
DocumentBuilder dBuilder = factory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
NodeList nList = doc.getElementsByTagName("entry");
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
for (int i = 0; i < nList.getLength(); i++)
{
XPathExpression expr1 = xpath.compile("//hello/entry/css/metrics/score");
NodeList nodeList1 = (NodeList) expr1.evaluate(doc, XPathConstants.NODESET);
if(nodeList1.item(i)!=null)
{
Node currentItem = nodeList1.item(i);
if(!currentItem.getTextContent().isEmpty())
{
list.add(currentItem.getTextContent());
}
}
}
}
catch(Exception e)
{
e.printStackTrace();
}
System.out.println("size----"+list.size());
for(int i=0;i<list.size();i++)
{
System.out.println("list----"+list.get(i));
}
}
}
I need to read the <entry> tag from the XML and for that I am using XPath . In the XML file there are 4 entry tags and inside entry tag there is <show:css> tag, but in 3rd <entry> tag this <show:css> tag is missing and putting those css tag's score values in the list. So when I am running this java code first 2 values got stored in the list and at the 3rd place it stores 4th tag's css's score value.
I want a list as output which will have first, second and forth element as "3.6", “4.8” and “5.3” and 3rd element should be empty string or nill. But I am getting only 3 elements in the list with values of 1,2 and 4.
I need to put empty string “” at 3rd place and original value at 4th. Means If that tag is not present then put blank value in the list.
Current output - [“3.6” , “4.8” , “5.3”]
I expect - [“3.6” , “4.8” , “” , “5.3”]
Could anyone please help me with this that how to do this.
There's probably a few ways this might be achieved...
My basic take on it is to find all the entry nodes which have a css/metrics/score child node and which don't (you could probably just get ALL the entry nodes, but this demonstrates the power of the query language)
Something like...
XPathExpression expr1 = xPath.compile("//hello/entry[css/metrics/score or not(css/metrics/score)]");
I know the conditional expression is meaning less, I wanted the OP to see that they can use additional conditional to expand on there requirements, thank you all for pointing out despite the fact that I already did mention it...hope we can all move on from it
Then, loop through the resulting NodeList and query each entry Node for the css/metrics/score node. If it's null, then add a null value into the list (or what ever else you want), for example...
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
Document doc = dbf.newDocumentBuilder().parse(JavaApplication908.class.getResourceAsStream("/Hello.xml"));
XPathFactory xf = XPathFactory.newInstance();
XPath xPath = xf.newXPath();
XPathExpression expr1 = xPath.compile("//hello/entry[css/metrics/score or not(css/metrics/score)]");
XPathExpression expr2 = xPath.compile("css/metrics/score");
List<String> values = new ArrayList<>(25);
NodeList nodeList1 = (NodeList) expr1.evaluate(doc, XPathConstants.NODESET);
for (int index = 0; index < nodeList1.getLength(); index++) {
Node node = nodeList1.item(index);
System.out.println(node.getAttributes().getNamedItem("id"));
Node css = (Node) expr2.evaluate(node, XPathConstants.NODE);
if (css != null) {
values.add(css.getTextContent());
} else {
values.add(null);
}
}
for (String value : values) {
System.out.println(value);
}
This outputs...
id="2008-0001"
id="2008-0002"
id="2008-0003"
id="2008-0004"
3.6
5.8
null
4.3
(The first four lines are the entry node ids, the last four are the resulting css/metrics/score values)
I am not an expert in XPath but from looking at your code, I think you are just missing a couple of lines of code,
if(nodeList1.item(i)!=null)
{
Node currentItem = nodeList1.item(i);
if(!currentItem.getTextContent().isEmpty())
{
list.add(currentItem.getTextContent());
}
else
list.add("");
}
else
list.add("");
#MathiasMüller could you please let me know how it can be done in 1 expression in XPath 2.0. – ankit
The equivalent XPath 2.0 expression would be
for $x in //entry return (if ($x//*:score) then $x//*:score else '')
which makes heavy use of new constructs introduced in XPath 2.0. The output would then be
3.6
5.8
[Empty string]
4.3
But be aware that currently, most XPath implementations only support 1.0. Try this XPath 2.0 expression within an XSLT stylesheet online here, a site that uses Saxon 9.5 EE.

Getting the value of an attribute in Java from a XML using XPath

I'm currently using XPath to get some information from a podcast feed using Java and XPath. I'm trying to read the attribute of a node:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:atom="http://www.w3.org/2005/Atom/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" version="2.0">
<channel>
[....]
<itunes:image href="http://icebox.5by5.tv/images/broadcasts/14/cover.jpg" />
[...]
I want to get the value of the href attribute in <itunes:image>. Currently, I'm using the following code:
private static String IMAGE_XPATH = "//channel/itunes:image/#href";
String imageUrl = xpath.compile(IMAGE_XPATH).evaluate(doc, XPathConstants.STRING).toString();
The result of imageUrl is null. What happens in the code? Do I have an error in the XPath code, or in the Java code?
Thanks! :)
Disable namespace awarness:
DocumentBuilderFactory xmlFact = DocumentBuilderFactory.newInstance();
xmlFact.setNamespaceAware(false);
Your xpath expression should look like this now:
"//channel/image/#href"
If you need to use it as namespace aware, just implement your own NameSpaceContext, should look like this:
NamespaceContext ctx = new ItunesNamespaceContext();
XPathFactory xpathFact = XPathFactory.newInstance();
XPath xpath = xpathFact.newXPath();
xpath.setNamespaceContext(ctx);
String IMAGE_XPATH = "//channel/itunes:image/#href";
String imageUrl = path.compile(IMAGE_XPATH).evaluate(doc,XPathConstants.STRING).toString();
EDIT: Here is a test code that proves my point:
String a ="<?xml version=\"1.0\" encoding=\"UTF-8\"?><rss xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:sy=\"http://purl.org/rss/1.0/modules/syndication/\" xmlns:admin=\"http://webns.net/mvcb/\" xmlns:atom=\"http://www.w3.org/2005/Atom/\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:content=\"http://purl.org/rss/1.0/modules/content/\" xmlns:itunes=\"http://www.itunes.com/dtds/podcast-1.0.dtd\" version=\"2.0\"><channel><itunes:image href=\"http://icebox.5by5.tv/images/broadcasts/14/cover.jpg\" /></channel></rss>";
DocumentBuilderFactory xmlFact = DocumentBuilderFactory.newInstance();
xmlFact.setNamespaceAware(false);
DocumentBuilder builder = xmlFact.newDocumentBuilder();
XPathFactory xpathFactory = XPathFactory.newInstance();
String expr = "//channel/image/#href";
XPath xpath = xpathFactory.newXPath();
Document doc = builder.parse(new InputSource(new StringReader(a)));
String imageUrl = (String) xpath.compile(expr).evaluate(doc ,XPathConstants.STRING);
System.out.println(imageUrl);
The output is:
http://icebox.5by5.tv/images/broadcasts/14/cover.jpg
The XPath should include the root element, so rss/channel/itunes:image/#href.
Alternatively, you could start the xpath with a // so that all levels are searched for the xpath (//channel/itunes:image/#href) but if the root will always be the same it is more efficient to use the first option.

Faster api than javax.xml.xpath to parse the xml for a value?

I am using javax.xml.xpath to search for specific strings in xml files, however due to the huge number of xml files which needs to be searched this is turning out to be much slower than expected.
Is there any api that java supports that is faster than javax.xml.xpath or which is the fastest that is available?
As pointed out by skaffman you will want to be sure you are using the javax.xml.xpath libraries as efficiently as possible. If you are executing an XPath statement more that once you will want to make sure to compile it into an XPathExpression.
XPathExpression xPathExpression = xPath.compile("/root/device/modelname");
nl = (NodeList) xPathExpression.evaluate(dDoc, XPathConstants.NODESET);
Demo
In the example option #2 will be faster than option #1.
import java.io.File;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class Demo {
public static void main(String[] args) {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder builder = domFactory.newDocumentBuilder();
File xml = new File("input.xml");
Document dDoc = builder.parse(xml);
NodeList nl;
// OPTION #1
XPath xPath = XPathFactory.newInstance().newXPath();
nl = (NodeList) xPath.evaluate("root/device/modelname", dDoc, XPathConstants.NODESET);
printResults(nl);
nl = (NodeList) xPath.evaluate("/root/device/modelname", dDoc, XPathConstants.NODESET);
printResults(nl);
// OPTION #2
XPathExpression xPathExpression = xPath.compile("/root/device/modelname");
nl = (NodeList) xPathExpression.evaluate(dDoc, XPathConstants.NODESET);
printResults(nl);
nl = (NodeList) xPathExpression.evaluate(dDoc, XPathConstants.NODESET);
printResults(nl);
} catch (Exception e) {
e.printStackTrace();
}
}
private static void printResults(NodeList nl) {
for(int x=0; x<nl.getLength(); x++) {
System.out.println("the value is: " + nl.item(x).getTextContent());
}
}
}
input.xml
<?xml version="1.0" encoding="UTF-8"?>
<root>
<blah>foo</blah>
<device>
<modelname>xbox</modelname>
</device>
<blah>bar</blah>
<device>
<modelname>wii</modelname>
</device>
<blah/>
</root>
I wonder if the XPath searching is really your bottleneck, or whether it's actually the XML parsing? I would suspect the latter. I don't know how persistent your XML documents are, but I would think the solution is to store them in an XML database so you only incur the parsing cost once, and so that they can be indexed to make XPath/XQuery searching more efficient.
you can look at my previous answer for something related.
Basically I used JXpath and Xerces as well as Dom4J and javax.
I can say with confidence from my experience that VTD-XML is hands down the fastest of these options.
There are plenty of other questions on using VTD-XML on SO if you care to search.
EDIT:
ok, so based on your comment the code snippet would be something like this:
VTDGen vg = new VTDGen();
AutoPilot ap = new AutoPilot();
int i;
ap.selectXPath("/root/device/modelname");
if (vg.parseFile(PATH_TO_FILE,true)){
VTDNav vn = vg.getNav();
ap.bind(vn); // apply XPath to the VTDNav instance
// AutoPilot moves the cursor for you
while((i=ap.evalXPath())!=-1){
System.out.println("the value is: " + vn.toNormalizedString(vn.getText()));
}
}
For the following XML:
<root>
<blah>foo</blah>
<device>
<modelname>xbox</modelname>
</device>
<blah>bar</blah>
<device>
<modelname>wii</modelname>
</device>
<blah/>
</root>
The output will be:
the value is: xbox
the value is: wii
You can take it from here...
You should elaborate on what kinds of things you are searching for -- if it's plain content Strings, I would consider using Stax API (javax.xml.stream.XMLStreamReader), for example.
XPath is good if you need to restrict your search for specific subset.
One problem with XPath however is that depending on expression it may end up building a DOM tree in memory, and this is rather costly (relative to parsing XML), both in terms of speed and memory use. So if this can be avoided that alone can speed up processing by factory of 3x.

Categories