XML text extraction

XML text extraction - java

Scenario:
Given the following XML file:
<a:root
xmlns:h="http://www.w3.org/TR/html4/"
xmlns:f="http://www.w3schools.com/furniture">
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<f:table>
<f:name>African Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>
aaaaaaaaaaaaaa
</a:root>
How do I extract the text inside the main element <a:root>:
"\naaaaaaaaaaaaaa\n"
The code I have right now is:
import java.io.File;
import java.util.Stack;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class Proof {
public static void main(String[] args) {
Document doc = null;
DocumentBuilderFactory dbf = null;
DocumentBuilder docBuild = null;
try {
dbf = DocumentBuilderFactory.newInstance();
docBuild = dbf.newDocumentBuilder();
doc = docBuild.parse(new File("test2.xml"));
System.out.println(doc.getFirstChild().getTextContent());
} catch(Exception e) {
e.printStackTrace();
}
}
}
But it returns the text I desire ("aaaaaaaaaaaaaa") + the inner text for the rest of the elements . Output:
Apples
Bananas
African Coffee Table
80
120
aaaaaaaaaaaaaa
The requirement is not to use an additional XML java library !

The answer by #Kirill Polishchuk is not corect:
The proposed:
a:root/text()
Is a relative expression and if it isn't evaluated having the root (/) node as the context node it selects nothing in the provided XML document.
Even the XPath expression: /a:root/text() is incorrect, because it selects three text nodes -- all text node children of the top element -- including two whitespace-only text nodes.
Here is a correct XPath solution:
/a:root/text()[string-length(normalize-space()) > 0]
When this Xpath expression is applied on the provided XML document (corrected to be well-formed):
<a:root
xmlns:a="UNDEFINED !!!!"
xmlns:h="http://www.w3.org/TR/html4/"
xmlns:f="http://www.w3schools.com/furniture">
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>
<f:table>
<f:name>African Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>
aaaaaaaaaaaaaa
</a:root>
It selects the last (and only non-whitespace-only) text node child of the top element, as required:
aaaaaaaaaaaaaa
XSLT-based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:a="UNDEFINED !!!!"
>
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:text>"</xsl:text>
<xsl:copy-of select=
"/a:root/text()
[string-length(normalize-space()) > 0]"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied against the provided XML document (above), the wanted, correctly selecte text node is output:
"
aaaaaaaaaaaaaa
"

You can use XPath: a:root/text()

Use this
import java.io.File;
import java.util.Stack;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class Proof {
public static void main(String[] args) {
Document doc = null;
DocumentBuilderFactory dbf = null;
DocumentBuilder docBuild = null;
try {
dbf = DocumentBuilderFactory.newInstance();
docBuild = dbf.newDocumentBuilder();
doc = docBuild.parse(new File("test2.xml"));
Element x= doc.getDocumentElement();
NodeList m=x.getChildNodes();
for(int i=0;i<m.getLength();i++){
Node it=m.item(i);
if(it.getNodeType()==3){
System.out.println(it.getNodeValue());
}
}
} catch(Exception e) {
e.printStackTrace();
}
}
}

Related

How to parse the full content of a XML Tag in java

I have some kind of complex XML data structure. The structure contains different fragments like in the following example:
<data>
<content-part-1>
<h1>Hello <strong>World</strong>. This is some text.</h1>
<h2>.....</h2>
</content-part1>
....
</data>
The h1 tag within the tag 'content-part-1' is of interest. I want to get the full content of the xml tag 'h1'.
In java I used the javax.xml.parsers.DocumentBuilder and tried something like this:
String my_content="<h1>Hello <strong>World</strong>. This is some text.</h1>";
// parse h1 tag..
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = documentBuilder.parse(new InputSource(new StringReader(my_content)));
Node node = doc.importNode(doc.getDocumentElement(), true);
if (node != null && node.getNodeName().equals("h1")) {
return node.getTextContent();
}
But the method 'getTextContent()' will return:
Hello World. This is some text.
The tag "strong" is removed by the xml parser (as it is the documented behavior).
My question is how I can extract the full content of a single XML Node within a org.w3c.dom.Document without any further parsing the node content?

Although java DOM parser provides functionality for parsing mixed content, in this particular case it could be more convenient to use Jsoup library. When using it code to extract h1 element content would be as follows:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
String text = "<data>\n"
+ " <content-part1>\n"
+ " <h1>Hello <strong>World</strong>. This is some text.</h1>\n"
+ " <h2></h2>\n"
+ " </content-part1>\n"
+ "</data>";
Document doc = Jsoup.parse(text);
Elements h1Elements = doc.select("h1");
for (Element h1 : h1Elements) {
System.out.println(h1.html());
}
Output in this case will be "Hello <strong>World</strong>. This is some text."

What you probaly want is XML generation from some subnode of your document.
So with slighlty modified nodeToString from earlier answer to similar question I can propose to
generate text <h1>Hello <strong>World</strong>. This is some text.</h1>. Some extra effor might be needed to get rid of <h1> and </h1>
package com.github.vtitov.test;
import org.junit.Test;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.StringReader;
import java.io.StringWriter;
import static org.hamcrest.MatcherAssert.*;
import static org.hamcrest.Matchers.*;
public class XmlTest {
#Test
public void buildXml() throws Exception {
String my_content="<h1>Hello <strong>World</strong>. This is some text.</h1>";
// parse h1 tag..
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = documentBuilder.parse(new InputSource(new StringReader(my_content)));
Node node = doc.importNode(doc.getDocumentElement(), true);
String h1Content = null;
if (node != null && node.getNodeName().equals("h1")) {
h1Content = nodeToString(node);
}
assertThat("h1", h1Content, equalTo("<h1>Hello <strong>World</strong>. This is some text.</h1>"));
}
private static String nodeToString(Node node) throws TransformerException {
StringWriter sw = new StringWriter();
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
t.setOutputProperty(OutputKeys.INDENT, "no");
t.transform(new DOMSource(node), new StreamResult(sw));
return sw.toString();
}
}

XPathExpression.evaluate using Node [duplicate]

I want to manipulate xml doc having default namespace but no prefix. Is there a way to use xpath without namespace uri just as if there is no namespace?
I believe it should be possible if we set namespaceAware property of documentBuilderFactory to false. But in my case it is not working.
Is my understanding is incorrect or I am doing some mistake in code?
Here is my code:
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(false);
try {
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document dDoc = builder.parse("E:/test.xml");
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nl = (NodeList) xPath.evaluate("//author", dDoc, XPathConstants.NODESET);
System.out.println(nl.getLength());
} catch (Exception e) {
e.printStackTrace();
}
Here is my xml:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://www.mydomain.com/schema">
<author>
<book title="t1"/>
<book title="t2"/>
</author>
</root>

The XPath processing for a document that uses the default namespace (no prefix) is the same as the XPath processing for a document that uses prefixes:
For namespace qualified documents you can use a NamespaceContext when you execute the XPath. You will need to prefix the fragments in the XPath to match the NamespaceContext. The prefixes you use do not need to match the prefixes used in the document.
http://download.oracle.com/javase/6/docs/api/javax/xml/namespace/NamespaceContext.html
Here is how it looks with your code:
import java.util.Iterator;
import javax.xml.namespace.NamespaceContext;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class Demo {
public static void main(String[] args) {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
try {
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document dDoc = builder.parse("E:/test.xml");
XPath xPath = XPathFactory.newInstance().newXPath();
xPath.setNamespaceContext(new MyNamespaceContext());
NodeList nl = (NodeList) xPath.evaluate("/ns:root/ns:author", dDoc, XPathConstants.NODESET);
System.out.println(nl.getLength());
} catch (Exception e) {
e.printStackTrace();
}
}
private static class MyNamespaceContext implements NamespaceContext {
public String getNamespaceURI(String prefix) {
if("ns".equals(prefix)) {
return "http://www.mydomain.com/schema";
}
return null;
}
public String getPrefix(String namespaceURI) {
return null;
}
public Iterator getPrefixes(String namespaceURI) {
return null;
}
}
}
Note:
I also used the corrected XPath suggested by Dennis.
The following also appears to work, and is closer to your original question:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class Demo {
public static void main(String[] args) {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document dDoc = builder.parse("E:/test.xml");
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nl = (NodeList) xPath.evaluate("/root/author", dDoc, XPathConstants.NODESET);
System.out.println(nl.getLength());
} catch (Exception e) {
e.printStackTrace();
}
}
}

Blaise Doughan is right, attached code is correct.
Problem was somewhere elese. I was running all my tests through Application launcher in Eclipse IDE and nothing was working. Then I discovered Eclipse project was cause of all grief. I ran my class from command prompt, it worked. Created a new eclipse project and pasted same code there, it worked there too.
Thank you all guys for your time and efforts.

I've written a simple NamespaceContext implementation (here), that might be of help. It takes a Map<String, String> as input, where the key is a prefix, and the value is a namespace.
It follows the NamespaceContext spesification, and you can see how it works in the unit tests.
Map<String, String> mappings = new HashMap<>();
mappings.put("foo", "http://foo");
mappings.put("foo2", "http://foo");
mappings.put("bar", "http://bar");
context = new SimpleNamespaceContext(mappings);
context.getNamespaceURI("foo"); // "http://foo"
context.getPrefix("http://foo"); // "foo" or "foo2"
context.getPrefixes("http://foo"); // ["foo", "foo2"]
Note that it has a dependency on Google Guava

What is the proper way to call getAttributeNS using Java DOM?

I'm having a problem correctly calling getAttributeNS() (and other NS methods) from Java DOM. First, here is my sample XML doc:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book xmlns:c="http://www.w3schools.com/children/" xmlns:foo="http://foo.org/foo" category="CHILDREN">
<title foo:lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
And here is my little Java class that uses DOM and calls getAttributeNS:
package com.mycompany.proj;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Element;
import java.io.File;
public class AttributeNSProblem
{
public static void main(String[] args)
{
try
{
File fXmlFile = new File("bookstore_ns.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
System.out.println("Root element: " + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("title");
Element elem = (Element)nList.item(0);
String lang = elem.getAttributeNS("http://foo.org/foo", "lang");
System.out.println("title lang: " + lang);
lang = elem.getAttribute("foo:lang");
System.out.println("title lang: " + lang);
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
When I call getAttributeNS("http://foo.org/foo", "lang"), it returns an empty String. I've also tried getAttributeNS("foo", "lang"), same result.
What's the proper way to retrieve the value of an attribute qualified by a namespace?
Thanks.

Immediately after DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();, add dbFactory.setNamespaceAware(true);

Getting too many child nodes and cant get attributes

I have a simple XML, and I want to get the attributes. There are a few examples on the web, but I still dont understand why I get 17 when I see only 4. I even try to count locations where I think text could be, but still I don't get that number unless is the length of the output . Which leads me to not know how to get the attribute name of all Tag3.
<?xml version="1.0" encoding="UTF-8"?>
<tag1 xmlns="something">
<xxxxxx-Set>
<tag3 Name="a"/>
<tag3 Name="b"/>
<tag3 Name="c"/>
<tag3 Name="d"/>
</xxxxxx-Set>
<tagB>
<tag3 Name="a"/>
<tag3 Name="b"/>
<tag3 Name="c"/>
<tag3 Name="d"/>
</tagB>
</tag1>
This is my java code:
import java.io.File;
import java.util.Arrays;
import java.util.List;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class ParseXML {
public static void main(String[] args) {
try {
File test= new File("test.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(test);
NodeList tagAs= doc.getElementsByTagName("xxxxxx-Set").item(0).getChildNodes(); //should be all the tag3 elements?
for(int i = 0; i < tagAs.getLength(); i++) {
System.out.println(tagAs);
System.out.println(i);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
Note: adding .getAttributes().getNamedItem("Name").getNodeValue() to the print statement gives me null exception.
And the output is:
[xxxxxx-Set: null]
0
[xxxxxx-Set: null]
1
...
[xxxxxx-Set: null]
16

If you want to take all your Name attributes (it's better to name them with lower case), use next approach:
Element xSet = (Element) doc.getElementsByTagName("xxxxxx-Set").item(0);
NodeList xSetTags = xSet.getElementsByTagName("tag3");
for(int i = 0; i < xSetTags.getLength(); i++) {
Element tag3 = (Element) xSetTags.item(i);
System.out.println(tag3.getAttribute("Name"));
}
I made it using org.w3c.dom.Element class. It's not the best idea to work with org.w3c.dom.Node, because this class represents not only xml elements, but attributes, comments and other too. Look documentation to get difference between Node and Element classes.

XPath error in Android

My goal is executing an XQuery using XPath.
My XML file is:
<?xml version="1.0" encoding="UTF-8"?>
<postes>
<poste>
<gouvernourat>Kairouan</gouvernourat>
<ville>Kairouan sud</ville>
<cp>3100</cp>
</poste>
<poste>
<gouvernourat>Tunis</gouvernourat>
<ville>Ghazela</ville>
<cp>1002</cp>
</poste>
</postes>
My Java code is:
package xmlparse;
import java.io.IOException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class QueryXML {
public void query() throws ParserConfigurationException, SAXException,
IOException, XPathExpressionException {
// Standard of reading a XML file
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder;
Document doc = null;
XPathExpression expr = null;
builder = factory.newDocumentBuilder();
doc = builder.parse("a.xml"); //C:\\Users\\aymen\\Desktop\\
// Create a XPathFactory
XPathFactory xFactory = XPathFactory.newInstance();
// Create a XPath object
XPath xpath = xFactory.newXPath();
// Compile the XPath expression
expr = xpath.compile("/postes/poste[gouvernourat='Tunis']/ville/text()");
// Run the query and get a nodeset
Object result = expr.evaluate(doc, XPathConstants.NODESET);
// Cast the result to a DOM NodeList
NodeList nodes = (NodeList) result;
for (int i=0; i<nodes.getLength();i++){
System.out.println(nodes.item(i).getNodeValue());
}
}
public static void main(String[] args) throws XPathExpressionException, ParserConfigurationException, SAXException, IOException {
QueryXML process = new QueryXML();
process.query();
}
}
When I launch this Java code the result is displayed on the console correctly (System.out.println).
But if I copy this code to my Android application and change System.out.println(nodes.item(i).getNodeValue()); to Text2.setText(nodes.item(i).getNodeValue()); (I have a TextView named Text2)
When I execute the code and I click the button the TextView stays empty (No error for Force Close)
Thank you in advance

Attribute names needs to start with '#' while using XPath in Android.
So change
[gouvernourat='Tunis']
To
[#gouvernourat='Tunis']
Refer http://developer.android.com/reference/javax/xml/xpath/package-summary.html for details.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

XML text extraction - java

You can use XPath: a:root/text()

Related

How to parse the full content of a XML Tag in java

XPathExpression.evaluate using Node [duplicate]

What is the proper way to call getAttributeNS using Java DOM?

Getting too many child nodes and cant get attributes

XPath error in Android

Categories

Resources