I have the following xml file:
<?xml version="1.0" encoding="UTF-8"?>
<users>
<user id="0" firstname="John"/>
</users>
Then I'm trying to parse it with java, but getchildnodes reports wrong number of child nodes.
Java code:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(this.file);
document.getDocumentElement().normalize();
Element root = document.getDocumentElement();
NodeList nodes = root.getChildNodes();
System.out.println(nodes.getLength());
Result: 3
Also I'm getting NPEs for accessing the nodes attributes, so I'm guessing something's going horribly wrong.
The child nodes consist of elements and text nodes for whitespace. You will want to check the node type before processing the attributes. You may also want to consider using the javax.xml.xpath APIs available in the JDK/JRE starting with Java SE 5.
Example 1
This example demonstrates how to issue an XPath statement against a DOM.
package forum11649396;
import java.io.StringReader;
import javax.xml.parsers.*;
import javax.xml.xpath.*;
import org.w3c.dom.*;
import org.xml.sax.InputSource;
public class Demo {
public static void main(String[] args) throws Exception {
String xml = "<?xml version='1.0' encoding='UTF-8'?><users><user id='0' firstname='John'/></users>";
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.parse(new InputSource(new StringReader(xml)));
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
Element userElement = (Element) xpath.evaluate("/users/user", document, XPathConstants.NODE);
System.out.println(userElement.getAttribute("id"));
System.out.println(userElement.getAttribute("firstname"));
}
}
Example 2
The following example demonstrates how to issue an XPath statement against an InputSource to get a DOM node. This saves you from having to parse the XML into a DOM yourself.
package forum11649396;
import java.io.StringReader;
import javax.xml.xpath.*;
import org.w3c.dom.*;
import org.xml.sax.InputSource;
public class Demo {
public static void main(String[] args) throws Exception {
String xml = "<?xml version='1.0' encoding='UTF-8'?><users><user id='0' firstname='John'/></users>";
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
InputSource inputSource = new InputSource(new StringReader(xml));
Element userElement = (Element) xpath.evaluate("/users/user", inputSource, XPathConstants.NODE);
System.out.println(userElement.getAttribute("id"));
System.out.println(userElement.getAttribute("firstname"));
}
}
There are three child nodes:
a text node containing a line break
an element node (tagged user)
a text node containing a line break
So when processing the child nodes, check for element nodes.
You have to make sure you account for the '\n' between the nodes, which count for text nodes. You can test for that using if(root.getNodeType() == Node.ELEMENT_NODE)
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(this.file);
document.getDocumentElement().normalize();
for(Node root = document.getFirstChild(); root != null; root = root.getNextSibling()) {
if(root.getNodeType() == Node.ELEMENT_NODE) {
NodeList nodes = root.getChildNodes();
System.out.println(root.getNodeName() + " has "+nodes.getLength()+" children");
for(int i=0; i<nodes.getLength(); i++) {
Node n = nodes.item(i);
System.out.println("\t"+n.getNodeName());
}
}
}
I didn't notice any of the answers addressing your last note about NPEs when trying to access attributes.
Also I'm getting NPEs for accessing the nodes attributes, so I'm guessing something's going horribly wrong.
Since I've seen the following suggestion on a few sites, I assume it's a common way to access attributes:
String myPropValue = node.getAttributes().getNamedItem("myProp").getNodeValue();
which works fine if the nodes always contain a myProp attribute, but if it has no attributes, getAttributes will return null. Also, if there are attributes, but no myProp attribute, getNamedItem will return null.
I'm currently using
public static String getStrAttr(Node node, String key) {
if (node.hasAttributes()) {
Node item = node.getAttributes().getNamedItem(key);
if (item != null) {
return item.getNodeValue();
}
}
return null;
}
public static int getIntAttr(Node node, String key) {
if (node.hasAttributes()) {
Node item = node.getAttributes().getNamedItem(key);
if (item != null) {
return Integer.parseInt(item.getNodeValue());
}
}
return -1;
}
in a utility class, but your mileage may vary.
Related
I have an application that uses SAML authentication, acts as an SP, and therefore parses SAMLResponses. I received notification that an IdP that communicates with my application will now start signing their SAMLResponses with http://www.w3.org/2001/10/xml-exc-c14n#WithComments, which means comments matter when calculating the validity of the SAML signature.
Here lies the problem - the library I use for XML parsing strips these comment nodes by default. See this example program:
import org.apache.commons.io.IOUtils;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
public class Main {
public static void main(String[] args) {
try {
String xml = "<NameID>test#email<!---->.com</NameID>";
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document doc = documentBuilder.parse(IOUtils.toInputStream(xml));
NodeList nodes = doc.getElementsByTagName("NameID");
if (nodes == null || nodes.getLength() == 0)
{
throw new RuntimeException("No NameID in document");
}
System.out.println(nodes.item(0).getTextContent());
} catch(Exception e) {
System.err.println(e.getMessage());
}
}
}
So, this program will print test#email.com (which means that's what my SAML code will get, too). This is a problem, as I'm pretty sure it will cause signature validation to fail without the comment included, since the XML document was signed with the #WithComments canonicalization method.
Is there any way to get DocumentBuilder/getTextContent() to leave in comment nodes so my signature is not invalidated by the missing comment?
Documentation for getTextContent() is here: https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html#getTextContent()
Your code actually retains the comment.
Here, slightly modified:
public static void main(String[] args) throws Exception {
String xml = "<NameID>test#email<!--foobar-->.com</NameID>";
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document doc = documentBuilder.parse(new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8)));
NodeList childNodes = doc.getDocumentElement().getChildNodes();
Node[] nodes = new Node[childNodes.getLength()];
for (int index = 0; index < childNodes.getLength(); index++) {
nodes[index] = childNodes.item(index);
}
System.out.println(nodes[1].getTextContent());
}
Prints foobar. (Run it on Ideone.)
There are 3 child nodes of the root element, one of the is the comment node. So it is actually retained.
I'm having a response as XML. I'm trying to parse the XML object to get inner details. Im using DocumentBuilderFactory for this. The parent object is not null, but when I try to get the deepnode list elements, its returning null. Am I missing anything
Here is my response XML
ResponseXML
<DATAPACKET REQUEST-ID = "1">
<HEADER>
</HEADER>
<BODY>
<CONSUMER_PROFILE2>
<CONSUMER_DETAILS2>
<NAME>David</NAME>
<DATE_OF_BIRTH>1949-01-01T00:00:00+03:00</DATE_OF_BIRTH>
<GENDER>001</GENDER>
</CONSUMER_DETAILS2>
</CONSUMER_PROFILE2></BODY></DATAPACKET>
and Im parsing in the following way
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(responseXML));
// Consumer details.
if(doc.getDocumentElement().getElementsByTagName("CONSUMER_DETAILS2") != null) {
Node consumerDetailsNode = doc.getDocumentElement().getElementsByTagName("CONSUMER_DETAILS2").item(0); -->This is coming as null
dateOfBirth = getNamedItem(consumerDetailsNode, "DATE_OF_BIRTH");
System.out.println("DOB:"+dateOfBirth);
}
getNamedItem
private static String getNamedItem(Node searchResultNode, String param) {
return searchResultNode.getAttributes().getNamedItem(param) != null ? searchResultNode.getAttributes().getNamedItem(param).getNodeValue() : "";
}
Any ideas would be greatly appreciated.
The easiest way to search for individual elements within an XML document is with XPAth. It provides search syntax similar to file system notation.
Here is a solution to the specific problem of you document:
EDIT: solution adopted to support multiple CONSUMER_PROFILE2 elements. You just need to get and parse NodeList instread of one Node
import java.io.*;
import javax.xml.parsers.*;
import javax.xml.xpath.*;
import org.w3c.dom.*;
import org.xml.sax.*;
public class XpathDemo
{
public static void main(String[] args)
{
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document xmlDoc = builder.parse(new InputSource(new FileReader("C://Temp/xx.xml")));
// Selects all CONSUMER_PROFILE2 elements no matter where they are in the document
String cp2_nodes = "//CONSUMER_PROFILE2";
// Selects first DATE_OF_BIRTH element somewhere under current element
String dob_nodes = "//DATE_OF_BIRTH[1]";
// Selects text child node of current element
String text_node = "/child::text()";
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList dob_list = (NodeList)xPath.compile(cp2_nodes + dob_nodes + text_node)
.evaluate(xmlDoc, XPathConstants.NODESET);
for (int i = 0; i < dob_list.getLength() ; i++) {
Node dob_node = dob_list.item(i);
String dob_text = dob_node.getNodeValue();
System.out.println(dob_text);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
I have an xml file having data which looks like given below:
....
<ems:MessageInformation>
<ecs:MessageID>2147321820</ecs:MessageID>
<ecs:MessageTimeStamp>2016-01-01T04:38:33</ecs:MessageTimeStamp>
<ecs:SendingSystem>LD</ecs:SendingSystem>
<ecs:ReceivingSystem>CH</ecs:ReceivingSystem>
<ecs:ServicingFipsCountyCode>037</ecs:ServicingFipsCountyCode>
<ecs:Environment>UGS-D8UACS02</ecs:Environment>
</ems:MessageInformation>
....
There are many other nodes also. All nodes have namespace like ecs,tns,ems etc. I am suing following code part to extract all node names without namespace.
public static void main(String[] args) throws SAXException, IOException, ParserConfigurationException, TransformerException {
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document document = docBuilder.parse(new File("C:\\Users\\DadMadhR\\Desktop\\temp\\EDR_D3A0327.XML"));
NodeList nodeList = document.getElementsByTagName("*");
for (int i = 0; i < nodeList.getLength(); i++) {
Node node = nodeList.item(i);
//System.out.println(node.getNodeName());
System.out.println(node.getLocalName());
}
}
But when I execute this code, it's printing null for individual node. Can someone tell me what I am doing wrong here?
I read on internet and I came to know that node.getLocalName() will give node name without namespace. What is wrong then in my case?
You need to set the document builder factory to be namespace aware first. Then getLocalName() will start returning non-null values.
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
docBuilder.setNamespaceAware(true); // <=== here
Document document = docBuilder.parse(new File("C:\\Users\\DadMadhR\\Desktop\\temp\\EDR_D3A0327.XML"));
Given the following XML (example):
<?xml version="1.0" encoding="UTF-8"?>
<rsb:VersionInfo xmlns:atom="http://www.w3.org/2005/Atom" xmlns:rsb="http://ws.rsb.de/v2">
<rsb:Variant>Windows</rsb:Variant>
<rsb:Version>10</rsb:Version>
</rsb:VersionInfo>
I need to get the values of Variant and Version. My current approach is using XPath as I cannnot rely on the given structure. All I know is that there is an element rsb:Version somewhere in the document.
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "//Variant";
InputSource inputSource = new InputSource("test.xml");
String result = (String) xpath.evaluate(expression, inputSource, XPathConstants.STRING);
System.out.println(result);
This however does not output anything. I have tried the following XPath expressions:
//Variant
//Variant/text()
//rsb:Variant
//rsb:Variant/text()
What is the correct XPath expression? Or is there an even simpler way getting to this element?
I would suggest just looping through the document to find the given tag
public static void main(String[] args) throws SAXException, IOException,ParserConfigurationException, TransformerException {
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document document = docBuilder.parse(new File("test.xml"));
NodeList nodeList = document.getElementsByTagName("rsb:VersionInfo");
for (int i = 0; i < nodeList.getLength(); i++) {
Node node = nodeList.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
// do something with the current element
System.out.println(node.getNodeName());
}
}
}
Edit: Yassin pointed out that it won't get child nodes. This should point you in the right direction for getting the children.
private static List<Node> getChildren(Node n)
{
List<Node> children = asList(n.getChildNodes());
Iterator<Node> it = children.iterator();
while (it.hasNext())
if (it.next().getNodeType() != Node.ELEMENT_NODE)
it.remove();
return children;
}
I want to parse an HTML such as http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top
and only want extract the text of the element which has <a class="title"
The options I have looked so far all look like overkill (SAX, DOM traversal).
Use Jsoup. It supports jQuery-like CSS selectors. Here's a kickoff example:
String url = "http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top";
Document document = Jsoup.connect(url).get();
for (Element link : document.select("a.title")) {
System.out.println(link.absUrl("href"));
}
Result:
http://news.cnet.com/8301-13579_3-10288022-37.html
http://dl.getdropbox.com/u/18264/mspoland.jpg
http://www.reddit.com/r/reddit.com/comments/ar5z1/verizon_stealthily_installed_a_bing_search_app_on/
http://www.grabup.com/uploads/240ccede5360b093dbf298f8946025a5.png
http://www.youtube.com/watch?v=7Ym0tZSWGMc&fmt=34
http://i42.tinypic.com/wv5qar.jpg
http://www.reddit.com/r/technology/comments/8hnya/apple_no_i_dont_want_to_make_quicktime_my_default/
http://cssferret.imgur.com/microsoft_wtf
http://imgur.com/8pct5.png
http://googleblog.blogspot.com/2011/02/microsofts-bing-uses-google-search.html
http://news.cnet.com/8301-27076_3-20011994-248.html?part=rss&subj=news&tag=2547-1_3-0-20
http://gizmodo.com/5383413/shady-microsoft-plugin-pokes-critical-hole-in-firefox-security
http://i.stack.imgur.com/sl1LY.png
http://imgur.com/T6BMs
http://www.nytimes.com/2010/09/14/world/europe/14raid.html
http://twitter.com/phil_nash/status/21159419598
http://online.wsj.com/article/SB10001424052748704415104576065641376054226.html?mod=WSJASIA_hpp_MIDDLESecondNews
http://www.reddit.com/r/reddit.com/comments/bqqxv/inside_the_chinese_factory_that_makes_microsofts/
http://i.min.us/iX0PA.png
http://imgur.com/m4nuz.gif
http://www.gamesforwindows.com/en-CA/Games/AgeofEmpiresIII/
http://foredecker.wordpress.com/2011/02/27/working-at-microsoft-day-to-day-coding/
http://homepage.mac.com/aleksivic/.Pictures/humor/spotTheBusey.jpg
http://www.bloomberg.com/apps/news?pid=20601087&sid=a7uOT0ro100U&refer=home
http://www.microsoft.com/windowsxp/eula/pro.mspx
Pretty concise, huh?
See also:
Pros and cons of leading HTML parsers in Java
Just an observation: Reddit generates XHTML, which means it's XML compliant. So you can just use an XPath library. e.g. (shamelessly copied from http://www.ibm.com/developerworks/library/x-javaxpathapi.html with minor modifications),
import java.io.IOException;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import javax.xml.parsers.*;
import javax.xml.xpath.*;
public class XPathExample {
public static void main(String[] args)
throws ParserConfigurationException, SAXException,
IOException, XPathExpressionException {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true); // never forget this!
DocumentBuilder builder = domFactory.newDocumentBuilder();
// replace the following line with code to retrieve and parse the URL of your choice
Document doc = builder.parse("books.xml");
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
XPathExpression expr
= xpath.compile("//a[class='title']/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getNodeValue());
}
}
}
Obviously won't work on all websites, but will work for any that serve XHTML.