XML Parsing using DocumentBuilder

XML Parsing using DocumentBuilder - java

I am trying to parse a xml into a map of key-value pairs as follows.
Sample xml document:
<Students>
<StudentA>
<Id>123</Id>
<Address>123 W </Address>
<Courses>
<Course1>CS203</Course1>
<Course2>CS206</Course2>
</Courses>
</StudentA>
<StudentB>
<Id>124</Id>
<Address>124 W </Address>
<Courses>
<Course1>CS202</Course1>
<Course2>CS204</Course2>
</Courses>
</StudentB>
</Students>
The xml parser code:
/**
* Parse the given xml data.
* #param xmlString The xml string to be parsed.
* #return Non-null list of {#link DiscreteDataEntry} values, may be empty.
*/
Map<String, String> parseXML(final String xmlString)
{
final String xmlDataToParse = xmlString;
parentNode = "";
try
{
final InputStream inputStream = new ByteArrayInputStream(xmlDataToParse.getBytes());
final DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
final DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
final Document document = documentBuilder.parse(inputStream);
final Map<String, String> data = createMapOfAttributeValuesKeyedByName(document.getDocumentElement());
}
catch (final Exception exception)
{
System.out.println("Exception:" + exception);
}
return data;
}
/**
* A recursive method which will loop through all the xml nodes.
* #param node The node.
* #return Non-null map of test values keyed by test name, may be empty.
*/
Map<String, String> createMapOfAttributeValuesKeyedByName(final Node node)
{
final Map<String, String> attributeValuesKeyedByName = new LinkedHashMap<String, String>();
final NodeList nodeList = node.getChildNodes();
for (int index = 0; index < nodeList.getLength(); index++)
{
final Node currentNode = nodeList.item(index);
if (node.getFirstChild() != null && node.getFirstChild().getNodeType() == Node.ELEMENT_NODE)
{
parentNode = getAncestralOrigin(currentNode);
attributeValuesKeyedByName.putAll(createMapOfAttributeValuesKeyedByName(currentNode));
}
else if (node.getFirstChild() != null && node.getFirstChild().getNodeType() == Node.TEXT_NODE)
{
final String attributeName = parentNode.length() > 0 ? parentNode + "." + node.getNodeName().trim() : node.getNodeName().trim();
final String attributeValue = node.getTextContent().trim();
attributeValuesKeyedByName.put(attributeName, attributeValue);
parentNode = "";
}
}
return attributeValuesKeyedByName;
}
/**
* Parses a give node and finds all its ancestors.
* #param node The node whose ancestors have to be found.
* #return A non-null but possible empty string built using the ancestors of the node.
*/
final String getAncestralOrigin(final Node node)
{
String ancestralOrigin = "";
final Node currentParentNode = node.getParentNode();
if (currentParentNode != null && currentParentNode.getNodeType() != Node.DOCUMENT_NODE)
{
ancestralOrigin = currentParentNode.getNodeName();
final String ancestor = getAncestralOrigin(currentParentNode);
if (ancestor.length() > 0)
{
ancestralOrigin = ancestor + "." + ancestralOrigin;
}
}
return ancestralOrigin;
}
The output of the map is :
Key:[Students.StudentA.Id], Value:[123]
Key:[Students.StudentA.Address], Value:[123 W]
Key:[Students.StudentA.Courses.Course1], Value:[CS203]
Key:[Students.StudentA.Courses.Course2], Value:[CS206]
Key:[Students.StudentB.Id], Value:[124]
Key:[Students.StudentB.Address], Value:[124 W]
Key:[Students.StudentB.Courses.Course1], Value:[CS202]
Key:[Students.StudentB.Courses.Course2], Value:[CS204]
But this output works fine if the file is being read with
final BufferedReader bufferedReader = new BufferedReader(new FileReader(new File(url.getFile().replaceAll("%20", " "))));
if the same file is read with
DataInputStream is = new DataInputStream(new FileInputStream(new File(url.getFile().replaceAll("%20", " "))));
the output is different. It does take all CR and LF within the xml doc.
Key:[Students], Value:[123
123 W
CS203
CS206
124
124 W
CS202
CS204]
I am using a dependency jar to read xml file which uses DataInputStream.
I was always under the impression that my xml parsers would take care of CR/LF/NewLine looks like its not.
I am replacing all CR LF and NewLines with empty string before parsing it.
But I would like to know if there are other xml parsers which would take care of itself. Also what is the reason behind BufferedReader skipping CR/LF and NewLine
but where as DataInputStream would not.
Also Is there any other better way to find the ancestors of the child tag, I need them to make the key value unique.
The xml will be as it is and cannot be changed. Also the xml will not be same as being shown here, it will be a generic xml with tags
changing, so I am trying to make a generic xml parser that parses xml child tags and puts them into a map.
The child tags can be duplicated so, I am using the path to the child to make it unique.
Also is there a way to parse the xml with just these tags(StudentA/StudentB) recursively by removing parent tag Students.
Note: The xml format changes and xml that I parse might be changing for every xml file.
So I really can't parse like get children of StudentA.

After go through long description, I learn that, you want to know other better way to parse XML.
The answer is, Yes, there are some other better way to parse XML. Use StAX or SAX, these are fast and more memory efficient. To learn more read JAXP of Java Tutorial.

DataInputStream is intended to read only something written using a DataOutputStream... i.e. serialized Java objects. It is not intended for reading text input.

Related

How to store values in an array (Java)

I have a table (REQUESTS) and it contains 1 column (XML_DATA) for xmls.
So if ID=123 has a row in this table, it should get the corresponding xml.
If xml was retrieved, i need to get all the values with tag <Mobile>0918xxxx</Mobile>.
Here is what i have so far:
for (int i = 0; i < RqeuestsDBViewData.length; i++) //GETS ROWS FROM TABLE REQUESTS
{
xmlDetails = test.getDetailsFromXML(mCustUtils, RequestDBViewData[i]); //GETS XML FROM XML_DATA
String strXmlDetails;
String strMob;
if (!AppUtils.isEmpty(xmlDetails)) //IF IT HAS ROW, THEN GET RECORD FROM <MOBILE></MOBILE> TAG
{
strXmlDetails = xmlDetails.toString(); //ENTIRE XML
strMob = StringUtils.substringBetween(strXmlDetails, "<Mobile>", "</Mobile>"); //GETS MOBILE VALUE
}
Now, if there are more than 1 <Mobile></Mobile>,
i need to store it in an array using for loop.
How do i store multiple values of strMob in an array?
After stroring all possible strMob, i'm planning to assign the values somewhere else like: personalInfo[j].setMobile(array/list[j]);
Any suggestions on how to do this?

Use a proper XML tool to read your XML.
Variant 1
Use Jackson to parse your XML to a prepared Java object. This will also work for arrays.
Variant 2
Read XML to a DOM object
public static Element getDocument(String xmlString) {
return getDocument(new ByteArrayInputStream(xmlString.getBytes()));
}
public static Element getDocument(InputStream inputStream) {
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = factory.newDocumentBuilder();
Document doc = docBuilder.parse(new BufferedInputStream(inputStream));
Element root = doc.getDocumentElement();
root.normalize();
return root;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
With the aim of XPATH you can than extract your <Mobile> elements. See this answer: Extract content between XML tags

Java XPath API Stripping HTML Tags from Text

I am currently using the Java XPath API to extract some text from a String.
This String, however, often has HTML formatting (<b>, <em>, <sub>, etc). When I run my code, the HTML tags are stripped off. Is there any way to avoid this?
Here is a sample input:
<document>
<summary>
The <b>dog</b> jumped over the fence.
</summary>
</document>
Here is a snippet of my code:
XPathFactory factory = XPathFactory.newInstance();
XPath xPath = factory.newXPath();
InputSource source = new InputSource(new StringReader(xml));
String output = xPath.evaluate("/document/summary", source);
Here is the current output:
The dog jumped over the fence.
Here is the output I want:
The <b>dog</b> jumped over the fence.
Thanks in advance for all your help.

A simple straight forward (but maybe not very efficient) solution:
/**
* Serializes a XML node to a string representation without XML declaration
*
* #param node The XML node
* #return The string representation
* #throws TransformerFactoryConfigurationError
* #throws TransformerException
*/
private static String node2String(Node node) throws TransformerFactoryConfigurationError, TransformerException {
final Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
final StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(node), new StreamResult(writer));
return writer.toString();
}
/**
* Serializes the inner (child) nodes of a XML element.
* #param el
* #return
* #throws TransformerFactoryConfigurationError
* #throws TransformerException
*/
private static String elementInner2String(Element el) throws TransformerFactoryConfigurationError, TransformerException {
final NodeList children = el.getChildNodes();
final StringBuilder sb = new StringBuilder();
for(int i = 0; i < children.getLength(); i++) {
final Node child = children.item(i);
sb.append(node2String(child));
}
return sb.toString();
}
Then the XPath evaluation should return the node instead of the string:
Element summaryElement = (Element) xpath.evaluate("/document/summary", doc, XPathConstants.NODE);
String output = elementInner2String(summaryElement);

The <b>dog</b> jumped over the fence
Get children from this string. You will have 2 Text Nodes and one Element Node. Treat them accordingly.

As part of the parser, it will read the text as XML and will classify the contents of the node summary as text, node, text. When you use /document/summary, the resolver will return a string which is made up of all the descendants of the selected node. This give you text + node.text + text. This is the reason you lose the bold tag. The input string inside of summary should either be:
HTML encoded -or-
Wrapped in a CDATA tag.
Wrapping inside of CDATA tag treats the the contents as text:
<document>
<summary>
<![CDATA[The <b>dog</b> jumped over the fence.]]>
</summary>
The problem with your solution is that the parser will want to treat as good xml structure. If you had an unbalanced tag inside summary, you would get an exception.
The solution to your question would be to loop over the elements to get text data while preserving the node names. This may work for your example, however, if you have an unbalanced tag it will break:
The <b>dog</b> jumped over <br> the fence
Don't use this solution to parse data between the summary tag. Instead either use CDATA or use some sort of regex to get content between the start and end points.

Xml document to DOM object using DocumentBuilderFactory

I am currently modifying a piece of code and I am wondering if the way the XML is formatted (tabs and spacing) will affect the way in which it is parsed into the DocumentBuilderFactory class.
In essence the question is...can I pass a big long string with no spacing into the DocumentBuilderFactory or does it need to be formatted in some way?
Thanks in advance, included below is the Class definition from Oracles website.
Class DocumentBuilderFactory
"Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. "

The documents will be different. Tabs and new lines will be converted into text nodes. You can eliminate these using the following method on DocumentBuilderFactory:
http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setIgnoringElementContentWhitespace(boolean)
But in order for it to work you must set up your DOM parser to validate the content against a DTD or xml schema.
Alternatively you could programmatically remove the extra whitespace yourself using something like the following:
public static void removeEmptyTextNodes(Node node) {
NodeList nodeList = node.getChildNodes();
Node childNode;
for (int x = nodeList.getLength() - 1; x >= 0; x--) {
childNode = nodeList.item(x);
if (childNode.getNodeType() == Node.TEXT_NODE) {
if (childNode.getNodeValue().trim().equals("")) {
node.removeChild(childNode);
}
} else if (childNode.getNodeType() == Node.ELEMENT_NODE) {
removeEmptyTextNodes(childNode);
}
}
}

It should not affect the ability of the parser as long as the string is valid XML. Tabs and newlines are stripped out or ignored by parsers and are really for the aesthetics of the human reader.
Note you will have to pass in an input stream (StringBufferInputStream for example) to the DocumentBuilder as the string version of parse assumes it is a URI to the XML.

The DocumentBuilder builds different DOM objects for xml string with line feeds and xml string without line feeds. Here is the code I tested:
StringBuilder sb = new StringBuilder();
sb.append("<root>").append(newlineChar).append("<A>").append("</A>").append(newlineChar).append("<B>tagB").append("</B>").append("</root>");
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputStream xmlInput = new ByteArrayInputStream(sb.toString().getBytes());
Element documentRoot = builder.parse(xmlInput).getDocumentElement();
NodeList nodes = documentRoot.getChildNodes();
System.out.println("How many children does the root have? => "nodes.getLength());
for(int index = 0; index < nodes.getLength(); index++){
System.out.println(nodes.item(index).getLocalName());
}
Output:
How many children does the root have? => 4
null
A
null
B
But if the new newlineChar is removed from the StringBuilder,
the ouptput is:
How many children does the root have? => 2
A
B
This demonstrates that the DOM objects generated by DocumentBuilder are different.

There shouldn't be any effect regarding the format of the XML-String, but I can remember a strange problem, as I passed a long String to an XML parser. The paser was unable to parse a XML-File as it was written all in one long line.
It may be better if you insert line-breaks, in that kind, that the lines wold not be longer than, lets say 1000 bytes.
But sadly i do neigther remember why that error occured nor which parser I took.

Parsing xml with dom4j or jdom or anyhow

I wanna read feed entries and I'm just stuck now. Take this for example : https://stackoverflow.com/feeds/question/2084883 lets say I wanna read all the summary node value inside each entry node in document. How do I do that? I've changed many variations of code this one is closest to what I want to achieve I think :
Element entryPoint = document.getRootElement();
Element elem;
for(Iterator iter = entryPoint.elements().iterator(); iter.hasNext();){
elem = (Element)iter.next();
System.out.println(elem.getName());
}
It goes trough all nodes in xml file and writes their name. Now what I wanted to do next is
if(elem.getName().equals("entry"))
to get only the entry nodes, how do I get elements of the entry nodes, and how to get let say summary and its value? tnx
Question: how to get values of summary nodes from this link

Have you tried jdom? I find it simpler and convenient.
http://www.jdom.org/
To get all children of an xml element, you can just do
SAXBuilder sb = new SAXBuilder();
StringReader sr = new StringReader(xmlDocAsString);
Document doc = sb.build(sr);
Element root = doc.getRootElement();
List l = root.getChildren("entry");
for (Iterator iter = l.iterator(); iter.hasNext();) {
...//do whatever...
}

Here's how you'd do it using vanilla Java:
//read the XML into a DOM
StreamSource source = new StreamSource(new StringReader("<theXml></theXml>"));
DOMResult result = new DOMResult();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(source, result);
Node root = result.getNode();
//make XPath object aware of namespaces
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(new NamespaceContext(){
#Override
public String getNamespaceURI(String prefix) {
if ("atom".equals(prefix)){
return "http://www.w3.org/2005/Atom";
}
return null;
}
#Override
public String getPrefix(String namespaceURI) {
return null;
}
#Override
public Iterator getPrefixes(String namespaceURI) {
return null;
}
});
//get all summaries
NodeList summaries = (NodeList) xpath.evaluate("/atom:feed/atom:entry/atom:summary", root, XPathConstants.NODESET);
for (int i = 0; i < summaries.getLength(); ++i) {
Node summary = summaries.item(i);
//print out all the attributes
for (int j = 0; j < summary.getAttributes().getLength(); ++j) {
Node attr = summary.getAttributes().item(j);
System.out.println(attr.getNodeName() + "=" + attr.getNodeValue());
}
//print text content
System.out.println(summaries.item(i).getTextContent());
}

if(elem.getName() == "entry")
I have no idea whether this is your problem (you don't really state what your problem is), but never test string equality with --. Instead, use equals():
if(elem.getName().equals("entry"))

A bit late but it might be useful for people googling...
There is a specialized API for dealing with RSS and Atom feeds in Java. It's called Rome, can be found here :
http://java.net/projects/rome/
It is really quite useful, it makes easy to read feed whatever the RSS or Atom version. You can also build feeds and generate the XML with it though I have no experience with this feature.
Here is a simple example that reads a feed and prints out the description nodes of all the entries in the feed :
URL feedSource = new URL("http://....");
feed = new SyndFeedInput().build(new XmlReader(feedSource));
List<SyndEntryImpl> entries = (List<SyndEntryImpl>)feed.getEntries();
for(SyndEntryImpl entry : entries){
System.out.println(entry.getDescription().getValue());
}
Simple enough.

How do you traverse and store XML in Blackberry Java app?

I'm having a problem accessing the contents of an XML document.
My goal is this:
Take an XML source and parse it into a fair equivalent of an associative array, then store it as a persistable object.
the xml is pretty simple:
<root>
<element>
<category_id>1</category_id>
<name>Cars</name>
</element>
<element>
<category_id>2</category_id>
<name>Boats</name>
</element>
</root>
Basic java class below. I'm pretty much just calling save(xml) after http response above. Yes, the xml is properly formatted.
import java.io.IOException;
import java.util.Hashtable;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import java.util.Vector;
import net.rim.device.api.system.PersistentObject;
import net.rim.device.api.system.PersistentStore;
import net.rim.device.api.xml.parsers.DocumentBuilder;
import net.rim.device.api.xml.parsers.DocumentBuilderFactory;
public class database{
private static PersistentObject storeVenue;
static final long key = 0x2ba5f8081f7ef332L;
public Hashtable hashtable;
public Vector venue_list;
String _node,_element;
public database()
{
storeVenue = PersistentStore.getPersistentObject(key);
}
public void save(Document xml)
{
venue_list = new Vector();
storeVenue.setContents(venue_list);
Hashtable categories = new Hashtable();
try{
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory. newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
docBuilder.isValidating();
xml.getDocumentElement ().normalize ();
NodeList list=xml.getElementsByTagName("*");
_node=new String();
_element = new String();
for (int i=0;i<list.getLength();i++){
Node value=list.item(i).getChildNodes().item(0);
_node=list.item(i).getNodeName();
_element=value.getNodeValue();
categories.put(_element, _node);
}
}
catch (Exception e){
System.out.println(e.toString());
}
venue_list.addElement(categories);
storeVenue.commit();
}
The code above is the work in progress, and is most likely heavily flawed. However, I have been at this for days now. I can never seem to get all child nodes, or the name / value pair.
When I print out the vector as a string, I usually end up with results like this:
[{ = root, = element}]
and that's it. No "category_id", no "name"
Ideally, I would end up with something like
[{1 = cars, 2 = boats}]
Any help is appreciated.
Thanks

Here's a fixed version of your program. Changes that I made are as follows:
I removed the DocBuilder-stuff from the save() method. These calls are needed to construct a new Document. Once you have such an object (and you do since it is passed in as an argument) you don't need the DocumentBuilder anymore. A proper use of DocumentBuilder is illustrated in the main method, below.
_node,_element need not be fields. They get new values with each pass through the loop inside save so I made them local variables. In addition I changed their names to category and name to reflect their association with the elements in the XML document.
There's never a need to create a new String object by using new String(). A simple "" in enough (see the initialization of the category and name variables).
Instead of looping over everything (via "*") the loop now iterates over element elements. Then there is a an inner loop that iterates over the children of each element, namely: its category_id and name elements.
In each pass of the inner we set either the category or the name variable depending on the name of the node at hand.
The actual value that is set to these variables is obtained by via node.getTextContent() which returns the stuff between the node's enclosing tags.
class database:
public class database {
private static PersistentObject storeVenue;
static final long key = 0x2ba5f8081f7ef332L;
public Hashtable hashtable;
public Vector venue_list;
public database() {
storeVenue = PersistentStore.getPersistentObject(key);
}
public void save(Document xml) {
venue_list = new Vector();
storeVenue.setContents(venue_list);
Hashtable categories = new Hashtable();
try {
xml.getDocumentElement().normalize();
NodeList list = xml.getElementsByTagName("element");
for (int i = 0; i < list.getLength(); i++) {
String category = "";
String name = "";
NodeList children = list.item(i).getChildNodes();
for(int j = 0; j < children.getLength(); ++j)
{
Node n = children.item(j);
if("category_id".equals(n.getNodeName()))
category = n.getTextContent();
else if("name".equals(n.getNodeName()))
name = n.getTextContent();
}
categories.put(category, name);
System.out.println("category=" + category + "; name=" + name);
}
} catch (Exception e) {
System.out.println(e.toString());
}
venue_list.addElement(categories);
storeVenue.commit();
}
}
Here's a main method:
public static void main(String[] args) throws Exception {
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
docBuilder.isValidating();
Document xml = docBuilder.parse(new File("input.xml"));
database db = new database();
db.save(xml);
}

Thank you so much. With only slight modification I was able to do exactly what I was looking for.
Here are the modifications I had to do:
Even though I am building in 1.5, getTextContent was not available. I had to use category = n.getFirstChild().getNodeValue(); to obtain the value of each node. Though there may have been a simple solution like updating my build settings, I am not familiar enough with BB requirements to know when it is safe to stray from the default recommended build settings.
In the main, I had to alter this line:
Document xml = docBuilder.parse(new File("input.xml"));
so that it was reading from an InputStream delivered from my web server, and not necessarily a local file - even though I wonder if storing the xml local would be more efficient than storing a vector full of hash tables.
...
InputStream responseData = connection.openInputStream();
Document xmlParsed = docBuilder.parse(result);
Obviously I skipped over the HTTP connection portion for the sake of keeping this readable.
Your help has saved me a full weekend of blind debugging. Thank you very much! Hopefully this post will help someone else as well.

//res/xml/input.xml
private static String _xmlFileName = "/xml/input.xml";
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputStream inputStream = getClass().getResourceAsStream( _xmlFileName );
Document document = builder.parse( inputStream );

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

XML Parsing using DocumentBuilder - java

After go through long description, I learn that, you want to know other better way to parse XML. The answer is, Yes, there are some other better way to parse XML. Use StAX or SAX, these are fast and more memory efficient. To learn more read JAXP of Java Tutorial.

DataInputStream is intended to read only something written using a DataOutputStream... i.e. serialized Java objects. It is not intended for reading text input.

Related

How to store values in an array (Java)

Java XPath API Stripping HTML Tags from Text

Xml document to DOM object using DocumentBuilderFactory

Parsing xml with dom4j or jdom or anyhow

How do you traverse and store XML in Blackberry Java app?

Categories

Resources