Handling Empty Nodes Using Java DOM - java

I have a question concerning XML, Java's use of DOM, and empty nodes. I am currently working on a project wherein I take an XML descriptor file of abstract machines (for text parsing) and parse a series of input strings with them. The actual building and interpretation of these abstract machines is all done and working fine, but I have come across a rather interesting XML requirement. Specifically, I need to be able to turn an empty InputString node into an empty string ("") and still execute my parsing routines. The problem, however, occurs when I attempt to extract this blank node from my XML tree. This causes a null pointer exception and then generally bad things start happening. Here is the offending snippet of XML (Note the first element is empty):
<InputStringList>
<InputString></InputString>
<InputString>000</InputString>
<InputString>111</InputString>
<InputString>01001</InputString>
<InputString>1011011</InputString>
<InputString>1011000</InputString>
<InputString>01010</InputString>
<InputString>1010101110</InputString>
</InputStringList>
I extract my strings from the list using:
//Get input strings to be validated
xmlElement = (Element)xmlMachine.getElementsByTagName(XML_INPUT_STRING_LIST).item(0);
xmlNodeList = xmlElement.getElementsByTagName(XML_INPUT_STRING);
for (int j = 0; j < xmlNodeList.getLength(); j++) {
//Add input string to list
if (xmlNodeList.item(j).getFirstChild().getNodeValue() != null) {
arrInputStrings.add(xmlNodeList.item(j).getFirstChild().getNodeValue());
} else {
arrInputStrings.add("");
}
}
How should I handle this empty case? I have found a lot of information on removing blank text nodes, but I still actually have to parse the blank nodes as empty strings. Ideally, I would like to avoid using a special character to denote a blank string.
Thank you in advance for your time.

if (xmlNodeList.item(j).getFirstChild().getNodeValue() != null) {
nodeValue shouldn't be null; it would be firstChild itself that might be null and should be checked for:
Node firstChild= xmlNodeList.item(j).getFirstChild();
arrInputStrings.add(firstChild==null? "" : firstChild.getNodeValue());
However note that this is still sensitive to the content being only one text node. If you had an element with another element in, or some text and a CDATA section, just getting the value of the first child isn't enough to read the whole text.
What you really want is the textContent property from DOM Level 3 Core, which will give you all the text inside the element, however contained.
arrInputStrings.add(xmlNodeList.item(j).getTextContent());
This is available in Java 1.5 onwards.

You could use a library like jOOX to generally simplify standard DOM manipulation. With jOOX, you'd get the list of strings as such:
List<String> strings = $(xmlMachine).find(XML_INPUT_STRING_LIST)
.find(XML_INPUT_STRING)
.texts();

Related

Conditional Java Parsing - Getting Child Node Contents

I am having some problems with what should be simple DOM parsing. I have checked over numerous questions and so far nothing has helped my situation. The problem is that I have some conditional nodes that may appear in an XML or may not appear. The tool that I have created must save the contents of these values into ArrayLists to be used later. Here is the XML in question:
-<Dbtr>
-<PstlAdr>
<AdrLine>111 Arlington Ave</AdrLine>
<AdrLine>Apartment A</AdrLine>
<AdrLine>Augusta, AZ 11100</AdrLine>
</PstlAdr>
</Dbtr>
Specifically, the Dbtr tag may appear any number of times in an XML. For each Dbtr tag there may be between 1-4 AdrLine children. I need to be able to save the value of each AdrLine and if there is no value then save a blank "" value into the array list for each.
To do this I wrote the following code:
NodeList Dbtr = doc.getElementsByTagName("Dbtr");
for(int i = 0; i < Dbtr.getLength(); i++){
NodeList DbtrChildren = Dbtr.item(i).getChildNodes();
if(DbtrChildren.getLength()==1){
//Add the first child.
}else if(DbtrChildren.getLength()==2){
//Add the first & second child.
}else if(DbtrChildren.getLength()==3){
System.out.println("Test Flag");
System.out.println(DbtrChildren.item(0).getNodeValue()+"Node Value");
System.out.println(DbtrChildren.item(0).getAttributes()+"Text Attributes");
System.out.println(DbtrChildren.item(0).getTextContent()+"Text Content");
}else if(DbtrChildren.getLength()==4){
//Add all 4 children.
}
}
So depending on the number of children AdrLine nodes the values will either be saved into an array list or else a blank value will be saved.
The problem is that no matter what I do I get blank values for the children. I can clearly see during testing that the Dbtr tag does in fact have 3 children. As you can see I tried to do some debugging to figure out some way to get the values. See results below:
Test Flag
Node Value
nullText Attributes
Text Content
So I'm getting a large amount of whitespace but no value. Of course I considered that perhaps it was actually picking up "PstlAdr" but then why would it successfully detect 3 child nodes?
Any help is GREATLY appreciated.
I figured out the solution to my problem.
In the above example I was getting back "3" for the length of DbtrChildren which led me to believe that the three children elements were the adrline tags.
In reality the three children were #text, pstladr and #text (the #text apparently represents the /n).
So I was trying to get values from a tag without values. Once I tried
for(int i = 0; i < Dbtr.getLength(); i++){
NodeList Dbtr2 = Dbtr.item(i).getChildNodes();
Node Dbtr3 = Dbtr2.item(1);
System.out.println(Dbtr3.getNodeName());
}
I got back "Pstladr" as the result so now I know I only need to go further into "Pstladr" to fix my problem.

Why is the output of my XML file appendin a '/>'

Here is my dilemma:
I have an XML, where I want to insert animation_sequence,however, instead the code adds animation_sequnce/> with an opening angle bracket, I can add all other elements but that one. Why is that? I tried adding the XML here but it wouldn't render. Here is my code:
Element state = testDoc.createElement("state");
state.setTextContent(element);
Element animationState = testDoc.createElement("animation_state");
Element sequence = testDoc.createElement("animation_sequence");
testDoc.getElementsByTagName("animations_list").item(0).appendChild(animationState).appendChild(state);
testDoc.getElementsByTagName("animation_state").item(testDoc.getElementsByTagName("state").getLength() - 1).appendChild(sequence);
The code you have shown us creates nodes in a tree. It doesn't append any angle brackets to anything. Angle brackets only appear when you serialize the tree (convert it to lexical XML). Generally the system takes care of how to serialize the XML, and you don't need to worry when it chooses between different ways of serializing it because when the XML is parsed the differences won't matter.
Now it could be that the "/>" is a symptom that the tree you have built isn't the tree that you intended to build, but that's a different matter.

How to getChildText() of a node with a namespace, when there are multiple namespaces in the XML?

I want to use getChildText() to get text from a node that is a few levels deep. There are two namespaces in the file. The syntax below does not work and sets textToGet to null.
String textToGet = root.getChildText("ns1:Customer/ns1:Address/ns1:Street/ns2:Streetname");
I know there is an alternative of first getting the Child Element, and then its Text, but I want to use a one-liner.
Also, would rather not chain getChild(), because some of the elements are not guaranteed to be in the file.
You are not going to be able to make that a one-liner....
Consider using XPaths.... JDOM 2.x should help with that:
XPathExpression<String> xpe = XPathFactory.instance().compile(
Filters.fstring(), "ns1:Customer/ns1:Address/ns1:Street/ns2:Streetname",
null, namespace_ns1, namespace_ns2);
String textToGet = xpe.evaluateFirst(root);
(textToGet may be null)
Edit, the XPath expression above actually returns an element... you should add "/text()" to the end of the XPath, or change textToGet to be String (and the Filters too).
Rolf

Serialize a Document object in Java, while preserving the formatting of arbitrary elements

I am using the function below to convert a DOM Document object into a String in Java.
public static String convertDocumentToString(final Document doc) {
final DOMImplementationLS domImplementation = (DOMImplementationLS) doc.getImplementation();
final LSSerializer lsSerializer = domImplementation.createLSSerializer();
lsSerializer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
final String xml = lsSerializer.writeToString(doc);
return xml;
}
This works well most of the time, but there is are some specific elements that I don't want formatted (e.g. the screen DocBook element). So I have two questions:
Is there a way to skip certain elements when formatting XML in Java like in the code above?
If not, is there another way to convert a Document to a String while preserving the layout of arbitrary elements?
Note that I have also used the Transformer in the past (see Getting xml string from Document in Java), but that didn't preserve CDATA sections.
Update:
Just so I am clear, I am deserializing and serializing XML in order to create a Document object that can be edited programatically via a DOM, with the serialization process preferably "pretty printing" the resulting XML (with the exception of some arbitrary elements).
Update 2:
In the end I created a custom function to convert a Node to a String with optional formatting. See the convertNodeToString function at https://sourceforge.net/p/commonclasses/code/110/tree/trunk/src/com/redhat/ecs/commonutils/XMLUtilities.java called like so:
final String exampleXml = FileUtilities.readFileContents(new File("test.xml"));
final ArrayList<String> contentsInlineElements = new ArrayList<String>();
contentsInlineElements.add("title");
contentsInlineElements.add("term");
final ArrayList<String> inlineElements = new ArrayList<String>();
inlineElements.add("prompt");
inlineElements.add("command");
inlineElements.add("firstterm");
inlineElements.add("ulink");
inlineElements.add("guilabel");
inlineElements.add("filename");
inlineElements.add("replaceable");
inlineElements.add("parameter");
inlineElements.add("literal");
inlineElements.add("classname");
inlineElements.add("sgmltag");
inlineElements.add("guibutton");
inlineElements.add("guimenuitem");
inlineElements.add("guimenu");
inlineElements.add("menuchoice");
inlineElements.add("citetitle");
final ArrayList<String> verbatimElements = new ArrayList<String>();
verbatimElements.add("screen");
verbatimElements.add("programlisting");
final Document doc = XMLUtilities.convertStringToDocument(exampleXml);
final String formattedXml = XMLUtilities.convertNodeToString(doc.getDocumentElement(), true, false, false, verbatimElements, inlineElements, contentsInlineElements, true, 1, 0);
Serialization is designed to get data across a transport medium, but not necessarily (or even usually) in a way that is true to the form of the input data, if that form is by definition not carrying any extra information (as is the case with XML documents).
If you need to carry over the design, too, you will have to encode this "meta" information (i. e. the formatting) into the data itself, for example by escaping whitespace etc. Maybe the easiest solution, but one that will keep you from simply "reading" (as in with your eyes) the transport stream, is to encode your formatted data in something like Base64. This will perfectly transport inside an XML wrapper, while at the same time conserving the fidelity of the original input data you fed into the encoder.
On the other side, of course, you will have to decode the data again, before you can go on processing it further.
The short answer: you can't. When you tell the serializer to pretty-print, you're making a statement about the use of inter-element whitespace (ie, it's ignorable).
The longer answer: you can't without modifying the DOM (or a copy of it). IMO the simplest way is the following:
Identify the node that you want to preserve. I'll assume that you have an ID or some other way to select it using XPath.
Call Document.adoptNode() to move that node into a new DOM. I recall having some issues with this method, but that was many years ago. If it doesn't work, use Document.importNode() and explicitly remove the node from the source document. I believe that you can adopt a node as the root of a document, but can't guarantee that.
Insert a text node into the original document, containing unique content. An easy way to generate unique content is UUID.randomUUID().toString().
Convert both documents to strings, pretty-printing one and not pretty-printing the other.
Use String.replace() to insert the not-pretty-printed document into the pretty-printed document.
And, as always, if you're planning to write those strings to a file or other byte-oriented format, you must explicitly encode as UTF-8.
Whitespace is not significant in XML documents other than in CDATA sections, and none of the standard tools is designed to preserve it. Any requirement to the contrary is ill-formed.

How to get some xml that comes before and a little from after a DOM Node

I am using java and I am pretty open to using w3c DOM or DOM4J at this point.
So lets say I have a Node like a text node that I have found something interesting in, like say an occurrence of a substring in the nodes text. If I want to get a string with a number characters preceding that node and a few characters after that node how may I do that? Basically I need to be able to display a snippet of the original xml around the occurrence of that string.
The problem I have with getting the parent node for example and then calling asXML is that I no longer know the exact location of the substring in the text node. If I search again for that string value in the parents xml then I may find 2 occurrences or many more if the parent has other children that contain an occurrence of that string.
Much appreciation if any one can answer this question.
I haven't done anything with the DOM from Java in ages, so take this as pseudocode, not Java.
Basically, it boils down to something like this:
parent = node.getParentNode()
Node[] children = parent.getChildNodes()
for (Node child : children) {
if (child == node) {
// Do something different with the matched node
} else {
// do something with child
}

Categories