What does Java Node normalize method do? - java

I'm doing some tests, but I see no difference when I use or not the normalize() method.
But the examples at ExampleDepot website use it.
So, what is it for? (The documentation wasn't clear for me either)

You can programmatically build a DOM tree that has extraneous structure not corresponding to actual XML structures - specifically things like multiple nodes of type text next to each other, or empty nodes of type text. The normalize() method removes these, i.e. it combines adjacent text nodes and removes empty ones.
This can be useful when you have other code that expects DOM trees to always look like something built from an actual XML document.
This basically means that the following XML element
<foo>hello
wor
ld</foo>
could be represented like this in a denormalized node:
Element foo
Text node: ""
Text node: "Hello "
Text node: "wor"
Text node: "ld"
When normalized, the node will look like this
Element foo
Text node: "Hello world"

It cleans code from adjacent Text nodes and empty Text nodes

there are a lot of possible DOM trees that correspond to the same XML structure and each XML structure has at least one corresponding DOM tree. So conversion from DOM to XML is surjective.
So it may happen that:
dom_tree_1 != dom_tree_2
# but:
dom_tree_1.save_DOM_as_XML() == dom_tree_2.save_DOM_as_XML()
And there is no way for ensuring:
dom_tree == dom_tree.save_DOM_as_XML().load_DOM_from_XML()
But we would like to have it bijective. That means each XML structure corresponds to one particular DOM tree.
So you can define a subset of all possible DOM trees that is bijective to the set of all possible XML structures.
# still:
dom_tree.save_DOM_as_XML() == dom_tree.normalized().save_DOM_as_XML()
# but with:
dom_tree_n = dom_tree.normalize()
# we now even have:
dom_tree_n == dom_tree_n.save_DOM_as_XML().load_DOM_from_XML().normalize()
So normalized DOM trees can be perfectly reconstructed from their XML representation. There is no information loss.

Normalize the root element of the XML document. This ensures that all Text nodes under the root node are put into a "normal" form, which means that there are neither adjacent Text nodes nor empty Text nodes in the document.

Related

Why am I getting no spaces between text node values?

I am using Xpath expression to get text nodes from a XML document like below:
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
<proj>
<under>E01</under>
<under>E02</under>
</proj>
<name>John Doe</name>
<gender>male</gender>
</emp>
</company>
I have written the following XPATH expression to get the text values :
normalize-space(string(//emp))
It is extracting the correct values and the output is like below:
Acct1000E01E02John Doemale
Notice that there are no spaces between the text node values from different nodes.
I actually want the output value to be in this way:
`Acct 1000 E01 E02 John Doe`
I have used javax.xml.xpath to parse and build the tree as follows:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document document = builder.parse(new File("/employees.xml"));
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "normalize-space(string(//emp))";
String output= (String)xPath.compile(expression).evaluate(document, XPathConstants.STRING);
I am using JAVA SE 10 here. So, the Xpath version is 1.0
Is there a better way to extract the text values?
I am pretty new to XPath so any suggestions would be helpful.
You are almost rigth here.
Picking the not operator is the right way to go.
It should be something like this:
/html/body/company/emp/*[not(self::gender)]
That is, all childnodes of emp except gender node.
Here go a full exemple in javascript:
let xpathExpression = '/html/body/company/emp/*[not(self::gender)]';
let contextNode = window.document;
let xpathResult = document.evaluate(xpathExpression, contextNode,
null, XPathResult.ANY_TYPE, null);
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
Oh dear, this one is complicated...
First of all, you haven't tagged your question with an XPath version. Usually people who aren't aware of XPath versions are using the ancient version 1.0, so I'll make that assumption: sorry if it's wrong.
In XPath 1.0, a function that is given a node-set and that expects a string uses the string value of the first node in the node-set, taken in document order.
In your query
normalize-space(string(//emp))
//emp selects a node-set, which happens to contain a single node, so string() takes the string value of that node. The string value of an element node is the concatenation of all its descendant text node. The normalize-space function removes leading and trailing whitespace, and normalizes internal space to a single whitespace character.
You have shown your XML in indented form as
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
etc, so it's reasonable to expect that the whitespace between elements forms part of the string value of the <emp> element. But you haven't told us how the document was parsed and turned into a node tree. Parsers often provide multiple options on how to do this, in particular, on how to handle the whitespace between element nodes. Most retain the whitespace by default, unless perhaps there is a schema or DTD that tells the parser that the whitespace is insignificant. Microsoft's MSXML parser, notoriously, drops the whitespace by default, which causes considerable problems when you are using XML to represent narrative documents, but actually makes life easier for people using XML for this kind of non-document data.
Your parser, for one reason or another (we can't tell) seems to have deleted the whitespace between element nodes. No XPath query is going to bring it back again. You may have options when building the document to retain the whitespace; that depends on the tools you are using.
Your second question asks about dropping one of the elements in the input. That's beyond the scope of XPath. XPath can only select nodes from the input, it cannot modify them in any way. To modify the tree, you need XSLT or XQuery.
Your attempt to solve the problem with //emp[not(descendant::gender)] is hopelessly doomed because this will only select employees that have no descendant element named gender. You appear to be guessing the semantics rather than using a specification or tutorial.

Can the Java streaming XML parser distinguish empty element from self-closing empty element?

Can the Java streaming XML parser, i.e. javax.xml.stream.XMLEventReader distinguish an empty element
<document>
<empty></empty>
<document>
from a self-closing empty element?
<document>
<empty/>
<document>
Let's suppose we parse both of the above xml fragments and print the eventType and the event itself, just like this:
System.out.println("eventType:" + event.getEventType() + "; element:"+event.toString());
Both of the above fragments will produce the exact same result:
eventType:7; element:<?xml version="null" encoding='null' standalone='no'?>
eventType:1; element:<document>
eventType:4; element:
eventType:1; element:<empty>
eventType:2; element:</empty>
eventType:2; element:</document>
eventType:8; element:ENDDOCUMENT
Just to give some context, what we want to achieve is, we want to rewrite some parts of the xml based on some rules, but want to preserve other parts exactly as they are, that is, we want to keep empty elements in their original form, even though the two forms are semantically the same. If we have a normal empty element (1st example), we want to keep it that way, if we have a self-closing empty element, we want to write a self-closing element in the result as well. Can we achieve this goal with javax.xml.stream.XMLEventReader?
The answer is no. Similarly, you can't preserve whitespace within a tag (e.g. newlines between attribute values, or spaces around the "=" sign). These are considered to be of no interest to applications, and are therefore not reported.
You could test if the startevent and endevent have the same location
event.getLocation().getCharacterOffset();
From the javadoc
Return the byte or character offset into the input source this location is pointing to. If the input source is a file or a byte stream then this is the byte offset into that stream, but if the input source is a character media then the offset is the character offset. Returns -1 if there is no offset available.
The offset is not guaranteed to be available, but that should depend on your setup and worth a try if it works in yours. (Also it can only represent offsets up to Integer.MAX_VALUE)

Keep elements in text with XPath

I have a XML files, and each file contains some informations, it also contains description of itself closed in element <namespace:description></namespace:description>. This description will be inserted in HTML web page and uploaded to web.
The problem is that in description element are other HTML elements and I want to keep them there, so that text can be formatted, but XPath escape all those elements and returns only their text.
<namespace:descr>Some <i>nice</i> description</namespace:descr>
I tried variations on this XPath query: //*[local-name()='descr']
(I'm not really skilled with XPath)
Also tried something like //*[local-name()='descr']//*[not(descendant::*[self::p or self::i])] found in this answer, but it doesn't work for me.
So my question: is there some way to keep XML/HTML elements in text after using XPath query?
The return value of an XPath expression can either be a string, number, boolean or a node-set. Each of these types can be converted to one of the primitive types.
The expression //*[local-name()='descr'] returns a node-set but you then obviously convert it to a string which returns the concatenated text content of the first node in the node-set, stripping off all markup.
To print the content of the result node as markup you would need to do the following:
Retrieve the expression result as node-set. The implementation type of the node-set depends on the XPath engine, and for instance could be a DOM nodelist.
Serialize the nodes as XML fragment. This of course depends on the API node-set and the XPath engine. XSLT could be used for that but it may also be as simple as calling toString() on the node implementation.

Efficient way to search for an element name in DOM4j document

What is the most efficient to search an element?
Would it take to traverse through the complete DOM4j document?Should I use XPATH here?
I am actually comparing two XML documents. Will iterate through first xml one by one and search for it the second xml document.
It is not a straightforward comparison. I would be comparing name attribute value with second xml's elements. And if first xml has any name such as name="xx.yy" then I need to look for <xx>
<yy></yy>
</xx> in second xml.
Maybe you could use Jsoup for this? I don'k know what kind of comparison are you up to, but with Jsoup you could simply select all nodes from both XMLs and iterate over both collections in one loop.
Jsoup is very effective and easy to use if you need to select random node just by its attribute (any attribute) tag name or content.

DOM parse xml file without converting entity references

I am writing a parser for an xml file which will contains special characters, for example
<name>You & me ®</name>
The dom parser will, by default, parse this value to "You & me ®".
However what I want the string is
You & me ®
Is there any way I can do this?
Thanks
If you are using DOM for parsing, see the DocumentBuilderFactory.setExpandEntityReferences() method.
By default, this setting is true meaning that entities are expanded out automatically. If you turn this off, you will be able to read the entities from the DOM - in this case you won't just get one big text node from a parent element, but you will get text nodes interleaved with entity nodes.

Categories