Keep elements in text with XPath - java

I have a XML files, and each file contains some informations, it also contains description of itself closed in element <namespace:description></namespace:description>. This description will be inserted in HTML web page and uploaded to web.
The problem is that in description element are other HTML elements and I want to keep them there, so that text can be formatted, but XPath escape all those elements and returns only their text.
<namespace:descr>Some <i>nice</i> description</namespace:descr>
I tried variations on this XPath query: //*[local-name()='descr']
(I'm not really skilled with XPath)
Also tried something like //*[local-name()='descr']//*[not(descendant::*[self::p or self::i])] found in this answer, but it doesn't work for me.
So my question: is there some way to keep XML/HTML elements in text after using XPath query?

The return value of an XPath expression can either be a string, number, boolean or a node-set. Each of these types can be converted to one of the primitive types.
The expression //*[local-name()='descr'] returns a node-set but you then obviously convert it to a string which returns the concatenated text content of the first node in the node-set, stripping off all markup.
To print the content of the result node as markup you would need to do the following:
Retrieve the expression result as node-set. The implementation type of the node-set depends on the XPath engine, and for instance could be a DOM nodelist.
Serialize the nodes as XML fragment. This of course depends on the API node-set and the XPath engine. XSLT could be used for that but it may also be as simple as calling toString() on the node implementation.

Related

Why am I getting no spaces between text node values?

I am using Xpath expression to get text nodes from a XML document like below:
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
<proj>
<under>E01</under>
<under>E02</under>
</proj>
<name>John Doe</name>
<gender>male</gender>
</emp>
</company>
I have written the following XPATH expression to get the text values :
normalize-space(string(//emp))
It is extracting the correct values and the output is like below:
Acct1000E01E02John Doemale
Notice that there are no spaces between the text node values from different nodes.
I actually want the output value to be in this way:
`Acct 1000 E01 E02 John Doe`
I have used javax.xml.xpath to parse and build the tree as follows:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document document = builder.parse(new File("/employees.xml"));
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "normalize-space(string(//emp))";
String output= (String)xPath.compile(expression).evaluate(document, XPathConstants.STRING);
I am using JAVA SE 10 here. So, the Xpath version is 1.0
Is there a better way to extract the text values?
I am pretty new to XPath so any suggestions would be helpful.
You are almost rigth here.
Picking the not operator is the right way to go.
It should be something like this:
/html/body/company/emp/*[not(self::gender)]
That is, all childnodes of emp except gender node.
Here go a full exemple in javascript:
let xpathExpression = '/html/body/company/emp/*[not(self::gender)]';
let contextNode = window.document;
let xpathResult = document.evaluate(xpathExpression, contextNode,
null, XPathResult.ANY_TYPE, null);
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
Oh dear, this one is complicated...
First of all, you haven't tagged your question with an XPath version. Usually people who aren't aware of XPath versions are using the ancient version 1.0, so I'll make that assumption: sorry if it's wrong.
In XPath 1.0, a function that is given a node-set and that expects a string uses the string value of the first node in the node-set, taken in document order.
In your query
normalize-space(string(//emp))
//emp selects a node-set, which happens to contain a single node, so string() takes the string value of that node. The string value of an element node is the concatenation of all its descendant text node. The normalize-space function removes leading and trailing whitespace, and normalizes internal space to a single whitespace character.
You have shown your XML in indented form as
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
etc, so it's reasonable to expect that the whitespace between elements forms part of the string value of the <emp> element. But you haven't told us how the document was parsed and turned into a node tree. Parsers often provide multiple options on how to do this, in particular, on how to handle the whitespace between element nodes. Most retain the whitespace by default, unless perhaps there is a schema or DTD that tells the parser that the whitespace is insignificant. Microsoft's MSXML parser, notoriously, drops the whitespace by default, which causes considerable problems when you are using XML to represent narrative documents, but actually makes life easier for people using XML for this kind of non-document data.
Your parser, for one reason or another (we can't tell) seems to have deleted the whitespace between element nodes. No XPath query is going to bring it back again. You may have options when building the document to retain the whitespace; that depends on the tools you are using.
Your second question asks about dropping one of the elements in the input. That's beyond the scope of XPath. XPath can only select nodes from the input, it cannot modify them in any way. To modify the tree, you need XSLT or XQuery.
Your attempt to solve the problem with //emp[not(descendant::gender)] is hopelessly doomed because this will only select employees that have no descendant element named gender. You appear to be guessing the semantics rather than using a specification or tutorial.

String representation of HTML element in HTTPUnit

Been trying to find a way of getting the String representation of an HTMLElement in HTTPUnit. I'm using HTTPUnit in some tests to get response HTML, and can get the text content of an element, however this does not include a text representation of its surrounding HTML, which I want to compare with a test value.
Any help appreciated.
The is more than one way to represent an HTML element. For instance, the attributes of an HTML element can be in any order, so you could produce a string, but it is not guarantied to be identical to the original element.

How to extract a list of string from xml file?

I have an xml file and an attribute "name" in some of the tags.
If I give the correct xpath - is there a way to extract a list of strings, each element being one of the values of this attribute?
(I do not need to get the entire list of DOM nodes...)
With XPath 2.0 or with XQuery you can write //#name/string() to get a sequence of string values of all name attributes in the document. With XPath 1.0 you can select the attribute nodes with //#name but then you need to use your host language (e.g. Java) to build a list of all the attribute values.

Efficient way to search for an element name in DOM4j document

What is the most efficient to search an element?
Would it take to traverse through the complete DOM4j document?Should I use XPATH here?
I am actually comparing two XML documents. Will iterate through first xml one by one and search for it the second xml document.
It is not a straightforward comparison. I would be comparing name attribute value with second xml's elements. And if first xml has any name such as name="xx.yy" then I need to look for <xx>
<yy></yy>
</xx> in second xml.
Maybe you could use Jsoup for this? I don'k know what kind of comparison are you up to, but with Jsoup you could simply select all nodes from both XMLs and iterate over both collections in one loop.
Jsoup is very effective and easy to use if you need to select random node just by its attribute (any attribute) tag name or content.

DOM parse xml file without converting entity references

I am writing a parser for an xml file which will contains special characters, for example
<name>You & me ®</name>
The dom parser will, by default, parse this value to "You & me ®".
However what I want the string is
You & me ®
Is there any way I can do this?
Thanks
If you are using DOM for parsing, see the DocumentBuilderFactory.setExpandEntityReferences() method.
By default, this setting is true meaning that entities are expanded out automatically. If you turn this off, you will be able to read the entities from the DOM - in this case you won't just get one big text node from a parent element, but you will get text nodes interleaved with entity nodes.

Categories