How to extract a list of string from xml file?

How to extract a list of string from xml file? - java

I have an xml file and an attribute "name" in some of the tags.
If I give the correct xpath - is there a way to extract a list of strings, each element being one of the values of this attribute?
(I do not need to get the entire list of DOM nodes...)

With XPath 2.0 or with XQuery you can write //#name/string() to get a sequence of string values of all name attributes in the document. With XPath 1.0 you can select the attribute nodes with //#name but then you need to use your host language (e.g. Java) to build a list of all the attribute values.

Related

Get element from multiple div class with colon in css html

There are 2 classes with the same name
<div class="website text:middle"> A</div>
<div class="website text:middle"> 1</div>
How to get A and 1? I tried using getElementById with :eq(0) and it gives out null

Method getElementById queries for elements with a specified id, not class; I'm not sure what you were trying to query with :eq(0) either.
Try:
// String html = ...
Document doc = Jsoup.parse(html);
List<String> result = doc.getElementsByClass("text:middle").eachText();
// result = ["A", "1"]
EDIT
You can query for elements that match multiple classes! See Jsoup select div having multiple classes.
However, a colon (:) is a special character in css and needs to be escaped when it appears as part of a class name in a selector query. I don't think that jsoup currently supports this and simply treats everything after a colon as a pseudo-class.

To add to Janez's correct answer - while jsoup's CSS selector (currently) doesn't support escaping a : character in the class name, there are other ways to get it to work if you want to use the select() method instead of getElementsByXXX -- e.g. if you want to combine selectors in one call:
Elements divs = doc.select("div[class=website text:middle]");
That will find div elements with the literal attribute class="website text:middle". Example.
Or:
Elements divs = doc.select("div[class~=text:middle]");
That finds elements with the class attribute that matches the regex /text:middle/. Example
For the presented data though, I think think the getElementsByClass() DOM method is the way to go and the most general. I just wanted to show a couple alternatives for other cases.

document.querySelectorAll(".website")[0] // 0 is child index
you should use querySelector it is fully supported by every browser
check this for support details support

Why am I getting no spaces between text node values?

I am using Xpath expression to get text nodes from a XML document like below:
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
<proj>
<under>E01</under>
<under>E02</under>
</proj>
<name>John Doe</name>
<gender>male</gender>
</emp>
</company>
I have written the following XPATH expression to get the text values :
normalize-space(string(//emp))
It is extracting the correct values and the output is like below:
Acct1000E01E02John Doemale
Notice that there are no spaces between the text node values from different nodes.
I actually want the output value to be in this way:
`Acct 1000 E01 E02 John Doe`
I have used javax.xml.xpath to parse and build the tree as follows:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document document = builder.parse(new File("/employees.xml"));
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "normalize-space(string(//emp))";
String output= (String)xPath.compile(expression).evaluate(document, XPathConstants.STRING);
I am using JAVA SE 10 here. So, the Xpath version is 1.0
Is there a better way to extract the text values?
I am pretty new to XPath so any suggestions would be helpful.

You are almost rigth here.
Picking the not operator is the right way to go.
It should be something like this:
/html/body/company/emp/*[not(self::gender)]
That is, all childnodes of emp except gender node.
Here go a full exemple in javascript:
let xpathExpression = '/html/body/company/emp/*[not(self::gender)]';
let contextNode = window.document;
let xpathResult = document.evaluate(xpathExpression, contextNode,
null, XPathResult.ANY_TYPE, null);
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());

Oh dear, this one is complicated...
First of all, you haven't tagged your question with an XPath version. Usually people who aren't aware of XPath versions are using the ancient version 1.0, so I'll make that assumption: sorry if it's wrong.
In XPath 1.0, a function that is given a node-set and that expects a string uses the string value of the first node in the node-set, taken in document order.
In your query
normalize-space(string(//emp))
//emp selects a node-set, which happens to contain a single node, so string() takes the string value of that node. The string value of an element node is the concatenation of all its descendant text node. The normalize-space function removes leading and trailing whitespace, and normalizes internal space to a single whitespace character.
You have shown your XML in indented form as
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
etc, so it's reasonable to expect that the whitespace between elements forms part of the string value of the <emp> element. But you haven't told us how the document was parsed and turned into a node tree. Parsers often provide multiple options on how to do this, in particular, on how to handle the whitespace between element nodes. Most retain the whitespace by default, unless perhaps there is a schema or DTD that tells the parser that the whitespace is insignificant. Microsoft's MSXML parser, notoriously, drops the whitespace by default, which causes considerable problems when you are using XML to represent narrative documents, but actually makes life easier for people using XML for this kind of non-document data.
Your parser, for one reason or another (we can't tell) seems to have deleted the whitespace between element nodes. No XPath query is going to bring it back again. You may have options when building the document to retain the whitespace; that depends on the tools you are using.
Your second question asks about dropping one of the elements in the input. That's beyond the scope of XPath. XPath can only select nodes from the input, it cannot modify them in any way. To modify the tree, you need XSLT or XQuery.
Your attempt to solve the problem with //emp[not(descendant::gender)] is hopelessly doomed because this will only select employees that have no descendant element named gender. You appear to be guessing the semantics rather than using a specification or tutorial.

Keep elements in text with XPath

I have a XML files, and each file contains some informations, it also contains description of itself closed in element <namespace:description></namespace:description>. This description will be inserted in HTML web page and uploaded to web.
The problem is that in description element are other HTML elements and I want to keep them there, so that text can be formatted, but XPath escape all those elements and returns only their text.
<namespace:descr>Some <i>nice</i> description</namespace:descr>
I tried variations on this XPath query: //*[local-name()='descr']
(I'm not really skilled with XPath)
Also tried something like //*[local-name()='descr']//*[not(descendant::*[self::p or self::i])] found in this answer, but it doesn't work for me.
So my question: is there some way to keep XML/HTML elements in text after using XPath query?

The return value of an XPath expression can either be a string, number, boolean or a node-set. Each of these types can be converted to one of the primitive types.
The expression //*[local-name()='descr'] returns a node-set but you then obviously convert it to a string which returns the concatenated text content of the first node in the node-set, stripping off all markup.
To print the content of the result node as markup you would need to do the following:
Retrieve the expression result as node-set. The implementation type of the node-set depends on the XPath engine, and for instance could be a DOM nodelist.
Serialize the nodes as XML fragment. This of course depends on the API node-set and the XPath engine. XSLT could be used for that but it may also be as simple as calling toString() on the node implementation.

Efficient way to search for an element name in DOM4j document

What is the most efficient to search an element?
Would it take to traverse through the complete DOM4j document?Should I use XPATH here?
I am actually comparing two XML documents. Will iterate through first xml one by one and search for it the second xml document.
It is not a straightforward comparison. I would be comparing name attribute value with second xml's elements. And if first xml has any name such as name="xx.yy" then I need to look for <xx>
<yy></yy>
</xx> in second xml.

Maybe you could use Jsoup for this? I don'k know what kind of comparison are you up to, but with Jsoup you could simply select all nodes from both XMLs and iterate over both collections in one loop.
Jsoup is very effective and easy to use if you need to select random node just by its attribute (any attribute) tag name or content.

How to get Attribute using selector syntax in jsoup

I need to get value of attribute href of a tag.
I know using a.attr("href") I can get href attribute value.
But I want to know is there any other way to get href attribute as like in jTidy
(using syntax like //a/#href) for Jsoup.
Means can I use some selector syntax to get attribute directly ?
Thanks.

No, you cant retrieve the attribute value by a single selector. Its purpose is to select elements by various criteria.
But you can select only those elements which have the attribute and then ask it's value.
Element withAttr = doc.select("a[href]").first();
String attrAvlue = withAttr.attr("href");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.