Evaluate many elements with XPathExpression and NODESET - java

I parse a very large xml file (from jpylyzer, a jp2 properties extractor). This xml contains properties of many JP2 images, each one with the same elements, like :
//results/jpylyzer/fileInfo/fileName
//results/jpylyzer/properties/jp2HeaderBox/imageHeaderBox/height
//results/jpylyzer/properties/jp2HeaderBox/imageHeaderBox/width
//results/jpylyzer/properties/jp2HeaderBox/imageHeaderBox/bPCDepth
In order to reduce processing time, I'm using this method :
for (XPathExpression xPathExpression : listXPathExpression) {
nodeList = (NodeList) xPathExpression.evaluate(document, XPathConstants.NODESET);
//we use our list
}
It's very convenient and fast, but the number of elements must be as we expected for each property.
As some properties are unique to some images, some xpath values won't be found for some images.
nodeList is filled ONLY with found values, which is a problem : there's no way to match those values to other ones as lists don't have the same size depending on how many properties has been found.
Is there a way to fill "blank" when no value is found ?

What you want is not possible with a single XPath expression, not even with version 2.0. In such a case, you have to reach for the higher-level language you embed XPath in.
As I'm not familiar with Java very much, I cannot give you specific code, but I can explain what you have to do.
I assume an XML document similar to
<results>
<jpylyzer>
<fileInfo>
<fileName>Name of file</fileName>
</fileInfo>
<properties>
<jp2HeaderBox>
<imageHeaderBox>
<height>45</height>
<width>66</width>
<bPCDepth>386</bPCDepth>
</imageHeaderBox>
<imageHeaderBox>
<width>32</width>
</imageHeaderBox>
</jp2HeaderBox>
</properties>
</jpylyzer>
</results>
As a starting point, find an element that really is present in all XML documents, in all situations. For the sake of an example, let us assume imageHeaderBox is present everywhere, but its children height, width and bPCDepth are not necessarily there.
Find an XPath expression for the imageHeaderBox element:
/results/jpylyzer/properties/imageHeaderBox
evaluate the expression and save the result to a nodeList. Next, process this list further. This only works if XPath expressions can be applied to the individual items in a nodeList, but it seems you are optimistic about that:
I can iterate over nodelist. I guess i can evaluate too
Iterate over the nodeList (the result of the imageHeaderBox expression) and apply another path expression to each item.
XPath 2.0
In XPath 2.0, you can use an if/then statement that checks for the presence of a node. Assuming the imageHeaderBox element node as the context item:
if(height) then height else 'e.g. text saying there is no height'
XPath 1.0
With XPath 1.0, it's slightly more complicated:
concat(height, substring('e.g. text saying there is no height', 1 div not(height)))"
See Dimitre Novatchev's answer here for an explanation. The technique is known as the Becker method, probably introduced here.
Finally, the result list should look similar to
45
e.g. text saying there is no height

Related

Jsoup eq selector returns no value

Trying to fetch data using Jsoup 1.10.3, seems like eq selector is not working correctly.
I tried the nth-child, but it seems like its not getting the second table (table:nth-child(2)).
Is my selector correct?
html > body > table:nth-child(2) > tbody > tr:nth-child(2) > td:nth-child(2)
in the example below, trying to extract the value 232323
Here is the try it sample
There are several issues that you may be struggling with. First, I don't think that you want to use the :nth-child(an+b) selector. Here is the explanation of that selector from the jsoup docs:
:nth-child(an+b) elements that have an+b-1 siblings before it in the document tree, for any positive integer or zero value of n, and has a parent element. For values of a and b greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the bth element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The a and b values must be integers (positive, negative, or zero). The index of the first child of an element is 1.
I guess you want to use the :table:nth-of-type(n) selector.
Second, you only select elements with your selector, but you want to get the visible content 232323, which is only one inner node of the element you select. So what is missing is the part where you get to the content. There are several ways of doing this. I again recommend that you read the docs. Especially the cookbook is very helpful for beginners. I guess you could use something like this:
String content = element.text();
Third, with CSS selector you really do to need to go through every hierarchy level of the DOM. Since tables always contain a tbody and tr and td elements, you may do something like this:
String content = document.select("table:nth-of-type(2) tr:nth-of-type(2) td:last-of-type").text();
Note, I do not have a java compiler at hand. Please use my code with care.

getNodeName matches an XML node, but XPath can't find it

This feels like such a noob question.
I'm looking at a pile of Java code that manipulates an XML DOM. (The classes are the stock org.w3c.dom.Document and javax.xml.xpath.XPath and such that ship with JDK 7.) It has a ton of places that look like this:
String expr = "/fixed/path/through/the/hierarchy";
// actual code reuses factory instances, etc
XPath xpath = XPathFactory.newInstance().newXPath();
Node topNode = someDocumentInstance.getFirstChild();
Node node = (Node) xpath.evaluate (expr, topNode, XPathConstants.NODE);
NodeList children = node.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
Node child = children.item(i);
if (child.getNodeName().equalsIgnoreCase("somePrefix:someTag")) {
// "return child;" or otherwise break out of the loop
}
}
And it all works. But that loop seems a tedious effort; if we're already using XPath to get a node, why then iterate over that node's children looking for a known tag?
So I tried to rewrite a section to fetch the child node directly. But querying using
String expr = "/fixed/path/through/the/hierarchy/somePrefix:someTag";
never matches anything. I've tried variations like requesting XPathConstants.NODESET or .STRING, but still no results. (There should only ever be one of these nodes anyhow.)
I feel like I'm missing something supremely obvious here, but I can't figure out why the full query fails, when the query-for-parent plus a manual loop through the children works. Is XPath testing some quality of a node beyond getNodeName() when I use a query like that?
The only theory I've come up with is that it has something to do with XML namespaces, which aren't used in this project. (There's actually a call to .setNamespaceAware(false) on the DocumentBuilderFactory instance with a comment saying "leave this off or everything everywhere breaks".)
If you're parsing without namespaces, then you should leave somePrefix out of your expression:
String expr = "/fixed/path/through/the/hierarchy/someTag";
The reason for this is that XPath performs matches on namespace and local name, not qualified name (which is what getNodeName() returns). If you put a prefix in your XPath expression, the XPath interpreter will use that to retrieve the namespace from its namespace mapping. Since you haven't given it any mappings, that will fail.
Also, you probably want to use NODESET if you're going to iterate over the child nodes.

Searching for the first matching element after a specific node (XPath and ITunes XML)

it's not nessesary to post my full code because I have just a short questions. I'm searching with XPath in a XML Doc for a text Value. I have a XML Like
<key>Name</key>
<string>Dat Ass</string>
<key>Artist</key>
<string>Earl Sweatshirt</string>
<key>Album</key>
<string>Kitchen Cutlery</string>
<key>Kind</key>
<string>MPEG-Audiodatei</string>
I have an Expression like this:
//string[preceding-sibling::key[text()[contains(., 'Name')]]]/text()
but this gives me ALL following string-tags, I just want the first one with the Song-Title.
greets Alex
Use:
(//string[preceding-sibling::key[1] = 'Name'])[1]/text()
Alternatively, one can use a forward-only expression:
(//key[. = 'Name'])[1]/following-sibling::string[1]/text()
Do note:
This is a common error. Any expression of the kind:
//someExpr[1]
Doesn't select "the first node in the document from all nodes selected by //someExpr". In fact it can select many nodes.
The above expression selects any node that is selected by //someExpr and that is the first such child of its parent.
This is why, without brackets, the other answer to this question is generally incorrect.
You can just add another predicate [1] to select the first matching node. The nested predicate using text() should be unneccessary:
//string[preceding-sibling::key[contains(., 'Name')]][1]/text()
Another, perhaps more efficient, way to select this node would be
//key[contains(., 'Name')]/following-sibling::*[1][self::string]
This selects the first node (with any name) following the wanted key node and tests if its name is string.

Get the computed style of a DOM element

After a layout is completed, I want to parse through the DOM tree and get the computed styles of each element. Is this possible.
The closest I could get is the below snippet, but it gives only uncomputed styles.
Element elm = (Element) _doc.getElementsByTagName("table").item(0);
Map props = _sharedContext.getCss().getCascadedPropertiesMap(elm);
Is it also possible to get which "Box" the element lies in.
You can access the computed style in the document with ITextRenderer.getRootBox().
This method return a tree of org.xhtmlrenderer.render.Box objects which you can scan to find your element.
You can get the box computed style with Box.getStyle() and you can get the element the box refers to with Box.getElement().

Jdoms annoying textnodes and addContent(index, Element) - schema solutions?

i have some already generated xmls and the application causing problems now needs to add elements to it which need to be at a specific position to be valid with to the applied schemata...
now there are two problems the first one is that i have to hardcode the positions which is not that nice but "ok".
But the much bigger one is jdom... I printed the content list and it looks like:
element1
text
element2
element4
text
element5
while the textnodes are just whitespaces and every element i add makes it even more unpredictable how many textnodes there are (because sometimes there are added some sometimes not) which are just counted as it were elements but i want to ignore them because when i add element3 at index 2 its not between element2 and element4 it comes after this annoying textnode.
Any suggestions? The best solution imho would be something that automatically puts it where it has to be according to the schema but i think thats not possible?
Thanks for advice :)
The JDOM Model of the XML is very literal... it has to be. On the other hand, JDOM offers ways to filter and process the XML in a way that should make your task easier.
In your case, you want to add Element content to the document, and all the text content is whitespace..... so, just ignore all the text content, and worry about the Element content only.
For example, if you want to insert a new element nemt before the 3rd Element, you can:
rootemt.getChildren().add(3, new Element("nemt"));
The elements are now sorted out.... what about the text...
A really simple solution is to just pretty-print the output:
XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
xout.output(System.out, mydoc);
That way all the whitespace will be reformatted to make the XML 'pretty'.
EDIT - and no, there is no way with JDOM to automatically insert the element in the right place according to the schema....
Rolf

Categories