What is the most efficient way of addressing an element in XPath?

What is the most efficient way of addressing an element in XPath? - java

I have a Java program which cares about efficiency. There I use XPaths.
In XPath I can select elements starting from root
/root/a/b/c/d/e
or use the descendent-or-self axis:
//e
What will be most efficient method among these two?

A direct path will tend to perform better than one using the more general descendant-or-self (//) axis, however:
Implementations could vary (but as a general rule, direct paths perform better).
The difference can be minor enough not to matter, especially for
small documents.
As with all performance concerns, measure before optimizing to avoid expending effort in areas other than true bottlenecks.

I would imagine that /root/a/b/c/d/e would be more efficient, because in the first case, the XPath processor can eliminate a lot of branches, whereas in the second case (//e) the XPath processor has to search the entire document tree.
You should write a small Java program that excersizes the two different ways, and then see how long it takes to run 1000 loops.

Understanding the leading / and // constructs is very important.
A leading / starts a path that is always relevant to the root node. Therefore, even though we are searching a sub-node, the XPath:
root/a/b/c
... will still return every c node in your XML document even though they are not descendants of the first c node. Likewise, the XPath:
//e/
... will still return every e node in your XML document, not just the descendants of your first c node.

Related

What would be the best way to build a Big-O runtime complexity analyzer for pseudocode in a text file?

I am trying to create a class that takes in a string input containing pseudocode and computes its' worst case runtime complexity. I will be using regex to split each line and analyze the worst-case and add up the complexities (based on the big-O rules) for each line to give a final worst-case runtime. The pseudocode written will follow a few rules for declaration, initilization, operations on data structures. This is something I can control. How should I go about designing a class considering the rules of iterative and recursive analysis?
Any help in C++ or Java is appreciated. Thanks in advance.
class PseudocodeAnalyzer
{
public:
string inputCode;
string performIterativeAnalysis(string line);
string performRecursiveAnalysis(string line);
string analyzeTotalComplexity(string inputCode);
}
An example for iterative algorithm: Check if number in a grid is Odd:
1. Array A = Array[N][N]
2. for i in 1 to N
3. for j in 1 to N
4. if A[i][j] % 2 == 0
5. return false
6. endif
7. endloop
8. endloop
Worst-case Time-Complexity: O(n*n)

The concept: "I wish to write a program that analyses pseudocode in order to print out the algorithmic complexity of the algorithm it describes" is mathematically impossible!
Let me try to explain why that is, or how you get around the inevitability that you cannot write this.
Your pseudocode has certain capabilities. You call it pseudocode, but given that you are now trying to parse it, it's still a 'real' language where terms have real meaning. This language is capable of expressing algorithms.
So, which algorithms can it express? Presumably, 'all of them'. There is this concept called a 'turing machine': You can prove that anything a computer can do, a turing machine can also do. And turing machines are very simple things. Therefore, if you have some simplistic computer and you can use that computer to emulate a turing machine, you can therefore use it to emulate a complete computer. This is how, in fundamental informatics, you can prove that a certain CPU or system is capable of computing all the stuff some other CPU or system is capable of computing: Use it to compute a turing machine, thus proving you can run it all. Any system that can be used to emulate a turing machine is called 'turing complete'.
Then we get to something very interesting: If your pseudocode can be used to express anything a real computer can do, then your pseudocode can be used to 'write'... your very pseudocode checker!
So let's say we do just that and stick the pseudocode that describes your pseudocode checker in a function we shall call pseudocodechecker. It takes as argument a string containing some pseudocode, and returns a string such as O(n^2).
You can then write this program in pseudocode:
1. if pseudocodechecker(this-very-program) == O(n^2)
2. If True runSomeAlgorithmThatIsO(1)
3. If False runSomeAlgorithmTahtIsO(n^2)
And this is self-defeating: We have 'programmed' a paradox. It's like "This statement is a lie", or "the set of all sets that do not contain themselves". If it's false it is true and if it is true it false. [Insert GIF of exploding computer here].
Thus, we have mathematically proved that what you want is impossible, unless one of the following is true:
A. Your pseudocode-based checker is incorrect. As in, it will flat out give a wrong answer sometimes, thus solving the paradox: If you feed your program a paradox, it gives a wrong answer. But how useful is such an app? An app where you know the answer it gives may be incorrect?
B. Your pseudocode-based checker is incomplete: The official definition of your pseudocode language is so incapable, you cannot even write a turing machine in it.
That last one seems like a nice solution; but it is quite drastic. It pretty much means that your algorithm can only loop over constant ranges. It cannot loop until a condition is true, for example. Another nice solution appears to be: The program is capable of realizing that an answer cannot be given, and will then report 'no answer available', but unfortunately, with some more work, you can show that you can still use such a system to develop a paradox.

The answer by #rzwitserloot and the ones given in the link are correct. Let me just add that it is possible to compute an approximation both to the halting problem as well as to finding the time complexity of a piece of code (written in a Turing-complete language!). (Compare that to the existence of automated theorem provers for arithmetic and other second order logics, which are undecidable!) A tool that under-approximated the complexity problem would output the correct time complexity for some inputs, and "don't know" for other inputs.
Indeed, the whole wide field of code analyzers, often built into the IDEs that we use every day, more often than not under-approximate decision problems that are uncomputable, e.g. reachability, nullability or value analyses.
If you really want to write such a tool: the basic idea is to identify heuristics, i.e., common patterns for which a solution is known, such as various patterns of nested for-loops with only very basic arithmetic operations manipulating the indices, or simple recursive functions where the recurrence relation can be spotted straight-away. It would actually be not too hard (though definitely not easy!) to write a tool that could solve most of the toy problems (such as the one you posted) that are given as homework to students, and that are often posted as questions here on SO, since they follow a rather small number of patterns.
If you wish to go beyond simple heuristics, the main theoretical concept underlying more powerful code analyzers is abstract interpretation. Applied to your use case, this would mean developing a mapping between code constructs in your language to code constructs in a different language (or simpler code constructs in the same language) for which it is easier to compute the time complexity. This mapping would have to conform to some constraints, in particular, the mapped constructs have have the same or worse time complexity as the original code. Actually, mapping a piece of code to a recurrence relation would be an example of abstract interpretation. So is replacing a line of code with something like "O(1)". So, the task is just to formalize some of the things that we do in our heads anyway when we are analyzing the time complexity of code.

Jsoup select - why does it include current element?

I am trying to understand if I'm missing something, because it seems very bizarre to me why Jsoup includes the current element in the search performed by select.
For example (scala code):
val el = doc.select("div").first
el.select("div").contains(el) // => true
What is the point of this? I see very limited cases where you'd actually want this. Do I need to always use el.children.select instead? Is there a nicer method?
Side question: Is there a nicer way to do el.children.select(s).first? In Ruby Nokogiri it would be el.at_css(s) which is much shorter, is there a similar option in Jsoup?

As to why the select method was implemented the way it did, my only guess would be because it's the most straightforward way to do it if we take into consideration the struct that holds the data resulted by your query.
If we think about el, we will see that it is a "tree" representation of the elements that you asked for, having as root the first parent div node. Then you call select on that tree. Now it all depends on how you decide to see this tree. Should we treat this "tree" as a whole (include root) or not (discard root)? It's a matter of taste I guess.
If I judge from myself, a lot of people using Jsoup, probably have had some experience on DOM parsing with jQuery. The equivalent would be something like this $("div").first().find("div") where find is documented as
Get the descendants of each element in the current set of matched
elements, filtered by a selector, jQuery object, or element.
This is in agreement with what you stated. It's just a matter of how the two libraries "see" the resulting tree. Jsoup treats the root as one of the nodes, jQuery differentiates the root (as far find is concerned).
About the second part of your question.
val el = doc.select("div").first
el.children.select(s).first
No there isn't. The only way is to change the css selector.
val result = doc.select("div:eq(0) " + s).first;

When is JDOM XPATH faster than element hunting with getChildren?

Using: Java 1.5 / JDom 1.1.3 / Jaxen 1.1.1
The test I've written was to confirm the belief that using precompiled XPATH in JDOM was faster than iterating through child elements. Instead, what I've found is that XPATH is between 4 and 5 times slower than iterating through lists of children, performing string comparisons, and hunting for what I want.
For context, my XPath is something like:
/root/quote/vehicle[#id = 'some vehicle']/coverage[#id = 'some coverage']/code";
And the actual evaluation being timed (in a try/catch block):
String element = path.valueOf(doc).getText();
And the alternative search is:
List<Element> vehicleList = doc.getRootElement()
.getChild("quote")
.getChildren("vehicle");
for(Element vehElement : vehicleList)
if(vehElement.getAttributeValue("id").equals("some vehicle")){
List<Element> coverageList = ele.getChildren("coverage");
for(Element covElement : coverageList){
if(covElement.getAttributeValue("id").equals("some coverage")){
element = covElement.getChild("CoverageType").getText();
break;
}
}
}
Curiously, while the runtime of the method using XPATH is much slower, it is most consistent over 1000 iterations.
The first example completes around .29 ms +- 0.01ms.
The second example completes anywhere between .013ms and .002ms.
Both approach very short running times given a long enough test.
The XPath is, for me, easier to write, however the getChild route seems more flexible but a little verbose. Still that's a trade I don't mind making for speed. It is also true that even 100 iterations is incredibly fast, so this may be academic...
Ultimately I'd like to know:
Is there a scenario where JDOM Xpath is faster than the alternative style shown ?
What benefits does JDom XPath (in any version of Java/JDOM) bring ?

There are a few things to note in here.... I have done (I'm a JDOM maintainer) extensive work on JDOM 2.0.1 especially in regards to performance of XPath evaluation. Here are some numbers:
http://hunterhacker.github.com/jdom/jdom2/performance.html
Read it from the bottom up.
Here are some other interesting numbers (compares different JDOM versions with different Java VM's)
http://hunterhacker.github.com/jdom/jdom2/performanceJDK.html
The 'bottom line'....
JDOM 2.x introduces faster iterators. Jaxen is very Iterator intensive, and the performance improvements in JDOM 2.x are significant in this regard....
Java 7 is much faster than previous versions in regard to iterator performance too.
There is no benefit to 'compiling' Jaxen XPaths....
even in the best of times though, the 'native' method of searching will be faster than the XPath version.
Your biggest performance boost will come from running with Java7, then upgrading to JDOM 2.x
Although the 'custom' search, if written efficiently, will always be faster than XPath.
Edit: Also, JDOM 2.x introduces a new API for running XPath queries that you may find easier to work with (although the old API still works too): https://github.com/hunterhacker/jdom/wiki/JDOM2-Feature-XPath-Upgrade

Java Algorithm for finding the largest set of independent nodes in a binary tree

By independent nodes, I mean that the returned set can not contain nodes that are in immediate relations, parent and child cannot both be included. I tried to use Google, with no success. I don't think I have the right search words.
A link, any help would be very much appreciated. Just started on this now.
I need to return the actual set of independent nodes, not just the amount.

You can compute this recursive function with dynamic programming (memoization):
MaxSet(node) = 1 if "node" is a leaf
MaxSet(node) = Max(1 + Sum{ i=0..3: MaxSet(node.Grandchildren[i]) },
Sum{ i=0..1: MaxSet(node.Children[i]) })
The idea is, you can pick a node or choose not to pick it. If you pick it, you can't pick its direct children but you can pick the maximum set from its grandchildren. If you don't pick it, you can pick maximum set from the direct children.
If you need the set itself, you just have to store how you selected "Max" for each node. It's similar to the LCS algorithm.
This algorithm is O(n). It works on trees in general, not just binary trees.

I would take-and-remove all leaves first while marking their parents as not-to-take, then remove all leaves that are marked until no such leaves are left, then recurse until the tree is empty. I don't have a proof that this always produces the largest possible set, but I believe it should.

I've provided an answer to a question for the same problem, although the solution is in python, the explanation, algorithm, and test cases could be applicable.

Is there any XPath processor for SAX model?

I'm looking for an XPath evaluator that doesn't rebuild the whole DOM document to look for the nodes of a document: actually the object is to manage a large amount of XML data (ideally over 2Gb) with SAX model, which is very good for memory management, and give the possibility to search for nodes.
Thank you all for the support!
For all those who say it's not possible: I recently, after asked the question, found a project named "saxpath" (http://www.saxpath.org/), but I can't find any implementing project.

My current list (compiled from web search results and the other answers) is:
http://code.google.com/p/xpath4sax/
http://spex.sourceforge.net/
https://github.com/santhosh-tekuri/jlibs/wiki/XMLDog (also contains a performance chart)
http://www.cs.umd.edu/projects/xsq/ (uniersity project, dead since 10 years, GPL)
MIT-Licensed approach http://softwareengineeringcorner.blogspot.com/2012/01/conveniently-processing-large-xml-files.html
Other parsers/memory models supporting fast XPath:
http://vtd-xml.sourceforge.net/ ("The world's fastest XPath 1.0 implementation.")
http://jaxen.codehaus.org/ (contains http://www.saxpath.org/)
http://www.saxonica.com/documentation/sourcedocs/streaming/streamable-xpath.html
The next step is to use the examples of XMLDog and compare the performance of all these approaches. Then, the test cases should be extended to the supported XPath expressions.

We regularly parse 1GB+ complex XML files by using a SAX parser which extracts partial DOM trees that can be conveniently queried using XPath. I blogged about it here: http://softwareengineeringcorner.blogspot.com/2012/01/conveniently-processing-large-xml-files.html - Sources are available on github - MIT License.

XPath DOES work with SAX, and most XSLT processors (especially Saxon and Apache Xalan) do support executing XPath expressions inside XSLTs on a SAX stream without building the entire dom.
They manage to do this, very roughly, as follows :
Examining the XPath expressions they need to match
Receiving SAX events and testing if that node is needed or will be needed by one of the XPath expressions.
Ignoring the SAX event if it is of no use for the XPath expressions.
Buffering it if it's needed
How they buffer it is also very interesting, cause while some simply create DOM fragments here and there, others use very optimized tables for quick lookup and reduced memory consumption.
How much they manage to optimize largely depends on the kind of XPath queries they find. As the already posted Saxon documentation clearly explain, queries that move "up" and then traverse "horizontally" (sibling by sibling) the document obviously requires the entire document to be there, but most of them require just a few nodes to be kept into RAM at any moment.
I'm pretty sure of this because when I was still making every day webapp using Cocoon, we had the XSLT memory footprint problem each time we used a "//something" expression inside an XSLT, and quite often we had to rework XPath expressions to allow a better SAX optimization.

SAX is forward-only, while XPath queries can navigate the document in any direction (consider parent::, ancestor::, preceding:: and preceding-sibling:: axis). I don't see how this would be possible in general. The best approximation would be some sort of lazy-loading DOM, but depending on your queries this may or may not give you any benefit - there is always a worst-case query such as //*[. != preceding::*].

Sorry, a slightly late answer here - it seems that this is possible for a subset of XPath - in general it's very difficult due to the fact that XPath can match both forwards and backwards from the "current" point. I'm aware of two projects that solve it to some degree using state machines: http://spex.sourceforge.net & http://www.cs.umd.edu/projects/xsq. I haven't looked at them in detail but they seem to use a similar approach.

I'll toss in a plug for a new project of mine, called AXS. It's at https://code.google.com/p/annotation-xpath-sax/ and the idea is that you annotate methods with (forward-axis-only) XPath
statements and they get called when the SAX parser is at a node that matches it. So with a document
<doc>
<nodes>
<node name="a">text of node 1</node>
<node name="b">text of node 2</node>
<node otherattr="I have attributes!">text of node 3</node>
</nodes>
</doc>
you can do things like
#XPath("/nodes/node")
void onNode(String nodeText)
{
// will be called with "text of node [123]"
}
or
#XPathStart("//node[#name='']")
void onNode3(Attrs node3Attrs) { ... }
or
#XPathEnd("/nodes/node[2]")
void iDontCareAboutNode3() throws SAXExpression
{
throw new StopParsingExpression();
}
Of course, the library is so new that I haven't even made a release of it yet, but it's MIT licensed, so feel free to give it a try and see if it matches your need. (I wrote it to
do HTML screen-scraping with low enough memory requirements that I can run it on
old Android devices...) If you find bugs, please let me know by filing them on the
googlecode site!

There are SAX/StAX based XPath implementations, but they only support a small subset of XPath expressions/axis largely due to SAX/StAX's forward only nature.. the best alternative I am aware of is extended VTD-XML, it supports full xpath, partial document loading via mem-map.. and a max document size of 256GB, but you will need 64-bit JVM to use it to its full potential

What you could do is hook an XSL transformer to a SAX input source. Your processing will be sequential and the XSL preprocessor will make an attempt to catch the input as it comes to fiddle it into whatever result you specified. You can use this to pull a path's value out of the stream. This would come in especially handy if you wanted to produce a bunch of different XPATH results in one pass.
You'll get (typically) an XML document as a result, but you could pull your expected output out of, say, a StreamResult with not too much hassle.

Have a look at the streaming mode of the Saxon-SA XSLT-processor.
http://www.saxonica.com/documentation/sourcedocs/serial.html
"The rules that determine whether a path expression can be streamed are:
The expression to be streamed starts with a call on the document() or doc() function.
The path expression introduced by the call on doc() or document must conform to a subset of XPath defined as follows:
any XPath expression is acceptable if it conforms to the rules for path expressions appearing in identity constraints in XML Schema. These rules allow no predicates; the first step (but only the first) can be introduced with "//"; the last step can optionally use the attribute axis; all other steps must be simple Axis Steps using the child axis.
In addition, Saxon allows the expression to contain a union, for example doc()/(*/ABC | /XYZ). Unions can also be expressed in abbreviated form, for example the above can be written as doc()//(ABC|XYZ).
The expression must either select elements only, or attributes only, or a mixture of elements and attributes.
Simple filters (one or more) are also supported. Each filter may apply to the last step or to the expression as a whole, and it must only use downward selection from the context node (the self, child, attribute, descendant, descendant-or-self, or namespace axes). It must not be positional (that is, it must not reference position() or last(), and must not be numeric: in fact, it must be such that Saxon can determine at compile time that it will not be numeric). Filters cannot be applied to unions or to branches of unions. Any violation of these conditions causes the expression to be evaluated without the streaming optimization.
These rules apply after other optimization rewrites have been applied to the expression. For example, some FLWOR expressions may be rewritten to a path expression that satisfies these rules.
The optimization is enabled only if explicitly requested, either by using the saxon:stream() extension function, or the saxon:read-once attribute on anXSLT xsl:copy-of instruction, or the XQuery pragma saxon:stream. It is available only if the stylesheet or query is processed using Saxon-SA."
Note: It is most likely in the commercial version this facility is available. I've used Saxon extensively earlier, and it is a nice piece of work.

Mmh I don't know if I really understand you. As far as I know, the SAX model is event oriented. That means, you do something if a certain node is encountered during the parsing. Yeah, it is better for memory but I don't see how you would like to get XPath into it. As SAX does not build a model, I don't think that this is possible.

I don't think xpath works with SAX, but you might take a look at StAX which is an extended streaming XML API for Java.
http://en.wikipedia.org/wiki/StAX

The standard javax xpath API technically already works with streams; javax.xml.xpath.XPathExpression can be evaluated against an InputSource, which in turn can be constructed with a Reader. I don't think it constructs a DOM under the covers.

Did you have tried also QuiXPath https://code.google.com/p/quixpath/ ?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.