Why is XPath last() function not working as I expect? - java

I am using Java and Selenium to write a test. I need to get the last element inside another element, so I used last() function, but the problem is that it doesn't always bring me the last one when I apply :
//a//b[last()]
to
<a>
<l>
<b>asas</b>
</l>
<b>as</b>
</a>
to get <b>as</b> ,it brings me:
<b>asas</b>
<b>as</b>
but when I apply it to:
<a>
<b>asas</b>
<b>as</b>
</a>
it brings me:
<b>as</b>

This is a common source of XPath confusion. First the straightforward parts:
//a selects all a elements in the document.
//a//b selects all b elements in the document that are
descendants of a elements.
Normal stuff so far. Next is the tricky part:
To select the last b elements among siblings (beneath a elements):
//a//b[last()]
Here, the filtering is a part of the b selection criteria because [] has a higher precedence than //.
To select the last b element in the document (beneath a elements):
(//a//b)[last()]
Here, the last() is an index on the list of all selected b elements because () is used to override the default precedence.

I think it's easiest to understand the behaviour if you remember that "//" is an abbreviation for "/descendant-or-self::node()/", and that the step "b" is an abbreviation for "child::b". So
//b[last()]
is an abbreviation for
/descendant-or-self::node()/child::b[position()=last()]
Which means "Select every node in the document (except attributes and namespaces). For each of these nodes, form a list of the child elements named "b", and select the last element in this list".
You ask for sources of information. #kjhughes recommends reading the XPath 1.0 recommendation, and indeed, it is a lot more readable than many specs. But it can be a bit terse at times; it occasionally feels like solving a crossword puzzle. My "XSLT 2.0 Programmer's Reference" (which also includes a lot of material on XPath) was written for people who want a deep understanding of how the language works, but explained in plainer English. This particular topic is on page 627, and it's easy enough to find a pirated copy on the web if you want to see how it's covered. But I'd recommend buying a legal copy, because scrolling through 1300 pages of scanned PDF is not much fun.

Related

Jsoup eq selector returns no value

Trying to fetch data using Jsoup 1.10.3, seems like eq selector is not working correctly.
I tried the nth-child, but it seems like its not getting the second table (table:nth-child(2)).
Is my selector correct?
html > body > table:nth-child(2) > tbody > tr:nth-child(2) > td:nth-child(2)
in the example below, trying to extract the value 232323
Here is the try it sample
There are several issues that you may be struggling with. First, I don't think that you want to use the :nth-child(an+b) selector. Here is the explanation of that selector from the jsoup docs:
:nth-child(an+b) elements that have an+b-1 siblings before it in the document tree, for any positive integer or zero value of n, and has a parent element. For values of a and b greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the bth element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The a and b values must be integers (positive, negative, or zero). The index of the first child of an element is 1.
I guess you want to use the :table:nth-of-type(n) selector.
Second, you only select elements with your selector, but you want to get the visible content 232323, which is only one inner node of the element you select. So what is missing is the part where you get to the content. There are several ways of doing this. I again recommend that you read the docs. Especially the cookbook is very helpful for beginners. I guess you could use something like this:
String content = element.text();
Third, with CSS selector you really do to need to go through every hierarchy level of the DOM. Since tables always contain a tbody and tr and td elements, you may do something like this:
String content = document.select("table:nth-of-type(2) tr:nth-of-type(2) td:last-of-type").text();
Note, I do not have a java compiler at hand. Please use my code with care.

Data retrieval / search in text

I am working on a selfProjet for my own interest on data retrieval. I have one text file with the following format.
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...
".I 1" indicate the beginning of chunk of text corresponding to doc ID1 and ".I 2" indicates the beginning of chunk of text corresponding to doc ID2.
I did:
split the docs and put them in separate files
delete stopwords (and, or, while, is, are, ...)
stem the words to get the root of each (achievement, achieve, achievable, ...all converted to achiv and so on)
and finally create e TreeMultiMap which looks like this:
{key: word} {Values are arraylist of docID and frequency of that word in that docID}
aerodynam [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
book [[Doc_00562,6],[Doc_01111,1]]
....
....
result [[Doc_00010,5]]
....
....
zzzz [[Doc_01235,1]]
Now my questions:
Suppose that user is interested to know:
what documents does have achieving and book? (idea)
documents which has achieving and skills but not book nor video
document include Aerodynamic
and some other simple queries like this
(input) so suppose she enters
achieving AND book
(achieving AND skills) AND (NOT (book AND video))
Aerodynamic
.....and some other simple queries
(Output)
[Doc_00562,6],[Doc_01121,5],[Doc_01151,3],[Doc_00012,2],[Doc_00001,1]
....
as you can see there might be
Some precedence modifier (parenthesis which we dont know the depth)
precedence of AND, OR, NOT
and some other interesting challenges and issues
So, I would like to run the queries against the TreeMultimap and search in the words(key) and retrieve the Values(list of docs) to user.
how should I think about this problem and how to design my solution? what articles or algorithms should i read? any idea would be appreciated. (thanks for reading this long post)
The collection that you have used is the Cranfield test collection, which I believe has around 3000 documents. While for collections of this size, it is okay to store the inverted list (the data structure that you have constructed) in memory with a hash-based or trie based organization, for realistic collections of much larger sizes, often comprised of millions of documents, you would find it difficult to store the inverted list entirely within memory in such cases.
Instead of reinventing the wheel, the practical solution is thus to make use of a standard text indexing (and retrieval) framework such as Lucene. This tutorial should help you to get started.
The questions that you seek to address can be answered by Boolean queries where you can specify set of Boolean operators AND, OR and NOT between its constituent terms. Lucene supports this. Have a look at the API doc here and a related StackOverflow question here.
The Boolean query retrieval algorithm is very simple. The list elements (i.e. the document ids) corresponding to each term are stored in sorted order so that at run-time it is possible to compute the union and intersection in time linear to the size of the lists, i.e. O(n1+n2).... (this is very similar to mergesort).
You can find more information in this book chapter.

Searching for the first matching element after a specific node (XPath and ITunes XML)

it's not nessesary to post my full code because I have just a short questions. I'm searching with XPath in a XML Doc for a text Value. I have a XML Like
<key>Name</key>
<string>Dat Ass</string>
<key>Artist</key>
<string>Earl Sweatshirt</string>
<key>Album</key>
<string>Kitchen Cutlery</string>
<key>Kind</key>
<string>MPEG-Audiodatei</string>
I have an Expression like this:
//string[preceding-sibling::key[text()[contains(., 'Name')]]]/text()
but this gives me ALL following string-tags, I just want the first one with the Song-Title.
greets Alex
Use:
(//string[preceding-sibling::key[1] = 'Name'])[1]/text()
Alternatively, one can use a forward-only expression:
(//key[. = 'Name'])[1]/following-sibling::string[1]/text()
Do note:
This is a common error. Any expression of the kind:
//someExpr[1]
Doesn't select "the first node in the document from all nodes selected by //someExpr". In fact it can select many nodes.
The above expression selects any node that is selected by //someExpr and that is the first such child of its parent.
This is why, without brackets, the other answer to this question is generally incorrect.
You can just add another predicate [1] to select the first matching node. The nested predicate using text() should be unneccessary:
//string[preceding-sibling::key[contains(., 'Name')]][1]/text()
Another, perhaps more efficient, way to select this node would be
//key[contains(., 'Name')]/following-sibling::*[1][self::string]
This selects the first node (with any name) following the wanted key node and tests if its name is string.

Jdoms annoying textnodes and addContent(index, Element) - schema solutions?

i have some already generated xmls and the application causing problems now needs to add elements to it which need to be at a specific position to be valid with to the applied schemata...
now there are two problems the first one is that i have to hardcode the positions which is not that nice but "ok".
But the much bigger one is jdom... I printed the content list and it looks like:
element1
text
element2
element4
text
element5
while the textnodes are just whitespaces and every element i add makes it even more unpredictable how many textnodes there are (because sometimes there are added some sometimes not) which are just counted as it were elements but i want to ignore them because when i add element3 at index 2 its not between element2 and element4 it comes after this annoying textnode.
Any suggestions? The best solution imho would be something that automatically puts it where it has to be according to the schema but i think thats not possible?
Thanks for advice :)
The JDOM Model of the XML is very literal... it has to be. On the other hand, JDOM offers ways to filter and process the XML in a way that should make your task easier.
In your case, you want to add Element content to the document, and all the text content is whitespace..... so, just ignore all the text content, and worry about the Element content only.
For example, if you want to insert a new element nemt before the 3rd Element, you can:
rootemt.getChildren().add(3, new Element("nemt"));
The elements are now sorted out.... what about the text...
A really simple solution is to just pretty-print the output:
XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
xout.output(System.out, mydoc);
That way all the whitespace will be reformatted to make the XML 'pretty'.
EDIT - and no, there is no way with JDOM to automatically insert the element in the right place according to the schema....
Rolf

Is this possible to develop some criteria based search on the Strings in C# or JAVA?

I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...
The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.
You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.
Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}

Categories