hi i need help to extract information from XML with xpath.
I will use Xpath to extract the value of the attribute of one Tag that start with a generic keywork:
<st:Testprova id='abcd'>
....
</st>
or
<st:Test1prova id='defg'>
....
</st>
I used that Xpath expression:
"//*[contains(.,'prova')]/#ID"
but does not work. Can you help me??
You are using #ID instead of #id, this is case sensitive. Besides, you should use name() to retrieve the node-name.
This XPath expression
//*[contains(name(),'prova')]/#id
returns abcd and defg
Although your XML is not correct, it should be:
<st:Testprova id='abcd'>
....
</st:Testprova>
<st:Test1prova id='defg'>
....
</st:Test1prova>
The correct function to be used in this case is matches().
contains() can return true even for node names like
Testprova
Prodprova
UATprova
provaTest
and others which ever contains the word prova.
But if you know the pattern with which the node name will be, then matches() functions filters out exactly the desired nodes.
So if i assume a digit might appear between both the words, the xpath can be written like below
//*[matches(name(), '^Test[0-9]?prova$')]//#ID
Note: matches function is part of Xpath2.0 and will not work in Xpath1.0
Related
There are 2 classes with the same name
<div class="website text:middle"> A</div>
<div class="website text:middle"> 1</div>
How to get A and 1? I tried using getElementById with :eq(0) and it gives out null
Method getElementById queries for elements with a specified id, not class; I'm not sure what you were trying to query with :eq(0) either.
Try:
// String html = ...
Document doc = Jsoup.parse(html);
List<String> result = doc.getElementsByClass("text:middle").eachText();
// result = ["A", "1"]
EDIT
You can query for elements that match multiple classes! See Jsoup select div having multiple classes.
However, a colon (:) is a special character in css and needs to be escaped when it appears as part of a class name in a selector query. I don't think that jsoup currently supports this and simply treats everything after a colon as a pseudo-class.
To add to Janez's correct answer - while jsoup's CSS selector (currently) doesn't support escaping a : character in the class name, there are other ways to get it to work if you want to use the select() method instead of getElementsByXXX -- e.g. if you want to combine selectors in one call:
Elements divs = doc.select("div[class=website text:middle]");
That will find div elements with the literal attribute class="website text:middle". Example.
Or:
Elements divs = doc.select("div[class~=text:middle]");
That finds elements with the class attribute that matches the regex /text:middle/. Example
For the presented data though, I think think the getElementsByClass() DOM method is the way to go and the most general. I just wanted to show a couple alternatives for other cases.
document.querySelectorAll(".website")[0] // 0 is child index
you should use querySelector it is fully supported by every browser
check this for support details support
What is the XPath expression to select <link> elements with type="application/rss+xml" OR type="application/atom+xml" (RSS and Atom feeds)
link[#rel='alternate'][#type='application/rss+xml'] selects RSS feeds
link[#rel='alternate'][#type='application/atom+xml'] selects Atom feeds
But what is the single XPath expression for selecting them both?
Thank you.
use:
link[#rel='alternate'][#type='application/rss+xml' or #type='application/atom+xml']
see http://www.w3.org/TR/xpath/#NT-OrExpr
You could also use union to accomplish this
link[#rel='alternate'][#type='application/rss+xml']|link[#rel='alternate'][#type='application/atom+xml']
but or will do.
if you want to get fancy and use XPath 2.0, it is more elegant (but potentially confusing, depending who might be reading the code) to write it like this:
link[#rel='alternate'][#type = ('application/rss+xml', 'application/atom+xml')]
the reason for this is that XPath 2.0 redefines '=' to apply to sequences, which means that the above comparison returns true if there is one match when comparing items from the left-hand sequence with comparing items from the right-hand sequence. this can be very useful if the list of things you want to compare with is dynamic.
Say I have an XPath string like /Results/Bill[Item[id]]/id. I need to add namespace information to the path, so that the path is transformed to this: /*:Results/*:Bill[*:Item[*:id]]/*:id.
I was thinking of use regex to do this, something like "prepend "*:" to any alphanumeric character that is not preceded by another alphanumeric character". However, I don't have very much regex knowledge and don't know what regex this would correspond to (I'm planning to use Java's replaceAll() function once I have the regex). Also, can anyone think of a counter example where my idea wouldn't work? I'll just be performing the replacement operation on XPath strings with simple predicates (i.e. no and, or etc in between the square brackets).
You might get a regex solution to work with some kind of subset of XPath expressions, but you will never get it to work with all XPath expressions. The XPath grammar is just too complicated.
(The most obvious bugs in your initial proposal are that it fails on variable names like $var, function names like count(..) and axis names like parent::* or #code. You could solve that by checking for the relevant punctuation before or after the symbol. Checking for text inside comments or string literals is a bit trickier. But distinguishing "div" as an element name from "div" as an operator is way beyond what a regex approach can do: it needs a full context-sensitive parser.)
Better suggestion: use a tool that gives you a parse tree for the XPath expression, modify that parse tree, and then re-serialize the modified tree into XPath syntax.
See for example what can be done with Gunther Rademacher's Rex tool, or with the W3C XQuery parser applets (both easily found with google).
In Java, how do you properly determine if XPath selector targets attribute or element?
To explain the issue: I need to get text from WebDriver's WebElement. Either innerText of the element or it's attribute depending on the XPath. Unfortunately each extraction is done differently (see below) so I have to determine first what the intended target is, element or attribute:
String getStringValue(String selector, WebElement context) {
if(targetsAttribute(selector) {
WebElement node = context.findElement(new By.xpath(elemPart(selector)));
return node.getAttribute(attrName(selector));
} else {
return context.findElement(new By.xpath(selector)).getText();
}
};
I'm looking for implementation of targetsAttribute, elemPart and attrName methods. Currently I use regex's:
Pattern ATTR_PAT = Pattern.compile("^.*/#([^/]+)$");
Pattern ELEM_PAT = Pattern.compile("^(.*)/#[^/]+$");
But I find this approach ugly and non-systematic. It doesn't match attribute:: for example. Is there some way to do this using some standard library or so?
NOTE: I'm actually trying to solve similar problem as in following question, only going a bit higher:
How to get the value of an attribute using XPath
You might be able to use the XPath expression parser that's part of Saxon XSLT/XQuery processor.
ExpressionParser's parseExpression() method should be able to give you the information you need.
If you do figure it out, please post your code (as an answer) because I don't know that anybody else has posted a solution.
Edit:
Actually, it's impossible to construct an algorithm that will correctly answer, for every XPath expression, whether it will select an element or an attribute. This is because the type of result returned by an XPath expression can depend on the input. E.g. the XPath expression
//foo | //bar/#baz
could return elements, attributes, both, or neither, depending on what elements and attributes exist in the document.
However, using the parsing tools mentioned above would probably give you your best chance at figuring out, for a subset of XPath expressions, whether they can return an attribute or not.
It seems to me that the inability to get the string value of an XPath expression, regardless of whether it selects an element or an attribute, is a serious shortcoming in the WebDriver API. Unless it provides that ability in some other way that I'm not aware of.
The lack of a node-agnostic means to address text content is a problem in many (if not all!) XPath APIs. And, as already indicated, there is no completely general way to determine in advance whether an XPath expression selects attributes or elements, as it could select both, with a disjunctive combination.
If you can rule out disjunctions (or treat each piece separately) then, heuristically, it all depends on what follows the final slash in the expression: if the remainder starts with '#' (or 'attribute::'), you're selecting an attribute; otherwise, an element. This is not bullet-proof, but from experience I've found that this is good enough in practice. Your heuristic approach is as good as any.
I need a regular expression to detect a span-element where the order of id and class doesn´t matter. The name of the class is always the same, the id is always a fixed number of digits, for example:
<span class="className" id="123">
or
<span id="321" class="className" >
My approach for a regular expression in java was:
String pattern = "<span class=\"className\" id=\"\\d*\">";
but so i can get only one version. Can sombody help?
Thanks, hansa
Don't parse HTML with regular expressions. HTML isn't regular.
This should do it:
String r = "<span (?=[^<>]*\\bclass=\"className\")[^<>]*\\bid=\"(\\d+)\"[^<>]*>";
The lookahead confirms that the span is of the desired class without consuming any characters. Then the rest of the regex, starting from the same position, searches for the id attribute and captures its value. The [^<>]* takes care of any other attributes that might be present, while ensuring that all matching occurs within the tag. (Technically, angle brackets can appear in attribute values, but you probably don't have to worry about that.)
I would do a two step version, first finding the span tag with:
<span[^>]*class=\"classname\"[^>]*>
And then dig out the id from the tags that match the first pattern with
id=\"(\d+)\"
As others have pointed out, it's not a good idea to parse HTML with regular expressions. But for dirty data processing, this is how i would do it.