Jsoup: Performance of top-level select() vs. inner-level select() - java

My understanding is that once a document is loaded into Jsoup, using Jsoup.parse(), no parsing is required again as a neatly hierarchical tree is ready for programmer's use.
But what I am not sure whether top-level select() is more costly than inner-level select().
For example, if we have a <p> buried inside many nested <div>s, and that <p>'s parent is already available in the program, will there be any performance difference between:
document.select("p.pclass")
and
pImediateParent.select("p.pclass")
?
How does that work in Jsoup?
UPDATE: Based on the answer below, I understand that both document.select() and pImediateParent.select() use the same exact static method, just with a different root as the second parameter:
public Elements select(String query) {
return Selector.select(query, this);
}
Which translates into:
/**
* Find elements matching selector.
*
* #param query CSS selector
* #param root root element to descend into
* #return matching elements, empty if not
*/
public static Elements select(String query, Element root) {
return new Selector(query, root).select();
}
I am not surprised, but the question now is how does that query work? Does it iterate to find the queried element? Is it a random access (as in hash table) query?

Yes, it will be faster if you use the intermediate parent. If you check the Jsoup source code, you'll see that Element#select() actually delegates to the Selector#select() method with the Element itself as 2nd argument. Now, the javadoc of that method says:
select
public static Elements select(String query, Element root)
Find elements matching selector.
Parameters:
query - CSS selector
root - root element to descend into
Returns:
matching elements, empty if not
Note the description of the root parameter. So yes, it definitely makes difference. Not shocking, but there is some difference.

Related

What's the difference between peekOption and headOption in Vavr Collection

Here the doc for Vavr List peekOption
https://www.javadoc.io/doc/io.vavr/vavr/0.10.1/io/vavr/collection/List.html#peekOption--
Here the doc of Vavr Traversable headOption
https://www.javadoc.io/doc/io.vavr/vavr/0.10.1/io/vavr/collection/Traversable.html#headOption--
Implémentation seems exactly the same so with that kind of usage i can use both but which is the best..?
MyObject myObject = myJavaCollection.stream()
.filter(SomePredicate::isTrue)
.collect(io.vavr.collection.List.collector()) //Collect to vavr list to have vavr methods availables
.peek(unused -> LOGGER.info("some log"))
.map(MyObject::new)
.peekOption() //or .headOption()
.getOrNull();
So i was wondering what is the différence between those methods.
From the sourcecode of Vavr's List (see https://github.com/vavr-io/vavr/blob/master/src/main/java/io/vavr/collection/List.java) we have:
/**
* Returns the head element without modifying the List.
*
* #return {#code None} if this List is empty, otherwise a {#code Some} containing the head element
* #deprecated use headOption() instead
*/
#Deprecated
public final Option<T> peekOption() {
return headOption();
}
So they do exactly the same, like you say, and since peekOption() is deprecated, headOption() seems to be the one to use.
As for the reason to use one over the other:
It looks like the Vavr List interface defines some stack related methods (like push, pop, peek, etc) to make it easier to use lists as stacks in a convenient way, if you should want that. (For example, you would use peekOption() if you consider the list to be a stack and headOption() otherwise)
These stack-methods are however all deprecated - probably because there are always non-stack methods that can be used instead of them. So they probably backed away from the idea that "a list is also a stack" - maybe because they thought it mixes concepts a bit and makes the interface too big (just a guess). So that must be the reason headOption() is preferred - all the stack-methods are deprecated.
(Normal Java Lists also have stack methods, but that is through an interface, so all lists are also stacks but you can have a stack which is not a list.))
According to their documentation (List and Traversable)
List's peekOption
default Option peekOption()
Returns the head element without modifying the List.
Returns:
None if this List is empty, otherwise a Some containing the head element
Traversable's headOption
default Option headOption()
Returns the first element of a non-empty Traversable as Option.
Returns:
Some(element) or None if this is empty.
They act exactly the same way. They either return the head element or Option.none(). Only their variants head and peek throw an exception if no elements are present. List simply happens to have two methods that behave the same way only because it extends the Traversable interface.

Is jsoup Document thread safe?

Is it safe to use jsoup someDocument.select(..) from multiple threads or is there some internal state for read operations?
You can call safely Document.select(String cssSelector) from multiple threads even though Document class is not thread-safe. Underlying implementation of .select(String cssSelector) method passes reference to the element that called this method (Document object in this case), but it does not call any method that changes state of the caller.
When you call .select(String cssSelector) you actually call Collector.collect(Evaluator eval, Element root) method, where root instance is a reference to the Document object.
/**
Build a list of elements, by visiting root and every descendant of root, and testing it against the evaluator.
#param eval Evaluator to test elements against
#param root root of tree to descend
#return list of matches; empty if none
*/
public static Elements collect (Evaluator eval, Element root) {
Elements elements = new Elements();
new NodeTraversor(new Accumulator(root, elements, eval)).traverse(root);
return elements;
}
In this method only elements object gets updated.
Why Document class is not thread-safe?
There are a few methods in Document class that allow to change state of the object without any synchronization mechanism, e.g. Document.outputSettings(Document.OutputSettings outputSettings). In best case Document class should be final and immutable so sharing its instance between multiple threads won't be a problem.

FindBys and List<WebElement> always return null list

As far as my understanding goes, FindBys Annotation in pagefactory returns you elements which satisfies all the condition mentioned inside. The code below always returns 0 elements.
Similarly,If I'm using FindAll annotation with same id and Xpath attribute it is returning me 2 webelements. Can anyone help me in understanding the results.
#FindBys
(
{
#FindBy(xpath="//*[#id='ctl00_ctl00_divWelcome']"),
#FindBy(id="ctl00_ctl00_divWelcome")
}
)
public List<WebElement> allElementsInList;
Your understanding is wrong.
The documentation for #FindBy says:
Used to mark a field on a Page Object to indicate that lookup should use a series of #FindBy tags in a chain as described in org.openqa.selenium.support.pagefactory.ByChained
Further, the documentation for ByChained says:
Mechanism used to locate elements within a document using a series of other lookups. This class will find all DOM elements that matches each of the locators in sequence, e.g. driver.findElements(new ByChained(by1, by2)) will find all elements that match by2 and appear under an element that matches by1.
So in your example, you are looking for an element by XPath with a specific ID, and then its child element by the same ID ... which, of course, is not going to return anything.

Why does Node have no getElementByTagName()?

I would like to understand why I have to do the following when I want to access a specific element value during XML parsing:
NodeList controlList = poDoc.getElementsByTagName("control");
Node controlNode = controlList.item(0);
Element controlElem = (Element) controlNode;
usageType = controlElem.getElementsByTagName("usage_type").item(0).getFirstChild().getNodeValue();
Here I have to cast the controlNode to (Element), only because I want to access another element deeper down the DOM tree. This is all working as expected, I just would like to understand why this is the way it is. Why can't there be a getElementByTagName or similar call for the Node-object? Or is there and I just don't know it. Since I'm quite new to Java, this might be the case. There surely is a better reason to this than "because that's the way the interface was implemented".
Only documents and elements can contain elements.
So the designers of the DOM API simpy decided to define the method
getElementsByTagName only in the Node classes Document and Element.
An alternative design would have been to define getElementsByTagName in the Node class and return an empty node list if the node cannot contain elements. (This is roughly the design decision made by the XPath spec).
By XML standards, every entity in an XML document is a Node, and not everything in an XML document can have child elements. The parser can't know if a referenced Node is a header, an element, or even a comment, so it would be unwise to have such a method without first checking its type.
Even if you expect your XML to be formatted a certain way, it's typical to check if a Node is actually an Element, for example:
if(node instanceof Element) {
NodeList usagetypes = ((Element)node).getElementsByTagName("usage_types");
...
According to the javadoc, Node is any piece of data that can exist in an XML doc, including comments, headers, and text (text value of XML element) so not all kind of nodes can have have a "name" or have child elements.
an Element defines the kinds of node that may have child elements that may be retrieved by a name.

Can't extract sub divs from site in java

I wanna extract data from site
For example when i try to get Price with below code, i can't.
deal.getDetail().setPriceElement(content.select("div#main-new div.buy-now-aligner div.buy-now-price").first());
But i can extract data from deal.getDetail().setPriceElement(content.select("div#main-new").first());
I can't reach the sub divs, how could it be?
You are using the method first() in the wrong way.
Look at the Jsoup API;
public Element first()
Get the first matched element.
Returns:
The first matched element, or null if contents is empty.
This meaning, the Element object that is returned is the first matched of your selection, in your case the first buy-now-price div class.
If you want the child elements of that element, (there is only one in your example URL), you can use either the child() method or the children() method.
The first method takes a parameter that is the index of the child you want, and the second method returns a collection of Element objects as Elements.
Use either that is suitable for you.

Categories