JSoup Element selection by CSS rule: is there any documented ordering? - java

I am currently using JSoup CSS selection to get a list of some elements in an HTML document.
Though to verify the robustness of my algorithm I have to know in which order the elements do get browsed and, therefore, returned.
My concern is strictly linked to nested elements. If i do search for all the elements in a document which is like the following:
<div> Something <span style='color:red;'>special</span> for me </div>
and i run in JSoup:
Document doc = Jsoup.parse(myCode);
Elements els = doc.select("*");
in which order will those two elements be traversed, and therefore returned? I am currently looking at the documentation page for the select method, but no information is provided on the traversal order. Is there any more precise reference I can look at?
Clearly i can proceed in a trial-and-error to infer the ordering, but I would like to know if this is already known/someone has already digged into it, since I do not know the HTML structure of the documents I have to parse beforehand.
Thanks!

Related

Why Elements is empty?

I use JSoup for the first time. so i'm not familiar with JSoup. I already read 'COOKBOOk'. But still i don't know why that Elements still empty. am i missing something?
Document doc = Jsoup.connect("http://sports.news.naver.com/sports/" +
"index.nhn?category=baseball.html").get();
Elements teams= doc.select("td.t_name");
Elements wins= doc.select("td.win");
System.out.println(teams.isEmpty());
System.out.println(wins.isEmpty());
Maybe because there is no "td.t_name" and "td.win" in the document.
You should make sure that the document you get form http://sports.news.naver.com/sports/index.nhn?category=baseball.html
contain the data you want to select.
As far as I debugged into your code, I didn't see any "td.win" or "td.t_name" in the document.
Note that data loaded in via AJAX will not be downloaded by JSoup.

Is there a way to call nextElementSibling in Selenium Webdriver with Java?

I am working with a really messy page structure and the fragment that I am stuck on looks something like this:
<div>
<h3>...</h3>
<ul>...</ul>
<h3>...</h3>
<ul>...</ul>
...
</div>
I want to get one of the <ul> elements, so I could dig deeper into it and retrieve the actual value that I really need from there (it has a table inside it). Currently I am able to get the <h3> element that precedes the <ul> I am looking for. Since ul-elements don't have any unique identifiers that I could use to get them directly, I am hoping to achieve it by getting the element that comes after the h3-tag (on the same level). Is there a way to get what seems to be nextElementSibling?
Thank you!
NB! h3 and ul elements don't have strict sequence number - there may or may not be a few elements before them, so getting an n-th child does not seem to be an option there.
You can achieve this with either xpath or by executing some javascript.
Xpath:
driver.find(By.xpath("//div/h3/following-sibling::ul[1]"));
JavaScript:
JavaScriptExecutor jsExec = (JavaScriptExecutor)driver;
WebElement ulElement = jsExec.executeScript("return arguments[0].nextSibling;", driver.find(By.cssSelector("div h3")));
Hope that helps!

selenium - extracting value from cell (td)

I'm writing a Java test using cellenium, in order to validate the correctness of data I'm trying to extract the values of the table cells, although all cells have different values and meaning the <td> of all cells look the same and have the same attributes like so:
<td onclick="show_data('2','2','rowDetails.php','myID','434b2410aef9e61d6237dbbe562689a9b84','644');">2</td>
the naive solution would be to extract all tag <td> and then go by index.
Is there a better way?
If your elements have no unique identifier then the common way to solve your problem is by fetching all the TD elements and loop over them. This is something you already seem to be aware of, as you're describing this as: extract all tags and then go by index.
However, your provided example does contain a unique identifier, namely the property of the onclick attribute. By using CSS selectors, don't use XPath, you can do a select based on the property (the value) of the onclick attribute. This should help you to narrow down the elements you are looking for.
For list of CSS Selectors see: http://www.w3schools.com/cssref/css_selectors.asp

How do I get the CSS selector of an element with Jsoup?

I have an element extracted from the DOM using JSOUP. I want to get the CSS selector of that element, so I can quickly find the equivalent elements on other pages with the same structure. Is this possible?
Thanks
I doubt that it's possible, because multiple selectors could be valid for your element -- eg, the trivial Selector.select("*",rootElement) would match it.
It sounds like you don't want to use the same element-extraction code (that you initially used) for all subsequent documents? If you're intent on using selectors, then perhaps try different ones until you find one that you're happy with (from either a readability or a performance perspective).

Page scrape for a particular div

I am wondering if there is a way to read the html output of a given webpage using Java?
I know in php you can do something like:
$handle = #fopen("'http://www.google.com", "r");
$source_code = fread($handle,9000);
I am looking for the Java equivalent.
Additionally, once I have the rendered html are there any Java utilities that would allow me to strip out a single div by its id?
Thanks for any help with this.
Use jsoup.
You have the choice between a tree model and a powerful query syntax similar to CSS or jQuery selectors, plus utility methods to quickly get the source of a webpage.
To quote from their website:
Fetch the Wikipedia homepage, parse it to a DOM, and select the
headlines from the In the news section into a list of Elements:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Once you found the Element representing the div you want to remove, just call remove() on it.

Categories