I am able to iterate through all div elements in a document, using getElementsByTag("div").
Now I want to build a list of only div elements that have the attribute "id" (i.e. div elements with attribute "class" shouldn't be in the list).
Intuitively, I was thinking of checking something like this:
if (divElement.attr("id") != "")
add_to_list(divElement);
Is my approach correct at all?
Is there a more optimal way of testing for having the "id" attribute? (the above uses string comparison for every element in the DOM document)
You can do it like this:
Elements divsWithId = doc.select("div[id]");
for(Element element : divsWithId){
// do something
}
Reference:
JSoup > Selector Syntax
Try this:
var all_divs = document.getElementsByTagName("div");
var divs_with_id = [];
for (var i = 0; i < all_divs.length; i++)
if (all_divs[i].hasAttribute("id"))
divs_with_id.push(all_divs[i]);
Related
I've got multiple div on a webpage that have the same class attribute.
I'm looking for a way to check all the div of my page for a specific class name, and then get, for each of this div, their Xpath.
Actually, i can check how many div i have on the page with a class name :
List<WebElements> test = driver.findElements(By.className("fitter"));
int countDiv = test.size();
Do you have an idea, now, to get all my elements Xpath one by one ? Or have you a different solution ?
My project is a Selenium project and i'm testing a web page. My test are written in Java and i'm using WebDriver.
thanks for help
Simply iterate through the list of webelements?
List<WebElement> elements = driver.findElements(By.className("fitter"));
for (int i = 0; i < elements.size(); i++) {
WebElement element = elements.get(i);
// ... do whatever you need here
}
The xpath of each element in the loop would be
//div[contains(#class, 'fitter')][i]
I am looking to get data from the table on http://www.sportinglife.com/greyhounds/abc-guide using jSoup. I would like to put this data into some kind of table within my java program that I can then use in my code.
I'm not too sure how to do this. I have been playing around with jSoup and currently am able to get each cell from the table to print out using a while loop - but obviously can't use this always as the number of cells in the table will change.
Document doc = Jsoup.connect("http://www.sportinglife.com/greyhounds/abc-guide").get();
int n = 0;
while (n < 100){
Element tableHeader = doc.select("td").get(n);
for( Element element : tableHeader.children() )
{
// Here you can do something with each element
System.out.println(element.text());
}
n++;
}
Any idea of how I could do this?
There are just a few things you have to implement to achieve your goal. Take a look on this Groovy script - https://gist.github.com/wololock/568b9cc402ea661de546 Now lets explain what we have here
List<Element> rows = document.select('table[id=ABC Guide] > tbody > tr')
Here we're specifying that we are interested in every row tr that is immediate child of tbody which is immediate child of table with id ABC Guide. In return you receive a list of Element objects that describes those tr rows.
Map<String, String> data = new HashMap<>()
We will store our result in a simple hash map for further evaluation e.g. putting those scraped data into the database.
for (Element row : rows) {
String dog = row.select('td:eq(0)').text()
String race = row.select('td:eq(1)').text()
data.put(dog, race)
}
Now we iterate over every Element and we select content as a text from the first cell: String dog = row.select('td:eq(0)').text() and we repeat this step to retrieve the content as a text from the second cell: String race = row.select('td:eq(1)').text(). Then we just simply put those data into the hash map. That's all.
I hope this example with provided description will help you with developing your application.
EDIT:
Java code sample - https://gist.github.com/wololock/8ccbc6bbec56ef57fc9e
I am trying to parse XML from URL using Jsoup.
In this given XML there are nodes with namespace.
for ex: <wsdl:types>
Now I want to get all nodes which contain text as "types" but can have any namespace.
I am able to get this nodes using expression as "wsdl|types".
But how can I get all nodes containing text as "types" having any namespace. ?
I tried with expression as "*|types" but it didn't worked.
Please help.
There is no such selector (yet). But you can use a workaround - a not as easy to read like a selector, but it's a solution.
/*
* Connect to the url and parse the document; a XML Parser is used
* instead of the default one (html)
*/
final String url = "http://www.consultacpf.com/webservices/producao/cdc.asmx?wsdl";
Document doc = Jsoup.connect(url).parser(Parser.xmlParser()).get();
// Elements of any tag, but with 'types' are stored here
Elements withTypes = new Elements();
// Select all elements
for( Element element : doc.select("*") )
{
// Split the tag by ':'
final String s[] = element.tagName().split(":");
/*
* If there's a namespace (otherwise s.length == 1) use the 2nd
* part and check if the element has 'types'
*/
if( s.length > 1 && s[1].equals("types") == true )
{
// Add this element to the found elements list
withTypes.add(element);
}
}
You can put the essential parts of this code into a method, so you get something like this:
Elements nsSelect(Document doc, String value)
{
// Code from above
}
...
Elements withTypes = nsSelect(doc, "types");
I'm using JSoup to retrive reviews from a particular webpage in Amazon and what I have now is this:
Document doc = Jsoup.connect("http://www.amazon.com/Presto-06006-Kitchen-Electric-Multi-Cooker/product-reviews/B002JM202I/ref=sr_1_2_cm_cr_acr_txt?ie=UTF8&showViewpoints=1").get();
String title = doc.title();
Element reviews = doc.getElementById("productReviews");
System.out.println(reviews);
This gives me the block of html which has the reviews but I want only the text without all the tags div etc. I want to then write all this information into a file. How can I do this? Thanks!
Use text() method
System.out.println(reviews.text());
While text() will get you a bunch of text, you'll want to first use jsoup's select(...) methods to subdivide the problem into individual review elements. I'll give you the first big division, but it will be up to you to subdivide it further:
public static List<Element> getReviewList(Element reviews) {
List<Element> revList = new ArrayList<Element>();
Elements eles = reviews.select("div[style=margin-left:0.5em;]");
for (Element element : eles) {
revList.add(element);
}
return revList;
}
If you analyze each element, you should see how amazon further subdivides the information held including the title of the review, the date of the review and the body of the text it holds.
I try to get all info contained in div class named : bg_block_info, but instead i get info for another div class <div class="bg_block_info pad_20"> Why i'm getting it wrong ?
Document doc = Jsoup.connect("http://www.maib.md").get();
Elements myin = doc.getElementsByClass("bg_block_info");
You can combine and chain selectors to refine your query, e.g.:
Document doc = Jsoup.connect("http://www.maib.md/").get();
Elements els = doc.getElementsByClass("bg_block_info").not(".pad_10").not(".pad_20");
That element has two classes (notice the space between bg_block_info and pad_20):
<div class="bg_block_info pad_20">
So it does have the class bg_block_info and your code is working as expected.
Elements downloadLinks = dContent.select("a[href]");
Elements pdfLinks = downloadLinks.select("a[data-format$=pdf]");
Full reference jsoup selector syntax
In your case you probably might use Element content = doc.getElementById("pollsstart"); instead Elements myin = doc.getElementsByClass("bg_block_info");.
Just use comma between bg_block_info" and "pad_20". It should be like this.
Elements myin = doc.getElementsByClass("div.bg_block_info.pad_20");