Retrieving Reviews from Amazon using JSoup - java

I'm using JSoup to retrive reviews from a particular webpage in Amazon and what I have now is this:
Document doc = Jsoup.connect("http://www.amazon.com/Presto-06006-Kitchen-Electric-Multi-Cooker/product-reviews/B002JM202I/ref=sr_1_2_cm_cr_acr_txt?ie=UTF8&showViewpoints=1").get();
String title = doc.title();
Element reviews = doc.getElementById("productReviews");
System.out.println(reviews);
This gives me the block of html which has the reviews but I want only the text without all the tags div etc. I want to then write all this information into a file. How can I do this? Thanks!

Use text() method
System.out.println(reviews.text());

While text() will get you a bunch of text, you'll want to first use jsoup's select(...) methods to subdivide the problem into individual review elements. I'll give you the first big division, but it will be up to you to subdivide it further:
public static List<Element> getReviewList(Element reviews) {
List<Element> revList = new ArrayList<Element>();
Elements eles = reviews.select("div[style=margin-left:0.5em;]");
for (Element element : eles) {
revList.add(element);
}
return revList;
}
If you analyze each element, you should see how amazon further subdivides the information held including the title of the review, the date of the review and the body of the text it holds.

Related

Remove Duplicate URLs from Elements list in jSoup?

When using jSoup to scrape a page, all of the links on the page can be gathered using;
Elements allLinksOnPage = doc.select("a");
Which is great. Now how would one go about removing duplicate URLs from this list? I.e. imagine a /contact-us.html that is linked in both the main navigation.
Once all the duplicate URLs have been removed, then the next step is to crawl those unique URLs and continue the loop.
A question on the practicalities of this though. For the code;
for (Element e : allLinksOnPage) {
String absUrl = e.absUrl("href");
//Asbolute URL Starts with HTTP or HTTPS to make sure it isn't a mailto: link
if (absUrl.startsWith("http") || absUrl.startsWith("https")) {
//Check that the URL starts with the original domain name
if (absUrl.startsWith(getURL)) {
//Remove Duplicate URLs
//Not sure how to do this bit yet?
//Add New URLs found on Page to 'allLinksOnPage' to allow this
//For Loop to continue until the entire website has been scraped
}
}
}
So the question being, the final part of the loop, imagine when page-2.html is crawled, more URLs are identified on here and added to the allLinksOnPage variable.
Will the for loop continue for the length of the full list, i.e. 10 links found on page-1.html and 10 links on page-2.html, so 20 pages will be crawled in total - OR - Will the loop only continue for the length of the first 10 links identified, i.e. the links before the code 'for (Element e : allLinksOnPage)' is triggered?
This will all inevitably end up in a database once the logic is finalised, but looking to keep the logic purely Java based initially to prevent lots of reads/writes to the DB which will slow everything down.
allLinksOnPage is iterated only once. You never retrieve any information about pages you found links to.
You can use a Set and a List for this, however. Furthermore you can use the URL class to extract the protocol for you.
URL startUrl = ...;
Set<String> addedPages = new HashSet<>();
List<URL> urls = new ArrayList<>();
addedPages.add(startUrl.toExternalForm());
urls.add(startUrl);
while (!urls.isEmpty()) {
// retrieve url not yet crawled
URL url = urls.remove(urls.size()-1);
Document doc = JSoup.parse(url, TIMEOUT);
Elements allLinksOnPage = doc.select("a");
for (Element e : allLinksOnPage) {
// add hrefs
URL absUrl = new URL(e.absUrl("href"));
switch (absUrl.getProtocol()) {
case "https":
case "http":
if (absUrl.toExternalForm().startsWith(getURL) && addedPages.add(absUrl.toExternalForm())) {
// add url, if not already added
urls.add(absUrl);
}
}
}
}

Get data from table using jSoup

I am looking to get data from the table on http://www.sportinglife.com/greyhounds/abc-guide using jSoup. I would like to put this data into some kind of table within my java program that I can then use in my code.
I'm not too sure how to do this. I have been playing around with jSoup and currently am able to get each cell from the table to print out using a while loop - but obviously can't use this always as the number of cells in the table will change.
Document doc = Jsoup.connect("http://www.sportinglife.com/greyhounds/abc-guide").get();
int n = 0;
while (n < 100){
Element tableHeader = doc.select("td").get(n);
for( Element element : tableHeader.children() )
{
// Here you can do something with each element
System.out.println(element.text());
}
n++;
}
Any idea of how I could do this?
There are just a few things you have to implement to achieve your goal. Take a look on this Groovy script - https://gist.github.com/wololock/568b9cc402ea661de546 Now lets explain what we have here
List<Element> rows = document.select('table[id=ABC Guide] > tbody > tr')
Here we're specifying that we are interested in every row tr that is immediate child of tbody which is immediate child of table with id ABC Guide. In return you receive a list of Element objects that describes those tr rows.
Map<String, String> data = new HashMap<>()
We will store our result in a simple hash map for further evaluation e.g. putting those scraped data into the database.
for (Element row : rows) {
String dog = row.select('td:eq(0)').text()
String race = row.select('td:eq(1)').text()
data.put(dog, race)
}
Now we iterate over every Element and we select content as a text from the first cell: String dog = row.select('td:eq(0)').text() and we repeat this step to retrieve the content as a text from the second cell: String race = row.select('td:eq(1)').text(). Then we just simply put those data into the hash map. That's all.
I hope this example with provided description will help you with developing your application.
EDIT:
Java code sample - https://gist.github.com/wololock/8ccbc6bbec56ef57fc9e

WebDriver filtering list of elements

how to efficiently find and filter list of elements.
here is the HTML
<span class="tab-strip-text" unselectable="on">Admin</span>
<span class="tab-strip-text" unselectable="on">User</span>
<span class="tab-strip-text" unselectable="on">Reports</span>
<span class="tab-strip-text" unselectable="on">Logs</span>
currently i am using following method to find and filter and click on the element i want based on text
public static void clickTab(String tabText){
List<WebElement> tabs = driver.findElements(By.className("tab-strip-text"));
for(WebElement tab : tabs){
if(tab.getText().equals(tabText)){
tab.click();
break;
}
}
}
is there better way to find and iterate over list (to click based on text() of elements?)
thx
Use XPath with the text you are after in your locators.
//*[#class='tab-strip-text' and text()='Reports']
Then you have:
WebElement reportTab = driver.findElement(By.xpath("//*[#class='tab-strip-text' and text()='Reports']"));
reportTab.click();
Note I don't encourage you to use text in your locators if your site supports multi-languages. In that case, the best way is to add meaningful class names to your source of each element.
try this xpath
//span[contains(text(),'Reports')]
String value="text you are looking for";
public void method(String value){
driver.findElements(By.xpath( //span[contains(text(),'"+value+"')])).click();
}
I know that this post for long time ago but i guess someone will look for a solution :)
tabs.forEach(tab -> {
if(tab.getText().equals(tabText)){
tab.click();
}
}
I am Using Stream filter(). This will help you to click on element based on given text.
filter returns a stream consisting of the elements of this stream that match the given predicate.
List<WebElement> categories = driver.findElements(By.className("tab-strip-text"));
categories.stream().filter(ele->ele.getText().equalsIgnoreCase("Admin")).forEach(ele -> ele.click());

Get element by class in JSoup

I try to get all info contained in div class named : bg_block_info, but instead i get info for another div class <div class="bg_block_info pad_20"> Why i'm getting it wrong ?
Document doc = Jsoup.connect("http://www.maib.md").get();
Elements myin = doc.getElementsByClass("bg_block_info");
You can combine and chain selectors to refine your query, e.g.:
Document doc = Jsoup.connect("http://www.maib.md/").get();
Elements els = doc.getElementsByClass("bg_block_info").not(".pad_10").not(".pad_20");
That element has two classes (notice the space between bg_block_info and pad_20):
<div class="bg_block_info pad_20">
So it does have the class bg_block_info and your code is working as expected.
Elements downloadLinks = dContent.select("a[href]");
Elements pdfLinks = downloadLinks.select("a[data-format$=pdf]");
Full reference jsoup selector syntax
In your case you probably might use Element content = doc.getElementById("pollsstart"); instead Elements myin = doc.getElementsByClass("bg_block_info");.
Just use comma between bg_block_info" and "pad_20". It should be like this.
Elements myin = doc.getElementsByClass("div.bg_block_info.pad_20");

Jsoup: Optimal way of checking whether a <div> has an ID

I am able to iterate through all div elements in a document, using getElementsByTag("div").
Now I want to build a list of only div elements that have the attribute "id" (i.e. div elements with attribute "class" shouldn't be in the list).
Intuitively, I was thinking of checking something like this:
if (divElement.attr("id") != "")
add_to_list(divElement);
Is my approach correct at all?
Is there a more optimal way of testing for having the "id" attribute? (the above uses string comparison for every element in the DOM document)
You can do it like this:
Elements divsWithId = doc.select("div[id]");
for(Element element : divsWithId){
// do something
}
Reference:
JSoup > Selector Syntax
Try this:
var all_divs = document.getElementsByTagName("div");
var divs_with_id = [];
for (var i = 0; i < all_divs.length; i++)
if (all_divs[i].hasAttribute("id"))
divs_with_id.push(all_divs[i]);

Categories