Jsoup scraping text from children of div - java

I am trying to extract a review of a the product on the link- Moto X using JSoup but it is throwing NullPointerException. Also, I want to extact the text which is shown after clicking "Read More" link of the review.
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class JSoupEx
{
public static void main(String[] args) throws IOException
{
Document doc = Jsoup.connect("https://www.flipkart.com/moto-x-play-with-turbo-charger-white-16-gb/product-reviews/itmefzwvdejejvth?pid=MOBEFM5HAFRNSJJA").get();
Element ele = doc.select("div[class=qwjRop] > div").first();
System.out.println(ele.text());
}
}
Any solutions?

As gherkin suggested, using the network tab in the developer tools, we see a request that receives the reviews (in JSON format) as a response:
https://www.flipkart.com/api/3/product/reviews?productId=MOBEFM5HAFRNSJJA&count=15&ratings=ALL&reviewerType=ALL&sortOrder=MOST_HELPFUL&start=0
Using a JSON parser like JSON.simple we can extract information like review author, usefulness and text.
Example Code
String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36";
String reviewApiCall = "https://www.flipkart.com/api/3/product/reviews?productId=MOBEFM5HAFRNSJJA&count=15&ratings=ALL&reviewerType=ALL&sortOrder=MOST_HELPFUL&start=";
String xUserAgent = userAgent + " FKUA/website/41/website/Desktop";
String referer = "https://www.flipkart.com/moto-x-play-with-turbo-charger-white-16-gb/product-reviews/itmefzwvdejejvth?pid=MOBEFM5HAFRNSJJA";
String host = "www.flipkart.com";
int numberOfPages = 2; // first two pages of results will be fetched
try {
// loop for multiple review pages
for (int i = 0; i < numberOfPages; i++) {
// query reviews
Response response = Jsoup.connect(reviewApiCall+(i*15)).userAgent(userAgent).referrer(referer).timeout(5000)
.header("x-user-agent", xUserAgent).header("host", host).ignoreContentType(true).execute();
System.out.println("Response in JSON format:\n\t" + response.body() + "\n");
// parse json response
JSONObject jsonObject = (JSONObject) new JSONParser().parse(response.body().toString());
jsonObject = (JSONObject) jsonObject.get("RESPONSE");
JSONArray jsonArray = (JSONArray) jsonObject.get("data");
for (Object object : jsonArray) {
jsonObject = (JSONObject) object;
jsonObject = (JSONObject) jsonObject.get("value");
System.out.println("Author: " + jsonObject.get("author") + "\thelpful: "
+ jsonObject.get("helpfulCount") + "\n\t"
+ jsonObject.get("text").toString().replace("\n", "\n\t") + "\n");
}
}
} catch (Exception e) {
e.printStackTrace();
}
Output
Response in JSON format:
{"CACHE_INVALIDATION_TTL":"132568825671","REQUEST":null,"REQUEST-ID": [...] }
Author: Flipkart Customer helpful: 140
A great phone at an affordable price with
-an outstanding camera
-great battery life
-an excellent display
-premium looks
the flipkart delivery was also fast and perfect.
Author: Vaibhav Yadav helpful: 518
I m writing this review after using 2 months..
First of all ..I must say this is one of the best product ..camera quality is best in natural lights or daytime..but in low light and in the night..camera quality is not so good but it's ok..
It has good battery backup ..last one day on 3g usage ..while using 4g ..it lasts for about 10-12 hour..
Turbo charges is good..although ..my charger is not working..
Only problem in this phone is ..while charging..this phone heats a lot..this may b becoz of turbo charger..if u r using other charger than it does not heat..
Author: KAPIL CHOPRA helpful: 9
[...]
Note: output truncated ([...])

JSoup can only parse HTML, not run JavaScript, but the content you are looking for is added to the page by JavaScript, which Jsoup is not aware of.
You need something like selenium to get what you are looking for, however for this specific site you are trying to parse, a quick analysis of its' network activities tells you all the contents your are looking for is fetched from backend by API calls, which you might make use of and makes the content much more accessible without using Jsoup.

Related

jsoup- Search Form's URL not changing in post()

I'm Developing a web crawler.
I need to insert some value into the input field of a form (for a search) and get the result programatically. The form has a post method and the action value is "/SetReviewFilter#REVIEWS".
But the problem is when I do the search from the website manually the URL of the website don't change.I think the web page is self posting
Here the link of the Webpage
I got no idea to how to implement this.But I tried this
private Document getReviewSearchDocument(Document search,String search_url)
{
//search_url mean the url of that search document I fetched previously
// search means the current document of the webpage
Element input = search.getElementsByClass("ratings_and_types").first();
Element link = input.select("div:nth-child(1) > form ").first();
Document rdocument= null;
if (link !=null) {
System.out.println("form found! on: "+link_value);
} else {
System.out.println("Form not found");
}
Connection connection = Jsoup.connect(search_url + "/SetReviewFilter#REVIEWS").timeout(30 * 1000).ignoreContentType(true).ignoreHttpErrors(true);
try {
Connection.Response resp = connection.execute();
if (resp.statusCode() ==200) {
rdocument = connection.data("q",this.keywords).userAgent("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36").execute().parse();
System.out.println("Success: "+ resp.statusCode());
System.out.println("document: "+ rdocument.text().toString());
}
else{
System.out.println("no search match");
}
} catch (IOException e) {
e.printStackTrace();
}
return rdocument;
}
If any body have a idea on this please share it.
Thank You.
I tried few alternatives and modified my code to call a JSOUP POST request to get the job done.But I got failed several times due to the problems with cookies.I found that, for this single post request it requires almost 50 cookies(Thanks to Chrome console).And some cookies I couldn't fill it my self because those cookies were linked to different websites(eg: facebook).And the worst scenario is that I have to make this request depending on the number of hotels per city.So sometimes it can be up to 85 000 ,So it will be costly process.(-5 for me for didn't see that coming)
There for I rebuild the project through Web Automation using Selenium in Java.And the searching in forms became so easy.Thank You!

Getting information whether a google search results exists or not (JAVA)

i try parsing google for search results. What i need are not the search results themselves, but instead i need the information whether a search result exists or not!
Now my problem is i want to search for combined strings. E.g. "Max Testperson".
Now google is really nice and tells me:
We could not find search results for "Max Testperson" but instead for Max Testperson. But !!! I do not need Max Testperson, i need "Max Testperson".
So basically i am not interested in the search results themselves, but instead into the part before the search results (Whether a search string can be found or not!).
I used the following tutorial in java:
http://mph-web.de/web-scraping-with-java-top-10-google-search-results/
With this i can parse the search results. But like i said! No need for that! I just want to know if my search string exists or not. Since google removes the ->" "<- i get search results anyways.
Can anyone help me out with this?
Try to add the get parameter nfpr=1 to your search to disable the auto-correction feature:
final Document doc = Jsoup.connect("https://google.com/search?q=test"+"&nfpr=1").userAgent(USER_AGENT).get();
Update:
You could parse for the message regarding no result:
public class App {
public static final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
public static void main(String[] args) throws Exception {
String searchTerm = "\"daniel+nasseh\"+\"26.02.1987\"";
boolean hasExactResults = true;
final Document doc = Jsoup.connect("https://google.com/search?q=" + searchTerm + "&nfpr=1")
.userAgent(USER_AGENT).get();
Elements noResultMessage = doc.select("div.e.obp div.med:first-child");
if (!noResultMessage.isEmpty()) {
hasExactResults = false;
for (Element result : noResultMessage) {
System.out.println(result.text());
}
}
if (hasExactResults) {
// Traverse the results
for (Element result : doc.select("h3.r a")) {
final String title = result.text();
final String url = result.attr("href");
System.out.println(title + " -> " + url);
}
}
}
}
Update 2: best solution as presented from Donselm himself in the comments is to add &tbs=li:1 to force the search for the exact search term
String searchTerm = "\"daniel+nasseh\"+\"26.02.1987\"";
final Document doc = Jsoup.connect("https://google.com/search?q=" + searchTerm + "&tbs=li:1").userAgent(USER_AGENT).get();

Java - Getting HTTP error 503 when querying google

I'm trying to code a little program in Java, with a small UI, that lets you use some google search's keyword to improve your search.
I have 2 text field (one for the site and one for the keywords) and 2 date pickers to let the user select the date range for the searching result .
When I press the search button it will connect to the following url:
"https://www.google.it/search?q=" + site + Keywords + daterange
site = "site:SITE_MAIN_URL"
keywords are the keywords i am looking for
daterange = "daterange:JULIAN_DATE_1 - JULIAN_DATE_2"
after all this I fetch the first 10 result, but here's the problem...
If I select no dates I can easily fetch the links
If I set the daterange I get the HTTP 503 error that is the one for service unavailable (if I paste the generated URL on my web browser everything works fine)
(the User Agent is set to mozilla 5.0)
EDIT: didn't post any code :P
//here i generate the site
site = "site:" + website_field.getText();
//here i convert the dates using a class found on the net
d1 = (int) DateLabelFormatter.dateToJulian(date1);
d2 = (int) DateLabelFormatter.dateToJulian(date2);
daterange += "+daterange:" + d1 + "-" + d2;
//here i generate the keywords
keywords = keyword_field.getText();
String[] keyword = keywords.split(" ");
for (int i = 0; i < keyword.length; i++) {
tempKeyword += "+" + keyword[i];
}
//the query
query = "https://www.google.it/search?q=" + site + tempKeyword + daterange;
//the connection (wrapped in a try-catch)
Document jSoupDoc = Jsoup.connect(query).userAgent("Mozilla/5.0").timeout(5000).get();
//fetching the links
Elements links = jSoupDoc.select("a[href]");
Element link;
for (int i = 0; i < links.size(); i++) {
link = links.get(i);
String temp = link.attr("href");
// filtering the first 10 google links
if (temp.contains("url")) //donothing
if (temp.contains("webcache")) { //donothing
} else {
String[] splitTemp = temp.split("=");
String[] splitTemp2 = splitTemp[1].split("&sa");
System.out.println(splitTemp2[0]);
}
}
After executing all this (NotSoWellWritten)code if i select no date, and i use just the "site" and the "keywords" I can see on the console the first 10 result found on the google search page.
If i select a daterange from the datepickers i get the 503 error.
If you wanna try a working query, here's one that search on facebook.com the keyword "dog" starting from the 1st of november to the 15th generated with this "tool"
https://www.google.it/search?q=site:facebook.com+dog+daterange:2457328-2457342
`
I have no problems using the following code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main
{
public static void main(String[] args) throws IOException
{
// the connection (wrapped in a try-catch)
Document jSoupDoc = Jsoup.connect("https://www.google.it/search?q=site:facebook.com+dog+daterange:2457328-2457342").userAgent("Mozilla/5.0").timeout(5000).get();
// fetching the links
Elements links = jSoupDoc.select("a[href]");
Element link;
for (int i = 0; i < links.size(); i++)
{
link = links.get(i);
String temp = link.attr("href");
// filtering the first 10 google links
if (temp.contains("url") && !temp.contains("webcache"))
{
String[] splitTemp = temp.split("=");
String[] splitTemp2 = splitTemp[1].split("&sa");
System.out.println(splitTemp2[0]);
}
}
}
}
The code gives this as output on my computer:
https://www.facebook.com/uniladmag/videos/1912071728815877/
https://it-it.facebook.com/DogEvolutionAsd
https://it-it.facebook.com/DylanDogSergioBonelliEditore
https://www.facebook.com/DelawareCountyDogShelter/
https://www.facebook.com/LostDogAlert/
https://it-it.facebook.com/pages/Toelettatura-Vanity-DOG/270854126382923
https://it-it.facebook.com/washdogsgm
https://www.facebook.com/thedailystar/videos/1193933410623520/
https://www.facebook.com/OakhurstDogPark/
https://www.facebook.com/bigdogdinerco/
A 503 error usually means that the web server is having temporary issues. Specifically:
503: The Web server (running the Web site) is currently unable to handle the HTTP request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay.
If this code works but your original code still does not, then your code is not generating the URL you posted and you should investigate further.
Besides the coding style, I don't see any functional problems with the provided code and it supplies the answers correctly (tested it locally). The problem might reside in the dateToJulian which I don't know what it returns and how the result is cast to int (if information is lost).
Also, consider the case in which the keywords contain dangerous characters and they are unescaped. They should be sanitized beforehand.
Another possibility is that Google is rejecting your queries if you are sending too many too fast. If this was done using a visual browser, you'd get a "We want to make sure you're not a robot." and a CAPTCHA page. That is why I'd recommend leveraging the Google API for your searches. See this SO for more info: How can you search Google Programmatically Java API

how to fix HTTP error fetching URL. Status=500 in java while crawling?

I am trying to crawl the user's ratings of cinema movies of imdb from the review page:
(number of movies in my database is about 600,000). I used jsoup to parse pages as below: (sorry, I didn't write the whole code here since it is too long)
try {
//connecting to mysql db
ResultSet res = st
.executeQuery("SELECT id, title, production_year " +
"FROM title " +
"WHERE kind_id =1 " +
"LIMIT 0 , 100000");
while (res.next()){
.......
.......
String baseUrl = "http://www.imdb.com/search/title?release_date=" +
""+year+","+year+"&title="+movieName+"" +
"&title_type=feature,short,documentary,unknown";
Document doc = Jsoup.connect(baseUrl)
.userAgent("Mozilla")
.timeout(0).get();
.....
.....
//insert ratings into database
...
I tested it for the first 100, then first 500 and also for the first 2000 movies in my db and it worked well. But the problem is that when I tested for 100,000 movies I got this error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500, URL=http://www.imdb.com/search/title?release_date=1899,1899&title='Columbia'%20Close%20to%20the%20Wind&title_type=feature,short,documentary,unknown
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at imdb.main(imdb.java:47)
I searched a lot for this error and I found it is a server side error with 5xx error number.
Then I decided to set a condition that when connection fails, it tries 2 more times and then if still couldn't connect, does not stop and goes to the next url. since I am new to java I tried to search for similar questions and read these answers in stackoverflow:
Exceptions while I am extracting data from a Web site
Jsoup error handling when couldn't connect to website
Handling connection errors and JSoup
but, when I try with "Connection.Response" as they suggest, it tells me that "Connection.Response cannot be resolved to a type".
I appreciate if someone could help me, since I am just a newbie and I know it might be simple but I don't know how to fix it.
Well, I could fix the http error status 500 by just adding "ignoreHttpError(true)" as below:
org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");
con.timeout(180000).ignoreHttpErrors(true).followRedirects(true);
Response resp = con.execute();
Document doc = null;
if (resp.statusCode() == 200) {
doc = con.get();
......
hope it can help those have the same error.
however, after crawling review pages of 22907 movies (about 12 hours), I got another error:
"READ TIMED OUT".
I appreciate any suggestion to fix this error.
Upgrading my comments to an answer:
Connection.Response is org.jsoup.Connection.Response
To allow document instance only when there is a valid http code (200), break your call into 3 parts; Connection, Response, Document
Hence, your part of the code above gets modified to:
while (res.next()){
.......
.......
String baseUrl = "http://www.imdb.com/search/title?release_date=" + ""
+ year + "," + year + "&title=" + movieName + ""
+ "&title_type=feature,short,documentary,unknown";
Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21").timeout(10000);
Connection.Response resp = con.execute();
Document doc = null;
if (resp.statusCode() == 200) {
doc = con.get();
....
}

I want to pull Facebook posts from a public page to a Java application

I am creating an app in Java that will take all the information from a public website and load it in the app for people to read using jsoup. I was trying the same kind of function with Facebook but it wasn't working the same way. Does anyone have a good idea about how I should go about this?
Thanks,
Calland
public String[] scrapeEvents(String... args) throws Exception {
Document doc = Jsoup.connect("http://www.facebook.com/cedarstreettimes?fref=ts").get();
Elements elements = doc.select("div._wk");
String s = elements.toString();
return s;
}
edit: I found this link of information,but I'm a little confused on how to manipulate it to get me only the content of what the specific user posts on their wall. http://developers.facebook.com/docs/getting-started/graphapi/
I had a look at the source of that page -- the thing that is tripping up the parse is that all the real content is wrapped in comments, like this:
<code class="hidden_elem" id="u_0_42"><!-- <div class="fbTimelineSection ...> --></code>
There is JS on the page that lifts that data into the real DOM, but as jsoup doesn't execute JS it stays as comments. So before extracting the content, we need to emulate that JS and "un-hide" those elements. Here's an example to get you started:
String url = "https://www.facebook.com/cedarstreettimes?fref=ts";
String ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.33 (KHTML, like Gecko) Chrome/27.0.1438.7 Safari/537.33";
Document doc = Jsoup.connect(url).userAgent(ua).timeout(10*1000).get();
// move the hidden commented out html into the DOM proper:
Elements hiddenElements = doc.select("code.hidden_elem");
for (Element hidden: hiddenElements) {
for (Node child: hidden.childNodesCopy()) {
if (child instanceof Comment) {
hidden.append(((Comment) child).getData()); // comment data parsed as html
}
}
}
Elements articles = doc.select("div[role=article]");
for (Element article: articles) {
if (article.select("span.userContent").size() > 0) {
String text = article.select("span.userContent").text();
String imgUrl = article.select("div.photo img").attr("abs:src");
System.out.println(String.format("%s\n%s\n\n", text,imgUrl));
}
}
That example pulls out the article text and any photo that is associated with it.
(It's possibly better to use the FB API that this method; I wanted to show how you can emulate little bits of JS to make a scrape work properly.)

Categories