I'm Developing a web crawler.
I need to insert some value into the input field of a form (for a search) and get the result programatically. The form has a post method and the action value is "/SetReviewFilter#REVIEWS".
But the problem is when I do the search from the website manually the URL of the website don't change.I think the web page is self posting
Here the link of the Webpage
I got no idea to how to implement this.But I tried this
private Document getReviewSearchDocument(Document search,String search_url)
{
//search_url mean the url of that search document I fetched previously
// search means the current document of the webpage
Element input = search.getElementsByClass("ratings_and_types").first();
Element link = input.select("div:nth-child(1) > form ").first();
Document rdocument= null;
if (link !=null) {
System.out.println("form found! on: "+link_value);
} else {
System.out.println("Form not found");
}
Connection connection = Jsoup.connect(search_url + "/SetReviewFilter#REVIEWS").timeout(30 * 1000).ignoreContentType(true).ignoreHttpErrors(true);
try {
Connection.Response resp = connection.execute();
if (resp.statusCode() ==200) {
rdocument = connection.data("q",this.keywords).userAgent("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36").execute().parse();
System.out.println("Success: "+ resp.statusCode());
System.out.println("document: "+ rdocument.text().toString());
}
else{
System.out.println("no search match");
}
} catch (IOException e) {
e.printStackTrace();
}
return rdocument;
}
If any body have a idea on this please share it.
Thank You.
I tried few alternatives and modified my code to call a JSOUP POST request to get the job done.But I got failed several times due to the problems with cookies.I found that, for this single post request it requires almost 50 cookies(Thanks to Chrome console).And some cookies I couldn't fill it my self because those cookies were linked to different websites(eg: facebook).And the worst scenario is that I have to make this request depending on the number of hotels per city.So sometimes it can be up to 85 000 ,So it will be costly process.(-5 for me for didn't see that coming)
There for I rebuild the project through Web Automation using Selenium in Java.And the searching in forms became so easy.Thank You!
Related
I am trying to extract a review of a the product on the link- Moto X using JSoup but it is throwing NullPointerException. Also, I want to extact the text which is shown after clicking "Read More" link of the review.
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class JSoupEx
{
public static void main(String[] args) throws IOException
{
Document doc = Jsoup.connect("https://www.flipkart.com/moto-x-play-with-turbo-charger-white-16-gb/product-reviews/itmefzwvdejejvth?pid=MOBEFM5HAFRNSJJA").get();
Element ele = doc.select("div[class=qwjRop] > div").first();
System.out.println(ele.text());
}
}
Any solutions?
As gherkin suggested, using the network tab in the developer tools, we see a request that receives the reviews (in JSON format) as a response:
https://www.flipkart.com/api/3/product/reviews?productId=MOBEFM5HAFRNSJJA&count=15&ratings=ALL&reviewerType=ALL&sortOrder=MOST_HELPFUL&start=0
Using a JSON parser like JSON.simple we can extract information like review author, usefulness and text.
Example Code
String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36";
String reviewApiCall = "https://www.flipkart.com/api/3/product/reviews?productId=MOBEFM5HAFRNSJJA&count=15&ratings=ALL&reviewerType=ALL&sortOrder=MOST_HELPFUL&start=";
String xUserAgent = userAgent + " FKUA/website/41/website/Desktop";
String referer = "https://www.flipkart.com/moto-x-play-with-turbo-charger-white-16-gb/product-reviews/itmefzwvdejejvth?pid=MOBEFM5HAFRNSJJA";
String host = "www.flipkart.com";
int numberOfPages = 2; // first two pages of results will be fetched
try {
// loop for multiple review pages
for (int i = 0; i < numberOfPages; i++) {
// query reviews
Response response = Jsoup.connect(reviewApiCall+(i*15)).userAgent(userAgent).referrer(referer).timeout(5000)
.header("x-user-agent", xUserAgent).header("host", host).ignoreContentType(true).execute();
System.out.println("Response in JSON format:\n\t" + response.body() + "\n");
// parse json response
JSONObject jsonObject = (JSONObject) new JSONParser().parse(response.body().toString());
jsonObject = (JSONObject) jsonObject.get("RESPONSE");
JSONArray jsonArray = (JSONArray) jsonObject.get("data");
for (Object object : jsonArray) {
jsonObject = (JSONObject) object;
jsonObject = (JSONObject) jsonObject.get("value");
System.out.println("Author: " + jsonObject.get("author") + "\thelpful: "
+ jsonObject.get("helpfulCount") + "\n\t"
+ jsonObject.get("text").toString().replace("\n", "\n\t") + "\n");
}
}
} catch (Exception e) {
e.printStackTrace();
}
Output
Response in JSON format:
{"CACHE_INVALIDATION_TTL":"132568825671","REQUEST":null,"REQUEST-ID": [...] }
Author: Flipkart Customer helpful: 140
A great phone at an affordable price with
-an outstanding camera
-great battery life
-an excellent display
-premium looks
the flipkart delivery was also fast and perfect.
Author: Vaibhav Yadav helpful: 518
I m writing this review after using 2 months..
First of all ..I must say this is one of the best product ..camera quality is best in natural lights or daytime..but in low light and in the night..camera quality is not so good but it's ok..
It has good battery backup ..last one day on 3g usage ..while using 4g ..it lasts for about 10-12 hour..
Turbo charges is good..although ..my charger is not working..
Only problem in this phone is ..while charging..this phone heats a lot..this may b becoz of turbo charger..if u r using other charger than it does not heat..
Author: KAPIL CHOPRA helpful: 9
[...]
Note: output truncated ([...])
JSoup can only parse HTML, not run JavaScript, but the content you are looking for is added to the page by JavaScript, which Jsoup is not aware of.
You need something like selenium to get what you are looking for, however for this specific site you are trying to parse, a quick analysis of its' network activities tells you all the contents your are looking for is fetched from backend by API calls, which you might make use of and makes the content much more accessible without using Jsoup.
I'm making a little script in java to check iPhone IMEI numbers.
There is this site from Apple :
https://appleonlinefra.mpxltd.co.uk/search.aspx
You have to enter an IMEI number. If this number is OK, it drives you to this page :
https://appleonlinefra.mpxltd.co.uk/Inspection.aspx
Else, you stay on /search.aspx page
I want to open the search page, enter an IMEI, submit, and check if the URL has changed. In my code there is a working IMEI number.
Here is my java code :
HtmlPage page = webClient.getPage("https://appleonlinefra.mpxltd.co.uk/search.aspx");
HtmlTextInput imei_input = (HtmlTextInput)page.getElementById("ctl00_ContentPlaceHolder1_txtIMEIVal");
imei_input.setValueAttribute("012534008614194");
//HtmlAnchor check_imei = page.getAnchorByText("Rechercher");
//Tried with both ways of getting the anchor, none works
HtmlAnchor anchor1 = (HtmlAnchor)page.getElementById("ctl00_ContentPlaceHolder1_imeiValidate");
page = anchor1.click();
System.out.println(page.getUrl());
I can't find out where it comes from, since i often use HTMLUnit for this and i never had this issue. Maybe because of the little loading time after submiting ?
Thank you in advance
You can do this by using a connection wrapper that HTMLUnit provides
Here is an example
new WebConnectionWrapper(webClient) {
public WebResponse getResponse(WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
if (request.getUrl().toExternalForm().contains("Inspection.aspx")) {
String content = response.getContentAsString("UTF-8");
WebResponseData data = new WebResponseData(content.getBytes("UTF-8"), response.getStatusCode(),
response.getStatusMessage(), response.getResponseHeaders());
response = new WebResponse(data, request, response.getLoadTime());
}
return response;
}
};
With the connection wrapper above, you can check for any request and response that is passing through HTMLUnit
I am trying to crawl the user's ratings of cinema movies of imdb from the review page:
(number of movies in my database is about 600,000). I used jsoup to parse pages as below: (sorry, I didn't write the whole code here since it is too long)
try {
//connecting to mysql db
ResultSet res = st
.executeQuery("SELECT id, title, production_year " +
"FROM title " +
"WHERE kind_id =1 " +
"LIMIT 0 , 100000");
while (res.next()){
.......
.......
String baseUrl = "http://www.imdb.com/search/title?release_date=" +
""+year+","+year+"&title="+movieName+"" +
"&title_type=feature,short,documentary,unknown";
Document doc = Jsoup.connect(baseUrl)
.userAgent("Mozilla")
.timeout(0).get();
.....
.....
//insert ratings into database
...
I tested it for the first 100, then first 500 and also for the first 2000 movies in my db and it worked well. But the problem is that when I tested for 100,000 movies I got this error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500, URL=http://www.imdb.com/search/title?release_date=1899,1899&title='Columbia'%20Close%20to%20the%20Wind&title_type=feature,short,documentary,unknown
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at imdb.main(imdb.java:47)
I searched a lot for this error and I found it is a server side error with 5xx error number.
Then I decided to set a condition that when connection fails, it tries 2 more times and then if still couldn't connect, does not stop and goes to the next url. since I am new to java I tried to search for similar questions and read these answers in stackoverflow:
Exceptions while I am extracting data from a Web site
Jsoup error handling when couldn't connect to website
Handling connection errors and JSoup
but, when I try with "Connection.Response" as they suggest, it tells me that "Connection.Response cannot be resolved to a type".
I appreciate if someone could help me, since I am just a newbie and I know it might be simple but I don't know how to fix it.
Well, I could fix the http error status 500 by just adding "ignoreHttpError(true)" as below:
org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21");
con.timeout(180000).ignoreHttpErrors(true).followRedirects(true);
Response resp = con.execute();
Document doc = null;
if (resp.statusCode() == 200) {
doc = con.get();
......
hope it can help those have the same error.
however, after crawling review pages of 22907 movies (about 12 hours), I got another error:
"READ TIMED OUT".
I appreciate any suggestion to fix this error.
Upgrading my comments to an answer:
Connection.Response is org.jsoup.Connection.Response
To allow document instance only when there is a valid http code (200), break your call into 3 parts; Connection, Response, Document
Hence, your part of the code above gets modified to:
while (res.next()){
.......
.......
String baseUrl = "http://www.imdb.com/search/title?release_date=" + ""
+ year + "," + year + "&title=" + movieName + ""
+ "&title_type=feature,short,documentary,unknown";
Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21").timeout(10000);
Connection.Response resp = con.execute();
Document doc = null;
if (resp.statusCode() == 200) {
doc = con.get();
....
}
I am creating an app in Java that will take all the information from a public website and load it in the app for people to read using jsoup. I was trying the same kind of function with Facebook but it wasn't working the same way. Does anyone have a good idea about how I should go about this?
Thanks,
Calland
public String[] scrapeEvents(String... args) throws Exception {
Document doc = Jsoup.connect("http://www.facebook.com/cedarstreettimes?fref=ts").get();
Elements elements = doc.select("div._wk");
String s = elements.toString();
return s;
}
edit: I found this link of information,but I'm a little confused on how to manipulate it to get me only the content of what the specific user posts on their wall. http://developers.facebook.com/docs/getting-started/graphapi/
I had a look at the source of that page -- the thing that is tripping up the parse is that all the real content is wrapped in comments, like this:
<code class="hidden_elem" id="u_0_42"><!-- <div class="fbTimelineSection ...> --></code>
There is JS on the page that lifts that data into the real DOM, but as jsoup doesn't execute JS it stays as comments. So before extracting the content, we need to emulate that JS and "un-hide" those elements. Here's an example to get you started:
String url = "https://www.facebook.com/cedarstreettimes?fref=ts";
String ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.33 (KHTML, like Gecko) Chrome/27.0.1438.7 Safari/537.33";
Document doc = Jsoup.connect(url).userAgent(ua).timeout(10*1000).get();
// move the hidden commented out html into the DOM proper:
Elements hiddenElements = doc.select("code.hidden_elem");
for (Element hidden: hiddenElements) {
for (Node child: hidden.childNodesCopy()) {
if (child instanceof Comment) {
hidden.append(((Comment) child).getData()); // comment data parsed as html
}
}
}
Elements articles = doc.select("div[role=article]");
for (Element article: articles) {
if (article.select("span.userContent").size() > 0) {
String text = article.select("span.userContent").text();
String imgUrl = article.select("div.photo img").attr("abs:src");
System.out.println(String.format("%s\n%s\n\n", text,imgUrl));
}
}
That example pulls out the article text and any photo that is associated with it.
(It's possibly better to use the FB API that this method; I wanted to show how you can emulate little bits of JS to make a scrape work properly.)
I have a task to fetch html from a website, before I go to that page I need to log in.
I use a low-level api url fetch service. Here is my code test code:
private String postPage(String loginPageHtml) throws IOException{
String charset = "UTF-8";
Document doc = Jsoup.parse(loginPageHtml);
Iterator<Element> inputHiddensIter = doc.select("form").first().select("input[type=hidden]").iterator();
String paramStr = "";
paramStr += "Username" + "=" + URLEncoder.encode("username", charset) + "&";
paramStr += "Password" + "=" + URLEncoder.encode("password", charset) + "&";
paramStr += "ImageButton1.x" + "=" + URLEncoder.encode("50", charset) + "&";
paramStr += "ImageButton1.y" + "=" + URLEncoder.encode("10", charset) + "&";
while (inputHiddensIter.hasNext()) {
Element ele = inputHiddensIter.next();
String name = ele.attr("name");
String val = ele.attr("value");
paramStr += name + "=" + URLEncoder.encode(val, charset) + "&";
}
URL urlObj = new URL(LOG_IN_PAGE);
URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
HTTPRequest request = new HTTPRequest(urlObj, HTTPMethod.POST);
HTTPHeader header = new HTTPHeader("Content-Type", "application/x-www-form-urlencoded");
HTTPHeader header3 = new HTTPHeader("Content-Language", "en-US");
HTTPHeader header4 = new HTTPHeader("User-Agent", DEFAULT_USER_AGENT);
if(!cookie.isEmpty()){
request.addHeader(new HTTPHeader("Set-Cookie", cookie));
}
request.addHeader(header);
request.addHeader(header3);
request.addHeader(header4);
request.setPayload(paramStr.getBytes());
request.getFetchOptions().setDeadline(30d);
HTTPResponse response = null;
try{
response = fetcher.fetch(request);
byte[] content = response.getContent();
int responseCode = response.getResponseCode();
log.severe("Response Code : " + responseCode);
List<HTTPHeader>headers = response.getHeaders();
for(HTTPHeader h : headers) {
String headerName = h.getName();
if(headerName.equals("Set-Cookie")){
cookie = h.getValue();
}
}
String s = new String(content, "UTF-8");
return s;
}catch (IOException e){
/* ... */
}
return "";
}
Here is my default user agent:
private static final String DEFAULT_USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.83 Safari/537.1";
It works fine on my dev machine, but when I deploy on app engine and test it, I get response code 500 and the following error:
Validation of viewstate MAC failed. If this application is hosted by a Web Farm or cluster, ensure >that configuration specifies the same validationKey and validation algorithm. >AutoGenerate cannot be used in a cluster.
Description: An unhandled exception occurred during the execution of the current web request. Please >review the stack trace for more information about the error and where it originated in the code.
Exception Details: System.Web.HttpException: Validation of viewstate MAC failed. If this application >is hosted by a Web Farm or cluster, ensure that configuration specifies the same >validationKey and validation algorithm. AutoGenerate cannot be used in a cluster.
It seems some error occur on ASP side.
Is there something wrong with my code or some limitation on app engine?
It looks like you are doing a POST to an aspx page.
When an aspx page receives a POST request it expects some hidden inputs which have an encoded ViewState present - if you browse to the page in question and "View Source" you should see some fields just inside the <form /> tag that look something like this:
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="xxxxxxxxx" />
Because you are submitting a POST request without these values present, it's having trouble decoding and validating them (which is what that error means - it can also crop up for other reasons in other scenarios).
There are a couple of possible solutions to this:
1 - If you have access to the code for the site, and the login page doesn't require ViewState, you could try switching it off at the page level within the #Page directive:
<%# Page ViewStateMode="Disabled" .... %>
2 - You could do a double-request
- do a GET request on the login page to retrieve the values for any missing hidden fields
- use those values and include them in your POST
EDIT
Ah yes, from your comment I can see that you're including the hidden form fields already - apologies!
In which case, another possibility is that the login page is on a load balanced environment. Each server in that environment will have a different MachineKey value (which is used to encode/decode the ViewState). You may be reading from one and posting to the other. Some LBs inject ArrowPoint cookies into the response to ensure that you "stick" to the same server between requests.
I can see you're already including a cookie in your POST, but I can't see where it's defined. Is it from the first GET request, or is it a custom cookie? If you haven't tried it already, maybe try using the cookie from the original GET where you're retrieving the login page HTML? Other than that, I'm out of ideas - sorry!
Commonly, when you're trying to emulate a postBack on the asp.net, you need to POST:
preserved from the first request cookies to act on the same session
data fields (login, password)
hidden ones from the first page: __VIEWSTATE, __VIEWSTATEENCRYPTED (even if it's empty!), __EVENTVALIDATION
if you sending some action items, maybe you need to include also hidden fields __EVENTTARGET and __EVENTARGUMENT