How to scrape the price from dynamically updated webpages? - java

I have a problem when i trying to scrape a price from dynamically updated web pages. I mean that lion's share of html code doesn't received using ways like UrlConnection, Jsoup, HtmlUnit.
I don't know really much about web scraping, but I guess that problem is that internet shops like these:
Auchan,
Silpo
use javascript and ajax to load main info about products. And in my opinion, the problem is in redirecting or deley that doesn't allow to get full loaded html file with all needed data.
So, the question is how to scrape price from links above?
I have already tried several approaches:
UrlConnection
URL url;
try {
url = new URL("https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/");
URLConnection con = url.openConnection();
InputStream is = con.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
try(FileWriter fileWriter = new FileWriter("output.html")){
while ((line = br.readLine()) != null) {
fileWriter.write(line+"\n");
}
}
} catch (IOException e) {
e.printStackTrace();
}
Runs good, but return html without price data.
Jsoup
Document document = null;
String link = "https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/";
try {
document = Jsoup.connect(link).get();
} catch (IOException e) {
e.printStackTrace();
}
if (document != null) {
try (FileWriter fileWriter = new FileWriter("output.html")) {
fileWriter.write(document.toString());
} catch (IOException e) {
e.printStackTrace();
}
}
Returns the same.
3.HtmlUnit
String link = "https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/";
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScriptStartingBefore(5000);
HtmlPage htmlPage = null;
try {
htmlPage = webClient.getPage(link);
webClient.waitForBackgroundJavaScript(5000);
} catch (IOException e) {
e.printStackTrace();
}
if (htmlPage!=null){
try (FileWriter fileWriter = new FileWriter("output.html")) {
fileWriter.write(Jsoup.parse(htmlPage.asXml()).toString());
} catch (IOException e) {
e.printStackTrace();
}
}
Returns a little bit more, including some javascripts tags, but still nothing usefull. Also, this code above throws so many exceptions, that they don't even fit in console.
I also tried to set up agents like this:
java.net.URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
and this:
System.setProperty("http.agent", "")

You need to use Chrome's Dev tools to view the HTTP requests/responses
The page loads up tons of javascript. This in turn churns out a whole load of HTTP requests and waits for the responses: the first that looks interesting is:
https://auchan.ua/graphql which is a POST request with an important http header referer: https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/ - The response body for the request is: {"data":{"urlResolver":{"type":"PRODUCT","id":297668}}}
Taking the product ID value and searching for it in the subsequent responses I found the product ID was contained. The responses are all escaped unicode characters but if you open the URLs in a browser the content is rendered.
This particular URL that starts with auchan.ua/graphql/?query=query%20getProductDetail... looked promising and sure enough the special_price matches whats displayed on the page. So you'd need to find a way of generating/extracting these URLs from the initial page source.
link to product details
You may also find this response I gave useful for processing JSON data.
The second shop you linked to requires a username/password but the process for getting the data will likely be very similar; use dev tools to view the http requests, work out where the price info is coming from (find the value in one of the responses) then try to recreate the same request from the initial URL and the response returned.
Good luck!

Related

An issue with an URLConnection using java

I'm trying to read out the code of a website.
But there is an issue if I want to receive the code of this site for example: "https://www.amazon.de/gp/bestsellers/pet-supplies/#2"
I tried a lot, but still im just receiving the code of https://www.amazon.de/gp/bestsellers/pet-supplies". So something does not work right as I want to receive place 21-40 and not 1-20.
I'm using an URLConneciton and a BufferedReader:
public String fetchPage(String urlS){
String s = null;
String qc = null;
try{
URL url = new URL(urlS);
URLConnection uc = url.openConnection();
uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0");
BufferedReader reader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
while((s = reader.readLine()) != null){
qc += s;
}
reader.close();
} catch(IOException e) {
e.printStackTrace();
qc = "receiving qc failed";
}
return qc;
}
Thank you in advance for your effort :)
The URL you're fetching, contains an achor (the #2 at the end). An anchor is a client-side concept and is originally used to jump to a certain part of the page. Some webapps (mostly single-page apps) use the anchor to keep track of some sort of state (eg. what page of products you're viewing).
Since the anchor is a client side concept, the responding webserver (or your browser/HTTP client library) just drops any anchors as if you actually requested https://www.amazon.de/gp/bestsellers/pet-supplies.
Bottom line is that you'll never get the second page... Goog luck in scraping Amazon though ;)

Extracting HTML from javascript protected website

A website www.kissanime.to has a "Is javascript enabled on your browser" protection so when you want to read the html content of the website you have to have a browser with javascript enabled so using this code won't work:
URL kissanime = new URL("http://www.kissanime.to/");
URLConnection ks = kissanime.openConnection();
BufferedReader in = new BufferedReader(newInputStreamReader(ks.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
And after a while of researching I found Selenium it's a browser library emulator:
HtmlUnitDriver html = new HtmlUnitDriver();
String URL="https://www.kissanime.to/";
html.get(URL);
String pageSource=html.getPageSource();
System.out.println(pageSource);
And that works but isn't there a better way to do this like with Jsoup and Rhino libraries where you will make an initial connection with jsoup and then you will add rhino to make it seem that you have javascript or better yet only Jsoup and adding some cookies to bypass the protection.
Selenium is a pretty heavyweight solution for such a simple use case. If You need a basic engine emulating a real browser with javascript enabled then HtmlUnit (http://htmlunit.sourceforge.net/) is what you are looking for.
Here is a code fragment which scrapes data from google.com:
WebClient webClient = new WebClient();
HtmlPage googlePage = webClient.getPage("http://www.google.com/");
//posibly wait for the javascript to execute
String source = googlePage.asXml()
I do it like you but not only that page anyway, you can get the html this way by it's url, also emulate a browser
private String conexion(String link) {
String content = null;
URLConnection connection = null;
try {
connection = new URL(link).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\Z");
content = scanner.next();
} catch (IOException ex) {
JOptionPane.showMessageDialog(null, "¡¡Ese capitulo no Existe!!", "Error", JOptionPane.ERROR_MESSAGE);
}
return content;
}

HTTP URL connection response

I am trying to hit the URL and get the response from my Java code.
I am using URLConnection to get this response. And writing this response in html file.
When opening this html in browser after executing the java class, I am getting only google home page and not with the results.
Whats wrong with my code, my code here,
FileWriter fWriter = null;
BufferedWriter writer = null;
URL url = new URL("https://www.google.co.in/?gfe_rd=cr&ei=aS-BVpPGDOiK8Qea4aKIAw&gws_rd=ssl#q=google+post+request+from+java");
byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
String encoding = new String(encodedBytes);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("User-Agent", "Mozilla/5.0");
connection.setRequestProperty("Accept-Charset", "UTF-8");
connection.setDoInput(true);
connection.setRequestProperty("Authorization", "Basic " + encoding);
connection.connect();
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
String line;
try {
fWriter = new FileWriter(new File("f:\\fileName.html"));
writer = new BufferedWriter(fWriter);
while ((line = in.readLine()) != null) {
String s = line.toString();
writer.write(s);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Same code works couple of days back, but not now.
The reason is that this url does not return search results it self. You have to understand google's working process to understand it. Open this url in your browser and view its source. You will only see lots of javascript there.
Actually, in a short summary, google uses Ajax requests to process search queries.
To perform required task you either have to use a headless browser (the hard way) which can execute javascript/ajax OR better use google search api as directed by anand.
This method of searching is not advised is supposed to fail, you must use google search APIs for this kind of work.
Note: Google uses some redirection and uses token, so even if you will find a clever way to handle it, it is ought to fail in long run.
Edit:
This is a sample of how using Google search APIs you can get your work done in reliable way; please do refer to the source for more information.
public static void main(String[] args) throws Exception {
String google = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=";
String search = "stackoverflow";
String charset = "UTF-8";
URL url = new URL(google + URLEncoder.encode(search, charset));
Reader reader = new InputStreamReader(url.openStream(), charset);
GoogleResults results = new Gson().fromJson(reader, GoogleResults.class);
// Show title and URL of 1st result.
System.out.println(results.getResponseData().getResults().get(0).getTitle());
System.out.println(results.getResponseData().getResults().get(0).getUrl());
}

How can I Retrieve the HTML of a Search Engine Query Result?

I am trying to retrieve the html of a Google search query result using Java. That is, if I do a search in Google.com for a particular phrase, I would like to retrieve the html of the resulting web page (the page containing the links to possible matches along with their descriptions, URLs, ect…).
I tried doing this using the following code that I found in a related post:
import java.io.*;
import java.net.*;
import java.util.*;
public class Main {
public static void main (String args[]) {
URL url;
InputStream is = null;
DataInputStream dis;
String line;
try {
url = new URL("https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
is = url.openStream(); // throws an IOException
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
}
}
From: How do you Programmatically Download a Webpage in Java
The URL used in this code was obtained by doing a Google search query from the Google homepage. For some reason I do not understand, if I write the phrase that I want to search for in the URL bar of my web browser and then use the URL of the resulting search result page in the code I get a 403 error.
This code, however, did not return the html of the search query result page. Instead, it returned the source code of the Google homepage.
After doing further research I noticed that if you view the source code of a Google search query result (by right clicking on the background of the search result page and selecting “View page source”) and compare it with the source code of the Google homepage, they are both identical.
If instead of viewing the source code of the search result page I save the html of the search result page (by pressing ctrl+s), I can get the html that I am looking for.
Is there a way to retrieve the html of the search result page using Java?
Thank you!
Rather than parsing the resulting HTML page from a standard google search, perhaps you would be better off looking at the official Custom Search api to return results from Google in a more usable format. The API is definitely the way to go; otherwise your code could simply break if Google were to change some features of the google.com front-end's html. The API is designed to be used by developers and your code would be far less fragile.
To answer your question, though: We can't really help you just from the information you've provided. Your code seems to retrieve the html of stackoverflow; an exact copy-and-paste of the code from the question you linked to. Did you attempt to change the code at all? What URL are you actually trying to use to retrieve google search results?
I tried to run your code using url = new URL("http://www.google.com/search?q=test"); and I personally get an HTTP error 403 forbidden. A quick search of the problem says that this happens if I don't provide the User-Agent header in the web request, though that doesn't exactly help you if you're actually getting HTML returned. You will have to provide more information if you wish to receive specific help - though switching to the Custom Search API will likely solve your problem.
edit: new information provided in original question; can directly answer question now!
I figured out your problem after packet-capturing the web request that java was sending and applying some basic debugging... Let's take a look!
Here's the web request that Java was sending with your provided example URL:
GET / HTTP/1.1
User-Agent: Java/1.6.0_30
Host: www.google.com
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive
Notice that the request seemed to ignore most of the URL... leaving just the "GET /". That is strange. I had to look this one up.
As per the documentation of the Java URL class (and this is standard for all web pages), A URL may have appended to it a "fragment", also known as a "ref" or a "reference". The fragment is indicated by the sharp sign character "#" followed by more characters ... This fragment is not technically part of the URL.
Let's take a look at your example URL...
https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951
notice that "#" is the first character in the file path? Java is simply ignoring everything after the "#" because sharp-signs are only used by the client / web browser - this leaves you with the url https://www.google.com/. Hey, at least it was working as intended!
I can't tell you exactly what Google is doing, but the sharp-symbol url definitely means that Google is returning results of the query through some client-side (ajax / javascript) scripting. I'd be willing to bet that any query you send directly to the server (i.e- no "#" symbol) without the proper headers will return a 403 forbidden error - looks like they're encouraging you to use the API :)
edit2: As per Tengji Zhang answer to the question, here is working code that returns the result of the google query for "test"
URL url;
InputStream is = null;
DataInputStream dis;
String line;
URLConnection c;
try {
url = new URL("https://www.google.com/search?q=test");
c = url.openConnection();
c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
c.connect();
is = c.getInputStream();
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
I suggest you try http://seleniumhq.org/
There is a good tutorial of searching in google
http://code.google.com/p/selenium/wiki/GettingStarted
you don't set the User-Agent in your code.
URLConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
Or you can read that "http://www.google.com/robots.txt". This file tells you which url is allowed by the google servers.
The below code is successful.
package org.test.stackoverflow;
import java.io.*;
import java.net.*;
import java.util.*;
public class SearcherRetriver {
public static void main (String args[]) {
URL url;
InputStream is = null;
DataInputStream dis;
String line;
URLConnection c;
try {
url = new URL("https://www.google.com.hk/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
c = url.openConnection();
c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
c.connect();
is = c.getInputStream();
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
}
}

Problem sending a POST Request with Parameters From a Java app

There's a web page with a search engine:
http://www.nukat.edu.pl/cgi-bin/gw_48_1_12/chameleon?sessionid=2010010122520520752&skin=default&lng=pl&inst=consortium&search=KEYWORD&function=SEARCHSCR&SourceScreen=NOFUNC&elementcount=1&pos=1&submit=TabData
I want to use its search engine from a java application.
Currently I'm trying to send a very simple request - only one field filled and no logical operators.
This is my code:
try {
URL url = new URL( nukatSearchUrl );
URLConnection urlConn = url.openConnection();
urlConn.setDoInput( true );
urlConn.setDoOutput( true );
urlConn.setUseCaches( false );
urlConn.setRequestProperty( "Content-Type", "application/x-www-form-urlencoded" );
BufferedWriter out = new BufferedWriter( new OutputStreamWriter( urlConn.getOutputStream() ) );
String content = "t1=" + URLEncoder.encode( "Duma Key", "UTF-8" );
out.write( content );
out.flush();
out.close();
BufferedReader in = new BufferedReader( new InputStreamReader( urlConn.getInputStream() ) );
String rcv = null;
while ( ( rcv = in.readLine() ) != null ) {
System.out.println( rcv );
}
fd.close();
in.close();
} catch ( Exception ex ) {
throw new SearchEngineException( "NukatSearchEngine.search() : " + ex.getMessage() );
}
Unfortunateley what I keep getting is the main site - looks like this:
<cant post the link to the main site :/>
Not the search results I'm expecting.
What could be wrong here?
I wound't go any further with this after reading BalusC's answer. Here are, however, a few pointers, if you don't worry of being blacklisted:
set the User-Agent header to pretend being a browser. for example
urlConn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB6");
you can use a simulation of a human user in firefox, using Selenium WebDriver
The URL may be wrong or your request is likely incomplete. You need to check the HTML source (rightclick page > View Source) and use the same URL as definied in the <form action> and gather all request parameters (including those from hidden input fields and the button which you intend to "press"!) for use in your query string.
That said, doing so is in most cases a policy violation and may result in your IP become blacklisted. Please check their robots.txt and the "Terms of use" -if any, I don't understand Polish. Their robots.txt at least says that everyone is disallowed to access the entire website programmatically. Use it on your own risks. You've been warned. Better contact them and ask if they have any public webservice and then use it instead.
You can always spoof the user-agent request header with a real-looking string as extracted from a real webbrowser to minimize the risk to get recognized as a bot as pointed out by Bozho here, but you can still get caught on based on the visitor patterns/statistics.
An easy way to see all activity that you need to replicate is the Live HTTP Headers Firefox Extension. To see all form elements on the page, Firebug is useful. Finally, I often use a fake server that I control to see what the browser is sending, and compare to my application. I rolled my own, just a small Java server that prints out everything sent to it - inverse telnet, if you will.
Another note is that some sites deny access based on the User-Agent, i.e. you might need to get your application to pretend it's Firefox. This is very bad practice, and a little dishonest. As BalusC mentioned, check their usage policy and robots.txt! I would also recommend asking permission if you intend to spread your application around.
Finally, I happen to be working on something similar and you might find the following code useful (it writes a mapping of key -> lists of values to the correct POST format):
StringBuilder builder = new StringBuilder();
try {
boolean first = false;
for(Entry<String,List<String>> entry : data.entrySet()) {
for(String value : entry.getValue()) {
if(first) {
first = false;
}
else {
builder.append("&");
}
builder.append(URLEncoder.encode(entry.getKey(), "UTF-8") + "=" + URLEncoder.encode(value, "UTF-8"));
}
}
} catch (UnsupportedEncodingException e1) {
return false;
}
conn.setDoOutput(true);
try {
OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
wr.write(builder.toString());
wr.flush();
conn.connect();
} catch (IOException e) {
return(false);
}
As well as the user-agent it could also be using cookies to check that the search is being sent from the search page.
HttpClient is good for automating form submission including handling any cookies and pretending to be a browser.

Categories