How can I Retrieve the HTML of a Search Engine Query Result?

How can I Retrieve the HTML of a Search Engine Query Result? - java

I am trying to retrieve the html of a Google search query result using Java. That is, if I do a search in Google.com for a particular phrase, I would like to retrieve the html of the resulting web page (the page containing the links to possible matches along with their descriptions, URLs, ect…).
I tried doing this using the following code that I found in a related post:
import java.io.*;
import java.net.*;
import java.util.*;
public class Main {
public static void main (String args[]) {
URL url;
InputStream is = null;
DataInputStream dis;
String line;
try {
url = new URL("https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
is = url.openStream(); // throws an IOException
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
}
}
From: How do you Programmatically Download a Webpage in Java
The URL used in this code was obtained by doing a Google search query from the Google homepage. For some reason I do not understand, if I write the phrase that I want to search for in the URL bar of my web browser and then use the URL of the resulting search result page in the code I get a 403 error.
This code, however, did not return the html of the search query result page. Instead, it returned the source code of the Google homepage.
After doing further research I noticed that if you view the source code of a Google search query result (by right clicking on the background of the search result page and selecting “View page source”) and compare it with the source code of the Google homepage, they are both identical.
If instead of viewing the source code of the search result page I save the html of the search result page (by pressing ctrl+s), I can get the html that I am looking for.
Is there a way to retrieve the html of the search result page using Java?
Thank you!

Rather than parsing the resulting HTML page from a standard google search, perhaps you would be better off looking at the official Custom Search api to return results from Google in a more usable format. The API is definitely the way to go; otherwise your code could simply break if Google were to change some features of the google.com front-end's html. The API is designed to be used by developers and your code would be far less fragile.
To answer your question, though: We can't really help you just from the information you've provided. Your code seems to retrieve the html of stackoverflow; an exact copy-and-paste of the code from the question you linked to. Did you attempt to change the code at all? What URL are you actually trying to use to retrieve google search results?
I tried to run your code using url = new URL("http://www.google.com/search?q=test"); and I personally get an HTTP error 403 forbidden. A quick search of the problem says that this happens if I don't provide the User-Agent header in the web request, though that doesn't exactly help you if you're actually getting HTML returned. You will have to provide more information if you wish to receive specific help - though switching to the Custom Search API will likely solve your problem.
edit: new information provided in original question; can directly answer question now!
I figured out your problem after packet-capturing the web request that java was sending and applying some basic debugging... Let's take a look!
Here's the web request that Java was sending with your provided example URL:
GET / HTTP/1.1
User-Agent: Java/1.6.0_30
Host: www.google.com
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive
Notice that the request seemed to ignore most of the URL... leaving just the "GET /". That is strange. I had to look this one up.
As per the documentation of the Java URL class (and this is standard for all web pages), A URL may have appended to it a "fragment", also known as a "ref" or a "reference". The fragment is indicated by the sharp sign character "#" followed by more characters ... This fragment is not technically part of the URL.
Let's take a look at your example URL...
https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951
notice that "#" is the first character in the file path? Java is simply ignoring everything after the "#" because sharp-signs are only used by the client / web browser - this leaves you with the url https://www.google.com/. Hey, at least it was working as intended!
I can't tell you exactly what Google is doing, but the sharp-symbol url definitely means that Google is returning results of the query through some client-side (ajax / javascript) scripting. I'd be willing to bet that any query you send directly to the server (i.e- no "#" symbol) without the proper headers will return a 403 forbidden error - looks like they're encouraging you to use the API :)
edit2: As per Tengji Zhang answer to the question, here is working code that returns the result of the google query for "test"
URL url;
InputStream is = null;
DataInputStream dis;
String line;
URLConnection c;
try {
url = new URL("https://www.google.com/search?q=test");
c = url.openConnection();
c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
c.connect();
is = c.getInputStream();
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}

I suggest you try http://seleniumhq.org/
There is a good tutorial of searching in google
http://code.google.com/p/selenium/wiki/GettingStarted

you don't set the User-Agent in your code.
URLConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
Or you can read that "http://www.google.com/robots.txt". This file tells you which url is allowed by the google servers.
The below code is successful.
package org.test.stackoverflow;
import java.io.*;
import java.net.*;
import java.util.*;
public class SearcherRetriver {
public static void main (String args[]) {
URL url;
InputStream is = null;
DataInputStream dis;
String line;
URLConnection c;
try {
url = new URL("https://www.google.com.hk/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
c = url.openConnection();
c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
c.connect();
is = c.getInputStream();
dis = new DataInputStream(new BufferedInputStream(is));
while ((line = dis.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe ) {
// nothing to see here
}
}
}
}

Related

How to scrape the price from dynamically updated webpages?

I have a problem when i trying to scrape a price from dynamically updated web pages. I mean that lion's share of html code doesn't received using ways like UrlConnection, Jsoup, HtmlUnit.
I don't know really much about web scraping, but I guess that problem is that internet shops like these:
Auchan,
Silpo
use javascript and ajax to load main info about products. And in my opinion, the problem is in redirecting or deley that doesn't allow to get full loaded html file with all needed data.
So, the question is how to scrape price from links above?
I have already tried several approaches:
UrlConnection
URL url;
try {
url = new URL("https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/");
URLConnection con = url.openConnection();
InputStream is = con.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
try(FileWriter fileWriter = new FileWriter("output.html")){
while ((line = br.readLine()) != null) {
fileWriter.write(line+"\n");
}
}
} catch (IOException e) {
e.printStackTrace();
}
Runs good, but return html without price data.
Jsoup
Document document = null;
String link = "https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/";
try {
document = Jsoup.connect(link).get();
} catch (IOException e) {
e.printStackTrace();
}
if (document != null) {
try (FileWriter fileWriter = new FileWriter("output.html")) {
fileWriter.write(document.toString());
} catch (IOException e) {
e.printStackTrace();
}
}
Returns the same.
3.HtmlUnit
String link = "https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/";
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScriptStartingBefore(5000);
HtmlPage htmlPage = null;
try {
htmlPage = webClient.getPage(link);
webClient.waitForBackgroundJavaScript(5000);
} catch (IOException e) {
e.printStackTrace();
}
if (htmlPage!=null){
try (FileWriter fileWriter = new FileWriter("output.html")) {
fileWriter.write(Jsoup.parse(htmlPage.asXml()).toString());
} catch (IOException e) {
e.printStackTrace();
}
}
Returns a little bit more, including some javascripts tags, but still nothing usefull. Also, this code above throws so many exceptions, that they don't even fit in console.
I also tried to set up agents like this:
java.net.URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
and this:
System.setProperty("http.agent", "")

You need to use Chrome's Dev tools to view the HTTP requests/responses
The page loads up tons of javascript. This in turn churns out a whole load of HTTP requests and waits for the responses: the first that looks interesting is:
https://auchan.ua/graphql which is a POST request with an important http header referer: https://auchan.ua/govjadina-v-kartofel-nom-pjure-so-svekloj-hipp-6440-220-g-297668/ - The response body for the request is: {"data":{"urlResolver":{"type":"PRODUCT","id":297668}}}
Taking the product ID value and searching for it in the subsequent responses I found the product ID was contained. The responses are all escaped unicode characters but if you open the URLs in a browser the content is rendered.
This particular URL that starts with auchan.ua/graphql/?query=query%20getProductDetail... looked promising and sure enough the special_price matches whats displayed on the page. So you'd need to find a way of generating/extracting these URLs from the initial page source.
link to product details
You may also find this response I gave useful for processing JSON data.
The second shop you linked to requires a username/password but the process for getting the data will likely be very similar; use dev tools to view the http requests, work out where the price info is coming from (find the value in one of the responses) then try to recreate the same request from the initial URL and the response returned.
Good luck!

An issue with an URLConnection using java

I'm trying to read out the code of a website.
But there is an issue if I want to receive the code of this site for example: "https://www.amazon.de/gp/bestsellers/pet-supplies/#2"
I tried a lot, but still im just receiving the code of https://www.amazon.de/gp/bestsellers/pet-supplies". So something does not work right as I want to receive place 21-40 and not 1-20.
I'm using an URLConneciton and a BufferedReader:
public String fetchPage(String urlS){
String s = null;
String qc = null;
try{
URL url = new URL(urlS);
URLConnection uc = url.openConnection();
uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0");
BufferedReader reader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
while((s = reader.readLine()) != null){
qc += s;
}
reader.close();
} catch(IOException e) {
e.printStackTrace();
qc = "receiving qc failed";
}
return qc;
}
Thank you in advance for your effort :)

The URL you're fetching, contains an achor (the #2 at the end). An anchor is a client-side concept and is originally used to jump to a certain part of the page. Some webapps (mostly single-page apps) use the anchor to keep track of some sort of state (eg. what page of products you're viewing).
Since the anchor is a client side concept, the responding webserver (or your browser/HTTP client library) just drops any anchors as if you actually requested https://www.amazon.de/gp/bestsellers/pet-supplies.
Bottom line is that you'll never get the second page... Goog luck in scraping Amazon though ;)

How to Pass a File through an HttpURLConnection

I'm trying to get an image hosting on our server available to be displayed on a client. As per the specs of the project:
"When a Client receives such a URL, it must download the
contents (i.e., bytes) of the file referenced by the URL.
Before the Client can display the image to the user, it must first retrieve (i.e., download) the bytes of the
image file from the Server. Similarly, if the Client receives the URL of a known data file or a field help file
from the Server, it must download the content of those files before it can use them."
I'm pretty sure we have the server side stuff down, because if I put the url into a browser it retrieves and displays just fine. So it must be something with the ClientCommunicator class; can you take a look at my code and tell me what the problem is? I've spent hours on this.
Here is the code:
Where I actually call the function to get and display the file: (This part is working properly insofar as it is passing the right information to the server)
JFrame f = new JFrame();
JButton b = (JButton)e.getSource();
ImageIcon image = new ImageIcon(ClientCommunicator.DownloadFile(HOST, PORT, b.getLabel()));
JLabel l = new JLabel(image);
f.add(l);
f.pack();
f.setVisible(true);
From the ClientCommunicator class:
public static byte[] DownloadFile(String hostname, String port, String url){
String image = HttpClientHelper.doGetRequest("http://"+hostname+":"+port+"/"+url, null);
return image.getBytes();
}
The pertinent httpHelper:
public static String doGetRequest(String urlString,Map<String,String> headers){
URL url;
HttpURLConnection connection = null;
try {
//Create connection
url = new URL(urlString);
connection = (HttpURLConnection)url.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Content-Language", "en-US");
connection.setUseCaches (false);
connection.setDoInput(true);
connection.setDoOutput(true);
if(connection.getResponseCode() == 500){
return "failed";
}
//Get Response
InputStream is = connection.getInputStream();
BufferedReader rd = new BufferedReader(new InputStreamReader(is));
String line;
StringBuffer response = new StringBuffer();
while((line = rd.readLine()) != null) {
response.append(line);
}
rd.close();
return response.toString();
} catch (Exception e) {
e.printStackTrace();
return null;
} finally {
if(connection != null) {
connection.disconnect();
}
}
}
After that, it jumps into the server stuff, which as I stated I believe is working correctly because clients such as Chrome can retrieve the file and display it properly. The problem has to be somewhere in here.
I believe that it has to do with the way the bytes are converted into a string and then back, but I do not know how to solve this problem. I've looked at similar problems on StackOverflow and have been unable to apply them to my situation. Any pointers in the right direction would be greatly appreciated.

If your server is sending binary data, you do not want to use an InputStreamReader, or in fact a Reader of any sort. As the Java API indicates, Readers are for reading streams of characters (not bytes) http://docs.oracle.com/javase/7/docs/api/java/io/Reader.html, which means you will run into all sorts of encoding issues.
See this other stack overflow answer for how to read bytes from a stream:
Convert InputStream to byte array in Java

Do your homework.
Isolate the issue. Modify the server side to send only 256 all possible bytes. Do a binary search and reduce it to small set of bytes.
Use http proxy tools to monitor the bytes as they are transmitted. Fiddler in windows world. Find other ones for the *nix environments.
Then see where the problem is happening and google/bing the suspicions or share the result.

HttpsURLConnection and POST

some time ago i wrote this program in python, that logged in a website using https, took some info and logged out.
The program was quite simple:
class Richiesta(object):
def __init__(self,url,data):
self.url = url
self.data = ""
self.content = ""
for k, v in data.iteritems():
self.data += str(k)+"="+str(v)+"&"
if(self.data == ""):
self.req = urllib2.Request(self.url)
else:
self.req = urllib2.Request(self.url,self.data)
self.req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6')
self.req.add_header('Referer', baseurl+'/www/')
self.req.add_header('Cookie', cookie )
def leggi(self):
while(self.content == ""):
try:
r = urllib2.urlopen(self.req)
except urllib2.HTTPError, e:
print("Errore del server, nuovo tentativo tra 15 secondi")
time.sleep(15)
except urllib2.URLError, e:
print("Problema di rete, proverò a riconnettermi tra 20 secondi")
time.sleep(20)
else:
self.content = r.read().decode('utf-8')
def login(username,password):
global cookie
print("Inizio la procedura di login")
url = "https://example.com/auth/Authenticate"
data = {"login":"1","username":username,"password":password}
f = Richiesta(url,data)
f.leggi()
Now, for some reason, I have to translate it in java. Untill now, this is what i've writte:
import java.net.*;
import java.security.Security.*;
import java.io.*;
import javax.net.ssl.*;
public class SafeReq {
String baseurl = "http://www.example.com";
String useragent = "Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6";
String content = "";
public SafeReq(String s, String sid, String data) throws MalformedURLException {
try{
URL url = new URL(s);
HttpsURLConnection request = ( HttpsURLConnection ) url.openConnection();
request.setUseCaches(false);
request.setDoOutput(true);
request.setDoInput(true);
request.setFollowRedirects(true);
request.setInstanceFollowRedirects(true);
request.setRequestProperty("User-Agent",useragent);
request.setRequestProperty("Referer","http://www.example.com/www/");
request.setRequestProperty("Cookie","sid="+sid);
request.setRequestProperty("Origin","http://www.example.com");
request.setRequestProperty("Content-Type","application/x-www-form-urlencoded");
request.setRequestProperty("Content-length",String.valueOf(data.length()));
request.setRequestMethod("POST");
OutputStreamWriter post = new OutputStreamWriter(request.getOutputStream());
post.write(data);
post.flush();
BufferedReader in = new BufferedReader(new InputStreamReader(request.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
content += inputLine;
}
post.close();
in.close();
} catch (IOException e){
e.printStackTrace();
}
}
public String leggi(){
return content;
}
}
The problem is, the login doesn't work, and when i try to get a page that require me to be logged, i get the "Login Again" message.
The two classes seems quite the same, and i can't understand why i can't make the second one to work ... any idea?

Where do you get your sid from? From the symptoms, I would guess that your session cookie is not passed correctly to the server.
See this question for possible solution: Cookies turned off with Java URLConnection.
In general, I recommend you to use HttpClient for implementing HTTP conversations in Java (anything more complicated than a simple one-time GET or POST). See code examples (I guess "Form based logon" example is appropriate in your case).

Anyone looking for this in the future, take a look at HtmlUnit.
This answer has a nice example.

Problem sending a POST Request with Parameters From a Java app

There's a web page with a search engine:
http://www.nukat.edu.pl/cgi-bin/gw_48_1_12/chameleon?sessionid=2010010122520520752&skin=default&lng=pl&inst=consortium&search=KEYWORD&function=SEARCHSCR&SourceScreen=NOFUNC&elementcount=1&pos=1&submit=TabData
I want to use its search engine from a java application.
Currently I'm trying to send a very simple request - only one field filled and no logical operators.
This is my code:
try {
URL url = new URL( nukatSearchUrl );
URLConnection urlConn = url.openConnection();
urlConn.setDoInput( true );
urlConn.setDoOutput( true );
urlConn.setUseCaches( false );
urlConn.setRequestProperty( "Content-Type", "application/x-www-form-urlencoded" );
BufferedWriter out = new BufferedWriter( new OutputStreamWriter( urlConn.getOutputStream() ) );
String content = "t1=" + URLEncoder.encode( "Duma Key", "UTF-8" );
out.write( content );
out.flush();
out.close();
BufferedReader in = new BufferedReader( new InputStreamReader( urlConn.getInputStream() ) );
String rcv = null;
while ( ( rcv = in.readLine() ) != null ) {
System.out.println( rcv );
}
fd.close();
in.close();
} catch ( Exception ex ) {
throw new SearchEngineException( "NukatSearchEngine.search() : " + ex.getMessage() );
}
Unfortunateley what I keep getting is the main site - looks like this:
<cant post the link to the main site :/>
Not the search results I'm expecting.
What could be wrong here?

I wound't go any further with this after reading BalusC's answer. Here are, however, a few pointers, if you don't worry of being blacklisted:
set the User-Agent header to pretend being a browser. for example
urlConn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB6");
you can use a simulation of a human user in firefox, using Selenium WebDriver

The URL may be wrong or your request is likely incomplete. You need to check the HTML source (rightclick page > View Source) and use the same URL as definied in the <form action> and gather all request parameters (including those from hidden input fields and the button which you intend to "press"!) for use in your query string.
That said, doing so is in most cases a policy violation and may result in your IP become blacklisted. Please check their robots.txt and the "Terms of use" -if any, I don't understand Polish. Their robots.txt at least says that everyone is disallowed to access the entire website programmatically. Use it on your own risks. You've been warned. Better contact them and ask if they have any public webservice and then use it instead.
You can always spoof the user-agent request header with a real-looking string as extracted from a real webbrowser to minimize the risk to get recognized as a bot as pointed out by Bozho here, but you can still get caught on based on the visitor patterns/statistics.

An easy way to see all activity that you need to replicate is the Live HTTP Headers Firefox Extension. To see all form elements on the page, Firebug is useful. Finally, I often use a fake server that I control to see what the browser is sending, and compare to my application. I rolled my own, just a small Java server that prints out everything sent to it - inverse telnet, if you will.
Another note is that some sites deny access based on the User-Agent, i.e. you might need to get your application to pretend it's Firefox. This is very bad practice, and a little dishonest. As BalusC mentioned, check their usage policy and robots.txt! I would also recommend asking permission if you intend to spread your application around.
Finally, I happen to be working on something similar and you might find the following code useful (it writes a mapping of key -> lists of values to the correct POST format):
StringBuilder builder = new StringBuilder();
try {
boolean first = false;
for(Entry<String,List<String>> entry : data.entrySet()) {
for(String value : entry.getValue()) {
if(first) {
first = false;
}
else {
builder.append("&");
}
builder.append(URLEncoder.encode(entry.getKey(), "UTF-8") + "=" + URLEncoder.encode(value, "UTF-8"));
}
}
} catch (UnsupportedEncodingException e1) {
return false;
}
conn.setDoOutput(true);
try {
OutputStreamWriter wr = new OutputStreamWriter(conn.getOutputStream());
wr.write(builder.toString());
wr.flush();
conn.connect();
} catch (IOException e) {
return(false);
}

As well as the user-agent it could also be using cookies to check that the search is being sent from the search page.
HttpClient is good for automating form submission including handling any cookies and pretending to be a browser.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I Retrieve the HTML of a Search Engine Query Result? - java

I suggest you try http://seleniumhq.org/ There is a good tutorial of searching in google http://code.google.com/p/selenium/wiki/GettingStarted

Related

How to scrape the price from dynamically updated webpages?

An issue with an URLConnection using java

How to Pass a File through an HttpURLConnection

HttpsURLConnection and POST

Problem sending a POST Request with Parameters From a Java app

Categories

Resources