How to crawl mobile website java? - java

I want to read mobile version of the website but my program reads the normal website.
I am using this property
connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
What am I supposed to do?

The decision on which page to serve by the server is made based upon the "User-Agent" property of the request.
To get the mobile version of the page, take a look at this chrome dev article detailing chrome on android user agent strings, and set the "User-Agent" string in your header to be that of a mobile client; it doesn't look like the User-Agent string you have used in your question is that of a mobile client.
For example,
HttpClient httpclient = new DefaultHttpClient();
HttpPost httppost = new HttpPost(url);
String userAgent = "Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19";
try {
httppost.setHeader("User-Agent", userAgent);
// Add your data
// Execute HTTP Post Request
HttpResponse response = httpclient.execute(httppost);
// ....
} catch ... {
This should give you the mobile version of a page, as would be seen by a Galaxy Nexus device.

Here is a list of a ton of mobile browser user agent strings: http://www.useragentstring.com/pages/Mobile%20Browserlist/
Maybe try a different user agent string like:
Opera/9.80 (J2ME/MIDP; Opera Mini/9 (Compatible; MSIE:9.0; iPhone; BlackBerry9700; AppleWebKit/24.746; U; en) Presto/2.5.25 Version/10.54

Related

how to solve moved temporarily error for yahoo finance api

I am working on Yahoo stock data. Yesterday I got the stock data by using finance web service api. But today when I am trying to get the data from api I am getting the below error:
{
   "p": {
      "a": {
         "href": "https://finance.yahoo.com/webservice/v1/symbols/msft,goog,appl,orcl,yhoo,tcs,amzn,INFY.NS/quote?bypass=true&format=json&view=detail",
         "content": "https://finance.yahoo.com/webservice/v1/symbols/msft,goog,appl,orcl,yhoo,tcs,amzn,INFY.NS/quote?bypass=true&format=json&view=detail"
      },
      "content": "Moved Temporarily. Redirecting to"
   }
}
Saying that it was moved temporarily.
Why am I getting this error? Did I reach the API limit for today?
NOTE:
Yesterday I kept it running to test the API request limit. But when I am trying to run today it showing the above error.
If the API limit for IP is reached then when do I get access to the data again?
This is the API which I am using:
http://finance.yahoo.com/webservice/v1/symbols/msft,goog,appl,orcl,yhoo,tcs,amzn,INFY.NS/quote?format=json&view=detail
As it was commented here: https://stackoverflow.com/a/38390559/6586718, you have to change the user-agent to a mobile device.
On Java, I do the following, and it's working (this is for XML, but the same can be applied to JSON):
URL url = new URL ("https://finance.yahoo.com/webservice/v1/symbols/" + stocks + "/quote");
HttpURLConnection urlc = (HttpURLConnection) url.openConnection ();
urlc.setRequestProperty ("User-Agent", "Mozilla/5.0 (Linux; Android 6.0; MotoE2(4G-LTE) Build/MPI24.65-39) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.81 Mobile Safari/537.36");
Document xml = DocumentBuilderFactory.newInstance ().newDocumentBuilder ().parse (urlc.getInputStream ());
try with this new one..
https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20csv%20where%20url%3D'http%3A%2F%2Fdownload.finance.yahoo.com%2Fd%2Fquotes.csv%3Fs%3DAAPL%26f%3Dsl1d1t1c1ohgv%26e%3D.csv'%20and%20columns%3D'symbol%2Cprice%2Cdate%2Ctime%2Cchange%2Ccol1%2Chigh%2Clow%2Ccol2'&format=json&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys

Apache HttpComponents in java HttpClient.execute(get) not doing anything

I'm using Apache HttpComponents to create an http connection with a website. I have made some methods to get website content using post/get, to send cookies, recieve them and store them in a class I created called CookieManager. Everything works fine, but when I try to get a page content using the GET method the program keeps running but it doesn't do anything.
public HttpResponse sendRequestGet(String url, List<NameValuePair> headers) throws IOException{
HttpGet get = new HttpGet(url);
for (NameValuePair header : headers){
get.setHeader(header.getName(), header.getValue());
}
HttpResponse response = client.execute(get);
System.out.println("----------- STATUS CODE -------------");
System.out.println(response.getStatusLine().getStatusCode() + ": " + url);
System.out.println("-------------------------------------");
return response;
}
The code is the one shown above. The string url contains the url which I want to access, lets say http://mylink.com/market and the headers parameters is made like this:
List<NameValuePair> headerList = new ArrayList<NameValuePair>();
headerList.add((NameValuePair) new BasicNameValuePair("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"));
headerList.add((NameValuePair) new BasicNameValuePair("Accept-Language", "en-US;q=1,en;q=0.8"));
headerList.add((NameValuePair) new BasicNameValuePair("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"));
headerList.add((NameValuePair) new BasicNameValuePair("Cookie", getCookies()));
headerList.add((NameValuePair) new BasicNameValuePair("Referer", "http://mylink.com/profile/"));
headerList.add((NameValuePair) new BasicNameValuePair("Upgrade-Insecure-Requests", "1"));
If I call the function with the getCookies() returning an empty string, it works , but the problem is that I have to send my session Id which is in the cookies. I tried debugging and I found that since the line HttpResponse response = client.execute(get); it doesn't do anything, the program is still executing but it gets stuck in that line. Also, I should mention that I can get other pages sending the needed cookies but http://mylink.com/market/ gives me this problem.
I have already used chrome 'network' tab to see the interaction between the browser and the server, the only thing is that I don't include some headers (like Host).
Does someone know what I'm doing wrong?
Thanks
I was able to fix it by adding this line inside the function, before the client.execute(...) call : client = HttpClientBuilder.create().build();

How google server can distinguish between browser and HtmlUnit?

If I request the following URL
http://www.google.com/recaptcha/api/noscript?k=MYPUBLICKEY
I will get old no-script version of captcha, containing image of Google street number, like this
But if I'll do the same with HtmlUnit I will get some faked version of image, like this:
It happens all the time: real-world street number from browser and blackish distorted text from HtmlUnit. Public key is the same.
How can Google server distinguish between browser and HtmlUnit?
The HtmlUnit code is follows:
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17);
final HtmlPage page = webClient.getPage("http://www.google.com/recaptcha/api/noscript?k=" + getPublicKey());
HtmlImage image = page.<HtmlImage>getFirstByXPath("//img");
ImageReader imageReader = image.getImageReader();
Process is observable with Fiddler.
And how about setting correct Headers for your request? User-Agent is a key here.
Headers are the way that backend can get client information (Firefox, Chrome etc) and what is it in your case? Set correct headers eg. for Firefox:
conn.setRequestProperty("User-Agent", " Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0.1) Gecko/20100101 Firefox/8.0.1");
conn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
This snipped if from my code using Apache HttpClient, you need to adapt it to your needs.
I know this is old post but, good way is to use
WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER);
How you solve your problem?

Android: Always getting GET on the server eventhough POST was sent

I am trying to send data using POST method from my android apps. However in
the server it is always recognized as GET. I am using Rails apps as the web
service. Here is the snippet of my Android code:
 
URI uri = new URI(hostName);
HttpPost httpRequest = new HttpPost(uri);
 httpRequest.addHeader("Accept", "application/json");
 httpRequest.addHeader("Content-Type", "application/json");
 List<NameValuePair> pairs = new ArrayList<NameValuePair>();
 pairs.add(new BasicNameValuePair("key1", "value1"));
 httpRequest.setEntity(new UrlEncodedFormEntity(pairs));
HttpClient httpClient = new DefaultHttpClient();
HttpResponse httpResponse = httpClient.execute(httpRequest);
Have I done anything wrong? Thanks for your help.
You're android code looks fine, make sure your log doesn't show a 301 redirect code for POST despite showing a 200 code for GET. Strangely, this can be the case depending on your host configuration.
e.g. You might see something like this :
123.156.189.123 - - [21/Oct/2011:09:03:34 -0700] "POST /server_script.php HTTP/1.1" 301 532 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.202 Safari/535.1"
123.156.189.123 - - [21/Oct/2011:09:03:34 -0700] "GET /server_script.php HTTP/1.1" 200 250 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.202 Safari/535.1"
Here the GET was not redirected (code 200), but the POST was (code 301). If this is happening then you need to override your redirect settings using a .htaccess or other configuration options.
Were you being redirected? I suspect that if you try to POST to domain A that redirects you to domain B, your request will be turned in to a GET request. I had the same problem until I decided to use the server's IP address in the POST request directly, instead of using a alphabet name that redirects to the IP.

Setting user agent in Java httpclient and allow redirects to true

I am trying to set my user agent string in the HttpClient apache object in Java but I cannot find out how to do it.
Please help!
Also I am trying to enable redirects to true but also cannot find this option within the HttpClient object.
Thanks
Andy
With HttpClient 4.0, the following worked for me:
import org.apache.http.params.HttpProtocolParams;
HttpClient httpclient = new HttpClient();
HttpProtocolParams.setUserAgent(httpclient.getParams(), "My fancy UA");
HttpProtocolParams resides in the httpcore JAR file: http://hc.apache.org/httpcomponents-core/download.html
HttpClient httpclient = new HttpClient();
httpclient.getParams().setParameter(
HttpMethodParams.USER_AGENT,
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2"
);
Use AndroidHttpClient, and pass the user agent as a parameter to newInstance:
AndroidHttpClient client = AndroidHttpClient.newInstance(String userAgent);
There are other good reasons to use AndroidHttpClient instead of the raw HttpClient as well.

Categories