Jsoup not downloading entire page - java

The webpage is: http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty_pf.htm
I want to extract all the <tr class="tr_normal"> elements using Jsoup.
The code I am using is:
Document doc = Jsoup.connect(url).get();
Elements es = doc.getElementsByClass("tr_normal");
System.out.println(es.size());
But the size (1350) is smaller than actually have (1452).
I copied this page onto my computer and deleted some <tr> elements. Then I ran the same code and it's correct. It looks like there are too many elements so jsoup can't read all of them?
So what's happened? Thanks!

The problem is the internal Jsoup Http Connection Handling. Nothing wrong with the selector engine. I didn't go deep in but there always problem with proprietary way to handle http connection. I would recommend to replace it with HttpClient - http://hc.apache.org/ . If you can't add http client as dependencies, you might want to check Jsoup source code in handling http connection.
The issue is the default maxBodySize of Jsoup.Connection. Please refer to updated answer. *I still keep HttpClient code as sample.
Output of the program
load from file= 1452
load from http client= 1452
load from jsoup connect= 1350
load from jsoup connect using maxBodySize= 1452
package test;
import java.io.IOException;
import java.io.InputStream;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class TestJsoup {
/**
* #param args
* #throws IOException
*/
public static void main(String[] args) throws IOException {
Document doc = Jsoup.parse(loadContentFromClasspath(), "UTF8", "");
Elements es = doc.getElementsByClass("tr_normal");
System.out.println("load from file= " + es.size());
doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", "");
es = doc.getElementsByClass("tr_normal");
System.out.println("load from http client= " + es.size());
String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
+ "/stockcode/eisdeqty_pf.htm";
doc = Jsoup.connect(url).get();
es = doc.getElementsByClass("tr_normal");
System.out.println("load from jsoup connect= " + es.size());
int maxBodySize = 2048000;//2MB (default is 1MB) 0 for unlimited size
doc = Jsoup.connect(url).maxBodySize(maxBodySize).get();
es = doc.getElementsByClass("tr_normal");
System.out.println("load from jsoup connect using maxBodySize= " + es.size());
}
public static InputStream loadContentByHttpClient()
throws ClientProtocolException, IOException {
String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
+ "/stockcode/eisdeqty_pf.htm";
HttpClient client = HttpClientBuilder.create().build();
HttpGet request = new HttpGet(url);
HttpResponse response = client.execute(request);
return response.getEntity().getContent();
}
public static InputStream loadContentFromClasspath()
throws ClientProtocolException, IOException {
return TestJsoup.class.getClassLoader().getResourceAsStream(
"eisdeqty_pf.htm");
}
}

Related

Jsoup reddit scraper 429 error

So I'm trying to use jsoup to scrape Reddit for images, but when I scrape certain subreddits such as /r/wallpaper, I get a 429 error and am wondering how to fix this. Totally understand that this code is horrible and this is a pretty noob question, but I'm completely new to this. Anyways:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class javascraper{
public static void main (String[]args) throws MalformedURLException
{
Scanner scan = new Scanner (System.in);
System.out.println("Where do you want to store the files?");
String folderpath = scan.next();
System.out.println("What subreddit do you want to scrape?");
String subreddit = scan.next();
subreddit = ("http://reddit.com/r/" + subreddit);
new File(folderpath + "/" + subreddit).mkdir();
//test
try{
//gets http protocol
Document doc = Jsoup.connect(subreddit).timeout(0).get();
//get page title
String title = doc.title();
System.out.println("title : " + title);
//get all links
Elements links = doc.select("a[href]");
for(Element link : links){
//get value from href attribute
String checkLink = link.attr("href");
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
if (imgCheck(checkLink)){ // checks to see if img link j
System.out.println("link : " + link.attr("href"));
downloadImages(checkLink, folderpath);
}
}
}
catch (IOException e){
e.printStackTrace();
}
}
public static boolean imgCheck(String http){
String png = ".png";
String jpg = ".jpg";
String jpeg = "jpeg"; // no period so checker will only check last four characaters
String gif = ".gif";
int length = http.length();
if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
return true;
}
else{
return false;
}
}
private static void downloadImages(String src, String folderpath) throws IOException{
String folder = null;
//Exctract the name of the image from the src attribute
int indexname = src.lastIndexOf("/");
if (indexname == src.length()) {
src = src.substring(1, indexname);
}
indexname = src.lastIndexOf("/");
String name = src.substring(indexname, src.length());
System.out.println(name);
//Open a URL Stream
URL url = new URL(src);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));
for (int b; (b = in.read()) != -1;) {
out.write(b);
}
out.close();
in.close();
}
}
Your issue is caused by the fact that your scraper is violating reddit's API rules. Error 429 means "Too many requests" – you're requesting too many pages too fast.
You can make one request every 2 seconds, and you also need to set a proper user agent (they format they recommend is <platform>:<app ID>:<version string> (by /u/<reddit username>)). The way it currently looks, your code is running too fast and doesn't specify one, so it's going to be severely rate-limited.
To fix it, first off, add this to the start of your class, before the main method:
public static final String USER_AGENT = "<PUT YOUR USER AGENT HERE>";
(Make sure to specify an actual user agent).
Then, change this (in downloadImages)
URL url = new URL(src);
InputStream in = url.openStream();
to this:
URLConnection connection = (new URL(src)).openConnection();
Thread.sleep(2000); //Delay to comply with rate limiting
connection.setRequestProperty("User-Agent", USER_AGENT);
InputStream in = connection.getInputStream();
You'll also want to change this (in main)
Document doc = Jsoup.connect(subreddit).timeout(0).get();
to this:
Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();
Then your code should stop running into that error.
Note that using reddit's API (IE, /r/subreddit.json instead of /r/subreddit) would probably make this project easier, but it isn't required and your current code will work.
As you can look up at Wikipedia the 429 status code tells you that you have too many requests:
The user has sent too many requests in a given amount of time. Intended for use with rate limiting schemes.
A solution would be to slow down your scraper. There are some options how to do this, one would be to use sleep.

406 Not Acceptable while uploading image to a subdomain

I am trying to upload file to a url for that I am using this code
import java.io.File;
import java.io.IOException;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.methods.PostMethod;
import org.apache.commons.httpclient.methods.multipart.FilePart;
import org.apache.commons.httpclient.methods.multipart.MultipartRequestEntity;
import org.apache.commons.httpclient.methods.multipart.Part;
public class UploadIt{
public static void main(String[] args) throws IOException {
String s=uploadFile(new File("C://paid.png"), "http://abc.xyz.com");
System.out.println("val is "+s);
}
public static String uploadFile(File resourceUrl,String url) throws HttpException, IOException{
File f = resourceUrl;
PostMethod filePost = new PostMethod(url);
Part[] parts = {new FilePart(f.getName(), f)};
filePost.setRequestEntity(new MultipartRequestEntity(parts, filePost.getParams()));
HttpClient client = new HttpClient();
int status = client.executeMethod(filePost);
String resultUUid=null;
resultUUid = filePost.getResponseBodyAsString();
filePost.releaseConnection();
System.out.println(" status "+status );
return resultUUid;
}
}
From source.
It is giving error
status 406
val is <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>406 Not Acceptable</title>
</head><body>
<h1>Not Acceptable</h1>
<p>An appropriate representation of the requested resource / could not be found on this server.</p>
<p>Additionally, a 404 Not Found
error was encountered while trying to use an ErrorDocument to handle the request.</p>
</body></html>
How to resolve this problem
my directory has permission 755
It's not the directory permissions. it may be a restriction on the Mime types accepted by the server - have a look here http://www.checkupdown.com/status/E406.html
Accept: The MIME types accepted by the client. For example, a browser may only accept back types of data (HTML files, GIF files etc.) it knows how to process.
maybe you could print the response headers as well for further debugging

How to use Jsoup to login my university website?

I am trying to come up with a Android app that needs some information on the university inner website. I have been trying to use Jsoup to login the website programmatically. Here is the code I have now:
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
//import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
//import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.Map;
public class Test {
public static void main(String[] args) {
Document doc;
try {
Connection.Response res = Jsoup
.connect(
"https://sso.bris.ac.uk/sso/login?service=https%3A%2F%2Fwww.cs.bris.ac.uk%2FTeaching%2Fsecure%2Funit-list.jsp%3Flist%3Dmine")
.execute();
Map<String, String> cookies = res.cookies();
System.out.println(cookies.keySet());
Document fakepage = res.parse();
Element fakelt = fakepage.select("input[name=lt]").get(0);
Element fakeexecution = fakepage.select("input[name=execution]")
.get(0);
Element fake_eventID = fakepage.select("input[name=_eventId]").get(
0);
System.out.println("Hello World!");
System.out.println(fakelt.attr("value"));
System.out.println(fakeexecution.toString());
System.out.println(fake_eventID.toString());
// System.out.println(cookies.get("JSESSIONID"));
String url="https://sso.bris.ac.uk/sso/login?service=https%3A%2F%2Fwww.cs.bris.ac.uk%2FTeaching%2Fsecure%2Funit-list.jsp%3Flist%3Dmine";
System.out.println(url);
Connection newreq = Jsoup
.connect(url)
.cookies(cookies).data("lt", fakelt.attr("value")).followRedirects(true).header("Connection", "keep-alive")
.header("Refer", " https://sso.bris.ac.uk/sso/login?service=https%3A%2F%2Fwww.cs.bris.ac.uk%2FTeaching%2Fsecure%2Funit-list.jsp%3Flist%3Dmine")
.header("Content-Type","application/x-www-form-urlencoded;charset=UTF-8")
.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:27.0) Gecko/20100101 Firefox/27.0")
.data("lt",fakelt.attr("value"))
.data("execution", fakeexecution.attr("value"))
.data("_eventID", fake_eventID.attr("value"))
.data("username", "aabbcc").data("password", "ddeeff")
.data("submit", "").method(Method.POST);
Connection.Response newres = newreq.execute();
doc = newres.parse();
System.out.println(doc.toString());
System.out.println(newres.statusCode());
Map<String,String> newcookies = newres.cookies();
doc = Jsoup.connect("https://www.cs.bris.ac.uk/Teaching/secure/unit-list.jsp?list=mine").cookies(newcookies).get();
System.out.println(doc.toString());
// System.out.println(doc.toString());
} catch (IOException e) {
System.out.println("Excepiton:");
System.out.println(e.getMessage());
}
}
}
I completely faked a form to submit use Jsoup, and to get around the security cookies I first request the website once and then use the cookies it sent me to request the website again. The form has some hidden fields so I use the ones I got on my first request to fake it when I request it again. However this does not work. Is it possible to do it or the server has some advanced preventer against me doing so?
Do not use Jsoup to do this, it needs you to handle all the cookies yourself, instead, use Httpclient, if you use something from 4.0 onward it handle the cookies automatically. Much eaiser to work with.

cannot download csv files from morningstar.com using authentication

I have a premium account with morningstar and I tried to download a few csv files from the premium content area. For some reason I cannot get those premium content. For example, with premium account I can get 10 year financial statement data, but I've tried all the sample authentication java code from apache httpcomponents-client. All of them can only get me content that does not need authentication. How can I tell what authentication protocol morningtar is using and authenticate successfully? I tried the example code from org.apache.http.examples.client, including clientAuthentication.java, clientKerberosAuthentication.java, clientInteractiveAuthentication.java . If I log in in morningstar account in Chrome and paste this URL, I can get 10 years data csv, but if I access through java I only get 5 years data. Below are one of sample codes I tried. I didn't get exceptions or errors, but I only got 5 years data instead of 10.
import java.io.BufferedReader;
import java.io.InputStreamReader;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;
public class ClientAuthentication {
public static void main(String[] args) throws Exception {
DefaultHttpClient httpclient = new DefaultHttpClient();
try {
httpclient.getCredentialsProvider().setCredentials(
new AuthScope("morningstar.com", 443),
new UsernamePasswordCredentials("xxx#gmail.com", "xxxx")); //anonymized this before posting to stackoverflow
HttpGet httpget = new HttpGet("http://financials.morningstar.com/ajax/ReportProcess4CSV.html?t=aapl&region=usa&culture=en_US&reportType=is&period=12&dataType=A&order=asc&columnYear=10&rounding=3&view=raw&productCode=USA&r=199209&denominatorView=raw&number=3");
System.out.println("executing request" + httpget.getRequestLine());
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
BufferedReader in;
in = new BufferedReader(new InputStreamReader(entity.getContent()));
System.out.println("----------------------------------------");
System.out.println(response.getStatusLine());
if (entity != null) {
System.out.println("Response content length: " + entity.getContentLength());
int linenum = 0;
while (true){
String line = in.readLine();
if (line == null) break;
linenum++;
if (linenum>1)
System.out.println(line);
}
}
EntityUtils.consume(entity);
} finally {
// When HttpClient instance is no longer needed,
// shut down the connection manager to ensure
// immediate deallocation of all system resources
httpclient.getConnectionManager().shutdown();
}
}
}

Automate HTML form submission using Java to find grocery hours

I'm trying to automate form submission using Java to get the hours of a grocery store here:
www.giantfood.com
I've posted the query and the hidden miles and storeType fields of the form, but my output.html is just the original web header and footer with an error message in the body. What am I doing wrong?
import java.io.*;
import java.net.*;
public class PostHTML
{
public static void main(String[] args)
{
try
{
URL url = new URL( "http://www.giantfood.com/our_stores/locator/store_search.htm" );
HttpURLConnection hConnection = (HttpURLConnection)
url.openConnection();
HttpURLConnection.setFollowRedirects( true );
hConnection.setDoOutput( true );
hConnection.setRequestMethod("POST");
PrintStream ps = new PrintStream( hConnection.getOutputStream() );
ps.print("groceryStoreAddress=20814&groceryStoreMiles=10&storeType=GROCERY");
ps.close();
hConnection.connect();
if( HttpURLConnection.HTTP_OK == hConnection.getResponseCode() )
{
InputStream is = hConnection.getInputStream();
OutputStream os = new FileOutputStream("output.html");
int data;
while((data=is.read()) != -1)
{
os.write(data);
}
is.close();
os.close();
hConnection.disconnect();
}
}
catch(Exception ex)
{
ex.printStackTrace();
}
}
}
UPDATE
Thanks! Using &'s worked. I'm trying to use HttpClient but I'm getting another error now:
package clientwithresponsehandler;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.HttpClient;
import org.apache.http.impl.client.BasicResponseHandler;
import org.apache.http.impl.client.DefaultHttpClient;
import java.util.ArrayList;
import java.util.List;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.protocol.HTTP;
/**
* This example demonstrates the use of the {#link ResponseHandler} to simplify
* the process of processing the HTTP response and releasing associated resources.
*/
public class ClientWithResponseHandler {
public static void main(String[] args) throws Exception {
HttpClient httpclient = new DefaultHttpClient();
try {
HttpPost httpost = new HttpPost("http://www.giantfood.com/our_stores/locator/store_search.htm");
System.out.println("executing request " + httpost.getURI());
List <NameValuePair> nvps = new ArrayList <NameValuePair>();
nvps.add(new BasicNameValuePair("groceryStoreAddress", "20878"));
nvps.add(new BasicNameValuePair("groceryStoreMiles", "10"));
nvps.add(new BasicNameValuePair("storeType", "GROCERY"));
httpost.setEntity(new UrlEncodedFormEntity(nvps, HTTP.UTF_8));
// Create a response handler
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = httpclient.execute(httpost, responseHandler);
System.out.println("----------------------------------------");
System.out.println(responseBody);
System.out.println("----------------------------------------");
} finally {
// When HttpClient instance is no longer needed,
// shut down the connection manager to ensure
// immediate deallocation of all system resources
httpclient.getConnectionManager().shutdown();
}
}
}
Output:
run:
executing request http://www.giantfood.com/our_stores/locator/store_search.htm
Exception in thread "main" org.apache.http.client.HttpResponseException: Moved Temporarily
at org.apache.http.impl.client.BasicResponseHandler.handleResponse(BasicResponseHandler.java:67)
at org.apache.http.impl.client.BasicResponseHandler.handleResponse(BasicResponseHandler.java:55)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:945)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:919)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:910)
at clientwithresponsehandler.ClientWithResponseHandler.main(ClientWithResponseHandler.java:39)
Java Result: 1
BUILD SUCCESSFUL (total time: 1 second)
I don't understand the Moved Temporarily error.
try to use
ps.print("groceryStoreAddress=20814&groceryStoreMiles=10&storeType=GROCERY")
instead
BTW, it's easier to use some http-library, like Apache HttpClient
Solved the Moved Temporarily by learning about HTML Redirects:
Httpclient 4, error 302. How to redirect?

Categories