Hello this is my first question on here and i was wondering if anybody has a solution to my problem, i am trying to get the full content of a webpage after everything has loaded. For example i have a website that pulls information in after the web page has loaded, so like a search page that uses ajax to request data from the server. When i run the code all i get is the basic shell of the webpage and nothing from the search result.
URL url = new URL("a_url");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
i am searching pirate bay for torrents, as i am testing the use of magnet downloads in java and when i try to collect the magnet links, and name of the torrent the "inputLine" does not print anything that i have searched for, only what the website consist of before the search has been added. any help would be much appreciated thanks
With your piece, you're requesting the page to the server and displaying to sysout.
Every content pulled after the page is loaded is requested by some javascript. The javascript is interpreted by the Web browser. If you want to have the same result, you should interpret the javascript as the browser does. I think that jsoup has such a feature (never tested).
Other solution : the javascript is accessing the server via a HTTP API. Try to access to the some API from your java code, without requesting the main page.
Related
When I want to get the source code of a specific web page, I use following code:
URL url = new URL("https://google.de");
URLConnection urlConnect = url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader(urlConnect.getInputStream())); //Here is the error with the amazon url
StringBuffer sb = new StringBuffer();
String line, htmlData;
while((line=br.readLine())!=null){
sb.append(line+"\n");
}
htmlData = sb.toString();
The code above works without problems, but when your url is called...
URL url = new URL("https://amazon.de");
...then you might get sometimes a IOException error -> Server error code 503. In my opinion, this doesn't make any sense, because I can enter the amazon web page with the browser without any errors.
When accessing https://amazon.de with curl -v https://amazon.de you either get a 503 or a 301 status code in the response (When following the redirect, you get a 503 from the referenced location https://www.amazon.de/). The body contains the following comment:
To discuss automated access to Amazon data please contact api-services-support#amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.de/ref=rm_5_sv, or our Product Advertising API at https://partnernet.amazon.de/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
I assume Amazon is returning this response when your request is detected as coming from a non browser context (i.e. by parsing the user agent) to hint you towards using the APIs and not crawling the sites directly.
I want to log in to a https website using Jsoup and make subsequent calls 3-4 services to check whether a job is done or not.
public class JSOUPTester {
public static void main(String[] args){
System.out.println("Inside the JSOUP testing method");
String url = "https://someloginpage.com";
try{
Document doc = Jsoup.connect(url).get();
String S = doc.getElementById("username").text();// LINE 1
String S1 = doc.getElementById("password").text();// LINE 2
}catch(Exception e){
e.printStackTrace();
}
}
}
Exception:
java.lang.NullPointerException
JSOUPTester.main(JSOUPTester.java:7)
I have checked in the chrome that these pages contain elements with id "username" and "password".
The lines above are throwing NullPointerException. What I am doing wrong here?
A Number of things can be the cause of this. Without the URL I can't be certain, but here are some clues:
Some pages load their content via AJAX. Jsoup can#t deal with this, since it does not interpret any JavaScript. You can check for this by downloading the page with curl, or in a browser while turnig off JavaScript. To deal with pages that use JavaScript to render themselves, you can use tools like Selenium webdriver or HTMLUnit.
The webserver of the page that you try to load might require a cookie to be present. You need to look at the network traffic that happens surfing loading of that page. In chrome or firefox you can see this in the developer tools in the network tab.
The webserver might respond differently for different clients. That is why you may have to set the UserAgent string to a known Browser in your JSoup http request.
Jsoup.connect("url").userAgent("Mozilla/5.0")
JSoup has a size limitation of 1MB for the downloaded html source. You can turn this off or set it to a larger value if needed.
Jsoup.connect("url").maxBodySize(0)
Jsoup might timeout on the request. To change timeout behavior use
Jsoup.connect("url").timeout(milliseconds)
There might be other reasons I did not think of now.
Okay, so what I want to do is to download HTML from facebook from Java code.
I know how to do that, the problem comes when I want it to download HTML as I would in View page source in my browser, when I'm logged in instead of getting the login fb page.
I know that I can use API but I just want to check one thing in HTML and it seems like kinda too big thing to include and use a whole API.
So I was wondering if there is a simple way of doing that (maybe I should execute some link first with my credentials, although I don't think that it is the way to do that).
I want to do is to download HTML from facebook from JAVA code
You can do that by reading from a Urlconnection.
import java.net.*;
import java.io.*;
public class URLConnectionReader {
public static void main(String[] args) throws Exception {
URL facebook = new URL("http://www.facebook.com/or any dir");
URLConnection yc = facebook.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
}
You can input any url and get the source code of that given page.
To view the source code or save the source code.
java URLConnectionReader > facebook.html(or any format)
The problems comes when I want to download HTML as it would be if I
were Logged in (But of course I'm not, it just downloads the login
page). And I don't know how to kind of progmatically login, so that I
can download the HTML as it would be after I've logged in
First a word of caution, if you don't have direct permission to do this, beware, the site in question may preclude this in their terms of service.
To answer the question, there are many, many reasons a site would reject a login. To do this successfully you need to get as close as possible to how a browser would handle the transaction. To do that you need to see what a real browser is doing.
https is more tricky as many http sniffers can't deal with it but httpwatch claims it can. Check out the HTTP transactions and then try to replicate them.
Your url.openConnection() call will actually return an instance of HTTPURLConnction, cast to that & then you'll be able to easily set various http headers such as the User-Agent.
A final note, you say a cookie may be required. Your code isn't going to deal with cookies. To do that you'll need to use a cookie manager, e.g.:refer this for example
This question already has answers here:
403 Forbidden with Java but not web browser?
(4 answers)
Closed 4 years ago.
I'm attempting to write a web scraper here and the website is returning a 403 forbidden to my code even though it is an accessible webpage through a browser. My main question is: is this something that they set up on the website to discourage web scraping or am I doing something wrong?
import java.net.*;
import java.io.*;
public class Main {
public static void main(String[] args) throws Exception {
URL oracle = new URL("http://www.pcgs.com/prices/");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
}
If I change the url to a website like http://www.google.com then it will return html. If the site is blocking it is there a way around that? Thanks for the help
Don't know much Java, but this simple Python code worked when I tried it without an error, saving the content as it appeared in my browser:
import requests
r = requests.get('http://www.pcgs.com/prices/')
with open('out.html', 'w') as f:
f.write(r.content)
This sends a slightly unusual, non-browser user-agent.
So, if their site isn't likely blocking you on the basis of user-agent, maybe you've hit the site too quickly and they've blocked your IP address or rate limited you? If you're intending on scraping sites, you should be nice and limit the number of requests you make.
Another thing you can do before scraping is check for a site's robots.txt; like this one for Stack Overflow; that explicitly declares what the site's policies are towards automated scrapers. (In this case, the PCGS site doesn't appear to have one.)
The web server may contain a code to block not authorized user-agent.
I guess you can verify this by making sure your program will send a standard User-Agent value (i.e. corresponding to an existing web browser) and see if it makes any difference.
I have written a Servlet that should act like a web-proxy. But some of the Javascript GET calls only return part of the original content when I am loading a page, like localhost:8080/Proxy?requestURL=example.com.
When priting the content of the java script to the console, they are complete.
But the response at the browser is truncated.
I am writing like this:
ServletOutputStream sos = resp.getOutputStream();
OutputStreamWriter writer = new OutputStreamWriter(sos);
..
String str = content_of_get_request
..
writer.write(str);
writer.flush();
writer.close();
The strange thing is, when I request directly the Javascript that was loaded during the page request like this:
localhost:8080/Proxy?requestURL=anotherexaple.com/needed.js
The whole content is returned to the browser.
It would be great if someone had an idea.
Regards
UPDATE:
The problem was the way how I created the response String:
while ((line = rd.readLine()) != null)
{
response.append(line);
}
I read one line from a Stream and appended it on a StringBuffer, but it appears that firefox and chrome had a problem with that.
It seems that some browsers implement a maximum line length for JavaScript, however there is no maximum line length mentioned in the RFC HTTP 1.1 standard.
Fix:
Just adding a "\n" to the line fixes the issue.
response.append(line+"\n");
Because what you are doing is just reading the Html Response , but you are not actually calling the other resources that are referenced in the HTML like images, js etc.
You can observe that when you monitor how the browser renders the html though Firebug for Firefox.
1) The browser receives Html response.
2)Then it parses for referenced resources and make a separate Get call for each of those.
So in order for proxy to work you need to mimick this browser behavior.
My Advice is to use a already available open source libs HTML Unit