Jsoup will not scrape entire HTML from web page

Jsoup will not scrape entire HTML from web page - java

I currently have an instance of a Document object. The Document object uses JSoup's connect method to fetch a Http Request. When I call the .html() method on the instance doc and print the result, there seems to be missing tags. When comparing my output to the source code on my browser (FireFox), there seems to be missing elements (More specifically, a youtube video for instance will have a div tag with a class attribute of "html5-video-container").
For reference the source code that I am using is as follows:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class JsoupTester {
public static void main(String[] args){
try {
Document doc = Jsoup.connect("https://m.youtube.com/watch?v=ycPr5-27vSI").userAgent("Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) FxiOS/1.0 Mobile/12F69 Safari/600.1.4").get();
System.out.println(doc.html());
} catch(IOException e) {
e.printStackTrace();
}
}
}

Related

Read Fortnite Json API

So I'm trying to read a JSON file in from a website (fortniteapi.com), every time I try to download the file to my local computer it does not download. I've been at this for about a week and I just can't figure out why it won't work.
also i'm using Gson
Here is my code so far:
package sample;
import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import javafx.application.Application;
import javafx.fxml.FXMLLoader;
import javafx.scene.Parent;
import javafx.scene.Scene;
import javafx.stage.Stage;
import java.io.*;
import java.net.URL;
import java.net.URLConnection;
public class Main extends Application {
#Override
public void start(Stage primaryStage) throws Exception{
Parent root = FXMLLoader.load(getClass().getResource("sample.fxml"));
primaryStage.setTitle("Fortnite");
primaryStage.setScene(new Scene(root, 300, 275));
primaryStage.show();
ReadJson();
}
public static void main(String[] args) {
launch(args);
}
public void ReadJson()
{
try {
// read url
String sURL = "https://fortnite-public-api.theapinetwork.com/prod09/users/id?username=Ninja"; //just a string
// Connect to the URL using java's native library
URL url = new URL(sURL);
URLConnection request = url.openConnection();
request.connect();
// Convert to a JSON object
JsonParser jp = new JsonParser(); //from gson
JsonElement root = jp.parse(new InputStreamReader((InputStream) request.getContent())); //Convert the input stream to a json element
JsonObject rootobj = root.getAsJsonObject();
String output = rootobj.get("username").getAsString(); //just grab the username value
// print out the result/output
System.out.println(output);
} catch (IOException e) {
System.out.println("Unexpected Error.");
// JOptionPane.showMessageDialog(null, "Oh no something went wrong.", "Unexpected Error", JOptionPane.ERROR_MESSAGE);
System.exit(1);
}
}
}

The error
After reading the errorStream() of the request (after casting it to HttpURLConnection) HTML is printed and states:
Access denied | fortnite-public-api.theapinetwork.com used Cloudflare
to restrict access
and
The owner of this website (fortnite-public-api.theapinetwork.com) has
banned your access based on your browser's signature
(mybrowsersignature).
What does this mean
Cloudflare states that that error means that:
the domain owner is blocking this request based on the client's web
browser signature.
and that the feature is called "Browser Integrity Check", from there we can find What does the Browser Integrity Check do?:
Cloudflare's Browser Integrity Check (BIC) is similar to Bad Behavior
and looks for common HTTP headers abused most commonly by spammers and
denies access to your page. It will also challenge visitors that do
not have a user agent or a non standard user agent (also commonly used
by abuse bots, crawlers or visitors).
Solution
We can change the User-Agent of request to something that should be valid before request.connect(); like so (user agent copied from User-Agent | MDN):
request.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0");
The expected output is printed:
Ninja

Why "http://www.stackoverflow.com" is not getting parsed but "http://www.javatpoint.com/java-tutorial" is getting parsed

I am trying to learn the basic methods of jsoup.I tried to get all the hyperlinks
of a particular web page.But i used stackoverflow link then,i am unable to get all the hyperlinks on that page ,but on the other side when i changed it to
javatpoint it's working.
Can someone explain Why??
Here is the code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
import org.jsoup.nodes.Document;
class Repo {
// String html;
public static void main(String s[]) throws IOException {
try {
Document doc = Jsoup.connect("http://www.javatpoint.com/java-tutorial").get();
// Document doc=Jsoup.connect("http://www.stackoverflow.com").get();
System.out.println("doc");
// Elements link=(Elements)doc.select("span[class]");
// Elements link = doc.select("span").first();
// Elements link = (Elements)doc.select("span");
Elements link = (Elements) doc.select("a[href]");
for (Element el : link) {
// System.out.print("-");
// System.out.println(el.attr("class"));
String str = el.attr("href");
System.out.println(str);
}
} catch (Exception e) {
}
}
}

Many websites require valid http requests to carry certain headers. A prominent example is the userAgent header. SO for example will work with this:
Document doc = Jsoup
.connect("http://www.stackoverflow.com")
.userAgent("Mozilla/5.0")
.get();
Side note:
You should never try catch exceptions and then silently ignore the possible fail case. At least do some logging there - otherwise your programs will be very hard to debug.

How to get all the source code from a page with Jsoup - Java [duplicate]

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw
Here's element which content I need: <div id='tags_list'></div>
I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:
<div id="tags_list">
разведчик
Sr
стратегический
</div>
Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}

JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.
To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?

Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.
pom.xml
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version> <here goes last version> </version>
</dependency>
PhantomJsUtils.groovy
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver
class PhantomJsUtils {
private static String filePath = 'data/temp/';
public static Document renderPage(String filePath) {
System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.get(filePath);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
public static Document renderPage(Document doc) {
String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
FileUtils.writeToFile(tmpFileName, doc.toString());
return renderPage(tmpFileName);
}
}
ClassInProject.groovy
Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))

You need to understand what is happening :
When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.
Here are some solutions for this kind of problems:
If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
it might be that the url is dynamic, and depends on what is on the page at that time
if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.

You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>3.35</version>
</dependency>
Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit
// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
// clean up resources
webClient2.close();
A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)
import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();
String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");
HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");
WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
new URL("https://url"),
"https://referrer");
// Add other cookies/ Session ...
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);
// Wait
synchronized (page) {
try {
page.wait(15000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());

I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...
String RedirectedUrl=null;
Elements meta = page.select("html head meta");
if (meta.attr("http-equiv").contains("REFRESH")) {
RedirectedUrl = meta.attr("content").split("=")[1];
} else {
if (page.toString().contains("window.location.href")) {
meta = page.select("script");
for (Element script:meta) {
String s = script.data();
if (!s.isEmpty() && s.startsWith("window.location.href")) {
int start = s.indexOf("=");
int end = s.indexOf(";");
if (start>0 && end >start) {
s = s.substring(start+1,end);
s =s.replace("'", "").replace("\"", "");
RedirectedUrl = s.trim();
break;
}
}
}
}
}
... now retrieve the redirected page again...

It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
...
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);
Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");

After specifying user agent, my problem is solved.
https://github.com/jhy/jsoup/issues/287#issuecomment-12769155

Try:
Document Doc = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

How to use Jsoup to login my university website?

I am trying to come up with a Android app that needs some information on the university inner website. I have been trying to use Jsoup to login the website programmatically. Here is the code I have now:
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
//import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
//import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.Map;
public class Test {
public static void main(String[] args) {
Document doc;
try {
Connection.Response res = Jsoup
.connect(
"https://sso.bris.ac.uk/sso/login?service=https%3A%2F%2Fwww.cs.bris.ac.uk%2FTeaching%2Fsecure%2Funit-list.jsp%3Flist%3Dmine")
.execute();
Map<String, String> cookies = res.cookies();
System.out.println(cookies.keySet());
Document fakepage = res.parse();
Element fakelt = fakepage.select("input[name=lt]").get(0);
Element fakeexecution = fakepage.select("input[name=execution]")
.get(0);
Element fake_eventID = fakepage.select("input[name=_eventId]").get(
0);
System.out.println("Hello World!");
System.out.println(fakelt.attr("value"));
System.out.println(fakeexecution.toString());
System.out.println(fake_eventID.toString());
// System.out.println(cookies.get("JSESSIONID"));
String url="https://sso.bris.ac.uk/sso/login?service=https%3A%2F%2Fwww.cs.bris.ac.uk%2FTeaching%2Fsecure%2Funit-list.jsp%3Flist%3Dmine";
System.out.println(url);
Connection newreq = Jsoup
.connect(url)
.cookies(cookies).data("lt", fakelt.attr("value")).followRedirects(true).header("Connection", "keep-alive")
.header("Refer", " https://sso.bris.ac.uk/sso/login?service=https%3A%2F%2Fwww.cs.bris.ac.uk%2FTeaching%2Fsecure%2Funit-list.jsp%3Flist%3Dmine")
.header("Content-Type","application/x-www-form-urlencoded;charset=UTF-8")
.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:27.0) Gecko/20100101 Firefox/27.0")
.data("lt",fakelt.attr("value"))
.data("execution", fakeexecution.attr("value"))
.data("_eventID", fake_eventID.attr("value"))
.data("username", "aabbcc").data("password", "ddeeff")
.data("submit", "").method(Method.POST);
Connection.Response newres = newreq.execute();
doc = newres.parse();
System.out.println(doc.toString());
System.out.println(newres.statusCode());
Map<String,String> newcookies = newres.cookies();
doc = Jsoup.connect("https://www.cs.bris.ac.uk/Teaching/secure/unit-list.jsp?list=mine").cookies(newcookies).get();
System.out.println(doc.toString());
// System.out.println(doc.toString());
} catch (IOException e) {
System.out.println("Excepiton:");
System.out.println(e.getMessage());
}
}
}
I completely faked a form to submit use Jsoup, and to get around the security cookies I first request the website once and then use the cookies it sent me to request the website again. The form has some hidden fields so I use the ones I got on my first request to fake it when I request it again. However this does not work. Is it possible to do it or the server has some advanced preventer against me doing so?

Do not use Jsoup to do this, it needs you to handle all the cookies yourself, instead, use Httpclient, if you use something from 4.0 onward it handle the cookies automatically. Much eaiser to work with.

Use jsoup to read table content

Can anyone help me figure out why I can't use jsoup to read table in this link below:
http://data.fpt.vn/InfoDNS.aspx?domain=google.com
I use it to get DNS of a host.
Here is the code that I used:
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
public class dnsjava {
public static void main(String... args) throws Exception {
String fpt = "http://data.fpt.vn/InfoDNS.aspx?domain=google.com";
String espn = "http://espn.go.com/mens-college-basketball/conferences/standings/_/id/2/year/2012/acc-conference"
org.jsoup.nodes.Document doc = Jsoup.connect(fpt).get();
Elements table = doc.select("table.tabular");
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.text());
System.out.println(tds.text());
}
}
}
It work with the url of espn and doc.select("table.tablehead"); but with fpt url, nothing happen!
Thank you for your help!

looks like the response you are seeking is not present, when i did the "view source"(in browser) of the link.
doc.select("table.tabular"); //
"tabular" is not present in response.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup will not scrape entire HTML from web page - java

Related

Read Fortnite Json API

Why "http://www.stackoverflow.com" is not getting parsed but "http://www.javatpoint.com/java-tutorial" is getting parsed

How to get all the source code from a page with Jsoup - Java [duplicate]

How to use Jsoup to login my university website?

Use jsoup to read table content

Categories

Resources