Why does my Crawler get the wrong HTML code?

Why does my Crawler get the wrong HTML code? - java

I've wanted to write a crawler in java for some school exercise. Actually the crawler code, implemented with the jsoup lib, worked because the result of my request was some HTML code, but when I searched for a word which was clearly written on the website, it was not found, because some div's from the crawler where empty.
Then I recognized, that I got the same code as you can see when you navigate to the website and right-click -> 'view page source'.
When I compared the code to right-click -> 'inspect', the code was not the same as in 'view page source'.
Is there anything I could do to get the HTML code containing the full content?
requested URL: https://app.libertex.com/?lang=deu&_ga=2.222573595.1459393376.1568209606-1642141519.1566978579&_gac=1.53153498.1566978579.CjwKCAjwzJjrBRBvEiwA867byuxkXf35eSWyL2LJhLel3PRiGsSfvU6iLb00E21dQOkogLcx_z5G6hoCQgwQAvD_BwE

You can't get the right code with jsoup as this website loads content dynamically.
This webpage loads code dynamically, i.e. it loads the initial content and then executes other code to load the rest of the content. jsoup is merely an HTML parser, meaning that it can parse through the various content which it is given. It cannot execute Javascript or wait for external files to load.
To scrape a website like this, you probably need an automated browser of sorts. I personally use Selenium in Python for crawling websites which load dynamically.

Related

Jsoup Issue scraping non-hardcoded data

I'm trying to use Jsoup to gather wave height information from Surfline.com. I have the element I desire in the screenshot and the it's showing in the dev tools. When I scrape the site with Jsoup, the returned string includes everything seen in the dev tool but the "1-2ft" which is what I need. The site is Javascript heavy and I'm assuming that jsoup is snagging the html before the javascript actually runs (I have no clue really). Do I need to specifically tell jsoup to wait for the pageload or am I missing some other critical component?
This is the code I'm using.
Document doc = Jsoup.connect("http://www.surfline.com/surf-report/folly-beach-pier-southside-southeast_5294/").get();
Elements content = doc.select("div[id=current-surf-range]");
System.out.println(content);
and this is the output I'm seeing in my IDE
<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;"></div>
it seems really odd that the contents of the div wouldn't be returned with it. This is my first time using Jsoup and I tried to read through the docs as best I could but nothing seemed to touch on this particular issue. Any insight would be awesome and greatly appreciated.

What you see in the browser is not what necessarily you would get when download the page by URL with your HTTP library of a choice. In fact, you should never expect them to be the same. In the modern web, webpages are quite dynamic and are loaded asynchronously involving multiple API calls to different resource providers and javascript being executed in the browser (which has the javascript engine).
What you get with JSoup in this case is the initial HTML that browser starts to form the page with. Then, there is a set of XHR calls to the surfline API that brings the data into the browser which then dynamically fills up different parts of the page, including the current surf range.
The simplest way to approach the problem is to switch to browser automation tool called selenium which would fire up a real browser. You can then wait for the current surf range element to have a value and, if you wish to continue with JSoup, get the page source and feed it to JSoup for further parsing.
Another approach would involve looking into the requests that the page makes in the browser developer tools and then try to simulate these requests in your code, parsing the JSON responses and extracting the surf forecast data.

Getting web page source code in Java

I use Java. I want to get web page source code but on the page works JavaScript and I want get code generated by JavaScript (code which we see in firebug in firefox)
Anyone knows what I should do?

To inspect the page after modification by JavaScript, you need a client-side JavaScript engine that can run the scripts and then let you inspect the DOM.
HtmlUnit can do this - it is a "GUI-Less browser for Java programs".
See also this question
However, this won't give you the exact original page source, because that has already been parsed into a DOM by this point.

I think you want to see the source code of DOM Elements created after the page load via AJAX.
If that´s what you want, the only way to see it is through a DOM inspector, like firebug in firefox or Developers Tools in Chrome.
Going to "View source code" only shows the source at load-time.

If I understand your question, yes your javascript objects can be passed back to your java backend either by a creating a html <form> element with inputelements, fill them with your values and then submit the form, or asynchronously via ajax/json (which doesn't require re-loading your web page). For both methods you need to configure an endpoint on your java side to receive the submitted data and return some kind of confirmation to the client, i.e. your javascript. I would recommend googling "jQuery.post" for the javascript side and finding some examples for your java backend.

How to retrieve dynamically generated web elements?

I fetch the website using Jsoup. Here is the link to the web:
http://www.yelp.com/search?find_desc=restaurants&find_loc=westmont%2C+il&ns=1&ls=43131f934bb3adf3#find_loc=Hinsdale,+IL&l=p:IL:Hinsdale::&sortby=rating&unfold=1
Now I'm trying to extract the number of sub-pages on the web. For example the numbers next to "Go to Page" as shown in the picture below:
Unfortunately either 'view source' in the browser or Jsoup is not able to see these elements. I guess this content is embedded dynamically into the web. If so what is the best way to access dynamically generated web? Thanks.

For the website that use AJAX/JS Libraries technique to generate content, you may want to use HTMLUnit instead (HTMLUnit can simulate Javascript events). JSoup is only for static HTML, or things that you could receive via viewsource.

How to grab data that is not in html source but visible from browser?

The data I want is visible from the browser, but I can't find it from the html source code. I suspect the data was generated by scripts. I'd like to grad such kind of data. Is it possible using Jsoup? I'm aware Jsoup just does not execute Javascript.
Take this page for example, I'd like to grab all the colleges and schools under Academics -> COLLEGES & SCHOOLS.

If the dom content is generated via scripts or plugins, then you really should consider a scriptable browser like phantomjs. Then you can just write some javascript to extract the data.
I didn't check your link, and I assume you're looking for a general answer not specific to any page.

How can I get html content from a browser that can do the html correction and js scripting?

I need a solution for getting HTML content from a browser. As rendering in a browser, js will be ran, and if not, js won't be ran. So any html libraries like lxml, beautifulsoup and others are all not gonna work.
I've searched a project named pywebkitgtk, but it's purpose is to create a browser with a front end.
Is there any way to put a url into a "fake browser" and render it and run its all javascript and save it into a html file? I don't need any front-end, just back-end is ok.
I need to use Python or java to do that.

selenium-rc lets you drive an actual browser for your purpose, under control of any of several languages at your choice, which include both Python and Java. Check it out!
For a detailed example of use with Python, see here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why does my Crawler get the wrong HTML code? - java

Related

Jsoup Issue scraping non-hardcoded data

Getting web page source code in Java

How to retrieve dynamically generated web elements?

How to grab data that is not in html source but visible from browser?

How can I get html content from a browser that can do the html correction and js scripting?

Categories

Resources