I can connect to most sites and get the HTML just fine but when trying to connect to a website where most of the content is generated after the initial page load with JavaScript, it does not get any of that data. Is there any way to do this with Jsoup or does it not support it?
JSoup has some basic connection handling included, but it is not a web browser. It excels at parsing static html content. It does not run any javascript, so you are out of luck. However, there are different options that you might follow:
You can analyze the page that you want to retrieve and find out how the content you are interested in gets loaded. Often it is not very hard to tap the original source of the loaded content and work with this. This approach has the benefit that you get what you want with no need of extra libraries and the retrieval will be fast.
You can use a (full) browser and automate the loading of the page. A very good tool for this is selenium webdriver in combination with the headless webkit browser phantomjs. This however requires extra software and extra libraries in your project and will run much much slower than the first solution.
Related
I realize this looks like a duplicate question, but it's not!(as far as I know, and I've searched a lot...) So for the last few days I've been trying to get the HTML content of my whatsapp web application but using the input stream reader provided by java seems to not give me the full html code. The URL I'm using is just https://web.whatsapp.com/, which I suppose could be a problem, but there aren't any personal URLs as far as I'm aware. However, in developer tools using the element inspector I can easily access and read the DOM elements I'm interested in. I'm wondering if there's a way I can get this source directly using java/perl/python.
I'm also looking to do this as a learning project, so preferably would like to stay away from tools such as jsoup and such. Thanks!
You can use selenium.webdriver in python. Something like:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://web.whatsapp.com/")
html = browser.page_source
If you want to get your own whatsapp page, you should use selenium to log into the site before getting the page_source.
I'm trying to use Jsoup to gather wave height information from Surfline.com. I have the element I desire in the screenshot and the it's showing in the dev tools. When I scrape the site with Jsoup, the returned string includes everything seen in the dev tool but the "1-2ft" which is what I need. The site is Javascript heavy and I'm assuming that jsoup is snagging the html before the javascript actually runs (I have no clue really). Do I need to specifically tell jsoup to wait for the pageload or am I missing some other critical component?
This is the code I'm using.
Document doc = Jsoup.connect("http://www.surfline.com/surf-report/folly-beach-pier-southside-southeast_5294/").get();
Elements content = doc.select("div[id=current-surf-range]");
System.out.println(content);
and this is the output I'm seeing in my IDE
<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;"></div>
it seems really odd that the contents of the div wouldn't be returned with it. This is my first time using Jsoup and I tried to read through the docs as best I could but nothing seemed to touch on this particular issue. Any insight would be awesome and greatly appreciated.
What you see in the browser is not what necessarily you would get when download the page by URL with your HTTP library of a choice. In fact, you should never expect them to be the same. In the modern web, webpages are quite dynamic and are loaded asynchronously involving multiple API calls to different resource providers and javascript being executed in the browser (which has the javascript engine).
What you get with JSoup in this case is the initial HTML that browser starts to form the page with. Then, there is a set of XHR calls to the surfline API that brings the data into the browser which then dynamically fills up different parts of the page, including the current surf range.
The simplest way to approach the problem is to switch to browser automation tool called selenium which would fire up a real browser. You can then wait for the current surf range element to have a value and, if you wish to continue with JSoup, get the page source and feed it to JSoup for further parsing.
Another approach would involve looking into the requests that the page makes in the browser developer tools and then try to simulate these requests in your code, parsing the JSON responses and extracting the surf forecast data.
I use Java. I want to get web page source code but on the page works JavaScript and I want get code generated by JavaScript (code which we see in firebug in firefox)
Anyone knows what I should do?
To inspect the page after modification by JavaScript, you need a client-side JavaScript engine that can run the scripts and then let you inspect the DOM.
HtmlUnit can do this - it is a "GUI-Less browser for Java programs".
See also this question
However, this won't give you the exact original page source, because that has already been parsed into a DOM by this point.
I think you want to see the source code of DOM Elements created after the page load via AJAX.
If that´s what you want, the only way to see it is through a DOM inspector, like firebug in firefox or Developers Tools in Chrome.
Going to "View source code" only shows the source at load-time.
If I understand your question, yes your javascript objects can be passed back to your java backend either by a creating a html <form> element with inputelements, fill them with your values and then submit the form, or asynchronously via ajax/json (which doesn't require re-loading your web page). For both methods you need to configure an endpoint on your java side to receive the submitted data and return some kind of confirmation to the client, i.e. your javascript. I would recommend googling "jQuery.post" for the javascript side and finding some examples for your java backend.
I fetch the website using Jsoup. Here is the link to the web:
http://www.yelp.com/search?find_desc=restaurants&find_loc=westmont%2C+il&ns=1&ls=43131f934bb3adf3#find_loc=Hinsdale,+IL&l=p:IL:Hinsdale::&sortby=rating&unfold=1
Now I'm trying to extract the number of sub-pages on the web. For example the numbers next to "Go to Page" as shown in the picture below:
Unfortunately either 'view source' in the browser or Jsoup is not able to see these elements. I guess this content is embedded dynamically into the web. If so what is the best way to access dynamically generated web? Thanks.
For the website that use AJAX/JS Libraries technique to generate content, you may want to use HTMLUnit instead (HTMLUnit can simulate Javascript events). JSoup is only for static HTML, or things that you could receive via viewsource.
I need a solution for getting HTML content from a browser. As rendering in a browser, js will be ran, and if not, js won't be ran. So any html libraries like lxml, beautifulsoup and others are all not gonna work.
I've searched a project named pywebkitgtk, but it's purpose is to create a browser with a front end.
Is there any way to put a url into a "fake browser" and render it and run its all javascript and save it into a html file? I don't need any front-end, just back-end is ok.
I need to use Python or java to do that.
selenium-rc lets you drive an actual browser for your purpose, under control of any of several languages at your choice, which include both Python and Java. Check it out!
For a detailed example of use with Python, see here.