Jsoup Issue scraping non-hardcoded data

Jsoup Issue scraping non-hardcoded data - java

I'm trying to use Jsoup to gather wave height information from Surfline.com. I have the element I desire in the screenshot and the it's showing in the dev tools. When I scrape the site with Jsoup, the returned string includes everything seen in the dev tool but the "1-2ft" which is what I need. The site is Javascript heavy and I'm assuming that jsoup is snagging the html before the javascript actually runs (I have no clue really). Do I need to specifically tell jsoup to wait for the pageload or am I missing some other critical component?
This is the code I'm using.
Document doc = Jsoup.connect("http://www.surfline.com/surf-report/folly-beach-pier-southside-southeast_5294/").get();
Elements content = doc.select("div[id=current-surf-range]");
System.out.println(content);
and this is the output I'm seeing in my IDE
<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;"></div>
it seems really odd that the contents of the div wouldn't be returned with it. This is my first time using Jsoup and I tried to read through the docs as best I could but nothing seemed to touch on this particular issue. Any insight would be awesome and greatly appreciated.

What you see in the browser is not what necessarily you would get when download the page by URL with your HTTP library of a choice. In fact, you should never expect them to be the same. In the modern web, webpages are quite dynamic and are loaded asynchronously involving multiple API calls to different resource providers and javascript being executed in the browser (which has the javascript engine).
What you get with JSoup in this case is the initial HTML that browser starts to form the page with. Then, there is a set of XHR calls to the surfline API that brings the data into the browser which then dynamically fills up different parts of the page, including the current surf range.
The simplest way to approach the problem is to switch to browser automation tool called selenium which would fire up a real browser. You can then wait for the current surf range element to have a value and, if you wish to continue with JSoup, get the page source and feed it to JSoup for further parsing.
Another approach would involve looking into the requests that the page makes in the browser developer tools and then try to simulate these requests in your code, parsing the JSON responses and extracting the surf forecast data.

Related

Getting web page source code in Java

I use Java. I want to get web page source code but on the page works JavaScript and I want get code generated by JavaScript (code which we see in firebug in firefox)
Anyone knows what I should do?

To inspect the page after modification by JavaScript, you need a client-side JavaScript engine that can run the scripts and then let you inspect the DOM.
HtmlUnit can do this - it is a "GUI-Less browser for Java programs".
See also this question
However, this won't give you the exact original page source, because that has already been parsed into a DOM by this point.

I think you want to see the source code of DOM Elements created after the page load via AJAX.
If that´s what you want, the only way to see it is through a DOM inspector, like firebug in firefox or Developers Tools in Chrome.
Going to "View source code" only shows the source at load-time.

If I understand your question, yes your javascript objects can be passed back to your java backend either by a creating a html <form> element with inputelements, fill them with your values and then submit the form, or asynchronously via ajax/json (which doesn't require re-loading your web page). For both methods you need to configure an endpoint on your java side to receive the submitted data and return some kind of confirmation to the client, i.e. your javascript. I would recommend googling "jQuery.post" for the javascript side and finding some examples for your java backend.

Jsoup get dynamically generated HTML

I can connect to most sites and get the HTML just fine but when trying to connect to a website where most of the content is generated after the initial page load with JavaScript, it does not get any of that data. Is there any way to do this with Jsoup or does it not support it?

JSoup has some basic connection handling included, but it is not a web browser. It excels at parsing static html content. It does not run any javascript, so you are out of luck. However, there are different options that you might follow:
You can analyze the page that you want to retrieve and find out how the content you are interested in gets loaded. Often it is not very hard to tap the original source of the loaded content and work with this. This approach has the benefit that you get what you want with no need of extra libraries and the retrieval will be fast.
You can use a (full) browser and automate the loading of the page. A very good tool for this is selenium webdriver in combination with the headless webkit browser phantomjs. This however requires extra software and extra libraries in your project and will run much much slower than the first solution.

Convert Web Page to PDF or Image

I need to convert a web page [which has not public access] to PDF or Image [preferably to PNG].
Web page contains set of charts and image. Most of the charts are populated through Ajax calls so there is a delay between page load and chart load.
I am looking answer for any of these questions:
1- I found set of snapshot api's but none of them support accessing my internal page. Since the web page I am trying to export is not public I need to be authenticated. Biggest problem is I cannot send request headers [such as session-id, cookie or other variables] along with these API's. It seems they don't support this kind of functionality.
2- I am not sure if I can do following: Login to my web page with HTTP client, add http headers, send get call and get HTML string. Then use one of the converters to convert it to PDF. What I am not sure is if it's possible to get proper PDF from the HTML string I got from http client since resources [css, js and etc] will be missing. I want my pdf/image looks exactly as it on the web site.
I really appreciate if you can help.
Thanks in advance,
ED

You're probably best of using wkhtmltopdf, which is a server-side tool and is easily installed.
There are two parameters you can use to wait for your Ajax to finish, try:
javascript-delay to influence the time the program waits for the JavaScript to finish
window-status to wait for a certain return code for the window
See the extensive manual for this program here
wkhtmltopdf generates a PDF and wkhtmltoimg generates an image, which is PNG (as you requested) by default.

Authentication is difficult because it involves security. Because the operation you are describing is unusual it is likely to result in all kinds of alarm bells going off. It is entirely possible to do but it is fraught, easy to get wrong and fragile in the face of security updates and code changes.
As such I'm going to suggest an alternate method which is one we often recommend for ABCpdf (on which I work). Yes we support standard authentication methods but the beauty of this approach is that it is robust and is applicable to other solutions (eg Java based) and novel authentication methods.
Typically you just want a PDF of the current page. The easiest way to do this is snaffle the HTML. The way you do this rather depends on your environment. For example under ASP.NET you can obtain the HTML of the current page using the HttpResponse.Filter property or by overriding the Render method of the page. The way you do it will depend on what you're coding in.
Then you need to save this HTML to a file and present it to your solution via a 'file://' protocol URL. Now obviously at this point any relative links will be broken but this is easily fixed by dropping in a BASE tag that references the place they are located.
Generally the types of resources referenced by an server-side page are static. So if you can create a tag that references the actual files rather than a web site, you will bypass any authentication for access to these resources.
That still leaves the AJAX based problems which are another can of worms. The render delay method is something we have supported for many years (from before AJAX was around) however it is not terribly reliable because you just don't know how long to wait.
Much better is a tighter link into the JavaScript via a callback you can use to determine if the page is loaded. I don't think ABCpdf is going to be appropriate for you since it is .NET but I would certainly encourage you to look for a Java based solution that uses this type of more sophisticated approach.

Get the contents of the link created by JavaScript

I am trying to build a very rudimentary crawler which could move through certain specific links and extract the contents from them. I am using JSoup for traversing through the links on a page and reading the required content.
However I have hit a roadblock on one of the sites. It is a kind of news portal on which users are allowed to post their own comments. I need to extract these comments. However if there are more than 5 comments, they are spread over several pages and the links to the subsequent pages are created by a JavaScript code in href (instead of a real link). It is something like this:
<a id="pager1_lnkPage2" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("pager1$lnkPage2", "", true, "", "", false, true))">2</a>
Now I have no idea how to traverse through the links generated by this JavaScript. Is there any way to get the data on the pages referred to by these links (on the face of it this does not seem to create any new link since the URL does not change while we navigate through other pages)?
For your reference here is a link to one such page. The links to navigate through multiple pages are at the lower right corner of the page.
This is embedded on the page with the main story in an iframe.
I have also come across an interface called ScriptEngine in javax but I could not understand it well enough to use it here.
Thanks

I've never used jsoup, but judging by its description (it is HTML parser) and the fact you try to somehow incorporate javascript into it, is telling me that you chose wrong tool for the job.
In your case I would rather go with Zombie.js (Node.js based) or Selenium. Latter may be better choice if you want to stick with Java (Selenium has Java based plugins).

Display external webpage into a webpage in my application

i want to display an external webpage (exactly as it's rendered in that site) into a webpage in my application in a way that's fast and better for SEO crawlers, and i was wondering if there's a way to do that with javaee ?
if not then what is better in performance and for SEO the XMLHTTPRequest way or the iframes way.
please advise with sample code or link if possible, thanks
Update: example website is: http://www.akhbarak.net/

If you need to display content from different pages inline, use iframe (iframe stands for inline frame - it has nothing to do with Apple).
If you'd like to use AJAX to display pages, I would recommend colorbox.
Note that accessing pages in a different domain via AJAX is next to impossible - this is a very, very big security hole. I would not recommend doing it. You would have to use a proxy on your own server to fetch the page and return its HTML.
That said, using the iframe in your source code, so it is loaded with the rest of the page, seems like your best bet. Sites like facebook and twitter use this in embeddable "like" and "tweet" widgets so that those widgets can make requests on their own domain - that is, twitter or facebook. While managing lots of iframes isn't very fun, it is a very accepted way of doing what you want to do.

In theory, you could
load the whole page into a PHP variable,
replace the body tags with ,
take out the html tags,
pull out the entire section and put it in the encompassing pages ,
and replace all links with absolute ones (ie '/images' changes to 'http://example.com/images')
Would it be easy to do? Probably not. It's the only way I can think of to accomplish it so that the site appears as part of yours though.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.