Can't get all content from webpage with HTMLParser - java

I am using Jsoup to parse an webpage this one https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=true&multiple=true&allEx=true
In that webpage i can see something in the browser but when i am trying to parse it with Jsoup
Document doc = Jsoup.parse("https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=true&multiple=true&allEx=true");
System.out.println(doc);
It will return
<html>
<head></head>
<body>
https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=trueμltiple=true&allEx=true
</body>
</html>
Which is not all HTML.
Any suggestions how i can solve it or why it is happening?

That looks like they're detecting a crawler, usually via your user agent, and sending different content. Try setting your user agent string to a standard browser's string, and see if that resolves the issue you're having.
One other potential problem, though I don't think it's the issue here, is data loaded in via AJAX will not be downloaded by JSoup. It parses the HTML that gets served up, but it doesn't execute the JavaScript, so it can't get any extra content that comes in later. You might be able to resolve that issue using something like PhantomJS which can process and render HTML, CSS, and JavaScript, and would (in theory) give you the actual HTML you end up seeing in your browser.

Related

Jsoup Issue scraping non-hardcoded data

I'm trying to use Jsoup to gather wave height information from Surfline.com. I have the element I desire in the screenshot and the it's showing in the dev tools. When I scrape the site with Jsoup, the returned string includes everything seen in the dev tool but the "1-2ft" which is what I need. The site is Javascript heavy and I'm assuming that jsoup is snagging the html before the javascript actually runs (I have no clue really). Do I need to specifically tell jsoup to wait for the pageload or am I missing some other critical component?
This is the code I'm using.
Document doc = Jsoup.connect("http://www.surfline.com/surf-report/folly-beach-pier-southside-southeast_5294/").get();
Elements content = doc.select("div[id=current-surf-range]");
System.out.println(content);
and this is the output I'm seeing in my IDE
<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;"></div>
it seems really odd that the contents of the div wouldn't be returned with it. This is my first time using Jsoup and I tried to read through the docs as best I could but nothing seemed to touch on this particular issue. Any insight would be awesome and greatly appreciated.
What you see in the browser is not what necessarily you would get when download the page by URL with your HTTP library of a choice. In fact, you should never expect them to be the same. In the modern web, webpages are quite dynamic and are loaded asynchronously involving multiple API calls to different resource providers and javascript being executed in the browser (which has the javascript engine).
What you get with JSoup in this case is the initial HTML that browser starts to form the page with. Then, there is a set of XHR calls to the surfline API that brings the data into the browser which then dynamically fills up different parts of the page, including the current surf range.
The simplest way to approach the problem is to switch to browser automation tool called selenium which would fire up a real browser. You can then wait for the current surf range element to have a value and, if you wish to continue with JSoup, get the page source and feed it to JSoup for further parsing.
Another approach would involve looking into the requests that the page makes in the browser developer tools and then try to simulate these requests in your code, parsing the JSON responses and extracting the surf forecast data.

How to read from file on page load using jsp?

I have one webpage and a file on a server. How to read from file on every page load. I m using jsp. Is there any function available to check page load?
Every page load means that you come to server every time (cache is another story).
So, jsp is loaded from the server every time and here is simple directive to include file to jsp:
<%# include file="foo.html" %>
Keep in mind that server knows only about jsp changes but not about foo.html changes. So, if you change only foo.html server doesn't know about it. That's the reason why this approach is not common. It is used mostly for common templates and parts of all pages (like common footer) even there are other better modern techniques to do so (like CSS).
However, if you still want to use external file which constantly changes just remember that JSP is Java too and you can use whatever you do in Java (except it is not recommended - JSP should be simple viewer in MVC).
So, something like this will work:
<% out.write(Files.readAllBytes("foo.html")); %>
You can use any techniques to read file and write it to the response output.
Addition for your comment:
Text field is regular html. Input to it would be like this:
<input name=abc value="<% out.write(Files.readAllBytes("foo.txt")); %>">
but, again, please consider more modern techniques like DHTML, AJAX, CSS or simple JavaScript.

How to view the source of a JSP file client side

I have been writing c#, html, js, jquery for awhile and been using F12 in chrome and IE to look at the html code to help me with some dom manipulation. I am curious if there is a way to do something similar to that context to look at a jsp page and manipulate it with javascript.
For example: viewsource on a jsp page and fill out a textbox on the jsp page with javascript.
I hope the question is clear and the example makes sense. I am not working on anything with this so I don't have a real example. This is more of curiosity. I know javascript but I am not familiar with java at all.
The jsp file gets compiled to a servlet on the Application Server which is then executed. This servlet then produces html on the server side, therefore you can not view the jsp source code client side (F12 in chrome). You can only view the generated output (html, css, js, ...).
Source: http://oreilly.com/catalog/jserverpages2/chapter/ch03.html
JSP is compiled and run on the server side--the same place you edit it (more or less).
Only HTML is sent to the client.
In addition,
You write in C#, so if you generate simple ASP.NET page and see how a dynamic page is being generated.
It's not JSP/Java but It have the same concept of dynamic pages.

Getting web page source code in Java

I use Java. I want to get web page source code but on the page works JavaScript and I want get code generated by JavaScript (code which we see in firebug in firefox)
Anyone knows what I should do?
To inspect the page after modification by JavaScript, you need a client-side JavaScript engine that can run the scripts and then let you inspect the DOM.
HtmlUnit can do this - it is a "GUI-Less browser for Java programs".
See also this question
However, this won't give you the exact original page source, because that has already been parsed into a DOM by this point.
I think you want to see the source code of DOM Elements created after the page load via AJAX.
If that´s what you want, the only way to see it is through a DOM inspector, like firebug in firefox or Developers Tools in Chrome.
Going to "View source code" only shows the source at load-time.
If I understand your question, yes your javascript objects can be passed back to your java backend either by a creating a html <form> element with inputelements, fill them with your values and then submit the form, or asynchronously via ajax/json (which doesn't require re-loading your web page). For both methods you need to configure an endpoint on your java side to receive the submitted data and return some kind of confirmation to the client, i.e. your javascript. I would recommend googling "jQuery.post" for the javascript side and finding some examples for your java backend.

Best way to embed exact output from an URL into a JSP

I am working with Liferay and I need to show a preview of HTML output of an URL as an embeded window in a JSP view. I am assessing different possibilities.
Store somewhat the interface to preview as a screenshot image and show it as an embeded image. Good thing is that formatting would be totally the same.
Parse URL output stream with a BufferedReader and clean all html, scripting, body tags with indexOf. Embed images as cid:
Some kind of include, jsp:include or liferay-util:include form direct URL of downloaded temporal HTML output
Any JQuery AJAX $().html() kind of solution
Any HTML-level solution in iframe, applet, frame, appletwindow or whatever if it exists
What do you think is the best or recommended way: simplest, reliable and exact looking? Any code or reference?
And in case I had to send it as a JavaMail Message content into an email direction?
Thank you!!
This should probably be done client-side, in Javascript, or even via iframes. Either put the iframes in the page directly, or have javascript code that generates the iframes, and point the iframes at the URL to be previewed. Keep it simple.

Categories