Convert Web Page to PDF or Image - java

I need to convert a web page [which has not public access] to PDF or Image [preferably to PNG].
Web page contains set of charts and image. Most of the charts are populated through Ajax calls so there is a delay between page load and chart load.
I am looking answer for any of these questions:
1- I found set of snapshot api's but none of them support accessing my internal page. Since the web page I am trying to export is not public I need to be authenticated. Biggest problem is I cannot send request headers [such as session-id, cookie or other variables] along with these API's. It seems they don't support this kind of functionality.
2- I am not sure if I can do following: Login to my web page with HTTP client, add http headers, send get call and get HTML string. Then use one of the converters to convert it to PDF. What I am not sure is if it's possible to get proper PDF from the HTML string I got from http client since resources [css, js and etc] will be missing. I want my pdf/image looks exactly as it on the web site.
I really appreciate if you can help.
Thanks in advance,
ED

You're probably best of using wkhtmltopdf, which is a server-side tool and is easily installed.
There are two parameters you can use to wait for your Ajax to finish, try:
javascript-delay to influence the time the program waits for the JavaScript to finish
window-status to wait for a certain return code for the window
See the extensive manual for this program here
wkhtmltopdf generates a PDF and wkhtmltoimg generates an image, which is PNG (as you requested) by default.

Authentication is difficult because it involves security. Because the operation you are describing is unusual it is likely to result in all kinds of alarm bells going off. It is entirely possible to do but it is fraught, easy to get wrong and fragile in the face of security updates and code changes.
As such I'm going to suggest an alternate method which is one we often recommend for ABCpdf (on which I work). Yes we support standard authentication methods but the beauty of this approach is that it is robust and is applicable to other solutions (eg Java based) and novel authentication methods.
Typically you just want a PDF of the current page. The easiest way to do this is snaffle the HTML. The way you do this rather depends on your environment. For example under ASP.NET you can obtain the HTML of the current page using the HttpResponse.Filter property or by overriding the Render method of the page. The way you do it will depend on what you're coding in.
Then you need to save this HTML to a file and present it to your solution via a 'file://' protocol URL. Now obviously at this point any relative links will be broken but this is easily fixed by dropping in a BASE tag that references the place they are located.
Generally the types of resources referenced by an server-side page are static. So if you can create a tag that references the actual files rather than a web site, you will bypass any authentication for access to these resources.
That still leaves the AJAX based problems which are another can of worms. The render delay method is something we have supported for many years (from before AJAX was around) however it is not terribly reliable because you just don't know how long to wait.
Much better is a tighter link into the JavaScript via a callback you can use to determine if the page is loaded. I don't think ABCpdf is going to be appropriate for you since it is .NET but I would certainly encourage you to look for a Java based solution that uses this type of more sophisticated approach.

Related

Getting Twirl to stream a scala template render output?

Using Play Framework with Java (controller) and Twirl, I am trying to send a page that contains about 40MB of data to the caller, through a single HTTP GET request (using scala templates, nothing special here). The issue is that when I call the render method on the template, my application crashes with an OutOfMemoryError. I am 100% positive it is relative to page weigth VS available heap memory, I have confirmed it through testing. I think it is related to the fact that "To be able to compute the Content-Length header properly, Play must consume the whole response data and load its content into memory." (https://www.playframework.com/documentation/2.5.x/JavaStream). The Twirl templates render method seems to load the entire page into memory just the same anyway.
I have confirmed, as mentionned in the docs, that if I use an InputStream to output my own data, things will work smoothly and I can send as much data as I want (up to gigabytes of data if I want, it just works) back to the client simply by returning ok(myInputStream) in my controller method.
Now, what I would like to do is to get Twirl to render its template using streaming (providing me an InputStream or something similar). I have searched for solutions, read the docs to no avail. Am I looking at the wrong places, or it is just not available and possible by design, and I have no other choices than to rethink my template and requests/responses ?
Java 8, Play Framework 2.5.

Jsoup Issue scraping non-hardcoded data

I'm trying to use Jsoup to gather wave height information from Surfline.com. I have the element I desire in the screenshot and the it's showing in the dev tools. When I scrape the site with Jsoup, the returned string includes everything seen in the dev tool but the "1-2ft" which is what I need. The site is Javascript heavy and I'm assuming that jsoup is snagging the html before the javascript actually runs (I have no clue really). Do I need to specifically tell jsoup to wait for the pageload or am I missing some other critical component?
This is the code I'm using.
Document doc = Jsoup.connect("http://www.surfline.com/surf-report/folly-beach-pier-southside-southeast_5294/").get();
Elements content = doc.select("div[id=current-surf-range]");
System.out.println(content);
and this is the output I'm seeing in my IDE
<div id="current-surf-range" style="font-size:21px;font-weight:bold;padding-top:7px; padding-bottom: 7px;"></div>
it seems really odd that the contents of the div wouldn't be returned with it. This is my first time using Jsoup and I tried to read through the docs as best I could but nothing seemed to touch on this particular issue. Any insight would be awesome and greatly appreciated.
What you see in the browser is not what necessarily you would get when download the page by URL with your HTTP library of a choice. In fact, you should never expect them to be the same. In the modern web, webpages are quite dynamic and are loaded asynchronously involving multiple API calls to different resource providers and javascript being executed in the browser (which has the javascript engine).
What you get with JSoup in this case is the initial HTML that browser starts to form the page with. Then, there is a set of XHR calls to the surfline API that brings the data into the browser which then dynamically fills up different parts of the page, including the current surf range.
The simplest way to approach the problem is to switch to browser automation tool called selenium which would fire up a real browser. You can then wait for the current surf range element to have a value and, if you wish to continue with JSoup, get the page source and feed it to JSoup for further parsing.
Another approach would involve looking into the requests that the page makes in the browser developer tools and then try to simulate these requests in your code, parsing the JSON responses and extracting the surf forecast data.

Getting web page source code in Java

I use Java. I want to get web page source code but on the page works JavaScript and I want get code generated by JavaScript (code which we see in firebug in firefox)
Anyone knows what I should do?
To inspect the page after modification by JavaScript, you need a client-side JavaScript engine that can run the scripts and then let you inspect the DOM.
HtmlUnit can do this - it is a "GUI-Less browser for Java programs".
See also this question
However, this won't give you the exact original page source, because that has already been parsed into a DOM by this point.
I think you want to see the source code of DOM Elements created after the page load via AJAX.
If that´s what you want, the only way to see it is through a DOM inspector, like firebug in firefox or Developers Tools in Chrome.
Going to "View source code" only shows the source at load-time.
If I understand your question, yes your javascript objects can be passed back to your java backend either by a creating a html <form> element with inputelements, fill them with your values and then submit the form, or asynchronously via ajax/json (which doesn't require re-loading your web page). For both methods you need to configure an endpoint on your java side to receive the submitted data and return some kind of confirmation to the client, i.e. your javascript. I would recommend googling "jQuery.post" for the javascript side and finding some examples for your java backend.

Display external webpage into a webpage in my application

i want to display an external webpage (exactly as it's rendered in that site) into a webpage in my application in a way that's fast and better for SEO crawlers, and i was wondering if there's a way to do that with javaee ?
if not then what is better in performance and for SEO the XMLHTTPRequest way or the iframes way.
please advise with sample code or link if possible, thanks
Update: example website is: http://www.akhbarak.net/
If you need to display content from different pages inline, use iframe (iframe stands for inline frame - it has nothing to do with Apple).
If you'd like to use AJAX to display pages, I would recommend colorbox.
Note that accessing pages in a different domain via AJAX is next to impossible - this is a very, very big security hole. I would not recommend doing it. You would have to use a proxy on your own server to fetch the page and return its HTML.
That said, using the iframe in your source code, so it is loaded with the rest of the page, seems like your best bet. Sites like facebook and twitter use this in embeddable "like" and "tweet" widgets so that those widgets can make requests on their own domain - that is, twitter or facebook. While managing lots of iframes isn't very fun, it is a very accepted way of doing what you want to do.
In theory, you could
load the whole page into a PHP variable,
replace the body tags with ,
take out the html tags,
pull out the entire section and put it in the encompassing pages ,
and replace all links with absolute ones (ie '/images' changes to 'http://example.com/images')
Would it be easy to do? Probably not. It's the only way I can think of to accomplish it so that the site appears as part of yours though.

html5: Write content to a page without refreshing

I want to create a page where the top half contains start stop buttons and in the bottom half i want to write content from the server. the buttons invoke functions on the server and the server does some computing and generates timely messages which need to be written to the bottom half of the page.
Possible ways of doing
1. AJAX
2. DWR
3. HTML5
Let me know which method is better and how can i do it.
AJAX
Means "Making an HTTP request without leaving the page". This will let you get content from the server. To write it you need to do DOM manipulation. There are no shortage of Ajax and DOM tutorials on the web
DWR
Is a library to help with Ajax (although not one I've heard of before).
HTML5
Is a now meaningless buzzword. Of all the meanings it has, only the ones which include "Ajax and DOM manipulation" will help, and that's already covered above.
Look up a tutorial on xmlhttprequest. I know there is a good explanation of it here Quirksmode. I am not sure where your current skill set is using HTML, server side programming, XML, and Javascript are, but you will need to have a minimum understanding of all of those in order to successfully implement AJAX.
There are lots of resources on the web, Google

Categories