Is it possible to convert an html page with charts generated by javascript to an Image or PDF in Java?
I familiar with iText framework and it seems to be suitable but I am not sure how it handle JS generated things.
A quick search turned up this as a possible answer.
Using a library to convert to XSL-FO then another one to convert that to PDF.
Edit: This might interest you as well. There's a bit on some JBrowser class that seems to let you print web pages.
It depends on how the were generated. I suppose three possibilities:
Canvas tag.
You need to add a bit of JS code to get image using toDataURL canvas method
SVG.
You can add some code to get the full code of generated SVG document via innerHTML method.
Flash.
The worst case. I think it's hardly possible to achive what you want.
Solution 1
If you have access to plain HTML (taken after the JavaScript has executed and built the page), you can easily pass it to iText and convert it to PDF. I would recommend using Flyng-Saucer (which in turn uses iText) which has a very good and convenient API for this (See http://code.google.com/p/flying-saucer/ ).
Solution 2
On the other hand, if you do not have access to the final HTML output, you could use the Swing libraries to render the page and then take a screenshot of it. This will allow you to even use Flash, but I'm not sure whether this approach will be suitable to your problem.
However, if it is the case, you can load the web page into a Swing application (you will need to rely on a third-party browser component for JS support, but there are quite a lot out there), and then you can use the Robot class to get a screenshot of it.
Take a look at http://download.oracle.com/javase/6/docs/api/java/awt/Robot.html
Related
I am developing an JSF web application and would like to introduce a lot of documentation to be visible directly in the web application. Technically I would like to use Markdown language and made already first experimence with.
I am currently playing around with flexmark Java library to render e. g. HTML strings from a markdown document. Also this seems to work fine. But what to do with links to other md files?
If I do have my markdown part: See also [here](Background.md)
Then this will be rendered correct to HTML with a link like: See also here.
But how should I tell my web server to react on this link and update the document part of the page with the rendered md file?
I would need to manually find such links in the generated HTML and change them to a kind of JavaScript call, telling my server to render the panel using the other md file.
Or should I create an IFrame so that within this frame, I could follow the link to e. g. a web servlet, rendering the md files to new HTML?
But this all feels a bit clumbsy to me. Am I missing a more easy solution?
Ok, no other answers, so I answer on my own.
The comment about primefaces extension with localized is interesting, but too far away from my focus and some features did not really match to my requirements.
Therefore I stayed with a pure markdown library and made the rest on my own.
With the links it was much more easy than expected! Within JavaScript you can very easily detect all links of the page (document.links), iterate over them and just set an onclick function (see here).
I am writing a new service Convert-HTML-TO-PDF. But now I am confused that what way should I prefer.
What ways I have to implement:
Use Head-less browser and capture the HTML page and convert to PDF
Use Java/Node Lib to convert. Which will create HTML relevant component in PDF file and then render?
Now, please help me to understand what will be the best way to implement a service and why!
[update]
And what will be the advantages and disadvantages of each approach
In my view, the best way forward always depends on what you already have experience with and what approach you take. There is no right or wrong here, everyone has to decide that for themselves based on their preferences.
Each approach has advantages and disadvantages. Some of them are:
Headless Browser:
Advantage:
No large Libs necessary, therefore very memory saving
Disadvantage
the desired browser must be installed on the computer/server
rendering may differ for different browsers
Library:
Advantage:
different libraries available
for the popular libs there is a good documentation and code examples
Disadvantage
When upgrading to a newer version, code usually needs to be adapted.
When upgrading to a newer version, the result may look different.
In my projects I use a headless chrome browser. For this I found an easy to use api on Github, which uses the DevTools of Chrome.
It also includes a simple example how to print a page into a PDF.
For my purposes I have customized this example and write the HTML into a temporary file and then navigate to that file.
// Navigate to HTML-File
page.navigate(htmlTempFile.getAbsolutePath());
I can't say if this is the best way, but for me this was the easiest and most understandable way
I'd like to take a web page and add some tags to its head. Specifically, a CSS link and a JavaScript link. I need to do this programatically for a wide variety of web pages. Now, I could hack this out with a regex or two, but I'd like to use something more robust.
What's a good way to inject or transform HTML? I'm using Scala, but anything Java or JVM will work.
You can use jsoup.
An example for modifying content in html is here
I don't want just the source code. I want the rendered page. This is an important distinction that I apparently cannot make by simply searching Google.
Does anyone know how I can get the rendered page from a URL?
This needs to be done in Java, hopefully without an extra library.
Another solution would be to use HTMLUnit which is a "GUI-less browser for JAVA". It is recommended by Google to generate snapshots of ajax-based webpages to make them crawlable.
You can try using a library that wraps a web browser, for example Berkelium. If you need it in Java, a Google search produced this Java wrapper API for Berkelium (I haven't tried it personally).
sites.google has an example of its use:
I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain.
Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text of the links and get back an array of Strings with the text of the links, or ask for all the bold text etc.
Run the XHTML through something like JTidy, which should give you back valid XML.
You may want to look at Watij. I have only used its Ruby cousin, Watir, but with it I was able to load a webpage and request all URLs of the page in exactly the manner you describe.
It was very easy to work with - it literally fires up a webbrowser and gives you back information in nice forms. IE support seemed best, but at least with Watir Firefox was also supported.
I had some problems with JTidy back in the day. I think it was related to tags that weren't closed that made JTidy fail. I don't know if thats fixed now. I ended up using something that was a wrapper around TagSoup, although I don't remember the exact project's name. Theres also HTMLCleaner.
I've used http://htmlparser.sourceforge.net/. It can parse poorly formed html and allows data extraction quite easily.