Safe API to transform HTML (Java) - java

I'd like to take a web page and add some tags to its head. Specifically, a CSS link and a JavaScript link. I need to do this programatically for a wide variety of web pages. Now, I could hack this out with a regex or two, but I'd like to use something more robust.
What's a good way to inject or transform HTML? I'm using Scala, but anything Java or JVM will work.

You can use jsoup.
An example for modifying content in html is here

Related

How to get only HTML content of the page in Java?

Without the use of any external library, what is the simplest way to fetch a website's HTML content into a string? I had tried, but I'm getting the complete page source, but I only want HTML content.
I find it a bit difficult to achieve this my friend without the use of an external lib.
You actually want to execute the javascript parts of the Html and act like a GUI-less web browser programmaticaly.
If you are to use an external library I would go for http://htmlunit.sourceforge.net/ that is pretty easy.

HTML with JS to Image/PDF

Is it possible to convert an html page with charts generated by javascript to an Image or PDF in Java?
I familiar with iText framework and it seems to be suitable but I am not sure how it handle JS generated things.
A quick search turned up this as a possible answer.
Using a library to convert to XSL-FO then another one to convert that to PDF.
Edit: This might interest you as well. There's a bit on some JBrowser class that seems to let you print web pages.
It depends on how the were generated. I suppose three possibilities:
Canvas tag.
You need to add a bit of JS code to get image using toDataURL canvas method
SVG.
You can add some code to get the full code of generated SVG document via innerHTML method.
Flash.
The worst case. I think it's hardly possible to achive what you want.
Solution 1
If you have access to plain HTML (taken after the JavaScript has executed and built the page), you can easily pass it to iText and convert it to PDF. I would recommend using Flyng-Saucer (which in turn uses iText) which has a very good and convenient API for this (See http://code.google.com/p/flying-saucer/ ).
Solution 2
On the other hand, if you do not have access to the final HTML output, you could use the Swing libraries to render the page and then take a screenshot of it. This will allow you to even use Flash, but I'm not sure whether this approach will be suitable to your problem.
However, if it is the case, you can load the web page into a Swing application (you will need to rely on a third-party browser component for JS support, but there are quite a lot out there), and then you can use the Robot class to get a screenshot of it.
Take a look at http://download.oracle.com/javase/6/docs/api/java/awt/Robot.html

Parsing HTML from a web page

I have to extract some information from a web page, and reformat it for the user.
Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extract substrings in given locations with the relevant data.
Anyhow I'm wondering if there is a better way, maybe an HTML-aware way. How would you do it?
Cheers
Ideally, you should use a real HTML-parser. I've used Jsoup successfully in the past on Android:
http://jsoup.org/
I personally like to use Jericho parser: http://jericho.htmlparser.net/docs/index.html
It is easy to use, have very much examples on project's page and deals good with pure HTML (unclosed tags etc.).
We've used HTTPUnit do do this in the past.
jsoup.org is better but Cobra have also some addidtional features (CSS-aware and JavaScript-aware).

Vanilla JSF application with javascript?

Hey guys I have been trying different Javascript/AJAX jsf frameworks. I find most of them extremely heavy. Icefaces tends to add so much javascript just for simple things and it uses a notoriously slow javascript framework also. Primefaces is a little better since it uses jquery but I still find it kind of heavy. What if I just want to use straight vanilla jsf and add javascript on top of that.
What is the best way to go about this. I would need to be able to output javascript to the page from the backing bean. Would a servlet or restful service be good way to output javascript/html to a page?
I basically want to use basic jquery animations. Maybe do a datatable filter. Thanks for any help.
What you want is definitely supported in JSF. You can use the <h:outputScript> tags for this and additionally the <h:outputStylesheet> tag if you also need CSS.
With these tags you can include scripts per view (page), although you can also opt to include them for all pages by creating a master Facelets template and include those there.
You can also create very simple components of your own by using JSF/Facelet's composite component feature. Those components just consist out of a simple .xhtml template file and those can include the javascript libraries you need and contain your own lightweight tailor-made javascript.
See this for some examples of using the <h:outputScript> tag in JSF: http://www.mkyong.com/jsf2/resources-library-in-jsf-2-0/

What is the best way to screen scrape poorly formed XHTML pages for a java app

I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain.
Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text of the links and get back an array of Strings with the text of the links, or ask for all the bold text etc.
Run the XHTML through something like JTidy, which should give you back valid XML.
You may want to look at Watij. I have only used its Ruby cousin, Watir, but with it I was able to load a webpage and request all URLs of the page in exactly the manner you describe.
It was very easy to work with - it literally fires up a webbrowser and gives you back information in nice forms. IE support seemed best, but at least with Watir Firefox was also supported.
I had some problems with JTidy back in the day. I think it was related to tags that weren't closed that made JTidy fail. I don't know if thats fixed now. I ended up using something that was a wrapper around TagSoup, although I don't remember the exact project's name. Theres also HTMLCleaner.
I've used http://htmlparser.sourceforge.net/. It can parse poorly formed html and allows data extraction quite easily.

Categories