How to get only HTML content of the page in Java? - java

Without the use of any external library, what is the simplest way to fetch a website's HTML content into a string? I had tried, but I'm getting the complete page source, but I only want HTML content.

I find it a bit difficult to achieve this my friend without the use of an external lib.
You actually want to execute the javascript parts of the Html and act like a GUI-less web browser programmaticaly.
If you are to use an external library I would go for http://htmlunit.sourceforge.net/ that is pretty easy.

Related

Safe API to transform HTML (Java)

I'd like to take a web page and add some tags to its head. Specifically, a CSS link and a JavaScript link. I need to do this programatically for a wide variety of web pages. Now, I could hack this out with a regex or two, but I'd like to use something more robust.
What's a good way to inject or transform HTML? I'm using Scala, but anything Java or JVM will work.
You can use jsoup.
An example for modifying content in html is here

GAE how to serve an html file dynamically from the server

I'm working with GAE (Java) in my GWT application.
When my users enter a certain URL I'd like to dynamically create an html page on the server side and serve it to the client.
How can I do this? Using HttpServlet?
I'm quite lost here, do I need to have an html template file on the server side that I dynamically complete and serve to the client?
You should start with the tutorial to learn the basics. You can generate the whole HTML dynamically, but that tends to get awkward. It's better to separate the HTML to a template and fill in the details with the logic implemented in the GAE application.
https://developers.google.com/appengine/docs/java/gettingstarted/
You can use a library like this one https://github.com/alexmuntean/java-url-rewrite . Read the readme to understand more.
You can just take the request and serve anything you want (jsp, jsf, html static). And you can also write gwt code to do actions(effects or ajax for more things. Etc) with the existing html (just add ids to elements) And write another entry point for that page and just include the generated js in your page
I am planning to do a tutorial and POC on how to make a gwt website indexable by google

HTML with JS to Image/PDF

Is it possible to convert an html page with charts generated by javascript to an Image or PDF in Java?
I familiar with iText framework and it seems to be suitable but I am not sure how it handle JS generated things.
A quick search turned up this as a possible answer.
Using a library to convert to XSL-FO then another one to convert that to PDF.
Edit: This might interest you as well. There's a bit on some JBrowser class that seems to let you print web pages.
It depends on how the were generated. I suppose three possibilities:
Canvas tag.
You need to add a bit of JS code to get image using toDataURL canvas method
SVG.
You can add some code to get the full code of generated SVG document via innerHTML method.
Flash.
The worst case. I think it's hardly possible to achive what you want.
Solution 1
If you have access to plain HTML (taken after the JavaScript has executed and built the page), you can easily pass it to iText and convert it to PDF. I would recommend using Flyng-Saucer (which in turn uses iText) which has a very good and convenient API for this (See http://code.google.com/p/flying-saucer/ ).
Solution 2
On the other hand, if you do not have access to the final HTML output, you could use the Swing libraries to render the page and then take a screenshot of it. This will allow you to even use Flash, but I'm not sure whether this approach will be suitable to your problem.
However, if it is the case, you can load the web page into a Swing application (you will need to rely on a third-party browser component for JS support, but there are quite a lot out there), and then you can use the Robot class to get a screenshot of it.
Take a look at http://download.oracle.com/javase/6/docs/api/java/awt/Robot.html

Reading HTML+JavaScript using Java

I can read the HTML contents via http (for example, http://www.foo.com) using Java (with URL and BufferedReader classes). However, a couple of them contain JavaScript. My current app cannot process JavaScript.
What's the best way to read HTML content with JavaScript using Java?
I am open using other languages if it is easier.
Thanks in advance for your help.
UPDATE - Clarification:
A couple HTML contents are generated dynamically using JavaScript. I can see the result (in pure HTML after the JavaScript processing) when viewing them on a browser.
On the other hand, when my Java app retrieves the HTML contents, it says that there is no JavaScript on my app.
Ideally, I want to be able to get the same result as on the browser using my Java app.
Thanks for everyone's response.
HtmlUnit has good JavaScript support and it should (almost) parse the HTML as a web browser.
http://htmlunit.sourceforge.net/
http://htmlunit.sourceforge.net/javascript.html
Cobra (http://lobobrowser.org/cobra/getting-started.jsp) will fit your needs
For just HTML parsing you can use HTMLParser (org.htmlparser). However from the way you described your problem, it seems you need a browser, because executing is totally different than just parsing. Cheers.
With no doubt you need to use Java html parser:
Java Open Source HTML Parsers
Which Html Parser is best?
HTML/XML Parser for Java
HTML PARSER in java [closed]

download a complete web page including resources (like images) in java

is there a way to download (html) web page and all it's resources (eg: images,CSS).
I know how to do this using a html parser, by going through all the relevant tags, but isn't there a easy way?
That is the easy way.
The hard way is to write your own network libraries, html parser etc...

Categories