is there a way to download (html) web page and all it's resources (eg: images,CSS).
I know how to do this using a html parser, by going through all the relevant tags, but isn't there a easy way?
That is the easy way.
The hard way is to write your own network libraries, html parser etc...
Related
Without the use of any external library, what is the simplest way to fetch a website's HTML content into a string? I had tried, but I'm getting the complete page source, but I only want HTML content.
I find it a bit difficult to achieve this my friend without the use of an external lib.
You actually want to execute the javascript parts of the Html and act like a GUI-less web browser programmaticaly.
If you are to use an external library I would go for http://htmlunit.sourceforge.net/ that is pretty easy.
I'm working with GAE (Java) in my GWT application.
When my users enter a certain URL I'd like to dynamically create an html page on the server side and serve it to the client.
How can I do this? Using HttpServlet?
I'm quite lost here, do I need to have an html template file on the server side that I dynamically complete and serve to the client?
You should start with the tutorial to learn the basics. You can generate the whole HTML dynamically, but that tends to get awkward. It's better to separate the HTML to a template and fill in the details with the logic implemented in the GAE application.
https://developers.google.com/appengine/docs/java/gettingstarted/
You can use a library like this one https://github.com/alexmuntean/java-url-rewrite . Read the readme to understand more.
You can just take the request and serve anything you want (jsp, jsf, html static). And you can also write gwt code to do actions(effects or ajax for more things. Etc) with the existing html (just add ids to elements) And write another entry point for that page and just include the generated js in your page
I am planning to do a tutorial and POC on how to make a gwt website indexable by google
I can read the HTML contents via http (for example, http://www.foo.com) using Java (with URL and BufferedReader classes). However, a couple of them contain JavaScript. My current app cannot process JavaScript.
What's the best way to read HTML content with JavaScript using Java?
I am open using other languages if it is easier.
Thanks in advance for your help.
UPDATE - Clarification:
A couple HTML contents are generated dynamically using JavaScript. I can see the result (in pure HTML after the JavaScript processing) when viewing them on a browser.
On the other hand, when my Java app retrieves the HTML contents, it says that there is no JavaScript on my app.
Ideally, I want to be able to get the same result as on the browser using my Java app.
Thanks for everyone's response.
HtmlUnit has good JavaScript support and it should (almost) parse the HTML as a web browser.
http://htmlunit.sourceforge.net/
http://htmlunit.sourceforge.net/javascript.html
Cobra (http://lobobrowser.org/cobra/getting-started.jsp) will fit your needs
For just HTML parsing you can use HTMLParser (org.htmlparser). However from the way you described your problem, it seems you need a browser, because executing is totally different than just parsing. Cheers.
With no doubt you need to use Java html parser:
Java Open Source HTML Parsers
Which Html Parser is best?
HTML/XML Parser for Java
HTML PARSER in java [closed]
I'd like to fetch a web page including images, flash animations and other embedded objects. What's a straightforward way of achieving this?
Writing a web-crawler in the java programming language.
http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/
Use an open source HTML Parser such as HTMLCleaner - http://java-source.net/open-source/html-parsers/htmlcleaner or CyberNekoHtml - http://java-source.net/open-source/html-parsers/nekohtml.
Once you have used a parser to create a representation of the DOM of the web page, you can then load/download images and other embedded objects that exist in the DOM by performing queries on the DOM and extracting relevant src attributes from the HTML elements.
try web-harvest
Web app I'm working on generates HTML using Velocity templates. Problem is that using whitespace in velocity templates and other formatting results in butt-ugly HTML (excessive whitespace, misalignment, etc.)
Looking for a nice (single jar packaging would be nice) Java-based HTML prettifier to run over the generated HTML right before we dump it to the servlet response to make the source nicer to look at.
Third party integrators would like to be able to glance at the HTML and know which templates are causing problems. The first step to this is having the HTML formatted nicely.
Thanks in advance for any guidance you can provide!
JTidy has a JTidyFilter. Just define it in web.xml and the respone HTML will be prettified.
JTidy could be what you're searching for.
I know it's not helping right now, but I think the ideal solution would be for Velocity in first place to support a "better whitespace generation and control" :).
If many users would request and vote such a feature, maybe the Velocity team would include it. Running jTidy or other parsers over the output all the time (e.g. for live requests) consumes quite a few resources, so I'm not sure if it's the best approach especially for dynamic content where caching of that cleaned output doesn't bring much.
There are many HTML parsers here: Open Source HTML Parsers in Java