Need a Java-based HTML prettifier to clean up Velocity-generated HTML - java

Web app I'm working on generates HTML using Velocity templates. Problem is that using whitespace in velocity templates and other formatting results in butt-ugly HTML (excessive whitespace, misalignment, etc.)
Looking for a nice (single jar packaging would be nice) Java-based HTML prettifier to run over the generated HTML right before we dump it to the servlet response to make the source nicer to look at.
Third party integrators would like to be able to glance at the HTML and know which templates are causing problems. The first step to this is having the HTML formatted nicely.
Thanks in advance for any guidance you can provide!

JTidy has a JTidyFilter. Just define it in web.xml and the respone HTML will be prettified.

JTidy could be what you're searching for.

I know it's not helping right now, but I think the ideal solution would be for Velocity in first place to support a "better whitespace generation and control" :).
If many users would request and vote such a feature, maybe the Velocity team would include it. Running jTidy or other parsers over the output all the time (e.g. for live requests) consumes quite a few resources, so I'm not sure if it's the best approach especially for dynamic content where caching of that cleaned output doesn't bring much.

There are many HTML parsers here: Open Source HTML Parsers in Java

Related

Parsing HTML from a web page

I have to extract some information from a web page, and reformat it for the user.
Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extract substrings in given locations with the relevant data.
Anyhow I'm wondering if there is a better way, maybe an HTML-aware way. How would you do it?
Cheers
Ideally, you should use a real HTML-parser. I've used Jsoup successfully in the past on Android:
http://jsoup.org/
I personally like to use Jericho parser: http://jericho.htmlparser.net/docs/index.html
It is easy to use, have very much examples on project's page and deals good with pure HTML (unclosed tags etc.).
We've used HTTPUnit do do this in the past.
jsoup.org is better but Cobra have also some addidtional features (CSS-aware and JavaScript-aware).

Generate HTML files using XML configuration

My target is to assemble a static web site that has a lot of repeating code. Now, I could use JSP includes for that purpose. But the site will be modified infrequently and under very heavy load, also using features like gzip and I don't need the complications.
My idea is to put up a build process with some tool like ant, That build process will concatenate all HTML pieces, preprocess HTML, JS, CSS with minifier and finally apply gzip.
I want an XML configuration that will define the parts that need to go in every html page and their order.
I need advice on ant or any similar tool; how to approach the configuration, any external tools that will help? Any suggestions are much appreciated.
XSLT is perfectly suited to transform XML into another format like HTML.
You can download Apache Xalan to give it a try. Ant has support for XSLT processing.
In the java world, you can take a look at Apache Forrest, which precisely do that kind of things.
In other worlds, there also exist webgen, which is a competent Ruby site builder.
I also vaguey remember there are other alternatives, but i can't find back their name.

Best practices for processing input HTML content at server side in java

I would like to implement content management system with RDBMS in java/j2ee, and would like to know the best practices for handling input HTML content
Below are the few doubts I have got, am sure there are lots of other things to take care..
Do we need to escape HTML tags and special characters before we save HTML content to database
How do we validate/remove invalid special symbols in large input HTML content
Best practices for displaying HTML content back to browser from database
Any security risk involved in while handling HTML content
Looking forward to see some great ideas from gurus!
Use a tool like Neko to clean up the HTML into XHTML, then use any XML parser to parse it.
I recently tried out some html clean-up libraries, and the best I came across was the Cobra Html Renderer and Parser which seems to faster than others and also manages to convert dirtier HTML do XHTML. I first went for HTML Tidy, but it ended up complaining about "Unparseable HTML" way too often.
What I'd strongly discourage you from doing is to use a REGEX ;-)
I am not a guru in this but i think you will have to figure out how to deal with some special characters and escape sequences as in quotes(both double and single)..etc
May be you can try replacing those special charas and escape sequences with some other characters.
Mayb Someone else who is currenntly delaing with cms mite help you out..nways cheers!!
I would recommend looking at the architecture and design of an open source CMS like Alfresco or Apache Jackrabbit.
These are actual content repositories and will not contain end-to-end integration most likely, but can show you an underlying data model that is a good place to start.
I would also recommend you check out OWASP for information on web application security and vulnerabilities, and in particular security issues relevant to Java developers.

What is the best way to screen scrape poorly formed XHTML pages for a java app

I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain.
Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text of the links and get back an array of Strings with the text of the links, or ask for all the bold text etc.
Run the XHTML through something like JTidy, which should give you back valid XML.
You may want to look at Watij. I have only used its Ruby cousin, Watir, but with it I was able to load a webpage and request all URLs of the page in exactly the manner you describe.
It was very easy to work with - it literally fires up a webbrowser and gives you back information in nice forms. IE support seemed best, but at least with Watir Firefox was also supported.
I had some problems with JTidy back in the day. I think it was related to tags that weren't closed that made JTidy fail. I don't know if thats fixed now. I ended up using something that was a wrapper around TagSoup, although I don't remember the exact project's name. Theres also HTMLCleaner.
I've used http://htmlparser.sourceforge.net/. It can parse poorly formed html and allows data extraction quite easily.

How do you grab a text from webpage (Java)?

I'm planning to write a simple J2SE application to aggregate information from multiple web sources.
The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.
What technique/library would you advice?
Updates/Remarks
Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
It sould be really simple.
You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.
You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml.
As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's
http://hc.apache.org/httpclient-3.x/
If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):
<table>
{
for $d in //td[contains(a/small/text(), "New York, NY")]
for $row in $d/parent::tr/parent::table/tr
where contains($d/a/small/text()[1], "New York")
return <tr><td>{data($row/td[1])}</td>
<td>{data($row/td[2])}</td>
<td>{$row/td[3]//img}</td> </tr>
}
</table>
In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.
You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.
If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :
GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>
then use the Socket#getInputStream , and then read the data using a BufferedReader , and parse the data using whatever you like.
You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.
If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.
This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.
Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.
If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.
Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.
Check this out http://www.alchemyapi.com/api/demo.html
They return pretty good results and have an SDK for most platforms. Not only text extraction but they do keywords analysis etc.

Categories