How to validate HTML from Java? - java

What is a fast and simple way to validate HTML from Java? I’m looking for an open-source/PD class (or set of classes) that describes the various properties of the 100-odd HTML tags, such as:
Is the tag optional? Empty? Is it legal to omit its closing tag?
Which other tags can this tag contain (if any)?
Which attributes are legal for this tag, and what are their types? (not required, but nice to have)
Thanks!
EDIT
I'm looking to do to a tag-by-tag analysis of an HTML document, so I'm less interested in whether the document as a whole is valid, but rather what the specific requirements are for each type of tag.
I could encode the rules based on the W3C spec, but wanted to see which ready-made solutions are available first.

If you want to verify certain tags follow certain specifications, there seems to be no end of Java based HTML parsers:
Open Source HTML Parsers in Java
In other words, you could parse you HTML, and then inspect the resulting document for the tags you were looking for and determine if they meet the specifications you require. If they don't you could then just throw an error.
I don't think you'll find a HTML analysis tool which was written with exactly your requirements in mind, mostly because those requirements haven't been voiced and are probably a bit nebulous.
If the parser doesn't do what you want out of the box, at least this list is open source, so you can hack the parser as long as you publish your changes.

Check JTidy (http://jtidy.sourceforge.net/) and VietSpider HTMLParser ( http://sourceforge.net/projects/binhgiang/ ) both are Java HTML parser and some syntax checking capabilities. Some eclipse based HTML editor plugin use JTidy (or port of Tidy) for syntax checking. Or as David Said, submit the page to w3c.org

Related

Extracting webpage information based on a template in Java

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is a tedious task, because these webpage change very frequently. Also it requires a programmer to fix the Java code. Here is an example of HTML code of my interest on a webpage:
<div>
<p><strong>Score:</strong>2.5/5</p>
<p><strong>Director:</strong> Bryan Singer</p>
</div>
<div>some other info which I dont need</div>
Now here is what I want to do, I want to save this webpage (an HTML file) locally and create a template out of it, like:
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
<div>some other info which I dont need</div>
Along with the actual URLs of the webpages these HTML templates will be the input to the Java program which will find out the location of these predefined keywords (e.g. {MOVIE_RATING}, {MOVIE_DIRECTOR}) and extract the values from the actual webpages.
This way I wouldn't have to modify the Java program every time a webpage changes, I will just save the webpage's HTML and replace the data with these keywords and rest will be taken care by the program. For example in future the actual HTML code may look like this:
<div>
<div><b>Rating:</b>**1/2</div>
<div><i>Director:</i>Singer, Bryan</div>
</div>
and the corresponding template will look like this:
<div>
<div><b>Rating:</b>{MOVIE_RATING}</div>
<div><i>Director:</i>{MOVIE_DIRECTOR}</div>
</div>
Also creating these kind of templates can be done by a non-programmer, anyone who can edit a file.
Now the question is, how can I achieve this in Java and is there any existing and better approach to this problem?
Note: While googling I found some research papers, but most of them require some prior learning data and accuracy is also a matter of concern.
The approach you gave is pretty much similar to the Gilbert's except
the regex part. I don't want to step into the ugly regex world, I am
planning to use template approach for many other areas apart from
movie info e.g. prices, product specs extraction etc.
The template you describe is not actually a "template" in the normal sense of the word: a set static content that is dumped to the output with a bunch of dynamic content inserted within it. Instead, it is the "reverse" of a template - it is a parsing pattern that is slurped up & discarded, leaving the desired parameters to be found.
Because your web pages change regularly, you don't want to hard-code the content to be parsed too precisely, but want to "zoom in" on its' essential features, making the minimum of assumptions. i.e. you want to commit to literally matching key text such as "Rating:" and treat interleaving markup such as"<b/>" in a much more flexible manner - ignoring it and allowing it to change without breaking.
When you combine (1) and (2), you can give the result any name you like, but IT IS parsing using regular expressions. i.e. the template approach IS the parsing approach using a regular expression - they are one and the same. The question is: what form should the regular expression take?
3A. If you use java hand-coding to do the parsing then the obvious answer is that the regular expression format should just be the java.util.regex format. Anything else is a development burden and is "non-standard" and will be hard to maintain.
3B. If you use want to use an html-aware parser, then jsoup is a good solution. Problem is you need more text/regular expression handling and flexibility than jsoup seems to provide. It seems too locked into specific html tags and structures and so breaks when pages change.
3C. You can use a much more powerful grammar-controlled general text parser such as ANTLR - a form of backus-naur inspired grammar is used to control the parsing and generator code is inserted to process parsed data. Here, the parsing grammar expressions can be very powerful indeed with complex rules for how text is ordered on the page and how text fields and values relate to each other. The power is beyond your requirements because you are not processing a language. And there's no escaping the fact that you still need to describe the ugly bits to skip - such as markup tags etc. And wrestling with ANTLR for the first time involves educational investment before you get productivity payback.
3D. Is there a java tool that just uses a simple template type approach to give a simple answer? Well a google search doesn't give too much hope https://www.google.com/search?q=java+template+based+parser&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a. I believe that any attempt to create such a beast will degenerate into either basic regex parsing or more advanced grammar-controlled parsing because the basic requirements for matching/ignoring/replacing text drive the solution in those directions. Anything else would be too simple to actually work. Sorry for the negative view - it just reflects the problem space.
My vote is for (3A) as the simplest, most powerful and flexible solution to your needs.
Not really a template-based approach here, but jsoup can still be a workable solution if you just externalize your Selector queries to a configuration file.
Your non-programmer doesn't even have to see HTML, just update the selectors in the configuration file. Something like SelectorGadget will make it easier to pick out what selector to actually use.
How can I achieve this in Java and is there any existing and better approach to this problem?
The template approach is a good approach. You gave all of the reasons why in your question.
Your templates would consist of just the HTML you want to process, and nothing else. Here's my example based on your example.
<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
Basically, you would use Jsoup to process your templates. Then, as you use Jsoup to process the web pages, you check all of your processed templates to see if there's a match.
On a template match, you find the keywords in the processed template, then you find the corresponding values in the processed web page.
Yes, this would be a lot of coding, and more difficult than my description indicates. Your Java programmer will have to break this description down into simpler and simpler tasks until she or he can code the tasks.
If the web page changes frequently, then you'll probably want to confine your search for the fields like MOVIE_RATING to the smallest possible part of the page, and ignore everything else. There are two possibilities: you could either use a regular expression for each field, or you could use some kind of CSS selector. I think either would work and either "template" can consist of a simple list of search expressions, regex or css, that you would apply. Just roll through the list and extract what you can, and fail if some particular field isn't found because the page changed.
For example, the regex could look like this:
"Score:"(.)*[0-9]\.[0-9]\/[0-9]
(I haven't tested this.)
Or you can try different approach, using what i would call 'rules' instead of templates: for each piece of information that you need from the page, you can define jQuery expression(s) that extracts the text. Often when page change is small, the same well written jQuery expressions would still give the same results.
Then you can use Jerry (jQuery in Java), with the almost the same expressions to fetch the text you are looking for. So its not only about selectors, but you also have other jQuery methods for walking/filtering the DOM tree.
For example, rule for some Director text would be (in sort of sudo-java-jerry-code):
$.find("div#movie").find("div:nth-child(2)")....text();
There could be more (and more complex) expressions in the rule, spread across several lines, that for example iterate some nodes etc.
If you are OO person, each rule may be defined in its own implementation. If you are groovy person, you can even rewrite rules when needed, without recompiling your project, and still being in java. Etc.
As you see, the core idea here is to define rules how to find your text; and not to match to patterns as that may be fragile to minor changes - imagine if just a space has been added between two divs:). In this example of mine, I've used jQuery-alike syntax (actually, it's Jerry-alike syntax, since we are in Java) to define rules. This is only because jQuery is popular and simple, and known by your web developer too; at the end you can define your own syntax (depending on parsing tool you are using): for example, you may parse HTML into DOM tree and then write rules using your helper methods how to traverse it to the place of interest. Jerry also gives you access to underlaying DOM tree, too.
Hope this helps.
I used the following approach to do something similar in a personal project of mine that generates a RSS feed out of here the leading real estate website in spain.
Using this tool I found the rented place I'm currently living in ;-)
Get the HTML code from the page
Transform the HTML into XHTML. I used this this library I guess there might be today better options available
Use XPath to navigate the XHTML to the information you're interesting in
Of course every time they change the original page you will have to change the XPath expression. The other approach I can think of -semantic analysis of the original HTML source- is far, far beyond my humble skills ;-)

Java - Extract html information from string

All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?
I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).
You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.
You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.
Take a look at this:
http://en.wikipedia.org/wiki/Java_API_for_XML_Processing
If you parse the HTML you should be able to extract the values from the DOM tree.

Does JSoup support getComputedStyle or the equivalent?

Does the document tree returned by JSoup when it parses an HTML document support getComputedStyle on the individual document elements?
What I would like to do is inline the CSS in an HTML fragment so that I can insert the fragment into a larger HTML document, with all of its formatting preserved but without messing with any other formatting in the document.
The research I've done would seem to suggest that I can accomplish this by iterating through all of the elements in the document, calling getComputedStyle on each one, and assigning the result to be the style for the element.
Yes, I realize that this may very well bloat the resulting HTML by putting a bunch of redundant / unnecessary style information on the individual elements, but I'm willing to pay the price of larger HTML, and as far as I can tell, embedding the style inline like this is the only way to preserve the formatting exactly while also making the HTML fragments fully portable. (If you've got another suggestion for accomplishing that purpose, I'm all ears. :-)
Getting back on topic... If I can't use getComputedStyle (or the equivalent) with JSoup, is there another Java HTML+CSS parser that supports getComputedStyle or the equivalent?
Thanks.
That's not possible. Jsoup is just a HTML parser with CSS selector support, it is not a HTML renderer.
You may want to take a look at Lobobrowser which is a Java based HTML renderer supporting JavaScript and like. I do not know nor guarantee that getComputedStyle() is supported by Lobo.
No other tools comes to mind. HtmlUnit comes close as it can also access/invoke JavaScript, but some Google results suggests that getComputedStyle() doesn't work on HtmlUnit as well. It's after all actually also not a real HTML renderer as well.

Clean up user input from unwanted HTML in a Spring web application

I need to tidy user input in a web application so that I remove certain HTML-tags and encode < to &gt etc.
I've made a couple of simple util methods that strips the HTML, but I find myself adding these EVERYWHERE in my application.
Is there a smarter way to tidy the user input? E.g. in the binding process, or as a filter somehow?
I've seen JTidy that can act as a servlet filter, but I'm not sure that this is what I want because I need to clean user input, not output of my JSP's.
From JTidy's homepage:
It can be used as a tool for cleaning up malformed and faulty HTML generated by your dynamic web application.
It can Validate HTML without changing the output and generate warnings for each page so you could identify JSP or Servlet that need to be fixed.
It can save you hours of time. The more HTML you write in JSP or Servlets, the more time you will save. Don't waste time manually looking for problems, figuring out why your HTML doesn't display like it should.
In addition to JTidy validation you could submit dynamically generated pages to online HTML validators for example W3C Markup Validation Service, WAVE Accessibility Tool or WDG HTML Validator even if you are behind the firewall.
I find myself adding these EVERYWHERE in my application.
Really? It's unusual to have many user inputs that accept HTML. Most inputs should be plain text, so that when the user types < they literally get a less-than sign, not a (potentially-tidied/filtered-out) tag. This requires HTML-encoding at the output stage. Typically you'd get that from the <c:out> tag.
(Old-school JSP before JSTL, lamentably, provided no HTML-encoder, so if for some reason that's what you're working with you would have to provide your own HTML-encoding method built out of string replacments, or use one of the many third-party tools that contain one.)
For the usually-few-if-any ‘rich text’ fields that are deliberately meant to accept user-supplied HTML, you should be filtering them strongly to prevent JavaScript injection from the markup. This is a difficult job! A “couple of simple util methods that strips the HTML” are highly unlikely to do it correctly and securely.
The proper way to do this is to parse the input HTML into a DOM; walk over it checking that only known-safe element and attribute names are used; then serialise it back to well-formed [X]HTML. There are a number of tools that can do this and yes, jTidy is one. You would use the method Tidy.parseDOM on the input field value, remove unwanted items from the resulting DOM with removeChild and removeAttribute, then reserialise using pprint.
A good alternative to HTML-based rich text is to give the user a simpler form of textual markup that you can then convert to known-safe HTML tags. Like this SO text box I'm typing into now.
There's Interceptor interface in Spring MVC which may be used to do some common stuff on every request. Regardless of tool you are using for tidying, you may use it for getting what you need at one point. See this manual to manage using ut. Just put the tidying routine into preHandle method and walk through data in HttpServletRequest to update it.

Is there a validating HTML parser implemented in Java?

I need to parse HTML 4 in Java.
Ideally I'd like an implementation that is SAX compatible.
I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.
My requirements are:
No tidying.
If the input document is invalid HTML parsing should fail.
The document should be validatable against the HTML DTDs.
The parser can produce SAX2 events.
Is there a library that meets these requirements?
You can find a collection of HTML parsers here HTML Parsers. I don't remeber exactly but I think TagSoup parses the file without applying corrections...
I think the Jericho HTML Parser can deliver at least one of your core requirements ('If the input document is invalid HTML parsing should fail.') in that it will at least tell you if there are mismatched tags or other poisonous HTML flaws, and you can choose to fail based on this information.
Try typing invalid html into this Jericho formatting demo, and note the 'Parser Log' at the bottom of the page:
http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp
So yes, this is doing tag tidying, but it is at least telling you about it - you can grab this information by setting a net.htmlparser.jericho.Logger (e.g. a WriterLogger or something more specific of your own creation) on your source, and then proceeding depending on what errors are logged out. This is a small example:
Source source=new Source("<a>I forgot to close my link!");
source.setLogger(myListeningLogger);
source.getSourceFormatter().writeTo(new NullWriter());
// myListeningLogger has now had all the HTML flaws written to it
In the example above, your logger's info() method is called with the string: 'StartTag at (r1,c1,p0) missing required end tag', which is relatively parseable, and you can always decide to just reject any HTML that logs any message worse than debug - in fact Jericho logs almost all errors as 'info' level, with a couple at 'warn' level (you might be tempted to create a small fork with the severities adjusted to correspond to what you care about).
Jericho is available on Maven Central, which is always a good sign:
http://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html
Good luck!
You may wish to check http://lobobrowser.org/cobra.jsp. They have a pure Java web browser (Lobo) implemented. They have the parser component (Cobra) pulled out separately for use. I honestly am not sure if it will do what you require with the "no tidying" requirement, but it may be worth a look. I ran across it when exploring the wild for a pure Java web browser.
You can try to subclass javax.swing.text.html.parser.Parser and implement the handleXXX() methods. It seems it doesn't try to fix the XML. See more at the API

Categories