Any good Java HTML parsers? - java

I was using Cobra until now because of how easy it was but unfortunately it had some problem with a few test cases. Does anyone suggest a tried-and-tested library?
I've tried Cobra's built in one and HTMLCleaner without any luck.

TagSoup is really great when dealing with crappy HTML/XHTML.
Jericho (and NekoHTML) are good too to parse non valid HTML.
TagSoup and Jericho: tried-and-tested. NekoHTML: feedback from trustable source.

Mozilla HTML Parser looks rather interesting. By definition, it's supposed to be as good as Gecko engine itself, which is likely to cover your needs.

Take a look at Saxon (no, I'm not involved in any way with the product, just a satisfied user).

[Answering the title - the overall question and comments are not consistsent]
JTidy (http://jtidy.sourceforge.net/) is a port of Dave Raggett's HTMLTidy. It's very useful though I think development may have slowed/ceased.

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. (Mozilla is currently in the process of replacing its own HTML parser with this one.)

Related

Efficient Parser for large XMLs

I have very large XML files to process. I want to convert them to readable PDFs with colors, borders, images, tables and fonts. I don't have a lot of resources in my machine, thus, I need my application to be very optimal addressing memory and processor.
I did a humble research to make my mind about the technology to use but I could not decide what is the best programming language and API for my requirements. I believe DOM is not an option because it consumes a lot of memory, but, would Java with SAX parser fulfill my requirements?
Some people also recommended Python for XML parsing. Is it that good?
I would appreciate your kind advice.
SAX is very good parser but it is outdated.
Recently Oracle have launched new Parser to parse the xml files efficiently called Stax
*http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html*
Attached link will also shows comparisons of all parsers along with memory utilization and its features.
Thanks,
Pavan
Yes I think Sax will work for you. Dom is not good for large XML files as It keeps the whole XML file in memory. You can see a Comparison I wrote in my blog here
Not sure if you're interested in using Perl, but if you're open to it, the following are all good options: LibXML, LibXSLT and XML-Twig, which is good for files too large to fit in memory (so is LibXML::Reader). Of course as SAX is there, but it can be slow. Most people recommend the first two options. Finally, CPAN is an amazing source with a very active community.
If you want the best of DOM without its memory overhead, vtd-xml is the best bet, here is the proof...
http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

Is there a well-designed, maintained RSS-parsing library for Java?

I know this question has been asked before, but that was several years ago, and of the two answers, Rome and Abdera, the first no-longer seems to be maintained (there aren't even any download links on the website, nor can I find documentation). The latter also appears rather complicated, and neither appears up to contemporary standards of Java library design.
Are there any new alternatives out there that are well designed, and well maintained?
Sorry, I do not know of any library, but, that said, seeing as RSS is an XML format you should be able to roll your own using SAX/JAXB/DOM. Which one to use depends on whether you wan ease of integration with Java (JAXB) or speed (SAX). There is a middle ground in DOM.
RSS is not a complicated format so I think you could just develop the features you need as you come across them and it'll be faster (and the skills you learn more transferable) than exhaustice searching for a library if one cannot be found easily.
Hope this helps.
I did find this class RSSDigester. It might help, I don't realy have the time to investigate it right now, sorry.
RSS reading hasn't really needed changing for some time. ROME really is quite nice, and as far as fetching it you can get it from http://download.java.net/maven/2/rome/.
I eventually found HorroRSS, which is exactly what I was hoping for. Its simple, easy to use, and appears robust.

Use jsoup or gquery for plain XML

I was recently wondering about a good library for XML manipulation in Java: A nice Java XML DOM utility
Before re-inventing the wheel, porting jQuery to Java in jOOX, I checked out these libraries:
http://jsoup.org
http://code.google.com/p/gwtquery
But at closer inspection, I can see:
jsoup does not operate on a standard org.w3c.dom document structure. They rolled their own implementation. I checked out the code and I doubt that it is as efficient and tuned as Xerces, for instance. For my use-cases, performance is important
jsoup seems tightly coupled with HTML. I only want to operate on XML, no HTML structure, no CSS
gwtquery is coupled with GWT. I'm not sure how tightly
Has anyone made any experience with these libraries when using it only for server-side XML, not for HTML?
I'm interested in
Performance benchmarks (maybe comparing it with standard DOM / XPath)
Compatibility experience (easy to import/export to standard DOM?)
Without an answer after one month, I think that my own library will resolve my problems best:
http://www.jooq.org/products/jOOX

Are there any tools to isolate the content of a webpage?

I'm working on a school project in which we would like to analyze the content of webpages. We don't, however, want to deal with things like Nav bars and comments. If we were looking at a specific website we could make a parser to filter that sort of extraneous stuff out specifically for that site, but we are hoping work on arbitrary sites that we may not have ever encountered before.
I feel like it's a bit much to hope for, so I won't be surprised if nothing like this exists already, but does anyone know of a tool that can do that sort of content isolation on arbitrary websites? I've had a bit of luck diffing pages with others from the same site, but it's imperfect and leaves comments and such.
I am working in Java, but would welcome anything open source in any language that I can use for ideas.
I'm a little late to this one (especially for a school project), but if anyone finds this at some future point, the following may be helpful.
I stumbled across a Java library to do exactly this. Performance, in my simple tests, is similar to Readability.
http://code.google.com/p/boilerpipe/
You could try an unofficial API of arc90's Readability.
Basically what Readability does is extract content on a webpage and presents it to you as a nicely formatted article. Nav bars, comments, and all the other stuff that surrounds content on a webpage is gone.
im also a bit late to this conversation but ...
the Java Boilerpipe extractors are probably what you want (ArticleSentencesExtractor probably), although there is at least 1 port of the arc90 readability to java on github.
If you want to build a poor mans boilerpipe you might try diff'ing 2 pages from the same site (assuming they are using the same template you will likely get an interesting result)
The main difference between boilerpipe, readability and a diff based hack is that boilerpipe will strip out all html but preserve some structure
I doubt that anything exists that would do what you want. Without some sort of semantic markup it is next to impossible to distinguish "real" content from the other stuff. This is a task that requires real intelligence.
There are of course good tools for parsing HTML of varying degrees of correctness, and it is often possible to cobble together some pattern-based solution for dealing with pages on a particular site ... assuming that there are common structures / patterns to be elicited.

Parsing RTF Documents with Java/JavaCC

Is anybody familiar with the the RTF document format and parsing using any Java libaries. The standard way people have done this is by using the RTFEditorKit in the JDK Swing API:
Swing RTFEditorKit API
but it isn't that accurate when it comes to parsing RTF documents. In fact there's a comment in the API:
The RTF support was not written by the
Swing team. In the future we hope to
improve the support provided.
I don't think I'm going to wait for this to happen :)
The other approach taken is to define a grammar using JavaCC and generate a parser. This works better, but I'm having trouble finding a complete grammar. I've tried:
PMD Applied JavaCC Grammar
which is ok and the following (which is the best so far).
Koders RTFParserDelegate and ETranslate Grammar
There are various implementations of the ETranslate grammar about (I know the Nutch API may use this). Does anybody know which is the most accurate grammar or whether there is a better approach to this?
I could start ploughing through the JavaCC docs to understand the .jj files and test it against the RTF files... this is my current approach, but it's taking a while... any help would be appreciated
Does anybody know which is the most accurate grammar or whether there
is a better approach to this?
Many years ago I spent some time reading RTF (Wikipedia) with C#. I say reading because if you understand RTF in detail and use it the way it was designed you will realize that RTF is not meant to be read as a whole and parsed as a whole over and over again when editing. In the documentation you will find the syntax for RTF, but don't be misled into believing that you should use a lexer/parser. In the documentation they give a sample reader for RTF.
Remember that RTF was created many ages ago when memory was measured in KB and not MB, and editing long documents of several hundred pages in a conventional way would tax system resources. So RFT has the ability to be edited in smaller subsections without loading or modifying the entire document. This is what gives it the ability to work on such large documents with limited memory. It is also why the syntax may seem odd at first.
Presumably, the source of OpenOffice contains what you're looking for.

Categories