Reading HTML file to DOM tree using Java

Reading HTML file to DOM tree using Java - java

Is there a parser/library which is able to read an HTML document into a DOM tree using Java? I'd like to use the standard DOM/Xpath API that Java provides.
Most libraries seem have custom API's to solve this task. Furthermore the conversion HTML to XML-DOM seems unsupported by the most of the available parsers.
Any ideas or experience with a good HTML DOM parser?

JTidy, either by processing the stream to XHTML then using your favourite DOM implementation to re-parse, or using parseDOM if the limited DOM imp that gives you is enough.
Alternatively Neko.

Since HTML files are generally problematic, you'll need to first clean them up using a parser/scanner. I've used JTidy but never happily. NekoHTML works okay, but any of these tools are always just making a best guess of what is intended. You're effectively asking to let a program alter a document's markup until it conforms to a schema. That will likely cause structural (markup), style or content loss. It's unavoidable, and you won't really know what's missing unless you manually scan via a browser (and then you have to trust the browser too).
It really depends on your purpose — if you have thousands of ugly documents with tons of extraneous (non-HTML) markup, then a manual process is probably unreasonable. If your goal is accuracy on a few important documents, then manually fixing them is a reasonable proposition.
One approach is the manual process of repeatedly passing the source through a well-formed and/or validating parser, in an edit cycle using the error messages to eventually fix the broken markup. This does require some understanding of XML, but that's not a bad education to undertake.
With Java 5 the necessary XML features — called the JAXP API — are now built into Java itself; you don't need any external libraries.
You first obtain an instance of a DocumentBuilderFactory, set its features, create a DocumentBuilder (parser), then call its parse() method with an InputSource. InputSource has a number of possible constructors, with a StringReader used in the following example:
import javax.xml.parsers.*;
// ...
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
dbf.setNamespaceAware(true);
dbf.setIgnoringComments(false);
dbf.setIgnoringElementContentWhitespace(false);
dbf.setExpandEntityReferences(false);
DocumentBuilder db = dbf.newDocumentBuilder();
return db.parse(new InputSource(new StringReader(source)));
This returns a DOM Document. If you don't mind using external libraries there's also the JDOM and XOM APIs, and while these have some advantages over the SAX and DOM APIs in JAXP, they do require non-Java libraries to be added. The DOM can be somewhat cumbersome, but after so many years of using it I don't really mind any longer.

Here is a link that might be useful. It's a list of Open Source HTML Parser in Java Open Source HTML Parsers in Java

TagSoup can do what you want.

Use https://jsoup.org , this is very simple and power.can read and change a html.
Sample:
Document doc = Jsoup.parse(page); //page can be a file or string.
Element main = doc.getElementById("MainView");
Elements links = doc.select(".link");
for create elements can use j2html, https://j2html.com

Apache's Xerces2 parser should do what you want.

Related

How to find specific information in HTTP Java response? [duplicate]

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!

The best I've seen so far is HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03

SAXBuilder().build(InputStream) - does this read entire file into memory?

Reading the docs, this is the method used in all the examples I've seen:
(Version of org.jdom.input.SAXBuilder is jdom-1.1.jar)
Document doc = new SAXBuilder().build(is);
Element root = doc.getRootElement();
Element child = root.getChild("someChildElement");
...
where is is an InputStream variable.
I'm wondering, since this is a SAX builder (as opposed to a DOM builder), does the entire inputstream get read into the document object with the build method? Or is it working off a lazy load and as long as I request elements with Element.getChildren() or similar functions (stemming from the root node) that are forward-only through the document, then the builder automatically takes care of loading chunks of the stream for me?
I need to be sure I'm not loading the whole file into memory.
Thanks,
Mike

The DOM parser similarly to the JDom parser loads the whole XML resource in memory to provide you a Document instance allowing to navigate in the elements of the XML.
Some references here :
the DOM standard is a codified standard for an in-memory document
model.
And here :
JDOM works on the logical in-memory XML tree,
Both DOM and JDom use the SAX parser internally to read the XML resource but they use it only to store the whole content in the Document instance that they return. Indeed, with Dom and JDom, the client never needs to provide a handler to intercept events triggered by the SAX parser.
Note that both DOM and JDom don't have any obligation to use SAX internally.
They use them mainly as the SAX standard is already there and so it makes sense to use it for reporting errors.
I need to be sure I'm not loading the whole file into memory.
You have two programming models to work with XML: streaming and the document object model (DOM).
You are looking for the first one.
So use the SAX parser by providing your handler to handle events generated by the SAX parser (startDocument(), startElement(), and so for) or as alternative look at a more user friendly API : STAX (Streaming API for XML) :
As an API in the JAXP family, StAX can be compared, among other APIs,
to SAX, TrAX, and JDOM. Of the latter two, StAX is not as powerful or
flexible as TrAX or JDOM, but neither does it require as much memory
or processor load to be useful, and StAX can, in many cases,
outperform the DOM-based APIs. The same arguments outlined above,
weighing the cost/benefits of the DOM model versus the streaming
model, apply here.

It eagerly parses the whole file to build the in-memory representation (i.e. Document) of the XML file.
If you want to be absolutely certain of that, you can go through the source on GitHub. More importantly the following classes: SAXBuilder, SAXHandler, and Document.

Best way to read XML in Java

From some of our other application i am getting XML file.
I want to read that XML file node by node and store node values in database for further use.
So, what is the best way/API to read XML file and retrieve node values using Java?

There are various tools for that. Today, I prefer two:
Simple XML
JAXB
StAX
Here is a good comparison between the Simple and JAXB: http://blog.bdoughan.com/2010/10/how-does-jaxb-compare-to-simple.html
Personally, I like Simple a bit better because support by Niall is excellent but JAXB (as explained in the blog post above) can produce better output with less code.
StAX is a more basic API which allows you to read XML documents that simply don't fit into RAM (neither Simple nor JAXB allow you to read an XML document "object by object" - they will always try to load everything into RAM at once).

I would advice for a simple XML tool if you can manage by that.
For example I and my colleges have introduces complex XML frameworks that worked like a charm at first.
Then you forget about the framework, you have special build files just for mapping XML to beans, you have annotated beans, you provide a new barrier for new developers to your project. You loose much of your freedom to refactor.
At the end you will be sorry that you used the complex framework to save some time in the beginning and I have seen more than one time that the frameworks have been thrown out in refactoring because everybody had a negative feeling about it although they are great at paper.
So think twice about introducing complex XML frameworks if you seldom use them. If you and your team use them rather frequently then they are the way to go.

I suggest using XPath. Xalan is already included in the JDK (no external jars needed) and it fits your requirement, i.e. iterating through element nodes (i presume) and storing their text values. For example:
String xml = "<root> <item>One</item> <item>Two</item> <item>Three</item> </root>";
XPathFactory xpf = XPathFactory.newInstance();
InputSource is = new InputSource(new StringReader(xml));
NodeList nodes = (NodeList) xpf.newXPath().evaluate("/*/*", is,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
Element e = (Element) nodes.item(i);
System.out.println(e.getNodeName() + " -> " + e.getTextContent());
}
}
This example returns a list of all non-root elements and print out the corresponding element name and text content. Adapt the xpath expression to fit your needs.

dom4j and jdom are pretty easy to use (ignoring the requirement "best" for a moment ;) )

Try Apache Xerces. It is mature and robust. Any such available alternatives will do also, just be sure not to roll out your own implementation.

Bypassing alltogether the question of parsing the xml and storing the values in a database, I'd like to question the need to do the above. Most databases can handle xml nowadays, so it can be stored in some way into a table without the need of parsing the content; and the content of such an xml within a column in a table can typically be queried by use of 'xmlselect()' and similar functions.
Think about this for a second; if in the near or distant future the content of the xml that you get from the other application changes, you'll have plenty of changes to do. If it changes often, it'll become a nightmare.
Cheers,
Wim

Try XStream, this one's really simple.

well，i used stax to parse quite a huge of XML nodes, which consumes less memory than Dom and sax, cauz it is of style of pulling XML data. Stax might be a good choice for large XML data nodes.

What's the best way to retrieve two pieces of data from an XML file?

I've got an XML document that is in either a pre or post FO transformed state that I need to extract some information from. In the pre-case, I need to pull out two tags that represent the pageWidth and pageHeight and in the post case I need to extract the page-height and page-width parameters from a specific tag (I forget which one it is off the top of my head).
What I'm looking for is an efficient/easily maintainable way to grab these two elements. I'd like to only read the document a single time fetching the two things I need.
I initially started writing something that would use BufferedReader + FileReader, but then I'm doing string searching and it gets messy when the tags span multiple lines. I then looked at the DOMParser, which seems like it would be ideal, but I don't want to have to read the entire file into memory if I could help it as the files could potentially be large and the tags I'm looking for will nearly always be close to the top of the file. I then looked into SAXParser, but that seems like a big pile of complicated overkill for what I'm trying to accomplish.
Anybody have any advice? Or simple implementations that would accomplish my goal? Thanks.
Edit: I forgot to mention that due to various limitations I have, whatever I use has to be "builtin" to core Java, in which I can't use and/or download any 3rd party XML tools.

While XPath is very good for querying XML data, I am not aware of good and fast XPath implementation for Java (they all use DOM model at least).
I would recommend you to stick with StAX. It is extremely fast even for huge files, and it's cursor API is rather trivial:
XMLInputFactory f = XMLInputFactory.newInstance();
XMLStreamReader r = f.createXMLStreamReader("my.xml");
try {
while (r.hasNext()) {
r.next();
. . .
}
} finally {
r.close()
}
Consult StAX tutorial and XMLStreamReader javadocs for more information.

You can use XPath to search for your tags. Here is a tutorial on forming XPath expressions. And here is an article on using XPath with Java.
An easy to use parser (dom, sax) is dom4j. It would be quite easier to use than the built-in SAXParser.

try "XMLDog"
This uses sax to evaluate xpaths

Is there a validating HTML parser implemented in Java?

I need to parse HTML 4 in Java.
Ideally I'd like an implementation that is SAX compatible.
I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.
My requirements are:
No tidying.
If the input document is invalid HTML parsing should fail.
The document should be validatable against the HTML DTDs.
The parser can produce SAX2 events.
Is there a library that meets these requirements?

You can find a collection of HTML parsers here HTML Parsers. I don't remeber exactly but I think TagSoup parses the file without applying corrections...

I think the Jericho HTML Parser can deliver at least one of your core requirements ('If the input document is invalid HTML parsing should fail.') in that it will at least tell you if there are mismatched tags or other poisonous HTML flaws, and you can choose to fail based on this information.
Try typing invalid html into this Jericho formatting demo, and note the 'Parser Log' at the bottom of the page:
http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp
So yes, this is doing tag tidying, but it is at least telling you about it - you can grab this information by setting a net.htmlparser.jericho.Logger (e.g. a WriterLogger or something more specific of your own creation) on your source, and then proceeding depending on what errors are logged out. This is a small example:
Source source=new Source("<a>I forgot to close my link!");
source.setLogger(myListeningLogger);
source.getSourceFormatter().writeTo(new NullWriter());
// myListeningLogger has now had all the HTML flaws written to it
In the example above, your logger's info() method is called with the string: 'StartTag at (r1,c1,p0) missing required end tag', which is relatively parseable, and you can always decide to just reject any HTML that logs any message worse than debug - in fact Jericho logs almost all errors as 'info' level, with a couple at 'warn' level (you might be tempted to create a small fork with the severities adjusted to correspond to what you care about).
Jericho is available on Maven Central, which is always a good sign:
http://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html
Good luck!

You may wish to check http://lobobrowser.org/cobra.jsp. They have a pure Java web browser (Lobo) implemented. They have the parser component (Cobra) pulled out separately for use. I honestly am not sure if it will do what you require with the "no tidying" requirement, but it may be worth a look. I ran across it when exploring the wild for a pure Java web browser.

You can try to subclass javax.swing.text.html.parser.Parser and implement the handleXXX() methods. It seems it doesn't try to fix the XML. See more at the API

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.