I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!
The best I've seen so far is HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.
I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03
Related
I want to know is it possible to retrieve HTML tag and plain text such as
<p>This is text </p> or <div> or This is text
by using XmlPullParser ? I read here that it is not recommended. So is there any alternative way or a simple code that allow you to retrieve HTML and plain text like I wanted above ? I'm still a beginner in android. Thank you for your help.
I think your best option (which I have also used) is JSOUP.
JSOUP provides a very convenient API for extracting and manipulating data, using DOM, CSS, and jquery-like methods. JSOUP allows you to scrape and parse HTML from a URL, file, or string and many more.
jSoup: https://jsoup.org/
You have here a nice tutorial (not mine)
http://www.androidbegin.com/tutorial/android-basic-jsoup-tutorial/
JSOUP is a great parser and is one of the most commonly used ones.
Another thing that might be helpful for you is HTML organizer, a common thing that happens when writing parsers is errors due to Malformed HTML files. This happens more often then what you expect so a HTML organizer can reduce the amount of errors.
A good organizer I used is: Tidy
All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?
I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).
You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.
You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.
Take a look at this:
http://en.wikipedia.org/wiki/Java_API_for_XML_Processing
If you parse the HTML you should be able to extract the values from the DOM tree.
I want to implement a java method which takes URL as input and stores the entire webpage including css, images, js (all related resources) on my disk. I have used Jsoup html parser to fetch html page. Now the only option I am thinking to implement is get the page using jsoup and now parse the html content and convert relative path to absolute path and then make another get requests for javascript, images etc. and save them on disk.
I also read about html cleaner, htmlunit parsers but i think in all these cases I have to parse the html content to fetch images,css and javascript files.
Any advice whether i am thinking right or not.
Or is there any easy way to accomplish this task ??
Basically, you can do it with Jsoup:
Document doc = Jsoup.connect("http://rabotalux.com.ua/vacancy/4f4f800c8bc1597dc6fc7aff").get();
Elements links = doc.select("link");
Elements scripts = doc.select("script");
for (Element element : links) {
System.out.println(element.absUrl("href"));
}
for (Element element : scripts) {
System.out.println(element.absUrl("src"));
}
And so on with images and all related resources.
BUT if your site creates some elements with javaScript, Jsoup will skip it, as it cant execute javaScript
I have encountered the similar problem before couple of years where we have used exactly the same mechanism which you are planing. parse the html content and convert relative path to absolute path and also we have used multiple threads to run simultaneously and retrieve images, java script etc for performance optimization. I don't know it should done as we did or not but at the end it works for us.:-)
This GitHub project does this, using jSoup. No need to write it again if it already exists!
EDIT: I made an improved version of this class, and added new features :
It can:
Extract URL's from Linked or Inline CSS, eg. for background images, and download & save those too.
It does multithreaded downloading of all the files, (images, scripts, etc.)
Gives details about progress and errors.
Can get HTML frames embedded in the HTML document, and nested frames also.
Some caveats:
Uses JSoup and OkHttp, so you need to have those libraries.
GPL licenced, for now anyway.
Does the document tree returned by JSoup when it parses an HTML document support getComputedStyle on the individual document elements?
What I would like to do is inline the CSS in an HTML fragment so that I can insert the fragment into a larger HTML document, with all of its formatting preserved but without messing with any other formatting in the document.
The research I've done would seem to suggest that I can accomplish this by iterating through all of the elements in the document, calling getComputedStyle on each one, and assigning the result to be the style for the element.
Yes, I realize that this may very well bloat the resulting HTML by putting a bunch of redundant / unnecessary style information on the individual elements, but I'm willing to pay the price of larger HTML, and as far as I can tell, embedding the style inline like this is the only way to preserve the formatting exactly while also making the HTML fragments fully portable. (If you've got another suggestion for accomplishing that purpose, I'm all ears. :-)
Getting back on topic... If I can't use getComputedStyle (or the equivalent) with JSoup, is there another Java HTML+CSS parser that supports getComputedStyle or the equivalent?
Thanks.
That's not possible. Jsoup is just a HTML parser with CSS selector support, it is not a HTML renderer.
You may want to take a look at Lobobrowser which is a Java based HTML renderer supporting JavaScript and like. I do not know nor guarantee that getComputedStyle() is supported by Lobo.
No other tools comes to mind. HtmlUnit comes close as it can also access/invoke JavaScript, but some Google results suggests that getComputedStyle() doesn't work on HtmlUnit as well. It's after all actually also not a real HTML renderer as well.
Could please anybody recommend libraries that are able to do the opposite thing than these libraries ?
HtmlCleaner, TagSoup, HtmlParser, HtmlUnit, jSoup, jTidy, nekoHtml, WebHarvest or Jericho.
I need to build html pages, build the DOM model from String content.
EDIT: I need it for testing purposes. I have various types of input/strings that might be in the html page on various places... So I need to dynamically build it up... I then process the html page based on various criterions that must be fulfilled or not.
I will show you why I asked this question, consider htmlCleaner for this job :
List<String> paragraphs = getParagraphs(entity.getFile());
List<TagNode> pNodes = new ArrayList<TagNode>();
TagNode html = cleaner.clean("<html/>");
for(String paragraph : paragraphs) {
TagNode p = new TagNode("p");
pNodes.add(p);
// CANNOT setText() ?
}
html.addChildren(pNodes);
The problem is that TagNode has getText() method, but no setText() method ....
Please add more comments about how vague this question is ... The best thing you can do
Jsoup, Jsoup, Jsoup! I've used all of those, and it's my favorite by a long shot. You can use it to build documents, plus it brings a lot of the magic of Jquery-style traversing alongside the best HTML document parsing I've seen to date in a Java library. I'm so happy with it that I don't mind shamelessly promoting it. ;)
There are lot of template libraries for Java, from JSP to FreeMarker, from specific implementations in various frameworks (Spring?) to generic libraries like StringTemplate.
The most difficult task is... to make a choice.
In general, these libraries offer to make a skeleton of Web page, with "holes" to fill with variables. It is the simplest approach, often working well with tools.
If you really want to build from Dom, you can just use an XML library and generate XHTML.
If you are interested in HtmlCleaner particularly, it is actually a very convenient choice for building html documents.
But you must know that if you want to set content to a TagNode, you append a child ContentNode element :-)
List<String> paragraphs = getParagraphs(entity.getFile());
List<TagNode> pNodes = new ArrayList<TagNode>();
TagNode html = new TagNode("html");
for(String paragraph : paragraphs) {
TagNode p = new TagNode("p");
p.addChild(new ContentNode(paragraph));
pNodes.add(p);
}
html.addChildren(pNodes);
jwebutils -- A library for creating HTML 5 markup using Java. It also contains support for creating JSON and CSS 3 markup.
Jakarta Element Construction Set (ECS) - A Java API for generating elements for various markup languages it directly supports HTML 4.0 and XML. Now retired, but some folks really like it.