All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?
I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).
You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.
You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.
Take a look at this:
http://en.wikipedia.org/wiki/Java_API_for_XML_Processing
If you parse the HTML you should be able to extract the values from the DOM tree.
Related
I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!
The best I've seen so far is HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.
I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03
I want to know is it possible to retrieve HTML tag and plain text such as
<p>This is text </p> or <div> or This is text
by using XmlPullParser ? I read here that it is not recommended. So is there any alternative way or a simple code that allow you to retrieve HTML and plain text like I wanted above ? I'm still a beginner in android. Thank you for your help.
I think your best option (which I have also used) is JSOUP.
JSOUP provides a very convenient API for extracting and manipulating data, using DOM, CSS, and jquery-like methods. JSOUP allows you to scrape and parse HTML from a URL, file, or string and many more.
jSoup: https://jsoup.org/
You have here a nice tutorial (not mine)
http://www.androidbegin.com/tutorial/android-basic-jsoup-tutorial/
JSOUP is a great parser and is one of the most commonly used ones.
Another thing that might be helpful for you is HTML organizer, a common thing that happens when writing parsers is errors due to Malformed HTML files. This happens more often then what you expect so a HTML organizer can reduce the amount of errors.
A good organizer I used is: Tidy
I have text that may contain HTML islands.
Example:
qwwdeadaskdfdaskjfhbsdfkfSet attribute valuesgfkjgfkjrgjgjgjgjgroggjrog <b>jsoup</b>sdflkjsdfsfklsfklfjsfkljsfljsfJsoup.parse(String html)skgjdfgkjdfgkldfjgdfkgljdfg
How can I extract those HTML fragments?
Java supports both DOM and SAX parsing for XML, however they both require the document to be well-formed. Therefore your example would not be parsed. There is a project called NekoHTML (http://nekohtml.sourceforge.net/) that supports scanning non well-formed HTML.
I do exactly what you are asking -- find HTML fragments in a chunk of text -- by wrapping an enclosing tag around the text then using a java.xml.parsers.DocumentBuilder to create a DOM tree.
The basic idea (and omitting much) is just
String fragment = "<wrap_node>" + orig_text + "</wrap_node>";
Document d = builder.parse(fragment);
If tags aren't well-formed... missing end, improper nesting, etc. ... this won't work, but this works for me because I want to reject anything malformed.
Does the document tree returned by JSoup when it parses an HTML document support getComputedStyle on the individual document elements?
What I would like to do is inline the CSS in an HTML fragment so that I can insert the fragment into a larger HTML document, with all of its formatting preserved but without messing with any other formatting in the document.
The research I've done would seem to suggest that I can accomplish this by iterating through all of the elements in the document, calling getComputedStyle on each one, and assigning the result to be the style for the element.
Yes, I realize that this may very well bloat the resulting HTML by putting a bunch of redundant / unnecessary style information on the individual elements, but I'm willing to pay the price of larger HTML, and as far as I can tell, embedding the style inline like this is the only way to preserve the formatting exactly while also making the HTML fragments fully portable. (If you've got another suggestion for accomplishing that purpose, I'm all ears. :-)
Getting back on topic... If I can't use getComputedStyle (or the equivalent) with JSoup, is there another Java HTML+CSS parser that supports getComputedStyle or the equivalent?
Thanks.
That's not possible. Jsoup is just a HTML parser with CSS selector support, it is not a HTML renderer.
You may want to take a look at Lobobrowser which is a Java based HTML renderer supporting JavaScript and like. I do not know nor guarantee that getComputedStyle() is supported by Lobo.
No other tools comes to mind. HtmlUnit comes close as it can also access/invoke JavaScript, but some Google results suggests that getComputedStyle() doesn't work on HtmlUnit as well. It's after all actually also not a real HTML renderer as well.
What is a fast and simple way to validate HTML from Java? I’m looking for an open-source/PD class (or set of classes) that describes the various properties of the 100-odd HTML tags, such as:
Is the tag optional? Empty? Is it legal to omit its closing tag?
Which other tags can this tag contain (if any)?
Which attributes are legal for this tag, and what are their types? (not required, but nice to have)
Thanks!
EDIT
I'm looking to do to a tag-by-tag analysis of an HTML document, so I'm less interested in whether the document as a whole is valid, but rather what the specific requirements are for each type of tag.
I could encode the rules based on the W3C spec, but wanted to see which ready-made solutions are available first.
If you want to verify certain tags follow certain specifications, there seems to be no end of Java based HTML parsers:
Open Source HTML Parsers in Java
In other words, you could parse you HTML, and then inspect the resulting document for the tags you were looking for and determine if they meet the specifications you require. If they don't you could then just throw an error.
I don't think you'll find a HTML analysis tool which was written with exactly your requirements in mind, mostly because those requirements haven't been voiced and are probably a bit nebulous.
If the parser doesn't do what you want out of the box, at least this list is open source, so you can hack the parser as long as you publish your changes.
Check JTidy (http://jtidy.sourceforge.net/) and VietSpider HTMLParser ( http://sourceforge.net/projects/binhgiang/ ) both are Java HTML parser and some syntax checking capabilities. Some eclipse based HTML editor plugin use JTidy (or port of Tidy) for syntax checking. Or as David Said, submit the page to w3c.org