This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
What are the pros and cons of the leading Java HTML parsers?
What HTML parser would you recommend for parsing HTML?
I need one feature html parser to have:
That parser returns useful text, no menu, no footer, no headers information. Only text that contains normal content.
I have tried Jericho Html parser, HtmlCleaner but they do not seem to work as I need.
Thanks in advance.
I'm not really sure what you're asking; an HTML parser parses HTML--what you extract out of it is up to you. I like jsoup and tagsoup.
If you want something that pulls "normal" content out of HTML, you could look at how Apache Tika handles HTML. All HTML is written differently--you have to be able to define what "normal" content is, and where it is.
Related
I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!
The best I've seen so far is HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.
I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03
I want to know is it possible to retrieve HTML tag and plain text such as
<p>This is text </p> or <div> or This is text
by using XmlPullParser ? I read here that it is not recommended. So is there any alternative way or a simple code that allow you to retrieve HTML and plain text like I wanted above ? I'm still a beginner in android. Thank you for your help.
I think your best option (which I have also used) is JSOUP.
JSOUP provides a very convenient API for extracting and manipulating data, using DOM, CSS, and jquery-like methods. JSOUP allows you to scrape and parse HTML from a URL, file, or string and many more.
jSoup: https://jsoup.org/
You have here a nice tutorial (not mine)
http://www.androidbegin.com/tutorial/android-basic-jsoup-tutorial/
JSOUP is a great parser and is one of the most commonly used ones.
Another thing that might be helpful for you is HTML organizer, a common thing that happens when writing parsers is errors due to Malformed HTML files. This happens more often then what you expect so a HTML organizer can reduce the amount of errors.
A good organizer I used is: Tidy
All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?
I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).
You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.
You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.
Take a look at this:
http://en.wikipedia.org/wiki/Java_API_for_XML_Processing
If you parse the HTML you should be able to extract the values from the DOM tree.
I have learnt how to create a HTTP Get request method to retrieve data from a URL, but I would like to filter the response to only give me a list of the links on the webpage.
For example, if the HTML contained the following text:
<link href="http://www.thompsons.co.uk">
then it should print out:
http://www.thompsons.co.uk
I would strongly recommend that you DO NOT use regexes to "parse" HTML. Unless you have control over the formatting of the web pages you are processing, a solution based on regexes is liable to be fragile and buggy.
Instead, use a permissive HTML parser. This Question gives a number of alternatives: HTML/XML Parser for Java
You can use jsoup:
http://jsoup.org/cookbook/extracting-data/attributes-text-html
You read in the whole data fully, then parse it with regexp to extract the links. Read more here: http://www.mkyong.com/regular-expressions/how-to-extract-html-links-with-regular-expression/
how can use a regular expression to extract a links in a web page(suppose i get the html page as a text file) using java?
This previously posted question should help you
How to use regular expressions to parse HTML in Java?
Essentially you should really look at using a HTML parser
Agree that HTML parser will make your life easier if you can include it with your build - I've used Jericho HTML Parser for something similar in the past...