Could please anybody recommend libraries that are able to do the opposite thing than these libraries ?
HtmlCleaner, TagSoup, HtmlParser, HtmlUnit, jSoup, jTidy, nekoHtml, WebHarvest or Jericho.
I need to build html pages, build the DOM model from String content.
EDIT: I need it for testing purposes. I have various types of input/strings that might be in the html page on various places... So I need to dynamically build it up... I then process the html page based on various criterions that must be fulfilled or not.
I will show you why I asked this question, consider htmlCleaner for this job :
List<String> paragraphs = getParagraphs(entity.getFile());
List<TagNode> pNodes = new ArrayList<TagNode>();
TagNode html = cleaner.clean("<html/>");
for(String paragraph : paragraphs) {
TagNode p = new TagNode("p");
pNodes.add(p);
// CANNOT setText() ?
}
html.addChildren(pNodes);
The problem is that TagNode has getText() method, but no setText() method ....
Please add more comments about how vague this question is ... The best thing you can do
Jsoup, Jsoup, Jsoup! I've used all of those, and it's my favorite by a long shot. You can use it to build documents, plus it brings a lot of the magic of Jquery-style traversing alongside the best HTML document parsing I've seen to date in a Java library. I'm so happy with it that I don't mind shamelessly promoting it. ;)
There are lot of template libraries for Java, from JSP to FreeMarker, from specific implementations in various frameworks (Spring?) to generic libraries like StringTemplate.
The most difficult task is... to make a choice.
In general, these libraries offer to make a skeleton of Web page, with "holes" to fill with variables. It is the simplest approach, often working well with tools.
If you really want to build from Dom, you can just use an XML library and generate XHTML.
If you are interested in HtmlCleaner particularly, it is actually a very convenient choice for building html documents.
But you must know that if you want to set content to a TagNode, you append a child ContentNode element :-)
List<String> paragraphs = getParagraphs(entity.getFile());
List<TagNode> pNodes = new ArrayList<TagNode>();
TagNode html = new TagNode("html");
for(String paragraph : paragraphs) {
TagNode p = new TagNode("p");
p.addChild(new ContentNode(paragraph));
pNodes.add(p);
}
html.addChildren(pNodes);
jwebutils -- A library for creating HTML 5 markup using Java. It also contains support for creating JSON and CSS 3 markup.
Jakarta Element Construction Set (ECS) - A Java API for generating elements for various markup languages it directly supports HTML 4.0 and XML. Now retired, but some folks really like it.
Related
I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!
The best I've seen so far is HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.
I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03
Do you know of any good lightweight library in java to make good and safe HTML representation of user input? That's very generic task, I think. Consider: user leaves a comment in the blog - my task is to convert user comment into safe & nice HTML content.
Use the jsoup HTML Cleaner with a configuration specified by a Whitelist.
String unsafe =
"<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p>Link</p>
Excerpt from the Jsoup Cookbook.
HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans.
HTML Parser
Open Source HTML Parsers in Java
Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.
You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!
There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.
You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet
My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)
http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site
You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit
You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/
You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.
Is there a jQuery like JAVA/Android library that uses CSS Selectors to parse the XML ?
Like :
String desc = myXML.find("bloc[type=pro]").get(0).attr("description");
Chainability is also what I'm looking for, in the same way of jQuery...
I hope this exists !
While initially designed as a HTML parser with CSS selector support, Jsoup works fine for XML documents as well if your sole intent is to extract data, not to manipulate data.
Document document = Jsoup.parse(xmlString);
String desc = document.select("bloc[type=pro]").get(0).attr("description");
// ...
You see, the syntax is almost identical to what you have had in the question.
Apache Jericho is what you are looking for.
You example would look like
String desc = source.getFirstElement( "type", "pro" ).getAttributeValue( "description" );
It's a charm to parse HTML with jericho, so I guess it's even easier for well structured XML.
Since there are some bugs in other Libraries like Jsoup and Jericho is different from what I was expecting,
I wrote a Class Extending the org.xml.sax.helpers.DefaultHandler which parse the XML. I then wrote two other Classes that look like Element and Elements from Jsoup containing two functions called find that handle the CSS3 Selector and attr that returns the attribute value.
I'm now cleaning and commenting that code... I'll post the library later for who is interested in.
xmlDoc.find("bloc[type=Pro]>act").attr("label");
is now possible like in jQuery !
Edit !
Here is the link to access the code for who is interested : Google Code Project
Moving to GitHub : https://github.com/ChristopheCVB/JavaXMLQuery
I use XPath to solve that issue. XML parsing like JDOM is ok to to the XPath. Maybe jQuery see how XPath works :p
//bloc[#type="pro"][1]/#description
Xpath index start from 1, not 0
https://www.w3schools.com/xml/xpath_syntax.asp
The droidQuery library can do many of the things you are looking for. Although the syntax is a little different, you can:
Get view attributes using chained, jQuery-style commands, such as:
CharSequence text = $.with(this, R.id.myTextView).attr("text");
Parse XML:
Document dom = $.parseXML(myXMLString);
If you are a fan of jQuery, you will be pleased to see that nearly all of the features it provides are included in droidQuery, and although the syntax may differ at times, its major goal is to be as syntactically close to jQuery as possible.
In Groovy, how do I grab a web page and remove HTML tags, etc., leaving only the document's text? I'd like the results dumped into a collection so I can build a word frequency counter.
Finally, let me mention again that I'd like to do this in Groovy.
Assuming you want to do this with Groovy (guessing based on the groovy tag), your approaches are likely to be either heavily shell-script oriented or using Java libraries. In the case of shell-scripting I would agree with moogs, using Lynx or Elinks is probably the easiest way to go about it. Otherwise have a look at HTMLParser and see Processing Every Word in a File (scroll down to find the relevant code snippet)
You're probably stuck with finding Java libs for use with Groovy for the HTML parsing, as it doesn't appear there are any Groovy libs for it. If you're not using Groovy, then please post the desired language, since there are a multitude of HTML to text tools out there, depending on what language you're working in.
If you want a collection of tokenized words from HTML then can't you just parse it like XML (needs to be valid XML) and grab all of the text between tags? How about something like this:
def records = new XmlSlurper().parseText(YOURHTMLSTRING)
def allNodes = records.depthFirst().collect{ it }
def list = []
allNodes.each {
it.text().tokenize().each {
list << it
}
}
You can use the Lynx Web Browser to spit out the document text and save it.
Do you want to do this automatically? Do you want a separate application that does this? Or do you want help coding it into your application? What platforms (windows desktop, web server, etc) will it run on?