I'm planning to write a simple J2SE application to aggregate information from multiple web sources.
The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.
What technique/library would you advice?
Updates/Remarks
Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
It sould be really simple.
You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.
You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml.
As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's
http://hc.apache.org/httpclient-3.x/
If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):
<table>
{
for $d in //td[contains(a/small/text(), "New York, NY")]
for $row in $d/parent::tr/parent::table/tr
where contains($d/a/small/text()[1], "New York")
return <tr><td>{data($row/td[1])}</td>
<td>{data($row/td[2])}</td>
<td>{$row/td[3]//img}</td> </tr>
}
</table>
In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.
You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.
If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :
GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>
then use the Socket#getInputStream , and then read the data using a BufferedReader , and parse the data using whatever you like.
You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.
If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.
This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.
Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.
If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.
Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.
Check this out http://www.alchemyapi.com/api/demo.html
They return pretty good results and have an SDK for most platforms. Not only text extraction but they do keywords analysis etc.
Related
How can I open a website and return some information from it in Java ? For example I want to go to http://xyz.com, enter my family name and return my national code
You can use java.net.HttpURLConnection to connect to a website. For scraping information from the loaded website you can use a Java HTML Parser library (for example JSoup) to be able to traverse through the DOM and/or retrieve relevant pieces of information from the DOM.
With Selenium, that is a tool for testing web applications, you can do all that you say. Try to check its documentation
This is an example case example in java
If that site returns information in XML format, then its possible to do XML parsing to get the result you desire.
SAX is really handy in XML parsing in these cases.
What are some good open source java libraries to search and scrape data out of a web page and stick it into a database. For example, suppose I had a page such as:
<tr><td><b>Address:</b></td>
<td colspan=3>123 My Street </td></tr>
"Address:" is the key, but I'm actually trying to get "123 My Street" which has a bunch of html tags and spaces in between. Ideally I want to get the value between the td that follows the string "Address:". It seems like JSoup can do the find, but I didn't see a good example on how to do the offset (I may have missed it). Is there a library that handles key/value?
I'd also be interested in learning about any open source (MIT/Apache) initiatives for UI scripting similar to the Kapow Extraction Browser.
Thanks.
Try Web-Harvest.
It's open-source crawler written in Java.
It can be used as Java library, as command-line application or with it's standalone IDE.
You can use <xpath> element to extract any value from the XHTML document.
This is a good list of open source parsers: http://java-source.net/open-source/html-parsers
I've used TagSoup with great success for parsing tens of thousands of web pages in the wild. As for the "key-value" relationship, that's something you'll have to deal with yourself.
I have to extract some information from a web page, and reformat it for the user.
Since the web page is somewhat regular, now I use HttpClient to retrive the HTML as a string, and I extract substrings in given locations with the relevant data.
Anyhow I'm wondering if there is a better way, maybe an HTML-aware way. How would you do it?
Cheers
Ideally, you should use a real HTML-parser. I've used Jsoup successfully in the past on Android:
http://jsoup.org/
I personally like to use Jericho parser: http://jericho.htmlparser.net/docs/index.html
It is easy to use, have very much examples on project's page and deals good with pure HTML (unclosed tags etc.).
We've used HTTPUnit do do this in the past.
jsoup.org is better but Cobra have also some addidtional features (CSS-aware and JavaScript-aware).
I would like to implement content management system with RDBMS in java/j2ee, and would like to know the best practices for handling input HTML content
Below are the few doubts I have got, am sure there are lots of other things to take care..
Do we need to escape HTML tags and special characters before we save HTML content to database
How do we validate/remove invalid special symbols in large input HTML content
Best practices for displaying HTML content back to browser from database
Any security risk involved in while handling HTML content
Looking forward to see some great ideas from gurus!
Use a tool like Neko to clean up the HTML into XHTML, then use any XML parser to parse it.
I recently tried out some html clean-up libraries, and the best I came across was the Cobra Html Renderer and Parser which seems to faster than others and also manages to convert dirtier HTML do XHTML. I first went for HTML Tidy, but it ended up complaining about "Unparseable HTML" way too often.
What I'd strongly discourage you from doing is to use a REGEX ;-)
I am not a guru in this but i think you will have to figure out how to deal with some special characters and escape sequences as in quotes(both double and single)..etc
May be you can try replacing those special charas and escape sequences with some other characters.
Mayb Someone else who is currenntly delaing with cms mite help you out..nways cheers!!
I would recommend looking at the architecture and design of an open source CMS like Alfresco or Apache Jackrabbit.
These are actual content repositories and will not contain end-to-end integration most likely, but can show you an underlying data model that is a good place to start.
I would also recommend you check out OWASP for information on web application security and vulnerabilities, and in particular security issues relevant to Java developers.
I am creating a tool that will check dynamically generated XHTML and validate it against expected contents.
I need to confirm the structure is correct and that specific attributes exist/match. There may be other attributes which I'm not interested in, so a direct string comparison is not suitable.
One way of validating this is with XPath, and I have implemented this already, but I would also like something less verbose - I want to be able to use CSS Selectors, like I can with jQuery, but on the server - within CFML code - as opposed to on the client.
Is there a CFML or Java library that allows me to use CSS Selectors against an XHTML string?
I've just released an open source project which is a W3C CSS Selectors Level 3 implementation in Java. Please give it a try. I was looking for the same thing and decided to implement my own engine. It's inspired by the code in WebKit etc.
http://github.com/chrsan/css-selectors/tree
I don't know of a Java library itself, but there is a Ruby library called Hpricot that does exactly what you're looking for. In conjunction with the Ruby implementation on the Java platform, JRuby, it should be relatively straightforward to call Ruby methods from your Java code (using BSF, JSR-222 Scripting APIs, or an internal API).
Are you using Coldfusion 8? Coldfusion 8, being based on Java 6, supports JSR-222 Scripting APIs "javax.scripting".
Take a look at this blog entry on embedding PHP within CFML. You should be able to do the same with Ruby. There is ZIP file example code linked from this blog posting, and if you crack open the CFML, you'll see a good example of embedding Ruby within CFML.
Although it might take a bit of work to make all the pieces work together, but with a bit of investment, it should give you the robust parsing/CSS selector querying that you're looking for.
Hpricot is definetly a fantastic solution if the JRuby-route is open to you.
Wrt. XPath being the "correct" way to access XML documents... sorry but this is rubbish. There are numerous ways to access elements of an XML document: DOM traversal, XPath, XQuery, CSS selectors to name a few. XPath is certainly popular but CSS selectors are very very powerful, assuming your XML document has HTML semantics.
If you can use PHP within your CFML (as mentioned above), you could take advantage of this excellent "jQuery for PHP" library, phpQuery
Full CSS selector support, manipulation functions, traversing, etc. It should work great for what you need.
Hope it helps.
There is a theoretical difference between the server and client. To a web browser, the document is a living DOM hierarchy. To your server code it's merely an XML document of whatever type. XPath is the "correct" way to access elements of an XML document.
So unless you have a serious performance problem with your current XPath solution, or it doesn't actually work correctly, I suggest you stick with it. Trying something too clever brings the risk of breaking something that's working.
If you find the XPath to be too verbose and ugly to leave sitting around, or want more power to re-use the tool in different cases, or just can't resist trying to do something clever, then you could try writing a utility that compiles a given CSS selector into an XPath. You could then call this in one line whenever you needed.
it may be easier to use cQuery.com - cQuery.com is an API based 'Content Query Engine' to extract content from live websites by using CSS.
You can using it programatically in you application.