Can you scrape a CSV using Jsoup? - java

I am looking for a Java tool to scrape a CSV from a website and then parse the data. Jsoup seems like a viable option. Is there a way to scrape a CSV file and then save the information to a database using Jsoup?
Or is it strictly for scraping HTML code? Thanks.

No, it ain't gonna work. Look at the Jsoup description:
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
What you are asking for is how to parse CSV file in Java. This question might be helpful for you:
Fast CSV parsing

Related

Return some information from a website in java

How can I open a website and return some information from it in Java ? For example I want to go to http://xyz.com, enter my family name and return my national code
You can use java.net.HttpURLConnection to connect to a website. For scraping information from the loaded website you can use a Java HTML Parser library (for example JSoup) to be able to traverse through the DOM and/or retrieve relevant pieces of information from the DOM.
With Selenium, that is a tool for testing web applications, you can do all that you say. Try to check its documentation
This is an example case example in java
If that site returns information in XML format, then its possible to do XML parsing to get the result you desire.
SAX is really handy in XML parsing in these cases.

Reading HTML+JavaScript using Java

I can read the HTML contents via http (for example, http://www.foo.com) using Java (with URL and BufferedReader classes). However, a couple of them contain JavaScript. My current app cannot process JavaScript.
What's the best way to read HTML content with JavaScript using Java?
I am open using other languages if it is easier.
Thanks in advance for your help.
UPDATE - Clarification:
A couple HTML contents are generated dynamically using JavaScript. I can see the result (in pure HTML after the JavaScript processing) when viewing them on a browser.
On the other hand, when my Java app retrieves the HTML contents, it says that there is no JavaScript on my app.
Ideally, I want to be able to get the same result as on the browser using my Java app.
Thanks for everyone's response.
HtmlUnit has good JavaScript support and it should (almost) parse the HTML as a web browser.
http://htmlunit.sourceforge.net/
http://htmlunit.sourceforge.net/javascript.html
Cobra (http://lobobrowser.org/cobra/getting-started.jsp) will fit your needs
For just HTML parsing you can use HTMLParser (org.htmlparser). However from the way you described your problem, it seems you need a browser, because executing is totally different than just parsing. Cheers.
With no doubt you need to use Java html parser:
Java Open Source HTML Parsers
Which Html Parser is best?
HTML/XML Parser for Java
HTML PARSER in java [closed]

Getting the DOM for an xhtml document in Java

I'm building a mini web browser that parses XHTML and Javascript using Javacc and Java, and I need to build the DOM. Is there any tool that can help me get the DOM and manipulate its nodes without having to build it manually as my browser parses the document?
Try using JDOM or Dom4J or reading this question about XML parsers for Java
If you want to handle HTML as found in the wild, trying using JTidy, which will attempt to recover badly formatted HTML for you before rendering it to a DOM.
I'm not sure why you think you need JavaCC to parse an XHTML document. If it's truly valid XHTML, then it's valid XML, and that means that any XML DOM parser will be able to deliver a DOM that you can manipulate. Why not just use the DOM parser that's built into Java or Xerces from Apache or JDOM or DOM4J? Writing your own using JavaCC can be a valuable learning exercise, but I doubt that it'll be better than what you already have available to you.

How do you grab a text from webpage (Java)?

I'm planning to write a simple J2SE application to aggregate information from multiple web sources.
The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.
What technique/library would you advice?
Updates/Remarks
Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
It sould be really simple.
You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.
You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml.
As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's
http://hc.apache.org/httpclient-3.x/
If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):
<table>
{
for $d in //td[contains(a/small/text(), "New York, NY")]
for $row in $d/parent::tr/parent::table/tr
where contains($d/a/small/text()[1], "New York")
return <tr><td>{data($row/td[1])}</td>
<td>{data($row/td[2])}</td>
<td>{$row/td[3]//img}</td> </tr>
}
</table>
In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.
You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.
If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :
GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>
then use the Socket#getInputStream , and then read the data using a BufferedReader , and parse the data using whatever you like.
You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.
If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.
This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.
Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.
If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.
Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.
Check this out http://www.alchemyapi.com/api/demo.html
They return pretty good results and have an SDK for most platforms. Not only text extraction but they do keywords analysis etc.

How to parse javascript for links with java?

I'm writing a program (in Java) that needs to extract links from webpages. I'm using htmlParser (http://htmlparser.sourceforge.net/) but I'm only able to extract html links (defined with <a href="...">) and I don't know how to handle javascript code to extract links from... can you help me??
You can use Rhino with DOM environment, written in JavaScript.
By the way it is written by John Resig.
HTML Parser from sourceforge is useful. I have used it to parse a whole bunch of HTML already. However, parsing JS is different. Cheers.
This is probally the most comprehensive tool out there. Rhino . Everything you want to do can be done with Rhino.

Categories