how to use htmlparsing and curl in JAVA for this task...? - java

I'm trying to write a program that takes company names from a text file and searches them on a search engine website (SEC's Edgar search). Each search usually comes up with 1-10 unique search result links and so I want to use curl to click on the link with the relevant company name. The link page has a brief summary with the term "state of incorporation:" and then the state name. Im hoping to parse the state name. I am having trouble understanding how to use HTML parsing and curl and their classes. I would appreciate any help possible such as a brief outline of steps or just any advice at all. Thanks.

Assuming that the HTML is fairly basic, use something like the Mozilla Java HTML Parser. The getting started guide will give you more details on creating the DOM. Java has builtin APIs for downloading content from the web, and these will likely be sufficient for you (rather than using "curl").
Once you have a DOM, you can use the standard DOM APIs to navigate for the links and items that you want.

Related

Fetch Data from Url Using java

This question was asked me during my interview, and i was asked to implement it. The question is as follows:
Your application
Will take the username and password for the linkedIn profile,
On the page www.linkedin.com , use that to login into the page.
Simulate the Clicking of the Profile->Edit profile menu
Scrape the page of that user that comes below in the format below and dumps it in a text file. ( hint: you can use the beautiful soup library )
On fetching this url, you need to read the following information and put it in an csv/excel file.
Can somebody give me an idea on how to do it? It should be done using java only.
I'd use Web Browser Automation software like Selenium http://www.seleniumhq.org/ which seems like will solve this problem. You can choose any of its bindings (Java, C#, Ruby, Python, JavaScript) to implement the solution.
Take a look at the tutorials https://www.airpair.com/selenium/posts/selenium-tutorial-with-java
It seems related to web crawler, and we can do this using JSOUP library very well.
You have to read implementation using JSOUP library and we can filter out the link which has something like
https://www.linkedin.com/profile/edit?trk=nav_responsive_sub_nav_edit_profile"
Here if you see we are having the keywords as edit_profile which can be used to filter out the results we require.
Link u should follow and explore more about JSOUP
Webcrawler using JSOUP

Return some information from a website in java

How can I open a website and return some information from it in Java ? For example I want to go to http://xyz.com, enter my family name and return my national code
You can use java.net.HttpURLConnection to connect to a website. For scraping information from the loaded website you can use a Java HTML Parser library (for example JSoup) to be able to traverse through the DOM and/or retrieve relevant pieces of information from the DOM.
With Selenium, that is a tool for testing web applications, you can do all that you say. Try to check its documentation
This is an example case example in java
If that site returns information in XML format, then its possible to do XML parsing to get the result you desire.
SAX is really handy in XML parsing in these cases.

Getting Data from Internet in Java

I thought of making the following application for my college project in java. I know core java. I want to know what should i read "specifically" for this project as there is less time:
It will have an interface to put your query. This string would go as a query to internet search engines and with the help of search engine find the data (the first web page that we see (that is data for my application for this time. :) )).
I do not want to display the data. I just want the HTML file or the source code of the generated web page. Is it sounding like Common Getaway Interface? I do not know about this.
But i think it for the same purpose. If it is this. please guide me to know how to implement this.
Whatever please specify
Problem 1 : What should i read ? Any direct help at this point is not my intention. I want to implement it myself.
Problem 2 : Is connecting to internet requires some jnlp knowledge too.
for eg. as on google we search something it shows us the links of the websites. I can see the source code of this generated web page. I just want this page for my application to work on.
EDIT:
I do not want to rely on google only or any particular web server. I want to decide that by my application.
Please also refer to my problem 2.
As i discovered that we have Terms of Conditions for websites should i try to make my crawler. Would then my application not breaking the rules . Well its important for me.
Ashish,
Here what I would recommend.
Learn the basics of JSON from these links (Introduction ,lib download)
Then look at the Google Web Search JSON API here.
Learn how to GET the data from servers using HttpClient library here.
Now what you have to do is, fire a get request for the search, read the JSON response, parse the response using the JSON lib from #1 and you have the search results.
Most of the search engines (Bing etc) offer Jason/REST apis so you can do the same for other search engines.
Note: Jason APIs are normally used from JavaScritps on the UI side but since its very easy and quick to learn, I suggested you that. You can also explore (if time permits) the XML based APIs also.
URL url = new URL("http://fooooo.com");
in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
{
System.out.println(inputLine);
}
Should be enough to get you started .
And yes , do check if you are not violating the usage terms of a website . Search Engines dont really like you trying to access them via a program .
Many , Including Google , has APIs specifically designed for this purpose.
you can do everything you want using HTMLUnit. It´s like a web browser but for java. Check some examples at their website.
Read "Working with URL's" in the Java tutorial to get an idea what is behind the available libs like HTMLUnit, HttpClient, etc
I do not want to display the data. I just want the HTML file or the source code of the generated web page.
You probably dont need the HTML either. Google provide its search results as a web service using this API. Similarly for other search engine GIYF. You get the search results as XML, which is far more easier for you to parse. Plus the XML wont have any unwanted data like ads.

What are some good java libraries to search and scrape data out of a web page.

What are some good open source java libraries to search and scrape data out of a web page and stick it into a database. For example, suppose I had a page such as:
<tr><td><b>Address:</b></td>
<td colspan=3>123 My Street </td></tr>
"Address:" is the key, but I'm actually trying to get "123 My Street" which has a bunch of html tags and spaces in between. Ideally I want to get the value between the td that follows the string "Address:". It seems like JSoup can do the find, but I didn't see a good example on how to do the offset (I may have missed it). Is there a library that handles key/value?
I'd also be interested in learning about any open source (MIT/Apache) initiatives for UI scripting similar to the Kapow Extraction Browser.
Thanks.
Try Web-Harvest.
It's open-source crawler written in Java.
It can be used as Java library, as command-line application or with it's standalone IDE.
You can use <xpath> element to extract any value from the XHTML document.
This is a good list of open source parsers: http://java-source.net/open-source/html-parsers
I've used TagSoup with great success for parsing tens of thousands of web pages in the wild. As for the "key-value" relationship, that's something you'll have to deal with yourself.

How do you grab a text from webpage (Java)?

I'm planning to write a simple J2SE application to aggregate information from multiple web sources.
The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.
What technique/library would you advice?
Updates/Remarks
Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
It sould be really simple.
You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.
You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml.
As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's
http://hc.apache.org/httpclient-3.x/
If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):
<table>
{
for $d in //td[contains(a/small/text(), "New York, NY")]
for $row in $d/parent::tr/parent::table/tr
where contains($d/a/small/text()[1], "New York")
return <tr><td>{data($row/td[1])}</td>
<td>{data($row/td[2])}</td>
<td>{$row/td[3]//img}</td> </tr>
}
</table>
In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.
You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.
If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :
GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>
then use the Socket#getInputStream , and then read the data using a BufferedReader , and parse the data using whatever you like.
You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.
If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.
This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.
Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.
If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.
Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.
Check this out http://www.alchemyapi.com/api/demo.html
They return pretty good results and have an SDK for most platforms. Not only text extraction but they do keywords analysis etc.

Categories