Fetch Data from Url Using java - java

This question was asked me during my interview, and i was asked to implement it. The question is as follows:
Your application
Will take the username and password for the linkedIn profile,
On the page www.linkedin.com , use that to login into the page.
Simulate the Clicking of the Profile->Edit profile menu
Scrape the page of that user that comes below in the format below and dumps it in a text file. ( hint: you can use the beautiful soup library )
On fetching this url, you need to read the following information and put it in an csv/excel file.
Can somebody give me an idea on how to do it? It should be done using java only.

I'd use Web Browser Automation software like Selenium http://www.seleniumhq.org/ which seems like will solve this problem. You can choose any of its bindings (Java, C#, Ruby, Python, JavaScript) to implement the solution.
Take a look at the tutorials https://www.airpair.com/selenium/posts/selenium-tutorial-with-java

It seems related to web crawler, and we can do this using JSOUP library very well.
You have to read implementation using JSOUP library and we can filter out the link which has something like
https://www.linkedin.com/profile/edit?trk=nav_responsive_sub_nav_edit_profile"
Here if you see we are having the keywords as edit_profile which can be used to filter out the results we require.
Link u should follow and explore more about JSOUP
Webcrawler using JSOUP

Related

Getting information from a third party Wiki page

In the project I am working on I need to access information from the website explainxkcd.com which gives the explanation of specific xkcd comics. The information I am looking for would be the explanation of the comic as a string. Explainxkcd is a site that runs using mediawiki, software that forms a template for the "wiki" feel. Mediawiki has an api that allows you to extract information from their websites and I have gone to http://www.mediawiki.org/wiki/API:Main_page trying to figure out how to use their API for this particular wiki site but to no avail. It seems that you can replace the "index.php" in a URL with api.php to use the API but when I try this for http://explainxkcd.com/9/api.php it doesn't seem to work. I guess my URL is wrong but I don't see any information on how to find the specific URL to use for Explainxkcd.com
QUESTION:
How can I access information from a third party wikipedia page in a Java program? This can be through the mediawiki api or some other solution. If you know a good way to find the URL that can be used with mediawiki that would be preferred. Just looking for a nudge in the right direction here.
Thanks
Using the same method, s/index.php/api.php/, I get a different result: http://www.explainxkcd.com/wiki/api.php which seems to work. If a wiki is using pretty URLs (e.g. example.com/wiki/Main_Page), just click on edit, view source or history.
Yes, please use the API instead of screen-scraping. You can see a few existing Java libraries for that here.

How do I get the rendered web page page from a URL?

I don't want just the source code. I want the rendered page. This is an important distinction that I apparently cannot make by simply searching Google.
Does anyone know how I can get the rendered page from a URL?
This needs to be done in Java, hopefully without an extra library.
Another solution would be to use HTMLUnit which is a "GUI-less browser for JAVA". It is recommended by Google to generate snapshots of ajax-based webpages to make them crawlable.
You can try using a library that wraps a web browser, for example Berkelium. If you need it in Java, a Google search produced this Java wrapper API for Berkelium (I haven't tried it personally).
sites.google has an example of its use:

Accessing website search box in Java

What I need to write is a code snippet that would go to a website e.g. www.google.com find the search box put in the phrase and retrieve HTML code of results page/pages. Is it possible to achieve this in Java?
e.g. www.google.com
For Google, use the JSON/Atom Custom Search API. It is the only (legal) way to access Google search.
Yes, use something like HttpClient, although there are other similar options.
Most probably you should be able to pass a parameter to the url (have a look at the google url after issuing a search, there are plenty of parameters) or use a post request (if the site supports it, check for an API description).
If you read the URL directly from Java (e.g. using the URL class) you'll get the returned HTMl as is.
The first tool I thought of was Selenium. It is primarily a web testing framework, but can be used to automate a browser for the kind of operation you're suggesting.
http://seleniumhq.org/docs/03_webdriver.html#getting-started-with-selenium-webdriver
HttpUnit can also be used. It's a well documented, open source and easy to use unit test framework.

Getting Data from Internet in Java

I thought of making the following application for my college project in java. I know core java. I want to know what should i read "specifically" for this project as there is less time:
It will have an interface to put your query. This string would go as a query to internet search engines and with the help of search engine find the data (the first web page that we see (that is data for my application for this time. :) )).
I do not want to display the data. I just want the HTML file or the source code of the generated web page. Is it sounding like Common Getaway Interface? I do not know about this.
But i think it for the same purpose. If it is this. please guide me to know how to implement this.
Whatever please specify
Problem 1 : What should i read ? Any direct help at this point is not my intention. I want to implement it myself.
Problem 2 : Is connecting to internet requires some jnlp knowledge too.
for eg. as on google we search something it shows us the links of the websites. I can see the source code of this generated web page. I just want this page for my application to work on.
EDIT:
I do not want to rely on google only or any particular web server. I want to decide that by my application.
Please also refer to my problem 2.
As i discovered that we have Terms of Conditions for websites should i try to make my crawler. Would then my application not breaking the rules . Well its important for me.
Ashish,
Here what I would recommend.
Learn the basics of JSON from these links (Introduction ,lib download)
Then look at the Google Web Search JSON API here.
Learn how to GET the data from servers using HttpClient library here.
Now what you have to do is, fire a get request for the search, read the JSON response, parse the response using the JSON lib from #1 and you have the search results.
Most of the search engines (Bing etc) offer Jason/REST apis so you can do the same for other search engines.
Note: Jason APIs are normally used from JavaScritps on the UI side but since its very easy and quick to learn, I suggested you that. You can also explore (if time permits) the XML based APIs also.
URL url = new URL("http://fooooo.com");
in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
{
System.out.println(inputLine);
}
Should be enough to get you started .
And yes , do check if you are not violating the usage terms of a website . Search Engines dont really like you trying to access them via a program .
Many , Including Google , has APIs specifically designed for this purpose.
you can do everything you want using HTMLUnit. It´s like a web browser but for java. Check some examples at their website.
Read "Working with URL's" in the Java tutorial to get an idea what is behind the available libs like HTMLUnit, HttpClient, etc
I do not want to display the data. I just want the HTML file or the source code of the generated web page.
You probably dont need the HTML either. Google provide its search results as a web service using this API. Similarly for other search engine GIYF. You get the search results as XML, which is far more easier for you to parse. Plus the XML wont have any unwanted data like ads.

how to use htmlparsing and curl in JAVA for this task...?

I'm trying to write a program that takes company names from a text file and searches them on a search engine website (SEC's Edgar search). Each search usually comes up with 1-10 unique search result links and so I want to use curl to click on the link with the relevant company name. The link page has a brief summary with the term "state of incorporation:" and then the state name. Im hoping to parse the state name. I am having trouble understanding how to use HTML parsing and curl and their classes. I would appreciate any help possible such as a brief outline of steps or just any advice at all. Thanks.
Assuming that the HTML is fairly basic, use something like the Mozilla Java HTML Parser. The getting started guide will give you more details on creating the DOM. Java has builtin APIs for downloading content from the web, and these will likely be sufficient for you (rather than using "curl").
Once you have a DOM, you can use the standard DOM APIs to navigate for the links and items that you want.

Categories