Getting Data from Internet in Java - java

I thought of making the following application for my college project in java. I know core java. I want to know what should i read "specifically" for this project as there is less time:
It will have an interface to put your query. This string would go as a query to internet search engines and with the help of search engine find the data (the first web page that we see (that is data for my application for this time. :) )).
I do not want to display the data. I just want the HTML file or the source code of the generated web page. Is it sounding like Common Getaway Interface? I do not know about this.
But i think it for the same purpose. If it is this. please guide me to know how to implement this.
Whatever please specify
Problem 1 : What should i read ? Any direct help at this point is not my intention. I want to implement it myself.
Problem 2 : Is connecting to internet requires some jnlp knowledge too.
for eg. as on google we search something it shows us the links of the websites. I can see the source code of this generated web page. I just want this page for my application to work on.
EDIT:
I do not want to rely on google only or any particular web server. I want to decide that by my application.
Please also refer to my problem 2.
As i discovered that we have Terms of Conditions for websites should i try to make my crawler. Would then my application not breaking the rules . Well its important for me.

Ashish,
Here what I would recommend.
Learn the basics of JSON from these links (Introduction ,lib download)
Then look at the Google Web Search JSON API here.
Learn how to GET the data from servers using HttpClient library here.
Now what you have to do is, fire a get request for the search, read the JSON response, parse the response using the JSON lib from #1 and you have the search results.
Most of the search engines (Bing etc) offer Jason/REST apis so you can do the same for other search engines.
Note: Jason APIs are normally used from JavaScritps on the UI side but since its very easy and quick to learn, I suggested you that. You can also explore (if time permits) the XML based APIs also.

URL url = new URL("http://fooooo.com");
in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
{
System.out.println(inputLine);
}
Should be enough to get you started .
And yes , do check if you are not violating the usage terms of a website . Search Engines dont really like you trying to access them via a program .
Many , Including Google , has APIs specifically designed for this purpose.

you can do everything you want using HTMLUnit. It´s like a web browser but for java. Check some examples at their website.

Read "Working with URL's" in the Java tutorial to get an idea what is behind the available libs like HTMLUnit, HttpClient, etc

I do not want to display the data. I just want the HTML file or the source code of the generated web page.
You probably dont need the HTML either. Google provide its search results as a web service using this API. Similarly for other search engine GIYF. You get the search results as XML, which is far more easier for you to parse. Plus the XML wont have any unwanted data like ads.

Related

Fetch Data from Url Using java

This question was asked me during my interview, and i was asked to implement it. The question is as follows:
Your application
Will take the username and password for the linkedIn profile,
On the page www.linkedin.com , use that to login into the page.
Simulate the Clicking of the Profile->Edit profile menu
Scrape the page of that user that comes below in the format below and dumps it in a text file. ( hint: you can use the beautiful soup library )
On fetching this url, you need to read the following information and put it in an csv/excel file.
Can somebody give me an idea on how to do it? It should be done using java only.
I'd use Web Browser Automation software like Selenium http://www.seleniumhq.org/ which seems like will solve this problem. You can choose any of its bindings (Java, C#, Ruby, Python, JavaScript) to implement the solution.
Take a look at the tutorials https://www.airpair.com/selenium/posts/selenium-tutorial-with-java
It seems related to web crawler, and we can do this using JSOUP library very well.
You have to read implementation using JSOUP library and we can filter out the link which has something like
https://www.linkedin.com/profile/edit?trk=nav_responsive_sub_nav_edit_profile"
Here if you see we are having the keywords as edit_profile which can be used to filter out the results we require.
Link u should follow and explore more about JSOUP
Webcrawler using JSOUP

Getting information from a third party Wiki page

In the project I am working on I need to access information from the website explainxkcd.com which gives the explanation of specific xkcd comics. The information I am looking for would be the explanation of the comic as a string. Explainxkcd is a site that runs using mediawiki, software that forms a template for the "wiki" feel. Mediawiki has an api that allows you to extract information from their websites and I have gone to http://www.mediawiki.org/wiki/API:Main_page trying to figure out how to use their API for this particular wiki site but to no avail. It seems that you can replace the "index.php" in a URL with api.php to use the API but when I try this for http://explainxkcd.com/9/api.php it doesn't seem to work. I guess my URL is wrong but I don't see any information on how to find the specific URL to use for Explainxkcd.com
QUESTION:
How can I access information from a third party wikipedia page in a Java program? This can be through the mediawiki api or some other solution. If you know a good way to find the URL that can be used with mediawiki that would be preferred. Just looking for a nudge in the right direction here.
Thanks
Using the same method, s/index.php/api.php/, I get a different result: http://www.explainxkcd.com/wiki/api.php which seems to work. If a wiki is using pretty URLs (e.g. example.com/wiki/Main_Page), just click on edit, view source or history.
Yes, please use the API instead of screen-scraping. You can see a few existing Java libraries for that here.

How to modify search result page given by Solr?

I intend to make a niche search engine. I am using apache-nutch-1.6 as the crawler and apache-solr-3.6.2 as the searcher. I must say there is very less updated information on web about these technologies.
I followed this tutorial http://wiki.apache.org/nutch/NutchTutorial and have successfully installed apache and solr on my ubuntu system. I was also successful in injecting seed url to webdb and perform the crawl.
Using solr interface at http://localhost:8983/solr/admin, I can also query the crawled results. But this is the output I receive. .
Am I missing something here, the earlier apache-nutch-0.7 had a war which generated a clear html output like this. . How do I achieve this... Or if anyone could point me to a latest tutorial or guidebook, highly appreciated.
A couple of things:
If you are just starting, do not use Solr 3.6, go straight to latest 4.1+. A bunch of things have changed and a lot of new features are added.
You seem to be saying that you will expose Solr + UI directly to general web - that's a really bad idea, as Solr is completely unsecured and allows web-based delete queries. You really want a business layer in a middle.
With Solr 4.1, there is a pretty Admin UI and, also, there is a /browse page that shows how to use Velocity to do the pages backed by Solr. Or have a look at something like Project Blacklight for an example of how to get UI over Solr.
I found below link
http://cmusphinx.sourceforge.net/2012/06/building-a-java-application-with-apache-nutch-and-solr/
which answered my query.
I agree after reading the content available on above link, I felt very angry at me.
Solr package provides all the required objects to query solr.
Infact, the essential jars are just solr-solrj-3.4.0.jar, commons-httpclient-3.1.jar and slf4j-api-1.6.4.jar.
Anyone can build a java search engine using these objects to query the database and have a fancy UI.
Thanks again.

Getting data from a website that needs you to log in (Java)

I don't even know if what I'm asking is possible and I don't know what to search for on Google.
Basically, there are multiple projects that would require me to fetch some data from websites. The example I'm thinking of right now is to grab my account info from a banking site http://www.americanexpress.ca I'd like to know how I'd make it so my login info is entered in the fields on the left and grab the data from the resulting page. I'd then make methods to parse that data.
Obviously, this would need to be secure as I don't want my banking info stolen.
Sorry if the solution is obvious as I've never tried grabbing data from websites.
As mentioned, Apache HttpClient is one option, though personally I've always found HtmlUnit to be a bit more convenient to work with (from an API standpoint) for doing things like this. HtmlUnit is built on top of HttpClient, and exposes a higher-level API for interacting with and manipulating page content.
You have to use Apache HttpClient (or same) library. It have all required classes for you.

Android - Obtaining data from a website

I'm finding my way around Android and so far so good. My next big challenge is coming to grips with web services. I would like to build an app that reads data from a web site or database on web server and store the data in my app.
Basically, it will be an app that I build in conjunction with a news website that pulls their latest articles into the app. What I'm finding difficult is how to bridge the gap between my application and the data in the SQL Server database.
I'm familiar with building asp websites that read data from a database, but how would I do something similar with an app?
Do I ask the website to store the articles in an xml format? Or, is there another way that I can request a specific article and be provided with the content?
I hope I'm phrasing the question correctly and that someone can just guide me to the right way to approach this.
Thanks in advance.
You can approach this problem from different perspectives.
The common solution is to build a Webservice that will bridge the gap between your mobile application and the data that remain in your server. I personnaly prefer to setup a Rails backend and thus have a RESTful API that will help me access my data. For instance, to retrieve the list of articles I could just request the following url: http://my_server_host/articles. So for the Webservice part you can have whatever you want: Rails, J2EE, .NET etc. And you can choose the model that fits your needs (REST, SOAP, XML-RPC etc.).
Then you will have to write a class that will contain all the necessary calls to the Webservice you have built. Basically, if your Webservice returns the results as an XML format you will have to:
Send the request to the appropriate URL. (See: HttpGet or HttpPost if you want to modify a resource).
Parse the XML returned. (In short, you can use SAX or DOM to parse your XML response and transform them to a business entity (an Article, a User etc.).)
This hopefully gives you a hint about a possible solution. By the way Google is your friend, but I will probably come back to add external links/resources to help you more.
Edit
Another possible solution that could work for you, since all you need is to retrieve some articles. Just setup a simple Wordpress blog for instance. Wordpress gives you an URL for the blog's RSS feed, all you will have to do is to parse that RSS feed (XML). There is a great article on the IBM website for parsing an RSS feed that you can find here. By the way, this solution is only possible if you want to save your articles on a Wordpress blog. But you got the point hopefully.
Reading your data form the Database on the Server would be bad practice. You'd have to open up some ports and that's defiantly not what you want (if you don't have root-access, you also can't).
For non-interactive content (what you want) you would use XML or JSON.

Categories