I am writing a crawler in java that examines an IMDB movie page and extracts some info like name, year etc. User writes (or copy/pastes) the link of the tittle and my program should do the rest.
After examining html sources of several (imdb) pages and browsing on how crawlers work I managed to write a code.
The info I get (for example title) is in my mother tongue. If there is no info in my mother tongue I get the original title. What I want is to get the title in a specific language of my choosing.
I'm fairly new to this so correct me if I'm wrong but I get the results in my mother tongue because imdb "sees" that I'm from Serbia and than customizes the results for me. So basically I need to tell it somehow that I prefer results in English? Is that possible (i imagine it is) and how do I do it?
edit:
Program crawls like this: it gets the url path in String, converts it to url, reads all of the source with bufferedreader and inspects what it gets. I'm not sure if that is the right way to do it but it's working (minus the language problem)
code:
public static Info crawlUrl(String urlPath) throws IOException{
Info info = new Info();
//
URL url = new URL(urlPath);
URLConnection uc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
uc.getInputStream(), "UTF-8"));
String inputLine;
while ((inputLine = in.readLine()) != null){
if(inputLine.contains("<title>")) System.out.println(inputLine);
}
in.close();
//
return info;
}
this code goes trough a page and prints the main title on console.
You don't need to crawl IMDB, you can use the dumps they provide: http://www.imdb.com/interfaces
There's also a parser for the data they provide: https://code.google.com/p/imdbdumpimport/ it's not perfect but maybe it will help you (you can expect spending some effort to make it work).
An alternative parser: https://github.com/dedeler/imdb-data-parser
EDIT You're saying you want to crawl IMDB anyway for learning purposes. So you'll probably have to go with http://en.wikipedia.org/wiki/Content_negotiation as suggested in the other answer:
uc.setRequestProperty("Accept-Language", "de; q=1.0, en; q=0.5");
Try to look at the request headers used by your crawler, mine is containing Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4 so I get the title in French.
EDIT :
I checked with ModifyHeaders add-on on Google Chrome and the value en-US is getting me the English title for the movie =)
Related
Hey I a having a little trouble here. I am doing File Writing at school and we got the challenge of reading a webpage. How is it possible to do it? I had a go with a JSoup and an Apache plugin, but neither worked, but I have to use the net import
I am a bit of a noob at coding, so there will probably be a couple of errors!
Here is my code:
URL oracle = new URL("http://www.oracle.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = br.readLine()) != null){
System.out.println(inputLine);
}
br.close();
There is no output from the program, and earlier I managed output but it was in the form of HTML, however I deleted that code, ironically looking for a fix for that issue.
Any help or solutions would be greatly appreciated! Thank you all very much!
The code example is from Reading Directly from a URL, but the tutorial is old. The url http://www.oracle.com now redirects to https://www.oracle.com/ but you don't follow the redirect.
If you use a URL that does not redirect, like http://www.google.com you will see that the code works.
If you want a more robust program that handles redirects, you'll probably want to use a HttpURLConnection instead of the basic URL, as it has more features for you to use.
I'm currently working on a project that involves analyzing weather information from various airports and weather stations. What I need to do is display the basic weather info from a url (example: http://w1.weather.gov/xml/current_obs/KFHB.xml). Where I need to display temperature, visibility, etc. in some basic format. I was thinking an array list.
Can anyone tell me how to do this? How do I display the elements from a webpage onto a java application? What classes (methods, or other procedures) should I look into? Are there any good templates out there for this? Any help would be appreciated. Thanks.
You can read through the contents of the url like reading a file doing the following:
URL url = new URL(link);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
You can also capture the RSS feed from that weather station. Read this article if you want an an example as how to parse a RSS feed:
http://www.vogella.com/articles/RSSFeed/article.html
I am doing an assignment for one of my classes.
I am supposed to write a webcrawler that download files and images from a website given a specified crawl depth.
I am allowed to use third party parsing api so I am using Jsoup. I've also tried htmlparser. Both nice softwares but they are not perfect.
I used the default java URLConnection to check content type before processing the url but it becomes really slow as the number of links grows.
Question : Anyone know any specialized parser api for images and links ?
I could start writing mine using Jsoup but am being lazy. Besides why reinvent the wheel if there could be a working solution out there? Any help would be appreciated.
i need to check contentType while looping through the links to check if the link is to a file, in an effective way but Jsoup does not have what i need. Heres what i have:
**
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
UPDATE
Based on Yoshi's answer, I was able to get my code to work right. Here's the link:
https://github.com/unekwu/cs_nemesis/blob/master/crawler/crawler/src/cu/cs/cpsc215/project1/parser/Parser.java
Use jSoup i think this API is good enough for your purpose. Also you can find good Cookbook on this site.
Several steps:
Jsoup: how to get an image's absolute url?
how to download image from any web page in java
You can write your own recursion method which walk through links on page which contains nesessary domain name or relative links. Use this way to grab all links and find all images on it. Write it yourself it's not bad practice.
You don't need to use URLConnection class, jSoup have wrapper for it.
e.g
You can use only one line of code to get DOM object:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Instead of this code:
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
Update1
try to add in your code next lines:
Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();
I've been trying to get information from a webpage, specifically this site: http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D (among other similar ones). I'm using the URL and URLConnection packages to do so. I'm trying to get a certain number from the webpage - on this page, I want the total number of articles (16428).
It says this near the top of the page: "Results: 1 to 20 of 16428" and when I look at the page source manually I can find this. However, when I try to use the java connection to obtain this number from the page source, for some reason the number it gets is "863399" instead of "16428".
Code:
URL connection = new URL("http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D");
URLConnection yc = connection.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String html = "";
String inputLine;
while ((inputLine = in.readLine()) != null) html += inputLine;
in.close();
int startMarker = html.indexOf("ncbi_resultcount");
int endMarker = html.indexOf("ncbi_op");
System.out.println(html.substring(startMarker, endMarker));
When I run this code, I get:
ncbi_resultcount" content="863399" />
rather than:
ncbi_resultcount" content="16428" />
Does anyone know why this is / how I can fix it?
Thanks!
I can't reproduce your problem and I have no idea why this is happening. Perhaps it's sniffing specific Java user agent versions. You'd then need to try to set the User-Agent header to something else to pretend as a "real" webbrowser.
yc.setRequestProperty("User-Agent", "Mozilla");
Unrelated to the concrete problem, I'd suggest to use a real HTML parser for this job, such as Jsoup. It's then as easy as:
Document document = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D").get();
Element nbci_resultcount = document.select("meta[name=ncbi_resultcount]").first();
System.out.println(nbci_resultcount.attr("content")); // 16433
I have a problem once again where I cant find the source code because its hidden or something... When my java program indexes the page it finds everything but the info i need... I assume its hidden for a reason but is there anyway around this?
Its just a bunch of tr/td tags that show up in firebug but dont show up when viewing the page source or when i do below
URL url = new URL("my url");
URLConnection yc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
I really have no idea how to attempt to get the info that i need...
The reason for this behavior is because probably those tags are dynamically injected into the DOM using javascript and are not part of the initial HTML which is what you can fetch with an URLConnection. They might even be created using AJAX. You will need a javascript interpreter on your server if you want to fetch those.
If they don't show up in the page source, they're likely being added dynamically by Javascript code. There's no way to get them from your server-side script short of including a javascript interpreter, which is rather high-overhead.
The information in the tags is presumably coming from somewhere, though. Why not track that down and grab it straight from there?
Try Using Jsoup.
Document doc = doc=Jsoup.parse("http:\\",10000);
System.out.print(doc.toString());
Assuming that the issue is that the "missing" content is being injected using javascript, the following SO Question is pertinent:
What's a good tool to screen-scrape with Javascript support?