java program to retreive page source from google search automatically [duplicate] - java

I have 30000 dictionary words. In that I want to to search each word in Google and want to find hits of each word using Java program. Is it possible?

Look up <estimatedTotalResultsCount> using Google's SOAP search API.
You'll be limited to 1000 queries per day though. This limit is removed if you use their AJAX API.

Since your duplicate post is closed, I'll post my answer here as well:
Whether this is possible or not doesn't really matter: Google doesn't want you to do that. They have a public AJAX-search API developers can use: http://code.google.com/apis/ajaxsearch/web.html

Here is a Sun tutorial on Reading to and Writing from an URLConnection.
The simplest URL I can see to make a Google search is like:
http://www.google.com/#q=wombat

Reading from a url with java is pretty straight forward. A basic working example is as follows
public Set<String> readUrl(String url) {
String line;
Set<String> lines = new HashSet<String>();
try {
URL url = new URL(url);
URLConnection page = url.openConnection();
BufferedReader in = new BufferedReader( new InputStreamReader(page.getInputStream()));
while ((line = in.readLine()) != null) {
lines.add(line);
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return lines;
}

I'd recommend separating your problem into pieces. Get each one working, then marry them together for the solution you want.
You have a few things going on here:
Downloading text from a URL
Scanning a stream of characters and breaking it up into words
Iterating through a list of words and tallying up the hits from your dictionary
Computer science is all about taking large problems and decomposing them into smaller ones. I'd recommend that you start learning how to do that now.

Related

CheckMarkx :: HRA_JAVA_CGI_REFLECTED_XSS_ALL_CLIENTS issue

I am struggling with one of the CheckMarx vulnerabilities. I need some guidance to support this.
Below is my code :
try(Bufferedreader in = new BufferedReader(new InputStreamReader(con.getInputStream()))){
String content = null;
while((content = in.readLine()) != null) {
// Logic to Parse JSON data and use it.
}
}
Here con is (HttpurlConnection) new URL("some url").openConnection().
So, checkmarx is highlighting issue at in.readLine().
Workarounds I tried:
1: StringEscapeUtils.unescapeJson(in.readLine()), it's not helping.
2: Used in.lines().collect(Collectors.joining()) in place of in.readline() by reading somewhere in google.
It helped to fix this but introduced a new one at con.getInputStream() (the same vulnerability).
Please help to fix this issue.
Thanks in advance.
Technically it should be StringEscapeUtils.escapeJson(in.readLine()) not StringEscapeUtils.unescapeJson(in.readLine()). The intent is to output encode to prevent XSS, not the other way around.
try(Bufferedreader in = new BufferedReader(new InputStreamReader(con.getInputStream()))){
String content = null;
while((content = StringEscapeUtils.escapeJson(in.readLine())) != null) {
// Logic to Parse JSON data and use it.
}
}
Still, I don't think Checkmarx will recognize this as a sanitizer, I can only see that it only looks for escapeXml, escapeHtml, escapeHtml3, escapeHtml4 methods under StringEscapeUtils.
Work with your security team to update the Checkmarx query to include escapeJson or you can use an alternative that Checkmarx recognizes such as the replace method that replaces malicious tags <,>,</,/> but this is not a full proof solution though to be considered a robust secure code

Getting imdb movie titles in a specific language

I am writing a crawler in java that examines an IMDB movie page and extracts some info like name, year etc. User writes (or copy/pastes) the link of the tittle and my program should do the rest.
After examining html sources of several (imdb) pages and browsing on how crawlers work I managed to write a code.
The info I get (for example title) is in my mother tongue. If there is no info in my mother tongue I get the original title. What I want is to get the title in a specific language of my choosing.
I'm fairly new to this so correct me if I'm wrong but I get the results in my mother tongue because imdb "sees" that I'm from Serbia and than customizes the results for me. So basically I need to tell it somehow that I prefer results in English? Is that possible (i imagine it is) and how do I do it?
edit:
Program crawls like this: it gets the url path in String, converts it to url, reads all of the source with bufferedreader and inspects what it gets. I'm not sure if that is the right way to do it but it's working (minus the language problem)
code:
public static Info crawlUrl(String urlPath) throws IOException{
Info info = new Info();
//
URL url = new URL(urlPath);
URLConnection uc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
uc.getInputStream(), "UTF-8"));
String inputLine;
while ((inputLine = in.readLine()) != null){
if(inputLine.contains("<title>")) System.out.println(inputLine);
}
in.close();
//
return info;
}
this code goes trough a page and prints the main title on console.
You don't need to crawl IMDB, you can use the dumps they provide: http://www.imdb.com/interfaces
There's also a parser for the data they provide: https://code.google.com/p/imdbdumpimport/ it's not perfect but maybe it will help you (you can expect spending some effort to make it work).
An alternative parser: https://github.com/dedeler/imdb-data-parser
EDIT You're saying you want to crawl IMDB anyway for learning purposes. So you'll probably have to go with http://en.wikipedia.org/wiki/Content_negotiation as suggested in the other answer:
uc.setRequestProperty("Accept-Language", "de; q=1.0, en; q=0.5");
Try to look at the request headers used by your crawler, mine is containing Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4 so I get the title in French.
EDIT :
I checked with ModifyHeaders add-on on Google Chrome and the value en-US is getting me the English title for the movie =)

Web Crawler specifically for downloading images and files

I am doing an assignment for one of my classes.
I am supposed to write a webcrawler that download files and images from a website given a specified crawl depth.
I am allowed to use third party parsing api so I am using Jsoup. I've also tried htmlparser. Both nice softwares but they are not perfect.
I used the default java URLConnection to check content type before processing the url but it becomes really slow as the number of links grows.
Question : Anyone know any specialized parser api for images and links ?
I could start writing mine using Jsoup but am being lazy. Besides why reinvent the wheel if there could be a working solution out there? Any help would be appreciated.
i need to check contentType while looping through the links to check if the link is to a file, in an effective way but Jsoup does not have what i need. Heres what i have:
**
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
UPDATE
Based on Yoshi's answer, I was able to get my code to work right. Here's the link:
https://github.com/unekwu/cs_nemesis/blob/master/crawler/crawler/src/cu/cs/cpsc215/project1/parser/Parser.java
Use jSoup i think this API is good enough for your purpose. Also you can find good Cookbook on this site.
Several steps:
Jsoup: how to get an image's absolute url?
how to download image from any web page in java
You can write your own recursion method which walk through links on page which contains nesessary domain name or relative links. Use this way to grab all links and find all images on it. Write it yourself it's not bad practice.
You don't need to use URLConnection class, jSoup have wrapper for it.
e.g
You can use only one line of code to get DOM object:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Instead of this code:
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
Update1
try to add in your code next lines:
Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();

how to get image using html parsing with jsoup

I want get all images using html parsing with jsoup.
I use below code ;
Elements images = doc.select("img[src~=(?i)\\.(jpe?g)]");
for (Element image : images) {
//System.out.println("\nsrc : " + image.attr("src"));
arrImageItem.add(image.attr("src"));
}
I parse this method all images but i want to parse this url
http://tvrehberi.hurriyet.com.tr/images/742/403742.jpg
I want to parse beginnig of this url
http://tvrehberi.hurriyet.com.tr/images .... .jpg
How to get parse like this ?
This will probably give you what you ask for, though your question is a bit unclear, so I can't be sure.
public static void main(String args[]){
Document doc = null;
String url = "http://tvrehberi.hurriyet.com.tr";
try {
doc = Jsoup.connect(url).get();
} catch (IOException e1) {
e1.printStackTrace();
}
for (Element e : doc.select("img[src~=(?i)\\.(jpe?g)]")) {
if(e.attr("src").startsWith("http://tvrehberi.hurriyet.com.tr/images")){
System.out.println(e.attr("src"));
}
}
}
So, this might not be a very "clean" solution, but the if-statement will make sure it only prints out the image URL's from the /images/-directory on the server.
If I understood correctly, you want to retrieve the URL path up to a certain point and cut off the rest. Do you even have to do that every time?
If you are only using URLs from the one site in your example, you could store "http://tvrehberi.hurriyet.com.tr/images" as a constant since it never changes. If, on the other hand, you fetch URLs from many different sites, you can parse your URL as described here.
Anyway, if you shared the purpose of parsing the URLs, we certainly could help you more.

Java - Best way to download a webpage's source html

I'm writing a little crawler. What is the best way to download a web page's source html? I'm currently using little piece of code below but some times the result is just half of the page's source!!! I don't know what's the problem. Some people suggested that I should use Jsoup but using .get.html() function from Jsoup also returns half of the page's source if it's too long. Since I'm writing a crawler, it's very important that the method support unicode (UTF-8) and the efficiency is also very important. I wanted to know the best modern way to do it so I asked you guys since I'm new to Java. Thanks.
Code:
public static String downloadPage(String url)
{
try
{
URL pageURL = new URL(url);
StringBuilder text = new StringBuilder();
Scanner scanner = new Scanner(pageURL.openStream(), "utf-8");
try {
while (scanner.hasNextLine()){
text.append(scanner.nextLine() + NL);
}
}
finally{
scanner.close();
}
return text.toString();
}
catch(Exception ex)
{
return null;
}
}
I use commons-io String html = IOUtils.toString(url.openStream(), "utf-8");
Personally, I'm very pleased with the Apache HTTP library http://hc.apache.org/httpcomponents-client-ga/. If you're writing a web crawler, which I am also, you may greatly appreciate the control it gives over things like cookies and client sharing and the like.

Categories