How to parse result by Custom Search Engine - java

I'm using Jsoup to parse data from a website. But I don't know how to parse search result by CSE (custom search engine).
Please review images below.
Search Result
In search result, I want to get: image, title, link and description.
If you know, you can give me some solution!
Link Search:
http://www.truyenngan.com.vn/tim-kiem.html?q=love&cx=000993172113723111222%3Auprumhk-rde&cof=FORID%3A11&ie=UTF-8&siteurl=www.truyenngan.com.vn%2F&ref=&ss=419j62441j4
`

When you'll parse page.asXml() you will get the source code which will definitely fetch the whole page data.
You need to apply some parsing logic,as the links will be with particular div/class/id ,so you can fetch them by looping.
Document doc = Jsoup.parse(page.asXml());
Elements elements = doc.getElementsByTag('<id/div/class>');
Iterate elements to get value of all the links and description.

Use the Custom Search Engines API, and you will be getting parsed results in JSON.

Related

Java - use searchbar on given website

Let me just start by saying that this is a soft question.
I am rather new to application development, and thus why I'm asking a question without presenting you with any actual code. I know the basics of Java coding, and I was wondering if anyone could enlighten me on the following topic:
Say I have an external website, Craigslist, or some other site that allows me to search through products/services/results manually by typing a query into a searchbox somewhere on the page. The trouble is, that there is no API for this site for me to use.
However I do know that http://sfbay.craigslist.org/search/sss?query=QUERYHERE&sort=rel points me to a list of results, where QUERYHERE is replaced by what I'm looking for.
What I'm wondering here is: is it possible to store these results in an Array (or List or some form of Collection) in Java?
Is there perhaps some library or external tool that can allow me to specify a query to search for, have it paste it in to a search-link, perform the search, and fill an Array with the results?
Or is what I am describing impossible without an API?
This depends, if the query website accepts returning the result as XML or JSON (usually with a .xml or .json at the end of url) you can parse it easily with DOM for XML on Java or download and use the JSONLibrary to parse a JSON.
Otherwise you will receive a HTML that is the page that a user would see in a browser, then you can try parse it as a XML but you will have a lot of work to map all fields in the HTML to get the list as you want.

Why Elements is empty?

I use JSoup for the first time. so i'm not familiar with JSoup. I already read 'COOKBOOk'. But still i don't know why that Elements still empty. am i missing something?
Document doc = Jsoup.connect("http://sports.news.naver.com/sports/" +
"index.nhn?category=baseball.html").get();
Elements teams= doc.select("td.t_name");
Elements wins= doc.select("td.win");
System.out.println(teams.isEmpty());
System.out.println(wins.isEmpty());
Maybe because there is no "td.t_name" and "td.win" in the document.
You should make sure that the document you get form http://sports.news.naver.com/sports/index.nhn?category=baseball.html
contain the data you want to select.
As far as I debugged into your code, I didn't see any "td.win" or "td.t_name" in the document.
Note that data loaded in via AJAX will not be downloaded by JSoup.

Screen Scraping Using Jsoup to Extract Sentences

I want to do some screen scraping and after doing a little research, it appears that JSoup is the best tool for this task. I want to be able to extract all the sentences on a web page; so for example, given this wikipedia page, http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping, I want to be able to get all the sentences on that page and print it out to the console. I'm still not familiar with how JSoup works though, so if somebody could help me out that would be greatly appreciated. Thanks!
First download Jsoup and include it in your project. Then the best place to start is the Jsoup cookbook (http://jsoup.org/cookbook/) as it provides examples for the most common methods you will use with Jsoup. I recommend that you spend some time working through those examples to familiarize yourself with the API. Another good resource is the javadocs.
Here is a quick example to pull some text from the Wikipedia link you provided:
String url = "http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping";
// Download the HTML and store in a Document
Document doc = Jsoup.connect(url).get();
// Select the <p> Elements from the document
Elements paragraphs = doc.select("p");
// For each selected <p> element, print out its text
for (Element e : paragraphs) {
System.out.println(e.text());
}

get number of google search results

I searched a lot to retrieve the number of search results in google using java, but nothing worked.
I have tried Google Custom Search API aswell.
I don't want the title/url of results, just number of total results found.
Can some one please guide me?
By using the Custom Search API, you're on the right way.
There's a totalResults key in the response JSON that you get from your query. Just grab it's value and you're done.
If you want your JSON to only contain that value, add the fields parameter to your query like that:
https://www.googleapis.com/customsearch/v1?key={YOUR_API_KEY}&cx={YOUR_SEARCH_ENGINE_ID}
&q={YOUR_SEARCH_STRING}&alt=json&fields=queries(request(totalResults))

Page scrape for a particular div

I am wondering if there is a way to read the html output of a given webpage using Java?
I know in php you can do something like:
$handle = #fopen("'http://www.google.com", "r");
$source_code = fread($handle,9000);
I am looking for the Java equivalent.
Additionally, once I have the rendered html are there any Java utilities that would allow me to strip out a single div by its id?
Thanks for any help with this.
Use jsoup.
You have the choice between a tree model and a powerful query syntax similar to CSS or jQuery selectors, plus utility methods to quickly get the source of a webpage.
To quote from their website:
Fetch the Wikipedia homepage, parse it to a DOM, and select the
headlines from the In the news section into a list of Elements:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Once you found the Element representing the div you want to remove, just call remove() on it.

Categories