Getting Google the number of hits for term search in Java? - java

I'm trying to get the number of results from google for search term in Java.
eg. For the term "computer" :About 3,070,000,000 results.
Is it possible?

You should be able to do this with Java sample library for custom google search API.
http://code.google.com/p/google-api-java-client/wiki/APIs#CustomSearch_API
Also you can reduce the data transferred during service requests by specifying members that contain data elements that your application needs.

It is not as straightforward as one might think... You can write a Java program that reads the results of a Google search on a term like "computer", but the number of results will not be statically present in the output.
You will have to use the Google Custom Search Engine, or the JSON/Atom Custom Search API.

Related

Google Places API: place category and place information

With Google Places API, I'm able to select a location and get several properties including its name, website, ph number etc.
However, i need to get the place category (food, gas station etc) but the method getPlaceTypes() gives me a List with integers, not a string/sequence of chars.
Also, how can I get paragraphs of info of that specific place? Is that possible? Or do I have to scrape from websites myself?
Thanks.
Here you can find the Place Types supported by Google Places API. The list of integers that you are receiving is of the well known types that you can find here. So you will need to transform the integers into your own text.
With Google Places is not currently possible to get detailed info of a place, not even using the more detailed place details that you receive when you query the Places API Web Service.
no api gives you the categories instead you have to explicitly provide category name in parameter for example for getting near by places.You can find categories list here https://developers.google.com/places/supported_types

ElasticSearch ranking - scoring

We are developing an app with java and using elasticsearch java api. We indexed metadatas and want to use ranking/scoring at indexing time or searching time.
And also, I dunno if it is possible to rank/score a result which is choosed/approved by the users when they click a result. It is like to set that result is a popular result and increase its popularity.
How to implement them? Thanks for your suggestions.
elasticsearch is allow us to change/modify the elasticsearch score using the _score.
I hope your requirement is to maintain custom ranking in documents rather than the elasticsearch scoring.
if so you need to design the document like that. Add a filed name like userRank in all the documents and increment the value if a user click the document in the result. using function_score you can add the userRank field value to the calculated _score.
There's a large and complex field called learning to rank that studies how to turn quality information about documents/queries and turn them into relevance ranking rules.
For Elasticsearch specifically, there is this plugin that could help. (disclaimer I'm the creator).

Getting All Tweets From a Country Within A Time Period at Java

I am working on a project that I will get all tweets from a country that has tweeted within a certain time period. I will make a data mining on it after that(examining that how many positive thoughts are said for a certain pupil etc.). I want to use Java as programming language. However I don't know how to start this project. I made a search and I know that there is:
Twitter's Search API
Twitter's Streaming API
Twitter4J a twitter API for Java
Something interesting here out of Java : http://dev.datasift.com/discussions/category/csdl-language
Where I can start to get all tweets from a country(if it can be from a given state) within a time period. Some examples are like: you are giving a username and it returns the tweets if it is a public profile. I don't have the list of all public profiles. Should I handle that problem and how?
Any ideas?
If you gonna use Java Twitter4j is your best shot.
But you gonna have to choose a strategy for retrieving the tweets that you want.
You can either get the data from Twitter itself or get it from a Data Provider which has full Firehose Access. DataSift and Gnip are those providers which has full access to Firehose.If you want to use a data provider DataSift is the way to go because of its own query language which is pretty cool.
In case of retrieving the data by yourself.
Firstly if you want to get the Tweets in real time you need to use Twitter Streaming API and Twitter4j makes it really easy to use it.But unfortunately Streaming API doesn't support country or language filtering.You can listen the Streaming API for the search queries that you are registered for.
Your second option is Search API.Twitter4j also makes using Search API pretty easy.Search API supports much more filtering options.But there isn't any way to filter tweets for country.But instead of that filtering tweets depending on the Language is much more useful way to do that. E.g filtering tweets that are en,fr or so on.
Hope this helps.
You want to use the search API. However, the API doesn't allow searching by country, only by geocode.
in Twitter4J
You can get location like this.
tweet.getUser().getLocation()
But it gets user's location input field.

Twitter stream - location AND keywords

I am using the 'Twitter4j' library and I am just wondering if it is at all possible to return tweets within a location AND contain a certain keyword. I notice that on the official Twitter documentation it mentions this:
Bounding boxes are logical ORs. A locations parameter may be combined with track parameters, but note that all terms are logically ORd, so the query string track=twitter&locations=-122.75,36.8,-121.75,37.8 would match any tweets containing the term Twitter (even non-geo tweets) OR coming from the San Francisco area.
Which is unfortunate as it is not what I need, it's returning way too many tweets. Any idea on how I could get around this or is there something I'm missing in the library that could allow me to do it?
Library javadoc: http://twitter4j.org/en/javadoc/twitter4j/FilterQuery.html#locations
At the moment I have my filter code like this
twitter.filter(new FilterQuery().locations(sydney).track(keywords));
and have also tried each on its own line:
twitter.filter(new FilterQuery().locations(sydney).track(keywords));
twitter.filter(new FilterQuery().track(keywords));
Unfortunately, you are reading the documents correctly. I don't know enough about twitter4j to say if there's a method contained somewhere that will handle this for you more easily, but you can always just use a simple string comparison to see if your search terms are included in the tweet.
Not an answer but a Simple workaround:
I know most of people don't have GPS enabled when they tweet and others would not like to share their location!
But they are still sharing their location!! Guess how? On their profiles! Their hometown, their country is mostly visible, which can give you an approximate location of where the tweet came from! You can query for the user's profile and thus his/her location using the Rest API
twitter.showUser(userScreenName).getLocation();
Search for all keywords, if the location you wan't doesn't match, simply discard! In this way, you can get more number of tweets atleast

Storing data in Lucene or database

I'm a Lucene newbie and am thinking of using it to index the words in the title and description elements of RSS feeds so that I can record counts of the most popular words in the feeds.
Various search options are needed, some will have keywords entered manually by users, whereas in other cases popular terms would be generated automatically by the system. So I could have Lucene use query strings to return the counts of hits for manually entered keywords and TermEnums in automated cases?
The system also needs to be able to handle new data from the feeds as they are polled at regular intervals.
Now, I could do much / all of this using hashmaps in Java to work out counts, but if I use Lucene, my question concerns the best way to store the words for counting. To take a single RSS feed, is it wise to have Lucene create a temporary index in memory, and pass the words and hit counts out so other programs can write them to database?
Or is it better to create a Lucene document per feed and add new feed data to it at polling time? So that if a keyword count is required between dates x and y, Lucene can return the values? This implies I can datestamp Lucene entries which I'm not sure of yet.
Hope this makes sense.
Mr Morgan.
From the description you have given in the question, I think Lucene alone will be sufficient. (No need of MySQL or Solr). Lucene API is also easy to use and you won't need to change your frontend code.
From every RSS feed, you can create a Document having three fields; namely title, description and date. The date must preferably be a NumericField. You can then append every document to the lucene index as the feeds arrive.
How frequently do you want the system to automatically generate the popular terms? For eg. Do you want to show the users, "most popular terms last week", etc.? If so, then you can use the NumericRangeFilter to efficiently search the date field you have stored. Once you get the documents satisfying a date range, you can then find the document frequency of each term in the retrieved documents to find the most popular terms. (Do not forget to remove the stopwords from your documents (say by using the StopAnalyzer) or else the most popular terms will be the stopwords)
I can recommend you check out Apache Solr. In a nutshell, Solr is a web enabled front end to Lucene that simplifies integration and also provides value added features. Specifically, the Data Import Handlers make updating/adding new content to your Lucene index really simple.
Further, for the word counting feature you are asking about, Solr has a concept of "faceting" which will exactly fit the problem you're are describing.
If you're already familiar with web applications, I would definitely consider it: http://lucene.apache.org/solr/
Solr is definitely the way to go although I would caution against using it with Apache Tomcat on Windows as the install process is a bloody nightmare. More than happy to guide you through it if you like as I have it working perfectly now.
You might also consider the full text indexing capabilities of MySQL, far easier the Lucene.
Regards

Categories