Building network graph from twitter users by subject - java

I'm trying to construct a social network graph of twitter users who have mentioned a particular topic. My strategy to do this goes roughly like this:
Query twitter for a topic. Collect the first 100 tweets that come up and add those users to the graph.
For each user:
Retrieve friends and followers.
Query each friend/follower for the topic. If they turn up a result (meaning they've discussed the topic), add them to the graph.
For each user that was added to the graph, return to step 2 until the desired search depth is reached.
My problem is two-fold. First of all, this approach quickly exceeds my search API rate limit. Even with a search depth of 2, it's quite likely that I'll find people with 100+ friends/followers and I am unable to query them all before hitting the rate limit.
Secondly, this all takes quite awhile. Twitter API is not fast. In the hypothetical event that I was not rate limited, I could submit the requests asynchronously, but I can't help wondering if there is a more efficient way.
I've tried aggregating the requests into one query per search depth:
topic AND from:name1 OR from:name2 .... OR from:namei
This basically explodes. I get a connection reset error from the twitter API. If I copy the query into the twitter web page, it just sits for awhile and then says "loading tweets seems to be taking awhile."
I also emailed api#twitter.com to ask for suggestions / access increase, but no response so far.
If anyone has any suggestions on how to go about gathering this type of information through the twitter API, I would very much appreciate it. I am currently using twitter4j and java.

Have you tried just using a filtered stream for a topic, and building the graph using mentions and retweets? This is quite indirect, and will still be slow, but won't hit any rate limits.
See http://truthy.indiana.edu/ and http://cnets.indiana.edu/groups/nan/truthy

Related

Reduction in georeferenced tweets

While streaming twitter data, I found that there has been an obvious reduction in geo-referenced tweets (tweets with lat and lon). Is it because of the Foursquare information integration? Or are there any other issues?
Many thanks!
I worked on a Social Analytics by Location application last year. We sampled tweets from twitter with the intension of using the geolocation attributes to determine the sentiment by area. Unfortunately we found that only between 10-15% of tweets (based on our own findings) were actually geo-tagged which was not enough to provide an accurate depiction of sentiment. Instead we opted for using location indicative hashtags.
In saying that it depends on the sample size. We were trying to determine sentiment in areas such as buildings which had a small amount of active twitter users. If your aim is to find tweets within much larger areas such as Towns/Cities/Countries then 10-15% is probably enough for your needs.
To answer your original question: users are generally private unless they explicitly intend to checkin somewhere and so my guess is that the 10-15% of tweets that are geo-located are a result of users forgetting to disable geo-location or using a new/infrequently used device where it is not disabled. It can also be attributed to foursquare information integration as I'm sure users just overlook the fact that foursquare provides twitter with the geolocation information.
This article is an interesting read. It outlines an application developed by the University of SoCal that can help users identify if they are giving away sensitive/private location information with their tweets.

Best way to gather large volume of tweets?

So I am currently trying to gather tweets on a specific location and then analyse what is going on in that location from the tweets gathered. My task basically involves a lot of data mining.
The main problem I have come across however is gathering enough tweets that will allow me to make a judgement.
I have been using the Twitter Streaming API, however this only gives 1% of all the tweets which is far from enough. I mined 100,000 tweets and very little were in English let alone related to the location I was looking for.
I have also noticed that twitter rate limits how often you can call a method via their API. How are sites like trendsmap.com working? Are they somehow accessing a larger data set?
Edit: Ok, so I have tried to use the geolocation feature in the twiiter4j API. Turns out the rate limits can be avoided if you are careful with your implementation. The amount of people however that actually have the geolocation feature turned on when tweeting is very low. This therefore does not represent people in that area. I seem to be getting the same tweets every single time. Twitter does offer a search operator "near" which works great on their website. However they have not included this functionality in their API as far as I can tell.
If you are searching using the Twitter API you can restrict your searches to a specific geolocation using the geocode option.
You can use result_type=recent to ensure you're only getting the most recent tweets.
The maximum count - that is, number of tweets per request - is 100.
The current limit on number of search requests per hour is 450.
So, that's a maximum of 45,000 tweets per hour - is that enough for you?
tl:dr - use the most restrictive set of search parameters to limit the results to those you actually need.

Unlimited tweet search using java

I have a requirement for retrieving all(i mean "all") till a given date or between dates.
But the code i wrote gives me tweets but only for today. I implemented paging but its no help i do get multiple pages and the data is not redundant. But the data is still limited for the current day. I only get like 600-700 tweets. And i used hasNext() and it retrieves false after 6-7 pages.
I'm fairly new to this API and i dont have much idea about the framework so forgive me if i sound really naive.
Heres the code:
Query search=new Query(searchKeyWord);
QueryResult results;
search.setCount(100);
//search.setMaxId(-1);
search.setSince("2013-01-01");
search.lang("en");
// search.setUntil("2013-05-01");
int i=0;
//TwitterFactory.getSingleton().search(search);//
do{
i++;
System.out.println("Page "+i);
results=tweety.search(search);
for(Status stats : results.getTweets()){
Text=stats.getText();
Text=Text.replace("\n", " ");
writer.append(stats.getUser().getScreenName()+";"+Text+";"+stats.getCreatedAt()+";"+"\n");
}
search=results.nextQuery();
} while(search!=null);
The requirement is for text mining on a large amount data so the more tweets retrieved the better. Of course I will restricting the since and until dates. But if i set the dates for an older time interval the tweets are still retrieved only for the last day of that interval.
Am i wrong here somewhere? And is there something I need to add or change to get all the tweets? I'm aware of rate limits. Is this the reason why i receive only limited data?
Thanks in advance.
You should use both search API and Streaming API. I am also working on data mining with twitter data and what I am doing is I just implemented two different apps to collect tweets. You can also do same thing. The streaming API needs only one twitter account for token and authentication stuff.However, you should have more accounts for the search API. If you have more questions let me know.

How to get around the Twitter 3200 status limit? [duplicate]

With https://dev.twitter.com/docs/api/1/get/statuses/user_timeline I can get 3,200 most recent tweets. However, certain sites like http://www.mytweet16.com/ seems to bypass the limit, and my browse through the API documentation could not find anything.
How do they do it, or is there another API that doesn't have the limit?
You can use twitter search page to bypass 3,200 limit. However you have to scroll down many times in the search results page. For example, I searched tweets from #beyinsiz_adam. This is the link of search results:
https://twitter.com/search?q=from%3Abeyinsiz_adam&src=typd&f=realtime
Now in order to scroll down many times, you can use the following javascript code.
var myVar=setInterval(function(){myTimer()},1000);
function myTimer() {
window.scrollTo(0,document.body.scrollHeight);
}
Just run it in the FireBug console. And wait some time to load all tweets.
The only way to see more is to start saving them before the user's tweet count hits 3200. Services which show more than 3200 tweets have saved them in their own dbs. There's currently no way to get more than that through any Twitter API.
http://www.quora.com/Is-there-a-way-to-get-more-than-3200-tweets-from-a-twitter-user-using-Twitters-API-or-scraping
https://dev.twitter.com/discussions/276
Note from that second link: "…the 3,200 limit is for browsing the timeline only. Tweets can always be requested by their ID using the GET statuses/show/:id method."
I've been in this (Twitter) industry for a long time and witnessed lots of changes in Twitter API and documentation. I would like to clarify one thing to you. There is no way to surpass 3200 tweets limit. Twitter doesn't provide this data even in its new premium API.
The only way someone can surpass this limit is by saving the tweets of an individual Twitter user.
There are tools available which claim to have a wide database and provide more than 3200 tweets. Few of them are followersanalysis.com, keyhole.co which I know of.
You can use a tool I wrote that bypasses the limit.
It saves the Tweets in a JSON format.
https://github.com/pauldotknopf/twitter-dump
You can use a Python library snscrape to do it. Or you can use ExportData tool to get all tweets for the user, which returns already preprocessed CSV and spreadsheet files. The first option is free, but has less information and requires more manual work.

Show users a list of unique items on Java Google App Engine

I've been going round in circles with what must be a very simple challenge but I want to do it the most efficient way from the start. So, I've watched Brett Slatkin's Google IO videos (2008 & 2009) about building scalable apps including http://www.youtube.com/watch?v=AgaL6NGpkB8 and read the docs but as a n00b, I'm still not sure.
I'm trying to build an app on GAEJ similar to the original 'hotornot' where a user is presented with an item which they rate. Once they rate it, they are presented with another one which they haven't seen before.
My question is this; is it most efficient to do a query up front to grab x items (say 100) and put them in a list (stored in memcache?) or is it better to simply make a query for a new item after each rating.
To keep track of the items a user has seen, I'm planning to keep those items' keys in a list property of the user's entity. Does that sound sensible?
I've really got myself confused about this so any help would be much appreciated.
I would personally do something like:
When a user logs in, create a list of 100 random IDs that they have not seen. Then as they click to the next item, do a query to the datastore and pull back the one at the front of the list.
If this ends up too slow you can try to cache, but it is really hard to memcache you entire database. Even loading the 100 guys they need will be hard (as the number of users scale out). Pulling back 1 entry for 1 webpage load is not slow. Each click will be post 1 comment and pull 1 item back. Simple, only a few MS from the datastore. Doing the 100 random IDs they haven't seen can be slow, so that makes sense to do ahead of time and keep around (in their request or session depending on how you are doing that...)

Categories