I am trying to build connectors to twitter on top of twitter4j using java. One of the problems that Twitte4j doesn't deal with and expects you to deal with is the ratelimit issue.
My approach to make the best out of twitter api using Twitter4j is to build multiple threads on top of it. I have tweets dump with nothing but tweet id and users with user ids in my database, I need my twitter threads to query twitter and update these tables whenever new information flows into them. So, I built two different threads, one that updates user table and one that updates tweets table. The user update thread is fairly easy to do, coz twitter supports querying up to 100 users in one go(users/lookup). The tweet thread, however, supports only one at a time (tweets/show). So, I have my 'tweet update' thread, start 5 more threads, wherein each thread goes and queries twitter and updates one single post at a time. This is where ratelimit comes into picture. So, at any moment, I have 6 threads running and querying TwitterService (my service class). These threads before querying always check if ratelimit has been hit, if yes, they go into sleep mode. So service method that threads invoke looks like this:
private synchronized void checkRateLimitStatus() {
if (rateLimitHit) {
try {
logger.warn("RateLimit has been reached");
wait(secondsUntilReset * 1000);
rateLimitHit = false;
secondsUntilReset = 0;
} catch (InterruptedException ie) {
ie.printStackTrace();
}
notifyAll();
}
}
The boolean rateLimitHit is set by Twitter4J listener, which checks the number of requests left. Once the count is zero, this bool is set to true. The code looks like this:
public synchronized void onRateLimitStatus(RateLimitStatusEvent evt) {
RateLimitStatus status = evt.getRateLimitStatus();
if (status.getRemainingHits() == 0) {
rateLimitHit = true;
secondsUntilReset = status.getSecondsUntilReset();
}
}
The problem with this is, say, I have 3 more queries left to Twitter, and the method checkRateLimitStatus() will return false for all the 6 queries (coz it has not been set, yet). So, all of the threads start coz the count is not zero yet. But, when first 3 threads are done with Twitter, the count would have reached zero and the rest of the three threads fail.
How do I solve this problem? How do I make these threads more reliable?
Assuming getting rate limit status is based on the same messaging with Twitter as other actions there's always a lag that makes any attempts to bring reliability by checking this status unsuccessful. There's always a chance when status will be out off date unless you operate in sync manner. I'd suggest you to try compute rate limit status locally and make all threads self-recoverable is case of error. Also using wait/notify mechanism is a good point for any repeatable actions from the perspective of CPU time wasting.
Related
I am working on data visualization. I have a MySQL database, java backend, js front end. I access the DB via Hibernate.
I need to retrieve a couple of mio. rows from the DB at a time in order to visualize them properly.
Initial problem: it takes about 20-30 seconds for the whole process
This is not very user-friendly so I need to reduce that time!
I looked where the long waiting time is coming from, and it is definitely the data receiving from the DB: when I want to retrieve the same data in MySQLWorkbench with pure SQL instead of Hibernate it takes almost the same time.
So data processing in the backend is not the problem, it is getting the data in the first place.
I tried to tackle this problem with multithreading.
I read about the common problems using multithreading with hibernate and tried to avoid them, but still I have the feeling I am overlooking something important.
Here is what i tried so far:
In the method which should return all the data for further processing I have
some code...
for(int i = 0; i <= numberOfThreads; i++) {
Runnable r = new Runnable() {
#Override
public void run() {
try{
someDataStructure = HibernateUtil.executeHqlCustom(hql,
lowerLimit, stepSize);
}catch(someException){
...
}
}
};
r.run();
r = null;
}
return someDataStructure;
The "executeHqlCustom" creates a session, creates a Query, uses the setFirstResult to set the lower bound and uses the setMaxResult to set the amount of data each thread should be responsible of (this way the data should be received in chunks).
Session gets closed.
I tried a lot of different combinations (from a couple of threads up to 1 thread for every single row in the DB (what is ridiculous I know!)) but in the end using no multi threading was always faster than all other options I just described.
I also tried using an own class which implements Runnable / extends Thread and had a boolean flag as an instance variable in order to set the thread to null when it is not needed anymore.
None of that was any better.
Like I said I am 99% sure I am overlooking something very important or that I misunderstood some of the concepts of multithreading (I am very new to that).
Any help is appreciated!
I am trying to roll out my own SMS verification system for my app. I don’t want to start paying for a service and then have them jack up the price on me (Urban Airship did that to me for push notification: lesson learned). During development and beta testing I have been using Twilio with a very basic setup: 1 phone number. It worked well for over a year, but right now for whatever reason the messages aren’t always delivered. In any case I need to create a better system for production. So I have the following specs in mind:
600 delivered SMS per minute
zero misses
save money
Right now my Twilio phone number can send one SMS per second; which means the best I can handle is 60 happy users per minute. So how do I get 600 happy users per minute?
So the obvious solution is to use 10 phone numbers. But how would I implement the system? My server is App Engine, DataStore, Java. So say I purchase 10 phone numbers from Twilio (fewer would of course be better). How do I implement the array so that it can handle concurrent calls from users? Will the following be sufficient?
public static final String[] phoneBank = {“1234567890”,”2345678901”,”3456789012”,”4567890123”,…};
public static volatile nextIndex;
public void sendSMSUsingTwilio(String message, String userPhone){
nextIndex = (nextIndex+1)%phoneBank.length;
String toPhone = phoneBank[nextIndex];
// boilerplate for sending sms with twilio goes here
//…
}
Now imagine 1000 users calling this function at the very same time. Would nextIndex run from 0,1,2…9,0,1…9,0,… successively until all requests are sent?
So really this is a concurrency problem. How will this concurrency issue work on Java AppEngine? Will there be interleaving? bottlenecking? I want this to be fast on a low budget: At least 600 per minute. So I definitely don’t want synchronization in the code itself to waste precious time. So how do I best synchronize calls to increment nextIndex so that the phone numbers are each called equally and in a periodic fashion? Again, this is for Google App Engine.
You need to use Task API. Every message is a new task, and you can assign phone numbers using round-robin or random assignments. As a task is completed, App Engine will automatically pull and execute the next task. You can configure the desired throughput rate (for example, 10 per second), and App Engine will manage the required capacity for you.
You can try to implement something similar on your own, but it's much more difficult than you think - you have to handle concurrency, retries, instance shutdowns, memory limits, etc. Task API does all of that for you.
I'm new to threading, so I want to understand what is happening behind the scenes when you create a bunch of Threads in a loop and the implications/better ways of doing it.
Here's an example:
for (Page page : book) {
Thread t = new Thread(new Runnable() {
public void run() {
//http request to get page and put into concurrent data structure
}
});
t.start();
threads.add(t);
}
//wait for threads
As you can probably see, in my specific use case right now, I am paging through objects that I request via HTTP. I know there doesn't necessarily need to be threads here and instead I could make async requests, but how (with explanations) how this could be improved.
In your example your are creating and starting a new thread for each Page object you have in your book. This is not useful if you have many more pages than cores on your system.
It's also kinda low-level by now to directly create and start threads and keep track of them.
A better solution would be to use an ExecutorService and create a number of threads close, for example, to the number of cores there is on the system (for I/O bound tasks you may want to create more threads than that: you can check out the comments below this answer).
For example:
final ExecutorService e =
Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
for (Page page : book) {
e.submit( new Runnable() {
//http request to get page and put into concurrent data structure}
}
You'd then wait on your ExecutorService to terminate its job.
Note that depending on the server you're fetching your information from, you may need to add, on purpose, delays as to not "hammer" the server too much.
Certain websites will tell you how often you can query them (for example the Bitstamp bitcoin exchange allows one query per second) and will ban your IP if you don't respect the delay. Others won't well you anything and simply ban your IP if they detect that you're leeching too fast.
I have a Spring-MVC, Hibernate, (Postgres 9 db) Web app. An admin user can send in a request to process nearly 200,000 records (each record collected from various tables via joins). Such operation is requested on a weekly or monthly basis (OR whenever the data reaches to a limit of around 200,000/100,000 records). On the database end, i am correctly implementing batching.
PROBLEM: Such a long running request holds up the server thread and that causes the the normal users to suffer.
REQUIREMENT: The high response time of this request is not an issue. Whats required is not make other users suffer because of this time consuming process.
MY SOLUTION:
Implementing threadpool using Spring taskExecutor abstraction. So i can initialize my threadpool with say 5 or 6 threads and break the 200,000 records into smaller chunks say of size 1000 each. I can queue in these chunks. To further allow the normal users to have a faster db access, maybe I can make every runnable thread sleep for 2 or 3 secs.
Advantages of this approach i see is: Instead of executing a huge db interacting request in one go, we have a asynchronous design spanning over a larger time. Thus behaving like multiple normal user requests.
Can some experienced people please give their opinion on this?
I have also read about implementing the same beahviour with a Message Oriented Middleware like JMS/AMQP OR Quartz Scheduling. But frankly speaking, i think internally they are also gonna do the same thing i.e making a thread pool and queueing in the jobs. So why not go with the Spring taskexecutors instead of adding a completely new infrastructure in my web app just for this feature?
Please share your views on this and let me know if there is other better ways to do this?
Once again: the time to completely process all the records in not a concern, whats required is that normal users accessing the web app during that time should not suffer in any way.
You can parallelize the tasks and wait for all of them to finish before returning the call. For this, you want to use ExecutorCompletionService which is available in Java standard since 5.0
In short, you use your container's service locator to create an instance of ExecutorCompletionService
ExecutorCompletionService<List<MyResult>> queue = new ExecutorCompletionService<List<MyResult>>(executor);
// do this in a loop
queue.submit(aCallable);
//after looping
queue.take().get(); //take will block till all threads finish
If you do not want to wait then, you can process the jobs in the background without blocking the current thread but then you will need some mechanism to inform the client when the job has finished. That can be through JMS or if you have an ajax client then, it can poll for updates.
Quartz also has a job scheduling mechanism but, Java provides a standard way.
EDIT:
I might have misunderstood the question. If you do not want a faster response but rather you want to throttle the CPU, use this approach
You can make an inner class like this PollingThread where batches containing java.util.UUID for each job and the number of PollingThreads are defined in the outer class. This will keep going forever and can be tuned to keep your CPUs free to handle other requests
class PollingThread implements Runnable {
#SuppressWarnings("unchecked")
public void run(){
Thread.currentThread().setName("MyPollingThread");
while (!Thread.interrupted()) {
try {
synchronized (incomingList) {
if (incomingList.size() == 0) {
// incoming is empty, wait for some time
} else {
//clear the original
list = (LinkedHashSet<UUID>)
incomingList.clone();
incomingList.clear();
}
}
if (list != null && list.size() > 0) {
processJobs(list);
}
// Sleep for some time
try {
Thread.sleep(seconds * 1000);
} catch (InterruptedException e) {
//ignore
}
} catch (Throwable e) {
//ignore
}
}
}
}
Huge-db-operations are usually triggered at wee hours, where user traffic is pretty less. (Say something like 1 Am to 2 Am.. ) Once you find that out, you can simply schedule a job to run at that time. Quartz can come in handy here, with time based triggers. (Note: Manually triggering a job is also possible.)
The processed result could now be stored in different table(s). (I'll refer to it as result tables) Later when a user wants this result, the db operations would be against these result tables which have minimal records and hardly any joins would be involved.
instead of adding a completely new infrastructure in my web app just for this feature?
Quartz.jar is ~ 350 kb and adding this dependency shouldn't be a problem. Also note that there's no reason this need to be as a web-app. These few classes that do ETL could be placed in a standalone module.The request from the web-app needs to only fetch from the result tables
All these apart, if you already had a master-slave db model(discuss on that with your dba) then you could do the huge-db operations with the slave-db rather than the master, which normal users would be pointed to.
I'm playing around with the GPars library while working to improve the scalability of a matching system. I'd like to be able to query the database and immediately query the database while the results are being processed concurrently. The bottleneck is reading from the database so I would like to keep the database busy full time while processing the results asynchronously when they are available. I realise I may have some fundamental misunderstandings on how the actor framework works and I'd be happy to be corrected!
In pseudo code I'm trying to do the following:
Define two actors, One for running selects against the database and another for processing the records.
queryActor querys database and sends results to processorActor
queryActor immediately querys database again without waiting for processorActor to finish
I could probably achieve the simple use case without using actors but my end goal is to have an actor pool that is always working on new queries with potentially different datasources in order to increase the throughput of the system in general.
The processing Actor will always be much faster than the database query so I would like to query multiple replicas concurrently in future.
def processor = actor {
loop {
react {querySet ->
println "processing recordset"
if (querySet instanceof Object[]) {
MatcherDataRowProcessor matcher = new MatcherDataRowProcessor(matchedRecords, matchedRecordSet);
matchedRecords = matcher.processRecordset(querySet);
reply matchedRecords
}
else {
println 'processor fed nothing, halting processor actor'
stop()
}
}
}
}
def dbqueryer = actor {
println "dbqueryer has started"
while (batchNum.longValue() <= loopLimiter) {
println "hitting db"
Object[] querySet
def thisRuleBatch = new MatchRuleBatch(targetuidFrom, targetuidTo)
thisRuleBatch.targetuidFrom = batchNum * perBatch - perBatch
thisRuleBatch.targetuidTo = thisRuleBatch.targetuidFrom + perBatch
thisRuleBatch.targetName = targetName
thisRuleBatch.whereClause = whereClause
querySet = dao.getRecordSet(thisRuleBatch)
processor.send querySet
batchNum++
}
react { processedRecords ->
processor.send false
}
}
I would suggest taking a look at Dataflow Queues in the Dataflow Concurrency section of the user guide for GPars. You may find that Dataflows provide a better/cleaner abstraction for your problem at hand. Dataflows can also be used in conjunction with actors.
I think either actors or dataflows would work in this situation and feel that the decision comes down to which one provides the abstraction that more closely matches what you are trying to accomplish. For me, the concept of tasks, queues, dataflows seems to be a closer fit terminology-wise.
After some more research I have found that the DataFlow concurrency stuff in Gpars is actually built on top of the Actor support. The DataflowOperatorTest in the gpars java demo distribution (I need to do a java implementation) seems to be a good match for what I need to do. The main thread waits for multiple stream inputs to be populated which in my case are the parallel database queries.