I'm playing around with the GPars library while working to improve the scalability of a matching system. I'd like to be able to query the database and immediately query the database while the results are being processed concurrently. The bottleneck is reading from the database so I would like to keep the database busy full time while processing the results asynchronously when they are available. I realise I may have some fundamental misunderstandings on how the actor framework works and I'd be happy to be corrected!
In pseudo code I'm trying to do the following:
Define two actors, One for running selects against the database and another for processing the records.
queryActor querys database and sends results to processorActor
queryActor immediately querys database again without waiting for processorActor to finish
I could probably achieve the simple use case without using actors but my end goal is to have an actor pool that is always working on new queries with potentially different datasources in order to increase the throughput of the system in general.
The processing Actor will always be much faster than the database query so I would like to query multiple replicas concurrently in future.
def processor = actor {
loop {
react {querySet ->
println "processing recordset"
if (querySet instanceof Object[]) {
MatcherDataRowProcessor matcher = new MatcherDataRowProcessor(matchedRecords, matchedRecordSet);
matchedRecords = matcher.processRecordset(querySet);
reply matchedRecords
}
else {
println 'processor fed nothing, halting processor actor'
stop()
}
}
}
}
def dbqueryer = actor {
println "dbqueryer has started"
while (batchNum.longValue() <= loopLimiter) {
println "hitting db"
Object[] querySet
def thisRuleBatch = new MatchRuleBatch(targetuidFrom, targetuidTo)
thisRuleBatch.targetuidFrom = batchNum * perBatch - perBatch
thisRuleBatch.targetuidTo = thisRuleBatch.targetuidFrom + perBatch
thisRuleBatch.targetName = targetName
thisRuleBatch.whereClause = whereClause
querySet = dao.getRecordSet(thisRuleBatch)
processor.send querySet
batchNum++
}
react { processedRecords ->
processor.send false
}
}
I would suggest taking a look at Dataflow Queues in the Dataflow Concurrency section of the user guide for GPars. You may find that Dataflows provide a better/cleaner abstraction for your problem at hand. Dataflows can also be used in conjunction with actors.
I think either actors or dataflows would work in this situation and feel that the decision comes down to which one provides the abstraction that more closely matches what you are trying to accomplish. For me, the concept of tasks, queues, dataflows seems to be a closer fit terminology-wise.
After some more research I have found that the DataFlow concurrency stuff in Gpars is actually built on top of the Actor support. The DataflowOperatorTest in the gpars java demo distribution (I need to do a java implementation) seems to be a good match for what I need to do. The main thread waits for multiple stream inputs to be populated which in my case are the parallel database queries.
Related
I am working on data visualization. I have a MySQL database, java backend, js front end. I access the DB via Hibernate.
I need to retrieve a couple of mio. rows from the DB at a time in order to visualize them properly.
Initial problem: it takes about 20-30 seconds for the whole process
This is not very user-friendly so I need to reduce that time!
I looked where the long waiting time is coming from, and it is definitely the data receiving from the DB: when I want to retrieve the same data in MySQLWorkbench with pure SQL instead of Hibernate it takes almost the same time.
So data processing in the backend is not the problem, it is getting the data in the first place.
I tried to tackle this problem with multithreading.
I read about the common problems using multithreading with hibernate and tried to avoid them, but still I have the feeling I am overlooking something important.
Here is what i tried so far:
In the method which should return all the data for further processing I have
some code...
for(int i = 0; i <= numberOfThreads; i++) {
Runnable r = new Runnable() {
#Override
public void run() {
try{
someDataStructure = HibernateUtil.executeHqlCustom(hql,
lowerLimit, stepSize);
}catch(someException){
...
}
}
};
r.run();
r = null;
}
return someDataStructure;
The "executeHqlCustom" creates a session, creates a Query, uses the setFirstResult to set the lower bound and uses the setMaxResult to set the amount of data each thread should be responsible of (this way the data should be received in chunks).
Session gets closed.
I tried a lot of different combinations (from a couple of threads up to 1 thread for every single row in the DB (what is ridiculous I know!)) but in the end using no multi threading was always faster than all other options I just described.
I also tried using an own class which implements Runnable / extends Thread and had a boolean flag as an instance variable in order to set the thread to null when it is not needed anymore.
None of that was any better.
Like I said I am 99% sure I am overlooking something very important or that I misunderstood some of the concepts of multithreading (I am very new to that).
Any help is appreciated!
I am developing a Service that calls multiple external services that are independent of each other. I collate the responses of all these services and return it as a consolidated response. Since these are not interdependent , I am using Spring's #Async capability to perform all these activities in parallel. I am following the example provided in this link
https://spring.io/guides/gs/async-method/
Here , a while loop is used to wait until all the responses are obtained -
while (!(page1.isDone() && page2.isDone() && page3.isDone())) {
Thread.sleep(10); //10-millisecond pause between each check
}
I know this a sample code which was aimed at explaining the concept, which it does effectively. However in an enterprise application , can a while loop be used similar to what is shown above or should a different approach be adopted? If a different approach has to be adopted what is the advantage of the approach over using a while loop?
Couldn't you just use Future.get()? It's a blocking call. It'll make sure to wait until the result is ready. You can do something like:
List<Future<?>> results = Lists.newArrayList();
results.add(page1.get());
results.add(page2.get());
results.add(page3.get());
I'm using Java to create EC2 instances from within Eclipse. Now I would like to push parts of the application to these instances so that these can process whatever needs processing and then send the results back to my machine.
What I'm trying to do is something along the lines of:
assignWork(){
workPerformed = workQueue;
workPerInstance = workQueue/numberOfInstances;
while(workQueue > 0){
netxInstance.doWork(workPerformed,workPerInstance);
workPerformer -= workPerInstance;
}
}
doWork(start, end){
while(start>end){
//process stuff
start--;
}
}
This way I could control exactly how many AMI's to instantiates depending on the volume of work at hand. I could instantiate them, send them specific code to process and then terminate them as soon as I receive the results.
Is this possible just using the AWS JDK?
It is, but consider that...
If you have SLAs, and they fall within SQS Limits (Maximum 4 Days), you could consider publishing your task queues into SNS/SQS, and use CloudWatch to track the number of needed instances.
If you have a clear division of roles (more like a workflow), and the long-running tasks are not of much concern and you can retry, also consider using AWS SWF instead. It goes a bit beyond of a SQS/SNS Combo, and I think it could fit nicely with CloudWatch (thats just a theory, I haven't looked further). Cons are the extreme assh*le AWS Flow Framework for writing the Workflow Processes
If your workload is predictable (say, around 5K processes to process today), meaning you have no need for real-time and you can batch those requests, then consider using Elastic MapReduce for this. Being Hadoop-based, it offers some such niceties, such as being able to resize your cluster on demand, and the obvious case of not having any vendor lock in at all.
Actually, if you want that manage and without many surprises, consider looking at options such as PiCloud and IronWorker. They were really made for situations just like the one you've just described.
If you have only a Queue and EC2, you can surely automate that. It only depends on how badly you want to coordinate these tasks, but I'm sure its possible.
I have a Spring-MVC, Hibernate, (Postgres 9 db) Web app. An admin user can send in a request to process nearly 200,000 records (each record collected from various tables via joins). Such operation is requested on a weekly or monthly basis (OR whenever the data reaches to a limit of around 200,000/100,000 records). On the database end, i am correctly implementing batching.
PROBLEM: Such a long running request holds up the server thread and that causes the the normal users to suffer.
REQUIREMENT: The high response time of this request is not an issue. Whats required is not make other users suffer because of this time consuming process.
MY SOLUTION:
Implementing threadpool using Spring taskExecutor abstraction. So i can initialize my threadpool with say 5 or 6 threads and break the 200,000 records into smaller chunks say of size 1000 each. I can queue in these chunks. To further allow the normal users to have a faster db access, maybe I can make every runnable thread sleep for 2 or 3 secs.
Advantages of this approach i see is: Instead of executing a huge db interacting request in one go, we have a asynchronous design spanning over a larger time. Thus behaving like multiple normal user requests.
Can some experienced people please give their opinion on this?
I have also read about implementing the same beahviour with a Message Oriented Middleware like JMS/AMQP OR Quartz Scheduling. But frankly speaking, i think internally they are also gonna do the same thing i.e making a thread pool and queueing in the jobs. So why not go with the Spring taskexecutors instead of adding a completely new infrastructure in my web app just for this feature?
Please share your views on this and let me know if there is other better ways to do this?
Once again: the time to completely process all the records in not a concern, whats required is that normal users accessing the web app during that time should not suffer in any way.
You can parallelize the tasks and wait for all of them to finish before returning the call. For this, you want to use ExecutorCompletionService which is available in Java standard since 5.0
In short, you use your container's service locator to create an instance of ExecutorCompletionService
ExecutorCompletionService<List<MyResult>> queue = new ExecutorCompletionService<List<MyResult>>(executor);
// do this in a loop
queue.submit(aCallable);
//after looping
queue.take().get(); //take will block till all threads finish
If you do not want to wait then, you can process the jobs in the background without blocking the current thread but then you will need some mechanism to inform the client when the job has finished. That can be through JMS or if you have an ajax client then, it can poll for updates.
Quartz also has a job scheduling mechanism but, Java provides a standard way.
EDIT:
I might have misunderstood the question. If you do not want a faster response but rather you want to throttle the CPU, use this approach
You can make an inner class like this PollingThread where batches containing java.util.UUID for each job and the number of PollingThreads are defined in the outer class. This will keep going forever and can be tuned to keep your CPUs free to handle other requests
class PollingThread implements Runnable {
#SuppressWarnings("unchecked")
public void run(){
Thread.currentThread().setName("MyPollingThread");
while (!Thread.interrupted()) {
try {
synchronized (incomingList) {
if (incomingList.size() == 0) {
// incoming is empty, wait for some time
} else {
//clear the original
list = (LinkedHashSet<UUID>)
incomingList.clone();
incomingList.clear();
}
}
if (list != null && list.size() > 0) {
processJobs(list);
}
// Sleep for some time
try {
Thread.sleep(seconds * 1000);
} catch (InterruptedException e) {
//ignore
}
} catch (Throwable e) {
//ignore
}
}
}
}
Huge-db-operations are usually triggered at wee hours, where user traffic is pretty less. (Say something like 1 Am to 2 Am.. ) Once you find that out, you can simply schedule a job to run at that time. Quartz can come in handy here, with time based triggers. (Note: Manually triggering a job is also possible.)
The processed result could now be stored in different table(s). (I'll refer to it as result tables) Later when a user wants this result, the db operations would be against these result tables which have minimal records and hardly any joins would be involved.
instead of adding a completely new infrastructure in my web app just for this feature?
Quartz.jar is ~ 350 kb and adding this dependency shouldn't be a problem. Also note that there's no reason this need to be as a web-app. These few classes that do ETL could be placed in a standalone module.The request from the web-app needs to only fetch from the result tables
All these apart, if you already had a master-slave db model(discuss on that with your dba) then you could do the huge-db operations with the slave-db rather than the master, which normal users would be pointed to.
I'm thinking of using Java's TaskExecutor to fire off asynchronous database writes. Understandably threads don't come for free, but assuming I'm using a fixed threadpool size of say 5-10, how is this a bad idea?
Our application reads from a very large file using a buffer and flushes this information to a database after performing some data manipulation. Using asynchronous writes seems ideal here so that we can continue working on the file. What am I missing? Why doesn't every application use asynchronous writes?
Why doesn't every application use asynchronous writes?
It's often necessary/usefull/easier to deal with a write failure in a synchronous manner.
I'm not sure a threadpool is even necessary. I would consider using a dedicated databaseWriter thread which does all writing and error handling for you. Something like:
public class AsyncDatabaseWriter implements Runnable {
private LinkedBlockingQueue<Data> queue = ....
private volatile boolean terminate = false;
public void run() {
while(!terminate) {
Data data = queue.take();
// write to database
}
}
public void ScheduleWrite(Data data) {
queue.add(data);
}
}
I personally fancy the style of using a Proxy for threading out operations which might take a long time. I'm not saying this approach is better than using executors in any way, just adding it as an alternative.
Idea is not bad at all. Actually I just tried it yesterday because I needed to create a copy of online database which has 5 different categories with like 60000 items each.
By moving parse/save operation of each category into the parallel tasks and partitioning each category import into smaller batches run in parallel I reduced the total import time from several hours (estimated) to 26 minutes. Along the way I found good piece of code for splitting the collection: http://www.vogella.de/articles/JavaAlgorithmsPartitionCollection/article.html
I used ThreadPoolTaskExecutor to run tasks. Your tasks are just simple implementation of Callable interface.
why doesn't every application use asynchronous writes? - erm because every application does a different thing.
can you believe some applications don't even use a database OMG!!!!!!!!!
seriously though, given as you don't say what your failure strategies are - sounds like it could be reasonable. What happens if the write fails? or the db does away somehow
some databases - like sybase - have (or at least had) a thing where they really don't like multiple writers to a single table - all the writers ended up blocking each other - so maybe it wont actually make much difference...