I'm thinking of using Java's TaskExecutor to fire off asynchronous database writes. Understandably threads don't come for free, but assuming I'm using a fixed threadpool size of say 5-10, how is this a bad idea?
Our application reads from a very large file using a buffer and flushes this information to a database after performing some data manipulation. Using asynchronous writes seems ideal here so that we can continue working on the file. What am I missing? Why doesn't every application use asynchronous writes?
Why doesn't every application use asynchronous writes?
It's often necessary/usefull/easier to deal with a write failure in a synchronous manner.
I'm not sure a threadpool is even necessary. I would consider using a dedicated databaseWriter thread which does all writing and error handling for you. Something like:
public class AsyncDatabaseWriter implements Runnable {
private LinkedBlockingQueue<Data> queue = ....
private volatile boolean terminate = false;
public void run() {
while(!terminate) {
Data data = queue.take();
// write to database
}
}
public void ScheduleWrite(Data data) {
queue.add(data);
}
}
I personally fancy the style of using a Proxy for threading out operations which might take a long time. I'm not saying this approach is better than using executors in any way, just adding it as an alternative.
Idea is not bad at all. Actually I just tried it yesterday because I needed to create a copy of online database which has 5 different categories with like 60000 items each.
By moving parse/save operation of each category into the parallel tasks and partitioning each category import into smaller batches run in parallel I reduced the total import time from several hours (estimated) to 26 minutes. Along the way I found good piece of code for splitting the collection: http://www.vogella.de/articles/JavaAlgorithmsPartitionCollection/article.html
I used ThreadPoolTaskExecutor to run tasks. Your tasks are just simple implementation of Callable interface.
why doesn't every application use asynchronous writes? - erm because every application does a different thing.
can you believe some applications don't even use a database OMG!!!!!!!!!
seriously though, given as you don't say what your failure strategies are - sounds like it could be reasonable. What happens if the write fails? or the db does away somehow
some databases - like sybase - have (or at least had) a thing where they really don't like multiple writers to a single table - all the writers ended up blocking each other - so maybe it wont actually make much difference...
Related
I'm new to SO/Java software development, and I've been searching for this without much avail.
My question is --in Java-- is it possible to run one main statement across many threads at once? I am writing a native Java application in order to load test a server. The process for this is to have a bunch of threads running at once to simulate users. These threads read from a certain file, get various UIDs, manipulate some standard data, and send this to a queue on the server. After the thread sends the data, we start pulling data from the response queue, and each of the threads that have already sent their data start checking against the UID of the newly returned data, and if it matches, the process outputs the round trip time and terminates.
Algorithmic-ally, that is what I plan to implement, however I don't have much experience with concurrency and using multiple threads, so I'm not sure how I would be able to make the threads run this process. I've seen other work where an array of WorkerThreads is used, and I've read the API for Threads and read various tutorials on concurrency. Any guidance would be helpful.
Thank you!
The recommended way to implement concurrent workers is to use an Executor service. The pattern is something like this:
ExecutorService pool = Executors.newFixedThreadPool(poolSize);
...
while (...) {
final int someParameter = ...
pool.submit(new Runnable() {
public void run() {
// do something using 'someParameter'
}
});
}
This approach takes care of the complicated process of creating and managing a thread pool by hand.
There are numerous variations; see the javadocs for Executors and ExecutorService.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Java: TaskExecutor for Asynchronous Database Writes?
I have a Map of data objects in memory that I need to be able to read and write to very quickly. I would like these objects to be persistent across process restarts so I'd like to have them stored in a DB.
Since I'd rather not have any DB inserts or updates slow down my running time, I'd like to have them done immediately in memory and later, asynchronously to the DB. (It's even acceptable to me if the process crashes and a little bit of data is lost.)
Is there a Java tool (preferably open source) that has this ability "out of the box"? Can this easily be done with Hibernate?
As I have also stated in my comment, if you need async writes and do not want to use hibernate or ehcache:
Runnable: The only way to achieve async processing in Java will be via simple class which extends Runnable:
public class AsyncDatabaseWriter implements Runnable {
private LinkedBlockingQueue<Data> queue = ....
private volatile boolean terminate = false;
public void run() {
while(!terminate) {
Data data = queue.take();
// write to database
}
}
public void ScheduleWrite(Data data) {
queue.add(data);
}
}
Also stated here: Java: TaskExecutor for Asynchronous Database Writes?
Distributed workers: If you want to introduce more moving parts in your system, then you can try java alternative of distributed task queue like celery: hazelcast or octobot. This will need a messaging tier in between which will act as a queue and the workers will do the task of writing to DB for you. This looks likes an overkill, but again depends on your use case and the scale at which you want to use your app.
I did something very similar where I had a use case to write to DB in an async manner, so I went with celery (python). Sample implementation can be found here: artemis
Consider using write behind caching, e.g. in EhCache:
[...] writing data into the cache, instead of writing the data into database at the same time, the write-behind cache saves the changed data into a queue and lets a backend thread to do the writing later. Therefore, the cache-write process can proceed without waiting for the database-write and, thus, be finished much faster. Any data that has been changed can be persisted into database eventually. In the mean time, any read from cache will still get the latest data.
Unfortunately I don't know how and if write behind integrates with Hibernate (don't think so).
In hibernate you choose when you want to persist data in the database, you work mainly with proxy objects (that you can keep and manipulate in memory) and you call save method whenever you want to insert or update in the database.
I have a Spring-MVC, Hibernate, (Postgres 9 db) Web app. An admin user can send in a request to process nearly 200,000 records (each record collected from various tables via joins). Such operation is requested on a weekly or monthly basis (OR whenever the data reaches to a limit of around 200,000/100,000 records). On the database end, i am correctly implementing batching.
PROBLEM: Such a long running request holds up the server thread and that causes the the normal users to suffer.
REQUIREMENT: The high response time of this request is not an issue. Whats required is not make other users suffer because of this time consuming process.
MY SOLUTION:
Implementing threadpool using Spring taskExecutor abstraction. So i can initialize my threadpool with say 5 or 6 threads and break the 200,000 records into smaller chunks say of size 1000 each. I can queue in these chunks. To further allow the normal users to have a faster db access, maybe I can make every runnable thread sleep for 2 or 3 secs.
Advantages of this approach i see is: Instead of executing a huge db interacting request in one go, we have a asynchronous design spanning over a larger time. Thus behaving like multiple normal user requests.
Can some experienced people please give their opinion on this?
I have also read about implementing the same beahviour with a Message Oriented Middleware like JMS/AMQP OR Quartz Scheduling. But frankly speaking, i think internally they are also gonna do the same thing i.e making a thread pool and queueing in the jobs. So why not go with the Spring taskexecutors instead of adding a completely new infrastructure in my web app just for this feature?
Please share your views on this and let me know if there is other better ways to do this?
Once again: the time to completely process all the records in not a concern, whats required is that normal users accessing the web app during that time should not suffer in any way.
You can parallelize the tasks and wait for all of them to finish before returning the call. For this, you want to use ExecutorCompletionService which is available in Java standard since 5.0
In short, you use your container's service locator to create an instance of ExecutorCompletionService
ExecutorCompletionService<List<MyResult>> queue = new ExecutorCompletionService<List<MyResult>>(executor);
// do this in a loop
queue.submit(aCallable);
//after looping
queue.take().get(); //take will block till all threads finish
If you do not want to wait then, you can process the jobs in the background without blocking the current thread but then you will need some mechanism to inform the client when the job has finished. That can be through JMS or if you have an ajax client then, it can poll for updates.
Quartz also has a job scheduling mechanism but, Java provides a standard way.
EDIT:
I might have misunderstood the question. If you do not want a faster response but rather you want to throttle the CPU, use this approach
You can make an inner class like this PollingThread where batches containing java.util.UUID for each job and the number of PollingThreads are defined in the outer class. This will keep going forever and can be tuned to keep your CPUs free to handle other requests
class PollingThread implements Runnable {
#SuppressWarnings("unchecked")
public void run(){
Thread.currentThread().setName("MyPollingThread");
while (!Thread.interrupted()) {
try {
synchronized (incomingList) {
if (incomingList.size() == 0) {
// incoming is empty, wait for some time
} else {
//clear the original
list = (LinkedHashSet<UUID>)
incomingList.clone();
incomingList.clear();
}
}
if (list != null && list.size() > 0) {
processJobs(list);
}
// Sleep for some time
try {
Thread.sleep(seconds * 1000);
} catch (InterruptedException e) {
//ignore
}
} catch (Throwable e) {
//ignore
}
}
}
}
Huge-db-operations are usually triggered at wee hours, where user traffic is pretty less. (Say something like 1 Am to 2 Am.. ) Once you find that out, you can simply schedule a job to run at that time. Quartz can come in handy here, with time based triggers. (Note: Manually triggering a job is also possible.)
The processed result could now be stored in different table(s). (I'll refer to it as result tables) Later when a user wants this result, the db operations would be against these result tables which have minimal records and hardly any joins would be involved.
instead of adding a completely new infrastructure in my web app just for this feature?
Quartz.jar is ~ 350 kb and adding this dependency shouldn't be a problem. Also note that there's no reason this need to be as a web-app. These few classes that do ETL could be placed in a standalone module.The request from the web-app needs to only fetch from the result tables
All these apart, if you already had a master-slave db model(discuss on that with your dba) then you could do the huge-db operations with the slave-db rather than the master, which normal users would be pointed to.
Our company has a Batch Application which runs every day, It does some database related jobs mostly, import data into database table from file for example.
There are 20+ tasks defined in that application, each one may depends on other ones or not.
The application execute tasks one by one, the whole application runs in a single thread.
It takes 3~7 hours to finish all the tasks. I think it's too long, so I think maybe I can improve performance by multi-threading.
I think as there is dependency between tasks, it not good (or it's not easy) to make tasks run in parallel, but maybe I can use multi-threading to improve performance inside a task.
for example : we have a task defined as "ImportBizData", which copy data into a database table from a data file(usually contains 100,0000+ rows). I wonder is that worth to use multi-threading?
As I know a little about multi-threading, I hope some one provide some tutorial links on this topic.
Multi-threading will improve your performance but there are a couple of things you need to know:
Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
Upload the data in chunks and commit once in a while to avoid accumulating huge rollback/undo tables.
Cut tasks into several work units where each unit does one job.
To elaborate the last point: Currently, you have a task that reads a file, parses it, opens a JDBC connection, does some calculations, sends the data to the database, etc.
What you should do:
One (!) thread to read the file and create "jobs" out of it. Each job should contains a small, but not too small "unit of work". Push those into a queue
The next thread(s) wait(s) for jobs in the queue and do the calculations. This can happen while the threads in step #1 wait for the slow hard disk to return the new lines of data. The result of this conversion step goes into the next queue
One or more threads to upload the data via JDBC.
The first and the last threads are pretty slow because they are I/O bound (hard disks are slow and network connections are even worse). Plus inserting data in a database is a very complex task (allocating space, updating indexes, checking foreign keys)
Using different worker threads gives you lots of advantages:
It's easy to test each thread separately. Since they don't share data, you need no synchronization. The queues will do that for you
You can quickly change the number of threads for each step to tweak performance
Multi threading may be of help, if the lines are uncorrelated, you may start off two processes one reading even lines, another uneven lines, and get your db connection from a connection pool (dbcp) and analyze performance. But first I would investigate whether jdbc is the best approach normally databases have optimized solutions for imports like this. These solutions may also temporarily switch of constraint checking of your table, and turn that back on later, which is also great for performance. As always depending on your requirements.
Also you may want to checkout springbatch which is designed for batch processing.
As far as I know,the JDBC Bridge uses synchronized methods to serialize all calls to ODBC so using mutliple threads won't give you any performance boost unless it boosts your application itself.
I am not all that familiar with JDBC but regarding the multithreading bit of your question, what you should keep in mind is that parallel processing relies on effectively dividing your problem into bits that are independent of one another and in some way putting them back together (their output that is). If you dont know the underlying dependencies between tasks you might end up having really odd errors/exceptions in your code. Even worse, it might all execute without any problems, but the results might be off from true values. Multi-threading is tricky business, in a way fun to learn (at least I think so) but pain in the neck when things go south.
Here are a couple of links that might provide useful:
Oracle's java trail: best place to start
A good tutorial for java concurrency
an interesting article on concurrency
If you are serious about putting effort to getting into multi-threading I can recommend GOETZ, BRIAN: JAVA CONCURRENCY, amazing book really..
Good luck
I had a similar task. But in my case, all the tables were unrelated to each other.
STEP1:
Using SQL Loader(Oracle) for uploading data into database(very fast) OR any similar bulk update tools for your database.
STEP2:
Running each uploading process in a different thread(for unrelated tasks) and in a single thread for related tasks.
P.S. You could identify different inter-related jobs in your application and categorize them in groups; and running each group in different threads.
Links to run you up:
JAVA Threading
follow the last example in the above link(Example: Partitioning a large task with multiple threads)
SQL Loader can dramatically improve performance
The fastest way I've found to insert large numbers of records into Oracle is with array operations. See the "setExecuteBatch" method, which is specific to OraclePreparedStatement. It's described in one of the examples here:
http://betteratoracle.com/posts/25-array-batch-inserts-with-jdbc
If Multi threading would complicate your work, you could go with Async messaging. I'm not fully aware of what your needs are, so, the following is from what I am seeing currently.
Create a file reader java whose purpose is to read the biz file and put messages into the JMS queue on the server. This could be plain Java with static void main()
Consume the JMS messages in the Message driven beans(You can set the limit on the number of beans to be created in the pool, 50 or 100 depending on the need) if you have mutliple servers, well and good, your job is now split into multiple servers.
Each row of data is asynchronously split between 2 servers and 50 beans on each server.
You do not have to deal with threads in the whole process, JMS is ideal because your data is within a transaction, if something fails before you send an ack to the server, the message will be resent to the consumer, the load will be split between the servers without you doing anything special like multi threading.
Also, spring is providing spring-batch which can help you. http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios
I have a very complex system (100+ threads) which need to send email without blocking. My solution to the problem was to implement a class called EmailQueueSender which is started at the beginning of execution and has a ScheduledExecutorService which looks at an internal queue every 500ms and if size()>0 it empties it.
While this is going on there's a synchronized static method called addEmailToQueue(String[]) which accepts an email containing body,subject..etc as an array. The system does work, and my other threads can move on after adding their email to queue without blocking or even worrying if the email was successfully sent...it just seems to be a little messy...or hackish...Every programmer gets this feeling in their stomach when they know they're doing something wrong or there's a better way. That said, can someone slap me on the wrist and suggest a more efficient way to accomplish this?
Thanks!
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html
this class alone will probably handle most of the stuff you need.
just put the sending code in a runnable and add it with the execute method.
the getQueue method will allow you to retrieve the current list of waiting items so you can save it when restarting the sender service without losing emails
If you are using Java 6, then you can make heavy use of the primitives in the java.util.concurrent package.
Having a separate thread that handles the real sending is completely normal. Instead of polling a queue, I would rather use a BlockingQueue as you can use a blocking take() instead of busy-waiting.
If you are interested in whether the e-mail was successfully sent, your append method could return a Future so that you can pass the return value on once you have sent the message.
Instead of having an array of Strings, I would recommend creating a (almost trivial) Java class to hold the values. Object creation is cheap these days.
Im not sure if this would work for your application, but sounds like it would. A ThreadPoolExecutor (an ExecutorService-implementation) can take a BlockingQueue as argument, and you can simply add new threads to the queue. When you are done you simply terminate the ThreadPoolExecutor.
private BlockingQueue<Runnable> queue;
...
ThreadPoolExecutor executor = new ThreadPoolExecutor(10, 10, new Long(1000),
TimeUnit.MILLISECONDS, this.queue);
You can keep a count of all the threads added to the queue. When you think you are done (the queue is empty, perhaps?) simply compare this to
if (issuedThreads == pool.getCompletedTaskCount()) {
pool.shutdown();
}
If the two match, you are done. Another way to terminate the pool is to wait a second in a loop:
try {
while (!this.pool.awaitTermination(1000, TimeUnit.MILLISECONDS));
} catch (InterruptedException e) {//log exception...}
There might be a full blown mail package out there already, but I would probably start with Spring's support for email and job scheduling. Fire a new job for each email to be sent, and let the timing of the executor send the jobs and worry about how many need to be done. No queuing involved.
Underneath the framework, Spring is using Java Mail for the email part, and lets you choose between ThreadPoolExecutor (as mention by #Lorenzo) or Quartz. Quartz is better in my opinion, because you can even set it up so that it fires your jobs at fixed points in time like cron jobs (eg. at midnight). The advantage of using Spring is that it greatly simplifies working with these packages, so that your job is even easier.
There are many packages and tools that will help with this, but the generic name for cases like this, extensively studied in computer science, is producer-consumer problem. There are various well-known solutions for it, which could be considered 'design patterns'.