This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Java: TaskExecutor for Asynchronous Database Writes?
I have a Map of data objects in memory that I need to be able to read and write to very quickly. I would like these objects to be persistent across process restarts so I'd like to have them stored in a DB.
Since I'd rather not have any DB inserts or updates slow down my running time, I'd like to have them done immediately in memory and later, asynchronously to the DB. (It's even acceptable to me if the process crashes and a little bit of data is lost.)
Is there a Java tool (preferably open source) that has this ability "out of the box"? Can this easily be done with Hibernate?
As I have also stated in my comment, if you need async writes and do not want to use hibernate or ehcache:
Runnable: The only way to achieve async processing in Java will be via simple class which extends Runnable:
public class AsyncDatabaseWriter implements Runnable {
private LinkedBlockingQueue<Data> queue = ....
private volatile boolean terminate = false;
public void run() {
while(!terminate) {
Data data = queue.take();
// write to database
}
}
public void ScheduleWrite(Data data) {
queue.add(data);
}
}
Also stated here: Java: TaskExecutor for Asynchronous Database Writes?
Distributed workers: If you want to introduce more moving parts in your system, then you can try java alternative of distributed task queue like celery: hazelcast or octobot. This will need a messaging tier in between which will act as a queue and the workers will do the task of writing to DB for you. This looks likes an overkill, but again depends on your use case and the scale at which you want to use your app.
I did something very similar where I had a use case to write to DB in an async manner, so I went with celery (python). Sample implementation can be found here: artemis
Consider using write behind caching, e.g. in EhCache:
[...] writing data into the cache, instead of writing the data into database at the same time, the write-behind cache saves the changed data into a queue and lets a backend thread to do the writing later. Therefore, the cache-write process can proceed without waiting for the database-write and, thus, be finished much faster. Any data that has been changed can be persisted into database eventually. In the mean time, any read from cache will still get the latest data.
Unfortunately I don't know how and if write behind integrates with Hibernate (don't think so).
In hibernate you choose when you want to persist data in the database, you work mainly with proxy objects (that you can keep and manipulate in memory) and you call save method whenever you want to insert or update in the database.
Related
I have a flat file say a csv file of 50mb that contains a structured data, and I need read them and then need to push into a db say MySQL. One way to do is splitting file into multiple and then processing in parallel using executors. This is okie. Now the second use case if any one data is incorrect, I need to stop processing of all the threads which means if any one data found in csv is incorrect we should not process the transaction. I need idea for second part.
Thanks,
RK
For 50MB, you'd be over complicating this design by adding multiple threads. Flat file or structured data like JSON can be ripped through with a single thread in seconds if not faster. Spinning up multiple threads for 50MB of data is overkill. On a number of occasions, I've handled the same use case with 400+ MB of JSON or CSV data with single thread.
You also have to consider that you are writing to a single DB, in which case multiple threads are going to complicate things as you have multiple transactions. Taking your CSV example, it sounds like you intend for each thread to be responsible for reading one or more lines and write it to the DB? In which case, each thread is operating in its own JDBC transaction. Thus, if you stop all threads, you're going to end up with partially written data in the DB as some threads may have completed work already and resulted in a completed transaction. Since each thread is operating independently, you don't have the opportunity to rollback all the already committed transactions for the completed threads.
If you're still committed to parallelization for 50MB of data, consider making 2 passes:
To read and validate the data and generate the appropriate SQL insert statements
If all threads are successful, execute the generated SQL file
This would do what you want and you guarantee that you file completely if there's a validation error before any data is written to the DB. Second, it ensures that the data can be written to the DB atomically. To do what you want, you'd want to use something like a CyclicBarrier or some other type of synchronizer in the java.util.concurrent package.
There are also plenty of frameworks out there that make this stuff easier and handle error cases and reusability of jobs. Spring Batch is one such tool and there are several more.
Do use ThreadGroup.
public static void main(String... args) {
final ThreadGroup group = new ThreadGroup("Thread Group");
new Thread(group, () -> {
// payload
group.interrupt();
}).start();
new Thread(group, () -> {
// payload
group.interrupt();
}).start();
}
After I create a new object 'Order', I would like to get its generated ID and put it on an AMQP queue, so that a worker can do other stuff with it. The worker takes the generated ID (message) and looks up the order but complains that no record exists, even though I just created one. I am trying to figure out either how long to wait for after I call my .persist() before I put the message (generated ID) on the queue (which I dont think is a good idea); have the worker loop over and over until mysql DOES return a record (which I dont like either); or find a point where I can put the message on the queue after I know the data is safe in mysql (this sounds best). Im thinking that it needs to be done outside of any #Transactional method.
The worker that is going to read the data back out of mysql is part of a different system on a different server. So when can I tell the worker that the data is in mysql so that it can get started with its task?
Is it true that after the #Transactional method finishes the data is done being written to mysql, I am having trouble understanding this.
Thanks a million in advanced.
Is it true that after the #Transactional method finishes the data is
done being written to mysql, I am having trouble understanding this.
Thanks a million in advanced.
So first, as Kayamann and Ralf wrote in comments, it is guaranteed that data is stored and available for other processes when the transaction commits (ends)
#Transactional methods are easy to understand. When you have #Transactional method, it means that the container (application that is going to actually invoke that method) will begin the transaction before the method is invoked, and auto commit or rollback the transaction in case of success or error.
So if we have
#Transactional
public void modify(){
doSomething();
}
And when you call somewhere in the code (or invokation via contaier eg due to some bindings) the actuall frol will be as follows
tx=entityManager.beginTransaction();
object.modify();
tx.commit();
There is quite simple. Such approach will mean that transactions are Container Controlled
As four your situation, well to let your external system know that transaction has been complete, you have to either use message queue (that you are using already) with the message that transaction is complete for some id and it can start processing stuff, or use different technology, REST for example.
Remote systems can signal eachoter for various of events via queues and REST services, so there is no difference.
Can you help me in two problem :
A. We have a table on which read and write operation happens simultaneously. Write happens very vastly so read is very slow - sometimes my web application does not come up due to heavy write operation on this table. How could i handle such scenario. Write happens through different Java application while read happens through our web application, so web application become very slow. Any idea?
B. Write happens to this table happens through 200 threads, these thread take connection from connection pool and write into the table and this application run 24 by 7. is the thread priority is having issue and stopping read operation from web application.
C. Can we have master- master replication for that table only- so write happens in one table and write happens in other table and every two minute data migrates from one table to other table?
Please suggest me .
Thanks in advance.
Check connection pool size - maybe it's too small and your threads waste time waiting for connection from pool.
Check your database settings, if you just running it with out-of-the-box params there maybe a good space for improvements.
You probably need some kind of event-driven system - when vehicle sends data DB is not updated, but a message is added to some queue (e.g. JMS). Your app then caches data on startup, and updates both cache and database upon receiving this message. The key thing is that the only component that interacts with DB is your app, and data changed only when you receive event - so you don't need to query DB to read the data, plus you may do updates in the background using only few threads, etc. There are quite good open-source messaging systems (e.g. Apache Active MQ) and caching libraries (e.g. EH Cache), so you can built reasonably perfomant and fault-tolerant system with not too much effort.
I guess introducing messaging will be a serious reengineering, so to solve your immediate problem replication might be the best solution - merge data from the updateable table to another one every 2 minutes, and the tracker will read that another table; obviously works well if you only read the data in the web-app, and not update them, otherwise you need to put a lot of effort to keep 2 tables in sync. A variation of that is batching - data from vehicle are iserted into intermediate table, and then every 2 minutes transferred into main table from which reader queries them; intermediate table is cleaned after transfer.
The one true way to solve this is to use a queue of write events and to stop the writing periodically so that the reader has a chance.
Create a queue for incoming write updates
Create an atomicXXX (see java.util.concurrency) to use as a lock
Create a thread pool to read from the queue and execute the updates when the lock is unset
Use javax.swing.Timer to periodically set the lock and read the table data.
Before trying anything too complicated try this perhaps:
1) Don't use Thread priorities, they are rarely what you want.
2) Set up your own priority scheme, perhaps simply by having a (priority) queue for both reads and writes where reads are prioritized. That is: add read and write requests to a single queue and have them block or be notified of the result.
3) check your database features to optimize write heavy tables
I have a tasks thread running in two separate instances of tomcat.
The Task threads concurrently reads (using select) TASKS table on certain where condition and then does some processing.
Issue is ,sometimes both the threads pick the same task , because of which the task is executed twice.
My question is how do i make both thread not to read the same set of data from the TASKS table
It is just because your code(which is accessing data base)DAO function is not synchronized.Make it synchronized,i think your problem will be solved.
If the TASKS table you mention is a database table then I would use Transaction isolation.
As a suggestion, within a trasaction, set an attribute of the TASK table to some unique identifiable value if not set. Commit the tracaction. If all is OK then the task has be selected by the thread.
I haven't come across this usecase so treat my suggestion with catuion.
I think you need to see some information how does work with any enterprise job scheduler, for example with Quartz
For your use case there is a better tool for the job - and that's messaging. You are persisting items that need to be worked on, and then attempting to synchronise access between workers. There are a number of issues that you would need to resolve in making this work - in general updating a table and selecting from it should not be mixed (it locks), so storing state there doesn't work; neither would synchronization in your Java code, as that wouldn't survive a server restart.
Using the JMS API with a message broker like ActiveMQ, you would publish a message to a queue. This message would contain the details of the task to be executed. The message broker would persist this somewhere (either in its own message store, or a database). Worker threads would then subscribe to the queue on the message broker, and each message would only be handed off to one of them. This is quite a powerful model, as you can have hundreds of message consumers all acting on tasks so it scales nicely. You can also make this as resilient as it needs to be, so tasks can survive both Tomcat and broker restarts.
Whether the database can provide graceful management of this will depend largely on whether it is using strict two-phase locking (S2PL) or multi-version concurrency control (MVCC) techniques to manage concurrency. Under MVCC reads don't block writes, and vice versa, so it is very possible to manage this with relatively simple logic. Under S2PL you would spend too much time blocking for the database to be a good mechanism for managing this, so you would probably want to look at external mechanisms. Of course, an external mechanism can work regardless of the database, it's just not really necessary with MVCC.
Databases using MVCC are PostgreSQL, Oracle, MS SQL Server (in certain configurations), InnoDB (except at the SERIALIZABLE isolation level), and probably many others. (These are the ones I know of off-hand.)
I didn't pick up any clues in the question as to which database product you are using, but if it is PostgreSQL you might want to consider using advisory locks. http://www.postgresql.org/docs/current/interactive/explicit-locking.html#ADVISORY-LOCKS I suspect many of the other products have some similar mechanism.
I think you need have some variable (column) where you keep last modified date of rows. Your threads can read same set of data with same modified date limitation.
Edit:
I did not see "not to read"
In this case you need have another table TaskExecutor (taskId , executorId) , and when some thread runs task you put data to TaskExecutor; and when you start another thread it just checks that task is already executing or not (Select ... from RanTask where taskId = ...).
Нou also need to take care of isolation level for transaсtions.
I'm thinking of using Java's TaskExecutor to fire off asynchronous database writes. Understandably threads don't come for free, but assuming I'm using a fixed threadpool size of say 5-10, how is this a bad idea?
Our application reads from a very large file using a buffer and flushes this information to a database after performing some data manipulation. Using asynchronous writes seems ideal here so that we can continue working on the file. What am I missing? Why doesn't every application use asynchronous writes?
Why doesn't every application use asynchronous writes?
It's often necessary/usefull/easier to deal with a write failure in a synchronous manner.
I'm not sure a threadpool is even necessary. I would consider using a dedicated databaseWriter thread which does all writing and error handling for you. Something like:
public class AsyncDatabaseWriter implements Runnable {
private LinkedBlockingQueue<Data> queue = ....
private volatile boolean terminate = false;
public void run() {
while(!terminate) {
Data data = queue.take();
// write to database
}
}
public void ScheduleWrite(Data data) {
queue.add(data);
}
}
I personally fancy the style of using a Proxy for threading out operations which might take a long time. I'm not saying this approach is better than using executors in any way, just adding it as an alternative.
Idea is not bad at all. Actually I just tried it yesterday because I needed to create a copy of online database which has 5 different categories with like 60000 items each.
By moving parse/save operation of each category into the parallel tasks and partitioning each category import into smaller batches run in parallel I reduced the total import time from several hours (estimated) to 26 minutes. Along the way I found good piece of code for splitting the collection: http://www.vogella.de/articles/JavaAlgorithmsPartitionCollection/article.html
I used ThreadPoolTaskExecutor to run tasks. Your tasks are just simple implementation of Callable interface.
why doesn't every application use asynchronous writes? - erm because every application does a different thing.
can you believe some applications don't even use a database OMG!!!!!!!!!
seriously though, given as you don't say what your failure strategies are - sounds like it could be reasonable. What happens if the write fails? or the db does away somehow
some databases - like sybase - have (or at least had) a thing where they really don't like multiple writers to a single table - all the writers ended up blocking each other - so maybe it wont actually make much difference...