Best way to distribute database read jobs among Java threads

Best way to distribute database read jobs among Java threads - java

I have a MySQL database with a large number of rows.
I want to initialize multiple Threads (each with its own database connection) in Java and read/print the data simultaneously.
How to partition data between multiple threads so as no two Threads read the same record? What strategies can be used?

It depends on what kind of work are your threads going to do. For example i usually execute single SELECT for some kind of large dataset, add tasks to thread safe task queue and submit workers which picks up proper task from queue to process. I usually write to DB without synchronisation, but that depends on size of unit of work, and DB constrains (like unique keys etc). Works like charm.
Other method would be to just simply run multiple threads and let them work on their own. I strongly disadvice usage of some fancy LIMIT, OFFSET however. It still requires DB to fetch MORE data rows than it will actually return from query.
EDIT:
As you have added comment that you have same data, than yes, my solution is what are you looking for
Get dataset by single query
Add data to queue
Lunch your threads (by executors or new threads)
Pick data from queue and process it.

If the large dataset has an integer primary key, then one of the approaches would be as follows
Get the count of rows using the same select query.
Divide the entire dataset into equal number of partitions
Assign each partition to each thead. Each thread will have its own select query with primary key value range as constraint.
Note: the following issues with this approach
You (fire number of threads + 1) queries to database. So performance might be a problem.
All the partitions may not be equal (as there will be some ids which are deleted).
This approach is simple and makes sure that a row is strictly processed by only thread.

You can use a singleton class to maintain already read rows. So every thread can access the row number from that singleton.
Otherwise you can use static AtomicInteger variable from a common class. Every time threads will call getAndIncrement method. So you can partition data between the threads.

Related

How to tell other threads that the task corresponding to this record (in DB) was taken by me?

Right now, I am thinking of implementing multi-threading to take tasks corresponding to records in the DB tables. The tasks will be ordered by created date. Now, I am stuck to handle the case that when one task (record) being taken, other tasks should skip this one and chase the next one.
Is there any way to do this? Many thanks in advance.

One solution is to make a synchronized pickATask() method and free threads can only pick a task by this method.
this will force the other free threads to wait for their order.
synchronized public NeedTask pickATask(){
return task;
}

According to how big is your data insertion you can either use global vectorized variables or use a table in the database itself to record values like (string TASK, boolean Taken, boolean finished, int Owner_PID).
By using the database to check the status you tend to accomplish a faster code in large scale, but if do not have too many threads or this code will run just once the (Synchronized) global variable approach may be a better solution.

In my opinion if you create multiple thread to read from db and every thread involve in I/O operation and some kind of serialization while reading row from same table.In my mind this is not scallable and also some performance impact.
My solution will be one thread will be producer which will read the row in batch and create task and submit the task to execution (will be thread pool of worker to do the actual task.)Now we have two module which can be scallable independently.In producer side if required we can create multiple thread and every thread will read some partition data.For an example Thread 1 will read 0-100 and thread 2 read 101-200.

It depends on how you manage your communication between java and DB. Are you using direct jdbc calls, Hibernate, Spring Data or any other ORM framework. In case you use just JDBC you can manage this whole issue on your DB level. you will need to configure your DB to lock your record upon writing. I.e. once a record was selected for update no-one can read it until the update is finished.
In case that you use some ORM framework (Such as Hibernate for example) the framework allows you to manage concurrency issues. See about Optimistic and Pessimistic locking. Pessimistic locking does approximately what is described above - Once the record is being updated no-one can read it until the update is finished. Optimistic one uses versioning mechanism, and then multiple threads can try to update the record but only the first one succeeds and the rest will get an exception saying that they are now working with stale data and they should read the record again. The versioning mechanism is to add a version column that is usually a number or sometimes timestamp. Each thread reads the record and upon update it checks if the version in DB still the same. If so it means no-ne else updated the record and upon update the version is changed (incremented or current timestamp is set). If the version changed then someone else already updated the record since it was read and so this thread has stale record and should not be allowed to update it. Optimistic locking shows better performance in environment where reading heavily outnumbers writing

Best Java Data Structure for Fast, Concurrent Insertions

My use case is as follows: I have 10 threads simultaneously writing to one data structure. The order of the elements in the data structure does not matter. All the elements are unique. I will only be doing a read from this data structure only once at the very end.
What would be the fastest native Java data structure to suit this purpose? From my reading, it seems Collections.synchronizedList might be the go to option?

I have 10 threads simultaneously writing to one data structure.
I think it would be best to use a separate data structure per thread. That way no synchronisation is needed between the threads, and it would be much more CPU cache friendly too.
At the end they could be joined.
As for the underlying structure: if the elements are fixed size, an array/verctor would be best. Joining them would only take a copy of the block of memory they occupy, depending on the implementation - but lists would always be slower.

There is no need for you to synchronize on a list as each of the thread can work on their local copy and at the end can join results from all the threads into one final list.
If am going to use JDK7 and above then I would use fork and join for the same where i would create simple List in each forked task and finally join it in the main list at the end in the join phase.
If am on JDK6 then i could use a CountDownLatch with count as 10. Each and every thread after writing to their individual list (passed to the thread from main controller thread) counts down the latch and in the main controller, once all threads are done, i would combine all the result into one.

Oracle 11g - System performace impacted as the data grows

Our system has quite a high level of concurrency with multiple java threads picking up one record at a time from a given Oracle 11g table which normally holds about two millions records.
There are always many records ready to be picked up for processing. The records ready to be processed are selected based on a relatively complex SQL statement but once selected the processing order is based on a FIFO algorithm (ID order).
It is crucial that the same record is not picked up by two distinct threads. Because of this we have a locking mechanism in place.
From a high level view the way in which it works at the moment is that java thread invokes a stored procedure which in turn will open a RECORD_READY_KEYS cursor and then it iterates trough that cursor and try to acquire a lock on a record on a locking table with that key. The locking attempt is done with SELECT FOR UPDATE SKIP LOCKED. If the lock succeeds the record to process is returned to the java thread for processing.
Everything works fine as long as the records ready to process are not too many. However when this number grows over a limit (from our observations when going over 15K) the SQL statement used to get the RECORD_READY_KEYS cursor starts decreasing in performance. Despite the fact it is fully optimised it starts taking close to 0.2 seconds to run which means you can only process maximum five records per second per java thread. In reality considering the time taken to acquire the lock, to travel over the network, to actually do the processing, commit the transaction, etc. will result in even slower processing.
Increasing the number of java threads is an option, however we cannot go over a certain limit as they will start putting pressure on the database/application server/system resources, etc.
The real problem is that we run an SQL statement to get the RECORD_READY_KEYS containing fifteen thousand keys out of a total of two millions and we then pick up the first available record from the top and the we discard the rest by closing the cursor.
My idea would be to have a KEYS_CACHE nested table defined at package level and store the result of RECORD_READY_KEYS selection in that nested table. Once a key is locked it will delete it from the KEYS_CACHE and will return it to the java thread. The process can go that way until the whole KEYS_CACHE gets consumed and when this happens it will populate it again.
Now my questions will be:
Q1. Can you see any weak point with this approach.
I can see multiple threads trying to lock the same record at the same time and such wasting a bit of time. On the java side we can make the stored procedure invocation synchronized to a given extend only as the invocation will happen from multiple JVMs. However I cannot see this a major issue.
Another issue would be when an unlikely rollback happens as there will be no easy way to put back the deleted key. The next RECORD_READY_KEYS selection will pick it back again and a delay of a few minutes will not really matter.
Q2. As the nested table gets less and less records it will become very sparse. Can you see this becoming a problem? If so should I limit the initial size to say 5000 keys or it does not really matter.
Q3. Can you see a problem with that package level KEYS_CACHE nested table being accessed concurrently by so many threads (we have between 25 to 100 of them)
Q4. Can you see an alternative approach that would not require a whole system redesign.
Thank you in advance
I think I was not very when explaining my situation. We do not lock the records to process in the two millions records table but we do the lock the key instead that are also saved on a different locking table.
Say I have this 2 million records table called messages:
And there are only messages with Key-A, Key-B, and Key-C that are ready to be processed a possible content of the key locking table may be:
Note the Key-X is in there even if no messages ready to be processed for that key because messages with such a key were just finished processing and the clean-up thread did not kicked off yet. That is OK and even desirable in case more new messages with Key-X will enter the system in a short while it will save a new insert.
So our select (fully optimised) will obtain a list with the Key-A, Key-C, and Key-B in this order (Key-C comes before Key-B because has a message with an Id = 2 which is smaller than the first Key-B message with the Id=6
Very simplified what we do here in fact is
SELECT key FROM messages WHERE ready = ‘Y’ GROUP BY key ORDER BY min(id)
Once we get that select in a cursor we fetch the key one by one and try to lock it in the key_locckings table. Once a lock succeeds the key get assigned to a thread (there is threads table for this) and will stay with that thread processing all messages that are ready for that key. As I mentioned in my first post it is crucial that messages with the same key be processed by the same thread as the key is how we link related messages which must be processed in sequence.
The SELECT above is instantly when the number of keys selected is up to a few thousands. It is still performing OK when it gets around 10000 keys. Once the number of retrieved keys gets over 15000 then the performance starts degrading. The time to run the SELECT is still OK (about 0.2 seconds) and we do have indexes on all fields involved in this selection. It is just that getting the WHERE, GROUP, ORDER BY applied to select 15000 keys out of two million records that take the time.
So the problem for us is that every single thread will run the same SELECT and will get 15000 records just to pick up one of them. The think I was considering was that rather than closing the cursor and throwing the hard work away as we do at the moment to try storing those keys in a package level nested table and delete the keys from there as we allocate them to the threads. My first three questions just wanted to capture some others opinions about this approach while the last one was about finding some alternative ideas (e.g. someone would say use advanced queues, etc)

I have an example to hand that is (I think) very similar.
You have a table, with lots of rows, and you have multiple consumer processes (the Java threads), that all want to work on the contents of that table.
First off, I recommend avoiding SKIP LOCKED. If you think you need it, consider if you've set your INITRANS high enough on the table. Consider that SKIP LOCKED means that Oracle will skip locked resources, not just locked rows. If a block's ITL is full, SKIP LOCKED will skip it, even if there are unlocked rows in the block!
For a more detailed discussion of this, see here:
https://markjbobak.wordpress.com/2010/04/06/unintended-consequences/
So, on to my suggestion. For each concurrent JAVA thread, define a thread number. So, suppose you have 10 concurrent threads, assign them each a thread number or thread id, 0-9. Now, if you have the flexibility to modify the table, you could add a column, THREAD_ID, and then use that in the select statement when selecting from the table. Each concurrent JAVA thread will select only those rows that match it's thread id. In this way, you can guarantee that you'll avoid collisions. If you don't have the ability to add a a column to the table, then you hopefully have a nice, numeric, sequence-driven primary key? If so, you can get the same effect by querying MOD(PRIMARY_KEY_COLUMN, 10) = :client_thread_id.
Additionally, do you have columns that specify a status of some sort, or something like that, which you'll use to determine which rows from the table are eligible to be processed by the Java thread? If so, and particularly if that criteria significantly improves selectivity, creating a virtual column that is only populated for the values which you're interested in, could be quite useful, if that column is then added to the index. (THREAD_ID, STATUS), for example.
Finally, you mentioned processing in a specific order. If THREAD_ID, STATUS is your selection criteria, then perhaps a PRIORITY or STATUS_DATE column may be your ordering requirement. In that case, it may be useful to continue to build out the index, to add in the column(s) specifying the required order, and top it off with the primary key of the table.
With a carefully constructed index, and using the THREAD_ID idea, it should be possible to construct an index that will allow you to:
avoid collisions (Use THREAD_ID or MOD() on primary key)
minimize the size of the index (vurtual columns)
avoid any ORDER BY operation (add order by columns to index)
avoid any TABLE ACCESS BY ROWID operation (add primary key column to end of index)
I made a few assumptions that may or may not apply.

Using multi-threading in JDBC batch job

We have a JDBC batch job. There are two tables:
BUSINESS_CONTRACT
CLASSIFY_RECORD
The table BUSINESS_CONTRACT stores information of business contracts, we classify business contracts every month and store classify result in the table CLASSIFY_RECORD.
The batch job runs once per month, query the BUSINESS_CONTRACT for those business contracts need to be classified and classify them then insert classify results into CLASSIFY_RECORD.
The batch job runs in a single thread right now, and I want to make it runs with multi-threads
How should I write the basic code structure using the dispatcher-worker pattern?
I learn java multi-threading, but found theoretical resources mostly.Now I want to use multi-threading to solve a real problem, but don't know how to write the first line code.

First, do you need the added complexity of multi-threading? How long does your current process take to run? Do you have multiple CPUs or multiple CPU cores available on the server you would be running this on, that would make the multi-threading beneficial?
I'm not going to write your code for you, but can give you a few pointers...
How would you do this work manually? Assume you had these as paper records, and had to split the task with a co-worker. How would you divide up the work? Between 2 people or 20 people? (That's how many threads you could potentially split this into.)
Once you have these details figured out, you can create multiple threads (your workers, using parent "dispatcher" code) - each configured to select only a portion of the results from your query. You should keep references to each of your threads, and call .join() on each of them once they are all started in order to wait for the entire batch to complete. If there is a large amount of data that will be difficult to split into equal units of work (1,000 records divided into 500 and 500 may require 75% and 25% of the resources for whatever reason), you may want to consider splitting the work into much smaller units (more units than threads), then have the dispatcher continue to feed the units of work to the workers until all work has been assigned.
Also consider, would these split functions of work be truly distinct? If one unit of work fails for some reason and needs to be rolled-back in the database, does this mean that all of the other units of work need to be stopped and any existing inserts rolled-back as well?

Are you using batch updates? It will probably make more of a difference than multiple threads doing single updates.

JPA persistence using multiple threads

I have a problem when I try to persist objects using multiple threads.
Details :
Suppose I have an object PaymentOrder which has a list of PaymentGroup (One to Many relationship) and PaymentGroup contains a list of CreditTransfer(One to Many Relationship again).
Since the number of CreditTransfer is huge (in lakhs), I have grouped it based on PaymentGroup(based on some business logic)
and creating WORKER threads(one thread for each PaymentGroup) to form the PaymentOrder objects and commit in database.
The problem is, each worker thread is creating one each of PaymentOrder(which contains a unique set of PaymentGroups).
The primary key for all the entitties are auto generated.
So there are three tables, 1. PAYMENT_ORDER_MASTER, 2. PAYMENT_GROUPS, 3. CREDIT_TRANSFERS, all are mapped by One to Many relationship.
Because of that when the second thread tries to persist its group in database, the framework tries to persist the same PaymentOrder, which previous thread committed,the transaction fails due to some other unique field constraints(the checksum of PaymentOrder).
Ideally it must be 1..n..m (PaymentOrder ->PaymentGroup-->CreditTransfer`)
What I need to achieve is if there is no entry of PaymentOrder in database make an entry, if its there, dont make entry in PAYMENT_ORDER_MASTER, but only in PAYMENT_GROUPS and CREDIT_TRANSFERS.
How can I ovecome this problem, maintaining the split-master-payment-order-using-groups logic and multiple threads?

You've got options.
1) Primitive but simple, catch the key violation error at the end and retry your insert without the parents. Assuming your parents are truly unique, you know that another thread just did the parents...proceed with children. This may perform poorly compared to other options, but maybe you get the pop you need. If you had a high % parents with one child, it would work nicely.
2) Change your read consistency level. It's vendor specific, but you can sometimes read uncommitted transactions. This would help you see the other threads' work prior to commit. It isn't foolproof, you still have to do #1 as well, since another thread can sneak in after the read. But it might improve your throughput, at a cost of more complexity. Could be impossible, based on RDBMS (or maybe it can happen but only at DB level, messing up other apps!)
3) Implement a work queue with single threaded consumer. If the main expensive work of the program is before the persistence level, you can have your threads "insert" their data into a work queue, where the keys aren't enforced. Then have a single thread pull from the work queue and persist. The work queue can be in memory, in another table, or in a vendor specific place (Weblogic Queue, Oracle AQ, etc). If the main work of the program is before the persistence, you parallelize THAT and go back to a single thread on the inserts. You can even have your consumer work in "batch insert" mode. Sweeeeeeeet.
4) Relax your constraints. Who cares really if there are two parents for the same child holding identical information? I'm just asking. If you don't later need super fast updates on the parent info, and you can change your reading programs to understand it, it can work nicely. It won't get you an "A" in DB design class, but if it works.....
5) Implement a goofy lock table. I hate this solution, but it does work---have your thread write down that it is working on parent "x" and nobody else can as it's first transaction (and commit). Typically leads to the same problem (and others--cleaning the records later, etc), but can work when child inserts are slow and single row insert is fast. You'll still have collisions, but fewer.

Hibernate sessions are not thread-safe. JDBC connections that underlay Hibernate are not thread safe. Consider multithreading your business logic instead so that each thread would use it's own Hibernate session and JDBC connection. By using a thread pool you can further improve your code by adding ability of throttling the number of the simultaneous threads.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.