Spring Batch Parallel reading from DB - java

How can I implement parallel reading from DB in Spring Batch?
According to https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html I can implement mulithreaded step, however, I have to implement SynchronizedItemStreamReader for my reader. Therefore all my queries to DB are consecutive because of it.

JdbcCursorItemReader is not thread safe because it wrap a single non thread safe ResulSet. This is why in multithread environment you need to synchronized access on it.
On the other hand the JdbcPagingItemReader is thread safe. When using multiple threads, each chunk is executed in it's own thread. If you've configured the page size to match the commit-interval, that means each page is processed in the same thread.
Now most of the time we need scaling rather at processing and write time than during read. Read are in general faster enough to support our scalability need.
But as I said if you really need this you should go with the out-of-the-box paging reader or by writing your own thread safe reader.

Related

How to stop all threads if anyone fails in java?

I have a flat file say a csv file of 50mb that contains a structured data, and I need read them and then need to push into a db say MySQL. One way to do is splitting file into multiple and then processing in parallel using executors. This is okie. Now the second use case if any one data is incorrect, I need to stop processing of all the threads which means if any one data found in csv is incorrect we should not process the transaction. I need idea for second part.
Thanks,
RK
For 50MB, you'd be over complicating this design by adding multiple threads. Flat file or structured data like JSON can be ripped through with a single thread in seconds if not faster. Spinning up multiple threads for 50MB of data is overkill. On a number of occasions, I've handled the same use case with 400+ MB of JSON or CSV data with single thread.
You also have to consider that you are writing to a single DB, in which case multiple threads are going to complicate things as you have multiple transactions. Taking your CSV example, it sounds like you intend for each thread to be responsible for reading one or more lines and write it to the DB? In which case, each thread is operating in its own JDBC transaction. Thus, if you stop all threads, you're going to end up with partially written data in the DB as some threads may have completed work already and resulted in a completed transaction. Since each thread is operating independently, you don't have the opportunity to rollback all the already committed transactions for the completed threads.
If you're still committed to parallelization for 50MB of data, consider making 2 passes:
To read and validate the data and generate the appropriate SQL insert statements
If all threads are successful, execute the generated SQL file
This would do what you want and you guarantee that you file completely if there's a validation error before any data is written to the DB. Second, it ensures that the data can be written to the DB atomically. To do what you want, you'd want to use something like a CyclicBarrier or some other type of synchronizer in the java.util.concurrent package.
There are also plenty of frameworks out there that make this stuff easier and handle error cases and reusability of jobs. Spring Batch is one such tool and there are several more.
Do use ThreadGroup.
public static void main(String... args) {
final ThreadGroup group = new ThreadGroup("Thread Group");
new Thread(group, () -> {
// payload
group.interrupt();
}).start();
new Thread(group, () -> {
// payload
group.interrupt();
}).start();
}

Fastest way to write file with multiple threads:FileChannel vs multiple RandomAccessFiles [duplicate]

I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?
Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.
I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.
Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.

Writing a file using multiple threads

I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?
Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.
I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.
Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.

Java: create a middleman class between multiple threads accessing a non synchronized class?

So taking cues from this question Multithreaded access to file
My scenario is I have a spreadsheet component in which multiple threads will access and write to each workbook. The component itself is not thread safe and so am I correct in thinking that while a thread is writing to it, other thread needs to be blocked until the first one is finished writing? How would I going about to achieve this when I am dealing with a non thread safe class? put the writing method in a synchronized block?
another concern this raises is that what if one thread is busy writing long rows of data to it's respective workbook, the other thread would have to stop dead in it's tracks until the first one is finished, and this is not desirable.
instead, I imagine a scenario where each thread runs without blocking each other but the data being written to the spreadsheet is done by another middleman class which will buffer and flush the data onto spreadsheet componenet without causing multiple threads to "wait" until their writing process is complete.
Basically each thread does two things on it's own. 1) performs some long running processing of data from each respective source, 2) the writing of processed data to the spreadsheet. I am seeking a concurrent solution where 1) faces no "waiting" due to 2).
The best solution really depends on the types of operations that you're performing on the spreadsheet. For example, if one thread needs to read the value written by another thread, then it's probably necessary to lock either the whole spreadsheet or at least specific rows at a time. Since the spreadsheet itself isn't thread safe, you're correct that you'll need to do your own synchronization.
If it's important to serialize all access (which hurts performance, as it gets rid of parallelism), consider using a thread-safe queue, where each thread adds an object to the queue that represents the operation it wants to perform. Then you can have a worker thread pull items off of the queue (again, in a thread-safe manner, since the queue is thread-safe) and perform the operation.
There may be room here to parallelize the queue workers, since they can communicate with each other, and do some row-based locking amongst themselves. For example, if the first operation is to read rows 1-4 and write to row 5, and the 2nd operation is to read fro rows 6-10 and write to row 11, then these should be able to execute in parallel. But be careful here, since it may depend on the underlying structure of the spreadsheet, which you say isn't thread safe. Reads are probably fine to perform in parallel nonetheless.
While non-trivial, synchronizing access to a queue is the basic readers-writers problem, and while you have to make sure to avoid starvation as well as deadlocks, it's a lot easier to think about than random-access to a spreadsheet.
That said, the best solution would be to use a thread-safe spreadsheet, or only use one thread to ever access it. Why not use a database-backed spreadsheet and then have multiple threads reading/writing the database at once?

How to integrate LMAX within a real financial application

I am also thinking of integrating the disruptor pattern in our application. I am a bit unsure about a few things before I start using the disruptor
I have 3 producers, mainly a FIX thread which de-serialises the requests. Another thread which continously modifies order price as the market moves. Also we have one more thread which is responsible for de-serialising the requests sent from a GUI application. All three threads currently write to a Blocking Queue (hence we see a lot of contention on the queue)
The disruptor talks about a Single writer principle and from what I have read that approach scales the best. Is there any way we could make the above three threads obey the single writer principle?
Also in a typical request/response application, specially in our case we have contention on an in memory cache, as we need to lock the cache when we update the cache with the response, whilst a request might be happening for the same order. How do we handle this through the disruptor, i.e. how do I tie up a response to a particular request? Can I eliminate the lock on the cache if yes how?
Any suggestions/pointers would be highly appreciated. We are currently using Java 1.6
I'm new to distruptor and am trying to understand as much usecases as possible. I have tried to answer your questions.
Yes, Disruptor can be used to sequence calls from multiple
producers. I understand that all 3 threads try to update the state
of a shared object. And a single consumer which takes necessary action on the shared object. Internally you can have the single consumer delegate calls to the appropriate single threaded handler based on responsibility. The
The Disruptor exactly does this. It sequences the calls such that
the state is accessed only by a thread at a time. If there's a specific order in which the event handlers are to be invoked, set up the memory barrier. The latest version of Disruptor has a DSL that lets you setup the order easily.
The Cache can be abstracted and accessed through the Disruptor. At a time, only a
Reader or a Writer would get access to the cache, since all calls to
the cache are sequential.

Categories