I have a flat file say a csv file of 50mb that contains a structured data, and I need read them and then need to push into a db say MySQL. One way to do is splitting file into multiple and then processing in parallel using executors. This is okie. Now the second use case if any one data is incorrect, I need to stop processing of all the threads which means if any one data found in csv is incorrect we should not process the transaction. I need idea for second part.
Thanks,
RK
For 50MB, you'd be over complicating this design by adding multiple threads. Flat file or structured data like JSON can be ripped through with a single thread in seconds if not faster. Spinning up multiple threads for 50MB of data is overkill. On a number of occasions, I've handled the same use case with 400+ MB of JSON or CSV data with single thread.
You also have to consider that you are writing to a single DB, in which case multiple threads are going to complicate things as you have multiple transactions. Taking your CSV example, it sounds like you intend for each thread to be responsible for reading one or more lines and write it to the DB? In which case, each thread is operating in its own JDBC transaction. Thus, if you stop all threads, you're going to end up with partially written data in the DB as some threads may have completed work already and resulted in a completed transaction. Since each thread is operating independently, you don't have the opportunity to rollback all the already committed transactions for the completed threads.
If you're still committed to parallelization for 50MB of data, consider making 2 passes:
To read and validate the data and generate the appropriate SQL insert statements
If all threads are successful, execute the generated SQL file
This would do what you want and you guarantee that you file completely if there's a validation error before any data is written to the DB. Second, it ensures that the data can be written to the DB atomically. To do what you want, you'd want to use something like a CyclicBarrier or some other type of synchronizer in the java.util.concurrent package.
There are also plenty of frameworks out there that make this stuff easier and handle error cases and reusability of jobs. Spring Batch is one such tool and there are several more.
Do use ThreadGroup.
public static void main(String... args) {
final ThreadGroup group = new ThreadGroup("Thread Group");
new Thread(group, () -> {
// payload
group.interrupt();
}).start();
new Thread(group, () -> {
// payload
group.interrupt();
}).start();
}
Related
How can I implement parallel reading from DB in Spring Batch?
According to https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html I can implement mulithreaded step, however, I have to implement SynchronizedItemStreamReader for my reader. Therefore all my queries to DB are consecutive because of it.
JdbcCursorItemReader is not thread safe because it wrap a single non thread safe ResulSet. This is why in multithread environment you need to synchronized access on it.
On the other hand the JdbcPagingItemReader is thread safe. When using multiple threads, each chunk is executed in it's own thread. If you've configured the page size to match the commit-interval, that means each page is processed in the same thread.
Now most of the time we need scaling rather at processing and write time than during read. Read are in general faster enough to support our scalability need.
But as I said if you really need this you should go with the out-of-the-box paging reader or by writing your own thread safe reader.
I'm trying to read a large (in GBs) file with JSON lines, do some 'processing' and write the result to another file.
I'll be using GSON streaming API for the purpose.
To speed up the processing, I'd like to multithread the 'processsing' part.
I'm reading the file line by line since I can't load the whole file in memory. My 'processing' depends on two different lines(possibly thousands of lines apart) that meet certain conditions. Is it possible to multithread this 'processing' , without loading the whole thing in memory ?
Any suggestions on how to go about this ?
Well a high level design would be to have a reader thread, a writer thread and an ExecutorService instance to do the processing.
The reader thread reads the JSON file using a streaming API1. When it has identified a unit of work to be performed, it creates a task and submits it to the executor service, and repeats.
The executor server processes the tasks it is given. You should use a service with a bounded thread pool, and possibly a bounded / blocking work queue.
The writer thread scans the Future objects created by task submission, and uses them to get the task results (in order), generate the output from the results and write the output to the file.
If the output file doesn't need to be in order, you could dispense with writer thread2, and have the tasks write to the file. They will need to use a shared lock or mutex so that only one tasking is writing to the file at a time.
1 - If you don't, then: 1) you need to be able to parse and hold the entire input file in memory, and 2) the reader thread won't be able to start submitting tasks until it has finished parsing the input.
2 - Do this if it simplifies things, not for performance reasons. The need for mutual exclusion while writing kills any hypothetical performance benefits.
As #Thilo notes, there is little to be gained by trying to have multiple reader threads. (And a whole lot of complexity if you try!)
I think you'll have a single process reading from the file which adds workers (Runnable/Callable) to a queue. You then have a pool of threads which consumes from the queue and executes the workers in parallel.
See Executors static methods which can help creating a ExecutorService
I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?
Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.
I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.
Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.
So taking cues from this question Multithreaded access to file
My scenario is I have a spreadsheet component in which multiple threads will access and write to each workbook. The component itself is not thread safe and so am I correct in thinking that while a thread is writing to it, other thread needs to be blocked until the first one is finished writing? How would I going about to achieve this when I am dealing with a non thread safe class? put the writing method in a synchronized block?
another concern this raises is that what if one thread is busy writing long rows of data to it's respective workbook, the other thread would have to stop dead in it's tracks until the first one is finished, and this is not desirable.
instead, I imagine a scenario where each thread runs without blocking each other but the data being written to the spreadsheet is done by another middleman class which will buffer and flush the data onto spreadsheet componenet without causing multiple threads to "wait" until their writing process is complete.
Basically each thread does two things on it's own. 1) performs some long running processing of data from each respective source, 2) the writing of processed data to the spreadsheet. I am seeking a concurrent solution where 1) faces no "waiting" due to 2).
The best solution really depends on the types of operations that you're performing on the spreadsheet. For example, if one thread needs to read the value written by another thread, then it's probably necessary to lock either the whole spreadsheet or at least specific rows at a time. Since the spreadsheet itself isn't thread safe, you're correct that you'll need to do your own synchronization.
If it's important to serialize all access (which hurts performance, as it gets rid of parallelism), consider using a thread-safe queue, where each thread adds an object to the queue that represents the operation it wants to perform. Then you can have a worker thread pull items off of the queue (again, in a thread-safe manner, since the queue is thread-safe) and perform the operation.
There may be room here to parallelize the queue workers, since they can communicate with each other, and do some row-based locking amongst themselves. For example, if the first operation is to read rows 1-4 and write to row 5, and the 2nd operation is to read fro rows 6-10 and write to row 11, then these should be able to execute in parallel. But be careful here, since it may depend on the underlying structure of the spreadsheet, which you say isn't thread safe. Reads are probably fine to perform in parallel nonetheless.
While non-trivial, synchronizing access to a queue is the basic readers-writers problem, and while you have to make sure to avoid starvation as well as deadlocks, it's a lot easier to think about than random-access to a spreadsheet.
That said, the best solution would be to use a thread-safe spreadsheet, or only use one thread to ever access it. Why not use a database-backed spreadsheet and then have multiple threads reading/writing the database at once?
i'm trying to build a semi file sharing program, when each computer acts both as a server and as a client.
I give multiple threads the option to DL the file from my system.
also, i've got a user interface that can recieve a delete message.
my problem is that i want that the minute a delete message receieved, i wait for all the threads that are DL the file to finish DL, and ONLY than excute file.delete().
what is the best way to do it?
I thought about some database that holds > and iterate and check if the thread is active, but it seems clumsy. is there a better way?
thanks
I think you can do this more simply than using a database. I would put a thin wrapper class around File.. a TrackedFile. It has the file inside, and a count of how many people are reading it. When you do to delete, just stop allowing new people to grab the file, and wait for the count to get to 0.
Since you are dealing with many threads accessing shared state, make sure you properly use java.util.concurrent
I am not sure this addresses all your problems, but this is what I have in mind:
Assumming that all read/write/delete operations occur only from within the same application, a thread synchronization mechanism using locks can be useful.
For every new file that arrives, a new read/write lock can be created (See Java's ReentrantReadWriteLock). The read lock should be acquired for all read operations, while the write lock should be acquired for write/delete operations. Of course, when the lock is acquired you should check whether the operation is still meaningful (i.e. whether the file still exists).
Your delete event handling thread (probably your UI) will become un-responsive if you have to wait for all readers to finish. Instead queue the delete and periodically poll for deletions which can be processed. You can use:
private class DeleteRunnable implements Runnable {
public void run() {
while (!done) {
ArrayList<DeletedObject> tmpList = null;
synchronized (masterList) {
tmpList = new ArrayList<DeletedObjects>(masterList);
}
for (DeletedObject o : tmpList)
if (o.waitForReaders(500, TimeUnit.MilliSeconds))
synchronized (masterList) {
masterList.remove(o);
}
}
}
}
If you were to restructure your design just slightly so that loading the file from disk and uploading the file to the client were not done in the same thread, you could wait for the file to stop being accessed simply by locking out new threads from reading this file, then iterating over all of the threads reading from that file and do a join() on each one, one at a time. As long as the file-reading threads terminate directly after loading the file, the iteration will finish the moment the last thread is no longer reading the file and you are good to go.
The following paragraph is based on the assumption that you keep re-reading the file data over multiple times, even if the reading threads are both reading during the same general time frame, since that's what it sounds like you're doing.
Doing it this way, separating file-reading into separate threads, would also allow you to have a single thread loading a specific file and to have multiple client-uploads getting the data from the single reading pass over the file. There are several optimizations you could implement with this, depending on what type of project this is for. If you do, make sure you don't keep too much file data in memory, or the obvious will happen. But if you are guaranteed by the nature of your project that there will be few and/or small files that will not take up too much memory, this is a great side effect of separating file-loading into a separate thread.
If you go this route of using join(), you could use the join(milliseconds) variant if you want the deletion thread to be able to wait a certain period then demand the other threads stop (for huge files and/or times when many files are being accessed so HD is going slow), if they haven't already. Just get a timestamp of (now + theDurationYouWantToWait) and join(impatientTimestamp-currentTimestamp), and send an interrupt to all file-loading threads in the middle of the loop on if(currentTimestamp >= impatientTimestamp) - then have the file-loading threads check for it in the loop where they're reading file data, then re-join() the thread that the join(milliseconds) aborted from and continue the join()ing iteration you were doing.