Writing a file using multiple threads - java

I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?

Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.

I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.

Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.

Related

Read large file and process by multithreading

I'm trying to read a large (in GBs) file with JSON lines, do some 'processing' and write the result to another file.
I'll be using GSON streaming API for the purpose.
To speed up the processing, I'd like to multithread the 'processsing' part.
I'm reading the file line by line since I can't load the whole file in memory. My 'processing' depends on two different lines(possibly thousands of lines apart) that meet certain conditions. Is it possible to multithread this 'processing' , without loading the whole thing in memory ?
Any suggestions on how to go about this ?
Well a high level design would be to have a reader thread, a writer thread and an ExecutorService instance to do the processing.
The reader thread reads the JSON file using a streaming API1. When it has identified a unit of work to be performed, it creates a task and submits it to the executor service, and repeats.
The executor server processes the tasks it is given. You should use a service with a bounded thread pool, and possibly a bounded / blocking work queue.
The writer thread scans the Future objects created by task submission, and uses them to get the task results (in order), generate the output from the results and write the output to the file.
If the output file doesn't need to be in order, you could dispense with writer thread2, and have the tasks write to the file. They will need to use a shared lock or mutex so that only one tasking is writing to the file at a time.
1 - If you don't, then: 1) you need to be able to parse and hold the entire input file in memory, and 2) the reader thread won't be able to start submitting tasks until it has finished parsing the input.
2 - Do this if it simplifies things, not for performance reasons. The need for mutual exclusion while writing kills any hypothetical performance benefits.
As #Thilo notes, there is little to be gained by trying to have multiple reader threads. (And a whole lot of complexity if you try!)
I think you'll have a single process reading from the file which adds workers (Runnable/Callable) to a queue. You then have a pool of threads which consumes from the queue and executes the workers in parallel.
See Executors static methods which can help creating a ExecutorService

Concurrent Reads with a Java MappedByteBuffer

I'm trying to use a MappedByteBuffer to allow concurrent reads on a file by multiple threads with the following constraints:
File is too large to load into memory
Threads must be able to read asynchronously (it's a web app)
The file is never written to by any thread
Every thread will always know the exact offset and length of bytes it needs to read (ie - no "seeking" by the app itself).
According to the docs (https://docs.oracle.com/javase/8/docs/api/java/nio/Buffer.html) Buffers are not thread-safe since they keep internal state (position, etc). Is there a way to have concurrent random access to the file without loading it all into memory?
Although FileChannel is technically thread-safe, from the docs:
Where the file channel is obtained from an existing stream or random access file then the state of the file channel is intimately connected to that of the object whose getChannel method returned the channel. Changing the channel's position, whether explicitly or by reading or writing bytes, will change the file position of the originating object, and vice versa
So it would seem that it's simply synchronized. If I were to new RandomAccessFile().getChannel().map() in each thread [edit: on every read] then doesn't that incur the I/O overhead with each read that MappedByteBuffers are supposed to avoid?
Rather than using multiple threads for concurrent reads, I'd go with this approach (based on an example with a huge CSV file whose lines have to be sent concurrently via HTTP):
Reading a single file at multiple positions concurrently wouldn't let you go any faster (but it could slow you down considerably).
Instead of reading the file from multiple threads, read the file from a single thread, and parallelize the processing of these lines. A single thread should read your CSV line-by-line, and put each line in a queue. Multiple working threads should then take the next line from the queue, parse it, convert to a request, and process the request concurrently as needed. The splitting of the work would then be done by a single thread, ensuring that there are no missing lines or overlaps.
If you can read the file line by line, LineIterator from Commons IO is a memory-efficient possibility. If you have to work with chunks, your MappedByteBuffer seems to be a reasonable approach. For the queue, I'd use a blocking queue with a fixed capacity—such as ArrayBlockingQueue—to better control the memory usage (lines/chunks in queue + lines/chunks among workers = lines/chunks in memory).
FileChannel supports a read operation without synchronization. It natively uses pread on Linux:
public abstract int read(ByteBuffer dst, long position) throws IOException
Here is on the FileChannel documentation:
...Other operations, in particular those that take an explicit position, may proceed concurrently; whether they in fact do so is dependent upon the underlying implementation and is therefore unspecified.
It is pretty primitive by returning how many bytes were read (see details here). But I think you can still make use of it with your assumption that "Every thread will always know the exact offset and length of bytes it needs to read"

Fastest way to write file with multiple threads:FileChannel vs multiple RandomAccessFiles [duplicate]

I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?
Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.
I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.
Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.

deleting a file in java while uploading it in other thread

i'm trying to build a semi file sharing program, when each computer acts both as a server and as a client.
I give multiple threads the option to DL the file from my system.
also, i've got a user interface that can recieve a delete message.
my problem is that i want that the minute a delete message receieved, i wait for all the threads that are DL the file to finish DL, and ONLY than excute file.delete().
what is the best way to do it?
I thought about some database that holds > and iterate and check if the thread is active, but it seems clumsy. is there a better way?
thanks
I think you can do this more simply than using a database. I would put a thin wrapper class around File.. a TrackedFile. It has the file inside, and a count of how many people are reading it. When you do to delete, just stop allowing new people to grab the file, and wait for the count to get to 0.
Since you are dealing with many threads accessing shared state, make sure you properly use java.util.concurrent
I am not sure this addresses all your problems, but this is what I have in mind:
Assumming that all read/write/delete operations occur only from within the same application, a thread synchronization mechanism using locks can be useful.
For every new file that arrives, a new read/write lock can be created (See Java's ReentrantReadWriteLock). The read lock should be acquired for all read operations, while the write lock should be acquired for write/delete operations. Of course, when the lock is acquired you should check whether the operation is still meaningful (i.e. whether the file still exists).
Your delete event handling thread (probably your UI) will become un-responsive if you have to wait for all readers to finish. Instead queue the delete and periodically poll for deletions which can be processed. You can use:
private class DeleteRunnable implements Runnable {
public void run() {
while (!done) {
ArrayList<DeletedObject> tmpList = null;
synchronized (masterList) {
tmpList = new ArrayList<DeletedObjects>(masterList);
}
for (DeletedObject o : tmpList)
if (o.waitForReaders(500, TimeUnit.MilliSeconds))
synchronized (masterList) {
masterList.remove(o);
}
}
}
}
If you were to restructure your design just slightly so that loading the file from disk and uploading the file to the client were not done in the same thread, you could wait for the file to stop being accessed simply by locking out new threads from reading this file, then iterating over all of the threads reading from that file and do a join() on each one, one at a time. As long as the file-reading threads terminate directly after loading the file, the iteration will finish the moment the last thread is no longer reading the file and you are good to go.
The following paragraph is based on the assumption that you keep re-reading the file data over multiple times, even if the reading threads are both reading during the same general time frame, since that's what it sounds like you're doing.
Doing it this way, separating file-reading into separate threads, would also allow you to have a single thread loading a specific file and to have multiple client-uploads getting the data from the single reading pass over the file. There are several optimizations you could implement with this, depending on what type of project this is for. If you do, make sure you don't keep too much file data in memory, or the obvious will happen. But if you are guaranteed by the nature of your project that there will be few and/or small files that will not take up too much memory, this is a great side effect of separating file-loading into a separate thread.
If you go this route of using join(), you could use the join(milliseconds) variant if you want the deletion thread to be able to wait a certain period then demand the other threads stop (for huge files and/or times when many files are being accessed so HD is going slow), if they haven't already. Just get a timestamp of (now + theDurationYouWantToWait) and join(impatientTimestamp-currentTimestamp), and send an interrupt to all file-loading threads in the middle of the loop on if(currentTimestamp >= impatientTimestamp) - then have the file-loading threads check for it in the loop where they're reading file data, then re-join() the thread that the join(milliseconds) aborted from and continue the join()ing iteration you were doing.

BufferedReader in a multi-threaded environment

How to read from BufferedReader simultaneously by multiple threads.
Well, you won't be able to have them actually simultaneously performing a read. However, you could:
Synchronize all the reads on one lock, so that only one thread tries to read at a time, but they can all read eventually
Have one thread just reading, and make it populate a thread-safe queue of some sort (see java.util.concurrent for various options) which the other threads fetch items from.
Are you wanting to read lines at a time, or arbitrary blocks of characters?
If all threads are to read all lines from the file, then you should create a separate buffered reader for each thread. If each thread is processing one line at a time (and the order of lines don't matter), then you should probably use the producer/consumer model, where only one thread actually reads from the file and places the workload in a BlockingQueue, while other threads periodically remove work loads and process them. Note that you will be able to redue locking overhead if you read N lines into a list, and then place the list in the blocking queue, instead of placing each individual line directly in the blocking queue, since that will allow multiple lines to be read/extracted with a single synchronization operation... placing and removing each and every line directly into/out of the queue will be very inefficient, especially if processing them is fairly quick.

Categories