Java - Multithreading and Files Question - java

I have one text file that needs to be read by two threads, but I need to make the reading sequentially. Example: Thread 1 gets the lock and read first line, lock is free. Thread 2 gets the lock and read line 2, and so goes on.
I was thinking in sharing the same buffer reader or something like that, but I'm not so sure about it.
Thanks in advance!
EDITED
Will be 2 classes each one with a thread. Those 2 classes will read the same file.

You can lock the BufferReader as you say.
I would warn you that the performance is likely to be worse than using just one thread. However you can do it as an exercise.

It would probably be more performant to read the file line by line in one thread, and pass the resulting input lines to a thread pool via a queue such as ConcurrentLinkedQueue, if you want to guarantee order at least of start of processing of the files lines. Much simpler to implement, and no contention on whatever class you use to read the file.
Unless there's some cast-iron reason why you need the reading to happen local to each thread, I'd avoid sharing the file like this.

Related

Fastest way to write file with multiple threads:FileChannel vs multiple RandomAccessFiles [duplicate]

I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?
Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.
I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.
Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.

Writing a file using multiple threads

I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?
Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.
I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.
Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.

Muti-threaded access to the same text file

I have a huge line-separated text file and I want to make some calculations on each line. I need to make a multithreaded program to process it because it is the processing of each line that takes the most time to complete rather than reading each line. (the bottleneck lies in the CPU processing, rather than the IO)
There are two options I came up with:
1) Open the file from main thread, create a lock on the file handle and pass the file handle around the worker threads and then let each worker read-access the file directly
2) Create a producer / consumer setup where only the main thread has direct read-access to the file, and feeds lines to each worker thread using a shared queue
Things to know:
I am really interested in speed performance for this task
Each line is independent
I am working this in C++ but I guess the issue here is a bit language-independent
Which option would you choose and why?
I would suggest the second option, since it will be more clear design wise and less complicated than first option. First option is less scalable and require additional communication among thread in order to synchronize they progress on file lines. While in second option you have one dispatcher which deals with IO and initiate workers threads to starts they computation, and each computational thread is completely independent from each other, hence allows you scaling. Moreover in the second option you separate your logic in more clear way.
If we are talking about massively large file, which needs to be processed with a large cluster - MapReduce is probably the best solution.
The framework allows you great scalability, and already handles all the dirty work of managing the workers and tolerating failures for you.
The framework is specifically designed to recieve files read from file system [originally for GFS] as input.
Note that there is an open source implementation of map-reduce: Apache Hadoop
If each line is really independent and processing is much slower than reading the file, what you can do is to read all the data at once and store it in an array, such that each line represents element of an array.
Then all your threads can do the processing in parallel. For example, if you have 200 lines and 4 threads, each thread could perform calculation on 50 lines. Morever, Since this method would be embarrassingly parallel, you could easily use OpenMP for that.
I would suggest the second option because it is definitely better design-wise and would allow you to have better control over the work that the worker threads are doing.
Moreover that would increase the performance since the inter-thread communication in that case is the minimum of the two options you described
Another option is to memory map the file and maintaining a shared structure properly handling mutual exclusion of the threads.

deleting a file in java while uploading it in other thread

i'm trying to build a semi file sharing program, when each computer acts both as a server and as a client.
I give multiple threads the option to DL the file from my system.
also, i've got a user interface that can recieve a delete message.
my problem is that i want that the minute a delete message receieved, i wait for all the threads that are DL the file to finish DL, and ONLY than excute file.delete().
what is the best way to do it?
I thought about some database that holds > and iterate and check if the thread is active, but it seems clumsy. is there a better way?
thanks
I think you can do this more simply than using a database. I would put a thin wrapper class around File.. a TrackedFile. It has the file inside, and a count of how many people are reading it. When you do to delete, just stop allowing new people to grab the file, and wait for the count to get to 0.
Since you are dealing with many threads accessing shared state, make sure you properly use java.util.concurrent
I am not sure this addresses all your problems, but this is what I have in mind:
Assumming that all read/write/delete operations occur only from within the same application, a thread synchronization mechanism using locks can be useful.
For every new file that arrives, a new read/write lock can be created (See Java's ReentrantReadWriteLock). The read lock should be acquired for all read operations, while the write lock should be acquired for write/delete operations. Of course, when the lock is acquired you should check whether the operation is still meaningful (i.e. whether the file still exists).
Your delete event handling thread (probably your UI) will become un-responsive if you have to wait for all readers to finish. Instead queue the delete and periodically poll for deletions which can be processed. You can use:
private class DeleteRunnable implements Runnable {
public void run() {
while (!done) {
ArrayList<DeletedObject> tmpList = null;
synchronized (masterList) {
tmpList = new ArrayList<DeletedObjects>(masterList);
}
for (DeletedObject o : tmpList)
if (o.waitForReaders(500, TimeUnit.MilliSeconds))
synchronized (masterList) {
masterList.remove(o);
}
}
}
}
If you were to restructure your design just slightly so that loading the file from disk and uploading the file to the client were not done in the same thread, you could wait for the file to stop being accessed simply by locking out new threads from reading this file, then iterating over all of the threads reading from that file and do a join() on each one, one at a time. As long as the file-reading threads terminate directly after loading the file, the iteration will finish the moment the last thread is no longer reading the file and you are good to go.
The following paragraph is based on the assumption that you keep re-reading the file data over multiple times, even if the reading threads are both reading during the same general time frame, since that's what it sounds like you're doing.
Doing it this way, separating file-reading into separate threads, would also allow you to have a single thread loading a specific file and to have multiple client-uploads getting the data from the single reading pass over the file. There are several optimizations you could implement with this, depending on what type of project this is for. If you do, make sure you don't keep too much file data in memory, or the obvious will happen. But if you are guaranteed by the nature of your project that there will be few and/or small files that will not take up too much memory, this is a great side effect of separating file-loading into a separate thread.
If you go this route of using join(), you could use the join(milliseconds) variant if you want the deletion thread to be able to wait a certain period then demand the other threads stop (for huge files and/or times when many files are being accessed so HD is going slow), if they haven't already. Just get a timestamp of (now + theDurationYouWantToWait) and join(impatientTimestamp-currentTimestamp), and send an interrupt to all file-loading threads in the middle of the loop on if(currentTimestamp >= impatientTimestamp) - then have the file-loading threads check for it in the loop where they're reading file data, then re-join() the thread that the join(milliseconds) aborted from and continue the join()ing iteration you were doing.

BufferedReader in a multi-threaded environment

How to read from BufferedReader simultaneously by multiple threads.
Well, you won't be able to have them actually simultaneously performing a read. However, you could:
Synchronize all the reads on one lock, so that only one thread tries to read at a time, but they can all read eventually
Have one thread just reading, and make it populate a thread-safe queue of some sort (see java.util.concurrent for various options) which the other threads fetch items from.
Are you wanting to read lines at a time, or arbitrary blocks of characters?
If all threads are to read all lines from the file, then you should create a separate buffered reader for each thread. If each thread is processing one line at a time (and the order of lines don't matter), then you should probably use the producer/consumer model, where only one thread actually reads from the file and places the workload in a BlockingQueue, while other threads periodically remove work loads and process them. Note that you will be able to redue locking overhead if you read N lines into a list, and then place the list in the blocking queue, instead of placing each individual line directly in the blocking queue, since that will allow multiple lines to be read/extracted with a single synchronization operation... placing and removing each and every line directly into/out of the queue will be very inefficient, especially if processing them is fairly quick.

Categories