Can't seem to find anything about input stream "blocking" that describes both what it is and when it occurs. Is this some type of multi-thread prevention of concurrent threads accessing the same stream?
On that note, when two concurrent threads access the same stream at the same time, can this cause problems, or do both threads get their own stream pointers? Obviously, one would need to wait, but hopefully it wouldn't lead to an unchecked exception.
"Blocking" is when a read or write hangs, while waiting for either more information (for reads) or for more space in some internal buffer (for writes) before returning control to the calling thread.
And I'm pretty sure the stream object takes care of its own read/write locations, so the pointer just points to the stream object, which reads out of its own buffer. So, if you're reading with synchronized methods, then each read will wait its turn, and get cohesive (but not overlapping) data. If the methods aren't synchronized, then I'm pretty sure all hell will break loose.
In the context of input streams, "blocking" typically refers to the stream waiting for more data becoming available. The term would probably make more sense if you think about sockets rather than files.
If you have multiple threads concurrently reading from the same stream, you have to do your own synchronization. There are no thread-specific "stream pointers". Again, think about multiple threads reading from the same socket (rather than from a file).
Each stream has a stream pointer. It doesn't make much sense to have two threads reading the same stream.
Related
I'm trying to use a MappedByteBuffer to allow concurrent reads on a file by multiple threads with the following constraints:
File is too large to load into memory
Threads must be able to read asynchronously (it's a web app)
The file is never written to by any thread
Every thread will always know the exact offset and length of bytes it needs to read (ie - no "seeking" by the app itself).
According to the docs (https://docs.oracle.com/javase/8/docs/api/java/nio/Buffer.html) Buffers are not thread-safe since they keep internal state (position, etc). Is there a way to have concurrent random access to the file without loading it all into memory?
Although FileChannel is technically thread-safe, from the docs:
Where the file channel is obtained from an existing stream or random access file then the state of the file channel is intimately connected to that of the object whose getChannel method returned the channel. Changing the channel's position, whether explicitly or by reading or writing bytes, will change the file position of the originating object, and vice versa
So it would seem that it's simply synchronized. If I were to new RandomAccessFile().getChannel().map() in each thread [edit: on every read] then doesn't that incur the I/O overhead with each read that MappedByteBuffers are supposed to avoid?
Rather than using multiple threads for concurrent reads, I'd go with this approach (based on an example with a huge CSV file whose lines have to be sent concurrently via HTTP):
Reading a single file at multiple positions concurrently wouldn't let you go any faster (but it could slow you down considerably).
Instead of reading the file from multiple threads, read the file from a single thread, and parallelize the processing of these lines. A single thread should read your CSV line-by-line, and put each line in a queue. Multiple working threads should then take the next line from the queue, parse it, convert to a request, and process the request concurrently as needed. The splitting of the work would then be done by a single thread, ensuring that there are no missing lines or overlaps.
If you can read the file line by line, LineIterator from Commons IO is a memory-efficient possibility. If you have to work with chunks, your MappedByteBuffer seems to be a reasonable approach. For the queue, I'd use a blocking queue with a fixed capacity—such as ArrayBlockingQueue—to better control the memory usage (lines/chunks in queue + lines/chunks among workers = lines/chunks in memory).
FileChannel supports a read operation without synchronization. It natively uses pread on Linux:
public abstract int read(ByteBuffer dst, long position) throws IOException
Here is on the FileChannel documentation:
...Other operations, in particular those that take an explicit position, may proceed concurrently; whether they in fact do so is dependent upon the underlying implementation and is therefore unspecified.
It is pretty primitive by returning how many bytes were read (see details here). But I think you can still make use of it with your assumption that "Every thread will always know the exact offset and length of bytes it needs to read"
I know that java.util.BitSet operations are not thread-safe. Does only reading and writing to a BitSet in parallel threads cause a permanent(in the current runtime of application) loss of information? Or write operation executes correctly, and only the current parallel read operation may return wrong information and later read operations return correct information. In other words, I mean that, if I only synchronize the write operations and allow write operations to run in parallel with read operations, will some information still be lost permanently?
The only thread-safe operation is read vs read: nothing is written in memory, memory can be accessed from any thread without any problem.
BUT when you have read vs write you can have surprises, ex: reading while writing may give you half the previous result and half the new result since bitfield is not atomic.
In your question, you accept that the concurrent read/write returns incorrect results in read. In that case, how do you know if the data returned by a read is correct? read many times and make an average?
So you have to synchronize your read operations with your write operations too.
EDIT: if you really want to go down the "I don't care if data is corrupt when reading" road, I suggest you add CRC to the emitted data, and you can reject the data if incorrect.
When you're talking about a standard Java library class, then you should go by what the javadoc says.
There are users out there running different JRE versions, from different vendors, on different operating system versions. You can't rely on the behavior of Bitset or any other library class to be exactly the same in every environment, but you can rely on it to do whatever the Javadoc says it will do.
if I only synchronize the write operations and allow write operations to run in parallel with read operations, will some information still be lost permanently?
It's highly unlikely that overlapped read operations or reads overlapped with a single write operation could leave a Bitset (or any other object) in some invalid state. If you think that your application can cope with incorrect results that might be returned by a read, then that might be a reasonable risk to take,
BUT
Are you certain that synchronizing reads causes a performance problem? If you haven't actually measured the performance, and you haven't found that the difference between unsynchronized and synchronized makes the difference beetween acceptable performance and unacceptable performance, then why not just synchronize all access?
To do otherwise is called "premature optimization", and more often than not, it's a waste of your own time.
A BitSet is not safe for multithreaded use without external synchronization.
Please read this article for basic understanding of readers writer problem.
https://dzone.com/articles/java-concurrency-read-write-lo
Only READing will not cause any issue in concurreny.
READ/WRITE or WRITE/WRITE cause inconsistent issues when you access the information concurrently
I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?
Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.
I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.
Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.
I am trying to write a single huge file in Java using multiple threads.
I have tried both FileWriter and bufferedWriter classes in Java.
The content being written is actually an entire table (Postgres) being read using CopyManager and written. Each line in the file is a single tuple from the table and I am writing 100s of lines at a time.
Approach to write:
The single to-be-written file is opened by multiple threads in append mode.
Each thread thereafter tries writing to the file file.
Following are the issues I face:
Once a while, the contents of the file gets overwritten i.e: One line remains incomplete and the next line starts from there itself. My assumption here is that the buffers for writer are getting full. This forces the writer to immediately write the data onto the file. The data written may not be a complete line and before it can write the remainder, the next thread writes its content onto the file.
While using Filewriter, once a while I see a single black line in the file.
Any suggestions, how to avoid this data integrity issue?
Shared Resource == Contention
Writing to a normal file by definition is a serialized operation. You gain no performance by trying to write to it from multiple threads, I/O is a finite bounded resource at orders of magnitude less bandwidth than even the slowest or most overloaded CPU.
Concurrent access to a shared resource can be complicated ( and slow )
If you have multiple threads that are doing expensive calculations then you have options, if you are just using multiple threads because you think you are going to speed something up, you are just going to do the opposite. Contention for I/O always slows down access to the resource, it never speeds it up because of the lock waits and other overhead.
You have to have a critical section that is protected and allows only a single writer at a time. Just look up the source code for any logging writer that supports concurrency and you will see that there is only a single thread that writes to the file.
If your application is primarily:
CPU Bound: You can use some locking mechanism/data construct to only let one thread out of many write to the file at a time, which will be useless from a concurrency standpoint as a naive solution; If these threads are CPU bound with little I/O this might work.
I/O Bound: This is the most common case, you must use a messaging passing system with a queue of some sort and have all the threads post to a queue/buffer and have a single thread pull from it and write to the file. This will be the most scalable and easiest to implement solution.
Journaling - Async Writes
If you need to create a single super large file where order of writes are unimportant and the program is CPU bound you can use a journaling technique.
Have each process write to a separate file and then concat the multiple files into a single large file at the end. This is a very old school low tech solution that works well and has for decades.
Obviously the more storage I/O you have the better this will perform on the end concat.
I am trying to write a single huge file in Java using multiple threads.
I would recommend that you have X threads reading from the database and a single thread writing to your output file. This is going to be much easier to implement as opposed to doing file locking and the like.
You could use a shared BlockingQueue (maybe ArrayBlockingQueue) so the database readers would add(...) to the queue and your writer would be in a take() loop on the queue. When the readers finish, they could add some special IM_DONE string constant and as soon as the writing thread sees X of these constants (i.e. one for each reader), it would close the output file and exit.
So then you can use a single BufferedWriter without any locks and the like. Chances are that you will be blocked by the database calls instead of the local IO. Certainly the extra thread isn't going to slow you down at all.
The single to-be-written file is opened by multiple threads in append mode. Each thread thereafter tries writing to the file file.
If you are adamant to have your reading threads also do the writing then you should add a synchronized block around the access to a single shared BufferedWriter -- you could synchronize on the BufferedWriter object itself. Knowing when to close the writer is a bit of an issue since each thread would have to know if the other one has exited. Each thread could increment a shared AtomicInteger when they run and decrement when they are done. Then the thread that looks at the run-count and sees 0 would be the one that would close the writer.
Instead of having a synchronized methods, the better solution would be to have a threadpool with single thread backed by a blocking queue. The message application would be writing will be pushed to blocking queue. The log writer thread would continue to read from blocking queue (will be blocked in case queue is empty) and would continue to write it to single file.
I am trying to understand piped streams.
Instead of piped stream why can't we use other streams to pipe each other? like below:
final ByteArrayOutputStream pos = new ByteArrayOutputStream();
final ByteArrayInputStream pis = new ByteArrayInputStream(pos.toByteArray());
and when will we have a deadlock in a piped stream? I tried to read and write using single main thread, but it executes smoothly.
The difficulty here is that the process must be implemented in several threads because writing to one end of the pipe must be matched with a read at the other end.
It is certainly not difficult to create a thread to monitor arrivals at the end of one pipe and push them back through another pipe but it cannot be done with a single thread.
Have you looked at this question?
Piped streams allow for efficient byte-by-byte processing with minimal effort.
I could very well be wrong, but I believe toByteArray() might not do what you think it does. It just copies the current contents, not any contents in future.
So the only real issue here is management of this, which would be a bit more difficult. You'd have to constantly poll the output stream. Not to mention the memory allocation of an array for each call to toByteArray (which "Creates a newly allocated byte array" for each call).
How I suspect deadlocks may happen in a single thread:
If you try to (blocking) read from an input stream that doesn't have data yet. It will never be able to get data because data can only be obtained from the output stream to which must be written in the same thread, which can't happen while you're sitting waiting for data.
So, in a single thread, it will happen if you're not very careful, but it should be possible to successfully use them in the same thread without deadlocks, but why would you want to? I think another data structure may be better suited, like a linked-list or simple circular array.