Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I read that these are used in order to reduce the disk/network call overhead which seems fine in case of write operation. But what is the benefit of doing buffered read?
If you read from a file by byte you make a system call each time and this is expensive operation. With buffered reads you make a system call once per buffer. This code reads 100K from a file on my PC in 130 ms:
InputStream is = new FileInputStream("d:/1");
long start = System.currentTimeMillis();
for(int i = 0; i < 100000; i++) {
is.read();
}
System.out.println((System.currentTimeMillis() - start));
and if I change first line with
InputStream is = new BufferedInputStream(new FileInputStream("d:/1"));
it reads 100K in 12 ms.
Reading from an input stream can sometime be a long operation.
Reading a single byte is not a very good choice if the stream generator save the informations on a bigger chunk. For example a file is saved on the disk in pieces of many kilobytes. If you don't use a buffer you reload from the disk many times the same chunk to read the bytes composing it. Instead using a buffer the read operation save the chunk (or part of it) in the memory reducing the I/O operations on the disk.
Because reading from memory is faster than reading from a disk you can make the same operation gaining a lot of time.
Note for writing: the only attention you need to do is to remember that for writing operations the data are not written to the file if you don't flush the buffer at the end of the operation.
When you read bytes from a InputStream, at some level a system call needs to be made to read the physical file from the disk. Now, system calls are costly - they need to pass your parameters from the user space to kernel space and them make a switch to kernel mode before executing. After the call is made the results should again be moved back from kernel to user space.
By using BufferedInputStream the number of read system calls can be reduced. For example, if you read 1000 bytes in unbuffered mode - you need to make 1000 system calls. Where as in buffered mode, the BufferedInputStream reads a block of data (usually 1024 bytes) using single system call. Each of your read calls on the input stream - gives the data from its own buffer.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am trying to use threading on the reading of a bmp image in Java. More exactly, i want to read it in 4 chunks of data. This is for educational purposes only, i know it's not something you would technically need.
However, i don't know how to read the file byte by byte or chunk by chunk. The only thing i found was using readAllBytes which is not what i need, or readByte, which requires me having the array of bytes already, but this is not a threaded reading anymore.
Is there any way i could read byte by byte or block by block for a given path?
Thank you in advance!
.read(), with no arguments, reads exactly one byte, but two important notes on this:
This is not thread safe. Threading + disk generally doesn't work; the bottleneck is the disk, not the CPU, and you need to add a guard that only one thread is reading at any one time. Given that the disk is so slow, you'll end up in a situation that one thread needs to read from disk, does so, and processes the data so received, and while that is happening, the other X threads that all were waiting on disk now have one of them that can 'go' (the others still have to wait). But, each thread is done reading and processing data before any other thread even got unpaused: You gain nothing.
read() on a FileInputStream is usually incredibly slow. These are low level operations, but disks tend to read entire blocks at a time and are incapable of reading one byte at a time. Thus, read() is implemented as: Read the smallest chunk one can read (usually still 4096 or more bytes), take the one byte from that chunk that is needed, and just toss the remainder in the garbage can. In other words, if you read a file by calling .read() 4906 times, that reads the same chunk from disk 4096 times. Whereas calling:
byte[] b = new byte[4096];
int read = fileIn.read(b);
would fill the entire byte array with the chunk, and is thus 4096x faster.
If your aim is to learn about multithreading, 'reading a file' is not how to learn about it; you can't observe anything meaningful in action this way.
If your aim is to speed up BMP processing, 'multithread the reading process' is now the way either. I'm at a loss to explain why multithreading is involved here at all. It is suitable neither to learn about it, nor to speed anything up.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Stupid question probably but what is the ideal length for a byte array to send over an outputstream? I couldn't find anything about this online.
I have found a lot of example that set their array size equal to 2^X or something similar. But what is the purpose of this?
There is no optimal size. OutputStream is an abstract concept; there are a million implementations of it. (Not just 'FileOutputStream is an implementation', but 'FileOutputStream, on OpenJDK11, on Windows 10, with this servicepack, with this CPU and this much system memory, under these circumstances').
The reason you see that is for buffer efficiency. The problem with sending 1 byte is usually basically nothing, but sometimes, sending 1 (or very few) bytes results in this nasty scenario:
you send one byte.
The underlying outputstream isn't designed to buffer that byte, it doesn't have the storage for it, so the only thing it can do is send it onwards to the actual underlying resource. Let's say the OutputStream represents a file on the filesystem.
The kerneldriver for this works similarly. (Most OSes do buffer internally, but you can ask the OS not to do this when you open the file).
Therefore, that one byte now needs to be written to disk. However, it is an SSD, and you can't do that to an SSD, you can only write an entire cell at once*. That's just how SSDs work: You can only write an entire block's worth. They aren't bits in sequence on a big platter.
So, the kernel reads the entire cell out, updates the one byte you are writing, and writes the entire cell back to the SSD.
Your actual loop does write, say, about 50,000 bytes, so something that should have taken a single SSD read and write, now takes 50,000 reads and 50,000 writes, burning through your SSD cell longevity and taking 50,000 times longer than needed.
Similar issues occur for networking (end up sending a single byte, wrapped in HTTP headers, wrapped in 2 TCP/IP packets, resulting in sending ~1000 bytes over the network for each byte you .write(singleValue) and many other such systems.
So why don't these streams buffer?
Because there are cases where you don't actually want them to do this; there are plenty of reasons to write I/O with specific efficiency in mind.
Is there no way to just let something to this for me?
Ah, you're in luck, there is! BufferedWriter and friends (BufferedOutputStream exists as well) wrap around an underlying stream and buffer for you:
var file = new FileOutputStream("/some/path");
var wrapped = new BufferedOutputStream(file);
file.write(1); // this is a bad idea
wrapped.write(1); // this is fine
Here, the wrapped write doesn't result in anything happening except some memory being shoved around. No bytes are written to disk (with the downside that if someone trips over a powercable, it's just lost). Only after you close wrapped, or call flush() on wrapped, or write some sufficient amount of bytes to wrapped, will wrapped end up actually sending a whole bunch of bytes to the underlying stream. This is what you should use if making a byte array is unwieldy. Why reinvent the wheel?
But I want to write to the underlying raw stream
Well, you're using too few bytes if the amount of bytes is less than what a single TCP/IP packet can hold, or an unfortunate size otherwise (imagine the TCP/IP packet can hold 1000 bytes exactly, and you send 1001 and bytes. That's one full packet, and then a second packet with just 1 byte, giving you only 50% efficiency. 50% is still better than 0.1% efficiency which byte-at-a-time would get you in this hypothetical). But, if you send, say, 5001 bytes, that's 5 full packets and one regrettable 1-byte packet, for 83.35% efficiency. Unfortunate it's not near 100, but not nearly as bad. Same applies to disk (if an SSD cell holds 2048 bytes and you send 65537 bytes, it's still ~96/7% efficient).
You're using too many bytes if the impact on your own java process is such that this becomes problematic: It's causing excessive garbage collection, or, worse, out of memory errors.
So where is the 'sweet spot'? Depends a little bit, but 65536 is common and is unlikely to be 'too low'. Unless you run many thousands of simultaneous threads, it's not too high either.
It's usually a power of 2 mostly because of superstition, but there is some sense in it: Those underlying buffer things are often a power of two (computers are binary things, after all). So if the cell size happens to be, say, 2048, well, then you are 100% efficient if you send 65536 bytes (that's exactly 32 cells worth of data).
But, the only thing you're really trying to avoid is that 0.1% efficiency rate which occurs if you write one byte at a time to a packetizing (SSD, network, etc) underlying stream. Hence, it doesn't matter, as long as it's more than 2048 or so, you should already have avoided the doom scenario.
*) I'm oversimplifying; the point is that a single byte read or write can take as long as a whole chunk of them, and to give some hint as to why that is, not to do a complete deep-dive on SSD technology.
I would like to read a huge binary file ( ~100GB ) efficiently in Java. I have to process each line of it . The line processing will be in separate threads. I don't want to load the whole file into memory. Does reading in chunks work? What will be the optimum buffer size? Any formula for that?
If this is a binary file, then reading in "lines" does not make a lot of sense.
If the file is really binary, then use a BufferedInputStream and read bytes one at a time into byte[]. When you get to the byte that marks your end of "line", add the byte[] and the count of bytes in the line to a queue for you worker threads to process.
And repeat.
Tips:
Use a bounded buffer in case you can read lines faster than you can process them.
Recycle the byte[] objects to reduce garbage generation.
If the file is (really) text, then you could use BufferedReader and the readLine() method instead of calling read().
The above will give you reasonable performance. Depending on how much work has to be done to process each line, it may be good enough that there is no point optimizing the file reading. You can check this by profiling.
If you profiling tells you that reading is the bottle-neck, then consider using NIO with ByteBuffer or CharBuffer. It is more complicated but potentially faster than read() or readLine().
Does reading in chunks work?
BufferedReader or BufferedInputStream both read in chunks, under the covers.
What will be the optimum buffer size?
That's probably not that important what the buffer size is. I'd make it a few KB or tens of KB.
Any formula for that?
No there isn't a formula for an optimum buffer size. It will depend on variables that you can't quantify.
Java 8, streaming
Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt"));
lines.forEach(l -> {
// Do anything line by line
});
What is the exact use of flush()? What is the difference between stream and buffer? Why do we need buffer?
The advantage of buffering is efficiency. It is generally faster to write a block of 4096 bytes one time to a file than to write, say, one byte 4096 times.
The disadvantage of buffering is that you miss out on the feedback. Output to a handle can remain in memory until enough bytes are written to make it worthwhile to write to the file handle. One part of your program may write some data to a file, but a different part of the program or a different program can't access that data until the first part of your program copies the data from memory to disk. Depending on how quickly data is being written to that file, this can take an arbitrarily long time.
When you call flush(), you are asking the OS to immediately write out whatever data is in the buffer to the file handle, even if the buffer is not full.
The data sometimes gets cached before it's actually written to disk (in a buffer) flush causes what's in the buffer to be written to disk.
flush tells an output stream to send all the data to the underlying stream. It's necessary because of internal buffering. The essential purpose of a buffer is to minimize calls to the underlying stream's APIs. If I'm storing a long byte array to a FileOutputStream, I don't want Java to call the operating system file API once per byte. Thus, buffers are used at various stages, both inside and outside Java. Even if you did call fputc once per byte, the OS wouldn't really write to disk each time, because it has its own buffering.
I am running my application under a profiler. The 'class' that has the most memory consumption is the char[] which is about 10 kB in my application.
I then created an InputStream (PipedInputStream to be exact) which holds a byte array data of 300 MB.
Then I took a look at my profiler, and I don't see any significant change ( don't see anywhere that something eats up 300 MB).
The question is, if that 300 MB of byte array is not in memory, where is Java keeping it?
[Update]
Additional info on how I got the 300 MB to my PipedInputStream:
I am developing a web app that has a file upload mechanism.
And in one of the processes in my file upload, I create an input stream (PipedInputStream). Basically,
I read the multipartfile's input stream (a few KB of byte[] at a time),
Created a PipedOutputStream
Created a PipedInputStream (passing the recently created output stream to the constructor)
Wrote the multipart's input stream to my PipedOutputStream (running on a separated thread; which flushes and closes that output stream before exiting the thread). At this point, I now have a copy of the multipart's bytes in my own input stream
Then (accidentally) stored that input stream in my http session (discussion/debate on whether that is a good idea would be on a different question)
So the question then again is, where is Java keeping my InputStream's content (I don't see it anywhere in my profiler)?
[Update#2]
I have a FileOutputStream which reads from the PipedInputStream and writes to a file.
A PipedInputStream just makes data available when it's written by the output stream that it's connected to. So long as you keep reading from your input stream as fast as it receives data from the output stream, there won't be much data to buffer.
If that doesn't help, you'll need to give more information about what you're doing with the piped input stream - what output stream is it connected to, and what's reading from it?
EDIT: You still haven't said what's reading from your PipedInputStream. Something must be, as otherwise the PipedOutputStream would block - PipedInputStream only have a fairly small buffer (by default).
A PipedInputStream does not store any data at all. Also, where do you get that 300 MB byte array from?