Reading byte by byte in java [closed]

Reading byte by byte in java [closed] - java

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am trying to use threading on the reading of a bmp image in Java. More exactly, i want to read it in 4 chunks of data. This is for educational purposes only, i know it's not something you would technically need.
However, i don't know how to read the file byte by byte or chunk by chunk. The only thing i found was using readAllBytes which is not what i need, or readByte, which requires me having the array of bytes already, but this is not a threaded reading anymore.
Is there any way i could read byte by byte or block by block for a given path?
Thank you in advance!

.read(), with no arguments, reads exactly one byte, but two important notes on this:
This is not thread safe. Threading + disk generally doesn't work; the bottleneck is the disk, not the CPU, and you need to add a guard that only one thread is reading at any one time. Given that the disk is so slow, you'll end up in a situation that one thread needs to read from disk, does so, and processes the data so received, and while that is happening, the other X threads that all were waiting on disk now have one of them that can 'go' (the others still have to wait). But, each thread is done reading and processing data before any other thread even got unpaused: You gain nothing.
read() on a FileInputStream is usually incredibly slow. These are low level operations, but disks tend to read entire blocks at a time and are incapable of reading one byte at a time. Thus, read() is implemented as: Read the smallest chunk one can read (usually still 4096 or more bytes), take the one byte from that chunk that is needed, and just toss the remainder in the garbage can. In other words, if you read a file by calling .read() 4906 times, that reads the same chunk from disk 4096 times. Whereas calling:
byte[] b = new byte[4096];
int read = fileIn.read(b);
would fill the entire byte array with the chunk, and is thus 4096x faster.
If your aim is to learn about multithreading, 'reading a file' is not how to learn about it; you can't observe anything meaningful in action this way.
If your aim is to speed up BMP processing, 'multithread the reading process' is now the way either. I'm at a loss to explain why multithreading is involved here at all. It is suitable neither to learn about it, nor to speed anything up.

Related

Byte array outputstream [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Stupid question probably but what is the ideal length for a byte array to send over an outputstream? I couldn't find anything about this online.
I have found a lot of example that set their array size equal to 2^X or something similar. But what is the purpose of this?

There is no optimal size. OutputStream is an abstract concept; there are a million implementations of it. (Not just 'FileOutputStream is an implementation', but 'FileOutputStream, on OpenJDK11, on Windows 10, with this servicepack, with this CPU and this much system memory, under these circumstances').
The reason you see that is for buffer efficiency. The problem with sending 1 byte is usually basically nothing, but sometimes, sending 1 (or very few) bytes results in this nasty scenario:
you send one byte.
The underlying outputstream isn't designed to buffer that byte, it doesn't have the storage for it, so the only thing it can do is send it onwards to the actual underlying resource. Let's say the OutputStream represents a file on the filesystem.
The kerneldriver for this works similarly. (Most OSes do buffer internally, but you can ask the OS not to do this when you open the file).
Therefore, that one byte now needs to be written to disk. However, it is an SSD, and you can't do that to an SSD, you can only write an entire cell at once*. That's just how SSDs work: You can only write an entire block's worth. They aren't bits in sequence on a big platter.
So, the kernel reads the entire cell out, updates the one byte you are writing, and writes the entire cell back to the SSD.
Your actual loop does write, say, about 50,000 bytes, so something that should have taken a single SSD read and write, now takes 50,000 reads and 50,000 writes, burning through your SSD cell longevity and taking 50,000 times longer than needed.
Similar issues occur for networking (end up sending a single byte, wrapped in HTTP headers, wrapped in 2 TCP/IP packets, resulting in sending ~1000 bytes over the network for each byte you .write(singleValue) and many other such systems.
So why don't these streams buffer?
Because there are cases where you don't actually want them to do this; there are plenty of reasons to write I/O with specific efficiency in mind.
Is there no way to just let something to this for me?
Ah, you're in luck, there is! BufferedWriter and friends (BufferedOutputStream exists as well) wrap around an underlying stream and buffer for you:
var file = new FileOutputStream("/some/path");
var wrapped = new BufferedOutputStream(file);
file.write(1); // this is a bad idea
wrapped.write(1); // this is fine
Here, the wrapped write doesn't result in anything happening except some memory being shoved around. No bytes are written to disk (with the downside that if someone trips over a powercable, it's just lost). Only after you close wrapped, or call flush() on wrapped, or write some sufficient amount of bytes to wrapped, will wrapped end up actually sending a whole bunch of bytes to the underlying stream. This is what you should use if making a byte array is unwieldy. Why reinvent the wheel?
But I want to write to the underlying raw stream
Well, you're using too few bytes if the amount of bytes is less than what a single TCP/IP packet can hold, or an unfortunate size otherwise (imagine the TCP/IP packet can hold 1000 bytes exactly, and you send 1001 and bytes. That's one full packet, and then a second packet with just 1 byte, giving you only 50% efficiency. 50% is still better than 0.1% efficiency which byte-at-a-time would get you in this hypothetical). But, if you send, say, 5001 bytes, that's 5 full packets and one regrettable 1-byte packet, for 83.35% efficiency. Unfortunate it's not near 100, but not nearly as bad. Same applies to disk (if an SSD cell holds 2048 bytes and you send 65537 bytes, it's still ~96/7% efficient).
You're using too many bytes if the impact on your own java process is such that this becomes problematic: It's causing excessive garbage collection, or, worse, out of memory errors.
So where is the 'sweet spot'? Depends a little bit, but 65536 is common and is unlikely to be 'too low'. Unless you run many thousands of simultaneous threads, it's not too high either.
It's usually a power of 2 mostly because of superstition, but there is some sense in it: Those underlying buffer things are often a power of two (computers are binary things, after all). So if the cell size happens to be, say, 2048, well, then you are 100% efficient if you send 65536 bytes (that's exactly 32 cells worth of data).
But, the only thing you're really trying to avoid is that 0.1% efficiency rate which occurs if you write one byte at a time to a packetizing (SSD, network, etc) underlying stream. Hence, it doesn't matter, as long as it's more than 2048 or so, you should already have avoided the doom scenario.
*) I'm oversimplifying; the point is that a single byte read or write can take as long as a whole chunk of them, and to give some hint as to why that is, not to do a complete deep-dive on SSD technology.

Is Java's DataInputStream's readByte() faster than readInt()?

Background: I'm currently creating an application in which two Java programs communicate over a network using a DataInputStream and DataOutputStream.
Before every communication, I'd like to send an indication of what type of data is being sent, so the program knows how to handle it. I was thinking of sending an integer for this, but a byte has enough possible combinations.
So my question is, is Java's DataInputStream's readByte() faster than readInt()?
Also, on the other side, is Java's DataOutputStream's writeByte() faster than writeInt()?

If one byte will be enough for your data then readByte and writeByte will be indeed faster (because it reads/writes less data). It won't be noticeable difference though because the size of the data is very small in both cases - 1 vs 4 bytes.
If you have lots of data coming from the stream then using readByte or readInt will not make speed difference - for example calling readByte 4 times instead of readInt 1 time. Just use the one depending on what kind of data you expect and what makes your code easier to understand. You will have to read the whole stuff anyway :)

Why is it better to split files into small pieces when writing? [duplicate]

This question already has answers here:
Why is the default char buffer size of BufferedReader 8192?
(5 answers)
How do you determine the ideal buffer size when using FileInputStream?
(9 answers)
Closed 5 years ago.
I'm reading the book "O'Reilly Java IO" and there is a recommendation:
Files аrе often best written in small multiples of the block size of
the disk, typically 512, 1024, or 2048 bytes.
I tried to find any explanation, but I couldn't. I have just some ideas. But it would be great if someone explained or shared the link, why it is a good practice. Thanks.

To prevent uneccesary Read-write-modify cycles on the harddisk.
ideally you "output the data" as close as possible to the actual sector size.
Everytime data is being written the entire sector is Read, then modified with your data, then written back to disk.
If you flush out 5 bytes at a time, thats a lot of IO operations that have to happen. Lets say you have a harddrive with sector size 4096, that's roughly a 120 read - modify -write operations to complete.
Whilst if you buffer it to the the max, you only have one read write modify operation to complete, which will cause less waiting for harddrive to complete this task.
A lot of operating system processes and probaly harddrive firmware exists to wait a bit to see if more data will be added to sector, but it's best to help to make sure the harddrive write cache can be flushed to disk sooner.
Interesting reading:
What goes on behind the curtains during disk I/O?
http://www.seagate.com/tech-insights/advanced-format-4k-sector-hard-drives-master-ti/

When and why to use buffered input and output streams? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I read that these are used in order to reduce the disk/network call overhead which seems fine in case of write operation. But what is the benefit of doing buffered read?

If you read from a file by byte you make a system call each time and this is expensive operation. With buffered reads you make a system call once per buffer. This code reads 100K from a file on my PC in 130 ms:
InputStream is = new FileInputStream("d:/1");
long start = System.currentTimeMillis();
for(int i = 0; i < 100000; i++) {
is.read();
}
System.out.println((System.currentTimeMillis() - start));
and if I change first line with
InputStream is = new BufferedInputStream(new FileInputStream("d:/1"));
it reads 100K in 12 ms.

Reading from an input stream can sometime be a long operation.
Reading a single byte is not a very good choice if the stream generator save the informations on a bigger chunk. For example a file is saved on the disk in pieces of many kilobytes. If you don't use a buffer you reload from the disk many times the same chunk to read the bytes composing it. Instead using a buffer the read operation save the chunk (or part of it) in the memory reducing the I/O operations on the disk.
Because reading from memory is faster than reading from a disk you can make the same operation gaining a lot of time.
Note for writing: the only attention you need to do is to remember that for writing operations the data are not written to the file if you don't flush the buffer at the end of the operation.

When you read bytes from a InputStream, at some level a system call needs to be made to read the physical file from the disk. Now, system calls are costly - they need to pass your parameters from the user space to kernel space and them make a switch to kernel mode before executing. After the call is made the results should again be moved back from kernel to user space.
By using BufferedInputStream the number of read system calls can be reduced. For example, if you read 1000 bytes in unbuffered mode - you need to make 1000 system calls. Where as in buffered mode, the BufferedInputStream reads a block of data (usually 1024 bytes) using single system call. Each of your read calls on the input stream - gives the data from its own buffer.

What about buffering FileInputStream?

I have a piece of code that reads hell of a lot (hundreds of thousand) of relatively small files (couple of KB) from the local file system in a loop. For each file there is a java.io.FileInputStream created to read the content. The process its very slow and take ages.
Do you think that wrapping the FIS into java.io.BufferedInputStream would make a significant difference?

If you aren't already using a byte[] buffer of a decent size in the read/write loop (the latest implementation of BufferedInputStream uses 8KB), then it will certainly make difference. Give it a try yourself. Don't forget to make any OutputStream a BufferedOutputStream as well.
But if you already have buffered it using a byte[] and/or it after all makes only little difference, then you've hit the harddisk and I/O controller speed as the bottleneck.

I very much doubt whether that will make any difference.
Your fundamental problem is the hundreds of throusands of tiny files. Reading those is going to make the disk thrash and take forever, no matter how you do it, you'll spend 99,9% of the time waiting on mechanical movement inside the harddisk.
There are two ways to fix this:
Save your data on an SSD - they have much lower (as in five orders of magnitude less) latency.
Rearrange your data into few large files and read those sequentially

That depends on how you're reading the data. If you're reading from the FileInputStream in a very inefficient way (for example, calling read() byte-by-byte), then using a BufferedInputStream could improve things dramatically. But if you're already using a reasonable-sized buffer with FileInputStream, switching to a BufferedInputStream won't matter.
Since you're talking a large number of very small files, there's a strong possibility that a lot of the delay is due to directory operations (open, close), not the actual reading of bytes from the files.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.