Java Bytebuffer can only read sequentially? - java

I am mapping a file to memory and reading it back with java's ByteBuffer. This proves to be a really fast way of reading large files. However, I can only read the values sequentially. Meaning that once I read them buffer.getInt()the buffer pointer moves to the next bytes. So If I want to use a value more than once I have to store it to another variable:
int a = buffer.getInt()
I am noticing that this approach of copying a piece of memory to another is taking a long time (especially with a very large file) compared to just reading bytes. Is there a way I can re-read those bytes instead of copying them?

Just use position(int) to seek in ByteBuffer. Then you can read from anywhere.
ByteBuffer buffer=ByteBuffer.allocate(1000);
byte[] data=new byte[10];
buffer.position(100);
//read 10 from postion 100
buffer.get(data);

Related

FileInputStream read byte by byte or block?

The reason why bufferedinputstream(BIS) is faster than FileInputStream(FIS) provided on Why is using BufferedInputStream to read a file byte by byte faster than using FileInputStream? is that
With a BufferedInputStream, the method delegates to an overloaded
read() method that reads 8192 amount of bytes and buffers them until
they are needed while FIS read the single byte
Per my understanding Disk is a 'block device'. The disk is always going to read/write entire blocks, even if the read request is for some smaller amount of data.
Is n't it ? So how even both FIS and BIS will be reading complete block not single byte(as stated for FIS). Right ? So how BIS is faster than FIS ?
The java API of InputStream is what it is. Specifically, it has this method:
int read() throws IOException
which reads a single byte (it returns an int, so that it can return -1 to indicate EOF).
So, if you try to read a SINGLE BYTE from a file, it'll try to do that. In the case of a block device like a harddisk, that'll likely read the entire block, and then chuck everything except that one byte, so, if you call that read() method 8192 times, it reads the same block, over and over, 8192 times, each time chucking away 8191 bytes and giving you just the one you want. Thus, reading 67 million bytes in the entire process. Ouch. Not very efficient.
Given that the kernel, CPU, disk, etc all read in a block size of 8192, there is zero performance difference between a BufferedInputStream(new FileInputStream) and just the new FileInputStream, IF you use something like:
byte[] buffer = new byte[8192];
in.read(buffer);
Now even plain jane unbuffered new FileInputStream just ends up reading that block off of disk just once.
BufferedInputStream does that 'under the hood' even if you use the single-byte form of read(), and will then feed you data from that byte array for the next 8191 calls to read(). That's all BufferedInputStream does.
If you are using the read() (one byte at a time) variant (or the byte-array variant of read, but with really small byte arrays), then BufferedInputStream makes sense. Otherwise, that does nothing and there is no need to put that in there.
NB: As far as I know, java makes no guesses about what the disk buffer size is and just uses some reasonable buffer size. The effect is the same: If using single-byte-at-a-time, wrapping your filestream into a bufferedstream improves performance by a factor 1000+, if you are using the byte array variant, no difference whatsoever.

How does RandomAccessFile.seek() work?

As per the API, these are the facts:
The seek(long bytePosition) method simply put, moves the pointer to
the position specified with the bytePosition parameter.
When the bytePosition is greater than the file length, the file
length does not change unless a byte is written at the (new) end.
If data is present in the length skipped over, such data is left
untouched.
However, the situation I'm curious about is: When there is a file with no data (0 bytes) and I execute the following code:
file.seek(100000-1);
file.write(0);
All the 100,000 bytes are filled with 0 almost instantly. I can clock over 200GB in say, 10 ms.
But when I try to write 100000 bytes using other methods such as BufferedOutputStream the same process takes an almost infinitely longer time.
What is the reason for this difference in time? Is there a more efficient way to create a file of n bytes and fill it with 0s?
EDIT:
If the data is not actually written, how is the file filled with data?
Sample this code:
RandomAccessFile out=new RandomAccessFile("D:/out","rw");
out.seek(100000-1);
out.write(0);
out.close();
This is the output:
Plus, If the file is huge enough I can no longer write to the disk due to lack of space.
When you write 100,000 bytes to a BufferedOutputStream, your program is explicitly accessing each byte of the file and writing a zero.
When you use a RandomAccessFile.seek() on a local file, you are indirectly using the C system call fseek(). How that gets handled depends on the operating system.
In most modern operating systems, sparse files are supported. This means that if you ask for an empty 100,000 byte file, 100,000 bytes of disk space are not actually used. When you write to byte 100,001, the OS still doesn't use 100,001 bytes of disk. It allocates a small amount of space for the block containing "real" data, and keeps track of the empty space separately.
When you read a sparse file, for example, by fseek()ing to byte 50,000, then reading, the OS can say "OK, I have not allocated disk space for byte 50,000 because I have noted that bytes 0 to 100,000 are empty. Therefore I can return 0 for this byte.". This is invisible to the caller.
This has the dual purpose of saving disk space, and improving speed. You have noticed the speed improvement.
More generally, fseek() goes directly to a position in a file, so it's O(1) rather than O(n). If you compare a file to an array, it's like doing x = arr[n] instead of for(i = 0; i<=n; i++) { x = arr[i]; }
This description, and that on Wikipedia, is probably sufficient to understand why seeking to byte 100,000 then writing is faster than writing 100,000 zeros. However you can read the Linux kernel source code to see how sparse files are implemented, you can read the RandomAccessFile source code in the JDK, and the JRE source code, to see how they interact. However, this is probably more detail than you need.
Your operating system and filesystem support sparse files and when it's the case, seek is implemented to make use of this feature.
This is not really related to Java, it's just a feature of fseek and fwrite functions from C library, which are most likely the backend behind File implementation on the JRE you are using.
more info: https://en.wikipedia.org/wiki/Sparse_file
Is there a more efficient way to create a file of n bytes and fill it with 0s?
On operating systems that support it, you could truncate the file to the desired size instead of issuing a write call. However, this seems to be not available in Java APIs.

What is the initial "mode" of ByteBuffer?

While studying the ByteBuffer class I got to thinking about an array wrapped ByteBuffer that might be constructed as follows:
byte data[] = new byte[10];
// Populate data array
ByteBuffer myBuffer = ByteBuffer.wrap(data);
int i = myBuffer.getInt();
Which, I thought, might retrieve the first 4 bytes of my byte array as an int value. However, as I studied further, I seemed to find that the ByteBuffer has two modes which are read and write and we can flip between them using the flip() method. However, since flip is basically a toggle, it pre-supposes than one knows the initial value to meaningfully flip between the read and write states.
What is the definition of the initial state of a ByteBuffer?
write?
read?
A function of how it was created (eg. allocate vs wrap)?
Strictly speaking the ByteBuffer itself doesn't track if it is "read" or "write"; that's merely a function of how it is used. A ByteBuffer can read and write at any time. The reason why we say flip switches the "mode" is because it is part of the common task of writing to the buffer, flipping it, then reading from the buffer.
Indeed, both allocate and wrap set the limit and capacity to be equal to the array size, and the position to zero. This means that read operations can read up to the whole array, and write operations can fill the whole array. You can therefore do either reading or writing with a newly allocated or wrapped ByteBuffer.

How do I read more than one byte with BufferedInputStream

This is an exact quote from my text:
The purpose of I/O buffering is to improve system performance.
Rather than reading a byte at a time, a large number of bytes are read together
the first time the read() method is invoked.
However, when I use BufferedInputStream.read() all I can do is get a single byte. What am I doing wrong and what do I need to do?
It's not you, it is the stream that reads more than one character at a time. The BufferedInputStream keeps a buffer, and next time you call read() the next byte from that buffer is returned without any access to a physical drive (unless the buffer is empty and the next chunk of data has to be read into the buffer).
Note there are methods that read more than a byte, but these don't really have to do with the difference you explicitly asked for in your question.
The BufferedInputStream class facilitates buffering to your input streams. Rather than read one byte at a time from the network or disk, you read a larger block at a time.
You can set the buffer size to be used internally by the BufferedInputStream with the following constructor
InputStream input = new BufferedInputStream(new FileInputStream("PathOfFile"), 2 * 1024);
This example sets the buffer size to 2 KB
When the BufferedInputStream is created, an internal buffer array is created. As bytes from the stream are read or skipped, the internal buffer is refilled as necessary from the contained input stream, many bytes at a time

Custom java serialization of message

While writing a message on wire, I want to write down the number of bytes in the data followed by the data.
Message format:
{num of bytes in data}{data}
I can do this by writing the data to a temporary byteArrayOutput stream and then obtaining the byte array size from it, writing the size followed by the byte array. This approach involves a lot of overhead, viz. unnecessary creation of temporary byte arrays, creation of temporary streams, etc.
Do we have a better (considering both CPU and garbage creation) way of achieving this?
A typical approach would be to introduce a re-useable ByteBuffer. For example:
ByteBuffer out = ...
int oldPos = out.position(); // Remember current position.
out.position(oldPos + 2); // Leave space for message length (unsigned short)
out.putInt(...); // Write out data.
// Finally prepend buffer with number of bytes.
out.putShort(oldPos, (short)(out.position() - (oldPos + 2)));
Once the buffer is populated you could then send the data over the wire using SocketChannel.write(ByteBuffer) (assuming you are using NIO).
Here’s what I would do, in order of preference.
Don’t bother about memory consumption and stuff. Most likely this already is the optimal solution unless it takes a lot of time to create the byte representation of your data so that creating it twice is a noticable impact.
(Actually this would be more like #37 on my list, with #2 to #36 being empty.) Include a method in your all your data objects that can calculate the size of the byte representation and takes less resources than it would to create the byte representation.

Categories