Java direct ByteBuffer - decode the characters

Java direct ByteBuffer - decode the characters - java

I would like to read the bytes into the direct ByteBuffer and then decode them without rewrapping the original buffer into the byte[] array to minimize memory allocations.
Hence I'd like to avoid using StandardCharsets.UTF_8.decode() as it allocates the new array on the heap.
I'm stuck on how to decode the bytes. Consider the following code that writes a string into the buffer and then reads id again.
ByteBuffer byteBuffer = ByteBuffer.allocateDirect(2 << 16);
byteBuffer.put("Hello Dávid".getBytes(StandardCharsets.UTF_8));
byteBuffer.flip();
CharBuffer charBuffer = byteBuffer.asCharBuffer();
for (int i = charBuffer.position(); i < charBuffer.length(); i++) {
System.out.println(charBuffer.get());
}
The code output:
䡥汬漠
How can I decode the buffer?

I would like to read the bytes into the direct ByteBuffer and then decode them without rewrapping the original buffer into the byte[] array to minimize memory allocations.
ByteBuffer.asCharBuffer() fits your need, indeed, since both wrappers share the same underlying buffer.
This method's javadoc says:
The new buffer's position will be zero, its capacity and its limit will be the number of bytes remaining in this buffer divided by two
Although it's not explicitly said, it's a hint that CharBuffer uses UTF-16 character encoding over the given buffer. Since we don't have control over what encoding the charbuffer uses, you don't have much choice but to necessarily write the character bytes in that encoding.
byteBuffer.put("Hello Dávid".getBytes(StandardCharsets.UTF_16));
One thing about your printing for loop. Be careful that CharBuffer.length() is actually the number of remaining chars between the buffer's position and limit, so it decreases as you call CharBuffer.get(). So you should use get(int) or change the for termination condition to limit().

You can't specify the encoding of a CharBuffer. See here: What Charset does ByteBuffer.asCharBuffer() use?
Also, since buffers are mutable, I don't see how you could ever possibly create a String from it which are always immutable without doing a memory re-allocation...

Related

Java - CRC32.update() on concatenated ByteBuffer

I have the following function:
byte[] test1 = {1,2,3,4};
byte[] test2 = {5,6,7,8};
ByteBuffer bbtest1 = ByteBuffer.wrap(test1).order(ByteOrder.LITTLE_ENDIAN);
ByteBuffer bbtest2= ByteBuffer.wrap(test2).order(ByteOrder.LITTLE_ENDIAN);
ByteBuffer contents = ByteBuffer.allocate(bbtest1.limit() + bbtest2.limit());
contents.put(bbtest1);
contents.put(bbtest2);
CRC32 checksum = new CRC32();
checksum.update(contents);
System.out.println(checksum.getValue());
No matter what values I assign my byte arrays, getValue() always returns 0 when I have concatenated by byteBuffers. According to this thread, this is a valid way of concatenating byteBuffers. If I only call put() on a single byteBuffer (for example if I comment out the line byte[] test2 = {5,6,7,8}; then getValue() actually returns a valid value.
Is this an issue with the way the ByteBuffers are concatenated, update(ByteBuffer) performs on concatenated ByteBuffers, or maybe something else altogether?

Your calls to put have advanced the position of the ByteBuffer to the point that there is no byte left to read from it for CRC32.update().
If you only put one of the two byte buffers, then there will still be 4 bytes to read for the CRC checksum (all 4 have the value 0).
You need to reset the position of your bytebuffer to zero before calling checksum.update(contents). You can use rewind or flip for that:
contents.rewind();
or
contents.flip();
Flip does the same as rewind(), but additionally it sets the limit of the ByteBuffer to the position it had before flipping, so if you first constructed the content of the ByteBuffer and then want to read from it, flip() is more correct, as you don't run the risk of reading from parts of the ByteBuffer that you didn't yet write into.
EJP's answer is insightful, as he points out that you don't need to concatenate byte buffers at all.
You can alternatively do one of:
update the CRC32 with the individual ByteBuffers
use ByteBuffer.put(byte[]) to directly put the test1 and test2 byte arrays in the contents ByteBuffer
skip ByteBuffers altogether and call CRC32.update(byte[]) with test1 and test2 in sequence.

You need to prepare the buffer for getting (or writing to a channel) by calling flip() after the puts (or reads from a channel), and compact() it afterwards.
But I don't know why you're concatenating ByteBuffers at all. Just do
checksum.update(bbtest1);
checksum.update(bbtest2);
There's no advantage to creating yet another copy of the data, and NIO contains scatter/gather methods to operate on multiple ByteBuffers at once.

ByteArrayOutputStream capacity restriction

I create ByteArrayOutputStream barr = new ByteArrayOutputStream(1);, i.e. with capacity 1 bytes and write to it more than 1 byte barr.write("123456789000000".getBytes());. No error occurs, I check the length of barr it is 15. Why my writing was not blocked or wrapped? Is there a way to prevent of writing more than capacity and which outputstream could be used for that?
I am very limited in available memory and don`t want to write there more than my limitations define
P.S.
Thanks a lot for the answers! I had a following up question
It could be great if you could look

ByteArrayOutputStream will grow the backing array if you try to write more bytes. This is usually thought of as a good thing.
If you want different behavior, you can always write your own OutputStream implementation that throws an IOException if the number of bytes to write goes beyond the capacity.
ByteArrayOutputStream is not final, so you can extend it. I think all you would have to do is override write(int) and write(byte[], int, int) to throw Exception if the number of bytes to write is more than the amount remaining. The fields buf and count are protected so your subclass should be able to see how much of the backing array is written to and the length of the array

That is because the capacity that you specify to the constructor is the initial size of the buffer. If you write more data, the buffer will be automatically re-allocated with a larger size, to fit more data.
As far as I know, there is no way with ByteArrayOutputStream to limit the growth of the buffer. You could use something else instead, for example a java.nio.ByteBuffer, which has a fixed size.

I had the same problem, and eventually turned up an existing implementation of the same thing within Hadoop - BoundedByteArrayOutputStream.java

What is the initial "mode" of ByteBuffer?

While studying the ByteBuffer class I got to thinking about an array wrapped ByteBuffer that might be constructed as follows:
byte data[] = new byte[10];
// Populate data array
ByteBuffer myBuffer = ByteBuffer.wrap(data);
int i = myBuffer.getInt();
Which, I thought, might retrieve the first 4 bytes of my byte array as an int value. However, as I studied further, I seemed to find that the ByteBuffer has two modes which are read and write and we can flip between them using the flip() method. However, since flip is basically a toggle, it pre-supposes than one knows the initial value to meaningfully flip between the read and write states.
What is the definition of the initial state of a ByteBuffer?
write?
read?
A function of how it was created (eg. allocate vs wrap)?

Strictly speaking the ByteBuffer itself doesn't track if it is "read" or "write"; that's merely a function of how it is used. A ByteBuffer can read and write at any time. The reason why we say flip switches the "mode" is because it is part of the common task of writing to the buffer, flipping it, then reading from the buffer.
Indeed, both allocate and wrap set the limit and capacity to be equal to the array size, and the position to zero. This means that read operations can read up to the whole array, and write operations can fill the whole array. You can therefore do either reading or writing with a newly allocated or wrapped ByteBuffer.

Does java socket read the data exactly as it's sent

Consider we have a socket connection between two device (A and B). Now if I write only 16 bytes (size doesn't matter here) to the output stream (not BufferedOutputStream) of the socket in side A 3 times or in general more than once like this:
OutputStream outputStream = socket.getOutputStream();
byte[] buffer = new byte[16];
OutputStream.write(buffer);
OutputStream.write(buffer);
OutputStream.write(buffer);
I read the data in side B using the socket input stream (not BufferedInputStream) with a buffer larger than sending buffer for example 1024:
InputStream inputStream = socket.getInputStream();
byte[] buffer = new byte[1024];
int read = inputStream.read(buffer);
Now I wonder how the data is received on side B? May it get accumulated or it exactly read the data as A sends it? In another word may the read variable get more than 16?

InputStream makes very few guarantees about how much data will be read by any invocation of the multi-byte read() methods. There is a whole class of common programming errors that revolve around misunderstandings and wrong assumptions about that. For example,
if InputStream.read(byte[]) reads fewer bytes than the provided array can hold, that does not imply that the end of the stream has been reached, or even that another read will necessarily block.
the number of bytes read by any one invocation of InputStream.read(byte[]) does not necessarily correlate to any characteristic of the byte source on which the stream draws, except that it will always read at least one byte when not at the end of the stream, and that it will not read more bytes than are actually available by the time it returns.
the number of available bytes indicated by the available() method does not reliably indicate how many bytes a subsequent read should or will read. A nonzero return value reliably indicates only that the next read will not block; a zero return value tells you nothing at all.
Subclasses may make guarantees about some of those behaviors, but most do not, and anyway you often do not know which subclass you actually have.
In order to use InputStreams properly, you generally must be prepared to perform repeated reads until you get sufficient data to process. That can mean reading until you have accumulated a specific number of bytes, or reading until a delimiter is encountered. In some cases you can handle any number of bytes from any given read; generally these are cases where you are looping anyway, and feeding everything you read to a consumer that can accept variable length chunks (many compression and encryption interfaces are like that, for example).

Per the docs:
public int read(byte[] b) throws IOException
Reads some number of bytes from the input stream and stores them into the buffer array b. The number of bytes
actually read is returned as an integer. This method blocks until
input data is available, end of file is detected, or an exception is
thrown. If the length of b is zero, then no bytes are read and 0 is
returned; otherwise, there is an attempt to read at least one byte. If
no byte is available because the stream is at the end of the file, the
value -1 is returned; otherwise, at least one byte is read and stored
into b.
The first byte read is stored into element b[0], the next one into
b[1], and so on. The number of bytes read is, at most, equal to the
length of b. Let k be the number of bytes actually read; these bytes
will be stored in elements b[0] through b[k-1], leaving elements b[k]
through b[b.length-1] unaffected.
Read(...) tells you how many bytes it put into the array and yes, you can read further; you'll get whatever was already there.

The off parameter in BufferedInputStream.read(byte[] b, int off, int len)

The javadoc says the following.
Parameters:
b - destination buffer.
off - offset at which to start storing bytes.
len - maximum number of bytes to read.
I would like to confirm my understanding of the "offset at which to start storing bytes". Does this mean that off is "the index at the destination buffer b at which to start storing bytes"? It does sound like off means it. I think the method is more usable if off is the "offset at the BufferedInputStream at which to start reading bytes", but I want to confirm. I tried looking at the source but it's hard to read.
A side question: When 1024 bytes of a stream is read, will the said 1024 bytes always be removed from the stream?

Does this mean that off is "the index at the destination buffer b at which to start storing bytes"?
It's documented: "The first byte read is stored into element b[off]".
When 1024 bytes of a stream is read, will the said 1024 bytes always be removed from the stream?
Of course, but you seem to be really asking whether 1024 bytes will always be read if you supply a buffer of 1024 bytes. Answer is no. It's documented: "there is an attempt to read at least one byte".

Yes. off is the index in b where the stream will start entering len bytes.
When 1024 bytes of a stream is read, will the said 1024 bytes always be removed from the stream?
Using an InputStream, you have no knowledge of what's going on underneath. All you know are the methods available to you and what they do (what their documentation says). Implementations can do whatever they want.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java direct ByteBuffer - decode the characters - java

You can't specify the encoding of a CharBuffer. See here: What Charset does ByteBuffer.asCharBuffer() use? Also, since buffers are mutable, I don't see how you could ever possibly create a String from it which are always immutable without doing a memory re-allocation...

Related

Java - CRC32.update() on concatenated ByteBuffer

ByteArrayOutputStream capacity restriction

What is the initial "mode" of ByteBuffer?

Does java socket read the data exactly as it's sent

The off parameter in BufferedInputStream.read(byte[] b, int off, int len)

Categories

Resources