Java Array of Bytes - java

If I create an array of bytes with byte[], what would be the size of each element? Can they be resized/merged?
Thanks,

Not sure what you meant by resized and merged
from the documentation:
byte: The byte data type is an 8-bit signed two's complement integer. It has a minimum value of -128 and a maximum value of 127 (inclusive). The byte data type can be useful for saving memory in large arrays, where the memory savings actually matters. They can also be used in place of int where their limits help to clarify your code; the fact that a variable's range is limited can serve as a form of documentation.
Edit: If by resized/merged you are talking about the array itself, there's nothing special about a byte array compared to other arrays.

There are two ways to allocate an array.
A) allocate an empty array of a given size:
byte[] ba1 = new byte[18]; // 18 elements
B) allocate an array by specifying the contents
byte[] ba2 = {1,2,3,4,5}; // 5 elements

The size would be a byte per element.
They can not be re-sized. However you can merge them yourself using System.arrayCopy() by creating a new array and copying your source arrays into the new array.
Edit 1:
There is also an 8-byte overhead for the object header and a 4-byte overhead for the array length, for a total overhead of 12 bytes. So small arrays are relatively expensive.
Check out GNU Trove and Fastutil. They are libraries that make working with primitive collections easier.
Edit 2:
I read in one of your response that you're doing object serialization. You might be interested in ByteBuffers. Those make it easy to write out various primitive types to a wrapped array and get the resulting array. Also check out Google protocol buffers if you want easily serialized structured data types.

Related

Java hashmap with single byte[] array for all keys

I have a rather large dataset consisting of 2.3GB worth of data spread over 160 Million byte[] arrays with an average data length of 15 bytes. The value for each byte[] key is only an int so memory usage of nearly half the hashmap (which is over 6GB) is made up of the 16 byte overhead of each byte array
overhead = 8 byte header + 4 byte length rounded up by VM to 16 bytes.
So my overhead is 2.5GB.
Does anybody know of a hashmap implementation that stores its (variable length) byte[] keys in one single large byte array so there would be no overhead (apart from 1 byte length field)?
I would rather not use an in memory DB as they usually have a performance overhead compared to a plain Trove TObjectIntHashMap that i'm using and I value cpu cycles even more then memory usage.
Thanks in advance
Since most personal computers have 16GB these days and servers often 32 - 128 GB or more, is there a real problem with there being a degree of bookkeeping overhead?
If we consider the alternative: the byte data concatenated into a single large array -- we should think about what individual values would have to look like, to reference a slice of a larger array.
Most generally you would start with:
public class ByteSlice {
protected byte[] array;
protected int offset;
protected int len;
}
However, that's 8 bytes + the size of a pointer (perhaps just 4 bytes?) + the JVM object header (12 bytes on a 64-bit JVM). So perhaps total 24 bytes.
If we try and make this single-purpose & minimalist, we're still going to need 4 bytes for the offset.
public class DedicatedByteSlice {
protected int offset;
protected byte len;
protected static byte[] getArray() {/*somebody else knows about the array*/}
}
This is still 5 bytes (probably padded to 8) + the JVM object header. Probably still total 20 bytes.
It seems that the cost of dereferencing with an offset & length, and having an object to track that, are not substantially less than the cost of storing the small array directly.
One further theoretical possibility -- de-structuring the Map Key so it is not an object
It is possible to conceive of destructuring the "length & offset" data such that it is no longer in an object. It then is passed as a set of scalar parameters eg (length, offset) and -- in a hashmap implementation -- would be stored by means of arrays of the separate components (eg. instead of a single Object[] keyArray).
However I expect it is extremely unlikely that any library offers an existing hashmap implementation for your (very particular) usecase.
If you were talking about the values it would probably be pointless, since Java does not offer multiple returns or method OUT parameters; which makes communication impractical without "boxing" destructured data back to an object. Since here you are asking about Map Keys specifically, and these are passed as parameters but need not be returned, such an approach may be theoretically possible to consider.
[Extended]
Even given this it becomes tricky -- for your usecase the map API probably has to become asymmetrical for population vs lookup, as population has to be by (offset, len) to define keys; whereas practical lookup is likely still by concrete byte[] arrays.
OTOH: Even quite old laptops have 16GB now. And your time to write this (times 4-10 to maintain) should be worth far more than the small cost of extra RAM.

Efficient growable java byte array that allows access to bytes

Does anyone know of a Java class to store bytes that satisfies the following conditions?
Stores bytes efficiently (i.e. not one object per bytes).
Grows automatically, like a StringBuilder.
Allows indexed access to all of its bytes (without copying everything to a byte[].
Nothing I've found so far satisfies these. Specifically:
byte[] : Doesn't satisfy 2.
ByteBuffer : Doesn't satisfy 2.
ByteArrayOutputStream : Doesn't satisfy 3.
ArrayList : Doesn't satisfy 1 (AFAIK, unless there's some special-case optimisation).
If I can efficiently remove bytes from the beginning of the array that would be nice. If I were writing it from scratch I would implement it as something like
{ ArrayList<byte[256]> data; int startOffset; int size; }
and then the obvious functions. Does something like this exist?
Most straightforward would be to subclass ByteArrayOutputStream and add functionality to access the underlying byte[].
Removal of bytes from the beginning can be implemented in different ways depending on your requirements. If you need to remove a chunk, System.arrayCopy should work fine, if you need to remove single bytes I would put a headIndex which would keep track of the beginning of the data (performing an arraycopy after enough data is "removed").
There are some implementations for high performance primitive collections such as:
hppc or Koloboke
You'd have to write one. Off the top of my head what I would do is create an ArrayList internally and store the bytes 4 to each int, with appropriate functions for masking off the bytes. Performance will be sub optimal for removing and adding individual bytes. However it will store the object in the minimal size if that is a real consideration, wasting no more than 3 bytes for storage (on top of the overhead for the ArrayList).
The laziest method will be ArrayList. Its not as inefficient as you seem to believe, since Byte instances can and will be shared, meaning there will be only 256 byte objects in the entire VM unless you yourself do a "new Byte()" somewhere.

size of java byte[] in memory

So i created a very large array of byte[] so something like byte[50,000,000][16].. so according to my math thats 800,000,000 bytes which is 0.8GB plus some overhead.
To my surprise when to do
memoryBefore = runtime.totalMemory() - runtime.freeMemory()
its using 1.8GB of memory. when i point a profiler at it i get this
https://www.dropbox.com/s/el9vlav5ylg781z/Screen%20Shot%202015-08-30%20at%202.55.00%20pm.png?dl=0
I can see most byte[] are 24bytes not 16 as expected and i see quite a few much larger byte[] of size 472 or more.. Does anyone know whats going on here?
Thanks for reading
All objects have an overhead to maintain the object, information like "type" aka Class. See What is the memory consumption of an object in Java?.
Arrays also have a length, and the overhead might be bigger in 64-bit Java.
Since you are allocating 50,000,000 arrays of 16 bytes, you get 50_000_000 * (16 + overhead) + (overhead + 50_000_000 * pointerSize). Second part is the outer array of arrays.
Depending on your requirements, you might be able to improve this in one of two ways:
Flip the indices of you 2-dimensional array to byte[16][50_000_000]. This reduces overhead from 50,000,001 to 17 and reduced outer array size.
Use a single array byte[16 * 50_000_000] and do offset logic yourself. This will keep your 16 bytes contiguous and eliminate all overhead.
I can see most byte[] are 24bytes not 16
An object in Java has a couple of words of header information, in addition to the space that holds the object's fields. In the case of an array, there is an additional word to hold the array's length field. The length actually needs 4 bytes ('cos length is an int), but the JVM is most likely aligning to an 8 byte boundary on your platform.
If you are seeing 16 byte arrays occupying 24 bytes, then that accounting of space is most likely including (just) the length word.
(Note that the actual space occupied by object / array headers is JVM and platform specific.)
... I see quite a few much larger byte[] of size 472 or more.
Those are unrelated. If some other part of your code is not creating them explicitly, they are most likely created by Java runtime libraries.

What is the reason for the overhead in memory usage for arrays in Java?

In Java, the character data type, char, is represented with 2 bytes. The array of n characters, char[], is represented with 2n+24 bytes.
In general, there is an overhead of 24 bytes for storing an array of n objects (at least if the objects are of primitive type).
Why do we need these additional 24 bytes? How are they used?
EDIT (July 2nd 2015). It was brought to my attention in a comment that an answer to this question is offered here on the programmers StackExchange.
It is the object header, it includes information about the object itself (locked bits, marked bits for the GC), a pointer to its class object and the length.
In addition to object headers and length prefix all java objects also have to be aligned, usually to pointer sizes, possibly multiples thereof (there's an option to tune object alignment).
The need for alignment means there has to be padding end of each object - and thus the array - if the subsequent object wouldn't align properly. That concern will be most prominent with byte[] arrays.

Java - Integer[] and int[] memory difference

I'm trying to get a general idea of the memory cost difference between an Integer array and int array. While there seems to be a lot of information out there about the differences between a primitive int and Integer object, I'm still a little confused as to how to calculate the memory costs of an int[] and Integer[] array (overhead costs, padding, etc).
Any help would be appreciated. Thanks!
In addition to storing the length of the array, an array of ints needs space for N 4-byte elements, while an array of Integers needs space for N references, whose size is platform-dependent; commonly, that would be 4 bytes on 32-bit platforms or 8 bytes on 64-bit platforms.
As far as int[] goes, there is no additional memory required to store data. Integer[], on the other hand, needs objects of type Integer, which could be all distinct or shared (e.g. through interning of small numbers implemented by the Java platform itself). Therefore, Integer[] requires up to N additional objects, each one containing a 4-byte int.
Assuming that all Integers in an Integer[] array are distinct objects, the array and its content will take two to three times the space of an int[] array. On the other hand, if all objects are shared, and the memory costs of shared objects are accounted for, there may be no additional overhead at all (on 32-bit platforms) or the there would be a 2x overhead on 64-bit platforms.
Here is a comparison on jdk6u26 of the size of an array of 1024 Integers as opposed to 1024 ints. Note that in the case of anInteger[] array containing low number Integers, these can be shared with other uses of these Integers in the JVM by the auto-box cache.

Categories