Java efficiently comparing bits of equally sizes byte arrays - java

How would one efficiently compare two bits of equally sized byte[] In Java? BitSet might be used by constructing them from given arrays, however this approach is not as for example efficient as shifting through the arrays using bit manipulation and bitmasks. What would be implementation be alike?

BitSet might be used by constructing them from given arrays, however this approach is not as for example efficient as shifting through the arrays using bit manipulation and bitmasks.
BitSet actually uses bit manipulation and bit masks under the hood.
If you still want to use only byte[] (e.g. to avoid overhead on BitSet construction from byte[]), take a look at BitSet#get(int) implementation

Create a long value by reading 64 bit chunk from first array;
Create another long value by reading 64 bit chunk from second array;
If XOR of both longs is not 0, these chunks had some value that was different;
If you are done with entire length of array and all XORs were 0, the arrays were same.

Related

Efficient growable java byte array that allows access to bytes

Does anyone know of a Java class to store bytes that satisfies the following conditions?
Stores bytes efficiently (i.e. not one object per bytes).
Grows automatically, like a StringBuilder.
Allows indexed access to all of its bytes (without copying everything to a byte[].
Nothing I've found so far satisfies these. Specifically:
byte[] : Doesn't satisfy 2.
ByteBuffer : Doesn't satisfy 2.
ByteArrayOutputStream : Doesn't satisfy 3.
ArrayList : Doesn't satisfy 1 (AFAIK, unless there's some special-case optimisation).
If I can efficiently remove bytes from the beginning of the array that would be nice. If I were writing it from scratch I would implement it as something like
{ ArrayList<byte[256]> data; int startOffset; int size; }
and then the obvious functions. Does something like this exist?
Most straightforward would be to subclass ByteArrayOutputStream and add functionality to access the underlying byte[].
Removal of bytes from the beginning can be implemented in different ways depending on your requirements. If you need to remove a chunk, System.arrayCopy should work fine, if you need to remove single bytes I would put a headIndex which would keep track of the beginning of the data (performing an arraycopy after enough data is "removed").
There are some implementations for high performance primitive collections such as:
hppc or Koloboke
You'd have to write one. Off the top of my head what I would do is create an ArrayList internally and store the bytes 4 to each int, with appropriate functions for masking off the bytes. Performance will be sub optimal for removing and adding individual bytes. However it will store the object in the minimal size if that is a real consideration, wasting no more than 3 bytes for storage (on top of the overhead for the ArrayList).
The laziest method will be ArrayList. Its not as inefficient as you seem to believe, since Byte instances can and will be shared, meaning there will be only 256 byte objects in the entire VM unless you yourself do a "new Byte()" somewhere.

Java multi-bit / compact small integer array

I am working on implementing some bloom filter variants, and a very useful data structure for this would be a compact multi-bit array; that is, an array where each element is a compact integer of around 4 bits.
Space efficiency is of the utmost importance here, so while a plain integer array would give me the functionality I want, it would be bulkier than necessary.
Before I try to implement this functionality myself with bit arithmetic, I was wondering if anyone knows of a library out there that already provides such a data structure.
Edit: Static size is fine.
The ideal case would be an implementation that is flexible with regard to the number of bits per cell. That might be a bit much to hope for though (no pun intended?).
If you aren't modifying the array after creation, java.util.BitSet does all the bit masking for you but is slow to access since you have to fetch each bit individually and do the masking yourself to re-create the int from 4 bits.
Having said that writing it yourself might be the best way to go. Doing the bit arithmetic yourself isn't that difficult since it's only 2 values per byte so decoding the high bits are (array[i] & 0xF0) >> 4 and the low bits are array[i] & 0x0F
Take a look at the compressed BitSet provided by http://code.google.com/p/javaewah/, it allows to set bits freely and will ensure that it uses memory efficiently via compression algorithms being used.
I.e. something like
EWAHCompressedBitmap32 set = new EWAHCompressedBitmap32();
set.set(0);
set.set(1000000);
will still only occupy a few bytes, not one MB as with the Java BitSet...
You should be able to map the 4-bit integer to the BitSet by multiplying the index into the BitSet accordingly

Is there a way to efficiently store a sequence of enum values in Java?

I'm looking for a way to encode a sequence of enum values in Java that packs better than one object reference per element. In fantasy-code:
List<MyEnum> list = new EnumList<MyEnum>(MyEnum.class);
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits per element. Is there an existing implementation for this, or a simple way to do it?
It would be sufficient to have a class that encodes a sequence of numbers of arbitrary radix (i.e. if there are 5 possible enum values then use base 5) into a sequence of bytes, since a simple wrapper class could be used to implement List<MyEnum>.
I would prefer a general, existing solution, but as a poor man's solution I might just use an array of longs and radix-encode as many elements as possible into each long. With 5 enum values, 27 elements will fit into a long and waste only ~1.3 bits, which is pretty good.
Note: I'm not looking for a set implementation. That wouldn't preserve the sequence.
You can store bits in an int (32 bits, 32 "switches"). But aside from the exercise value, what's the point?- you're really talking about a very small amount of memory. A better question might be, why do you want to save a few bytes in enum references? Other parts of your program are likely to be using much more memory.
If you're concerned with transferring data efficiently, you could consider leaving the Enums alone but using custom serialization, though again, it'd be an unusual situation where it'd be worth the effort.
One object reference typically occupies one 32-bit or 64-bit word. To do better than that, you need to convert the enum values into numbers that are smaller than 32 bits, and hold them in an array.
Converting to a number is as simple as calling getOrdinal(). From there you could:
cast to a byte or short, then represent the sequence as an array of byte / short values, or
use a suitable compression algorithm on the array of int values.
Of course, all of this comes at the cost of making your code more complicated. For instance you cannot make use of the collection APIs, and you have to do your own sequence management. I doubt that this will be worth it unless you have to deal with very large sequences or huge numbers of sequences.
In principle it should be possible to encode each element using log2(MyEnum.values().length) bits.
In fact you may be able to do better than that ... by compressing the sequences. It depends on how much redundancy there is.

What is faster and less in memory in Java: int[] or boolean[]?

In Java, what is faster and less in memory: int[n] or boolean[n] or maybe Bitset(n) ?
The question is applicable for arrays of small (n is up to 1000), medium (n is between 1000 and 100000) and huge (n is greater than 100000) sizes. Thank you.
I want to achieve flags (1/0) storage.
On most JVMs; an array or object has a 12-16 byte overhead. An int use 4 bytes and a boolean uses a byte (it doesn't have to but it does with OpenJDK/HotSpot) BitSet uses two objects and more memory for small sets, but only one bit per have. So for small collections an int[] can be smaller than a BitSet but as the size grows, BitSet will be the smallest.
If the data structure is smaller than your cache the fastest is int[] then boolean[] then BitSet This is because there is non-trival overhead in breaking int into byte or a bit.
However once your cache size becomes important, it can be that the overhead of BitSet fades compared to the overhead of using a slower cache or main memory.
In short: if in doubt use BitSet as this is clearer as to your intent and its likely to be faster.
Actually, it is JVM dependent. For example Sun JVM converts boolean type to int. That mean even boolean variable uses 32 bit. But jvm optimize boolean arrays, and reserve 8 bit per boolean array cell.
Java store boolean as int internally. So int[] and boolean[] is exactly the same.
BitSet use less memory. Faster or not depends on your usage pattern.
Descending order in memory usage
int[n] > Bitset(n) > boolean[n]
However, in accessing the indexes, there should not be any difference.
Consider to replace bit flags with enums. Then you can use e.g. EnumSet instead of Bitset.

Java Array of Bytes

If I create an array of bytes with byte[], what would be the size of each element? Can they be resized/merged?
Thanks,
Not sure what you meant by resized and merged
from the documentation:
byte: The byte data type is an 8-bit signed two's complement integer. It has a minimum value of -128 and a maximum value of 127 (inclusive). The byte data type can be useful for saving memory in large arrays, where the memory savings actually matters. They can also be used in place of int where their limits help to clarify your code; the fact that a variable's range is limited can serve as a form of documentation.
Edit: If by resized/merged you are talking about the array itself, there's nothing special about a byte array compared to other arrays.
There are two ways to allocate an array.
A) allocate an empty array of a given size:
byte[] ba1 = new byte[18]; // 18 elements
B) allocate an array by specifying the contents
byte[] ba2 = {1,2,3,4,5}; // 5 elements
The size would be a byte per element.
They can not be re-sized. However you can merge them yourself using System.arrayCopy() by creating a new array and copying your source arrays into the new array.
Edit 1:
There is also an 8-byte overhead for the object header and a 4-byte overhead for the array length, for a total overhead of 12 bytes. So small arrays are relatively expensive.
Check out GNU Trove and Fastutil. They are libraries that make working with primitive collections easier.
Edit 2:
I read in one of your response that you're doing object serialization. You might be interested in ByteBuffers. Those make it easy to write out various primitive types to a wrapped array and get the resulting array. Also check out Google protocol buffers if you want easily serialized structured data types.

Categories