Java Resizable Bit Array? - java

Suppose I have a bit stream which spits one bit at each time, and the stream can stop at any time. What is the idiomatic way to record the output? Assuming my main usage for this data structure is to convert it to 8-bit block ASCII string later. List<Boolean> doesn't sound right because it's messy to convert to 8-bit block bit array. BitSet can't grow dynamically. List<Char> having a problem when the stream stops after spit bits whose number is not a multiple of 8. Any ideas?

I recommend using a ByteBuffer. http://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html

You can simply construct a BitList using an long[] array and an integer nBits that keeps track of the number of bits:
public class BitList {
private int nBits = 0;
private long[] data = new long[2];
//0 or 1
public void add (byte data) {
if(nBits >= 64*data.length) {
long[] newdata = new long[2*data.length];
for(int i = 0; i < data.length; i++) {
newdata[i] = data[i];
}
this.data = newdata;
}
data[nBits/64] |= data<<(nBits&0x3f);
nBits++;
}
public byte get (int index) {
long val = data[index/64]>>(index&0x3f);
return (byte) (val&0x01);
}
//and so on.
}
Or you might wait until the system has spit out a multiple of eight by packing them into bytes:
public class Packer {
private byte data;
public byte getData () {
byte result = this.data;
this.data = 0;
return result;
}
//only last bit counts (thus bit is 0 or 1)
public void addBit (byte bit) {
this.data <<= 0x01;
this.data |= bit;
}
}
In that case the Packer can be used to ease the implementation since you can use an ArrayList<Byte> and use an integer to keep track of the number of bits (without having to implement add/remove/etc. methods yourself.

You might want to try
List<Byte>
This would most fluidly convert into a byte string, and the bits in each byte are identifiable.

Related

Java uint32 (stored as long) to 4 byte array

I'm writing to a storage format that has uint32, with a max allowed value of "4294967295".
Integer in Java is, of course, just under half that at "2147483647". So internally, I have to use either Long or Guava's UnsignedInteger.
To write to this format, the byte array length needs to be 4, which fits Integer just fine, but converting Long to a byte array requires an array of length 8.
How can I convert a Long or UnsignedInteger representing a max value of "4294967295" as a 4 byte array?
Simply convert it to an 8 byte array and then take only the last 4 bytes:
public static byte[] fromUnsignedInt(long value)
{
byte[] bytes = new byte[8];
ByteBuffer.wrap(bytes).putLong(value);
return Arrays.copyOfRange(bytes, 4, 8);
}
To reverse this you can use the following method:
public static long toUnsignedInt(byte[] bytes)
{
ByteBuffer buffer = ByteBuffer.allocate(8).put(new byte[]{0, 0, 0, 0}).put(bytes);
buffer.position(0);
return buffer.getLong();
}
Note that this method CAN take a negative long or a long that exceed the range of an unsigned int and won't throw a exception in such a case!
You can just cast it to an int and do whatever you do that turns ints into arrays, such as this: (not tested)
public static byte[] getUnsignedInt(long value)
{
byte[] bytes = new byte[4];
ByteBuffer.wrap(bytes).putInt((int)value);
return bytes;
}
Of course if you're putting these things in a ByteBuffer anyway, you might as well do that directly.
The "meaning" or "interpretation" of the top bit is irrelevant if all you're doing it storing it. For example, 4294967295 would be interpreted as -1, but it's really the same number: 0xFFFFFFFF in hexadecimal, so you will get the byte array { 0xFF, 0xFF, 0xFF, 0xFF }.
To reverse it, you could do something like this (not tested)
public static long toUnsignedInt(byte[] bytes)
{
ByteBuffer buffer = ByteBuffer.allocate(4).put(bytes);
buffer.position(0);
return buffer.getInt() & 0xFFFFFFFFL;
}
An answer without the object creation and array copying of the accepted answer... It's easy to do yourself w/ shift operations. See:
import java.io.*;
public class TestUINT32 {
public static void writeLongAsUINT32(long value, OutputStream os) throws IOException {
for (int i=0; i<4; ++i) {
os.write((byte) value);
value = value >> 8;
}
}
public static void main(String[] args) throws IOException {
ByteArrayOutputStream os = new ByteArrayOutputStream();
long value = 0xFFEEDDBB;
writeLongAsUINT32(value, os);
byte[] ba = os.toByteArray();
for (int i=ba.length; i>0; --i) {
System.out.print(String.format("%02X", ba[i-1]));
}
System.out.println();
}
}
Example run:
$ java TestUINT32
FFEEDDBB

Convert array of doubles to byte array: What is the Java way of C# Buffer.BlockCopy?

I need to serialize an array of doubles to base64 in Java. I have following method from C#
public static string DoubleArrayToBase64( double[] dValues ) {
byte[] bytes = new byte[dValues.Length * sizeof( double )];
Buffer.BlockCopy( dValues, 0, bytes, 0, bytes.Length );
return Convert.ToBase64String( bytes );
}
How do I do that in Java? I tried
Byte[] bytes = new Byte[abundaceArray.length * Double.SIZE];
System.arraycopy(abundaceArray, 0, bytes, 0, bytes.length);
abundanceValues = Base64.encodeBase64String(bytes);
however this leads to an IndexOutofBoundsException.
How can I achieve this in Java?
EDIT:
Buffer.BlockCopy copies on byte level, the last paramter is number of bytes. System.arraycopy last parameter is number of elements to copy. So yes it should be abundaceArray.length but then a ArrayStoreException is thrown.
EDIT2:
The base64 string must be the same as the ine created with the c# code!
You get an ArrayStoreException when the array types on the method are not the same primitive, so double to byte will not work. Here is a workaround i patched up that seems to work. I do not know of any method in the java core that does automatic conversion from primitive to byte block :
public class CUSTOM {
public static void main(String[] args) {
double[] arr = new double[]{1.1,1.3};
byte[] barr = toByteArray(arr);
for(byte b: barr){
System.out.println(b);
}
}
public static byte[] toByteArray(double[] from) {
byte[] output = new byte[from.length*Double.SIZE/8]; //this is reprezented in bits
int step = Double.SIZE/8;
int index = 0;
for(double d : from){
for(int i=0 ; i<step ; i++){
long bits = Double.doubleToLongBits(d); // first transform to a primitive that allows bit shifting
byte b = (byte)((bits>>>(i*8)) & 0xFF); // bit shift and keep adding
int currentIndex = i+(index*8);
output[currentIndex] = b;
}
index++;
}
return output;
}
}
The Double.SIZE get 64 which is number of bits I suggest to initialize the array like this
Byte[] bytes = new Byte[abundaceArray.length * 8];
Not sure what this C# function does, but I suspect you should replace this line
System.arraycopy(abundaceArray, 0, bytes, 0, bytes.length);
with this
System.arraycopy(abundaceArray, 0, bytes, 0, abundaceArray.length);
I'm guessing you're using the apache commons Base64 class. That only has methods accepting an array of bytes (the primitive type), not Bytes (object wrapper around primitive type).
It's not clear what type your 'abundaceArray' is - whether it's doubles or Doubles.
Either way, you can't use System.arraycopy to copy between arrays of difference primitive types.
I think your best bet is to serialise your array object to a byte array, then base64 encode that.
eg:
ByteArrayOutputStream b = new ByteArrayOutputStream(); // to store output from serialization in a byte array
ObjectOutputStream o = new ObjectOutputStream(b); // to do the serialization
o.writeObject(abundaceArray); // arrays of primitive types are serializable
String abundanceValues = Base64.encodeBase64String(b.toByteArray());
There is of course an ObjectInputStream for going in the other direction at the other end.

Improvement of Algorithm: Counting set bits in Byte-Arrays

We store knowledge in byte arrays as bits. Counting the number of set bits is pretty slow. Any suggestion to improve the algorithm is welcome:
public static int countSetBits(byte[] array) {
int setBits = 0;
if (array != null) {
for (int byteIndex = 0; byteIndex < array.length; byteIndex++) {
for (int bitIndex = 0; bitIndex < 7; bitIndex++) {
if (getBit(bitIndex, array[byteIndex])) {
setBits++;
}
}
}
}
return setBits;
}
public static boolean getBit(int index, final byte b) {
byte t = setBit(index, (byte) 0);
return (b & t) > 0;
}
public static byte setBit(int index, final byte b) {
return (byte) ((1 << index) | b);
}
To count the bits of a byte array of length of 156'564 takes 300 ms, that's too much!
Try Integer.bitcount to obtain the number of bits set in each byte. It will be more efficient if you can switch from a byte array to an int array. If this is not possible, you could also construct a look-up table for all 256 bytes to quickly look up the count rather than iterating over individual bits.
And if it's always the whole array's count you're interested in, you could wrap the array in a class that stores the count in a separate integer whenever the array changes. (edit: Or, indeed, as noted in comments, use java.util.BitSet.)
I would use the same global loop but instead of looping inside each byte I would simply use a (precomputed) array of size 256 mapping bytes to their bit count. That would probably be very efficient.
If you need even more speed, then you should separately maintain the count and increment it and decrement it when setting bits (but that would mean a big additional burden on those operations so I'm not sure it's applicable for you).
Another solution would be based on BitSet implementation : it uses an array of long (and not bytes) and here's how it counts :
658 int sum = 0;
659 for (int i = 0; i < wordsInUse; i++)
660 sum += Long.bitCount(words[i]);
661 return sum;
I would use:
byte[] yourByteArray = ...
BitSet bitset = BitSet.valueOf(yourByteArray); // java.util.BitSet
int setBits = bitset.cardinality();
I don't know if it's faster, but I think it will be faster than what you have. Let me know?
Your method would look like
public static int countSetBits(byte[] array) {
return BitSet.valueOf(array).cardinality();
}
You say:
We store knowledge in byte arrays as bits.
I would recommend to use a BitSet for that. It gives you convenient methods, and you seem to be interested in bits, not bytes, so it is a much more appropriate data type compared to a byte[]. (Internally it uses a long[]).
By far the fastest way is counting bits set, in "parallel", method is called Hamming weight
and is implemented in Integer.bitCount(int i) as far as I know.
As per my understaning,
1 Byte = 8 Bits
So if Byte Array size = n , then isn't total number of bits = n*8 ?
Please correct me if my understanding is wrong
Thanks
Vinod

Simple data serialization in C

I am currently re-designing an application and stumbled upon a problem serializing some data.
Say I have an array of size mxn
double **data;
that I want to serialize into a
char *dataSerialized
using simple delimiters (one for rows, one for elements).
De-serialization is fairly straightforward, counting delimiters and allocating size for the data to be stored. However, what about the serialize function, say
serialize_matrix(double **data, int m, int n, char **dataSerialized);
What would be the best strategy to determine the size needed by the char array and allocate the appropriate memory for it?
Perhaps using some fixed width exponential representation of double's in a string? Is it possible to just convert all bytes of double into char's and have a sizeof(double) aligned char array? How would I keep the accuracy of the numbers intact?
NOTE:
I need the data in a char array, not in binary, not in a file.
The serialized data will be sent over the network using ZeroMQ between a C server and a Java client. Would it be possible, given the array dimensions and sizeof(double) that it can always be accurately reconstructed between those two?
Java has pretty good support for reading raw bytes and converting into whatever you want.
You can decide on a simple wire-format, and then serialize to this in C, and unserialize in Java.
Here's an example of an extremely simple format, with code to unserialize and serialize.
I've written a slightly larger test program that I can dump somewhere if you want; it creates a random data array in C, serializes, writes the serialized string base64-encoded to stdout. The much smaller java-program then reads, decodes and deserializes this.
C code to serialize:
/*
I'm using this format:
32 bit signed int 32 bit signed int See below
[number of elements in outer array] [number of elements in inner array] [elements]
[elements] is buildt like
[element(0,0)][element(0,1)]...[element(0,y)][element(1,0)]...
each element is sendt like a 64 bit iee754 "double". If your C compiler/architecture is doing something different with its "double"'s, look forward to hours of fun :)
I'm using a couple non-standard functions for byte-swapping here, originally from a BSD, but present in glibc>=2.9.
*/
/* Calculate the bytes required to store a message of x*y doubles */
size_t calculate_size(size_t x, size_t y)
{
/* The two dimensions in the array - each in 32 bits - (2 * 4)*/
size_t sz = 8;
/* a 64 bit IEE754 is by definition 8 bytes long :) */
sz += ((x * y) * 8);
/* and a NUL */
sz++;
return sz;
}
/* Helpers */
static char* write_int32(int32_t, char*);
static char* write_double(double, char*);
/* Actual conversion. That wasn't so hard, was it? */
void convert_data(double** src, size_t x, size_t y, char* dst)
{
dst = write_int32((int32_t) x, dst);
dst = write_int32((int32_t) y, dst);
for(int i = 0; i < x; i++) {
for(int j = 0; j < y; j++) {
dst = write_double(src[i][j], dst);
}
}
*dst = '\0';
}
static char* write_int32(int32_t num, char* c)
{
char* byte;
int i = sizeof(int32_t);
/* Convert to network byte order */
num = htobe32(num);
byte = (char*) (&num);
while(i--) {
*c++ = *byte++;
}
return c;
}
static char* write_double(double d, char* c)
{
/* Here I'm assuming your C programs use IEE754 'double' precision natively.
If you don't, you should be able to convert into this format. A helper library most likely already exists for your platform.
Note that IEE754 endianess isn't defined, but in practice, normal platforms use the same byte order as they do for integers.
*/
char* byte;
int i = sizeof(uint64_t);
uint64_t num = *((uint64_t*)&d);
/* convert to network byte order */
num = htobe64(num);
byte = (char*) (&num);
while(i--) {
*c++ = *byte++;
}
return c;
}
Java code to unserialize:
/* The raw char array from c is now read into the byte[] `bytes` in java */
DataInputStream stream = new DataInputStream(new ByteArrayInputStream(bytes));
int dim_x; int dim_y;
double[][] data;
try {
dim_x = stream.readInt();
dim_y = stream.readInt();
data = new double[dim_x][dim_y];
for(int i = 0; i < dim_x; ++i) {
for(int j = 0; j < dim_y; ++j) {
data[i][j] = stream.readDouble();
}
}
System.out.println("Client:");
System.out.println("Dimensions: "+dim_x+" x "+dim_y);
System.out.println("Data:");
for(int i = 0; i < dim_x; ++i) {
for(int j = 0; j < dim_y; ++j) {
System.out.print(" "+data[i][j]);
}
System.out.println();
}
} catch(IOException e) {
System.err.println("Error reading input");
System.err.println(e.getMessage());
System.exit(1);
}
If you are writing a binary file, you should think of a good way to serialize the actual binary data (64bit) of your double. This could go from directly writing the content of the double to the file (minding endianness) to some more elaborate normalizing serialization schemes (e.g. with a well-defined representation of NaN). That's up to you really. If you expect to be basically among homogeneous architectures, a direct memory dump would probably suffice.
If you want to write to a text file and a are looking for an ASCII representation, I would strongly discourage a decimal numerical representation. Instead, you could convert the 64-bit raw data to ASCII using base64 or something like that.
You really want to keep all the precision that you have in your double!

Space efficient long representation

I want to take a long value in Java, and convert it to a byte array.
However, I want the representation to be small for small values, so perhaps if the value is less than 127 then it requires only a single byte.
The encoding and decoding algorithms should be extremely efficient.
I'm sure this has been done but I can't find any example code, anyone got any pointers?
You can use stop bit encoding e.g.
public static void writeLong(OutputStream out, long value) throws IOException {
while(value < 0 || value > 127) {
out.write((byte) (0x80 | (value & 0x7F)));
value = value >>> 7;
}
out.write((byte) value);
}
public static long readLong(InputStream in) throws IOException {
int shift = 0;
long b;
long value = 0;
while((b = in.read()) >= 0) {
value += (b & 0x7f) << shift;
shift += 7;
if ((b & 0x80) == 0) return value;
}
throw new EOFException();
}
This is a fast form of compression, but all compression comes at a cost. (However if you are bandwidth limited it may be faster to transmit and worth the cost)
BTW: Values 0 to 127 use one byte. You can use the same routine for short and int values as well.
EDIT: You can still use generic compression after this and it can be smaller than not using this as well.
public static void main(String... args) throws IOException {
long[] sequence = new long[1024];
Random rand = new Random(1);
for (int i = 0; i < sequence.length; i+=2) {
sequence[i] = (long) Math.pow(2, rand.nextDouble() * rand.nextDouble() * 61);
// some pattern.
sequence[i+1] = sequence[i] / 2;
}
testDeflator(sequence);
testStopBit(sequence);
testStopBitDeflator(sequence);
}
private static void testDeflator(long[] sequence) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(new DeflaterOutputStream(baos));
for (long l : sequence)
dos.writeLong(l);
dos.close();
System.out.println("Deflator used " + baos.toByteArray().length);
}
private static void testStopBit(long[] sequence) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
for (long l : sequence)
writeLong(baos, l);
baos.close();
System.out.println("Stop bit used " + baos.toByteArray().length);
}
private static void testStopBitDeflator(long[] sequence) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(new DeflaterOutputStream(baos));
for (long l : sequence)
writeLong(dos, l);
dos.close();
System.out.println("Stop bit & Deflator used " + baos.toByteArray().length);
}
public static void writeLong(OutputStream out, long value) throws IOException {
while (value < 0 || value > 127) {
out.write((byte) (0x80 | (value & 0x7F)));
value = value >>> 7;
}
out.write((byte) value);
}
Prints
Deflator used 3492
Stop bit used 2724
Stop bit & Deflator used 2615
What works best is highly dependant on the data you are sending. e.g. If your data is truly random, any compression technique you use will only make the data larger.
The Deflator is a stripped down version of the GZip output (minus a header and CRC32)
Simply use a GZipOutputStream - entropy encoding like GZip basically does exactly what you describe, just generically.
Edit:
Just to be sure: do you realize that a variable-length encoding that uses only 1 byte for small numbers necessarily needs to use more than 8 bytes for most large ones? Unless you know that you'll have far more small than large numbers, it could even end up increasing the overall size of your data. Whereas GZIP adapts to your actual data set and can compress data sets that are skewed in different ways.
See Read7BitEncodedInt in C#. (It's the same concept.)
If you want to store long values with different lengths, then you'll need a delimiter, otherwise you can't decide, which byte belongs to which long value... And the delimiters will add extra bytes to the data...
If you're looking for a fast library to store long values (with 64Bit each), I'd recommend colt. It is fast.
(I might be stating the obvious to some people ... but here goes.)
If you are doing to reduce the size long values in some external serialization, go ahead.
However, if you are trying to save memory in an Java program you are probably wasting your time. The smallest representation of a byte[] in Java is either 2 or 3 32-bit words. And that is for a byte array of length zero. Add some multiple of 32 bit words to for any array length greater than zero. Then you've got to allow at least 1 32-bit word to hold the reference to the byte[] object.
If you add that up, it takes at least 4 words to represent any given long other than 0L as a byte[].
The only case where you are going to get any saving is if you are representing a number of long values in a single byte[]. You will need at least 3 long values before you can possibly break even, and even then if you will lose if the values turn out to be too large on average.

Categories