Space efficient long representation

Space efficient long representation - java

I want to take a long value in Java, and convert it to a byte array.
However, I want the representation to be small for small values, so perhaps if the value is less than 127 then it requires only a single byte.
The encoding and decoding algorithms should be extremely efficient.
I'm sure this has been done but I can't find any example code, anyone got any pointers?

You can use stop bit encoding e.g.
public static void writeLong(OutputStream out, long value) throws IOException {
while(value < 0 || value > 127) {
out.write((byte) (0x80 | (value & 0x7F)));
value = value >>> 7;
}
out.write((byte) value);
}
public static long readLong(InputStream in) throws IOException {
int shift = 0;
long b;
long value = 0;
while((b = in.read()) >= 0) {
value += (b & 0x7f) << shift;
shift += 7;
if ((b & 0x80) == 0) return value;
}
throw new EOFException();
}
This is a fast form of compression, but all compression comes at a cost. (However if you are bandwidth limited it may be faster to transmit and worth the cost)
BTW: Values 0 to 127 use one byte. You can use the same routine for short and int values as well.
EDIT: You can still use generic compression after this and it can be smaller than not using this as well.
public static void main(String... args) throws IOException {
long[] sequence = new long[1024];
Random rand = new Random(1);
for (int i = 0; i < sequence.length; i+=2) {
sequence[i] = (long) Math.pow(2, rand.nextDouble() * rand.nextDouble() * 61);
// some pattern.
sequence[i+1] = sequence[i] / 2;
}
testDeflator(sequence);
testStopBit(sequence);
testStopBitDeflator(sequence);
}
private static void testDeflator(long[] sequence) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(new DeflaterOutputStream(baos));
for (long l : sequence)
dos.writeLong(l);
dos.close();
System.out.println("Deflator used " + baos.toByteArray().length);
}
private static void testStopBit(long[] sequence) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
for (long l : sequence)
writeLong(baos, l);
baos.close();
System.out.println("Stop bit used " + baos.toByteArray().length);
}
private static void testStopBitDeflator(long[] sequence) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(new DeflaterOutputStream(baos));
for (long l : sequence)
writeLong(dos, l);
dos.close();
System.out.println("Stop bit & Deflator used " + baos.toByteArray().length);
}
public static void writeLong(OutputStream out, long value) throws IOException {
while (value < 0 || value > 127) {
out.write((byte) (0x80 | (value & 0x7F)));
value = value >>> 7;
}
out.write((byte) value);
}
Prints
Deflator used 3492
Stop bit used 2724
Stop bit & Deflator used 2615
What works best is highly dependant on the data you are sending. e.g. If your data is truly random, any compression technique you use will only make the data larger.
The Deflator is a stripped down version of the GZip output (minus a header and CRC32)

Simply use a GZipOutputStream - entropy encoding like GZip basically does exactly what you describe, just generically.
Edit:
Just to be sure: do you realize that a variable-length encoding that uses only 1 byte for small numbers necessarily needs to use more than 8 bytes for most large ones? Unless you know that you'll have far more small than large numbers, it could even end up increasing the overall size of your data. Whereas GZIP adapts to your actual data set and can compress data sets that are skewed in different ways.

See Read7BitEncodedInt in C#. (It's the same concept.)

If you want to store long values with different lengths, then you'll need a delimiter, otherwise you can't decide, which byte belongs to which long value... And the delimiters will add extra bytes to the data...
If you're looking for a fast library to store long values (with 64Bit each), I'd recommend colt. It is fast.

(I might be stating the obvious to some people ... but here goes.)
If you are doing to reduce the size long values in some external serialization, go ahead.
However, if you are trying to save memory in an Java program you are probably wasting your time. The smallest representation of a byte[] in Java is either 2 or 3 32-bit words. And that is for a byte array of length zero. Add some multiple of 32 bit words to for any array length greater than zero. Then you've got to allow at least 1 32-bit word to hold the reference to the byte[] object.
If you add that up, it takes at least 4 words to represent any given long other than 0L as a byte[].
The only case where you are going to get any saving is if you are representing a number of long values in a single byte[]. You will need at least 3 long values before you can possibly break even, and even then if you will lose if the values turn out to be too large on average.

Related

Fourier transforming a byte array

I am not so proficient in Java, so please keep it quite simple. I will, though, try to understand everything you post. Here's my problem.
I have written code to record audio from an external microphone and store that in a .wav. Storing this file is relevant for archiving purposes. What I need to do is a FFT of the stored audio.
My approach to this was loading the wav file as a byte array and transforming that, with the problem that 1. There's a header in the way I need to get rid of, but I should be able to do that and 2. I got a byte array, but most if not all FFT algorithms I found online and tried to patch into my project work with complex / two double arrays.
I tried to work around both these problems and finally was able to plot my FFT array as a graph, when I found out it was just giving me back "0"s. The .wav file is fine though, I can play it back without problems. I thought maybe converting the bytes into doubles was the problem for me, so here's my approach to that (I know it's not pretty)
byte ByteArray[] = Files.readAllBytes(wav_path);
String s = new String(ByteArray);
double[] DoubleArray = toDouble(ByteArray);
// build 2^n array, fill up with zeroes
boolean exp = false;
int i = 0;
int pow = 0;
while (!exp) {
pow = (int) Math.pow(2, i);
if (pow > ByteArray.length) {
exp = true;
} else {
i++;
}
}
System.out.println(pow);
double[] Filledup = new double[pow];
for (int j = 0; j < DoubleArray.length; j++) {
Filledup[j] = DoubleArray[j];
System.out.println(DoubleArray[j]);
}
for (int k = DoubleArray.length; k < Filledup.length; k++) {
Filledup[k] = 0;
}
This is the function I'm using to convert the byte array into a double array:
public static double[] toDouble(byte[] byteArray) {
ByteBuffer byteBuffer = ByteBuffer.wrap(byteArray);
double[] doubles = new double[byteArray.length / 8];
for (int i = 0; i < doubles.length; i++) {
doubles[i] = byteBuffer.getDouble(i * 8);
}
return doubles;
}
The header still is in there, I know that, but that should be the smallest problem right now. I transformed my byte array to a double array, then filled up that array to the next power of 2 with zeroes, so that the FFT can actually work (it needs an array of 2^n values). The FFT algorithm I'm using gets two double arrays as input, one being the real, the other being the imaginary part. I read, that for this to work, I'd have to keep the imaginary array empty (but its length being the same as the real array).
Worth to mention: I'm recording with 44100 kHz, 16 bit and mono.
If necessary, I'll post the FFT I'm using.
If I try to print the values of the double array, I get kind of weird results:
...
-2.0311904060823147E236
-1.3309975624948503E241
1.630738286366793E-260
1.0682002560745842E-255
-5.961832069690704E197
-1.1476447092561027E164
-1.1008407401197794E217
-8.109566204271759E298
-1.6104556241572942E265
-2.2081172620352248E130
NaN
3.643749694745671E-217
-3.9085815506127892E202
-4.0747557114875874E149
...
I know that somewhere the problem lies with me overlooking something very simple I should be aware of, but I can't seem to find the problem. My question finally is: How can I get this to work?

There's a header in the way I need to get rid of […]
You need to use javax.sound.sampled.AudioInputStream to read the file if you want to "skip" the header. This is useful to learn anyway, because you would need the data in the header to interpret the bytes if you did not know the exact format ahead of time.
I'm recording with 44100 kHz, 16 bit and mono.
So, this almost certainly means the data in the file is encoded as 16-bit integers (short in Java nomenclature).
Right now, your ByteBuffer code makes the assumption that it's already 64-bit floating point and that's why you get strange results. In other words, you are reinterpreting the binary short data as if it were double.
What you need to do is read in the short data and then convert it to double.
For example, here's a rudimentary routine to do such as you're trying to do (supporting 8-, 16-, 32- and 64-bit signed integer PCM):
import javax.sound.sampled.*;
import javax.sound.sampled.AudioFormat.Encoding;
import java.io.*;
import java.nio.*;
static double[] readFully(File file)
throws UnsupportedAudioFileException, IOException {
AudioInputStream in = AudioSystem.getAudioInputStream(file);
AudioFormat fmt = in.getFormat();
byte[] bytes;
try {
if(fmt.getEncoding() != Encoding.PCM_SIGNED) {
throw new UnsupportedAudioFileException();
}
// read the data fully
bytes = new byte[in.available()];
in.read(bytes);
} finally {
in.close();
}
int bits = fmt.getSampleSizeInBits();
double max = Math.pow(2, bits - 1);
ByteBuffer bb = ByteBuffer.wrap(bytes);
bb.order(fmt.isBigEndian() ?
ByteOrder.BIG_ENDIAN : ByteOrder.LITTLE_ENDIAN);
double[] samples = new double[bytes.length * 8 / bits];
// convert sample-by-sample to a scale of
// -1.0 <= samples[i] < 1.0
for(int i = 0; i < samples.length; ++i) {
switch(bits) {
case 8: samples[i] = ( bb.get() / max );
break;
case 16: samples[i] = ( bb.getShort() / max );
break;
case 32: samples[i] = ( bb.getInt() / max );
break;
case 64: samples[i] = ( bb.getLong() / max );
break;
default: throw new UnsupportedAudioFileException();
}
}
return samples;
}
The FFT algorithm I'm using gets two double arrays as input, one being the real, the other being the imaginary part. I read, that for this to work, I'd have to keep the imaginary array empty (but its length being the same as the real array).
That's right. The real part is the audio sample array from the file, the imaginary part is an array of equal length, filled with 0's e.g.:
double[] realPart = mySamples;
double[] imagPart = new double[realPart.length];
myFft(realPart, imagPart);
More info... "How do I use audio sample data from Java Sound?"

The samples in a wave file are not going to be already 8-byte doubles that can be directly copied as per your posted code.
You need to look up (partially from the WAVE header format and from the RIFF specification) the data type, format, length and endianess of the samples before converting them to doubles.
Try 2 byte little-endian signed integers as a likely possibility.

Byte to "Bit"array

A byte is the smallest numeric datatype java offers but yesterday I came in contact with bytestreams for the first time and at the beginning of every package a marker byte is send which gives further instructions on how to handle the package. Every bit of the byte has a specific meaning so I am in need to entangle the byte into it's 8 bits.
You probably could convert the byte to a boolean array or create a switch for every case but that can't certainly be the best practice.
How is this possible in java why are there no bit datatypes in java?

Because there is no bit data type that exists on the physical computer. The smallest allotment you can allocate on most modern computers is a byte which is also known as an octet or 8 bits. When you display a single bit you are really just pulling that first bit out of the byte with arithmetic and adding it to a new byte which still is using an 8 bit space. If you want to put bit data inside of a byte you can but it will be stored as a at least a single byte no matter what programming language you use.

You could load the byte into a BitSet. This abstraction hides the gory details of manipulating single bits.
import java.util.BitSet;
public class Bits {
public static void main(String[] args) {
byte[] b = new byte[]{10};
BitSet bitset = BitSet.valueOf(b);
System.out.println("Length of bitset = " + bitset.length());
for (int i=0; i<bitset.length(); ++i) {
System.out.println("bit " + i + ": " + bitset.get(i));
}
}
}
$ java Bits
Length of bitset = 4
bit 0: false
bit 1: true
bit 2: false
bit 3: true
You can ask for any bit, but the length tells you that all the bits past length() - 1 are set to 0 (false):
System.out.println("bit 75: " + bitset.get(75));
bit 75: false

Have a look at java.util.BitSet.
You might use it to interpret the byte read and can use the get method to check whether a specific bit is set like this:
byte b = stream.read();
final BitSet bitSet = BitSet.valueOf(new byte[]{b});
if (bitSet.get(2)) {
state.activateComponentA();
} else {
state.deactivateComponentA();
}
state.setFeatureBTo(bitSet.get(1));
On the other hand, you can create your own bitmask easily and convert it to a byte array (or just byte) afterwards:
final BitSet output = BitSet.valueOf(ByteBuffer.allocate(1));
output.set(3, state.isComponentXActivated());
if (state.isY){
output.set(4);
}
final byte w = output.toByteArray()[0];

How is this possible in java why are there no bit datatypes in java?
There are no bit data types in most languages. And most CPU instruction sets have few (if any) instructions dedicated to adressing single bits. You can think of the lack of these as a trade-off between (language or CPU) complexity and need.
Manipulating a single bit can be though of as a special case of manipulating multiple bits; and languages as well as CPU's are equipped for the latter.
Very common operations like testing, setting, clearing, inverting as well as exclusive or are all supported on the integer primitive types (byte, short/char, int, long), operating on all bits of the type at once. By chosing the parameters appropiately you can select which bits to operate on.
If you think about it, a byte array is a bit array where the bits are grouped in packages of 8. Adressing a single bit in the array is relatively simple using logical operators (AND &, OR |, XOR ^ and NOT ~).
For example, testing if bit N is set in a byte can be done using a logical AND with a mask where only the bit to be tested is set:
public boolean testBit(byte b, int n) {
int mask = 1 << n; // equivalent of 2 to the nth power
return (b & mask) != 0;
}
Extending this to a byte array is no magic either, each byte consists of 8 bits, so the byte index is simply the bit number divided by 8, and the bit number inside that byte is the remainder (modulo 8):
public boolean testBit(byte[] array, int n) {
int index = n >>> 3; // divide by 8
int mask = 1 << (n & 7); // n modulo 8
return (array[index] & mask) != 0;
}

Here is a sample, I hope useful for you!
DatagramSocket socket = new DatagramSocket(6160, InetAddress.getByName("0.0.0.0"));
socket.setBroadcast(true);
while (true) {
byte[] recvBuf = new byte[26];
DatagramPacket packet = new DatagramPacket(recvBuf, recvBuf.length);
socket.receive(packet);
String bitArray = toBitArray(recvBuf);
System.out.println(Integer.parseInt(bitArray.substring(0, 8), 2)); // convert first byte binary to decimal
System.out.println(Integer.parseInt(bitArray.substring(8, 16), 2)); // convert second byte binary to decimal
}
public static String toBitArray(byte[] byteArray) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < byteArray.length; i++) {
sb.append(String.format("%8s", Integer.toBinaryString(byteArray[i] & 0xFF)).replace(' ', '0'));
}
return sb.toString();
}

Improvement of Algorithm: Counting set bits in Byte-Arrays

We store knowledge in byte arrays as bits. Counting the number of set bits is pretty slow. Any suggestion to improve the algorithm is welcome:
public static int countSetBits(byte[] array) {
int setBits = 0;
if (array != null) {
for (int byteIndex = 0; byteIndex < array.length; byteIndex++) {
for (int bitIndex = 0; bitIndex < 7; bitIndex++) {
if (getBit(bitIndex, array[byteIndex])) {
setBits++;
}
}
}
}
return setBits;
}
public static boolean getBit(int index, final byte b) {
byte t = setBit(index, (byte) 0);
return (b & t) > 0;
}
public static byte setBit(int index, final byte b) {
return (byte) ((1 << index) | b);
}
To count the bits of a byte array of length of 156'564 takes 300 ms, that's too much!

Try Integer.bitcount to obtain the number of bits set in each byte. It will be more efficient if you can switch from a byte array to an int array. If this is not possible, you could also construct a look-up table for all 256 bytes to quickly look up the count rather than iterating over individual bits.
And if it's always the whole array's count you're interested in, you could wrap the array in a class that stores the count in a separate integer whenever the array changes. (edit: Or, indeed, as noted in comments, use java.util.BitSet.)

I would use the same global loop but instead of looping inside each byte I would simply use a (precomputed) array of size 256 mapping bytes to their bit count. That would probably be very efficient.
If you need even more speed, then you should separately maintain the count and increment it and decrement it when setting bits (but that would mean a big additional burden on those operations so I'm not sure it's applicable for you).
Another solution would be based on BitSet implementation : it uses an array of long (and not bytes) and here's how it counts :
658 int sum = 0;
659 for (int i = 0; i < wordsInUse; i++)
660 sum += Long.bitCount(words[i]);
661 return sum;

I would use:
byte[] yourByteArray = ...
BitSet bitset = BitSet.valueOf(yourByteArray); // java.util.BitSet
int setBits = bitset.cardinality();
I don't know if it's faster, but I think it will be faster than what you have. Let me know?
Your method would look like
public static int countSetBits(byte[] array) {
return BitSet.valueOf(array).cardinality();
}
You say:
We store knowledge in byte arrays as bits.
I would recommend to use a BitSet for that. It gives you convenient methods, and you seem to be interested in bits, not bytes, so it is a much more appropriate data type compared to a byte[]. (Internally it uses a long[]).

By far the fastest way is counting bits set, in "parallel", method is called Hamming weight
and is implemented in Integer.bitCount(int i) as far as I know.

As per my understaning,
1 Byte = 8 Bits
So if Byte Array size = n , then isn't total number of bits = n*8 ?
Please correct me if my understanding is wrong
Thanks
Vinod

Java: efficiently store boolean[32]?

In Java, I would like to store (>10'000) arrays of boolean values (boolean[]) with length 32 to the disk and read them again later on for further computation and comparison.
Since a single array will have a length of 32, I wonder whether it makes sense to store it as an integer value to speed up the reading and writing (on a 32 bit machine). Would you suggest using BitSet and then convert to int? Or even forget about int and use bytes?

For binary storage, use int and a DataOutputStream (DataInputStream for reading).
I think boolean arrays are stored as byte or int arrays internally in Java, so you may want to consider avoiding the overhead and keeping the int encoding all the time, i.e. not use boolean[] at all.
Instead, have something like
public class BooleanArray32 {
private int values;
public boolean get(int pos) {
return (values & (1 << pos)) != 0;
}
public void set(int pos, boolean value) {
int mask = 1 << pos;
values = (values & ~mask) | (value ? mask : 0);
}
public void write(DataOutputStream dos) throws IOException {
dos.writeInt(values);
}
public void read(DataInputStream dis) throws IOException {
values = dis.readInt();
}
public int compare(BooleanArray32 b2) {
return countBits(b2.values & values);
}
// From http://graphics.stanford.edu/~seander/bithacks.html
// Disclaimer: I did not fully double check whether this works for Java's signed ints
public static int countBits(int v) {
v = v - ((v >>> 1) & 0x55555555); // reuse input as temporary
v = (v & 0x33333333) + ((v >>> 2) & 0x33333333); // temp
return ((v + (v >>> 4) & 0xF0F0F0F) * 0x1010101) >>> 24;
}
}

I am under the strong impression that any compression you are going to make to pack your boolean values will increase the read and write time. (my mistake, I was clearly missing my medication). You will rather gain in terms of storage involved.
BitSet is a sensible choice on your business logic side. It internally stores a long, which you could convert to an int. However, since BitSet is prude enough not to show you its privates, you need to get each bit index in sequence. This means that I guess there is no real advantage converting to an int rather than just using bytes directly.
The roll-your-own solution of Stefan Haustein (extended as necessary to mimic BitSet) is therefore preferable for your storage requirement, since you do not incur any unnecessary overhead.

Current best way to populate mixed type byte array

I'm trying to send and receive a byte stream in which certain ranges of bytes represent different pieces of data. I've found ways to convert single primitive datatypes into bytes, but I'm wondering if there's a straightforward way to place certain pieces of data into specified byte regions.
For example, I might need to produce or read something like the following:
byte 1 - int
byte 2-5 - int
byte 6-13 - double
byte 14-21 - double
byte 25 - int
byte 26-45 - string
Any suggestions would be appreciated.

Try DataOutputStream/DataInputStream or, for arrays, the ByteBuffer class.
For storing the integer in X bytes, you may use the following method. If you think it is badly named, you may use the much less descriptive i2os name which is used in several (crypto) algorithm descriptions. Note that the returned octet string uses Big Endian encoding of unsigned ints, which you should specify for your protocol.
public static byte[] possitiveIntegerToOctetString(
final long value, final int octets) {
if (value < 0) {
throw new IllegalArgumentException("Cannot encode negative values");
}
if (octets < 1) {
throw new IllegalArgumentException("Cannot encode a number in negative or zero octets");
}
final int longSizeBytes = Long.SIZE / Byte.SIZE;
final int byteBufferSize = Math.max(octets, longSizeBytes);
final ByteBuffer buf = ByteBuffer.allocate(byteBufferSize);
for (int i = 0; i < byteBufferSize - longSizeBytes; i++) {
buf.put((byte) 0x00);
}
buf.mark();
buf.putLong(value);
// more bytes than long encoding
if (octets >= longSizeBytes) {
return buf.array();
}
// less bytes than long encoding (reset to mark first)
buf.reset();
for (int i = 0; i < longSizeBytes - octets; i++) {
if (buf.get() != 0x00) {
throw new IllegalArgumentException("Value does not fit in " + octets + " octet(s)");
}
}
final byte[] result = new byte[octets];
buf.get(result);
return result;
}
EDIT before storing the string, think of a padding mechanism (spaces would be most used), and character-encoding e.g. String.getBytes(Charset.forName("ASCII")) or "Latin-1". Those are the most common encodings with a single byte per character. Calculating the size of "UTF-8" is slightly more difficult (encode first, add 0x20 valued bytes at the end using ByteBuffer).

You may want to consider having a constant size for each data type. For example, the 32-bit Java int will take up 4 bytes a long will take 8, etc. In fact, if you use Java's DataInputStream and DataOutputStreams, you'll basically be doing that anyway. They have really nice methods like read/writeInt, etc.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Space efficient long representation - java

See Read7BitEncodedInt in C#. (It's the same concept.)

Related

Fourier transforming a byte array

Byte to "Bit"array

Improvement of Algorithm: Counting set bits in Byte-Arrays

Java: efficiently store boolean[32]?

Current best way to populate mixed type byte array

Categories

Resources