We have this use case where we would like to compress and store objects (in-memory) and decompress them as and when required.
The data we want to compress is quite varied, from float vectors to strings to dates.
Can someone suggest any good compression technique to do this ?
We are looking at ease of compression and speed of decompression as the most important factors.
Thanks.
If you want to compress instances of MyObject you could have it implement Serializable and then stream the objects into a compressed byte array, like so:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream gzipOut = new GZIPOutputStream(baos);
ObjectOutputStream objectOut = new ObjectOutputStream(gzipOut);
objectOut.writeObject(myObj1);
objectOut.writeObject(myObj2);
objectOut.close();
byte[] bytes = baos.toByteArray();
Then to uncompress your byte[] back into the objects:
ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
GZIPInputStream gzipIn = new GZIPInputStream(bais);
ObjectInputStream objectIn = new ObjectInputStream(gzipIn);
MyObject myObj1 = (MyObject) objectIn.readObject();
MyObject myObj2 = (MyObject) objectIn.readObject();
objectIn.close();
Similar to previous answers, except I suggest you use DeflatorOutputStream and InflatorInputStream as these are simpler/faster/smaller than the alternatives. The reason it is smaller is it just does the compression whereas the alternatives add file format extensions like CRC checks and headers.
If size is important, you might like to have a simple serialization of your own. This is because ObjectOutputStream has a significant overhead making small objects much larger. (It improves for larger object especially when compressed)
e.g. an Integer takes 81 byte, and compression won't help much for such a small number of bytes. It is possible to cut this significantly.
One proposal could be to use a combination of the following streams:
ObjectOutputStream / ObjectInputStream for serializing/deserializing Java objects
GZIPOutputStream / GZIPInputStream for compressing/uncompressing. There are other options to be found in the java.util.zip package.
ByteArrayOutputStream / ByteArrayInputStream for storing the data in memory as a byte array
Compression of searilized objects in Java is usually not well... not so good.
First of all you need to understand that a Java object has a lot of additional information not needed. If you have millions of objects you have this overhead millions of times.
As an example lets us a double linked list. Each element has a previous and a next pointer + you store a long value (timestamp) + byte for the kind of interaction and two integers for the user ids. Since we use pointer compression we are 6Bytes * 2 + 8 + 4 * 2= 28Bytes. Java adds 8 Bytes + 12bytes for the padding. This makes 48Bytes per Element.
Now we create 10 million lists with 20 Elements each (time series of click events of users for the last three years (we want to find patterns)).
So we have 200Million * 48 Bytes of elements = 10GB memory (ok not much).
Ok beside the Garbage collection kills us and the overhead inside the JDK skyrocks, we end with 10GB memory.
Now lets use our own memory / object storage. We store it as a column wise data table where each object is actually a single row. So we have 200Million rows in a timestamp, previous, next, userIdA and userIdB collection.
Previous and next are now point to row ids and become 4byte (or 5bytes if we exceed 4billion entries (unlikely)).
So we have 8 + 4 + 4 + 4 + 4 => 24 * 200 Mio = 4.8GB + no GC problem.
Since the timestamp column stores the timestamps in a min max fashion and our timestamps all are within three years, we only need 5bytes to store each of the timestamps. Since the pointer are now stored relative (+ and -) and due the click series are timely closely related we only need 2bytes in average for both previous and next and for the user ids we use a dictionary since the click series are for roughly 500k users we only need three bytes each.
So we now have 5 + 2 + 2 + 3 + 3 => 15 * 200Mio => 3GB + Dictionary of 4 * 500k * 4 = 8MB = 3GB + 8MB. Sounds different to 10GB right?
But we are not finished yet. Since we now have no objects but rows and datas, we store each series as a table row and use special columns being collections of array that actually are storing 5 values and a pointer to the next five values + a pointer previous.
So we have 10Mio lists with 20 enries each (since we have overhead), we have per list 20 * (5 + 3 + 3) + 4 * 6 (lets add some overhead of partly filled elements) => 20 * 11 + 5 * 6 => 250 * 10Mio => 2,5GB + we can access the arrays faster than walking elements.
But hey its not over yet... the timestamps are now relatively stored only requiring 3 bytes per entry + 5 at the first entry. -> so we save a lot more 20 * 9 + 2 + 5 * 6 => 212 * 10Mio => 2,12 GB. And now storing it all to memory using gzip it and we result in 1GB since we can store it all lineary first storing the length of the array, all timestamps, all user ids making it very highly that there are patterns in the bits to be compressable. Since we use a dictionary we just sort it according the propability of each userId to be part of a series.
And since everything is a table you can deserialize everything in almost read speed so 1GB on a modern SSD cost 2 second to load. Try this with serialization / deserialization and you can hear inner user cry.
So before you ever compress serialized data, store it in tables, check each column / property if it can be logically be compressed. And finally have fun with it.
And remember 1TB (ECC) cost 10k today. Its nothing. And 1TB SSD 340 Euro. So do not waste your time on that issue unless you really have to.
The best compression technology I know is ZIP. Java supports ZipStream. All you need is to serialize your object into byte array and then zip it.
Tips: Use ByteArrayOutputStream, DataStream, ZipOutputStream.
There are various compression algorithm implemented in the JDK. Check the [java.util.zip](http://download.oracle.com/javase/6/docs/api/java/util/zip/package-summary.html) for all algorithm implemented. However it may not be a good thing to compress all your data. For instance a serialized empty array may be several dozen of bytes long as the name of the underlying class is in the serialized data stream. Also most compression algorithm are designed to remove redundancy from large data blocks. On small to medium Java objects you'll probably have very little or no gain at all.
This is a tricky problem:
First, using ObjectOutputStream is probably not the answer. The stream format includes a lot of type-related metadata. If you are serializing small objects, the mandatory metadata will make it hard for the compression algorithm to "break even", even if you implement custom serialization methods.
Using DataOutputStream with minimal (or no) added type information will give a better result, but mixed data is not generally that compressible using a general purpose compression algorithms.
For better compression, you may need to look at the properties of the data that you are compressing. For instance:
Date objects could be represented as int values if you know that have a precision of 1 day.
Sequences of int values could be run-length encoded, or delta-encoded if they have the right properties.
and so on.
However way you do it, you will need to do a serious amount of work to get a worthwhile amount of compression. IMO, a better idea would be to write the objects to a database, datastore or file and use caching to keep frequently used objects in memory.
If you need to compress arbitrary objects, a possible approach is to serialize the object into a byte array, and then use e.g. the DEFLATE algorithm (the one used by GZIP) to compress it. When you need the object, you can decompress and deserialize it. Not sure about how efficient this would be, but it will be completely general.
Related
Before starting to explain my problem, I should mention that I am not looking for a way to increase Java heap memory. I should strictly store these objects.
I am working on storing huge number (5-10 GB) of DNA sequences and their counts (Integer) in a hash table. The DNA sequences (with length 32 or less) consists of 'A', 'C', 'G', 'T', and 'N' (undefined) chars. As we know, when storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. Thus, if I store this sequence as string (it holds about 100 MB memory for a sequence with length ~30), I see the error.
I tried to represent nucleic acids as 'A'=00, 'C'=01, 'G'=10, 'T'=11 and neglect 'N' (because it ruins the char to 2-bit transform as the 5-th acid). Then, concatenate these 2-bit acids into byte array. It brought some improvement but unfortunately I see the error after a couple of hours again. I need a convenient solution or at least a workaround to handle this error. Thank you in advance.
Being fairly complex maybe this here is a weird idea, and would require quite a lot of work, but this is what I would try:
You already pointed out two individual subproblems of your overall task:
the default HashMap implementation may be suboptimal for such large collection sizes
you need to store something else than strings
The map implementation
I would recommend to write a highly tailored hash map implementation for the Map<String, Long> interface. Internally you do not have to store strings. Unfortunately 5^32 > 2^64, so there is no way to pack your whole string into a single long, well, let's stick to two longs for a key. You can make string to/back long[2] conversion fairly efficiently on the fly when providing a string key to your map implementation (use bit shifts etc).
As for packing the values, here are some considerations:
for a key-value pair a standard hashmap will need to have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key it will need to have a linked list of key-value pairs to resolve keys that produce identical hash codes. For your specific case you could try to optimize it in the following way:
use a long[] of size 3N where N is the capacity to store both keys and values in a continuous array
in this array, at locations 3 * (hashcode % N) and 3 * (hashcode % N) + 1 you store the long[2] representation of the key, of the first key that matches this bucket or of the only one (on insertion, zero otherwise), at location 3 * (hashcode % N) + 2 you store the corresponding count
for all those cases where a different key results in the same hash code and thus the same bucket, your store the data in a standard HashMap<Long2KeyWrapper, Long>. The idea is to keep the capacity of the array mentioned above (and resize correspondingly) large enough to have by far the largest part of the data in that contiguous array and not in the fallback hash map. This will dramatically reduce the storage overhead of the hashmap
do not expand the capacity in N=2N iterations, make smaller growth steps, e.g. 10-20%. this will cost performance on populating the map, but will keep your memory footprint under control
The keys
Given the inequality 5^32 > 2^64 your idea to use bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long[2].
I recommend you look into the Trove4j Collections API; it offers Collections that hold primitives which will use less memory than their boxed, wrapper classes.
Specifically, you should check out their TObjectIntHashMap.
Also, I wouldn't recommended storing anything as a String or char until JDK 9 is released, as the backing char array of a String is UTF-16 encoded, using two bytes per char. JDK 9 defaults to UTF-8 where only one byte is used.
If you're using on the order of ~10gb of data, or at least data with an in memory representation size of ~10gb, then you might need to think of ways to write the data you don't need at the moment to disk and load individual portions of your dataset into memory to work on them.
I had this exact problem a few years ago when I was conducting research with monte carlo simulations so I wrote a Java data structure to solve it. You can clone/fork the source here: github.com/tylerparsons/surfdep
The library supports both MySQL and SQLite as the underlying database. If you don't have either, I'd recommend SQLite as it's much quicker to set up.
Full disclaimer: this is not the most efficient implementation, but it will handle very large datasets if you let it run for a few hours. I tested it successfully with matrices of up to 1 billion elements on my Windows laptop.
I have a rather large dataset consisting of 2.3GB worth of data spread over 160 Million byte[] arrays with an average data length of 15 bytes. The value for each byte[] key is only an int so memory usage of nearly half the hashmap (which is over 6GB) is made up of the 16 byte overhead of each byte array
overhead = 8 byte header + 4 byte length rounded up by VM to 16 bytes.
So my overhead is 2.5GB.
Does anybody know of a hashmap implementation that stores its (variable length) byte[] keys in one single large byte array so there would be no overhead (apart from 1 byte length field)?
I would rather not use an in memory DB as they usually have a performance overhead compared to a plain Trove TObjectIntHashMap that i'm using and I value cpu cycles even more then memory usage.
Thanks in advance
Since most personal computers have 16GB these days and servers often 32 - 128 GB or more, is there a real problem with there being a degree of bookkeeping overhead?
If we consider the alternative: the byte data concatenated into a single large array -- we should think about what individual values would have to look like, to reference a slice of a larger array.
Most generally you would start with:
public class ByteSlice {
protected byte[] array;
protected int offset;
protected int len;
}
However, that's 8 bytes + the size of a pointer (perhaps just 4 bytes?) + the JVM object header (12 bytes on a 64-bit JVM). So perhaps total 24 bytes.
If we try and make this single-purpose & minimalist, we're still going to need 4 bytes for the offset.
public class DedicatedByteSlice {
protected int offset;
protected byte len;
protected static byte[] getArray() {/*somebody else knows about the array*/}
}
This is still 5 bytes (probably padded to 8) + the JVM object header. Probably still total 20 bytes.
It seems that the cost of dereferencing with an offset & length, and having an object to track that, are not substantially less than the cost of storing the small array directly.
One further theoretical possibility -- de-structuring the Map Key so it is not an object
It is possible to conceive of destructuring the "length & offset" data such that it is no longer in an object. It then is passed as a set of scalar parameters eg (length, offset) and -- in a hashmap implementation -- would be stored by means of arrays of the separate components (eg. instead of a single Object[] keyArray).
However I expect it is extremely unlikely that any library offers an existing hashmap implementation for your (very particular) usecase.
If you were talking about the values it would probably be pointless, since Java does not offer multiple returns or method OUT parameters; which makes communication impractical without "boxing" destructured data back to an object. Since here you are asking about Map Keys specifically, and these are passed as parameters but need not be returned, such an approach may be theoretically possible to consider.
[Extended]
Even given this it becomes tricky -- for your usecase the map API probably has to become asymmetrical for population vs lookup, as population has to be by (offset, len) to define keys; whereas practical lookup is likely still by concrete byte[] arrays.
OTOH: Even quite old laptops have 16GB now. And your time to write this (times 4-10 to maintain) should be worth far more than the small cost of extra RAM.
Does anyone know of a Java class to store bytes that satisfies the following conditions?
Stores bytes efficiently (i.e. not one object per bytes).
Grows automatically, like a StringBuilder.
Allows indexed access to all of its bytes (without copying everything to a byte[].
Nothing I've found so far satisfies these. Specifically:
byte[] : Doesn't satisfy 2.
ByteBuffer : Doesn't satisfy 2.
ByteArrayOutputStream : Doesn't satisfy 3.
ArrayList : Doesn't satisfy 1 (AFAIK, unless there's some special-case optimisation).
If I can efficiently remove bytes from the beginning of the array that would be nice. If I were writing it from scratch I would implement it as something like
{ ArrayList<byte[256]> data; int startOffset; int size; }
and then the obvious functions. Does something like this exist?
Most straightforward would be to subclass ByteArrayOutputStream and add functionality to access the underlying byte[].
Removal of bytes from the beginning can be implemented in different ways depending on your requirements. If you need to remove a chunk, System.arrayCopy should work fine, if you need to remove single bytes I would put a headIndex which would keep track of the beginning of the data (performing an arraycopy after enough data is "removed").
There are some implementations for high performance primitive collections such as:
hppc or Koloboke
You'd have to write one. Off the top of my head what I would do is create an ArrayList internally and store the bytes 4 to each int, with appropriate functions for masking off the bytes. Performance will be sub optimal for removing and adding individual bytes. However it will store the object in the minimal size if that is a real consideration, wasting no more than 3 bytes for storage (on top of the overhead for the ArrayList).
The laziest method will be ArrayList. Its not as inefficient as you seem to believe, since Byte instances can and will be shared, meaning there will be only 256 byte objects in the entire VM unless you yourself do a "new Byte()" somewhere.
I have a bunch of Strings I'd like a fast lookup for. Each String is 22 chars long and is looked up by the first 12 only (the "key" so to say), the full set of Strings is recreated periodically. They are loaded from a file and refreshed when the file changes. I have to deal with too little available memory, other server processes on my VPS need it too and need it more.
How do I best store the Strings and search for them?
My current idea is to store them all one after another inside a char[] (to save RAM), and sort them for faster lookups (I figure the lookup is fastest if I have them presorted so I can use binary or interpolation search). But I'm not exactly sure how I should code it - if anyone is in the mood for a challenging puzzle: here it is...
Btw: It's probably ok to exceed the memory constraints for a while during the recreation / sorting, but it shouldn't be by much or for long.
Thanks!
Update
For the "I want to know specifics" crowd (correct me if I'm wrong in the Java details): The source files contain about 320 000 entries (all ANSI text), I really want to stay (WAY!) below 64 MB RAM usage and the data is only part of my program. Here's some information on sizes of Java types in memory.
My VPS is a 32bit OS, so...
one byte[], all concatenated = 12 + length bytes
one char[], all concatenated = 12 + length * 2 bytes
String = 32 + length * 2 bytes (is Object, has char[] + 3 int)
So I have to keep in memory:
~7 MB if all are stored in a byte[]
~14 MB if all are stored in a char[]
~25 MB if all are stored in a String[]
> 40 MB if they are stored in a HashTable / Map (for which I'd probably have to finetune the initial capacity)
A HashTable is not magical - it helps on insertion, but in principle it's just a very long array of String where the hashCode modulus capacity is used as an index, the data is stored in the next free position after the index and searched lineary if it's not found there on lookup. But for a Hashtable, I'd need the String itself and a substring of the first 12 chars for lookup. I don't want that (or do I miss something here?), sorry folks...
I would probably use a cache solution for that, may be even guava will do. Of course sort them, then binary search. Unfortunately I do not have the time for it :(
Sounds like a HashTable would be the right implementation for this situation.
Searching is done in constant time and refreshing could be done in linear time.
Java Data Structure Big-O (Warning PDF)
I coded a solution myself - but it's a little different than the question I posted because I could use information I didn't publish (I'll do better next time, sorry).
I'm just answering this because it's solved, I won't accept one of the other answers because they didn't really help with the memory constraints (and were a little short for my taste). They still got an upvote each, no hard feelings and thanks for taking the time!
I managed to push all of the info into two longs (with the key completely residing in the first one). The first 12 chars are an ISIN which can be compressed into a long because it only uses digits and capital letters, always starts with two capital letters and ends with a digit which can be reconstructed from the other chars. The product of all possible values leaves a little more than 3 bits to spare.
I store all entries from my source file in a long[] (packed ISIN first, other stuff in the second long) and sort them based on the first of two longs.
When I do a query by a key, I transform it to a long, do a binary search (which I'll maybe change to an interpolation search) and return the matching index. The different parts of the value are retrievable by said index - I get the second long from the array, unpack it and return the requested data.
The result: RAM usage dropped from ~110 MB to < 50 MB including Jetty (btw - I used a HashTable before) and lookups are lightning fast.
While writing a message on wire, I want to write down the number of bytes in the data followed by the data.
Message format:
{num of bytes in data}{data}
I can do this by writing the data to a temporary byteArrayOutput stream and then obtaining the byte array size from it, writing the size followed by the byte array. This approach involves a lot of overhead, viz. unnecessary creation of temporary byte arrays, creation of temporary streams, etc.
Do we have a better (considering both CPU and garbage creation) way of achieving this?
A typical approach would be to introduce a re-useable ByteBuffer. For example:
ByteBuffer out = ...
int oldPos = out.position(); // Remember current position.
out.position(oldPos + 2); // Leave space for message length (unsigned short)
out.putInt(...); // Write out data.
// Finally prepend buffer with number of bytes.
out.putShort(oldPos, (short)(out.position() - (oldPos + 2)));
Once the buffer is populated you could then send the data over the wire using SocketChannel.write(ByteBuffer) (assuming you are using NIO).
Here’s what I would do, in order of preference.
Don’t bother about memory consumption and stuff. Most likely this already is the optimal solution unless it takes a lot of time to create the byte representation of your data so that creating it twice is a noticable impact.
(Actually this would be more like #37 on my list, with #2 to #36 being empty.) Include a method in your all your data objects that can calculate the size of the byte representation and takes less resources than it would to create the byte representation.