I'm writing some "big data" software that needs to hold a lot of data in memory. I wrote a prototype in c++ that works great. However the actual end-users typically code in Java so they've asked me to also write a java prototype.
I've done background reading on memory-footprint in java and some preliminary tests. For example, lets say I have this object
public class DataPoint{
int cents, time, product_id, store_id;
public DataPoint(int cents, int time, int product_id, int store_id){
this.cents = cents;
this.time = time;
this.product_id = product_id;
this.store_id = store_id;
}
}
In C++ the sizeof this structure is 16 bytes, which makes sense. In Java we have to be indirect. If I create, e.g., 10m of these objects and use Runtime.totalMemory() - Runtime.freeMemory() before
and after and then divide as appropriate I get approximately 36 bytes per structure. A ~2.4x memory difference is pretty nasty; its gonna get ugly when we try to hold hundreds of millions of DataPoints in memory.
I read somewhere that in cases like this in Java its better to store the data as arrays -- essentially a column-based store rather than a row-based store. I think I understand this: the column-based way reduces the number of number of references, and perhaps the JVM can even pack the ints into 8-byte words intelligently.
What other tricks can I use for reducing the memory-footprint of what is essentially a memory block that has one very large dimension (millions/billions of datapoints) and one very small dimension (the O(1) number of columns/variables)?
Turns out storing the data as 4 int arrays used exactly 16 bytes per entry. The lesson: small objects have nasty proportional overhead in java.
It isn't that straightforward to see how much memory your data structure takes in Java. totalMemory() shows the space allocated for vm which is larger than the actual usage. You could try using Java profiler that shows space-consumption of your data structures, they are quite easy to setup and run. One handy free tool is Java's own VisualVM that for example shows memory behaviour of your application, you will also learn a bit about how Java's GC works if you use it.
VisualVM screenshot showing performance footprint (image from http://visualvm.java.net/features.html):
You should also consider making the variables final if it's possible. It allows Java VM to optimize the code bit better (not sure if it saves space though).
First of all an object in Java will always be slightly larger than a C++ version since the object encapsulates runtime type information that enables you to do instanceof etc that is not possible in C++. Additionally it facilitates in the memory management you would have to manually do yourself, so you can also consider this part of your C++ code as not part of the code base.
You could look into Flyweight Pattern to reduce memory requirements so that you reuse the DataPoints (make the class Immutable). I assume that if you have billions of points as you say some will probably be the same values.
I am sure others here will give some more concrete information on optimizing in memory space
Depending on the value ranges you might be able to use smaller data types. Can you get away with using byte or short for some of the members?
Related
Java programs can be very memory hungry. For example, a Double object has 24 bytes: 8 bytes of data and 16 bytes of JVM-imposed overhead. In general, the objects that represent the primitive types are very expensive.
The same happens for any collection in the Java Standard Library. There are even some counterintuitive facts such as a HashSet being more memory hungry than a HashMap, since a HashSet contains a HashMap inside (http://docs.oracle.com/javase/7/docs/api/java/util/HashSet.html).
Could you come up with some advice when modeling data and delegation of objects in high performance settings so that these "weaknesses" of Java are mitigated?
Some techniques I use to reduce memory:
Make your own IntArrayList (etc) class that prevents boxing
Make your own IntHashMap (etc) class where keys are primitives
Use nio's ByteBuffer to store large arrays of data efficiently (and in native memory, outside heap). It's like a byte array but contains methods to store/retrieve all primitive types from the buffer at any arbitrary offset (trade memory for speed)
Don't use pooling because pools keep unused instances explicitly alive.
Use threads scarcely, they're super memory hungry (in native memory, outside heap)
When making substrings of big strings, and discarding the original, the substrings still refer to the original. So use new String to dispose of the old big string.
A linear array is smaller than a multidimensional array, and if the size of all but the last dimension is a power of two, calculating indices is fastest: array[x|y<<4] for a 16xN array.
Initialize collections and StringBuilder with an initial capacity chosen such that it prevents internal reallocation in a typical circumstance.
Use StringBuilder instead of string concatenation, because the compiled class files use new StringBuilder() without initial capacity to concatenate strings.
Depends on the application, but generally speaking
Layout data structures in (parallel) arrays of primitives
Try to make big "flat" objects, inlining otherwise sensible sub-structures
Specialize collections of primitives
Reuse objects, use object pools, ThreadLocals
Go off-heap
I cannot say these practices are "best", because they, unfortunately, make you suffer, losing the point why you are using Java, reduce flexibility, supportability, reliability, testability and other "good" properties of the codebase.
But, they certainly allow to lower memory footprint and GC pressure.
One of the memory problems that are easy to overlook in Java is memory leakage. Nicholas Greene already pointed you to memory profiling.
Many people assume that Java's garbage collection prevents memory leaks, but that is not actually true - all it takes is one forgotten reference somewhere to keep an object around in perpetuity. Paradoxically, trying to optimize your program may introduce more opportunities for memory leaks because you end up with more complex data structures.
One example for a memory leak if you are implementing, for instance, a stack:
Integer stack[];
stack = new Integer[10];
int stackPtr = 0;
// a few push operation on our stack.
stack[stackPtr++] = new Integer(5);
stack[stackPtr++] = new Integer(3);
// and pop from the stack again
--stackPtr;
--stackPtr;
// at this point, the stack is logically empty, but
// the Integer objects are still referenced by the array,
// and are basically leaked.
The correct solution would have been:
stack[--stackPtr] = null;
If you have high performance constraints and need to use collections for simple types, you might take a look on some implementations of Primitive Collections for Java.
Some are:
HPPC
GNU Trove
Apache Commons Primitives
Also, as a reference take a look at this question: Why can Java Collections not directly store Primitives types?
Luís Bianchin already gave you a few libraries which implement optimal collections in Java.
Nevertheless, it seems that you are specially concerned about Java collections' memory allocation. In that case, there are a few alternatives which are quite straight forward.
Cache
You could use a cache to limit the memory the collection (the cache) can allocate. By doing that, you only load in main memory the most frequently used entries and you don't need to load the whole data set form disk/network/whatever. I highly recommend Guava Cache as it's very well documented and pretty mature.
Persistent Collections
Sometimes a cache is not a solution for your problem. For example, in an ETL solution, you might know you will only load each entry once. For this scenario I recommend to go for persistent collections. These are disk stored collections that are way faster than traditional databases but have nice Java APIs.
MapDB and PCollections are for me the best libraries.
Profile memory usage
On top of that, if you really want to know the actual state of your program's memory allocation I highly recommend you to use a profiler. This way you will not only know how much memory you collections occupy, but also how the GC behaves over time.
In fact, you should only try an alternative to Java's collections and data structures if there is an actual memory problem, and that is something a profiler can tell you.
The JDK has a profiler called VisualVM which does a great job. Nevertheless, I recommend you to use a commercial profiler if you can afford it. The commercial profilers usually have a low impact in the application's performance when compared to VisualVM.
Memory optimal data is nice with the network.
Finally, that it's not strictly related to your question, but it's closely connected. In case you want to serialize your Java objects into an optimal binary representation I recommend you Google Protocol Buffers in Java. Protocol buffers are ideal to transfer data structures thought the network using the least bandwidth possible and having a really fast coding/decoding.
Well there is a lot of things you can do.
Here are a few problems and solutions:
When you change the value of a string in java, the string is not actually overwritten. Instead, a new string is created to replace the old one. However, the old string still exists. This can be a problem when using RAM efficiently is a concern. Here are some solutions to this problem:
When using a string to specify something like the "state" of an object or anything else that can only have a specific set of possible values, don't use a string. Instead use an enum. If you don't know what an enum is or how to use one yet, here's a link to a tutorial on what enums are and how to use them!
If you are using a string as a variable who's value will change at some point in the program, don't define a string how you usually would. Instead, use the StringBuilder class from the java.lang package. StringBuilder is a class which is used to create strings and change their values. This class handles strings differently than usual. When it is used to change the value of a string, StringBuilder doesn't create a duplicate string with a different value to replace the old string, it actually changes the value of the original string. Therefore, since you aren't creating duplicate strings, this saves RAM. Here is a link to to the StringBuilder class in the java api.
Writer and reader objects such as fileWriters and fileReaders also take up RAM. If you have a lot of them, this can also cause problems. Here are some solutions:
All reader and writer objects have a method called close(). As you can probably guess, it closes the writer or reader object. All it does is get rid of the reader or writer object. Whenever you have a reader or writer object and you reach the point in your code when you know you will never use the reader or writer object anymore, use this method. It will get rid of the reader or writer object and will free some RAM.
Every object in java takes up memory. When you have an object that you won't use anymore, it's not very convenient to keep it around.
The Object class has a method called finalize(). This method has the same effect as the close() method in reader and writer objects. When you aren't going to use an object anymore, use the finalize() method to get rid of it and free some RAM.
Beware of early optimisation.
See When is optimisation premature?
While not knowing the exact requirements of your application or runtime environment, in my experience java was able to handle anything I threw it at. Doing some profiling on your demo /proof of concept app might be time well spent if performance or garbage collection (you tagged memory leaks) is an issue.
I have a big HashMap in java storing mappings from String to Integer. It has 400K records. It runs ok, but i am wondering if there is a better to optimize in terms of memory usage. After the Map is initialized, it will only be searched, no other update operations.
I vaguely remember that I came across some suggestions to convert the string keys as int, but not sure about that. Please help or share your ideas on this.
Thanks.
I vaguely remember that I came across some suggestions to convert the string keys as int, but not sure about that.
If the strings keys are actually the string representations of integers, then it could make sense to convert them to Integer or Long objects ... using Integer.valueOf(String). You will save some memory, since the primitive wrapper classes use less memory than the corresponding String objects. The space saving is likely to be significant (maybe ~16 bytes versus ~40 bytes per key ... depending on your platform.)
The flip-side of this is that you would need to convert candidate keys from String to the real key type before doing a hashmap lookup. That conversion takes a bit of time, and typically generates a bit of garbage.
But if the String keys do not represent integers, then this is simply not going to work. (Or at least ... I don't know what "conversion" you are referring to ...)
Note also that the key type has to be Integer / Long rather than int / long. Generic type parameters must be reference types.
There may be 3rd-party collection implementations that would help as well ... depending on precisely your data structure works; e.g. Trove, Guava, Fastutil. Try combining then with the String -> Integer preconversion ...
On the suggestion of using a database. If
you don't need the query / update / transactional capabilities of a database, AND
you can afford the memory to hold the data in memory, AND
you can afford the startup cost of loading the data into memory,
then using a database is just a big, unnecessary performance hit on each lookup.
You might want to tune initialCapacity and loadFactor also improving hashCode() to avoid collision if you want read at higher rate, if you have too many write you might want to benchmark hashCode(),
Even if this is too big for your app you might want to consider it moving out side of jvm to some cache (redis) or may be database if you can afford the little read/write delay
Writing the data to a database is ultimately the best solution if the data gets too big, but 400k is still doable in memory.
However, Java's built-in HashMap implementation uses separate chaining and every key-value pair has a separate class. I've gotten great (30%) speed improvements and awesome (50%) memory improvements by building a quadratic probing implementation of Map.
I suggest you search around on the internet. There are plenty of good implementations around!
You could use Guava's ImmutableMap -- it's optimized for write-once data, and takes ~15% less memory than HashMap.
I’m using ArrayList<Integer> in my research project. I need to keep unknown number of integers in this list. Sometimes I need to update the list: remove existing records or add new records. As Integer is an object, it’s taking much more memory than only int. Is there any alternate way to maintain the list that will consume less memory than Integer?
Try an integer list implementation that is optimized for memory usage, such as the one from the Colt library:
http://acs.lbl.gov/software/colt/api/cern/colt/list/IntArrayList.html
Java Integer objects usually require more overhead than an int primitive, so you need an implementation that is space-optimized.
From Colt:
Scientific and technical computing, as, for example, carried out at CERN, is characterized by demanding problem sizes and a need for high performance at reasonably small memory footprint. [...]
You could use an array with int-s and write your own methods with the same logic, that ArrayList does. But IMO that is a bad idea - modern machines have enough memory to use Integer objects, trust me... :)
That depends on the language you use, but I assume it's Java. In Java, as you probably know, you can't use ints in ArrayList, because they are a primitive datatype. To use ints, you'd have to use regular arrays, which are fixed size. That means you need to create a new array of a larger size each time you add something, assuming that makes the number of elements larger than the array. You trade memory for complexity, as you have to write a lot more code and mess around with moving ints back and forth.
The reduced memory use is unlikely to be worth the work and the extra risk of bugs in implementing such a solution.
You should also think about an alternativ storage system to you ArrayList. As for the linking mechanisim every Element has a overhead consuming (sometimes) more memory as the value itself. Maybe you don't need them ordered. Have you thought about a Map oder a simple Set if this is applicable or implement your own data structure?
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
In Java, what is the best way to determine the size of an object?
In Actionscript I can usually just do:
var myVar:uint = 5;
getSize(myVar)
//outputs 4 bytes
How do I do this in Java?
If you turn off -XX:-UseTLAB you can check the Runtime.freeMemory() before and after. However in the case of local variables, they don't take space on the heap (as they use the stack) and you can't get it size.
However, an int is a 32-bit sign value and you can expect it will use 4-bytes (or more depending on the JVM and the stack alignment etc)
The sizeof in C++ is useful for pointer arithmetic. Since Java doesn't allow this, its isn't useful and possibly deliberately hidden to avoid developers worrying about low level details.
The only reason C had a sizeOf intrinsic (function? well something) was because they needed it for manual memory management and some pointer arithmetic stuff.
There's no need to have that in Java. Also how much memory an object takes up is completely implementation defined and can't be answered reliably, but you can try some statistics by allocating lots of the same object and averaging - this can work nicely if you observe some basic principles, but that's boring.
If we know some basics about our VM we can also just count memory, so for Hotspot:
2 words overhead per object
every object is 8byte aligned (i.e. you have to round up to the next multiple of 8)
at least 1 word for variables, i.e. even if you have an object without any variables we "waste" 1 word
Also you should know your language spec a bit, so that you understand why an inner class has 1 additional reference than is obvious and why a static inner class does not.
A bit of work, but then it's generally a rather useless thing to know - if you're that worried about memory, you shouldn't be using neither ActionScript nor Java but C/C++ - you may get identical performance in Java, but you'll generally use about a factor of 2 more memory while doing so...
I believe there is no direct way of doing this. #Peter Lawrey 's suggestion could be a close approximation. But, you cannot rely on calculating the object size by taking the difference between the available free memory before and after the Object buildup, as there could be lots of other allocations in background happening from other threads as well. Also, there could be a possibility that the garbage collector could fire up and free up some memory in between your opertions. Also specially, in a multithreaded environment relying in the memory difference is definitely not a solution.
Hmmm. I have a table which is an array of structures I need to store in Java. The naive don't-worry-about-memory approach says do this:
public class Record {
final private int field1;
final private int field2;
final private long field3;
/* constructor & accessors here */
}
List<Record> records = new ArrayList<Record>();
If I end up using a large number (> 106 ) of records, where individual records are accessed occasionally, one at a time, how would I figure out how the preceding approach (an ArrayList) would compare with an optimized approach for storage costs:
public class OptimizedRecordStore {
final private int[] field1;
final private int[] field2;
final private long[] field3;
Record getRecord(int i) { return new Record(field1[i],field2[i],field3[i]); }
/* constructor and other accessors & methods */
}
edit:
assume the # of records is something that is changed infrequently or never
I'm probably not going to use the OptimizedRecordStore approach, but I want to understand the storage cost issue so I can make that decision with confidence.
obviously if I add/change the # of records in the OptimizedRecordStore approach above, I either have to replace the whole object with a new one, or remove the "final" keyword.
kd304 brings up a good point that was in the back of my mind. In other situations similar to this, I need column access on the records, e.g. if field1 and field2 are "time" and "position", and it's important for me to get those values as an array for use with MATLAB, so I can graph/analyze them efficiently.
The answers that give the general "optimise when you have to" is unhelpful in this case because , IMHO, programmers should always be aware of the performance in different in design choices when that choice leads to an order of magnitude performance penalty, particularly API writers.
The original question is quite valid and I would tend to agree that the second approach is better, given his particular situation. I've written image processing code where each pixel requires a data structure, a situation not too dissimilar to this, except I needed frequent random access to each pixel. The overhead of creating one object for each pixel was enormous.
The second version is much, much worse. Instead of resizing one array, you're resizing three arrays when you do an insert or delete. What's more, the second version will lead to the creation of many more temporary objects and it will do so on accesses. That could lead to a lot of garbage (from a GC point of view). Not good.
Generally speaking, you should worry about how you use the objects long before you think about performance. So you have a record with three fields or three arrays. Which one more accurately depicts what you're modeling? By this I mean, when you insert or delete an item, are you doing one of the three arrays or all three as a block?
I suspect it's the latter in which case the former makes far more sense.
If you're really concerned about insertion/deletion performance then perhaps a different data structure is appropriate, perhaps a SortedSet or a Map or SortedMap.
If you have millions of records, the second approach has several advantages:
Memory usage: the first approach uses more memory because a) every Java object in heap has a header (containing class id, lock state etc.); b) objects are aligned in memory; c) each reference to an object costs 4 bytes (on 64-bit JVMs with Compressed OOPs or 32-bit JVMs) or 8 bytes (64-bit JVMs without Compressed OOPs). See e. g. CompressedOops for more details. So the first approach takes about two times more memory (more precisely: according to my benchmark, an object with 16 bytes of payload + a reference to it took 28 bytes on 32-bit Java 7, 36 bytes on 64-bit Java 7 with compressed OOPs, and 40 bytes on 64-bit Java 7 w/o compressed OOPs).
Garbage collection: although the second approach seems to create many objects (one on each call of getRecord), it might not be so, as modern server JVMs (e. g. Oracle's Java 7) can apply escape analysis and stack allocation to avoid heap allocation of temporary objects in some cases; anyway, GCing short-lived objects is cheap. On the other hand, it is probably easier for the garbage collector if there are not millions of long-lived objects (as there are in the first approach) whose reachability to check (or at least, such objects may make your application need more careful tuning of GC generation sizes). Thus the second approach may be better for GC performance. However, to see whether it makes a difference in the real situation, one should make a benchmark oneself.
Serialization speed: the speed of (de)serializing a large array of primitives on disk is only limited by HDD speed; serializing many small objects is inevitably slower (especially if you use Java's default serialization).
Therefore I have used the second approach quite often for very large collections. But of course, if you have enough memory and don't care about serialization, the first approach is simpler.
How are you going to access the data? If the accesses over the fields are always coupled, then use the first option, if you are going to process the fields by its own, then the second option is better.
See this article in wikipedia: Parallel Array
A good example about when it's more convenient to have separate arrays could be simulations where the numerical data is packed together in the same array, and other attributes like name, colour, etc. that are accessed just for presentation of the data in other array.
I was curious so I actually ran a benchmark. If you don't re-create the object like you are[1], then SoA beats AoS by 5-100% depending on workload[2]. See my code here:
https://gist.github.com/twolfe18/8168262c5420c7a62d39
[1] I didn't add that because if you are concerned enough about speed to consider this refactor, it would be silly to do that.
[2] This also doesn't account for re-allocation, but again, this is often something you can either amortize away or know statically. This is a reasonable assumption for a pure-speed benchmark.
Notice that the second approach might have negative impact on caching behaviour. If you want to access a single record at a time, you'd better have that record not scattered all across the place.
Also, the only memory you win in the second approach, is (possibly) due to member alignment. (and having to allocate a separate object).
Otherwise, they have exactly the same memory use, asymptotically. The first option is much better due to locality, IMO
Whenever I have tried doing number crunching in Java, I have always had to revert to C-style coding (i.e. close to your option 2). It minimised the number of objects floating around in your system, as instead of 1,000,000 objects, you only have 3. I was able to do a bit of FFT analysis of real-time sound data using the C-style, and it was far too slow using objects.
I'd choose the first method (array of structures) unless you access the store relatively infrequently and are running into serious memory pressure issues.
First version basically stores the objects in their "natural" form (+1 BTW for using immutable records). This uses a little more memory because of the per-object overhead (probably around 8-16 bytes depending on your JVM) but is very good for accessing and returning objects in a convenient and human-understandable form in one simple step.
Second version uses less memory overall, but the allocation of a new object on every "get" is a pretty ugly solution that will not perform well if accesses are frequent.
Some other possibilities to consider:
An interesting "extreme" variant would be to take the second version but write your algorithms / access methods to interact with the underlying arrays directly. This is clearly going to result in complex inter-dependencies and some ugly code, but would probably give you the absolute best performance if you really needed it. It's quite common to use this approach for intensive graphics applications such as manipulating a large array of 3D coordinates.
A "hybrid" option would be to store the underlying data in a structure of arrays as in the second version, but cache the accessed objects in a HashMap so that you only generate the object the first time a particular index is accessed. Might make sense if only a small fraction of objects are ever likely to accessed, but all data is needed "just in case".
(Not a direct answer, but one that I think should be given)
From your comment,
"cletus -- I greatly respect your thoughts and opinions, but you gave me the high-level programming & software design viewpoint which is not what I'm looking for. I cannot learn to ignore optimization until I can get an intuitive sense for the cost of different implementation styles, and/or the ability to estimate those costs. – Jason S Jul 14 '09 at 14:27"
You should always ignore optimization until it presents itself as a problem. Most important is to have the system be usable by a developer (so they can make it usable by a user). There are very few times that you should concern yourself with optimization, in fact in ~20 years of professional coding I have cared about optimization a total of two times:
Writing a program that had its primary purpose to be faster than another product
Writing a smartphone app with the intention of reducing the amount of data sent between the client and server
In the first case I wrote some code, then ran it through a profiler, when I wanted to do something and I was not sure which approach was best (for speed/memory) I would code one way and see the result in the profiler, then code the other way and see the result. Then I would chose the faster of the two. This works and you learn a lot about low level decisions. I did not, however, allow it to impact the higher level classes.
In the second case, there was no programming involved, but I did the same basic thing of looking at the data being sent and figuring out how to reduce the number of messages being sent as well as the number of bytes being sent.
If your code is clear then it will be easier to speed up once you find out it is slow. As Cletus said in his answer, you are resizing one time -vs- three times... one time will be faster than three. From a higher point of view the one time is simpler to understand than the three times, thus it is more likely to be correct.
Personally I'd rather get the right answer slowly then the wrong answer quickly. Once I know how to get the right answer then I can find out where the system is slow and replace those parts of it with faster implementations.
Because you are making the int[] fields final, you are stuck with just the one initialization of the array and that is it. Thus, if you wanted 10^6 field1's, Java would need to separate that much memory for each of those int[], because you cannot reassign the size of those arrays. With an ArrayList, if you do not know the number of records beforehand and will be removing records potentially, you save a lot of space upfront and then later on as well when you go to remove records.
I would go for the ArrayList version too, so I don't need to worry about growing it. Do you need to have a column like access to values? What is your scenario behind your question?
Edit You could also use a common long[][] matrix.
I don't know how you pass the columns to Matlab, but I guess you don't gain much speed with a column based storage, more likely you loose speed in the java computation.