Hmmm. I have a table which is an array of structures I need to store in Java. The naive don't-worry-about-memory approach says do this:
public class Record {
final private int field1;
final private int field2;
final private long field3;
/* constructor & accessors here */
}
List<Record> records = new ArrayList<Record>();
If I end up using a large number (> 106 ) of records, where individual records are accessed occasionally, one at a time, how would I figure out how the preceding approach (an ArrayList) would compare with an optimized approach for storage costs:
public class OptimizedRecordStore {
final private int[] field1;
final private int[] field2;
final private long[] field3;
Record getRecord(int i) { return new Record(field1[i],field2[i],field3[i]); }
/* constructor and other accessors & methods */
}
edit:
assume the # of records is something that is changed infrequently or never
I'm probably not going to use the OptimizedRecordStore approach, but I want to understand the storage cost issue so I can make that decision with confidence.
obviously if I add/change the # of records in the OptimizedRecordStore approach above, I either have to replace the whole object with a new one, or remove the "final" keyword.
kd304 brings up a good point that was in the back of my mind. In other situations similar to this, I need column access on the records, e.g. if field1 and field2 are "time" and "position", and it's important for me to get those values as an array for use with MATLAB, so I can graph/analyze them efficiently.
The answers that give the general "optimise when you have to" is unhelpful in this case because , IMHO, programmers should always be aware of the performance in different in design choices when that choice leads to an order of magnitude performance penalty, particularly API writers.
The original question is quite valid and I would tend to agree that the second approach is better, given his particular situation. I've written image processing code where each pixel requires a data structure, a situation not too dissimilar to this, except I needed frequent random access to each pixel. The overhead of creating one object for each pixel was enormous.
The second version is much, much worse. Instead of resizing one array, you're resizing three arrays when you do an insert or delete. What's more, the second version will lead to the creation of many more temporary objects and it will do so on accesses. That could lead to a lot of garbage (from a GC point of view). Not good.
Generally speaking, you should worry about how you use the objects long before you think about performance. So you have a record with three fields or three arrays. Which one more accurately depicts what you're modeling? By this I mean, when you insert or delete an item, are you doing one of the three arrays or all three as a block?
I suspect it's the latter in which case the former makes far more sense.
If you're really concerned about insertion/deletion performance then perhaps a different data structure is appropriate, perhaps a SortedSet or a Map or SortedMap.
If you have millions of records, the second approach has several advantages:
Memory usage: the first approach uses more memory because a) every Java object in heap has a header (containing class id, lock state etc.); b) objects are aligned in memory; c) each reference to an object costs 4 bytes (on 64-bit JVMs with Compressed OOPs or 32-bit JVMs) or 8 bytes (64-bit JVMs without Compressed OOPs). See e. g. CompressedOops for more details. So the first approach takes about two times more memory (more precisely: according to my benchmark, an object with 16 bytes of payload + a reference to it took 28 bytes on 32-bit Java 7, 36 bytes on 64-bit Java 7 with compressed OOPs, and 40 bytes on 64-bit Java 7 w/o compressed OOPs).
Garbage collection: although the second approach seems to create many objects (one on each call of getRecord), it might not be so, as modern server JVMs (e. g. Oracle's Java 7) can apply escape analysis and stack allocation to avoid heap allocation of temporary objects in some cases; anyway, GCing short-lived objects is cheap. On the other hand, it is probably easier for the garbage collector if there are not millions of long-lived objects (as there are in the first approach) whose reachability to check (or at least, such objects may make your application need more careful tuning of GC generation sizes). Thus the second approach may be better for GC performance. However, to see whether it makes a difference in the real situation, one should make a benchmark oneself.
Serialization speed: the speed of (de)serializing a large array of primitives on disk is only limited by HDD speed; serializing many small objects is inevitably slower (especially if you use Java's default serialization).
Therefore I have used the second approach quite often for very large collections. But of course, if you have enough memory and don't care about serialization, the first approach is simpler.
How are you going to access the data? If the accesses over the fields are always coupled, then use the first option, if you are going to process the fields by its own, then the second option is better.
See this article in wikipedia: Parallel Array
A good example about when it's more convenient to have separate arrays could be simulations where the numerical data is packed together in the same array, and other attributes like name, colour, etc. that are accessed just for presentation of the data in other array.
I was curious so I actually ran a benchmark. If you don't re-create the object like you are[1], then SoA beats AoS by 5-100% depending on workload[2]. See my code here:
https://gist.github.com/twolfe18/8168262c5420c7a62d39
[1] I didn't add that because if you are concerned enough about speed to consider this refactor, it would be silly to do that.
[2] This also doesn't account for re-allocation, but again, this is often something you can either amortize away or know statically. This is a reasonable assumption for a pure-speed benchmark.
Notice that the second approach might have negative impact on caching behaviour. If you want to access a single record at a time, you'd better have that record not scattered all across the place.
Also, the only memory you win in the second approach, is (possibly) due to member alignment. (and having to allocate a separate object).
Otherwise, they have exactly the same memory use, asymptotically. The first option is much better due to locality, IMO
Whenever I have tried doing number crunching in Java, I have always had to revert to C-style coding (i.e. close to your option 2). It minimised the number of objects floating around in your system, as instead of 1,000,000 objects, you only have 3. I was able to do a bit of FFT analysis of real-time sound data using the C-style, and it was far too slow using objects.
I'd choose the first method (array of structures) unless you access the store relatively infrequently and are running into serious memory pressure issues.
First version basically stores the objects in their "natural" form (+1 BTW for using immutable records). This uses a little more memory because of the per-object overhead (probably around 8-16 bytes depending on your JVM) but is very good for accessing and returning objects in a convenient and human-understandable form in one simple step.
Second version uses less memory overall, but the allocation of a new object on every "get" is a pretty ugly solution that will not perform well if accesses are frequent.
Some other possibilities to consider:
An interesting "extreme" variant would be to take the second version but write your algorithms / access methods to interact with the underlying arrays directly. This is clearly going to result in complex inter-dependencies and some ugly code, but would probably give you the absolute best performance if you really needed it. It's quite common to use this approach for intensive graphics applications such as manipulating a large array of 3D coordinates.
A "hybrid" option would be to store the underlying data in a structure of arrays as in the second version, but cache the accessed objects in a HashMap so that you only generate the object the first time a particular index is accessed. Might make sense if only a small fraction of objects are ever likely to accessed, but all data is needed "just in case".
(Not a direct answer, but one that I think should be given)
From your comment,
"cletus -- I greatly respect your thoughts and opinions, but you gave me the high-level programming & software design viewpoint which is not what I'm looking for. I cannot learn to ignore optimization until I can get an intuitive sense for the cost of different implementation styles, and/or the ability to estimate those costs. – Jason S Jul 14 '09 at 14:27"
You should always ignore optimization until it presents itself as a problem. Most important is to have the system be usable by a developer (so they can make it usable by a user). There are very few times that you should concern yourself with optimization, in fact in ~20 years of professional coding I have cared about optimization a total of two times:
Writing a program that had its primary purpose to be faster than another product
Writing a smartphone app with the intention of reducing the amount of data sent between the client and server
In the first case I wrote some code, then ran it through a profiler, when I wanted to do something and I was not sure which approach was best (for speed/memory) I would code one way and see the result in the profiler, then code the other way and see the result. Then I would chose the faster of the two. This works and you learn a lot about low level decisions. I did not, however, allow it to impact the higher level classes.
In the second case, there was no programming involved, but I did the same basic thing of looking at the data being sent and figuring out how to reduce the number of messages being sent as well as the number of bytes being sent.
If your code is clear then it will be easier to speed up once you find out it is slow. As Cletus said in his answer, you are resizing one time -vs- three times... one time will be faster than three. From a higher point of view the one time is simpler to understand than the three times, thus it is more likely to be correct.
Personally I'd rather get the right answer slowly then the wrong answer quickly. Once I know how to get the right answer then I can find out where the system is slow and replace those parts of it with faster implementations.
Because you are making the int[] fields final, you are stuck with just the one initialization of the array and that is it. Thus, if you wanted 10^6 field1's, Java would need to separate that much memory for each of those int[], because you cannot reassign the size of those arrays. With an ArrayList, if you do not know the number of records beforehand and will be removing records potentially, you save a lot of space upfront and then later on as well when you go to remove records.
I would go for the ArrayList version too, so I don't need to worry about growing it. Do you need to have a column like access to values? What is your scenario behind your question?
Edit You could also use a common long[][] matrix.
I don't know how you pass the columns to Matlab, but I guess you don't gain much speed with a column based storage, more likely you loose speed in the java computation.
Related
Guava contains utilities for splitting and joining Strings, but it requires the instantiation of a Splitter/Joiner object to do so. These are small objects that typically only contain the character(s) on which to split/join. Is it a good idea to maintain references to these objects in order to reuse them, or is it preferable to just create them whenever you need them and let them be garbage collected?
For example, I could implement this method in the following two ways:
String joinLines(List<String> lines) {
return Joiner.on("\n").join(lines);
}
OR
static final Joiner LINE_JOINER = Joiner.on("\n");
String joinLines(List<String> lines) {
return LINE_JOINER.join(lines);
}
I find the first way more readable, but it seems wasteful to create a new Joiner each time the method is called.
To be honest, this sounds like premature optimization to me. I agree with #Andy Turner, write whatever is easiest to understand and maintain.
If you plan to use Joiner.on("\n") in a few places, make it a well named constant; go with option two.
If you only plan to use it in your joinLines method, a constant seems overly verbose; go with option one.
It depends greatly on how often you expect the code to be called and what tradeoffs you want to make between CPU time, memory consumption and readability. Since Joiner is such a small thing, it's not going to make a huge difference either way: if you make it a constant, you save the (fairly minimal) costs of allocating it and GCing it for each call, while adding the (fairly minimal) memory consumption overhead to the program.
It also depends in part on what platform you're running the code on: if you're running on the server, typically you'll have plenty of memory so keeping a constant won't be an issue. On the other hand, if you're running on Android you're more memory constrained, but you also want to avoid unnecessary allocations since garbage collection is going to be worse and more impactful to your performance.
Personally, I tend to allocate a constant unless I know it's only going to be used some fixed number of times as opposed to repeatedly throughout the program.
I have a big HashMap in java storing mappings from String to Integer. It has 400K records. It runs ok, but i am wondering if there is a better to optimize in terms of memory usage. After the Map is initialized, it will only be searched, no other update operations.
I vaguely remember that I came across some suggestions to convert the string keys as int, but not sure about that. Please help or share your ideas on this.
Thanks.
I vaguely remember that I came across some suggestions to convert the string keys as int, but not sure about that.
If the strings keys are actually the string representations of integers, then it could make sense to convert them to Integer or Long objects ... using Integer.valueOf(String). You will save some memory, since the primitive wrapper classes use less memory than the corresponding String objects. The space saving is likely to be significant (maybe ~16 bytes versus ~40 bytes per key ... depending on your platform.)
The flip-side of this is that you would need to convert candidate keys from String to the real key type before doing a hashmap lookup. That conversion takes a bit of time, and typically generates a bit of garbage.
But if the String keys do not represent integers, then this is simply not going to work. (Or at least ... I don't know what "conversion" you are referring to ...)
Note also that the key type has to be Integer / Long rather than int / long. Generic type parameters must be reference types.
There may be 3rd-party collection implementations that would help as well ... depending on precisely your data structure works; e.g. Trove, Guava, Fastutil. Try combining then with the String -> Integer preconversion ...
On the suggestion of using a database. If
you don't need the query / update / transactional capabilities of a database, AND
you can afford the memory to hold the data in memory, AND
you can afford the startup cost of loading the data into memory,
then using a database is just a big, unnecessary performance hit on each lookup.
You might want to tune initialCapacity and loadFactor also improving hashCode() to avoid collision if you want read at higher rate, if you have too many write you might want to benchmark hashCode(),
Even if this is too big for your app you might want to consider it moving out side of jvm to some cache (redis) or may be database if you can afford the little read/write delay
Writing the data to a database is ultimately the best solution if the data gets too big, but 400k is still doable in memory.
However, Java's built-in HashMap implementation uses separate chaining and every key-value pair has a separate class. I've gotten great (30%) speed improvements and awesome (50%) memory improvements by building a quadratic probing implementation of Map.
I suggest you search around on the internet. There are plenty of good implementations around!
You could use Guava's ImmutableMap -- it's optimized for write-once data, and takes ~15% less memory than HashMap.
I am currently writing some code in java meant to be a little framework for a project which revolves around a database with some billions of entries. I want to keep it high-level and the data retriueved from the database shoud be easily usable for statistic inference. I am resolved to use the Map interface in this project.
a core concept is mapping the attributes ("columns in the database") to values ("cells") when handling single datasets (with which I mean a columns in the database) for readable code: I use enum objects (named "Attribute") for the attribute types, which means mapping <Attribute, String>, because the data elements are all String (also not very large, maximum 40 characters or so).
There are 15 columns, so there are 15 enums, and the maps will have only so much entries, or less.
So it appears, I will be having a very large number of Map objects floating around, at times, but with comparatively little payload (15-). My goal is to not make the memory explode due to the implementation memory overhead, compared to the actual payload. (Stretch goal: do the same with cpu usage ;] )
I was not really familiar with all the different implementations of Java Collections to date, and when the problem dawned at me today, I looked into my to-date all-time favorite 'HashMap', and was not happy with how much memory overhead there was declared. I am sure, that additonal to the standard implementations, there are a number of implementations not shipped with Java. Googling my case brought not up much of a result, So I am asking you:
Do you know a good implementation of Map for my use case (low entry count, low value size, enumerable keys, ...)
I hope I made my use case clear, and am anxious for your input =)
Thanks a lot!
Stretch answer goal, absolutely optional and only if you got the time and knowledge:
What other implementations of collections are suitable for:
handling attribute (the String things) vectors, and matrices for inference data (counts/probabilities) (Matrices: here I am really clueless for now, Did really no serious math work with java to date)
math libraries for statistical inference, see above
Use EnumMap, this is the best map implementation if you have enums as key, for both performance and memory usage.
The trick is that this map implementation is the only one that that does not store the keys, it only needs a single array with the values (similar to an ArrayList of the values). There is only a little bit of overhead if there are keys that are not mapped to a value, but in most cases this won't be a problem because enums usually do not have too many instances.
Compared to HashMap, you additionally get a predictable iteration order for free.
Since you start off saying you want to store lots of data, eventually, you'll also want to access/modify that data. There are many high performance libraries out there.
Look at
Trove4j : https://bitbucket.org/robeden/trove/
HPPC: http://labs.carrotsearch.com/hppc.html
FastUtil: http://fastutil.di.unimi.it/
When you find a bottleneck, you can switch to using a lower level API (more efficient)
You'll many more choices if look a bit more: What is the most efficient Java Collections library?
EDIT: if your strings are not unique, you could save significant amounts of memory using String.intern() : Is it good practice to use java.lang.String.intern()?
You can squeeze out a bit of memory with a simple map implementation that uses two array lists (keys and values). For larger maps, that is going to mean insertion and look up speeds become much slower because you have to scan the entire list. However, for small maps it is actually faster this way since you don't have to calculate any hashcodes and only have to look at a small number of entries.
If you need an implementation, take a look at my SimpleMap in my jsonj project: https://github.com/jillesvangurp/jsonj/blob/master/src/main/java/com/github/jsonj/SimpleMap.java
I'm writing some "big data" software that needs to hold a lot of data in memory. I wrote a prototype in c++ that works great. However the actual end-users typically code in Java so they've asked me to also write a java prototype.
I've done background reading on memory-footprint in java and some preliminary tests. For example, lets say I have this object
public class DataPoint{
int cents, time, product_id, store_id;
public DataPoint(int cents, int time, int product_id, int store_id){
this.cents = cents;
this.time = time;
this.product_id = product_id;
this.store_id = store_id;
}
}
In C++ the sizeof this structure is 16 bytes, which makes sense. In Java we have to be indirect. If I create, e.g., 10m of these objects and use Runtime.totalMemory() - Runtime.freeMemory() before
and after and then divide as appropriate I get approximately 36 bytes per structure. A ~2.4x memory difference is pretty nasty; its gonna get ugly when we try to hold hundreds of millions of DataPoints in memory.
I read somewhere that in cases like this in Java its better to store the data as arrays -- essentially a column-based store rather than a row-based store. I think I understand this: the column-based way reduces the number of number of references, and perhaps the JVM can even pack the ints into 8-byte words intelligently.
What other tricks can I use for reducing the memory-footprint of what is essentially a memory block that has one very large dimension (millions/billions of datapoints) and one very small dimension (the O(1) number of columns/variables)?
Turns out storing the data as 4 int arrays used exactly 16 bytes per entry. The lesson: small objects have nasty proportional overhead in java.
It isn't that straightforward to see how much memory your data structure takes in Java. totalMemory() shows the space allocated for vm which is larger than the actual usage. You could try using Java profiler that shows space-consumption of your data structures, they are quite easy to setup and run. One handy free tool is Java's own VisualVM that for example shows memory behaviour of your application, you will also learn a bit about how Java's GC works if you use it.
VisualVM screenshot showing performance footprint (image from http://visualvm.java.net/features.html):
You should also consider making the variables final if it's possible. It allows Java VM to optimize the code bit better (not sure if it saves space though).
First of all an object in Java will always be slightly larger than a C++ version since the object encapsulates runtime type information that enables you to do instanceof etc that is not possible in C++. Additionally it facilitates in the memory management you would have to manually do yourself, so you can also consider this part of your C++ code as not part of the code base.
You could look into Flyweight Pattern to reduce memory requirements so that you reuse the DataPoints (make the class Immutable). I assume that if you have billions of points as you say some will probably be the same values.
I am sure others here will give some more concrete information on optimizing in memory space
Depending on the value ranges you might be able to use smaller data types. Can you get away with using byte or short for some of the members?
I’m using ArrayList<Integer> in my research project. I need to keep unknown number of integers in this list. Sometimes I need to update the list: remove existing records or add new records. As Integer is an object, it’s taking much more memory than only int. Is there any alternate way to maintain the list that will consume less memory than Integer?
Try an integer list implementation that is optimized for memory usage, such as the one from the Colt library:
http://acs.lbl.gov/software/colt/api/cern/colt/list/IntArrayList.html
Java Integer objects usually require more overhead than an int primitive, so you need an implementation that is space-optimized.
From Colt:
Scientific and technical computing, as, for example, carried out at CERN, is characterized by demanding problem sizes and a need for high performance at reasonably small memory footprint. [...]
You could use an array with int-s and write your own methods with the same logic, that ArrayList does. But IMO that is a bad idea - modern machines have enough memory to use Integer objects, trust me... :)
That depends on the language you use, but I assume it's Java. In Java, as you probably know, you can't use ints in ArrayList, because they are a primitive datatype. To use ints, you'd have to use regular arrays, which are fixed size. That means you need to create a new array of a larger size each time you add something, assuming that makes the number of elements larger than the array. You trade memory for complexity, as you have to write a lot more code and mess around with moving ints back and forth.
The reduced memory use is unlikely to be worth the work and the extra risk of bugs in implementing such a solution.
You should also think about an alternativ storage system to you ArrayList. As for the linking mechanisim every Element has a overhead consuming (sometimes) more memory as the value itself. Maybe you don't need them ordered. Have you thought about a Map oder a simple Set if this is applicable or implement your own data structure?