Would a hashtable/hashmap use a lot of memory if it only consists of object references and int's?
As for a school project we had to map a database to objects (that's what being done by orm/hibernate nowadays) but eager to find a good way not to store id's in objects in order to save them again we thought of putting all objects we created in a hashmap/hashtable, so we could easily retrieve it's ID. My question is if it would cost me performance using this, in my opinion more elegant way to solve this problem.
Would a hashtable/hashmap use a lot of
memory if it only consists of object
references and int's?
"a lot" depends on how many objects you have. For a few hundreds or a few thousands, you're not going to notice.
But typically the default Java collections are really incredibly inefficient when you're working with primitives (because of the constant boxing/unboxing from "primitive to wrapper" going on, like say "int to Integer") , both from a performances and memory standpoint (the two being related but not identical).
If you have a lot of entries, like hundreds of thousands or millions, I suggest using for example the Trove collections.
In your case, you'd use this:
TIntObjectHashMap<SomeJavaClass>
or this:
TObjectIntHashMap<SomeJavaClass>
In any case, that shall run around circle the default Java collections perf-wise and cpu-wise (and it shall trigger way less GC, etc.).
You're dodging the unnecessary automatic (un)boxing from/to int/Integer, the collections are creating way less garbage, resizing in a much smarter way, etc.
Don't even get me started on the default Java HashMap<Integer,Integer> compared to Trove's TIntIntHashMap or I'll go berzerk ;)
Minimally, you'd need an implementation of the Map.Entry interface with a reference to the key object and a reference to the value object. If either the the key or value are primitive types, such as int, you'll need a wrapper type (e.g. Integer) to wrap it as well. The Map.Entrys are stored in an array and allocated in blocks.
Take a look at this question for more information on how to measure your memory consumption in Java.
It's impossible to answer this without some figures. How many objects are you looking to store? Don't forget you're storing the objects already, so the key/object reference combination should be fairly small.
The only sensible thing to do is to try this and see if it works for you. Don't forget that the JVM will have a default maximum memory allocation and you can increase this (if you need) via -Xmx
Related
Java programs can be very memory hungry. For example, a Double object has 24 bytes: 8 bytes of data and 16 bytes of JVM-imposed overhead. In general, the objects that represent the primitive types are very expensive.
The same happens for any collection in the Java Standard Library. There are even some counterintuitive facts such as a HashSet being more memory hungry than a HashMap, since a HashSet contains a HashMap inside (http://docs.oracle.com/javase/7/docs/api/java/util/HashSet.html).
Could you come up with some advice when modeling data and delegation of objects in high performance settings so that these "weaknesses" of Java are mitigated?
Some techniques I use to reduce memory:
Make your own IntArrayList (etc) class that prevents boxing
Make your own IntHashMap (etc) class where keys are primitives
Use nio's ByteBuffer to store large arrays of data efficiently (and in native memory, outside heap). It's like a byte array but contains methods to store/retrieve all primitive types from the buffer at any arbitrary offset (trade memory for speed)
Don't use pooling because pools keep unused instances explicitly alive.
Use threads scarcely, they're super memory hungry (in native memory, outside heap)
When making substrings of big strings, and discarding the original, the substrings still refer to the original. So use new String to dispose of the old big string.
A linear array is smaller than a multidimensional array, and if the size of all but the last dimension is a power of two, calculating indices is fastest: array[x|y<<4] for a 16xN array.
Initialize collections and StringBuilder with an initial capacity chosen such that it prevents internal reallocation in a typical circumstance.
Use StringBuilder instead of string concatenation, because the compiled class files use new StringBuilder() without initial capacity to concatenate strings.
Depends on the application, but generally speaking
Layout data structures in (parallel) arrays of primitives
Try to make big "flat" objects, inlining otherwise sensible sub-structures
Specialize collections of primitives
Reuse objects, use object pools, ThreadLocals
Go off-heap
I cannot say these practices are "best", because they, unfortunately, make you suffer, losing the point why you are using Java, reduce flexibility, supportability, reliability, testability and other "good" properties of the codebase.
But, they certainly allow to lower memory footprint and GC pressure.
One of the memory problems that are easy to overlook in Java is memory leakage. Nicholas Greene already pointed you to memory profiling.
Many people assume that Java's garbage collection prevents memory leaks, but that is not actually true - all it takes is one forgotten reference somewhere to keep an object around in perpetuity. Paradoxically, trying to optimize your program may introduce more opportunities for memory leaks because you end up with more complex data structures.
One example for a memory leak if you are implementing, for instance, a stack:
Integer stack[];
stack = new Integer[10];
int stackPtr = 0;
// a few push operation on our stack.
stack[stackPtr++] = new Integer(5);
stack[stackPtr++] = new Integer(3);
// and pop from the stack again
--stackPtr;
--stackPtr;
// at this point, the stack is logically empty, but
// the Integer objects are still referenced by the array,
// and are basically leaked.
The correct solution would have been:
stack[--stackPtr] = null;
If you have high performance constraints and need to use collections for simple types, you might take a look on some implementations of Primitive Collections for Java.
Some are:
HPPC
GNU Trove
Apache Commons Primitives
Also, as a reference take a look at this question: Why can Java Collections not directly store Primitives types?
Luís Bianchin already gave you a few libraries which implement optimal collections in Java.
Nevertheless, it seems that you are specially concerned about Java collections' memory allocation. In that case, there are a few alternatives which are quite straight forward.
Cache
You could use a cache to limit the memory the collection (the cache) can allocate. By doing that, you only load in main memory the most frequently used entries and you don't need to load the whole data set form disk/network/whatever. I highly recommend Guava Cache as it's very well documented and pretty mature.
Persistent Collections
Sometimes a cache is not a solution for your problem. For example, in an ETL solution, you might know you will only load each entry once. For this scenario I recommend to go for persistent collections. These are disk stored collections that are way faster than traditional databases but have nice Java APIs.
MapDB and PCollections are for me the best libraries.
Profile memory usage
On top of that, if you really want to know the actual state of your program's memory allocation I highly recommend you to use a profiler. This way you will not only know how much memory you collections occupy, but also how the GC behaves over time.
In fact, you should only try an alternative to Java's collections and data structures if there is an actual memory problem, and that is something a profiler can tell you.
The JDK has a profiler called VisualVM which does a great job. Nevertheless, I recommend you to use a commercial profiler if you can afford it. The commercial profilers usually have a low impact in the application's performance when compared to VisualVM.
Memory optimal data is nice with the network.
Finally, that it's not strictly related to your question, but it's closely connected. In case you want to serialize your Java objects into an optimal binary representation I recommend you Google Protocol Buffers in Java. Protocol buffers are ideal to transfer data structures thought the network using the least bandwidth possible and having a really fast coding/decoding.
Well there is a lot of things you can do.
Here are a few problems and solutions:
When you change the value of a string in java, the string is not actually overwritten. Instead, a new string is created to replace the old one. However, the old string still exists. This can be a problem when using RAM efficiently is a concern. Here are some solutions to this problem:
When using a string to specify something like the "state" of an object or anything else that can only have a specific set of possible values, don't use a string. Instead use an enum. If you don't know what an enum is or how to use one yet, here's a link to a tutorial on what enums are and how to use them!
If you are using a string as a variable who's value will change at some point in the program, don't define a string how you usually would. Instead, use the StringBuilder class from the java.lang package. StringBuilder is a class which is used to create strings and change their values. This class handles strings differently than usual. When it is used to change the value of a string, StringBuilder doesn't create a duplicate string with a different value to replace the old string, it actually changes the value of the original string. Therefore, since you aren't creating duplicate strings, this saves RAM. Here is a link to to the StringBuilder class in the java api.
Writer and reader objects such as fileWriters and fileReaders also take up RAM. If you have a lot of them, this can also cause problems. Here are some solutions:
All reader and writer objects have a method called close(). As you can probably guess, it closes the writer or reader object. All it does is get rid of the reader or writer object. Whenever you have a reader or writer object and you reach the point in your code when you know you will never use the reader or writer object anymore, use this method. It will get rid of the reader or writer object and will free some RAM.
Every object in java takes up memory. When you have an object that you won't use anymore, it's not very convenient to keep it around.
The Object class has a method called finalize(). This method has the same effect as the close() method in reader and writer objects. When you aren't going to use an object anymore, use the finalize() method to get rid of it and free some RAM.
Beware of early optimisation.
See When is optimisation premature?
While not knowing the exact requirements of your application or runtime environment, in my experience java was able to handle anything I threw it at. Doing some profiling on your demo /proof of concept app might be time well spent if performance or garbage collection (you tagged memory leaks) is an issue.
I have a big HashMap in java storing mappings from String to Integer. It has 400K records. It runs ok, but i am wondering if there is a better to optimize in terms of memory usage. After the Map is initialized, it will only be searched, no other update operations.
I vaguely remember that I came across some suggestions to convert the string keys as int, but not sure about that. Please help or share your ideas on this.
Thanks.
I vaguely remember that I came across some suggestions to convert the string keys as int, but not sure about that.
If the strings keys are actually the string representations of integers, then it could make sense to convert them to Integer or Long objects ... using Integer.valueOf(String). You will save some memory, since the primitive wrapper classes use less memory than the corresponding String objects. The space saving is likely to be significant (maybe ~16 bytes versus ~40 bytes per key ... depending on your platform.)
The flip-side of this is that you would need to convert candidate keys from String to the real key type before doing a hashmap lookup. That conversion takes a bit of time, and typically generates a bit of garbage.
But if the String keys do not represent integers, then this is simply not going to work. (Or at least ... I don't know what "conversion" you are referring to ...)
Note also that the key type has to be Integer / Long rather than int / long. Generic type parameters must be reference types.
There may be 3rd-party collection implementations that would help as well ... depending on precisely your data structure works; e.g. Trove, Guava, Fastutil. Try combining then with the String -> Integer preconversion ...
On the suggestion of using a database. If
you don't need the query / update / transactional capabilities of a database, AND
you can afford the memory to hold the data in memory, AND
you can afford the startup cost of loading the data into memory,
then using a database is just a big, unnecessary performance hit on each lookup.
You might want to tune initialCapacity and loadFactor also improving hashCode() to avoid collision if you want read at higher rate, if you have too many write you might want to benchmark hashCode(),
Even if this is too big for your app you might want to consider it moving out side of jvm to some cache (redis) or may be database if you can afford the little read/write delay
Writing the data to a database is ultimately the best solution if the data gets too big, but 400k is still doable in memory.
However, Java's built-in HashMap implementation uses separate chaining and every key-value pair has a separate class. I've gotten great (30%) speed improvements and awesome (50%) memory improvements by building a quadratic probing implementation of Map.
I suggest you search around on the internet. There are plenty of good implementations around!
You could use Guava's ImmutableMap -- it's optimized for write-once data, and takes ~15% less memory than HashMap.
I am currently writing some code in java meant to be a little framework for a project which revolves around a database with some billions of entries. I want to keep it high-level and the data retriueved from the database shoud be easily usable for statistic inference. I am resolved to use the Map interface in this project.
a core concept is mapping the attributes ("columns in the database") to values ("cells") when handling single datasets (with which I mean a columns in the database) for readable code: I use enum objects (named "Attribute") for the attribute types, which means mapping <Attribute, String>, because the data elements are all String (also not very large, maximum 40 characters or so).
There are 15 columns, so there are 15 enums, and the maps will have only so much entries, or less.
So it appears, I will be having a very large number of Map objects floating around, at times, but with comparatively little payload (15-). My goal is to not make the memory explode due to the implementation memory overhead, compared to the actual payload. (Stretch goal: do the same with cpu usage ;] )
I was not really familiar with all the different implementations of Java Collections to date, and when the problem dawned at me today, I looked into my to-date all-time favorite 'HashMap', and was not happy with how much memory overhead there was declared. I am sure, that additonal to the standard implementations, there are a number of implementations not shipped with Java. Googling my case brought not up much of a result, So I am asking you:
Do you know a good implementation of Map for my use case (low entry count, low value size, enumerable keys, ...)
I hope I made my use case clear, and am anxious for your input =)
Thanks a lot!
Stretch answer goal, absolutely optional and only if you got the time and knowledge:
What other implementations of collections are suitable for:
handling attribute (the String things) vectors, and matrices for inference data (counts/probabilities) (Matrices: here I am really clueless for now, Did really no serious math work with java to date)
math libraries for statistical inference, see above
Use EnumMap, this is the best map implementation if you have enums as key, for both performance and memory usage.
The trick is that this map implementation is the only one that that does not store the keys, it only needs a single array with the values (similar to an ArrayList of the values). There is only a little bit of overhead if there are keys that are not mapped to a value, but in most cases this won't be a problem because enums usually do not have too many instances.
Compared to HashMap, you additionally get a predictable iteration order for free.
Since you start off saying you want to store lots of data, eventually, you'll also want to access/modify that data. There are many high performance libraries out there.
Look at
Trove4j : https://bitbucket.org/robeden/trove/
HPPC: http://labs.carrotsearch.com/hppc.html
FastUtil: http://fastutil.di.unimi.it/
When you find a bottleneck, you can switch to using a lower level API (more efficient)
You'll many more choices if look a bit more: What is the most efficient Java Collections library?
EDIT: if your strings are not unique, you could save significant amounts of memory using String.intern() : Is it good practice to use java.lang.String.intern()?
You can squeeze out a bit of memory with a simple map implementation that uses two array lists (keys and values). For larger maps, that is going to mean insertion and look up speeds become much slower because you have to scan the entire list. However, for small maps it is actually faster this way since you don't have to calculate any hashcodes and only have to look at a small number of entries.
If you need an implementation, take a look at my SimpleMap in my jsonj project: https://github.com/jillesvangurp/jsonj/blob/master/src/main/java/com/github/jsonj/SimpleMap.java
I’m using ArrayList<Integer> in my research project. I need to keep unknown number of integers in this list. Sometimes I need to update the list: remove existing records or add new records. As Integer is an object, it’s taking much more memory than only int. Is there any alternate way to maintain the list that will consume less memory than Integer?
Try an integer list implementation that is optimized for memory usage, such as the one from the Colt library:
http://acs.lbl.gov/software/colt/api/cern/colt/list/IntArrayList.html
Java Integer objects usually require more overhead than an int primitive, so you need an implementation that is space-optimized.
From Colt:
Scientific and technical computing, as, for example, carried out at CERN, is characterized by demanding problem sizes and a need for high performance at reasonably small memory footprint. [...]
You could use an array with int-s and write your own methods with the same logic, that ArrayList does. But IMO that is a bad idea - modern machines have enough memory to use Integer objects, trust me... :)
That depends on the language you use, but I assume it's Java. In Java, as you probably know, you can't use ints in ArrayList, because they are a primitive datatype. To use ints, you'd have to use regular arrays, which are fixed size. That means you need to create a new array of a larger size each time you add something, assuming that makes the number of elements larger than the array. You trade memory for complexity, as you have to write a lot more code and mess around with moving ints back and forth.
The reduced memory use is unlikely to be worth the work and the extra risk of bugs in implementing such a solution.
You should also think about an alternativ storage system to you ArrayList. As for the linking mechanisim every Element has a overhead consuming (sometimes) more memory as the value itself. Maybe you don't need them ordered. Have you thought about a Map oder a simple Set if this is applicable or implement your own data structure?
Which type of data structure uses more memory?
Hashtable
Hashmap
ArrayList
Could you please give me a brief explanation which one is less prone to memory leakage?
...which one to use for avoiding the memory leakage
The answer is all of them and none of them.
Memory leakage is not related to the data structure, but the way you use them.
The amount of memory used by each one is irrelevant when you aim to avoid "memory leakage".
The best thing you can do is: When you detect an object won't be used any longer in the application, you must remove it from the collection ( not only those you've listed, but any other you might use; List, Map, Set or even arrays ).
That way the garbage collector will be able to release the memory used by that object.
You can take a look at this article "How does garbage collector works" for further explanation on how Java release memory from the objects it uses.
Edit:
There are others data structures in Java which help for the references management such as WeakHashMap, but this may be considered as "advanced topics".
Most likely you should really just use a Collection that suits your current need. In the most common cases, if you need a List, use ArrayList, and if you need a Map, use HashMap. For a tutorial, see e.g. http://java.sun.com/docs/books/tutorial/collections/
When your profiler shows you there is an actual memory leak related to the use of Java Collections, then do something about it.
Your question is woefully underspecified because the concrete data structures you specify are not of comparable structure.
The HashMap/HashTable are comparable since they both function as maps (key -> value lookups).
ArrayLists (and lists in general) do not.
The HashMap/HashTable part is easy to answer as they are largely identical (the major difference is null keys) but the former is not synchronized and the latter is, thus HashMap will generally be faster (assuming the synchronization is not required) Modern JVM's are reasonably fast at uncontended locks though so the difference will be small in a micro benchmark.
Well, I've actually been, recently, in a situation where I had to hold onto large collections of custom objects, where the size of the collections was one of the applications limiting factors. If that's your situation, a few suggestions -
there are a few implementations of
collections using primitives (list
here). Played around a bit with
trove4j, and found a somewhat smaller
memory footprint (as long as you're
dealing with primitives, of course).
If you're dealing with large
collections, you'll probably get more
bang for your buck, in terms of
reducing memory footprint, by
optimizing the objects you're
holding. After all, you've got a lot
more of them, otherwise you wouldn't
need a collection, right?
Some collections are naturally smaller (e.g. LinkedList will be a bit smaller than an ArrayList) but the difference will probably be swamped by the differences in how they're used)
Most of the java collections can be manually sized - you can set your arraylist of 100 elements to be initialized to 100 elements, and you can set your maps to keep less open space at the cost of slower performance. All in the javadocs.
Ultimately the simplest thing to do is to test for yourself.
You're not comparing like with like: HashMap and Hashtable implement the Map interface, and ArrayList implements the List interface.
In a direct comparison between Hashtable and HashMap, HashMap will probably offer better performance because Hashtable is synchronized.
If you give some indication about what you're using the collections for, you might get a more insightful answer.
Hashtables (be it HashMap or HashTable) would take a little more memory than what they use to actually store the information.
Hashing performance comes at a price.
A java.util.Collection stores references to objects.
A java.util.Map stores references to Map.Entry instances, which hold references to keys and objects.
So a java.util.Collection holding N references to objects will require less memory than a java.util.Map holding onto the same N references, because the Map has to point to the keys as well.
Performance for reading and writing differs depending on the implementation of each of these interfaces.
I don't see any java.util.Collection analogous to WeakHashMap. I'd read about that class if you're worried about garbage collection and memory leaks.
As others have pointed out, your question is underspecified.
Still, sometimes an ArrayList based implementation can replace a HashMap based implementation (I would not consider Hashtable at all, it's obsolete). You might need to search the ArrayList linearly but for small Lists that might still be fast enough and the ArrayList will need less memory (for the same data), because it has less overhead.
In most languages it depends on how good you are at picking up your toys after you're done with them.
In Java, it doesn't matter since garbage collection is done automatically and you don't really need to worry about memory leaks.