I want to read some XML-files and convert it to a graph (no graphics, just a model). But because the files are very large (2,2 GB) my model object, which holds all the information, becomes even larger (4x the size of the file...).
Googling through the net I tried to find ways to reduce the object size. I tried different collection types but would like to stick to a HashMap (because I have to have random access). The actuall keys and values make up just a small amount of the allocated memory. Most of the hash table is empty...
If I'm not totally wrong a garbage collection doesn't help me to free the allocated memory and reduce the size of the hashmap. Is there and other way to release unused memory and shrink the hashmap? Or is there a way to do perfect hashing? Or shoud I just use another collection?
Thanks in advance,
Sebastian
A HashMap is typically just a large array of references filled to a certain percentage of capacity. If only 80% of the map is filled, the remaining 20% of the array cells are unused (i.e., are null). The extra overhead is really only just the empty (null) cells.
On a 32-bit CPU, each array cell is usually 4 bytes in size (although some JVM implementations may allocate 8 bytes). That's not really that much unused space overall.
Once your map is filled, you can copy it to another HashMap with a more appropriate (smaller) size giving a larger fill percentage.
Your question seems to imply that there are more allocated but unused objects that you're worried about. But how is that the case?
Addendum
Once a map is filled almost to capacity (typically more than 95% or so), a larger array is allocated, the old array's contents are copied to the new array, and then the smaller array is left to be garbage collected. This is obviously an expensive operation, so choosing a reasonably large initial size for the map is key to improving performance.
If you can (over)estimate the number of cells needed, preallocating a map can reduce or even eliminate the resizing operations.
What you are asking is not so clear, it is not clear if memory is taken by the objects that you put inside the hasmap or by the hashmap itself, which shouldn't be the case since it only holds references.
In any case take a look at the WeakHashMap, maybe it is what you are looking for: it is an hashmap which doesn't guarantee that keys are kept inside it, it should be used as a sort of cache but from your description I don't really know if it is your case or not.
If you get nowhere with reducing the memory footprint of your hashmap, you could always put the data in a database. Depending on how the data is accessed, you might still get reasonable performance if you introduce a cache in front of the db.
One thing that might come into play is that you might have substrings that are referencing old larger strings, and those substrings are then making it impossible for the GC to collect the char arrays that are too big.
This happens when you are using some XML parsers that are returning attributes/values as substring from a larger string. (A substring is only a limited view of the larger string).
Try to put your strings in the map by doing something like this:
map.put(new String(key), new String(value));
Note that the GC then might get more work to do when you are populating the map, and this might not help you if you don't have that many substrings that are referencing larger strings.
If you're really serious about this and you have time to spare, you can make your own implementation of the Map interface based on minimal perfect hashing
If your keys are Strings, then there apparently is a map available for you here.
I haven't tried it myself but it brags about reduced memory usage.
You might give the Trove collections a shot. They advertise it as a more time and space efficient drop-in replacement for the java.util Collections.
Related
I have to count the number of repeats for different strings in Java. Those strings can be huge, come from several data sources, and a large number of the strings are repeated.
I need to get the only 20 from those strings with most frequency of every hour.
I considered counting the occurrence of each string, store them in a huge HashMap, with a PriorityQueue to keep the top string occurrence, but that will also consume a lot of memory. At the start of every hour, the old hash map will be dropped, a new hash map would be created to count for the new hour's 20-top-frequent strings. This could cause the JVM spend a long time to garbage collect that memory.
String#intern could help a little, but the HashMap is also a problem for the memory, and in the future I also want to store the aggregate data off-heap, but the uncertain distinct number of strings make it hard to estimate off-heap memory and the way to store those strings. Is there any advice to avoid map in off-heap?
I'm also interested in cardinality estimation,but seems hard to use it to count the number of replications of each string.
A HashMap is the answer. It uses less memory than you think, because the map holds references to unique Strings, and uses O(1) space per entry. There's no getting around having to store one copy of each string, so a map won't cost much more memory than the (unique) strings themselves. Just accumulate the total occurrences of each string and use it to find the top 20.
If you run out of memory, you'll have to implement the map on disk, eg a relational database or NoSql, or something else. The principle of using a map (or map like structure) is the way to go.
I argue that a SortedMultiset from Guava would be easier to use in this case. You can pass it a custom Comparator so that you can easily grab the first 20 entries (the most frequent strings). It uses the same amount of memory as a Map implementation, and it automatically handles the accumulation for you.
Generally, They say that we have moved from Array to ArrayList for the following Reason
Arrays are fixed size where as Array Lists are not .
One of the disadvantages of ArrayList is:
When it reaches it's capacity , ArrayList becomes 3/2 of it's actual size. As a result , Memory can go wasted if we donot utilize the space properly.In this scenario, Arrays are preferred.
If we use ArrayList.TrimSize(), will that make Array List a unanimous choice? Eliminating the only advantage(fixed size) Array has over it?
One short answer would be: trimToSize doesn't solve everything, because shrinking an array after it has grown - is not the same as preventing growth in the first place; the former has the cost of copying + garbage collection.
The longer answer would be: int[] is low level, ArrayList is high level which means it's more convenient but gives you less control over the details. Thus in a business-oriented code (e.g. manipulating a short list of "Products") i'll prefer ArrayList so that I can forget about the technicalities and focus on the business. In a mathematically-oriented code i'll probably go for int[].
There are additional subtle differences, but i'm not sure how relevant they are to you. E.g. concurrency: if you change the data of ArrayList from several threads simultaneously, it will intentionally fail, because that's the intuitive requirement for most business code. An int[] will allow you to do whatever you want, leaving it up to you to make sure it makes sense. Again, this can all be summarized as "low level"...
If you are developing an extremely memory critical application, need resizability as well and performance can be traded off, then trimming array list is your best bet. This is the only time, array list with trimming will be unanimous choice.
In other situations, what you are actually doing is:
You have created an array list. Default capacity of the list is 10.
Added an element and applied trim operation. So both size and capacity is now 1. How trim size works? It basically creates a new array with actual size of the list and copies old array data to new array. Old array is left for grabage collection.
You again added a new element. Since list is full, it will be reallocated with more 50% spaces. Again, procedure similar to 2 will be followed.
Again you call TrimSize and it follows same procedure as 2.
Things repeats...
So you see, we are incurring lots of performance overhead just to keep list capacity and size same. Fixed size is not offering you anything advantageous here except saving few more extra spaces which is hardly an issue in modern machines.
In a nutshell, if you want resizability without writing lots of boilerplate code, then array list is unanimous choice. But if size never changes and you don't need any dynamic function such as removal operation, then array is better choice. Few extra bytes are hardly an issue.
I need to store a large amount of information, say for example 'names' in a java List. The number of items can change (or in short I cannot predefine the size). I am of the opinion that from a memory allocation perspective LinkedList would be a better option than ArrayList, as for an ArrayList once the max size is reached, automatically the memory allocation doubles and hence there would always be a chance of more memory being allocated than what is needed.
I understand from other posts here that individual elements stored in a LinkedList takes more space than an ArrayList as LinkedList also needs to store the node information, but I am still guessing for the scenario I have defined LinkedList might be a better option. Also, I do not want to get into the performance aspect (fetching, deleting etc) , as much has already been discussed on it.
LinkedList might allocate fewer entries, but those entries are astronomically more expensive than they'd be for ArrayList -- enough that even the worst-case ArrayList is cheaper as far as memory is concerned.
(FYI, I think you've got it wrong; ArrayList grows by 1.5x when it's full, not 2x.)
See e.g. https://github.com/DimitrisAndreou/memory-measurer/blob/master/ElementCostInDataStructures.txt : LinkedList consumes 24 bytes per element, while ArrayList consumes in the best case 4 bytes per element, and in the worst case 6 bytes per element. (Results may vary depending on 32-bit versus 64-bit JVMs, and compressed object pointer options, but in those comparisons LinkedList costs at least 36 bytes/element, and ArrayList is at best 8 and at worst 12.)
UPDATE:
I understand from other posts here that individual elements stored in a LinkedList takes more space than an ArrayList as LinkedList also needs to store the node information, but I am still guessing for the scenario I have defined LinkedList might be a better option. Also, I do not want to get into the performance aspect (fetching, deleting etc) , as much has already been discussed on it.
To be clear, even in the worst case, ArrayList is 4x smaller than a LinkedList with the same elements. The only possible way to make LinkedList win is to deliberately fix the comparison by calling ensureCapacity with a deliberately inflated value, or to remove lots of values from the ArrayList after they've been added.
In short, it's basically impossible to make LinkedList win the memory comparison, and if you care about space, then calling trimToSize() on the ArrayList will instantly make ArrayList win again by a huge margin. Seriously. ArrayList wins.
... but I am still guessing for the scenario I have defined LinkedList might be a better option
Your guess is incorrect.
Once you have got past the initial capacity of the array list, the size of the backing will be between 1 and 2 references times the number of entries. This is due to strategy used to grow the backing array.
For a linked list, each node occupies AT LEAST 3 times the number of entries, because each node has a next and prev reference as well as the entry reference. (And in fact, it is more than 3 times, because of the space used by the nodes' object headers and padding. Depending on the JVM and pointer size, it can be as much as 6 times.)
The only situation where a linked list will use less space than an array list is if you badly over-estimate the array list's initial capacity. (And for very small lists ...)
When you think about it, the only real advantage linked lists have over array lists is when you are inserting and removing elements. Even then, it depends on how you do it.
ArrayList use one reference per object (or two when its double the size it needs to be) This is typically 4 bytes.
LinkedList uses only the nodes its needs, but these can be 24 bytes each.
So even at it worst ArrayList will be 3x smaller than LinkedList.
For fetching ARrayList support random access O(1) but LinkedList is O(n). For deleting from the end, both are O(1), for deleting from somewhere in the middle ArrayList is O(n)
Unless you have millions of entries, the size of the collection is unlikely to matter. What will matter first is the size of entries which is the same regardless of the collection used.
Back of the envelope worst-case:
500,000 names in an array sized to 1,000,000 = 500,000 used, 500,000 empty pointers in the unused portion of the allocated array.
500,000 entries in a linked list = 3 pointers per entry (Node object holds current, prev, and next) = 1,5000,000 pointers in memory. (Then you have to add the size of the Node itself)
ArrayList.trimToSize() may satisfy you.
Trims the capacity of this ArrayList instance to be the list's current
size. An application can use this operation to minimize the storage of
an ArrayList instance.
By the way, in ArrayList Java6, it's not double capacity, it's about 1.5 times max size is reached.
In my project, I get entries of a form from two servers and keeping them in a hashmap.
key is serverName and value is 2d ArrayList (ArrayList<ArrayList<Object>>)
In ArrayList, I keep the values of fields that belong to the form on that server.
I compare these values in two server and print them to an excel file.
My problem is that When I get a form with 12000 entries and 100 fields, This map use around 400M of memory. I don't want my program to use this much memory. Can you suggest me anything?
I doubt it's the hashmap that is causing you problems, but the ArrayList, since it allocates room for 10 entries by default. If you're only storing one or two values for each index, then that will be wasteful.
You could try setting the initial size to 1 or 2 to see if that helps. A potential downside is that if the size is too small, it will cause frequent reallocation. But you will see yourself if that causes any significant slowdown.
The HashMap ist not at all the problem here. What objects are actually contained in the ArrayList<ArrayList<Object>>?
You really should use VisualVM and do some heap profiling to see what actually takes up your memory. That's much better than the guesswork here, and you may be surprised by the result.
I suppose that much of the memory waste results from using a lot of ArrayLists. They are designed for dynamic use (additions & removals), so they usually have many unused positions. If your matrix is static, consider using 2d array instead of a list of a lists. Otherwise, try to set the capacity of the ArrayList to some estimated value, instead of the default value.
The problem is obviously not the Hashmap itself, because it has no more then two entries (the keys are your two server names). You just have to handle a large amount of data(2 x 12000 x 100 values, if I get it right plus the result, which is an 'excel file'). It just needs some memory. The big objects are the two 2D arrays lists. The map just has references to those data structures.
Usually I wouldn't care and just increase the max heap size to 512M or even 1G.
Say I instantiate 100 000 of Vectors
a[0..100k] = new Vector<Integer>();
If i do this
a[0..100k] = new Vector<Integer>(1);
Will they take less memory? That is ignoring whether they have stuff in them and the overhead of expanding them when there has to be more than 1 element.
According to the Javadoc, the default capacity for a vector is 10, so I would expect it to take more memory than a vector of capacity 1.
In practice, you should probably use an ArrayList unless you need to work with another API that requires vectors.
When you create a Vector, you either specify the size you want it to have at the start or leave some default value. But it should be noted that in any case everything stored in a Vector is just a bunch of references, which take really little place compared to the objects they are actually pointing at.
So yes, you will save place initially, but only by the amount which equals to the difference between the default size and the specified multiplied by the size of a reference variable. If you create a really large amount of vectors like in your case, initial size does matter.
Well, sort of yes. IIRC Vector initializes internally 16 elements by default which means that due to byte alignment and other stuff done by underlying VM you'll save a considerable amount of memory initially.
What are you trying to accomplish, though?
Yes, they will. Putting in reasonable "initial sizes" for collections is one of the first things I do when confronted with a need to radically improve memory consumption of my program.
Yes, it will. By default, Vector allocates space for 10 elements.
Vector()
Constructs an empty vector so that its internal data array has size 10
and its standard capacity increment is zero.increment is zero.
Therefore, it reserves memory for 10 memory references.
That being said, in real life situations, this is rarely a concern. If you are truly generating 100,000 Vectors, you need to rethink your designincrement is zero.