Java HashSet performance

Java HashSet performance - java

I understand HashSet based on HashMap, since they are pretty similar. It makes the code more flexible and minimizes implementation effort. However, one reference variable in the HashSet's Entry seem to be unnecessary for me if the class forbids null element, therefore the whole Entry makes no point. Despite this fact, an Entry takes 24 byte memory / element, whereas a single array with the set's elements would take only 4 byte / element if my figures are correct. (aside from array's header)
If my argument is correct, does the advantages really overweight this performance hit?
(if i am wrong, i would learn from it aswell)

Though this question is primarily opinion-based, I'll summarize a few points on the topic:
HashSet appeared in Java 1.2 many years ago. It's hard to guess now the exact reasons for design decisions made at that times, but clearly Java wasn't used for high-loaded applications; performance played less role than simplicity.
You are right that HashSet is suboptimal in its memory consumption. The problem is known, the bug JDK-6624565 is registered, and discussions at core-libs-dev are held from time to time. But is this a blocker for many real world applications? Probably, no.
For those uncommon applications where HashSet memory usage is unacceptable, there are already good alternatives, like trove THashSet.
Note that open addressing algorithms has their disadvantages, e.g. significant performance degradation with load factors close to 1; element removal difficulties. See the related answer.

Related

List iteration, which is the most efficient?

I am working on a game and have several customers stored in an ArrayList, and each customer has their own unique ID which is saved as a variable inside the object.
If I want to retrieve a customer from the list using their ID, which of these would be better practice?
Iterate through all customers in the list until I find a match.
Convert the list to a HashMap of keys (the ID) and values (the customers), and then simply use the .get() method. Maybe this does exactly the same as option one?

HashMap will be more efficient (in most cases O(1), in worst case O(n)), iterating over a list would be O(n).
Of course it depends on the size of your data, if you have lots of it then HashMap is the obvious choice, if you have few (e.g. 5 maybe 10), it might be the case that a List would be more efficient - constant factors have to be taken under consideration here, which Big-Oh notation ignores.

As written by others, collections having "hash" in their name typically provide improved performance properties for certain kind of actions.
But please keep in mind the very old rule to not do premature optimization. Seriously: unless we are talking a "real production setup"; and 10 000+ paying customers ... then you should understand that you should very much more focus on good design, and creating readable + maintainable code than on potential performance issues.
Yes, one should avoid outright stupid designs; but the thing is: if you focus too much and too early on performance, you risk two things:
a) missing the real bottleneck. If you really encounter performance issues, you have profile your application to understand where time is spent. Far too often people assume that their problem is X; and then they spend a lot of time fixing X - to later find out that actually the Just-in-Time compiler addresses X good enough; and that their real problem is some other Y they never thought of.
b) putting yourself into a corner. Thing is: good designs need to be able to evolve and change. If you assume you must do this or that to preserve some holy performance grail ... chances are, that this impacts your overall application in a negative way.

I know your question was specific, but I'm going to propose another solution.
Build you own set, backed by an array of Costumers. Once a Costumer is inserted, the Customer's ID is set to the actual index it gets when placed into the array.
that way you can have direct access to all you costumers, using their ID.

The hashmap and all sorted list can use the index to find the right item faster. It can pick an index in the middle, if the value it's looking for is higher or lower it can use that information and pick a new index in the lower or higher half and repeat the same task as before. This reduce the time to find the item. It's called binary search (thx fabian)
You can try it and compare the result!
EDIT:
I am with Jägermeister with this one, you should wait to up the performance and have it as a final step.

Lightweight Map implementation Java (little memory overhead)

I am currently writing some code in java meant to be a little framework for a project which revolves around a database with some billions of entries. I want to keep it high-level and the data retriueved from the database shoud be easily usable for statistic inference. I am resolved to use the Map interface in this project.
a core concept is mapping the attributes ("columns in the database") to values ("cells") when handling single datasets (with which I mean a columns in the database) for readable code: I use enum objects (named "Attribute") for the attribute types, which means mapping <Attribute, String>, because the data elements are all String (also not very large, maximum 40 characters or so).
There are 15 columns, so there are 15 enums, and the maps will have only so much entries, or less.
So it appears, I will be having a very large number of Map objects floating around, at times, but with comparatively little payload (15-). My goal is to not make the memory explode due to the implementation memory overhead, compared to the actual payload. (Stretch goal: do the same with cpu usage ;] )
I was not really familiar with all the different implementations of Java Collections to date, and when the problem dawned at me today, I looked into my to-date all-time favorite 'HashMap', and was not happy with how much memory overhead there was declared. I am sure, that additonal to the standard implementations, there are a number of implementations not shipped with Java. Googling my case brought not up much of a result, So I am asking you:
Do you know a good implementation of Map for my use case (low entry count, low value size, enumerable keys, ...)
I hope I made my use case clear, and am anxious for your input =)
Thanks a lot!
Stretch answer goal, absolutely optional and only if you got the time and knowledge:
What other implementations of collections are suitable for:
handling attribute (the String things) vectors, and matrices for inference data (counts/probabilities) (Matrices: here I am really clueless for now, Did really no serious math work with java to date)
math libraries for statistical inference, see above

Use EnumMap, this is the best map implementation if you have enums as key, for both performance and memory usage.
The trick is that this map implementation is the only one that that does not store the keys, it only needs a single array with the values (similar to an ArrayList of the values). There is only a little bit of overhead if there are keys that are not mapped to a value, but in most cases this won't be a problem because enums usually do not have too many instances.
Compared to HashMap, you additionally get a predictable iteration order for free.

Since you start off saying you want to store lots of data, eventually, you'll also want to access/modify that data. There are many high performance libraries out there.
Look at
Trove4j : https://bitbucket.org/robeden/trove/
HPPC: http://labs.carrotsearch.com/hppc.html
FastUtil: http://fastutil.di.unimi.it/
When you find a bottleneck, you can switch to using a lower level API (more efficient)
You'll many more choices if look a bit more: What is the most efficient Java Collections library?
EDIT: if your strings are not unique, you could save significant amounts of memory using String.intern() : Is it good practice to use java.lang.String.intern()?

You can squeeze out a bit of memory with a simple map implementation that uses two array lists (keys and values). For larger maps, that is going to mean insertion and look up speeds become much slower because you have to scan the entire list. However, for small maps it is actually faster this way since you don't have to calculate any hashcodes and only have to look at a small number of entries.
If you need an implementation, take a look at my SimpleMap in my jsonj project: https://github.com/jillesvangurp/jsonj/blob/master/src/main/java/com/github/jsonj/SimpleMap.java

Java - List or Array? [duplicate]

Want to improve this post? Provide detailed answers to this question, including citations and an explanation of why your answer is correct. Answers without enough detail may be edited or deleted.
This question already has answers here:
What does it mean to "program to an interface"?
(33 answers)
When to use a List over an Array in Java?
(9 answers)
Closed 9 years ago.
I know Lists make things much easier in Java instead of working with hard-set arrays (lists allow you to add/remove elements at will and they automagically resize, etc).
I've read some stuff suggesting that array's should be avoided in java whenever possible since they are not flexible at all (and sometimes can impose weird limitations, such as if you don't know the size an array needs to be, etc).
Is this "good practice" to stop using arrays at all and use only List logic instead? I'm sure the List type consumes more memory than an array and thus have higher overhead, but is this significant? Most Lists would be GC'ed during runtime if they are left laying around anyways so maybe it isn't as big of a deal as I'm thinking?

I don't like dogma. Know the rules; know when to break the rules.
"Never" is too strong, especially when it comes to software.
Arrays and Lists are both potential targets for the GC, so that's a wash.
Yes, you have to know the size of an array before you start. For the cases when you do, there's nothing wrong with it.
It's easy to go back and forth as needed using java.util.Collections and java.util.Arrays classes.

I think a good rule of thumb would be to use Lists unless you NEED an Array (for memory/performance reasons). Otherwise Lists are typically easier to maintain and thus less likely to cause future bugs.
Lists provide more flexibility/functionality in terms of auto-expansion, so unless you are either pressed for memory (and can't afford the overhead that Lists create) or do not mind maintaining the Array size as it expands/shrinks, I would recommend Lists.
Try not to micromanage the code too much, and instead focus on more discernible and readable components.

It depends on the list. A LinkedList probably takes up space only as it's needed, while an ArrayList typically greatly increases its space whenever its capacity is reached. Internally, an ArrayList is implemented using an array, but it's an array that's always larger than what you want. However, since it stores references, not objects, the memory overhead is negligible in most cases and the convenience is worth it, I believe.

I would have to say I follow this approach of using the Collections framework where I might otherwise have used an array. The collections offer you many benefits and convenience over arrays but yes there is likely to be some performance hit.
It is better to write code that is easy to understand and hard to break, arrays require you to put in a lot of checking code to ensure you don't access bits of the array you shouldn't or put to many things in it etc. Given that the majority of the time performance is not a problem it shouldn't be a concern.

Initializing ArrayList with a new operator in Java?

What is the best practice for initializing an ArrayList in Java?
If I initialize a ArrayList using the new operator then the ArrayList will by default have memory allocated for 10 buckets. Which is a performance hit.
I don't know, maybe I am wrong, but it seems to me that I should create a ArrayList by mentioning the size, if I am sure about the size!

Which is a performance hit.
I wouldn't worry about the "performance hit". Object creation in Java is very fast. The performance difference is unlikely to be measurable by you.
By all means use a size if you know it. If you don't, there's nothing to be done about it anyway.
The kind of thinking that you're doing here is called "premature optimization". Donald Knuth says it's the root of all evil.
A better approach is to make your code work before you make it fast. Optimize with data in hand that tells you where your code is slow. Don't guess - you're likely to be wrong. You'll find that you rarely know where the bottlenecks are.

If you know how many elements you will add, initialize the ArrayList with correct number of objects. If you don't, don't worry about it. The performance difference is probably insignificant.

This is the best advice I can give you:
Don't worry about it. Yes, you have several options to create an ArrayList, but using the new, the default option provided by the library, isn't a BAD choice, otherwise it'd be stupid to make it the default choice for everyone without clarifying what's better.
If it turns out that this is a problem, you'll quickly discover it when you profile. That's the proper place to find problems, when you profile your application for performance/memory problems. When you first write the code, you don't worry about this stuff -- that's premature optimization -- you just worry about writing good, clean code, with good design.
If your design is good, you should be able to fix this problem in no time, with little impact to the rest of the system. Effective Java 2nd Edition, Item 52: Refer to objects by their interfaces. You may even be able to switch to a LinkedList, or any other kind of List out there, if that turns out to be a better data structure. Design for this kinds of flexibility.
Finally, Effective Java 2nd Edition, Item 1: Consider static factory methods instead of constructors. You may even be able to combine this with Item 5: Avoid creating unnecessary objects, if in fact no new instances are actually needed (e.g. Integer.valueOf doesn't always create a new instance).
Related questions
Java Generics Syntax - in-depth about type inferring static factory methods (also in Guava)
On ArrayList micromanagement
Here are some specific tips if you need to micromanage an ArrayList:
You can use ArrayList(int initialCapacity) to set the initial capacity of a list. The list will automatically grow beyond this capacity if needed.
When you're about to populate/add to an ArrayList, and you know what the total number of elements will be, you can use ensureCapacity(int minCapacity) (or the constructor above directly) to reduce the number of intermediate growth. Each add will run in amortized constant time regardless of whether or not you do this (as guaranteed in the API), so this can only reduce the cost by a constant factor.
You can trimToSize() to minimize the storage usage.
This kind of micromanagement is generally unnecessary, but should you decide (justified by conclusive profiling results) that it's worth the hassle, you may choose to do so.
See also
Collections.singletonList - Returns an immutable list containing only the specified object.

If you already know the size of your ArrayList (approximately) you should use the constructor with capacity. But most of the time developers don't really know what will be in the List, so with a capacity of 10 it should be sufficient for most of the cases.
10 buckets is an approximation and isn't a performance hit unless you already know that your ArrayList contains tons of elements, in this case, the need to resize your array all the time will be the performance hit.

You don't need to tell initial size of ArrayList. You can always add/remove any element from it easily.
If this is a performance matter, please keep in mind following things :
Initialization of ArrayList is very fast. Don't worry about it.
Adding/removing element from ArrayList is also very fast. Don't worry about it.
If you find your code runs too slow. The first to blame is your algorithm, no offense. Machine specs, OS and Language indeed participate too. But their participation is considered insignificant compared to your algorithm participation.

If you don't know the size of theArrayList, then you're probably better off using a LinkedList, since the LinkedList.add() operation is constant speed.
However as most people here have said you should not worry about speed before you do some kind of profiling.
You can use this old, but good (in my opinion) article for reference.
http://chaoticjava.com/posts/linkedlist-vs-arraylist/

Since ArrayList is implemented by array underlying, we have to choose a initial size for the array.

If you really care you can call trimToSize() once you have constructed and populated the object. The javadoc states that the capacity will be at least as large as the list size. As previously stated, its unlikely you will find that the memory allocated to an ArrayList is a performance bottlekneck, and if it were, I would recommend you use an array instead.

array of structures, or structure of arrays?

Hmmm. I have a table which is an array of structures I need to store in Java. The naive don't-worry-about-memory approach says do this:
public class Record {
final private int field1;
final private int field2;
final private long field3;
/* constructor & accessors here */
}
List<Record> records = new ArrayList<Record>();
If I end up using a large number (> 106 ) of records, where individual records are accessed occasionally, one at a time, how would I figure out how the preceding approach (an ArrayList) would compare with an optimized approach for storage costs:
public class OptimizedRecordStore {
final private int[] field1;
final private int[] field2;
final private long[] field3;
Record getRecord(int i) { return new Record(field1[i],field2[i],field3[i]); }
/* constructor and other accessors & methods */
}
edit:
assume the # of records is something that is changed infrequently or never
I'm probably not going to use the OptimizedRecordStore approach, but I want to understand the storage cost issue so I can make that decision with confidence.
obviously if I add/change the # of records in the OptimizedRecordStore approach above, I either have to replace the whole object with a new one, or remove the "final" keyword.
kd304 brings up a good point that was in the back of my mind. In other situations similar to this, I need column access on the records, e.g. if field1 and field2 are "time" and "position", and it's important for me to get those values as an array for use with MATLAB, so I can graph/analyze them efficiently.

The answers that give the general "optimise when you have to" is unhelpful in this case because , IMHO, programmers should always be aware of the performance in different in design choices when that choice leads to an order of magnitude performance penalty, particularly API writers.
The original question is quite valid and I would tend to agree that the second approach is better, given his particular situation. I've written image processing code where each pixel requires a data structure, a situation not too dissimilar to this, except I needed frequent random access to each pixel. The overhead of creating one object for each pixel was enormous.

The second version is much, much worse. Instead of resizing one array, you're resizing three arrays when you do an insert or delete. What's more, the second version will lead to the creation of many more temporary objects and it will do so on accesses. That could lead to a lot of garbage (from a GC point of view). Not good.
Generally speaking, you should worry about how you use the objects long before you think about performance. So you have a record with three fields or three arrays. Which one more accurately depicts what you're modeling? By this I mean, when you insert or delete an item, are you doing one of the three arrays or all three as a block?
I suspect it's the latter in which case the former makes far more sense.
If you're really concerned about insertion/deletion performance then perhaps a different data structure is appropriate, perhaps a SortedSet or a Map or SortedMap.

If you have millions of records, the second approach has several advantages:
Memory usage: the first approach uses more memory because a) every Java object in heap has a header (containing class id, lock state etc.); b) objects are aligned in memory; c) each reference to an object costs 4 bytes (on 64-bit JVMs with Compressed OOPs or 32-bit JVMs) or 8 bytes (64-bit JVMs without Compressed OOPs). See e. g. CompressedOops for more details. So the first approach takes about two times more memory (more precisely: according to my benchmark, an object with 16 bytes of payload + a reference to it took 28 bytes on 32-bit Java 7, 36 bytes on 64-bit Java 7 with compressed OOPs, and 40 bytes on 64-bit Java 7 w/o compressed OOPs).
Garbage collection: although the second approach seems to create many objects (one on each call of getRecord), it might not be so, as modern server JVMs (e. g. Oracle's Java 7) can apply escape analysis and stack allocation to avoid heap allocation of temporary objects in some cases; anyway, GCing short-lived objects is cheap. On the other hand, it is probably easier for the garbage collector if there are not millions of long-lived objects (as there are in the first approach) whose reachability to check (or at least, such objects may make your application need more careful tuning of GC generation sizes). Thus the second approach may be better for GC performance. However, to see whether it makes a difference in the real situation, one should make a benchmark oneself.
Serialization speed: the speed of (de)serializing a large array of primitives on disk is only limited by HDD speed; serializing many small objects is inevitably slower (especially if you use Java's default serialization).
Therefore I have used the second approach quite often for very large collections. But of course, if you have enough memory and don't care about serialization, the first approach is simpler.

How are you going to access the data? If the accesses over the fields are always coupled, then use the first option, if you are going to process the fields by its own, then the second option is better.
See this article in wikipedia: Parallel Array
A good example about when it's more convenient to have separate arrays could be simulations where the numerical data is packed together in the same array, and other attributes like name, colour, etc. that are accessed just for presentation of the data in other array.

I was curious so I actually ran a benchmark. If you don't re-create the object like you are[1], then SoA beats AoS by 5-100% depending on workload[2]. See my code here:
https://gist.github.com/twolfe18/8168262c5420c7a62d39
[1] I didn't add that because if you are concerned enough about speed to consider this refactor, it would be silly to do that.
[2] This also doesn't account for re-allocation, but again, this is often something you can either amortize away or know statically. This is a reasonable assumption for a pure-speed benchmark.

Notice that the second approach might have negative impact on caching behaviour. If you want to access a single record at a time, you'd better have that record not scattered all across the place.
Also, the only memory you win in the second approach, is (possibly) due to member alignment. (and having to allocate a separate object).
Otherwise, they have exactly the same memory use, asymptotically. The first option is much better due to locality, IMO

Whenever I have tried doing number crunching in Java, I have always had to revert to C-style coding (i.e. close to your option 2). It minimised the number of objects floating around in your system, as instead of 1,000,000 objects, you only have 3. I was able to do a bit of FFT analysis of real-time sound data using the C-style, and it was far too slow using objects.

I'd choose the first method (array of structures) unless you access the store relatively infrequently and are running into serious memory pressure issues.
First version basically stores the objects in their "natural" form (+1 BTW for using immutable records). This uses a little more memory because of the per-object overhead (probably around 8-16 bytes depending on your JVM) but is very good for accessing and returning objects in a convenient and human-understandable form in one simple step.
Second version uses less memory overall, but the allocation of a new object on every "get" is a pretty ugly solution that will not perform well if accesses are frequent.
Some other possibilities to consider:
An interesting "extreme" variant would be to take the second version but write your algorithms / access methods to interact with the underlying arrays directly. This is clearly going to result in complex inter-dependencies and some ugly code, but would probably give you the absolute best performance if you really needed it. It's quite common to use this approach for intensive graphics applications such as manipulating a large array of 3D coordinates.
A "hybrid" option would be to store the underlying data in a structure of arrays as in the second version, but cache the accessed objects in a HashMap so that you only generate the object the first time a particular index is accessed. Might make sense if only a small fraction of objects are ever likely to accessed, but all data is needed "just in case".

(Not a direct answer, but one that I think should be given)
From your comment,
"cletus -- I greatly respect your thoughts and opinions, but you gave me the high-level programming & software design viewpoint which is not what I'm looking for. I cannot learn to ignore optimization until I can get an intuitive sense for the cost of different implementation styles, and/or the ability to estimate those costs. – Jason S Jul 14 '09 at 14:27"
You should always ignore optimization until it presents itself as a problem. Most important is to have the system be usable by a developer (so they can make it usable by a user). There are very few times that you should concern yourself with optimization, in fact in ~20 years of professional coding I have cared about optimization a total of two times:
Writing a program that had its primary purpose to be faster than another product
Writing a smartphone app with the intention of reducing the amount of data sent between the client and server
In the first case I wrote some code, then ran it through a profiler, when I wanted to do something and I was not sure which approach was best (for speed/memory) I would code one way and see the result in the profiler, then code the other way and see the result. Then I would chose the faster of the two. This works and you learn a lot about low level decisions. I did not, however, allow it to impact the higher level classes.
In the second case, there was no programming involved, but I did the same basic thing of looking at the data being sent and figuring out how to reduce the number of messages being sent as well as the number of bytes being sent.
If your code is clear then it will be easier to speed up once you find out it is slow. As Cletus said in his answer, you are resizing one time -vs- three times... one time will be faster than three. From a higher point of view the one time is simpler to understand than the three times, thus it is more likely to be correct.
Personally I'd rather get the right answer slowly then the wrong answer quickly. Once I know how to get the right answer then I can find out where the system is slow and replace those parts of it with faster implementations.

Because you are making the int[] fields final, you are stuck with just the one initialization of the array and that is it. Thus, if you wanted 10^6 field1's, Java would need to separate that much memory for each of those int[], because you cannot reassign the size of those arrays. With an ArrayList, if you do not know the number of records beforehand and will be removing records potentially, you save a lot of space upfront and then later on as well when you go to remove records.

I would go for the ArrayList version too, so I don't need to worry about growing it. Do you need to have a column like access to values? What is your scenario behind your question?
Edit You could also use a common long[][] matrix.
I don't know how you pass the columns to Matlab, but I guess you don't gain much speed with a column based storage, more likely you loose speed in the java computation.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.