How do HashSets in Java work? [duplicate] - java

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How does Java hashmap work?
Can someone explain to me how HashSets in java work and why they are faster than using ArrayLists?

A HashSet is actually a HashMap where the value is always the same.
The way a HashMap works is described in many places (it is referred to as "hashtable" as well). In short: it generates hashes of keys (objects) and positions them into a table. Then each time you look for a key, its hash is computed and the bucket in the table is referenced directly. This means you have just one operation (best case) to access the map.
The HashSet simply contains the keys, so .contains(..) is O(1). That and remove(..) are the only operations a HashSet is faster than an ArrayList (which is O(n)). Iteration is the same, addition is the same.

First, HashSet, unlike ArrayList is a Set: It cannot contain duplicates while ArrayList can - so they are built for different purposes. It also does not guarantee ordering - again, unlike a list.
Second - a HashSet is built on the hash table data structure, that allows O(1) seek time for an element.
Note that many times, a HashSet is slower then an ArrayList - if you want to iterate on elements for example - usually doing it in an ArrayList will be faster then in a HashSet [because of bad cache performance of hash, among other reasons]

These are 2 different data structures.
The concept behind HashSet is key probing.
I.e. you use a transformation of the input key to get an index of the location of the value in an array.
This is a constant O(1) operation since an array allows random access.
The arraylist is also O(1) operation for access since it is also backed by an array.
But only for random access and insertion.
The search though is O(N) operation for an arraylist since you have to search through all the elemements in the list to get to the value unlike the HashSet where you just transform the key and access the array. Search in a HashSet is O(1)

As a matter of fact, for example iterating over and appending to an ArrayList is faster.
And heck, you cannot even sort a HashSet.
But the fastest of all is the NoOp. There is nothing just remotely as fast as the NoOp. Granted, it doesn't do much, the NoOp. But it's really fast at that!
You need to be more precise in what you consider to be "faster than".

Related

Insertion order in CopyOnWriteArraySet VS HashSet

Everyone knows HashSet stores elements in buckets based on the size of the hashtable and the elements' hash code values.
But how does CopyOnWriteArraySet store elements? I thought it makes a snapshot of those buckets and copies them. Looks like it doesn't. Does it store them in 'normal' array 1 by 1 and checks equals()?
Does it even use hashing principle?
CopyOnWriteArraySet is a Set-wrapper for CopyOnWriteArrayList, which stores its elements in an array, so it does not use hashing. That's why it doesn't have the O(1) lookup benefit of a HashSet.
The docs say it is only suitable for small sets.

What are the benefits of using Map over ArrayList of costume class

I am learning Java now and I am learning about different kinds of collections, so far I learned about LinkedList, ArrayList and Array[].
Now I've been introduced to Hash types of collections, HashSet and HashMap, and I didn't quite understand why there are useful, because the list of commands that they support is quietly limited, also, they are sorted in a random order and I need to Override the equal and HashKey methods in order to make it work right with class.
Now, what I don't understand is the benefits over the hassle of using these types instead of ArrayList of a costume class.
I mean, what Map is doing is connecting 2 objects as 1, but wouldn't it just be better to create a class that contains this 2 objects as parameters, and have getters to modify and use them?
If the benefit is that this Hash objects can only contain 1 object of the same name, wouldn't it just be easier to make the ArrayList check that the type is not already there before adding it?
So far I learned to choose when to use LinkedList, ArrayList or Array[] by the rule of "if it's really simple, use Array[], if it's a bit more complex use ArrayList (for example to hold collection of certain class), and if the list is dynamic with a lot of objects inside that need to change order according to removing or adding a new one in the middle or go back and forth within the list then use LinkedList.
But I couldn't understand when to prefer HashMap or HashSet, and I would be really glad if you could explain it to me.
Let me help you out here...
Hashed collections are the most efficient to add, search and remove data, since they hash the key (in HashMap) or the element (in HashSet) to find the place where they belong in a single step.
The concept of hashing is really simple. It is the process of representing an object as a number that can work as it´s id.
For example, if you have a string in Java like String name = "Jeremy";, and you print its hashcode: System.out.println(name.hashCode());, you will see a big number there (-2079637766), that was created using that string object values (in this string object, it's characters), that way, that number can be used as an Id for that object.
So the Hashed collections like the ones mentioned above, use this number to use it as an array index to find the elements in no-time. But obviously is too big to use it as an array index for a possible small array. So they need to reduce that number so it fits in the range of the array size. (HashMap and HashSet use arrays to store their elements).
The operation that they use to reduce that number is called hashing, and is something like this: Math.abs(-2079637766 % arrayLength);.
It's not like that exactly, it's a bit more complex, but this is to simplify.
Let's say that arrayLength = 16;
The % operator will reduce that big number to a number smaller than 16, so that it can be fit in the array.
That is why a Hashed collection will not allow duplicate, because if you try to add the same object or an equivalent one (like 2 strings with the same characters), it will produce the same hashcode and will override whatever value is in the result index.
In your question, you mentioned that if you are worried about duplicates items in an ArrayList, we can just check if the item is there before inserting it, so this way we don't need to use a HashSet. But that is not a good idea, because if you call the method list.contains(elem); in an ArrayList, it needs to go one by one comparing the elements to see if it's there. If you have 1 million elements in the ArrayList, and you check if an element is there, but it is not there, the ArrayList iterated over 1 million elements, that is not good. But with a HashSet, it would only hashed the object and go directly where it is supposed to be in the array and check, doing it in just 1 step, instead of 1 million. So you see how efficient a HashSet is compared to an ArrayList.
The same happens with a HashMap of size 1 million, that it will only take 1 single step to check if a key is there, and not 1 million.
The same thing happens when you need to add, find and remove an element, with the hashed collections it will do all that in a single step (constant time, doesn't depend on the size of the map), but that varies for other structures.
That's why it is really efficient and widely used.
Main Difference between an ArrayList and a LinkedList:
If you want to find the element at place 500 in an ArrayList of size 1000, you do: list.get(500); and it will do that in a single step, because an ArrayList is implemented with an array, so with that 500, it goes directly where the element is in the array.
But a LinkedList is not implemented with an array, but with objects pointing to each other. This way, they need to go linearly and counting from 0, one by one until they get to the 500, which is not really efficient compared to the 1 single step of the ArrayList.
But when you need to add and remove elements in an ArrayList, sometimes the Array will need to be recreated so more elements fit in it, increasing the overhead.
But that doesn't happen with the LinkedList, since no array has to be recreated, only the objects (nodes) have to be re-referenced, which is done in a single step.
So an ArrayList is good when you won't be deleting or adding a lot of elements on the structure, but you are going to read a lot from it.
If you are going to add and remove a lot of elements, then is better a linked list since it has less work to do with those operations.
Why you need to implement the equals(), hashCode() methods for user-defined classes when you want to use those objects in HashMaps, and implement Comparable interface when you want to use those objects with TreeMaps?
Based on what I mentioned earlier for HashMaps, is possible that 2 different objects produce the same hash, if that happens, Java will not override the previous one or remove it, but it will keep them both in the same index. That is why you need to implement hashCode(), so you make sure that your objects will not have a really simple hashCode that can be easily duplicated.
And the reason why is recommended to override the equals() method is that if there is a collision (2 or more objects sharing the same hash in a HashMap), then how do you tell them apart? Well, asking the equals() method of those 2 objects if they are the same. So if you ask the map if it contains a certain key, and in that index, it finds 3 elements, it asks the equals() methods of those elements if its equals() to the key that was passed, if so, it returns that one. If you don't override the equals() method properly and specify what things you want to check for equality (like the properties name, age, etc.), then some unwanted overrides inside the HashMap will happen and you will not like it.
If you create your own classes, say, Person, and has properties like name, age, lastName and email, you can use those properties in the equals() method and if 2 different objects are passed but have the same values in your selected properties for equality, then you return true to indicate that they are the same, or false otherwise. Like the class String, that if you do s1.equals(s2); if s1 = new String("John"); and s2 = new String("John");, even though they are different objects in Java Heap Memory, the implementation of String.equals method uses the characters to determine if the objects are equals, and it returns true for this example.
To use a TreeMap with user-defined classes, you need to implement the Comparable interface, since the TreeMap will compare and sort the objects based on some properties, you need to specify by which properties your objects will be sorted. Will your objects be sorted by age? By name? By id? Or by any other property that you would like. Then, when you implement the Comparable interface and override the compareTo(UserDefinedClass o) method, you do your logic and return a positive number if the current object is greater than the o object passed, 0 if they are the same and a negative number if the current object is smaller. That way, the TreeMap will know how to sort them, based on the number returned.
First HashSet. In HashSet, you can easily get whether it contains given element. Let's have a set of people in your class and you want to ask whether a guy is in your class. You can make an array list of strings. And if you want to ask if a guy is in your class, you have to iterate through whole the list until you find him, which might be too slow for longer lists. If you use HashSet instead, the operation is much faster. You calculate the hash of the searched string and then you go directly to the hash, so you don't need to pass so many elements to answer your question. Well, you can also make a workaround to make the ArrayList faster to access for this purpose but this is already prepared.
And now HashMap. Now imagine that you also want to store a score for each person. So now you can use HashMap. You enter the name and you get his score in a short time, without the need of iterating through whole the data structure.
Does it make sense?
Concerning your question:
"But I couldn't understand when to prefer HashMap or HashSet, and I
would be really glad if you could explain it to me"
The HashMap implement the Map interface, to be used for mapping a Key (K) to a value (V) in constant time, and where order doesn't matter, so you can put and retrieve those data efficiently if you now the key.
And HashSet implement the Set interface, but is internanly using and HashMap, its role is to be used as a Set, meaning you're not supposed to retrieve an element, you just check that is in the set or not (mostly).
In HashMap, you can have identical value, while you can't in a Set (because its a property of a Set).
Concerning this question :
If the benefit is that this Hash objects can only contain 1 object of the same name, >wouldn't it just be easier to make the ArrayList check that the type is not already >there before adding it?
When dealing with collection, you have may base you choice of a particular one on the data representation but also on the way you want to access and store those data, how do you access it ? Do you need to sort them ? Because each implemenation may have different complexity (https://en.wikipedia.org/wiki/Time_complexity), it become important.
Using the doc,
For ArrayList:
The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking).
For HashMap:
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets. Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.
So it's about the time complexity.
You may choose even more untypical collection for certain problems :).
This has little to do with Java specifically, and the choice depends mostly on performance requirements, but there's a fundamental difference that must be highlighted. Conceptually, Lists are types of collections that keep the order of insertion and may have duplicates, Sets are more like bags of items that have no specific order and no duplicates. Of course, different implementations may find a way around it (like a TreeSet).
First, let's check the difference between ArrayList and LinkedList. A linked list is a set of nodes, where each node contains a value and a link to the next and previous nodes. This makes inserting an element to a linked list a matter of appending a node to the end of the list, which is a quick operation since the memory does not have to be contiguous, as long as a node keeps a reference to the next node. On the other side, accessing a specific element requires transversing the entire list until finding it.
An array list, as the name implies, wraps an array. Accessing elements in an array by using its index is direct access, but inserting an element implies resizing the array to include the new element, so the memory it occupies is contiguous, making writes a bit heavier in this case.
A HashMap works like a dictionary, where for each key there's a value. The behavior of the insertion will mostly depend on how the hashCode and equals functions of the object used as a key are implemented. If the hashCode of two keys is the same, there's a hash collision, so equals will be used to understand if it's the same key or not. If equals is the same, then it's the same key, so the value is replaced. If not, the new value is added to the collection. Accessing and Writing values depends mostly on calculating the hash of the key followed by direct access to the value, making both operations really quick, O(1).
A set is pretty much like a hash map, without the "values" part, thus, it follows the same rules regarding the implementation of hashCode and equals operations for the added value.
It might be handy to study a bit about the Big-O notation and complexity of algorithms. If you are starting with Java, I'd strongly recommend the book Effective Java, by Joshua Bloch.
Hope it helps you dig further.

Java HashSet vs Array Performance

I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are (and the number won't change), and was wondering whether Array would have a notable performance advantage over HashSet for storing/retrieving said elements.
On paper, Array guarantees constant time insertion (since I know the size ahead of time) and retrieval, but the code for HashSet looks much cleaner and adds some flexibility, so I'm wondering if I'm losing anything performance-wise using it, at least, theoretically.
Depends on your data;
HashSet gives you an O(1) contains() method but doesn't preserve order.
ArrayList contains() is O(n) but you can control the order of the entries.
Array if you need to insert anything in between, worst case can be O(n), since you will have to move the data down and make room for the insertion. In Set, you can directly use SortedSet which too has O(n) too but with flexible operations.
I believe Set is more flexible.
The choice greatly depends on what do you want to do with it.
If it is what mentioned in your question:
I have a collection of objects that are guaranteed to be distinct (in particular, indexed by a unique integer ID). I also know exactly how many of them there are
If this is what you need to do, the you need neither of them. There is a size() method in Collection for which you can get the size of it, which mean how many of them there are in the collection.
If what you mean for "collection of object" is not really a collection, and you need to choose a type of collection to store your objects for further processing, then you need to know, for different kind of collections, there are different capabilities and characteristic.
First, I believe to have a fair comparison, you should consider using ArrayList instead Array, for which you don't need to deal with the reallocation.
Then it become the choice of ArrayList vs HashSet, which is quite straight-forward:
Do you need a List or Set? They are for different purpose: Lists provide you indexed access, and iteration is in order of index. While Sets are mainly for you to keep a distinct set of data, and given its nature, you won't have indexed access.
After you made your decision of List or Set to use, then it is a choice of List/Set implementation, normally for Lists, you choose from ArrayList and LinkedList, while for Sets, you choose between HashSet and TreeSet.
All the choice depends on what you would want to do with that collection of data. They performs differently on different action.
For example, an indexed access in ArrayList is O(1), in HashSet (though not meaningful) is O(n), (just for your interest, in LinkedList is O(n), in TreeSet is O(nlogn) )
For adding new element, both ArrayList and HashSet is O(1) operation. Inserting in the middle is O(n) for ArrayList, while it doesn't make sense in HashSet. Both will suffer from reallocation, and both of them need O(n) for the reallocation (HashSet is normally slower in reallocation, because it involve calculation of hash for each element again).
To find if certain element exists in the collection, ArrayList is O(n) and HashSet is O(1).
There are still lots of operations you can do, so it is quite meaningless to discuss for performance without knowing what you want to do.
theoretically, and as SCJP6 Study guide says :D
arrays are faster than collections, and as said, most of the collections depend mainly on arrays (Maps are not considered Collection, but they are included in the Collections framework)
if you guarantee that the size of your elements wont change, why get stuck in Objects built on Objects (Collections built on Arrays) while you can use the root objects directly (arrays)
It looks like you will want an HashMap that maps id's to counts. Particularly,
HashMap<Integer,Integer> counts=new HashMap<Integer,Integer>();
counts.put(uniqueID,counts.get(uniqueID)+1);
This way, you get amortized O(1) adds, contains and retrievals. Essentially, an array with unique id's associated with each object IS a HashMap. By using the HashMap, you get the added bonus of not having to manage the size of the array, not having to map the keys to an array index yourself AND constant access time.

Randomly getting elements in a HashMap or HashSet without looping

I have roughly 420,000 elements that I need to store easily in a Set or List of some kind. The restrictions though is that I need to be able to pick a random element and that it needs to be fast.
Initially I used an ArrayList and a LinkedList, however with that many elements it was very slow. When I profiled it, I saw that the equals() method in the object I was storing was called roughly 21 million times in a very short period of time.
Next I tried a HashSet. What I gain in performance I loose in functionality: I can't pick a random element. HashSet is backed by a HashMap which is backed by an array of HashMap.Entry objects. However when I attempted to expose them I was hindered by the crazy private and package-private visibility of the entire Java Collections Framework (even copying and pasting the class didn't work, the JCF is very "Use what we have or roll your own").
What is the best way to randomly select an element stored in a HashSet or HashMap? Due to the size of the collection I would prefer not to use looping.
IMPORTANT EDIT: I forgot a really important detail: exactly how I use the Collection. I populate the entire Collection at the begging of the table. During the program I pick and remove a random element, then pick and remove a few more known elements, then repeat. The constant lookup and changing is what causes the slowness
There's no reason why an ArrayList or a LinkedList would need to call equals()... although you don't want a LinkedList here as you want quick random access by index.
An ArrayList should be ideal - create it with an appropriate capacity, add all the items to it, and then you can just repeatedly pick a random number in the appropriate range, and call get(index) to get the relevant value.
HashMap and HashSet simply aren't suitable for this.
If ALL you need to do is get a large collection of values and pick a random one, then ArrayList is (literally) perfect for your needs. You won't get significantly faster (unless you went directly to primitive array, where you lose benefits of abstraction.)
If this is too slow for you, it's because you're using other operations as well. If you update your question with ALL the operations the collection must service, you'll get a better answer.
If you don't call contains() (which will call equals() many times), you can use ArrayList.get(randomNumber) and that will be O(1)
You can't do it with a HashMap - it stores the objects internally in an array, where the index = hashcode for the object. Even if you had that table, you'd need to guess which buckets contain objects. So a HashMap is not an option for random access.
Assuming that equals() calls are because you sort out duplicates with contains(), you may want to keep both a HashSet (for quick if-already-present lookup) and an ArrayList (for quick random access). Or, if operations don't interleave, build a HashSet first, then extract its data with toArray() or transform it into ArrayList with constructor of the latter.
If your problems are due to remove() call on ArrayList, don't use it and instead:
if you remove not the last element, just replace (with set()) the removed element with the last;
shrink the list size by 1.
This will of course screw up element order, but apparently you don't need it, judging by description. Or did you omit another important detail?

TreeMap or HashMap? [duplicate]

This question already has answers here:
Difference between HashMap, LinkedHashMap and TreeMap
(17 answers)
What is the difference between a HashMap and a TreeMap? [duplicate]
(8 answers)
Closed 8 years ago.
When to use hashmaps or treemaps?
I know that I can use TreeMap to iterate over the elements when I need them to be sorted.
But is just that? There is no optimization when I just want to consult the maps, or some optimal specific uses?
TreeMap provides guaranteed O(log n) lookup time (and insertion etc), whereas HashMap provides O(1) lookup time if the hash code disperses keys appropriately.
Unless you need the entries to be sorted, I'd stick with HashMap. Or there's ConcurrentHashMap of course. I can't remember the details of the differences between all of them, but HashMap is a perfectly reasonable "default" option :)
For completeness, I should point out that there was a discussion on Stack Overflow a month or so ago about the internals of various maps. See the comments in this question, which I will copy into this answer if bestsss is happy for me to do so.
Hashtables (usually) perform search operations (look up) bounded within the complexity of O(n)<=T(n)<=O(1), with an average case complexity of O(1 + n/k); however, binary search trees, (BST's), perform search operations (lookup) bounded within the complexity of O(n)<=T(n)<=O(log_2(n)), with an average case complexity of O(log_2(n)). The implementation for each (and every) data structure should be known (by you), to understand the advantages, drawbacks, time complexity of operations, and code complexity.
For example, the number of entries in a hashtable often have some fixed number of entries (some part of which may not be filled at all) with lists of collisions. Trees, on the other hand, usually have two pointers (references) per node, but this can be more if the implementation allows more than two child nodes per node, and this allows the tree to grow as nodes are added, but may not allow duplicates. (The default implementation of a Java TreeMap does not allow for duplicates)
There are special cases to consider as well, for example, what if the number of elements in a particular data structure increases without bound or approaches the limit of an underlying part of the data structure? What about amortized operations that perform some rebalancing or cleanup operation?
For example, in a hashtable, when the number of elements in the table become sufficiently large, and arbitrary number of collisions can occur. On the other hand, trees usually require come re-balancing procedure after an insertion (or deletion).
So, if you have something like a cache (Ex. the number of elements in bounded, or size is known) then a hashtable is probably your best bet; however, if you have something more like a dictionary (Ex. populated once and looked up many times) then I'd use a tree.
This is only in the general case, however, (no information was given). You have to understand process that happen how they happen to make the right choice in deciding which data structure to use.
When I need a multi-map (ranged lookup) or sorted flattening of a collection, then it can't be a hashtable.
The largest difference between the two is the underlying structure used in the implementation.
HashMaps use an array and a hashing function to store elements. When you try to insert or delete an item in the array the hashing function converts the key into an index on the array where the object is/should be stored (ignoring conflicts). While hashmaps are generally very fast because they don't need to iterate over large amounts of data, they slow down when they're filled because they need to copy all the key/values into a new array.
TreeMaps store a the data in a sorted tree structure. While this means that they'll never have to allocate more space and copy over to it, operations require that part of the data already stored be iterated over. Sometimes changing large amounts of the structure.
Out of the two Hashmaps will generally have better performance when you don't need sorting.
Inserting new elements into a HashMap will, on average, be a good deal faster than inserting elements into a TreeMap. Unless you need your elements sorted, I'd go with the HashMap.
Don't forget there is also LinkedHashMap which is nearly as fast as HashMap for add/contains/remove operations but also maintains the insertion order.

Categories