I just stumbled upon this statement about the java.util.HashSet and it goes "This class makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time." Can any one explain the statement?
Statement Source: click here
HashSet is using N buckets and stores the elements based on their hashcode in one of these buckets to make searching faster: when you search for an element, the set calculates the hash of the element to know which bucket it needs to search, then it checks if that bucket contains this element. This makes searching N times faster since the set doesn't need to check the other N-1 buckets.
For a small number of elements, the number of buckets can be small. But with more elements arriving, the buckets will start to contain more elements which means that searching will go slower. To solve this problem, the set will need to add more buckets and rearrange its elements to use the new buckets.
Now when we iterate over a HashSet we do it by starting with the elements from the first bucket, then from the second bucket and so on. You see that Sets which are using only buckets can't guarantee the same order of elements since the buckets can change between iterations.
Because HashSet is not ordered, the iterator probably walks all buckets and steps through each bucket's contents in turn. This means that if more items are added so that the buckets are rebalanced then the order can change .
E.g. if you have 1,2,3 and you iterate you may well get 1,3,2. Also, if you later add 4 you could then get 4,2,3,1 or any other order.
It basically means a HashSet has no order. Then you should not rely on values order in your code.
If your set contains values {2, 1, 3} (insert order), nothing guarantees that iterating over it will return {2, 1, 3} nor {1, 2, 3}.
Related
I know the underlying data structure for the hashset is an array. I thought I can get a random value from the hashset by using iterator().next().
I looked at the source code but couldn't really tell. Does iterator not traverse the values in the hashset in a random order?
The iterator will traverse the elements by hash table bucket which is based on the hash code of the objects, and thus they will be in an arbitrary order which might certainly seem random, however they will be consistent for a given HashSet size and contents. Because the order is arbitrary, hash-based containers make no guarantees about the iteration order of their elements, but they do not make any effort to randomize the order.
Random Access in terms of data structures means that you can get the elements in an array-like operation using an index. It lets you select any location by specifying the aforementioned index. Lists are also random access as they have a get() method. If you want to get the elements in a random order other you could put them in a List and then shuffle the list.
List<Integer> list = new ArrayList<>(List.of(1,2,3,4,5,6,7,8,9));
Collections.shuffle(list);
for (int i : list) {
System.out.println(i);
}
prints something like the following without repeated elements.
4
5
3
7
6
1
9
8
2
If you want to just get values randomly including possible repeated elements. Then use Random as suggested and generate a random index from 0 to list.size() and retrieve the value using list.get(). You could do that as long as required without exhausting the supply.
So I understand that the hashmaps use buckets and hashcodes and what not. From my experience, Java hashcodes are not small, but rather large numbers usually, so I assume it's not indexed internally. Unless the hashcode quality is poor resulting in approximately equal bucket length and amount buckets, what makes hashmaps faster than a list of name->value pairs?
Hashmaps work by mapping elements to "buckets" by using a hash function. When someone tries to insert an element, a hash code is calculated and a modulus operation is applied to the hash code in order to get the bucket index in which the element should be inserted (That is the reason why it doesn't matter how big the hashcode is). For example, if you have 4 buckets and your hashcode is 40, it will be inserted in the bucket 0, because 40 mod 4 is 0.
When two elements are mapped to the same bucket a "collision" occurs and usually the element is stored in a list under the same bucket.
If you try to obtain an element the key is mapped again using the hash function. If the bucket contains a list of elements, the equals() function is used in order to identify which element is the correct one (That is the reason why you must implement equals() and hashcode() to insert a custom object into a hashmap).
So, if you search for an element, and your hashmap does not have any lists on the buckets, you have a O(1) cost. The worst case would be when you have only 1 bucket and a list containing all elements in which obtaining an element would be the same as searching on a list O(N).
I looked in the Java implementation and found it does a bitwise and akin to a modulus, which makes a lot of sense to reduce array size. This allows the O(1) access that makes HashMaps nice.
I was asked this question in a recent interview.
You are given an array that has a million elements. All the elements are duplicates except one. My task is to find the unique element.
var arr = [3, 4, 3, 2, 2, 6, 7, 2, 3........]
My approach was to go through the entire array in a for loop, and then create a map with index as the number in the array and the value as the frequency of the number occurring in the array. Then loop through our map again and return the index that has value of 1.
I said my approach would take O(n) time complexity. The interviewer told me to optimize it in less than O(n) complexity. I said that we cannot, as we have to go through the entire array with a million elements.
Finally, he didn't seem satisfied and moved onto the next question.
I understand going through million elements in the array is expensive, but how could we find a unique element without doing a linear scan of the entire array?
PS: the array is not sorted.
I'm certain that you can't solve this problem without going through the whole array, at least if you don't have any additional information (like the elements being sorted and restricted to certain values), so the problem has a minimum time complexity of O(n). You can, however, reduce the memory complexity to O(1) with a XOR-based solution, if every element is in the array an even number of times, which seems to be the most common variant of the problem, if that's of any interest to you:
int unique(int[] array)
{
int unpaired = array[0];
for(int i = 1; i < array.length; i++)
unpaired = unpaired ^ array[i];
return unpaired;
}
Basically, every XORed element cancels out with the other one, so your result is the only element that didn't cancel out.
Assuming the array is un-ordered, you can't. Every value is mutually exclusive to the next so nothing can be deduced about a value from any of the other values?
If it's an ordered array of values, then that's another matter and depends entirely on the ordering used.
I agree the easiest way is to have another container and store the frequency of the values.
In fact, since the number of elements in the array was fix, you could do much better than what you have proposed.
By "creating a map with index as the number in the array and the value as the frequency of the number occurring in the array", you create a map with 2^32 positions (assuming the array had 32-bit integers), and then you have to pass though that map to find the first position whose value is one. It means that you are using a large auxiliary space and in the worst case you are doing about 10^6+2^32 operations (one million to create the map and 2^32 to find the element).
Instead of doing so, you could sort the array with some n*log(n) algorithm and then search for the element in the sorted array, because in your case, n = 10^6.
For instance, using the merge sort, you would use a much smaller auxiliary space (just an array of 10^6 integers) and would do about (10^6)*log(10^6)+10^6 operations to sort and then find the element, which is approximately 21*10^6 (many many times smaller than 10^6+2^32).
PS: sorting the array decreases the search from a quadratic to a linear cost, because with a sorted array we just have to access the adjacent positions to check if a current position is unique or not.
Your approach seems fine. It could be that he was looking for an edge-case where the array is of even size, meaning there is either no unmatched elements or there are two or more. He just went about asking it the wrong way.
My Set is sometimes sorted, and sometimes not.
Here is the example:
public class SetOfInteger {
public static void main(String[] args) {
Random rand = new Random(47);
Set<Integer> intset = new HashSet<>();
for (int i = 0; i < 10; i++) {
int j = rand.nextInt(30);
System.out.print(j + " ");
intset.add(j);
}
System.out.println();
System.out.println(intset);
}
}
The result shows that the set is not sorted.
8 5 13 11 1 29 28 20 12 7
[1, 20, 5, 7, 8, 11, 12, 29, 28, 13]
When I change the termination expression to i < 20 in the for statement, the result shows that the set become sorted.
8 5 13 11 1 29 28 20 12 7 18 18 21 19 29 28 28 1 20 28
[1, 5, 7, 8, 11, 12, 13, 19, 18, 21, 20, 29, 28]
It is so strange, is it? I just don't know how to explain it, and I need some help, Thank you very much.
A HashSet does not guarantee sorted iteration, but under very specific circumstances its internal data structure may act like a bucket sort.
Specifically, for integer keys in the range [0,65535] and a table size that is greater than the largest key, the index of the bucket a key is stored in is equal to the key itself, and since the iterator iterates in bucket order, it emits the elements in sorted order.
There are some good answers all around, but none attempt to explain what exactly happens in this particular situation, so I'll limit my answer to that, rather than add another explanation of how the HashSet works. I'm taking that understanding as granted.
The default constructor of HashSet creates a set with a capacity of 16 and a load factor of 0.75. That means there are 16 bins, and this capacity is increased when you insert 16 * 0.75 = 12 unique elements.
That's why in the first case, the numbers are sorted by their remainder when divided by 16: the set started with a table size of 16, "hashing" each element to a bin by taking x % 16. Then when there were 12 elements, it grew the table and performed a rehash (see Javier Martin's answer if that's not clear), probably growing the table to 32. (I could only find information about how it grows in the java 6 doc, which states that the number of buckets is "approximately" doubled, whatever that means.) That gave each integer under 30 its own bin, so when the set iterated over each bin in order, it iterated over the numbers in order. If you inserted numbers below 64, you'd probably find that you need to insert 32*0.75 = 24 elements before the iteration appears sorted.
Also note that this way of assigning bins is not guaranteed behavior. HashSets in other Java versions/implementations might do something more complicated with the objects' hashCode() values than simply taking a remainder. (As noted by ruakh and fluffy in the comments - thanks!)
Your question points out that item order changes as the set grows bigger. However, you can't count on the order being preserved. A Set has one guarantee: there is only one of each kind of element. There are other Set objects that provide further guarantees, but a simple HashSet provides no guarantee of order.
The re-ordering you see is simply an internal reshuffling due to the way the HashSet is stored internally. In a very simplified way of thinking, the HashSet has a certain number of "slots" to store values which is usually an odd number if not also prime. The hashcodes from getHashCode() are used to assign the object to a slot. When you have a hash code collision, then the HashSet uses the equality operator equals() to determine if the objects are in fact unique.
As you add items to a HashSet several things happen:
Objects are assigned to their internal slot
The hashcode is then further hashed to find what slot it belongs in
If there's a slot collision, then we test for equality. If it's the same object we discard it, if not we add it to a list in that slot
When the number of objects exceed the number of slots, the HashSet needs to resize itself
It creates a bigger set of slots which is still usually an odd or prime number
The existing items are remapped into the new collection of slots -- this is where order can change
The bottom line is that if the objects magically sorted themselves, that's not an implementation you can count on unless you are using a TreeSet which imposes a sort order on the set items.
The iteration order of a HashSet is not defined, the only guarantee is that it is consistent: iterating over a HashSet that has not been modified will produce the same sequences.
Internally, as a commenter said, the class uses the hashCode method of each element to store them in a certain number of bins. So, for example, if it's using 20 bins then it could take o.hashCode() % 20 as the bin index. Each bin can have several items in a list, which are then distinguished by the equals method. Thus, even if the hash of an Integer is its int value, the order need not be the natural integer ordering.
Furthermore, the set monitors its load factor when inserting and removing elements; considering the fraction of free bins, the maximum bin list size, the average number of items per bin, whatever. When it considers appropriate it performs a rehash, which means changing the number of bins that is used to store the elements, so their bin index changes because the n in the o.hashCode() % n changes. Every element is "reshuffled" to its new place (this is a costly operation), thus explaining the different ordering you see after adding more elements.
Interesting question. Set uses array of linked list to store its elements. hashCode() is used to find the position (indirectly) of the object to be stored in the Set.
In case there are two objects needing to be stored in same position then the object is stored in the next slot of the linked list at that position.
The size of the array is dynamic and computed run time as per the number of objects in it. Its not sure but I assume you see your numbers as sorted because the Set might have increased the size. The hashCode() is dependent upon the number value and so would have been computed sequentially. As the size of the underlying array would have increased with the increase of size of loop. There would have been no collisions and the output is sorted.
But Still I would like to emphasis so that my answer does not lead to any misconception. HashSet does not guarantee any ordering of the elements
You must sort it manually, because there is no guarantee that the hashset will be sorted. If you want you can use also TreeSet which will provide the functionality you want, but if you want to use HashSet anyway try this:
Set intset = new HashSet();
List sortedIntList = new ArrayList(intset);
Collections.sort(sortedIntList);
In some post I read:
ConcurrentHashMap groups elements by a proximity based on loadfactor
How this grouping happens?
Lets say I override hashCode() function so that it always return 1. Now how are higher and lower values of loadfactor going to effect inserts into a ConcurrentHashMap ?.
Now I override hashCode() function so that it always returns different hashcodes. Now how are higher and lower values of loadfactor going to effect inserts into a ConcurrentHashMap ?.
A hashmap is essentially an array of lists. For example, lets say a given hashmap has an array of 100 lists. When you add something to it, the hashCode is calculated for that object. Then the modulus of that value and the number of lists (in this case 100) is used to determine which list it is added to. So if you add a object with hashcode 13, it gets added to list 13. If you add an object with the hascode 12303512, it get's added to list 12.
The load factor tells the hashmap when to increase the number of lists. It's based on the number of items in the entire map and the current capacity.
In your first scenario where hashcode always returns 1, no matter how many lists there are, your objects will end up in the same list (this is bad.) In the second scenario, they will be distributed more evenly across the lists (this is good.)
Since the load factor is based on the overall size of the map and not that of the lists, the quality of your hashcodes doesn't really interact with the loadfactor. In the first scenario, it will grow just like in the second one but everything will still end up in the same list regardless.