My Set is sometimes sorted, and sometimes not.
Here is the example:
public class SetOfInteger {
public static void main(String[] args) {
Random rand = new Random(47);
Set<Integer> intset = new HashSet<>();
for (int i = 0; i < 10; i++) {
int j = rand.nextInt(30);
System.out.print(j + " ");
intset.add(j);
}
System.out.println();
System.out.println(intset);
}
}
The result shows that the set is not sorted.
8 5 13 11 1 29 28 20 12 7
[1, 20, 5, 7, 8, 11, 12, 29, 28, 13]
When I change the termination expression to i < 20 in the for statement, the result shows that the set become sorted.
8 5 13 11 1 29 28 20 12 7 18 18 21 19 29 28 28 1 20 28
[1, 5, 7, 8, 11, 12, 13, 19, 18, 21, 20, 29, 28]
It is so strange, is it? I just don't know how to explain it, and I need some help, Thank you very much.
A HashSet does not guarantee sorted iteration, but under very specific circumstances its internal data structure may act like a bucket sort.
Specifically, for integer keys in the range [0,65535] and a table size that is greater than the largest key, the index of the bucket a key is stored in is equal to the key itself, and since the iterator iterates in bucket order, it emits the elements in sorted order.
There are some good answers all around, but none attempt to explain what exactly happens in this particular situation, so I'll limit my answer to that, rather than add another explanation of how the HashSet works. I'm taking that understanding as granted.
The default constructor of HashSet creates a set with a capacity of 16 and a load factor of 0.75. That means there are 16 bins, and this capacity is increased when you insert 16 * 0.75 = 12 unique elements.
That's why in the first case, the numbers are sorted by their remainder when divided by 16: the set started with a table size of 16, "hashing" each element to a bin by taking x % 16. Then when there were 12 elements, it grew the table and performed a rehash (see Javier Martin's answer if that's not clear), probably growing the table to 32. (I could only find information about how it grows in the java 6 doc, which states that the number of buckets is "approximately" doubled, whatever that means.) That gave each integer under 30 its own bin, so when the set iterated over each bin in order, it iterated over the numbers in order. If you inserted numbers below 64, you'd probably find that you need to insert 32*0.75 = 24 elements before the iteration appears sorted.
Also note that this way of assigning bins is not guaranteed behavior. HashSets in other Java versions/implementations might do something more complicated with the objects' hashCode() values than simply taking a remainder. (As noted by ruakh and fluffy in the comments - thanks!)
Your question points out that item order changes as the set grows bigger. However, you can't count on the order being preserved. A Set has one guarantee: there is only one of each kind of element. There are other Set objects that provide further guarantees, but a simple HashSet provides no guarantee of order.
The re-ordering you see is simply an internal reshuffling due to the way the HashSet is stored internally. In a very simplified way of thinking, the HashSet has a certain number of "slots" to store values which is usually an odd number if not also prime. The hashcodes from getHashCode() are used to assign the object to a slot. When you have a hash code collision, then the HashSet uses the equality operator equals() to determine if the objects are in fact unique.
As you add items to a HashSet several things happen:
Objects are assigned to their internal slot
The hashcode is then further hashed to find what slot it belongs in
If there's a slot collision, then we test for equality. If it's the same object we discard it, if not we add it to a list in that slot
When the number of objects exceed the number of slots, the HashSet needs to resize itself
It creates a bigger set of slots which is still usually an odd or prime number
The existing items are remapped into the new collection of slots -- this is where order can change
The bottom line is that if the objects magically sorted themselves, that's not an implementation you can count on unless you are using a TreeSet which imposes a sort order on the set items.
The iteration order of a HashSet is not defined, the only guarantee is that it is consistent: iterating over a HashSet that has not been modified will produce the same sequences.
Internally, as a commenter said, the class uses the hashCode method of each element to store them in a certain number of bins. So, for example, if it's using 20 bins then it could take o.hashCode() % 20 as the bin index. Each bin can have several items in a list, which are then distinguished by the equals method. Thus, even if the hash of an Integer is its int value, the order need not be the natural integer ordering.
Furthermore, the set monitors its load factor when inserting and removing elements; considering the fraction of free bins, the maximum bin list size, the average number of items per bin, whatever. When it considers appropriate it performs a rehash, which means changing the number of bins that is used to store the elements, so their bin index changes because the n in the o.hashCode() % n changes. Every element is "reshuffled" to its new place (this is a costly operation), thus explaining the different ordering you see after adding more elements.
Interesting question. Set uses array of linked list to store its elements. hashCode() is used to find the position (indirectly) of the object to be stored in the Set.
In case there are two objects needing to be stored in same position then the object is stored in the next slot of the linked list at that position.
The size of the array is dynamic and computed run time as per the number of objects in it. Its not sure but I assume you see your numbers as sorted because the Set might have increased the size. The hashCode() is dependent upon the number value and so would have been computed sequentially. As the size of the underlying array would have increased with the increase of size of loop. There would have been no collisions and the output is sorted.
But Still I would like to emphasis so that my answer does not lead to any misconception. HashSet does not guarantee any ordering of the elements
You must sort it manually, because there is no guarantee that the hashset will be sorted. If you want you can use also TreeSet which will provide the functionality you want, but if you want to use HashSet anyway try this:
Set intset = new HashSet();
List sortedIntList = new ArrayList(intset);
Collections.sort(sortedIntList);
Related
I was asked this question in a recent interview.
You are given an array that has a million elements. All the elements are duplicates except one. My task is to find the unique element.
var arr = [3, 4, 3, 2, 2, 6, 7, 2, 3........]
My approach was to go through the entire array in a for loop, and then create a map with index as the number in the array and the value as the frequency of the number occurring in the array. Then loop through our map again and return the index that has value of 1.
I said my approach would take O(n) time complexity. The interviewer told me to optimize it in less than O(n) complexity. I said that we cannot, as we have to go through the entire array with a million elements.
Finally, he didn't seem satisfied and moved onto the next question.
I understand going through million elements in the array is expensive, but how could we find a unique element without doing a linear scan of the entire array?
PS: the array is not sorted.
I'm certain that you can't solve this problem without going through the whole array, at least if you don't have any additional information (like the elements being sorted and restricted to certain values), so the problem has a minimum time complexity of O(n). You can, however, reduce the memory complexity to O(1) with a XOR-based solution, if every element is in the array an even number of times, which seems to be the most common variant of the problem, if that's of any interest to you:
int unique(int[] array)
{
int unpaired = array[0];
for(int i = 1; i < array.length; i++)
unpaired = unpaired ^ array[i];
return unpaired;
}
Basically, every XORed element cancels out with the other one, so your result is the only element that didn't cancel out.
Assuming the array is un-ordered, you can't. Every value is mutually exclusive to the next so nothing can be deduced about a value from any of the other values?
If it's an ordered array of values, then that's another matter and depends entirely on the ordering used.
I agree the easiest way is to have another container and store the frequency of the values.
In fact, since the number of elements in the array was fix, you could do much better than what you have proposed.
By "creating a map with index as the number in the array and the value as the frequency of the number occurring in the array", you create a map with 2^32 positions (assuming the array had 32-bit integers), and then you have to pass though that map to find the first position whose value is one. It means that you are using a large auxiliary space and in the worst case you are doing about 10^6+2^32 operations (one million to create the map and 2^32 to find the element).
Instead of doing so, you could sort the array with some n*log(n) algorithm and then search for the element in the sorted array, because in your case, n = 10^6.
For instance, using the merge sort, you would use a much smaller auxiliary space (just an array of 10^6 integers) and would do about (10^6)*log(10^6)+10^6 operations to sort and then find the element, which is approximately 21*10^6 (many many times smaller than 10^6+2^32).
PS: sorting the array decreases the search from a quadratic to a linear cost, because with a sorted array we just have to access the adjacent positions to check if a current position is unique or not.
Your approach seems fine. It could be that he was looking for an edge-case where the array is of even size, meaning there is either no unmatched elements or there are two or more. He just went about asking it the wrong way.
In some post I read:
ConcurrentHashMap groups elements by a proximity based on loadfactor
How this grouping happens?
Lets say I override hashCode() function so that it always return 1. Now how are higher and lower values of loadfactor going to effect inserts into a ConcurrentHashMap ?.
Now I override hashCode() function so that it always returns different hashcodes. Now how are higher and lower values of loadfactor going to effect inserts into a ConcurrentHashMap ?.
A hashmap is essentially an array of lists. For example, lets say a given hashmap has an array of 100 lists. When you add something to it, the hashCode is calculated for that object. Then the modulus of that value and the number of lists (in this case 100) is used to determine which list it is added to. So if you add a object with hashcode 13, it gets added to list 13. If you add an object with the hascode 12303512, it get's added to list 12.
The load factor tells the hashmap when to increase the number of lists. It's based on the number of items in the entire map and the current capacity.
In your first scenario where hashcode always returns 1, no matter how many lists there are, your objects will end up in the same list (this is bad.) In the second scenario, they will be distributed more evenly across the lists (this is good.)
Since the load factor is based on the overall size of the map and not that of the lists, the quality of your hashcodes doesn't really interact with the loadfactor. In the first scenario, it will grow just like in the second one but everything will still end up in the same list regardless.
I just stumbled upon this statement about the java.util.HashSet and it goes "This class makes no guarantees as to the iteration order of the set; in particular, it does not guarantee that the order will remain constant over time." Can any one explain the statement?
Statement Source: click here
HashSet is using N buckets and stores the elements based on their hashcode in one of these buckets to make searching faster: when you search for an element, the set calculates the hash of the element to know which bucket it needs to search, then it checks if that bucket contains this element. This makes searching N times faster since the set doesn't need to check the other N-1 buckets.
For a small number of elements, the number of buckets can be small. But with more elements arriving, the buckets will start to contain more elements which means that searching will go slower. To solve this problem, the set will need to add more buckets and rearrange its elements to use the new buckets.
Now when we iterate over a HashSet we do it by starting with the elements from the first bucket, then from the second bucket and so on. You see that Sets which are using only buckets can't guarantee the same order of elements since the buckets can change between iterations.
Because HashSet is not ordered, the iterator probably walks all buckets and steps through each bucket's contents in turn. This means that if more items are added so that the buckets are rebalanced then the order can change .
E.g. if you have 1,2,3 and you iterate you may well get 1,3,2. Also, if you later add 4 you could then get 4,2,3,1 or any other order.
It basically means a HashSet has no order. Then you should not rely on values order in your code.
If your set contains values {2, 1, 3} (insert order), nothing guarantees that iterating over it will return {2, 1, 3} nor {1, 2, 3}.
below I've listed a problem I'm having some trouble with. This problem is a simple nested loop away from an O(n^2) solution, but I need it to be O(n). Any ideas how this should be tackled? Would it be possible to form two equations?
Given an integer array A, check if there are two indices i and j such that A[j] = 2∗A[i]. For example, on the array (25, 13, 16, 7, 8) the algorithm should output “true” (since 16 = 2 * 8), whereas on the array (25, 17, 44, 24) the algorithm should output “false”. Describe an algorithm for this problem with worst-case running time that is better than O(n^2), where n is the length of A.
Thanks!
This is a great spot to use a hash table. Create a hash table and enter each number in the array into the hash table. Then, iterate across the array one more time and check whether 2*A[i] exists in the hash table for each i. If so, then you know a pair of indices exists with this property. If not, you know no such pair exists.
On expectation, this takes time O(n), since n operations on a hash table take expected amortized O(1) time.
Hope this helps!
templatetypedef's suggestion to use a hash table is a good one. I want to explain a little more about why.
The key here is realizing that you are essentially searching for some value in a set. You have a set of numbers you are searching in (2 * each value in the input array), and a set of numbers you are searching for (each value in the input array). Your brute-force naive case is just looking up values directly in the search-in array. What you want to do is pre-load your "search-in" set into something with faster lookups than an array (like a hash table), then you can search from there.
You can also further prune your results by not searching for A[i] where A[i] is odd; because you know that A[i] = 2 * A[j] can never be true if A[i] is odd. You can also compute the minimum and maximum values in the "search-in" array on the fly during initialization and prune all A[i] outside that range.
The performance there is hard to express in big O form since it depends on the nature of the data, but you can calculate a best- and worst- case and an amortized case as well.
However, a proper choice of hash table size (if your value range is small you can simply choose a capacity that is larger than your value range, where the hash function is the value itself) may actually make pruning more costly than not in some cases, you'd have to profile it to find out.
the basics of my question are that given a List object in Java, what's the fastest way to return a collection of just the unique data?
The more specific version, is that I have a 2d ArrayList (think of it like a table), and I want to loop through a given column index and return the unique data.
Here's my current setup:
public Set<Object> getDistinctColumnData( int colIndex ) {
//dataByIndex = List<List<Object>>
Set<Object> colDistinctData = new HashSet<Object>( dataByIndex.size() + 1, 1f ) ;
for( List<Object> row : dataByIndex ) {
colDistinctData.add( row.get( colIndex ) ) ;
}
return colDistinctData ;
}
I got a small performance gain when I set the initial capacity to plus one the size of the non distinct set and the load factor to 1 (My thinking was it won't need to grow until it hits 100%, and that shouldn't happen even if the original set is already 100% distinct (or am I wrong?)).
Is there a faster way?
I think it would be way faster if you just have two unique Collections. Maintain your dataByIndex list, but also maintain a dataSet Collection (Set). When you insert into your dataByIndex list, also put into your dataSet Set. Then just use your dataSet where needed. The Set will maintain uniqueness by nature of being a Set.
I think that it doesn't make much sense in setting up the capacity and load factor to the values you specified. What hashing function do you use? May be it downgrades to linked list?
You are likely to get a further performance increase (on average) if you increase the initial capacity of the HashSet even more. This is because the distribution of the hash values of the objects in your list may be such that collisions are more likely.
For instance, given the following list, all but the first insertion will result in a collision, despite there being no duplicate values. (the Java hash function for integers is the value of the integer itself, and HashSet uses open addressing and linear probing in case of a collision).
[0,10,1,2,3,4,5,6,7]
or even worse, because the each insertion has to check every non-free space before it can be inserted.
[0, 5, 25, 125]
In the last example 0 gets put in index 0. 5 goes to index 0 initially as 5 % size (ie. 5) equals 0, so then goes to index 1. 125 would go to index 0, but 0 is at index 0, 5 at index 1 and 25 at index 2. Meaning that after three checks 125 can finally be inserted at index 3.
If you increase the initial capacity then this decreases the probability of collisions (on average), and the decreases the number of checks required if a collision does occur (on average as well). By default java uses a load factor of 0.75 as a good balance between performance and memory usage. So divide by load factor of 0.75 and add 1 should give you a good initial capacity.