HashSet iteration - java

I have a query regarding iterator of HashSet in Java.
In book "Java Generics and Collections", following is stated:
The chief attraction of a hash table implementation for sets is the (ideally) constanttime
performance for the basic operations of add, remove, contains, and size. Its main
disadvantage is its iteration performance; since iterating through the table involves
examining every bucket, its cost is proportional to the table size regardless of the size
of the set it contains.
It states that iterator looks in every bucket of underlying table. But going through actual implementation(JDK 8), I see that HashIterator stores next node
reference. So it seems iterator doesn't need to visit every single bucket.
Is book wrong here OR my understanding is wrong?

The document is right. Although KeyIterator indeed calls nextNode().key, like this
final class KeyIterator extends HashIterator implements Iterator<K> {
public final K More ...next() {
return nextNode().key;
}
}
the code for nextNode() in the base class HashIterator has the loop that the documentation is talking about:
final Node<K,V> nextNode() {
Node<K,V>[] t;
Node<K,V> e = next;
if (modCount != expectedModCount)
throw new ConcurrentModificationException();
if (e == null)
throw new NoSuchElementException();
if ((next = (current = e).next) == null && (t = table) != null) {
do {} while (index < t.length && (next = t[index++]) == null);
}
return e;
}
The do/while loop with an empty body traverses the buckets one by one, looking for the next entry.
The only time this may be relevant is when you iterate over a hash set which you pre-allocated with a large number of buckets, but have not populated with a large number of items yet. When you let your HashSet grow by itself as you add more items, the number of buckets will be proportional to the number of items that you inserted so far, so the slowdown would not be significant.

Related

Detail the big-O of Hashmap - put() method by real code in Java 8 [duplicate]

This question already has answers here:
HashMap get/put complexity
(8 answers)
Review the big-O detail of Hashmap - put() method in Java 8 [duplicate]
Closed 3 years ago.
I 'm novice in Algorithm. I read and aware that big-O of put(K key, V value) in Hashmap is O(1).
When I went to core of HashMap class
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
boolean evict) {
Node<K,V>[] tab; Node<K,V> p; int n, i;
if ((tab = table) == null || (n = tab.length) == 0)
//...
if ((p = tab[i = (n - 1) & hash]) == null)
//...
else {
Node<K,V> e; K k;
if (p.hash == hash &&
((k = p.key) == key || (key != null && key.equals(k))))
e = p;
else if (p instanceof TreeNode)
// ...
else {
for (int binCount = 0; ; ++binCount) {
if ((e = p.next) == null) {
p.next = newNode(hash, key, value, null);
// ...
}
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k))))
break;
p = e;
}
}
if (e != null) { // existing mapping for key
// ...
}
}
...
return null;
}
As you can see, when adding new item to hashmap, it will iterate max n (all item in hashmap) with "For Loop" above:
for (int binCount = 0; ; ++binCount) {
Now, big-O of For Loop here is O(n) --> Why big-O of put(K key, V value) in HashMap can be O(1) ?
Where do I understand wrongly ?
Thanks very much.
The HashMap is actually a collection (backed by an array) of buckets that are backed by a Red-Black tree (as of Java 8). If you have a very poor hashing function that puts all the elements into the same bin, then performance would degrade to O(log(n))
From Baeldung,
HashMap has O(1) complexity, or constant-time complexity, of putting and getting the elements. Of course, lots of collisions could degrade the performance to O(log(n)) time complexity in the worst case, when all elements land in a single bucket. This is usually solved by providing a good hash function with a uniform distribution.
From the docs,
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets.
The idea is that a single bin of the hash table has an expected constant number of elements. Therefore the loop you mentioned would run in O(1) expected time (assuming the hashCode() of your keys is not terrible).
Specifically, for the HashMap implementation, there's a load factor whose value is 0.75 by default. This means that on average, each bin of the HashMap should have <= 0.75 elements (once there are more than load-factor * number-of-bins entries in the HashMap, the number-of-bins is doubled in order to maintain this invariant). Therefore, the mentioned loop should have a single iteration on average.
Big-O notation has to do with the performance of a function relative to the number of elements is operates on. Just having a for loop doesn't suddenly make a Hashmap lookup's performance grow by a fixed amount for every element within the Hashmap.
There are patterns in Big-O notation. Loops across the entire set of elements is O(n) but loops in general don't mean the lookup is O(n). To illustrate, I'll use the (silly) example below
Function with O(1) performance
public void add_o_one(int x, int y) {
return x + y;
}
Function with O(1) performance, with loops added
public void add_still_o_one(int x, int y) {
int[2] numbers;
numbers[0] = x;
numbers[1] = y;
int result = 0;
for (int index = 0; index < numbers.length; index++) {
result += numbers[index];
}
return result;
}
While I would expect the latter one to be a bit less efficient, there's no way to alter its run-time speed by choosing different (or more) numbers. Therefore, it is still O(1).
In your Hashmap, the looping over bucket lists does alter the speed relative to the input; but, it doesn't alter it by a constant amount for each element. Also, most of the hashmap implementations grow their bucket size as the map fills. This means that what you are seeing is not a common case, and is likely to seldom be called (unless you implement a really bad hashcode).
While you might consider that "something larger than O(1)" it is very difficult to make the code function in a way that is inconsistent with O(1) unless you specifically break the algorithm (by providing objects that all hash to the same value, for example).

Java - Array index out of range - Vector?

I'm testing a method that adds a linked list of hash pairs inside a vector. Although, I'm running into a IndexOutOfBounds but I'm having trouble understanding where the problem exists.
import java.util.*;
class HashPair<K, E> {
K key;
E element;
}
public class Test4<K, E> {
private Vector<LinkedList<HashPair<K, E>>> table;
public Test4(int tableSize) {
if (tableSize <= 0)
throw new IllegalArgumentException("Table Size must be positive");
table = new Vector<LinkedList<HashPair<K, E>>>(tableSize);
}
public E put(K key, E element) {
if (key == null || element == null)
throw new NullPointerException("Key or element is null");
int i = hash(key);
LinkedList<HashPair<K, E>> onelist = table.get(i);
ListIterator<HashPair<K, E>> cursor = onelist.listIterator();
HashPair<K, E> pair;
E answer = null;
while (cursor.hasNext()) {
pair = cursor.next();
if (pair.key.equals(key)) {
answer = pair.element;
pair.element = element;
return answer;
}
}
pair = new HashPair<K, E>();
pair.key = key;
pair.element = element;
onelist.addFirst(pair);
return answer;
}
private int hash(K key) {
return Math.abs(key.hashCode() % table.capacity());
}
public static void main(String[] args) {
Test4<Integer, Integer> obj = new Test4<Integer, Integer>(10);
obj.put(0, 10);
}
}
My compiler says that the problem is here:
LinkedList<HashPair<K, E>> onelist = table.get(i);
From what I understand is that I'm trying to get the table index of i which is a hash value generated from the hash(K key) method. So in my main method if I set the key to 0 as an example? Why is the index out of range?
Here is the exception
Exception in thread "main" 0java.lang.ArrayIndexOutOfBoundsException:
Array index out of range: 0
at java.util.Vector.get(Vector.java:748)
at Test4.put(Test4.java:24)
at Test4.main(Test4.java:55)
The problem here is that you are considering the capacity of a vector to be the number of elements in the vector. This is not what capacity of a collection represents.
The capacity of a collection in the standard Java libraries is the size of the internal array used by that collection. The number of elements in the collection, however, is represented by size.
Whenever an element is added to/removed from such a collection, the size property is modified. This does not affect the capacity of the collection unless the internal array needs to be resized.
The solution: modify hash() to the following:
private int hash(K key) {
return Math.abs(key.hashCode() % table.size());
}
And make sure that the table vector contains at least one element before calling hash and table.get.
I presume that you are creating an implementation of a HashMap with buckets. If you are, then ponder this: How can you go about storing a value in a bucket if there aren't any buckets? You need to have at least one bucket before trying to get a bucket.
It seems your code is getting stuck at line 748, which is:
LinkedList<HashPair<K, E>> onelist = table.get(i);
The description Array index out of range: 0 means you're trying to get an object at slot '0', when there is no such slot available at the time. In short: your vector is empty. And by looking at your code, the reason becomes pretty evident. The only treatment this Vector called table receives before Test4.put() is called gets down to this at line 15:
table = new Vector<LinkedList<HashPair<K, E>>>(tableSize);
So, yes, you're properly creating an object and initializing a variable, you are even specifying a default capacity, but you never added something into your brand new Vector, and both lists and vectors do require to be filled manually with stuff first. Keep on mind that this "capacity" refers to how much stuff is this Vector supposed to hold without need to resize the array it uses internally. It gives me the impression you are trying to create a class whose objects have a behavior like HashMaps, but I can't wrap my mind around the need of using a Vector of LinkedLists of KeyPairs when just a single collection of KeyPairs should be enough unless... wait, what is that hash() method doing? Oh... ohh... oh, I see what you did there.
So, right, the solution. As your Vector is properly created but empty, you need to fill it with whatever it is supposed to hold. In this case, it holds LinkedLists of KeyPairs, so let's fill it with just enough of them to hold the capacity you set through the constructor. This modification to the constructor should do the thing:
public Test4(int tableSize) {
if (tableSize <= 0)
throw new IllegalArgumentException("Table Size must be positive");
table = new Vector<LinkedList<HashPair<K, E>>>(tableSize);
//Prepare the fast lookup table (at least that's what I think it could be called)
for (int i = 0; i < tableSize; i++) {
table.add(new LinkedList<HashPair<K, E>>());
}
}
And that's pretty much it. I even tested it here just to be sure it worked fine after my patch.
Hope this helps you.
PS: Splitting your structure in n pieces to speedup search/store? I like the idea.

Best way to keep track of maximum 5 values found while parsing a stream in Java

I'm parsing a large file, line by line, reading substrings in each line. I will obtain an integer value from each substring, ~30 per line, and need to take the return the highest 5 values from the file. What data structure will be the most efficient for keeping track of the 5 largest values while going through?
This problem is usually solved with a heap, but (perhaps counter-intuitively) you use a min-heap (the smallest element is the "top" of the heap).
The algorithm is basically this:
for each item parsed
if the heap contains less than n items,
add the new item to the heap
else
if the new item is "greater" than the "smallest" item in the heap
remove the smallest item and replace it with the new item
When you are done, you can pop the elements off the heap from least to greatest.
Or, concretely:
static <T extends Comparable<T>> List<T> top(Iterable<? extends T> items, int k) {
if (k < 0) throw new IllegalArgumentException();
if (k == 0) return Collections.emptyList();
PriorityQueue<T> top = new PriorityQueue<>(k);
for (T item : items) {
if (top.size() < k) top.add(item);
else if (item.compareTo(top.peek()) > 0) {
top.remove();
top.add(item);
}
}
List<T> hits = new ArrayList<>(top.size());
while (!top.isEmpty())
hits.add(top.remove());
Collections.reverse(hits);
return hits;
}
You can compare the new item to the top of the heap efficiently, and you don't need to keep all of the elements strictly ordered all the time, so this is faster than a completely ordered collection like a TreeSet.
For a very short list of five elements, iterating over an array may be faster. But if the size of the "top hits" collection grows, this heap-based method will win out.
I would use a TreeSet (basically a sorted set), where you drop off the first (lowest) element each time you add to the set. This will dicard duplicates.
SortedSet<Integer> set = new TreeSet<>();
for (...) {
...
if (set.size() < 5) {
set.add(num);
} else if (num > set.first()) {
set.remove(set.first());
set.add(num);
}
}
You could use a LinkedList inserting with a sort order. each new int, you would check the end to make sure it's in the max. If it is then iterate in descending order and if newInt > the node's int, insert the new int there, and then removeLast() to maintain the length of 5.
Array also works, but you'll have to shuffle.
The Guava library has an Ordering.greatestOf method that returns the greatest K elements from an Iterable in O(N) time and O(K) space.
The implementation is in a package-private TopKSelector class.

Removing element from list in counted loop vs iterator [duplicate]

This question already has answers here:
Why iterator.remove does not throw ConcurrentModificationException
(6 answers)
Closed 7 years ago.
Why is this legal:
for(int i=0; i < arr.size(); i++) {
arr.remove(i);
}
But using an iterator or the syntactic sugar of a for each results in a ConcurrentModificationException:
for(String myString : arr) {
arr.remove(myString);
}
Before everyone starts jumping on the bandwagon telling me to use iterator.remove(); I'm asking why the different behavior, not how to avoid the conc mod exception. Thanks.
Let's take a look at how, e.g., ArrayLists's iterator is implemented:
private class Itr implements Iterator<E> {
int cursor; // index of next element to return
int lastRet = -1; // index of last element returned; -1 if no such
public E next() {
checkForComodification();
int i = cursor;
if (i >= size) throw new NoSuchElementException();
// ...
cursor = i + 1;
return (E) elementData[lastRet = i];
}
public void remove() {
// ...
ArrayList.this.remove(lastRet);
// ...
cursor = lastRet;
lastRet = -1;
}
Let's look at an example:
List list = new ArrayList(Arrays.asList(1, 2, 3, 4));
Iterator it = list.iterator();
Integer item = it.next();
We remove the first element
list.remove(0);
If we want to call it.remove() now, the iterator would remove number 2 because that's what field lastRet points to now.
if (item == 1) {
it.remove(); // list contains 3, 4
}
This would be incorrect behavior! The contract of the iterator states that remove() deletes the last element returned by next() but it couldn't hold its contract in the presence of concurrent modifications. Therefore it chooses to be on the safe side and throw an exception.
The situation may be even more complex for other collections. If you modify a HashMap, it may grow or shrink as needed. At that time, elements would fall to different buckets and an iterator keeping pointer to a bucket before rehashing would be completely lost.
Notice that iterator.remove() doesn't throw an exception by itself because it is able to update both the internal state of itself and the collection. Calling remove() on two iterators of the same instance collection would throw, however, because it would leave one of the iterators in an inconsistent state.
Looking at your code, I am assuming arr is a List. In the top loop you operate on the list directly, and "re-calibrate" your condition at the top when you check
i < arr.size()
So if you remove an element, i has to compare to a lesser value.
On the other hand, in the second case you operate on the collection after an iterator has been instantiated, and don't really re-calibrate yourself.
Hope this helps.
In the first one you are modifying an array that it's not being used as an iterator on your for loop.
In the second one you are trying to access to an array that it's being modified at the same time you are iterating with it on the loop. It's why it throws ConcurrentModificationException.

Internal structure of Hashmap

I was going through the internal structure of HashMap and got stuck in the concept of how a bucket stores multiple objects.
As HashMap is an array of Entry objects, every index of array is a bucket. The Entry class is like
static class Entry<K,V> implements Map.Entry<K,V> {
K key;
V value;
Entry<K,V> next_entry;
int hash;
}
On Adding a new key-value pair
If we are adding a value with a key which has already been saved in HashMap, then the value gets overwritten.
Otherwise the element is added to the bucket. If the bucket already has at least one element, a new one is gets added and placed in the first position in the bucket. Its next field refers to the old element.
So how can a bucket store multiple object as per the 2nd point?
This is about HashMap in Oracle JDK 1.7.0.55.
Creating a new entry is done through:
void createEntry(int hash, K key, V value, int bucketIndex) {
Entry<K,V> e = table[bucketIndex];
table[bucketIndex] = new Entry<>(hash, key, value, e);
size++;
}
which clearly shows that the already existing element is stored as next element of the new element. So the array contains the buckets. Buckets themselves are single linked lists made up of Entry elements.
And when a get operation is performed, then this single linked list is iterated as can be seen in the for loop of (comment by me)
final Entry<K,V> getEntry(Object key) {
if (size == 0) {
return null;
}
int hash = (key == null) ? 0 : hash(key);
for (Entry<K,V> e = table[indexFor(hash, table.length)];
e != null;
e = e.next) { // <- see here
Object k;
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k))))
return e;
}
return null;
}
So the Entry elements of the single linked list all have keys with the same hash. But by the contract of hashCode objects not equal to each other may have the same hash codes. So using key.equals(k) in the above for loop will not be true for the first round in the loop in every case. So the loop may be traversed until the end of the linked list.
Java HashMap uses a linkedlist for buckets (but not a java.util.LinkedList). If a class hard codes the hashCode() method to a single value; instances of such a class loaded into a HashMap the structure will degenerate into a linkedlist. You override equals() to support replace in the "bucket".
HashMap maintains an array: Entry[]; each element of that array represents a "bucket". The remainder of the entries in the bucket are accessed by traversing a linked list maintained in by Entry.next.

Categories