I'm new to using hash table structures. I'm using LinkedHashMap (ex: cache = new LinkedHashMap<K,V>(...)) to implement my own cache. I have a list of questions about this data structure:
I set a parameter capacity = 100 (eg.), it means that number of items in bucket is limited to 100. Then if I insert a new item into this cache (when cache size = 100), am I correct in thinking the evict policy will happen?
In my implementation, keys are composite object include two items like this:
class Key {
public string a;
public string b;
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((a == null) ? 0 : a.hashCode());
result = prime * result + ((b == null) ? 0 : b.hashCode());
return result;
}
}
With this hashcode(), suppose the bucket already has 100 items. When I insert a new item, assuming that hashcode() returns a duplicate key with a previous item, my understanding is that linkedhashmap will remove the eldest item using the evict policy and use linkedlist to handle collision for the new item, so the number of items in the bucket will be 99. Is it right ?
Is there any way to identify which entries in the bucket current contain a chain for handle collision?
Answering to question one:
You need to explicity override method removeEldest to make the eviction work.
Default implementation returns false, so it won't remove any element:
protected boolean removeEldestEntry(Map.Entry<K,V> eldest) {
return false;
}
Question two: Nothing will be removed in your case, if you don't override the method removeEldest
Question three: I don't think there is a way to handle such situation.
Please read this useful article to become more familiar with eviciton algorithm based on LinkedHahMap:
http://javarticles.com/2012/06/lru-cache.html
For complementary lecture, read also about LFU eviction: http://javarticles.com/2012/06/lfu-cache.html
I set a parameter capcity = 100 (eg.), it means that number of items in bucket limit to 100. Then if I insert new item to this cache (when cache size = 100), the evict policy will happen,right?
No, the capacity parameter is a hint to the constructor of how large you expect the map to become. It uses this to attempt to avoid needlessly resizing the map as you add elements. If you add more than the specified capacity it will just resize the map to fit more elements efficiently.
when I insert new item, assuming that hashcode() return a duplicate key with one of previous items, then linkedhashmap will remove the eldest item as evict policy and use linkedlist to handle collision for new item, so the number items in bucket will be 99, is it right ?
No, if two non-equal elements are inserted with the same hash code they will simply be placed in the same bucket, but both will still exist and be accessible. Of course if you specify a key that is equal to a key that currently exists in the map, that entry will be overwritten.
Is there any way to identify which entries in the bucket current contain a chain for handle collision?
Generally no. You could use reflection, but that would be arduous at best. What are you trying to accomplish that makes you think you'd need to do this?
The caching behavior provided by LinkedHashMap depends on you extending the class and implementing removeEldestEntry(). As you can see in the example in that method, you can add a check such as size() > MAX_ENTRIES to instruct the map to remove the oldest element when put() or putAll() is called.
If you need a more powerful cache you might like Guava's Cache and LoadingCache classes.
Capacity is not fixed. It will dynamically change based on the map usage.
From javadocs:
An instance of HashMap has two parameters that affect its
performance: initial capacity and load factor. The capacity is the
number of buckets in the hash table, and the initial capacity is
simply the capacity at the time the hash table is created. The load
factor is a measure of how full the hash table is allowed to get
before its capacity is automatically increased. When the number of
entries in the hash table exceeds the product of the load factor and
the current capacity, the hash table is rehashed (that is, internal
data structures are rebuilt) so that the hash table has approximately
twice the number of buckets.
So map will not remove items based on number of entries.
Simple cache to use is provided by guava library.
Related
This question already has answers here:
Is a Java hashmap search really O(1)?
(15 answers)
Closed 4 years ago.
How to calculate the complexity of the HashMap search algorithm? I'm googling result of this calculation - O(1), but I don't understand how they arrived at these findings.
HashMap works on the hashing principle.It is the data structure that allow you to store and retrieve data in O(1) time provided we know the key.
In hashing, hash functions are used to link key and value in HashMap. Objects are stored by calling put(key, value) method of HashMap and retrieved by calling get(key) method. When we call put method, hashcode() method of the key object is called so that hash function of the map can find a bucket location to store value object, which is actually an index of the internal array, known as the table. HashMap internally stores mapping in the form of Map.Entry object which contains both key and value object. When you want to retrieve the object, you call the get() method and again pass the key object. This time again key object generate same hash code (it's mandatory for it to do so to retrieve the object and that's why HashMap keys are immutable e.g. String) and we end up at same bucket location. If there is only one object then it is returned and that's your value object which you have stored earlier. Things get little tricky when collisions occur.
Collision : Since the internal array of HashMap is of fixed size, and if you keep storing objects, at some point of time hash function will return same bucket location for two different keys, this is called collision in HashMap. In this case, a linked list is formed at that bucket location and a new entry is stored as next node.
If we try to retrieve an object from this linked list, we need an extra check to search correct value, this is done by equals() method. Since each node contains an entry, HashMap keeps comparing entry's key object with the passed key using equals() and when it return true, Map returns the corresponding value.
Since searching inlined list is O(n) operation, in worst case hash collision reduce a map to linked list. This issue is recently addressed in Java 8 by replacing linked list to the tree to search in O(logN) time.
By the way, you can easily verify how HashMap works by looking at the code of HashMap.java in your Eclipse IDE if you are keenly interested in the code, otherwise the logic is explained above.
Information On Buckets : An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
In Java.
How can I map a set of numbers(integers for example) to another set of numbers?
All the numbers are positive and all the numbers are unique in their own set.
The first set of numbers can have any value, the second set of numbers represent indexes of an array, and so the goal is to be able to access the numbers in the second set through the numbers in the first set. This is a one to one association.
Speed is crucial as the method will have to be called many times each second.
Edit: I tried it with SE hashmap implementation, but found it to be slow for my purposes.
There's an article, devoted to this problem (with a solution): Implementing a world fastest Java int-to-int hash map
Code can be found in related GitHub repository. (Best results are in class IntIntMap4a.java )
Citation from the article:
Summary
If you want to optimize your hash map for speed, you have to do as much as you can of the following list:
Use underlying array(s) with capacity equal to a power of 2 - it will allow you to use cheap & instead of expensive % for array index
Do not store the state in the separate array - use dedicated fields for free/removed keys and values.
Interleave keys and values in the one array - it will allow you to load a value into memory for free.
Implement a strategy to get rid of 'removed' cells - you can sacrifice some of remove performance in favor of more frequent get/put.
Scramble the keys while calculating the initial cell index - this is required to deal with the case of consecutive keys.
Yes, I know how to use citation formatting. But it looks awful and doesn't handle bullet lists well.
The structure you are looking for is called an associative array. In computer science, an associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears just once in the collection.
In java in particular as already mentioned this is easily done with a HashMap.
HashMap<Integer, Integer> cache = new HashMap<Integer, Integer>();
You can insert elements with the method put
cache.put(21, 42);
and you can retrieve a value with get
Integer key = 21
Integer value = cache.get(key);
System.out.println("Key: " + key +" value: "+ value);
Key: 21 value: 42
If you want to iterate through data you need to define an iterator:
Iterator<Integer> Iterator = cache.keySet().iterator();
while(Iterator.hasNext()){
Integer key = Iterator.next();
System.out.println("key: " + key + " value: " + cache.get(key));
}
Sounds like HashMap<Integer,Integer> is what you're looking for.
If you are willing to use an external library, you can use apache's IntToIntMap, which is a part of Apache Lucene.
It implements a pretty efficient int to int map that uses primitives for tasks that should not suffer the boxing overhead.
If you have a limit for the size of the first list, you can just use a large array. Suppose you know there first list only has numbers 0-99, you can use int[100]. Use the first number as an array index.
Your requirements can be satisfied by the Map interface. As an example, see HashMap<K,V>.
See Map and HashMap
While debugging I found a strange behavior.
I got a HashMap<Integer, Set<Term>> (Term is a class which only contains a String) the normal toString() shows this:
When I click the table property of the HashMap I get this:
My Question now, why are there null values in the table toString() ?
Edit: Thanks for your fast answers! If I could, I would accept all of them...
HashMap is a Map implementation that's crucial feature is constant time O(1) lookup.
The only data structure in computer science with constant time lookup is an array of fixed length. When you initialise the HashMap it's creating a fixed length array that it will expand when your entries exceed the current array's size.
Edit: #kutschkem has pointed out that java.util.HashMap expands its fixed length array when the number of entries is around 80% of the current array's size, rather than when the entries exceed the current array's size.
Because the Map implementation you are using is working with a starting set of HashBuckets some of which are NULL at beginning (determined by initialCapacity). If you exceed the number of entries it will start creating more HashBuckets / slots for your Objects. Think of this as a growth reserve the HashMap automatically creates for you.
Read more:
https://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html
The HashMap stores its entries in a hashtable. That is an array, and the hash function maps the key to one of the array entries (also called hash buckets).
The hash buckets are always at least 20% empty. If they are not, then the array is resized to make sure there is enough free space.
The reason is that as the hash table gets filled up, collisions between hashes get more and more likely. You lose all advantages of the HashMap if collisions are too frequent. Too full, and your HashMap would be no better than a LinkedList (yes, LinkedList, not ArrayList). It would probably be even worse.
That is how a hash map work: a large array (table), and for some key the following table entry is tried:
table[key.hashCode() % table.length]
That table slot then is used. Rehashing is used if there already is a key that is not equals(key).
So initially the table contains only nulls, and has size initialCapacity. The array can be grown when the hash map becomes too filled (loadFactor).
The HashMap uses internally an array to store the entries. Very much simplified, it does something like array_index = hashcode % array_length (again: very simplified, as it also needs to take care of hash collisions etc). This internal array is typically larger than the number of elements you store in the HashMap -- otherwise, the array would have to be resized every time you add an element to it. So what you see as null are the yet unused slots in the array.
This is normal behavior.
There are null values because the table array was initialized as being filled with nulls, and uses null to indicate that there are no values stored in that hash bucket.
The toString() function provided doesn't skip over them because seeing them was useful to the folds debugging the HashMap implementation.
If you want to see the contents without the nulls, you'll have to write your own display function, either by subclassing HashMap and overriding toString() or by providing a convenience function somewhere in your code.
I simulate a cache replacement algorithm. I have a cache implemented as an arraylist that fills in with the requested items (objects of custom-type Request). These objects are identified by their unique numeric ID (reqID) - for N items, an integer between 1 and N. I insert items in the cache from the start of the cache arraylist and evict them from the end. That is,
// insertion
this.cache.add(0, item);
// eviction
this.cache.remove(this.cache.size()-1);
I maintain the cache arraylist sorted according to the score of the items which I keep in a HashMap reqID (K) - score (V).
At some point in my code I have to check for cache reordering due to a cache hit (since the score of that cached item is increased by 1). If required, in order to keep the cache ordered, the requested item and the item at its left in the cache arraylist will exchange position in the cache.
Therefore, I have to know the index of the requested item in the cache. I can use the following:
int ind = this.cache.indexOf(request);
and therefore the index of the next item at its left will be:
int indLeft = ind - 1;
and thus I will be able easilly to exchange their position in the cache arraylist.
However, I would really like to avoid using indexOf since it makes use internally of a for loop. Hence, I will have to store the positions of the items in some data structure. What kind of data structure would you suggest? My first thought was a HashMap positions (reqID - index) due to constant complexity of the operations that I care about, but then I noticed that each time I insert an item in the cache, the index of the other items is increased by 1. How could I possibly increase by 1 the value of all the previous keys (reqIDs) in that positions map each time I put a new key-value pair without using a for loop? Thus, probably some other data structure is required. Or some other idea...
You can keep your arraylist, but to more efficiently find, replace, and remove, you can use 2 dimensional arraylist if you are incorporating blocks and then a hashtable to index the elements.
So anytime you get a "hit", you can grab the key and replace it from your arraylist. This method requires no loops.
Create a CacheItem class to wrap your key, value and score.
public class CacheItem {
private String key;
private Object value;
private int score;
//getters and setters
}
Then implement the cache pool with HashMap:
//cache pool structure
Map<String, CacheItem> cache = new HashMap<String, CacheItem>();
CacheItem item = cache.get("somekey");
if(item != null) { //hit
item.setScore(item.getScore() + 1); //incr score
}
But when you want to check the score of each cache item, you must use a loop.
I was reading the Java api docs on Hashtable class and came across several questions. In the doc, it says "Note that the hash table is open: in the case of a "hash collision", a single bucket stores multiple entries, which must be searched sequentially. " I tried the following code myself
Hashtable<String, Integer> me = new Hashtable<String, Integer>();
me.put("one", new Integer(1));
me.put("two", new Integer(2));
me.put("two", new Integer(3));
System.out.println(me.get("one"));
System.out.println(me.get("two"));
the out put was
1
3
Is this what it means by "open"?
what happened to the Integer 2? collected as garbage?
Is there an "closed" example?
No, this is not what is meant by "open".
Note the difference between a key collision and a hash collision.
The Hashtable will not allow more than one entry with the same key (as in your example, you put two entries with the key "two", the second one (3) replaced the first one (2), and you were left with only the second one in the Hashtable).
A hash collision is when two different keys have the same hashcode (as returned by their hashCode() method). Different hash table implementations could treat this in different ways, mostly in terms of low-level implementation. Being "open", Hashtable will store a linked list of entries whose keys hash to the same value. This can cause, in the worst case, O(N) performance for simple operations, that normally would be O(1) in a hash map where the hashes mostly were different values.
It means that two items with different keys that have the same hashcode end up in the same bucket.
In your case the keys "two" are the same and so the second put overwrites the first one.
But assuming that you have your own class
class Thingy {
private final String name;
public Thingy(String name) {
this.name = name;
}
public boolean equals(Object o) {
...
}
public int hashcode() {
//not the worlds best idea
return 1;
}
}
And created multiple instances of it. i.e.
Thingy a = new Thingy("a");
Thingy b = new Thingy("b");
Thingy c = new Thingy("c");
And inserted them into a map. Then one bucket i.e. the bucket containing the stuff with hashcode 1 will contain a list (chain) of the three items.
Map<Thingy, Thingy> map = new HashMap<Thingy, Thingy>();
map.put(a, a);
map.put(b, b);
map.put(c, c);
So getting an item by any Thingy key would result in a lookup of the hashcode O(1) followed by a linear search O(n) on the list of items in the bucket with hashcode 1.
Also be careful to ensure that you obey the correct relationship when implementing hashcode and equals. Namely if two objects are equal then they should have the same hascode, but not necessarily the otherway round as multiple keys are likely to get the same hashcode.
Oh and for the full definitions of Open hashing and Closed hash tables look here http://www.c2.com/cgi/wiki?HashTable
Open means that if two keys are not equal, but have the same hash value, then they will be stored in the same "bucket". In this case, you can think of each bucket as a linked list, so if many things are stored in the same bucket, search performance will decrease.
Bucket 0: Nothing
Bucket 1: Item 1
Bucket 2: Item 2 -> Item 3
Bucket 3: Nothing
Bucket 4: Item 4
In this case, if you search for a key that hashes to bucket 2, you have to then perform an O(n) search on the list to find the key that equals what you're searching for. If the key hashes to Bucket 0, 1, 3, or 4, then you get an O(1) search performance.
It means that Hashtable uses open hashing (also known as separate chaining) to deal with hash collisions. If two separate keys have the same hashcode, both of them will be stored in the same bucket (in a list).
A hash is a computed function that maps one object ("one" or "two" in your sample) to (in this case) an integer. This means that there may be multiple values that map to the same integer ( an integer has a finite number of permitted values while there may be an infinite number of inputs) . In this case "equals" must be able to tell these two apart. So your code example is correct, but there may be some other key that has the same hashcode (and will be put in the same bucket as "two")
Warning: there are contradictory definitions of "open hashing" in common usage:
Quoting from http://www.c2.com/cgi/wiki?HashTable cited in another answer:
Caution: some people use the term
"open hashing" to mean what I've
called "closed hashing" here! The
usage here is in accordance with that
in TheArtOfComputerProgramming and
IntroductionToAlgorithms, both of
which are recommended references if
you want to know more about hash
tables.
For example, the above page defines "open hashing" as follows:
There are two main strategies. Open
hashing, also called open addressing,
says: when the table entry you need
for a new key/value pair is already
occupied, find another unused entry
somehow and put it there. Closed
hashing says: each entry in the table
is a secondary data structure (usually
a linked list, but there are other
possibilities) containing the actual
data, and this data structure can be
extended without limit.
By contrast, the definition supplied by Wikipedia is:
In the strategy known as separate
chaining, direct chaining, or simply
chaining, each slot of the bucket
array is a pointer to a linked list
that contains the key-value pairs that
hashed to the same location. Lookup
requires scanning the list for an
entry with the given key. Insertion
requires appending a new entry record
to either end of the list in the
hashed slot. Deletion requires
searching the list and removing the
element. (The technique is also called
open hashing or closed addressing,
which should not be confused with
'open addressing' or 'closed
hashing'.)
If so-called "experts" cannot agree what the term "open hashing" means, it is best to avoid using it.