I am looking for verification on two different but related arguments-- those above (A) and below (B) the first line line-comment here in the Q.
(A) The way HashMap is structured is:
a HashMap is a plain table. thats direct memory access (DMA).
The whole idea behind HashMap (or hashing in general) at the first place
is to put into use this constant-time memory access for
a.) accessing records by their own data content (< K,V >),
not by their locations in DMA (the table index)
b.) managing variable number of records-- a number of
records not of a given size, and may/not remain constant
in size throughout the use of this structure.
So, the overall structure in a Java Hash is:
a table: table // i`m using the identifier used in HashMap
each cell of this table is a bucket.
Each bucket is a linked list of type Entry--
i.e., each node of this linked list (not the linked list of Java/API, but the data structure) is of type Entry which in turn is a < K,V > pair.
When a new pair comes in to be added to the hash,
a unique hashCode is calculated for this < K,V > pair.
This hashCode is the key to the index of this < K,V > in table-- it tells
which bucket this < K,V > will go in in the hash.
Note: hashCode is "normalized" thru the function hash() (in HashMap for one)
to better-fit the current length of the table. indexFor() is also at use
to determine which bucket, i.e., cell of table the < K,V > will go in.
When the bucket is determined, the < K,V > is added to the beginning of the linked list in this bucket-- as a result, it is the first < K,V > entry in this bucket and the first entry of the linked-list-that-already-existed is now
the "next" entry that is pointed by this newly added one.
//===============================================================
(B)
From what I see in HashMap, the resizing of the table-- the hash is only done upon a decision based on
hash size and capacity, which are the current and max. # entries in the entire hash.
There is no re-structuring or resizing upon individual bucket sizes-- like "resize() when the max.#entries in a bucket exceeds such&such".
It is not probable, but is possible that a significant number of entries may be bulked up in a bucket while the rest of the hash is pretty much empty.
If this is the case, i.e., no upper limit on the size of each bucket, hash is not of constant but linear access-- theoretically for one thing. It takes $O(n)$ time to get hold of an entry in hash where $n$ is the total number of entries. But then it shouldn't be.
//===============================================================
I don't think I'm missing anything in Part (A) above.
I'm not entirely sure of Part (B). It is a significant issue and I'm looking to find out how accurate this argument is.
I'm looking for verification on both parts.
Thanks in advance.
//===============================================================
EDIT:
Maximum bucket size being fixed, i.e., hash being restructured whenever
the #entries in a bucket hits a maximum would resolve it-- the access time is plain
constant in theory and in use.
This isn't a well structured but quick fix, and would work just fine for sake of constant access.
The hashCodes are likely to be evenly distributed throughout the buckets and it isn`t so likely
that anyone of the buckets will hit the bucket-max before the threshold on the overall size of the hash is hit.
This is the assumption the current setup of HashMap is using as well.
Also based on Peter Lawrey`s discussion below.
Collisions in HashMap are only a problem in pathological cases such as denial of service attacks.
In Java 7, you can change the hashing strategy such that an external party cannot predict your hashing algo.
AFAIK, In Java 8 HashMap for a String key will use a tree map instead of a linked list for collisions. This means O(ln N) worst case instead of O(n) access times.
I'm looking to increase the table size when everything is in the same hash. The hash-to-bucket mapping changes when the size of the table does.
Your idea sounds good. And it is completely true and basically what HashMap does when the table size is smaller than desired / the average amount of elements per bucket gets too large.
It's not doing that by looking at each bucket and checking if there is too much in there because it's easy to calculate that.
The implementation of HashMap.get() in OpenJDK according to this is
public V get(Object key) {
if (key == null)
return getForNullKey();
int hash = hash(key.hashCode());
for (Entry<K,V> e = table[indexFor(hash, table.length)];
e != null;
e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
return e.value;
}
return null;
}
That shows how HashMap finds elements pretty good but it's written in very confusing ways. After a bit of renaming, commenting and rewriting it could look roughly like this:
public V get(Object key) {
if (key == null)
return getForNullKey();
// get key's hash & try to fix the distribution.
// -> this can modify every 42 that goes in into a 9
// but can't change it once to a 9 once to 8
int hash = hash(key.hashCode());
// calculate bucket index, same hash must result in same index as well
// since table length is fixed at this point.
int bucketIndex = indexFor(hash, table.length);
// we have just found the right bucket. O(1) so far.
// and this is the whole point of hash based lookup:
// instantly knowing the nearly exact position where to find the element.
// next see if key is found in the bucket > get the list in the bucket
LinkedList<Entry> bucketContentList = table[bucketIndex];
// check each element, in worst case O(n) time if everything is in this bucket.
for (Entry entry : bucketContentList) {
if (entry.key.equals(key))
return entry.value;
}
return null;
}
What we see here is that the bucket indeed depends on both the .hashCode() returned from each key object and the current table size. And it will usually change. But only in cases where .hashCode() is different.
If you had an enormous table with 2^32 elements you could simply say bucketIndex = key.hashCode() and it would be as perfect as it can get. There is unfortunately not enough memory to do that so you have to use less buckets and map 2^32 hashes into just a few buckets. That's what indexFor essentially does. Mapping large number space into small one.
That is perfectly fine in the typical case where (almost) no object has the same .hashCode() of any other. But the one thing that you must not do with HashMaps is to add only elements with exactly the same hash.
If every hash is the same, your hash based lookup results in the same bucket and all your HashMap has become is a LinkedList (or whatever data structure holds the elements of a bucket). And now you have the worst case scenario of O(N) access time because you have to iterate over all the N elements.
Related
I am having confusion in hashing:
When we use Hashtable/HashMap (key,value), first I understood the internal data structure is an array (already allocated in memory).
Java hashcode() method has an int return type, so I think this hash value will be used as an index for the array and in this case, we should have 2 power 32 entries in the array in RAM, which is not what actually happens.
So does Java create an index from the hashcode() which is smaller range?
Answer:
As the guys pointed out below and from the documentation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java
HashMap is an array. The hashcode() is rehashed again but still integer and the index in the array becomes: h & (length-1); so if the length of the array is 2^n then I think the index takes the first n bit from re-hashed value.
The structure for a Java HashMap is not just an array. It is an array, but not of 2^31 entries (int is a signed type!), but of some smaller number of buckets, by default 16 initially. The Javadocs for HashMap explain that.
When the number of entries exceeds a certain fraction (the "load factor) of the capacity, the array grows to a larger size.
Each element of the array does not hold only one entry. Each element of the array holds a structure (currently a red-black tree, formerly a list) of entries. Each entry of the structure has a hash code that transforms internally to the same bucket position in the array.
Have you read the docs on this type?
http://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html
You really should.
Generally the base data structure will indeed be an array.
The methods that need to find an entry (or empty gap in the case of adding a new object) will reduce the hash code to something that fits the size of the array (generally by modulo), and use this as an index into that array.
Of course this makes the chance of collisions more likely, since many objects could have a hash code that reduces to the same index (possible anyway since multiple objects might have exactly the same hash code, but now much more likely). There are different strategies for dealing with this, generally either by using a linked-list-like structure or a mechanism for picking another slot if the first slot that matched was occupied by a non-equal key.
Since this adds cost, the more often such collisions happen the slower things become and in the worse case lookup would in fact be O(n) (and slow as O(n) goes, too).
Increasing the size of the internal store will generally improve this though, especially if it is not to a multiple of the previous size (so the operation that reduced the hash code to find an index won't take a bunch of items colliding on the same index and then give them all the same index again). Some mechanisms will increase the internal size before absolutely necessary (while there is some empty space remaining) in certain cases (certain percentage, certain number of collisions with objects that don't have the same full hash code, etc.)
This means that unless the hash codes are very bad (most obviously, if they are in fact all exactly the same), the order of operation stays at O(1).
As per the following link document: Java HashMap Implementation
I'm confused with the implementation of HashMap (or rather, an enhancement in HashMap). My queries are:
Firstly
static final int TREEIFY_THRESHOLD = 8;
static final int UNTREEIFY_THRESHOLD = 6;
static final int MIN_TREEIFY_CAPACITY = 64;
Why and how are these constants used? I want some clear examples for this.
How they are achieving a performance gain with this?
Secondly
If you see the source code of HashMap in JDK, you will find the following static inner class:
static final class TreeNode<K, V> extends java.util.LinkedHashMap.Entry<K, V> {
HashMap.TreeNode<K, V> parent;
HashMap.TreeNode<K, V> left;
HashMap.TreeNode<K, V> right;
HashMap.TreeNode<K, V> prev;
boolean red;
TreeNode(int arg0, K arg1, V arg2, HashMap.Node<K, V> arg3) {
super(arg0, arg1, arg2, arg3);
}
final HashMap.TreeNode<K, V> root() {
HashMap.TreeNode arg0 = this;
while (true) {
HashMap.TreeNode arg1 = arg0.parent;
if (arg0.parent == null) {
return arg0;
}
arg0 = arg1;
}
}
//...
}
How is it used? I just want an explanation of the algorithm.
HashMap contains a certain number of buckets. It uses hashCode to determine which bucket to put these into. For simplicity's sake imagine it as a modulus.
If our hashcode is 123456 and we have 4 buckets, 123456 % 4 = 0 so the item goes in the first bucket, Bucket 1.
If our hashCode function is good, it should provide an even distribution so that all the buckets will be used somewhat equally. In this case, the bucket uses a linked list to store the values.
But you can't rely on people to implement good hash functions. People will often write poor hash functions which will result in a non-even distribution. It's also possible that we could just get unlucky with our inputs.
The less even this distribution is, the further we're moving from O(1) operations and the closer we're moving towards O(n) operations.
The implementation of HashMap tries to mitigate this by organising some buckets into trees rather than linked lists if the buckets become too large. This is what TREEIFY_THRESHOLD = 8 is for. If a bucket contains more than eight items, it should become a tree.
This tree is a Red-Black tree, presumably chosen because it offers some worst-case guarantees. It is first sorted by hash code. If the hash codes are the same, it uses the compareTo method of Comparable if the objects implement that interface, else the identity hash code.
If entries are removed from the map, the number of entries in the bucket might reduce such that this tree structure is no longer necessary. That's what the UNTREEIFY_THRESHOLD = 6 is for. If the number of elements in a bucket drops below six, we might as well go back to using a linked list.
Finally, there is the MIN_TREEIFY_CAPACITY = 64.
When a hash map grows in size, it automatically resizes itself to have more buckets. If we have a small HashMap, the likelihood of us getting very full buckets is quite high, because we don't have that many different buckets to put stuff into. It's much better to have a bigger HashMap, with more buckets that are less full. This constant basically says not to start making buckets into trees if our HashMap is very small - it should resize to be larger first instead.
To answer your question about the performance gain, these optimisations were added to improve the worst case. You would probably only see a noticeable performance improvement because of these optimisations if your hashCode function was not very good.
It is designed to protect against bad hashCode implementations and also provides basic protection against collision attacks, where a bad actor may attempt to slow down a system by deliberately selecting inputs which occupy the same buckets.
To put it simpler (as much as I could simpler) + some more details.
These properties depend on a lot of internal things that would be very cool to understand - before moving to them directly.
TREEIFY_THRESHOLD -> when a single bucket reaches this (and the total number exceeds MIN_TREEIFY_CAPACITY), it is transformed into a perfectly balanced red/black tree node. Why? Because of search speed. Think about it in a different way:
it would take at most 32 steps to search for an Entry within a bucket/bin with Integer.MAX_VALUE entries.
Some intro for the next topic. Why is the number of bins/buckets always a power of two? At least two reasons: faster than modulo operation and modulo on negative numbers will be negative. And you can't put an Entry into a "negative" bucket:
int arrayIndex = hashCode % buckets; // will be negative
buckets[arrayIndex] = Entry; // obviously will fail
Instead there is a nice trick used instead of modulo:
(n - 1) & hash // n is the number of bins, hash - is the hash function of the key
That is semantically the same as modulo operation. It will keep the lower bits. This has an interesting consequence when you do:
Map<String, String> map = new HashMap<>();
In the case above, the decision of where an entry goes is taken based on the last 4 bits only of you hashcode.
This is where multiplying the buckets comes into play. Under certain conditions (would take a lot of time to explain in exact details), buckets are doubled in size. Why? When buckets are doubled in size, there is one more bit coming into play.
So you have 16 buckets - last 4 bits of the hashcode decide where an entry goes. You double the buckets: 32 buckets - 5 last bits decide where entry will go.
As such this process is called re-hashing. This might get slow. That is (for people who care) as HashMap is "joked" as: fast, fast, fast, slooow. There are other implementations - search pauseless hashmap...
Now UNTREEIFY_THRESHOLD comes into play after re-hashing. At that point, some entries might move from this bins to others (they add one more bit to the (n-1)&hash computation - and as such might move to other buckets) and it might reach this UNTREEIFY_THRESHOLD. At this point it does not pay off to keep the bin as red-black tree node, but as a LinkedList instead, like
entry.next.next....
MIN_TREEIFY_CAPACITY is the minimum number of buckets before a certain bucket is transformed into a Tree.
TreeNode is an alternative way to store the entries that belong to a single bin of the HashMap. In older implementations the entries of a bin were stored in a linked list. In Java 8, if the number of entries in a bin passed a threshold (TREEIFY_THRESHOLD), they are stored in a tree structure instead of the original linked list. This is an optimization.
From the implementation:
/*
* Implementation notes.
*
* This map usually acts as a binned (bucketed) hash table, but
* when bins get too large, they are transformed into bins of
* TreeNodes, each structured similarly to those in
* java.util.TreeMap. Most methods try to use normal bins, but
* relay to TreeNode methods when applicable (simply by checking
* instanceof a node). Bins of TreeNodes may be traversed and
* used like any others, but additionally support faster lookup
* when overpopulated. However, since the vast majority of bins in
* normal use are not overpopulated, checking for existence of
* tree bins may be delayed in the course of table methods.
You would need to visualize it: say there is a Class Key with only hashCode() function overridden to always return same value
public class Key implements Comparable<Key>{
private String name;
public Key (String name){
this.name = name;
}
#Override
public int hashCode(){
return 1;
}
public String keyName(){
return this.name;
}
public int compareTo(Key key){
//returns a +ve or -ve integer
}
}
and then somewhere else, I am inserting 9 entries into a HashMap with all keys being instances of this class. e.g.
Map<Key, String> map = new HashMap<>();
Key key1 = new Key("key1");
map.put(key1, "one");
Key key2 = new Key("key2");
map.put(key2, "two");
Key key3 = new Key("key3");
map.put(key3, "three");
Key key4 = new Key("key4");
map.put(key4, "four");
Key key5 = new Key("key5");
map.put(key5, "five");
Key key6 = new Key("key6");
map.put(key6, "six");
Key key7 = new Key("key7");
map.put(key7, "seven");
Key key8 = new Key("key8");
map.put(key8, "eight");
//Since hascode is same, all entries will land into same bucket, lets call it bucket 1. upto here all entries in bucket 1 will be arranged in LinkedList structure e.g. key1 -> key2-> key3 -> ...so on. but when I insert one more entry
Key key9 = new Key("key9");
map.put(key9, "nine");
threshold value of 8 will be reached and it will rearrange bucket1 entires into Tree (red-black) structure, replacing old linked list. e.g.
key1
/ \
key2 key3
/ \ / \
Tree traversal is faster {O(log n)} than LinkedList {O(n)} and as n grows, the difference becomes more significant.
The change in HashMap implementation was was added with JEP-180. The purpose was to:
Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries. Implement the same improvement in the LinkedHashMap class
However pure performance is not the only gain. It will also prevent HashDoS attack, in case a hash map is used to store user input, because the red-black tree that is used to store data in the bucket has worst case insertion complexity in O(log n). The tree is used after a certain criteria is met - see Eugene's answer.
To understand the internal implementation of hashmap, you need to understand the hashing.
Hashing in its simplest form, is a way to assigning a unique code for any variable/object after applying any formula/algorithm on its properties.
A true hash function must follow this rule –
“Hash function should return the same hash code each and every time when the function is applied on same or equal objects. In other words, two equal objects must produce the same hash code consistently.”
When I run this program
public class MyHashMapOperationsDebug {
public static void main(String[] args) {
MyHashMap hashMap = new MyHashMap();//MyHashMap is replica of HashMap
for (int i=1;i<=11;i++)
hashMap.put(i, i+100);
}
}
and MyHashMap.java has
void addEntry(int hash, K key, V value, int bucketIndex) { //replica of HashMap's addEntry method
Entry<K,V> e = table[bucketIndex];
**System.out.println("bucketIndex : " + bucketIndex);**
table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
if (size++ >= threshold)
resize(2 * table.length);
}
OUTPUT:
bucketIndex : 7
bucketIndex : 14
bucketIndex : 4
bucketIndex : 13
bucketIndex : 1
bucketIndex : 8
bucketIndex : 2
bucketIndex : 11
bucketIndex : 11
bucketIndex : 2
bucketIndex : 8
Why some keys go to same bucket, even when only 11 keys are stored in map of size 16? E.g. bucket at index 2, and 11 has two keys each
EDIT:
After Reading inputs below One question : What will be the complexity in above case where HashMap & Integer of Java is used. Is it O(1) ?
Because it's impossible, without knowing all the keys in advance, to design an algorithm that will guarantee that they will be evenly distributed. And even when knowing all the keys in advance, if two of them have the same hashCode, they will always be in the same bucket.
That doesn't mean the HashMap isn't O(1). Even assuming that every bucket has 2 entries, regardless of the number of entries in the map, that still makes every get operation execute in time that doesn't depend on the number of entries in the map, which is the definition of O(1).
What will be the complexity in above case where HashMap & Integer of Java is used. Is it O(1) ?
Yes. The Integer.hashcode() method returns the value of the Integer itself, and that will be uniformly distributed across the space of possible hash values.
So the performance of the hash table will be optimal; i.e. O(1) for get operations and O(1) (amortized) for put operations. And since there are only 2^32 unique keys possible, we don't need to consider the issue of how HashMap scales beyond that point.
E.g. bucket at index 2, and 11 has two keys each: thats because of the hashCollision. HashMap can give a performance of O(n) in look up provided that you have Hash Collision for all n elements. Thats the bad design of hashing algorithm.
And infact in a way you can say that to avoid these collisions, you have some extra space being allocated. Because your hashing technique makes sure that you dont have many collisions and to do that you obviously need extra backets.
But at the same time, you cant avoid collision completely, because if your hashing technique is such that each bucket will be having only one entry, you will need a lot of space. So, actually hashCollisions, in a limit are good to have.
It's very difficult to know the key distribution beforehand to design an o(1) hash function. Even though you know the key distribution also your key may map to same slot. So you need to do rehashing once your load factor moves to a certain fraction. If suppose your map size is 16 and you have 17 keys then it will have collision. so in this situation you need to have some mechanism to rehash the map to remove potential collisions.
The find operation in hashmap is asymptotically O(1) but it can GO TO o(n) as well.
Since i'm working around time complexity, i've been searching through the oracle Java class library for the time complexity of some standard methods used on Lists, Maps and Classes. (more specifically, ArrayList, HashSet and HashMap)
Now, when looking at the HashMap javadoc page, they only really speak about the get() and put() methods.
The methods i still need to know are:
remove(Object o)
size()
values()
I think that remove() will be the same complexity as get(), O(1), assuming we don't have a giant HashMap with equal hashCodes, etc etc...
For size() i'd also assume O(1), since a HashSet, which also has no order, has a size() method with complexity O(1).
The one i have no idea of is values() - I'm not sure whether this method will just somehow "copy" the HashMap, giving a time complexity of O(1), or if it will have to iterate over the HashMap, making the complexity equal to the amount of elements stored in the HashMap.
Thanks.
The source is often helpful: http://kickjava.com/src/java/util/HashMap.java.htm
remove: O(1)
size: O(1)
values: O(n) (on traversal through iterator)
The code for remove(as in rt.jar for HashMap) is:
/**
* Removes and returns the entry associated with the specified key
* in the HashMap. Returns null if the HashMap contains no mapping
* for this key.
*/
final Entry<K,V> removeEntryForKey(Object key) {
int hash = (key == null) ? 0 : hash(key.hashCode());
int i = indexFor(hash, table.length);
Entry<K,V> prev = table[i];
Entry<K,V> e = prev;
while (e != null) {
Entry<K,V> next = e.next;
Object k;
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k)))) {
modCount++;
size--;
if (prev == e)
table[i] = next;
else
prev.next = next;
e.recordRemoval(this);
return e;
}
prev = e;
e = next;
}
return e;
}
Clearly, the worst case is O(n).
Search: O(1+k/n)
Insert: O(1)
Delete: O(1+k/n)
where k is the no. of collision elements added to the same LinkedList (k elements had same hashCode)
Insertion is O(1) because you add the element right at the head of LinkedList.
Amortized Time complexities are close to O(1) given a good hashFunction. If you are too concerned about lookup time then try resolving the collisions using a BinarySearchTree instead of Default implementation of java i.e LinkedList
Just want to add a comment regarding to the above comment claimed worst case scenario that HashMap may go to O(n) in deletion & search, that will never happen as we are talking about Java HashMap implementation.
for a limited number (below 64 of entries), the hashMap is backed up by array, so with a unfortunate enough case, but still very unlikely, it is linear, but asymptotically speaking, we should say in worse case, HahsMap O(logN)
You can always take a look on the source code and check it yourself.
Anyway... I once checked the source code and what I remember is that there is a variable named size that always hold the number of items in the HashMap so size() is O(1).
On an average the time complexity of a HashMap insertion, deletion, the search takes O(1) constant time.
That said, in the worst case, java takes O(n) time for searching, insertion, and deletion.
Mind you, the time complexity of HashMap apparently depends on the loadfactor n/b (the number of entries present in the hash table BY the total number of buckets in the hashtable) and how efficiently the hash function maps each insert. By efficient I mean, a hash function might map two very different objects to the same bucket (this is called a collision) in case. There are various methods of solving collisions known as collision resolution technique such as
Using a better hashing function
Open addressing
Chaining e.t.c
Java uses chaining and rehashing to handle collisions.
Chaining Drawbacks In the worst case, deletion and searching would take operation O(n). As it might happen all objects are mapped to a particular bucket, which eventually grows to the O(n) chain.
Rehashing Drawbacks Java uses an efficient load factor(n/b) of 0.75 as a rehashing limit (to my knowledge chaining apparently requires lookup operations on average O(1+(n/b)). If n/b < 0.99 with rehashing is used, it is constant time). Rehashing goes off-hand when the table is massive, and in this case, if we use it for real-time applications, response time could be problematic.
In the worst case, then, Java HashMap takes O(n) time to search, insert, and delete.
This might sound as an very vague question upfront but it is not. I have gone through Hash Function description on wiki but it is not very helpful to understand.
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
What do we mean by hashing? How does it work internally?
What algorithm does it follow ?
What is the difference between HashMap, HashTable and HashList ?
What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.
Here is a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. myHashFunction(key: "Rachel" value: "Rachel") --> 10. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.
Here are some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n. h(k) = k mod n. For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10.
Hashtable is synchronised and Hashmap is not.
Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)
Hashing means generating a (hopefully) unique number that represents a value.
Different types of values (Integer, String, etc) use different algorithms to compute a hashcode.
HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSet is a set of unique values.
Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string.
These are things that people ought to know.
Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for java.lang.Object by getting some memory address by the JVM.
The JVM address thing is unimportant detail. Each class can override the hashCode() method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.
Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.
I'll try to give simple explanations of hashing and of its purpose.
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).
What do we mean by Hashing, how does
it work internally ?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
What algorithm does it follow ?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
Lastly, why in most interview
questions Hash and LinkedList are
asked, is there any specific logic for
it from testing interviewee's
knowledge ?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.