This question already has answers here:
Can hash tables really be O(1)?
(10 answers)
Closed 5 years ago.
When we use a HashTable for storing data, it is said that searching takes o(1) time. I am confused, can anybody explain?
Well it's a little bit of a lie -- it can take longer than that, but it usually doesn't.
Basically, a hash table is an array containing all of the keys to search on. The position of each key in the array is determined by the hash function, which can be any function which always maps the same input to the same output. We shall assume that the hash function is O(1).
So when we insert something into the hash table, we use the hash function (let's call it h) to find the location where to put it, and put it there. Now we insert another thing, hashing and storing. And another. Each time we insert data, it takes O(1) time to insert it (since the hash function is O(1).
Looking up data is the same. If we want to find a value, x, we have only to find out h(x), which tells us where x is located in the hash table. So we can look up any hash value in O(1) as well.
Now to the lie: The above is a very nice theory with one problem: what if we insert data and there is already something in that position of the array? There is nothing which guarantees that the hash function won't produce the same output for two different inputs (unless you have a perfect hash function, but those are tricky to produce). Therefore, when we insert we need to take one of two strategies:
Store multiple values at each spot in the array (say, each array slot has a linked list). Now when you do a lookup, it is still O(1) to arrive at the correct place in the array, but potentially a linear search down a (hopefully short) linked list. This is called "separate chaining".
If you find something is already there, hash again and find another location. Repeat until you find an empty spot, and put it there. The lookup procedure can follow the same rules to find the data. Now it's still O(1) to get to the first location, but there is a potentially (hopefully short) linear search to bounce around the table till you find the data you are after. This is called "open addressing".
Basically, both approaches are still mostly O(1) but with a hopefully-short linear sequence. We can assume for most purposes that it is O(1). If the hash table is getting too full, those linear searches can become longer and longer, and then it is time to "re-hash" which means to create a new hash table of a much bigger size and insert all the data back into it.
Searching takes O(1) time if there is no collisons in the hashtable , so it is incorrect to sya that searching in a hashtable takes O(1) or constant time.
See how Hashtable works on MSDN?
EDIT
mgiuca explains it beautifully and i am just adding one more Collosion Avoidance technique which is called Chaining.
IN this technique , We maintain a linklist of values at each location so when you have a collosion , your value will be added to the Linklist at that position so when you are searching for a value there may be scenario that you need to search the value in whole link list so in that case searching will not be O(1) operation.
Related
I understand that hash tables are designed to have easy sorting and retrieval of data when storing massive amounts of them. However, when retrieving a specific piece of data, how do they retrieve it if they were stored in an alternative location due to collision?
Say there are 10 indexes and data A was stored in index 3 and data E runs into collision because data A is stored in index 3 already and collision prevention puts it in index 7 instead. When it comes time to retrieve data E, how does it retrieve E instead of using the first hash function and retrieving A instead?
Sorry if this is dumb question. I'm still somewhat new to programming.
I don't believe that Java will resolve a hashing collision by moving an item to a different bucket. Doing that would make it difficult if not impossible to determine the correct bucket into which it should have been hashed. If you read this SO article carefully, you will note that it points out two tools which Java has at its disposal. First, it maintains a list of values for each bucket* (read note below). Second, if the list becomes too large it can increase the number of buckets.
I believe that the list has now been replaced with a tree. This will ensure O(n*lgn) performance for lookup, insertion, etc., in the worst case, whereas with a list the worst case performance was O(n).
I have two lists of phone numbers. 1st list is a subset of 2nd list. I ran two different algorithms below to determine which phone numbers are contained in both of two lists.
Way 1:
Sortting 1st list: Arrays.sort(FirstList);
Looping 2nd list to find matched element: If Arrays.binarySearch(FistList, 'each of 2nd list') then OK
Way 2:
Convert 1st list into HashMap with key/valus is ('each of 1st list', Boolean.TRUE)
Looping 2nd list to find matched element: If FirstList.containsKey('each of 2nd list') then OK
It results in Way 2 ran within 5 seconds is faster considerably than Way 1 with 39 seconds. I can't understand the reason why.
I appreciate your any comments.
Because hashing is O(1) and binary searching is O(log N).
HashMap relies on a very efficient algorithm called 'hashing' which has been in use for many years and is reliable and effective. Essentially the way it works is to split the items in the collection into much smaller groups which can be accessed extremely quickly. Once the group is located a less efficient search mechanism can be used to locate the specific item.
Identifying the group for an item occurs via an algorithm called a 'hashing function'. In Java the hashing method is Object.hashCode() which returns an int representing the group. As long as hashCode is well defined for your class you should expect HashMap to be very efficient which is exactly what you've found.
There's a very good discussion on the various types of Map and which to use at Difference between HashMap, LinkedHashMap and TreeMap
My shorthand rule-of-thumb is to always use HashMap unless you can't define an appropriate hashCode for your keys or the items need to be ordered (either natural or insertion).
Look at the source code for HashMap: it creates and stores a hash for each added (key, value) pair, then the containsKey() method calculates a hash for the given key, and uses a very fast operation to check if it is already in the map. So most retrieval operations are very fast.
Way 1:
Sorting: around O(nlogn)
Search: around O(logn)
Way 2:
Creating HashTable: O(n) for small density (no collisions)
Contains: O(1)
This question already has answers here:
Is a Java hashmap search really O(1)?
(15 answers)
Closed 9 years ago.
I'm studying Java collection performance characteristics, Big O notation and complexity, etc. There's a real-world part I can't wrap my head around, and that's why HashMap and other hash containers are considered O(1), which should mean that finding an entry by key in a 1,000 entry table should take about the same time as a 1,000,000 entry table.
Let's say you have HashMap myHashMap, stored with a key of first name + last name. If you call myHashMap.get("FredFlinstone"), how can it instantly find Fred Flinstone's Person object? How can it not have to iterate through the set of keys stored in the HashMap to find the pointer to the object? If there were 1,000,000 entries in the HashMap, the list of keys would also be 1,000,000 long (assuming no collision), which must take more time to go through than a list of 1.000, even if it were sorted. So how can the get() or containsKey() time not change with n?
Note: I thought my question would be answered in Is a Java hashmap really O(1)? but the answers didn't really address this point. My question is also not about collisions.
"My question is also not about collisions." - Actually it is indirectly. No collision = O(1) ...
In the worst case (pathological case), there would be one bucket with N items hanging off it, then it would be O(N)
Let's take a look at a very simple example of a hash map and a hash function. To keep things simple, let's say that your hash map has 10 buckets and that it uses integers as keys. For the purposes of this example we shall use the following hash function:
public int hash(int key) {
return key % 10;
}
Now, when we want to store an entry in the map, we hash the key, get an integer between 0-9 and then put that entry in the corresponding bucket. Then, when we need to lookup a key, we just have to compute it's hash and we know exactly what bucket it is in (or would be in) without having to look in any of the others.
Computing the hash function on a given key takes constant time. Looking up whether there is a value stored to that key is a random access operation - the hashmap is backed with an array. The only problem is being assured that different keys with the SAME value (hash collision) doesn't happen too often. If it happened once in n, that's enough for constant time in the average case.
I have an assignment that I am working on, and I can't get a hold of the professor to get clarity on something. The idea is that we are writing an anagram solver, using a given set of words, that we store in 3 different dictionary classes: Linear, Binary, and Hash.
So we read in the words from a textfile, and for the first 2 dictionary objects(linear and binary), we store the words as an ArrayList...easy enough.
But for the HashDictionary, he want's us to store the words in a HashTable. I'm just not sure what the values are going to be for the HashTable, or why you would do that. The instructions say we store the words in a Hashtable for quick retrieval, but I just don't get what the point of that is. Makes sense to store words in an arraylist, but I'm just not sure of how key/value pairing helps with a dictionary.
Maybe i'm not giving enough details, but I figured maybe someone would have seen something like this and its obvious to them.
Each of our classes has a contains method, that returns a boolean representing whether or not a word passed in is in the dictionary, so the linear does a linear search of the arraylist, the binary does a binary search of the arraylist, and I'm not sure about the hash....
The difference is speed. Both methods work, but the hash table is fast.
When you use an ArrayList, or any sort of List, to find an element, you must inspect each list item, one by one, until you find the desired word. If the word isn't there, you've looped through the entire list.
When you use a HashTable, you perform some "magic" on the word you are looking up known as calculating the word's hash. Using that hash value, instead of looping through a list of values, you can immediately deduce where to find your word - or, if your word doesn't exist in the hash, that your word isn't there.
I've oversimplified here, but that's the general idea. You can find another question here with a variety of explanations on how a hash table works.
Here is a small code snippet utilizing a HashMap.
// We will map our words to their definitions; word is the key, definition is the value
Map<String, String> dictionary = new HashMap<String, String>();
map.put("hello","A common salutation");
map.put("chicken","A delightful vessel for protein");
// Later ...
map.get("chicken"); // Returns "A delightful vessel for protein";
The problem you describe asks that you use a HashMap as the basis for a dictionary that fulfills three requirements:
Adding a word to the dictionary
Removing a word from the dictionary
Checking if a word is in the dictionary
It seems counter-intuitive to use a map, which stores a key and a value, since all you really want to is store just a key (or just a value). However, as I described above, a HashMap makes it extremely quick to find the value associated with a key. Similarly, it makes it extremely quick to see if the HashMap knows about a key at all. We can leverage this quality by storing each of the dictionary words as a key in the HashMap, and associating it with a garbage value (since we don't care about it), such as null.
You can see how to fulfill the three requirements, as follows.
Map<String, Object> map = new HashMap<String, Object>();
// Add a word
map.put('word', null);
// Remove a word
map.remove('word');
// Check for the presence of a word
map.containsKey('word');
I don't want to overload you with information, but the requirements we have here align with a data structure known as a Set. In Java, a commonly used Set is the HashSet, which is almost exactly what you are implementing with this bit of your homework assignment. (In fact, if this weren't a homework assignment explicitly instructing you to use a HashMap, I'd recommend you instead use a HashSet.)
Arrays are hard to find stuff in. If I gave you array[0] = "cat"; array[1] = "dog"; array[2] = "pikachu";, you'd have to check each element just to know if jigglypuff is a word. If I gave you hash["cat"] = 1; hash["dog"] = 1; hash["pikachu"] = 1;", instant to do this in, you just look it up directly. The value 1 doesn't matter in this particular case although you can put useful information there, such as how many times youv'e looked up a word, or maybe 1 will mean real word and 2 will mean name of a Pokemon, or for a real dictionary it could contain a sentence-long definition. Less relevant.
It sounds like you don't really understand hash tables then. Even Wikipedia has a good explanation of this data structure.
Your hash table is just going to be a large array of strings (initially all empty). You compute a hash value using the characters in your word, and then insert the word at that position in the table.
There are issues when the hash value for two words is the same. And there are a few solutions. One is to store a list at each array position and just shove the word onto that list. Another is to step through the table by a known amount until you find a free position. Another is to compute a secondary hash using a different algorithm.
The point of this is that hash lookup is fast. It's very quick to compute a hash value, and then all you have to do is check that the word at that array position exists (and matches the search word). You follow the same rules for hash value collisions (in this case, mismatches) that you used for the insertion.
You want your table size to be a prime number that is larger than the number of elements you intend to store. You also need a hash function that diverges quickly so that your data is more likely to be dispersed widely through your hash table (rather than being clustered heavily in one region).
Hope this is a help and points you in the right direction.
I am working on an algorithm that will store a small list of objects as a sublist of a larger set of objects. the objects are inherently ordered, so an ordered list is required.
The most common operations performed will be, in order of frequency:
retrieving the nth element from the list (for some arbitrary n)
inserting a single to the beginning or end of the list
removing the first or last n elements from the list (for some
arbitrary n)
removing and inserting from the middle will never be done so there is no need to consider the efficiency of that.
My question is what implementation of List is most efficient for this use case in Java (i.e. LinkedList, ArrayList, Vector, etc)? Please defend your answer by explaining the implementation s of the different data structures so that I can make an informed decision.
Thanks.
NOTE
No, this is not a homework question. No, I do not have an army research assistants who can do the work for me.
Based on your first criteria (arbitrary access) you should use an ArrayList. ArrayLists (and arrays in general) provide lookup/retrieval in constant time. In contrast, it takes linear time to look up items in a LinkedList.
For ArrayLists, insertion or deletion at the end is free. It may also be with LinkedLists, but that would be an implementation-specific optimization (it's linear otherwise).
For ArrayLists, insertion or deletion at front requires linear time (with consistent reuse of space, these may become constant depending on implementation). LinkedList operations at front of list are constant.
The last two usage cases somewhat balance each other out, however your most common case definitely suggests array-based storage.
As far as basic implementation details:
ArrayLists are basically just sequential sections of memory. If you know where the beginning is, you can just do a single addition to find the location of any element. Operations at the front are expensive because elements may have to be shifted to make room.
LinkedLists are disjoint in memory and consist of nodes linked to each other (with a reference to the first node). To find the nth node, you have to start at the first node and follow links until you reach the desired node. Operations at the front just require creating a node and updating your start pointer.
I vote for double linked list. http://docs.oracle.com/javase/6/docs/api/java/util/Deque.html
Probably the best data structure for this purpose would be a deque implemented with a dynamic array, which is basically an ArrayList that starts adding elements to the middle of the internal array instead of the beginning. Unfortunately Java's ArrayDeque does not support looking up an nth element.
It is, however, pretty easy to implement one yourself (or lookup an existing implementation), and then all three of the described operations can be done in O(1).
YOu can do all of them with arrayList with minimal confusion if your not worried about efficiency.
i would uses some sort of a queue or stack if i am only inserting at the front or end. They have the least overhead. Or you could also use a linked list.
To remove N elements from the first or end i would use a linked list, you can just delete one node and the ones before or after it are gone. Ie if i delete the first 5 elements just delete the 5th element and the ones before it will disappear. Also if i delete the last 6 elements just delete the 6th to last one and the rest will disappear. And java will do the garbage collecting for you. This would be an order of (1) for this operation.
is this a homework question?
Definitely go for LinkedList. For both inserting a value at the beginning/end of the list and removing the first/last element in the list, it runs in O(1). This is because all that needs to be changed to carry out these operations is a couple of pointers, a minimally costly operation.
Although ArrayLists retrieve the nth element in O(1) while LinkedLists retrieve the nth element in O(n), ArrayLists run the danger of having to adjust their size when elements are inserted. What do you suppose happens when the memory allotted for the ArrayList is used up and you try to insert another element? Well what happens is the ArrayList duplicates itself then allocates more memory (amounting to twice as much as it had initially allocated), a very costly operation. LinkedLists don't have this problem since, again, all that is done is the addition of a pointer.
I don't know a whole lot about Java Vectors, but if they're anything like C++ vectors, then they're very similar to ArrayLists.
I hope this helps.
java.util.TreeMap of Long to Object, and use index of i+tm.firstKey()