How is data retrieved from hash tables for collisions

How is data retrieved from hash tables for collisions - java

I understand that hash tables are designed to have easy sorting and retrieval of data when storing massive amounts of them. However, when retrieving a specific piece of data, how do they retrieve it if they were stored in an alternative location due to collision?
Say there are 10 indexes and data A was stored in index 3 and data E runs into collision because data A is stored in index 3 already and collision prevention puts it in index 7 instead. When it comes time to retrieve data E, how does it retrieve E instead of using the first hash function and retrieving A instead?
Sorry if this is dumb question. I'm still somewhat new to programming.

I don't believe that Java will resolve a hashing collision by moving an item to a different bucket. Doing that would make it difficult if not impossible to determine the correct bucket into which it should have been hashed. If you read this SO article carefully, you will note that it points out two tools which Java has at its disposal. First, it maintains a list of values for each bucket* (read note below). Second, if the list becomes too large it can increase the number of buckets.
I believe that the list has now been replaced with a tree. This will ensure O(n*lgn) performance for lookup, insertion, etc., in the worst case, whereas with a list the worst case performance was O(n).

Related

Efficient Search Algorithm For Specific Data Fields

So I am actually been assigned to write algorithms on filtering/searching.
Task : Filter: search and list objects that fulfill specified attribute(s)
Say The whole system is a student registration record system.
I have data as shown below. I will need to filter and search by these attributes say search/filter by gender or student name or date of birth etc.
Student Name
, Gender
, Date Of Birth
, Mobile No
Is there specific efficient algorithm formula or method for each of these field.
Example , strings and integers each has their own type of efficient search algorithm right?
Here's what I am going to do.
I am going to code a binary search algorithm for searching/filtering based on these fields above.
That's it. But yeah that's easy to be honest.
But I am just curious like what's the proper and appropriate coding approach for a efficient search/filter algorithm for each of these fields will you guys do?
I will not be using sequential search algorithm obviously as this will involve huge data so I am not going to iterate each of these data to downgrade efficiency performance.
Sequential search algorithm will be used when needed if data is less.

Searching is a very broad topic and it completely depends upon your use case.
while building an efficient Searching algorithm you should take below factors into consideration
What's the size of your data? -is it fixed or it keeps varying
periodically?
How often you are going to Insert/modify/delete
your data?
Is your data sorted or unsorted?
Do you need a prefix based search like autosearch,autocomplete,longest prefix search etc?
Now let's think about the solution/approach
if your data is less and unsorted as you can try Linear
Search(which has O(n)time complexity where "n" is size of your
data/array)
if your data is already sorted which is not always the case you can
use Binary search as it's complexity is 0(log n). if your
data is not sorted then sorting the data again takes
(nlogn)~typically if you are using Java,Arrays.sort() by default uses Merge sort or Quick sort which is (nlogn).
if faster retrieval is the main object you can think of HashMaps or HashMaps. the elements of Hashmap are indexed by Hashcode, the
time to search for any element would almost be 1 or constant time(if
your hash function implementation is good)
Prefix based search :since you mentioned about searching by Names,you also have the option of using
"Tries" data structure.
Tries are excellent option if you are performing Insert/Delete/Update functionalities frequently .
Lookup of an elements in a Trie is 0(k) where "k" is the length of the string to be searched.
Since you have registration data where insert,update,deletion is common TRIES Data Structure is a good option to consider.
Also,check this link to choose between Tries and HashTables TriesVsMaps
Below is the sample representation of Tries(img src:Hackerearth)

What is the most memory efficient method of storing a large number of Strings in a map?

I want to store huge amounts of Strings in a Map<String, MagicObject>, so that the MagicObjects can be accessed quickly. There are so many entries to this Map that memory is becoming a bottleneck. Assuming the MagicObjects can't be optimized, what is the most efficient type of map I could use for this situation? I am currently using the following:
gnu.trove.map.hash.TCustomHashMap<byte[], MagicObject>

If your keys are long enough and have a lot of long enough common prefixes then you can save memory by using a trie (prefix tree) data structure. Answers to this question point to a a couple of Java implementations of trie.

To open mind, consider Huffman coding to compress your strings first before
put in map, as long as your strings are fixed(number and content of string don't change).

I'm a little late to this party but this question came up in a related search and piqued my interest. I don't usually answer Java questions.
There are so many entries to this Map that memory is becoming a bottleneck.
I doubt it.
For the storage of strings in memory to become a bottleneck you need an awfully large number of unique strings[1]. To put things into perspective, I recently worked with a 1.8m word dictionary (1.8m unique english words) and they took up around 1.6MB in RAM at runtime.
If you used every word in the dictionary as a key you'll still only use 1.6MB of RAM[2] to store the keys, hence memory cannot be your bottleneck.
What I suspect you are experiencing is the O(n^2) performance of string matching. By this I mean that as more keys are added performance slows down exponentially[3]. This is unavoidable if you are using strings are keys.
If you want to speed things up a bit, store each key into a hashtable that doesn't store duplicates and use the hash key as the key to your map.
NOTES:
[1] I'm assuming the strings are all unique or else you would not attempt to use them as a key into a map.
[2] Even if Java uses 2 bytes per character, it still only comes to 3.2MB of memory, total.
[3] It slows down even more if you choose the wrong data structure, such as an unbalanced binary tree, to store your values. I don't know how map stores values internally, but an unbalanced binary tree will have O(2^n) performance - pretty much the worst performance you can find.

Which list implementation is optimal for removing and inserting from the front and back?

I am working on an algorithm that will store a small list of objects as a sublist of a larger set of objects. the objects are inherently ordered, so an ordered list is required.
The most common operations performed will be, in order of frequency:
retrieving the nth element from the list (for some arbitrary n)
inserting a single to the beginning or end of the list
removing the first or last n elements from the list (for some
arbitrary n)
removing and inserting from the middle will never be done so there is no need to consider the efficiency of that.
My question is what implementation of List is most efficient for this use case in Java (i.e. LinkedList, ArrayList, Vector, etc)? Please defend your answer by explaining the implementation s of the different data structures so that I can make an informed decision.
Thanks.
NOTE
No, this is not a homework question. No, I do not have an army research assistants who can do the work for me.

Based on your first criteria (arbitrary access) you should use an ArrayList. ArrayLists (and arrays in general) provide lookup/retrieval in constant time. In contrast, it takes linear time to look up items in a LinkedList.
For ArrayLists, insertion or deletion at the end is free. It may also be with LinkedLists, but that would be an implementation-specific optimization (it's linear otherwise).
For ArrayLists, insertion or deletion at front requires linear time (with consistent reuse of space, these may become constant depending on implementation). LinkedList operations at front of list are constant.
The last two usage cases somewhat balance each other out, however your most common case definitely suggests array-based storage.
As far as basic implementation details:
ArrayLists are basically just sequential sections of memory. If you know where the beginning is, you can just do a single addition to find the location of any element. Operations at the front are expensive because elements may have to be shifted to make room.
LinkedLists are disjoint in memory and consist of nodes linked to each other (with a reference to the first node). To find the nth node, you have to start at the first node and follow links until you reach the desired node. Operations at the front just require creating a node and updating your start pointer.

I vote for double linked list. http://docs.oracle.com/javase/6/docs/api/java/util/Deque.html

Probably the best data structure for this purpose would be a deque implemented with a dynamic array, which is basically an ArrayList that starts adding elements to the middle of the internal array instead of the beginning. Unfortunately Java's ArrayDeque does not support looking up an nth element.
It is, however, pretty easy to implement one yourself (or lookup an existing implementation), and then all three of the described operations can be done in O(1).

YOu can do all of them with arrayList with minimal confusion if your not worried about efficiency.
i would uses some sort of a queue or stack if i am only inserting at the front or end. They have the least overhead. Or you could also use a linked list.
To remove N elements from the first or end i would use a linked list, you can just delete one node and the ones before or after it are gone. Ie if i delete the first 5 elements just delete the 5th element and the ones before it will disappear. Also if i delete the last 6 elements just delete the 6th to last one and the rest will disappear. And java will do the garbage collecting for you. This would be an order of (1) for this operation.
is this a homework question?

Definitely go for LinkedList. For both inserting a value at the beginning/end of the list and removing the first/last element in the list, it runs in O(1). This is because all that needs to be changed to carry out these operations is a couple of pointers, a minimally costly operation.
Although ArrayLists retrieve the nth element in O(1) while LinkedLists retrieve the nth element in O(n), ArrayLists run the danger of having to adjust their size when elements are inserted. What do you suppose happens when the memory allotted for the ArrayList is used up and you try to insert another element? Well what happens is the ArrayList duplicates itself then allocates more memory (amounting to twice as much as it had initially allocated), a very costly operation. LinkedLists don't have this problem since, again, all that is done is the addition of a pointer.
I don't know a whole lot about Java Vectors, but if they're anything like C++ vectors, then they're very similar to ArrayLists.
I hope this helps.

java.util.TreeMap of Long to Object, and use index of i+tm.firstKey()

How does hashing have an o(1) search time? [duplicate]

This question already has answers here:
Can hash tables really be O(1)?
(10 answers)
Closed 5 years ago.
When we use a HashTable for storing data, it is said that searching takes o(1) time. I am confused, can anybody explain?

Well it's a little bit of a lie -- it can take longer than that, but it usually doesn't.
Basically, a hash table is an array containing all of the keys to search on. The position of each key in the array is determined by the hash function, which can be any function which always maps the same input to the same output. We shall assume that the hash function is O(1).
So when we insert something into the hash table, we use the hash function (let's call it h) to find the location where to put it, and put it there. Now we insert another thing, hashing and storing. And another. Each time we insert data, it takes O(1) time to insert it (since the hash function is O(1).
Looking up data is the same. If we want to find a value, x, we have only to find out h(x), which tells us where x is located in the hash table. So we can look up any hash value in O(1) as well.
Now to the lie: The above is a very nice theory with one problem: what if we insert data and there is already something in that position of the array? There is nothing which guarantees that the hash function won't produce the same output for two different inputs (unless you have a perfect hash function, but those are tricky to produce). Therefore, when we insert we need to take one of two strategies:
Store multiple values at each spot in the array (say, each array slot has a linked list). Now when you do a lookup, it is still O(1) to arrive at the correct place in the array, but potentially a linear search down a (hopefully short) linked list. This is called "separate chaining".
If you find something is already there, hash again and find another location. Repeat until you find an empty spot, and put it there. The lookup procedure can follow the same rules to find the data. Now it's still O(1) to get to the first location, but there is a potentially (hopefully short) linear search to bounce around the table till you find the data you are after. This is called "open addressing".
Basically, both approaches are still mostly O(1) but with a hopefully-short linear sequence. We can assume for most purposes that it is O(1). If the hash table is getting too full, those linear searches can become longer and longer, and then it is time to "re-hash" which means to create a new hash table of a much bigger size and insert all the data back into it.

Searching takes O(1) time if there is no collisons in the hashtable , so it is incorrect to sya that searching in a hashtable takes O(1) or constant time.
See how Hashtable works on MSDN?
EDIT
mgiuca explains it beautifully and i am just adding one more Collosion Avoidance technique which is called Chaining.
IN this technique , We maintain a linklist of values at each location so when you have a collosion , your value will be added to the Linklist at that position so when you are searching for a value there may be scenario that you need to search the value in whole link list so in that case searching will not be O(1) operation.

Hash Tables - Java

Am about to do a homework, and i need to store quite a lot of information (Dictionary) in a data structure of my choice. I heard people in my classroom saying hash-tables are the way to go. How come?

Advantages
When you first hear about hash tables they sound too good to be true. The reason is that is does not matter how many items there are searching, insertion (deletion sometimes) can take approximately 0(1) which is pretty much instantaneous from the user POV. Given its performance capabilities in terms of speed, hash tables are used mainly yet not limited to programs that need to look up thousands of items in less than a sec (for example spell-checkers / search engines). From my particular point of view I find H tables much easier to program than any sort of binary trees, and am not expert, so if you are a beginner that might too be an advantage.
Disadvantages
As hash tables are based on arrays they can be difficult to expand once created. Also I have read that for certain hash tables once full or getting full the speed when performing a task reduces notoriously. As a result of both when programming you will need to be fairly accurate of how many items you need to store. Additionally is not possible to search the items in the hash table in order for example from the smallest to the biggest, so if that is something you are looking for it might not be what you need.
Extra Info
Wikipedia article's - Hash Table - Big O Notation
Tutorial on Hash Tables - Tutorial
All how to's about Hash Tables - Java2S
Book Advice
I advice you to get a book called "Data Structures & Algorithms in Java - Second Edition - Robert Lafore" its a big book, but it has everything explained very subtle, for me is the only programming book so far i can read like is a novel.
Additional info regarding Big O notation - O(1)
O(1) doesn't mean "pretty much instantaneous" (an O(1) algorithm could take hours, weeks or years). It means (in this case) "is independent of the size of the collection" (assuming the hash code is good enough). – Ben Lings
Thanks to Ben for his clarification.
P.S: You might want to be more descriptive in the future when you ask a question that way other users can pin-point what you are looking for.

To help you out on deciding what type of collection is better for you, take a look at this Java Tutorials lesson:
Lesson: Introduction to Collections
Reading this you can see which collection fits your needs.

The best structure for your Dictionary would be a Prefix tree in which each node's 'key' is a letter from one of your words and each node's 'value' is the meaning of the word (dictionary translation). Word lookup is linear on the word's length (the same as a hashtable, since your hash function would ideally be linear), or O(1) if we consider words as a whole. The thing that is better than hash tables is that a hash table will take a lot of space in order to ensure O(1) access and, depending on the words in the dictionary, it might be very sparsely populated. The prefix tree on the other hand actually provides compression - the tree itself will contain all the original information in less space than before, since common parts of words are shared along the tree structure. Dictionaries usually have tens of thousands words, leaving a prefix tree the only viable solution.
P.S. As mentioned earlier, the tree has almost infinite scalability, in contrast to a hash table.

It depends on what you want to store and how you want to access it. You don't really provide enough information.
Hash tables provide O(1) lookup times so they can be used to retrieve values based on a key very quickly. If the hashing algorithm is expensive you may find that it is outperformed by other data structures. This is especially true if you are doing a lot of inserting and removing of items from the structure.

If you are planning on using a hash table implementation from the Java libraries, be sure to note that there are two of them - HashTable, and HashMap. One of them is commonly used these days, and one is outdated and generally found in legacy code. Do some research to find out which is which, and why the newer one is better.

A hashtable allows you to map keys to objects.
If you're storing values that have unique keys, and you will need to lookup the values by their keys, hashtables are the way to go.
If you just want to store an ordered set of objects without unique keys, an ordinary ArrayList is the way to go. (In particular, note that ordinary hashtables are unordered)

Hash Tables are good option but while using it you might have to decide what can be the good hash function.. this question can have many answers and depends on the programmer. I personally feel you can check out B+ tree or Trie. One of the main use of Trie is Dictionary representation.Trie in Wiki
Hope this helps !!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.