Data Structures for hashMap, List and Set - java

Can any one please guide me to look in depth about the Data Structures used and how is it implemented in the List, Set and Maps of Util Collection page.
In Interviews most of the questions will be on the Algorithms, but I never saw anywhere the implementation details, Can any one please share the information.

To learn how Java implements collections, the definitive place to go is the source code itself, freely available. Generally, Lists are implemented as either arrays (ArrayList) or linked lists (LinkedList); sets are either hashtables (HashSet) or trees (TreeSet); and maps are hashtables (HashMap).
Algorithms for manipulating arrays, linked lists, hashtables, and binary or n-ary trees (add, remove, search, sort) are complex enough in themselves that an entire course is necessary to cover them all. Anyone doing their own program design typically needs to understand these algorithms and their performance tradeoffs by heart. There's no substitute here for textbook study and/or practice.

The source code of the API is available, get a JDK and open up the src.zip file from the installation folder.

ArrayList: array
LinkedList: doubly linked list (Entry objects)
HashMap: array of Entry objects each Entry pointing to singly linked list
HashSet: internally uses HashMap, stores data as Key and dummy Object (of class Object) as Value in the map.
TreeMap: Red-Black tree implementation of Entry objects.
TreeSet: internally uses TreeMap. Key as data and dummy object as value.
*Entry: is an internal class in these collections and generally has Key, Value, references for other Entry objects etc.

You can always open the source files, it's all there, however, I wouldn't recommend it as usually they are quite hard to understand. Instead, I'd try finding the underlying data structure, and looking it up. Wikipedia contains most of the information you want to know on these subjects, and google contains the absolute rest.
List is just a dynamic array,
Set is a... set,
And maps are usually hash tables keyed by the key's hash, and stored as key-value pair.
If you're going to dive into the source code, I'd recommend familiarizing yourself with "how-it-probably-works", cause otherwise it will be hard to understand, especially the hash table.

Related

Collection Map in Java

I want to write my own Map in Java. I know how map works, but i don't really know where you can keep keys and values. Can i keep them for example in List? So the keys would be store in the list and values would be store in another list?
Best would be if you checked out some of the concepts behind HashMap, TreeMap, HeapMap etc.
Once you understand those concepts, you're far better prepared for writing your own map when it comes to speed.
In other words: unless you know the concepts of all available implementations, it is very unlikely your wheel-re-invention will be a better solution.
Also be sure to test your implementations very thoroughly, as Collection are the backbone and heart of any good application.
Two very very simple (but slow) solutions are these:
1) As suggested above, you can use an ArrayList<Pair> and add your custom getItemByKey() (in Java commonly named 'get') method.
2) You can use two arrays, both keeping the same size, and keeping keys and values matched by their respective indices.
For choosing the data structure there's not better than Array (not all time but almost) of Entries (key/value) because the main goal of map is to map objects for objects, so mapping keys to values.
Using arrays for fast and constant access O(1), but you have a little problem, when your map is full, you have to create new Array and copy old entries.
Note: HashMap works in the same way.

Is there a good Java library for generation of order preserving O(1) hash codes, based on a set of attributes and a comparator?

Given set of attributes and a comparator I'd like to generate an order preserving hash code that provides O(1) access. Is there a Java library for this sort of thing or would I have to design the hash function myself?
Try:
java.util.LinkedHashMap()
There is no single collection that will do this. Depending on the detail requirements there are several options to chose from.
For simplicity, I would just use a HashMap for lookups and when I need the sorted data, I'd make a copy of the values and sort it:
List<?> sorted = new ArrayList<?>(hashMap.values());
Collections.sort(sorted, Comparator<?>);
This suffices for most real world use cases.
You could also write your own super-container that internally holds the elements in two collections, one HashMap and maybe a TreeSet. You can then easily provide access methods that make use of the collection better for the purpose of the method. Just make sure you make additions and removals affect both the contained collections.

Java - Is it common practice to use a hashtable (eg HashMap) to map objects to themselves?

I'm making a java application that is going to be storing a bunch of random words (which can be added to or deleted from the application at any time). I want fast lookups to see whether a given word is in the dictionary or not. What would be the best java data structure to use for this? As of now, I was thinking about using a hashMap, and using the same word as both a value and the key for that value. Is this common practice? Using the same string for both the key and value in a (key,value) pair seems weird to me so I wanted to make sure that there wasn't some better idea that I was overlooking.
I was also thinking about alternatively using a treeMap to keep the words sorted, giving me an O(lgn) lookup time, but the hashMap should give an expected O(1) lookup time as I understand it, so I figured that would be better.
So basically I just want to make sure the hashMap idea with the strings doubling as both key and value in each (key,value) pair would be a good decision. Thanks.
I want fast lookups to see whether a given word is in the dictionary or not. What would be the best java data structure to use for this?
This is the textbook usecase of a Set. You can use a HashSet. The naive implementation for Set<T> uses a corresponding Map<T, Object> to simply mark whether the entry exists or not.
If you're storing it as a collection of words in a dictionary, I'd suggest taking a look at Tries. They require less memory than a Set and have quick lookup times of worst case O(string length).
Any class that is a Set should help your purpose. However, Do note that Set will not allow for duplicates. For that matter, even a Map won't allow duplicate keys. I would suggest on using a an ArrayList(assuming synchronization is not needed) if you need to add duplicate entries and treat them as separate.
My only concern would be memory, if you use the HashSet and if you have a very large collection of words... Then you will have to load the entire collection in the memory... If that's not a problem.... (And your collection must be very large for this to be a problem)... Then the HashSet should be fine... If you indeed have a very large collection of words, then you can try to use a tree, and only load in memory the parts that you are interested in.
Also keep in mind that insertion is fast, but not as fast as in a tree, remember that for this to work, Java is going to insert every element sorted. Again, nothing major, but if you add a lot of words at a time, you may consider using a tree...

any suggestions for speeding-up Java code that clones hash maps?

I have a java class that contains a hash map as a member. This class is created with many objects. In many of these cases, one object of this type is cloned to another object, and then changed. The cloning is required because the changes modify the hash map, and I need to keep the original hash map of the original object intact.
I am wondering if anyone has any suggestions how to speed up the cloning part, or maybe some trick to avoid it. When I profile the code, most time is spent on the cloning these hash maps (which usually have very small set of values, a few hundreds or so).
(I am currently using the colt OpenIntDoubleHashMap implementation.)
You should use more effective algorithms for it. Look at the http://code.google.com/p/pcollections/ library, the PMap structure which allows immutable maps.
UPDATE
If your map is quite small (you said only a few hundreds), maybe more effective would be just two arrays:
int keys[size];
double values[size];
In this case to clone the map you just need do use System.arraycopy which should work very fast.
Maybe implement a copy-on-write wrapper for your map if the original only changes occasionally.
If only a small fraction of the objects change, could you implement a two-layer structure:
Layer 1 is the original map.
Layer 2 keeps the changed elements only.
Any object from the original map that needs to change gets cloned, modified an put into the layer-2 map.
Lookups first consult the layer-2 map and, if the object is not found, fall back to the layer-1 map.

Do all Hash-based datastructures in java use the 'bucket' concept?

The hash structures I am aware of - HashTable, HashSet & HashMap.
Do they all use the bucket structure - ie when two hashcodes are similar exactly the same one element does not overwrite the other, instead they are placed in the same bucket associated with that hashcode?
In Sun's current implementation of the Java library, IdentityHashMap and the internal implementation in ThreadLocal use probing structures.
The general problem with probing hash tables in Java is that hashCode and equals may be relatively expensive. Therefore you want to cache the hash value. You can't have an array that mixes references and primitives, so you'd need to do something relatively complicated. On the other hand, if you are using == to check matches, then you can check many references without a performance problem.
IIRC, Azul had a fast concurrent quadratic probing hash map.
A linked list is used at each bucket for dealing with hash collisions. Note that the java HashSet is actually implemented by a HashMap underneath (all keys being mapped to the same singleton value across all HashSets) and hence uses the same bucket structure.
If an element is added, its equality is checked against all items in the linked list (via .equals) before it is added at the end. Hence having hash collisions is particularly bad, as this could be an expensive check as the linked list becomes larger.
I believe Java hash structures all use a form of chaining to deal with colisions when performing the hashing - which places the items that have the same hash into a list.
I do not believe that Java uses open addressing for it's hash based data structures (open addressing recomputes hashes based on retry sequences until it finds an open slit in the table)
No -- open addressing is an alternate method of representing hash tables, where objects are stored directly in the table, instead of residing in a linked list. Only one object can be stored at a given index, so resolving collisions is more complicated.
When adding an object for which another object already resides at the same index, a probing sequence is used to determine the new index at which to store the new object. Removing objects is also more complicated, since you if you remove an object, you need to leave a marker that says "there used to be an object here"; for more details, see Wikipedia.
Open addressing is preferable when the objects being stored as small and will rarely be deleted. Open addressing has improved cache performance, since you don't need to go through an extra level of indirection walking a linked list.
The classes you mentioned -- HashTable, HashSet, and HashMap don't use open addressing, but you could easily create new classes that implemented open addressing and provided the same APIs as those classes.
The apis define the behaviour, the internals of how Hash collisions are managed doesn't affect the guarantees of the API ... the performance impact of bad hash value computation is another story. Let's just hash everything to 42 and see how it behaves.
Maps and Sets are the interfaces that determine the behavior of a HashSet or HashMap. A HashSet is a Set, and so it behaves like a Set (ie duplicates are not allowed). A HashMap acts like a Map - it will not overwrite a key with a similar hashcode, but it will overwrite a key, if the same exact key is used again. This will be the same regardless of what data structure is backing the Map internally. See the javadoc for Sets and HashMaps for more.
Did you mean to ask something about the specific implementation of one of these structures?
Except the HashSet. Set is by definition unique elements.
This was a mistake. Please see the comments below.

Categories