I am trying to get my head around the details of a HAMT. I'd have implemented one myself in Java just to understand. I am familiar with Tries and I think I get the main concept of the HAMT.
Basically,
Two types of nodes:
Key/Value
Key Value Node:
K key
V value
Index
Index Node:
int bitmap (32 bits)
Node[] table (max length of 32)
Generate a 32-bit hash for an object.
Step through the hash 5-bits at a time. (0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-31)
note: the last step (7th) is only 2 bits.
At each step, find the position of that 5-bit integer in the bitmap. e.g. integer==5 bitmap==00001
If the bit is a 1 then that part of the hash exist.
If the bit is a 0 then key doesn't exist.
If the key exists, find it's index into the table by counting the number of 1s in the bitmap between 0 and the position. e.g. integer==6 bitmap==0101010101 index==3
If the table points to a key/value node then compare the keys.
If the table points to a index node then go to 2 moving ahead one step.
The part I don't quite understand is collision detection and mitigation. In the linked paper he alludes to it:
The existing key is then inserted in the new sub-hash table and the
new key added. Each time 5 more bits of the hash are used the
probability of a collision reduces by a factor of 1/32. Occasionally
an entire 32 bit hash may be consumed and a new one must be computed
to diļ¬erentiate the two keys.
If I were to compute a "new" hash and store the object at that new hash; how would you ever be able to look-up the object in the structure? When doing a look-up, wouldn't it generate the "initial" hash and not the "re-computed hash".
I must be missing something.....
BTW: The HAMT performs fairly well, it's sits between a hash map and tree map in my tests.
Data Structure Add time Remove time Sorted add time Sorted remove time Lookup time Size
Java's Hash Map 38.67 ms 18 ms 30 ms 15 ms 16.33 ms 8.19 MB
HAMT 68.67 ms 30 ms 57.33 ms 35.33 ms 33 ms 10.18 MB
Java's Tree Map 86.33 ms 78 ms 75 ms 39.33 ms 76 ms 8.79 MB
Java's Skip List Map 111.33 ms 106 ms 72 ms 41 ms 72.33 ms 9.19 MB
HAMT is great and highly performant structure especially when one needs immutable objects, i.e. each time after any modification a new copy of a data structure is created!
As for your question on hash collisions, I have found a C# implementation (which is buggy now) that shows how it works: on each hash collision the key is rehashed and lookup is retried recursively until maximum iterations limit is reached.
Currently I am also exploring HAMP in functional programming context and learning existing code. There are several reference implementations of HAMT in Haskell as Data.HshMap and in Clojure as PersistenceHashMap.
There are some other simpler implementations on the web that do not deal with collisions, but they are useful to understand the concept. Here they are in Haskell and OCaml
I have found a nice summary article article that describes HAMT with pictures and links to original research papers by Phil Bagwell.
Related points:
While implementing HAMT in F# I have noticed that popCount function implementation described here really matters and gives 10-15% compared to naive implementation described in the next answers in the link. Not great, but a free lunch.
Related IntMap structures (Haskell and its port to F#) are very good when the key could be an integer and they implement related PATRICIA/Radix trie.
I believe all these implementation are very good to learn efficient immutable data structure and functional languages in all their beauty on these examples - they really shine together!
There's two sections of the paper I think you might of missed. The first is the bit immediately preceding the bit you quoted:
Or the key will collide with an existing one. In which case the existing key
must be replaced with a sub-hash table and the next 5 bit hash of the existing key
computed. If there is still a collision then this process is repeated until no collision
occurs.
So if you have object A in the table and you add object B which clashes, the cell at which their keys clashed will be a pointer to another subtable (where they don't clash).
Next, Section 3.7 of the paper you linked describes the method for generating a new hash when you run off the end of your first 32 bits:
The hash function was tailored to give a 32 bit hash. The algorithm requires that
the hash can be extended to an arbitrary number of bits. This was accomplished by
rehashing the key combined with an integer representing the trie level, zero being
the root. Hence if two keys do give the same initial hash then the rehash has a
probability of 1 in 2^32 of a further collision.
If this doesn't seem to explain anything, say and I'll extend this answer with more detail.
If I were to compute a "new" hash and store the object at that new
hash; how would you ever be able to look-up the object in the
structure? When doing a look-up, wouldn't it generate the "initial"
hash and not the "re-computed hash".
When doing a look-up the initial hash is used. When the bits in the initial
hash is exhausted, either one of the following condition is true:
we end up with a key/value node - return it
we end up with an index node - this is the hint that we have to go
deeper by recomputing a new hash.
The key here is hash bits exhaustion.
The chance of collision is presumably very low, and generally only problematic for huge trees. Given this, you're better off just storing collisions in an array at the leaf and searching it linearly (I do this in my C# HAMT).
Related
I want to store 1*10^8 Objects in a map for searching. When my program start, it will read and store these objects in a map. After reading is end, this map never be updated util program is dead. I don't want jvm to abandon any of them. I learn that HashMap will waste many memory , is there any type of map can store so much objects and save memory?
and I know that jvm will scan these objects, it waste time. how to avid this?
Sorry, The situation is that: I am writing a bolt with apache storm. I want to read data from databases. when a bolt is processing a tuple, I need to calculate with the data in databases. For performance of program I have to store them in memory. I know jvm is not good at managing a lot of memory, So maybe I should to try koloboke?
HashMap need to allocate array of sufficient size in order to minimize hash collisions - it can happen that two or more objects that are not equal have the same hash code - probability of such situation depends on quality of hash function. Collisions are resolved by techniques such as linear probing, which stores entry at next (hash + i) mod length index that is not occupied, quadratic probing which stores entry at next (hash + i^k) mod length index that is not occupied, separate chaining which stores linked list of entries at each bucket. Collision probability is decreased by increasing length of backing array, thus memory wasting.
However, you can use TreeMap which stores entries in tree structure that creates only such a number of nodes that is equal to number of entries i. e. efficient memory usage.
Note, there is a difference in complexity of get, put, remove operations. HashMap has complexity O(1), while TreeMap has complexity O(log n).
Suppose you want to get an entry from map of size 100 000 000, then in worst case (element to be found is leaf i. e. is located at the last level of the tree), path that need to be passed down the tree has length log(100 000 000) = 8.
Well, I am back.
In first I used about 30g to store about 5x10^7 key-value entry.. but gc is not stable.I make a mistake about using string to store double, it is bigger than double in memory and a char is 16bit in java ..after I changed this mistake, gc is better..but not enough. Finally I used 'filedb' in mapdb to fix this.
Before starting to explain my problem, I should mention that I am not looking for a way to increase Java heap memory. I should strictly store these objects.
I am working on storing huge number (5-10 GB) of DNA sequences and their counts (Integer) in a hash table. The DNA sequences (with length 32 or less) consists of 'A', 'C', 'G', 'T', and 'N' (undefined) chars. As we know, when storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. Thus, if I store this sequence as string (it holds about 100 MB memory for a sequence with length ~30), I see the error.
I tried to represent nucleic acids as 'A'=00, 'C'=01, 'G'=10, 'T'=11 and neglect 'N' (because it ruins the char to 2-bit transform as the 5-th acid). Then, concatenate these 2-bit acids into byte array. It brought some improvement but unfortunately I see the error after a couple of hours again. I need a convenient solution or at least a workaround to handle this error. Thank you in advance.
Being fairly complex maybe this here is a weird idea, and would require quite a lot of work, but this is what I would try:
You already pointed out two individual subproblems of your overall task:
the default HashMap implementation may be suboptimal for such large collection sizes
you need to store something else than strings
The map implementation
I would recommend to write a highly tailored hash map implementation for the Map<String, Long> interface. Internally you do not have to store strings. Unfortunately 5^32 > 2^64, so there is no way to pack your whole string into a single long, well, let's stick to two longs for a key. You can make string to/back long[2] conversion fairly efficiently on the fly when providing a string key to your map implementation (use bit shifts etc).
As for packing the values, here are some considerations:
for a key-value pair a standard hashmap will need to have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key it will need to have a linked list of key-value pairs to resolve keys that produce identical hash codes. For your specific case you could try to optimize it in the following way:
use a long[] of size 3N where N is the capacity to store both keys and values in a continuous array
in this array, at locations 3 * (hashcode % N) and 3 * (hashcode % N) + 1 you store the long[2] representation of the key, of the first key that matches this bucket or of the only one (on insertion, zero otherwise), at location 3 * (hashcode % N) + 2 you store the corresponding count
for all those cases where a different key results in the same hash code and thus the same bucket, your store the data in a standard HashMap<Long2KeyWrapper, Long>. The idea is to keep the capacity of the array mentioned above (and resize correspondingly) large enough to have by far the largest part of the data in that contiguous array and not in the fallback hash map. This will dramatically reduce the storage overhead of the hashmap
do not expand the capacity in N=2N iterations, make smaller growth steps, e.g. 10-20%. this will cost performance on populating the map, but will keep your memory footprint under control
The keys
Given the inequality 5^32 > 2^64 your idea to use bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long[2].
I recommend you look into the Trove4j Collections API; it offers Collections that hold primitives which will use less memory than their boxed, wrapper classes.
Specifically, you should check out their TObjectIntHashMap.
Also, I wouldn't recommended storing anything as a String or char until JDK 9 is released, as the backing char array of a String is UTF-16 encoded, using two bytes per char. JDK 9 defaults to UTF-8 where only one byte is used.
If you're using on the order of ~10gb of data, or at least data with an in memory representation size of ~10gb, then you might need to think of ways to write the data you don't need at the moment to disk and load individual portions of your dataset into memory to work on them.
I had this exact problem a few years ago when I was conducting research with monte carlo simulations so I wrote a Java data structure to solve it. You can clone/fork the source here: github.com/tylerparsons/surfdep
The library supports both MySQL and SQLite as the underlying database. If you don't have either, I'd recommend SQLite as it's much quicker to set up.
Full disclaimer: this is not the most efficient implementation, but it will handle very large datasets if you let it run for a few hours. I tested it successfully with matrices of up to 1 billion elements on my Windows laptop.
Let me put the question first: considering the situation and requirements I'll describe further down, what data structures would make sense/help achieving the non-functional requirements?
I tried to look up several structures but wasn't very successful so far, which might be due to me missing some terminology.
Since we'll implement that in Java any answers should take that into account (e.g. no pointer-magic, assume 8-byte references etc.).
The situation
We have somewhat large set of values that are mapped via a 4-dimensional key (let's call those dimensions A, B, C and D). Each dimension can have a different size, so we'll assume the following:
A: 100
B: 5
C: 10000
D: 2
This means a completely filled structure would contain 10 million elements. Not considering their size the space needed to hold the references alone would be like 80 megabytes, so that would be considered a lower bound for memory consumption.
We further can assume that the structure won't be completely filled but quite densely.
The requirements
Since we build and query that structure quite often we have the following requirements:
constructing the structure should be fast
queries on single elements and ranges (e.g. [A1-A5, B3, any C, D0]) should be efficient
fast deletion of elements isn't required (won't happen too often)
the memory footprint should be low
What we already considered
kd-trees
Building such a tree takes some time since it can get quite deep and we'd either have to accept slower queries or take rebalancing measures. Additonally the memory footprint is quite high since we need to hold the complete key in each node (there might be ways to reduce that though).
Nested maps/map tree
Using nested maps we could store only the key for each dimension as well as a reference to the next dimension map or the values - effectively building a tree out of those maps. To support range queries we'd keep sorted sets of the possible keys and access those while traversing the tree.
Construction and queries were way faster than with kd-trees but the memory footprint was much higher (as expected).
A single large map
An alternative would be to keep the sets for individual available keys and use a single large map instead.
Construction and queries were fast as well but memory consumption was even higher due to each map node being larger (they need to hold all dimensions of a key now).
What we're thinking of at the moment
Building insertion-order index-maps for the dimension keys, i.e. we map each incoming key to a new integer index as it comes in. Thus we can make sure that those indices grow one step a time without any gaps (not considering deletions).
With those indices we'd then access a tree of n-dimensional arrays (flattened to a 1-d array of course) - aka n-ary tree. That tree would grow on demand, i.e. if we need a new array then instead of creating a larger one and copying all the data we'd just create the new block. Any needed non-leaf nodes would be created on demand, replacing the root if needed.
Let me illustrate that with an example of 2 dimensions A and B. We'll allocate 2 elements for each dimension resulting in a 2x2 matrix (array of length 4).
Adding the first element A1/B1 we'd get something like this:
[A1/B1,null,null,null]
Now we add element A2/B2:
[A1/B1,null,A2/B2,null]
Now we add element A3/B3. Since we can't map the new element to the existing array we'll create a new one as well as a common root:
[x,null,x,null]
/ \
[A1/B1,null,A2/B2,null] [A3/B3,null,null,null]
Memory consumption for densely filled matrices should be rather low depending on the size of each array (having 4 dimensions and 4 values per dimension in an array we'd have arrays of length 256 and thus get a maximum tree depth of 2-4 in most cases).
Does this make sense?
If the structure will be "quite densely" filled, then I think it makes sense to assume that it will be full. That simplifies things quite a bit. And it's not like you're going to save a lot (or anything) using a sparse matrix representation of a densely filled matrix.
I'd try the simplest possible structure first. It might not be the most memory efficient, but it should be reasonable and quite easy to work with.
First, a simple array of 10,000,000 references. That is (and please pardon the C#, as I'm not really a Java programmer):
MyStructure[] theArray = new MyStructure[](10000000);
As you say, that's going to consume 80 megabytes.
Next is four different dictionaries (maps, I think, in Java), one for each key type:
Dictionary<KeyAType, int> ADict;
Dictionary<KeyBType, int> BDict;
Dictionary<KeyCType, int> CDict;
Dictionary<KeyDType, int> DDict;
When you add an element at {A,B,C,D}, you look up the respective keys in the dictionary to get their indexes (or add a new index if that key doesn't exist), and do the math to compute an index into the array. The math is, I think:
DIndex + 2*(CIndex + 10000*(BIndex + 5*AIndex));
In .NET, dictionary overhead is something like 24 bytes per key. But you only have 11,007 total keys, so the dictionaries are going to consume something like 250 kilobytes.
This should be very quick to query directly, and range queries should be as fast as a single lookup and then some array manipulation.
One thing I'm not clear on is if you want a key, to resolve to the same index with every build. That is, if "foo" maps to index 1 in one build, will it always map to index 1?
If so, you probably should statically construct the dictionaries. I guess it depends on if your range queries always expect things in the same key order.
Anyway, this is a very simple and very effective data structure. If you can afford 81 megabytes as the maximum size of the structure (minus the actual data), it seems like a good place to start. You could probably have it working in a couple of hours.
At best it's all you'll have to do. And if you end up having to replace it, at least you have a working implementation that you can use to verify the correctness of whatever new structure you come up with.
There are other multidimensional trees that are usually better than kd-trees:quadtrees, R*Trees (like R-Tree, but much faster for updates) or PH-Tree.
The PH-Tree is like a quadtree, but much more space efficient, scales better with dimensions and depth is limited by maximum bitwidth of values, i.e. maximum '10000' requires 14 bit, so the depth will not be more than 14.
Java implementations of all trees can be found on my repo, either here (quadtree may be a bit buggy) or here.
EDIT
The following optimization can probably be ignored. Of course the described query will result in a full scan, but that may not be as bad as it sounds, because it will on average anyway return 33%-50% of the whole tree.
Possible optimisation (not tested, but might work for the PH-Tree):
One problem with range queries is the different selectivity of your dimensions, which may result in something to a full scan of the tree. For example when querying for [0..100][0..5][0..10000][1..1], i.e. constraining only the last dimension (with least selectivity).
To avoid this, especially for the PH-Tree, I would try to multiply your values by a fixed constant. For example multiply A by 100, B by 2000, C by 1 and D by 5000. This allows all values to range from 0 to 10000, which may improve query performance when constraining only dimensions with low selectivity (the 2nd or 4th).
If a hash set contains only one instance of any distinct element(s), how might collision occur at this case?
And how could load factor be an issue since there is only one of any given element?
While this is homework, it is not for me. I am tutoring someone, and I need to know how to explain it to them.
Let's assume you have a HashSet of Integers, and your Hash Function is mod 4. The integers 0, 4, 8, 12, 16, etc. will all colide, if you try to insert them. (mod 4 is a terrible hash function, but it illustrates the concept)
Assuming a proper function, the load factor is correlated to the chance of having a collision; please note that I say correlated and not equal because it depends on the strategy you use to handle collisions. In general, a high load factor increases the possibility of collisions. Assuming that you have 4 slots and you use mod 4 as the hash function, when the load factor is 0 (empty table), you won't have a collision. When you have one element, the probability of a collision is .25, which obviously degrades the performance, since you have to solve the collision.
Now, assuming that you use linear probing (i.e. on collision, use the next entry available), once you reach 3 entries in the table, you have a .75 probability of a collision, and if you have a collision, in the best case you will go to the next entry, but in the worst, you will have to go through the 3 entries, so the collision means that instead of a direct access, you need in average a linear search with an average of 2 items.
Of course, you have better strategies to handle collisions, and generally, in non-pathological cases, a load of .7 is acceptable, but after that collisions shoot up and performance degrades.
The general idea behind a "hash table" (which a "hash set" is a variety of) is that you have a number of objects containing "key" values (eg, character strings) that you want to put into some sort of container and then be able to find individual objects by their "key" values easily, without having to examine every item in the container.
One could, eg, put the values into a sorted array and then do a binary search to find a value, but maintaining a sorted array is expensive if there are lots of updates.
So the key values are "hashed". One might, for instance, add together all of the ASCII values of the characters to create a single number which is the "hash" of the character string. (There are better hash computation algorithms, but the precise algorithm doesn't matter, and this is an easy one to explain.)
When you do this you'll get a number that, for a ten-character string, will be in the range from maybe 600 to 1280. Now, if you divide that by, say, 500 and take the remainder, you'll have a value between 0 and 499. (Note that the string doesn't have to be ten characters -- longer strings will add to larger values, but when you divide and take the remainder you still end up with a number between 0 and 499.)
Now create an array of 500 entries, and each time you get a new object, calculate its hash as described above and use that value to index into the array. Place the new object into the array entry that corresponds to that index.
But (especially with the naive hash algorithm above) you could have two different strings with the same hash. Eg, "ABC" and "CBA" would have the same hash, and would end up going into the same slot in the array.
To handle this "collision" there are several strategies, but the most common is to create a linked list off the array entry and put the various "hash synonyms" into that list.
You'd generally try to have the array large enough (and have a better hash calculation algorithm) to minimize such collisions, but, using the hash scheme, there's no way to absolutely prevent collisions.
Note that the multiple entries in a synonym list are not identical -- they have different key values -- but they have the same hash value.
This might sound as an very vague question upfront but it is not. I have gone through Hash Function description on wiki but it is not very helpful to understand.
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
What do we mean by hashing? How does it work internally?
What algorithm does it follow ?
What is the difference between HashMap, HashTable and HashList ?
What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.
Here is a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. myHashFunction(key: "Rachel" value: "Rachel") --> 10. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.
Here are some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n. h(k) = k mod n. For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10.
Hashtable is synchronised and Hashmap is not.
Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)
Hashing means generating a (hopefully) unique number that represents a value.
Different types of values (Integer, String, etc) use different algorithms to compute a hashcode.
HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSet is a set of unique values.
Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string.
These are things that people ought to know.
Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for java.lang.Object by getting some memory address by the JVM.
The JVM address thing is unimportant detail. Each class can override the hashCode() method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.
Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.
I'll try to give simple explanations of hashing and of its purpose.
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).
What do we mean by Hashing, how does
it work internally ?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
What algorithm does it follow ?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
Lastly, why in most interview
questions Hash and LinkedList are
asked, is there any specific logic for
it from testing interviewee's
knowledge ?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.