While debugging I found a strange behavior.
I got a HashMap<Integer, Set<Term>> (Term is a class which only contains a String) the normal toString() shows this:
When I click the table property of the HashMap I get this:
My Question now, why are there null values in the table toString() ?
Edit: Thanks for your fast answers! If I could, I would accept all of them...
HashMap is a Map implementation that's crucial feature is constant time O(1) lookup.
The only data structure in computer science with constant time lookup is an array of fixed length. When you initialise the HashMap it's creating a fixed length array that it will expand when your entries exceed the current array's size.
Edit: #kutschkem has pointed out that java.util.HashMap expands its fixed length array when the number of entries is around 80% of the current array's size, rather than when the entries exceed the current array's size.
Because the Map implementation you are using is working with a starting set of HashBuckets some of which are NULL at beginning (determined by initialCapacity). If you exceed the number of entries it will start creating more HashBuckets / slots for your Objects. Think of this as a growth reserve the HashMap automatically creates for you.
Read more:
https://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html
The HashMap stores its entries in a hashtable. That is an array, and the hash function maps the key to one of the array entries (also called hash buckets).
The hash buckets are always at least 20% empty. If they are not, then the array is resized to make sure there is enough free space.
The reason is that as the hash table gets filled up, collisions between hashes get more and more likely. You lose all advantages of the HashMap if collisions are too frequent. Too full, and your HashMap would be no better than a LinkedList (yes, LinkedList, not ArrayList). It would probably be even worse.
That is how a hash map work: a large array (table), and for some key the following table entry is tried:
table[key.hashCode() % table.length]
That table slot then is used. Rehashing is used if there already is a key that is not equals(key).
So initially the table contains only nulls, and has size initialCapacity. The array can be grown when the hash map becomes too filled (loadFactor).
The HashMap uses internally an array to store the entries. Very much simplified, it does something like array_index = hashcode % array_length (again: very simplified, as it also needs to take care of hash collisions etc). This internal array is typically larger than the number of elements you store in the HashMap -- otherwise, the array would have to be resized every time you add an element to it. So what you see as null are the yet unused slots in the array.
This is normal behavior.
There are null values because the table array was initialized as being filled with nulls, and uses null to indicate that there are no values stored in that hash bucket.
The toString() function provided doesn't skip over them because seeing them was useful to the folds debugging the HashMap implementation.
If you want to see the contents without the nulls, you'll have to write your own display function, either by subclassing HashMap and overriding toString() or by providing a convenience function somewhere in your code.
Related
This question already has answers here:
Is a Java hashmap search really O(1)?
(15 answers)
Closed 4 years ago.
How to calculate the complexity of the HashMap search algorithm? I'm googling result of this calculation - O(1), but I don't understand how they arrived at these findings.
HashMap works on the hashing principle.It is the data structure that allow you to store and retrieve data in O(1) time provided we know the key.
In hashing, hash functions are used to link key and value in HashMap. Objects are stored by calling put(key, value) method of HashMap and retrieved by calling get(key) method. When we call put method, hashcode() method of the key object is called so that hash function of the map can find a bucket location to store value object, which is actually an index of the internal array, known as the table. HashMap internally stores mapping in the form of Map.Entry object which contains both key and value object. When you want to retrieve the object, you call the get() method and again pass the key object. This time again key object generate same hash code (it's mandatory for it to do so to retrieve the object and that's why HashMap keys are immutable e.g. String) and we end up at same bucket location. If there is only one object then it is returned and that's your value object which you have stored earlier. Things get little tricky when collisions occur.
Collision : Since the internal array of HashMap is of fixed size, and if you keep storing objects, at some point of time hash function will return same bucket location for two different keys, this is called collision in HashMap. In this case, a linked list is formed at that bucket location and a new entry is stored as next node.
If we try to retrieve an object from this linked list, we need an extra check to search correct value, this is done by equals() method. Since each node contains an entry, HashMap keeps comparing entry's key object with the passed key using equals() and when it return true, Map returns the corresponding value.
Since searching inlined list is O(n) operation, in worst case hash collision reduce a map to linked list. This issue is recently addressed in Java 8 by replacing linked list to the tree to search in O(logN) time.
By the way, you can easily verify how HashMap works by looking at the code of HashMap.java in your Eclipse IDE if you are keenly interested in the code, otherwise the logic is explained above.
Information On Buckets : An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
I was trying to implement a trie and read in an example implementation that it would be more space efficient to use a small array of size 26 to store the children because then you wouldn't have to waste space with a HashMap (the code was in Java, if that makes a difference)
But wouldn't a map be more space efficient since you don't necessarily need to store all 26 values? Or is a HashMap object that contains Character objects as keys just more space because a simple int[] type does not use extra space in the background that the implementation for these more complex objects would use?
Just wanting to check if maybe this person was mistaken or if there's some overhead involved in using object types like HashMap that I should be aware of.
A hashmap stores keys and values, so if you were to implement a trie using a hashmap, you would be storing not only the values, but also the keys. If you use an array, then the key is actually the index of the value in the array, so you do not have to store it anywhere.
Besides that, hashmaps are less space efficient than arrays because they always have a load factor which is smaller than one, which is the same as saying that they keep more entries allocated than necessary due to the way they work. I am not expanding on this, because it is not related to your issue, but if you are curious, search for hashmap load factor.
This is a classic programming compromise. In this case, using a HashMap will take more space, but that may be a price you're willing to pay for improved performance. Or it might not. Depends on your problem.
I have n objects each of them with an identifying number. I get them unsorted but the range of indexes (0, n-1) is used to identify them. I want to access them as fastest as possible. I suppose that an ArrayList would be the best option, I'd add the object with identifier n at the position of the ArrayList with index n by:
list.add(identifier, object);
The problem is that when I am adding the objects I get an IndexOutOfBounds Exception because I'm adding them unsorted and the size() is smaller although I know that previous positions will also be filled.
Another option is to use a HashMap but I suppose that this will decrease performance.
Do you know a collection that has the behavior described above?
Do you know a collection that has the behavior described above?
It sounds like you need a plain old Java array. And if you need it as a collection, then use "Arrays.asList(...)" to create a List wrapper for it.
Now this won't work if you needed to add or remove elements from the array / collection, but it sounds like you don't need to from your problem description.
If you do need to add / remove elements (as distinct from using set to update the element at a given position, then Peter Lawrey's approach is best.
By contrast, a HashMap<Integer, Object> would be an expensive alternative. At a rough estimate, I'd say it that "indexing" operations would be at least 10 times slower, and the data structure would take 10 times the space compared to an equivalent array or ArrayList type. A hash table based solution is only really a viable alternative (from a performance perspective) if the array is large and sparse.
Sometimes you get the indexes out of order. This requires you to add dummy entries which may be filled later.
int indexToAdd = ...
E elementToAdd = ...
while(list.size() <= indexToAdd) list.add(null);
list.set(indexToAdd, elementToAdd);
This will allow you to add entries beyond the current end of the list.
The Javadoc for List.add(int, E) and List.set(int, E) both state
IndexOutOfBoundsException - if the index is out of range (index < 0 || index > size())
If you attempt to add entries beyond the end
List list = new ArrayList();
list.add(1, 1);
you get
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, Size: 0
at java.util.ArrayList.rangeCheckForAdd(ArrayList.java:612)
at java.util.ArrayList.add(ArrayList.java:426)
at Main.main(Main.java:28)
I'm not sure how much more expensive a HashMap<Integer, T> would be. Integer.hashCode() is quite efficient, though the only really expensive operation might be copying the data into a new larger array as the number of items increases. However, if you know your n, you could use a normal array.
As an alternative, you could implement your own Map<Integer, T> that does not use the hash code but the Integer itself. Before you do this, make sure that neither an array is sufficient nor HashMap is efficient enough!
I think you have at least two good options.
The first is to use a straight Java array initialized to the appropriate length, then add objects with the syntax:
theArray[i] = object;
The second would be to use a HashMap, and add objects with the syntax:
theMap.put(i, object);
I'm not sure what performance issues you're worried about, but adding elements within the range, clearing (out of an array) or removing (out of a HashMap), and finding elements from a given index (or key, for a HashMap) are all O(1) for both structures. I would also suggest taking a look at Wikipedia's list of data structures if neither of these seem good.
I have a source of strings (let us say, a text file) and many strings repeat multiple times. I need to get the top X most common strings in the order of decreasing number of occurrences.
The idea that came to mind first was to create a sortable Bag (something like org.apache.commons.collections.bag.TreeBag) and supply a comparator that will sort the entries in the order I need. However, I cannot figure out what is the type of objects I need to compare. It should be some kind of an internal map that combines my object (String) and the number of occurrences, generated internally by TreeBag. Is this possible?
Or would I be better off by simply using a hashmap and sort it by value as described in, for example, Java sort HashMap by value
Why don't you put the strings in a map. Map of string to number of times they appear in text.
In step 2, traverse the items in the map and keep on adding them to a minimum heap of size X. Always extract min first if the heap is full before inserting.
Takes nlogx time.
Otherwise after step 1 sort the items by number of occurrences and take first x items. A tree map would come in helpful here :) (I'd add a link to the javadocs, but I'm in a tablet )
Takes nlogn time.
With Guava's TreeMultiset, just use Multisets.copyHighestCountFirst.
This question already has answers here:
Can hash tables really be O(1)?
(10 answers)
Closed 5 years ago.
When we use a HashTable for storing data, it is said that searching takes o(1) time. I am confused, can anybody explain?
Well it's a little bit of a lie -- it can take longer than that, but it usually doesn't.
Basically, a hash table is an array containing all of the keys to search on. The position of each key in the array is determined by the hash function, which can be any function which always maps the same input to the same output. We shall assume that the hash function is O(1).
So when we insert something into the hash table, we use the hash function (let's call it h) to find the location where to put it, and put it there. Now we insert another thing, hashing and storing. And another. Each time we insert data, it takes O(1) time to insert it (since the hash function is O(1).
Looking up data is the same. If we want to find a value, x, we have only to find out h(x), which tells us where x is located in the hash table. So we can look up any hash value in O(1) as well.
Now to the lie: The above is a very nice theory with one problem: what if we insert data and there is already something in that position of the array? There is nothing which guarantees that the hash function won't produce the same output for two different inputs (unless you have a perfect hash function, but those are tricky to produce). Therefore, when we insert we need to take one of two strategies:
Store multiple values at each spot in the array (say, each array slot has a linked list). Now when you do a lookup, it is still O(1) to arrive at the correct place in the array, but potentially a linear search down a (hopefully short) linked list. This is called "separate chaining".
If you find something is already there, hash again and find another location. Repeat until you find an empty spot, and put it there. The lookup procedure can follow the same rules to find the data. Now it's still O(1) to get to the first location, but there is a potentially (hopefully short) linear search to bounce around the table till you find the data you are after. This is called "open addressing".
Basically, both approaches are still mostly O(1) but with a hopefully-short linear sequence. We can assume for most purposes that it is O(1). If the hash table is getting too full, those linear searches can become longer and longer, and then it is time to "re-hash" which means to create a new hash table of a much bigger size and insert all the data back into it.
Searching takes O(1) time if there is no collisons in the hashtable , so it is incorrect to sya that searching in a hashtable takes O(1) or constant time.
See how Hashtable works on MSDN?
EDIT
mgiuca explains it beautifully and i am just adding one more Collosion Avoidance technique which is called Chaining.
IN this technique , We maintain a linklist of values at each location so when you have a collosion , your value will be added to the Linklist at that position so when you are searching for a value there may be scenario that you need to search the value in whole link list so in that case searching will not be O(1) operation.