Java : Best collection for small number of key value pairs - java

I'm currently creating a prototype of a Rock, Paper Scissors program and I need to store the selections in an Integer/String format.
What would be the "best" collection to use, in terms of speed of searching and memory usage? The premise being that the computer will pick a random number, then the key value pairs are searched to find the appropriate selection name for use later in the program
EDIT:
To clarify, there will be at most 5 key value pairs in the collection, with integers as the keys
further clarification based on comments. Im looking for a key/value collection for a small amount of pairs (5 at most)

An ArrayList or an HashMap would be fine if your keys are Integer.
Both are O(1) and expect unique keys.
Otherwise, if they are String, HashMap only would fit.

HashMap would be the best one. get() has O(1) time complexity . O(1) means independent of the number of elements i.e, constant.

If you need a Map and you do not need a SortedMap, HashMap is almost always the right choice.
Note that one of the constructors of HashMap has an argument telling it the initial size of the map. The constructor will use that to allocate a reasonable amount of memory for the map, which is helpful if the size of your map is likely to be smaller than the default size (which is 16).

I'd suggest this is a perfect opportunity to use enums and an EnumMap which is optimised for exactly the kind of scenario you are considering - i.e. a discrete set of possible keys.

Related

Should I use a `HashSet` or a `TreeSet` for a very large dataset?

I have a requirement to store 2 to 15 million Accounts (which are a String of length 15) in a data structure for lookup purpose and checking uniqueness. Initially I planned to store them in a HashSet, but I doubt the speed of the lookup will be slow because of hash collisions and will eventually be slower than a TreeMap (using Binary search).
There is no requirement for Data to be sorted. I am using Java 7. I have 64G system with 48G dedicated for this application.
This question is not a duplicate of HashSet and TreeSet performance test because that question is about the performance of adding elements to a Set and this question is about the performance of checking an existing Set for duplicate values.
If you have 48 GB of dedicated Memory for your 2 million to 15 million records, your best bet is probably to use a HashMap<Key, Record>, where your key is an Integer or a String depending on your requirements.
You will be fine as far as hash collisions go as long as you give enough memory to the Map and have an appropriate load factor.
I recommend using the following constructor: new HashMap<>(13_000_000); (30% more than your expected number of records - which will be automatically expanded by HashMap's implementation to 2^24 cells).
Tell your application that this Map will be very large from the get-go so it doesn't need to automatically grow as you populate it.
HashMap uses an O(1) access time for it's members, whereas TreeMap uses O(log n) lookup time, but can be more efficient with memory and doesn't need a clever hashing function. However, if you're using String or Integer keys, you don't need to worry about designing a hashing function and the constant time lookups will be a huge improvement. Also, another advantage of TreeMap / TreeSet is the sorted ordering, which you stated you don't care about; use HashMap.
If the only purpose of the list is to check for unique account numbers, then everything I've said above is still true, but as you stated in your question, you should use a HashSet<String>, not a HashMap. The performance recommendations and constructor argument is still applicable.
Further reading: HashSet and TreeSet performance test
When we tried to store 50 million records in HashMap with proper initialization parameters, insertion started to slowdown, especially after 35 million records. Changing to TreeMap gave a constant insertion and retrieval performance.
Observation : TreeMap will give better performance than a HashMap for large input set. For a smaller set, of course HashMap will give better performance.

Java HashMap<Integer, Integer> vs int[]

I have an integer array of size 10000, which is gradually filled with other integers (context: http://uva.onlinejudge.org/external/1/100.pdf), but it is not large enough. I plan to replace it with a HashMap, and was wondering if this was a better idea than making the array arbitrarily larger (eg. increasing the size to 100000)?
Also, what are the main differences between a HashMap and an integer array?
N.B. In this case, only odd keys are used in the HashMap/array.
Both, obviously, provide a mapping from a subset of the integers into the integers. There are several differences, but the short answer is that an array is likely to work better for dense keys and a HashMap for sparse keys.
The memory cost per key that you use will be 32 bits for the array, but several times that for the HashMap. The memory cost per key that is in the range but that you don't use is also 32 bits for the array, but can be close to zero for the HashMap.
Array access will be faster than HashMap access.
If you expect to use as many as 50% of the entries, you are much better off with the array. If only odd keys are needed, and the array is large, consider using array index (i-1)/2 to represent the element with key i.
The best way to find which is better for your situation, including finding the density threshold for switching between them, is by testing. This is the procedure I would follow:
Define an interface for the data structure that has methods for the operations you need to be able to do on it.
Write your code, except for the actual creation of the structure, in terms only of that interface.
Define two classes that each implement the interface, one using the array and the other using a HashMap.
Measure using each of the classes. For the HashMap, you can also experiment with the HashMap constructor arguments.
An array is a list of values. An int array is a list of integers. You access the element by indices.
A map resp. hash map is: key --> value. In a hash map you retrieve values by a key.
public class Book{}
HashMap<String, Book> books = new HashMap<String, Book>(); // mapping from a string(=key) to a Book object(=value)
books.put("Harry Potter", new Book());
// etc.
So if you want to access elements by a key, then a hash map is what you need. The keys have to be immutable (like int or string)
So pick what suits you the best.

HashMap speed greater for smaller maps

This may be a strange question, but it is based on some results I get, using Java Map - is element retrieval speed greater in case of a HashMap, when the map is smaller?
I have some part of code that uses containsKey and get(key) methods of a HashMap, and it seems that runs faster if number of elements in the Map is smaller? Is that so?
My knowledge is that HashMap uses some hash function to access to certain field of a map, and there are versions in which that field is a reference to a linked list (because some keys can map to same value), or to other fields in the map, when implemented fully statically.
Is this correct - speed can be greater if Map has less elements?
I need to extend my question, with a concrete example.
I have 2 cases, in both the total number of elements is same.
In first case, I have 10 HashMaps, I'm not aware how elements are distributed. Time of execution of that part of algorithm is 141ms.
In second case, I have 25 HashMaps, same total number of elements. Time of execution of same algorithm is 69ms.
In both cases, I have a for loop that goes through each of the HashMaps, tries to find same elements, and to get elements if present.
Can it be that the execution time is smaller, because individual search inside HashMap is smaller, so is there sum?
I know that this is very strange, but is something like this somehow possible, or am I doing something wrong?
Map(Integer,Double) is considered. It is hard to tell what is the distribution of elements, since it is actually an implementation of KMeans clustering algorithm, and the elements are representations of cluster centroids. That means that they will mostly depend on the initialization of the algorithm. And the total number of elements will not mostly be the same, but I have tried to simplify the problem, sorry if that was misleading.
The number of collisions is decisive for a slow down.
Assume an array of some size, the hash code modulo the size then points to an index where the object is put. Two objects with the same index collide.
Having a large capacity (array size) with respect to number of elements helps.
With HashMap there are overloaded constructors with extra settings.
public HashMap(int initialCapacity,
float loadFactor)
Constructs an empty HashMap with the specified initial capacity and load factor.
You might experiment with that.
For a specific key class used with a HashMap, having a good hashCode can help too. Hash codes are a separate mathematical field.
Of course using less memory helps on the processor / physical memory level, but I doubt an influence in this case.
Does your timing take into account only the cost of get / containsKey, or are you also performing puts in the timed code section? If so, and if you're using the default constructor (initial capacity 16, load factor 0.75) then the larger hash tables are going to need to resize themselves more often than will the smaller hash tables. Like Joop Eggen says in his answer, try playing around with the initial capacity in the constructor, e.g. if you know that you have N elements then set the initial capacity to N / number_of_hash_tables or something along those lines - this ought to result in the smaller and larger hash tables having sufficient capacity that they won't need to be resized

Is efficiency of java's TreeMap based on number of keys or values?

Since Java uses a red-black tree to implement the TreeMap class, is the efficiency of put() and get() lg(N), where N = number of distinct keys, or N = number of insertions/retrievals you plan to do?
For example, say I want to use a
TreeMap<Integer, ArrayList<String>>
to store the following data:
1 million <1, "bob"> pairs and 1 million <2, "jack"> pairs (the strings get inserted into the arraylist value corresponding to the key)
The final treemap will have 2 keys, with each one storing arraylist of million "bob" or "jack" strings. Is the time efficiency lg(2mil) or lg(2)? I am guessing it's lg(2) since that's how a red-black tree works, but just wanted to check.
Performance of a TreeMap with 2 pairs will behave as N=2, regardless of how many duplicate additions were previously made. There is no "memory" of the excess additions so they cannot possibly produce any overhead.
So yes, you can informally assume that time efficiency is "log 2".
Although it's fairly meaningless as big-O notation is intended to relate to asymptotic efficiency rather than be relevant for small sizes. An O(N^3) algorithm could easily be faster than a O(log N) algorithm for N=2.
For this case, a tree map is lg(n) where n=2 as you describe. There are only 2 values in the map: one arraylist, and another arraylist. No matter what is contained inside those, the map only knows of two values.
While not directly concerned with your question, you may want to consider not using a treemap for this... I mean, how do you plan to access the data stored inside your "bob" or "jack" lists? These are going to be O(n) searches unless you're going to use some kind of binary search on them or something, and the n here is 1 million. If you elaborate more on your end goal, perhaps a more encompassing solution can be achieved.

Hash : How does it work internally?

This might sound as an very vague question upfront but it is not. I have gone through Hash Function description on wiki but it is not very helpful to understand.
I am looking simple answers for rather complex topics like Hashing. Here are my questions:
What do we mean by hashing? How does it work internally?
What algorithm does it follow ?
What is the difference between HashMap, HashTable and HashList ?
What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ?
Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge?
I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic.
Here is a good explanation about hashing. For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. myHashFunction(key: "Rachel" value: "Rachel") --> 10. The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques.
Here are some common hash functions used. A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. One of the methods is called the division method. We map a key k into one of n slots by taking the remainder of k divided by n. h(k) = k mod n. For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10.
Hashtable is synchronised and Hashmap is not.
Hashmap allows null values as key but Hashtable does not.
The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. You will get the element "instantly".
Of couse a programer/developer computer scientist needs to know about data structures and complexity =)
Hashing means generating a (hopefully) unique number that represents a value.
Different types of values (Integer, String, etc) use different algorithms to compute a hashcode.
HashMap and HashTable are maps; they are a collection of unqiue keys, each of which is associated with a value.
Java doesn't have a HashList class. A HashSet is a set of unique values.
Getting an item from a hashtable is constant-time with regard to the size of the table.
Computing a hash is not necessarily constant-time with regard to the value being hashed.
For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string.
These are things that people ought to know.
Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). The hash function is not reversable - i.e. you can't obtain the original object from the hash. Internally it is implemented (for java.lang.Object by getting some memory address by the JVM.
The JVM address thing is unimportant detail. Each class can override the hashCode() method with its own algorithm. Modren Java IDEs allow for generating good hashCode methods.
Hashtable and hashmap are the same thing. They key-value pairs, where keys are hashed. Hash lists and hashsets don't store values - only keys.
Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. That is - 1, or close to 1
This is basic computer-science material, and it is supposed that everyone is familiar with it. I think google have specified that the hashtable is the most important data-structure in computer science.
I'll try to give simple explanations of hashing and of its purpose.
First, consider a simple list. Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation.
Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. Items in one such small list would have something in common, and this something can be deduced from the key. For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time.
However, we could note that the results are not that perfect. First, there are only 30 of them, and we can't change it. Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A. For better results, it's better to find a way to partition the items in sets of roughly same size. How could we solve that? This is where you use hash functions. It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. In our example with names, we could use something like
int hash(const char* str){
int rez = 0;
for (int i = 0; i < strlen(str); i++)
rez = rez * 37 + str[i];
return rez % NUMBER_OF_PARTITIONS;
};
This would assure a quite even distribution and configurable number of sets (also called buckets).
What do we mean by Hashing, how does
it work internally ?
Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. It is not indexing. The heart of hashing is the hash table. It contains array of items. Hash tables contain an index from the data item's key and use this index to place the data into the array.
What algorithm does it follow ?
In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)"
Lastly, why in most interview
questions Hash and LinkedList are
asked, is there any specific logic for
it from testing interviewee's
knowledge ?
Its about how good you are at logical reasoning. It is most important data-structure that every programmers know it.

Categories