Data lookup method for small data set with Java?

Data lookup method for small data set with Java? - java

We have to lookup some data based on three input data fields. The lookup has to be fast. There are only about 20 possible lookup combinations. We've implemented this using a static HashMap instance where we create a key by concatinating the three data fields. Is there a better way to do this or is this the way to go? Code is below.
Update: I'm not implying that this code is slow. Just curious if there is a better way to do this. I thought there might be a more elegant solution but I'm happy to keep this in place if there are no compelling alternatives!
Create class level static HashMap instance:
private static HashMap map = new HashMap();
How we load data into memory:
private void load(Iterator iterator) {
while (iterator.next()) {
Object o = it.next();
key = o.getField1() + "-" + o.getField2() + "-" o.getField3();
map.put(key, o.getData());
}
}
And how we look up the data based on the three fields:
private Stirng getData(String f1, String f2, String f3) {
String key = f1 + "-" + f2 + "-" f3;
return map.get(key);
}

Well, the question to ask yourself is of course "is it fast enough?" Because unless your application needs to be speedier and this is the bottleneck, it really doesn't matter. What you've got is already reasonably efficient.
That being said, if you want to squeeze every bit of speed possible out of this routine (without rewriting it in assembly language ;-) you might consider using an array instead of a HashMap, since there are only a small, limited number of keys. You'd have to develop some sort of hash function that hashes each object to a unique number between 0 and 19 (or however many elements you actually have). You may also be able to optimize the implementation of that hash function, although I couldn't tell you how exactly to do that without knowing the details of the objects you're working with.

You could create a special key object having three String fields to avoid building up the key string:
class MapKey {
public final String k1;
public final String k2;
public final String k3;
public MapKey(String k1, String k2, String k3) {
this.k1 = k1; this.k2 = k2; this.k3 = k3;
}
public MapKey(Object o) {
this.k1 = o.getField1(); this.k2 = o.getField2(); this.k3 = o.getField3();
}
public int hashCode() {
return k1.hashCode(); // if k1 is likely to be the same, also add hashes from k2 and k3
}
}

In your case I would keep using the implementation you outlined. For a large list of constant keys mapping to constant data, you could use Minimal Perfect Hashing. As it is not trivial to code this, and I am not sure about existing libraries, you have to consider the implementation cost before using this.

I think your approach is pretty fast. Any gains by implementing your own hashing algorithm would be very small, especially compared to the effort required.
One remark about your key format. You better make sure that your separator cannot occur in the field toString() values, otherwise you might get key collisions:
field1="a-", field2="b-", field3="c" -> key="a--b--c"
field1="a", field2="-b", field3="-c" -> key="a--b--c"

Concatenating strings is a bad idea for creating a key. My main object is that it is unclear. But in practice a significant proportion of implementations have bugs, notably that the separator can actually occur in the strings. In terms of performance, I have seen a program speed up ten percent simply by changing the key for a string hack to a meaningful key object. (If you really must be lazy about code, you can use Arrays.asList to make the key - see List.equals API doc.)

Since you only have 20 combinations it might be feasible to handcraft a "give me the index 1..20 of this combination" based on knowing the characteristics of each combination.
Are you in a position to list the exact list of combinations?

Another way to get this done is to create an Object to handle your key, with which you can override equals() (and hashCode()) to do a test against an incomming key, testing field1, field2 and field3 in turn.
EDIT (in response to comment):
As the value returned from hashCode() is used by your Map to put your keys into buckets, (from which it then will test equals), the value could theoretically be the same for all keys. I wouldn't suggest doing that, however, as you would not reap the benefits of HashMaps performance. You would essentially be iterating over all of your items in a bucket and testing equals().
One approach you could take would be to delegate the call to hashCode() to one of the values in your key container. You could always return the hashCode from field3, for example. In this case, you will distribute your keys to potentially as many buckets as there are distinct values for field3. Once your HashMap finds the bucket, it will still need to iterate over the items in the bucket to test the result of equals() until it finds a match.
You could create would be the sum of the values returned by hashCode() on all of your fields. As just discussed, this value does not need to be unique. Further, the potential for collision, and therefore larger buckets, is much smaller. With that in mind, your lookups on the HashMap should be quicker.
EDIT2:
the question of a good hash code for this key has been answered in a separate question here

Related

Can Collection be used as an key in Hashmap java?

I have following scenario (modified one than actual business purpose).
I have a program which predicts how much calories a person will
loose for the next 13 weeks based on certain attributes.
I want to cache this result in the database so that i don't call the
prediction again for the same combination.
I have class person
class Person { int personId; String weekStartDate; }
I have HashMap<List<Person>, Integer> - The key is 13 weeks data of a person and the value is the prediction
I will keep the hashvalue in the database for caching purpose
Is there a better way to handle above scenario? Any design pattern to support such scenarios

Depends: the implementation of hashCode() uses the elements of your list. So adding elements later on changes the result of that operation:
public int hashCode() {
int hashCode = 1;
for (E e : this)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
return hashCode;
}
Maps aren't build for keys that can change their hash values! And of course, it doesn't really make sense to implement that method differently.
So: it can work when your lists are all immutable, meaning that neither the list nor any of its members is modified after the list was used as key. But there is a certain risk: if you forget about that contract later on, and these lists see modifications, then you will run into interesting issues.

This works because the hashcode of the standard List implementations is computed with the hashcodes of the contents. You need to make sure, however, to also implement hashCode and equals in the Person class, otherwise you will get the same problem this guy had. See also my answer on that question.

I would suggest you define a class (say Data) and use it as a key in your hashmap. Override equals/hashcode accordingly with knowledge of data over weeks.

How to make an efficient hashCode?

I have three hashCode methods as follows, I prioritised them based on their efficiency. I am wondering if there is any other way to make a more efficient hashCode method.
1) public int hashCode() { //terrible
return 5;
}
2) public int hashCode() { //a bit less terrible
return name.length;
}
3) public int hashCode() { //better
final int prime = 31;
int result = 1;
result = prime * result + ((name == null) ? 0 : name.hashCode());
return result;
}

There is no surefire way to guarantee that your hashcode function is optimal because it is measured by two different metrics.
Efficiency - How quick it is to calculate.
Collisions - What is the chance of collision.
Your:
Maximises efficiency at the expense of collisions.
Finds a spot somwhere in the middle - but still not good.
Least efficient but best for avoiding collisions - still not necessarily best.
You have to find the balance yourself.
Sometimes it is obvious when there is a very efficient method that never collides (e.g. the ordinal of an enum).
Sometimes memoising the values is a good solution - this way even a very inefficient method can be mitigated because it is only ever calculated once. There is an obvious emeory cost to this which also must be balanced.
Sometimes the overall functionality of your code contributes to your choice. Say you want to put File objects in a HashMap. A number of options are clear:
Use the hashcode of the file name.
Use the hashcode of the file path.
Use a crc of the contents of the file.
Use the hashcode of the SHA1 digest of the contents of the file.
Why collisions are bad
One of the main uses of hashcode is when inserting objects into a HashMap. The algorithm requests a hash code from the object and uses that to decide which bucket to put the object in. If the hash collides with another object there will be another object in that bucket, in which case the bucket will have to grow which costs time. If all hashes are unique then the map will be one item per bucket and thus maximally efficient.
See the excellent WikiPedia article on Hash Table for a deeper discussion on how HashMap works.

I prioritised them based on their efficiency
Your list is sorted by ascending efficiency—if by "efficiency" you mean the performance of your application as opposed to the latency of the hashCode method isolated from everything else. A hashcode with bad dispersion will result in a linear or near-linear search through a linked list inside HashMap, completely annulling the advantages of a hashtable.
Especially note that, on today's architectures, computation is much cheaper than pointer dereference, and it comes at a fixed low cost. A single cache miss is worth a thousand simple arithmetic operations and each pointer dereference is a potential cache miss.

In addition to the valuable answers so far, I'd like to add some other methods to consider:
3a):
public int hashCode() {
return Objects.hashCode(name);
}
Not many pros/cons in terms of performance, but a bit more concise.
4.) You should either provide more information about the class that you are talking about, or reconsider your design. But using a class as the key of a hash map when the only property of this class is a String, then you might also be able to just use the String directly. So option 4 is:
// Changing this...
Map<Key, Value> map;
map.put(key, value);
Value value = map.get(key);
// ... to this:
Map<String, Value> map;
map.put(key.getName(), value);
Value value = map.get(key.getName());
(And if this is not possible, because the "name" of a Key might change after it has been created, you're in bigger trouble anyhow - see the next point)
5.) Maybe you can precompute the hash code. In fact, this is also done in the java.lang.String class:
public final class String
implements java.io.Serializable, Comparable<String>, CharSequence {
...
/** Cache the hash code for the string */
private int hash; // Default to 0
But of course, this only makes sense for immutable classes. You should be aware of the fact that using mutable classes as keys of a Map is "dangerous" and may lead to consistency errors, and should only be done when you're absolutely sure that the instances that are used as keys won't change.
So if you want to use your class as the keys, and maybe your class even has more fields than just a single one, then you could store the hash code as a field:
class Key
{
private final String name;
... // Other fields...
private final int hashCode;
Key(String name, ...)
{
this.name = name;
... // Other fields
// Pre-compute and store the hash code:
this.hashCode = computeHashCode();
}
private int computeHashCode()
{
int result = 31;
result = 31 * result + Objects.hashCode(name);
result = 31 * result + ... // Other fields
return result;
}
}

My answer is going a different path - basically it is not answer, but a question: why do you worry about the performance of hashCode()?
Did you exhaustive profiling of your application and found that there is a performance problem originating from that one method on some of your objects?
If the answer to that question is "no" ... then - why do you think you need to worry about this one method? Why do you think that the default, generated by eclipse, probably used billions of times each day ... isn't good enough for you?
See here for explanations why it is in general a very bad idea to waste ones time with such questions.

Yes, there are better alternatives.
xxHash or MurmurHash3 are general-purpose hashing algorithms that are both faster and better in quality.

How to generate unique keys for a hashtable, and recycle deleted keys?

For example, I need 10 objects stored in the hashmap.
So it creates keys 1,2,3,4,5
Then when I'm finished with '3', it deletes the whole entry and key for '3'. Making that key able to re-used for new object mappings -if I run over, via integer overflow or something.
Thoughts?
public static HashMap <GameKey, GameState> myMap = new HashMap<GameKey, GameState>();
int i=0;
public void MapNewGameState(Gamestate gs){
myMap.add(i, gameStateA);
i++;
}
myMap.remove("3");
//Now I want to be sure that my MapNewGameState function is able to eventually map a new GameState to the key "3" later on,
this is more a question about if HashMaps can be used in this way.

As I understand it you propose a key-pool where you get a key and if you don't need it anymore you put it back into the pool? If so this doesnt really make much sense since it adds complexity to your code but no other benefits (usually you pool something thats expensive to create or hold). And usually you want to recycle the value not the key?
To create truly (practical) unique keys use UUID.randomUUID(), whith this you don't have to worry about keys.

I came across this old question because I have a use case in Mono Webassembly (where we have to code as though CPU speeds are back to 2000s levels) that warrants this, and was disappointed that no answer was actually given. So let me offer my solution that works for integer keys (and could be adapted to work with keys of any type by providing a parameterized way to generate a new key):
public class RecyclingDictionary<T> : Dictionary<int, T>
{
Queue<int> _freedKeys = new Queue<int>();
int _lastAssignedKey = -1;
public int Add(T item)
{
var key = GetNextKey();
base[key] = item;
return key;
}
public new void Remove(int key)
{
base.Remove(key);
_freedKeys.Enqueue(key);
}
private int GetNextKey()
{
if (_freedKeys.Count > 0)
{
int key = _freedKeys.Dequeue();
return key;
}
return ++_lastAssignedKey;
}
}
As for why one might want to do this, well, here are a couple reasons:
Guid's are expensive to generate and store in memory (36 bytes when stringified vs. 4 for int). They're also more expensive to hash and thus more expensive to use as Dictionary keys.
Integer keys are usually much faster than string keys, however, there is an article (which I sadly cannot find) from years ago that analyzed the efficiency of different key types and concluded that simple incrementing integers were actually a POOR key type. As I recall, it had to do with the way dictionaries bucket their entries for purposes of the binary search tree. Thus the impetus to recycle keys.

Best practices on what should be key in a hashtable

The best look-up structure is a HashTable. It provides constant access on average (linear in worst case).
This depends on the hash function. Ok.
My question is the following. Assuming a good implementation of a HashTable e.g. HashMap is there a best practice concerning the keys passed in the map?I mean it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
Example the size of the key? For example in a good hashmap (in the way described above) if we used String as keys, won't the "bottleneck" be in the string comparison for equals (trying to find the key)? So should the keys be kept small? Or are there objects that should not be used as keys? E.g. a URL? In such cases how can you choose what to use as a key?

The best performing key for an HashMap is probably an Integer, where hashCode() and equals() are implemented as:
public int hashCode() {
return value;
}
public boolean equals(Object obj) {
if (obj instanceof Integer) {
return value == ((Integer)obj).intValue();
}
return false;
}
Said that, the purpose of an HashMap is to map some object (value) to some others (key). The fact that a hash function is used to address the (value) objects is to provide fast, constant-time access.
it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
The recommendation is to Map objects to what you need: don't think what is faster; but think what is the best for your business logic to address the objects to retrieve.
The important requirement is that the key object must be immutable, because if you change the key object after storing it in the Map it may be not possible to retrieve the associated value later.
The key word in HashMap is Map. Your object should just map. If you sacrifice the mapping task optimizing the key, you are defeating the purpose of the Map - without probably achieving any performance boost.
I 100% agree with the first two comments in your question:
the major constraint is that it has to be the thing that you want to base the lookup on ;)
– Oli Charlesworth
The general rule is to use as the key whatever you need to look up with.
– Louis Wasserman
Remember the two rules for optimization:
Don't.
(for experts only) don't yet.
The third rule is: profile before to optimize.

You should use whatever key you want to use to lookup things in the data structure, it's typically a domain-specific constraint. With that said, keep in mind that both hashCode() and equals() will be used in finding a key in the table.
hashCode() is used to find the position of the key, while equals() is used to determine if the key you are searching for is actually the key that we just found using hashCode().
For example, consider two keys a and b that have the same hash code in a table using separate chaining. Then a search for a would require testing if a.equals(key) for potentially both a and b in the table once we find the index of the list containing a and b from hashCode().

it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
The key of the value should be final.
Most times a field of the object is used as key. If that field changes then the map cannot find it:
void foo(Employee e) {
map.put(e.getId(), e);
String newId = e.getId() + "new";
e.setId(newId);
Employee e2 = e.get(newId);
// e != e2 !
}
So Employee should not have a setId() method at all, but that is difficult because when you are writing Employee you don't know what it will be keyed by.

I digged up the implementation. I had an assumption that the effectiveness of the hashCode() method will be the key factor.
When I looked into the HashMap() and the Hashtable() implementation, I found that the implementation is quite similar (with one exception). Both are using and storing an internal hash code for all entries, so that's a good point that hashCode() is not so heavily influencing the performance.
Both are having a number of buckets, where the values are stored. It is important balance between the number of buckets (say n), and the average number of keys within a bucket (say k). The bucket is found in O(1) time, the content of the bucket is iterated in O(k) size, but the more bucket we have, the more memory will be allocated. Also, if many buckets are empty, it means that the hashCode() method for the key class does not the hashcode wide enough.
The algorithm works like this:
Take the `hashCode()` of the Key (and make a slight bijective transformation on it)
Find the appropriate bucket
Loop through the content of the bucket (which is some kind of LinkedList)
Make the comparison of the keys as follows:
1. Compare the hashcodes
(it is calculated in the first step, and stored for the entry)
2. Examine if key `==` the stored key (still no call)
(this step is missing from Hashtable)
3. Compare the keys by `key.equals(storedKey)`
To summarize:
hashCode() is called once per call (this is a must, you cannot do
without it)
equals() is called if the hashCode is not so well spread, and two keys happen to have the same hashcode
The same algorithm is for get() and put() (because in put() case you can set the value for an existing key). So, the most important thing is how the hashCode() method was implemented. That is the most frequently called method.
Two strategies are: make it fast and make it effective (well-spread). The JDK developers made efforts to make it both, but it's not always possible to have them both.
Numeric types are good
Object (and non-overriden classes) are good (hashCode() is native), except that you cannot specify an own equals()
String is not good, iterates through the characters, but caches after that (see my comment below)
Any class with synchronized hashCode() is not good
Any class that has an iteration is not good
Classes that have hashcode cache are a bit better (depends on the usage)
Comment on the String: To make it fast, in the first versions of JDK the String hash code calculation was made for the first 32 characters only. But the hashcode it produced was not well spread, so they decided to take all the characters into the hashcode.

Accesing hidden getEntry(Object key) in HashMap

I have similar problem to one discussed here, but with stronger practical usage.
For example, I have a Map<String, Integer>, and I have some function, which is given a key and in case the mapped integer value is negative, puts NULL to the map:
Map<String, Integer> map = new HashMap<String, Integer>();
public void nullifyIfNegative(String key) {
Integer value = map.get(key);
if (value != null && value.intValue() < 0) {
map.put(key, null);
}
}
I this case, the lookup (and hence, hashCode calculation for the key) is done twice: one for lookup and one for replacement. It would be nice to have another method (which is already in HashMap) and allows to make this more effective:
public void nullifyIfNegative(String key) {
Map.Entry<String, Integer> entry = map.getEntry(key);
if (entry != null && entry.getValue().intValue() < 0) {
entry.setValue(null);
}
}
The same concerns cases, when you want to manipulate immutable objects, which can be map values:
Map<String, String>: I want to append something to the string value.
Map<String, int[]>: I want to insert a number into the array.
So the case is quite common. Solutions, which might work, but not for me:
Reflection. Is good, but I cannot sacrifice performance just for this nice feature.
Use org.apache.commons.collections.map.AbstractHashedMap (it has at least protected getEntry() method), but unfortunately, commons-collections do not support generics.
Use generic commons-collections, but this library (AFAIK) is out-of-date (not in sync with latest library version from Apache), and (what is critical) is not available in central maven repository.
Use value wrappers, which means "making values mutable" (e.g. use mutable integers [e.g. org.apache.commons.lang.mutable.MutableInt], or collections instead of arrays). This solutions leads to memory loss, which I would like to avoid.
Try to extend java.util.HashMap with custom class implementation (which should be in java.util package) and put it to endorsed folder (as java.lang.ClassLoader will refuse to load it in Class<?> defineClass(String name, byte[] b, int off, int len), see sources), but I don't want to patch JDK and it seems like the list of packages that can be endorsed, does not include java.util.
The similar question is already raised on sun.com bugtracker, but I would like to know, what is the opinion of the community and what can be the way out taking in mind the maximum memory & performance effectiveness.
If you agree, this is nice and beneficiary functionality, please, vote this bug!

As a logical matter, you're right in that the single getEntry would save you a hash lookup. As a practical matter, unless you have a specific use case where you have reason to be concerned about the performance hit( which seems pretty unlikely, hash lookup is common, O(1), and well optimized) what you're worrying about is probably negligible.
Why don't you write a test? Create a hashtable with a few 10's of millions of objects, or whatever's an order of magnitude greater than what your application is likely to create, and average the time of a get() over a million or so iterations (hint: it's going to be a very small number).
A bigger issue with what you're doing is synchronization. You should be aware that if you're doing conditional alterations on a map you could run into issues, even if you're using a Synchronized map, as you'd have to lock access to the key covering the span of both the get() and set() operations.

Not pretty, but you could use lightweight object to hold a reference to the actual value to avoid second lookups.
HashMap<String, String[]> map = ...;
// append value to the current value of key
String key = "key";
String value = "value";
// I use an array to hold a reference - even uglier than the whole idea itself ;)
String[] ref = new String[1]; // lightweigt object
String[] prev = map.put(key, ref);
ref[0] = (prev != null) ? prev[0] + value : value;
I wouldn't worry about hash lookup performance too much though (Steve B's answer is pretty good in pointing out why). Especially with String keys, I wouldn't worry too much about hashCode() as its result is cached. You could worry about equals() though as it might be called more than once per lookup. But for short strings (which are often used as keys) this is negligible too.

There are no performance gain from this proposal, because performance of Map in average case is O(1). But enabling access to the raw Entry in such case will raise another problem. It will be possible to change key in entry (even if it's only possible via reflection) and therefore break order of the internal array.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.