Creating a hash from several Java string objects - java

What would be the fastest and more robust (in terms of uniqueness) way for implementing a method like
public abstract String hash(String[] values);
The values[] array has 100 to 1,000 members, each of a which with few dozen characters, and the method needs to be run about 10,000 times/sec on a different values[] array each time.
Should a long string be build using a StringBuilder buffer and then a hash method invoked on the buffer contents, or is it better to keep invoking the hash method for each string from values[]?
Obviously a hash of at least 64 bits is needed (e.g., MD5) to avoid collisions, but is there anything simpler and faster that could be done, at the same quality?
For example, what about
public String hash(String[] values)
{
long result = 0;
for (String v:values)
{
result += v.hashCode();
}
return String.valueOf(result);
}

Definitely don't use plain addition due to its linearity properties, but you can modify your code just slightly to achieve very good dispersion.
public String hash(String[] values) {
long result = 17;
for (String v:values) result = 37*result + v.hashCode();
return String.valueOf(result);
}

It doesn't provide a 64 bit hash, but given the title of the question it's probably worth mentioning that since Java 1.7 there is java.util.Objects#hash(Object...).

Here is the simple implementation using Objects class available from Java 7.
#Override
public int hashCode()
{
return Objects.hash(this.variable1, this.variable2);
}

You should watch out for creating weaknesses when combining methods. (The java hash function and your own). I did a little research on cascaded ciphers, and this is an example of it. (the addition might interfere with the internals of hashCode().
The internals of hashCode() look like this:
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
so adding numbers together will cause the last characters of all strings in the array to just be added, which doesn't lower the randomness (this is already bad enough for a hash function).
If you want real pseudorandomness, take a look at the FNV hash algorithm. It is the fastest hash algorithm out there that is especially designed for use in HashMaps.
It goes like this:
long hash = 0xCBF29CE484222325L;
for(String s : strings)
{
hash ^= s.hashCode();
hash *= 0x100000001B3L;
}
^ This is not the actual implementation of FNV as it takes ints as input instead of bytes, but I think it works just as well.

First, hash code is typically numeric, e.g. int. Moreover your version of hash function create int and then makes its string representation that IMHO does not have any sense.
I'd improve your hash method as following:
public int hash(String[] values) {
long result = 0;
for (String v:values) {
result = result * 31 + v.hashCode();
}
return result;
}
Take a look on hashCode() implemented in class java.lang.String

Related

Is there a data structure that only stores hash codes and not the actual objects?

My use-case is that I'm looking for a data structure in Java that will let me see if an object with the same hash code is inside (by calling contains()), but I will never need to iterate through the elements or retrieve the actual objects. A HashSet is close, but from my understanding, it still contains references to the actual objects, and that would be a waste of memory since I won't ever need the contents of the actual objects. The best option I can think of is a HashSet of type Integer storing only the hash codes, but I'm wondering if there is a built-in data structure that would accomplish the same thing (and only accept one type as opposed to HashSet of type Integer which will accept the hash code of any object).
A Bloom filter can tell whether an object might be a member, or is definitely not a member. You can control the likelihood of false positives. Each hash value maps to a single bit.
The Guava library provides an implementation in Java.
You could use a primitive collection implementation like IntSet to store values of hash codes. Obviously as others have mentioned this assumes collisions aren't a problem.
If you want to track if a hash code is already present and to do it memory efficient a BitSet may suite your requirements.
Look at the following example:
public static void main(String[] args) {
BitSet hashCodes = new BitSet();
hashCodes.set("1".hashCode());
System.out.println(hashCodes.get("1".hashCode())); // true
System.out.println(hashCodes.get("2".hashCode())); // false
}
The BitSet "implements a vector of bits that grows as needed.". It's a JDK "built-in data structure" which doesn't contain "references to the actual objects". It stores only if "the same hash code is inside".
EDIT:
As #Steve mentioned in his comment the implementation of the BitSet isn't the most memory efficient one. But there are more memory efficient implementations of a bit set - though not built-in.
There is no such built-in data structure, because such a data structure is rarely needed. It's easy to build one, though.
public class HashCodeSet<T> {
private final HashSet<Integer> hashCodes;
public MyHashSet() {
hashCodes = new HashSet<>();
}
public MyHashSet(int initialCapacity) {
hashCodes = new HashSet<>(initialCapacity);
}
public HashCodeSet(HashCodeSet toCopy) {
hashCodes = new HashSet<>(toCopy.hashCodes);
}
public void add(T element) {
hashCodes.add(element.hashCode());
}
public boolean containsHashCodeOf(T element) {
return hashCodes.contains(element.hashCode());
}
#Override
public boolean equals(o: Object) {
return o == this || o instanceof HashCodeSet &&
((HashCodeSet) o).hashCodes.equals(hashCodes);
}
#Override
public int hashCode() {
return hashCodes.hashCode(); // hash-ception
}
#Override
public String toString() {
return hashCodes.toString();
}
}

How to efficiently store a set of tuples/pairs in Java

I need to perform a check if the combination of a long value and an integer value were already seen before in a very performance-critical part of an application. Both values can become quite large, at least the long will use more than MAX_INT values in some cases.
Currently I have a very simple implementation using a Set<Pair<Integer, Long>>, however this will require too many allocations, because even when the object is already in the set, something like seen.add(Pair.of(i, l)) to add/check existence would allocate the Pair for each call.
Is there a better way in Java (without libraries like Guava, Trove or Apache Commons), to do this check with minimal allocations and in good O(?)?
Two ints would be easy because I could combine them into one long in the Set, but the long cannot be avoided here.
Any suggestions?
Here are two possibilities.
One thing in both of the following suggestions is to store a bunch of pairs together as triple ints in an int[]. The first int would be the int and the next two ints would be the upper and lower half of the long.
If you didn't mind a 33% extra space disadvantage in exchange for an addressing speed advantage, you could use a long[] instead and store the int and long in separate indexes.
You'd never call an equals method. You'd just compare the three ints with three other ints, which would be very fast. You'd never call a compareTo method. You'd just do a custom lexicographic comparison of the three ints, which would be very fast.
B* tree
If memory usage is the ultimate concern, you can make a B* tree using an int[][] or an ArrayList<int[]>. B* trees are relatively quick and fairly compact.
There are also other types of B-trees that might be more appropriate to your particular use case.
Custom hash set
You can also implement a custom hash set with a custom, fast-calculated hash function (perhaps XOR the int and the upper and lower halves of the long together, which will be very fast) rather than relying on the hashCode method.
You'd have to figure out how to implement the int[] buckets to best suit the performance of your application. For example, how do you want to convert your custom hash code into a bucket number? Do you want to rebucket everything when the buckets start getting too many elements? And so on.
How about creating a class that holds two primitives instead? You would drop at least 24 bytes just for the headers of Integer and Long in a 64 bit JVM.
Under this conditions you are looking for a Pairing Function, or generate an unique number from 2 numbers. That wikipeia page has a very good example (and simple) of one such possibility.
How about
class Pair {
int v1;
long v2;
#Override
public boolean equals(Object o) {
return v1 == ((Pair) o).v1 && v2 == ((Pair) o).v2;
}
#Override
public int hashCode() {
return 31 * (31 + Integer.hashCode(v1)) + Long.hashCode(v2);
}
}
class Store {
// initial capacity should be tweaked
private static final Set<Pair> store = new HashSet<>(100*1024);
private static final ThreadLocal<Pair> threadPairUsedForContains = new ThreadLocal<>();
void init() { // each thread has to call init() first
threadPairUsedForContains.set(new Pair());
}
boolean contains(int v1, long v2) { // zero allocation contains()
Pair pair = threadPairUsedForContains.get();
pair.v1 = v1;
pair.v2 = v2;
return store.contains(pair);
}
void add(int v1, long v2) {
Pair pair = new Pair();
pair.v1 = v1;
pair.v2 = v2;
store.add(pair);
}
}

Combining Hash of String and Hash of Long

I have the following java class:
public class Person{
String name; //a unique name
Long DoB; //a unique time
.
.
.
#Override
public int hashCode(){
return name.hashCode() + DoB.hashCode();
}
}
Is my hashCode method correct (i.e. would it return a unique number of all combinations.
I have a feeling I'm missing something here.
You could let java.util.Arrays do it for you:
return Arrays.hashCode(new Object[]{ name, DoB });
You might also want to use something more fluent and more NPE-bulletproof like Google Guava:
#Override
public int hashCode(){
return Objects.hashCode(name, DoB);
}
#Override
public boolean equals(Object o) {
if ( this == o ) {
return true;
}
if ( o == null || o.getClass() != Person.class ) {
return false;
}
final Person that = (Person) o;
return Objects.equal(name, that.name) && Objects.equal(DoB, that.DoB);
}
Edit:
IntelliJ IDEA and Eclipse can generate more efficient hashCode() and equals().
Aside for the obvious, which is, you might want to implement the equals method as well...
Summing two hash codes has the very small risk of overflowing int
The sum itself seems like a bit of a weak methodology to provide unique hash codes. I would instead try some bitwise manipulation and use a seed.
See Bloch's Effective Java #9.
But you should start with an initial value (so that subsequent zero values are significant), and combine the fields that apply to the result along with a multiplier so that order is significant (so that similar classes will have much different hashes.)
Also, you will have to treat things like long fields and Strings a little different. e.g., for longs:
(int) (field ^ (field>>>32))
So, this means something like:
#Override public int hashCode() {
int result = 17;
result += name.hashCode() == null ? 0 : name.hashCode();
result = 31 * result + (int) (DoB ^ (DoB >>> 32));
return result;
}
31 is slightly magic, but odd primes can make it easier for the compiler to optimize the math to shift-subtraction. (Or you can do the shift-subtraction yourself, but why not let the compiler do it.)
usually a hashcode is build like so:
#Override
public int hashCode(){
return name.hashCode() ^ DoB.hashCode();
}
but the important thing to remember when doing a hashcode method is the use of it. the use of hashcode method is to put different object in different buckets in a hashtable or other collection using hashcode. as such, it's impotent to have a method that gives different answers to different objects at a low run time but doesn't have to be different for every item, though it's better that way.
This hash is used by other code when storing or manipulating the
instance – the values are intended to be evenly distributed for varied
inputs in order to use in clustering. This property is important to
the performance of hash tables and other data structures that store
objects in groups ("buckets") based on their computed hash values
and
The general contract for overridden implementations of this method is
that they behave in a way consistent with the same object's equals()
method: that a given object must consistently report the same hash
value (unless it is changed so that the new version is no longer
considered "equal" to the old), and that two objects which equals()
says are equal must report the same hash value.
Your hash code implementation is fine and correct. It could be better if you follow any of the suggestions other people have made, but it satisfies the contract for hashCode, and collisions aren't particularly likely, though they could be made less likely.

java: memoizing construction through hash function

I have an X object whose constructor takes in 4 integers fields. To calculate it's hash function, I simple throw them in an array and use Arrays.hashCode.
Currently the constructor is private and I have a static creator method. I'd like to memoize construction so that whenever the creator method is called with 4 integer parameters that have been called before, I can return the same object as last time. [Ideally without having to create another X object to compare with.]
Originally I tried a hashSet but that required me to create a new X to check if my hashSet.contains the equal object... nevermind the fact that I can't 'get' out of a hashSet.
My next idea is to use a HashTable which maps:
the hashCode of the int array of the 4 fields --> object. I'm not sure why, but that doesn't feel right. It feels like I'm doing too much work, isn't the point of a hashCode to be a sort of mapping to a bunch of objects which calculate to the same hashCode?
I appreciate your advice.
The purpose of a hash code is generally to narrow down the location in which to look for a particular object. Or put another way, the idea is that your hash code makes it so that if two objects have the same hash code they are "very likely" to be the same object.
Now, how likely is "very likely" essentially depends on the width (number of bits) and quality of the hash code. In the case of Java, with 32 bit hash codes, this "very likely" still generally means "not near enough to 100% that you can do away with an actual comparison of the object data". So as well as implementing hashCode(), you need to implement equals() on an object that is used as the key to a Java Map (HashMap etc).
Or put another way: your implementation is essentially correct, even though it looks like you're doing a lot of work. The upshot is that if what you are looking for is a performance improvement, you may as well just create a new object each time. But if functionally you require that there never exists more than one object with a given set of values, then your implementation is essentially correct.
Things you could do in principle:
if you had a large number of ints, then for the hashCode(), just form the hash code from a 'sample' of a couple of them -- the idea is to 'narrow down the choices' or make it 'fairly but not 100% likely' that equal hash code will mean equal object-- your equals() has to go through and check them anyway, so there's little point in cycling through all values in both hashCode() and equals();
potentially, you can use a stronger hash code, so that you literally assume that equal hash codes mean equal objects. In effect, you cycle through all of the values once in the hash code function and don't have an equals function at all. In practice this means using at least a strong-ish 64 bit hash code. It's probably not worth it for the case you mention. But if you want to understand a little about how it would work, I would point you to a tutorial I wrote on the advanced use of hash codes in Java.
If the 4 integers during construction mean the resulting object will be exactly the same, then use those as the key, not their hash. Notice I'm not using your full Object as the key, just the 4 integer values. The MyObjectSpecification below will be a tiny object.
public class MyObjectSpecification {
private final int i1, i2, i3, i4;
public MyObjectSpecification(int i1, int i2, int i3, int i4) {
this.i1 = i1;
this.i2 = i2;
this.i3 = i3;
this.i4 = i4;
}
public boolean equals(Object o) {
// ...
}
public int hashCode() {
// ...
}
}
public class MyObject {
private static final Map<MyObjectSpecification, MyObject> myObjects
= new ConcurrentHashMap<MyObjectSpecification, MyObject>();
private MyObject(MyObjectSpecification spec) {
// ...
}
public static MyObject getMyObject(int i1, int i2, int i3, int i4) {
MyObjectSpecification spec = new MyObjectSpecification(i1, i2, i3, i4);
if (myObjects.containsKey(spec)) {
return myObjects.get(spec);
}
MyObject newObject = new MyObject(spec);
myObjects.put(spec, newObject);
return newObject;
}
}
Not sure how you plan to use the Hashtable but I think below would do your job:
private static Hashtable<Integer, MyObject> objectInstances =
new Hashtable<Integer, MyObject>();
public static MyObject instance(int i1, int i2, int i3, int i4){
int hashKey = Arrays.hashCode(new int[]{i1, i2,i3,i4});
//get the object from hashtable
MyObject myObject = objectInstances.get(hashKey);
//if object was not already created, create now and put in the hashtable
if(myObject == null){
myObject = new MyObject(i1,i2,i3,i4);
objectInstances.put(hashKey, myObject);
}
return myObject;
}

How to check if a parameter value is in a list of constants

I have a list of constants and want to check if a value passed to a method equals one of the constants. I know Enums which would be fine, but I need an int value here for performance reasons. Enum values can have a value, but I'm afraid the getting the int value is slower than using an int directly.
public static final int
a = Integer.parseInt("1"), // parseInt is necessary in my application
b = Integer.parseInt("2"),
c = Integer.parseInt("3");
public void doIt(int param) {
checkIfDefined(param);
...
}
The question is about checkIfDefined. I could use if:
if(param == a)
return;
if(param == b)
return;
if(param == c)
return
throw new IllegalArgumentException("not defined");
Or I could use a Arrays.binarySearch(int[] a, int key), or ... any better idea?
Maybe enums would be better not only in elegance but also in speed, because I don't need the runtime checks.
Solution
Since whether enum nor switch worked in my special case with parseInt, I use plain old int constants together with Arrays.binarySearch now.
Do you prefer to use enums:
public enum Letters {
A("1"), B("2"), C("3");
private final String value;
private Letters(String value) {
this.value = value;
}
public Letters getByValue(String value) {
for (Letters letter : values()) {
if (letter.value.equals(value)) return letter;
}
throw new IllegalArgumentException("Letter #" + value + " doesn't exist");
}
public int toInt() {
return Integer.parseInt(value);
}
}
I am reasonably sure that getting an int property value of an enum is about as fast as getting the value of an int constant. (But if in doubt, why not make a prototype and measure the performance of both solutions?) Using enums makes your code so much more readable, it is always worth that little extra effort.
For fast lookup, if you only have unsigned integer values, you could store the enum values in an ArrayList, indexed by their respective int value.
If your list of constants is long, load them into a HashSet. That will give you the fastest lookup time: all lookups into hash maps are O(1) time (and with low constant cost, too, aside from the autoboxing of the integer).
If your list of constants is short and invariant, use a switch statement and let the compiler essentially build a HashSet for you (with no autoboxing).
If it's somewhere in between, throw them into an array, sort it, and use Arrays.binarySearch. You can do about 5-10 comparisons in the time it takes to box one integer, so I would switch over to HashSet once the number gets into the hundreds.
If it's a very short, and you know which number is most likely to come up, code it by hand in if statements, checking the most common ones first.
How about a switch statement. You can throw exception in the default case.
I guess the array solution you posted is fasted as long as the number of params stays small. If it gets larger than 128 and you need this frequently, then I would go with a HashSet:
Set<Integer> params = new HashSet<Integer>();
params.add(Integer.valueOf("1"));
params.add(Integer.valueOf("2"));
params.add(Integer.valueOf("3"));
public boolean checkIfDefined(int param){
return params.contains(param);
}
The auto-boxing is certainly slow, but the hash look-up O(1) and not O(log n) as the binary search.
2.) The fasted solution if memory is not much of a concern or the parameters are not going up to a high value is the use of boolean[] using the params as an index:
boolean[] params = new boolean[MAX_PARAMS+1];
params[Integer.parseInt("1")] = true;
params[Integer.parseInt("2")] = true;
params[Integer.parseInt("3")] = true;
public boolean checkIfDefined(int param){
if (param < 0 || params.length <= param)
return false;
return params[param];
}

Categories