Java hashcode of string from 0-1

Java hashcode of string from 0-1 - java

So I know I can convert a string to a hashcode simply by doing .hashCode(), but is there a way to convert (or use some other function if there is one out there) that will instead of returning an integer return a double between 0 and 1? I was thinking of just dividing the number by the maximum possible integer but wasn't sure if there was a better way.
*Edit (more information about why i'm trying to do this): i'm doing a mathematical operation, and i'm trying to group different objects to perform the same mathematical operation in their group but have a different parameter into the function. each member has a list of characteristics that "group" them... so i was thinking to put the characteristics into a string and then hashcode it and find their group value from that

You couldn't just divide by Integer.MAX_VALUE, as that wouldn't deal with negative numbers. You could use:
private static double INTEGER_RANGE = 1L << 32;
...
// First need to put it in the range [0, INTEGER_RANGE)
double doubleHash = ((long) text.hashCode() - Integer.MIN_VALUE) / INTEGER_RANGE;
That should be okay, as far as I'm aware... but I'm not going to make any claims about the distribution. There may well be a fairly simple way of using the 32 bits to make a unique double (per unique hash code) in the right range, but if you don't care too much about that, this will be simpler.

Dividing it should be ok, but you might loose some "precision" due to rounding problems, etc, that doubles might have.
In general a hash is used to identify something trying to assure it'll be unique, loosing precision might have problems in that.
You could write your own String.hashCodeDouble() returning the desired number, perhaps using a common hash algorithm (let's say, MD5) and adapting it to your required response range.
Example: do the MD5 of the String to get a hash, then simply put a 0. in front of it...
Remember that the .hashCode() is used in lots of functions in Java, you can't simply overwrite it.

This smells bad but might do what you want:
Integer iHash = "123".hashCode();
String sHash = "0."+iHash;
Double dHash = Double.valueOf(sHash);

Related

Get only positive long from murmur3 guava

I'm using java murmur3 from guava lib to get long values representing hash. Is there any possibility to get only positive long numbers? Right ow guava returns +/- results which is not good for me..
I use murmur3 to convert string ids to numerical representation because of caclculation framework limitations. I do not afraid of small quantity of collisions. But I'm afraid just to take abs(murmur3Value). It should significantly raise probability of collisions. Am I right?
I have ~ 1*10^8 unique ids, is it ok to abs their hased values and not to get too many collisions.
i don't have any collistions on 10^7 values, but hashed are positive and negative, i would like to use only positive values.

Using Math.abs is wrong... as Math.abs(Long.MIN_VALUE) == Long.MIN_VALUE. It's also needlessly slow, given that there are simple options:
x >>> 1
and
x & Long.MAX_VALUE
In any case you lose one bit, either the most or the least significant one. I guess in case of Murmur3 it doesn't matter.
Concerning collision, it really shouldn't matter what operation you choose - you'll have 2**63, i.e., about 9e18 different hashes. With 1e8 inputs, it means that collision are very rare if any (I'm to lazy to look up the formula).

Using hashcode for a unique ID

I am working in a java-based system where I need to set an id for certain elements in the visual display. One category of elements is Strings, so I decided to use the String.hashCode() method to get a unique identifier for these elements.
The problem I ran into, however, is that the system I am working in borks if the id is negative and String.hashCode often returns negative values. One quick solution is to just use Math.abs() around the hashcode call to guarantee a positive result. What I was wondering about this approach is what are the chances of two distinct elements having the same hashcode?
For example, if one string returns a hashcode of -10 and another string returns a hashcode of 10 an error would occur. In my system we're talking about collections of objects that aren't more than 30 elements large typically so I don't think this would really be an issue, but I am curious as to what the math says.

Hash codes can be thought of as pseudo-random numbers. Statistically, with a positive int hash code the chance of a collision between any two elements reaches 50% when the population size is about 54K (and 77K for any int). See Birthday Problem Probability Table for collision probabilities of various hash code sizes.
Also, your idea to use Math.abs() alone is flawed: It does not always return a positive number! In 2's compliment arithmetic, the absolute value of Integer.MIN_VALUE is itself! Famously, the hash code of "polygenelubricants" is this value.

Hashes are not unique, hence they are not apropriate for uniqueId.
As to probability of hash collision, you could read about birthday paradox. Actually (from what I recall) when drawing from an uniform distribution of N values, you should expect collision after drawing $\sqrt(N)$ (you could get collision much earlier). The problem is that Java's implementation of hashCode (and especially when hashing short strings) doesnt provide uniform distribution, so you'll get collision much earlier.

You already can get two strings with the same hashcode. This should be obvious if you think that you have an infinite number of strings and only 2^32 possible hashcodes.
You just make it a little more probable when taking the absolute value. The risk is small but if you need an unique id, this isn't the right approach.

What you can do when you only have 30-50 values as you said is register each String you get into an HashMap together with a running counter as value:
HashMap StringMap = new HashMap<String,Integer>();
StringMap.add("Test",1);
StringMap.add("AnotherTest",2);
You can then get your unique ID by calling this:
StringMap.get("Test"); //returns 1

A good hash function to use in interviews for integer numbers, strings?

I have come across situations in an interview where I needed to use a hash function for integer numbers or for strings. In such situations which ones should we choose ? I've been wrong in these situations because I end up choosing the ones which have generate lot of collisions but then hash functions tend to be mathematical that you cannot recollect them in an interview. Are there any general recommendations so atleast the interviewer is satisfied with your approach for integer numbers or string inputs? Which functions would be adequate for both inputs in an "interview situation"

Here is a simple recipe from Effective java page 33:
Store some constant nonzero value, say, 17, in an int variable called result.
For each significant field f in your object (each field taken into account by the
equals method, that is), do the following:
Compute an int hash code c for the field:
If the field is a boolean, compute (f ? 1 : 0).
If the field is a byte, char, short, or int, compute (int) f.
If the field is a long, compute (int) (f ^ (f >>> 32)).
If the field is a float, compute Float.floatToIntBits(f).
If the field is a double, compute Double.doubleToLongBits(f), and
then hash the resulting long as in step 2.1.iii.
If the field is an object reference and this class’s equals method
compares the field by recursively invoking equals, recursively
invoke hashCode on the field. If a more complex comparison is
required, compute a “canonical representation” for this field and
invoke hashCode on the canonical representation. If the value of the
field is null, return 0 (or some other constant, but 0 is traditional).
48 CHAPTER 3 METHODS COMMON TO ALL OBJECTS
If the field is an array, treat it as if each element were a separate field.
That is, compute a hash code for each significant element by applying
these rules recursively, and combine these values per step 2.b. If every
element in an array field is significant, you can use one of the
Arrays.hashCode methods added in release 1.5.
Combine the hash code c computed in step 2.1 into result as follows:
result = 31 * result + c;
Return result.
When you are finished writing the hashCode method, ask yourself whether
equal instances have equal hash codes. Write unit tests to verify your intuition!
If equal instances have unequal hash codes, figure out why and fix the problem.

You should ask the interviewer what the hash function is for - the answer to this question will determine what kind of hash function is appropriate.
If it's for use in hashed data structures like hashmaps, you want it to be a simple as possible (fast to execute) and avoid collisions (most common values map to different hash values). A good example is an integer hashing to the same integer - this is the standard hashCode() implementation in java.lang.Integer
If it's for security purposes, you will want to use a cryptographic hash function. These are primarily designed so that it is hard to reverse the hash function or find collisions.
If you want fast pseudo-random-ish hash values (e.g. for a simulation) then you can usually modify a pseudo-random number generator to create these. My personal favourite is:
public static final int hash(int a) {
a ^= (a << 13);
a ^= (a >>> 17);
a ^= (a << 5);
return a;
}
If you are computing a hash for some form of composite structure (e.g. a string with multiple characters, or an array, or an object with multiple fields), then there are various techniques you can use to create a combined hash function. I'd suggest something that XORs the rotated hash values of the constituent parts, e.g.:
public static <T> int hashCode(T[] data) {
int result=0;
for(int i=0; i<data.length; i++) {
result^=data[i].hashCode();
result=Integer.rotateRight(result, 1);
}
return result;
}
Note the above is not cryptographically secure, but will do for most other purposes. You will obviously get collisions but that's unavoidable when hashing a large structure to a integer :-)

For integers, I usually go with k % p where p = size of the hash table and is a prime number and for strings I choose hashcode from String class. Is this sufficient enough for an interview with a major tech company? – phoenix 2 days ago
Maybe not. It's not uncommon to need to provide a hash function to a hash table whose implementation is unknown to you. Further, if you hash in a way that depends on the implementation using a prime number of buckets, then your performance may degrade if the implementation changes due to a new library, compiler, OS port etc..
Personally, I think the important thing at interview is a clear understanding of the ideal characteristics of a general-purpose hash algorithm, which is basically that for any two input keys with values varying by as little as one bit, each and every bit in the output has about 50/50 chance of flipping. I found that quite counter-intuitive because a lot of the hashing functions I first saw used bit-shifts and XOR and a flipped input bit usually flipped one output bit (usually in another bit position, so 1-input-bit-affects-many-output-bits was a little revelation moment when I read it in one of Knuth's books. With this knowledge you're at least capable of testing and assessing specific implementations regardless of how they're implemented.
One approach I'll mention because it achieves this ideal and is easy to remember, though the memory usage may make it slower than mathematical approaches (could be faster too depending on hardware), is to simply use each byte in the input to look up a table of random ints. For example, given a 24-bit RGB value and int table[3][256], table[0][r] ^ table[1][g] ^ table[2][b] is a great sizeof int hash value - indeed "perfect" if inputs are randomly scattered through the int values (rather than say incrementing - see below). This approach isn't ideal for long or arbitrary-length keys, though you can start revisiting tables and bit-shift the values etc..
All that said, you can sometimes do better than this randomising approach for specific cases where you are aware of the patterns in the input keys and/or the number of buckets involved (for example, you may know the input keys are contiguous from 1 to 100 and there are 128 buckets, so you can pass the keys through without any collisions). If, however, the input ceases to meet your expectations, you can get horrible collision problems, while a "randomising" approach should never get much worse than load (size() / buckets) implies. Another interesting insight is that when you want a quick-and-mediocre hash, you don't necessarily have to incorporate all the input data when generating the hash: e.g. last time I looked at Visual C++'s string hashing code it picked ten letters evenly spaced along the text to use as inputs....

How should I implement a string hashing function for these requirements?

Ok, I need a hashing function to meet the following requirements. The idea is to be able to link together directories that are part of the same logical structure but stored in different physical areas of the file system.
I need to implement it in Java, it must be consistent across execution sessions and it can return a long.
I will be hashing directory names / strings. This should work so that "somefolder1" and "somefolder2" will return different hashes, as would "JJK" and "JJL". I'd also like some idea of when clashes are likely to occur.
Any suggestions?
Thanks

Well, nearly all hashing functions have the property that small changes in the input yield large changes in the output, meaning that "somefolder1" and "somefolder2" will always yield a different hash.
As for clashes, just look at how large the hash output is. Java's own hashcode() returns an int, therefore you can expect clashes more often than with MD5 or SHA-1, for example which yield 128 and 160 bit, respectively.
You shouldn't try creating such a function from scratch, though.
However, I didn't quite understand whether collisions shouldn't ever occur with your use case or whether they are acceptable if rare. For linking folders I'd definitely use a guarenteed-to-be-unique identifier instead of something that might occur more than once.

You haven't described under what circumstances different strings should return the same hash.
In general, I would approach designing a hashing function by first implementing the equality function. That should show you which bits of data you need to include in the hash, and which should be discarded. If the equality between two different bits of data is complicated (e.g. case-insensitivity) then hopefully there will be a corresponding hash function for that particular comparison.
Whatever you do, don't assume that equal hashes mean equal keys (i.e. that hashing is unique) - that's always a cause of potential problems.

Java's String hashcode will give you an int, if you want a long, you could take the least-significant 64 bits of the MD5 sum for the String.
Collisions could occur, your system must be prepared for that. Maybe if you give a little more detail as to what the hash codes will be used for, we can see if collisions would cause problems or not.

With a uniformly random hash function with M possible values, the odds of a collision happening after N hashes are 50% when
N = .5 + SQRT(.25 - 2 * M * ln(.5))
Look up the birthday problem for more analysis.
You can avoid collisions if you know all your keys in advance, using perfect hashing.

Getting an int representation of a String

I am looking for a way to create an int\long representation of an arbitrary alpha-numeric String. Hash codes won't do it, because I can't afford hash collisions i.e. the representation must be unique and repeatable.
The numeric representation will be used to perform efficient (hopefully) compares. The creation of the numeric key will take some time, but it only has to happen once, whereas I need to perform vast numbers of comparisons with it - which will hopefully be much faster than comparing the raw Strings.
Any other idea's on faster String comparison will be most appreciated too...

Unless your string is limited in length, you can't avoid collisions.
There are 4294967296 possible values for an integer (2^32). If you have a string of more than 4 ASCII characters, or more than two unicode characters, then there are more possible string values than possible integer values. You can't have a unique integer value for every possible 5 character string. Long values have more possible values, but they would only provide a unique value for every possible string of 8 ASCII characters.
Hash codes are useful as a two step process: first see if the hash code matches, then check the whole string. For most strings that don't match, you only need to do the first step, and it's really fast.

Can't you just start with a hash code, and if the hash codes match, do a character by character comparison?

How long are the strings? If they are very short, then a unique ID can be generated by considering the characters as digits in base 36 (26 + 10) that form a n-digits number where n is the length of the string. On the other hand, if the strings are short enough to allow this, direct comparison won't be an issue anyway.
Otherwise you'll have to generate a collision-free hash and this can only be done when the complete problem space is known in advance (i.e. if you know all strings that can possibly occur). You will want to have a look at perfect hashing, although the only feasible algorithm to find a perfect hash function that I know is probabilistic so collisions are still theoretically possible.
There might be other ways to find such a function. Knuth called this a “rather amusing … puzzle” in TAoCP but he doesn't give an algorithm either.
In general, you give way too few information to find an algorithm that doesn't require probing the whole problem space in some manner. This does invariably mean that the problem has exponential running time but could be solved using machine-learning heuristics. I'm not sure if this is advisable in your case.

Perhaps:
String y = "oiu291981u39u192u3198u389u28u389u";
BigInteger bi = new BigInteger(y, 36);
System.out.println(bi);

At the end of the day, a single alphanumeric character has at least 36 possible values. If you include punctuation, lower case, etc then you can easily pass 72 possible values.
A non-colliding number that allows you to quickly compare strings would necessarily grow exponentially with the length of the string.
So you first must decide on the longest string you are expecting to compare. Assuming it's N characters in length, and assuming you ONLY need uppercase letters and the numerals 0-9 then you need to have an integer representation that can be as high as
36^N
For a string of length 25 (common name field) then you end up needing a binary number with 130 bits.
If you compose that into 32 bit numbers, you'll need 4. Then you can compare each number (four integer compares should take no time, compared to walking the string). I would recommend a big number library, but for this specialized case I'm pretty sure you can write your own and get better performance.
If you want to handle 72 possible values per character (uppercase, lowercase, numerals, punctuation...) and you need 10 characters, then you'll need 62 bits - two 32 bit integers (or one 64 bit if you're on a system that supports 64 bit computing)
If, however, you are not able to restrict the numbers in the string (ie, could be any of the 256 letters/numbers/characters/etc) and you can't define the size of the string, then comparing the strings directly is the only way to go, but there's a shortcut.
Cast the pointer of the string to a 32 bit unsigned integer array, and compare the string 4 bytes at a time (or 64 bits/8bytes at a time on a 64 bit processor). This means that a 100 character string only requires 25 compares maximum to find which is greater.
You may need to re-define the character set (and convert the strings) so that the characters with higher precedence are assigned values closer to 0, and lower precedence values closer to 255 (or vice versa, depending on how you are comparing them).
Good luck!
-Adam

As long as it's a hash function, be it String.hashCode(), MD5 or SHA1, collision is unavoidable unless you have a fixed limit on the string's length. It is mathematically impossible to have one-to-one mapping from an infinite group to a finite group.
Stepping back, is collision avoidance absolutely necessary?

A few questions in the beginning:
Did you test that simple string comparison is too slow?
How the comparison looks like ('ABC' == 'abc' or 'ABC' != 'abc')?
How many string do you have to compare?
How many comparison do you have to do?
How your strings look like (the length, letter case)?
As far as I remember String in Java is an object and two identical strings point to the same object.
So, maybe it would be enough to compare objects (probably string comparison is already implemented in this way).
If it doesn't help you can try to use Pascal implementation of string object when first element is length and if your strings have various length this should save some CPU time.

How long are your strings? Unless you choose an int representation that's longer than the string, collisions will always be possible no matter what conversion you're using. So if you're using a 32 bit integer, you can only uniquely represent strings of up to 4 bytes.

How big are your strings? Arbitrarily long strings cannot be compressed into 32/64 bit format.

If you don't want collisions, try something insane like SHA-512. I can't guarantee there won't be collisions, but I don't think they have found any yet.

Assuming "alphanumeric" means letters and numbers, you could treat each letter/number as a base-36 digit. Unfortunately, large strings will cause the number to grow rapidly and you'd have to resort to big integers, which are hardly efficient.
If your strings are usually different when you make the comparison (i.e. searching for a specific string) the hash might be your best option. Once you get a potential hit, you can do the string comparison to be sure. A well-designed hash will make collisions exceedingly rare.

It would seem that an MD5 hash would work fine. The risk of a hash collision would be extremely unlikely. Depending on the length of your string, a hash that generates an int/long would run into max value problems very quickly.

Why don't you do something like 1stChar + (10 x 2ndChar) + 100 x (3rdChar) ...., where you use the simple integer value of each character, i.e. a = 1, b = 2 etc, or just the integer value if it's not a letter. This will give a unique value for each string, even for 2 strings that are just the same letters in a different order.
Of course if gets more complicated if you need to worry about Unicode rather than just ASCII and the numbers could get large if you need to use long string.
Are the standard Java string comparison functions definitely not efficient enough?

String length may vary, but let's say 10 characters for now.
In that case, in order to guarantee uniqueness you'd have to use some sort of big integer representation. I doubt that doing comparisons on big integers would be substantially faster than doing string comparisons in the first place. I'll second what other's have said here, use some sort of hash, then in the event of a hash match check the original strings to weed out any collisions.
In any case, If your strings are around 10 characters, I doubt that comparing, say, a bunch of 32 bit hashes will be all that much faster than direct string comparisons. I think you have to ask yourself if it's it really worth the additional complexity.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.