Get only positive long from murmur3 guava - java

I'm using java murmur3 from guava lib to get long values representing hash. Is there any possibility to get only positive long numbers? Right ow guava returns +/- results which is not good for me..
I use murmur3 to convert string ids to numerical representation because of caclculation framework limitations. I do not afraid of small quantity of collisions. But I'm afraid just to take abs(murmur3Value). It should significantly raise probability of collisions. Am I right?
I have ~ 1*10^8 unique ids, is it ok to abs their hased values and not to get too many collisions.
i don't have any collistions on 10^7 values, but hashed are positive and negative, i would like to use only positive values.

Using Math.abs is wrong... as Math.abs(Long.MIN_VALUE) == Long.MIN_VALUE. It's also needlessly slow, given that there are simple options:
x >>> 1
and
x & Long.MAX_VALUE
In any case you lose one bit, either the most or the least significant one. I guess in case of Murmur3 it doesn't matter.
Concerning collision, it really shouldn't matter what operation you choose - you'll have 2**63, i.e., about 9e18 different hashes. With 1e8 inputs, it means that collision are very rare if any (I'm to lazy to look up the formula).

Related

Is there a way to pow 2 BigInteger Numbers in java?

I have to pow a bigInteger number with another BigInteger number.
Unfortunately, only one BigInteger.pow(int) is allowed.
I have no clue on how I can solve this problem.
I have to pow a bigInteger number with another BigInteger number.
No, you don't.
You read a crypto spec and it seemed to say that. But that's not what it said; you didn't read carefully enough. The mathematical 'universe' that the math in the paper / spec you're reading operates in is different from normal math. It's a modulo-space. All operations are implicitly performed modulo X, where X is some number the crypto algorithm explains.
You can do that just fine.
Alternatively, the spec is quite clear and says something like: C = (A^B) % M and you've broken that down in steps (... first, I must calculate A to the power of B. I'll worry about what the % M part is all about later). That's not how that works - you can't lop that operation into parts. (A^B) % M is quite doable, and has its own efficient algorithm. (A^B) is simply not calculable without a few years worth of the planet's entire energy and GDP output.
The reason I know that must be what you've been reading, is because (A ^ B) % M is a common operation in crypto. (Well, that, and the simple fact that A^B can't be done).
Just to be crystal clear: When I say impossible, I mean it in the same way 'travelling faster than the speed of light' is impossible. It's a law in the physics sense of the word: If you really just want to do A^B and not in a modspace where B is so large it doesn't fit in an int, a computer cannot calculate it, and the result will be gigabytes large. int can hold about 9 digits worth. Just for fun, imagine doing X^Y where both X and Y are 20 digit numbers.
The result would have 10^21 digits.
That's roughly equal to the total amount of disk space available worldwide. 10^12 is a terabyte. You're asking to calculate a number where, forget about calculating it, merely storing it requires one thousand million harddisks each of 1TB.
Thus, I'm 100% certain that you do not want what you think you want.
TIP: If you can't follow the math (which is quite bizarre; it's not like you get modulo-space math in your basic AP math class!), generally rolling your own implementation of a crypto algorithm isn't going to work out. The problem with crypto is, if you mess up, often a unit test cannot catch it. No; someone will hack your stuff and then you know, and that's a high price to pay. Rely on experts to build the algorithm, spend your time ensuring the protocol is correct (which is still quite difficult to get right, don't take that lightly!). If you insist, make dang sure you have a heap of plaintext+keys / encrypted (or plaintext / hashed, or whatever it is you're doing) pairs to test against, and assume that whatever you wrote, even if it passes those tests, is still insecure because e.g. it is trivial to leak the key out of your algorithm using timing attacks.
Since you anyway want to use it in a modulo operation with a prime number, like #Progman said in the comments, you can use modPow()
Below is an example code:
// Create BigInteger objects
BigInteger biginteger1, biginteger2, exponent, result;
//prime number
int pNumber = 5;
// Intializing all BigInteger Objects
biginteger1 = new BigInteger("23895");
biginteger2 = BigInteger.valueOf(pNumber);
exponent = new BigInteger("15");
// Perform modPow operation on the objects and exponent
result = biginteger1.modPow(exponent, biginteger2);

Even distribution of long integer identifiers into buckets

I have a huge set of long integer identifiers that need to be distributed into (n) buckets as uniformly as possible. The long integer identifiers might have pockets of missing identifiers.
With that being the criteria, is there a difference between Using the long integer as is and doing a modulo (n) [long integer] or is it better to have a hashCode generated for the string version of long integer (to improve the distribution) and then do a modulo (n) [hash_code of string(long integer)]? Is the additional string conversion necessary to get the uniform spread via hash code?
Since I got feedback that my question does not have enough background information. I am adding some more information.
The identifiers are basically auto-incrementing numeric row identifiers that are autogenerated in a database representing an item id. The reason for pockets of missing identifiers is because of deletes.
The identifiers themselves are long integers.
The identifiers (items) themselves are in the order of (10s-100)+ million in some cases and in the order of thousands in some cases.
Only in the case where the identifiers are in the order of millions do I want to really spread them out into buckets (identifier count >> bucket count) for storage in a no-SQL system(partitions).
I was wondering if because of the fact that items get deleted, should I be resorting to (Long).toString().hashCode() to get the uniform spread instead of using the long numeric directly. I had a feeling that doing a toString.hashCode is not going to fetch me much, and I also did not like the fact that java hashCode does not guarantee same value across java revisions (though for String their hashCode implementation seems to be documented and stable for the past releases across years
)
There's no need to involve String.
new Integer(i).hashCode()
... gives you a hash - designed for the very purpose of evenly distributing into buckets.
new Integer(i).hashCode() % n
... will give you a number in the range you want.
However Integer.hashCode() is just:
return value;
So new Integer(i).hashCode() % n is equivalent to i % n.
Your question as is cannot be answered. #slim's try is the best you will get, because crucial information is missing in your question.
To distribute a set of items, you have to know something about their initial distribution.
If they are uniformly distributed and the number of buckets is significantly higher than the range of the inputs, then slim's answer is the way to go. If either of those conditions doesn't hold, it won't work.
If the range of inputs is not significantly higher than the number of buckets, you need to make sure the range of inputs is an exact multiple of the number of buckets, otherwise the last buckets won't get as many items. For instance, with range [0-999] and 400 buckets, first 200 buckets get items [0-199], [400-599] and [800-999] while the other 200 buckets get iems [200-399] and [600-799].
That is, half of your buckets end up with 50% more items than the other half.
If they are not uniformly distributed, as modulo operator doesn't change the distribution except by wrapping it, the output distribution is not uniform either.
This is when you need a hash function.
But to build a hash function, you must know how to characterize the input distribution. The point of the hash function being to break the recurring, predictable aspects of your input.
To be fair, there are some hash functions that work fairly well on most datasets, for instance Knuth's multiplicative method (assuming not too large inputs). You might, say, compute
hash(input) = input * 2654435761 % 2^32
It is good at breaking clusters of values. However, it fails at divisibility. That is, if most of your inputs are divisible by 2, the outputs will be too. [credit to this answer]
I found this gist has an interesting compilation of diverse hashing functions and their characteristics, you might pick one that best matches the characteristics of your dataset.

Using hashcode for a unique ID

I am working in a java-based system where I need to set an id for certain elements in the visual display. One category of elements is Strings, so I decided to use the String.hashCode() method to get a unique identifier for these elements.
The problem I ran into, however, is that the system I am working in borks if the id is negative and String.hashCode often returns negative values. One quick solution is to just use Math.abs() around the hashcode call to guarantee a positive result. What I was wondering about this approach is what are the chances of two distinct elements having the same hashcode?
For example, if one string returns a hashcode of -10 and another string returns a hashcode of 10 an error would occur. In my system we're talking about collections of objects that aren't more than 30 elements large typically so I don't think this would really be an issue, but I am curious as to what the math says.
Hash codes can be thought of as pseudo-random numbers. Statistically, with a positive int hash code the chance of a collision between any two elements reaches 50% when the population size is about 54K (and 77K for any int). See Birthday Problem Probability Table for collision probabilities of various hash code sizes.
Also, your idea to use Math.abs() alone is flawed: It does not always return a positive number! In 2's compliment arithmetic, the absolute value of Integer.MIN_VALUE is itself! Famously, the hash code of "polygenelubricants" is this value.
Hashes are not unique, hence they are not apropriate for uniqueId.
As to probability of hash collision, you could read about birthday paradox. Actually (from what I recall) when drawing from an uniform distribution of N values, you should expect collision after drawing $\sqrt(N)$ (you could get collision much earlier). The problem is that Java's implementation of hashCode (and especially when hashing short strings) doesnt provide uniform distribution, so you'll get collision much earlier.
You already can get two strings with the same hashcode. This should be obvious if you think that you have an infinite number of strings and only 2^32 possible hashcodes.
You just make it a little more probable when taking the absolute value. The risk is small but if you need an unique id, this isn't the right approach.
What you can do when you only have 30-50 values as you said is register each String you get into an HashMap together with a running counter as value:
HashMap StringMap = new HashMap<String,Integer>();
StringMap.add("Test",1);
StringMap.add("AnotherTest",2);
You can then get your unique ID by calling this:
StringMap.get("Test"); //returns 1

Java hashcode of string from 0-1

So I know I can convert a string to a hashcode simply by doing .hashCode(), but is there a way to convert (or use some other function if there is one out there) that will instead of returning an integer return a double between 0 and 1? I was thinking of just dividing the number by the maximum possible integer but wasn't sure if there was a better way.
*Edit (more information about why i'm trying to do this): i'm doing a mathematical operation, and i'm trying to group different objects to perform the same mathematical operation in their group but have a different parameter into the function. each member has a list of characteristics that "group" them... so i was thinking to put the characteristics into a string and then hashcode it and find their group value from that
You couldn't just divide by Integer.MAX_VALUE, as that wouldn't deal with negative numbers. You could use:
private static double INTEGER_RANGE = 1L << 32;
...
// First need to put it in the range [0, INTEGER_RANGE)
double doubleHash = ((long) text.hashCode() - Integer.MIN_VALUE) / INTEGER_RANGE;
That should be okay, as far as I'm aware... but I'm not going to make any claims about the distribution. There may well be a fairly simple way of using the 32 bits to make a unique double (per unique hash code) in the right range, but if you don't care too much about that, this will be simpler.
Dividing it should be ok, but you might loose some "precision" due to rounding problems, etc, that doubles might have.
In general a hash is used to identify something trying to assure it'll be unique, loosing precision might have problems in that.
You could write your own String.hashCodeDouble() returning the desired number, perhaps using a common hash algorithm (let's say, MD5) and adapting it to your required response range.
Example: do the MD5 of the String to get a hash, then simply put a 0. in front of it...
Remember that the .hashCode() is used in lots of functions in Java, you can't simply overwrite it.
This smells bad but might do what you want:
Integer iHash = "123".hashCode();
String sHash = "0."+iHash;
Double dHash = Double.valueOf(sHash);

Why multiply by a prime before xoring in many GetHashCode Implementations?

I understand that multiplication by a large number before xoring should help with badly distributed operands but why should the multiplier be a prime?
Related:
Why should hash functions use a prime number modulus?
Close, but not quite a Duplicate:
Why does Java’s hashCode() in String use 31 as a multiplier?
There's a good article on the Computing Life blog that discusses this topic in detail. It was originally posted as a response to the Java hashCode() question I linked to in the question. According to the article:
Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.
Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used.
However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about hashes without primes.
Multiplying by a non-prime has a cyclic repeating pattern much smaller than the number. If you use a prime then the cyclic repeating pattern is guaranteeed to be at least as large as the prime number.
I'm not sure exactly which algorithm you're talking about, but typically the constants in such algorithms need to be relatively prime. Otherwise, you get cycles and not all the possible values show up in the result.
The number probably doesn't need to be prime in your case, only relatively prime to some other numbers, but making it prime guarantees that. It also covers the cases where the other magic numbers change.
For example, if you are talking about taking the last bits of some number, then the multiplier needs to not be a multiple of 2. So, 9 would work even though it's not prime.
Consider the simplest multiplication: x2.
It is equivalent to a left-bitshift. In other words, it really didn't "randomize" the data, it just shifted it over.
Same with x4, or any power of two. The original data is intact, just shifted.
Now, multiplication by other numbers (non-powers of two) are not as obvious, but still have the same problem, more or less. The original data is intact, or trivially transformed. (eg. x5 is the same as left-bitshift two places, then add on the original data).
The point of GetHashCode is to essentially distribute the data as randomly as possible. Multiplying by a prime number guarantees that the answer won't be a simpler transform like bit-shifting or adding a number to itself.

Categories