Generate random numbers that depend on string hash - java

I'm trying to generate n random numbers that depend on an input string. It would be a function generateNumbers(String input) that generates the same set of numbers for the same input string but entirely different numbers for a slightly different input string.
My question is: Is there an easy way to do this?

I agree with nihlon, if what you want is a function f() returning an int such that f(string1) != f(string2) for any string1, string2 in some set of strings S, then you're looking for a perfect hash.
Obviously, if S is the set of all possible strings, there are way more than 2^32, or even 2^64, so no such f() can exist returning an int or even long. Hence, the question is: how is S characterized?
Also, are you sure you need unique numbers for different strings? In most problem domains regular hashing is adequate...

As Roberto says, a hash is one way to do this, with a small possibility of two different strings hashing to the same value. That probability depends on the maximum size of string you allow and the bit-size of the resulting hash number.
You could also use an encryption, but then you would have to limit the string size to one or two blocks of a block cipher. Two blocks of AES is 32 characters, and will produce a 256 bit number.
Pick the smallest string size you can live with, and the largest hash size/block size you can work with. A non-cryptographic hash like the fnv hash will be faster than a cryptographic hash like SHA-256, but obviously less secure. You do not say how important security is to you.

Related

Generate two random numbers between 1 to 10 using specific string

I would like to implement a logic based on a provided string I have to generate two random numbers between 1 to 10.
I have a string like Johnsen using it I have to generate two numbers like 1 and 3 and next time with the same string it should give the same numbers for the same string.
Need help to develop this algorithm or logic.
Java has random number generator support via the java.util.Random class. This class 'works' by having a seed value, then giving you some random data based on this seed value, which then updates the seed value to something new.
This pragmatically means:
2 instances of j.u.Random with the same seed value will produce the same sequence of values if you invoke the same sequence of calls on it to give you random data.
But, seed values are of type long - 64 bits worth of data.
Thus, to do what you want, you need to write an algorithm that turns any String into a long.
Given that long, you simply make an instance of j.u.Random with that long as seed, using the new Random(seedValue) constructor.
So that just leaves: How do I turn a string into a long?
Easy way
The simplest answer is to invoke hashCode() on them. But, note, hashcodes only have 32 bits of info (they are int, not long), so this doesn't cover all possible seed values. This is unlikely to matter unless you're doing this for crypto purposes. If you ARE, then you need to stop what you are doing and do a lot more research, because it is extremely easy to mess up and have working code that seems to test fine, but which is easy to hack. You don't want that. For starters, you'd want SecureRandom instead, but that's just the tip of the iceberg.
Harder way
Hashing algorithms exist that turn arbitrary data into fixed size hash representations. The hashCode algorithm of string [A] only makes 32-bits worth of hash, and [B] is not cryptographically secure: If you task me to make a string that hashes to a provided value I can trivially do so; a cryptographically secure hash has the property that I can't just cook you up a string that hashes to a desired value.
You can search the web for hashing strings or byte arrays (you can turn a string into one with str.getBytes(StandardCharsets.UTF_8)).
You can 'collapse' a byte array containing a hash into a long also quite easily - just take any 8 bytes in that hash and use them to construct a long. "Turn 8 bytes into a long" also has tons of tutorials if you search the web for it.
I assume the easy way is good enough for this exercise, however.
Thus:
String key = ...;
Random rnd = new Random(key.hashCode());
int number1 = rnd.nextInt(10) + 1;
int number2 = rnd.nextInt(10) + 1;
System.out.println("First number: " + number1);
System.out.println("Second number: " + number2);
You could get the hashcode of the string, then use that to seed a random number generator. Use the RNG to get numbers in the range 1 - 10.

Why is the given hash function a poor hash function?

Assume that the hash table is an array with indices 0 through HASHSIZE-1. The function returns a value in the correct range and does not generate any run-time errors. Assume that the passed in String has at least 2 characters. Why is it a poor hash function?
public static int hash(String key) {
return (key.charAt(0)
+ key.charAt(1)
+ key.charAt(key.length()-1) % HASHSIZE;
}
The quality of a hash function depends on the number of collisions they create among the expected population of keys. Good functions make situations when different keys produce the same hash code less likely.
The quality of this approach depends on the expected length of keys in use. For keys of length three this is a perfectly acceptable approach, although it is not ideal, because hash does not change based on letter ordering.
For keys of length 10 this approach will generate collisions for all keys starting in the same pair of letters that have the same letter at the end. When the two initial letters and the last letter combination repeat a lot, you will get collisions, rendering this hash function less useful.

one way function from a set of unique numbers to another set of unique numbers

I am looking for a function (hash function? not sure) that maps a set of (unique) numbers to another set of (unique) numbers. I had a look at perfect hashing -and the fact that it's impossible to do :( - and it seems quite close to what I am after.
In detail, I want the following:
Map any given 12 digit number to another number, 12 digit or more (I don't care about the resulting size but it must fit in a Java long).
Must be able to guarantee that the same 12 digit number maps to the same number every time.
Must be able to guarantee that a different 12 digit number always maps to a different number, i.e. there are no collisions in the result set.
Must be able to guarantee that the function is one way, which (in my mind anyway) means that you cannot compute the number you started from given the result of the function.
Something like minimal hashing function would not work in my case, as each number has to be computed on a different computer, meaning that the function itself has to guarantee these features without checking for collisions in the results and or having any centralized control of the result set.
An example of such a function would just be a simple:
take number, add 1, output number.
The only problem with that is that you can easily get first number just by subtracting one. I want it to be very difficult, or preferably impossible to get the previous number back.
Any thoughts ?
Translating the function from math to java I don't mind doing on my own. Unless you can suggest an already existing java library.
You are looking for Format-preserving encription algorithms.
It looks a cryptography problem for me.
If you want to map a String to other String and must be able to guarantee that is one way and has no collisions, you need encrypt the input String.
For example you can use DES.
If the output must be digits you can interpret the bytes of the output as hexadecimal and then convert to base 10.
What you want is a cipher. A good old fashioned symmetric cipher. Like AES.
Imagine for a second that instead of 12 digits, your numbers were 128 bits long. Let's say you set up an AES cipher with a key of your choosing, and use it to encrypt numbers. What is the outcome?
You will map any given 128-bit number to another number, of exactly 128 bits (AES's block size is 128 bits)
You can guarantee that the same 128-bit number maps to the same number every time (encryption is deterministic)
You can guarantee that two different 128-bit numbers always map to different numbers (encryption is reversible - there are no two plaintexts which encrypt to the same ciphertext, otherwise how could you decrypt them?)
You can guarantee that, for someone who doesn't have the key, the function is one way (encryption is not reversible without they key - it would be less useful if it was)
Now, 128 bits is bigger than you want, so AES is not the right cipher for you. All you need to do is choose a cipher which has a 12-digit block size.
There aren't any conventional ciphers which have a 12-digit block size. But it's actually pretty easy to construct one. You can use the Feistel construction to take a hash function and construct a block cipher. You can either build a binary cipher of a suitable size (40 bits in your case) and then use the "hasty pudding trick" to restrict its domain to 12 digits, or construct a cipher that works directly in decimal (more or less).
I wrote an answer to another question a while ago that explained this in some more detail; i even wrote some code to implement the idea, although looking at it now, i don't know how comprehensible it is. The key classes in it are TinyCipher, which implements a cipher with a block size of up to 32 bits (and could easily be extended up to 64 bits), and TrickCipher, which uses the hasty pudding trick to implement a cipher over an arbitrary-sized set (such as all 12-digit numbers).

Using DHT to lookup stuff. SHA-1. Chord protocol

I'm trying to implement the Chord protocol in order to quickly lookup some nodes and keys in a small network. What I can't figure out is ... Chord cosideres the nodes and keys as being placed on a cirlce. And their placement dictated by the hash values obtained by applying the SHA-1 hash function. How exactly do I operate with those values? Do I make them as a string de9f2c7f d25e1b3a fad3e85a 0bd17d9b 100db4b3 and then compare them as such, considering that "a" < "b" is true ? Or how? How do I know if a key is before or after another?
Since the keyspace is a ring, a single value can't be said to be greater than another, because if you go the other way around the ring, the opposite is true. You can say a value is within a range or not. In the Chord DHT, each server is responsible for the keys within the range of values between it and its predecessor.
I would advise against using strings for the hash values. You shouldn't use the hashCode function for distributed systems, but you need to math on the hash keys when adding new nodes. You could try converting the hashes into BigIntegers instead.
sha1 hashes are not strings but are very long hex numbers - they are often stored as strings because they would otherwise require a native 160 bit number type. They are built as 5 32 bit hex numbers and then often 'strung' together.
using sha1 strings as the numbers they represent is not hard but requires a library that can handle such large numbers (like BigInt or bcmath). these libraries work by calculating the numbers within the string one column at a time from the right to left, much like a person when using a pen and paper to add, multiply, divide, etc. they will typically have functions for doing common math as well comparisons etc, and often take strings as arguments. Also, make sure that you use a function for converting big numbers anytime you need to go from hex to dec, or else your 160 bit hex number will likely get rounded into a 64 bit dec float or similar and loose most of it's accuracy.
more/less than comparisons are used in chord to figure ranges but do so using modulo so that they 'wrap', making ranges such as [64, 2] possible. the actual formula is
find_successor(fingers[k] = n + 2^(k-1) mod(2^160))
where 'n' is the sha1 of a node and 'k' is the finger number.
remember, 'n' will be hex while 'k' and 'mod(160^2)' will typically be dec, so this is where your BigInt hex to BigInt dec will be needed.
even if your programing framework will let you create these vars as hex, 160 is specifically a dec (literally meaning one hounded and sixty bits) and besides, wrapping your brain around 'mod(160^2)' is already hard enough without visualizing it as hex. convert 'n' to dec rather than converting 'k' etc to hex , and then use a BigInt lib to do the math including comparisons.

Getting an int representation of a String

I am looking for a way to create an int\long representation of an arbitrary alpha-numeric String. Hash codes won't do it, because I can't afford hash collisions i.e. the representation must be unique and repeatable.
The numeric representation will be used to perform efficient (hopefully) compares. The creation of the numeric key will take some time, but it only has to happen once, whereas I need to perform vast numbers of comparisons with it - which will hopefully be much faster than comparing the raw Strings.
Any other idea's on faster String comparison will be most appreciated too...
Unless your string is limited in length, you can't avoid collisions.
There are 4294967296 possible values for an integer (2^32). If you have a string of more than 4 ASCII characters, or more than two unicode characters, then there are more possible string values than possible integer values. You can't have a unique integer value for every possible 5 character string. Long values have more possible values, but they would only provide a unique value for every possible string of 8 ASCII characters.
Hash codes are useful as a two step process: first see if the hash code matches, then check the whole string. For most strings that don't match, you only need to do the first step, and it's really fast.
Can't you just start with a hash code, and if the hash codes match, do a character by character comparison?
How long are the strings? If they are very short, then a unique ID can be generated by considering the characters as digits in base 36 (26 + 10) that form a n-digits number where n is the length of the string. On the other hand, if the strings are short enough to allow this, direct comparison won't be an issue anyway.
Otherwise you'll have to generate a collision-free hash and this can only be done when the complete problem space is known in advance (i.e. if you know all strings that can possibly occur). You will want to have a look at perfect hashing, although the only feasible algorithm to find a perfect hash function that I know is probabilistic so collisions are still theoretically possible.
There might be other ways to find such a function. Knuth called this a “rather amusing … puzzle” in TAoCP but he doesn't give an algorithm either.
In general, you give way too few information to find an algorithm that doesn't require probing the whole problem space in some manner. This does invariably mean that the problem has exponential running time but could be solved using machine-learning heuristics. I'm not sure if this is advisable in your case.
Perhaps:
String y = "oiu291981u39u192u3198u389u28u389u";
BigInteger bi = new BigInteger(y, 36);
System.out.println(bi);
At the end of the day, a single alphanumeric character has at least 36 possible values. If you include punctuation, lower case, etc then you can easily pass 72 possible values.
A non-colliding number that allows you to quickly compare strings would necessarily grow exponentially with the length of the string.
So you first must decide on the longest string you are expecting to compare. Assuming it's N characters in length, and assuming you ONLY need uppercase letters and the numerals 0-9 then you need to have an integer representation that can be as high as
36^N
For a string of length 25 (common name field) then you end up needing a binary number with 130 bits.
If you compose that into 32 bit numbers, you'll need 4. Then you can compare each number (four integer compares should take no time, compared to walking the string). I would recommend a big number library, but for this specialized case I'm pretty sure you can write your own and get better performance.
If you want to handle 72 possible values per character (uppercase, lowercase, numerals, punctuation...) and you need 10 characters, then you'll need 62 bits - two 32 bit integers (or one 64 bit if you're on a system that supports 64 bit computing)
If, however, you are not able to restrict the numbers in the string (ie, could be any of the 256 letters/numbers/characters/etc) and you can't define the size of the string, then comparing the strings directly is the only way to go, but there's a shortcut.
Cast the pointer of the string to a 32 bit unsigned integer array, and compare the string 4 bytes at a time (or 64 bits/8bytes at a time on a 64 bit processor). This means that a 100 character string only requires 25 compares maximum to find which is greater.
You may need to re-define the character set (and convert the strings) so that the characters with higher precedence are assigned values closer to 0, and lower precedence values closer to 255 (or vice versa, depending on how you are comparing them).
Good luck!
-Adam
As long as it's a hash function, be it String.hashCode(), MD5 or SHA1, collision is unavoidable unless you have a fixed limit on the string's length. It is mathematically impossible to have one-to-one mapping from an infinite group to a finite group.
Stepping back, is collision avoidance absolutely necessary?
A few questions in the beginning:
Did you test that simple string comparison is too slow?
How the comparison looks like ('ABC' == 'abc' or 'ABC' != 'abc')?
How many string do you have to compare?
How many comparison do you have to do?
How your strings look like (the length, letter case)?
As far as I remember String in Java is an object and two identical strings point to the same object.
So, maybe it would be enough to compare objects (probably string comparison is already implemented in this way).
If it doesn't help you can try to use Pascal implementation of string object when first element is length and if your strings have various length this should save some CPU time.
How long are your strings? Unless you choose an int representation that's longer than the string, collisions will always be possible no matter what conversion you're using. So if you're using a 32 bit integer, you can only uniquely represent strings of up to 4 bytes.
How big are your strings? Arbitrarily long strings cannot be compressed into 32/64 bit format.
If you don't want collisions, try something insane like SHA-512. I can't guarantee there won't be collisions, but I don't think they have found any yet.
Assuming "alphanumeric" means letters and numbers, you could treat each letter/number as a base-36 digit. Unfortunately, large strings will cause the number to grow rapidly and you'd have to resort to big integers, which are hardly efficient.
If your strings are usually different when you make the comparison (i.e. searching for a specific string) the hash might be your best option. Once you get a potential hit, you can do the string comparison to be sure. A well-designed hash will make collisions exceedingly rare.
It would seem that an MD5 hash would work fine. The risk of a hash collision would be extremely unlikely. Depending on the length of your string, a hash that generates an int/long would run into max value problems very quickly.
Why don't you do something like 1stChar + (10 x 2ndChar) + 100 x (3rdChar) ...., where you use the simple integer value of each character, i.e. a = 1, b = 2 etc, or just the integer value if it's not a letter. This will give a unique value for each string, even for 2 strings that are just the same letters in a different order.
Of course if gets more complicated if you need to worry about Unicode rather than just ASCII and the numbers could get large if you need to use long string.
Are the standard Java string comparison functions definitely not efficient enough?
String length may vary, but let's say 10 characters for now.
In that case, in order to guarantee uniqueness you'd have to use some sort of big integer representation. I doubt that doing comparisons on big integers would be substantially faster than doing string comparisons in the first place. I'll second what other's have said here, use some sort of hash, then in the event of a hash match check the original strings to weed out any collisions.
In any case, If your strings are around 10 characters, I doubt that comparing, say, a bunch of 32 bit hashes will be all that much faster than direct string comparisons. I think you have to ask yourself if it's it really worth the additional complexity.

Categories