I'm trying to implement the Chord protocol in order to quickly lookup some nodes and keys in a small network. What I can't figure out is ... Chord cosideres the nodes and keys as being placed on a cirlce. And their placement dictated by the hash values obtained by applying the SHA-1 hash function. How exactly do I operate with those values? Do I make them as a string de9f2c7f d25e1b3a fad3e85a 0bd17d9b 100db4b3 and then compare them as such, considering that "a" < "b" is true ? Or how? How do I know if a key is before or after another?
Since the keyspace is a ring, a single value can't be said to be greater than another, because if you go the other way around the ring, the opposite is true. You can say a value is within a range or not. In the Chord DHT, each server is responsible for the keys within the range of values between it and its predecessor.
I would advise against using strings for the hash values. You shouldn't use the hashCode function for distributed systems, but you need to math on the hash keys when adding new nodes. You could try converting the hashes into BigIntegers instead.
sha1 hashes are not strings but are very long hex numbers - they are often stored as strings because they would otherwise require a native 160 bit number type. They are built as 5 32 bit hex numbers and then often 'strung' together.
using sha1 strings as the numbers they represent is not hard but requires a library that can handle such large numbers (like BigInt or bcmath). these libraries work by calculating the numbers within the string one column at a time from the right to left, much like a person when using a pen and paper to add, multiply, divide, etc. they will typically have functions for doing common math as well comparisons etc, and often take strings as arguments. Also, make sure that you use a function for converting big numbers anytime you need to go from hex to dec, or else your 160 bit hex number will likely get rounded into a 64 bit dec float or similar and loose most of it's accuracy.
more/less than comparisons are used in chord to figure ranges but do so using modulo so that they 'wrap', making ranges such as [64, 2] possible. the actual formula is
find_successor(fingers[k] = n + 2^(k-1) mod(2^160))
where 'n' is the sha1 of a node and 'k' is the finger number.
remember, 'n' will be hex while 'k' and 'mod(160^2)' will typically be dec, so this is where your BigInt hex to BigInt dec will be needed.
even if your programing framework will let you create these vars as hex, 160 is specifically a dec (literally meaning one hounded and sixty bits) and besides, wrapping your brain around 'mod(160^2)' is already hard enough without visualizing it as hex. convert 'n' to dec rather than converting 'k' etc to hex , and then use a BigInt lib to do the math including comparisons.
Related
I'm trying to generate n random numbers that depend on an input string. It would be a function generateNumbers(String input) that generates the same set of numbers for the same input string but entirely different numbers for a slightly different input string.
My question is: Is there an easy way to do this?
I agree with nihlon, if what you want is a function f() returning an int such that f(string1) != f(string2) for any string1, string2 in some set of strings S, then you're looking for a perfect hash.
Obviously, if S is the set of all possible strings, there are way more than 2^32, or even 2^64, so no such f() can exist returning an int or even long. Hence, the question is: how is S characterized?
Also, are you sure you need unique numbers for different strings? In most problem domains regular hashing is adequate...
As Roberto says, a hash is one way to do this, with a small possibility of two different strings hashing to the same value. That probability depends on the maximum size of string you allow and the bit-size of the resulting hash number.
You could also use an encryption, but then you would have to limit the string size to one or two blocks of a block cipher. Two blocks of AES is 32 characters, and will produce a 256 bit number.
Pick the smallest string size you can live with, and the largest hash size/block size you can work with. A non-cryptographic hash like the fnv hash will be faster than a cryptographic hash like SHA-256, but obviously less secure. You do not say how important security is to you.
In a project I'm working on, I need to generate 16 character long unique IDs, consisting of 10 numbers plus 26 uppercase letters (only uppercase). They must be guaranteed to be universally unique, with zero chance of a repeat ever.
The IDs are not stored forever. An ID is thrown out of the database after a period of time and a new unique ID must be generated. The IDs can never repeat with the thrown out ones either.
So randomly generating 16 digits and checking against a list of previously generated IDs is not an option because there is no comprehensive list of previous IDs. Also, UUID will not work because the IDs must be 16 digits in length.
Right now I'm using 16-Digit Unique IDs, that are guaranteed to be universally unique every time they're generated (I'm using timestamps to generate them plus unique server ID). However, I need the IDs to be difficult to predict, and using timestamps makes them easy to predict.
So what I need to do is map the 16 digit numeric IDs that I have into the larger range of 10 digits + 26 letters without losing uniqueness. I need some sort of hashing function that maps from a smaller range to a larger range, guaranteeing a one-to-one mapping so that the unique IDs are guaranteed to stay unique after being mapped.
I have searched and so far have not found any hashing or mapping functions that are guaranteed to be collision-free, but one must exist if I'm mapping to a larger space. Any suggestions are appreciated.
Brandon Staggs wrote a good article on Implementing a Partial Serial Number Verification System. The examples are written in Delphi, but could be converted to other languages.
EDIT: This is an updated answer, as I misread the constraints on the final ID.
Here is a possible solution.
Let set:
UID16 = 16-digit unique ID
LUID = 16-symbol UID (using digits+letters)
SECRET = a secret string
HASH = some hash of SECRET+UID16
Now, you can compute:
LUID = BASE36(UID16) + SUBSTR(BASE36(HASH), 0, 5)
BASE36(UID16) will produce a 11-character string (because 16 / log10(36) ~= 10.28)
It is guaranteed to be unique because the original UID16 is fully included in the final ID. If you happen to get a hash collision with two different UID16, you'll still have two distinct LUID.
Yet, it is difficult to predict because the 5 other symbols are based on a non-predictable hash.
NB: you'll only get log2(36^5) ~= 26 bits of entropy on the hash part, which may or may not be enough depending on your security requirements. The less predictable the original UID16, the better.
One general solution to your problem is encryption. Because encryption is reversible it is always a one-to-one mapping from plaintext to cyphertext. If you encrypt the numbers [0, 1, 2, 3, ...] then you are guaranteed that the resulting cyphertexts are also unique, as long as you keep the same key, do not repeat a number or overflow the allowed size. You just need to keep track of the next number to encrypt, incrementing as needed, and check that it never overflows.
The problem then reduces to the size (in bits) of the encryption and how to present it as text. You say: "10 numbers plus 26 uppercase letters (only uppercase)." That is similar to Base32 encoding, which uses the digits 2, 3, 4, 5, 6, 7 and 26 letters. Not exactly what you require, but perhaps close enough and available off the shelf. 16 characters at 5 bits per Base32 character is 80 bits. You could use an 80 bit block cipher and convert the output to Base32. Either roll your own simple Feistel cipher or use Hasty Pudding cipher, which can be set for any block size. Do not roll your own if there is a major security requirement here. Your own Feistel cipher will give you uniqueness and obfuscation, not security. Hasty Pudding gives security as well.
If you really do need all 10 digits and 26 letters, then you are looking at a number in base 36. Work out the required bit size for 36^16 and proceed as before. Convert the cyphertext bits to a number expressed in base 36.
If you write your own cipher then it appears that you do not need the decryption function, which will save a little work.
You want to map from a space consisting of 1016 distinct values to one with 3616 values.
The ratio of the sizes of these two spaces is ~795,866,110.
Use BigDecimal and multiply each input value by the ratio to distribute the input keys equally over the output space. Then base-36 encode the resulting value.
Here's a sample of 16-digit values consisting of 11 digits "timestamp" and 5 digits server ID encoded using the above scheme.
Decimal ID Base-36 Encoding
---------------- ----------------
4156333000101044 -> EYNSC8L1QJD7MJDK
4156333000201044 -> EYNSC8LTY4Y8Y7A0
4156333000301044 -> EYNSC8MM5QJA9V6G
4156333000401044 -> EYNSC8NEDC4BLJ2W
4156333000501044 -> EYNSC8O6KXPCX6ZC
4156333000601044 -> EYNSC8OYSJAE8UVS
4156333000701044 -> EYNSC8PR04VFKIS8
4156333000801044 -> EYNSC8QJ7QGGW6OO
The first 11 digits form the "timestamp" and I calculated the result for a series incremented by 1; the last five digits are an arbitrary "server ID", in this case 01044.
I need to generate a reservation code of 6 alpha numeric characters, that is random and unique in java.
Tried using UUID.randomuuid().toString(), However the id is too long and the requirement demands that it should only be 6 characters.
What approaches are possible to achieve this?
Just to clarify, (Since this question is getting marked as duplicate).
The other solutions I've found are simply generating random characters, which is not enough in this case. I need to reasonably ensure that a random code is not generated again.
Consider using the hashids library to generate salted hashes of integers (your database ids or other random integers which is probably better).
http://hashids.org/java/
Hashids hashids = new Hashids("this is my salt",6);
String id = hashids.encode(1, 2, 3);
long[] numbers = hashids.decode(id);
You have 36 characters in the alphanumeric character set (0-9 digits + a-z letters). With 6 places you achieve 366 = 2.176.782.336 different options, that is slightly larger than 231.
Therefore you can use Unix time to create a unique ID. However, you must assure that no ID generated within the same second.
If you cannot guarantee that, you end up with a (synchronized) counter within your class. Also, if you want to survive a JVM restart, you should save the current value (e.g. to a database, file, etc. whatever options you have).
Despite its name, UUIDs are not unique. It's simply extremely unlikely to get a 128 bit collision. With 6 (less than 32 bit) it's very likely that you get a collision if you just hash stuff or generate a random string.
If the uniqueness constraint is necessary then you need to
generate a random 6 character string
Check if you generated that string before by querying your database
If you generated it before, go back to 1
Another way would be to use a pseadorandom permutation (PRP) of size 32 bit. Block ciphers are modeled as PRP functions, but there aren't many that support 32 bit block sizes. Some are Speck by the NSA and the Hasty Pudding Cipher.
With a PRP you could for example take an already unique value like your database primary key and encrypt it with the block cipher. If the input is not bigger than 32 bit then the output will still be unique.
Then you would run Base62 (or at least Base 41) over the output and remove the padding characters to get a 6 character output.
if you do a substring that value may not be unique
for more info please see following similar link
Generating 8-character only UUIDs
Lets say your corpus is the collection of alpha numberic letters. a-zA-Z0-9.
char[] corpus = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
We can use SecureRandom to generate a seed, which will ask the OS for entropy, depending on the os. The trick here is to keep a uniform distribution, each byte has 255 values, but we only need around 62 so I will propose rejection sampling.
int generated = 0;
int desired=6;
char[] result= new char[desired];
while(generated<desired){
byte[] ran = SecureRandom.getSeed(desired);
for(byte b: ran){
if(b>=0&&b<corpus.length){
result[generated] = corpus[b];
generated+=1;
if(generated==desired) break;
}
}
}
Improvements could include, smarter wrapping of generated values.
When can we expect a repeat? Lets stick with the corpus of 62 and assume that the distribution is completely random. In that case we have the birthday problem. That gives us N = 62^6 possiblities. We want to find n where the chance of a repeat around 10%.
p(r)= 1 - N!/(N^n (N-n)!)
And using the approximation given in the wikipedia page.
n = sqrt(-ln(0.9)2N)
Which gives us about 109000 numbers for 10% chance. For a 0.1% chance it woul take about 10000 numbers.
you can trying to make substring out of your generated UUID.
String uuid = UUID.randomUUID().toString();
System.out.println("uuid = " + uuid.substring(0,5);
I am looking for a function (hash function? not sure) that maps a set of (unique) numbers to another set of (unique) numbers. I had a look at perfect hashing -and the fact that it's impossible to do :( - and it seems quite close to what I am after.
In detail, I want the following:
Map any given 12 digit number to another number, 12 digit or more (I don't care about the resulting size but it must fit in a Java long).
Must be able to guarantee that the same 12 digit number maps to the same number every time.
Must be able to guarantee that a different 12 digit number always maps to a different number, i.e. there are no collisions in the result set.
Must be able to guarantee that the function is one way, which (in my mind anyway) means that you cannot compute the number you started from given the result of the function.
Something like minimal hashing function would not work in my case, as each number has to be computed on a different computer, meaning that the function itself has to guarantee these features without checking for collisions in the results and or having any centralized control of the result set.
An example of such a function would just be a simple:
take number, add 1, output number.
The only problem with that is that you can easily get first number just by subtracting one. I want it to be very difficult, or preferably impossible to get the previous number back.
Any thoughts ?
Translating the function from math to java I don't mind doing on my own. Unless you can suggest an already existing java library.
You are looking for Format-preserving encription algorithms.
It looks a cryptography problem for me.
If you want to map a String to other String and must be able to guarantee that is one way and has no collisions, you need encrypt the input String.
For example you can use DES.
If the output must be digits you can interpret the bytes of the output as hexadecimal and then convert to base 10.
What you want is a cipher. A good old fashioned symmetric cipher. Like AES.
Imagine for a second that instead of 12 digits, your numbers were 128 bits long. Let's say you set up an AES cipher with a key of your choosing, and use it to encrypt numbers. What is the outcome?
You will map any given 128-bit number to another number, of exactly 128 bits (AES's block size is 128 bits)
You can guarantee that the same 128-bit number maps to the same number every time (encryption is deterministic)
You can guarantee that two different 128-bit numbers always map to different numbers (encryption is reversible - there are no two plaintexts which encrypt to the same ciphertext, otherwise how could you decrypt them?)
You can guarantee that, for someone who doesn't have the key, the function is one way (encryption is not reversible without they key - it would be less useful if it was)
Now, 128 bits is bigger than you want, so AES is not the right cipher for you. All you need to do is choose a cipher which has a 12-digit block size.
There aren't any conventional ciphers which have a 12-digit block size. But it's actually pretty easy to construct one. You can use the Feistel construction to take a hash function and construct a block cipher. You can either build a binary cipher of a suitable size (40 bits in your case) and then use the "hasty pudding trick" to restrict its domain to 12 digits, or construct a cipher that works directly in decimal (more or less).
I wrote an answer to another question a while ago that explained this in some more detail; i even wrote some code to implement the idea, although looking at it now, i don't know how comprehensible it is. The key classes in it are TinyCipher, which implements a cipher with a block size of up to 32 bits (and could easily be extended up to 64 bits), and TrickCipher, which uses the hasty pudding trick to implement a cipher over an arbitrary-sized set (such as all 12-digit numbers).
I am looking for a way to create an int\long representation of an arbitrary alpha-numeric String. Hash codes won't do it, because I can't afford hash collisions i.e. the representation must be unique and repeatable.
The numeric representation will be used to perform efficient (hopefully) compares. The creation of the numeric key will take some time, but it only has to happen once, whereas I need to perform vast numbers of comparisons with it - which will hopefully be much faster than comparing the raw Strings.
Any other idea's on faster String comparison will be most appreciated too...
Unless your string is limited in length, you can't avoid collisions.
There are 4294967296 possible values for an integer (2^32). If you have a string of more than 4 ASCII characters, or more than two unicode characters, then there are more possible string values than possible integer values. You can't have a unique integer value for every possible 5 character string. Long values have more possible values, but they would only provide a unique value for every possible string of 8 ASCII characters.
Hash codes are useful as a two step process: first see if the hash code matches, then check the whole string. For most strings that don't match, you only need to do the first step, and it's really fast.
Can't you just start with a hash code, and if the hash codes match, do a character by character comparison?
How long are the strings? If they are very short, then a unique ID can be generated by considering the characters as digits in base 36 (26 + 10) that form a n-digits number where n is the length of the string. On the other hand, if the strings are short enough to allow this, direct comparison won't be an issue anyway.
Otherwise you'll have to generate a collision-free hash and this can only be done when the complete problem space is known in advance (i.e. if you know all strings that can possibly occur). You will want to have a look at perfect hashing, although the only feasible algorithm to find a perfect hash function that I know is probabilistic so collisions are still theoretically possible.
There might be other ways to find such a function. Knuth called this a “rather amusing … puzzle” in TAoCP but he doesn't give an algorithm either.
In general, you give way too few information to find an algorithm that doesn't require probing the whole problem space in some manner. This does invariably mean that the problem has exponential running time but could be solved using machine-learning heuristics. I'm not sure if this is advisable in your case.
Perhaps:
String y = "oiu291981u39u192u3198u389u28u389u";
BigInteger bi = new BigInteger(y, 36);
System.out.println(bi);
At the end of the day, a single alphanumeric character has at least 36 possible values. If you include punctuation, lower case, etc then you can easily pass 72 possible values.
A non-colliding number that allows you to quickly compare strings would necessarily grow exponentially with the length of the string.
So you first must decide on the longest string you are expecting to compare. Assuming it's N characters in length, and assuming you ONLY need uppercase letters and the numerals 0-9 then you need to have an integer representation that can be as high as
36^N
For a string of length 25 (common name field) then you end up needing a binary number with 130 bits.
If you compose that into 32 bit numbers, you'll need 4. Then you can compare each number (four integer compares should take no time, compared to walking the string). I would recommend a big number library, but for this specialized case I'm pretty sure you can write your own and get better performance.
If you want to handle 72 possible values per character (uppercase, lowercase, numerals, punctuation...) and you need 10 characters, then you'll need 62 bits - two 32 bit integers (or one 64 bit if you're on a system that supports 64 bit computing)
If, however, you are not able to restrict the numbers in the string (ie, could be any of the 256 letters/numbers/characters/etc) and you can't define the size of the string, then comparing the strings directly is the only way to go, but there's a shortcut.
Cast the pointer of the string to a 32 bit unsigned integer array, and compare the string 4 bytes at a time (or 64 bits/8bytes at a time on a 64 bit processor). This means that a 100 character string only requires 25 compares maximum to find which is greater.
You may need to re-define the character set (and convert the strings) so that the characters with higher precedence are assigned values closer to 0, and lower precedence values closer to 255 (or vice versa, depending on how you are comparing them).
Good luck!
-Adam
As long as it's a hash function, be it String.hashCode(), MD5 or SHA1, collision is unavoidable unless you have a fixed limit on the string's length. It is mathematically impossible to have one-to-one mapping from an infinite group to a finite group.
Stepping back, is collision avoidance absolutely necessary?
A few questions in the beginning:
Did you test that simple string comparison is too slow?
How the comparison looks like ('ABC' == 'abc' or 'ABC' != 'abc')?
How many string do you have to compare?
How many comparison do you have to do?
How your strings look like (the length, letter case)?
As far as I remember String in Java is an object and two identical strings point to the same object.
So, maybe it would be enough to compare objects (probably string comparison is already implemented in this way).
If it doesn't help you can try to use Pascal implementation of string object when first element is length and if your strings have various length this should save some CPU time.
How long are your strings? Unless you choose an int representation that's longer than the string, collisions will always be possible no matter what conversion you're using. So if you're using a 32 bit integer, you can only uniquely represent strings of up to 4 bytes.
How big are your strings? Arbitrarily long strings cannot be compressed into 32/64 bit format.
If you don't want collisions, try something insane like SHA-512. I can't guarantee there won't be collisions, but I don't think they have found any yet.
Assuming "alphanumeric" means letters and numbers, you could treat each letter/number as a base-36 digit. Unfortunately, large strings will cause the number to grow rapidly and you'd have to resort to big integers, which are hardly efficient.
If your strings are usually different when you make the comparison (i.e. searching for a specific string) the hash might be your best option. Once you get a potential hit, you can do the string comparison to be sure. A well-designed hash will make collisions exceedingly rare.
It would seem that an MD5 hash would work fine. The risk of a hash collision would be extremely unlikely. Depending on the length of your string, a hash that generates an int/long would run into max value problems very quickly.
Why don't you do something like 1stChar + (10 x 2ndChar) + 100 x (3rdChar) ...., where you use the simple integer value of each character, i.e. a = 1, b = 2 etc, or just the integer value if it's not a letter. This will give a unique value for each string, even for 2 strings that are just the same letters in a different order.
Of course if gets more complicated if you need to worry about Unicode rather than just ASCII and the numbers could get large if you need to use long string.
Are the standard Java string comparison functions definitely not efficient enough?
String length may vary, but let's say 10 characters for now.
In that case, in order to guarantee uniqueness you'd have to use some sort of big integer representation. I doubt that doing comparisons on big integers would be substantially faster than doing string comparisons in the first place. I'll second what other's have said here, use some sort of hash, then in the event of a hash match check the original strings to weed out any collisions.
In any case, If your strings are around 10 characters, I doubt that comparing, say, a bunch of 32 bit hashes will be all that much faster than direct string comparisons. I think you have to ask yourself if it's it really worth the additional complexity.