I was wondering if I'm experiencing a bug, or have just run into a limitation of the Hashids algorithm.
I'm using a custom alphabet, which consists of all uppercase letters, minus "O" and "I" and digits 2 - 9.
After generating several million hashes, I noticed that duplicates started to appear. I'm confused by this, especially since Hashids claims that duplicates are not possible since the algorithm is simply a hex version of an integer. And so long as the integers remain unique (such as counting up forever), so will the hashes.
Does a custom alphabet make it more likely for duplicates to appear? Also, I was expecting the number of unique hashes for my alphabet to be: 32^7 = 34,359,738,368. Before my counter hit this number, the generated hashids grew from 7 characters long to 8.
Does anyone have any ideas as to why this is happening?
Edit: another rather strange anomaly: after generating 10647 hashes, the rest (2.9 million plus) either start with a K or an X. I'm beginning to think the custom alphabet plus the length of the salt affect how the letters get shuffled.
have same problem, just try this :
var hashids = new Hashids("BSomeoneNameN161179IBRB46", 5, "ABCDEFGHIJKLMNPQRSTUVWXYZ1234567890");
var id = hashids.encode(1234567);
var numbers = hashids.decode(id);
changing salt by deleting last 5 characters one by one, just showing same result.
making salt not more than 20 chars seems solve the problem.
I solved this issue by adding the letters and digits I,O,0,1 back into the alphabet being used. With the increase in length of the alphabet, the rotation calculated by Hashids was affected. I simply filtered out any output that included an I,O,0 or 1 using a regular expression.
Related
I wrote a small program in Java that generates 5000 random UUIDs and finds the most recurring character in them overall and I always get as a result that the most recurring character after "-" ( always 20.000 occurencies obviously ) is "4" ( I ran the program several times always getting the same result ).
I was just curious about this fact and was wondering if someone had a technical explanation or if it's really just a coincidence.
Thanks!
This is the function I used to generate the 5000 random UUIDs.
UUID.randomUUID().toString();
Because UUIDs aren't entirely random. Check the Universally unique identifier on wikipedia which explains the various versions.
They look like:
xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx
Where the M and the N are definitely not random (they indicate versions and variants), and the rest may not be random either depending on the mode you're using. The code you write gets you version 4, which means 'M' is always 4, and half of 'N' is also unchanging. You get 122 bits of randomness; not 128.
4 is the most common digit because the 13th 'digit' is always a 4, as per the UUID design.
I have a requirement to generate random and unique key of exactly 10 characters for every record, based on some particular fields.
It should give me same key if I supply same set of information next time.
In summary I am looking for a way to convert a longer string to a 10 characters string, consistently.
Something like md5 hash but should output just 10 characters.
Thanks.
Note: As one of comments ask what I have tried. Basically I haven't found any proper solution and done lot of research on it.
The only solution I can think of is to store md5 hash and 10 characters key into a db which I can look up for next time. A custom 10 characters key can be generated which won't have any relation to md5 hash as such other than that mapping db.
What you're asking to do is not possible. The question title says, "generate random and unique key of exactly 10 characters based on longer input string".
By the Pigeonhole principle, you can't do that. Because there are more strings that are longer than 10 characters than there are strings that are exactly 10 characters long, any hashing function you come up with will generate duplicates. Multiple long strings will map to the same 10-character string.
You can't guarantee uniqueness.
Convert string to char array, chsr to int and multiply last and first char, second and second to last, until you have a 10 digit number.
In a project I'm working on, I need to generate 16 character long unique IDs, consisting of 10 numbers plus 26 uppercase letters (only uppercase). They must be guaranteed to be universally unique, with zero chance of a repeat ever.
The IDs are not stored forever. An ID is thrown out of the database after a period of time and a new unique ID must be generated. The IDs can never repeat with the thrown out ones either.
So randomly generating 16 digits and checking against a list of previously generated IDs is not an option because there is no comprehensive list of previous IDs. Also, UUID will not work because the IDs must be 16 digits in length.
Right now I'm using 16-Digit Unique IDs, that are guaranteed to be universally unique every time they're generated (I'm using timestamps to generate them plus unique server ID). However, I need the IDs to be difficult to predict, and using timestamps makes them easy to predict.
So what I need to do is map the 16 digit numeric IDs that I have into the larger range of 10 digits + 26 letters without losing uniqueness. I need some sort of hashing function that maps from a smaller range to a larger range, guaranteeing a one-to-one mapping so that the unique IDs are guaranteed to stay unique after being mapped.
I have searched and so far have not found any hashing or mapping functions that are guaranteed to be collision-free, but one must exist if I'm mapping to a larger space. Any suggestions are appreciated.
Brandon Staggs wrote a good article on Implementing a Partial Serial Number Verification System. The examples are written in Delphi, but could be converted to other languages.
EDIT: This is an updated answer, as I misread the constraints on the final ID.
Here is a possible solution.
Let set:
UID16 = 16-digit unique ID
LUID = 16-symbol UID (using digits+letters)
SECRET = a secret string
HASH = some hash of SECRET+UID16
Now, you can compute:
LUID = BASE36(UID16) + SUBSTR(BASE36(HASH), 0, 5)
BASE36(UID16) will produce a 11-character string (because 16 / log10(36) ~= 10.28)
It is guaranteed to be unique because the original UID16 is fully included in the final ID. If you happen to get a hash collision with two different UID16, you'll still have two distinct LUID.
Yet, it is difficult to predict because the 5 other symbols are based on a non-predictable hash.
NB: you'll only get log2(36^5) ~= 26 bits of entropy on the hash part, which may or may not be enough depending on your security requirements. The less predictable the original UID16, the better.
One general solution to your problem is encryption. Because encryption is reversible it is always a one-to-one mapping from plaintext to cyphertext. If you encrypt the numbers [0, 1, 2, 3, ...] then you are guaranteed that the resulting cyphertexts are also unique, as long as you keep the same key, do not repeat a number or overflow the allowed size. You just need to keep track of the next number to encrypt, incrementing as needed, and check that it never overflows.
The problem then reduces to the size (in bits) of the encryption and how to present it as text. You say: "10 numbers plus 26 uppercase letters (only uppercase)." That is similar to Base32 encoding, which uses the digits 2, 3, 4, 5, 6, 7 and 26 letters. Not exactly what you require, but perhaps close enough and available off the shelf. 16 characters at 5 bits per Base32 character is 80 bits. You could use an 80 bit block cipher and convert the output to Base32. Either roll your own simple Feistel cipher or use Hasty Pudding cipher, which can be set for any block size. Do not roll your own if there is a major security requirement here. Your own Feistel cipher will give you uniqueness and obfuscation, not security. Hasty Pudding gives security as well.
If you really do need all 10 digits and 26 letters, then you are looking at a number in base 36. Work out the required bit size for 36^16 and proceed as before. Convert the cyphertext bits to a number expressed in base 36.
If you write your own cipher then it appears that you do not need the decryption function, which will save a little work.
You want to map from a space consisting of 1016 distinct values to one with 3616 values.
The ratio of the sizes of these two spaces is ~795,866,110.
Use BigDecimal and multiply each input value by the ratio to distribute the input keys equally over the output space. Then base-36 encode the resulting value.
Here's a sample of 16-digit values consisting of 11 digits "timestamp" and 5 digits server ID encoded using the above scheme.
Decimal ID Base-36 Encoding
---------------- ----------------
4156333000101044 -> EYNSC8L1QJD7MJDK
4156333000201044 -> EYNSC8LTY4Y8Y7A0
4156333000301044 -> EYNSC8MM5QJA9V6G
4156333000401044 -> EYNSC8NEDC4BLJ2W
4156333000501044 -> EYNSC8O6KXPCX6ZC
4156333000601044 -> EYNSC8OYSJAE8UVS
4156333000701044 -> EYNSC8PR04VFKIS8
4156333000801044 -> EYNSC8QJ7QGGW6OO
The first 11 digits form the "timestamp" and I calculated the result for a series incremented by 1; the last five digits are an arbitrary "server ID", in this case 01044.
I need to generate a reservation code of 6 alpha numeric characters, that is random and unique in java.
Tried using UUID.randomuuid().toString(), However the id is too long and the requirement demands that it should only be 6 characters.
What approaches are possible to achieve this?
Just to clarify, (Since this question is getting marked as duplicate).
The other solutions I've found are simply generating random characters, which is not enough in this case. I need to reasonably ensure that a random code is not generated again.
Consider using the hashids library to generate salted hashes of integers (your database ids or other random integers which is probably better).
http://hashids.org/java/
Hashids hashids = new Hashids("this is my salt",6);
String id = hashids.encode(1, 2, 3);
long[] numbers = hashids.decode(id);
You have 36 characters in the alphanumeric character set (0-9 digits + a-z letters). With 6 places you achieve 366 = 2.176.782.336 different options, that is slightly larger than 231.
Therefore you can use Unix time to create a unique ID. However, you must assure that no ID generated within the same second.
If you cannot guarantee that, you end up with a (synchronized) counter within your class. Also, if you want to survive a JVM restart, you should save the current value (e.g. to a database, file, etc. whatever options you have).
Despite its name, UUIDs are not unique. It's simply extremely unlikely to get a 128 bit collision. With 6 (less than 32 bit) it's very likely that you get a collision if you just hash stuff or generate a random string.
If the uniqueness constraint is necessary then you need to
generate a random 6 character string
Check if you generated that string before by querying your database
If you generated it before, go back to 1
Another way would be to use a pseadorandom permutation (PRP) of size 32 bit. Block ciphers are modeled as PRP functions, but there aren't many that support 32 bit block sizes. Some are Speck by the NSA and the Hasty Pudding Cipher.
With a PRP you could for example take an already unique value like your database primary key and encrypt it with the block cipher. If the input is not bigger than 32 bit then the output will still be unique.
Then you would run Base62 (or at least Base 41) over the output and remove the padding characters to get a 6 character output.
if you do a substring that value may not be unique
for more info please see following similar link
Generating 8-character only UUIDs
Lets say your corpus is the collection of alpha numberic letters. a-zA-Z0-9.
char[] corpus = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
We can use SecureRandom to generate a seed, which will ask the OS for entropy, depending on the os. The trick here is to keep a uniform distribution, each byte has 255 values, but we only need around 62 so I will propose rejection sampling.
int generated = 0;
int desired=6;
char[] result= new char[desired];
while(generated<desired){
byte[] ran = SecureRandom.getSeed(desired);
for(byte b: ran){
if(b>=0&&b<corpus.length){
result[generated] = corpus[b];
generated+=1;
if(generated==desired) break;
}
}
}
Improvements could include, smarter wrapping of generated values.
When can we expect a repeat? Lets stick with the corpus of 62 and assume that the distribution is completely random. In that case we have the birthday problem. That gives us N = 62^6 possiblities. We want to find n where the chance of a repeat around 10%.
p(r)= 1 - N!/(N^n (N-n)!)
And using the approximation given in the wikipedia page.
n = sqrt(-ln(0.9)2N)
Which gives us about 109000 numbers for 10% chance. For a 0.1% chance it woul take about 10000 numbers.
you can trying to make substring out of your generated UUID.
String uuid = UUID.randomUUID().toString();
System.out.println("uuid = " + uuid.substring(0,5);
I implemented a custom HashMap class (in C++, but shouldn't matter). The implementation is simple -
A large array holds pointers to Items.
Each item contains the key - value pair, and a pointer to an Item (to form a linked list in case of key collision).
I also implemented an iterator for it.
My implementation of incrementing/decrementing the iterator is not very efficient. From the present position, the iterator scans the array of hashes for the next non-null entry. This is very inefficient, when the map is sparsely populated (which it would be for my use case).
Can anyone suggest a faster implementation, without affecting the complexity of other operations like insert and find? My primary use case is find, secondary is insert. Iteration is not even needed, I just want to know this for the sake of learning.
PS: Why I implemented a custom class? Because I need to find strings with some error tolerance, while ready made hash maps that I have seen provide only exact match.
EDIT: To clarify, I am talking about incrementing/decrementing an already obtained iterator. Yes, this is mostly done in order to traverse the whole map.
The errors in strings (keys) in my case occur from OCR errors. So I can not use the error handling techniques used to detect typing errors. The chance of fist character being wrong is almost the same as that of the last one.
Also, my keys are always string, one word to be exact. Number of entries will be less than 5000. So hash table size of 2^16 is enough for me. Even though it will still be sparsely populated, but that's ok.
My hash function:
hash code size is 16 bits.
First 5 bits for the word length. ==> Max possible key length = 32. Reasonable, given that key is a single word.
Last 11 bits for sum of the char codes. I only store the English alphabet characters, and do not need case sensitivity. So 26 codes are enough, 0 to 25. So a key with 32 'z' = 25 * 32 = 800. Which is well within 2^11. I even have scope to add case sensitivity, if needed in future.
Now when you compare a key containing an error with the correct one,
say "hell" with "hello"
1. Length of the keys is approx the same
2. sum of their chars will differ by the sum of the dropped/added/distorted chars.
in the hash code, as first 5 bits are for length, the whole table has fixed sections for every possible length of keys. All sections are of same size. First section stores keys of length 1, second of length 2 and so on.
Now 'hello' is stored in the 5th section, as length is 5.'When we try to find 'hello',
Hashcode of 'hello' = (length - 1) (sum of chars) = (4) (7 + 4 + 11 + 11 + 14) = (4) (47)
= (00100)(00000101111)
similarly, hashcode of 'helo' = (3)(36)
= (00011)(00000100100)
We jump to its bucket, and don't find it there.
so we try to check for ONE distorted character. This will not change the length, but change the sum of characters by at max -25 to +25. So we search from 25 places backwards to 25 places forward. i.e, we check the sum part from (36-25) to (36+25) in the same section. We won't find it.
We check for an additional character error. That means the correct string would contain only 3 characters. So we go to the third section. Now sum of chars due to additional char would have increased by max 25, it has to be compensated. So search the third section for appropriate 25 places (36 - 0) to (36 - 25). Again we don't find.
Now we consider the case of a missing character. So the original string would contain 5 chars. And the second part of hashcode, sum of chars in the original string, would be more by a factor of 0 to 25. So we search the corresponding 25 buckets in the 5th section, (36 + 0) to (36 + 25). Now as 47 (the sum part of 'hello') lies in this range, we will find a match of the hashcode. Ans we also know that this match will be due to a missing character. So we compare the keys allowing for a tolerance of 1 missing character. And we get a match!
In reality, this has been implemented to allow more than one error in key.
It can also be optimized to use only 25 places for the first section (since it has only one character) and so on.
Also, checking 25 places seems overkill, as we already know the largest and smallest char of the key. But it gets complex in case of multiple errors.
You mention an 'error tolerance' for the string. Why not build in the "tolerance' into the hash function itself and thus obviate the need for iteration.
You could go the way of Javas LinkedHashMap class. It adds efficient iteration to a hashmap by also making it a doubly-linked list.
The entries are key-value pairs that have pointers to the previous and next entries. The hashmap itself has the large array as well as the head of the linked list.
Insertion/deletion are constant time for both data structures, searches are done via the hashmap, and iteration via the linked list.