I need to generate a unique hash from a ID value of type Long. My concern is that it should not globally generate the same hash from two different Long/long values.
MD5 hashing looks a nice solution but the hash String is very long. I only need characters
0-9
a-z and A-Z
And just 6-characters like: j4qwO7
What could be the simpliest solution?
Your requirements cannot be met. You've got an alphabet of 62 possible characters, and 6 characters available - which means there are 626 possible IDs of that form.
However, there are 2568 possible long values. By the pigeon-hole principle, it's impossible to give each of those long values a different ID of the given form.
You don't have to use the hex representation. Build your own hash representation by using the actual hash bytes from the function. You could truncate the hash output to simplify the hash representation, but that would make collisions more probable.
Edit:
The other answers stating that what you ask isn't possible, based on the number of possible long values, is teoretically true, if you actually need the whole range.
If your IDs are auto incremented from zero and up, just 62^6 = 56800235584 values might be more than enough for you, depending on your needs.
Step 1. Switch to using ints instead of longs, or allow for a longer "hash". See every other answer for discussion of why 6 characters is unsufficient for dealing with longs.
Step 2. Encrypt your number using an algorithm which does not using padding. Personally, I suggest skip32 encoding. I make no promises that this is strong enough for security, but if your goal is "make random-looking IDs," it works well.
Step 3. Encode your number as a base_62 number (as opposed to base_10, not as opposed to base64 encoding).
Your question doesn't make sense.
'Unique hash' is a contradiction in terms.
A 'unique hash' value of a Java long must be 64 bits in length, like the long itself, and of course the simplest hash function for that is f(x) = x, i.e. the long value itself.
6 characters that can be 0-9, A-Z, and a-z can only yield 62^6 = 56800235584 distinct values, which isn't enough.
You can use long value irself as same as hash (for indexing/search purposes).
if you need to obfuscate/hide your long value, you can use any symmetric
encryption algorithm with 64-bit block, for example - DES or AES in ECB-mode.
Update:
No need to use Hashids. Base 36 is pretty enough.
long id = 12345;
String hash = Integer.toString(Math.abs((int)id), 36);
Original answer, with Hashids:
You might want to use Hashids
long id = 12345;
Hashids hashids = new Hashids("this is my salt");
String hash = hashids.encrypt(id); // "ryBo"
"ryBo" is going to be unique, as it can be converted back to your long. Hashids just converts, doesn't hash further.
long[] numbers = hashids.decrypt("ryBo");
// numbers[0] == 12345
If you really have a 64-bit value, the hash string is going to be quite long (around 16 characters, depending on the alphabet), but if you don't plan to have more than 2^16 thingies, you can get away with truncating the 64-bit hash to 32-bit (an int).
long id = 12345;
String hash = hashids.encrypt(Math.abs((int)id));
Related
I have this code created using Google Guava:
String sha256hex = Hashing.sha256()
.hashString(cardNum, StandardCharsets.UTF_8)
.toString();
How I can verify the generated values is a properly generated hash?
SHA-256 and, in general, the family of SHA 2 algorithms is wonderfully described in Wikipedia and different RFCs, RFC 6234 and the superseded RFC 4634.
All these sources dictate that the output provided by the SHA 256 hash function is 256 bits length, 32 bytes (the number that accompanies the SHA word is the mentioned value for every algorithm in the family, roughly speaking).
These sequence of bytes is typically encoded in hex. This is the implementation provided by Guava as well.
Then, the problem can be reduced to identify if a string in Java is a valid hex encoding.
That problem has been already answered here, in SO, for example in this question.
For its simplicity, consider the solution proposed by #laycat:
boolean isHex = mac_addr.matches("^[0-9a-fA-F]+$");
As every byte is encoded with two hex characters and, as mentioned, the SHA-256 algorithm produces and output of 32 bytes you can safely check for a string of 64 characters length, as suggested in the answer of #D.O. as well. Your validation code could be similar to this:
boolean canBeSha256Output = sha256Hex.matches("^[0-9a-fA-F]{64}$");
Please, be aware that there is no possibility for saying if a character hex string of a certain length on its own is or not the result of a hash function, whichever hash function you consider.
You only can be sure that a hash output is a hash output if and only if it matches the result of applying the corresponding hash function over the original input.
You could use a regex to verify that it looks like a sha256 hash(64 hexadecimal characters), like
\b[A-Fa-f0-9]{64}\b
I would like to convert any unicode string to a unique number (in Clojure or Java). I want the generated number to have the following properties:
- It is unique for that string
- When a set of such numbers are sorted and mapped back to the original strings, the strings will appear in sorted order. The strings are not all known in advance.
One way this could be done is:
(defn strval [^String s]
(bigdec (reduce #(str %1 (format "%05d" (int %2))) "0." s)))
We can validate the sort order is correct with:
(assert (< (strval "a") (strval "b")))
(assert (< (strval "a") (strval "aa")))
(assert (< (strval "aa") (strval "ab")))
(Ignore, if you like that “int” is not necessarily the best way to get the sort order of an individual character.)
For those not familiar with Clojure, this algorithm:
Converts the string into a sequence of characters
Gets the integer value of one character
Converts this integer to a string and pads it with zeros so that it makes a string of five characters.
Appends this string to a result string that starts with “0.”
If there are more characters go back to step 2, otherwise
Converts the result string to a Java BigDecimal
However, the process of creating a BigDecimal in this way is sub-optimal:
It relies on converting between numbers and strings and then back to a the final number.
Padding each value with zeros does not produce the most compact representation.
What alternatives are there to the function that will speed it up and make the generated number smaller if possible, while retaining the uniqueness and sorting properties described above?
Note: The solution does not have produce a BigDecimal, it just has to produce a number, but I don't know how you could make this work with a BigInteger. Also, I realise the function can be memoized to speed subsequent executions but I’m after a performance increase in the initial execution.
Not possible in general, but possible if your entire universe of strings is known in advance. What you are asking for is a hash function that preserves lexicographic sort order. In order to do that, the hash function has to produce a unique value for every possible string -- i.e. a hash function with no collisions over all possible inputs. The length of the hash value in this case has a lower bound equal to the number of bits of information in the input.
To see why this is impossible in general, consider a collection of random strings of length, say, 1000 consisting of only [A-Za-z0-9]. There are 62 possible values for each letter, call it 6 bits of data (rounded up slightly). Thus the number of possible distinct values is approximately 621000, or about 101792. How to you plan to encode those values in your hash function? Preserving order such that you could correctly sort "[999 random characters]A" and "[same 999 random characters]B" would require a hash code at least 6000 bits long.
If you know in advance all the possible strings you can sort the list and assign hash values in increasing order, but that probably is not what you want.
Also, if the maximum length of the strings is bounded (i.e. all strings are less than some reasonable value) you might be able to come up with an encoding that works. You'd need to figure out the total number of bits required to encode all possible values, which would be
ceil(log2(AL))
where L is the maximum length of string, and A is the size of the alphabet, i.e. the number of distinct characters that can occur in each position of a maximal-length string. So, for example, for a max length of 10 and an alphabet consisting of [A-Z], the number of bits required would be the base-2 logarithm of 2610 which, rounded up, is 48.
Designing an order-preserving hash that fits in the optimal 48 bits would probably be pretty difficult. A slightly less optimal approach is to calculate the number of bits required for each symbol, which is
ceil(log2(A))
which in your case is 5 bits. Encode each 8-bit byte down to 5 bits, pack those bits into a binary string and write it out as a byte stream.
This application is covered in the JDK by the class java.text.CollationKey. CollationKey is a representation of some string’s (Unicode) collation order.
So, if you’re on the Java platform, you can easily obtain collation keys and compare them directly – that’s what they’re made for:
(def root-collator (java.text.Collator/getInstance java.util.Locale/ROOT))
(defn collation-key [s]
(.getCollationKey root-collator s))
(compare (collation-key "a") (collation-key "b")) ; => -1
CollationKey has a toByteArray method that returns an array of bytes representing the key. Since these byte arrays are directly comparable with one another, you can pour their contents into big integers if you must:
(defn bigint-key [s]
(-> s collation-key .toByteArray bigint))
;; these all pass:
(assert (< (bigint-key "a") (bigint-key "b")))
(assert (< (bigint-key "a") (bigint-key "aa")))
(assert (< (bigint-key "aa") (bigint-key "ab")))
(I’m not 100% sure bigint-key is correct. A collation key byte array is unsigned, but a java.math.BigInteger byte array is a two’s complement representation; some legwork to deal with signedness may be necessary.)
You stress that you have some constraints on space/performance, so I’m not sure this solution is at all helpful. Still, it’s good to be aware that a thing such as CollationKey exists in the JDK and can be applied to this problem with a minimal amount of code.
I don’t know how Clojure or Java handle Interfacing to C, but that sounds like the C standard library’s strxfrm function. That said, strxfrm’s results only work if both strings are transformed with the same LC_COLLATE setting. In other words, it wouldn’t make sense to compare a German word to a French word, since these languages have different rules about how to sort words.
If you can use collation with byte string comparison (which covers all the times I’ve needed something like this), then strxfrm is all you need. But if you really need numeric comparison, then you have to do more.
If you do need numeric comparison, then you have to involve arbitrary precision integers (like Java’s BigInteger; you don’t need BigDecimal). After all, you can’t compare two seven-character strings as 64-bit integers (by the pigeonhole principle).
In that case, your best bet is to interpret the resulting byte string as an arbitrary-precision big-endian number. In other words, if the length of the byte string is seven bytes, you’ll need to build a resulting number as (byte_string[0]<<64) + (byte_string[1]<<56) + … + (byte_string[6]<<0) (where each byte is left-shifted by the total length minus its position * 8 bits).
I haven’t actually come across a situation where it’s useful to transform a string into an arbitrary-precision number in a manner that preserves collation, like you’re trying to do here. Usually, I find that I need to transform a Unicode string into a bytestring that preserves collation order under a memcmp-like comparison. However, certainly there may be some database layers that require what you’re asking for (presumably using something like Elias gamma coding under the hood for arbitrary-precision numbers). If that’s what you’ve got, then using strxfrm followed by an arbitrary-precision big-endian interpretation (as I’ve described here) may be what you need.
I have written a method to convert a plain text into it's hashcode using MD5 algorithm. Please find the code below which I used.
public static String convertToMD5Hash(final String plainText){
MessageDigest messageDigest = null;
try {
messageDigest = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
LOGGER.warn("For some wierd reason the MD5 algorithm was not found.", e);
}
messageDigest.reset();
messageDigest.update(plainText.getBytes());
final byte[] digest = messageDigest.digest();
final BigInteger bigInt = new BigInteger(1, digest);
String hashtext = bigInt.toString(8);
return hashtext;
}
This method works perfectly but it returns a lengthy hash. I need to limit this hash text to 8 characters. Is there any possibilities to set the length of the hashcodes in Java?
Yes and No. You can use a substring of the original hash if you always cut the original hash-string similary (ie. 8 last/first characters). What are you going to do with that "semi-hash" is another thing.
Whatever it is you're going to do, be sure it has nothing to do with security.
Here's why: MD5 is 128-bit hash, so there's 2^128 = ~340,000,000,000,000,000,000,000,000,000,000,000,000 possible permutations. The quite astronomical amount of permutations is the thing that makes bruteforcing this kind of string virtually impossible. By cutting down to 8 characters, you'll end up with 32-bit hash. This is because a single hex-value takes 4 bits to represent (thus, also 128-bit / 4 bit = 32 hex-values). With 32-bit hash there's only 2^32 = 4,294,967,296 combinations. That's about 79,228,162,514,264,337,593,543,950,336 times less secure than original 128-bit hash and can be broken in matter of seconds with any old computer that has processing power of an 80's calculator.
No. MD5 is defined to return 128 bit values. You could use Base64 to encode them to ASCII and truncate it using String#substring(0, 8).
In Java 8 (not officially released yet), you can encode a byte[] to Base64 as follows:
String base64 = Base64.getEncoder().encodeToString(digest);
For earlier Java versions see Decode Base64 data in Java
all hash algorithms should randomly change bits in whole hash whenever any part of data has changed. so you can just choose 8 chars from your hash. just don't pick them randomly - it must be reproducible
Firstly as everyone has mentioned, the 64 bit hash is not secure enough. Ultimately it depends on what you exactly plan to do with the hash.
If you still need to convert this to 8 characters, I suggest downcasting the BigInteger to a Long value using BigIteger.longValue()
It will ensure that the long value it produces is consistent with the hash that was produced.
I am not sure if taking most significant 64 bits from the 128 bit hash is good idea. I would rather take least significant 64 bits. What this ensures is that
when hash(128, a) = hash(128, b) then hash(64, a) = hash(64, b) will always be true.
But we have to live with collision in case of 64 bits i.e. when hash(64, a) = hash(64, b) then hash(128, a) = hash(128, b) is not always true.
In a nutshell, we ensure that we do not have a case where 128 bit hashes of 2 texts are different, but their 64 bit hashes are same. It depends on what you really use the hash for, but I personally feel this approach is more correct.
In the guidelines to write a good hashCode() written in Effective java, the author mentions the following step if the field is long.
If the field is a long, compute (int) (f ^ (f >>> 32)).
I am not able to get why this is done. Why are we doing this ?
In Java, a long is 64-bit, and an int is 32-bit.
So this is simply taking the upper 32 bits, and bitwise-XORing them with the lower 32 bits.
Because hasCode is 32-bit integer value and long 64-bit. You need hashCode to differ for values with same lower 32-bit for each long and this function should ensure it.
Just to be clear, you're hashing a 64-bit value into a 32-bit one. Also, a good hash function will produce an even distribution of values (for hopefully obvious reasons!).
You could ignore half the bits, but that would leave you with half the possible values producing one single. So, you want to take all the bits into account somehow when producing the hashcode.
Options for mashing the bits together are: AND, OR, XOR. If you think about it, AND and OR aren't going to produce an even distribution of values at all. XOR does, so it's the only good choice.
hashCode returns an int not long. A good hashCode algorithm tries to have different values for different inputs.
i am developing a piece of code to generate a unique hexadecimal value from an input string. The output size must be less than 11 bytes which comes as requirement.Can someone please give me an insight into this. I have done the string to binary conversion and then the hexagonal mapping which produces a combination of alphanumeric characters but the size is always greater tha 11 bytes. I also need to regenerate the input from this unique id..Is that possible.....
Thanks in adavance
If your result must be absolutely unique and your input can be any length, then your task is impossible.
Think of it that way: how many different combinations of 11 bytes are there? 25611 (or 211*8=288).
That's a big number, right? Yes, but it's not big enough.
For simplicities sake we'll talk about ASCII strings only, so we have 128 different values (in reality there are many more possibilities for a character in a Java String, but the principle stays the same. For simplicities sake we also ignore that a \0 character in a String is kind of unlikely).
Now, there are 12813 different 13-character ASCII strings. That's 27*13 or 291 different combinations. Obviously you can't have a unique id out of 288 possible ids for 291 different strings.
Less than 11 bytes means maximum 10 bytes.
8^10 is 1073741824.
2^80 is a huge number.
So if you take your hexvalue, and take it modulo that number, you should fit into the 10 bytes. Convert the remainder back to hex.
Regenerating the input will not be possible. If your input is allowed to be longer than 11 bytes, it will not be possible. That would be an endless compression.