A two way String hash function

A two way String hash function - java

I want to get a unique numeric representation of a String. I know there are lots of ways of doing this, my question is which do you think is the best? I don't want to have negative numbers - so the hashcode() function in java is not so good, although I could override it ... but I'd rather not since I am not so confident and don't want to accidentally break something.
My Strings are all semantic-web URIS. The reason for the numeric representation is that when I display the data for a URI on a page I need something to pass into the query String or put into various fields in my javascript. The URI itself is too unwieldy and looks bad when you have a URI as a value in a URI.
Basically I want to have a class called Resource which will look like this
Resource{
int id;
String uri;
String value; // this is the label or human readable name
// .... other code/getters/setters here
public int getId(){
return id = stringToIntFunction();
}
private int stringToIntFunction(String uri){
// do magic here
}
}
Can you suggestion a function that would do this if:
It had to be two way, that is you could also recover the original string from the numeric value
It doesn't have to be two way
Also are there other issues that are important that I am not considering?

If you want it to be reversible, you're in trouble. Hashes are designed to be one-way.
In particular, given that an int has 32 bits of information, and a char has 16 bits of information, requiring reversibility means you can only have strings of zero, one or two characters (and even that's assuming that you're happy to encode "" as "\0\0" or something similar). That's assuming you don't have any storage, of course. If you can use storage, then just store numbers sequentially... something like:
private int stringToIntFunction(String uri) {
Integer existingId = storage.get(uri);
if (existingId != null) {
return existingId.intValue();
}
return storage.put(uri);
}
Here storage.put() would increase a counter internally, store the URI as being associated with that counter value, and return it. My guess is that that's not what you're after though.
Basically, to perform a reversible encryption, I'd use a standard encryption library having converted the string to a binary format first (e.g. using UTF-8). I would expect the result to be a byte[].
If it doesn't have to be reversible, I'd consider just taking the absolute value of the normal hashCode() result (but mapping Integer.MIN_VALUE to something specific, as its absolute value can't be represented as an int).

Hashes are one way only (that's part of the reason they have a fixed length regardless of the input size). If you need two-way, you're looking at something like Base64 encoding.
Why can't you have negative numbers? Where do the URIs come from? Are they in a database? Why not use the Database Key ID? If they are not in a database, can you generate them for the user given a set of variables/parameters? (So the query string only contains things like foo=1&bar=two and you generate the URL on the Server or JavaScript side)

Given all the remars done above (hash function is one way), I would go for 2 possible solutions:
Use some encrypting function to get a long string representing your URL (you'll get something like -> param=456ab894ce897b98f (this could be longer and/or shorter depending on the URL). See DES encryption for instance or base64url.
Keep track of the URLs in a database (could be also a simple file-based database such as SQLite). Then you'll effectively have an uint <=> URL equivalence.

"Unique representation" implies that the Java supplied string.hashcode would be useless - you'd soon come across two URIs that shared the same hashcode.
Any two-way scheme is going to result in an unwieldy string - unless you store the URIs in a database and use the record ID as your unique identifier.
As far as one-way goes - an MD5 hash would be considerably more unique (but by no means unique) than the simple hashcode - but might be verging on "unwieldy" depending on your definition!

Q1: If you want to recover the string from the number then you could use:
1a: an encryption of the string, which is going to be the same size, or longer, unless you zip the string first. This will give an array of random looking bytes, which could be displayed as Base-64.
1b: a database, or a map, and the number is the index of the string in the map/database.
Q2: The string does not have to be recoverable.
Various ideas are possible here. You can display the hash in hex or in Base-64 to avoid negative signs. The only non-alphanumeric characters in Base-64 are '+', '/' and '='. For an almost unique hash you will need something of cryptographic size, MD5 (128 bits), SHA-1 (160 bits) or SHA-2 (256 or 512 bits).
An MD5 hash looks like "d131dd02c5e6eec4693d9a0698aff95c" in hex; the larger the hash the less likely a collision is.
rossum

Related

How to decode a string using com.google.common.hash.Hashing

I have a string that I have hashed using
Hashing.sha256().hashString("abc", Charsets.UTF_8).toString()
Now I want to decode the encrypted string. How should I do so?
The library I am using right now is
com.google.common.hash.Hashing

If you want to extract the original string, sha256 is specifically designed to make that almost impossible. If it was possible to do that efficiently then sha256 would be useless and everyone would be sad.
If you want it to be reversible, you should use an encryption algorithm, which is different from a hashing algorithm. Reversing a hash, especially a cryptographic hash function, is supposed to be hard.
(In addition, more than one string may have the exact same hash code, meaning that you can't be sure a string with the same hash was actually the same as the original string you used.)

Is it safe (in matter of uniqueness) to use UUID to generate a unique identifier for specific string?

String myText;
UUID.nameUUIDFromBytes((myText).getBytes()).toString();
I am using above code to generate a representative for specific texts.
For example 'Moien' should always be represeted with "e9cad067-56f3-3ea9-98d2-26e25778c48f", not any changes like project rebuild should be able to change that UUID.
The reason why I'm doing this is so that I don't want those specific texts to be readable(understandable) to human.
Note: I don't need the ability to regenerate the main text (e.g "Moien") after hashing .
I have an alternative way too :
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hash = digest.digest((matcher.group(1)).getBytes("UTF-8"));
String a = Base64.encode(hash);
Which od you think is better for my problem?

UUID.nameUUIDFromBytes appears to basically just be MD5 hashing, with the result being represented as a UUID.
It feels clearer to me to use a base64-encoded hash explicitly, partly as you can then control which hash gets used - which could be relevant if collisions pose any sort of security risk. (SHA-256 is likely a better option than MD5 for exactly that reason.) The string will be longer from SHA-256 of course, but hopefully that's not a problem.
Note that in either case, I'd convert the string to text using a fixed encoding via StandardCharsets. Don't use the platform default (as per your first snippet) and prefer StandardCharsets over magic string values (as per your second snippet).

Map a set of strings with similarities to shorter strings

i have a set of strings each of same length (10chars) with the following properties.
The size of the set is around 5000 - 10,000 strings. The data set can change frequently.
Although each string is unique, a sub string of a particular pattern would appear in most of these strings not necessarily at the same position.
Some examples are
123abc7gh0
t123abcmla
wp12123abc
123abc being the substring which appears in most of the strings
The problem is to map each string to a shorter string, and such mapping should be deterministic.
I could use a simple enumeration algorithm which maps each string encountered to an incremented counter value(on set of sorted strings). But since the set is bound to change frequently, i cannot use this algorithm to compute the map in a deterministic way for various runs.
I could also use data compression algorithm like Huffman encoding to compress each individual string. But i do not believe that would be effective as each string in itself has very less duplicate characters.
what should be the approach i should adapt to solve the problem by taking advantage of the properties of the data set? Note that i do not want to compress the whole set of data but would like to map each string in the set to a shortened string.

Replace the 'common string' by a character not appearing elsewhere in any string.
Do a probabilistic analysis of all strings
Create a Hufman tree based on the analysis, i.e. most frequent characters are at the top of the tree, resulting in short codes.
Replace sample strings by their hufman encoding based on the tree of #3 and compare the resulting size with the original. If most of the characters are uniformly spread even between the strings, then the Hufman coding will not reduce but increase the size.
If Hufman does not gain any improvement, you might try LZW or any other dictionary based compression method. However, this only works, if the structure of the strings (i.e. the distribution of characters/substrings) does not completely change over time. For example, if the strings would consist of english words, the substring dictionary compression (LZW) might be a good candidate.
But if the distribution changes or the character distribution is merely equal over all characters, I am afraid there is no compression method suitable of reducing the string size.
But the last question remains: What for? Why bother compressing 10000 strings?
Edit: The answer is: The strings are used to create folder names (paths). As there is a restriction on the total length, it should be as compact as possible.
You might try to create a database (i.e. dictionary) and use the index (coded e.g. as Base64) as a compressed string. This gives you a maximum of 5 chars when assuming a maximum dictionary size of 2^32-1.

If you can pre-process the set of strings and could know the pattern which occurs in each of the strings, you could treat that as a single character (use some encoding) which would shorten that string.

I'm confronted with the same kind of task and wonder whether it is possible to achieve the mapping without making use of persistence.
If persisting the mappings in (previous) use is allowed, then the solution is simple:
you can just assign a number to each of the strings (using a representation of a sufficiently high base so that you get the required maximum size of the numbers' string representation). For each of the source strings you would assign a next number and using the persisted mappings make sure not to use the same number a second time.
This policy would give you consistent results, even if you go through the procedure multiple times with a changing set of data: a string occurring for the first time would receive its private number and this private number would stay reserved to it forever - numbers that are no longer in use would never be reused.
The more challenging question is: is it possible to guarantee uniqueness without the aid of a persisted mapping? I'm afraid it is not, since size reduction is always prone to lead to collisions.

Could anyone verify the correctness of getting a md5 hash using this method?

MessageDigest m=MessageDigest.getInstance("MD5");
StringBuffer sb = new StringBuffer();
if(nodeName!=null) sb.append(nodeName);
if(nodeParentName!=null) sb.append(nodeParentName);
if(nodeParentFieldName!=null) sb.append(nodeParentFieldName);
if(nodeRelationName!=null) sb.append(nodeRelationName);
if(nodeViewName!=null) sb.append(nodeViewName);
if(treeName!=null) sb.append(treeName);
if(nodeValue!=null && nodeValue.trim().length()>0) sb.append(nodeValue);
if(considerParentHash) sb.append(parentHash);
m.update(sb.toString().getBytes("UTF-8"),0,sb.toString().length());
BigInteger i = new BigInteger(1,m.digest());
hash = String.format("%1$032X", i);
The idea behind these lines of code is that we append all the values of a class/model into a StringBuilder and then return the padded hash of that (the Java implementation returns md5 hashes that are lenght 30 or 31, so the last line formats the hash to be padded with 0s).
I can verify that this works, but I have a feeling it fails at one point (our application fails and I believe this to be the probable cause).
Can anyone see a reason why this wouldn't work? Are there any workarounds to make this code less prone to errors (e.g. removing the need for the strings to be UTF-8).

There are a few weird things in your code.
UTF-8 encoding of a character may use more than one byte. So you should not use the string length as final parameter to the update() call, but the length of the array of bytes that getBytes() actually returned. As suggested by Paŭlo, use the update() method which takes a single byte[] as parameter.
The output of MD5 is a sequence of 16 bytes with quite arbitrary values. If you interpret it as an integer (that's what you do with your call to BigInteger()), then you will get a numerical value which will be smaller than 2160, possibly much smaller. When converted back to hexadecimal digits, you may get 32, 31, 30... or less than 30 characters. Your usage of the the "%032X" format string left-pads with enough zeros, so your code works, but it is kind of indirect (the output of MD5 has never been an integer to begin with).
You assemble the hash input elements with raw concatenation. This may induce issues. For instance, if modeName is "foo" and modeParentName is "barqux", then the MD5 input will begin with (the UTF-8 encoding of) "foobarqux". If modeName is "foobar" and modeParentName is "qux", then the MD5 input will also begin with "foobarqux". You do not tell why you want to use a hash function, but usually, when one uses a hash function, it is to have a unique trace of some piece of data; two distinct data elements should yield distinct hash inputs.
When handling nodeValue, you call trim(), which means that this string could begin and/or end with whitespace, and you do not want to include that whitespace into the hash input -- but you do include it, since you append nodeValue and not nodeValue.trim().
If what you are trying to do has any relation to security then you should not use MD5, which is cryptographically broken. Use SHA-256 instead.
Hashing an XML element is normally done through canonicalization (which handles whitespace, attribute order, text representation, and so on). See this question on the topic of canonicalizing XML data with Java.

One possible problem is here:
m.update(sb.toString().getBytes("UTF-8"),0,sb.toString().length());
As said by Robing Green, the UTF-8 encoding can produce a byte[] which is longer than your original string (it will do this exactly when the String contains non-ASCII characters). In this case, you are only hashing the start of your String.
Better write it like this:
m.update(sb.toString().getBytes("UTF-8"));
Of course, this would not cause an exception, simply another hash than would be produced otherwise, if you have non-ASCII-characters in your string. You should try to brew your failure down to an SSCCE, like lesmana recommended.

How does a person go about learning Java? (convert byte array to hex string)

I know this sounds like a broad question but I can narrow it down with an example. I am VERY new at Java. For one of my "learning" projects, I wanted to create an in-house MD5 file hasher for us to use. I started off very simple by attempting to hash a string and then moving on to a file later. I created a file called MD5Hasher.java and wrote the following:
import java.security.*;
import java.io.*;
public class MD5Hasher{
public static void main(String[] args){
String myString = "Hello, World!";
byte[] myBA = myString.getBytes();
MessageDigest myMD;
try{
myMD = MessageDigest.getInstance("MD5");
myMD.update(myBA);
byte[] newBA = myMD.digest();
String output = newBA.toString();
System.out.println("The Answer Is: " + output);
} catch(NoSuchAlgorithmException nsae){
// print error here
}
}
}
I visited java.sun.com to view the javadocs for java.security to find out how to use MessageDigest class. After reading I knew that I had to use a "getInstance" method to get a usable MessageDigest object I could use. The Javadoc went on to say "The data is processed through it using the update methods." So I looked at the update methods and determined that I needed to use the one where I fed it a byte array of my string, so I added that part. The Javadoc went on to say "Once all the data to be updated has been updated, one of the digest methods should be called to complete the hash computation." I, again, looked at the methods and saw that digest returned a byte array, so I added that part. Then I used the "toString" method on the new byte array to get a string I could print. However, when I compiled and ran the code all that printed out was this:
The Answer Is: [B#4cb162d5
I have done some looking around here on StackOverflow and found some information here:
How can I generate an MD5 hash?
that gave the following example:
String plaintext = 'your text here';
MessageDigest m = MessageDigest.getInstance("MD5");
m.reset();
m.update(plaintext.getBytes());
byte[] digest = m.digest();
BigInteger bigInt = new BigInteger(1,digest);
String hashtext = bigInt.toString(16);
// Now we need to zero pad it if you actually want the full 32 chars.
while(hashtext.length() < 32 ){
hashtext = "0"+hashtext;
}
It seems the only part I MAY be missing is the "BigInteger" part, but I'm not sure.
So, after all of this, I guess what I am asking is, how do you know to use the "BigInteger" part? I wrongly assumed that the "toString" method on my newBA object would convert it to a readable output, but I was, apparently, wrong. How is a person supposed to know which way to go in Java? I have a background in C so this Java thing seems pretty weird. Any advice on how I can get better without having to "cheat" by Googling how to do something all the time?
Thank you all for taking the time to read. :-)

The key in this particular case is that you need to realize that bytes are not "human readable", but characters are. So you need to convert bytes to characters in a certain format. For arbitrary bytes like hashes, usually hexadecimal is been used as "human readable" format. Every byte is then to be converted to a 2-character hexadecimal string which you in turn concatenate together.
This is unrelated to the language you use. You just have to understand/realize how it works "under the hoods" in a language agnostic way. You have to understand what you have (a byte array) and what you want (a hexstring). The programming language is just a tool to achieve the desired result. You just google the "functional requirement" along with the programming language you'd like to use to achieve the requirement. E.g. "convert byte array to hex string in java".
That said, the code example you found is wrong. You should actually determine each byte inside a loop and test if it is less than 0x10 and then pad it with zero instead of only padding the zero depending on the length of the resulting string (which may not necessarily be caused by the first byte being less than 0x10!).
StringBuilder hex = new StringBuilder(bytes.length * 2);
for (byte b : bytes) {
if ((b & 0xff) < 0x10) hex.append("0");
hex.append(Integer.toHexString(b & 0xff));
}
String hexString = hex.toString();
Update as per the comments on the answer of #extraneon, using new BigInteger(byte[]) is also the wrong solution. This doesn't unsign the bytes. Bytes (as all primitive numbers) in Java are signed. They have a negative range. The byte in Java ranges from -128 to 127 while you want to have a range of 0 to 255 to get a proper hexstring. You basically just need to remove the sign to make them unsigned. The & 0xff in the above example does exactly that.
The hexstring as obtained from new BigInteger(bytes).toString(16) is NOT compatible with the result of all other hexstring producing MD5 generators the world is aware of. They will differ whenever you've a negative byte in the MD5 digest.

You have actually successfully digested the message. You just don't know how to present the found digest value properly. What you have is a byte array. That's a bit difficult to read, and a toString of a byte array yields [B#somewhere which is not useful at all.
The BigInteger comes into it as a tool to format the byte array to a single number.
What you do is:
construct a BigInteger with the proper value (in this case that value happens to be encoded in the form of a byte array - your digest
Instruct the BigInteger object to return a String representation (e.g. plain, readable text) of that number, base 16 (e.g. hex)
And the while loop prefixes that value with 0-characters to get a width of 32. I'd probably use String.format for that, but whatever floats your boat :)

MessageDigests compute a byte array of something, the string that you usually see (such as 1f3870be274f6c49b3e31a0c6728957f) is actually just a conversion of the byte array to a hexadecimal string.
When you call MessageDigest.toString(), it calls MessageDigest.digest().toString(), and in Java, the toString method for a byte[] (returned by MessageDigest.digest()) returns a sort of reference to the bytes, not the actual bytes.
In the code you posted, the byte array is changed to an integer (in this case a BigInteger because it would be extremely large), and then converted to hexadecimal to be printed to a String.
The byte array computed by the digest represents a number (a 128-bit number according to http://en.wikipedia.org/wiki/MD5), and that number can be converted to any other base, so the result of the MD5 could be represented as a base-10 number, a base-2 number (as in a byte array), or, most commonly, a base-16 number.

It is OK to google for answers as long as you (eventually) understand what you copy-pasted into your app :-)
In general, I recommend starting with a good Java introductory book, or web tutorial. See these threads for more tips:
https://stackoverflow.com/questions/77839/what-are-the-best-resources-for-learning-java-books-websites-etc
Learning Java
https://stackoverflow.com/questions/78293/good-book-to-learn-to-program-well-in-java-engineering-or-architecture-wise-not

Though I'm afraid that I have no experience whatsoever using Java to play with MD5 hashes, I can recommend Sun's Java Tutorials as a fantastic resource for learning Java. They go through most of the language, and helped me out a ton when I was learing Java.
Also look around for other posts asking the same thing and see what suggestions popped up there.

The reason BigInteger is used is because the byte array is very long, too big too fit into an int or long. However, if you do want to see everything in the byte array, there's an alternate approach. You could just replace the line:
String output = newBA.toString();
with:
String output = Arrays.toString(newBA);
This will print out the contents of the array, not the reference address.

Use an IDE that shows you where the "toString()" method is coming from. In most cases it's just from the Object class and won't be very useful. It's generally recommended to overwrite the toString-method to provide some clean output, but many classes don't do this.

I'm also a newbie to development. For the current problem, I suggest the Book "Introduction To Cryptography With Java Applets" by David Bishop. It demonstrates what you need and so forth...

Any advice on how I can get better
without having to "cheat" by Googling
how to do something all the time?
By by not starting out with an MD5 hasher! Seriously, work your way up little by little on programs that you can complete without worrying about domain-specific stuff like MD5.
If you're dumping everything into main, you're not programming Java.
In a program of this scale, your main() should do one thing: create an MD5Hasher object and then call some methods on it. You should have a constructor that takes an initial string, a method to "do the work" (update, digest), and a method to print the result.
Get some tutorials and spend time on simple, traditional exercises (a Fibonacci generator, a program to solve some logic puzzle), so you understand the language basics before bothering with the libraries, which is what you are struggling with now. Then you can start doing useful stuff.

I wrongly assumed that the "toString" method on my newBA object would convert it to a readable output, but I was, apparently, wrong. How is a person supposed to know which way to go in Java?
You could replace here Java with the language of your choice that you don't know/haven't mastered yet. Even if you worked 10 years in a specific language, you will still get those "Aha! This is the way it's working!"-effects, though not that often as in the beginning.
The point you need to learn here is that toString() is not returning the representation you want/expect, but any the implementer has chosen. The default implementation of toString() is like this (javadoc):
Returns a string representation of the object. In general, the toString method returns a string that "textually represents" this object. The result should be a concise but informative representation that is easy for a person to read. It is recommended that all subclasses override this method.
The toString method for class Object returns a string consisting of the name of the class of which the object is an instance, the at-sign character `#', and the unsigned hexadecimal representation of the hash code of the object. In other words, this method returns a string equal to the value of:
getClass().getName() + '#' + Integer.toHexString(hashCode())

How is a person supposed to know which
way to go in Java? I have a background
in C so this Java thing seems pretty
weird. Any advice on how I can get
better without having to "cheat" by
Googling how to do something all the
time?
Obvious answers are 1- google when you have questions (and it's not considered cheating imo) and 2- read books on the subject matter.
Apart from these two, I would recommend trying to find a mentor for yourself. If you do not have experienced Java developers at work, then try to join a local Java developer user group. You can find more experienced developers there and perhaps pick their brains to get answers to your questions.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.