Since there is known fact that Java generates around 4 Billion unique Hashcodes.
I am using Hashcode of Some String (Example Fname + Lname + DOB + DATE) which becomes Primary Key of my Database
in #PrePersist I set it with Hashcode which helps me in generating Hashcode for new Users. (Which has to be unique).
Now I am running out of has codes. Possible alternative for me is to use SHA-2 , MD5 etc.
How can I increase size of hash code & yet avoid that big collisions.
If your goal is to create a unique identifier for the database, I would suggest using UUID.
UUID Version 3, as it uses a namespace, will fit your case.
Some databases have native support for UUID, for instance PostgreSQL
I think you are confused about using int Object.hashCode(), which you can override and which returns an int and using a secure hash function. Those are two things. Object.hashCode is not intended to return unique integers (returning 1 is a valid implementation). So, using String.hashCode() for object identity is not a great idea since it can and will have collisions. It's intended for use with e.g. HashTables; which means it is optimized for performance and not for avoiding collisions.
You can indeed use sha1, sha2, sha3, or md5 if you want some kind of content hash. If not, use SecureRandom or UUID to generate something random. All of these have a very low probability of ever giving you a collision (not completely 0 of course).
Related
I am writing a MongoDB Collection that contains a specific set of data, and I want to run comparisons against that data by taking an MD5 (or maybe SHA256) hash of the data and basing comparisons off of that.
I was wondering if using a fixed-length character string of hex-numbers is the right way of doing this. Is there a better datatype to use, such as a "blob" or even a 64bit long integer to hold the values? (This may require me to use a hashing function that produces longs -- I don't know of one except maybe overriding the Java .hashCode() function with Eclispe?)
If there is a better way entirely, advise on best practice would be appreciated here!
Storing MD5 Hashes in MongoDB
You have to use String or Binary (half the size) in case you decide to store a MD5 hash (see here).
Best Hash Function
This is tough to answer, since it highly depends on the kind of data in your collection. I personally think that MD5 hashes are a good way, but again it depends on the use-case. In case you want to customize/optimize your hash, this post and this post might get you started. They cover some simple recipes on writing a custom hash function.
I have an
HashMap<String,AnObject>
and I'd like to give the string key a value from some infos the AnObject value contains.
Suppose AnObject is made this way:
public class AnObject(){
public String name;
public String surname;
}
Is it correct to assign the key to:
String.valueOf(o.name.hashcode()+o.surname.hashcode());
? Or Is there a better way to compute a String hash code from a value list?
No, absolutely not. hashCode() is not guaranteed to be unique.
The rules of a hash code are simple:
Two equal values must have the same hash code
Two non-equal values will ideally have different hash codes, but can have the same hash code. In particular, there are only 232 possible values to return from hashCode(), but more than 232 possible strings, making uniqueness impossible.
The hash code of an object should not change unless some equality-sensitive aspect of it changes. Indeed, it's generally a good idea to make types implementing value equality immutable, at least in equality-sensitive aspects. Otherwise you can easily find that you can't look up an entry using the exact same object reference that you previously used for the key!
Hash codes are an optimization technique to make it quick to find a "probably small" set of candidate values equal to some target, which you then iterate through with a rigorous equality check to find whether any of them is actually equal to the target. That's what lets you quickly look something up by key in a hash-based collection. The key isn't the hash itself.
If you need to create a key from two strings, you're going to basically have to make it from those two strings (with some sort of delimiter so you can tell the difference between {"a", "bc"} and {"ab", "c"} - understanding that the delimiter itself might appear in the values if you're not careful).
See Eric Lippert's blog post on the topic for more information; that's based on .NET rather than Java, but they all apply. It's also worth understanding that the semantics of hashCode aren't necessarily the same as those of a cryptographic hash. In particular, it's fine for the result of hashCode() to change if you start a new JVM but create an object with the same fields - no-one should be persisting the results of hashCode. That's not the case with something like SHA-256, which should be permanently stable for a particular set of data.
The hash code for String is lossy; many String values will result in the same hash code. An integer has 32 bit positions and each position has two values. There's no way to map even just the 32-character strings (for instance) (each character having lots of possibilities) into 32 bits without collisions. They just won't fit.
If you want to use arbitrary precision arithmetic (say, BigInteger), then you can just take each character as an integer and concatenate them all together.
No, hashCode() (BTW pay attention on case of letter C) does not guarantee uniqueness. You can have a lot of objects that produce the same hash code.
If you need unique identifier use class java.util.UUID.
I have a table whose PK consists of two short varchars (15 and 5) and one datetime field.
My thoughts on creating a hashCode was to formate the datetime to something like yyyyMMddHHmmss and then concatenate it with the other two fields using some delimiter (e.g. _) and then ask for the hash code on that string.
Was wondering if there may be a more elegant approach.
Thanks
All depends on what you mean by "bulletproof". If you just mean it can be used as the hashCode of a Java object, then it should be fine. Doesn't Hibernate return a datetime as a java Date? If so, just use hashCode on that Date. You can xor (or add, ...) with the other hashCodes instead of concatenating and hashing, it may be a bit faster.
If by "bulletproof" you need a cryptographically secure hash, then you need to do more.
What is a general collision-free Java best practice to generate hash codes for any-type (atomic types) multi-column primary keys?
I thought about it for a few hours and came to the conclusion, that a string concatenated by all primary key columns would be the only reliable way to do so. Then calling Java's hashCode method on that concatenated string should yield a unique integer. (it would in fact somehow mimic what a database index does, not sure here though)
For a multi-column primary key of the form
CREATE TABLE PlayerStats
(
game_id INTEGER,
is_home BOOLEAN,
player_id SMALLINT,
roster_id SMALLINT,
... -- (game_id, is_home) FK to score, (player_id, roster_id) FK to team member
PRIMARY KEY (game_id, is_home, player_id, roster_id)
)
a hash code could be calculated like:
#Override
public int hashCode()
{
// maxchars:
String surrogate = String.format("%011d", this.gameId) //11
+ String.format("%01d" , this.isHome ? 1 : 0) //1
+ String.format("%011d", this.playerId) //6
+ String.format("%011d", this.rosterId) //6
System.out.println("surrogate = '" + surrogate + "'");
return surrogate.hashCode();
}
Of course, this only works with HashSets and Hashtable when equals is also based on this.
My question: is this a good general strategy?
I can see on-the-fly calculation might not be the fastest. You might want to recalculate the hash code whenever a composite key value was changed (e.g. call a rehash() method from within every setter operating on a key property.
Suggestions and improvements welcome. Aren't there any generally known strategies for this? A pattern?
The hash code is used as an index to look up elements in the data set that have the same code. The equals method is then used to find matches within the set of elements that have the same hash code. As such, the generated hash code doesn't have to be 100% unique. It just needs to be "unique enough" to create a decent distribution among the data elements so that there isn't a need to invoke the equals method on a large number of items with the same hashCode value.
From that perspective, generating lots and lots of strings and computing hash codes on those strings seems like an expensive way to avoid an equals operation that consists of 3 integer and 1 boolean comparison. It also doesn't necessarily guarantee uniqueness in the hash code value.
My recommendation would be to start with a simple approach of having the hash code of the key being the sum of the hash codes of its constituents. If that doesn't provide a good distribution because all of the ids are in a similar range, you could try multiplying the ids by some different factors before summing.
I wish to store UUIDs created using java.util.UUID in a HSQLDB database.
The obvious option is to simply store them as strings (in the code they will probably just be treated as such), i.e. varchar(36).
What other options should I consider for this, considering issues such as database size and query speed (neither of which are a huge concern due to the volume of data involved, but I would like to consider them at least)
HSQLDB has a built-in UUID type. Use that
CREATE TABLE t (
id UUID PRIMARY KEY
);
You have a few options:
Store it as a VARCHAR(36), as you already have suggested. This will take 36 bytes (288 bits) of storage per UUID, not counting overhead.
Store each UUID in two BIGINT columns, one for the least-significant bits and one for the most-significant bits; use UUID#getLeastSignificantBits() and UUID#getMostSignificantBits() to grab each part and store it appropriately. This will take 128 bits of storage per UUID, not counting any overhead.
Store each UUID as an OBJECT; this stores it as the binary serialized version of the UUID class. I have no idea how much space this takes up; I'd have to run a test to see what the default serialized form of a Java UUID is.
The upsides and downsides of each approach is based on how you're passing the UUIDs around your app -- if you're passing them around as their string-equivalents, then the downside of requiring double the storage capacity for the VARCHAR(36) approach is probably outweighed by not having to convert them each time you do a DB query or update. If you're passing them around as native UUIDs, then the BIGINT method probably is pretty low-overhead.
Oh, and it's nice that you're looking to consider speed and storage space issues, but as many better than me have said, it's also good that you recognize that these might not be critically important given the amount of data your app will be storing and maintaining. As always, micro-optimization for the sake of performance is only important if not doing so leads to unacceptable cost or performance. Otherwise, these two issues -- the storage space of the UUIDs, and the time it takes to maintain and query them in the DB -- are reasonably low-importance given the cheap cost of storage and the ability of DB indices to make your life much easier. :)
I would recommend char(36) instead of varchar(36). Not sure about hsqldb, but in many DBMS char is a little faster.
For lookups, if the DBMS is smart, then you can use an integer value to "get closer" to your UUID.
For example, add an int column to your table as well as the char(36). When you insert into your table, insert the uuid.hashCode() into the int column. Then your searches can be like this
WHERE intCol = ? and uuid = ?
As I said, if hsqldb is smart like mysql or sql server, it will narrow the search by the intCol and then only compare at most a few values by the uuid. We use this trick to search through million+ record tables by string, and it is essentially as fast as an integer lookup.
Using BINARY(16) is another possibility. Less storage space than character types. Use CREATE TYPE UUID .. or CREATE DOMAIN UUID .. as suggested above.
I think the easiest thing to do would be to create your own domain thus creating your own UUID "type" (not really a type, but almost).
You also should consider the answer to this question (especially if you plan to use it instead of a "normal" primary key)
INT, BIGINT or UUID/GUID in HSQLDB? (deleted by community ...)
HSQLDB: Domain Creation and Manipulation