Searching for a Fast Hash Algorithm - java

I am searching for a fast hash algorithm. Actually, I am trying to build a hash table whose keys are URL's. I have used MD5 to hash the URL's, however it is too slow (I have used java's built in function). Can anybody help me by informing about some fast hash algorithm.

Java's String class already implements .hashCode(). This is likely going to be the fastest, 32bit hash, for Java, as its heavily optimized at the core. This is also the hash in use when using the built-in collections, such as java.util.HashMap.

Google open-sourced a very fast hashing algo: CityHash

MD5 is a cryptographic hash, so it will be slow compared to non-cryptographic hashes. As Yann says, the Java hash is likely to be fastest if you want a 64 bit hash.
If that doesn't suit then there are other fast non-cryptographic hashes available in various sizes, such as Fowler–Noll–Vo.

Related

Why is one function for an MD5 hash calculation preferable for smaller files yet inefficient for large files?

I am working on generating hash values for files as a means of disallowing duplicate files in a small database. As I was researching, I found the following thread: How to generate an MD5 checksum for a file in Android?
Why is the first answer "not efficient" for large files and it is best for small strings, whereas the answer provided by dentex is better-suited for large files? Is it because of the way the solution was programmed, or is there a caveat with MD5 hashing that I am unaware of?
MD5 generates a 128-bit digest.
SHA-1 generates a 160-bit digest.
SHA-2 generates a 224-, 256-, 384- or 512-bit digest.
More bits means more distinct values, means less likelihood of a two distinct inputs generating the same digest.

Difference in hash map in java 7 and 8

What is the difference in Hash Map of Java 7 and Java 8 when both works on constant complexity algorithm? As per my understanding hash map searches in constant time by generating a hash key for an object through hash function.
In Java 7 after calculating hash from hash function if more then one element has same hash than they are searched by linear search so it's complexity is (n). In Java 8 that search is performed by binary search so the complexity will become log(n). So, this concept is wrong that hash map searches an object in constant complexity because it is not the case at all times.
You might find the latest issues of the Java Specialist newsletter very helpful. It goes into great depth discussing hashing in Java over the course of the years; for example pointing out that you better make sure your map keys implement Comparable (when using Java8).

How to "hash" long String into String[64] in Java

I have a Java application which works with MySQL database.
I want to be able to store long texts and check whether table contains them. For this I want to use index, and search by reduced "hash" of full_text.
MY_TABLE [
full_text: TEXT
text_hash: varchar(255) - indexed
]
Thing is, I cannot use String.hashCode() as:
Implementation may vary across JVM versions.
Value is too short, which means many collisions.
I want to find a fast hashing function that will read the long text value and produce a long hash value for it, say 64 symbols long.
Such reliable hash methods are not fast. They're probably fast enough, though. You're looking for a cryptographic message digest method (like the ones used to identify files in P2P networks or commits in Git). Look for the MessageDigest class, and pick your algorithm (SHA1, MD5, SHA256, etc.).
Such a hash function will take bytes as argument, and produce bytes as a result, so make sure to convert your strings using a constant encoding (UTF8, for example), and to transform the produced byte array (typically of 16 or 20 bytes) to a readable String using hexadecimal or Base64 encoding.
I'd suggest that you to revisit String.hashCode().
First, it does not vary across implementations. The exact hash is specified; see the String.hashCode javadoc specification.
Second, while the String hash algorithm isn't the best there possibly is (and certainly it will have more collisions than a cryptographic hash) it does do a reasonably good job of spreading the hashes over the 32-bit result space. For example, I did a quick check of a text file on my machine (/usr/share/dict/web2a) which has 235,880 words, and there were six collisions.
Third and fourth: String.hashCode() should be considerably faster, and the storage required for the hash values should be considerably smaller, than a cryptographic hash.
If you're storing strings in a database table, and their hash values are indexed, having a few collisions shouldn't matter. Looking up a string should get you the right database rows really quickly, and having to (maybe) check a couple actual strings should be very fast compared to the database I/O.

Hashing function for strings in C

I want to implement a hashing technique in C where all the permutation of a string have same hash keys.
e.g. abc & cab both should have same keys.
I have thought of adding the ascii values & then checking frequency of characters[important otherwise both abc & aad would have same keys which we do not want].
But, it doesn't seem to be much efficient.
Is there any better hashing function which resolves collisions well & also doesn't result into sparse hash table?
Which hashing technique is used internally by Java [for strings] which not only minimizes the collisions but also the operations[insertion ,deletion, search] are fast enough?
Why not sort the string's characters before hashing?
The obvious technique is to simply sort the string. You could simply use the sorted string as the lookup key, or you can hash it with any algorithm deemed appropriate. Or you could use a run-length encoded (RLE) representation of your string (so the RLE of banana would be a3bn2), and optionally hash that.
A lot depends on what you're going to do with the hashes, and how resistant they must be to collisions. A simple CRC (cylic redundancy checksum) might be adequate, or it might be that cryptographic checksums such as MD5 or SHA1 are not secure enough for you.
Which hashing technique is used internally by Java [for strings] which
not only minimizes the collisions but also the operations[insertion
,deletion, search] are fast enough?
The basic "trick" used in Java for speed is caching of the hash value making it a member variable of a String and so you only compute it once. BUT this can only work in Java since strings are immutable.
The main rule about hashing is "Don't invent your own hashing algorithm. Ever.". You could just sort characters in string and apply standard hashing strategy.
Also read that if you are interested in hashing.

How should I implement a string hashing function for these requirements?

Ok, I need a hashing function to meet the following requirements. The idea is to be able to link together directories that are part of the same logical structure but stored in different physical areas of the file system.
I need to implement it in Java, it must be consistent across execution sessions and it can return a long.
I will be hashing directory names / strings. This should work so that "somefolder1" and "somefolder2" will return different hashes, as would "JJK" and "JJL". I'd also like some idea of when clashes are likely to occur.
Any suggestions?
Thanks
Well, nearly all hashing functions have the property that small changes in the input yield large changes in the output, meaning that "somefolder1" and "somefolder2" will always yield a different hash.
As for clashes, just look at how large the hash output is. Java's own hashcode() returns an int, therefore you can expect clashes more often than with MD5 or SHA-1, for example which yield 128 and 160 bit, respectively.
You shouldn't try creating such a function from scratch, though.
However, I didn't quite understand whether collisions shouldn't ever occur with your use case or whether they are acceptable if rare. For linking folders I'd definitely use a guarenteed-to-be-unique identifier instead of something that might occur more than once.
You haven't described under what circumstances different strings should return the same hash.
In general, I would approach designing a hashing function by first implementing the equality function. That should show you which bits of data you need to include in the hash, and which should be discarded. If the equality between two different bits of data is complicated (e.g. case-insensitivity) then hopefully there will be a corresponding hash function for that particular comparison.
Whatever you do, don't assume that equal hashes mean equal keys (i.e. that hashing is unique) - that's always a cause of potential problems.
Java's String hashcode will give you an int, if you want a long, you could take the least-significant 64 bits of the MD5 sum for the String.
Collisions could occur, your system must be prepared for that. Maybe if you give a little more detail as to what the hash codes will be used for, we can see if collisions would cause problems or not.
With a uniformly random hash function with M possible values, the odds of a collision happening after N hashes are 50% when
N = .5 + SQRT(.25 - 2 * M * ln(.5))
Look up the birthday problem for more analysis.
You can avoid collisions if you know all your keys in advance, using perfect hashing.

Categories