Hashing function for strings in C - java

I want to implement a hashing technique in C where all the permutation of a string have same hash keys.
e.g. abc & cab both should have same keys.
I have thought of adding the ascii values & then checking frequency of characters[important otherwise both abc & aad would have same keys which we do not want].
But, it doesn't seem to be much efficient.
Is there any better hashing function which resolves collisions well & also doesn't result into sparse hash table?
Which hashing technique is used internally by Java [for strings] which not only minimizes the collisions but also the operations[insertion ,deletion, search] are fast enough?

Why not sort the string's characters before hashing?

The obvious technique is to simply sort the string. You could simply use the sorted string as the lookup key, or you can hash it with any algorithm deemed appropriate. Or you could use a run-length encoded (RLE) representation of your string (so the RLE of banana would be a3bn2), and optionally hash that.
A lot depends on what you're going to do with the hashes, and how resistant they must be to collisions. A simple CRC (cylic redundancy checksum) might be adequate, or it might be that cryptographic checksums such as MD5 or SHA1 are not secure enough for you.

Which hashing technique is used internally by Java [for strings] which
not only minimizes the collisions but also the operations[insertion
,deletion, search] are fast enough?
The basic "trick" used in Java for speed is caching of the hash value making it a member variable of a String and so you only compute it once. BUT this can only work in Java since strings are immutable.

The main rule about hashing is "Don't invent your own hashing algorithm. Ever.". You could just sort characters in string and apply standard hashing strategy.
Also read that if you are interested in hashing.

Related

How to "hash" long String into String[64] in Java

I have a Java application which works with MySQL database.
I want to be able to store long texts and check whether table contains them. For this I want to use index, and search by reduced "hash" of full_text.
MY_TABLE [
full_text: TEXT
text_hash: varchar(255) - indexed
]
Thing is, I cannot use String.hashCode() as:
Implementation may vary across JVM versions.
Value is too short, which means many collisions.
I want to find a fast hashing function that will read the long text value and produce a long hash value for it, say 64 symbols long.
Such reliable hash methods are not fast. They're probably fast enough, though. You're looking for a cryptographic message digest method (like the ones used to identify files in P2P networks or commits in Git). Look for the MessageDigest class, and pick your algorithm (SHA1, MD5, SHA256, etc.).
Such a hash function will take bytes as argument, and produce bytes as a result, so make sure to convert your strings using a constant encoding (UTF8, for example), and to transform the produced byte array (typically of 16 or 20 bytes) to a readable String using hexadecimal or Base64 encoding.
I'd suggest that you to revisit String.hashCode().
First, it does not vary across implementations. The exact hash is specified; see the String.hashCode javadoc specification.
Second, while the String hash algorithm isn't the best there possibly is (and certainly it will have more collisions than a cryptographic hash) it does do a reasonably good job of spreading the hashes over the 32-bit result space. For example, I did a quick check of a text file on my machine (/usr/share/dict/web2a) which has 235,880 words, and there were six collisions.
Third and fourth: String.hashCode() should be considerably faster, and the storage required for the hash values should be considerably smaller, than a cryptographic hash.
If you're storing strings in a database table, and their hash values are indexed, having a few collisions shouldn't matter. Looking up a string should get you the right database rows really quickly, and having to (maybe) check a couple actual strings should be very fast compared to the database I/O.

Java Key-Value Collection with complexity of O(1) for millions of random unordered keys

I am stuck with a problem where I have millions of key-value pairs that I need to access using the keys randomly (not by using an iterator).
The range of keys is not known at compile time, but total number of the key-value pairs is known.
I have looked into HashMap and Hashset data structures but they are not truly O(1) as in case of collision in the hash-code they become array of LinkedLists which has linear search complexity at worst case.
I have also considered increasing the number of buckets in the HashMap but it does not ensure that every element will be stored in a separate bucket.
Is there any way to store and access millions of key-value pairs with O(1) complexity?
Ideally I would like every key to be like a variable and corresponding value should be the value assigned to that key
Thanks in advance.
I think you are confusing what Big O notation represents. It defines limiting behavior of a function, not necessarily actual behavior.
The average complexity of a hash map is O(1) for insert, delete, and search operations. What does this mean? In means, on average, those operations will complete in constant time regardless of the size of the hash map. So, depending on the implementation of the map, a lookup might not take exactly one step but it will most likely not involve more than a few steps, relative to the hash map's size.
How well a hash map actually behaves for those operations is determined by a few factors. The most obvious is the hash function used to bucket keys. Hash functions that distribute the computed hashes more uniformly over the hash range and limit the number of collisions are preferred. The better the hash function in those areas, the closer a hash map will actually operate in constant time.
Another factor that affects actual hash map behavior is how storage is managed. How a map resizes and repositions entries as items are added and removed helps control hash collisions by using an optimal number of buckets. Managing the hash map storage affectively will allow the hash map to operate close to constant time.
With all that said, there are ways to construct hash maps that have O(1) worst case behavior for lookups. This is accomplished using a perfect hash function. A perfect hash function is an invertible 1-1 function between keys and hashes. With a perfect hash function and the proper hash map storage, O(1) lookups can be achieved. The prerequisite for using this approach is knowing all the key values in advance so a perfect hash function can be developed.
Sadly, your case does not involve known keys so a perfect hash function can not be constructed but, the available research might help you construct a near perfect hash function for your case.
No, there isn't such a (known) data structure for generic data types.
If there were, it would most likely have replaced hash tables in most commonly-used libraries, unless there's some significant disadvantage like a massive constant factor or ridiculous memory usage, either of which would probably make it nonviable for you as well.
I said "generic data types" above, as there may be some specific special cases for which it's possible, such as when the key is a integer in a small range - in this case you could just have an array where each index corresponds to the same key, but this is also really a hash table where the key hashes to itself.
Note that you need a terrible hash function, the pathological input for your hash function, or a very undersized hash table to actually get the worst-case O(n) performance for your hash table. You really should test it and see if it's fast enough before you go in search of something else. You could also try TreeMap, which, with its O(log n) operations, will sometimes outperform HashMap.

At which length is a String key of a HashMap considered bad-practice?

I try to pay attention to good performance and clean code all the time.
I'm having difficulties trying to grasp whether it's sane to have a HashMap with keys of 150 characters.
Is there an unwritten law to the length of the HashMap key?
Is it considered bad practice to have String keys of let's say 150 characters?
Does it affect performance? At which length?
Not really, 150 chars String is relatively trivial to calculate an hashCode for.
That being said, in circumstances like this I would advise you to test it!
Create a routine that populates an HashMap with, say, insert a size here that is representative of your use scenario random values with 5 character strings as keys. Measure how long it takes. Then do the same for 15 character keys, and see how it scales.
Also, Strings in Java are immutable, which means that hashCode can be cached for each String that is stored in the String Constant Pool, and doesn't need to be recalculated when you call hashCode on the same String object.
This means that although you're calculating larger hash codes when creating your map, on access many of those will already be pre-calculated and cached, making the size of the original String even less relevant.
Is there an unwritten law to the length of the HashMap key?
If there is, it is also unspoken. I would measure your use case in a profiler and only worry about the things you can measure as a problem, not the things you can imagine might be a problem.
Is it considered bad practice to have String keys of let's say 150 characters?
I doubt it.
Does it affect performance? At which length?
Everything affects performance, usually to small to matter or sometimes even measure. The question should be; do you need 150 character keys. If you do, then use them.
There is an exotic case where adding strings with hashCode() of zero is a bad idea. This is because in Java 1.0 to 6 doesn't optimise the use case of a hashCode of zero and it can be predicted for denial of service attacks. Java 7 fixes this by having a secondary, less predictable hashcode.
Why doesn't String's hashCode() cache 0?
Long answer: A quick look at the source code of String::hashCode() reveals that the hash is cached after the first call. Meanwhile, String::equals() is O(n) if the strings are equal yet not identical (ie, equals() is true but == is false because they're allocated at different addresses).
So the impacts on performance you will see are with:
Passing never-before-hashed strings in calls to HashMap functions. However, generating lots of new strings will impact performance in itself.
Calls to HashMap::get() and HashMap::put()using a string key that is equal to a key already in the HashMap (because if the key isn't in the collection, then most likely only hashCode() will be called. But if it is, equals() will compare all characters until it determines the strings are equal). But only if the strings passed to these functions are not the same objects that are already in the HashMap, because in that case equals() is very fast.
Moreover, string literals, string constants, and manually intern()'d strings join the String Constant Pool, in which all "equal" strings are the same object with the same address. So if working exclusively with such strings, hashCode and equals are very fast.
Of course, the performance impact won't be at all noticeable unless you're doing the aforementioned operations in a tight loop (because 150 chars isn't long and hashCode() and equals() are both efficient).
Short answer: Benchmark.
First, there is no "unwritten rule". If long strings as keys make sense from an algorithmic perspective, use them. If profiling indicates that there is a problem, then you optimize.
So how can long strings affect hash table performance?
Long strings take more memory than short ones, and that could result in measurably longer garbage collection times, and other secondary performance effects related to hardware memory caches, TLBs and (potentially) physical memory page contention.
The hashcode algorithm for String uses all characters of the string and therefore its cost is proportional to the string length. This is mitigated by the fact that String hashcodes are cached. (The 2nd and subsequent time you call hashcode on a String, you get the cached value.) However, that only helps (here) if you do multiple hash table operations with the same String object as a key.
When you get a hash collision, the hash table falls back to using String.equals() to compare keys while searching the selected hash chain. In the worst case (e.g. when the strings are equal but not ==), String.equals() involves comparing all characters of the 2 strings.
As you can see, these effects are going to be specific to the actual application, and hence they are hard to predict. Hence an "unwritten rule" is unlikely to be helpful.

Searching for a Fast Hash Algorithm

I am searching for a fast hash algorithm. Actually, I am trying to build a hash table whose keys are URL's. I have used MD5 to hash the URL's, however it is too slow (I have used java's built in function). Can anybody help me by informing about some fast hash algorithm.
Java's String class already implements .hashCode(). This is likely going to be the fastest, 32bit hash, for Java, as its heavily optimized at the core. This is also the hash in use when using the built-in collections, such as java.util.HashMap.
Google open-sourced a very fast hashing algo: CityHash
MD5 is a cryptographic hash, so it will be slow compared to non-cryptographic hashes. As Yann says, the Java hash is likely to be fastest if you want a 64 bit hash.
If that doesn't suit then there are other fast non-cryptographic hashes available in various sizes, such as Fowler–Noll–Vo.

How should I implement a string hashing function for these requirements?

Ok, I need a hashing function to meet the following requirements. The idea is to be able to link together directories that are part of the same logical structure but stored in different physical areas of the file system.
I need to implement it in Java, it must be consistent across execution sessions and it can return a long.
I will be hashing directory names / strings. This should work so that "somefolder1" and "somefolder2" will return different hashes, as would "JJK" and "JJL". I'd also like some idea of when clashes are likely to occur.
Any suggestions?
Thanks
Well, nearly all hashing functions have the property that small changes in the input yield large changes in the output, meaning that "somefolder1" and "somefolder2" will always yield a different hash.
As for clashes, just look at how large the hash output is. Java's own hashcode() returns an int, therefore you can expect clashes more often than with MD5 or SHA-1, for example which yield 128 and 160 bit, respectively.
You shouldn't try creating such a function from scratch, though.
However, I didn't quite understand whether collisions shouldn't ever occur with your use case or whether they are acceptable if rare. For linking folders I'd definitely use a guarenteed-to-be-unique identifier instead of something that might occur more than once.
You haven't described under what circumstances different strings should return the same hash.
In general, I would approach designing a hashing function by first implementing the equality function. That should show you which bits of data you need to include in the hash, and which should be discarded. If the equality between two different bits of data is complicated (e.g. case-insensitivity) then hopefully there will be a corresponding hash function for that particular comparison.
Whatever you do, don't assume that equal hashes mean equal keys (i.e. that hashing is unique) - that's always a cause of potential problems.
Java's String hashcode will give you an int, if you want a long, you could take the least-significant 64 bits of the MD5 sum for the String.
Collisions could occur, your system must be prepared for that. Maybe if you give a little more detail as to what the hash codes will be used for, we can see if collisions would cause problems or not.
With a uniformly random hash function with M possible values, the odds of a collision happening after N hashes are 50% when
N = .5 + SQRT(.25 - 2 * M * ln(.5))
Look up the birthday problem for more analysis.
You can avoid collisions if you know all your keys in advance, using perfect hashing.

Categories