How to "hash" long String into String[64] in Java - java

I have a Java application which works with MySQL database.
I want to be able to store long texts and check whether table contains them. For this I want to use index, and search by reduced "hash" of full_text.
MY_TABLE [
full_text: TEXT
text_hash: varchar(255) - indexed
]
Thing is, I cannot use String.hashCode() as:
Implementation may vary across JVM versions.
Value is too short, which means many collisions.
I want to find a fast hashing function that will read the long text value and produce a long hash value for it, say 64 symbols long.

Such reliable hash methods are not fast. They're probably fast enough, though. You're looking for a cryptographic message digest method (like the ones used to identify files in P2P networks or commits in Git). Look for the MessageDigest class, and pick your algorithm (SHA1, MD5, SHA256, etc.).
Such a hash function will take bytes as argument, and produce bytes as a result, so make sure to convert your strings using a constant encoding (UTF8, for example), and to transform the produced byte array (typically of 16 or 20 bytes) to a readable String using hexadecimal or Base64 encoding.

I'd suggest that you to revisit String.hashCode().
First, it does not vary across implementations. The exact hash is specified; see the String.hashCode javadoc specification.
Second, while the String hash algorithm isn't the best there possibly is (and certainly it will have more collisions than a cryptographic hash) it does do a reasonably good job of spreading the hashes over the 32-bit result space. For example, I did a quick check of a text file on my machine (/usr/share/dict/web2a) which has 235,880 words, and there were six collisions.
Third and fourth: String.hashCode() should be considerably faster, and the storage required for the hash values should be considerably smaller, than a cryptographic hash.
If you're storing strings in a database table, and their hash values are indexed, having a few collisions shouldn't matter. Looking up a string should get you the right database rows really quickly, and having to (maybe) check a couple actual strings should be very fast compared to the database I/O.

Related

Which digits of a UUID are least likely to collide if the generator (e.g. Java version of UUID) is unknown?

Suppose we have an existing set of UUIDs (say, millions, though it doesn't matter) that may have been generated by different clients, so that we don't know the algorithm that generated any UUID. But we can assume that they are popular implementations.
Are there a set of 8 or more digits (not necessarily contiguous, though ideally yes) that are less or more likely to collide?
For example, I've seen the uuid() function in MySQL, when used twice in the same statement, generate 2 UUIDs exactly the same except the 5th through 8th digits:
0dec7a69-ded8-11e8-813e-42010a80044f
0decc891-ded8-11e8-813e-42010a80044f
^^^^
What is the answer generally?
The application is to expose a more compact ID for customers to copy and paste or communicate over phone. Unfortunately we're bound to using UUIDs in the backend, and understandably reluctant to creating mappings between long and short versions of IDs, but we can live with using a truncated UUID that occasionally collides and returns more than 1 result.
Suggestion: first 8 digits
1c59f6a6-21e6-481d-80ee-af3c54ac400a
^^^^^^^^
All generator implementations are required to use the same algorithms for a given version, so worry about the latter rather than the former.
UUID version 1 & version 2 are generally arranged from most to least entropy for a given source. So, the first 8 digits are probably the least likely to collide.
UUID version 4 and version 3 & 5 are designed to have uniform entropy, aside from the reserved digits for version and variant. So the first 8 digits are as good as any others.
There is one method that will work, no matter the caveats of the UUID specification. Since a UUID is in itself intended to be globally unique, a secure hash made out of it using a proper algorithm with at least the same bit size will have the same properties.
Except that the secure hash will have entropy through the hash value instead of specific locations.
As an example, you could do:
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hash = digest.digest(uuid.toString().getBytes(StandardCharsets.UTF_8));
And then you take as many bits out of the hash as you need and convert them back to a String.
This is a one-way function though; to map it back to the UUID in a fast an efficient manner, you need to keep a mapping table. (You can of course check if a UUID matches the shorter code by performing the one-way hash on the UUID again)
However, if you were to take a non-contiguous portion out of the UUID, you would have the same issue.

How to reduce memory usage for a HashMap<String, Integer> like data structure

Before starting to explain my problem, I should mention that I am not looking for a way to increase Java heap memory. I should strictly store these objects.
I am working on storing huge number (5-10 GB) of DNA sequences and their counts (Integer) in a hash table. The DNA sequences (with length 32 or less) consists of 'A', 'C', 'G', 'T', and 'N' (undefined) chars. As we know, when storing a large number of objects in memory, Java has poor space efficiency compared to lower level languages like C and C++. Thus, if I store this sequence as string (it holds about 100 MB memory for a sequence with length ~30), I see the error.
I tried to represent nucleic acids as 'A'=00, 'C'=01, 'G'=10, 'T'=11 and neglect 'N' (because it ruins the char to 2-bit transform as the 5-th acid). Then, concatenate these 2-bit acids into byte array. It brought some improvement but unfortunately I see the error after a couple of hours again. I need a convenient solution or at least a workaround to handle this error. Thank you in advance.
Being fairly complex maybe this here is a weird idea, and would require quite a lot of work, but this is what I would try:
You already pointed out two individual subproblems of your overall task:
the default HashMap implementation may be suboptimal for such large collection sizes
you need to store something else than strings
The map implementation
I would recommend to write a highly tailored hash map implementation for the Map<String, Long> interface. Internally you do not have to store strings. Unfortunately 5^32 > 2^64, so there is no way to pack your whole string into a single long, well, let's stick to two longs for a key. You can make string to/back long[2] conversion fairly efficiently on the fly when providing a string key to your map implementation (use bit shifts etc).
As for packing the values, here are some considerations:
for a key-value pair a standard hashmap will need to have an array of N longs for buckets, where N is the current capacity, when the bucket is found from the hash key it will need to have a linked list of key-value pairs to resolve keys that produce identical hash codes. For your specific case you could try to optimize it in the following way:
use a long[] of size 3N where N is the capacity to store both keys and values in a continuous array
in this array, at locations 3 * (hashcode % N) and 3 * (hashcode % N) + 1 you store the long[2] representation of the key, of the first key that matches this bucket or of the only one (on insertion, zero otherwise), at location 3 * (hashcode % N) + 2 you store the corresponding count
for all those cases where a different key results in the same hash code and thus the same bucket, your store the data in a standard HashMap<Long2KeyWrapper, Long>. The idea is to keep the capacity of the array mentioned above (and resize correspondingly) large enough to have by far the largest part of the data in that contiguous array and not in the fallback hash map. This will dramatically reduce the storage overhead of the hashmap
do not expand the capacity in N=2N iterations, make smaller growth steps, e.g. 10-20%. this will cost performance on populating the map, but will keep your memory footprint under control
The keys
Given the inequality 5^32 > 2^64 your idea to use bits to encode 5 letters seems to be the best I can think of right now. Use 3 bits and correspondingly long[2].
I recommend you look into the Trove4j Collections API; it offers Collections that hold primitives which will use less memory than their boxed, wrapper classes.
Specifically, you should check out their TObjectIntHashMap.
Also, I wouldn't recommended storing anything as a String or char until JDK 9 is released, as the backing char array of a String is UTF-16 encoded, using two bytes per char. JDK 9 defaults to UTF-8 where only one byte is used.
If you're using on the order of ~10gb of data, or at least data with an in memory representation size of ~10gb, then you might need to think of ways to write the data you don't need at the moment to disk and load individual portions of your dataset into memory to work on them.
I had this exact problem a few years ago when I was conducting research with monte carlo simulations so I wrote a Java data structure to solve it. You can clone/fork the source here: github.com/tylerparsons/surfdep
The library supports both MySQL and SQLite as the underlying database. If you don't have either, I'd recommend SQLite as it's much quicker to set up.
Full disclaimer: this is not the most efficient implementation, but it will handle very large datasets if you let it run for a few hours. I tested it successfully with matrices of up to 1 billion elements on my Windows laptop.

Base64 Encoding encrypted password hashes

I'm Currently creating a web application that requires passwords to be encrypted and stored in a database. I found the following Guide that encrypts passwords using PBKDF2WithHmacSHA1.
In the example provided the getEncryptedPassword method returns a byte array.
Are there any advantages in doing Base64 encoding the result?
Any Disadvantages?
The byte[] array is the smallest mechanism for storing the value (storage space wise). If you have lots of these values it may make sense to store them as bytes. Depending on where you store the result, the format may make a difference too. Most databases will accomodate byte[] values fairly well, but it can be cumbersome (depending on the database). Stores like text files and XML documents, etc. obviously will struggle with the byte[] array.
In most circumstances I feel there are two formats that make sense, Hexadecimal representation, or byte[]. I seldom think that the advantages of Base64 for short values (less than 32 characters) are worth it (for larger items then sure, use base64, and there's a fantastic library for it too).
This is obviously all subjective.....
Converting values to Hexadecimal are quite easy: see How to convert a byte array to a hex string in Java?
Hex output is convenient, and easier to manage than Base64 which has a more complicated algorithm to build, and is thus slightly slower.....
Assuming a reasonable database there is no advantage, since it's just an encoding scheme. There is a size increase as a consequence of base 64 encoding, which is a disadvantage. If your database reliably stores 8-bit bytes, just store the hash in the database.

Hashing function for strings in C

I want to implement a hashing technique in C where all the permutation of a string have same hash keys.
e.g. abc & cab both should have same keys.
I have thought of adding the ascii values & then checking frequency of characters[important otherwise both abc & aad would have same keys which we do not want].
But, it doesn't seem to be much efficient.
Is there any better hashing function which resolves collisions well & also doesn't result into sparse hash table?
Which hashing technique is used internally by Java [for strings] which not only minimizes the collisions but also the operations[insertion ,deletion, search] are fast enough?
Why not sort the string's characters before hashing?
The obvious technique is to simply sort the string. You could simply use the sorted string as the lookup key, or you can hash it with any algorithm deemed appropriate. Or you could use a run-length encoded (RLE) representation of your string (so the RLE of banana would be a3bn2), and optionally hash that.
A lot depends on what you're going to do with the hashes, and how resistant they must be to collisions. A simple CRC (cylic redundancy checksum) might be adequate, or it might be that cryptographic checksums such as MD5 or SHA1 are not secure enough for you.
Which hashing technique is used internally by Java [for strings] which
not only minimizes the collisions but also the operations[insertion
,deletion, search] are fast enough?
The basic "trick" used in Java for speed is caching of the hash value making it a member variable of a String and so you only compute it once. BUT this can only work in Java since strings are immutable.
The main rule about hashing is "Don't invent your own hashing algorithm. Ever.". You could just sort characters in string and apply standard hashing strategy.
Also read that if you are interested in hashing.

How should I implement a string hashing function for these requirements?

Ok, I need a hashing function to meet the following requirements. The idea is to be able to link together directories that are part of the same logical structure but stored in different physical areas of the file system.
I need to implement it in Java, it must be consistent across execution sessions and it can return a long.
I will be hashing directory names / strings. This should work so that "somefolder1" and "somefolder2" will return different hashes, as would "JJK" and "JJL". I'd also like some idea of when clashes are likely to occur.
Any suggestions?
Thanks
Well, nearly all hashing functions have the property that small changes in the input yield large changes in the output, meaning that "somefolder1" and "somefolder2" will always yield a different hash.
As for clashes, just look at how large the hash output is. Java's own hashcode() returns an int, therefore you can expect clashes more often than with MD5 or SHA-1, for example which yield 128 and 160 bit, respectively.
You shouldn't try creating such a function from scratch, though.
However, I didn't quite understand whether collisions shouldn't ever occur with your use case or whether they are acceptable if rare. For linking folders I'd definitely use a guarenteed-to-be-unique identifier instead of something that might occur more than once.
You haven't described under what circumstances different strings should return the same hash.
In general, I would approach designing a hashing function by first implementing the equality function. That should show you which bits of data you need to include in the hash, and which should be discarded. If the equality between two different bits of data is complicated (e.g. case-insensitivity) then hopefully there will be a corresponding hash function for that particular comparison.
Whatever you do, don't assume that equal hashes mean equal keys (i.e. that hashing is unique) - that's always a cause of potential problems.
Java's String hashcode will give you an int, if you want a long, you could take the least-significant 64 bits of the MD5 sum for the String.
Collisions could occur, your system must be prepared for that. Maybe if you give a little more detail as to what the hash codes will be used for, we can see if collisions would cause problems or not.
With a uniformly random hash function with M possible values, the odds of a collision happening after N hashes are 50% when
N = .5 + SQRT(.25 - 2 * M * ln(.5))
Look up the birthday problem for more analysis.
You can avoid collisions if you know all your keys in advance, using perfect hashing.

Categories