Using an MD5 Hash as an index - java

I am writing a MongoDB Collection that contains a specific set of data, and I want to run comparisons against that data by taking an MD5 (or maybe SHA256) hash of the data and basing comparisons off of that.
I was wondering if using a fixed-length character string of hex-numbers is the right way of doing this. Is there a better datatype to use, such as a "blob" or even a 64bit long integer to hold the values? (This may require me to use a hashing function that produces longs -- I don't know of one except maybe overriding the Java .hashCode() function with Eclispe?)
If there is a better way entirely, advise on best practice would be appreciated here!

Storing MD5 Hashes in MongoDB
You have to use String or Binary (half the size) in case you decide to store a MD5 hash (see here).
Best Hash Function
This is tough to answer, since it highly depends on the kind of data in your collection. I personally think that MD5 hashes are a good way, but again it depends on the use-case. In case you want to customize/optimize your hash, this post and this post might get you started. They cover some simple recipes on writing a custom hash function.

Related

Convert string to specific int

I am creating foreground notification with ID like so:
startForeground(1, notification)
When initialising the service I am sending to it some string (ex: Hello). I wish that the service and notification will be bind to this string so I wish to use it as my id. So, how can I convert string to unique ID? For example the word "Hello" will always generate 123 And the word Bye will always generate 456.
That sounds like you want a "Hash Code"; a value derived from some other information that is (hopefully, but not always) unique.
There are a lot of different algorithms available to do this and if you search for "hash code" you will find lots of them (especially in the security domain; sha, md5 etc)
However,
It sounds like you may not really need to get that complex (some of the more secure and "unique" hash code algorithms can be slow to calculate).
Is there any reason why you can't use the string itself?
String comparison may be slow, but maybe not as slow as a good hash. Also you might be able to use a Hash Table if you need a faster "lookup".. hashmap
Anyway, if you really do need a hash code from a string, a quick search found this (which looks reasonable) Sam Clarke; Kotlin Hash Strings

HashMap - Fastest access using a key with several fields

I have an object that is identified by 3 fields. One them is a String that represents 6 hex bytes, the other two are integers of not more than 1 bytes each. This all summed up is 8 bytes of data, which fits in a 64 bit integer.
I need to map these objects for fast access, and I can think of two approaches:
Use the 3 fields to generate a 64 bit key used to map the objects. This however would mean parsing the String to Hex for every access (and there will a lot of accesses, which need to be fast).
Use 3 HashMap levels, each nested inside the next, to represent the 3 identifying fields.
My question is which of these approaches should be the fastest.
Why not use a MultiKeyMap?
This might be not related to your question.
I have a suggestion for you.
Create an object with the 3 attributes that will form the key. Use the object has the key because it will be unique.
Map<ObjectKey,Object> map = new HashMap<>();
This makes sense for your use case? If you can add a bit more explanation maybe I can go further in suggest you possible solutions.
EDIT: You can override the equals and do something using this kind of logic:
#Override
public boolean equals(Object obj) {
if (!(obj instanceof Key))
return false;
ObjectKey objectKey= (Key) obj;
return this.key1.equals(objectKey.key1) && this.key2.equals(objectKey.key2) &&
...
this.keyN.equals(objectKey.keyN)
}
I would take the following steps:
Write it in the most readable way first, and profile it.
Refactor it to an implementation you think might be faster, then profile it again.
Compare.
Repeat.
Your key fits into a 64-bit value. Assuming you will build the HashMap in one go and then read from it multiple times (using it as a lookup table), my hunch is that using a Long type as the key of your HashMap will be about as fast as you can get.
You are concerned about having to parse the string as a hex number every time you look up a key in the map. What's the alternative? If you use a key containing the three separate fields, you will still have to parse the string to calculate its hash code (or, rather, the Java API implementation will calculate its hash code by parsing the string contents). The HashMap will not only call String.hashCode() but also String.equals(), so your string will be iterated twice. By contrast, calculating a Long and comparing it to the precalculated keys in the HashMap will consist of iterating the string only once.
If you use three levels of HashMap, as per your second suggestion, you will still have to calculate the hash code of your string, as well as having to look up the values of all three fields anyway, so the multi-level map doesn't give you any performance advantage.
You should also experiment with the HashMap constructor arguments to get the most efficiency. These will determine how efficiently your data will get spread into separate buckets.

What is the most memory efficient method of storing a large number of Strings in a map?

I want to store huge amounts of Strings in a Map<String, MagicObject>, so that the MagicObjects can be accessed quickly. There are so many entries to this Map that memory is becoming a bottleneck. Assuming the MagicObjects can't be optimized, what is the most efficient type of map I could use for this situation? I am currently using the following:
gnu.trove.map.hash.TCustomHashMap<byte[], MagicObject>
If your keys are long enough and have a lot of long enough common prefixes then you can save memory by using a trie (prefix tree) data structure. Answers to this question point to a a couple of Java implementations of trie.
To open mind, consider Huffman coding to compress your strings first before
put in map, as long as your strings are fixed(number and content of string don't change).
I'm a little late to this party but this question came up in a related search and piqued my interest. I don't usually answer Java questions.
There are so many entries to this Map that memory is becoming a bottleneck.
I doubt it.
For the storage of strings in memory to become a bottleneck you need an awfully large number of unique strings[1]. To put things into perspective, I recently worked with a 1.8m word dictionary (1.8m unique english words) and they took up around 1.6MB in RAM at runtime.
If you used every word in the dictionary as a key you'll still only use 1.6MB of RAM[2] to store the keys, hence memory cannot be your bottleneck.
What I suspect you are experiencing is the O(n^2) performance of string matching. By this I mean that as more keys are added performance slows down exponentially[3]. This is unavoidable if you are using strings are keys.
If you want to speed things up a bit, store each key into a hashtable that doesn't store duplicates and use the hash key as the key to your map.
NOTES:
[1] I'm assuming the strings are all unique or else you would not attempt to use them as a key into a map.
[2] Even if Java uses 2 bytes per character, it still only comes to 3.2MB of memory, total.
[3] It slows down even more if you choose the wrong data structure, such as an unbalanced binary tree, to store your values. I don't know how map stores values internally, but an unbalanced binary tree will have O(2^n) performance - pretty much the worst performance you can find.

Generating Unique Hash for URL crawled by crawler

I am implementing a Crawler and I wanted to generate a unique hash code for every URL crawled by my system. This will help me in checking duplicate URLs, matching complete URL can be a expensive stuff. Crawler will crawl millions of pages daily. So output of this hash function should be unique.
Unless you know every address ahead of time, and there happens to be a perfect hash for said set of addresses, this task is theoretically impossible.
By the pigeonhole principle, there must exist at least two strings that have the same Integer value no matter what technique you use for conversion, considering that Integers have a finite range, and strings do not. While addresses, in reality, are not infinitely long, you're still going to get multiple addresses that map to the same hash value. In theory, there are infinitely many strings that will map to the same Integer value.
So, in conclusion, you should probably just use a standard HashMap.
Additionally, you need to worry about the following:
www.stackoverflow.com http://www.stackoverflow.com
http://stackoverflow.com stackoverflow.com ...
which are all equivalent, so you would need to normalize first, then hash. While there are some algorithms that given the set first will generate a perfect hash, I doubt that that is necessary for your purposes.
I think the solution is to normalize URLs first by removing first parts like http:// or http://www. from the beginning and last parts like / or ?... or #....
After this cleaning, you should have a clean domain URL, and you can do a hash for it.
But the best solution is to use a bloomfilter (a probabilistic data structure) which can tell you of the URL was probably visited or guaranteed not visited

Hash Tables - Java

Am about to do a homework, and i need to store quite a lot of information (Dictionary) in a data structure of my choice. I heard people in my classroom saying hash-tables are the way to go. How come?
Advantages
When you first hear about hash tables they sound too good to be true. The reason is that is does not matter how many items there are searching, insertion (deletion sometimes) can take approximately 0(1) which is pretty much instantaneous from the user POV. Given its performance capabilities in terms of speed, hash tables are used mainly yet not limited to programs that need to look up thousands of items in less than a sec (for example spell-checkers / search engines). From my particular point of view I find H tables much easier to program than any sort of binary trees, and am not expert, so if you are a beginner that might too be an advantage.
Disadvantages
As hash tables are based on arrays they can be difficult to expand once created. Also I have read that for certain hash tables once full or getting full the speed when performing a task reduces notoriously. As a result of both when programming you will need to be fairly accurate of how many items you need to store. Additionally is not possible to search the items in the hash table in order for example from the smallest to the biggest, so if that is something you are looking for it might not be what you need.
Extra Info
Wikipedia article's - Hash Table - Big O Notation
Tutorial on Hash Tables - Tutorial
All how to's about Hash Tables - Java2S
Book Advice
I advice you to get a book called "Data Structures & Algorithms in Java - Second Edition - Robert Lafore" its a big book, but it has everything explained very subtle, for me is the only programming book so far i can read like is a novel.
Additional info regarding Big O notation - O(1)
O(1) doesn't mean "pretty much instantaneous" (an O(1) algorithm could take hours, weeks or years). It means (in this case) "is independent of the size of the collection" (assuming the hash code is good enough). – Ben Lings
Thanks to Ben for his clarification.
P.S: You might want to be more descriptive in the future when you ask a question that way other users can pin-point what you are looking for.
To help you out on deciding what type of collection is better for you, take a look at this Java Tutorials lesson:
Lesson: Introduction to Collections
Reading this you can see which collection fits your needs.
The best structure for your Dictionary would be a Prefix tree in which each node's 'key' is a letter from one of your words and each node's 'value' is the meaning of the word (dictionary translation). Word lookup is linear on the word's length (the same as a hashtable, since your hash function would ideally be linear), or O(1) if we consider words as a whole. The thing that is better than hash tables is that a hash table will take a lot of space in order to ensure O(1) access and, depending on the words in the dictionary, it might be very sparsely populated. The prefix tree on the other hand actually provides compression - the tree itself will contain all the original information in less space than before, since common parts of words are shared along the tree structure. Dictionaries usually have tens of thousands words, leaving a prefix tree the only viable solution.
P.S. As mentioned earlier, the tree has almost infinite scalability, in contrast to a hash table.
It depends on what you want to store and how you want to access it. You don't really provide enough information.
Hash tables provide O(1) lookup times so they can be used to retrieve values based on a key very quickly. If the hashing algorithm is expensive you may find that it is outperformed by other data structures. This is especially true if you are doing a lot of inserting and removing of items from the structure.
If you are planning on using a hash table implementation from the Java libraries, be sure to note that there are two of them - HashTable, and HashMap. One of them is commonly used these days, and one is outdated and generally found in legacy code. Do some research to find out which is which, and why the newer one is better.
A hashtable allows you to map keys to objects.
If you're storing values that have unique keys, and you will need to lookup the values by their keys, hashtables are the way to go.
If you just want to store an ordered set of objects without unique keys, an ordinary ArrayList is the way to go. (In particular, note that ordinary hashtables are unordered)
Hash Tables are good option but while using it you might have to decide what can be the good hash function.. this question can have many answers and depends on the programmer. I personally feel you can check out B+ tree or Trie. One of the main use of Trie is Dictionary representation.Trie in Wiki
Hope this helps !!

Categories