Hash Tables - Java - java

Am about to do a homework, and i need to store quite a lot of information (Dictionary) in a data structure of my choice. I heard people in my classroom saying hash-tables are the way to go. How come?

Advantages
When you first hear about hash tables they sound too good to be true. The reason is that is does not matter how many items there are searching, insertion (deletion sometimes) can take approximately 0(1) which is pretty much instantaneous from the user POV. Given its performance capabilities in terms of speed, hash tables are used mainly yet not limited to programs that need to look up thousands of items in less than a sec (for example spell-checkers / search engines). From my particular point of view I find H tables much easier to program than any sort of binary trees, and am not expert, so if you are a beginner that might too be an advantage.
Disadvantages
As hash tables are based on arrays they can be difficult to expand once created. Also I have read that for certain hash tables once full or getting full the speed when performing a task reduces notoriously. As a result of both when programming you will need to be fairly accurate of how many items you need to store. Additionally is not possible to search the items in the hash table in order for example from the smallest to the biggest, so if that is something you are looking for it might not be what you need.
Extra Info
Wikipedia article's - Hash Table - Big O Notation
Tutorial on Hash Tables - Tutorial
All how to's about Hash Tables - Java2S
Book Advice
I advice you to get a book called "Data Structures & Algorithms in Java - Second Edition - Robert Lafore" its a big book, but it has everything explained very subtle, for me is the only programming book so far i can read like is a novel.
Additional info regarding Big O notation - O(1)
O(1) doesn't mean "pretty much instantaneous" (an O(1) algorithm could take hours, weeks or years). It means (in this case) "is independent of the size of the collection" (assuming the hash code is good enough). – Ben Lings
Thanks to Ben for his clarification.
P.S: You might want to be more descriptive in the future when you ask a question that way other users can pin-point what you are looking for.

To help you out on deciding what type of collection is better for you, take a look at this Java Tutorials lesson:
Lesson: Introduction to Collections
Reading this you can see which collection fits your needs.

The best structure for your Dictionary would be a Prefix tree in which each node's 'key' is a letter from one of your words and each node's 'value' is the meaning of the word (dictionary translation). Word lookup is linear on the word's length (the same as a hashtable, since your hash function would ideally be linear), or O(1) if we consider words as a whole. The thing that is better than hash tables is that a hash table will take a lot of space in order to ensure O(1) access and, depending on the words in the dictionary, it might be very sparsely populated. The prefix tree on the other hand actually provides compression - the tree itself will contain all the original information in less space than before, since common parts of words are shared along the tree structure. Dictionaries usually have tens of thousands words, leaving a prefix tree the only viable solution.
P.S. As mentioned earlier, the tree has almost infinite scalability, in contrast to a hash table.

It depends on what you want to store and how you want to access it. You don't really provide enough information.
Hash tables provide O(1) lookup times so they can be used to retrieve values based on a key very quickly. If the hashing algorithm is expensive you may find that it is outperformed by other data structures. This is especially true if you are doing a lot of inserting and removing of items from the structure.

If you are planning on using a hash table implementation from the Java libraries, be sure to note that there are two of them - HashTable, and HashMap. One of them is commonly used these days, and one is outdated and generally found in legacy code. Do some research to find out which is which, and why the newer one is better.

A hashtable allows you to map keys to objects.
If you're storing values that have unique keys, and you will need to lookup the values by their keys, hashtables are the way to go.
If you just want to store an ordered set of objects without unique keys, an ordinary ArrayList is the way to go. (In particular, note that ordinary hashtables are unordered)

Hash Tables are good option but while using it you might have to decide what can be the good hash function.. this question can have many answers and depends on the programmer. I personally feel you can check out B+ tree or Trie. One of the main use of Trie is Dictionary representation.Trie in Wiki
Hope this helps !!

Related

Efficient Search Algorithm For Specific Data Fields

So I am actually been assigned to write algorithms on filtering/searching.
Task : Filter: search and list objects that fulfill specified attribute(s)
Say The whole system is a student registration record system.
I have data as shown below. I will need to filter and search by these attributes say search/filter by gender or student name or date of birth etc.
Student Name
, Gender
, Date Of Birth
, Mobile No
Is there specific efficient algorithm formula or method for each of these field.
Example , strings and integers each has their own type of efficient search algorithm right?
Here's what I am going to do.
I am going to code a binary search algorithm for searching/filtering based on these fields above.
That's it. But yeah that's easy to be honest.
But I am just curious like what's the proper and appropriate coding approach for a efficient search/filter algorithm for each of these fields will you guys do?
I will not be using sequential search algorithm obviously as this will involve huge data so I am not going to iterate each of these data to downgrade efficiency performance.
Sequential search algorithm will be used when needed if data is less.
Searching is a very broad topic and it completely depends upon your use case.
while building an efficient Searching algorithm you should take below factors into consideration
What's the size of your data? -is it fixed or it keeps varying
periodically?
How often you are going to Insert/modify/delete
your data?
Is your data sorted or unsorted?
Do you need a prefix based search like autosearch,autocomplete,longest prefix search etc?
Now let's think about the solution/approach
if your data is less and unsorted as you can try Linear
Search(which has O(n)time complexity where "n" is size of your
data/array)
if your data is already sorted which is not always the case you can
use Binary search as it's complexity is 0(log n). if your
data is not sorted then sorting the data again takes
(nlogn)~typically if you are using Java,Arrays.sort() by default uses Merge sort or Quick sort which is (nlogn).
if faster retrieval is the main object you can think of HashMaps or HashMaps. the elements of Hashmap are indexed by Hashcode, the
time to search for any element would almost be 1 or constant time(if
your hash function implementation is good)
Prefix based search :since you mentioned about searching by Names,you also have the option of using
"Tries" data structure.
Tries are excellent option if you are performing Insert/Delete/Update functionalities frequently .
Lookup of an elements in a Trie is 0(k) where "k" is the length of the string to be searched.
Since you have registration data where insert,update,deletion is common TRIES Data Structure is a good option to consider.
Also,check this link to choose between Tries and HashTables TriesVsMaps
Below is the sample representation of Tries(img src:Hackerearth)

What is the most memory efficient method of storing a large number of Strings in a map?

I want to store huge amounts of Strings in a Map<String, MagicObject>, so that the MagicObjects can be accessed quickly. There are so many entries to this Map that memory is becoming a bottleneck. Assuming the MagicObjects can't be optimized, what is the most efficient type of map I could use for this situation? I am currently using the following:
gnu.trove.map.hash.TCustomHashMap<byte[], MagicObject>
If your keys are long enough and have a lot of long enough common prefixes then you can save memory by using a trie (prefix tree) data structure. Answers to this question point to a a couple of Java implementations of trie.
To open mind, consider Huffman coding to compress your strings first before
put in map, as long as your strings are fixed(number and content of string don't change).
I'm a little late to this party but this question came up in a related search and piqued my interest. I don't usually answer Java questions.
There are so many entries to this Map that memory is becoming a bottleneck.
I doubt it.
For the storage of strings in memory to become a bottleneck you need an awfully large number of unique strings[1]. To put things into perspective, I recently worked with a 1.8m word dictionary (1.8m unique english words) and they took up around 1.6MB in RAM at runtime.
If you used every word in the dictionary as a key you'll still only use 1.6MB of RAM[2] to store the keys, hence memory cannot be your bottleneck.
What I suspect you are experiencing is the O(n^2) performance of string matching. By this I mean that as more keys are added performance slows down exponentially[3]. This is unavoidable if you are using strings are keys.
If you want to speed things up a bit, store each key into a hashtable that doesn't store duplicates and use the hash key as the key to your map.
NOTES:
[1] I'm assuming the strings are all unique or else you would not attempt to use them as a key into a map.
[2] Even if Java uses 2 bytes per character, it still only comes to 3.2MB of memory, total.
[3] It slows down even more if you choose the wrong data structure, such as an unbalanced binary tree, to store your values. I don't know how map stores values internally, but an unbalanced binary tree will have O(2^n) performance - pretty much the worst performance you can find.

How is data retrieved from hash tables for collisions

I understand that hash tables are designed to have easy sorting and retrieval of data when storing massive amounts of them. However, when retrieving a specific piece of data, how do they retrieve it if they were stored in an alternative location due to collision?
Say there are 10 indexes and data A was stored in index 3 and data E runs into collision because data A is stored in index 3 already and collision prevention puts it in index 7 instead. When it comes time to retrieve data E, how does it retrieve E instead of using the first hash function and retrieving A instead?
Sorry if this is dumb question. I'm still somewhat new to programming.
I don't believe that Java will resolve a hashing collision by moving an item to a different bucket. Doing that would make it difficult if not impossible to determine the correct bucket into which it should have been hashed. If you read this SO article carefully, you will note that it points out two tools which Java has at its disposal. First, it maintains a list of values for each bucket* (read note below). Second, if the list becomes too large it can increase the number of buckets.
I believe that the list has now been replaced with a tree. This will ensure O(n*lgn) performance for lookup, insertion, etc., in the worst case, whereas with a list the worst case performance was O(n).

Modifying .tim and .tip files in Lucene Index

I have a Lucene application with multiple indices in which the relevancy scoring suffers due to differences in the term frequencies across the different indices. My understanding is that the Term Dictionary (.tim file) contains "term statistics" such as the document frequency statistics on each term. I was thinking that one approach might be to modify the .tim files for each index (and related segments) and update the "term statistics". Is it possible to overwrite or modify the .tim and .tip files in such a way?
relevancy scoring suffers
From the FAQ:
score values are meaningful only for purposes of comparison between
other documents for the exact same query and the exact same index.
when you try to compute a percentage, you are setting up an implicit
comparison with scores from other queries.
Is it possible? I suppose, but it strikes me as about as good an idea as attempting to change an application by directly modifying the compiled binaries.
If you need very specific things from scoring, then you should generally implement a Similarity that does what you need. Extending TFIDFSimilarity is often a good idea. Really not clear on what the exact problem is, so I can't provide any more specific guidance than that, but perhaps that provides a point in the right general direction.

Parsing a lot of text based on a constant set of search terms

I have a set of search terms like [+dog -"jack russels" +"fox terrier"], [+cat +persian -tabby]. These could be quite long with maybe 30 sub-terms making up each term.
I now have some online news articles extracts such as ["My fox terrier is the cutest dog in the world..."] and ["Has anyone seen my lost persian cat? He went missing ..."]. They're not too long, perhaps 500 characters at most each.
In traditional search engines one expects a huge amount of articles that are pre-processed into indexes, allowing for speed-ups when searching given 'search terms', using set theory/boolean logic to reduce articles to only ones that match the phrases. In this situation, however, the order of my search terms is ~10^5, and I'd like to be able to process a single article at a time, to see ALL the sets of search terms that article would be matched with (i.e. all the + terms are in the text and none of the - terms).
I have a possible solution using two maps (one for the positive sub-phrases, one for the negative sub-phrases), but I don't think it'll be very efficient.
First prize would be a library that solves this problem, second prize is a push in the right direction towards solving this.
Kind regards,
Assuming all the positive sub-terms are required for a match:
Put all the sub-terms from your search terms into a hashtable. The sub-term is the key, the value is a pointer to the full search term data structure (which should include a unique id and a map of sub-terms to a boolean).
Additionally, when processing a news item, create a "candidates" map, indexed by the term id. Each candidate structure has a pointer to the term definition, a set that contains the seen sub-terms and a "rejected" flag.
Iterate over the words of the news article.
For each hit, look up the candidate entry. If not there, create and add an empty one.
If the candidate rejection flag is set, you are done.
Otherwise, look up the sub-term from the term data structure.
If negative, set the rejected flag.
If positive, add the sub-term to the set of seen sub-terms.
In the end, iterate over the candidates. All candidates that are not rejected and where the size of the seen set equals to the number of positive sub-terms of that term are your hits.
Implementation: https://docs.google.com/document/d/1boieLJboLTy7X2NH1Grybik4ERTpDtFVggjZeEDQH74/edit
Runtime is O(n * m) where n is the number of words in the article and m is the maximum number of terms sharing the same sub-term (expected to be relatively small).
First of all, I think making a Suffix Tree of your document makes the searching much faster since you need to built it once, but you may use it as many times as the length of your query is.
Second, you need to iterate all of the search terms (both + and - ones) to make sure if the answer is yes (that is the document matches the query). However, for a "no" answer, you dont! If the answer is no, then the order of matching the search terms against the document really matters. That is one order may give you a faster "no" than another order. Now the question is "What is the optimal order to get a fast NO?". It really depends on the application, but a good starting point is that multi-word terms such as "red big cat" are less commonly repeated in the documents compared to short terms such as "cat" and vice versa. So, go with +"Loo ooo ooo ooo ooo ong" and -"short" terms first.

Categories