Efficient Search Algorithm For Specific Data Fields

Efficient Search Algorithm For Specific Data Fields - java

So I am actually been assigned to write algorithms on filtering/searching.
Task : Filter: search and list objects that fulfill specified attribute(s)
Say The whole system is a student registration record system.
I have data as shown below. I will need to filter and search by these attributes say search/filter by gender or student name or date of birth etc.
Student Name
, Gender
, Date Of Birth
, Mobile No
Is there specific efficient algorithm formula or method for each of these field.
Example , strings and integers each has their own type of efficient search algorithm right?
Here's what I am going to do.
I am going to code a binary search algorithm for searching/filtering based on these fields above.
That's it. But yeah that's easy to be honest.
But I am just curious like what's the proper and appropriate coding approach for a efficient search/filter algorithm for each of these fields will you guys do?
I will not be using sequential search algorithm obviously as this will involve huge data so I am not going to iterate each of these data to downgrade efficiency performance.
Sequential search algorithm will be used when needed if data is less.

Searching is a very broad topic and it completely depends upon your use case.
while building an efficient Searching algorithm you should take below factors into consideration
What's the size of your data? -is it fixed or it keeps varying
periodically?
How often you are going to Insert/modify/delete
your data?
Is your data sorted or unsorted?
Do you need a prefix based search like autosearch,autocomplete,longest prefix search etc?
Now let's think about the solution/approach
if your data is less and unsorted as you can try Linear
Search(which has O(n)time complexity where "n" is size of your
data/array)
if your data is already sorted which is not always the case you can
use Binary search as it's complexity is 0(log n). if your
data is not sorted then sorting the data again takes
(nlogn)~typically if you are using Java,Arrays.sort() by default uses Merge sort or Quick sort which is (nlogn).
if faster retrieval is the main object you can think of HashMaps or HashMaps. the elements of Hashmap are indexed by Hashcode, the
time to search for any element would almost be 1 or constant time(if
your hash function implementation is good)
Prefix based search :since you mentioned about searching by Names,you also have the option of using
"Tries" data structure.
Tries are excellent option if you are performing Insert/Delete/Update functionalities frequently .
Lookup of an elements in a Trie is 0(k) where "k" is the length of the string to be searched.
Since you have registration data where insert,update,deletion is common TRIES Data Structure is a good option to consider.
Also,check this link to choose between Tries and HashTables TriesVsMaps
Below is the sample representation of Tries(img src:Hackerearth)

Related

Data structure to use for binary search on large data sets

I am trying to implement binary search into my application.
I am creating a method to go through the user's contact list, add the numbers to an array, sort it and then use a binary search to locate numbers etc.
But I was thinking what kind of array should I just use ArrayList, then sort it and then implement a binary search.
Or is there a way to store the data? like sets, or maps etc?
Scenario - I'll be getting the users contacts from their phone. Every number, of course, needs to be stored in an array or list (whichever is better).
Then sort that array.
Now I want to search for a number using a Binary search. Since a user can have a large contact set, I thought this would be a good method

There are three basic options:
Sorted list or array + binary search.
Tree-based structure like TreeMap.
Hash-based structure like HashMap.
The question is why you need binary search. If you simply want to look up contact info by number, then a HashMap would probably be a better choice from time complexity perspective.
Binary search would make sense if you have some order in keys and are interested in something like range queries. But even in this case a tree-based structure like TreeMap would be a better choice. Not so much for the time complexity (that will be pretty much the same) but more from the interface point of view.

I would suggest using a HashMap, since it has O(1) look-up vs O(log n) look-up in a sorted array.
So if your main concern is look-up (search), go for Hash.

How is data retrieved from hash tables for collisions

I understand that hash tables are designed to have easy sorting and retrieval of data when storing massive amounts of them. However, when retrieving a specific piece of data, how do they retrieve it if they were stored in an alternative location due to collision?
Say there are 10 indexes and data A was stored in index 3 and data E runs into collision because data A is stored in index 3 already and collision prevention puts it in index 7 instead. When it comes time to retrieve data E, how does it retrieve E instead of using the first hash function and retrieving A instead?
Sorry if this is dumb question. I'm still somewhat new to programming.

I don't believe that Java will resolve a hashing collision by moving an item to a different bucket. Doing that would make it difficult if not impossible to determine the correct bucket into which it should have been hashed. If you read this SO article carefully, you will note that it points out two tools which Java has at its disposal. First, it maintains a list of values for each bucket* (read note below). Second, if the list becomes too large it can increase the number of buckets.
I believe that the list has now been replaced with a tree. This will ensure O(n*lgn) performance for lookup, insertion, etc., in the worst case, whereas with a list the worst case performance was O(n).

What data structure to use for indexing data for partial %infix% searching?

Imagine you have a huge cache of data that is to be searched through by 4 ways :
exact match
prefix%
%suffix
%infix%
I'm using Trie for the first 3 types of searching, but I can't figure out how to approach the fourth one other than sequential processing of huge array of elements.

If your dataset is huge cosider using a search platform like Apache Solr so that you dont end up in a performance mess.

You can construct a navigable map or set (eg TreeMap or TreeSet) for the 2 (with keys in normal order) and 3 (keys in reverse)
For option 4 you can construct a collection with a key for every starting letter. You can simplify this depending on your requirement. This can lead to more space being used but get O(log n) lookup times.

For #4 I am thinking if you pre-compute the number of occurances of each character then you can look up in that table for entires that have at least as many occurances of the characters in the search string.
How efficient this algorithm is will probably depend on the nature of the data and the search string. It might be useful to give some examples of both here to get better answers.

Hash Tables - Java

Am about to do a homework, and i need to store quite a lot of information (Dictionary) in a data structure of my choice. I heard people in my classroom saying hash-tables are the way to go. How come?

Advantages
When you first hear about hash tables they sound too good to be true. The reason is that is does not matter how many items there are searching, insertion (deletion sometimes) can take approximately 0(1) which is pretty much instantaneous from the user POV. Given its performance capabilities in terms of speed, hash tables are used mainly yet not limited to programs that need to look up thousands of items in less than a sec (for example spell-checkers / search engines). From my particular point of view I find H tables much easier to program than any sort of binary trees, and am not expert, so if you are a beginner that might too be an advantage.
Disadvantages
As hash tables are based on arrays they can be difficult to expand once created. Also I have read that for certain hash tables once full or getting full the speed when performing a task reduces notoriously. As a result of both when programming you will need to be fairly accurate of how many items you need to store. Additionally is not possible to search the items in the hash table in order for example from the smallest to the biggest, so if that is something you are looking for it might not be what you need.
Extra Info
Wikipedia article's - Hash Table - Big O Notation
Tutorial on Hash Tables - Tutorial
All how to's about Hash Tables - Java2S
Book Advice
I advice you to get a book called "Data Structures & Algorithms in Java - Second Edition - Robert Lafore" its a big book, but it has everything explained very subtle, for me is the only programming book so far i can read like is a novel.
Additional info regarding Big O notation - O(1)
O(1) doesn't mean "pretty much instantaneous" (an O(1) algorithm could take hours, weeks or years). It means (in this case) "is independent of the size of the collection" (assuming the hash code is good enough). – Ben Lings
Thanks to Ben for his clarification.
P.S: You might want to be more descriptive in the future when you ask a question that way other users can pin-point what you are looking for.

To help you out on deciding what type of collection is better for you, take a look at this Java Tutorials lesson:
Lesson: Introduction to Collections
Reading this you can see which collection fits your needs.

The best structure for your Dictionary would be a Prefix tree in which each node's 'key' is a letter from one of your words and each node's 'value' is the meaning of the word (dictionary translation). Word lookup is linear on the word's length (the same as a hashtable, since your hash function would ideally be linear), or O(1) if we consider words as a whole. The thing that is better than hash tables is that a hash table will take a lot of space in order to ensure O(1) access and, depending on the words in the dictionary, it might be very sparsely populated. The prefix tree on the other hand actually provides compression - the tree itself will contain all the original information in less space than before, since common parts of words are shared along the tree structure. Dictionaries usually have tens of thousands words, leaving a prefix tree the only viable solution.
P.S. As mentioned earlier, the tree has almost infinite scalability, in contrast to a hash table.

It depends on what you want to store and how you want to access it. You don't really provide enough information.
Hash tables provide O(1) lookup times so they can be used to retrieve values based on a key very quickly. If the hashing algorithm is expensive you may find that it is outperformed by other data structures. This is especially true if you are doing a lot of inserting and removing of items from the structure.

If you are planning on using a hash table implementation from the Java libraries, be sure to note that there are two of them - HashTable, and HashMap. One of them is commonly used these days, and one is outdated and generally found in legacy code. Do some research to find out which is which, and why the newer one is better.

A hashtable allows you to map keys to objects.
If you're storing values that have unique keys, and you will need to lookup the values by their keys, hashtables are the way to go.
If you just want to store an ordered set of objects without unique keys, an ordinary ArrayList is the way to go. (In particular, note that ordinary hashtables are unordered)

Hash Tables are good option but while using it you might have to decide what can be the good hash function.. this question can have many answers and depends on the programmer. I personally feel you can check out B+ tree or Trie. One of the main use of Trie is Dictionary representation.Trie in Wiki
Hope this helps !!

Data structure for search engine in JAVA?

I m MCS 2nd year student.I m doing a project in Java in which I have different images. For storing description of say IMAGE-1, I have ArrayList named IMAGE-1, similarly for IMAGE-2 ArrayList IMAGE-2 n so on.....
Now I need to develop a search engine, in which i need to find a all image's whose description matches with a word entered in search engine..........
FOR EX If i enter "computer" then I should be able to find all images whose description contain "computer".
So my question is...
How should i do this efficiently?
How should i maintain all those
ArrayList since i can have 100 of
such...? or should i use another
data structure instead of ArrayList?

A simple implementation is to tokenize the description and use a Map<String, Collection<Item>> to store all items for a token.
Building:
for(String token: tokenize(description)) map.get(token).add(item)
(A collection is needed as multiple entries could be found for a token. The initialization of the collection is missing in the code. But the idea should be clear.)
Use:
List<Item> result = map.get("Computer")
The the general purpose HashMap implementation is not the most efficient in this case. When you start getting memory problems you can look into a tree implementation that is more efficient (like radix trees - implementation).
The next step could be to use some (in-memory) database. These could be relational (HSQL) or not (Berkeley DB).

If you have a small number of images and short descriptions (< 1000 characters), load them into an array and search for words using String.indexOf() (i.e. one entry in the array == one complete image description). This is efficient enough for, say, less than 10'000 images.
Use toLowerCase() to fold the case of the characters (so users will find "Computer" when they type "computer"). String.indexOf() will also work for short words (using "comp" to find "Computer" or "compare").
If you have lots of images and long descriptions and/or you want to give your users some comforts for the search (like Google does), then use Lucene.

There is no simple, easy-to-use data structure that supports efficient fulltext search.
But do you actually need efficiency? Is this a desktop app or a web app? In the former case, don't worry about efficiency, a modern CPU can search through megabytes of text in fractions of a second - simply look through all your descriptions using String.contains() (or a regexp to allow more flexible searches).
If you really need efficiency (such as for a webapp where many people could do searches at the same time), look into Apache Lucene.
As for your ArrayLists, it seems strange to use one for the description of a single image. Why a list, what does the index represent? Lines? If so, and unless you actually need to access lines directly, replace the lists with a simple String - it can contain newline characters just fine.

I would suggest you to use the Hashtable class or to organize your content into a tree to optimize searching.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.