Data structure to use for binary search on large data sets

Data structure to use for binary search on large data sets - java

I am trying to implement binary search into my application.
I am creating a method to go through the user's contact list, add the numbers to an array, sort it and then use a binary search to locate numbers etc.
But I was thinking what kind of array should I just use ArrayList, then sort it and then implement a binary search.
Or is there a way to store the data? like sets, or maps etc?
Scenario - I'll be getting the users contacts from their phone. Every number, of course, needs to be stored in an array or list (whichever is better).
Then sort that array.
Now I want to search for a number using a Binary search. Since a user can have a large contact set, I thought this would be a good method

There are three basic options:
Sorted list or array + binary search.
Tree-based structure like TreeMap.
Hash-based structure like HashMap.
The question is why you need binary search. If you simply want to look up contact info by number, then a HashMap would probably be a better choice from time complexity perspective.
Binary search would make sense if you have some order in keys and are interested in something like range queries. But even in this case a tree-based structure like TreeMap would be a better choice. Not so much for the time complexity (that will be pretty much the same) but more from the interface point of view.

I would suggest using a HashMap, since it has O(1) look-up vs O(log n) look-up in a sorted array.
So if your main concern is look-up (search), go for Hash.

Related

Efficient Search Algorithm For Specific Data Fields

So I am actually been assigned to write algorithms on filtering/searching.
Task : Filter: search and list objects that fulfill specified attribute(s)
Say The whole system is a student registration record system.
I have data as shown below. I will need to filter and search by these attributes say search/filter by gender or student name or date of birth etc.
Student Name
, Gender
, Date Of Birth
, Mobile No
Is there specific efficient algorithm formula or method for each of these field.
Example , strings and integers each has their own type of efficient search algorithm right?
Here's what I am going to do.
I am going to code a binary search algorithm for searching/filtering based on these fields above.
That's it. But yeah that's easy to be honest.
But I am just curious like what's the proper and appropriate coding approach for a efficient search/filter algorithm for each of these fields will you guys do?
I will not be using sequential search algorithm obviously as this will involve huge data so I am not going to iterate each of these data to downgrade efficiency performance.
Sequential search algorithm will be used when needed if data is less.

Searching is a very broad topic and it completely depends upon your use case.
while building an efficient Searching algorithm you should take below factors into consideration
What's the size of your data? -is it fixed or it keeps varying
periodically?
How often you are going to Insert/modify/delete
your data?
Is your data sorted or unsorted?
Do you need a prefix based search like autosearch,autocomplete,longest prefix search etc?
Now let's think about the solution/approach
if your data is less and unsorted as you can try Linear
Search(which has O(n)time complexity where "n" is size of your
data/array)
if your data is already sorted which is not always the case you can
use Binary search as it's complexity is 0(log n). if your
data is not sorted then sorting the data again takes
(nlogn)~typically if you are using Java,Arrays.sort() by default uses Merge sort or Quick sort which is (nlogn).
if faster retrieval is the main object you can think of HashMaps or HashMaps. the elements of Hashmap are indexed by Hashcode, the
time to search for any element would almost be 1 or constant time(if
your hash function implementation is good)
Prefix based search :since you mentioned about searching by Names,you also have the option of using
"Tries" data structure.
Tries are excellent option if you are performing Insert/Delete/Update functionalities frequently .
Lookup of an elements in a Trie is 0(k) where "k" is the length of the string to be searched.
Since you have registration data where insert,update,deletion is common TRIES Data Structure is a good option to consider.
Also,check this link to choose between Tries and HashTables TriesVsMaps
Below is the sample representation of Tries(img src:Hackerearth)

Why using Hashmap.containsKey run faster considerably than Arrays.binarySearch?

I have two lists of phone numbers. 1st list is a subset of 2nd list. I ran two different algorithms below to determine which phone numbers are contained in both of two lists.
Way 1:
Sortting 1st list: Arrays.sort(FirstList);
Looping 2nd list to find matched element: If Arrays.binarySearch(FistList, 'each of 2nd list') then OK
Way 2:
Convert 1st list into HashMap with key/valus is ('each of 1st list', Boolean.TRUE)
Looping 2nd list to find matched element: If FirstList.containsKey('each of 2nd list') then OK
It results in Way 2 ran within 5 seconds is faster considerably than Way 1 with 39 seconds. I can't understand the reason why.
I appreciate your any comments.

Because hashing is O(1) and binary searching is O(log N).

HashMap relies on a very efficient algorithm called 'hashing' which has been in use for many years and is reliable and effective. Essentially the way it works is to split the items in the collection into much smaller groups which can be accessed extremely quickly. Once the group is located a less efficient search mechanism can be used to locate the specific item.
Identifying the group for an item occurs via an algorithm called a 'hashing function'. In Java the hashing method is Object.hashCode() which returns an int representing the group. As long as hashCode is well defined for your class you should expect HashMap to be very efficient which is exactly what you've found.
There's a very good discussion on the various types of Map and which to use at Difference between HashMap, LinkedHashMap and TreeMap
My shorthand rule-of-thumb is to always use HashMap unless you can't define an appropriate hashCode for your keys or the items need to be ordered (either natural or insertion).

Look at the source code for HashMap: it creates and stores a hash for each added (key, value) pair, then the containsKey() method calculates a hash for the given key, and uses a very fast operation to check if it is already in the map. So most retrieval operations are very fast.

Way 1:
Sorting: around O(nlogn)
Search: around O(logn)
Way 2:
Creating HashTable: O(n) for small density (no collisions)
Contains: O(1)

Best tree type for Dictionary implementation

I am required to implement general dictionary using Java that will allow efficient O(logN) or better insertions, deletions and random access.
My question is: what type of tree will give me the best time performance for a huge amount of insertions and deletions? AVL, RB, Binary Search, Splay or B-Trees?

You can use trie data structure for implementing dictionary.For implementing it first you have to create trie which will take O(nlogn) time.After that you can search,insert & delete word in O(logn).
For more understanding you can refer NPTEL LINK,which contain basic of tire data structure.

Java limited map

I am looking for some kind of map that would have fixed size, for example 20 entries, but not only, I want to keep only the lowest values, lets say I'm evaluating some kind of function and inserting results in my map ( I need map because I have to keep Key-Value ) but I want to have only 20 lowest results. I was thinking about sorting and then removing last element but I need to do it for milions of records, so sorting everytime I add value is not efficient, maybe there is some better way?
Thanks for help.

There is no built in data structure for this in java. You can try looking for one in the guava library. Otherwise think about using a LinkedHashMap or a TreeMap for this. You can wrap it in your own class to take care of the limiting.
If you care about efficiency be advised that TreeMap is in fact a red-black tree internally so put() has the time complexity of log(n).

What data structure to use for indexing data for partial %infix% searching?

Imagine you have a huge cache of data that is to be searched through by 4 ways :
exact match
prefix%
%suffix
%infix%
I'm using Trie for the first 3 types of searching, but I can't figure out how to approach the fourth one other than sequential processing of huge array of elements.

If your dataset is huge cosider using a search platform like Apache Solr so that you dont end up in a performance mess.

You can construct a navigable map or set (eg TreeMap or TreeSet) for the 2 (with keys in normal order) and 3 (keys in reverse)
For option 4 you can construct a collection with a key for every starting letter. You can simplify this depending on your requirement. This can lead to more space being used but get O(log n) lookup times.

For #4 I am thinking if you pre-compute the number of occurances of each character then you can look up in that table for entires that have at least as many occurances of the characters in the search string.
How efficient this algorithm is will probably depend on the nature of the data and the search string. It might be useful to give some examples of both here to get better answers.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.