Dictionary data structure + fast complexity methods - java

I'm trying to build from scratch, a data structure that would be able to hold a vast dictionary (of words/characters).
The "words" can be made out of arbitrarily large number of characters.
The dictionary would need standard methods such as search, insert, delete.
I need the methods to have time complexity that is better than O(log(n)), so between O(log(n)) to O(1), e.g log(log(n))
where n = dictionary size (number of elements)
I've looked into various tree structures, like for example b-tree which has log(n) methods (not fast enough) as well as trie which seemed most appropriate for the dictionary, but due to the fact that the words can be arbitrarily large it seemed liked it's complexity would not be faster than log(n).
If you could please provide any explanation

A trie has significant memory requirements but the access time is usually faster than O(log n).
If I recall well, the access time depends on the length of the word, not of the count of the words in the structure.
The efficiency and memory consumption also depend on exactly what implementation of the trie you chose to use. There are some pretty efficient implementations out there.
For more information on Tries see:
http://en.wikipedia.org/wiki/Trie
http://algs4.cs.princeton.edu/52trie/
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/

If your question is how to achieve as few string comparisons as possible, then a hash table is probably a very good answer, as it requires close to O(1) string comparisons. Note that hashing the key value takes time proportional to the string length, as can be the time for string comparison.
But this is nothing new. Can we do better for long strings ? To be more precise, we will assume the string length to be bounded by M. We will also assume that the length of every string is known (for long strings, this can make a difference).
First notice that the search time is bounded below by the string length, and is Ω(M) in the worst case: comparing two strings can require to compare all characters as the strings can differ only in the last character comparisons. On the other hand, in the best case, the comparison can conclude immediately, either because the lengths are different or because the strings differ in the first characters compared.
Now you can reason as follows: consider the whole set of strings in the dictionary and find the position of the first character on which they differ. Based on the value of this character, you will decompose in a number of subsets. And you can continue this decomposition recursively until you get singletons.
For example,
able
about
above
accept
accident
accompany
is organized as
*bl*
*bou*
*bov*
*c*e**
*c*i****
*c*o*****
where an asterisk stands for a character which just ignored, and the remaining characters are used for discrimination.
As you can see, in this particular example two or three character comparisons are enough to recognize any word in the dictionary.
This representation can be described as a finite state automaton such that in every state you know which character to check next and what are the possible outcomes, leading to the next states. It has a K-ary tree structure (where K is the size of the alphabet).
For an efficient implementation, every state can be represented by the position of the decision character and an array of links to the next states. Actually, this is a trie structure, with path compression. (As said by #peter.petrov, there are many variants of the trie structure.)
How do we use it ? There are two situations:
1) the search string is known to be in the dictionary: then a simple traversal of the tree is guaranteed to find it. It will do so after a number of character comparisons equal to the depth of the corresponding leaf in the tree O(D), where D is this depth. This can be a very significant saving.
2) the search string may not be in the dictionary: during traversal of the tree you can observe an early rejection; otherwise, in the end you find a single potential match. Then you can't avoid performing an exhaustive comparison, O(1) in the best case, O(M) in the worst. (On average O(M) for random strings, but probably better for real-world distributions.) But you will compare against a single string, never more.
In addition to that device, if your distribution of key lengths is sparse, it may be useful to maintain a hash table of the key lengths, so that immediate rejection of the search string can occur.
As final remarks, notice that this solution has a cost not directly a function of N, and that it is likely that time sublinear in M could be achieved by suitable heuristics taking advantage of the particular distribution of the strings.

Related

If array needs to be sorted would it count as part of the binary search algorithm

I am trying to understand the speed of the Binary Search algorithm.
I understand it needs to operate on a sorted array.
However if the array comes in unsorted and performing the sorting. Wouldn't the sorting be part of the Binary Search and thus its performance would be slower?
I am confused because I think that there is very little chance to use this algorithm if the data does not come in sorted.
And if my code needs to sort it then why isn't it counting towards the search algorithm.
Sorry if I am confusing,
Thank you for helping.
You can't just point at an algorithm and say: It's got O(n^2) complexity!
That's what people usually say, sure. But that's shorthand. They're omitting things; assuming that the listener / reader will make assumptions.
You need to fully describe the exact algorithm, the conditions under which it is applied, and the precise definition of n and any other variable.
Then, you can answer that question. The problem you're having here is that the definition of 'what is the performance of binary search' is unclear. If you assume it means X whilst your buddy assumes it means Y, and you then argue about the answers, you're not actually having a constructive debate at all. You're just tilting at windmills; the real problem is that neither of you figured out the problem is communicating the basics.
Given that there is some confusion here, I'll give you 3 different more or less equally sensible more fleshed out definitions, along with the actual answer for each such definition. Hint, for one of them, 'binary search' isn't the fastest algorithm!
Given [1] a list that is already sorted, and [2] a single value, write me an algorithm that determines if this value is in the list or not.
The best answer would be: A binary sort algorithm, and its complexity would be O(log n).
Given [1] a list that is not sorted, and [2] a single value, write me an algorithm that determines if this value is in the list or not.
The best answer would be: Just iterate through the list. Its complexity would be O(n), and binary sort is not part of this answer at all.
given [1] a list that is not sorted, and [2] a list of tests, whereby each individual test is defined by a single value, but they all use the same input unsorted list, write an algorithm that will, for each test, determine if the value for that test is in the list or not, and then give me the amortized complexity (basically, the complexity of the whole thing, divided by the # of tests we ran).
Then the best answer would be: First sort the list, spending O(n log n) time to do so, but we get to amortize that over the test case count, and then use binary search for each individual test, adding an O(log n) complexity to each test. If we term n the size of the input list and t the number of tests we have, this gets us:
O( (n log n)/t + O(log n) )
Which is the actual answer to the question, complex as it may look. But, if t is large or even considered effectively infinite in size, OR we add one more rider to the question:
The list from [1] is given to you in advance and, within reasonable time and memory limits, you may preprocess this data without needing to amortize these costs across your test cases
then that boils down to just O(log n), as the large value for t makes that (n log n) / t factor approach zero.
In communicating this to your buddy, given that we don't talk in entire scientific papers, one might then say: "The algorithmic complexity of the binary sort algorithm is O(log n)", even if that omits a gigantic chunk of the full story.
You interpret the question as per the second case (input is unsorted, the input comprises both the list and the value to search for, no multi-test clause). Someone who says 'binary search is O(log n)' is labouring under either the first or third. You're both right.
NB: The third definition seems unusually complicated. However, it matches common scenarios. For example, 'we have compiled a list of folks living in town and their phone numbers, and we want to print them in a giant book with the aim of letting recipients of this book look up phone numbers. We expect over the lifetime of a single print run that the 100,000 citizens of the township will eaech do on average about 50 lookups, for a grand total of 5 million lookups for this single list. That gives you t= 5 million, n = 200,000 (let's say 200k people live here, half of which get a phonebook). Plug those numbers in and sorting the phonebook wins by a landslide vs. releasing the phonebook in arbitrary, unsorted order. Even if, yes, you start 'down' the effort of sorting it and won't make up for that loss until a few folks have speedily looked up a few phone numbers to make up for your efforts in sorting it before printing the book.
Yes. If
the data comes in unsorted
you only need to search for one element
...then you would have to first sort the data to use binary search, which would take a total of O(n log n + log n) = O(n log n) time.
But once the data is sorted, you can then binary search on that data as many times as you want. You don't have to sort it again each time.

Are Tries appropriate for languages which have no alphabet?

I am trying to work out the most efficient way for achieving similar efficiency as if using a Trie to store english words, but instead I want to store words in languages that have no alphabet such as Chinese. For example I want to be able to load a word list and have an application which as the user is typing gives suggestions in real time based on the characters typed already. Any suggestions how this could be achieved as if I use Tries I will have enormous number of parent nodes as there are thousands unique characters. Is there any established way for achieving what I have described above ?
A terminological detour: the word "alphabet" is commonly used to refer to the symbols in writing systems (like those of the various European languages) where each symbol roughly corresponds to a single phoneme (sound). There are also writing systems in which symbols correspond to syllables, morphemes or whole words; the symbols of such languages, which are much more numerous than alphabets, have different technical names: syllabaries, abugida, logographs, and so on, but the discrimination is not precise.
In computational theory, however, it is usual to use the word "alphabet" to describe any finite collection of symbols, regardless of how small or large the set is. Any alphabet -- or finite set of symbols -- can be transcoded into fixed-length sequences from a smaller alphabet of size at least two, with a change in length which is logarithmic in the size of the alphabet. Consequently, it is often convenient to assume that only binary representations are used: that is, representations from the alphabet {0, 1}.
A trie will work with any alphabet size; there is no requirement that the trie's alphabet be an "alphabet" from a human writing system, nor is there a prohibition against it being a larger collection of symbols, although naïve implementations with large alphabets can be very wasteful of space. In particular, nothing stops you from using a recoding of the original written characters into a smaller alphabet, using several trie levels for each character. For example, if the original string is represented in UTF-8, then you could use individual bytes (and you might want to distinguish between leading nodes, where the alphabet size is 178 although only 99 correspond to "letters", and interior nodes, where the alphabet size is 64). Alternatively, you could just split the Unicode code-point into three or four bit-sequences, each of a manageable size.
You can optimize tries by compressing successive nodes which have only a single child; that may be effective with the above schemes. A compact trie over the binary alphabet is called a Patricia trie, and it might be worth looking at as well.
Another common solution for dealing with sparse trie nodes is to use some kind of associative structure for children, rather than an array. In ternary search trees, the children are simply kept in a sorted list so that the correct child can be found with a binary search in time logarithmic in the alphabet size, which is constant for a given alphabet. (The time is actually logarithmic in the number of children, which could be much smaller than the alphabet size.)
Another solution, practical for medium-sized alphabets, is to keep a bit vector of present children as well as a sorted vector of children; modern CPUs have instructions which can rapidly sum the number of set bits in a word, making it efficient to use the bit vector to find the index of the child in the vector.
Yet another possible solution is to use a hash table whose entries are keyed by a 2-tuple consisting of the parent node's id and the child's leading character. This data structure is easy to maintain and space-efficient, but has very poor locality of reference. (One disadvantage is that additional work needs to be done to construct the list of children of a parent node: for example, by explicitly linking the children.)

Is it possible to add/update a sorted list in constant time?

Suppose you are given a list of integers that have already been sorted such as (1,7,13,14,50). It should be noted that the list will contain no duplicates.
Is there some data structure that could store this while allowing me to add any new element (at it's proper location) in constant time? add(10) would yield (1,7,10,13,14,50).
Similarly, would I be able to update an element (such as changing 7 to 19) and shift the order accordingly in constant time? change(7,19) yields (1,13,14,19,50).
For a class I need to write a data structure that performs these operations as quickly as possible, but I just wanted to know if constant time could be done and if not, then what would the ideal runtime be?
To insert in constant time, O(1), this would only occur as a best case for any of the data structures. Hash tables generally have the best insertion time, but it might not always be O(1) if there are collisions and there is separate chaining. You do not sort a hash so the complexity is irrelevent.
Binary tree's have a good insertion time, and as a bonus, it is sorted already upon inserting a new node. This takes on average O(logn) time however. The best case for inserting is O(1) if the tree is empty.
Those were just a couple examples, see here for more info on the complexities of these operations: http://bigocheatsheet.com/
In general? No. Determining where to insert a new element or re-ordering the list after insertion involves performing analysis of the list's contents, which involves reading the elements of the list, which (in general) means iterating over some portion of the length of the list. This (again, in general) is dependant on how many elements are in the list, which by definition is not a constant. Hence, a constant-time sorted insert is simply not possible except in special cases.
A binary tree, TreeSet, would be adequate. An array with Arrays.binarySearch and Arrays.copy would be fine too because here we have ints, and then we do not need the wrapper class Integer.
For real constant time, O(1), one must pay in space. Use a BitSet. To add 17 simply set 17 to true. There are optimized methods to find the next set bit and so on.
But I doubt optimizing is really needed at this spot. File I/O might pay off more.

Best way to compare - Using Sorting or adding to Set

I have two Strings like
String one = "one, two, three, four";
String two ="two,nine,ten";
Now, if any of the numbers two / nine / ten is present in the first string, I need to return true.
And, I Split both strings . splitOne[], splitTwo[] are present now.
Now, one way will be to compare each and every element more like a bubble sort.
This will give me a complexity of O(n^2).
Will adding the elements to a HashSet get me better complexity ?
For adding to set, I need to iterate through both lists and add each element.
Which of these will require lesser time ? Is there any significant difference ?
It really depends on your use-case.
There is no point in trying to sort it yourself. There are much more efficient methods available, one of which is, obviously, using a HashSet.
If you really are working with up to around 30 words then HashSet is certainly your way to go. However, as the number of strings gets bigger you are going to start running into space problems. For a start String.split will eat huge amounts of memory when you get into the thousands of strings, let alone the HashSet.
If you wish to avoid using database then there are solutions such as a Bloom Filter.
At the extreme end you would probably want to use a database of some sort.
You can use HashMap and also can maintain the number occurences as value.
Or instead of spliting strings, Split one string and conpare till the source String exits in distination one.
private boolean testArray(){
String one = "one, two, three,four,nine,ten";
String two ="two,nine,ten,11";
String strTwo[] = two.split(",");
for (String string : strTwo) {
if(!one.contains(string)) return false;
}
return true;
}
Lets say the no. of elements in the first set is N and the no. of elements in the second set is M.
Using a hashset will require O(N+M) as O(N) used for adding while O(M) is use for checking. (Assuming comparisons are O(1))
The 'bubble sort' way will take O(NM).
Theoretically, I think O(N+M) with a hashset will be faster in complexity. However, the constant factor of a hashset should be higher and hence you might not see any improvement for lower values of N and M.
Alternatively, since you are dealing with strings, comparison between strings aren't O(1). You can create a trie using the first set, taking O(A) time, where A is the number of characters in the first set. And then O(B) in total to traverse the trie to check, where O(B) is the number of characters in the second set. This might give you better performances than a HashSet as it is independent of any hashing function (and hence collision checking).

Is a Java hashmap search really O(1)?

I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.
In which case, the lookup would be O(n) rather than O(1).
Can someone explain whether they are O(1) and, if so, how they achieve this?
A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.
pcollision = n / capacity
So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.
O(n) = O(k * n)
We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.
pcollision x 2 = (n / capacity)2
This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to
pcollision x k = (n / capacity)k
And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.
We talk about this by saying that the hash-map has O(1) access with high probability
You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.
Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.
In Java, how HashMap works?
Using hashCode to locate the corresponding bucket [inside buckets container model].
Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
The items are scanned one by one, using equals for comparison.
When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.
Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).
So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.
I know this is an old question, but there's actually a new answer to it.
You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).
But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.
In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).
O(1+n/k) where k is the number of buckets.
If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.
If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).
The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.
Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:
location = (arraylength - 1) & keyhashcode
Here the & represents bitwise AND operator.
For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")
During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).
Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).
In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).
We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.
It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.
If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!
It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.
Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.
This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.
If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).
It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).
Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.
Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)
Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).
For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.

Categories