I'm accessing the minimum element of a binary tree lots of times. What implementations allow me to access the minimum element in constant time, rather than O(log n)?
Depending on your other requirements, a min-heap might be what you are looking for. It gives you constant time retrieval of the minimum element.
However you cannot do some other operations with the same ease as with a simple binary search tree, like determining if a value is in the tree or not. You can take a look at splay trees, a kind of self-balancing binary tree that provides improved access time to recently accessed elements.
Find it once in O(log n) and then compare new values which you are going to add with this cached minimum element.
UPD: about how can this work if you delete the minimum element. You'll just need to spend O(log n) one more time and find new one.
Let's imagine that you have 10 000 000 000 000 of integers in your tree. Each element needs 4 bytes in memory. In this case all your tree needs about 40 TB of harddrive space. Time O (log n) which should be spent for searching minimum element in this huge tree is about 43 operations. Of course it's not the simplest operations but anyway it's almost nothing even for 20 years old processors.
Of course this is actual if it's a real-world problem. If for some purposes (maybe academical) you need real O(1) algorithm then I'm not sure that my approach can give you such performance without using additional memory.
This may sound silly, but if you mostly access the minimum element, and don't change the tree too much, maintaining a pointer to the minimal element on add/delete (on any tree) may be the best solution.
Walking the tree will always be O(log n). Did you write the tree implementation yourself? You can always simply stash a reference to the current lowest value element alongside your data structure and keep it updated when adding/removing nodes. (If you didn't write the tree, you could also do this by wrapping the tree implementation in your own wrapper object that does the same thing.)
There is a implementation in the TAOCP that uses the "spare" pointers in non-full nodes to complete a double linked list along the nodes in order (I don't recall the detail right now, but I image you have to have a "has_child" flag for each direction to make it work).
With that and a start pointer, you could have the starting element in O(1) time.
This solution is not faster or more efficient that caching the minimum.
If by minimum element you mean element with the smallest value then you could use TreeSet with a custom Comparator which sorts the items into correct order to store individual elements and then just call SortedSet#first() or #last() to get the biggest/smallest values as efficiently as possible.
Note that inserting new items to TreeSet is slightly slow compared to other Sets/Lists but if you don't have a huge amount of elements which constantly change then it shouldn't be a problem.
If you can use a little memory it sounds like a combined collection might work for you.
For instance, what you are looking for sounds a lot like a linked list. You can always get to the minimum element but an insert or lookup of an arbitrary node might take longer because you have to do a lookup O(n)
If you combine a linked list and a tree you might get the best of both worlds. To look up an item for get/insert/delete operations you would use the tree to find the element. The element's "holder" node would have to have ways to cross over from the tree to the linked list for delete operations. Also the linked list would have to be a doubly linked list.
So I think getting the smallest item would be O(1), any arbitrary lookup/delete would be O(logN)--I think even an insert would be O(logN) because you could find where to put it in the tree, look at the previous element and cross over to your linked list node from there, then add "next".
Hmm, this is starting to seem like a really useful data structure, perhaps a little wasteful in memory, but I don't think any operation would be worse than O(logN) unless you have to re balance the tree.
If you upgrade / "upcomplex" your binary tree to a threaded binary tree, then you can get O(1) first and last elements.
You basically keep a reference to the current first and last nodes.
Immediately after an insert, if first's previous is non-null, then update first. Similarly for last.
Whenever you remove, you first check whether the node being removed is the first or last. And update that stored first, last appropriately.
Related
Suppose you are given a list of integers that have already been sorted such as (1,7,13,14,50). It should be noted that the list will contain no duplicates.
Is there some data structure that could store this while allowing me to add any new element (at it's proper location) in constant time? add(10) would yield (1,7,10,13,14,50).
Similarly, would I be able to update an element (such as changing 7 to 19) and shift the order accordingly in constant time? change(7,19) yields (1,13,14,19,50).
For a class I need to write a data structure that performs these operations as quickly as possible, but I just wanted to know if constant time could be done and if not, then what would the ideal runtime be?
To insert in constant time, O(1), this would only occur as a best case for any of the data structures. Hash tables generally have the best insertion time, but it might not always be O(1) if there are collisions and there is separate chaining. You do not sort a hash so the complexity is irrelevent.
Binary tree's have a good insertion time, and as a bonus, it is sorted already upon inserting a new node. This takes on average O(logn) time however. The best case for inserting is O(1) if the tree is empty.
Those were just a couple examples, see here for more info on the complexities of these operations: http://bigocheatsheet.com/
In general? No. Determining where to insert a new element or re-ordering the list after insertion involves performing analysis of the list's contents, which involves reading the elements of the list, which (in general) means iterating over some portion of the length of the list. This (again, in general) is dependant on how many elements are in the list, which by definition is not a constant. Hence, a constant-time sorted insert is simply not possible except in special cases.
A binary tree, TreeSet, would be adequate. An array with Arrays.binarySearch and Arrays.copy would be fine too because here we have ints, and then we do not need the wrapper class Integer.
For real constant time, O(1), one must pay in space. Use a BitSet. To add 17 simply set 17 to true. There are optimized methods to find the next set bit and so on.
But I doubt optimizing is really needed at this spot. File I/O might pay off more.
I have a sorted array, lets say D={1,2,3,4,5,6} and I want to add the number 5 in the middle. I can do that by adding the value 5 in the middle and move the other values one step to the right.
The problem is that I have an array with 1000 length and I need to do that operation 10.000 times, so I need a faster way.
What options do I have? Can I use LinkedLists for better performance?
That depends on how you add said numbers. If only in ascending or descending order - then yes, LinkedList will do the trick, but only if you keep the node reference in between inserts.
If you're adding numbers in arbitrary order, you may want to deconstruct your array, add the new entries and reconstruct it again. This way you can use a data structure that's good at adding and removing entries while maintaining "sortedness". You have to relax one of your assumptions however.
Option 1
Assuming you don't need constant time random access while adding numbers:
Use a binary sorted tree.
The downside - while you're adding, you cannot read or reference an element by their position, not easily at least. Best case scenario - you're using a tree that keeps track of how many elements the left node has and can get the ith element in log(n) time. You can still get pretty good performance if you're just iterating through the elements though.
Total runtime is down to n * log(n) from n^2. Random access is log(n).
Option 2
Assuming you don't need the elements sorted while you're adding them.
Use a normal array, but add elements to the end of it, then sort it all when you're done.
Total runtime: n * log(n). Random access is O(1), however elements are not sorted.
Option 3
(This is kinda cheating, but...)
If you have a limited number of values, then employing the idea of BucketSort will help you achieve great performance. Essentially - you would replace your array with a sorted map.
Runtime is O(n), random access is O(1), but it's only applicable to a very small number of situations.
TL;DR
Getting arbitrary values, quick adding and constant-time positional access, while maintaining sortedness is difficult. I don't know any such structure. You have to relax some assumption to have room for optimizations.
A LinkedList will probably not help you very much, if at all. Basically you are exchanging the cost of shifting every value on insert with the cost of having to traverse each node in order to reach the insertion point.
This traversal cost will also need to be paid whenever accessing each node. A LinkedList shines as a queue, but if you need to access the internal nodes individually it's not a great choice.
In your case, you want a sorted Tree of some sort. A BST (Balanced Search Tree, also referred to as a Sorted Binary Tree) is one of the simplest types and is probably a good place to start.
A good option is a TreeSet, which is likely functionally equivalent to how you were using an array, if you simply need to keep track of a set of sorted numbers.
My goal is a sorted data structure that can accomplish 2 things:
Fast insertion (at the location according to sort order)
I can quickly segment my data into the sets of everything greater than or less than or equal to an element. I need to know the size of each of these partitions, and I need to be able to "get" these partitions.
Currently, I'm implementing this in java using an ArrayList which provides #2 very easily since I can perform binary search (Collections.binarySearch) and get an insertion index telling me at what point an element would be inserted. Then based on the fact that indices range from 0 to the size of the array, I immediately know how many elements are greater than my element or smaller than my elements, and I can easily get at those elements (as a sublist). However, this doesn't have property #1, and results in too much array copying.
This makes me want to use something like a SkipList or RedBlackTree that could perform the insertions faster, but then I can't figure out how to satisfy property #2 without making it take O(N) time.
Any suggestions would be appreciated. Thanks
EDIT: Thanks for the answers below that reference data structures that perform the insertion in O(logN) time and that can partition quickly as well, but I want to highlight the size() requirement - I need to know the size of these partitions without having to traverse the entire partition (which, according to this is what the TreeSet does. The reasoning behind this is that in my use case I maintain my data using several different copies of data structures each using a different comparator, and then need to ask "according to what comparator is the set of all things larger than a particular element smallest". In the ArrayList case, this is actually easy and takes only O(YlogN) where Y is the number of comparators, because I just binary search each of the Y arrays and return the arraylist with the highest insertion index. It's unclear to me how I could this with a TreeSet without taking O(YN).
I should also add that an approximate answer for the insertion index would still be valuable even if it couldn't be solved exactly.
Use a common Java TreeSet. Insertion takes O(logN), so #1 of your requirements is done. Here's the qouting from documentation:
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
And as it implements the NavigableSet interface, you have #2 or your requirements with the following methods:
tailSet(someElem) returns a Set view starting from someElem till the last element
headSet(someElem) returns a Set view starting from the first element till someElem
subSet(fromElem, toElem) returns a Set view starting from fromElem till toElem
These operations are overloaded with versions that include/exclude the bounds provided.
TreeSet is quite flexible: it allows you to define a Comparator to order the Set in a custom way, or you can also rely on the natural ordering of the elements.
EDIT:
As per the requirement of returned subsets size() operation to not be O(n), I'm afraid there's no adhoc implementation in the Java API.
It is true, the set views returned by TreeSet range operations, implement size() by 'jumping' to the first element of the view in O(log n) time, and then iterating over the subsequent elements, adding 1 in each iteration, until the end of the subset is reached.
I must say this is quite unfortunate, since it's not always needed to traverse the returned subset view, but sometimes, knowing the size of the subset in advance can be quite useful (as it's your use case).
So, in order to fulfil your requirement, you need another structure, or at least, an auxiliary structure. After some research, I suggest you use a Fenwick tree. A Fenwick tree is also known as a Binary Indexed Tree (BIT), and can be either immutable or mutable. The immutable version is implemented with an array, while the mutable version could be implemented with a balanced binary tree, i.e. a Red-Black tree (Java TreeSet is actually implemented as a Red-Black tree). Fenwick trees are mainly used to store frequencies and calculate the sum of all frequencies up to a given element in O(log n) time.
Please refer to this question here on Stack Overflow for a complete introduction to this quite unknown but yet incredibly useful structure. (As the explanation is here in Stack Overflow, I won't copy it here).
Here's another Stack Overflow question asking how to properly initialize a Fenwick tree, and here's actual Java code showing how to implement Fenwick tree's operations. Finally, here's a very good theoretic explanation about the structure and the underlying algorithms being used.
The problem with all the samples in the web is that they use the immutable version of the structure, which is not suitable to you, since you need to interleave queries with adding elements to the structure. However, they are all very useful to fully understand the structure and algorithms being used.
My suggestion is that you study Java TreeMap's implementation and see how to modify/extend it so that you can turn it into a Fenwick tree that keeps 1 as a value for every key. This 1 would be each key's frequency. So Fenwick tree's basic operation getSum(someElement) would actually return the size of the subset from first element up to someElement, in O(log n) time.
So the challenge is to implement a balanced tree (a descendant of Java's Red-Black TreeMap, actually), that implements all Fenwick tree's operations you need. I believe you'd be done with getSum(somElement), but maybe you could also extend the returned subtree range views so that they all refer to getSum(someElelment) when implementing size() operation for range views.
Hope this helps, at least I hope it's a good place to start. Please, let me know if you need clarifications, as well as examples.
If you don't need duplicate elements (or if you can make the elements look distinct), I'd use a java.util.TreeSet. It meets your stated requirements.
O(log n) sorted insertion due to binary tree structure
O(log n) segmentation time using in-place subsets
Unfortunately, the O(log n) segmentation time is effectively slowed to O(n) by your requirement to always know the size of the segment, due to the reason in the answer you linked. The in-place subsets don't know their size until you ask them, and then they count. The counted size is stored, but if the backing set is changed in any way, the subset has to count again.
I think the best data structure for this problem would be a B-Tree with a dense index. Such a B-Tree is built from:
- inner nodes containing only pointers to child nodes
- leafs containing pointers to paged arrays
- a number of equal-sized-arrays (pages)
Unfortunately there are few generic implementations of a B-Tree in Java, probably because so many Variations exist.
The cost of insertion would be
O(log(n)) to find the position
O(p) to insert a new element into a page (where p is the constant page size)
Maybe this data structure also covers your segmentation problem. If not: The cost of extracting would be
O(log(n)) to find the borders
O(e) to copy the extract (where e is the size of the extract)
One easy way to get what you want involves augmenting your favourite binary search tree data structure (red-black trees, AVL trees, etc...) with left and right subtree sizes at each node --- call them L-size and R-size.
Assume that updating these fields in your tree data structures can be done efficiently (say constant time). Then here is what you get:
Insertion, deletion, and all the regular binary search tree operations as efficient as your choice of data structure --- O(log n) for red-back trees.
Given a key x, you can get the number of elements in your tree that have keys less than x in O(log n) time, by descending down the tree to find the appropriate location for x, summing up the L-sizes (plus one for the actual node you're traversing) each time you "go right". The "greater than" case is symmetrical.
Given a key x, you can get the sorted list x_L of elements that are less than x in O(log n + |x_L|) time by, again, descending down the tree to find the appropriate location for x, and each time you go right you tag the node you just traversed, appending it to a list h_L. Then doing in-order traversals of each of the nodes in h_L (in order of addition to h_L) will give you x_L (sorted). The "greater than" case is symmetrical.
Finally, for my answer to work, I need to guarantee you that we can maintain these L- and R-sizes efficiently for your choice of specific tree data structure. I'll consider the case of red-black trees.
Note that maintaining L-sizes and R-sizes is done in constant time for vanilla binary search trees (when you add a node starting from the root, just add one to L-sizes if the node should go in the left subtree, or one to the R-sizes if it goes in the right subtree). Now the additional balancing procedures of red-black trees only alter the tree structure through local rotations of nodes --- see Wikipedia's depiction of rotations in red-black trees. It's easy to see that the post-rotation L-size and R-size of P and Q can be recalculated from the L-sizes and R-sizes of A,B,C. This only adds a constant amount of work to the red-black tree's operations.
Concerning Trie from Wikipedia:
[Compared to HashTable] Tries support ordered iteration
I am not sure what is meant here first of all. Is it the same as sorted iteration?
Additionally is this supposed to be an inherent feature of this datastructure?
I mean if one uses e.g. a HashSet for the children of each node in a Trie we can get O(1) access trying to find the children to branch or by using a LinkedList you save space on the nodes.
Perhaps I am way wrong but from my point of view the only way to support ordered iteration would be to keep array of all keys per node even unused.
Isn't this approach bad?
And one last point:
If the ordered here is related to insertion order (and not sorted) how would we get that since we insert each word (using the chars as keys) to the corresponding node but I can't see how this gives us information on the insertion order?
Could someone help me clear these things in my mind out?
Thank you.
What they mean is that a depth-first search of the trie will yield a lexicographically ordered output of the strings in the trie.
But yes, you are right, this assumes that all sibling nodes on a given level are visited in lexicographical order and this is far from given, especially with large alphabets, where it makes sense to implement the child node table via a hash table.
In summary, I think your doubts are justified and that the Wikipedia article is wrong.
But it’s worth noting that a lexicographically ordered iteration over a trie is affordable even if the child nodes aren’t ordered, since ordering them in while iterating is expected to be relatively cheap – there will be few child nodes in each trie node so the overall performance of iterating the trie will still be in O(n) expected time – in contrast with a hash table, where ordered iteration effectively implies sorting all elements, an O(n log n) operation.
It means sorted. You can't get the insertion order out a trie without extra effort. Not exactly sure what you mean by inherent feature. The implementation needs to provide it (or provide access to the internals), but it is straight-forward to do so.
I am optimizing an implementation of a sorted LinkedList.
To insert an element I traverse the list and compare each element until I have the correct index, and then break loop and insert.
I would like to know if there is any other way that I can insert the element at the same time as traversing the list to reduce the insert from O(n + (n capped at size()/2)) to O(n).
A ListIterator is almost what Im after because of its add() method, but unfortunately in the case where there are elements in the list equal to the insert, the insert has to be placed after them in the list. To implement this ListIterator needs a peek() which it doesnt have.
edit: I have my answer, but will add this anyway since a lot of people havent understood correctly:
I am searching for an insertion point AND inserting, which combined is higher than O(n)
You may consider a skip list, which is implemented using multiple linked lists at varying granularity. E.g. the linked list at level 0 contains all items, level 1 only links to every 2nd item on average, level 2 to only every 4th item on average, etc.... Searching starts from the top level and gradually descends to lower levels until it finds an exact match. This logic is similar to a binary search. Thus search and insertion is an O(log n) operation.
A concrete example in the Java class library is ConcurrentSkipListSet (although it may not be directly usable for you here).
I'd favor Péter Török suggestion, but I'd still like to add something for the iterator approach:
Note that ListIterator provides a previous() method to iterate through the list backwards. Thus first iterate until you find the first element that is greater and then go to the previous element and call add(...). If you hit the end, i.e. all elements are smaller, then just call add(...) without going back.
I have my answer, but will add this anyway since a lot of people havent understood correctly: I am searching for an insertion point AND inserting, which combined is higher than O(n).
Your require to maintain a collection of (possibly) non-unique elements that can iterated in an order given by a ordering function. This can be achieved in a variety of ways. (In the following I use "total insertion cost" to mean the cost of inserting a number (N) of elements into an initially empty data structure.)
A singly or doubly linked list offers O(N^2) total insertion cost (whether or not you combine the steps of finding the position and doing the insertion!), and O(N) iteration cost.
A TreeSet offers O(NlogN) total insertion cost and O(N) iteration cost. But has the restriction of no duplicates.
A tree-based multiset (e.g. TreeMultiset) has the same complexity as a TreeSet, but allows duplicates.
A skip-list data structure also has the same complexity as the previous two.
Clearly, the complexity measures say that a data structure that uses a linked list performs the worst as N gets large. For this particular group of requirements, a well-implemented tree-based multiset is probably the best, assuming there is only one thread accessing the collection. If the collection is heavily used by many threads (and it is a set), then a ConcurrentSkipListSet is probably better.
You also seem to have a misconception about how "big O" measures combine. If I have one step of an algorithm that is O(N) and a second step that is also O(N), then the two steps combined are STILL O(N) .... not "more than O(N)". You can derive this from the definition of "big O". (I won't bore you with the details, but the Math is simple.)