So i was thinking of a problem i find very interesting and i would like to share the concept of this, the problem starts of with an hypotetical data structure you define (it can be a list, array, tree, binary search tree, red black tree, Btree, etc.), the goal of this is obviously to optimize insertion, search, delete and update (but you can consider this as a search with replacement), the time complexity has to be has low as possible for every single type of operation (possibly O(1) or O(log(n) try to not use a solution of O(n)) the second part of the problem is that this structure during a normal day of work receives new elements with a key of increasing value starting from 1 to N where N can be Long.MAX_LONG, obviously when a new key is given it has to be inserted immediately so it will go as follows:
[1,2,3,4,...,N]
I think i am close to the solution of this problem but i am missing a little bit more of optimization, i was thinking of using either a Tree or a Hashtable but in the case of Hashtable there is a problem when N becomes very high it's needed to rehash the entire structure or the complexity would become O(n), this however is not a problem with a Tree but i think it may become a sequence of elements (keep in mind that we have to put every new element when it comes) like this:
And in this case you can clearly see that this Tree is not just a Tree it's a List, using a BST would give the same result.
I think the correct structure to use is the BST (or something like it for example Red Black Tree) and find a way to always have it balanced, but i am missing something.
If the "key" is an integer and the key are generated by incrementing a counter starting from 1, then the obvious data structure for representing the key -> value mapping is a ValueType[]. Yes, an array.
There are two problems with this:
Arrays do not "grow" in Java.
Solutions:
Preallocate the array to be big enough to start with.
Use an ArrayList instead of a array.
"Borrow" the algorithm that ArrayList uses to grow a list and use it with a bare array.
Arrays cannot have more than Integer.MAX_VALUE elements. (And ArrayList has the same problem.
Solution: use an array of arrays, and do some arithmetic to convert the long keys into a pair of ints for indexing the arrays.
Related
Suppose you are given a list of integers that have already been sorted such as (1,7,13,14,50). It should be noted that the list will contain no duplicates.
Is there some data structure that could store this while allowing me to add any new element (at it's proper location) in constant time? add(10) would yield (1,7,10,13,14,50).
Similarly, would I be able to update an element (such as changing 7 to 19) and shift the order accordingly in constant time? change(7,19) yields (1,13,14,19,50).
For a class I need to write a data structure that performs these operations as quickly as possible, but I just wanted to know if constant time could be done and if not, then what would the ideal runtime be?
To insert in constant time, O(1), this would only occur as a best case for any of the data structures. Hash tables generally have the best insertion time, but it might not always be O(1) if there are collisions and there is separate chaining. You do not sort a hash so the complexity is irrelevent.
Binary tree's have a good insertion time, and as a bonus, it is sorted already upon inserting a new node. This takes on average O(logn) time however. The best case for inserting is O(1) if the tree is empty.
Those were just a couple examples, see here for more info on the complexities of these operations: http://bigocheatsheet.com/
In general? No. Determining where to insert a new element or re-ordering the list after insertion involves performing analysis of the list's contents, which involves reading the elements of the list, which (in general) means iterating over some portion of the length of the list. This (again, in general) is dependant on how many elements are in the list, which by definition is not a constant. Hence, a constant-time sorted insert is simply not possible except in special cases.
A binary tree, TreeSet, would be adequate. An array with Arrays.binarySearch and Arrays.copy would be fine too because here we have ints, and then we do not need the wrapper class Integer.
For real constant time, O(1), one must pay in space. Use a BitSet. To add 17 simply set 17 to true. There are optimized methods to find the next set bit and so on.
But I doubt optimizing is really needed at this spot. File I/O might pay off more.
I want to store pairs of <Integer, Object> and keep them in ascending order by the integer key. However, it should be allowed to keep the same key for different objects in the structure so I can't use one of the standard Maps.
Furthermore, the pairs should be able to be addressed by an index. So if I want to address the pair at index 2 (the third-greatest integer value), it should return the object stored there. Afterwards I want to change the integer value, sort the structure back again and rearrange the indices according to the ascending order.
The number of pairs in this structure is going to be constant so I don't need efficient insertion or deletion, only efficient sorting.
Is there such a data structure in Java (or at least in general)?
If you are going to use the same key for multiple entries, you should consider using MultiMap from Google Guava, for example.
http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/Multimap.html
One of the implementations of MultiMap interface is SortedSetMultimap which can be quite useful in your case.
This might be a bit too specific to exist in a general library, but a data structure that can be used to solve this problem efficiently is an order statistic tree (that is, a self-balancing binary search tree (BST) where each node stores the size of the subtree rooted at that node).
Insertion into and deletion from this tree (we can define an update as a delete followed by an insert) would look exactly the same as it would to a regular BST, apart from increasing and decreasing the subtree sizes appropriately.
To get the i-th element, we start at the root and repeatedly determine in which child subtree the target node would be by looking at the number of nodes.
Any of the above operations would take O(log n) time.
If you're able to order elements with the same integer value in some way (e.g. using the corresponding object) the above would work as is. If they can't be ordered, it's also possible to have each node be a collection (e.g. List) of elements with the same integer value, and then, instead of the number of nodes, we keep track of the number of elements (i.e. the sum of the sizes of each node) in each subtree.
Note: If you can't find a library or self-balancing implementation, it might be easiest to take a working implementation of a red-black tree and modify that to keep track of the sizes.
If you'd prefer simplicity:
If you're happy with O(n) to get the i-th element, and O(log n) per update, you can, similar to what I described above, just use a TreeMap<Integer, List<Object>> (or Guava's MultiMap), i.e. have each node store a list of objects with that integer value. Then, to find the i-th element, you can simply iterate through and keep track of how many elements we've encountered thus far.
If you're happy with O(1) to get the i-th element, but O(n) per update, you can simply use a sorted ArrayList and iterate through to find the correct element and shift it to the correct position. The update can be optimised to use binary search, but it will still be O(n).
I have a sorted array, lets say D={1,2,3,4,5,6} and I want to add the number 5 in the middle. I can do that by adding the value 5 in the middle and move the other values one step to the right.
The problem is that I have an array with 1000 length and I need to do that operation 10.000 times, so I need a faster way.
What options do I have? Can I use LinkedLists for better performance?
That depends on how you add said numbers. If only in ascending or descending order - then yes, LinkedList will do the trick, but only if you keep the node reference in between inserts.
If you're adding numbers in arbitrary order, you may want to deconstruct your array, add the new entries and reconstruct it again. This way you can use a data structure that's good at adding and removing entries while maintaining "sortedness". You have to relax one of your assumptions however.
Option 1
Assuming you don't need constant time random access while adding numbers:
Use a binary sorted tree.
The downside - while you're adding, you cannot read or reference an element by their position, not easily at least. Best case scenario - you're using a tree that keeps track of how many elements the left node has and can get the ith element in log(n) time. You can still get pretty good performance if you're just iterating through the elements though.
Total runtime is down to n * log(n) from n^2. Random access is log(n).
Option 2
Assuming you don't need the elements sorted while you're adding them.
Use a normal array, but add elements to the end of it, then sort it all when you're done.
Total runtime: n * log(n). Random access is O(1), however elements are not sorted.
Option 3
(This is kinda cheating, but...)
If you have a limited number of values, then employing the idea of BucketSort will help you achieve great performance. Essentially - you would replace your array with a sorted map.
Runtime is O(n), random access is O(1), but it's only applicable to a very small number of situations.
TL;DR
Getting arbitrary values, quick adding and constant-time positional access, while maintaining sortedness is difficult. I don't know any such structure. You have to relax some assumption to have room for optimizations.
A LinkedList will probably not help you very much, if at all. Basically you are exchanging the cost of shifting every value on insert with the cost of having to traverse each node in order to reach the insertion point.
This traversal cost will also need to be paid whenever accessing each node. A LinkedList shines as a queue, but if you need to access the internal nodes individually it's not a great choice.
In your case, you want a sorted Tree of some sort. A BST (Balanced Search Tree, also referred to as a Sorted Binary Tree) is one of the simplest types and is probably a good place to start.
A good option is a TreeSet, which is likely functionally equivalent to how you were using an array, if you simply need to keep track of a set of sorted numbers.
My goal is a sorted data structure that can accomplish 2 things:
Fast insertion (at the location according to sort order)
I can quickly segment my data into the sets of everything greater than or less than or equal to an element. I need to know the size of each of these partitions, and I need to be able to "get" these partitions.
Currently, I'm implementing this in java using an ArrayList which provides #2 very easily since I can perform binary search (Collections.binarySearch) and get an insertion index telling me at what point an element would be inserted. Then based on the fact that indices range from 0 to the size of the array, I immediately know how many elements are greater than my element or smaller than my elements, and I can easily get at those elements (as a sublist). However, this doesn't have property #1, and results in too much array copying.
This makes me want to use something like a SkipList or RedBlackTree that could perform the insertions faster, but then I can't figure out how to satisfy property #2 without making it take O(N) time.
Any suggestions would be appreciated. Thanks
EDIT: Thanks for the answers below that reference data structures that perform the insertion in O(logN) time and that can partition quickly as well, but I want to highlight the size() requirement - I need to know the size of these partitions without having to traverse the entire partition (which, according to this is what the TreeSet does. The reasoning behind this is that in my use case I maintain my data using several different copies of data structures each using a different comparator, and then need to ask "according to what comparator is the set of all things larger than a particular element smallest". In the ArrayList case, this is actually easy and takes only O(YlogN) where Y is the number of comparators, because I just binary search each of the Y arrays and return the arraylist with the highest insertion index. It's unclear to me how I could this with a TreeSet without taking O(YN).
I should also add that an approximate answer for the insertion index would still be valuable even if it couldn't be solved exactly.
Use a common Java TreeSet. Insertion takes O(logN), so #1 of your requirements is done. Here's the qouting from documentation:
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
And as it implements the NavigableSet interface, you have #2 or your requirements with the following methods:
tailSet(someElem) returns a Set view starting from someElem till the last element
headSet(someElem) returns a Set view starting from the first element till someElem
subSet(fromElem, toElem) returns a Set view starting from fromElem till toElem
These operations are overloaded with versions that include/exclude the bounds provided.
TreeSet is quite flexible: it allows you to define a Comparator to order the Set in a custom way, or you can also rely on the natural ordering of the elements.
EDIT:
As per the requirement of returned subsets size() operation to not be O(n), I'm afraid there's no adhoc implementation in the Java API.
It is true, the set views returned by TreeSet range operations, implement size() by 'jumping' to the first element of the view in O(log n) time, and then iterating over the subsequent elements, adding 1 in each iteration, until the end of the subset is reached.
I must say this is quite unfortunate, since it's not always needed to traverse the returned subset view, but sometimes, knowing the size of the subset in advance can be quite useful (as it's your use case).
So, in order to fulfil your requirement, you need another structure, or at least, an auxiliary structure. After some research, I suggest you use a Fenwick tree. A Fenwick tree is also known as a Binary Indexed Tree (BIT), and can be either immutable or mutable. The immutable version is implemented with an array, while the mutable version could be implemented with a balanced binary tree, i.e. a Red-Black tree (Java TreeSet is actually implemented as a Red-Black tree). Fenwick trees are mainly used to store frequencies and calculate the sum of all frequencies up to a given element in O(log n) time.
Please refer to this question here on Stack Overflow for a complete introduction to this quite unknown but yet incredibly useful structure. (As the explanation is here in Stack Overflow, I won't copy it here).
Here's another Stack Overflow question asking how to properly initialize a Fenwick tree, and here's actual Java code showing how to implement Fenwick tree's operations. Finally, here's a very good theoretic explanation about the structure and the underlying algorithms being used.
The problem with all the samples in the web is that they use the immutable version of the structure, which is not suitable to you, since you need to interleave queries with adding elements to the structure. However, they are all very useful to fully understand the structure and algorithms being used.
My suggestion is that you study Java TreeMap's implementation and see how to modify/extend it so that you can turn it into a Fenwick tree that keeps 1 as a value for every key. This 1 would be each key's frequency. So Fenwick tree's basic operation getSum(someElement) would actually return the size of the subset from first element up to someElement, in O(log n) time.
So the challenge is to implement a balanced tree (a descendant of Java's Red-Black TreeMap, actually), that implements all Fenwick tree's operations you need. I believe you'd be done with getSum(somElement), but maybe you could also extend the returned subtree range views so that they all refer to getSum(someElelment) when implementing size() operation for range views.
Hope this helps, at least I hope it's a good place to start. Please, let me know if you need clarifications, as well as examples.
If you don't need duplicate elements (or if you can make the elements look distinct), I'd use a java.util.TreeSet. It meets your stated requirements.
O(log n) sorted insertion due to binary tree structure
O(log n) segmentation time using in-place subsets
Unfortunately, the O(log n) segmentation time is effectively slowed to O(n) by your requirement to always know the size of the segment, due to the reason in the answer you linked. The in-place subsets don't know their size until you ask them, and then they count. The counted size is stored, but if the backing set is changed in any way, the subset has to count again.
I think the best data structure for this problem would be a B-Tree with a dense index. Such a B-Tree is built from:
- inner nodes containing only pointers to child nodes
- leafs containing pointers to paged arrays
- a number of equal-sized-arrays (pages)
Unfortunately there are few generic implementations of a B-Tree in Java, probably because so many Variations exist.
The cost of insertion would be
O(log(n)) to find the position
O(p) to insert a new element into a page (where p is the constant page size)
Maybe this data structure also covers your segmentation problem. If not: The cost of extracting would be
O(log(n)) to find the borders
O(e) to copy the extract (where e is the size of the extract)
One easy way to get what you want involves augmenting your favourite binary search tree data structure (red-black trees, AVL trees, etc...) with left and right subtree sizes at each node --- call them L-size and R-size.
Assume that updating these fields in your tree data structures can be done efficiently (say constant time). Then here is what you get:
Insertion, deletion, and all the regular binary search tree operations as efficient as your choice of data structure --- O(log n) for red-back trees.
Given a key x, you can get the number of elements in your tree that have keys less than x in O(log n) time, by descending down the tree to find the appropriate location for x, summing up the L-sizes (plus one for the actual node you're traversing) each time you "go right". The "greater than" case is symmetrical.
Given a key x, you can get the sorted list x_L of elements that are less than x in O(log n + |x_L|) time by, again, descending down the tree to find the appropriate location for x, and each time you go right you tag the node you just traversed, appending it to a list h_L. Then doing in-order traversals of each of the nodes in h_L (in order of addition to h_L) will give you x_L (sorted). The "greater than" case is symmetrical.
Finally, for my answer to work, I need to guarantee you that we can maintain these L- and R-sizes efficiently for your choice of specific tree data structure. I'll consider the case of red-black trees.
Note that maintaining L-sizes and R-sizes is done in constant time for vanilla binary search trees (when you add a node starting from the root, just add one to L-sizes if the node should go in the left subtree, or one to the R-sizes if it goes in the right subtree). Now the additional balancing procedures of red-black trees only alter the tree structure through local rotations of nodes --- see Wikipedia's depiction of rotations in red-black trees. It's easy to see that the post-rotation L-size and R-size of P and Q can be recalculated from the L-sizes and R-sizes of A,B,C. This only adds a constant amount of work to the red-black tree's operations.
I'm accessing the minimum element of a binary tree lots of times. What implementations allow me to access the minimum element in constant time, rather than O(log n)?
Depending on your other requirements, a min-heap might be what you are looking for. It gives you constant time retrieval of the minimum element.
However you cannot do some other operations with the same ease as with a simple binary search tree, like determining if a value is in the tree or not. You can take a look at splay trees, a kind of self-balancing binary tree that provides improved access time to recently accessed elements.
Find it once in O(log n) and then compare new values which you are going to add with this cached minimum element.
UPD: about how can this work if you delete the minimum element. You'll just need to spend O(log n) one more time and find new one.
Let's imagine that you have 10 000 000 000 000 of integers in your tree. Each element needs 4 bytes in memory. In this case all your tree needs about 40 TB of harddrive space. Time O (log n) which should be spent for searching minimum element in this huge tree is about 43 operations. Of course it's not the simplest operations but anyway it's almost nothing even for 20 years old processors.
Of course this is actual if it's a real-world problem. If for some purposes (maybe academical) you need real O(1) algorithm then I'm not sure that my approach can give you such performance without using additional memory.
This may sound silly, but if you mostly access the minimum element, and don't change the tree too much, maintaining a pointer to the minimal element on add/delete (on any tree) may be the best solution.
Walking the tree will always be O(log n). Did you write the tree implementation yourself? You can always simply stash a reference to the current lowest value element alongside your data structure and keep it updated when adding/removing nodes. (If you didn't write the tree, you could also do this by wrapping the tree implementation in your own wrapper object that does the same thing.)
There is a implementation in the TAOCP that uses the "spare" pointers in non-full nodes to complete a double linked list along the nodes in order (I don't recall the detail right now, but I image you have to have a "has_child" flag for each direction to make it work).
With that and a start pointer, you could have the starting element in O(1) time.
This solution is not faster or more efficient that caching the minimum.
If by minimum element you mean element with the smallest value then you could use TreeSet with a custom Comparator which sorts the items into correct order to store individual elements and then just call SortedSet#first() or #last() to get the biggest/smallest values as efficiently as possible.
Note that inserting new items to TreeSet is slightly slow compared to other Sets/Lists but if you don't have a huge amount of elements which constantly change then it shouldn't be a problem.
If you can use a little memory it sounds like a combined collection might work for you.
For instance, what you are looking for sounds a lot like a linked list. You can always get to the minimum element but an insert or lookup of an arbitrary node might take longer because you have to do a lookup O(n)
If you combine a linked list and a tree you might get the best of both worlds. To look up an item for get/insert/delete operations you would use the tree to find the element. The element's "holder" node would have to have ways to cross over from the tree to the linked list for delete operations. Also the linked list would have to be a doubly linked list.
So I think getting the smallest item would be O(1), any arbitrary lookup/delete would be O(logN)--I think even an insert would be O(logN) because you could find where to put it in the tree, look at the previous element and cross over to your linked list node from there, then add "next".
Hmm, this is starting to seem like a really useful data structure, perhaps a little wasteful in memory, but I don't think any operation would be worse than O(logN) unless you have to re balance the tree.
If you upgrade / "upcomplex" your binary tree to a threaded binary tree, then you can get O(1) first and last elements.
You basically keep a reference to the current first and last nodes.
Immediately after an insert, if first's previous is non-null, then update first. Similarly for last.
Whenever you remove, you first check whether the node being removed is the first or last. And update that stored first, last appropriately.