Finding the median in B+ tree

Finding the median in B+ tree - java

I need to implement a B+ tree.
And i need to create the following methods:
Insert(x) - 0(log_t(x)).
Search - Successful search - O(log_t(x)). Unsuccessful search - O(1) {With a high likely-hood}
So i started with implementing Insert(x)- Each time i have a full leaf i want to split it up into two separated leaves.
One leaf with keys equal or lower to the median key, Second one will contains keys with higher value than the median.
How can i find this median without hurting the run-time?
I thought about:
Representing each of the internal node and leaves as a smaller B+ tree, But then the median is the root (or one of the elements in the root) only when the tree is fully balanced.
Representing each of the internal nodes and leaves as a doubly-linked list. And trying to get the median key while the input is inserted, But there's input which doesn't work with it.
Representing as array might give me the middle, But then when i split it up i need at least O(n/2) to insert the keys into a new array.
What can i do?
And about the search, Idea-wise: The difference between a successful and unsuccessful search is about searching in the leaves, But i still need to 'run' through the different keys of the tree to determine whether the key is in the tree. So how can it be O(1)?

In B+ trees, all the values are stored in the leaves.
Note that you can add a pointer from each leaf to the following leaf, and you get in addition to the standard B+ tree an ordered linked list with all elements.
Now, note that assuming you know what the current median in this linked list is - upon insertion/deletion you can cheaply calculate the new median (it can be the same node, the next node or the previous node, no other choices).
Note that modifying this pointer is O(1) (though the insertion/deletion itself is O(logn).
Given that knowledge - one can cache a pointer to the median element and make sure to maintain it upon deletion/insertion. When you ask for median - just take the median from the cache - O(1).
Regarding Unsuccessful search - O(1) {With a high likely-hood} - this one screams bloom filters, which are a a probabilistic set implementation that never has false-negatives (never says something is not in set while it is), but has some false-positives (says something is in cache while in fact it isn't).

You don't need the median of the B+-tree. You need the median key in the node you're splitting. You have to split at that median to satisfy the condition that each node has N/2 <= n <= N keys. The median key in a node is just the one in the middle, at n/2, where n is the number of actual keys in the node. That's where you split the node. Computing that is O(1): it won't hurt the runtime.
You can't get O(1) search failure time from a B+-tree without superimposing another data structure.

I've already posted an answer (and since deleted it), but it's possible I've misinterpreted, so here's an answer for another interpretation...
What if you need to always know which item is the median in the complete B+ tree container.
As amit says, you can keep a pointer (along with your root pointer) to the current leaf node that contains the median. You can also keep an index into that leaf node. So you get O(1) access by following those directly to the correct node and item.
The issue is maintaining that. Certainly amit is correct that for each insert, the median must also remain the same item, or must step to the one just before or after. And if you have a linked list through the leaf nodes, that can be handled efficiently even if that means stepping to an adjacent leaf node.
I'm not convinced, though, that's it's trivial to determine whether or which way to step, though, except in the special case where the median and the insert happen to be in the same leaf node.
If you know the size of the complete tree (which you can easily store and maintain with the root pointer), you can at least determine which index the median item should be at both before and after the insert.
However, you need to know if the previous median item had it's index shifted up by the insert - if the insert point was before or after the median. Unless the insert point and median happen to be in the same node, that's a problem.
Overkill way - augment the B+ tree to support calculating the index of an item and searching for indexes. The trick for that is that each node keeps a total of the number of items in the leaf nodes of its subtree. That can be pushed up a level so each branch node has an array of subtree sizes along with its array of child node pointers.
This offers two solutions. You could use the information to determine the index for the insert point as you search, or (providing nodes have parent pointers) you could use it to re-determine the index of the previous median item after the insert.
[Actually three. After inserting, you could just search for the new half-way index based on the new size without reference to the previous median link.]
In terms of data stored for augmentation, though, this turns out to be overkill. You don't need to know the index of the insert point or the previous median - you can make do with knowing which side of the median the insert occurred on. If you know the trail to follow from the root to the median item, you should be able to keep track of which side of it you are as you search for the insert point. So you only need to augment with enough information to find and maintain that trail.

Related

Efficient algorithm to create a tree where you know each elements path

This is probably quite an easy question to answer but for some reason my mind has been blank and I've been unable to come up with an efficient solution.
The task is as so: I have a number of elements, which contain an array(their path) and their name. And I want to create a tree from this list in the format(the specific syntax with symbols is not important)
Elements:
( name: Element 1,
Elements:
( name: Element 1.1
)
Name: Element 2,
Elements:
( ... )
)
So given items in the following style, do you have any suggestions for an algorithm to solve this task as efficiently as possible.
The item style is: [ Great Grandfather, Grandfather, Father ], Element Name.
And the number of elements to the route could be any number. The only obvious solution I can come up with is by starting at the first item in each of the parent arrays and add those to the tree if they don't exist, then move onto Parents[1] then Parents[2] etc... Any ideas?

I see at least two approaches. The simpler one is O(n d ln b), with d being the maximum depth and b the branching factor, i.e. the maximum/average number of children per parent. You might assume these constant (and hence use O(n)), but they are probably functionally dependent on n.
The other one is O(n) time but may degrade to quadratic for bad hash functions and other reasons.
Version 1
each node keeps an ordered list of its children.
from root down, find next child, i.e.
if the path is A.B.C, find A, then recurse into A and find B.
finally insert C as child of B (if C does not exist).
Use binarySearch() to find the insertion point (or the matching element).
Since the list of children is sorted, the individual search is O(ln(b)) with b the maximum number of children. You need to go at worst d levels deep, so O(d ln(b)). The whole thing needs to repeat n times.
Version 2
This is more memory intensive and might not actually be faster.
While building the tree, keep a map path-to-node -> node.
Whenever you encounter a node the input
lookup the node's parent by path inside the map (O(1)).
append the child to the parent and store the full path of the new node in the map.
If you assume the hash table works as advertised, you get O(n) (or O(n ln(n)) if the children are sorted) but need to have enough memory for the hash table.
Since Version 2 seems not that much better, I'd stick with Version 1 which is simpler to implement.

Looking for indexable, self-sorting data structure with multiple keys

I want to store pairs of <Integer, Object> and keep them in ascending order by the integer key. However, it should be allowed to keep the same key for different objects in the structure so I can't use one of the standard Maps.
Furthermore, the pairs should be able to be addressed by an index. So if I want to address the pair at index 2 (the third-greatest integer value), it should return the object stored there. Afterwards I want to change the integer value, sort the structure back again and rearrange the indices according to the ascending order.
The number of pairs in this structure is going to be constant so I don't need efficient insertion or deletion, only efficient sorting.
Is there such a data structure in Java (or at least in general)?

If you are going to use the same key for multiple entries, you should consider using MultiMap from Google Guava, for example.
http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/Multimap.html
One of the implementations of MultiMap interface is SortedSetMultimap which can be quite useful in your case.

This might be a bit too specific to exist in a general library, but a data structure that can be used to solve this problem efficiently is an order statistic tree (that is, a self-balancing binary search tree (BST) where each node stores the size of the subtree rooted at that node).
Insertion into and deletion from this tree (we can define an update as a delete followed by an insert) would look exactly the same as it would to a regular BST, apart from increasing and decreasing the subtree sizes appropriately.
To get the i-th element, we start at the root and repeatedly determine in which child subtree the target node would be by looking at the number of nodes.
Any of the above operations would take O(log n) time.
If you're able to order elements with the same integer value in some way (e.g. using the corresponding object) the above would work as is. If they can't be ordered, it's also possible to have each node be a collection (e.g. List) of elements with the same integer value, and then, instead of the number of nodes, we keep track of the number of elements (i.e. the sum of the sizes of each node) in each subtree.
Note: If you can't find a library or self-balancing implementation, it might be easiest to take a working implementation of a red-black tree and modify that to keep track of the sizes.
If you'd prefer simplicity:
If you're happy with O(n) to get the i-th element, and O(log n) per update, you can, similar to what I described above, just use a TreeMap<Integer, List<Object>> (or Guava's MultiMap), i.e. have each node store a list of objects with that integer value. Then, to find the i-th element, you can simply iterate through and keep track of how many elements we've encountered thus far.
If you're happy with O(1) to get the i-th element, but O(n) per update, you can simply use a sorted ArrayList and iterate through to find the correct element and shift it to the correct position. The update can be optimised to use binary search, but it will still be O(n).

Sorted data structure with O(logN) insertion that gives insertion point index

My goal is a sorted data structure that can accomplish 2 things:
Fast insertion (at the location according to sort order)
I can quickly segment my data into the sets of everything greater than or less than or equal to an element. I need to know the size of each of these partitions, and I need to be able to "get" these partitions.
Currently, I'm implementing this in java using an ArrayList which provides #2 very easily since I can perform binary search (Collections.binarySearch) and get an insertion index telling me at what point an element would be inserted. Then based on the fact that indices range from 0 to the size of the array, I immediately know how many elements are greater than my element or smaller than my elements, and I can easily get at those elements (as a sublist). However, this doesn't have property #1, and results in too much array copying.
This makes me want to use something like a SkipList or RedBlackTree that could perform the insertions faster, but then I can't figure out how to satisfy property #2 without making it take O(N) time.
Any suggestions would be appreciated. Thanks
EDIT: Thanks for the answers below that reference data structures that perform the insertion in O(logN) time and that can partition quickly as well, but I want to highlight the size() requirement - I need to know the size of these partitions without having to traverse the entire partition (which, according to this is what the TreeSet does. The reasoning behind this is that in my use case I maintain my data using several different copies of data structures each using a different comparator, and then need to ask "according to what comparator is the set of all things larger than a particular element smallest". In the ArrayList case, this is actually easy and takes only O(YlogN) where Y is the number of comparators, because I just binary search each of the Y arrays and return the arraylist with the highest insertion index. It's unclear to me how I could this with a TreeSet without taking O(YN).
I should also add that an approximate answer for the insertion index would still be valuable even if it couldn't be solved exactly.

Use a common Java TreeSet. Insertion takes O(logN), so #1 of your requirements is done. Here's the qouting from documentation:
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
And as it implements the NavigableSet interface, you have #2 or your requirements with the following methods:
tailSet(someElem) returns a Set view starting from someElem till the last element
headSet(someElem) returns a Set view starting from the first element till someElem
subSet(fromElem, toElem) returns a Set view starting from fromElem till toElem
These operations are overloaded with versions that include/exclude the bounds provided.
TreeSet is quite flexible: it allows you to define a Comparator to order the Set in a custom way, or you can also rely on the natural ordering of the elements.
EDIT:
As per the requirement of returned subsets size() operation to not be O(n), I'm afraid there's no adhoc implementation in the Java API.
It is true, the set views returned by TreeSet range operations, implement size() by 'jumping' to the first element of the view in O(log n) time, and then iterating over the subsequent elements, adding 1 in each iteration, until the end of the subset is reached.
I must say this is quite unfortunate, since it's not always needed to traverse the returned subset view, but sometimes, knowing the size of the subset in advance can be quite useful (as it's your use case).
So, in order to fulfil your requirement, you need another structure, or at least, an auxiliary structure. After some research, I suggest you use a Fenwick tree. A Fenwick tree is also known as a Binary Indexed Tree (BIT), and can be either immutable or mutable. The immutable version is implemented with an array, while the mutable version could be implemented with a balanced binary tree, i.e. a Red-Black tree (Java TreeSet is actually implemented as a Red-Black tree). Fenwick trees are mainly used to store frequencies and calculate the sum of all frequencies up to a given element in O(log n) time.
Please refer to this question here on Stack Overflow for a complete introduction to this quite unknown but yet incredibly useful structure. (As the explanation is here in Stack Overflow, I won't copy it here).
Here's another Stack Overflow question asking how to properly initialize a Fenwick tree, and here's actual Java code showing how to implement Fenwick tree's operations. Finally, here's a very good theoretic explanation about the structure and the underlying algorithms being used.
The problem with all the samples in the web is that they use the immutable version of the structure, which is not suitable to you, since you need to interleave queries with adding elements to the structure. However, they are all very useful to fully understand the structure and algorithms being used.
My suggestion is that you study Java TreeMap's implementation and see how to modify/extend it so that you can turn it into a Fenwick tree that keeps 1 as a value for every key. This 1 would be each key's frequency. So Fenwick tree's basic operation getSum(someElement) would actually return the size of the subset from first element up to someElement, in O(log n) time.
So the challenge is to implement a balanced tree (a descendant of Java's Red-Black TreeMap, actually), that implements all Fenwick tree's operations you need. I believe you'd be done with getSum(somElement), but maybe you could also extend the returned subtree range views so that they all refer to getSum(someElelment) when implementing size() operation for range views.
Hope this helps, at least I hope it's a good place to start. Please, let me know if you need clarifications, as well as examples.

If you don't need duplicate elements (or if you can make the elements look distinct), I'd use a java.util.TreeSet. It meets your stated requirements.
O(log n) sorted insertion due to binary tree structure
O(log n) segmentation time using in-place subsets
Unfortunately, the O(log n) segmentation time is effectively slowed to O(n) by your requirement to always know the size of the segment, due to the reason in the answer you linked. The in-place subsets don't know their size until you ask them, and then they count. The counted size is stored, but if the backing set is changed in any way, the subset has to count again.

I think the best data structure for this problem would be a B-Tree with a dense index. Such a B-Tree is built from:
- inner nodes containing only pointers to child nodes
- leafs containing pointers to paged arrays
- a number of equal-sized-arrays (pages)
Unfortunately there are few generic implementations of a B-Tree in Java, probably because so many Variations exist.
The cost of insertion would be
O(log(n)) to find the position
O(p) to insert a new element into a page (where p is the constant page size)
Maybe this data structure also covers your segmentation problem. If not: The cost of extracting would be
O(log(n)) to find the borders
O(e) to copy the extract (where e is the size of the extract)

One easy way to get what you want involves augmenting your favourite binary search tree data structure (red-black trees, AVL trees, etc...) with left and right subtree sizes at each node --- call them L-size and R-size.
Assume that updating these fields in your tree data structures can be done efficiently (say constant time). Then here is what you get:
Insertion, deletion, and all the regular binary search tree operations as efficient as your choice of data structure --- O(log n) for red-back trees.
Given a key x, you can get the number of elements in your tree that have keys less than x in O(log n) time, by descending down the tree to find the appropriate location for x, summing up the L-sizes (plus one for the actual node you're traversing) each time you "go right". The "greater than" case is symmetrical.
Given a key x, you can get the sorted list x_L of elements that are less than x in O(log n + |x_L|) time by, again, descending down the tree to find the appropriate location for x, and each time you go right you tag the node you just traversed, appending it to a list h_L. Then doing in-order traversals of each of the nodes in h_L (in order of addition to h_L) will give you x_L (sorted). The "greater than" case is symmetrical.
Finally, for my answer to work, I need to guarantee you that we can maintain these L- and R-sizes efficiently for your choice of specific tree data structure. I'll consider the case of red-black trees.
Note that maintaining L-sizes and R-sizes is done in constant time for vanilla binary search trees (when you add a node starting from the root, just add one to L-sizes if the node should go in the left subtree, or one to the R-sizes if it goes in the right subtree). Now the additional balancing procedures of red-black trees only alter the tree structure through local rotations of nodes --- see Wikipedia's depiction of rotations in red-black trees. It's easy to see that the post-rotation L-size and R-size of P and Q can be recalculated from the L-sizes and R-sizes of A,B,C. This only adds a constant amount of work to the red-black tree's operations.

The Best Search Algorithm for a Linked List

I have to write a program as efficiently as possible that will insert given nodes into a sorted LinkedList. I'm thinking of how binary search is faster than linear in average and worst case, but when I Googled it, the runtime was O(nlogn)? Should I do linear on a singly-LinkedList or binary search on a doubly-LinkedList and why is that one (the one to chose) faster?
Also how is the binary search algorithm > O(logn) for a doubly-LinkedList?
(No one recommend SkipList, I think they're against the rules since we have another implementation strictly for that data structure)

You have two choices.
Linearly search an unordered list. This is O(N).
Linearly search an ordered list. This is also O(N) but it is twice as fast, as on average the item you search for will be in the middle, and you can stop there if it isn't found.
You don't have the choice of binary searching it, as you don't have direct access to elements of a linked list.
But if you consider search to be a rate-determining step, you shouldn't use a linked list at all: you should use a sorted array, a heap, a tree, etc.

Binary search is very fast on arrays simply because it's very fast and simple to access the middle index between any two given indexes of elements in the array. This make it's running time complexity to be O(log n) while taking a constant O(1) space.
For the linked list, it's different, since in order to access the middle element we need to traverse it node by node and therefore finding the middle node could run in an order of O(n)
Thus binary search is slow on linked list and fast on arrays

Binary search is possible by using skip list. You will spend number of pointers as twice as linked list if you set skip 2, 4, 8, ..., 2^n at same time. And then you can get O(log n) for each search.
If your data store in each node is quite big, applying this will very efficient.
You can read more on https://www.geeksforgeeks.org/skip-list/amp/

So basically binary search on a LL is O(n log n) because you would need to traverse the list n times to search the item and then log n times to split the searched set. But this is only true if you are traversing the LL from the beginning every time.
Ideally if you figure out some method (which it's possible!) to start from somewhere else like... the middle of the searched set, then you eliminate the need to always traverse the list n times to start the search and can optimize your algorithm to O(log n).

Binary tree to get minimum element in O(1)

I'm accessing the minimum element of a binary tree lots of times. What implementations allow me to access the minimum element in constant time, rather than O(log n)?

Depending on your other requirements, a min-heap might be what you are looking for. It gives you constant time retrieval of the minimum element.
However you cannot do some other operations with the same ease as with a simple binary search tree, like determining if a value is in the tree or not. You can take a look at splay trees, a kind of self-balancing binary tree that provides improved access time to recently accessed elements.

Find it once in O(log n) and then compare new values which you are going to add with this cached minimum element.
UPD: about how can this work if you delete the minimum element. You'll just need to spend O(log n) one more time and find new one.
Let's imagine that you have 10 000 000 000 000 of integers in your tree. Each element needs 4 bytes in memory. In this case all your tree needs about 40 TB of harddrive space. Time O (log n) which should be spent for searching minimum element in this huge tree is about 43 operations. Of course it's not the simplest operations but anyway it's almost nothing even for 20 years old processors.
Of course this is actual if it's a real-world problem. If for some purposes (maybe academical) you need real O(1) algorithm then I'm not sure that my approach can give you such performance without using additional memory.

This may sound silly, but if you mostly access the minimum element, and don't change the tree too much, maintaining a pointer to the minimal element on add/delete (on any tree) may be the best solution.

Walking the tree will always be O(log n). Did you write the tree implementation yourself? You can always simply stash a reference to the current lowest value element alongside your data structure and keep it updated when adding/removing nodes. (If you didn't write the tree, you could also do this by wrapping the tree implementation in your own wrapper object that does the same thing.)

There is a implementation in the TAOCP that uses the "spare" pointers in non-full nodes to complete a double linked list along the nodes in order (I don't recall the detail right now, but I image you have to have a "has_child" flag for each direction to make it work).
With that and a start pointer, you could have the starting element in O(1) time.
This solution is not faster or more efficient that caching the minimum.

If by minimum element you mean element with the smallest value then you could use TreeSet with a custom Comparator which sorts the items into correct order to store individual elements and then just call SortedSet#first() or #last() to get the biggest/smallest values as efficiently as possible.
Note that inserting new items to TreeSet is slightly slow compared to other Sets/Lists but if you don't have a huge amount of elements which constantly change then it shouldn't be a problem.

If you can use a little memory it sounds like a combined collection might work for you.
For instance, what you are looking for sounds a lot like a linked list. You can always get to the minimum element but an insert or lookup of an arbitrary node might take longer because you have to do a lookup O(n)
If you combine a linked list and a tree you might get the best of both worlds. To look up an item for get/insert/delete operations you would use the tree to find the element. The element's "holder" node would have to have ways to cross over from the tree to the linked list for delete operations. Also the linked list would have to be a doubly linked list.
So I think getting the smallest item would be O(1), any arbitrary lookup/delete would be O(logN)--I think even an insert would be O(logN) because you could find where to put it in the tree, look at the previous element and cross over to your linked list node from there, then add "next".
Hmm, this is starting to seem like a really useful data structure, perhaps a little wasteful in memory, but I don't think any operation would be worse than O(logN) unless you have to re balance the tree.

If you upgrade / "upcomplex" your binary tree to a threaded binary tree, then you can get O(1) first and last elements.
You basically keep a reference to the current first and last nodes.
Immediately after an insert, if first's previous is non-null, then update first. Similarly for last.
Whenever you remove, you first check whether the node being removed is the first or last. And update that stored first, last appropriately.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.