TreeRangeMap time and space complexities - java

I'm looking to guava TreeRangeMap that seem to very well suit my needs for a project. The java docs says that is based on a (java standard ?) TreeMap that have O(log(n)) time for get, put and next.
But the TreeRangeMap should be some kind of range tree implementation that according to this SO question have O(k + log(n)) time complexity for queries, O(n) space, with k being the range size?. Can somebody confirm this?
I'm also very interested in the time complexity of TreeRangeMap.subRangeMap() operation. Does it have the same O(k + log(n))?
Thanks.

It's a view, not an actual mutation or anything. subRangeMap returns in O(1) time, and the RangeMap it returns has O(log n) additive cost for each of its query operations -- that is, all of its operations still take O(log n), just with a higher constant factor.
Source: I'm "the guy who implemented it."

We generally use Range tree to find the points that lie in a given interval [x1, x2] and x1 < x2. However, if the range tree is a balanced-binary tree(as in the case of TreeMap which is implemented with Red-Black tree), the search paths to x1(or successor) and x2(or predecessor) have cost O(log n). When we find them, if there k number of points lie in this range, we will have to report it using Tree traversal which will have linear cost O(k). So in total O(k + log(n)).
I'm also very interested in the time complexity of
TreeRangeMap.subRangeMap() operation. Does it have the same O(k +
log(n))?
<K,V> subRangeMap(Range<K> subRange) Returns a view of the part of this range map that intersects with range, resulting in just another balanced-binary search Tree. So why not?

Related

Should I use TreeSet or HashSet?

I have large number of strings, I need to print unique strings in sorted order.
TreeSet stores them in sorted order but insertion time is O(Logn) for each insertion. HashSet takes O(1) time to add but then I will have to get list of the set and then sort using Collections.sort() which takes O(nLogn) (I assumes there is no memory overhead here since only the references of Strings will be copied in the new collection i.e. List). Is it fair to say overall any choice is same since at the end total time will be same?
That depends on how close you look. Yes, the asymptotic time complexity is O(n log n) in either case, but the constant factors differ. So it's not like one method can get a 100 times faster than the other, but it's certainly possible that one method is twice a fast as the other.
For most parts of a program, a factor of 2 is totally irrelevant, but if your program actually spends a significant part of its running time in this algorithm, it would be a good idea to implement both approaches, and measure their performance.
Measuring is the way to go, but if you're talking purely theoretically and ignoring read from after sorting, then consider for number of strings = x:
HashSet:
x * O(1) add operations + 1 O(n log n) (where n is x) sort operation = approximately O(n + n log n) (ok, that's a gross oversimplification, but..)
TreeSet:
x * O(log n) (where n increases from 1 to x) + O(0) sort operation = approximately O(n log (n/2)) (also a gross oversimplification, but..)
And continuing in the oversimplification vein, O(n + n log n) > O(n log (n/2)). Maybe TreeSet is the way to go?
If you distinguish the total number of strings (n) and number of unique strings (m), you get more detailed results for both approaches:
Hash set + sort: O(n) + O(m log m)
TreeSet: O(n log m)
So if n is much bigger than m, using a hash set and sorting the result should be slightly better.
You should take into account which methods will be executed more frequently and base your decision on that.
Apart from HashSet and TreeSet you can use LinkedHashSet which provides better performance for sorted sets. If you want to learn more about their differences in performance I suggest your read 6 Differences between TreeSet HashSet and LinkedHashSet in Java

Fastest way to find number of elements in a range

Given an array with n elements, how to find the number of elements greater than or equal to a given value (x) in the given range index i to index j in O(log n) or better complexity?
my implementation is this but it is O(n)
for(a=i;a<=j;a++)
if(p[a]>=x) // p[] is array containing n elements
count++;
If you are allowed to preprocess the array, then with O(n log n) preprocessing time, we can answer any [i,j] query in O(log n) time.
Two ideas:
1) Observe that it is enough to be able to answer [0,i] and [0,j] queries.
2) Use a persistent* balanced order statistics binary tree, which maintains n versions of the tree, version i is formed from version i-1 by adding a[i] to it. To answer query([0,i], x), you query the version i tree for the number of elements > x (basically rank information). An order statistics tree lets you do that.
*: persistent data structures are an elegant functional programming concept for immutable data structures and have efficient algorithms for their construction.
If the array is sorted you can locate the first value less than X with a binary search and the number of elements greater than X is the number of items after that element. That would be O(log(n)).
If the array is not sorted there is no way of doing it in less than O(n) time since you will have to examine every element to check if it's greater than or equal to X.
Impossible in O(log N) because you have to inspect all the elements, so a O(N) method is expected.
The standard algorithm for this is based on quicksort's partition, sometimes called quick-select.
The idea is that you don't sort the array, but rather just partition the section containing x, and stop when x is your pivot element. After the procedure is completed you have all elements x and greater to the right of x. This is the same procedure as when finding the k-th largest element.
Read about a very similar problem at How to find the kth largest element in an unsorted array of length n in O(n)?.
The requirement index i to j is not a restriction that introduces any complexity to the problem.
Given your requirements where the data is not sorted in advance and constantly changing between queries, O(n) is the best complexity you can hope to achieve, since there's no way to count the number of elements greater than or equal to some value without looking at all of them.
It's fairly simple if you think about it: you cannot avoid inspecting every element of a range for any type of search if you have no idea how it's represented/ordered in advance.
You could construct a balanced binary tree, even radix sort on the fly, but you're just pushing the overhead elsewhere to the same linear or worse, linearithmic O(NLogN) complexity since such algorithms once again have you inspecting every element in the range first to sort it.
So there's actually nothing wrong with O(N) here. That is the ideal, and you're looking at either changing the whole nature of the data involved outside to allow it to be sorted efficiently in advance or micro-optimizations (ex: parallel fors to process sub-ranges with multiple threads, provided they're chunky enough) to tune it.
In your case, your requirements seem rigid so the latter seems like the best bet with the aid of a profiler.

Is a binary or sequential/linear search more efficient when searching a sorted linked list?

I need to write a program that searches through a previously sorted linked list and I'm not sure which search would be more efficient.
All traversals of linked lists are in order, so a linear search is the best you can do, with an average case linear in the number of elements. If there are going to be lots of searches and you still need a linked list (instead of a random-access structure such as an array), consider a skip list, which lets you skip forward until you get near the desired element.
Linear search would be more efficient, and here's why.
In order to get to the kth location in a doubly linked list of size n, you have to iterate over at most n/2 elements.
If you were to apply binary search on top of that, then you'd wind up having to go down k elements every time, plus the work to perform a binary search.
O(n + log(n)) = O(n), which is equivalent to the performance of a linear search.
If your comparisons are cheap, the other answers are correct that linear and binary search are mostly equivalent (binary search has slightly higher expected number of node traversals, though the big-O is the same, O(n) traversals).
But this assumes that comparisons are a negligible cost relative to traversals. And that's not always a good assumption. If your comparisons are expensive, then binary search is still worth it, because binary search remains O(log n) on number of comparisons, while linear search is O(n). For example, if your comparison operation is something ridiculously expensive (say, the MD5 hash of file data, and you didn't cache it with the file names for whatever reason), and you've got a 1000 element list, linear search means you're, on average, computing the MD5 hash of 500 files (in this case, you could probably reduce it a bit by choosing the end to start from based on whether the MD5 you're searching for begins with 0-7 or 8-f, but even so, it's O(n) comparisons divided by a constant factor). Binary search means at most 10 (or 11, can't be bothered with off-by-one errors) file reads and MD5 computations. If the files are large enough, that's the difference between taking ~1 second to run and taking ~50 seconds.

What is the time complexity of the tailSet operation of java.util.TreeSet?

I was implementing the 2D closest pair algorithm using sweepline, and it says that you need to find the six points above a certain y coordinate. What I did is I put the points in a TreeSet sorted by y-coordinate, and used the tailSet method to get all points above a certain point, and iterate up to 6 times.
I was wondering if the complexity of the tailSet operation is O(log n), and if it is, is iterating over the tailSet at most six times also O(log n)?
Reference: http://people.scs.carleton.ca/~michiel/lecturenotes/ALGGEOM/sweepclosestpair.pdf
AFAIK Taking the tailSet is O(log n), but iterating over the the last m elements is O(m * log n)
Hmm... It’s strange to me. I thought that in terms of big O a process of creating a tailSet inside java.util.TreeSet is O(1).
Small clarification: tailSet(), headSet() and subSet() return very small objects that delegate all the hard work to methods of the underlying set. No new set is constructed. Hence O(1) and pretty insignificant.
a link for discussion

Quicksort- how pivot-choosing strategies affect the overall Big-oh behavior of quicksort?

I have came up with several strategies, but I am not entirely sure how they affect the overall behavior. I know the average case is O(NlogN), so I would assume that would be in the answer somewhere. I want to just put NlogN+1 for if I just select the 1st item in the array as the the pivot for the quicksort, but I don't know whether that is either correct nor acceptable? If anyone could enlighten me on this subject that would be great. Thanks!
Possible Strategies:
a) Array is random: pick the first item since that is the most cost effective choice.
b) Array is mostly sorted: pick middle item so we are likely to compliment the binary recursion of splitting in half each time.
c) Array is relatively large: pick first, middle and last indexes in array and compare them, picking the smallest to ensure we avoid worst case.
d) Perform 'c' with randomly generated indexes to make selection less deterministic.
An important fact you should know is that in an array of distinct elements, quicksort with a random choice of partition will run in O(n lg n). There are many good proofs of this, and the one on Wikipedia actually has a pretty good discussion of this. If you're willing to go for a slightly less formal proof that's mostly mathematically sound, the intuition goes as follows. Whenever we pick a pivot, let's say that a "good" pivot is a pivot that gives us at least a 75%/25% split; that is, it's greater than at least 25% of the elements and at most 75% of the elements. We want to bound the number of times that we can get a pivot of this sort before the algorithm terminates. Suppose that we get k splits of this sort and consider the size of the largest subproblem generated this way. It has size at most (3/4)kn, since on each iteration we're getting rid of at least a quarter of the elements. If we consider the specific case where k = log3/4 (1/n) = log4/3 n, then the size of the largest subproblem after k good pivots are chosen will be 1, and the recursion will stop. This means that if we choose get O(lg n) good pivots, the recursion will terminate. But on each iteration, what's the chance of getting such a pivot? Well, if we pick the pivot randomly, then there's a 50% chance that it's in the middle 50% of the elements, and so on expectation we'll choose two random pivots before we get a good pivot. Each step of choosing a pivot takes O(n) time, and so we should spend roughly O(n) time before getting each good pivot. Since we get at most O(lg n) good pivots, the overall runtime is O(n lg n) on expectation.
An important detail in the above discussion is that if you replace the 75%/25% split with any constant split - say, a (100 - k%) / k% split - the over asymptotic analysis is the same. You'll get that quicksort takes, on average, O(n lg n) time.
The reason that I've mentioned this proof is that it gives you a good framework for thinking about how to choose a pivot in quicksort. If you can pick a pivot that's pretty close to the middle on each iteartion, you can guarantee O(n lg n) runtime. If you can't guarantee that you'll get a good pivot on any iteration, but can say that on expectation it takes only a constant number of iterations before you get a good pivot, then you can also guarantee O(n lg n) expected runtime.
Given this, let's take a look at your proposed pivot schemes. For (a), if the array is random, picking the first element as the pivot is essentially the same as picking a random pivot at each step, and so by the above analysis you'll get O(n lg n) runtime on expectation. For (b), if you know that the array is mostly sorted, then picking the median is a good strategy. The reason is that if we can say that each element is "pretty close" to where it should be in the sorted sequence, then you can make an argument that every pivot you choose is a good pivot, giving you the O(n lg n) runtime you want. (The term "pretty close" isn't very mathematically precise, but I think you could formalize this without too much difficulty if you wanted to).
As for (c) and (d), of the two, (d) is the only one guaranteed to get O(n lg n) on expectation. If you deterministically pick certain elements to use as pivots, your algorithm will be vulnerable to deterministic sequences that can degenerate it to O(n2) behavior. There's actually a really interesting paper on this called "A Killer Adversary for Quicksort" by McIlroy that describes how you can take any deterministic quicksort and construct a pathologically worst-case input for it by using a malicious comparison function. You almost certainly want to avoid this in any real quicksort implementation, since otherwise malicious users could launch DoS attacks on your code by feeding in these killer sequences to force your program to sort in quadratic time and thus hang. On the other hand, because (d) is picking its sample points randomly, it is not vulnerable to this attack, because on any sequence the choice of pivots is random.
Interestingly, though, for (d), while it doesn't hurt to pick three random elements and take the median, you don't need to do this. The earlier proof is enough to show that you'll get O(n lg n) on expectation with a single random pivot choice. I actually don't know if picking the median of three random values will improve the performance of the quicksort algorithm, though since quicksort is always Ω(n lg n) it certainly won't be asymptotically any better than just picking random elements as the pivots.
I hope that this helps out a bit - I really love the quicksort algorithm and all the design decisions involved in building a good quicksort implementation. :-)
You have to understand that there are already many algorithms that will allow you to sustain a O(nlog(n)) complexity. Using randomized quick sort has expected time complexity of O(nlog(n)), and is usually considered better than other approaches.
You would be able to sustain O(nlog(n)) if you would a mix of all above, i.e. conditionally apply one of them based on the "profile" of your input data set. That being said, categorising an input data set in itself a challenge. In any case, to do any better, you have to research on your input data set and choose on the possible alternatives.
The best pivot is the one which can divide the array exactly in two halves. The median of the array is off course the best choice. I will suggest this approach :-
select some random indexes
calculate median of these elements
Use this as pivot element
From the O(n) median finding algorithm, I think 5 random indexes should be sufficient enough.

Categories