Fastest way to find number of elements in a range

Fastest way to find number of elements in a range - java

Given an array with n elements, how to find the number of elements greater than or equal to a given value (x) in the given range index i to index j in O(log n) or better complexity?
my implementation is this but it is O(n)
for(a=i;a<=j;a++)
if(p[a]>=x) // p[] is array containing n elements
count++;

If you are allowed to preprocess the array, then with O(n log n) preprocessing time, we can answer any [i,j] query in O(log n) time.
Two ideas:
1) Observe that it is enough to be able to answer [0,i] and [0,j] queries.
2) Use a persistent* balanced order statistics binary tree, which maintains n versions of the tree, version i is formed from version i-1 by adding a[i] to it. To answer query([0,i], x), you query the version i tree for the number of elements > x (basically rank information). An order statistics tree lets you do that.
*: persistent data structures are an elegant functional programming concept for immutable data structures and have efficient algorithms for their construction.

If the array is sorted you can locate the first value less than X with a binary search and the number of elements greater than X is the number of items after that element. That would be O(log(n)).
If the array is not sorted there is no way of doing it in less than O(n) time since you will have to examine every element to check if it's greater than or equal to X.

Impossible in O(log N) because you have to inspect all the elements, so a O(N) method is expected.
The standard algorithm for this is based on quicksort's partition, sometimes called quick-select.
The idea is that you don't sort the array, but rather just partition the section containing x, and stop when x is your pivot element. After the procedure is completed you have all elements x and greater to the right of x. This is the same procedure as when finding the k-th largest element.
Read about a very similar problem at How to find the kth largest element in an unsorted array of length n in O(n)?.
The requirement index i to j is not a restriction that introduces any complexity to the problem.

Given your requirements where the data is not sorted in advance and constantly changing between queries, O(n) is the best complexity you can hope to achieve, since there's no way to count the number of elements greater than or equal to some value without looking at all of them.
It's fairly simple if you think about it: you cannot avoid inspecting every element of a range for any type of search if you have no idea how it's represented/ordered in advance.
You could construct a balanced binary tree, even radix sort on the fly, but you're just pushing the overhead elsewhere to the same linear or worse, linearithmic O(NLogN) complexity since such algorithms once again have you inspecting every element in the range first to sort it.
So there's actually nothing wrong with O(N) here. That is the ideal, and you're looking at either changing the whole nature of the data involved outside to allow it to be sorted efficiently in advance or micro-optimizations (ex: parallel fors to process sub-ranges with multiple threads, provided they're chunky enough) to tune it.
In your case, your requirements seem rigid so the latter seems like the best bet with the aid of a profiler.

Related

Best way to retrieve K largest elements from large unsorted arrays?

I recently had a coding test during an interview. I was told:
There is a large unsorted array of one million ints. User wants to retrieve K largest elements. What algorithm would you implement?
During this, I was strongly hinted that I needed to sort the array.
So, I suggested to use built-in sort() or maybe a custom implementation if performance really mattered. I was then told that using a Collection or array to store the k largest and for-loop it is possible to achieve approximately O(N), in hindsight, I think it's O(N*k) because each iteration needs to compare to the K sized array to find the smallest element to replace, while the need to sort the array would cause the code to be at least O(N log N).
I then reviewed this link on SO that suggests priority queue of K numbers, removing the smallest number every time a larger element is found, which would also give O(N log N). Write a program to find 100 largest numbers out of an array of 1 billion numbers
Is the for-loop method bad? How should I justify pros/cons of using the for-loop or the priorityqueue/sorting methods? I'm thinking that if the array is already sorted, it could help by not needing to iterate through the whole array again, i.e. if some other method of retrieval is called on the sorted array, it should be constant time. Is there some performance factor when running the actual code that I didn't consider when theorizing pseudocode?

Another way of solving this is using Quickselect. This should give you a total average time complexity of O(n). Consider this:
Find the kth largest number x using Quickselect (O(n))
Iterate through the array again (or just through the right-side partition) (O(n)) and save all elements ≥ x
Return your saved elements
(If there are repeated elements, you can avoid them by keeping count of how many duplicates of x you need to add to the result.)
The difference between your problem and the one in the SO question you linked to is that you have only one million elements, so they can definitely be kept in memory to allow normal use of Quickselect.

There is a large unsorted array of one million ints. The user wants to retrieve the K largest elements.
During this, I was strongly hinted that I needed to sort the array.
So, I suggested using a built-in sort() or maybe a custom
implementation
That wasn't really a hint I guess, but rather a sort of trick to deceive you (to test how strong your knowledge is).
If you choose to approach the problem by sorting the whole source array using the built-in Dual-Pivot Quicksort, you can't obtain time complexity better than O(n log n).
Instead, we can maintain a PriorytyQueue which would store the result. And while iterating over the source array for each element we need to check whether the queue has reached the size K, if not the element should be added to the queue, otherwise (is size equals to K) we need to compare the next element against the lowest element in the queue - if the next element is smaller or equal we should ignore it if it is greater the lowest element has to be removed and the new element needs to be added.
The time complexity of this approach would be O(n log k) because adding a new element into the PriorytyQueue of size k costs O(k) and in the worst-case scenario this operation can be performed n times (because we're iterating over the array of size n).
Note that the best case time complexity would be Ω(n), i.e. linear.
So the difference between sorting and using a PriorytyQueue in terms of Big O boils down to the difference between O(n log n) and O(n log k). When k is much smaller than n this approach would give a significant performance gain.
Here's an implementation:
public static int[] getHighestK(int[] arr, int k) {
Queue<Integer> queue = new PriorityQueue<>();
for (int next: arr) {
if (queue.size() == k && queue.peek() < next) queue.remove();
if (queue.size() < k) queue.add(next);
}
return toIntArray(queue);
}
public static int[] toIntArray(Collection<Integer> source) {
return source.stream().mapToInt(Integer::intValue).toArray();
}
main()
public static void main(String[] args) {
System.out.println(Arrays.toString(getHighestK(new int[]{3, -1, 3, 12, 7, 8, -5, 9, 27}, 3)));
}
Output:
[9, 12, 27]
Sorting in O(n)
We can achieve worst case time complexity of O(n) when there are some constraints regarding the contents of the given array. Let's say it contains only numbers in the range [-1000,1000] (sure, you haven't been told that, but it's always good to clarify the problem requirements during the interview).
In this case, we can use Counting sort which has linear time complexity. Or better, just build a histogram (first step of Counting Sort) and look at the highest-valued buckets until you've seen K counts. (i.e. don't actually expand back to a fully sorted array, just expand counts back into the top K sorted elements.) Creating a histogram is only efficient if the array of counts (possible input values) is smaller than the size of the input array.
Another possibility is when the given array is partially sorted, consisting of several sorted chunks. In this case, we can use Timsort which is good at finding sorted runs. It will deal with them in a linear time.
And Timsort is already implemented in Java, it's used to sort objects (not primitives). So we can take advantage of the well-optimized and thoroughly tested implementation instead of writing our own, which is great. But since we are given an array of primitives, using built-in Timsort would have an additional cost - we need to copy the contents of the array into a list (or array) of wrapper type.

This is a classic problem that can be solved with so-called heapselect, a simple variation on heapsort. It also can be solved with quickselect, but like quicksort has poor quadratic worst-case time complexity.
Simply keep a priority queue, implemented as binary heap, of size k of the k smallest values. Walk through the array, and insert values into the heap (worst case O(log k)). When the priority queue is too large, delete the minimum value at the root (worst case O(log k)). After going through the n array elements, you have removed the n-k smallest elements, so the k largest elements remain. It's easy to see the worst-case time complexity is O(n log k), which is faster than O(n log n) at the cost of only O(k) space for the heap.

Here is one idea. I will think for creating array (int) with max size (2147483647) as it is max value of int (2147483647). Then for every number in for-each that I get from the original array just put the same index (as the number) +1 inside the empty array that I created.
So in the end of this for each I will have something like [1,0,2,0,3] (array that I created) which represent numbers [0, 2, 2, 4, 4, 4] (initial array).
So to find the K biggest elements you can make backward for over the created array and count back from K to 0 every time when you have different element then 0. If you have for example 2 you have to count this number 2 times.
The limitation of this approach is that it works only with integers because of the nature of the array...
Also the representation of int in java is -2147483648 to 2147483647 which mean that in the array that need to be created only the positive numbers can be placed.
NOTE: if you know that there is max number of the int then you can lower the created array size with that max number. For example if the max int is 1000 then your array which you need to create is with size 1000 and then this algorithm should perform very fast.

I think you misunderstood what you needed to sort.
You need to keep the K-sized list sorted, you don't need to sort the original N-sized input array. That way the time complexity would be O(N * log(K)) in the worst case (assuming you need to update the K-sized list almost every time).
The requirements said that N was very large, but K is much smaller, so O(N * log(K)) is also smaller than O(N * log(N)).
You only need to update the K-sized list for each record that is larger than the K-th largest element before it. For a randomly distributed list with N much larger than K, that will be negligible, so the time complexity will be closer to O(N).
For the K-sized list, you can take a look at the implementation of Is there a PriorityQueue implementation with fixed capacity and custom comparator? , which uses a PriorityQueue with some additional logic around it.

There is an algorithm to do this in worst-case time complexity O(n*log(k)) with very benign time constants (since there is just one pass through the original array, and the inner part that contributes to the log(k) is only accessed relatively seldomly if the input data is well-behaved).
Initialize a priority queue implemented with a binary heap A of maximum size k (internally using an array for storage). In the worst case, this has O(log(k)) for inserting, deleting and searching/manipulating the minimum element (in fact, retrieving the minimum is O(1)).
Iterate through the original unsorted array, and for each value v:
If A is not yet full then
insert v into A,
else, if v>min(A) then (*)
insert v into A,
remove the lowest value from A.
(*) Note that A can return repeated values if some of the highest k values occur repeatedly in the source set. You can avoid that by a search operation to make sure that v is not yet in A. You'd also want to find a suitable data structure for that (as the priority queue has linear complexity), i.e. a secondary hash table or balanced binary search tree or something like that, both of which are available in java.util.
The java.util.PriorityQueue helpfully guarantees the time complexity of its operations:
this implementation provides O(log(n)) time for the enqueing and dequeing methods (offer, poll, remove() and add); linear time for the remove(Object) and contains(Object) methods; and constant time for the retrieval methods (peek, element, and size).
Note that as laid out above, we only ever remove the lowest (first) element from A, so we enjoy the O(log(k)) for that. If you want to avoid duplicates as mentioned above, then you also need to search for any new value added to it (with O(k)), which opens you up to a worst-case overall scenario of O(n*k) instead of O(n*log(k)) in case of a pre-sorted input array, where every single element v causes the inner loop to fire.

What is the Big-O of this pseudo code? I need a proper explain also

This is the pseudo code that i want to calculate time complexity ,i think it is a binary search algorithm but i fail when calculating the complexity because it is reducing logarithamic.
USE variables half-array,found,middle element
SET half-array=initial array;
SET found=True;
Boolean SearchArray(half-array)
find middle element in half-array;
Compare search key with middle element;
IF middle element==search key THEN
SET found=True;
ELSE
IF search key< middle element THEN
SET half-array=lower half of initial array;
ELSE
SET half-array=upper half of initial array;
SearchArray(half-array)

It looks like you are running this method recursively, and with each iteration you are reducing the number of elements being searched by half. This is going to be a logarithmic reduction, i.e. O(log n).
Since you are reducing your elements by half each time, you need to determine how many executions will be needed to reduce it to a single element, which this previous answer provides a proof or if you are a more visual person, you can use the following diagram from this response:

Yes,It is indeed a binary search algorithm.The reason why it is called a 'binary' search is because,if you would have noticed,after each iteration,your problem space is reduced by roughly half (I say roughly because of the floor function).
So now,to find the complexity,we have to devise a recurrence relation,which we can use to determine the worst-case time complexity of binary-search.
Let T(n) denote the number of comparisons binary search does for n elements.In the worst case,no element is found.Also,to make our analysis easier,assume that n is a power of 2.
BINARY SEARCH:
When there is a single element,there is only one check,hence T(1) = 1.
It calculates the middle entry then compares it with our key.If it is equal to the key,it returns the index,otherwise it halves the range by updating upper and lower bounds such that n/2 elements are in the range.
We then check only one of the two halves,and this is done recursively until a single element is left.
Hence,we get the recurrence relation:
T(n) = T(n/2) + 1
Using the Master Theorem,we get the time complexity to be T(n) ∈ Θ(log n)
Also refer : Master Theorem

You are correct in saying that this algorithm is Binary Search (compare your pseudo code to the pseudo code on this Wikipedia page: Binary Search)
That being the case, this algorithm has a worst case time complexity of O(log n), where n is the number of elements in the given array. This is due to the fact that in every recursive call where you don't find the target element, you divide the array in half.
This reduction process is logarithmic because at the end of this algorithm, you will have reduced the list to a single element by dividing the number of elements that still need to be checked by 2 - the number of times you do that is roughly equivalent (see below) to the number of times you would have to multiply 2 by itself to obtain a number equal to the size of the given array.
*I say roughly above because the number of recursive calls made is always going to be an integral value, whereas the power you would have to raise 2 to will not be an integer if the size of the given list is not a power of two.

The Best Search Algorithm for a Linked List

I have to write a program as efficiently as possible that will insert given nodes into a sorted LinkedList. I'm thinking of how binary search is faster than linear in average and worst case, but when I Googled it, the runtime was O(nlogn)? Should I do linear on a singly-LinkedList or binary search on a doubly-LinkedList and why is that one (the one to chose) faster?
Also how is the binary search algorithm > O(logn) for a doubly-LinkedList?
(No one recommend SkipList, I think they're against the rules since we have another implementation strictly for that data structure)

You have two choices.
Linearly search an unordered list. This is O(N).
Linearly search an ordered list. This is also O(N) but it is twice as fast, as on average the item you search for will be in the middle, and you can stop there if it isn't found.
You don't have the choice of binary searching it, as you don't have direct access to elements of a linked list.
But if you consider search to be a rate-determining step, you shouldn't use a linked list at all: you should use a sorted array, a heap, a tree, etc.

Binary search is very fast on arrays simply because it's very fast and simple to access the middle index between any two given indexes of elements in the array. This make it's running time complexity to be O(log n) while taking a constant O(1) space.
For the linked list, it's different, since in order to access the middle element we need to traverse it node by node and therefore finding the middle node could run in an order of O(n)
Thus binary search is slow on linked list and fast on arrays

Binary search is possible by using skip list. You will spend number of pointers as twice as linked list if you set skip 2, 4, 8, ..., 2^n at same time. And then you can get O(log n) for each search.
If your data store in each node is quite big, applying this will very efficient.
You can read more on https://www.geeksforgeeks.org/skip-list/amp/

So basically binary search on a LL is O(n log n) because you would need to traverse the list n times to search the item and then log n times to split the searched set. But this is only true if you are traversing the LL from the beginning every time.
Ideally if you figure out some method (which it's possible!) to start from somewhere else like... the middle of the searched set, then you eliminate the need to always traverse the list n times to start the search and can optimize your algorithm to O(log n).

Time complexity assignment

I have an assignment in my intro to programming course that I don't understand at all. I've been falling behind because of problems at home. I'm not asking you to do my assignment for me I'm just hoping for some help for a programming boob like me.
The question is this:
Calculate the time complexity in average case for searching, adding, and removing in a
- unsorted vector
- sorted vector
- unsorted singlelinked list
- sorted singlelinked list
- hash table
Let n be the number of elements in the datastructure
and present the solution in a
table with three rows and five columns.
I'm not sure what this even means.. I've read as much as I can about time complexity but I don't understand it.. It's so confusing. I don't know where I would even start.. Remember I'm a novice programmer, as dumb as they come. I did really well last semester but had problems at home at the start of this one so I missed a lot of lectures and the first assignments so now I'm in over my head..
Maybe if someone could give me the answer and the reasoning behind it on a couple of them I could maybe understand it and do the others? I have a hard time learning through theory, examples work best.

Time complexity is a formula that describes how the cost of an operation varies related to the number of elements. It is usually expressed using "big-O" notation, for example O(1) or constant time, O(n) where cost relates linearly to n, O(n2) where cost increases as the square of the size of the input. There can be others involving exponentials or logarithms. Read up on "Big-O Notation".
You are being asked to evaluate five different data structures, and provide average cost for three different operations on each data structure (hence the table with three rows and five columns).

Time complexity is an abstract concept, that allows us to compare the complexity of various algorithms by looking at how many operations are performed in order to handle its inputs. To be precise, the exact number of operations isn't important, the bottom line is, how does the number of operations scale with increasing complexity of inputs.
Generally, the number of inputs is denoted as n and the complexity is denoted as O(p(n)), with p(n) being some kind of expression with n. If an algorithm has O(n) complexity, it means, that is scales linearly, with every new input, the time needed to run the algorithm increases by the same amount.
If an algorithm has complexity of O(n^2) it means, that the amount of operations grows as a square of number of inputs. This goes on and on, up to exponencially complex algorithms, that are effectively useless for large enough inputs.
What your professor asks from you is to have a look at the given operations and judge, how are going to scale with increasing size of lists, you are handling. Basically this is done by looking at the algorithm and imagining, what kinds of cycles are going to be necessary. For example, if the task is to pick the first element, the complexity is O(1), meaning that it doesn't depend on the size of input. However, if you want to find a given element in the list, you already need to scan the whole list and this costs you depending on the list size. Hope this gives you a bit of an idea how algorithm complexity works, good luck with your assignment.

Ok, well there are a few things you have to start with first. Algorithmic complexity has a lot of heavy math behind it and so it is hard for novices to understand, especially if you try to look up Wikipedia definitions or more-formal definitions.
A simple definition is that time-complexity is basically a way to measure how much an operation costs to perform. Alternatively, you can also use it to see how long a particular algorithm can take to run.
Complexity is described using what is known as big-O notation. You'll usually end up seeing things like O(1) and O(n). n is usually the number of elements (possibly in a structure) on which the algorithm is operating.
So let's look at a few big-O notations:
O(1): This means that the operation runs in constant time. What this means is that regardless of the number of elements, the operation always runs in constant time. An example is looking at the first element in a non-empty array (arr[0]). This will always run in constant time because you only have to directly look at the very first element in an array.
O(n): This means that the time required for the operation increases linearly with the number of elements. An example is if you have an array of numbers and you want to find the largest number. To do this, you will have to, in the worst case, look at every single number in the array until you find the largest one. Why is that? This is because you can have a case where the largest number is the last number in the array. So you cannot be sure until you have examined every number in the array. This is why the cost of this operation is O(n).
O(n^2): This means that the time required for the operation increases quadratically with the number of elements. This usually means that for each element in the set of elements, you are running through the entire set. So that is n x n or n^2. A well-known example is the bubble-sort algorithm. In this algorithm you run through and swap adjacent elements to ensure that the array is sorted according to the order you need. The array is sorted when no-more swaps need to be made. So you have multiple passes through the array, which in the worst case is equal to the number of elements in the array.
Now there are certain things in code that you can look at to get a hint to see if the algorithm is O(n) or O(n^2).
Single loops are usually O(n), since it means you are iterating over a set of elements once:
for(int i = 0; i < n; i++) {
...
}
Doubly-nested loops are usually O(n^2), since you are iterating over an entire set of elements for each element in the set:
for(int i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
...
}
}
Now how does this apply to your homework? I'm not going to give you the answer directly but I will give you enough and more hints to figure it out :). What I wrote above, describing big-O, should also help you. Your homework asks you to apply runtime analyses to different data structures. Well, certain data structures have certain runtime properties based on how they are set up.
For example, in a linked list, the only way you can get to an element in the middle of the list, is by starting with the first element and then following the next pointer until you find the element that you want. Think about that. How many steps would it take for you to find the element that you need? What do you think those steps are related to? Do the number of elements in the list have any bearing on that? How can you represent the cost of this function using big-O notation?
For each datastructure that your teacher has asked you about, try to figure out how they are set up and try to work out manually what each operation (searching, adding, removing) entails. I'm talking about writing the steps out and drawing pictures of the strucutres on paper! This will help you out immensely! Looking at that, you should have enough information to figure out the number of steps required and how it relates to the number of elements in the set.
Using this approach you should be able to solve your homework. Good luck!

Find nearest number in unordered array

Given a large unordered array of long random numbers and a target long, what's the most efficient algorithm for finding the closest number?
#Test
public void findNearest() throws Exception {
final long[] numbers = {90L, 10L, 30L, 50L, 70L};
Assert.assertEquals("nearest", 10L, findNearest(numbers, 12L));
}

Iterate through the array of longs once. Store the current closest number and the distance to that number. Continue checking each number if it is closer, and just replace the current closest number when you encounter a closer number.
This gets you best performance of O(n).
Building a binary tree as suggested by other answerer will take O(nlogn). Of course future search will only take O(logn)...so it may be worth it if you do a lot of searches.
If you are pro, you can parallelize this with openmp or thread library, but I am guessing that is out of the scope of your question.

If you do not intend to do multiple such requests on the array there is no better way then the brute force linear time check of each number.
If you will do multiple requests on the same array first sort it and then do a binary search on it - this will reduce the time for such requests to O(log(n)) but you still pay the O(n*log(n)) for the sort so this is only reasonable if the number of requests is reasonably large i.e. k*n >>(a lot bigger then) n*log(n) + k* log(n) where k is the number of requests.
If the array will change, then create a binary search tree and do a lower bound request on it. This again is only reasonable if the nearest number request is relatively large with comparison to array change requests and also to the number of elements. As the cost of building the tree is O(n*log(n)) and also the cost of updating it is O(logn) you need to have k*log(n) + n*log(n) + k*log(n) <<(a lot smaller then) k*n

IMHO, I think that you should use a Binary Heap (http://en.wikipedia.org/wiki/Binary_heap) which has the insertion time of O(log n), being O(n log n) for the entire array. For me, the coolest thing about the binary heap is that it can be made inside from your own array, without overhead. Take a look the heapfy section.
"Heapfying" your array turns possible to get the bigger/lower element in O(1).

if you build a binary search tree from your numbers and search against. O(log n) would be the complexity in worst case. In your case you won't search for equality instead, you'll looking for the smallest return value through subtraction

I would check the difference between the numbers while iterating through the array and save the min value for that difference.
If you plan to use findNearest multiple times I would calculate the difference while sorting (with an sorting algorithm of complexity n*log(n)) after each change of values in that array

The time complex to do this job is O(n), the length of the numbers.
final long[] numbers = {90L, 10L, 30L, 50L, 70L};
long tofind = 12L;
long delta = Long.MAX_VALUE;
int index = -1;
int i = 0;
while(i < numbers.length){
Long tmp = Math.abs(tofind - numbers[i]);
if(tmp < delta){
delta = tmp;
index = i;
}
i++;
}
System.out.println(numbers[index]); //if index is not -1
But if you want to find many times with different values such as 12L against the same numbers array, you may sort the array first and binary search against the sorted numbers array.

If your search is a one-off, you can partition the array like in quicksort, using the input value as pivot.
If you keep track - while partitioning - of the min item in the right half, and the max item in the left half, you should have it in O(n) and 1 single pass over the array.
I'd say it's not possible to do it in less than O(n) since it's not sorted and you have to scan the input at the very least.
If you need to do many subsequent search, then a BST could help indeed.

You could do it in below steps
Step 1 : Sort array
Step 2 : Find index of the search element
Step 3 : Based on the index, display the number that are at the Right & Left Side
Let me know incase of any queries...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.