Find the highest N numbers in an infinite list - java

I was asked this question in a recent Java interview.
Given a List containing millions of items, maintain a list of the highest n items. Sorting the list in descending order then taking the first n items is definitely not efficient due to the list size.
Below is what I did, I'd appreciate if anyone could provide a more efficient or elegant solution as I believe this could also be solved using a PriorityQueue:
public TreeSet<Integer> findTopNNumbersInLargeList(final List<Integer> largeNumbersList,
final int highestValCount) {
TreeSet<Integer> highestNNumbers = new TreeSet<Integer>();
for (int number : largeNumbersList) {
if (highestNNumbers.size() < highestValCount) {
highestNNumbers.add(number);
} else {
for (int i : highestNNumbers) {
if (i < number) {
highestNNumbers.remove(i);
highestNNumbers.add(number);
break;
}
}
}
}
return highestNNumbers;
}

The for loop at the bottom is unnecessary, because you can tell right away if the number should be kept or not.
TreeSet lets you find the smallest element in O(log N)*. Compare that smallest element to number. If the number is greater, add it to the set, and remove the smallest element. Otherwise, keep walking to the next element of largeNumbersList.
The worst case is when the original list is sorted in ascending order, because you would have to replace an element in the TreeSet at each step. In this case the algorithm would take O(K log N), where K is the number of items in the original list, an improvement of logNK over the solution of sorting the array.
Note: If your list consists of Integers, you could use a linear sorting algorithm that is not based on comparisons to get the overall asymptotic complexity to O(K). This does not mean that the linear solution would be necessarily faster than the original for any fixed number of elements.
* You can maintain the value of the smallest element as you go to make it O(1).

You don't need nested loops, just keep inserting and remove the smallest number when the set is too large:
public Set<Integer> findTopNNumbersInLargeList(final List<Integer> largeNumbersList,
final int highestValCount) {
TreeSet<Integer> highestNNumbers = new TreeSet<Integer>();
for (int number : largeNumbersList) {
highestNNumbers.add(number);
if (highestNNumbers.size() > highestValCount) {
highestNNumbers.pollFirst();
}
}
return highestNNumbers;
}
The same code should work with a PriorityQueue, too. The runtime should be O(n log highestValCount) in any case.
P.S. As pointed out in the other answer, you can optimize this some more (at the cost of readability) by keeping track of the lowest number, avoiding unnecessary inserts.

It's possible to support amortized O(1) processing of new elements and O(n) querying of the current top elements as follows:
Maintain a buffer of size 2n, and whenever you see a new element, add it to the buffer. When the buffer gets full, use quick select or another linear median finding algorithm to select the current top n elements, and discard the rest. This is an O(n) operation, but you only need to perform it every n elements, which balances out to O(1) amortized time.
This is the algorithm Guava uses for Ordering.leastOf, which extracts the top n elements from an Iterator or Iterable. It is fast enough in practice to be quite competitive with a PriorityQueue based approach, and it is much more resistant to worst case input.

I would start by saying that your question, as stated, is impossible. There is no way to find the highest n items in a List without fully traversing it. And there is no way to fully traverse an infinite List.
That said, the text of your question differs from the title. There is a massive difference between very large and infinite. Please bear that in mind.
To answer the feasible question, I would begin by implementing a buffer class to encapsulate the behaviour of keeping the top N, lets call it TopNBuffer:
class TopNBuffer<T extends Comparable<T>> {
private final NavigableSet<T> backingSet = new TreeSet<>();
private final int limit;
public TopNBuffer(int limit) {
this.limit = limit;
}
public void add(final T t) {
if (backingSet.add(t) && backingSet.size() > limit) {
backingSet.pollFirst();
}
}
public SortedSet<T> highest() {
return Collections.unmodifiableSortedSet(backingSet);
}
}
All we do here is to, on add, if the number is not unique, and adding the number makes the Set exceeds its limit, then we simply remove the lowest element from the Set.
The method highest gives an unmodifiable view of the current highest elements. So, in Java 8 syntax, all you need to do is:
final TopNBuffer<Integer> topN = new TopNBuffer<>(n);
largeNumbersList.foreach(topN::add);
final Set<Integer> highestN = topN.highest();
I think in an interview environment, its not enough to simply whack lots of code into a method. Demonstrating an understanding of OO programming and separation of concerns is also important.

Related

What is the time complexity of getting the headSet of a TreeSet in Java? Also, what if I call the headSet method 'n' times?

I was doing a hackerrank question that requires me to find out the number of shifts that need to occur to sort an array using Insertion Sort without actually sorting the array with insertion sort because otherwise that would be O(n^2) time-complexity!
Here is my code that gets timed out. From what I know, calling the headSet method n times should come to around O(n logn).
static class MyComp implements Comparator<Integer>{
#Override
public int compare(Integer o1, Integer o2) {
return o1 <= o2 ? -1: 1;
}
}
// Complete the insertionSort function below.
static int insertionSort(int[] arr) {
SortedSet<Integer> set = new TreeSet<>(new MyComp());
int count=0;
for(int i=arr.length-1; i>=0;i--){
set.add(arr[i]);
int pos = set.headSet(arr[i]).size();
// System.out.println(pos);
if(pos>0){
count=count+pos;
}
}
return count;
}
The complexity of creating a headset is O(1)
Why?
Because a headset is not a new set. It is actually a view of an existing set. Creating one doesn't involve copying the original set, and doesn't even involve finding the "bound" element in the set.
Thus, doing it N times is O(N).
However, the reason that your code is not O(N) is that
set.headSet(someElement).size();
is NOT O(1). The reason is that the size() method on a subset view of a TreeSet is computed by counting the elements in the view.
(AFAIK, this is not stated in the javadocs, but you can deduce it from looking at the source code for TreeSet and TreeMap.)
Stephen C isn't even close and I have no idea how it has positive upvotes or is the accepted answer. Treeset access is O(log(n)) obviously, not O(1). So first off, there's no way on earth this could possibly be O(n), it's at best O(n*log(n)).
But is it? No. It's even worse. Headset is NOT A VIEW of an existing set like Stephen says, it's a new set. This is obviously the case because you can modify the headset by adding an element to it. You can't modify a view, and if that referred to the original set, it'd be a massive pain.
You can test it with the following code:
TreeSet<Integer> test=new TreeSet<>();
long time=System.currentTimeMillis();
Random r=new Random(5);
for (int i=0; i<1e6; i++)
test.add(i);
long ans=0;
for (int i=0; i<1e6; i++) {
int l=r.nextInt((int)1e6);
ans+=test.headSet(l).size();
}
System.out.println(ans+" "+(System.currentTimeMillis()-time));
If it were O(n), it would run in 1/100th of a second. If it were O(log(n)), it would run in about 2 seconds. You can see this takes about 10^4 seconds. Your code is O(n^2).

Removing duplicates from sorted array list and return size in java without using extra space

Is there a better way to remove dups from the array list compared to the below code which does the work in O(n) when encountered with larger input. Any suggestions would be appreciated. Thank you.
Note :- Can't use any extra space and should be solved in place.
Input :- It will be a sorted array with dups.
Code :-
public int removeDuplicates(ArrayList<Integer> a) {
if(a.size()>1){
for( int i=0;i<a.size()-1;i++ ) {
if(a.get(i).intValue() == a.get(i+1).intValue() ) {
a.remove(i);
i--;
}
}
}
return a.size();
}
Please test the code here at coder pad link.
https://coderpad.io/MXNFGTJC
If this code is for removing elements of an unsorted list, then:
The algorithm is incorrect.
The Question is a duplicate of How do I remove repeated elements from ArrayList? (for example ...)
If the list is sorted, then:
The algorithm is correct.
The algorithm is NOT O(N). It is actually O(ND) on average where N is the list length and D is the number of duplicates.
Why? Because ArrayList::remove(int) is an on average O(N) operation!
There are two efficient ways to remove a large number of elements from a list:
Create a new list, iterate the old list and add the elements that you want to retain to the new list. Then either discard the old list or clear it and copy the new list to the old one.
This works efficiently (O(N)) for all standard kinds of list.
Perform a sliding window removal. The algorithm with arrays is like this:
int i = 0;
for (int j = 0; j < array.length; j++) {
if (should remove array[j]) {
// do nothing
} else {
array[i++] = array[j];
}
}
// trim array to length i, or assign nulls or something.
As you can see, this performs one pass through the array, and is O(N). It also avoids allocating any temporary space.
You can implement the sliding window removal using ArrayList::get(int) and ArrayList::set(int, <E>) ... followed by repeated removal of the last element to trim the list.
Here are some ideas to improve performance:
Removing elements one by one from an ArrayList can be expensive since you must shift the all contents after that element. Instead of ArrayList you might consider a different list implementation which allows O(1) removal. Alternatively, if you must use ArrayList and are not allowed any temporary data structures, you can rebuild the array by chaining together recursive calls that use set() instead of remove().
For lists with millions of elements, consider a parallel processing solution to leverage the power of multiple processes. Java streams are a simple way to achieve this.
List<Integer> l = new ArrayList<Integer>();
//add some elements to l
System.out.println(l.stream().distinct().collect(Collectors.toList()));

Best way to keep track of maximum 5 values found while parsing a stream in Java

I'm parsing a large file, line by line, reading substrings in each line. I will obtain an integer value from each substring, ~30 per line, and need to take the return the highest 5 values from the file. What data structure will be the most efficient for keeping track of the 5 largest values while going through?
This problem is usually solved with a heap, but (perhaps counter-intuitively) you use a min-heap (the smallest element is the "top" of the heap).
The algorithm is basically this:
for each item parsed
if the heap contains less than n items,
add the new item to the heap
else
if the new item is "greater" than the "smallest" item in the heap
remove the smallest item and replace it with the new item
When you are done, you can pop the elements off the heap from least to greatest.
Or, concretely:
static <T extends Comparable<T>> List<T> top(Iterable<? extends T> items, int k) {
if (k < 0) throw new IllegalArgumentException();
if (k == 0) return Collections.emptyList();
PriorityQueue<T> top = new PriorityQueue<>(k);
for (T item : items) {
if (top.size() < k) top.add(item);
else if (item.compareTo(top.peek()) > 0) {
top.remove();
top.add(item);
}
}
List<T> hits = new ArrayList<>(top.size());
while (!top.isEmpty())
hits.add(top.remove());
Collections.reverse(hits);
return hits;
}
You can compare the new item to the top of the heap efficiently, and you don't need to keep all of the elements strictly ordered all the time, so this is faster than a completely ordered collection like a TreeSet.
For a very short list of five elements, iterating over an array may be faster. But if the size of the "top hits" collection grows, this heap-based method will win out.
I would use a TreeSet (basically a sorted set), where you drop off the first (lowest) element each time you add to the set. This will dicard duplicates.
SortedSet<Integer> set = new TreeSet<>();
for (...) {
...
if (set.size() < 5) {
set.add(num);
} else if (num > set.first()) {
set.remove(set.first());
set.add(num);
}
}
You could use a LinkedList inserting with a sort order. each new int, you would check the end to make sure it's in the max. If it is then iterate in descending order and if newInt > the node's int, insert the new int there, and then removeLast() to maintain the length of 5.
Array also works, but you'll have to shuffle.
The Guava library has an Ordering.greatestOf method that returns the greatest K elements from an Iterable in O(N) time and O(K) space.
The implementation is in a package-private TopKSelector class.

Insert Objects in a Constant Length List - Java

I am looking for a good optimal strategy to write a code for the following problem.
I have a List of Objects.
The Objects have a String "valuation" field among other fields. The valuation field may or may not be unique.
The List is of CONSTANT length which is calculated within the program. The length would usually be between 100 and 500.
The Objects are all sorted within the list based on String field - valuation
As new objects are found or created: The String field valuation is compared with the existing members of the list.
If the comparison fails e.g. with the bottom member of the list, then the Object is NOT added to the list.
If the comparison succeeds and the new Object is added to the list - within the sort criteria;the new object is added in the right position and the bottom member is ousted from the list to keep the length of the list constant.
One strategy which I am thinking:
Keep adding members to the list - till it reaches maxLength
Sort - (e.g Collections.sort with a comparator) the list
When a new member is created - compare it with the bottom member of the list.
If success - replace the bottom member else continue
Re-Sort the List - if success
and continue.
The program loops through million or more iterations, thus optimized comparison and running has become an issue.
Any guidance on a good strategy to address this within the Java domain. What lists will be the most effective e.g. LinkedList or ArrayLists or Sets etc. Which sort/insert (standard package) will be the most effective?
Consider this example based on TreeSet and comparing over a String for Results. As you can see, after enough iterations, only elements with very large keys are left in List. On my quite old laptop, I had 10.000 items in less than 50ms - so roundabout 5s per million list operations.
public class Valuation {
public static class Element implements Comparable<Element> {
String valuation;
String data;
Element(String v, String d) {
valuation = v;
data = d;
}
#Override
public int compareTo(Element e) {
return valuation.compareTo(e.valuation);
}
}
private TreeSet<Element> ts = new TreeSet<Element>();
private final static int LISTLENGTH = 500;
public static void main(String[] args) {
NumberFormat nf = new DecimalFormat("00000");
Random r = new Random();
Valuation v = new Valuation();
for(long l = 1; l < 150; ++l) {
long start = System.currentTimeMillis();
for(int j = 0; j < 10000; ++j) {
v.pushNew(new Element(nf.format(r.nextInt(50000))
, UUID.randomUUID().toString()));
}
System.out.println("10.000 finished in " + (System.currentTimeMillis()-start) + "ms. Set contains: " + v.ts.size());
}
for(Element e : v.ts) {
System.out.println("-> " + e.valuation);
}
}
private void pushNew(Element hexString) {
if(ts.size() < LISTLENGTH) {
ts.add(hexString);
} else {
if(ts.first().compareTo(hexString) < 0) {
ts.add(hexString);
if(ts.size() > LISTLENGTH) {
ts.remove(ts.first());
}
}
}
}
}
Any guidance on a good strategy to address this within the Java domain.
My advice would be - there is no need to do any sorting. You can ensure your data is sorted by doing binary insertion as you add more objects into your collection.
This way, as you add more items, the collection itself is already is a sorted state.
After the 500th item, if you want to add another one, we just perform another binary insertion. The insertion performance always remains at O(log(n)) and there is no need to perform any sorting.
Comparing with your algorithm
Your algorithm works fine from 1 - 4. But step 5 will likely be the bottle neck of your algorithm:
5.Re-Sort the List - if success
This is because even though your list will only have a maximum of 500 items, but there can be infinite number of insertions to be performed on this list after the 500th item is being added.
Imagine having another 1 million more insertions and (in worse case scenario), all 1 million items "succeeded" and can be inserted into the list, that implies your algorithm will need to perform 1 million more sorts!
That will be 1 million * n(log(n)) for sorting.
Compare with binary insertion, in the worse case it will be 1 million * log(n) for insertion (no sorting).
What lists will be the most effective e.g. LinkedList or ArrayLists or Sets etc.
If you use ArrayList, insertion won't be as efficient as compared to a linked list since ArrayList is backed by an array. However accessing of elements is only O(1) for arrayList as compare to linked list which is O(n). So there isn't a data structure which is efficient for all scenarios. You will have to plan your algorithm first and see which one fits best for your strategy.
Which sort/insert (standard package) will be the most effective?
As far as I know, there is Arrays.sort() and Collections.sort() available which will give you a good performance of O(n log(n)) as they are using a dual pivot sort which will be more effective than a simple insertion/bubble/selection sort created by yourself.

Performance (runtime) of pulling arbitrary element from HashSet

I am using HashSets in an implementation to have fast adding, removing and element testing (amortized constant time).
However, I'd also like a method to obtain an arbitraty element from that set. The only way I am aware of is
Object arbitraryElement = set.iterator.next();
My question is - how fast (asymptotically speaking) is this? Does this work in (not amortized) constant time in the size of the set, or does the iterator().next() method do some operations that are slower? I ask because I seem to lose a log-factor in my implementation as experiments show, and this is one of the few lines affected.
Thank you very much!
HashSet.iterator().next() linearly scans the table to find the next contained item.
For the default load factor of .75, you would have three full slots for every empty one.
There is, of course, no guarantee what the distribution of the objects in the backing array will be & the set will never actually be that full so scans will take longer.
I think you'd get amortized constant time.
Edit: The iterator does not create a deep copy of anything in the set. It only references the array in the HashSet. Your example creates a few objects, but nothing more & no big copies.
I wouldn't expect this to be a logarithmic factor, on average, but it might be slow in some rare cases. If you care about this, use LinkedHashSet, which will guarantee constant time.
I would maintain an ArrayList of your keys, and when you need a random object, just generate an index, grab the key, and pull it out of the set. O(1) baby...
Getting the first element out of a HashSet using the iterator is pretty fast: I think it's amortised O(1) in most cases. This assumes the HashSet is reasonably well-populated for it's given capacity - if the capacity is very large compared to the number of elements then it will be more like O(capacity/n) which is the average number of buckets the iterator needs to scan before finding a value.
Even scanning an entire HashSet with an iterator is only O(n+capacity) which is effectively O(n) if your capacity is appropriately scaled. So it's still not particularly expensive (unless your HashSet is very large)
If you want better than that , you'll need a different data structure.
If you really need the fast access of arbitrary elements by index then I'd personally just put the objects in an ArrayList which will give you very fast O(1) access by index. You can then generate the index as a random number if you want to select an arbitrary element with equal probability.
Alternatively, if you want to get an arbitrary element but don't care about indexed access then a LinkedHashSet may be a good alternative.
This is from the JDK 7 JavaDoc for HashSet:
Iterating over this set requires time proportional to the sum of the HashSet instance's size (the number of elements) plus the "capacity" of the backing HashMap instance (the number of buckets). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.
I looked at the JDK 7 implementation of HashSet and LinkedHashSet. For the former, the next operation is a linked-list traversal within a has bucket, and between buckets an array traversal, where the size of the array is given by capacity(). The latter is strictly a linked list traversal.
If you need an arbitrary element in the probabilistic sense, you could use the following approach.
class MySet<A> {
ArrayList<A> contents = new ArrayList();
HashMap<A,Integer> indices = new HashMap<A,Integer>();
Random R = new Random();
//selects random element in constant O(1) time
A randomKey() {
return contents.get(R.nextInt(contents.size()));
}
//adds new element in constant O(1) time
void add(A a) {
indices.put(a,contents.size());
contents.add(a);
}
//removes element in constant O(1) time
void remove(A a) {
int index = indices.get(a);
contents.set(index,contents.get(contents.size()-1));
contents.remove(contents.size()-1);
indices.set(contents.get(contents.size()-1),index);
indices.remove(a);
}
//all other operations (contains(), ...) are those from indices.keySet()
}
If you are repeatedly choosing an arbitrary set element using an iterator and often removing that element, this can lead to a situation where the internal representation becomes unbalanced and finding the first element degrades to linear time complexity.
This is actually a pretty common occurrence when implementing algorithms involving graph traversal.
Use a LinkedHashSet to avoid this problem.
Demonstration:
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedHashSet;
import java.util.Random;
import java.util.Set;
import java.util.function.Supplier;
import java.util.stream.Collectors;
public class SetPeek {
private static final Random rng = new Random();
private static <T> T peek(final Iterable<T> i) {
return i.iterator().next();
}
private static long testPeek(Set<Integer> items) {
final long t0 = System.currentTimeMillis();
for (int i = 0; i < 100000; i++) {
peek(items);
}
final long t1 = System.currentTimeMillis();
return t1 - t0;
}
private static <S extends Set<Integer>> S createSet(Supplier<S> factory) {
final S set = new Random().ints(100000).boxed()
.collect(Collectors.toCollection(factory));
// Remove first half of elements according to internal iteration
// order. With the default load factor of 0.75 this will not trigger
// a rebalancing.
final Iterator<Integer> it = set.iterator();
for (int k = 0; k < 50000; k++) {
it.next();
it.remove();
}
return set;
}
public static void main(String[] args) {
final long hs = testPeek(createSet(HashSet::new));
System.err.println("HashSet: " + hs + " ms");
final long lhs = testPeek(createSet(LinkedHashSet::new));
System.err.println("LinkedHashSet: " + lhs + " ms");
}
}
Results:
HashSet: 6893 ms
LinkedHashSet: 8 ms

Categories