Get the N smallest [Comparable] items in a set - java

I have an unsorted Collection of objects [that are comparable], is it possible to get a sub list of the collection of the list without having to call sort?
I was looking at the possibility of doing a SortedList with a limited capacity, but that didn't look like the right option.
I could easily write this, but I was wondering if there was another way.
I am not able to modify the existing collection's structure.

Since you don't want to call sort(), it seems like you are trying to avoid an O(n log(n)) runtime cost. There is actually a way to do that in O(n) time -- you can use a selection algorithm.
There are methods to do this in the Guava libraries (Google's core Java libraries); look in Ordering and check out:
public <E extends T> List<E> Ordering.leastOf(Iterable iterable, int k)
public <E extends T> List<E> Ordering.greatestOf(Iterable iterable, int k)
These are implementations of quickselect, and since they're written generically, you could just call them on your Set and get a list of the k smallest things. If you don't want to use the entire Guava libraries, the docs link to the source code, and I think it should be straightforward to port the methods to your project.
If you don't want to deviate too far from the standard libraries, you can always use a sorted set like TreeSet, though this gets you logarithmic insert/remove time instead of the nice O(1) performance of the hash-based Set, and it ends up being O(n log(n)) in the end. Others have mentioned using heaps. This will also get you O(n log(n)) running time, unless you use some of the fancier heap variants. There's a fibonacci heap implementation in GraphMaker if you're looking for one of those.
Which of these makes sense really depends on your project, but I think that covers most of the options.

I would probably create a sorted set. Insert the first N items from your unsorted collection into your sorted set. Then for the remainder of your unsorted collection:
insert each item in the sorted set
delete the largest item from the sorted set
Repeat until you've processed all items in the unsorted collection

Yes, you can put all of them into a max heap data structure with a fixed size of N, conditionally, if the item is smaller than the largest in the max heap (by checking with the get() "peek" method). Once you have done so they will, by definition, be the N smallest. Optimal implementations will perform with O(M)+lg(N) or O(M) (where M is the size of the set) performance, which is theoretically fastest. Here's some pseudocode:
MaxHeap maxHeap = new MaxHeap(N);
for (Item x : mySetOfItems) {
if (x < maxHeap.get()) {
maxHeap.add(x);
}
}
The Apache Commons Collections class PriorityBuffer seems to be their flagship binary heap data structure, try using that one.

http://en.wikipedia.org/wiki/Heap_%28data_structure%29
don't you just want to make a heap?

Related

Providing a comparator and collection to PriorityQueue

I have a Map<Integer,Integer> freqMap where each value is the frequency of the key
Now I want to build a priorityQueue from this map
I was expecting a constructor like PriorityQueue pq = new PriorityQueue(freqMap.keySet(), comparator);
But there is no such constructor.
Although I can construct it using a comparator and then add the keySet elements using addAll
But that will internally add elements one by one. In case I want to build a max heap out of the key set with comparator
as the values. I am not sure how can I do this.
My Thoughts:
one way could be that instead of Integer I make a custom class and wrap those integers and make the class implement a Comparable interface. Then when I pass that collection as a parameter to the priorityQueue constructor it should construct the priority Queue in O(n) time.
Whereas if I use the addAll method then probably it will take O(n log n) time. I am not sure though if my reasoning is correct here. It does seem a little complicated to just use a wrapper class that implements comparable for this tiny purpose.
The comparable will compare based on values so the key with the highest value should be on top.
construct the priority Queue in O(n) time
The only case when you can construct a new PriorityQueue in a linear time O(n) is when you already have another PriorityQueue.
In such case, the copy-constructor internally would invoke method initFromPriorityQueue() which would create a copy of the underlying array of the given PriorityQueue and would assign this copy to the underlying array of this (newly created) queue. The copying of elements would cost O(n).
But when you have a collection that is not a PriorityQueue there's no way to ensure that the elements are ordered as required. That means that you need to enqueue them one-by-one, there's no workaround. For each element, it would have a cost of O(log n). And the overall time complexity would be linear-logarithmic O(n log n).
Here's a quote from the documentation regarding the time complexity of operations:
Implementation note:
this implementation provides O(log(n)) time for
the enqueuing and dequeuing methods (offer, poll, remove() and add);
Since a Binary Heap (PriorityQueue is an implementation of the Binary Heap data structure) has the worst case time complexity O(log n) for inserting a new element, then inserting n would run in linear-logarithmic time.
Regarding the mechanism behind addAll(), as in many other collections it delegates to the method add() which a logarithmic worst case time complexity (see the implementation note quoted above).
Note
All the information provided above is relevant for the PriorityQueue class from the JDK, which is implemented as a Binary Heap (don't confuse this class with the Priority queue data structure).
There are many ways to implement Heap data structure. And some of them like Fibonacci Heap have amortized O(1) time complexity for insertion, which allows populating them with n elements in linear time O(n). In such implementation would be included in the JDK in the future, then almost certainly it would not replace the current PriorityQueue implementation, but rather would be introduced as a new class (that's how Java is being developed since it's earlier days: new things come, almost nothing goes away).

Is get(0) on java.util.List always O(1)?

Of my knowledge, there are the following implementations:
ArrayList
LinkedList
Vector
Stack
(based on http://tutorials.jenkov.com/java-collections/list.html pls correct if wrong)
ArrayList is a dynamic array implementation, so, as array, get is O(1), LinkedList has O(1) for get from Head, Vector and Stack are based on ArrayList, hence, O(1).
So in EVERY case get(0) on any built-in (cause you can make your own, for a specific purpose on making get(0) TS of O(n!)) implementation of List if O(1)?
Is get(0) on java.util.List always O(1)?
Let us assume that there is a parameter N which stands for the length of the list1.
For the 4 implementations of List that you mentioned, get(0) is indeed an O(1) operation:
ArrayList, Vector and Stack all implement get(i) using array subscripting and that is an O(1) operation.
LinkedList.get(i) involves i link traversals which is O(i). But if i is a constant, that reduces to O(1).
However there are other "built in" implementations of List. Indeed, there are a considerable number of them if you include the various non-public implementations, such as the List classes that implement sublists, unmodifiable lists, and so on. Generalizing from those 4 to "all of them" is not sound2.
But get(0) won't be O(1) for all possible implementations of List.
Consider a simple linked list where the elements are chained in the reverse order. Since get(0) needs to traverse to the end of the list, which is N link traversals: O(N).
Consider a list that is fully populated from the rows in a database query's result set the first time that you attempt to retrieve a list element. The first get call will be at least O(N) because you are fetching N rows. (It could be worse than O(N) if the database query is not O(N).) So the worst case complexity for any call to get is O(N) ... or worse.
Indeed, with a some ingenuity, one could invent a custom list where get(0) has any Big-O complexity that you care to propose.
1 - I am being deliberately vague here. On the one hand, we need to identify a variable N denoting the "problem" size for complexity analysis to make sense. (The length of the list is the obvious choice.) On the other hand, the length of a List is a surprisingly "rubbery" concept when you consider all of the possible ways to implement the interface.
2 - I assume that you are asking this question because you want to write some library code that relies on List.get(0) being O(1). Since you can't prevent someone from using your library with a non-builtin list implementation, your "assume it is builtin" constraint in your question doesn't really help ... even if we could check all possible (past, current or future) builtin List implementations for you.
Ignoring custom implementations, and only looking at built-in implementations, like suggested at the end of the question, you still cannot say that get(0) will be O(1) regardless of list size.
As an example, calling get(0) on a sublist based on a LinkedList will be O(n):
List<Integer> list = new LinkedList<>(Arrays.asList(1,2,3,4,5,6,7,8,9));
List<Integer> subList = list.subList(4, 8);
Integer num = subList.get(0); // <===== O(n), not O(1)
In that code, subList.get(0) internally calls list.get(4), which has O(n) time complexity.
Yes, for all implementations of List you mentioned get(0) is O(1).

Java: is there a way to construct a max-heap from an array in O(n) using PriorityQueue?

Correct me if I'm wrong, but I think the PriorityQueue(Collection c) constructor will create a min-heap from a collection in time O(n). However, I couldn't find a constructor where I can pass both a collection and a comparator (in order to convert the min-heap to a max-heap). So I was wondering if there is a way to construct a max-heap from an array (say, int array) in O(n) using PriorityQueue?
No, having a set of elements arranged in a min-heap does not provide any advantage for rearranging them into a max-heap. Also, you seem to be assuming that the PriorityQueue constructors that accept a collection have O(n) asymptotic complexity. That's plausible -- even likely -- but it is not documented, so it is not safe to rely on it.

List vs. Map: Which takes less space and more efficient?

I have two classes Foo and Bar.
class Foo
{
Set<Integer> bars; // Foo objects have collection of bars.
Set<Integer> adjacents; // Adjacency list of Foos.
}
class Bar
{
int foo; // ID of foo of which this object belongs to
Ipsum ipsum; // This an arbitrary class. But it must be present
Map<Integer, Float> adjacents; // Adjacency list of Bars
}
Number of Bars are predefined (up to 1000). Hence, I may use an array.
But number of Foos are undefined (at most #ofBars/4).
When you consider addition, deletion and get(), I need the one which is faster and takes less space (because I'm going to use serialization).
Here are my options (as far as I have thought)
Option 1: Don't define a class for Foo. Instead, use List<Set<Integer>> foo; and another map for Map> fooAdjacencies;
Option 2: Use Map<Integer, Set<Integer> foo if I want to get bars of i, I simply write foo.get(i).
Option 3: Dont define classes. Instead, use option 2 and for Bar class:
Map<Integer, Ipsum> bar;
Map<Integer, Map<Integer, Floar>> barAdjacencies;
Which option should I choose in terms of space and time efficiency?
This sounds like it'd be very helpful for you (specifically the Data Structures section): http://bigocheatsheet.com/
You say
I need my structure to be efficient while adding, removing and finding elements. No other behavior.
The problem is that Lists and Maps are usually used in totally different cases. Their names describe their use cases fairly well -- you use a List if you need to list something (probably in some sequential order), while a Map would be used if you need to map an input to an output. You can use a Map as a List by mapping Integers to your elements, but that's overcomplicating things a bit. However, even within List and Map you can have different implementations that differ wildly in asymptotic performance.
With few exceptions, data structures will take O(n) space, which makes sense. If memory serves, anything other than an ArrayList (or other collections backed only by a primitive array) will have a decent amount of space overhead as they use other objects (e.g. Nodes for LinkedLists and Entry objects for Maps) to organize the underlying structure. I wouldn't worry too much about this overhead though unless space really is at a premium.
For best-performance addition, deletion, and search, you want to look at how the data structure is implemented.
LinkedList-style implementation will net you O(1) addition and deletion (and with a good constant factor, too!), but will have a pretty expensive get() with O(n) time, because the list will have to be traversed every time you want to get something. Java's LinkedList implementation, though, removes in O(n) time; while the actual act of deletion is O(1), that's only if you have a reference to the actual node that you're removing. Because you don't, removals in Java's LinkedList are O(n) -- O(n) for searching for the node to remove, and O(1) for removal.
Data structures backed with a plain array will have O(1) get() because it's an array, but takes O(n) to add, and delete, because any addition/deletion other than at the last element requires all other elements to be shuffled (in Java's implementation at least). Searching for something using an object instead of an index is done in O(n) time because you have to iterate over the array to find the object.
The following two structures are usually Maps, and so usually require you to implement equals() (and hashCode() for HashMaps):
Data structures backed by a tree (e.g. TreeMap) will have amortized (I think) O(lg n) add/remove, as a good implementation should be self-balancing, making worst-case addition/deletions only have to go through the height of the tree at most. get() operations are O(lg n). Using a tree requires that your elements be sortable/comparable in some way, which could be a bonus or hinderance, depending on your usage.
Hash-based data structures have amortized (average) O(1) everything, albeit with a slightly higher constant factor due to the overhead of hashing (and following any chains if the hash spread is poor). HashMaps could start sucking if you write a bad hashCode() function, though, so you want to be careful with that, although the implementers of Java's HashMap did do some magic behind the scenes to try to at least partially negate the effect of bad hashCode() implementations.
Hope that rundown helped. If you clear up how your program is structured, I might be able to give a recommendation. Until then, the best I can do is show you the options and let you pick.
I find this problem description a little hard to follow, but I think you're just looking for general collections/data structures advice.
A list (say, an array list) easily allows you to add and iterate over elements. When it is expanded beyond the size of the underlying array, a one-off costly resize operation is executed to add more space; but that is fine because it happens rarely and the amortized time is not bad. Searching for a specific element in a list is slow because you need to traverse it in order; there is no implied ordering in most lists. Deleting elements depends on the underlying list implementation. An array list could be slow in this regard; but I'm guessing that they optimized it just by marking the underlying element as deleted and skipping it during iteration. When using lists you also have to consider where you are adding elements. Linked lists are slower to iterate but can easily add and remove elements at any position. Array lists cannot easily add an element anywhere but the end.
Per your requirements, if you are required to execute a "get" or find on an element, then you need some kind of searching functionality to speed it up. This would make a map better as you can locate elements in log(n) time instead of linear time as when searching an unordered list. Adding and removing elements in a list is also relatively fast, so that's probably your best option.
Most importantly, implement it more than one way and profile it yourself to learn more :) Lists are rarely a good choice when searching is required though.

Best way to remove and add elements from the java List

I have 100,000 objects in the list .I want to remove few elements from the list based on condition.Can anyone tell me what is the best approach to achieve interms of memory and performance.
Same question for adding objects also based on condition.
Thanks in Advance
Raju
Your container is not just a List. List is an interface that can be implemented by, for example ArrayList and LinkedList. The performance will depend on which of these underlying classes is actually instantiated for the object you are polymorphically referring to as List.
ArrayList can access elements in the middle of the list quickly, but if you delete one of them you need to shift a whole bunch of elements. LinkedList is the opposite i nthis respect., requiring iteration for the access but deletion is just a matter of reassigning pointers.
Your performance depends on the implementation of List, and the best choice of implementation depends on how you will be using the List and which operations are most frequent.
If you're going to be iterating a list and applying tests to each element, then a LinkedList will be most efficient in terms of CPU time, because you don't have to shift any elements in the list. It will, however consume more memory than an ArrayList, because each list element is actually held in an entry.
However, it might not matter. 100,000 is a small number, and if you aren't removing a lot of elements the cost to shift an ArrayList will be low. And if you are removing a lot of elements, it's probably better to restructure as a copy-with filter.
However, the only real way to know is to write the code and benchmark it.
Collections2.filter (from Guava) produces a filtered collection based on a predicate.
List<Number> myNumbers = Arrays.asList(Integer.valueOf(1), Double.valueOf(1e6));
Collection<Number> bigNumbers = Collections2.filter(
myNumbers,
new Predicate<Number>() {
public boolean apply(Number n) {
return n.doubleValue() >= 100d;
}
});
Note, that some operations like size() are not efficient with this scheme. If you tend to follow Josh Bloch's advice and prefer isEmpty() and iterators to unnecessary size() checks, then this shouldn't bite you in practice.
LinkedList could be a good choice.
LinkedList does "remove and add elements" more effective than ArrayList. and no need to call such method as ArrayList.trimToSize() to remove useless memory. But LinkedList is a dual-linked list, each element is wrapped as an Entry which needs extra memory.

Categories