I'm trying to write a data structure which is a combination of Stack and HashSet with fast push/pop/membership (I'm looking for constant time operations). Think of Python's OrderedDict.
I tried a few things and I came up with the following code: HashInt and SetInt. I need to add some documentation to the source, but basically I use a hash with linear probing to store indices in a vector of the keys. Since linear probing always puts the last element at the end of a continuous range of already filled cells, pop() can be implemented very easy without a sophisticated remove operation.
I have the following problems:
the data structure consumes a lot of memory (some improvement is obvious: stackKeys is larger than needed).
some operations are slower than if I have used fastutil (eg: pop(), even push() in some scenarios). I tried rewriting the classes using fastutil and trove4j, but the overall speed of my application halved.
What performance improvements would you suggest for my code?
What open-source library/code do you know that I can try?
You've already got a pretty good implementation. The only improvement obvious to me is that you do more work than you need to by searching when popping. You should store in the stack not the key itself but the index into the key array. This gives you trivially fast pops at the expense of only one more pointer indirection when you want to peek the last item.
Just size your stack to LOAD_FACTOR*(heap array size), in addition to that, and you should have about as fast an implementation as you could expect with as little memory as you can manage given your speed requirements.
I think that what you want is (almost) already available in the libraries: LinkedHashSet is a hash-set with an underlying doubly linked list (which makes it iterable). LinkedHashMap even has a removeEldestEntry which sounds very similar to a pop-method.
How is the performance of a naive solution like:
class HashStack<T> {
private HashMap<T, Integer> counts = new HashMap<T, Integer>();
private Stack<T> stack = new Stack<T>();
public void push(T t) {
stack.push(t);
counts.put(t, 1 + getCount(t));
}
public T pop() {
T t = stack.pop();
counts.put(t, counts.get(t) - 1);
return t;
}
private int getCount(T t) {
return counts.containsKey(t) ? counts.get(t) : 0;
}
public boolean contains(T t) {
return getCount(t) > 0;
}
public String toString() {
return stack.toString();
}
}
I would suggest using TreeSet<T> as it provides guaranteed O(log n) cost for add, remove, and contains.
Related
I'm using a huge ArrayList with the code bellow
public final List<MyClass> list = new ArrayList<>();
public void update(MyClass myClass) {
int i;
for (i=0; i < list.size(); i++) {
if (myClass.foo(list.get(i))) {
list.set(i, myClass);
break;
}
}
if (i == list.size()) {
list.add(myClass);
}
}
The list is extremely large. There is something else that I can do to increase the performance with this scenario? Maybe using some Java 8 feature, replacing ArrayList or something like that.
Another code that is taking too long to run related this List is the code bellow:
public List<MyClass> something(Integer amount) {
list.sort((m1, m2) -> Double.compare(m2.getBar(), m1.getBar()));
return list.stream()
.limit(amount)
.collect(Collectors.toList());
}
Any help is welcome, thank you all
It seems like the choice of the ArrayList is not good.
In the first case, you attempt to find an object by his properties in the list. To find an object in the list, you have to check in each elements of your list. Bigger is the list, the longer it will be. (You have a worst case complexity of O(N) with ArrayList)
If you use an HashMap instead of a List, you can use your property as key of your map. Like this, you can select the object you need to update directly without check each element of your list. The execution time will be no more dependent of the number of entries. (You have a worst case complexity of O(1) with HashMap)
If you use HashMap instead of ArrayList, your update code gonna look like this:
public void update(MyClass myClass) {
map.put(myClass.getKey(), myClass);
}
(where getKey() is the properties you try to equals in your foo method).
But this is only for the first case. With the informations we have it seems the best solution.
There is something else that I can do to increase the performance with this scenario?
The problem is that your algorithm has to apply myClass.foo to every element of the list until you find the first match. If you do this serially, then the worst-case complexity is O(N) where N is the list size. (And the list size is large.)
Now, you could do the searching in parallel. However, if there can be multiple matches, then matching the first one in the list is going to be tricky. And you still end up with O(N/C) where C is the number of cores available.
The only way to get better than O(N) is to use a different data structure. But without knowing what the MyClass::foo method does, it is hard to say what that data structure should be.
Your second problem seems to be trying to solve the "top K of N" problem. This can be implemented in O(N log K) and possibly better; see Optimal algorithm for returning top k values from an array of length N.
I have implemented a graph.
I want to sort a given subset of vertices with respect to their degrees.
Therefore, I've written a custom comparator named DegreeComparator.
private class DegreeComparator implements Comparator<Integer>
{
#Override
public int compare(Integer arg0, Integer arg1)
{
if(adj[arg1].size() == adj[arg0].size()) return arg1 - arg0;
else return adj[arg1].size() - adj[arg0].size());
}
}
So, which one of the below is more efficient?
Using TreeSet
public Collection<Integer> sort(Collection<Integer> unsorted)
{
Set<Integer> sorted = new TreeSet<Integer>(new DegreeComparator());
sorted.addAll(unsorted);
return sorted;
}
Using ArrayList
Collections.sort(unsorted, new DegreeComparator());
Notice that the second approach is not a function, but a one-line code.
Intuitively, I'd rather choose the second one. But I'm not sure if it is more efficient.
Java API contains numerous Collection and Map implementations so it might be confusing to figure out which one to use. Here is a quick flowchart that might help with choosing from the most common implementations
A TreeSet is a Set. It removes duplicates (elements with the same degree). So both aren't equivalent.
Anyway, if what you want naturally is a sorted list, then sort the list. This will work whether the collection has duplicates or not, and even if it has the same complexity (O(n*log(n)) as populating a TreeSet, it is probably faster (because it just has to move elements in an array, instead of having to create lots of tree nodes).
If you only sort once, then the ArrayList is an obvious winner. The TreeSet is better if you add or remove items often as sorting a list again and again would be slow.
Note also that all tree structures need more memory and memory access indirection which makes them slower.
If case of medium sized lists, which change rather frequently by a single element, the fastest solution might be using ArrayList and inserting into the proper position (obviously assuming the arrays get sorted initially).
You'd need to determine the insert position via Arrays.binarySearch and insert or remove. Actually, I would't do it, unless the performance were really critical and a benchmark would show it helps. It gets slow when the list get really big and the gain is limited as Java uses TimSort, which is optimized for such a case.
As pointed in a comment, assuring that the Comparator returns different values is sometimes non-trivial. Fortunately, there's Guava's Ordering#arbitrary, which solves the problem if you don't need to be compatible with equals. In case you do, a similar method can be written (I'm sure I could find it somewhere if requested).
I have implemented a graph.
I want to sort a given subset of vertices with respect to their degrees.
Therefore, I've written a custom comparator named DegreeComparator.
private class DegreeComparator implements Comparator<Integer>
{
#Override
public int compare(Integer arg0, Integer arg1)
{
if(adj[arg1].size() == adj[arg0].size()) return arg1 - arg0;
else return adj[arg1].size() - adj[arg0].size());
}
}
So, which one of the below is more efficient?
Using TreeSet
public Collection<Integer> sort(Collection<Integer> unsorted)
{
Set<Integer> sorted = new TreeSet<Integer>(new DegreeComparator());
sorted.addAll(unsorted);
return sorted;
}
Using ArrayList
Collections.sort(unsorted, new DegreeComparator());
Notice that the second approach is not a function, but a one-line code.
Intuitively, I'd rather choose the second one. But I'm not sure if it is more efficient.
Java API contains numerous Collection and Map implementations so it might be confusing to figure out which one to use. Here is a quick flowchart that might help with choosing from the most common implementations
A TreeSet is a Set. It removes duplicates (elements with the same degree). So both aren't equivalent.
Anyway, if what you want naturally is a sorted list, then sort the list. This will work whether the collection has duplicates or not, and even if it has the same complexity (O(n*log(n)) as populating a TreeSet, it is probably faster (because it just has to move elements in an array, instead of having to create lots of tree nodes).
If you only sort once, then the ArrayList is an obvious winner. The TreeSet is better if you add or remove items often as sorting a list again and again would be slow.
Note also that all tree structures need more memory and memory access indirection which makes them slower.
If case of medium sized lists, which change rather frequently by a single element, the fastest solution might be using ArrayList and inserting into the proper position (obviously assuming the arrays get sorted initially).
You'd need to determine the insert position via Arrays.binarySearch and insert or remove. Actually, I would't do it, unless the performance were really critical and a benchmark would show it helps. It gets slow when the list get really big and the gain is limited as Java uses TimSort, which is optimized for such a case.
As pointed in a comment, assuring that the Comparator returns different values is sometimes non-trivial. Fortunately, there's Guava's Ordering#arbitrary, which solves the problem if you don't need to be compatible with equals. In case you do, a similar method can be written (I'm sure I could find it somewhere if requested).
I am looking for a fast queue implementation in Java. I see that LinkedList implements the Queue interface, but it will only be as fast as a LinkedList right? Is there a way to have a queue that will be faster especially for add (I only need poll, add and check for empty).
Down the line I may also need a PriorityQueue but not yet.
If multiple threads are going to be accessing the queue then consider using an ArrayBlockingQueue. Otherwise take a look at ArrayDeque. From the ArrayDeque API:
This class is likely to be faster than
Stack when used as a stack, and faster
than LinkedList when used as a queue.
Specifically an array-based queue implementation reduces the need to resize the underlying array if the existing array has sufficient capacity, thus making additions to the queue generally faster than LinkedList. Be aware that ArrayBlockingQueue is a bounded implementation whereas ArrayDeque will resize as required.
The flip-side is that LinkedList will typically provide a much more compact representation, particularly in cases where your queue grows and shrinks by a large amount. For example, if you added 10,000,000 elements to an ArrayDeque and then removed 9,999,999 elements, the underlying array would still be of length 10,000,000 whereas a LinkedList would not suffer from this problem.
In reality, for single-threaded access to a non-blocking queue I tend to favour LinkedList. I imagine the performance differences are so negligable you wouldn't notice the difference anyway.
I see that LinkedList implements the Queue interface, but it will only be as fast as a LinkedList right?
Eyeballing the source code, LinkedList is O(1) for Queue.add, Queue.poll, and Queue.peek operations.
I hope that's fast enough.
If performance of a linked list was really a problem, an alternative would be to implement a "circular queue" in an array, i.e. a queue where the start and end point move as entries are added and deleted. I can give more details if you care. When I was using languages that did not have a library of collections, this was how I always implemented queues because it was easier to write than a linked list and it was faster. But with built-in collections, the effort of writing and debugging my own collection for a special case is not worth the trouble 99% of the time: When it's already written, the fact that I could write it a different way faster than I could re-write it the way Java does is pretty much an irrelevant fact. And any performance gain is likely to be too small to be worth the trouble. I sub-type existing collections to get special behavior I need now and then, but I'm hard-pressed to think of the last time that I wrote one from scratch.
You may want to have a look at https://dzone.com/articles/gaplist-lightning-fast-list which introduces GapList. This new list implementation combines the strengths of both ArrayList and LinkedList.
It therefore implements the Deque interface, but can also be presized like the above mentioned ArrayDeque. In addition, you also get all the possibilities of the List interface for free. Get it at https://github.com/magicwerk/brownies-collections.
Start with really simplistic rotating Queue implementation with "C/C++ like" attitude and fixed size.
class SimpleQueue<E>
{
int index = 0;
int head = 0;
int size = 100;
int counter = 0;
E[] data ;
#SuppressWarnings("unchecked")
SimpleQueue()
{
data = (E[]) new Object[size];
}
public void add(E e)
{
data[index]=e;
index=(index+1)%size;
counter++;
}
public E poll()
{
E value = data[head];
head=(head+1)%size;
counter--;
return value;
}
public boolean empty()
{ return counter==0; }
//Test
public static void main(String[] args)
{
SimpleQueue<Integer> s = new SimpleQueue<Integer>();
System.out.println(s.empty());
for(int i=0; i< 10; i++)
s.add(i);
System.out.println(s.empty());
for(int i=0; i<10; i++)
System.out.print(s.poll()+",");
System.out.println("\n"+s.empty());
}
}
And then improve it.
I'm trying to use a PriorityQueue to order objects using a Comparator.
This can be achieved easily, but the objects class variables (with which the comparator calculates priority) may change after the initial insertion. Most people have suggested the simple solution of removing the object, updating the values and reinserting it again, as this is when the priority queue's comparator is put into action.
Is there a better way other than just creating a wrapper class around the PriorityQueue to do this?
You have to remove and re-insert, as the queue works by putting new elements in the appropriate position when they are inserted. This is much faster than the alternative of finding the highest-priority element every time you pull out of the queue. The drawback is that you cannot change the priority after the element has been inserted. A TreeMap has the same limitation (as does a HashMap, which also breaks when the hashcode of its elements changes after insertion).
If you want to write a wrapper, you can move the comparison code from enqueue to dequeue. You would not need to sort at enqueue time anymore (because the order it creates would not be reliable anyway if you allow changes).
But this will perform worse, and you want to synchronize on the queue if you change any of the priorities. Since you need to add synchronization code when updating priorities, you might as well just dequeue and enqueue (you need the reference to the queue in both cases).
I don't know if there is a Java implementation, but if you're changing key values alot, you can use a Fibonnaci heap, which has O(1) amortized cost to decrease a key value of an entry in the heap, rather than O(log(n)) as in an ordinary heap.
One easy solution that you can implement is by just adding that element again into the priority queue. It will not change the way you extract the elements although it will consume more space but that also won't be too much to effect your running time.
To proof this let's consider dijkstra algorithm below
public int[] dijkstra() {
int distance[] = new int[this.vertices];
int previous[] = new int[this.vertices];
for (int i = 0; i < this.vertices; i++) {
distance[i] = Integer.MAX_VALUE;
previous[i] = -1;
}
distance[0] = 0;
previous[0] = 0;
PriorityQueue<Node> pQueue = new PriorityQueue<>(this.vertices, new NodeComparison());
addValues(pQueue, distance);
while (!pQueue.isEmpty()) {
Node n = pQueue.remove();
List<Edge> neighbours = adjacencyList.get(n.position);
for (Edge neighbour : neighbours) {
if (distance[neighbour.destination] > distance[n.position] + neighbour.weight) {
distance[neighbour.destination] = distance[n.position] + neighbour.weight;
previous[neighbour.destination] = n.position;
pQueue.add(new Node(neighbour.destination, distance[neighbour.destination]));
}
}
}
return previous;
}
Here our interest is in line
pQueue.add(new Node(neighbour.destination, distance[neighbour.destination]));
I am not changing priority of the particular node by removing it and adding again rather I am just adding new node with same value but different priority.
Now at the time of extracting I will always get this node first because I have implemented min heap here and the node with value greater than this (less priority) always be extracted afterwards and in this way all neighboring nodes will already be relaxed when less prior element will be extracted.
Without reimplementing the priority queue yourself (so by only using utils.PriorityQueue) you have essentially two main approaches:
1) Remove and put back
Remove element then put it back with new priority. This is explained in the answers above. Removing an element is O(n) so this approach is quite slow.
2) Use a Map and keep stale items in the queue
Keep a HashMap of item -> priority. The keys of the map are the items (without their priority) and the values of the map are the priorities.
Keep it in sync with the PriorityQueue (i.e. every time you add or remove an item from the Queue, update the Map accordingly).
Now when you need to change the priority of an item, simply add the same item to the queue with a different priority (and update the map of course). When you poll an item from the queue, check if its priority is the same than in your map. If not, then ditch it and poll again.
If you don't need to change the priorities too often, this second approach is faster. Your heap will be larger and you might need to poll more times, but you don't need to find your item.
The 'change priority' operation would be O(f(n)log n*), with f(n) the number of 'change priority' operation per item and n* the actual size of your heap (which is n*f(n)).
I believe that if f(n) is O(n/logn)(for example f(n) = O(sqrt(n)), this is faster than the first approach.
Note : in the explanation above, by priority I means all the variables that are used in your Comparator. Also your item need to implement equals and hashcode, and both methods shouldn't use the priority variables.
It depends a lot on whether you have direct control of when the values change.
If you know when the values change, you can either remove and reinsert (which in fact is fairly expensive, as removing requires a linear scan over the heap!).
Furthermore, you can use an UpdatableHeap structure (not in stock java though) for this situation. Essentially, that is a heap that tracks the position of elements in a hashmap. This way, when the priority of an element changes, it can repair the heap. Third, you can look for an Fibonacci heap which does the same.
Depending on your update rate, a linear scan / quicksort / QuickSelect each time might also work. In particular if you have much more updates than pulls, this is the way to go. QuickSelect is probably best if you have batches of update and then batches of pull opertions.
To trigger reheapify try this:
if(!priorityQueue.isEmpty()) {
priorityQueue.add(priorityQueue.remove());
}
Something I've tried and it works so far, is peeking to see if the reference to the object you're changing is the same as the head of the PriorityQueue, if it is, then you poll(), change then re-insert; else you can change without polling because when the head is polled, then the heap is heapified anyways.
DOWNSIDE: This changes the priority for Objects with the same Priority.
Is there a better way other than just creating a wrapper class around the PriorityQueue to do this?
It depends on the definition of "better" and the implementation of the wrapper.
If the implementation of the wrapper is to re-insert the value using the PriorityQueue's .remove(...) and .add(...) methods,
it's important to point out that .remove(...) runs in O(n) time.
Depending on the heap implementation,
updating the priority of a value can be done in O(log n) or even O(1) time,
therefore this wrapper suggestion may fall short of common expectations.
If you want to minimize your effort to implement,
as well as the risk of bugs of any custom solution,
then a wrapper that performs re-insert looks easy and safe.
If you want the implementation to be faster than O(n),
then you have some options:
Implement a heap yourself. The wikipedia entry describes multiple variants with their properties. This approach is likely to get your the best performance, at the same time the more code you write yourself, the greater the risk of bugs.
Implement a different kind of wrapper: handlee updating the priority by marking the entry as removed, and add a new entry with the revised priority.
This is relatively easy to do (less code), see below, though it has its own caveats.
I came across the second idea in Python's documentation,
and applied it to implement a reusable data structure in Java (see caveats at the bottom):
public class UpdatableHeap<T> {
private final PriorityQueue<Node<T>> pq = new PriorityQueue<>(Comparator.comparingInt(node -> node.priority));
private final Map<T, Node<T>> entries = new HashMap<>();
public void addOrUpdate(T value, int priority) {
if (entries.containsKey(value)) {
entries.remove(value).removed = true;
}
Node<T> node = new Node<>(value, priority);
entries.put(value, node);
pq.add(node);
}
public T pop() {
while (!pq.isEmpty()) {
Node<T> node = pq.poll();
if (!node.removed) {
entries.remove(node.value);
return node.value;
}
}
throw new IllegalStateException("pop from empty heap");
}
public boolean isEmpty() {
return entries.isEmpty();
}
private static class Node<T> {
private final T value;
private final int priority;
private boolean removed = false;
private Node(T value, int priority) {
this.value = value;
this.priority = priority;
}
}
}
Note some caveats:
Entries marked removed stay in memory until they are popped
This can be unacceptable in use cases with very frequent updates
The internal Node wrapped around the actual values is an extra memory overhead (constant per entry). There is also an internal Map, mapping all the values currently in the priority queue to their Node wrapper.
Since the values are used in a map, users must be aware of the usual cautions when using a map, and make sure to have appropriate equals and hashCode implementations.