How to increase the performance with Java ArrayList - java

I'm using a huge ArrayList with the code bellow
public final List<MyClass> list = new ArrayList<>();
public void update(MyClass myClass) {
int i;
for (i=0; i < list.size(); i++) {
if (myClass.foo(list.get(i))) {
list.set(i, myClass);
break;
}
}
if (i == list.size()) {
list.add(myClass);
}
}
The list is extremely large. There is something else that I can do to increase the performance with this scenario? Maybe using some Java 8 feature, replacing ArrayList or something like that.
Another code that is taking too long to run related this List is the code bellow:
public List<MyClass> something(Integer amount) {
list.sort((m1, m2) -> Double.compare(m2.getBar(), m1.getBar()));
return list.stream()
.limit(amount)
.collect(Collectors.toList());
}
Any help is welcome, thank you all

It seems like the choice of the ArrayList is not good.
In the first case, you attempt to find an object by his properties in the list. To find an object in the list, you have to check in each elements of your list. Bigger is the list, the longer it will be. (You have a worst case complexity of O(N) with ArrayList)
If you use an HashMap instead of a List, you can use your property as key of your map. Like this, you can select the object you need to update directly without check each element of your list. The execution time will be no more dependent of the number of entries. (You have a worst case complexity of O(1) with HashMap)
If you use HashMap instead of ArrayList, your update code gonna look like this:
public void update(MyClass myClass) {
map.put(myClass.getKey(), myClass);
}
(where getKey() is the properties you try to equals in your foo method).
But this is only for the first case. With the informations we have it seems the best solution.

There is something else that I can do to increase the performance with this scenario?
The problem is that your algorithm has to apply myClass.foo to every element of the list until you find the first match. If you do this serially, then the worst-case complexity is O(N) where N is the list size. (And the list size is large.)
Now, you could do the searching in parallel. However, if there can be multiple matches, then matching the first one in the list is going to be tricky. And you still end up with O(N/C) where C is the number of cores available.
The only way to get better than O(N) is to use a different data structure. But without knowing what the MyClass::foo method does, it is hard to say what that data structure should be.
Your second problem seems to be trying to solve the "top K of N" problem. This can be implemented in O(N log K) and possibly better; see Optimal algorithm for returning top k values from an array of length N.

Related

Sorting a list in ascending order using a for loop repeatedly for binarySearch (Java)

I have a list of classes that I am attempting to sort in ascending order, by adding items in a for loop like so.
private static void addObjectToIndex(classObject object) {
for (int i=0;i<collection.size();i++)
{
if (collection.get(i).ID >= object.ID)
{
collection.insertElementAt(object, i);
return;
}
}
if (classObject.size() == 0)
classObject.add(object);
}
This is faster than sorting it every time I call that function, as that would be simpler but slower, as it gives O(N) time as opposed to using Collections.sort's O(N log N) every time (unless I'm wrong).
The problem is that when I run Collections.binarySearch to attempt to grab an item out of the Vector collection(The collection requires method calls on an atomic basis) it still ends up returning negative numbers as shown in the code below.
Comparator<classObject> c = new Comparator<classObject>()
{
public int compare(classObject u1, classObject u2)
{
int z1 = (int)(u1).ID;
int z2 = (int)(u2).ID;
if(z1 > z2)
return 1;
return z2 <= z1 ? 0 : -1;
}
};
int result = Collections.binarySearch(collection, new classObject(pID), c);
if (result < 0)
return null;
if (collection.get(result).ID != pID)
return null;
else
return collection.get(result);
Something like
result = -1043246
Shows up in the debugger, resulting in the second code snippet returning null.
Is there something I'm missing here? It's probably brain dead simple. I've tried adjusting the for loop that places things in order, <=, >=, < and > and it doesn't work. Adding object to the index i+1 doesn't work. Still returning null, which makes the entire program blow up.
Any insight would be appreciated.
Boy, did you get here from the 80s, because it sure sounds like you've missed quite a few API updates!
This is faster than sorting it every time I call that function, as that would be simpler but slower, as it gives O(N) time as opposed to using Collections.sort's O(N log N) every time (unless I'm wrong).
You're now spending an O(n) investment on every insert, So that's O(n^2) total, vs the model of 'add everything you want to add without sorting it' and then 'at the very end, sort the entire list', which is O(n logn).
Vector is threadsafe which is why I'm using it as opposed to something else, and that can't change
Nope. Threadsafety is not that easy; what you've written isn't thread safe.
Vector is obsolete and should never be used. What Vector does (vs. ArrayList) is that each individual operation on a vector is thread safe (i.e. atomic). Note that you can get this behaviour from any list if you really need it with: List<T> myList = Collections.synchronizedList(someList);, but it is highly unlikely you want this.
Take your current impl of addObjectToIndex. it is not atomic: It makes many different method calls on your vector, and these have zero guarantee of being consistent. If two threads both call addObjectToIndex and your computer has more than one core, than you will eventually end up with a list that looks like: [1, 2, 5, 4, 10] - i.e., not sorted.
Take your addObjectToIndex method: That method just doesn't work properly unless its view of your collection is consistent for the entirety of the run. In other words, that block needs to be 'atomic' - it either does it all or does none of it, and it needs a consistent view throughout. Stick a synchronized around the entire block. In contrast to Vector, which considers each individual call atomic and nothing else, which doesn't work here. More generally, 'just synchronize' is a rather inefficient way to do multicore - the various collections in the java.util.concurrent are usually vastly more efficient and much easier to use, you should read through that API and see if there's anything that'll work for you.
if(z1 > z2) return 1;
I'm pretty sure your insert code sorts ascending, but your comparator sorts descending. Which would break the binary search code (the binary search code is specced to return arbitrary garbage if the list isn't sorted, and as far as the comparator you use here is concerned, it isn't). You should use the same comparator anytime it is relevant, and not re-implement the logic multiple times (or if you do, at least test it!).
There is also no need to write all this code.
Comparator<classObject> c = Comparator::comparingInt(co -> co.ID);
is all you need.
However
It looks like what you really want is a collection that keeps itself continually sorted. Java has that; it's called a TreeSet. You pass it a Comparator (or you don't, and TreeSet expects that the elements you put in have a natural order, either is fine), and it will keep the collection sorted, at very cheap cost (far better than your O(n^2)!), continually. It IS a set, meaning if the comparator says that 2 items are equal, then adding both to the set results in the second add call being ignored (sets cannot contain the same element more than once, and for a TreeSet, 'the same element' is defined solely by 'comparing them returns 0' - TreeSet ignores hashCode and equals entirely).
This sounds like what you really want. If you need 2 different objects with the same ID to be added anyway, then add some more fields to your comparator (instead of returning 0 upon same ID, move on to checking the insertion timestamp or whatnot). But, with a name like 'ID', sounds like duplicates aren't going to happen.
The reason you want to use this off-the-shelf stuff is because otherwise you need to do it yourself, and if you're going to endeavour to write it yourself, you need to be a good programmer. Which you clearly aren't (yet - we all started a newbie and learned to become good later, it's the natural order of things). For example, if I try to add an element to a non-empty collection where the element I try to add has a larger ID than anything in the collection, it just won't add anything. That's because you wrote if (classObject.size() == 0) classObject.add(object); but you wanted classObject.add(object); without the if. Also, In java we write ClassObject, not ClassObject, and more generally, ClassObject is a completely meaningless name. Find a better name; this helps code be less confusing, and this question does suggest you could use some of that.

What is the best way to iterate over list

I have worked pretty much on collection but I have few doubts.
I am aware that we can iterate list with iterator.
Another way is that we can go through as below:
for(int i=0; i<list.size(); i++){
list.get(i);
}
Here I think there is problem that each time it will call list.size() that will build whole tree that will impact performance.
I thought other solution as well like:
int s = list.size();
for(int i=0; i<s; i++){
list.get(i);
}
I think this can solve the problem. I am not much exposed to thread. I am thinking that whetherthis should be right approach or not.
Another way I thought is like:
for (Object obj; list){
}
With this new for loop, I think compiler again checks size of list.
Please give best solution from these or alternative performance efficient approach. Thank you for your help.
Calling size() at each iteration is not really a problem. This operation is O(1) for all the collections I know of: size() simply returns the value of a field of the list, holding its size.
The main problem of the first way is the repeated call to get(i). This operation is O(1) for an ArrayList, but is O(n) for a LinkedList, making the whole iteration O(n2) instead of O(n): get(i) forces the list to start from the first element of the list (or the last one), and to go to the next node until the ith element.
Using an iterator, or using a foreach loop (which internally uses an iterator), guarantees that the most appropriate way of iterating is used, because the iterator knows about how the list is implemented and how best go from one element to the next.
BTW, this is also the only way to iterate through non-indexed collections, like Sets. So you'd better get used to use that kind of loop.
For your example is the best way:
for (Object obj: list){
}
It is the same like in java version < 1.5:
for (Iterator it = hs.iterator() ; it.hasNext() ; ){}
It use iterator of collection. You actually don't need the size of collection. The .size() method is should actually don't build the tree, but .get() can loops to the given element. .get() and .size() methods depend on List implementation. .get() In ArrayList should be actually O(1) complexity and not O(n)
UPDATE
In java 8 you can use:
myList.forEach{ Object elem ->
//do something
}
The best way to iterate the list in terms of performance would be to use iterators ( your second approach using foreach ).
If you are using list.get(i), it's performance would depend upon the implementation of the list. For ArrayList, list.get(i) is O(1) where as it's O(n) for LinkedList.
Also, list.size() is O(1) and should not have any impact over the performance.
for (Object obj: list){
}
Above code for me is the best way, it is clean and can be read easily.
The forEach in Java 8 is nice too.

Arraylist vs Array for finding element?

Which is the most efficient way of finding an element in terms of performance. Say I have 100's of strings. I need to find whether a specified string is available in those bulk strings. I have contains() method in Arraylist, But I need to iterate through Array for the same purpose. Anyone explain, which is the best way of doing this in terms of performance.
Say I have 100's of strings. I need to find whether a specified string is available in those bulk strings.
That sounds like you want a HashSet<String> - not a list or an array. At least, that's the case if the hundreds of strings is the same every time you want to search. If you're searching within a different set of strings every time, you're not going to do better than O(N) if you receive the set in an arbitrary order.
In general, checking for containment in a list/array is an O(N) operation, whereas in a hash-based data structure it's O(1). Of course there's also the cost of performing the hashing and equality checking, but that's a different matter.
Another option would be a sorted list, which would be O(log N).
If you care about the ordering, you might want to consider a LinkedHashSet<String>, which maintains insertion order but still has O(1) access. (It's basically a linked list combined with a hash set.)
An Arraylist uses an array as backing data so the performance will be the same for both
Look at the implementation of ArrayList#contains which calls indexOf()
public int indexOf(Object o) {
if (o == null) {
for (int i = 0; i < size; i++)
if (elementData[i]==null)
return i;
} else {
for (int i = 0; i < size; i++)
if (o.equals(elementData[i]))
return i;
}
return -1;
}
You would do the exact same thing if you implemented the contains() on your own for an array.
You don't have to worry about performance issues. It will not affect much. Its good and easy to use contains() method in ArrayList

Stack and Hash joint

I'm trying to write a data structure which is a combination of Stack and HashSet with fast push/pop/membership (I'm looking for constant time operations). Think of Python's OrderedDict.
I tried a few things and I came up with the following code: HashInt and SetInt. I need to add some documentation to the source, but basically I use a hash with linear probing to store indices in a vector of the keys. Since linear probing always puts the last element at the end of a continuous range of already filled cells, pop() can be implemented very easy without a sophisticated remove operation.
I have the following problems:
the data structure consumes a lot of memory (some improvement is obvious: stackKeys is larger than needed).
some operations are slower than if I have used fastutil (eg: pop(), even push() in some scenarios). I tried rewriting the classes using fastutil and trove4j, but the overall speed of my application halved.
What performance improvements would you suggest for my code?
What open-source library/code do you know that I can try?
You've already got a pretty good implementation. The only improvement obvious to me is that you do more work than you need to by searching when popping. You should store in the stack not the key itself but the index into the key array. This gives you trivially fast pops at the expense of only one more pointer indirection when you want to peek the last item.
Just size your stack to LOAD_FACTOR*(heap array size), in addition to that, and you should have about as fast an implementation as you could expect with as little memory as you can manage given your speed requirements.
I think that what you want is (almost) already available in the libraries: LinkedHashSet is a hash-set with an underlying doubly linked list (which makes it iterable). LinkedHashMap even has a removeEldestEntry which sounds very similar to a pop-method.
How is the performance of a naive solution like:
class HashStack<T> {
private HashMap<T, Integer> counts = new HashMap<T, Integer>();
private Stack<T> stack = new Stack<T>();
public void push(T t) {
stack.push(t);
counts.put(t, 1 + getCount(t));
}
public T pop() {
T t = stack.pop();
counts.put(t, counts.get(t) - 1);
return t;
}
private int getCount(T t) {
return counts.containsKey(t) ? counts.get(t) : 0;
}
public boolean contains(T t) {
return getCount(t) > 0;
}
public String toString() {
return stack.toString();
}
}
I would suggest using TreeSet<T> as it provides guaranteed O(log n) cost for add, remove, and contains.

Performance of traditional for loop vs Iterator/foreach in Java

Is there any performance testing results available in comparing traditional for loop vs Iterator while traversing a ArrayList,HashMap and other collections?
Or simply why should I use Iterator over for loop or vice versa?
Assuming this is what you meant:
// traditional for loop
for (int i = 0; i < collection.size(); i++) {
T obj = collection.get(i);
// snip
}
// using iterator
Iterator<T> iter = collection.iterator();
while (iter.hasNext()) {
T obj = iter.next();
// snip
}
// using iterator internally (confirm it yourself using javap -c)
for (T obj : collection) {
// snip
}
Iterator is faster for collections with no random access (e.g. TreeSet, HashMap, LinkedList). For arrays and ArrayLists, performance differences should be negligible.
Edit: I believe that micro-benchmarking is root of pretty much evil, just like early optimization. But then again, I think it's good to have a feeling for the implications of such quite trivial things. Hence I've run a small test:
iterate over a LinkedList and an ArrayList respecively
with 100,000 "random" strings
summing up their length (just something to avoid that compiler optimizes away the whole loop)
using all 3 loop styles (iterator, for each, for with counter)
Results are similar for all but "for with counter" with LinkedList. All the other five took less than 20 milliseconds to iterate over the whole list. Using list.get(i) on a LinkedList 100,000 times took more than 2 minutes (!) to complete (60,000 times slower). Wow! :) Hence it's best to use an iterator (explicitly or implicitly using for each), especially if you don't know what type and size of list your dealing with.
The first reason to use an iterator is obvious correctness. If you use a manual index, there may be very innocuous off-by-one errors that you can only see if you look very closely: did you start at 1 or at 0? Did you finish at length - 1? Did you use < or <=? If you use an iterator, it is much easier to see that it is really iterating the whole array. "Say what you do, do what you say."
The second reason is uniform access to different data structures. An array can be accessed efficiently through an index, but a linked list is best traversed by remembering the last element accessed (otherwise you get a "Shlemiel the painter"). A hashmap is even more complicated. By providing a uniform interface from these and other data structures (e.g., you can also do tree traversals), you get obvious correctness again. The traversing logic has to be implemented only once, and the code using it can concisely "say what it does, and do what it says."
Performance is similar in most cases.
However, whenever a code receives a List, and loops on it, there is well-known case:
the Iterator is way better for all List implementations that do not implement RandomAccess (example: LinkedList).
The reason is that for these lists, accessing an element by index is not a constant time operation.
So you can also consider the Iterator as more robust (to implementation details).
As always, performance should not be hide readability issues.
The java5 foreach loop is a big hit on that aspect :-)
Yes, it does make a difference on collections which are not random access based like LinkedList. A linked list internally is implemented by nodes pointing to the next(starting at a head node).
The get(i) method in a linked list starts from the head node and navigates through the links all the way to the i'th node. When you iterate on the linked list using a traditional for loop, you start again from the head node each time, thus the overall traversal becomes quadratic time.
for( int i = 0; i< list.size(); i++ ) {
list.get(i); //this starts everytime from the head node instead of previous node
}
While the for each loop iterates over the iterator obtained from the linked list and calls its next() method. The iterator maintains the states of the last access and thus does not start all the way from head everytime.
for( Object item: list ) {
//item element is obtained from the iterator's next method.
}
One of the best reasons to use an iterator over the i++ syntax is that not all data structures will support random access let alone have it perform well. You should also be programming to the list or collection interface so that if you later decided that another data structure would be more efficient you'd be able to swap it out without massive surgery. In that case (the case of coding to an interface) you won't necessarily know the implementation details and it's probably wiser to defer that to the data structure itself.
One of the reasons I've learned to stick with the for each is that it simplifies nested loops, especially over 2+ dimensional loops. All the i's, j's, and k's that you may end up manipulating can get confusing very quickly.
Use JAD or JD-GUI against your generated code, and you will see that there is no real difference. The advantage of the new iterator form is that it looks cleaner in your codebase.
Edit: I see from the other answers that you actually meant the difference between using get(i) versus an iterator. I took the original question to mean the difference between the old and new ways of using the iterator.
Using get(i) and maintaining your own counter, especially for the List classes is not a good idea, for the reasons mentioned in the accepted answer.
I don't believe that
for (T obj : collection) {
calculates .size() each time thru the loop and is therefore faster than
for (int i = 0; i < collection.size(); i++) {
+1 to what sfussenegger said. FYI, whether you use an explicit iterator or an implicit one (i.e. for each) won't make a performance difference because they compile to the same byte code.

Categories