Does anyone know of something similar to ArrayList that is better geared to handling really large amounts of data as quickly as possible?
I've got a program with a really large ArrayList that's getting choked up when it tries to explore or modify the ArrayList.
Presumably when you do:
//i is an int;
arrayList.remove(i);
The code behind the scenes runs something like:
public T remove(int i){
//Let's say ArrayList stores it's data in a T [] array called "contents".
T output = contents[i];
T [] overwrite = new T [contents.length - 1];
//Yes, I know generic arrays aren't created this simply. Bear with me here...
for(int x=0;x<i;x++){
overwrite[x] = contents[x];
}
for(int x=i+1;x<contents.length;x++){
overwrite[x-1] = contents[x];
}
contents = overwrite;
return output;
}
When the size of the ArrayList is a couple million units or so, all those cycles rearranging the positions of items in the array would take a lot of time.
I've tried to alleviate this problem by creating my own custom ArrayList subclass which segments it's data storage into smaller ArrayLists. Any process that required the ArrayList to scan it's data for a specific item generates a new search thread for each of the smaller ArrayLists within (to take advantage of my multiple CPU cores).
But this system doesn't work because when the Thread calling the search has an item in any of the ArrayLists synchronized, it can block those seperate search threads from completing their search, which in turn locks up the original thread that called the search in the process, essentially deadlocking the whole program up.
I really need some kind of data storage class oriented to containing and manipulating large amounts of objects as quickly as the PC is capable.
Any ideas?
I really need some kind of data storage class oriented to containing and manipulating large amounts of objects as quickly as the PC is capable.
The answer depends a lot on what sort of data you are talking about and the specific operations you need. You use the work "explore" without defining it.
If you are talking about looking up a record then nothing beats a HashMap – ConcurrentHashMap for threaded operation. If you are talking about keeping in order, especially when dealing with threads, then I'd recommend a ConcurrentSkipListMap which has O(logN) lookup, insert, remove, etc..
You may also want to consider using multiple collections. You need to be careful that the collections don't get out of sync, which can be especially challenging with threads, but that might be faster depending on the various operations you are making.
When the size of the ArrayList is a couple million units or so, all those cycles rearranging the positions of items in the array would take a lot of time.
As mentioned ConcurrentSkipListMap is O(logN) for rearranging an item. i.e. remove and add with new position.
The [ArrayList.remove(i)] code behind the scenes runs something like: ...
Well not really. You can look at the code in the JDK right? ArrayList uses System.arraycopy(...) for these sorts of operations. They maybe not efficient for your case but it isn't O(N).
One example of good usage for a linked list is where the list elements are very large ie. large enough that only one or two can fit in CPU cache at the same time. At this point the advantage that contiguous block containers like vectors or arrays for iteration have is more or less nullified, and a performance advantage may be possible if many insertions and removals are occurring in realtime.
ref: Under what circumstances are linked lists useful?
ref : https://coderanch.com/t/508171/java/Collection-datastructure-large-data
Different collection types has different time complexity for various operations. Typical complexities are: O(1), O(N), and O(log(N)). To choose a collection, you first need to decide which operation you use often, and avoid collections which have O(N) complexity for that operations. Here you often use operation ArrayList.remove(i) which is O(N). Even worse, you use remove(i) and not remove(element). If remove(element) would have been the only operation used often, then LinkedList could help, its remove(element) is O(1), but LinkedList.remove(i)is also O(N).
I doubt that a List with remove(i) complexity of O(1) can be implemented. The best possible time is O(log(N)), which is definitely better than O(N). Java standard library has no such implementation. You can try to google it by "binary indexed tree" keywords.
But the first thing I would do is to review the algorithm and try to get rid of List.remove(i) operation.
Related
I have a problem which uses many insertions in the list at the beginning and afterwards search and retrieval operations are extensively used, So which approach is good and efficient?
Approach 1: Use LinkedList as my data structure for the whole program.
Approach 2: Use ArrayList as my data structure for the whole program.
Approach 3: Use LinkedList as my data structure at the beginning for insertion and do
Arraylist al = new Arraylist(ll);
for retrieval operations.
How much does the changing of data structure cost?? Is it actually worth doing it?
Since they both implement the same interface you can find this out for yourself by writing your code so that the constructor can be plugged in and test your code both ways. Benchmarking can be done with jmh.
You can plug in the constructor by using the Supplier interface.
Depending on the nature of your problem you may find that using a Deque is appropriate.
Allow me to suggest a 4th approach: using ArrayDeque for the whole program. It has efficient insertion at the front and the back (possibly even faster then LinkedList) and search efficiency like ArrayList. ArrayDeque is an unfairly overlooked class in the Java Collections Framework, possibly because it was added later (Java 6, I think). It does not implement the List interface, so you will have to write your program specifically for it.
Other than that, the only two valid answers to your question are
Do not worry about efficiency until you absolutely have to.
If and when you have to worry about efficiency, you will have to make your own measurements of what performs satisfactorily on your data in your environment. No one here can tell you.
It depends on Frequencies of insert and retrieve operations. Here are the complexities:
ArrayList -> Insertion in the beginning : O(n)
ArrayList -> Retrieval based on index : O(1)
LinkedList -> Insertion in the beginning : O(1)
LinkedList -> Retrieval based on index : O(i) where i is number of elements to be scanned.
So, if you have more retrievals than insertions, go for ArrayList, if not, go for LinkedList.
I've read quite a few questions here that discuss the cost of using ArrayLists vs LinkedLists in Java. One of the most useful I've seen thus far is is here: When to use LinkedList over ArrayList?.
I want to be sure that I'm correctly understanding.
In my current use case, I have multiple situations where I have objects stored in a List structure. The number of objects in the list changes for each run, and random access to objects in the list is never required. Based on this information, I have elected to use LinkedLists with ListIterators to traverse the entire content of the list.
For example, my code may look something like this:
for (Object thisObject : theLinkedList) {
// do something
}
If this is a bad choice, please help me understand why.
My current understanding is that traversing the entire list of objects in a LinkedList would incur O(n) cost using the iterative solution. Since there is no random access to the list (i.e. The need to get item #3, for example), my current understanding is that this would be basically the same as looping over the content of an ArrayList and requesting each element with an index.
Assuming I knew the number of objects to be stored in the list beforehand, my current line of thinking is that it would be better to initialize an ArrayList to the appropriate size and switch to that structure entirely without using a ListIterator. Is this logic sound?
As always, I greatly appreciate everone's input!
Iteration over a LinkedList and ArrayList should take roughly the same amount of time to complete, since in each case the cost of stepping from one element to the next is a constant. The ArrayList might be a bit better due to locality of reference, though, so it might be worth profiling to see what happens.
If you are guaranteed that there will always be a fixed number of elements, and there won't be insertions and deletions in random locations, then a raw array might be a good choice, since it's extremely fast and well-optimized for this case.
That said, your analysis of why to use LinkedList seems sound. Again, it doesn't hurt to profile the program and see if ArrayList would actually be faster for your use case.
Hope this helps!
I need a sorted list in a scenario dominated by iteration (compared to insert/remove, not random get at all). For this reason I thought about using a skip list compared to a tree (the iterator should be faster).
The problem is that java6 only has a concurrent implementation of a skip list, so I was guessing whether it makes sense to use it in a non-concurrent scenario or if the overhead makes it a wrong decision.
For what I know ConcurrentSkipList* are basically lock-free implementations based on CAS, so they should not carry (much) overhead, but I wanted to hear somebody else's opinion.
EDIT:
After some micro-benchmarking (running iteration multiple times on different-sized TreeSet, LinkedList, ConcurrentSkipList and ArrayList) shows that there's quite an overhead. ConcurrentSkipList does store the elements in a linked list inside, so the only reason why it would be slower on iteration than a LinkedList would be due to the aforementioned overhead.
If thread-safety's not required I'd say to skip package java.util.concurrent altogether.
What's interesting is that sometimes ConcurrentSkipList is slower than TreeSet on the same input and I haven't sorted out yet why.
I mean, have you seen the source code for ConcurrentSkipListMap? :-) I always have to smile when I look at it. It's 3000 lines of some of the most insane, scary, and at the same time beautiful code I've ever seen in Java. (Kudos to Doug Lea and co. for getting all the concurrency utils integrated so nicely with the collections framework!) Having said that, on modern CPUs the code and algorithmic complexity won't even matter so much. What usually makes more difference is having the data to iterate co-located in memory, so that the CPU cache can do its job better.
So in the end I'll wrap ArrayList with a new addSorted() method that does a sorted insert into the ArrayList.
Sounds good. If you really need to squeeze every drop of performance out of iteration you could also try iterating a raw array directly. Repopulate it upon each change, e.g. by calling TreeSet.toArray() or generating it then sorting it in-place using Arrays.sort(T[], Comparator<? super T>). But the gain could be tiny (or even nothing if the JIT does its job well) so it might not be worth the inconvenience.
As measured using Open JDK 6 on typical production hardware my company uses, you can expect all add and query operations on a skip-list map to take roughly double the time as the same operation on a tree map.
examples:
63 usec vs 31 usec to create and add 200 entries.
145 ns vs 77 ns for get() on that 200-element map.
And the ratio doesn't change all that much for smaller and larger sizes.
(The code for this benchmark will eventually be shared so you can review it and run it yourself; sorry we're not ready to do that yet.)
Well you can use a lot of other structures to do the skip list, it exists in Concurrent package because concurrent data structures are a lot more complicated and because using a concurrent skip list would cost less than using other concurrent data structures to mimic a skip list.
In a single thread world is different: you can use a sorted set, a binary tree or your custom data structure that would perform better than concurrent skip list.
The complexity in iterating a tree list or a skip list will be always O(n), but if you instead use a linked list or an array list, you have the problem with insertion: to insert an item in the right position (sorted linked list) the complexity of insertion will be O(n) instead of O(log n) for a binary tree or for a skip list.
You can iterate in TreeMap.keySet() to obtain all inserted keys in order and it will not be so slow.
There is also the TreeSet class, that probably is what you need, but since it is just a wrapper to TreeMap, the direct use of TreeMap would be faster.
Without concurrency, it is usually more efficient to use a balanced binary search tree. In Java, this would be a TreeMap.
Skip lists are generally reserved for concurrent programming because of their ease in implementation the speed in multithreaded applications.
You seem to have a good grasp of the trade-off here, so I doubt anyone can give you a definitive, principled answer. Fortunately, this is pretty straightforward to test.
I started by creating a simple Iterator<String> that loops indefinitely over a finite list of randomly generated strings. (That is: on initialization, it generates an array _strings of a random strings of length b out of a pool of c distinct characters. The first call to next() returns _strings[0], the next call returns _strings[1] … the (n+1)th call returns _strings[0] again.) The strings returned by this iterator were what I used in all calls to SortedSet<String>.add(...) and SortedSet<String>remove(...).
I then wrote a test method that accepts an empty SortedSet<String> and loops d times. On each iteration, it adds e elements, then removes f elements, then iterates over the entire set. (As a sanity-check, it keeps track of the set's size by using the return values of add() and remove(), and when iterates over the entire set, it makes sure it finds the expected number of elements. Mostly I did that just so there would be something in the body of the loop.)
I don't think I need to explain what my main(...) method does. :-)
I tried various values for the various parameters, and I found that sometimes ConcurrentSkipListSet<String> performed better, and sometimes TreeSet<String> did, but the difference was never much more than twofold. In general, ConcurrentSkipListSet<String> performed better when:
a, b, and/or c were relatively large. (I mean, within the ranges I tested. My a's ranged from 1000 to 10000, my b's from 3 to 10, my c's from 10 to 80. Overall, the resulting set-sizes ranged from around 450 to exactly 10000, with modes of 666 and 6666 because I usually used e=2f.) This suggests that ConcurrentSkipListSet<String> copes somewhat better than TreeSet<String> with larger sets, and/or with more-expensive string-comparisons. Trying specific values designed to tease apart these two factors, I got the impression that ConcurrentSkipListSet<String> coped noticeably better than TreeSet<String> with larger sets, and slightly less well with more-expensive string-comparisons. (That's basically what you'd expect; TreeSet<String>'s binary-tree approach aims to do the absolute minimum possible number of comparisons.)
e and f were small; that is, when I called add(...)s and remove(...)s only a small number of times per iteration. (This is exactly what you predicted.) The exact turn-over point depended on a, b, and c, but to a first approximation, ConcurrentSkipListSet<String> performed better when e + f was less than around 10, and TreeSet<String> performed better when e + f was more than around 20.
Of course, this was on a machine that may look nothing like yours, using a JDK that may look nothing like yours, and using very artificial data that might look nothing like yours. I'd recommend that you run your own tests. Since Tree* and ConcurrentSkipList* both implement Sorted*, you should have no difficulty trying your code both ways and seeing what you find.
For what I know ConcurrentSkipList* are basically lock-free implementations based on CAS, so they should not carry (much) overhead, […]
My understanding is that this will depend on the machine. On some systems a lock-free implementation may not be possible, in which case these classes will have to use locks. (But since you're not actually multi-threading, even locks may not be all that expensive. Synchronization has overhead, of course, but its main cost is lock contention and forced single-threading. That isn't an issue for you. Again, I think you'll just have to test and see how the two versions perform.)
As noted SkipList has a lot of overhead compared to TreeMap and the TreeMap iterator isn't well suited to your use case because it just repeatedly calls the method successor() which turns out to be very slow.
So one alternative that will be significantly faster than the previous two is to write your own TreeMap iterator. Actually, I would dump TreeMap altogether since 3000 lines of code is a bit bulkier than you probably need and just write a clean AVL tree implementation with the methods you need. The basic AVL logic is just a few hundred lines of code in any language then add the iterator that works best in your case.
What is the fastest collection in Java?
I only need the operations to add and remove, order is not important, equals elements is not a issue, nothing more than add and remove is imporant.
Without limit size is important too.
These collection will have Objects inside him.
Currently I'm using ArrayDeque because I see this is the faster Queue implementation.
ArrayDeque is best. See this benchmark, which comes from this blog post about the results of benchmarking this. ArrayDeque doesn't have the overhead of node allocations that LinkedList does nor the overhead of shifting the array contents left on remove that ArrayList has. In the benchmark, it performs about 3x as well as LinkedList for large queues and even slightly better than ArrayList for empty queues. For best performance, you'll probably want to give it an initial capacity large enough to hold the number of elements it's likely to hold at a time to avoid many resizes.
Between ArrayList and LinkedList, it seems that it depends on the average number of total elements the queue will contain at any given time and that LinkedList beats ArrayList starting at about 10 elements.
You can use a java.util.LinkedList - it's doubly-linked and cicrular, so adding to one end and taking from the other are O(1)
Whatever implementation you choose, refer to it by the Queue interface, so that you can easily change it if it turns out not to fit your case (if, of course, a queue is what you need in the first place)
Update: Colin's answer shows a benchmark that concludes that ArrayDeque is better. Both have O(1) operations, but LinkedList creates new objects (nodes), which slightly affect performance. Since both have O(1) I don't think it would be too wrong to choose LinkedList though.
ConcurrentLinkedDeque is the best choice for multi thread Queue
I have a program where I need to make 100,000 to 1,000,000 random-access reads to a List-like object in as little time as possible (as in milliseconds) for a cellular automata-like program. I think the update algorithm I'm using is already optimized (keeps track of active cells efficiently, etc). The Lists do need to change size, but that performance is not as important. So I am wondering if the performance from using Arrays instead of ArrayLists is enough to make a difference when dealing with that many reads in such short spans of time. Currently, I'm using ArrayLists.
Edit:
I forgot to mention: I'm just storing integers, so another factor is using the Integer wrapper class (in the case of ArrayLists) versus ints (in the case of arrays). Does anyone know if using ArrayList will actually require 3 pointer look ups (one for the ArrayList, one for the underlying array, and one for the Integer->int) where as the array would only require 1 (array address+offset to the specific int)? Would HotSpot optimize the extra look ups away? How significant are those extra look ups?
Edit2:
Also, I forgot to mention I need to do random access writes as well (writes, not insertions).
Now that you've mentioned that your arrays are actually arrays of primitive types, consider using the collection-of-primitive-type classes in the Trove library.
#viking reports significant (ten-fold!) speedup using Trove in his application - see comments. The flip-side is that Trove collection types are not type compatible with Java's standard collection APIs. So Trove (or similar libraries) won't be the answer in all cases.
Try both, but measure.
Most likely you could hack something together to make the inner loop use arrays without changing all that much code. My suspicion is that HotSpot will already inline the method calls and you will see no performance gain.
Also, try Java 6 update 14 and use -XX:+DoEscapeAnalysis
ArrayLists are slower than Arrays, but most people consider the difference to be minor. In your case could matter though, since you're dealing with hundreds of thousands of them.
By the way, duplicate: Array or List in Java. Which is faster?
I would go with Kevin's advise.
Stay with the lists first and measure your performance if your programm is to slow compare it to a version with an array. If that gives you a measurable performance boost go with the arrays, if not stay with the lists because they will make your life much much easier.
There will be an overhead from using an ArrayList instead of an array, but it is very likely to be small. In fact, the useful bit of data in the ArrayList can be stored in registers, although you will probably use more (List size for instance).
You mention in your edit that you are using wrapper objects. These do make a huge difference. If you are typically using the same value repeatedly, then a sensible cache policy may be useful (Integer.valueOf gives the same results for -128 to 128). For primitives, primitive arrays usually win comfortably.
As a refinement, you might want to make sure the adjacent cells tend to be adjacent in the array (you can do better than rows of columns with a space filling curve).
One possibility would be to re-implement ArrayList (it's not that hard), but expose the backing array via a lock/release call cycle. This gets you convenience for your writes, but exposes the array for a large series of read/write operations that you know in advance won't impact the array size. If the list is locked, add/delete is not allowed - just get/set.
for example:
SomeObj[] directArray = myArrayList.lockArray();
try{
// myArrayList.add(), delete() would throw an illegal state exception
for (int i = 0; i < 50000; i++){
directArray[i] += 1;
}
} finally {
myArrayList.unlockArray();
}
This approach continues to encapsulate the array growth/etc... behaviors of ArrayList.
Java uses double indirection for its objects so they can be moved about in memory and have its references still be valid, this means every reference lookup is equivalent to two pointer lookups. These extra lookups cannot be optimised away completely.
Perhaps even worse is your cache performance will be terrible. Accessing values in cache is goings to be many times faster than accessing values in main memory. (perhaps 10x) If you have an int[] you know the values will be consecutive in memory and thus load into cache readily. However, for Integer[] the Integers individual objects can appear randomly across your memory and are much more likely to be cache misses. Also Integer use 24 bytes which means they are much less likely to fit into your caches than 4 byte values.
If you update an Integer, this often results in a new object created which is many orders of magnitude than updating an int value.
If you're creating the list once, and doing thousands of reads from it, the overhead from ArrayList may well be slight enough to ignore. If you're creating thousands of lists, go with the standard array. Object creation in a loop quickly goes quadratic, simply because of all the overhead of instantiating the member variables, calling the constructors up the inheritance chain, etc.
Because of this -- and to answer your second question -- stick with standard ints rather than the Integer class. Profile both and you'll quickly (or, rather, slowly) see why.
If you're not going to be doing a lot more than reads from this structure, then go ahead and use an array as that would be faster when read by index.
However, consider how you're going to get the data in there, and if sorting, inserting, deleting, etc, are a concern at all. If so, you may want to consider other collection based structures.
Primitives are much (much much) faster. Always. Even with JIT escape analysis, etc. Skip wrapping things in java.lang.Integer. Also, skip the array bounds check most ArrayList implementations do on get(int). Most JIT's can recognize simple loop patterns and remove the loop, but there isn't much reason to much with it if you're worried about performance.
You don't have to code primitive access yourself - I'd bet you could cut over to using IntArrayList from the COLT library - see http://acs.lbl.gov/~hoschek/colt/ - "Colt provides a set of Open Source Libraries for High Performance Scientific and Technical Computing in Java") - in a few minutes of refactoring.
The options are:
1. To use an array
2. To use the ArrayList which internally uses an array
It is obvious the ArrayList introduces some overhead (look into ArrayList source code). For the 99% of the use cases this overhead can be easily ignored. However if you implement time sensitive algorithms and do tens of millions of reads from a list by index then using bare arrays instead of lists should bring noticeable time savings. USE COMMON SENSE.
Please take a look here: http://robaustin.wikidot.com/how-does-the-performance-of-arraylist-compare-to-array I would personally tweak the test to avoid compiler optimizations, e.g. I would change "j = " into "j += " with the subsequent use of "j" after the loop.
An Array will be faster simply because at a minimum it skips a function call (i.e. get(i)).
If you have a static size, then Arrays are your friend.