I am looking for a collection which has the most efficient "add item" concurrent operation. Scala or Java is good.
I typically:
insert 100.000s of entries, one at a time
don't care about order
only read and clear the collection when there is no append. So not critical
Also, it should work with multiple thread (hence the concurrent constraint). But I need it to be most effective when there is no concurrency: the design for concurrent-safety should not have too much impact when there is no concurrent access.
I use this collection to record performance measurements. This is why it should be the most efficient possible to not bias too much actual performance. However, because the collection size is possibly big and not known in advance, it should cope efficiently with size increase.
So which collection would be best to use?
I currently used a mutable.ListBuffer with buffer.synchronized{ ... } around append (and clear) operation. I tried to use a var buf: List (scala) with similar synchronized{ ... } block but it seriously impacted the measurements.
I'd say the ConcurrentLinkedQueue. It's O(1) insertion using CAS. So under moderate load you probably won't have faster inserts. If you have very very high load, you may want to consider a LinkedBlockingQueue.
Since you say it's likely going to be single-threaded on adds, using CLQ and CAS will be your best option.
Related
I'm wondering if there are any existent concurrent data structures in Java that offer O(1) writes in the worst case. For my purposes, I intend to store no more than a fixed maximum of 16 elements, each of which will be written to many times and read only once, after all writes have completed.
I've been looking at ConcurrentHashMap and am wondering if there's some way to control the segment to which an object of the class assigns an element, such that each segment would only be written to by one thread. Moreover, if this were possible, would writes be improved to O(1) in the worst case? I'm sure effecting such change in the class would be a delicate and difficult operation to say the least, so would appreciate any advice on that.
I don't believe there's anything like a ConcurrentArray, which could offer O(1) writes, but would be very happy to be proven wrong there!
If this isn't possible, my plan B is to create 16 arrays (with some arbitrary, unchanging capacity) and using CAS to assign each thread to an "unoccupied" array. Not especially confident in the efficiency of that approach though...
Thanks in advance for suggestions.
EDIT:
To address a few comments made about the requirment for O(1) writes - O(1) worst case writes are important within the context of the problem I'm working on. It may not impact the program's overall performance significantly, nonetheless, it is a requirement for reasons that are beyond the scope of this question.
With Java 8 and lambdas it's easy to iterate over collections as streams, and just as easy to use a parallel stream. Two examples from the docs, the second one using parallelStream:
myShapesCollection.stream()
.filter(e -> e.getColor() == Color.RED)
.forEach(e -> System.out.println(e.getName()));
myShapesCollection.parallelStream() // <-- This one uses parallel
.filter(e -> e.getColor() == Color.RED)
.forEach(e -> System.out.println(e.getName()));
As long as I don't care about the order, would it always be beneficial to use the parallel? One would think it is faster dividing the work on more cores.
Are there other considerations? When should parallel stream be used and when should the non-parallel be used?
(This question is asked to trigger a discussion about how and when to use parallel streams, not because I think always using them is a good idea.)
A parallel stream has a much higher overhead compared to a sequential one. Coordinating the threads takes a significant amount of time. I would use sequential streams by default and only consider parallel ones if
I have a massive amount of items to process (or the processing of each item takes time and is parallelizable)
I have a performance problem in the first place
I don't already run the process in a multi-thread environment (for example: in a web container, if I already have many requests to process in parallel, adding an additional layer of parallelism inside each request could have more negative than positive effects)
In your example, the performance will anyway be driven by the synchronized access to System.out.println(), and making this process parallel will have no effect, or even a negative one.
Moreover, remember that parallel streams don't magically solve all the synchronization problems. If a shared resource is used by the predicates and functions used in the process, you'll have to make sure that everything is thread-safe. In particular, side effects are things you really have to worry about if you go parallel.
In any case, measure, don't guess! Only a measurement will tell you if the parallelism is worth it or not.
The Stream API was designed to make it easy to write computations in a way that was abstracted away from how they would be executed, making switching between sequential and parallel easy.
However, just because its easy, doesn't mean its always a good idea, and in fact, it is a bad idea to just drop .parallel() all over the place simply because you can.
First, note that parallelism offers no benefits other than the possibility of faster execution when more cores are available. A parallel execution will always involve more work than a sequential one, because in addition to solving the problem, it also has to perform dispatching and coordinating of sub-tasks. The hope is that you'll be able to get to the answer faster by breaking up the work across multiple processors; whether this actually happens depends on a lot of things, including the size of your data set, how much computation you are doing on each element, the nature of the computation (specifically, does the processing of one element interact with processing of others?), the number of processors available, and the number of other tasks competing for those processors.
Further, note that parallelism also often exposes nondeterminism in the computation that is often hidden by sequential implementations; sometimes this doesn't matter, or can be mitigated by constraining the operations involved (i.e., reduction operators must be stateless and associative.)
In reality, sometimes parallelism will speed up your computation, sometimes it will not, and sometimes it will even slow it down. It is best to develop first using sequential execution and then apply parallelism where
(A) you know that there's actually benefit to increased performance and
(B) that it will actually deliver increased performance.
(A) is a business problem, not a technical one. If you are a performance expert, you'll usually be able to look at the code and determine (B), but the smart path is to measure. (And, don't even bother until you're convinced of (A); if the code is fast enough, better to apply your brain cycles elsewhere.)
The simplest performance model for parallelism is the "NQ" model, where N is the number of elements, and Q is the computation per element. In general, you need the product NQ to exceed some threshold before you start getting a performance benefit. For a low-Q problem like "add up numbers from 1 to N", you will generally see a breakeven between N=1000 and N=10000. With higher-Q problems, you'll see breakevens at lower thresholds.
But the reality is quite complicated. So until you achieve experthood, first identify when sequential processing is actually costing you something, and then measure if parallelism will help.
I watched one of the presentations of Brian Goetz (Java Language Architect & specification lead for Lambda Expressions). He explains in detail the following 4 points to consider before going for parallelization:
Splitting / decomposition costs
– Sometimes splitting is more expensive than just doing the work!
Task dispatch / management costs
– Can do a lot of work in the time it takes to hand work to another thread.
Result combination costs
– Sometimes combination involves copying lots of data. For example, adding numbers is cheap whereas merging sets is expensive.
Locality
– The elephant in the room. This is an important point which everyone may miss. You should consider cache misses, if a CPU waits for data because of cache misses then you wouldn't gain anything by parallelization. That's why array-based sources parallelize the best as the next indices (near the current index) are cached and there are fewer chances that CPU would experience a cache miss.
He also mentions a relatively simple formula to determine a chance of parallel speedup.
NQ Model:
N x Q > 10000
where,
N = number of data items
Q = amount of work per item
Other answers have already covered profiling to avoid premature optimization and overhead cost in parallel processing. This answer explains the ideal choice of data structures for parallel streaming.
As a rule, performance gains from parallelism are best on streams over ArrayList , HashMap , HashSet , and ConcurrentHashMap instances; arrays; int ranges; and long ranges. What these data structures have in common is that they can all be accurately and cheaply split into subranges of any desired sizes, which makes it easy to divide work among parallel threads. The abstraction used by the streams library to perform this task is the spliterator , which is returned by the spliterator method on Stream and Iterable.
Another important factor that all of these data structures have in common is that they provide good-to-excellent locality of reference when processed sequentially: sequential element references are stored together in memory. The objects referred to by those references may not be close to one another in memory, which reduces locality-of-reference. Locality-of-reference turns out to be critically important for parallelizing bulk operations: without it, threads spend much of their time idle, waiting for data to be transferred from memory into the processor’s cache. The data structures with the best locality of reference are primitive arrays because the data itself is stored contiguously in memory.
Source: Item #48 Use Caution When Making Streams Parallel, Effective Java 3e by Joshua Bloch
Never parallelize an infinite stream with a limit. Here is what happens:
public static void main(String[] args) {
// let's count to 1 in parallel
System.out.println(
IntStream.iterate(0, i -> i + 1)
.parallel()
.skip(1)
.findFirst()
.getAsInt());
}
Result
Exception in thread "main" java.lang.OutOfMemoryError
at ...
at java.base/java.util.stream.IntPipeline.findFirst(IntPipeline.java:528)
at InfiniteTest.main(InfiniteTest.java:24)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.stream.SpinedBuffer$OfInt.newArray(SpinedBuffer.java:750)
at ...
Same if you use .limit(...)
Explanation here:
Java 8, using .parallel in a stream causes OOM error
Similarly, don't use parallel if the stream is ordered and has much more elements than you want to process, e.g.
public static void main(String[] args) {
// let's count to 1 in parallel
System.out.println(
IntStream.range(1, 1000_000_000)
.parallel()
.skip(100)
.findFirst()
.getAsInt());
}
This may run much longer because the parallel threads may work on plenty of number ranges instead of the crucial one 0-100, causing this to take very long time.
Collection.parallelStream() is a great way to do work in parallel. However you need to keep in mind that this effectively uses a common thread pool with only a few worker threads internally (number of threads equals to the number of cpu cores by default), see ForkJoinPool.commonPool(). If some of pool's tasks are a long-running I/O-bound work then others, potentially fast, parallelStream calls will get stuck waiting for the free pool threads. This obviously leads to a requirement of fork-join tasks being non-blocking and short or, in other words, cpu-bound. For better understanding of details I strongly recommend careful reading of java.util.concurrent.ForkJoinTask javadoc, here are some relevant quotes:
The efficiency of ForkJoinTasks stems from ... their main use as computational tasks calculating pure functions or operating on purely isolated objects.
Computations should ideally avoid synchronized methods or blocks, and should minimize other blocking synchronization
Subdividable tasks should also not perform blocking I/O
These indicate the main purpose of parallelStream() tasks as short computations over isolated in-memory structures. Also recommend checking out article Common parallel stream pitfalls
I am just wondering why the jdk HashMap reshashing process not taking the gradual approach as Redis. Though the rehash calculation of Jdk HashMap is quite elegant and effective, it will still take noticeable time when the number of elements in the original HashMap contains quite a number of entries.
I am not an experience java user so I always suppose that there must be consideration of the java designers that is beyond the limit of my cognitive capability.
The gradual rehash like Redis can effectively distributes the workload to each put, delete or get in the HashMap, which could significantly reduce the resize/rehashing time .
And I have also compared the two hash methods which in my mind doesn't restrict Jdk from doing a gradual rehashing.
Hope someone could give an clue or some inspiration. Thanks a lot in advance.
If you think about the costs and benefits of incremental rehashing for something like HashMap, It turns out that the costs are not insignificant, and the benefits are not as great as you might like.
An incrementally rehashing HashMap:
Uses 50% more memory on average, because it needs to keep both the old table and new table around during the incremental rehash; and
Has a somewhat higher computational cost per operation. Also:
The rehashing is still not entirely incremental, because allocating the new hash table array has to be done all at once; so
There are no improvements in the asymptotic complexity of any operation. And finally:
Almost nothing that would really need incremental rehashing can be implemented in Java at all, due to the unpredictable GC pauses, so why bother?
With Java 8 and lambdas it's easy to iterate over collections as streams, and just as easy to use a parallel stream. Two examples from the docs, the second one using parallelStream:
myShapesCollection.stream()
.filter(e -> e.getColor() == Color.RED)
.forEach(e -> System.out.println(e.getName()));
myShapesCollection.parallelStream() // <-- This one uses parallel
.filter(e -> e.getColor() == Color.RED)
.forEach(e -> System.out.println(e.getName()));
As long as I don't care about the order, would it always be beneficial to use the parallel? One would think it is faster dividing the work on more cores.
Are there other considerations? When should parallel stream be used and when should the non-parallel be used?
(This question is asked to trigger a discussion about how and when to use parallel streams, not because I think always using them is a good idea.)
A parallel stream has a much higher overhead compared to a sequential one. Coordinating the threads takes a significant amount of time. I would use sequential streams by default and only consider parallel ones if
I have a massive amount of items to process (or the processing of each item takes time and is parallelizable)
I have a performance problem in the first place
I don't already run the process in a multi-thread environment (for example: in a web container, if I already have many requests to process in parallel, adding an additional layer of parallelism inside each request could have more negative than positive effects)
In your example, the performance will anyway be driven by the synchronized access to System.out.println(), and making this process parallel will have no effect, or even a negative one.
Moreover, remember that parallel streams don't magically solve all the synchronization problems. If a shared resource is used by the predicates and functions used in the process, you'll have to make sure that everything is thread-safe. In particular, side effects are things you really have to worry about if you go parallel.
In any case, measure, don't guess! Only a measurement will tell you if the parallelism is worth it or not.
The Stream API was designed to make it easy to write computations in a way that was abstracted away from how they would be executed, making switching between sequential and parallel easy.
However, just because its easy, doesn't mean its always a good idea, and in fact, it is a bad idea to just drop .parallel() all over the place simply because you can.
First, note that parallelism offers no benefits other than the possibility of faster execution when more cores are available. A parallel execution will always involve more work than a sequential one, because in addition to solving the problem, it also has to perform dispatching and coordinating of sub-tasks. The hope is that you'll be able to get to the answer faster by breaking up the work across multiple processors; whether this actually happens depends on a lot of things, including the size of your data set, how much computation you are doing on each element, the nature of the computation (specifically, does the processing of one element interact with processing of others?), the number of processors available, and the number of other tasks competing for those processors.
Further, note that parallelism also often exposes nondeterminism in the computation that is often hidden by sequential implementations; sometimes this doesn't matter, or can be mitigated by constraining the operations involved (i.e., reduction operators must be stateless and associative.)
In reality, sometimes parallelism will speed up your computation, sometimes it will not, and sometimes it will even slow it down. It is best to develop first using sequential execution and then apply parallelism where
(A) you know that there's actually benefit to increased performance and
(B) that it will actually deliver increased performance.
(A) is a business problem, not a technical one. If you are a performance expert, you'll usually be able to look at the code and determine (B), but the smart path is to measure. (And, don't even bother until you're convinced of (A); if the code is fast enough, better to apply your brain cycles elsewhere.)
The simplest performance model for parallelism is the "NQ" model, where N is the number of elements, and Q is the computation per element. In general, you need the product NQ to exceed some threshold before you start getting a performance benefit. For a low-Q problem like "add up numbers from 1 to N", you will generally see a breakeven between N=1000 and N=10000. With higher-Q problems, you'll see breakevens at lower thresholds.
But the reality is quite complicated. So until you achieve experthood, first identify when sequential processing is actually costing you something, and then measure if parallelism will help.
I watched one of the presentations of Brian Goetz (Java Language Architect & specification lead for Lambda Expressions). He explains in detail the following 4 points to consider before going for parallelization:
Splitting / decomposition costs
– Sometimes splitting is more expensive than just doing the work!
Task dispatch / management costs
– Can do a lot of work in the time it takes to hand work to another thread.
Result combination costs
– Sometimes combination involves copying lots of data. For example, adding numbers is cheap whereas merging sets is expensive.
Locality
– The elephant in the room. This is an important point which everyone may miss. You should consider cache misses, if a CPU waits for data because of cache misses then you wouldn't gain anything by parallelization. That's why array-based sources parallelize the best as the next indices (near the current index) are cached and there are fewer chances that CPU would experience a cache miss.
He also mentions a relatively simple formula to determine a chance of parallel speedup.
NQ Model:
N x Q > 10000
where,
N = number of data items
Q = amount of work per item
Other answers have already covered profiling to avoid premature optimization and overhead cost in parallel processing. This answer explains the ideal choice of data structures for parallel streaming.
As a rule, performance gains from parallelism are best on streams over ArrayList , HashMap , HashSet , and ConcurrentHashMap instances; arrays; int ranges; and long ranges. What these data structures have in common is that they can all be accurately and cheaply split into subranges of any desired sizes, which makes it easy to divide work among parallel threads. The abstraction used by the streams library to perform this task is the spliterator , which is returned by the spliterator method on Stream and Iterable.
Another important factor that all of these data structures have in common is that they provide good-to-excellent locality of reference when processed sequentially: sequential element references are stored together in memory. The objects referred to by those references may not be close to one another in memory, which reduces locality-of-reference. Locality-of-reference turns out to be critically important for parallelizing bulk operations: without it, threads spend much of their time idle, waiting for data to be transferred from memory into the processor’s cache. The data structures with the best locality of reference are primitive arrays because the data itself is stored contiguously in memory.
Source: Item #48 Use Caution When Making Streams Parallel, Effective Java 3e by Joshua Bloch
Never parallelize an infinite stream with a limit. Here is what happens:
public static void main(String[] args) {
// let's count to 1 in parallel
System.out.println(
IntStream.iterate(0, i -> i + 1)
.parallel()
.skip(1)
.findFirst()
.getAsInt());
}
Result
Exception in thread "main" java.lang.OutOfMemoryError
at ...
at java.base/java.util.stream.IntPipeline.findFirst(IntPipeline.java:528)
at InfiniteTest.main(InfiniteTest.java:24)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.stream.SpinedBuffer$OfInt.newArray(SpinedBuffer.java:750)
at ...
Same if you use .limit(...)
Explanation here:
Java 8, using .parallel in a stream causes OOM error
Similarly, don't use parallel if the stream is ordered and has much more elements than you want to process, e.g.
public static void main(String[] args) {
// let's count to 1 in parallel
System.out.println(
IntStream.range(1, 1000_000_000)
.parallel()
.skip(100)
.findFirst()
.getAsInt());
}
This may run much longer because the parallel threads may work on plenty of number ranges instead of the crucial one 0-100, causing this to take very long time.
Collection.parallelStream() is a great way to do work in parallel. However you need to keep in mind that this effectively uses a common thread pool with only a few worker threads internally (number of threads equals to the number of cpu cores by default), see ForkJoinPool.commonPool(). If some of pool's tasks are a long-running I/O-bound work then others, potentially fast, parallelStream calls will get stuck waiting for the free pool threads. This obviously leads to a requirement of fork-join tasks being non-blocking and short or, in other words, cpu-bound. For better understanding of details I strongly recommend careful reading of java.util.concurrent.ForkJoinTask javadoc, here are some relevant quotes:
The efficiency of ForkJoinTasks stems from ... their main use as computational tasks calculating pure functions or operating on purely isolated objects.
Computations should ideally avoid synchronized methods or blocks, and should minimize other blocking synchronization
Subdividable tasks should also not perform blocking I/O
These indicate the main purpose of parallelStream() tasks as short computations over isolated in-memory structures. Also recommend checking out article Common parallel stream pitfalls
I'm writing a highly concurrent application, needing access to a large fine-grained set of shared resources. I'm currently writing a global lock manager to organize this. I'm wondering if I can piggyback off the standard ConcurrentHashMap and use that to handle the locking? I'm thinking of a system like the following:
A single global ConcurrentHashMap object contains a mapping between the unique string id of the resource, and a lock protecting that resource unique id of the thread using the resource
Tune the concurrency factor to reflect the need for a high level of concurrency
Locks are acquired using the atomic conditional replace(K key, V oldValue, V newValue) method in the hashmap
To prevent lock contention when locking multiple resources, locks must be acquired in alphabetical order
Are there any major issues with the setup? How will the performance be?
I know this is probably going to be much slower and more memory-heavy than a properly written locking system, but I'd rather not spend days trying to write my own, especially given that I probably won't be able to match Java's professionally-written concurrency code implementing the map.
Also, I've never used ConcurrentHashMap in a high-load situation, so I'm interested in the following:
How well will this scale to large numbers of elements? (I'm looking at ~1,000,000 being a good cap. If I reach beyond that I'd be willing to rewrite this more efficiently)
The documentation states that re-sizing is "relatively" slow. Just how slow is it? I'll probably have to re-size the map once every minute or so. Is this going to be problematic with the size of map I'm looking at?
Edit: Thanks Holger for pointing out that HashMaps shouldn't have that big of an issue with scaling
Also, is there is a better/more standard method out there? I can't find any places where a system like this is used, so I'm guessing that either I'm not seeing a major flaw, or theres something else.
Edit:
The application I'm writing is a network service, handling a variable number of requests. I'm using the Grizzly project to balance the requests among multiple threads.
Each request uses a small number of the shared resources (~30), so in general, I do not expect a large great deal of contention. The requests usually finish working with the resources in under 500ms. Thus, I'd be fine with a bit of blocking/continuous polling, as the requests aren't extremely time-sensitive and contention should be minimal.
In general, seeing that a proper solution would be quite similar to how ConcurrentHashMap works behind the scenes, I'm wondering if I can safely use that as a shortcut instead of writing/debugging/testing my own version.
The re-sizing issue is not relevant as you already told an estimate of the number of elements in your question. So you can give a ConcurrentHashMap an initial capacity large enough to avoid any rehashing.
The performance will not depend on the number of elements, that’s the main goal of hashing, but the number of concurrent threads.
The main problem is that you don’t have a plan of how to handle failed locks. Unless you want to poll until locking succeeds (which is not recommended) you need a way of putting a thread to sleep which implies that the thread currently owning the lock has to wake up a sleeping thread on release if one exists. So you end up requiring conventional Lock features a ConcurrentHashMap does not offer.
Creating a Lock per element (as you said ~1,000,000) would not be a solution.
A solution would look a bit like the ConcurrentHashMap works internally. Given a certain concurrency level, i.e. the number of threads you might have (rounded up), you create that number of Locks (which would be a far smaller number than 1,000,000).
Now you assign each element one of the Locks. A simple assignment would be based on the element’s hashCode, assuming it is stable. Then locking an element means locking the assigned Lock which gives you up to the configured concurrency level if all currently locked elements are mapped to different Locks.
This might imply that threads locking different elements block each other if the elements are mapped to the same Lock, but with a predictable likelihood. You can try fine-tuning the concurrency level (as said, use a number higher than the number of threads) to find the best trade-off.
A big advantage of this approach is that you do not need to maintain a data structure that depends on the number of elements. Afaik, the new parallel ClassLoader uses a similar technique.