Increasing program speedup when using shared memory

Increasing program speedup when using shared memory - java

I have a program that calculates Pi from the Chudnovsky formula. It's written in Java and it uses a shared Vector that is used to save intermediate calculations like factorials and powers that include the index of the element.
However, I believe that since it's a synchronized Vector (thread safe by default) only one thread can read or write to it. So when we have lots of threads, instead of having increasing speedup, we see the computation time becomes constant.
Is there anything that I can do to circumvent that? What to do when there are too many threads reading/writing to the same shared memory?

When the access pattern is lots of reads and occasional writes, you can protect an unsyncronized data structure with a ReentrantReadWriteLock. It allows multiple readers, but only a single writer.
Depending on your implementation, you might also benefit from using a ConcurrentHashMap.
You might be able to cheat a bit and use either an AtomicIntegerArray or an AtomicReferenceArray of Futures/CompletionStages.

Store the results of each thread in a stack. One thread collects results from every thread and adds them together. Of course the stack should not be empty.
If you want multiple threads to work on factorials why not create a thread or two that produce a list of factorial results. Other threads can just look up results if needed.

Instead of having the same shared memory, you can have multiple threads with individual memories in a stack. Eventually, add all these up together (or occasionally) with one thread!

If you need high throughput, you can consider using Disruptor and RingBuffer.
At a crude level you can think of a Disruptor as a multicast graph of queues where producers put objects on it that are sent to all the consumers for parallel consumption through separate downstream queues. When you look inside you see that this network of queues is really a single data structure - a ring buffer.
Each producer and consumer has a sequence counter to indicate which slot in the buffer it's currently working on. Each producer/consumer writes its own sequence counter but can read the others' sequence counters
Few useful links:
https://lmax-exchange.github.io/disruptor
http://martinfowler.com/articles/lmax.html
https://softwareengineering.stackexchange.com/questions/244826/can-someone-explain-in-simple-terms-what-is-the-disruptor-pattern

Related

Improving performance with Distributed Counter, looking for library

We have a system with many threads, each incrementing the same counter. At the end, we need the total number of increments of all threads. Due to the size of the final result and the cost of synchronization, we suspect some performance issue with our current solution, which uses syncronized access to a single variable.
To avoid synchronization, I would like to use a Distributed Counter (correct term?), where each thread increments its own counter copy. The individual counters are summed up only once at getting the final result.
I could implement such a counter from scratch. But I guess, I'm not the first one with such a requirment. Suprisingly, a quick search did not turn up any library. Could you suggest some library or demo code? I'm looking for a simple solutions, no heavy framework.

Does your system have many different processes managing all the different threads?
if all threads are managed by the same process i don't think you need a distributed resource (counter) you can just use as suggested an AtomicInteger
Atomic means that it is thread safe and can be accessed from many threads and no data corruption will happen.
if your system does use many processes than you will need a distributed resource.
you can use any type of database in order to achieve that.
seems to me that Redis might be a good option.
or any MySql Database if you want 100% Data consistency

The solution you propose yourself is a CRDT counter. Perhaps searching for that keyword let's you find a suitable implementation.

If it is within 1 JVM process, just read thread local counters to sum them up.
If it is inter-process, memory mapped files are great for performance and only file level (or buffer level) I/O API fiddly when it comes to reading and writing.

Java How to make sure that a shared data write affects multiple readers

I am trying to code a processor intensive task, so I would like to use multithreading and share the calculation between the available processor cores.
Let's say I have thousands of iterations and all iterations have two phases:
Some working threads that scans through hundreds of thousands of options
while they have to read data from a shared array (or some other data structure), while there is no modification of the data.
One thread that collects the results from all the working threads (while
they are waiting) and makes modifications on the shared array
The phases are in sequence, so that there is no overlap (no concurrent writing and reading of the data). My problem is: How would I be sure that the data (cache) is updated for the working threads before the next phase, Phase 1, starts.
I am assuming that when people speak about cache or caching in this context, they mean the processor cache (fix me if I'm wrong).
As I understood, volatile can be used for nonreference types only, while there is no point to use synchronized, because the working threads will block each other at reading (there can be thousands of reads when processing an option).
What else can I use in this case?
Right now I have a few ideas, but I have no idea how costly they are (most probably they are):
create new working threads for all iterations
in a synchronized block make a copy of the array (can be up to 195kB in size) for each threads before a new iteration begins
I red about ReentrantReadWriteLock, but I can't understand how is it related to caching. Can a read lock acquire force the reader's cache to update?

The thing I was searching for was mentioned in the "Java Tutorial on Concurrence" I just had to look deeper. In this case it was the AtomicIntegerArray class. Unfortunately it is not efficient enough for my needs. I run some tests, maybe it worth to share.
I approximated the cost of different memory access methods, by running them many times and averaged the elapsed times, broke everything down to one average read or write.
I used a size of 50000 integer array, and repeated every test methods 100 times, then averaged the results. The read tests are performing 50000 random(ish) reads. The results shows the approximated time of one read/write access. Still, this can't be stated as exact measurement, but I believe it gives a good sense of the time costs of the different access methods. However on different processors or with different numbers these results may be completely different regarding to the different cache sizes, and clock speeds.
So the results are:
Fill time with set is: 15.922673ns
Fill time with lazySet is: 4.5303152ns
Atomic read time is: 9.146553ns
Synchronized read time is: 57.858261399999996ns
Single threaded fill time is: 0.2879112ns
Single threaded read time is: 0.3152002ns
Immutable copy time is: 0.2920892ns
Immutable read time is: 0.650578ns
Points 1 and 2 shows the write result on an AtomicIntegerArray, with sequential writes. In some article I red about the good efficiency of the lazySet() mehtod so I wanted to test it. It is usually over perform the set() method by about 4 times, however different array sizes show different results.
Points 3 and 4 shows the difference between the "atomic" access and synchronized access (a synchronized getter) to one item of the array via random(ish) reads by four different threads simultaneously. This clearly indicates the benefits of the "atomic" access.
Since the first four value looked shockingly high, I really wanted to measure the access times without multithreading, so I got the reslults of points 5 and 6. I tried to copy and modify methods from the previous tests, to make the code as close as it is possible. Of course there can be optimizations I can't affect.
Then just out of curiosity I come up with points 7. and 8. which imitates the immutable access. Here one thread creates the array (by sequential writes) and passes it's reference to an another thread which does the random(ish) read accesses on it.
The results are heavily vary, if the parameters are changed, like the size of the array or the count of the methods running.
The conclusion:
If an algorithm is extremely memory intensive (lots of reads from the same small array, interrupted by short calculations - which is my case), multithreading can slow down the calculation instead of speeding it up. But if it has many many reads, compared to the size of the array, it may be helpful to use an immutable copy of the array, and use multiple threads.

Java: Using ConcurrentHashMap as a lock manager

I'm writing a highly concurrent application, needing access to a large fine-grained set of shared resources. I'm currently writing a global lock manager to organize this. I'm wondering if I can piggyback off the standard ConcurrentHashMap and use that to handle the locking? I'm thinking of a system like the following:
A single global ConcurrentHashMap object contains a mapping between the unique string id of the resource, and a lock protecting that resource unique id of the thread using the resource
Tune the concurrency factor to reflect the need for a high level of concurrency
Locks are acquired using the atomic conditional replace(K key, V oldValue, V newValue) method in the hashmap
To prevent lock contention when locking multiple resources, locks must be acquired in alphabetical order
Are there any major issues with the setup? How will the performance be?
I know this is probably going to be much slower and more memory-heavy than a properly written locking system, but I'd rather not spend days trying to write my own, especially given that I probably won't be able to match Java's professionally-written concurrency code implementing the map.
Also, I've never used ConcurrentHashMap in a high-load situation, so I'm interested in the following:
How well will this scale to large numbers of elements? (I'm looking at ~1,000,000 being a good cap. If I reach beyond that I'd be willing to rewrite this more efficiently)
The documentation states that re-sizing is "relatively" slow. Just how slow is it? I'll probably have to re-size the map once every minute or so. Is this going to be problematic with the size of map I'm looking at?
Edit: Thanks Holger for pointing out that HashMaps shouldn't have that big of an issue with scaling
Also, is there is a better/more standard method out there? I can't find any places where a system like this is used, so I'm guessing that either I'm not seeing a major flaw, or theres something else.
Edit:
The application I'm writing is a network service, handling a variable number of requests. I'm using the Grizzly project to balance the requests among multiple threads.
Each request uses a small number of the shared resources (~30), so in general, I do not expect a large great deal of contention. The requests usually finish working with the resources in under 500ms. Thus, I'd be fine with a bit of blocking/continuous polling, as the requests aren't extremely time-sensitive and contention should be minimal.
In general, seeing that a proper solution would be quite similar to how ConcurrentHashMap works behind the scenes, I'm wondering if I can safely use that as a shortcut instead of writing/debugging/testing my own version.

The re-sizing issue is not relevant as you already told an estimate of the number of elements in your question. So you can give a ConcurrentHashMap an initial capacity large enough to avoid any rehashing.
The performance will not depend on the number of elements, that’s the main goal of hashing, but the number of concurrent threads.
The main problem is that you don’t have a plan of how to handle failed locks. Unless you want to poll until locking succeeds (which is not recommended) you need a way of putting a thread to sleep which implies that the thread currently owning the lock has to wake up a sleeping thread on release if one exists. So you end up requiring conventional Lock features a ConcurrentHashMap does not offer.
Creating a Lock per element (as you said ~1,000,000) would not be a solution.
A solution would look a bit like the ConcurrentHashMap works internally. Given a certain concurrency level, i.e. the number of threads you might have (rounded up), you create that number of Locks (which would be a far smaller number than 1,000,000).
Now you assign each element one of the Locks. A simple assignment would be based on the element’s hashCode, assuming it is stable. Then locking an element means locking the assigned Lock which gives you up to the configured concurrency level if all currently locked elements are mapped to different Locks.
This might imply that threads locking different elements block each other if the elements are mapped to the same Lock, but with a predictable likelihood. You can try fine-tuning the concurrency level (as said, use a number higher than the number of threads) to find the best trade-off.
A big advantage of this approach is that you do not need to maintain a data structure that depends on the number of elements. Afaik, the new parallel ClassLoader uses a similar technique.

What does "less predictable performance of LinkedBlockingQueue in concurrent applications" mean?

For the logging feature I am working on, I need to have a processing thread which will sit waiting for jobs and execute them in batches when the count reaches or exceeds certain number. Since it is a standard case of producer consumer problem, I intend to use BlockingQueues. I have a number of producers adding entries to the queue using add() method, whereas there is only one consumer thread that uses take() to wait on the queue.
LinkedBlockingQueue seems to be a good option since it does not have any size restriction on it, however I am confused reading this from the documentation.
Linked queues typically have higher throughput than array-based queues but less predictable performance in most concurrent applications.
It was not clearly explained what they mean by this statement. Can some one please throw light on it? Does it mean LinkedBlockingQueue is not thread safe? Did any of you encounter any issues using LinkedBlockingQueue.
Since the number of producers are lot more, there is always a scenario I can run into where the queue is overwhelmed with large number of entries to be added. If I were to use ArrayBlockingQueue instead, which takes size of the queue as parameter in the constructor, I could always run into capacity full related exceptions. In order to avoid this, I am not sure how to determine what size I should instantiate my ArrayBlockingQueue with. Did you have to solve a similar problem using ArrayBlockingQueue?

Does it mean LinkedBlockingQueue is not thread safe?
It certainly does not mean that. The phrase "less predictable performance" is talking about just that -- performance -- and not some violation of the thread-safety or Java collections contract.
I suspect this is more around the fact that it is a linked-list so iterating and other operations on the collection will be slower so the class will hold locks longer. It also has to deal with more memory structures since each element has it's one linked-list node as opposed to just an entry in an array. This means that it has to flush more dirty memory pages between processors when synchronizing. Again, this impacts performance.
What they are trying to say is that if you can, you should use the ArrayBlockingQueue but otherwise I wouldn't worry about it.
Did any of you encounter any issues using LinkedBlockingQueue.
I've used it a lot and not seen any problems. It also is used a lot in the ExecutorService classes which are used everywhere.

Should I consider parallelism in statistic callculations?

we are going to implement software for various statistic analysis, in Java. The main concept is to get array of points on graph, then iterate thru it and find some results (like looking for longest rising sequence and various indicators).
Problem: lot of data
Problem2: must also work at client's PC, not only server (no specific server tuning possible)
Partial solution: do computation on background and let user stare at empty screen waiting for result :(
Question: Is there way how to increase performance of computation itself (lots of iterations) using parallelism? If so, please provide links to articles, samples, whatever usable here ...

The main point to use parallel processing is a presence of large amount of data or large computations that can be performed without each other. For example, you can count factorial of a 10000 with many threads by splitting it on parts 1..1000, 1001..2000, 2001..3000, etc., processing each part and then accumulating results with *. On the other hand, you cannot split the task of computing big Fibonacci number, since later ones depend on previous.
Same for large amounts of data. If you have collected array of points and want to find some concrete points (bigger then some constant, max of all) or just collect statistical information (sum of coordinates, number of occurrences), use parallel computations. If you need to collect "ongoing" information (longest rising sequence)... well, this is still possible, but much harder.
The difference between servers and client PCs is that client PCs doesn't have many cores, and parallel computations on single core will only decrease performance, not increase. So, do not create more threads than the number of user PC's cores is (same for computing clusters: do not split the task on more subtasks than the number of computers in cluster is).
Hadoop's MapReduce allows you to create parallel computations efficiently. You can also search for more specific Java libraries which allow evaluating in parallel. For example, Parallel Colt implements high performance concurrent algorithms for work with big matrices, and there're lots of such libraries for many data representations.

In addition to what Roman said, you should see whether the client's PC has multiple CPUs/CPU cores/hyperthreading. If there's just a single CPU with a single core and no hyperthreading, you won't benefit from parallelizing a computation. Otherwise, it depends on the nature of your computation.
If you are going to parallelize, make sure to use Java 1.5+ so that you can use the concurrency API. At runtime, determine the number of CPU cores like Runtime.getRuntime().availableProcessors(). For most tasks, you will want to create a thread pool with that many threads like Executors.newFixedThreadPool(numThreads) and submit tasks to the Executor. In order to get more specific, you will have to provide information about your particular computation, as Roman suggested.

If the problem you're going to solve is naturally parallelizable then there's a way to use multithreading to improve performance.
If there are many parts which should be computed serially (i.e. you can't compute the second part until the first part is computed) then multithreading isn't the way to go.
Describe the concrete problem and, maybe, we'll be able to provide you more help.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.