Improving performance with Distributed Counter, looking for library

Improving performance with Distributed Counter, looking for library - java

We have a system with many threads, each incrementing the same counter. At the end, we need the total number of increments of all threads. Due to the size of the final result and the cost of synchronization, we suspect some performance issue with our current solution, which uses syncronized access to a single variable.
To avoid synchronization, I would like to use a Distributed Counter (correct term?), where each thread increments its own counter copy. The individual counters are summed up only once at getting the final result.
I could implement such a counter from scratch. But I guess, I'm not the first one with such a requirment. Suprisingly, a quick search did not turn up any library. Could you suggest some library or demo code? I'm looking for a simple solutions, no heavy framework.

Does your system have many different processes managing all the different threads?
if all threads are managed by the same process i don't think you need a distributed resource (counter) you can just use as suggested an AtomicInteger
Atomic means that it is thread safe and can be accessed from many threads and no data corruption will happen.
if your system does use many processes than you will need a distributed resource.
you can use any type of database in order to achieve that.
seems to me that Redis might be a good option.
or any MySql Database if you want 100% Data consistency

The solution you propose yourself is a CRDT counter. Perhaps searching for that keyword let's you find a suitable implementation.

If it is within 1 JVM process, just read thread local counters to sum them up.
If it is inter-process, memory mapped files are great for performance and only file level (or buffer level) I/O API fiddly when it comes to reading and writing.

Related

Increasing program speedup when using shared memory

I have a program that calculates Pi from the Chudnovsky formula. It's written in Java and it uses a shared Vector that is used to save intermediate calculations like factorials and powers that include the index of the element.
However, I believe that since it's a synchronized Vector (thread safe by default) only one thread can read or write to it. So when we have lots of threads, instead of having increasing speedup, we see the computation time becomes constant.
Is there anything that I can do to circumvent that? What to do when there are too many threads reading/writing to the same shared memory?

When the access pattern is lots of reads and occasional writes, you can protect an unsyncronized data structure with a ReentrantReadWriteLock. It allows multiple readers, but only a single writer.
Depending on your implementation, you might also benefit from using a ConcurrentHashMap.
You might be able to cheat a bit and use either an AtomicIntegerArray or an AtomicReferenceArray of Futures/CompletionStages.

Store the results of each thread in a stack. One thread collects results from every thread and adds them together. Of course the stack should not be empty.
If you want multiple threads to work on factorials why not create a thread or two that produce a list of factorial results. Other threads can just look up results if needed.

Instead of having the same shared memory, you can have multiple threads with individual memories in a stack. Eventually, add all these up together (or occasionally) with one thread!

If you need high throughput, you can consider using Disruptor and RingBuffer.
At a crude level you can think of a Disruptor as a multicast graph of queues where producers put objects on it that are sent to all the consumers for parallel consumption through separate downstream queues. When you look inside you see that this network of queues is really a single data structure - a ring buffer.
Each producer and consumer has a sequence counter to indicate which slot in the buffer it's currently working on. Each producer/consumer writes its own sequence counter but can read the others' sequence counters
Few useful links:
https://lmax-exchange.github.io/disruptor
http://martinfowler.com/articles/lmax.html
https://softwareengineering.stackexchange.com/questions/244826/can-someone-explain-in-simple-terms-what-is-the-disruptor-pattern

Why sharing a static variable between threads reduce performance?

I asked question here and someone leaved a comment saying that the problem is I'm sharing a static variable.
Why is that a problem?

Sharing a static variable of and by itself should have no adverse effect on performance. Global data is common is all programs starting with the JVM and OS constructs.
Mutable shared data is a different story as the mutation of shared data can lead to both performance issues (cache misses at the very least) and correctness issues which are a pain and are often solved using locks, which lead to potentially other performance issues.

The wiki static variable looks like a pretty substantial part of your program. Not knowing anything about what it's going or how it's coded, I would guess that it does locking in order to keep a consistent state. If most of your threads are spending their time blocking waiting to acquire access to this same object then that would explain why you're not seeing any gain from using multiple threads.
For threads to make a difference to the performance of your program they have to be reasonably independent, and not all locking on the same thing. The more locking they have to do, the less gain you will see. So try to split out the work so as much can be done independently as possible. For instance if there are work items that can be gathered independently, then you might be better off by having multiple threads go find the work items, then feed them to a queue that a dedicated thread can use to pull work items off the queue and feed them to the wiki object.

Java: Using ConcurrentHashMap as a lock manager

I'm writing a highly concurrent application, needing access to a large fine-grained set of shared resources. I'm currently writing a global lock manager to organize this. I'm wondering if I can piggyback off the standard ConcurrentHashMap and use that to handle the locking? I'm thinking of a system like the following:
A single global ConcurrentHashMap object contains a mapping between the unique string id of the resource, and a lock protecting that resource unique id of the thread using the resource
Tune the concurrency factor to reflect the need for a high level of concurrency
Locks are acquired using the atomic conditional replace(K key, V oldValue, V newValue) method in the hashmap
To prevent lock contention when locking multiple resources, locks must be acquired in alphabetical order
Are there any major issues with the setup? How will the performance be?
I know this is probably going to be much slower and more memory-heavy than a properly written locking system, but I'd rather not spend days trying to write my own, especially given that I probably won't be able to match Java's professionally-written concurrency code implementing the map.
Also, I've never used ConcurrentHashMap in a high-load situation, so I'm interested in the following:
How well will this scale to large numbers of elements? (I'm looking at ~1,000,000 being a good cap. If I reach beyond that I'd be willing to rewrite this more efficiently)
The documentation states that re-sizing is "relatively" slow. Just how slow is it? I'll probably have to re-size the map once every minute or so. Is this going to be problematic with the size of map I'm looking at?
Edit: Thanks Holger for pointing out that HashMaps shouldn't have that big of an issue with scaling
Also, is there is a better/more standard method out there? I can't find any places where a system like this is used, so I'm guessing that either I'm not seeing a major flaw, or theres something else.
Edit:
The application I'm writing is a network service, handling a variable number of requests. I'm using the Grizzly project to balance the requests among multiple threads.
Each request uses a small number of the shared resources (~30), so in general, I do not expect a large great deal of contention. The requests usually finish working with the resources in under 500ms. Thus, I'd be fine with a bit of blocking/continuous polling, as the requests aren't extremely time-sensitive and contention should be minimal.
In general, seeing that a proper solution would be quite similar to how ConcurrentHashMap works behind the scenes, I'm wondering if I can safely use that as a shortcut instead of writing/debugging/testing my own version.

The re-sizing issue is not relevant as you already told an estimate of the number of elements in your question. So you can give a ConcurrentHashMap an initial capacity large enough to avoid any rehashing.
The performance will not depend on the number of elements, that’s the main goal of hashing, but the number of concurrent threads.
The main problem is that you don’t have a plan of how to handle failed locks. Unless you want to poll until locking succeeds (which is not recommended) you need a way of putting a thread to sleep which implies that the thread currently owning the lock has to wake up a sleeping thread on release if one exists. So you end up requiring conventional Lock features a ConcurrentHashMap does not offer.
Creating a Lock per element (as you said ~1,000,000) would not be a solution.
A solution would look a bit like the ConcurrentHashMap works internally. Given a certain concurrency level, i.e. the number of threads you might have (rounded up), you create that number of Locks (which would be a far smaller number than 1,000,000).
Now you assign each element one of the Locks. A simple assignment would be based on the element’s hashCode, assuming it is stable. Then locking an element means locking the assigned Lock which gives you up to the configured concurrency level if all currently locked elements are mapped to different Locks.
This might imply that threads locking different elements block each other if the elements are mapped to the same Lock, but with a predictable likelihood. You can try fine-tuning the concurrency level (as said, use a number higher than the number of threads) to find the best trade-off.
A big advantage of this approach is that you do not need to maintain a data structure that depends on the number of elements. Afaik, the new parallel ClassLoader uses a similar technique.

Automatic parallelization

What is your opinion regarding a project that will try to take a code and split it to threads automatically(maybe compile time, probably in runtime).
Take a look at the code below:
for(int i=0;i<100;i++)
sum1 += rand(100)
for(int j=0;j<100;j++)
sum2 += rand(100)/2
This kind of code can automatically get split to 2 different threads that run in parallel.
Do you think it's even possible?
I have a feeling that theoretically it's impossible (it reminds me the halting problem) but I can't justify this thought.
Do you think it's a useful project? is there anything like it?

This is called automatic parallelization. If you're looking for some program you can use that does this for you, it doesn't exist yet. But it may eventually. This is a hard problem and is an area of active research. If you're still curious...
It's possible to automatically split your example into multiple threads, but not in the way you're thinking. Some current techniques try to run each iteration of a for-loop in its own thread. One thread would get the even indicies (i=0, i=2, ...), the other would get the odd indices (i=1, i=3, ...). Once that for-loop is done, the next one could be started. Other techniques might get crazier, executing the i++ increment in one thread and the rand() on a separate thread.
As others have pointed out, there is a true dependency between iterations because rand() has internal state. That doesn't stand in the way of parallelization by itself. The compiler can recognize the memory dependency, and the modified state of rand() can be forwarded from one thread to the other. But it probably does limit you to only a few parallel threads. Without dependencies, you could run this on as many cores as you had available.
If you're truly interested in this topic and don't mind sifting through research papers:
Automatic thread extraction with decoupled software pipelining (2005) by G. Ottoni.
Speculative parallelization using software multi-threaded transactions (2010) by A. Raman.

This is practically not possible.
The problem is that you need to know, in advance, a lot more information than is readily available to the compiler, or even the runtime, in order to parallelize effectively.
While it would be possible to parallelize very simple loops, even then, there's a risk involved. For example, your above code could only be parallelized if rand() is thread-safe - and many random number generation routines are not. (Java's Math.random() is synchronized for you - however.)
Trying to do this type of automatic parallelization is, at least at this point, not practical for any "real" application.

It's certainly possible, but it is an incredibly hard task. This has been the central thrust of compiler research for several decades. The basic issue is that we cannot make a tool that can find the best partition into threads for java code (this is equivalent to the halting problem).
Instead we need to relax our goal from the best partition into some partition of the code. This is still very hard in general. So then we need to find ways to simplify the problem, one is to forget about general code and start looking at specific types of program. If you have simple control-flow (constant bounded for-loops, limited branching....) then you can make much more head-way.
Another simplification is reducing the number of parallel units that you are trying to keep busy. If you put both of these simplifications together then you get the state of the art in automatic vectorisation (a specific type of parallelisation that is used to generate MMX / SSE style code). Getting to that stage has taken decades but if you look at compilers like Intel's then performance is starting to get pretty good.
If you move from vector instructions inside a single thread to multiple threads within a process then you have a huge increase in latency moving data between the different points in the code. This means that your parallelisation has to be a lot better in order to win against the communication overhead. Currently this is a very hot topic in research, but there are no automatic user-targetted tools available. If you can write one that works it would be very interesting to many people.
For your specific example, if you assume that rand() is a parallel version so you can call it independently from different threads then it's quite easy to see that the code can be split into two. A compiler would convert just need dependency analysis to see that neither loop uses data from or affects the other. So the order between them in the user-level code is a false dependency that could split (i.e by putting each in a separate thread).
But this isn't really how you would want to parallelise the code. It looks as if each loop iteration is dependent on the previous as sum1 += rand(100) is the same as sum1 = sum1 + rand(100) where the sum1 on the right-hand-side is the value from the previous iteration. However the only operation involved is addition, which is associative so we rewrite the sum many different ways.
sum1 = (((rand_0 + rand_1) + rand_2) + rand_3) ....
sum1 = (rand_0 + rand_1) + (rand_2 + rand_3) ...
The advantage of the second is that each single addition in brackets can be computed in parallel to all of the others. Once you have 50 results then they can be combined into a further 25 additions and so on... You do more work this way 50+25+13+7+4+2+1 = 102 additions versus 100 in the original but there are only 7 sequential steps so apart from the parallel forking/joining and communication overhead it runs 14 times quicker. This tree of additions is called a gather operation in parallel architectures and it tends to be the expensive part of a computation.
On a very parallel architecture such as a GPU the above description would be the best way to parallelise the code. If you're using threads within a process it would get killed by the overhead.
In summary: it is impossible to do perfectly, it is very hard to do well, there is lots of active research in finding out how much we can do.

Whether it's possible in the general case to know whether a piece of code can be parallelized does not really matter, because even if your algorithm cannot detect all cases that can be parallelized, maybe it can detect some of them.
That does not mean it would be useful. Consider the following:
First of all, to do it at compile-time, you have to inspect all code paths you can potentially reach inside the construct you want to parallelize. This may be tricky for anything but simply computations.
Second, you have to somehow decide what is parallelizable and what is not. You cannot trivially break up a loop that modifies the same state into several threads, for example. This is probably a very difficult task and in many cases you will end up with not being sure - two variables might in fact reference the same object.
Even if you could achieve this, it would end up confusing for the user. It would be very difficult to explain why his code was not parallelizable and how it should be changed.
I think that if you want to achieve this in Java, you need to write it more as a library, and let the user decide what to parallelize (library functions together with annotations? just thinking aloud). Functional languages are much more suited for this.
As a piece of trivia: during a parallel programming course, we had to inspect code and decide whether it was parallelizable or not. I cannot remember the specifics (something about the "at-most-once" property? Someone fill me in?), but the moral of the story is that it was extremely difficult even for what appeared to be trivial cases.

There are some projects that try to simplify parallelization - such as Cilk. It doesn't always work that well, however.

I've learnt that as of JDK 1.8(Java 8), you can utilize/leverage multiple cores of your CPU in case of streams usage by using parallelStream().
However, it has been studied that before finalizing to go to production with parallelStream() it is always better to compare sequential() with parallel, by benchmarking the performance, and then decide which would be ideal.
Why?/Reason is: there could be scenarios where the parallel stream will perform dramatically worse than sequential, when the operation needs to do auto un/boxing. For those scenarios its advisable to use the Java 8 Primitive Streams such as IntStream, LongStream, DoubleStream.
Reference: Modern Java in Action: Manning Publications 2019

The Programming language is Java and Java is a virtual machine. So shouldn't one be able to execute the code at runtime on different Threads owned by the VM. Since all the Memory etc. is handled like that It whould not cause any corruption . You could see the Code as a Stack of instructions estimating execution Time and then distribute it on an Array of Threads which are each have an execution stack of roughtly the same time. It might be dangerous though some graphics like OpenGL immediate mode needs to maintain order and mostly should not be threaded at all.

Terracotta Performance and Tips

I am just learning how to use Terracotta after discovering it about a month ago. It is a very cool technology.
Basically what I am trying to do:
My root (System of Record) is a ConcurrentHashMap.
The main Instrumented Class is a "JavaBean" with 30 or so fields that I want to exist in the HashMap.
There will be about 20000 of these JavaBeans that exist in the Hashmap.
Each bean has (at least) 5 fields that will be updated every 5 seconds.
(The reason I am using Terracotta for this is because these JavaBeans need to be accessible across JVMs and nodes.)
Anyone with more experience than me with TC have any tips? Performance is key.
Any examples other similar applications?

You might find that batching several changes under one lock scope will perform better. Each synchronized block/method forms a write transaction (assuming you use a write lock) that must be sent to the server (and possibly back out to other nodes). By changing a bunch of fields, possibly on a bunch of objects under one lock, you reduce the overhead of creating a transaction. Something to play with at least.
Partitioning is also a key way to improve performance. Changes only need to be sent to nodes that are actually using an object. So if you can partition which nodes usually touch specific objects that reduces the number of changes that have to be sent around the cluster, which improves performance.
unnutz's suggestions about using CHM or CSM are good ones. CHM allows greater concurrency (as each internal segment can be locked and used concurrently) - make sure to experiment with larger segment counts too. CSM has effectively one lock per entry so has effectively N partitions in an N-sized table. That can greatly reduce lock contention (at the cost of managing more internal lock objects). Changes coming soon for CSM will make the lock mgmt cost much lower.
Generally we find a good strategy is:
Build a performance test (should be multi-threaded and multi-node and similar to your app (or your actual app!)
Tune objects - look at your clustered object graph in the dev-console to find objects that don't need to be clustered at all - sometimes this happens accidentally (remove or cut the cluster with a transient field). Sometimes you might be clustering a Date where a long would do. Small change but that's one object per map entry and that might make a difference.
Tune locks - use the lock profiler in the dev-console to find hot locks or locks that are too narrow or too wide. The clustered stats recorder can help look at transaction size as well.
Tune GC and DGC - tune JVM garbage collection, then tune Terracotta distributed GC by turning on changing the frequency of young gen gc.
Tune TC server - lots of very detailed tunings to do here, but usually not worth it till the stuff above is tuned.
Feel free to ask on the Terracotta forums as well - all of engineering, field engineering, product mgmt watch those and answer there.

Firstly, I would suggest you to raise this question on their forums too.
Secondly, actually, performance of your application clustered over the Terracotta willl depend on number of write transactions that happen. So you could consider using ConcurrentStringMap (if your keys are Strings) or ConcurrentHashMap. Note that CSM is much more better than CHM from point of performance.
After all, POJOs are loaded lazily. That means each property is loaded on-demand.
Hope that helps.
Cheers

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.