Terracotta Performance and Tips

Terracotta Performance and Tips - java

I am just learning how to use Terracotta after discovering it about a month ago. It is a very cool technology.
Basically what I am trying to do:
My root (System of Record) is a ConcurrentHashMap.
The main Instrumented Class is a "JavaBean" with 30 or so fields that I want to exist in the HashMap.
There will be about 20000 of these JavaBeans that exist in the Hashmap.
Each bean has (at least) 5 fields that will be updated every 5 seconds.
(The reason I am using Terracotta for this is because these JavaBeans need to be accessible across JVMs and nodes.)
Anyone with more experience than me with TC have any tips? Performance is key.
Any examples other similar applications?

You might find that batching several changes under one lock scope will perform better. Each synchronized block/method forms a write transaction (assuming you use a write lock) that must be sent to the server (and possibly back out to other nodes). By changing a bunch of fields, possibly on a bunch of objects under one lock, you reduce the overhead of creating a transaction. Something to play with at least.
Partitioning is also a key way to improve performance. Changes only need to be sent to nodes that are actually using an object. So if you can partition which nodes usually touch specific objects that reduces the number of changes that have to be sent around the cluster, which improves performance.
unnutz's suggestions about using CHM or CSM are good ones. CHM allows greater concurrency (as each internal segment can be locked and used concurrently) - make sure to experiment with larger segment counts too. CSM has effectively one lock per entry so has effectively N partitions in an N-sized table. That can greatly reduce lock contention (at the cost of managing more internal lock objects). Changes coming soon for CSM will make the lock mgmt cost much lower.
Generally we find a good strategy is:
Build a performance test (should be multi-threaded and multi-node and similar to your app (or your actual app!)
Tune objects - look at your clustered object graph in the dev-console to find objects that don't need to be clustered at all - sometimes this happens accidentally (remove or cut the cluster with a transient field). Sometimes you might be clustering a Date where a long would do. Small change but that's one object per map entry and that might make a difference.
Tune locks - use the lock profiler in the dev-console to find hot locks or locks that are too narrow or too wide. The clustered stats recorder can help look at transaction size as well.
Tune GC and DGC - tune JVM garbage collection, then tune Terracotta distributed GC by turning on changing the frequency of young gen gc.
Tune TC server - lots of very detailed tunings to do here, but usually not worth it till the stuff above is tuned.
Feel free to ask on the Terracotta forums as well - all of engineering, field engineering, product mgmt watch those and answer there.

Firstly, I would suggest you to raise this question on their forums too.
Secondly, actually, performance of your application clustered over the Terracotta willl depend on number of write transactions that happen. So you could consider using ConcurrentStringMap (if your keys are Strings) or ConcurrentHashMap. Note that CSM is much more better than CHM from point of performance.
After all, POJOs are loaded lazily. That means each property is loaded on-demand.
Hope that helps.
Cheers

Related

Why do we need the volatile keyword when the core cache synchronization is done on the hardware level?

So I’m currently listing to this talk.
At minute 28:50 the following statement is made: „the fact that on the hardware it could be in main memory, in multiple level 3 caches, in four level 2 caches […] is not your problem. That’s the problem for the hardware designers.“
Yet, in java we have to declare a boolean stopping a thread as volatile, since when another thread calls the stop method, it’s not guaranteed that the running thread will be aware of this change.
Why is this the case, when the hardware level should take care of updating every cache with the correct value?
I’m sure I’m missing something here.
Code in question:
public class App {
public static void main(String[] args) {
Worker worker = new Worker();
worker.start();
try {
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
worker.signalStop();
System.out.println(worker.isShouldStop());
System.out.println(worker.getVal());
System.out.println(worker.getVal());
}
static class Worker extends Thread {
private /*volatile*/ boolean shouldStop = false;
private long val = 0;
#Override
public void run() {
while (!shouldStop) {
val++;
}
System.out.println("Stopped");
}
public void signalStop() {
this.shouldStop = true;
}
public long getVal() {
return val;
}
public boolean isShouldStop() {
return shouldStop;
}
}
}

You are assuming the following:
Compiler doesn't reorder the instructions
CPU performs the loads and stores in the order as specified by your program
Then your reasoning makes sense and this consistency model is called sequential consistency (SC): There is a total order over loads/stores and consistent with the program order of each thread. In simple terms: just some interleaving of the loads/stores. The requirements for SC are a bit more strict, but this captures the essence.
If Java and the CPU would be SC, then there would not be any purpose of making something volatile.
The problem is that you would get terrible performance. A lot of compiler optimizations rely on rewriting the instructions to something more efficient and this can lead to reordering of loads and stores. It could even decide to optimize-out a load or a store so that it doesn't happen. This is all perfectly fine as long as there is just a single thread involved because the thread will not be able to observe these reordering of loads/stores.
Apart from the compiler, the CPU also likes to reorder loads/store. Imagine that a CPU needs to make a write, and the cache-line for that write isn't in the right state. So the CPU would block and this would be very inefficient. Since the store is going to be made anyway, it is better to queue the store in a buffer so that the CPU can continue and as soon as the cache-line is returned in the right state, the store is written to the cache-line and then committed to the cache. Store buffering is a technique used by a lot of processors (e.g. ARM/X86). One problem with it is that it can lead to an earlier store to some address being reordering with a newer load to a different address. So instead of having a total order over all loads and stores like SC, you only get a total order over all stores. This model is called TSO (Total Store Order) and you can find it on the x86 and SPARC v8/v9. This approach assumes that the stores in the store buffer are going to be written to the cache in program order; but there is also a relaxation possible such that store in the store buffer to different cache-lines can be committed to the cache in any order; this is called PSO (Partial Store Order) and you can find it on the SPARC v8/v9.
SC/TSO/PSO are strong memory models because every load and store is a synchronization action; so they order surrounding loads/stores. This can be pretty expensive because for most instructions, as long as the data-dependency-order is preserved, any ordering is fine because:
most memory is not shared between different CPUs.
if memory is shared, often there is some external synchronization like a unlock/lock of a mutex or release-store/acquire-load that takes care of synchronization. So the synchronization can be delayed.
CPU's with weak memory models like ARM, Itanium make use of this. They make a separation between plain loads and stores and synchronizing loads/stores. And for plain loads and stores, any ordering is fine. And modern processors execute instructions out of order any way; there is a lot of parallelism inside a single CPU.
Modern processors do implement cache coherence. The only modern processor that doesn't need to implement cache coherence is a GPU. Cache coherence can be implemented in 2 ways
for small systems the caches can sniff the bus traffic. This is where you see MESI protocol. This technique is called is called sniffing (or snooping).
for larger systems you can have a directory that knows the state of each cache-line and which CPUs are sharing the cache-line and which CPU is owning the cache-line (here there is some MESI-like protocol). And all requests for cache-line go through the directory.
The cache coherence protocol make sure that the cache-line is invalidated on CPUs before a different CPU can write to the cache line. Cache coherence will give you a total order of loads/stores on a single address, but will not provide any ordering of loads/stores between different addresses.
Coming back to volatile:
So what volatile does is:
prevent reordering loads and stores by the compiler and CPU.
ensure that a load/store becomes visible; so it will the compiler from optimizing-out a load or store.
the load/store is atomic; so you don't get problems like a torn read/write. This includes compiler behavior like natural alignment of the field.
I have give you some technical information about what is happening behind the scenes. But to properly understand volatile, you need to understand the Java Memory Model. It is an abstract model that doesn't care about any implementation details as described above. If you would not apply volatile in your example, you would have a data race because a happens-before edge is missing between concurrent conflicting accesses.
A great book on this topic is
A Primer on Memory Consistency and Cache Coherence, Second Edition. You can download it for free.
I can't recommend you any book on the Java Memory Model because it is all explained in an awful manner. Best to get an understanding of memory models in general before diving into the JMM. Probably the best sources are this doctoral dissertation by Jeremy Manson, and Aleksey Shipilëv: One Stop Page.
PS:
There are situations when you don't care about any ordering guarantees, e.g.
stop flag for a thread
progress indicators
blackholes for microbenchmarks.
This is where the VarHandle.getOpaque/setOpaque can be useful. It provides visibility and atomicity, but it doesn't provide any ordering guarantees with respect to other variables. This is mostly a compiler concern. Most engineers will never need this level of control.

What you're suggesting is that hardware designers just make the world all ponies and rainbows for you.
They cannot do that - what you want makes the notion of an on-core cache completely impossible. How could a CPU core possibly know that a given memory location needs to be synced up with another core before accessing it any further, short of just keeping the entire cache in sync on a permanent basis, completely invalidating the entire idea of an on-core cache?
If the talk is strongly suggesting that you as a software engineer can just blame hardware engineers for not making life easy for you, it's a horrible and stupid talk. I bet it's brought a little more nuanced than that.
At any rate, you took the wrong lesson from it.
It's a two-way street. The hardware engineering team works together with the JVM team, effectively, to set up a consistent model that is a good equilibrium between 'With these constraints and limited guarantees to the software engineer, the hardware team can make reliable and significant performance improvements' and 'A software engineer can build multicore software with this model without tearing their hair out'.
This happy equilibrium in java is the JMM (Java Memory Model), which primarily boils down to: All field accesses may have a local thread cache or not, you do not know, and you cannot test if it does. Essentially the JVM has an evil coin an will flip it every time you read a field. Tails, you get the local copy. Heads, it syncs first. The coin is evil in that it is not fair and will land heads through out development, testing, and the first week, every time, even if you flip it a million times. And then the important potential customer demoes your software and you start getting tails.
The solution is to make the JVM never flip it, and this means you need to establish Happens-Before/Happens-After relationships anytime you have a situation anywhere in your code where one thread writes a field and another reads it. volatile is one way to do it.
In other words, to give hardware engineers something to work with, you, the software engineer, effectively made the promise that you'll establish HB/HA if you care about synchronizing between threads. So that's your part of the 'deal'. Their part of the deal is that the hardware guarantees the behaviour if you keep up your end of the deal, and that the hardware is very very fast.

Akka actor message needs memory pool

I a new in java. I'm c++ programmer and nowadays study java for 2 months.
Sorry for my pool English.
I have a question that if it needs memory pool or object pool for Akka actor model. I think if i send some message from one actor to one of the other actors, i have to allocate some heap memory(just like new Some String, or new Some BigInteger and other more..) And times on, the garbage collector will be got started(I'm not sure if it would be started) and it makes my application calculate slowly.
So I search for the way to make the memory-pool and failed(Java not supported memory pool). And I Could Make the object pool but in others project i did not find anybody use the object-pool with actor(also in Akka Homepage).
Is there any documents bout this topic in the akka hompage? Plz tell me the link or tell me the solution of my question.
Thanks.

If, as it's likely you will, you are using Akka across multiple computers, messages are serialized on the wire and sent to the other instance. This means that simply a local memory pool won't suffice.
While it's technically possible that you write a custom JSerializer (see the doc here) implementation that stores local messages in a memory pool after deserializing them, I feel that's a bit of an overkill for most applications (and easy to cock-up and actually worsen performance with lookup times in the map)
Yes, when the GC kicks in, the app will lag a bit under heavy loads. But in 95% of the scenarios, especially under a performant framework like Akka, GC will not be your bottleneck: IO will.
I'm not saying you shouldn't do it. I'm saying that before you take on the task, given its non-triviality, you should measure the impact of GC on your app at runtime with things like Kamon or other Akka-specialized monitoring solutions, and only after you are sure it's worth it you can go for it.

Using an ArrayBlockingQueue to hold a pool of your objects should help,
Here is the example code.
TO create a pool and insert an instance of pooled object in it.
BlockingQueue<YOURCLASS> queue = new ArrayBlockingQueue<YOURCLASS>(256);//Adjust 256 to your desired count. ArrayBlockingQueues size cannot be adjusted once it is initialized.
queue.put(YOUROBJ); //This should be in your code that instanciates the pool
and later where you need it (in your actor that receives message)
YOURCLASS instanceName = queue.take();
You might have to write some code around this to create and manage the pool.
But this is the gist of it.

One can do object pooling to minimise long tail of latency (by sacrifice of median in multythreaded environment). consider using appropriate queues e.g. from JCTools, Distruptor, or Agrona. Don't forget about rules of engagement for state exhange via mutable state using multiple thereads in stored objects - https://youtu.be/nhYIEqt-jvY (the best content I was able to find).
Again, don't expect to improve throughout using such slightly dangerous techniques. You will loose L1-L3 cache efficiency and will polite PCI with barriers.
Bit of tangent (to get sense of low latency technology):
One may consider some GC implementation with lower latency if you want to stick with Akka, or use custom reactive model where object pool is used by single thread, or memory copied over e.g. Distrupptor's approach.
Another alternative is using memory regions (the way Erlang VM works). It creates garbage, but in form easy to handle by GC!
If you go to very low latency IO and are the biggest enemy of latency - forget legacy TCP (vs RDMA over Infininiband), switches (over swichless), OS accessing disk via OS calls and file system (use RDMA), forget interrupts shared by same core, not pinned cores (and without spinning for input) to real CPU core (vs virtual/hyperthreads) or inter NUMA communication or messages one by one instead of hardware multicast (or better optical switch) for multiple consumers and don't forget turning Epsilon GC for JVM ;)

Why sharing a static variable between threads reduce performance?

I asked question here and someone leaved a comment saying that the problem is I'm sharing a static variable.
Why is that a problem?

Sharing a static variable of and by itself should have no adverse effect on performance. Global data is common is all programs starting with the JVM and OS constructs.
Mutable shared data is a different story as the mutation of shared data can lead to both performance issues (cache misses at the very least) and correctness issues which are a pain and are often solved using locks, which lead to potentially other performance issues.

The wiki static variable looks like a pretty substantial part of your program. Not knowing anything about what it's going or how it's coded, I would guess that it does locking in order to keep a consistent state. If most of your threads are spending their time blocking waiting to acquire access to this same object then that would explain why you're not seeing any gain from using multiple threads.
For threads to make a difference to the performance of your program they have to be reasonably independent, and not all locking on the same thing. The more locking they have to do, the less gain you will see. So try to split out the work so as much can be done independently as possible. For instance if there are work items that can be gathered independently, then you might be better off by having multiple threads go find the work items, then feed them to a queue that a dedicated thread can use to pull work items off the queue and feed them to the wiki object.

Java: Using ConcurrentHashMap as a lock manager

I'm writing a highly concurrent application, needing access to a large fine-grained set of shared resources. I'm currently writing a global lock manager to organize this. I'm wondering if I can piggyback off the standard ConcurrentHashMap and use that to handle the locking? I'm thinking of a system like the following:
A single global ConcurrentHashMap object contains a mapping between the unique string id of the resource, and a lock protecting that resource unique id of the thread using the resource
Tune the concurrency factor to reflect the need for a high level of concurrency
Locks are acquired using the atomic conditional replace(K key, V oldValue, V newValue) method in the hashmap
To prevent lock contention when locking multiple resources, locks must be acquired in alphabetical order
Are there any major issues with the setup? How will the performance be?
I know this is probably going to be much slower and more memory-heavy than a properly written locking system, but I'd rather not spend days trying to write my own, especially given that I probably won't be able to match Java's professionally-written concurrency code implementing the map.
Also, I've never used ConcurrentHashMap in a high-load situation, so I'm interested in the following:
How well will this scale to large numbers of elements? (I'm looking at ~1,000,000 being a good cap. If I reach beyond that I'd be willing to rewrite this more efficiently)
The documentation states that re-sizing is "relatively" slow. Just how slow is it? I'll probably have to re-size the map once every minute or so. Is this going to be problematic with the size of map I'm looking at?
Edit: Thanks Holger for pointing out that HashMaps shouldn't have that big of an issue with scaling
Also, is there is a better/more standard method out there? I can't find any places where a system like this is used, so I'm guessing that either I'm not seeing a major flaw, or theres something else.
Edit:
The application I'm writing is a network service, handling a variable number of requests. I'm using the Grizzly project to balance the requests among multiple threads.
Each request uses a small number of the shared resources (~30), so in general, I do not expect a large great deal of contention. The requests usually finish working with the resources in under 500ms. Thus, I'd be fine with a bit of blocking/continuous polling, as the requests aren't extremely time-sensitive and contention should be minimal.
In general, seeing that a proper solution would be quite similar to how ConcurrentHashMap works behind the scenes, I'm wondering if I can safely use that as a shortcut instead of writing/debugging/testing my own version.

The re-sizing issue is not relevant as you already told an estimate of the number of elements in your question. So you can give a ConcurrentHashMap an initial capacity large enough to avoid any rehashing.
The performance will not depend on the number of elements, that’s the main goal of hashing, but the number of concurrent threads.
The main problem is that you don’t have a plan of how to handle failed locks. Unless you want to poll until locking succeeds (which is not recommended) you need a way of putting a thread to sleep which implies that the thread currently owning the lock has to wake up a sleeping thread on release if one exists. So you end up requiring conventional Lock features a ConcurrentHashMap does not offer.
Creating a Lock per element (as you said ~1,000,000) would not be a solution.
A solution would look a bit like the ConcurrentHashMap works internally. Given a certain concurrency level, i.e. the number of threads you might have (rounded up), you create that number of Locks (which would be a far smaller number than 1,000,000).
Now you assign each element one of the Locks. A simple assignment would be based on the element’s hashCode, assuming it is stable. Then locking an element means locking the assigned Lock which gives you up to the configured concurrency level if all currently locked elements are mapped to different Locks.
This might imply that threads locking different elements block each other if the elements are mapped to the same Lock, but with a predictable likelihood. You can try fine-tuning the concurrency level (as said, use a number higher than the number of threads) to find the best trade-off.
A big advantage of this approach is that you do not need to maintain a data structure that depends on the number of elements. Afaik, the new parallel ClassLoader uses a similar technique.

Object Pooling in Java

What are the pro's and con's of maintaining a pool of frequently used objects and grab one from the pool instead of creating a new one. Something like string interning except that it will be possible for all class objects.
For example it can be considered to be good since it saves gc time and object creation time. On the other hand it can be a synchronization bottleneck if used from multiple threads, demands explicit deallocation and introduces possibility of memory leaks. By tying up memory that could be reclaimed, it places additional pressure on the garbage collector.

First law of optimization: don't do it. Second law: don't do it unless you actually have measured and know for a fact that you need to optimize and where.
Only if objects are really expensive to create, and if they can actually be reused (you can reset the state with only public operations to something that can be reused) it can be effective.
The two gains you mention are not really true: memory allocation in java is free (the cost was close to 10 cpu instructions, which is nothing). So reducing the creation of objects only saves you the time spent in the constructor. This can be a gain with really heavy objects that can be reused (database connections, threads) without changing: you reuse the same connection, the same thread.
GC time is not reduced. In fact it can be worse. With moving generational GCs (Java is, or was up to 1.5) the cost of a GC run is determined by the number of alive objects, not by the released memory. Alive objects will be moved to another space in memory (this is what makes memory allocation so fast: free memory is contiguous inside each GC block) a couple of times before being marked as old and moved into the older generation memory space.
Programming languages and support, as GC, were designed keeping in mind the common usage. If you steer away from the common usage in many cases you may end up with harder to read code that is less efficient.

Unless the object is expensive to create, I wouldn't bother.
Benefits:
Fewer objects created - if object creation is expensive, this can be significant. (The canonical example is probably database connections, where "creation" includes making a network connection to the server, providing authentication etc.)
Downsides:
More complicated code
Shared resource = locking; potential bottleneck
Violates GC's expectations of object lifetimes (most objects will be shortlived)
Do you have an actual problem you're trying to solve, or is this speculative? I wouldn't think about doing something like this unless you've got benchmarks/profile runs showing that there's a problem.

Pooling will mean that you, typically, cannot make objects immutable. This leads to defencive copying so you ultimately wind up making many more copies than you would if you just made a new immutable object.
Immutability is not always desirable, but more often than not you will find that things can be immutable. Making them not immutable so that you can reuse them in a pool is probably not a great idea.
So, unless you know for certain that it is an issue don't bother. Make the code clear and easy to follow and odds are it will be fast enough. If it isn't then the fact that the code is clear and easy to follow will make it easier to speed it up (in general).

Don't.
This is 2001 thinking. The only object "pool" that is still worth anything now a days is a singleton. I use singletons only to reduce the object creation for purposes of profiling (so I can see more clearly what is impacting the code).
Anything else you are just fragmenting memory for no good purpose.
Go ahead and run a profile on creating a 1,000,000 objects. It is insignificant.
Old article here.

It entirely depends on how expensive your objects are to create, compared to the number of times you create them... for instance, objects that are just glorified structs (e.g. contain only a couple of fields, and no methods other than accessors) can be a real use case for pooling.
A real life example: I needed to repetitively extract the n highest ranked items (integers) from a process generating a great number of integer/rank pairs. I used a "pair" object (an integer, and a float rank value) in a bounded priority queue. Reusing the pairs, versus emptying the queue, throwing the pairs away, and recreating them, yielded a 20% performance improvement... mainly in the GC charge, because the pairs never needed to be reallocated throughout the entire life of the JVM.

Object pools are generally only a good idea for expensive object like database connections. Up to Java 1.4.2, object pools could improve performance but as of Java 5.0 object pools where more likely to harm performance than help and often object pools were removed to improve performances (and simplicity)

I agree with Jon Skeet's points, if you don't have a specific reason to create a pool of objects, I wouldn't bother.
There are some situations when a pool is really helpful/necessary though. If you have a resource that is expensive to create, but can be reused (such as a database connection), it might make sense to use a pool. Also, in the case of database connections, a pool is useful for preventing your apps from opening too many concurrent connections to the database.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.