Measure contention in wait-free multi-threaded java programs

Measure contention in wait-free multi-threaded java programs - java

I have a wait-free implementation for binary search trees but I am not able to figure out concrete methods to measure thread contention. By contention, here I mean number of threads that try to access the same piece of memory at the same time.
So far, I have searched ThreadMXBean and ThreadInfo class, but as there are no locks involved, I haven't found any solution yet.

There is no way to measure the contention over "memory location" without prohibitive performance costs. Direct measurement (e.g. properly synchronized counter wrapping all the accesses) will introduce the artificial bottlenecks, which will blow up test reliability.
"Same time" is loosely defined on the scale you want to measure it, because only a single CPU "owns" the particular location in memory in a given time. The best you can do in this case it to measure the rate at which CPUs are dealing with memory conflicts, e.g. through the HW counters. Doing that requires the understanding of memory subsystem on a given platfom. Also, the HW counters attribute for machine (= CPU) state, not the memory state; in other words, you can estimate how many conflicts the particular instructions have experienced, not how many CPUs accessed the given memory location.

Trying the measure within the source of the contention is the wrong approach. What might be the reason for contention anyways?!
So, first of all, setup a benchmarking suite which runs typical access patterns on your data structure and graph the performance for different thread counts. Here is a nice example from
nitro cache performance page.
If you scale almost linear: congrats, you are done!
If you don't scale linear, you need more insight. Now you need to profile the system as a whole and see what is the reason e.g. for CPU pipeline stalls. The best way is to use low-overhead tracing for this. On Linux you can use OProfile. OProfile has also Java support, which helps you to correlate the JITed machine code to your Java program.

Related

Why increasing number of threads is useless?

I've been looking for the way to increase the speed of processing messages received from rabbitmq queue. The only way I've found is make more than one threads doing the same - receiving and processing. And this gave me some profit. After I created 4 threads the speed quadrupled. As I have 8-core processor I've decided to increase the number of threads to 8. But this gives no performance increasing. YourKit shows that only 50% of CPU is used. Somebody can say that my app is lightweight and so it is so, but I can say that it can't do more work than it does regardless I produce much more what to do. Why this doesn't work?

There are many different issues that can constrain the maximum speed of some application on a given system. For example, it can be limited by memory bandwidth, by Amdahl's Law effects (time needed for non-parallel code, including synchronized blocks), I/O bandwidth, and cache space.
If you want further improvement you need to do some measurements and profiling to find where the time is going, and then work on that.

The short (and not particularly helpful) answer is "overheads and bottlenecks".
For instance:
Creating threads in Java is relatively expensive. If the amount of work done by a thread isn't large, the overhead of creating the thread can out-weigh the benefits.
Context switching between threads is relatively expensive, especially when you take account of memory-related overheads such as cache misses, TLB misses. (These overheads actually hit when a native thread is assigned to a core. If the OS can somehow keep a native thread on a single core continuously (i.e. with no other threads on the same core), then it can use spinlocking ... and avoid the context switch. But the more Java threads you have, the less likely it is that the OS can do this.)
The threads may be spending a large proportion of their time waiting for I/O to complete. The I/O system's throughput or the speed / latency of some external service can be a bottleneck.
You may have contention over data structures; e.g. threads requiring exclusive access to safely read or update (say) a shared Map. If threads regularly need to wait for others to release locks, then you have a bottleneck.
Your computation may be dominated by the costs of "feeding" the threads. For example, if there is a single master thread that hands out "work" to worker threads, then the master thread's activities could be the bottleneck; i.e. it may not be able to provide enough work to keep the workers busy.
Since your tags imply that you are using a message queue, it is possible that that is the bottleneck, especially if the messages are big or the "work" done on each one is relatively small.
(Using a separate separate message queue service is liable to increase context switches, add I/O latency, add protocol overheads and so on. It's not an automatic route to performance improvement for small-scale systems.)
It is also possible that you have "hyperthreaded" cores not real cores, or that the operating system is stopping your JVM from using all cores.

If CPU or waiting for IO is your bottle neck, adding independent threads can make a big difference.
If you have a shared resource is a bottleneck, e.g. your L3 cache, your network adapter, your kernel, adding threads won't help because CPU is not the problem. In fact it can often make it worse by adding overhead.
my app is lightweight
In which case CPU is unlikely to be your issue and you are doing well to see a speed up with more than 1 CPU. Most likely you are speeding up CPU used by RabbitMQ. Ideally it should be more efficient and this shouldn't really help much. IMHO, more efficient messaging solutions don't gain much by multiple CPUs as they will not be bottlenecked on CPU.

One way or another, you're only using 4 cores. There's a lot that can stop stop you from doubling your performance by doubling your threads, but from your 4 thread success you've gotten past all that. I'm guessing there's a bug in your code to set off 8 threads and it's only firing up 4. (Even with hyperthreading, you're going to get some improvement. Even with every possible problem, you're going to get some improvement.) Otherwise, I'll go with T.J.Crowder and Stephen C: I don't think you really have 8 cores.
I'd try using different numbers of threads: 3, 5, 6. See what changes. I think you'll stumble on the problem soon enough.
To be fair to Java: if you write thread-safe code and avoid bottlenecks, it handles threads really well, as you've noticed going from one thread to 4. I've always found the overhead costs to be trivial.

Your application does not have linear speedup, and therefore, it does not have good scalability.
In order to keep increase the number of threads you need to ensure the data being handle is growing accordingly. For a fixed amount of data, increasing number of threads (and/or cores) will have a diminishing return at some point since the overhead of creating threads will outweight the thread's compute time.
Make sure to look up the following link:
Ahmdahl's law
Gustafson's law
Gustafson's law is a great counterpoint to Ahmdal's law so I highly recommend understanding that article.

design a test program: increase/decrease cpu usage

I am trying to design and then write a java program that can increase/decrease CPU usage. Here is my basic idea: write a multi-thread program. And each thread does float point calculation. Increase/Decrease cpu usage through adding/reducing threads.
I am not sure what kinds of float point operations are best in this test case. Especially, I am gonna test VMWare virtual machine.

You can just sum up the reciprocals of the natural numbers. Since this sum doesn't converge, the compiler will not dare to optimize it away. Just make sure that the result is somehow used in the end.
1/1 + 1/2 + 1/3 + 1/4 + 1/5 ...
This will of course occupy the floating point unit, but not necessarily the central processing unit. So if this approach is good or not is the main question you should pose.

Just simple busy loops will increase the CPU usage -- I am not aware if doing FP calculations will change this significantly or otherwise be able to achieve a more consistent load factor, even though it does exercise the FPU and not just the ALU.
While creating a similar proof-of-concept in C# I used a fixed number of threads and changed the sleep/word durations of each thread. Bear in mind that this process isn't exact and is subject to both CPU and process throttling as well as other factors of modern preemptive operating systems. Adding VMware to the mix may also compound the observed behaviors. In degenerate cases harmonics can form between the code designed to adjust the load and the load reported by the system.
If lower-level constructs were used (generally requiring "kernel mode" access) then a more consistent throttle could be implemented -- partially because of the ability to avoid certain [thread] preemptions :-)
Another alternative that may be looked into, with the appropriate hardware, is setting the CPU clock and then running at 100%. The current Intel Core-i line of chips is very flexible this way (the CPU multiplier can be set discreetly through the entire range), although access from Java may be problematic. This would be done in the host, of course, not VMware.
Happy coding.

Normalization of speed for testing on different multicore processors

I want to calculate run time of some simple c programs on different multi-core processors. But as we know with advancement of technology new processors are incorporating more methods for faster computation like clock speed etc. How can I normalize such speed changes(to filter out the effect of other advance methods in processor except multi-core) as I only want to get results on the basis of number of cores of processor.

Under Linux, you can boot with the kernel command line parameter maxcpus=N to limit the machine to only N CPUs. See Documentation/kernel-parameters.txt in the kernel source for details.
Most BIOS environments also have the ability to turn off hyperthreading; depending upon your benchmarks, HT may speed up or slow down your tests; being in control of HT would be ideal.

Decide on a known set of reference hardware, run some sort of repeatable reference benchmark against this, and get a good known value to compare to. Then you can run this benchmark against other systems to figure out how to scale the values you get from your target benchmark runs.
The closer your reference benchmark is to your actual application, the more accurate the results of your scaling will be. You could have a single deterministic run (single code path, maybe average of multiple executions) of your application used as your reference benchmark.

If I understand you correctly, you are trying to find a measurement approach that allows to separate the effect of scaling the number of cores from advances of single processor improvements. I am afraid that is not easily possible. E.g. if you compare a multi-core system to one single core of that system you have a non-linear correlation. Because there are shared resources as e.g. the memory bus. If you use only one core of multi-core system it can use the complete memory bandwidth while it has to share in the multi-core case. Similar arguments apply to many shared resources: as there are caches, buses, io capabillities, ALUs, etc.

Your issue is with the auto scaling of core frequency based on the amount of active cores at any given time. For instance, AMD phenom 6-core chips operate at 3.4GHz (or somewhat similar) and if your application creates more than 3 threads it goes down to 2.8Ghz (or similar). Intel on the other hand uses a bunch of heuristics to determine the right frequency for any given time.
However, you can always turn these settings off by going to BIOS and then the results will be comparable only differing based on clock frequency. Usually, people measure giga flops to have comparable results.

Java Beginner: cost of threading over serial

I was wondering If I have (N) Double vectors, each with large length say 10,000 and I wanted to thread the operation(multiply by 3 for each vector ) which by making (N) threads.
I was wondering If there was a cost of using parallel (N) threads over Serial one by one operations ? Cost( memory, speed, etc) ?
or using threading is actually better idea since I read it would use available cores ?

Yes: Each thread will use up resources: at least memory, but possibly also OS processes or other OS resouces. The details will depend on the JVM implementation.
If the memory usage becomes to high you might also take a performance hit due to more frequent GC, paging and what ever computers do to manage their memory

When performing memory intensive operations, the size of your cache is more likely to be important than the size of your main memory. As the lower caches are shared, intensive use of this one cache may not give you much scalability when using multiple cores. However if you minimise the amount of data each core is using or your operations are non-trivial you can get near linear scalability

You wouldn't want to create new threads for each vector (the overhead of creating a new thread is significant). However it may make sense to do this work in a thread pool to exploit multiple cores.
It sounds as if:
The calculations are large enough that you would benefit from harnessing multiple threads
The individual vectors are large enough that they could be used as chunks of work
If the above is true, then it may make sense to use a ThreadPoolExecutor which will enable you to send each vector to the pool of threads for processing.

Java Performance Degradation on removing locks

I am testing my java application for any performance bottlenecks. The application uses concurrent.jar for locking purposes.
I have a high computation call which calls lock and unlock functions for its operations.
On removing the lock-unlock mechanism from the code, I have seen the performance degradation by multiple folds contrary to my expectations. Among other things observed was increase in CPU consumption which made me feel that the program is running faster but actually it was not.
Q1. What can be the reason for this degradation in performance when we remove locks?
Best Regards !!!

This can be quite a usual finding, depending on what you're doing and what you're using as an alternative to Locks.
Essentially, what happens is that constructs such as ReentrantLock have some logic built into them that knows "when to back off" when they realistically can't acquire the lock. This reduces the amount of CPU that's burnt just in the logic of repeatedly trying to acquire the lock, which can happen if you use simpler locking constructs.
As an example, have a look at the graph I've hurriedly put up here. It shows the throughput of threads continually accessing random elements of an array, using different constructs as the locking mechanism. Along the X axis is the number of threads; Y axis is throughput. The blue line is a ReentrantLock; the yellow, green and brown lines use variants of a spinlock. Notice how with low numbers of threads, the spinlock gives heigher throughput as you might expect, but as the number of threads ramps up, the back-off logic of ReentrantLock kicks in, and it ends up doing better, while with high contention, the spinlocks just sit burning CPU.
By the way, this was really a trial run done on a dual-processor machine; I also ran it in the Amazon cloud (effectively an 8-way Xeon) but I've ahem... mislaid the file, but I'll either find it or run the experiment again soon and train and post an update. But you get an essentially similar pattern as I recall.
Update: whether it's in locking code or not, a phenomenon that can happen on some multiprocessor architectures is that as the multiple processors do a high volume of memory accesses, you can end up flooding the memory bus, and in effect the processors slow each other down. (It's a bit like with ethernet-- the more machines you add to the network, the more chance of collisions as they send data.)

Profile it. Anything else here will be just a guess and an uninformed one at that.
Using a profiler like YourKit will not only tell you which methods are "hot spots" in terms of CPU time but it will also tell you where threads are spending most of their time BLOCKED or WAITING

Is it still performing correctly? For instance, there was a case in an app server where an unsychronised HashMap caused an occasional infinite loop. It is not to difficult to see how work could simply be repeated.

The most likely culprit for seeing performance decline and CPU usage increase when you remove shared memory protection is a race condition. Two or more threads could be continually flipping a state flag back and forth on a shared object.
More description of the purpose of your application would help with diagnosis.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.