Blocking Behaviour - Concurrent Data Structures Java - java

I'm currently running a highly concurrent benchmark which accesses a ConcurrentSkipList from the Java collections. I'm finding that threads are getting blocked within that method, more precisely here:
java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:828)
java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1626)
(This is obtained through, over 10 seconds interval, printing the stack trace of each individual thread). This is still not resolved after minutes
Is this is an expected behaviour of collections? What are the concurrent other collections likely to experience blocking?
Having tested, it, I exhibit similar behaviour with ConcurrentHashMaps:
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:994)

This could well be a spurious result.
When you ask Java for a dump of all its current stack traces, it tells each thread to wait when it gets to a yield point, then it captures the traces, and then it resumes all the threads. As you can imagine, this means that yield points are over-represented in these traces; these include synchronized methods, volatile accesses, etc. ConcurrentSkipListMap.head, a volatile field, is accessed in doGet.
See this paper for a more detailed analysis.
Solaris Studio has a profiler that captures stack traces from the OS and translates them to Java stack traces. This does away with the bias toward yield points and gives you more accurate results; you might find that doGet goes away almost entirely. I've only had luck running it on Linux, and even then it's not out-of-the-box. If you're interested, ask me in the comments how to set it up, I'd be happy to help.
As an easier approach, you could wrap your calls to ConcurrentSkipList.get with System.nanoTime() to get a sanity check on whether this is really where your time is going. Figure out how much time you're spending in that method, and confirm whether it's about what you'd expect given that the profiler says you're spending such-and-such percent of your time in that method.
Shameless self-plug: I created a simple program that demonstrates this a few months ago for a presentation at work. If you run it against a profiler, it should show that SpinWork.work appears a lot, while HardWork.work doesn't show up at all -- even though the latter actually takes a lot more time. It doesn't contain yield points.

Well, it isn't blocking in its truest form. Blocking implies the suspension of thread activity. ConcurrentSkipListMap is non-blocking, it will spin until is succeeds. But it also guarantees it will eventually succeed (that is it shouldn't get into an infinite loop)
That being said, unless you are doing many many gets and many many puts asynchronously I don't see how you can be spending so much time in this method.
If you can re-create it with an example and share with us that may help.

ConcurrentHashMap.get is a volatile read, which means, the CPU must finish all outstanding writes before it can perform the read. This is called a STORE/LOAD barrier. Depending on how much is going on in the other thread/cores, this can take a long time. See https://www.cs.umd.edu/users/pugh/java/memoryModel/jsr-133-faq.html.

Related

Timing a remote call in a multithreaded java program

I am writing a stress test that will issue many calls to a remote server. I want to collect the following statistics after the test:
Latency (in milliseconds) of the remote call.
Number of operations per second that the remote server can handle.
I can successfully get (2), but I am having problems with (1). My current implementation is very similar to the one shown in this other SO question. And I have the same problem described in that question: latency reported by using System.currentTimeMillis() is longer than expected when the test is run with multiple threads.
I analyzed the problem and I am positive the problem comes from the thread interleaving (see my answer to the other question that I linked above for details), and that System.currentTimeMillis() is not the way to solve this problem.
It seems that I should be able to do it using java.lang.management, which has some interesting methods like:
ThreadMXBean.getCurrentThreadCpuTime()
ThreadMXBean.getCurrentThreadUserTime()
ThreadInfo.getWaitedTime()
ThreadInfo.getBlockedTime()
My problem is that even though I have read the API, it is still unclear to me which of these methods will give me what I want. In the context of the other SO question that I linked, this is what I need:
long start_time = **rightMethodToCall()**;
result = restTemplate.getForObject("Some URL",String.class);
long difference = (**rightMethodToCall()** - start_time);
So that the difference gives me a very good approximation of the time that the remote call took, even in a multi-threaded environment.
Restriction: I'd like to avoid protecting that block of code with a synchronized block because my program has other threads that I would like to allow to continue executing.
EDIT: Providing more info.:
The issue is this: I want to time the remote call, and just the remote call. If I use System.currentTimeMillis or System.nanoTime(), AND if I have more threads than cores, then it is possible that I could have this thread interleaving:
Thread1: long start_time ...
Thread1: result = ...
Thread2: long start_time ...
Thread2: result = ...
Thread2: long difference ...
Thread1: long difference ...
If that happens, then the difference calculated by Thread2 is correct, but the one calculated by Thread1 is incorrect (it would be greater than it should be). In other words, for the measurement of the difference in Thread1, I would like to exclude the time of lines 4 and 5. Is this time that the thread was WAITING?
Summarizing question in a different way in case it helps other people understand it better (this quote is how #jason-c put it in his comment.):
[I am] attempting to time the remote call, but running the test with multiple threads just to increase testing volume.
Use System.nanoTime() (but see updates at end of this answer).
You definitely don't want to use the current thread's CPU or user time, as user-perceived latency is wall clock time, not thread CPU time. You also don't want to use the current thread's blocking or waiting time, as it measures per-thread contention times which also doesn't accurately represent what you are trying to measure.
System.nanoTime() will return relatively accurate results (although granularity is technically only guaranteed to be as good or better than currentTimeMillis(), in practice it tends to be much better, generally implemented with hardware clocks or other performance timers, e.g. QueryPerformanceCounter on Windows or clock_gettime on Linux) from a high resolution clock with a fixed reference point, and will measure exactly what you are trying to measure.
long start_time = System.nanoTime();
result = restTemplate.getForObject("Some URL",String.class);
long difference = (System.nanoTime() - start_time);
long milliseconds = difference / 1000000;
System.nanoTime() does have it's own set of issues but be careful not to get whipped up in paranoia; for most applications it is more than adequate. You just wouldn't want to use it for, say, precise timing when sending audio samples to hardware or something (which you wouldn't do directly in Java anyways).
Update 1:
More importantly, how do you know the measured values are longer than expected? If your measurements are showing true wall clock time, and some threads are taking longer than others, that is still an accurate representation of user-perceived latency, as some users will experience those longer delay times.
Update 2 (based on clarification in comments):
Much of my above answer is still valid then; but for different reasons.
Using per-thread time does not give you an accurate representation because a thread could be idle/inactive while the remote request is still processing, and you would therefore exclude that time from your measurement even though it is part of perceived latency.
Further inaccuracies are introduced by the remote server taking longer to process the simultaneous requests you are making - this is an extra variable that you are adding (although it may be acceptable as representative of the remote server being busy).
Wall time is also not completely accurate because, as you have seen, variances in local thread overhead may add extra latency that isn't typically present in single-request client applications (although this still may be acceptable as representative of a client application that is multi-threaded, but it is a variable you cannot control).
Of those two, wall time will still get you closer to the actual result than per-thread time, which is why I left the previous answer above. You have a few options:
You could do your tests on a single thread, serially -- this is ultimately the most accurate way to achieve your stated requirements.
You could not create more threads than cores, e.g. a fixed size thread pool with bound affinities (tricky: Java thread affinity) to each core and measurements running as tasks on each. Of course this still adds any variables due to synchronization of underlying mechanisms that are beyond your control. This may reduce the risk of interleaving (especially if you set the affinities) but you still do not have full control over e.g. other threads the JVM is running or other unrelated processes on the system.
You could measure the request handling time on the remote server; of course this does not take network latency into account.
You could continue using your current approach and do some statistical analysis on the results to remove outliers.
You could not measure this at all, and simply do user tests and wait for a comment on it before attempting to optimize it (i.e. measure it with people, who are what you're developing for anyways). If the only reason to optimize this is for UX, it could very well be the case that users have a pleasant experience and the wait time is totally acceptable.
Also, none of this makes any guarantees that other unrelated threads on the system won't be affecting your timings, but that is why it is important to both a) run your test multiple times and average (obviously) and b) set an acceptable requirement for timing error's that you are OK with (do you really need to know this to e.g. 0.1ms accuracy?).
Personally, I would either do the first, single-threaded approach and let it run overnight or over a weekend, or use your existing approach and remove outliers from the result and accept a margin of error in the timings. Your goal is to find a realistic estimate within a satisfactory margin of error. You will also want to consider what you are going to ultimately do with this information when deciding what is acceptable.

How to avoid 100% CPU utilization without removing while(true)

I am working on a system where I need a while(true) where the loop constantly listens to a queue and increments counts in memory.
The data is constantly coming in the queue, so I cannot avoid a while(true) condition. But naturally it increases my CPU utilization to 100%.
So, how can I keep a thread alive which listens to the tail of queue and performs some action, but at the same time reduce the CPU utilization to 100%?
Blocking queues were invented exactly for this purpose.
Also see this: What are the advantages of Blocking Queue in Java?
LinkedBlockingQueue.take() is what you should be using. This waits for an entry to arrive on the queue, with no additional synchronization mechanism needed.
(There are one or two other blocking queues in Java, IIRC, but they have features that make them unsuitable in the general case. Don't know why such an important mechanism is buried so deeply in arcane classes.)
usually a queue has a way to retrieve an item from it and your thread will be descheduled (thus using 0% cpu) until something arrives in the queue...
Based on your comments on another answer, you want to have a queue that is based on changes in hsqldb
Some quick googling turns up:
http://hsqldb.org/doc/guide/triggers-chapt.html
It appears you can set it up so that changes cause a trigger to occur, which will notify a class you write implementing the org.hsqldb.Trigger interface. Have that class contain a reference to a LinkedBlockingDequeue from the Concurrent package in Java and have the trigger add the change to the queue.
You now have a blocking queue that your reading thread will block on until hsqldb fires a trigger (from an update by a writer) which will put something in the queue. The waiting thread will then unblock and have the item off the queue.
lbalazscs and Brain have excellent answers. I couldn’t share my code it was hard for them to give them the exact fix for my issue. And having a while(true) which constantly polls a queue is surely the wrong way to go about it. So, here is what I did:
I used ScheduledExecutorService with a 10sec delay.
I read a block of messages (say 10k) and process those messages
thread is invoked again and the "loop" continues.
This considerably reduces my CPU usage. Suggestions welcomed.
Lots of dumb answers from people who read books and only wasted time in schools, not as many direct logic or answers I see.
while(true) will set your program to use all the CPU power that's basically 'alloted' to it by the windows algorithms to run what is in the loop, usually as-fast-as-possible over and over. This doesn't mean if it says 100% on your application, that if you run a game, your empty loop .exe will be taking all your OS CPU power, the game should still run as intended. It is more like a visual bug, similar to the windows idle process and some other processes. The fix is to add a Sleep(1) (at least 1 millisecond) or better yet a Sleep(5) to make sure other stuff can run and ensure the CPU is not constantly looping your while(true) as fast as possible. This will generally drop CPU usage to 0% or 1% in the visual queue as 1 full millisecond is a big resting time for even older CPU.
Many times while(trues) or generic endless loops are bad designs and can be drastically slowed down to even Sleep(1000) - 1 second interval checks or higher. Endless loops are not always bad designs, but usually they can be improved..
funny to see this bug I learned whe nI was like 12 learning C pop up and all the 'dumb' answers given.
Just know if you try it, unless the scripted slower language you have learned to use has fixed it somewhere along the line by itself, windows will claim to use a lot of CPU on doing an empty loop when the OS is actually having free resources to spend.

Why aren't all methods displayed in VisualVM profiler?

I am using VisualVM to see where my application is slow. But it does not show all methods, probably does not show all methods that delays the application.
I have a realtime application (sound processing) and have a time deficiency in few hundreds of microseconds.
Is it possible that VisualVM hides methods which are fast themselves?
UPDATE 1
I found slow method by sampler and guessing. It was toString() method which was called from debug logging which was turned off, but consuming a time.
Settings helped and now I know how to see it: it was depending on Start profiling from option.
Other than the filters mentioned by Ryan Stewart, here are a couple of additional reasons why methods may not appear in the profiler:
Sampling profiles are inherently stochastic: a sample of the current stack of all threads is taken every N ms. Some methods which actually executed but which aren't caught in any sample during your run just won't appear. This is generally not too problematic since the very fact they didn't appear in any sample, means that with very high probability these are methods which aren't taking up a large part of your runtime.
When using instrumentation based sampling in visualvm (called "CPU profiling"), you need to define the entry point for profiled methods (the "Start profiling from" option). I have found this fails for methods in the default package, and also won't pick up time in methods which are current running when the profiler is attached (for the duration of the current invocation - it will get later invocations. This is probably because the instrumented method won't be swapped in until the current invocation finishes.
Sampling is subject to a potentially serious issue with stack traced based profiling, which is that samples are only taken at safe points in the code. When a trace is requested, each thread is forced to a safe point, then the stack is taken. In some cases you may have a hot spot in your code which does no safe point polling (common for simple loops that the JIT can guarantee terminate after a fixed number of iterations), interleaved with a bit of code that does have a safepoint poll. Your stacks will always show your process in the safe-point code, never in the safe-point free code, even though the latter may be taking the majority of CPU time.
I don't have it in front of my at the moment, but before you start profiling, there's a settings pane that's hidden by default and lets you enter regexes for filtering out methods. By default, it filters out a lot of the core JDK stuff.
I had the same problem with my pet project. I added a package name and the problem is solved. I don't understand why. VisualVM 1.4.1, jdk1.8.0_181 and jdk-10.0.2, Windows 10

Analyze the "runnable" thread dump under high load server side

The thread dump from a java based application is easy to get but hard to analyze!
There is something interesting we can see from the thread dump.
Suppose we are in a heavily load java web app. And I often take 10 or 15 thread dump files during the peak time ( under the high loads) to make the wide data. So first, there is no doubt that we need tune up the codes whose status is Blocked and Monitor. I can't dig into more with the rest Runnable threads.
So, if the "method" appears from the thread dump many time, can we say it is slower or faster than the other under high load server side? Of course, we can use more profiling tools to check that but the thread dump may give us the same useful information, especially we are in the production env.
Thank you in adv!
Vance
I would look carefully at the call stack of the thread(s) in each dump, regardless of the thread's state, and ask "What exactly is it doing or waiting for at that point in time, and why?"
You don't just look at functions on the call stack, you look at the lines of code where the functions are called. That tells you the local reason for the call. If you combine the local reasons for the calls on the stack, that gives you the complete reason (the "why chain") for what the thread was doing at that time.
What you're looking for is bad reasons that appear on more than one snapshot.
(It only takes one unnecessary call on the stack to make that whole sample improvable, so the deeper the stack, the better the hunting.)
Since they're bad, they can be fixed and you will get improved performance. The amount of the performance improvement is roughly the fraction of snapshots that show them.
That's this technique.
I'd say that if a method appears very often in a thread dump you'd have to either
optimize that method since it is called many times or
check whether that method is called too often
If you see that the thread runs spend lots of time in a particluar method, there might also be some bug (like we had with using a special regex that suffered from a bug in the regex engine). So you'd need to investigate that.

Java Performance Degradation on removing locks

I am testing my java application for any performance bottlenecks. The application uses concurrent.jar for locking purposes.
I have a high computation call which calls lock and unlock functions for its operations.
On removing the lock-unlock mechanism from the code, I have seen the performance degradation by multiple folds contrary to my expectations. Among other things observed was increase in CPU consumption which made me feel that the program is running faster but actually it was not.
Q1. What can be the reason for this degradation in performance when we remove locks?
Best Regards !!!
This can be quite a usual finding, depending on what you're doing and what you're using as an alternative to Locks.
Essentially, what happens is that constructs such as ReentrantLock have some logic built into them that knows "when to back off" when they realistically can't acquire the lock. This reduces the amount of CPU that's burnt just in the logic of repeatedly trying to acquire the lock, which can happen if you use simpler locking constructs.
As an example, have a look at the graph I've hurriedly put up here. It shows the throughput of threads continually accessing random elements of an array, using different constructs as the locking mechanism. Along the X axis is the number of threads; Y axis is throughput. The blue line is a ReentrantLock; the yellow, green and brown lines use variants of a spinlock. Notice how with low numbers of threads, the spinlock gives heigher throughput as you might expect, but as the number of threads ramps up, the back-off logic of ReentrantLock kicks in, and it ends up doing better, while with high contention, the spinlocks just sit burning CPU.
By the way, this was really a trial run done on a dual-processor machine; I also ran it in the Amazon cloud (effectively an 8-way Xeon) but I've ahem... mislaid the file, but I'll either find it or run the experiment again soon and train and post an update. But you get an essentially similar pattern as I recall.
Update: whether it's in locking code or not, a phenomenon that can happen on some multiprocessor architectures is that as the multiple processors do a high volume of memory accesses, you can end up flooding the memory bus, and in effect the processors slow each other down. (It's a bit like with ethernet-- the more machines you add to the network, the more chance of collisions as they send data.)
Profile it. Anything else here will be just a guess and an uninformed one at that.
Using a profiler like YourKit will not only tell you which methods are "hot spots" in terms of CPU time but it will also tell you where threads are spending most of their time BLOCKED or WAITING
Is it still performing correctly? For instance, there was a case in an app server where an unsychronised HashMap caused an occasional infinite loop. It is not to difficult to see how work could simply be repeated.
The most likely culprit for seeing performance decline and CPU usage increase when you remove shared memory protection is a race condition. Two or more threads could be continually flipping a state flag back and forth on a shared object.
More description of the purpose of your application would help with diagnosis.

Categories