Timing a remote call in a multithreaded java program

Timing a remote call in a multithreaded java program - java

I am writing a stress test that will issue many calls to a remote server. I want to collect the following statistics after the test:
Latency (in milliseconds) of the remote call.
Number of operations per second that the remote server can handle.
I can successfully get (2), but I am having problems with (1). My current implementation is very similar to the one shown in this other SO question. And I have the same problem described in that question: latency reported by using System.currentTimeMillis() is longer than expected when the test is run with multiple threads.
I analyzed the problem and I am positive the problem comes from the thread interleaving (see my answer to the other question that I linked above for details), and that System.currentTimeMillis() is not the way to solve this problem.
It seems that I should be able to do it using java.lang.management, which has some interesting methods like:
ThreadMXBean.getCurrentThreadCpuTime()
ThreadMXBean.getCurrentThreadUserTime()
ThreadInfo.getWaitedTime()
ThreadInfo.getBlockedTime()
My problem is that even though I have read the API, it is still unclear to me which of these methods will give me what I want. In the context of the other SO question that I linked, this is what I need:
long start_time = **rightMethodToCall()**;
result = restTemplate.getForObject("Some URL",String.class);
long difference = (**rightMethodToCall()** - start_time);
So that the difference gives me a very good approximation of the time that the remote call took, even in a multi-threaded environment.
Restriction: I'd like to avoid protecting that block of code with a synchronized block because my program has other threads that I would like to allow to continue executing.
EDIT: Providing more info.:
The issue is this: I want to time the remote call, and just the remote call. If I use System.currentTimeMillis or System.nanoTime(), AND if I have more threads than cores, then it is possible that I could have this thread interleaving:
Thread1: long start_time ...
Thread1: result = ...
Thread2: long start_time ...
Thread2: result = ...
Thread2: long difference ...
Thread1: long difference ...
If that happens, then the difference calculated by Thread2 is correct, but the one calculated by Thread1 is incorrect (it would be greater than it should be). In other words, for the measurement of the difference in Thread1, I would like to exclude the time of lines 4 and 5. Is this time that the thread was WAITING?
Summarizing question in a different way in case it helps other people understand it better (this quote is how #jason-c put it in his comment.):
[I am] attempting to time the remote call, but running the test with multiple threads just to increase testing volume.

Use System.nanoTime() (but see updates at end of this answer).
You definitely don't want to use the current thread's CPU or user time, as user-perceived latency is wall clock time, not thread CPU time. You also don't want to use the current thread's blocking or waiting time, as it measures per-thread contention times which also doesn't accurately represent what you are trying to measure.
System.nanoTime() will return relatively accurate results (although granularity is technically only guaranteed to be as good or better than currentTimeMillis(), in practice it tends to be much better, generally implemented with hardware clocks or other performance timers, e.g. QueryPerformanceCounter on Windows or clock_gettime on Linux) from a high resolution clock with a fixed reference point, and will measure exactly what you are trying to measure.
long start_time = System.nanoTime();
result = restTemplate.getForObject("Some URL",String.class);
long difference = (System.nanoTime() - start_time);
long milliseconds = difference / 1000000;
System.nanoTime() does have it's own set of issues but be careful not to get whipped up in paranoia; for most applications it is more than adequate. You just wouldn't want to use it for, say, precise timing when sending audio samples to hardware or something (which you wouldn't do directly in Java anyways).
Update 1:
More importantly, how do you know the measured values are longer than expected? If your measurements are showing true wall clock time, and some threads are taking longer than others, that is still an accurate representation of user-perceived latency, as some users will experience those longer delay times.
Update 2 (based on clarification in comments):
Much of my above answer is still valid then; but for different reasons.
Using per-thread time does not give you an accurate representation because a thread could be idle/inactive while the remote request is still processing, and you would therefore exclude that time from your measurement even though it is part of perceived latency.
Further inaccuracies are introduced by the remote server taking longer to process the simultaneous requests you are making - this is an extra variable that you are adding (although it may be acceptable as representative of the remote server being busy).
Wall time is also not completely accurate because, as you have seen, variances in local thread overhead may add extra latency that isn't typically present in single-request client applications (although this still may be acceptable as representative of a client application that is multi-threaded, but it is a variable you cannot control).
Of those two, wall time will still get you closer to the actual result than per-thread time, which is why I left the previous answer above. You have a few options:
You could do your tests on a single thread, serially -- this is ultimately the most accurate way to achieve your stated requirements.
You could not create more threads than cores, e.g. a fixed size thread pool with bound affinities (tricky: Java thread affinity) to each core and measurements running as tasks on each. Of course this still adds any variables due to synchronization of underlying mechanisms that are beyond your control. This may reduce the risk of interleaving (especially if you set the affinities) but you still do not have full control over e.g. other threads the JVM is running or other unrelated processes on the system.
You could measure the request handling time on the remote server; of course this does not take network latency into account.
You could continue using your current approach and do some statistical analysis on the results to remove outliers.
You could not measure this at all, and simply do user tests and wait for a comment on it before attempting to optimize it (i.e. measure it with people, who are what you're developing for anyways). If the only reason to optimize this is for UX, it could very well be the case that users have a pleasant experience and the wait time is totally acceptable.
Also, none of this makes any guarantees that other unrelated threads on the system won't be affecting your timings, but that is why it is important to both a) run your test multiple times and average (obviously) and b) set an acceptable requirement for timing error's that you are OK with (do you really need to know this to e.g. 0.1ms accuracy?).
Personally, I would either do the first, single-threaded approach and let it run overnight or over a weekend, or use your existing approach and remove outliers from the result and accept a margin of error in the timings. Your goal is to find a realistic estimate within a satisfactory margin of error. You will also want to consider what you are going to ultimately do with this information when deciding what is acceptable.

Related

How can I discover how long does it take to do a context switch on my operating system?

I'd like to know how long does it take a context switch on my operating system. Is there a hack to do this? Can I do it with Java or I will need native code (eg. in C)? Does context switch differ for different threads?

From user-space processes, you can get a rough estimate running several threads/processes each of them getting wall-time clock (or processor ticks, RTDSC) as frequently as it's possible for some significant amount of time, and then finding a minimal discrepancy between the measurements of different threads. And make sure they are running at the same core.
Another estimate can be obtained by using some kind of waiting on mutex or conditional variable, but that would rather show performance of thread/process wake-up.
In Java you may get an additional overhead for JVM.
I guess the only reliable way is to profile your kernel or maybe find the numbers in kernel documentation.
Probably before trying all that you should make sure why you need to know such a thing. Performance of a multi-threaded/multi-process application depends on many factors, and context switching is most often the minor one.

Just call sleep(0) a large number of times; observe the total elapsed time; and divide. Do this in a high-priority process so that it is always the next process to be scheduled itself.

It's easy to write a bit if code to measure it. It's worth writing one yourself because you'll then get an answer relevant to your choice of language, etc. You should be able to write such a thing in any language that does threads and semaphores.
You need two threads or two processes. Have one of them record a high precision time of day (these days it will need to be good to nanoseconds, and that might be quite difficult. It will depend on the what the hardware/OS/language provides) in a shared buffer and then posts a semaphore. Your other thread or process should be written to be waiting on that semaphore. When it gets it it should then record the high precision time of day too, and subtract the time that the other thread/process had put in the shared buffer.
The reason why you might want to measure the context switch time for threads and processes is that a thread context switch time in a lot of OSes is less than that for processes (that's certainly true for Windows).
You could refine the answer with repeated runs and measure an average time. You could also similarly measure the time taken to post and take a semaphore to remove that component from the context switch time. I wouldn't bother with that because if you're worrying about the effect of context switch times you probably also care about the time taken to cause a context switch (such as posting a semaphore) as well.
I don't really know what results to expect these days. I know that VxWorks was achieving 10us context switch times on 200MHz PowerPC chips back in the 1990's, and that was really fast in those days.
==EDIT==
Context switching in multi core machines is potentially a much more variable thing. In a single core machine the OS must always switch execution contexts every time a different thread is to be run. But on a multi core machine the OS can distribute threads across cores, so there's not necessarily the need to do all the memory transactions associated with a context switch (I don't actually know if any OS out there does that). But given that thread / core distribution is itself a very variable thing depending on machine workload, etc, one's application might experience wildly variable CST depending on whether the user is moving the mouse, etc etc.

Timing a block of code with C++ and Java

I am trying to compare the accuracy of timing methods with C++ and Java.
With C++ I usually use CLOCKS_PER_SEC, I run the block of code I want to time for a certain amount of time and then calculate how long it took, based on how many times the block was executed.
With Java I usually use System.nanoTime().
Which one is more accurate, the one I use for C++ or the one I use for Java? Is there any other way to time in C++ so I don't have to repeat the piece of code to get a proper measurement? Basically, is there a System.nanoTime() method for C++?
I am aware that both use system calls which cause considerable latencies. How does this distort the real value of the timing? Is there any way to prevent this?

Every method has errors. Before you spend a great deal of time on this question, you have to ask yourself "how accurate do I need my answer to be"? Usually the solution is to run a loop / piece of code a number of times, and keep track of the mean / standard deviation of the measurement. This is a good way to get a handle on the repeatability of your measurement. After that, assume that latency is "comparable" between the "start time" and "stop time" calls (regardless of what function you used), and you have a framework to understand the issues.
Bottom line: clock() function typically gives microsecond accuracy.
See https://stackoverflow.com/a/20497193/1967396 for an example of how to go about this in C (in that instance, using a usec precision clock). There's the ability to use ns timing - see for example the answer to clock_gettime() still not monotonic - alternatives? which uses clock_gettime(CLOCK_MONOTONIC_RAW, &tSpec);
Note that you have to extract seconds and nanoseconds separately from that structure.

Be careful using System.nanoTime() as it is still limited by the resolution that the machine you are running on can give you.
Also there are complications timing Java as the first few times through a function will be a lot slower until they get optimized for your system.
Virtually all modern systems use pre-emptive multi threading and multiple cores, etc - so all timings will vary from run to run. (For example if control gets switched away from your thread while it in the method).
To get reliable timings you need to
Warm up the system by running around the thing you are timing a few hundred times before starting.
Run the code for a good number of times and average the results.
The reliability issues are the same for any language so apply just as well to C as to Java so C may not need the warm-up loop but you will still need to take a lot of samples and average them.

Thread.sleep and BufferedReader.readLine use the most cpu cycles in my java tcp server. Why?

Good evening,
I'm developing a java tcp server for communication between clients.
At this point i'm load testing the developed server.
This morning i got my hands on a profiler (yourkit) and started looking for problem spots in my server.
I now have 480 clients sending messages to the server every 500 msec. The server forwards every received message to 6 clients.
The server is now using about 8% of my cpu, when being on constant load.
My question is about the java functions that uses the most cpu cycles.
The java function that uses the most cpu cycles is strangly "Thread.sleep", followed by "BufferedReader.readLine".
Both of these functions seem to block the current thread while waiting for something (sleep waits for a few msec, readline waits for data).
Can somebody explain why these 2 functions take up that much cpu cycles? I was also wondering if there are alternative approaches that use less cpu cycles.
Kind regards,
T. Akhayo

sleep() and readLine() can use a lot of cpu as they both result in system calls which can context switch. It is also possible that the timing for these methods is not accurate for this reason (it may be an over estimate)
A way to reduce the overhead of context switches/sleep() is to have less threads and avoid needing to use sleep (e.g. use ScheduledExecutorServices), readLine() overhead can be reduced by using NIO but it is likely to add some complexity.

Sleeping shouldn't be an issue, unless you're having a bunch of threads sleep for short periods of time (100-150ms is 'short' in when you have 480 threads running a loop that just sleeps and does something trivial).
The readLine call should be using next to nothing when it's not actually reading something, except when you first call it. But like you said, it blocks, and it shouldn't be using a noticeable amount of CPU unless you have small windows where it blocks. CPU usage isn't that much unless you're reading tons of data, or initially calling the method.
So, your loops are too tight, and you're receiving too many messages too quickly, which is ultimately causing 'tons' of context switching, and processing. I'd suggest using a NIO framework (like Netty) if you're not comfortable enough with NIO to use it on your own.
Also, 8% CPU isn't that much for 480 clients that send 2 messages per second.

Here is a program in which sleep uses almost 100% of the cpu cycles given to the application:
for (i = 0; i < bigNumber; i++){
sleep(someTime);
}
Why? Because it doesn't use very many actual cpu cycles at all,
and of the ones it does use, nearly all of them are spent entering and leaving sleep.
Does that mean it's a real problem? Of course not.
That's the problem with profilers that only look at CPU time.
You need a sampler that samples on wall-clock time, not CPU time.
It should sample the stack, not just the program counter.
It should show you by line of code (not by function) the fraction of stack samples containing that line.
The usual objection to sampling on wall-clock time is that the measurements will be inaccurate due to sharing the machine with other processes.
But that doesn't matter, because to find time drains does not require precision of measurement.
It requires precision of location.
What you are looking for is precise code locations, and call sites, that are on the stack a healthy fraction of actual time, as determined by stack sampling that's uncorrelated with the state of the program.
Competition with other processes does not change the fraction of time that call sites are on the stack by a large enough amount to result in missing the problems.

How to increase the performance of a loop which runs for every 'n' minutes

Giving small background to my requirement and what i had accomplished so far:
There are 18 Scheduler tasks run at regular intervals (least being 30 mins) takes input of nearly 5000 eligible employees run into a static method for iteration and generates a mail content for that employee and mails. An average task takes about 9 min multiplied by 18 will be roughly 162 mins meanwhile there would be next tasks which will be in queue (I assume).
So my plan is something like the below loop
try {
// Handle Arraylist of alerts eligible employees
Iterator employee = empIDs.iterator();
while (employee.hasNext()) {
ScheduledImplementation.getInstance().sendAlertInfoToEmpForGivenAlertType((Long)employee.next(), configType,schedType);
}
} catch (Exception vEx)
{
_log.error("Exception Caught During sending " + configType + " messages:" + configType, vEx);
}
Since I know how many employees would come to my method I will divide the while loop into two and perform simultaneous operations on two or three employees at a time. Is this possible. Or is there any other ways I can improve the performance.
Some of the things I had implemented so far
1.Wherever possible made methods static and variables too
Didn't bother to catch exceptions and send back because these are background tasks.
(And I assume this improves performance)
Get the DB values in one query instead of multiple hits.
If am successful in optimizing the while loop I think i can save couple of mins.
Thanks

Wherever possible make methods static
and variables too
Don't bother to catch exceptions and send back because these are background tasks. (And I assume this improves performance)
That just makes for very bad code style, nothing to do with performance.
Get the DB values in one query instead of multiple hits.
Yes, this can be quite essential.
But in general, learn how to use a profiler to see where your code actually spends its time so that you can improve those parts, rather than making random "optimizations" based on hearsay, as you seem to be doing now.

Using static & final improves performance only when you are not using a decent JIT-compilers. Since most JREs are already using a good JIT-compiler you can ignore final and static for performance.
Better check your locking and synchronization style. A good indicator for locking problems is the CPU usage of your application. If it is low when it should be hard-working, there may be locks or db-queries blocking your application.
Maybe you can use techniques like Copy-On-Write or ReadWriteLocks.
Also check out the concurrent and atomic packages for some ideas how to improve or eliminate locking.
If there's one CPU-core with high load while others are idling around, try to do things in parallel. An ExecutorService may be helpful here.

Please don't fix anything until you know what needs to be fixed. Michael is right - profile - but I prefer this method.
Get a sense of time scales. When people optimize while loops, they are working at the level of nanoseconds. When you do a DB query, it is at the level of milliseconds. When you send an email and wait for the I/O to complete, you're talking seconds. If you have delays on the order of seconds, doing something that only saves milliseconds or nanoseconds will leave you disappointed in the results. **
So find out (don't guess) what is causing most of the delay, and work from there.
** To give a sense of the time scale, in 1 second a photon can travel most of the way to the moon; in 1 millisecond it can travel across a medium-sized country, and in 1 nanosecond it can travel the length of your foot.

Assuming you have profiled you application first and found that most of the delay is not in your program but somewhere else, e.g. the network/mail server....
You can create a fixed size thread pool e.g. 2-4 threads. and add a task for each mail you would have sent in the current thread.

Review your code, removing constructions which are knowingly slow.
The garbage collector is your friend but it charges you a very high price. You must do your best to get rid of gc as possible. Prefer simple data structures made of primitive types.
Exceptions are another very handy thing which charges you a very high price.
They are very expensive because each time an exception is thrown, the stack trace must be created and populated. Imagine a balance transfer operation which fails in 1% of cases due to lack of funds. Even with this relatively little rate of failures, performance may be severely impacted. See this benchmark.
Review your logic and algorithms. Try to perform operations in parallel.
Use bulk data moves, when possible.
Use bulk database operations and always use PreparedStatements.
Only after that use profiling techniques.
Profiling the code should be the last resort.
If you've done your homework properly, profiling will contribute only marginally.

Precise time measurement in Java

Java gives access to two method to get the current time: System.nanoTime() and System.currentTimeMillis(). The first one gives a result in nanoseconds, but the actual accuracy is much worse than that (many microseconds).
Is the JVM already providing the best possible value for each particular machine?
Otherwise, is there some Java library that can give finer measurement, possibly by being tied to a particular system?

The problem with getting super precise time measurements is that some processors can't/don't provide such tiny increments.
As far as I know, System.currentTimeMillis() and System.nanoTime() is the best measurement you will be able to find.
Note that both return a long value.

It's a bit pointless in Java measuring time down to the nanosecond scale; an occasional GC hit will easily wipe out any kind of accuracy this may have given. In any case, the documentation states that whilst it gives nanosecond precision, it's not the same thing as nanosecond accuracy; and there are operating systems which don't report nanoseconds in any case (which is why you'll find answers quantized to 1000 when accessing them; it's not luck, it's limitation).
Not only that, but depending on how the feature is actually implemented by the OS, you might find quantized results coming through anyway (e.g. answers that always end in 64 or 128 instead of intermediate values).
It's also worth noting that the purpose of the method is to find the two time differences between some (nearby) start time and now; if you take System.nanoTime() at the start of a long-running application and then take System.nanoTime() a long time later, it may have drifted quite far from real time. So you should only really use it for periods of less than 1s; if you need a longer running time than that, milliseconds should be enough. (And if it's not, then make up the last few numbers; you'll probably impress clients and the result will be just as valid.)

Unfortunately, I don't think java RTS is mature enough at this moment.
Java time does try to provide best value (they actually delegate the native code to call get the kernal time). However, JVM specs make this coarse time measurement disclaimer mainly for things like GC activities, and support of underlying system.
Certain GC activities will block all threads even if you are running concurrent GC.
default linux clock tick precision is only 10ms. Java cannot make it any better if linux kernal does not support.
I haven't figured out how to address #1 unless your app does not need to do GC. A decent and med size application probably and occasionally spends like tens of milliseconds on GC pauses. You are probably out of luck if your precision requirement is lower 10ms.
As for #2, You can tune the linux kernal to give more precision. However, you are also getting less out of your box because now kernal context switch more often.
Perhaps, we should look at it different angle. Is there a reason that OPS needs precision of 10ms of lower? Is it okay to tell Ops that precision is at 10ms AND also look at the GC log at that time, so they know the time is +-10ms accurate without GC activity around that time?

If you are looking to record some type of phenomenon on the order of nanoseconds, what you really need is a real-time operating system. The accuracy of the timer will greatly depend on the operating system's implementation of its high resolution timer and the underlying hardware.
However, you can still stay with Java since there are RTOS versions available.

JNI:
Create a simple function to access the Intel RDTSC instruction or the PMCCNTR register of co-processor p15 in ARM.
Pure Java:
You can possibly get better values if you are willing to delay until a clock tick. You can spin checking System.nanoTime() until the value changes. If you know for instance that the value of System.nanoTime() changes every 10000 loop iterations on your platform by amount DELTA then the actual event time was finalNanoTime-DELTA*ITERATIONS/10000. You will need to "warm-up" the code before taking actual measurements.
Hack (for profiling, etc, only):
If garbage collection is throwing you off you could always measure the time using a high-priority thread running in a second jvm which doesn't create objects. Have it spin incrementing a long in shared memory which you use as a clock.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.