Why is my System.nanoTime() broken?

Why is my System.nanoTime() broken? - java

Myself and another developer on my time recently moved from a Core 2 Duo machine at work to a new Core 2 Quad 9505; both running Windows XP SP3 32-bit with JDK 1.6.0_18.
Upon doing so, a couple of our automated unit tests for some timing/statistics/metrics aggregation code promptly started failing, due to what appear to be ridiculous values coming back from System.nanoTime().
Test code that shows this behaviour, reliably, on my machine is:
import static org.junit.Assert.assertThat;
import org.hamcrest.Matchers;
import org.junit.Test;
public class NanoTest {
#Test
public void testNanoTime() throws InterruptedException {
final long sleepMillis = 5000;
long nanosBefore = System.nanoTime();
long millisBefore = System.currentTimeMillis();
Thread.sleep(sleepMillis);
long nanosTaken = System.nanoTime() - nanosBefore;
long millisTaken = System.currentTimeMillis() - millisBefore;
System.out.println("nanosTaken="+nanosTaken);
System.out.println("millisTaken="+millisTaken);
// Check it slept within 10% of requested time
assertThat((double)millisTaken, Matchers.closeTo(sleepMillis, sleepMillis * 0.1));
assertThat((double)nanosTaken, Matchers.closeTo(sleepMillis * 1000000, sleepMillis * 1000000 * 0.1));
}
}
Typical output:
millisTaken=5001
nanosTaken=2243785148
Running it 100x yields nano results between 33% and 60% of the actual sleep time; usually around 40% though.
I understand the weaknesses in accuracy of timers in Windows, and have read related threads like Is System.nanoTime() consistent across threads?, however my understanding is that System.nanoTime() is intended for exactly the purpose we're using it :- measuring elapsed time; more accurately than currentTimeMillis().
Does anyone know why it's returning such crazy results? Is this likely to be a hardware architecture problem (the only major thing that has changed is the CPU/Motherboard on this machine)? A problem with the Windows HAL with my current hardware? A JDK problem? Should I abandon nanoTime()? Should I log a bug somewhere, or any suggestions on how I could investigate further?
UPDATE 19/07 03:15 UTC: After trying finnw's test case below I did some more Googling, coming across entries such as bugid:6440250. It also reminded me of some other strange behaviour I noticed late Friday where pings were coming back negative. So I added /usepmtimer to my boot.ini and now all the tests behave as expected., and my pings are normal too.
I'm a bit confused about why this was still an issue though; from my reading I thought TSC vs PMT issues were largely resolved in Windows XP SP3. Could it be because my machine was originally SP2, and was patched to SP3 rather than installed originally as SP3? I now also wonder whether I should be installing patches like the one at MS KB896256. Maybe I should take this up with the corporate desktop build team?

The problem was resolved (with some open suspicions about the suitability of nanoTime() on multi-core systems!) by adding /usepmtimer to the end of my C:\boot.ini string; forcing Windows to use the Power Management timer rather than the TSC. It's an open question as to why I needed to do this given I'm on XP SP3, as I understood that this was the default, however perhaps it was due to the manner in which my machine was patched to SP3.

On my system (Windows 7 64-Bit, Core i7 980X):
nanosTaken=4999902563
millisTaken=5001
System.nanoTime() uses OS-specific calls, so I expect that you are seeing a bug in your Windows/processor combination.

You probably want to read the answers to this other stack overflow question: Is System.nanoTime() completely useless?.
In summary, it would appear that nanoTime relies on operating system timers that may be affected by the presence of multiple core CPUs. As such, nanoTime may not be that useful on certain combinations of OS and CPU, and care should be taken when using it in portable Java code that you intend to run on multiple target platforms. There seems to be a lot of complaining on the web on this subject, but not much consensus on a meaningful alternative.

It is difficult to tell whether this is a bug or just normal timer variation between cores.
An experiment you could try is to use native calls to force the thread to run on a specific core.
Also, to rule out power management effects, try spinning in a loop as an alternative to sleep():
import com.sun.jna.Native;
import com.sun.jna.NativeLong;
import com.sun.jna.platform.win32.Kernel32;
import com.sun.jna.platform.win32.W32API;
public class AffinityTest {
private static void testNanoTime(boolean sameCore, boolean spin)
throws InterruptedException {
W32API.HANDLE hThread = kernel.GetCurrentThread();
final long sleepMillis = 5000;
kernel.SetThreadAffinityMask(hThread, new NativeLong(1L));
Thread.yield();
long nanosBefore = System.nanoTime();
long millisBefore = System.currentTimeMillis();
kernel.SetThreadAffinityMask(hThread, new NativeLong(sameCore? 1L: 2L));
if (spin) {
Thread.yield();
while (System.currentTimeMillis() - millisBefore < sleepMillis)
;
} else {
Thread.sleep(sleepMillis);
}
long nanosTaken = System.nanoTime() - nanosBefore;
long millisTaken = System.currentTimeMillis() - millisBefore;
System.out.println("nanosTaken="+nanosTaken);
System.out.println("millisTaken="+millisTaken);
}
public static void main(String[] args) throws InterruptedException {
System.out.println("Sleeping, different cores");
testNanoTime(false, false);
System.out.println("\nSleeping, same core");
testNanoTime(true, false);
System.out.println("\nSpinning, different cores");
testNanoTime(false, true);
System.out.println("\nSpinning, same core");
testNanoTime(true, true);
}
private static final Kernel32Ex kernel =
(Kernel32Ex) Native.loadLibrary(Kernel32Ex.class);
}
interface Kernel32Ex extends Kernel32 {
NativeLong SetThreadAffinityMask(HANDLE hThread, NativeLong dwAffinityMask);
}
If you get very different results depending on core selection (e.g. 5000ms on the same core but 2200ms on different cores) that would suggest that the problem is just natural timer variation between cores.
If you get very different results from sleeping vs. spinning, it is more likely due to power management slowing down the clocks.
If none of the four results are close to 5000ms, then it might be a bug.

Related

Why are Java HTTP requests so slow (in comparison to Python), and how can I make them faster?

Java is a beautiful language, and is also supposedly very efficient. Coming from a background of having used Python, I wanted to see the difference between the 2 languages- and from the start I was very impressed by the explicitness and clarity of Java's OOP based syntax. However, I also wanted to test out the performance differences between the languages.
I started off by trying to test the performance difference between the 2 languages by computation speed. For this, I wrote some code in each language- the program attempted to calculate a mathematical problem, and would iterate many times. I won't be adding this code here, but I will say the results- Python was almost 2 times slower (measured by time) than Java. Interesting, but it was expected. After all, the whole reason I wanted to try using Java is due to the amount of people bragging about the computation speed.
Following that, I proceeded to my second test- making HTTP connections to websites, in order to download web pages. For this test, I wrote another test program which would do the same as the last test, except instead of computing a math equation, it would download a web page using an HTTP library.
I ended up writing the following script in Python. It is very straightforward, it iterates 10 times over downloading a web page, then prints the average.
from requests import get
from time import time
# Start the timer
start = time()
# Loop 10 times
for i in range(10):
# Execute GET request
get("https://httpbin.org/get")
# Stop the timer
stop = time()
# Calculate and print average
avg = (stop - start) / 10
print(avg)
# Prints 0.5385, on my system.
For the Java test, I wrote the following piece of code. It is the same test as before, but implemented in Java.
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.Response;
import java.io.IOException;
import java.util.Objects;
public class Test {
public static String run(String url) throws IOException {
// Code taken from OKHTTP docs
// https://square.github.io/okhttp/
// https://raw.githubusercontent.com/square/okhttp/master/samples/guide/src/main/java/okhttp3/guide/GetExample.java
OkHttpClient client = new OkHttpClient();
Request request = new Request.Builder()
.url(url)
.build();
try (Response response = client.newCall(request).execute()) {
return Objects.requireNonNull(response.body()).string();
}
}
public static void main(String[] args) throws IOException {
// Start the timer
long startTime = System.nanoTime();
// Loop 10 times
for (int i = 0; i < 10; i++) {
// Execute GET request
run("https://httpbin.org/get");
}
// Stop the timer
long endTime = System.nanoTime();
// Calculate the average
float average = (((float) (endTime - startTime)) / 1000000000) / 10;
// Print results (1.05035 on my system)
System.out.println(average);
}
}
Huh... that's unexpected. Alas, isn't Java supposed to be faster than Python? I'm shocked to see that Java is almost 2 times slower than Python in this test, but I'm determined to find a conclusion that favors Java. To satisfy this, I decided to re-write the test using the Java default libraries instead of the OkHttp library. I won't be showing the code here since it's pretty long, but I used HttpURLConnection to assist me. My result was still the same, but slightly quicker than with the OkHttp library.
My final test was to do the same as the previous tests, but on http:// websites (in case the slowness occurs due to the SSL connection). My result was still the same- Python was faster by almost 2 times. The only thing I could think of to why this is happening is because the requests Python library would be coded in C, but as you can see from the "Languages" section of their GitHub page, all of the requests library was programmed in pure Python.
I would like to understand why Java is so slow at running HTTP connections, and if I did something wrong with my system setup or Java test code, what should I have written to improve the results? Also, if it's possible, how can a Java HTTP request be sent so that it is faster than its Python requests counterpart?

I was really skeptical about the results you got, so I gave it a try with the exact same Python code and main Java method (that uses https) as yours.
Here is the Java run method that reads the entire JSON content of the response:
private static String run(String url) throws IOException {
final URLConnection c = new URL(url).openConnection();
c.connect();
try (InputStream is = c.getInputStream()) {
final byte[] buf = new byte[1024];
final StringBuilder b = new StringBuilder();
int read = 0;
while ((read = is.read(buf)) != -1) {
b.append(new String(buf, 0, read));
}
return b.toString();
}
}
Results on my system:
Python 2.7.12: 0.5117351770401001
Python 3.5.2: 0.48344600200653076
Java 1.8: 0.19684727
10 iterations is probably not enough to get a good result, but here, Java is at least 2 times faster.

TLDR:
The majority of a request's lifespan is spent out in the actual internet traffic. Even though Java is faster than Python, it can only shave off a few milliseconds per request, as most of the time that is recorded for 1 request is due to server lag/latency. Also, reuse the Python Session and Java OkHttpClient objects in order to opt-in to critical optimizations that knock off drastic computation time.
There are a few mistakes that I made in my post. The first of which is that I generate a new OkHttpClient object for each request and used the get method directly. As Jesse pointed out in a comment, by using these, I miss out on heavy optimizations, and therefore get an unfair result.
To fix this, I used a Session object to persist my request history, and saved the same OkHttpClient object likewise.
My improvement implemented in Python:
from requests import Session
from time import time
# Start the timer
start = time()
# Create a new Session <-----
s = Session()
# Loop a few times
for i in range(50):
# Execute GET request
s.get("http://httpbin.org/get")
# Stop the timer
stop = time()
# Calculate and print average
avg = (stop - start) / 50
print(avg) # 180ms on my system
Likewise, I implemented the same concept in Java using a basic Singleton class and a few wrapper classes on top of the OkHttp library. I won't be posting the entire code here as I decided to expand it across many classes, but the underlying idea is simple. After making these changes and logging my newfound statistics, I ended up with the following chart:
As shown, Python does in fact have a quicker initialization process for the first request executed. However, you can also notice that for the Average Request (average of 50 consecutive and synchronous requests) the Java libraries (URLConnection & OkHttp) shave off a few milliseconds in comparison to the Python requests library.
Summary:
By reusing the Python Session and Java OkHttpClient objects (initialize the object once and use it for all requests, instead of making a new one for each request), heavy optimizations are done, and therefore the execution times are drastically lowered. However, when it comes to averages, Java only beats Python by a few measly milliseconds, as most of the time spent during a request is from network transmission time (the time it takes to send data between computers over the Internet).
If anyone would like to comment more information or show their own findings in another answer, I would be ecstatic to read more about it. Thank you to those who commented on my question and helped me figure out a few key components to the optimization process. Long live Java :)

Why is a particular Guava Stopwatch.elapsed() call much later than others? (output in post)

I am working on a small game project and want to track time in order to process physics. After scrolling through different approaches, at first I had decided to use Java's Instant and Duration classes and now switched over to Guava's Stopwatch implementation, however, in my snippet, both of those approaches have a big gap at the second call of runtime.elapsed(). That doesn't seem like a big problem in the long run, but why does that happen?
I have tried running the code below as both in focus and as a Thread, in Windows and in Linux (Ubuntu 18.04) and the result stays the same - the exact values differ, but the gap occurs. I am using the IntelliJ IDEA environment with JDK 11.
Snippet from Main:
public static void main(String[] args) {
MassObject[] planets = {
new Spaceship(10, 0, 6378000)
};
planets[0].run();
}
This is part of my class MassObject extends Thread:
public void run() {
// I am using StringBuilder to eliminate flushing delays.
StringBuilder output = new StringBuilder();
Stopwatch runtime = Stopwatch.createStarted();
// massObjectList = static List<MassObject>;
for (MassObject b : massObjectList) {
if(b!=this) calculateGravity(this, b);
}
for (int i = 0; i < 10; i++) {
output.append(runtime.elapsed().getNano()).append("\n");
}
System.out.println(output);
}
Stdout:
30700
1807000
1808900
1811600
1812400
1813300
1830200
1833200
1834500
1835500
Thanks for your help.

You're calling Duration.getNano() on the Duration returned by elapsed(), which isn't what you want.
The internal representation of a Duration is a number of seconds plus a nano offset for whatever additional fraction of a whole second there is in the duration. Duration.getNano() returns that nano offset, and should almost never be called unless you're also calling Duration.getSeconds().
The method you probably want to be calling is toNanos(), which converts the whole duration to a number of nanoseconds.
Edit: In this case that doesn't explain what you're seeing because it does appear that the nano offsets being printed are probably all within the same second, but it's still the case that you shouldn't be using getNano().
The actual issue is probably some combination of classloading or extra work that has to happen during the first call, and/or JIT improving performance of future calls (though I don't think looping 10 times is necessarily enough that you'd see much of any change from JIT).

Start event upon a given timespan

I have been researching and I am struggling to actually choose the best option. I am using processing sketch that runs java code, and I want to start an animation in several computers( OS X and windows) at the same time. The basic idea is to send a OSC message to each computer and after they receive a message they will store the currentTime plus the timespan(let say after 10 second). And each computer track the currentTime and when it reach the intended Time they will start the animation. Now I cannot figure out which System should I use. System.currentTimeMillis() or System.nanoTime(); I already tested with two computers(both Systems) and it seems to work. Both computers are OS X but I never tried with a windows one and it seems for System.currentTimeMillis() can be a lag of 50ms. I'm really confuse in this matter. Someone can me explain or highlight.
thank in advance

Simultanous simulations on two or more computers is tricky for some reasons.
First of all I would make sure, all connected computers synchronize their clocks with NTP. (See more https://en.wikipedia.org/wiki/Network_Time_Protocol)
Then the biggest difference is at most 50ms as far as I know.
Otherwise every approach will fail because of the differences of the clocks.
Second, clocks on different systems have different accuracy. I can recommend reading Alexey Shipilev's blog: https://shipilev.net/blog/2014/nanotrusting-nanotime/ . It is about the accuracy of clocks on machines in general.
Third you need to know that Linux has a round robin slide of 1ms and Windows about 10-15ms. Therefore Thread.sleep(...) will not work with smaller time spans reliable.
If you want to´work with smaller time spans you need to do a kind of "busy waiting" which is ugly but necessary:
public class SleepUtil {
public static final long MIN_PRECISION_IN_MICROS = 15L;
public static void main(String[] args) {
long before = System.nanoTime();
while (true) {
final long after = System.nanoTime();
long diff = (after - before) / 1000l;
before = after;
System.out.println(diff + " micros");
SleepUtil.sleepMicros(500);
}
}
private static void sleepMicros(int waitTimeInMicros) {
final long startTimeInNanos = System.nanoTime();
long elapsedTimeInMicros = 0L;
while (elapsedTimeInMicros < waitTimeInMicros - MIN_PRECISION_IN_MICROS) {
elapsedTimeInMicros = (System.nanoTime() - startTimeInNanos) / 1000L;
}
}
}
However, it will busy your cpu and not be always reliable (but most of the time).

Java: Why is calling a method for the first time slower?

Recently, I was writing a plugin using Java and found that retrieving an element(using get()) from a HashMap for the first time is very slow. Originally, I wanted to ask a question on that and found this (No answers though). With further experiments, however, I notice that this phenomenon happens on ArrayList and then all the methods.
Here is the code:
public class Test {
public static void main(String[] args) {
long startTime, stopTime;
// Method 1
System.out.println("Test 1:");
for (int i = 0; i < 20; ++i) {
startTime = System.nanoTime();
testMethod1();
stopTime = System.nanoTime();
System.out.println((stopTime - startTime) + "ns");
}
// Method 2
System.out.println("Test 2:");
for (int i = 0; i < 20; ++i) {
startTime = System.nanoTime();
testMethod2();
stopTime = System.nanoTime();
System.out.println((stopTime - startTime) + "ns");
}
}
public static void testMethod1() {
// Do nothing
}
public static void testMethod2() {
// Do nothing
}
}
Snippet: Test Snippet
The output would be like this:
Test 1:
2485ns
505ns
453ns
603ns
362ns
414ns
424ns
488ns
325ns
426ns
618ns
794ns
389ns
686ns
464ns
375ns
354ns
442ns
404ns
450ns
Test 2:
3248ns
700ns
538ns
531ns
351ns
444ns
321ns
424ns
523ns
488ns
487ns
491ns
551ns
497ns
480ns
465ns
477ns
453ns
727ns
504ns
I ran the code for a few times and the results are about the same. The first call would be even longer(>8000 ns) on my computer(Windows 8.1, Oracle Java 8u25).
Apparently, the first calls is usually slower than the following calls(Some calls may be longer in random cases).
Update:
I tried to learn some JMH, and write a test program
Code w/ sample output: Code
I don't know whether it's a proper benchmark(If the program has some problems, tell me), but I found that the first warm-up iterations spend more time(I use two warm-up iterations in case the warm-ups affect the results). And I think that the first warm-up should be the first call and is slower. So this phenomenon exists, if the test is proper.
So why does it happen?

You're calling System.nanoTime() inside a loop. Those calls are not free, so in addition to the time taken for an empty method you're actually measuring the time it takes to exit from nanotime call #1 and to enter nanotime call #2.
To make things worse, you're doing that on windows where nanotime performs worse compared to other platforms.
Regarding JMH: I don't think it's much help in this situation. It's designed to measure by averaging many iterations, to avoid dead code elimination, account for JIT warmup, avoid ordering dependence, ... and afaik it simply uses nanotime under the hood too.
Its design goals pretty much aim for the opposite of what you're trying to measure.
You are measuring something. But that something might be several cache misses, nanotime call overhead, some JVM internals (class loading? some kind of lazy initialization in the interpreter?), ... probably a combination thereof.
The point is that your measurement can't really be taken at face value. Even if there is a certain cost for calling a method for the first time, the time you're measuring only provides an upper bound for that.

This kind of behaviour is often caused by the compiler or RE. It starts to optimize the execution after the first iteration. Additionally class loading can have an effect (I guess this is not the case in your example code as all classes are loaded in the first loop latest).
See this thread for a similar problem.
Please keep in mind this kind of behaviour is often dependent on the environment/OS it's running on.

Java BlockingQueue latency high on Linux

I am using BlockingQueue:s (trying both ArrayBlockingQueue and LinkedBlockingQueue) to pass objects between different threads in an application I’m currently working on. Performance and latency is relatively important in this application, so I was curious how much time it takes to pass objects between two threads using a BlockingQueue. In order to measure this, I wrote a simple program with two threads (one consumer and one producer), where I let the producer pass a timestamp (taken using System.nanoTime()) to the consumer, see code below.
I recall reading somewhere on some forum that it took about 10 microseconds for someone else who tried this (don’t know on what OS and hardware that was on), so I was not too surprised when it took ~30 microseconds for me on my windows 7 box (Intel E7500 core 2 duo CPU, 2.93GHz), whilst running a lot of other applications in the background. However, I was quite surprised when I did the same test on our much faster Linux server (two Intel X5677 3.46GHz quad-core CPUs, running Debian 5 with kernel 2.6.26-2-amd64). I expected the latency to be lower than on my windows box , but on the contrary it was much higher - ~75 – 100 microseconds! Both tests were done with Sun’s Hotspot JVM version 1.6.0-23.
Has anyone else done any similar tests with similar results on Linux? Or does anyone know why it is so much slower on Linux (with better hardware), could it be that thread switching simply is this much slower on Linux compared to windows? If that’s the case, it’s seems like windows is actually much better suited for some kind of applications. Any help in helping me understanding the relatively high figures are much appreciated.
Edit:
After a comment from DaveC, I also did a test where I restricted the JVM (on the Linux machine) to a single core (i.e. all threads running on the same core). This changed the results dramatically - the latency went down to below 20 microseconds, i.e. better than the results on the Windows machine. I also did some tests where I restricted the producer thread to one core and the consumer thread to another (trying both to have them on the same socket and on different sockets), but this did not seem to help - the latency was still ~75 microseconds. Btw, this test application is pretty much all I'm running on the machine while performering test.
Does anyone know if these results make sense? Should it really be that much slower if the producer and the consumer are running on different cores? Any input is really appreciated.
Edited again (6 January):
I experimented with different changes to the code and running environment:
I upgraded the Linux kernel to 2.6.36.2 (from 2.6.26.2). After the kernel upgrade, the measured time changed to 60 microseconds with very small variations, from 75-100 before the upgrade. Setting CPU affinity for the producer and consumer threads had no effect, except when restricting them to the same core. When running on the same core, the latency measured was 13 microseconds.
In the original code, I had the producer go to sleep for 1 second between every iteration, in order to give the consumer enough time to calculate the elapsed time and print it to the console. If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly? My first guess was that the change had the effect that the producer called queue.put() before the consumer called queue.take(), so the consumer never had to block, but after playing around with a modified version of ArrayBlockingQueue, I found this guess to be false – the consumer did in fact block. If you have some other guess, please let me know. (Btw, if I let the producer call both Thread.sleep() and barrier.await(), the latency remains at 60 microseconds).
I also tried another approach – instead of calling queue.take(), I called queue.poll() with a timeout of 100 micros. This reduced the average latency to below 10 microseconds, but is of course much more CPU intensive (but probably less CPU intensive that busy waiting?).
Edited again (10 January) - Problem solved:
ninjalj suggested that the latency of ~60 microseconds was due to the CPU having to wake up from deeper sleep states - and he was completely right! After disabling C-states in BIOS, the latency was reduced to <10 microseconds. This explains why I got so much better latency under point 2 above - when I sent objects more frequently the CPU was kept busy enough not to go to the deeper sleep states. Many thanks to everyone who has taken time to read my question and shared your thoughts here!
...
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.CyclicBarrier;
public class QueueTest {
ArrayBlockingQueue<Long> queue = new ArrayBlockingQueue<Long>(10);
Thread consumerThread;
CyclicBarrier barrier = new CyclicBarrier(2);
static final int RUNS = 500000;
volatile int sleep = 1000;
public void start() {
consumerThread = new Thread(new Runnable() {
#Override
public void run() {
try {
barrier.await();
for(int i = 0; i < RUNS; i++) {
consume();
}
} catch (Exception e) {
e.printStackTrace();
}
}
});
consumerThread.start();
try {
barrier.await();
} catch (Exception e) { e.printStackTrace(); }
for(int i = 0; i < RUNS; i++) {
try {
if(sleep > 0)
Thread.sleep(sleep);
produce();
} catch (Exception e) {
e.printStackTrace();
}
}
}
public void produce() {
try {
queue.put(System.nanoTime());
} catch (InterruptedException e) {
}
}
public void consume() {
try {
long t = queue.take();
long now = System.nanoTime();
long time = (now - t) / 1000; // Divide by 1000 to get result in microseconds
if(sleep > 0) {
System.out.println("Time: " + time);
}
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
QueueTest test = new QueueTest();
System.out.println("Starting...");
// Run first once, ignoring results
test.sleep = 0;
test.start();
// Run again, printing the results
System.out.println("Starting again...");
test.sleep = 1000;
test.start();
}
}

Your test is not a good measure of queue handoff latency because you have a single thread reading off the queue which writes synchronously to System.out (doing a String and long concatenation while it is at it) before it takes again. To measure this properly you need to move this activity out of this thread and do as little work as possible in the taking thread.
You'd be better off just doing the calculation (then-now) in the taker and adding the result to some other collection which is periodically drained by another thread that outputs the results. I tend to do this by adding to an appropriately presized array backed structure accessed via an AtomicReference (hence the reporting thread just has to getAndSet on that reference with another instance of that storage structure in order to grab the latest batch of results; e.g. make 2 lists, set one as active, every x s a thread wakes up and swaps the active and the passive ones). You can then report some distribution instead of every single result (e.g. a decile range) which means you don't generate vast log files with every run and get useful information printed for you.
FWIW I concur with the times Peter Lawrey stated & if latency is really critical then you need to think about busy waiting with appropriate cpu affinity (i.e. dedicate a core to that thread)
EDIT after Jan 6
If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly?
You're looking at the difference between java.util.concurrent.locks.LockSupport#park (and corresponding unpark) and Thread#sleep. Most j.u.c. stuff is built on LockSupport (often via an AbstractQueuedSynchronizer that ReentrantLock provides or directly) and this (in Hotspot) resolves down to sun.misc.Unsafe#park (and unpark) and this tends to end up in the hands of the pthread (posix threads) lib. Typically pthread_cond_broadcast to wake up and pthread_cond_wait or pthread_cond_timedwait for things like BlockingQueue#take.
I can't say I've ever looked at how Thread#sleep is actually implemented (cos I've never come across something low latency that isn't a condition based wait) but I would imagine that it causes it to be demoted by the schedular in a more aggressive way than the pthread signalling mechanism and that is what accounts for the latency difference.

I would use just an ArrayBlockingQueue if you can. When I have used it the latency was between 8-18 microseconds on Linux. Some point of note.
The cost is largely the time it takes to wake up the thread. When you wake up a thread its data/code won't be in cache so you will find that if you time what happens after a thread has woken that can take 2-5x longer than if you were to run the same thing repeatedly.
Certain operations use OS calls (such as locking/cyclic barriers) these are often more expensive in a low latency scenario than busy waiting. I suggest trying to busy wait your producer rather than use a CyclicBarrier. You could busy wait your consumer as well but this could be unreasonably expensive on a real system.

#Peter Lawrey
Certain operations use OS calls (such as locking/cyclic barriers)
Those are NOT OS (kernel) calls. Implemented via simple CAS (which on x86 comes w/ free memory fence as well)
One more: dont use ArrayBlockingQueue unless you know why (you use it).
#OP:
Look at ThreadPoolExecutor, it offers excellent producer/consumer framework.
Edit below:
to reduce the latency (baring the busy wait), change the queue to SynchronousQueue add the following like before starting the consumer
...
consumerThread.setPriority(Thread.MAX_PRIORITY);
consumerThread.start();
This is the best you can get.
Edit2:
Here w/ sync. queue. And not printing the results.
package t1;
import java.math.BigDecimal;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.SynchronousQueue;
public class QueueTest {
static final int RUNS = 250000;
final SynchronousQueue<Long> queue = new SynchronousQueue<Long>();
int sleep = 1000;
long[] results = new long[0];
public void start(final int runs) throws Exception {
results = new long[runs];
final CountDownLatch barrier = new CountDownLatch(1);
Thread consumerThread = new Thread(new Runnable() {
#Override
public void run() {
barrier.countDown();
try {
for(int i = 0; i < runs; i++) {
results[i] = consume();
}
} catch (Exception e) {
return;
}
}
});
consumerThread.setPriority(Thread.MAX_PRIORITY);
consumerThread.start();
barrier.await();
final long sleep = this.sleep;
for(int i = 0; i < runs; i++) {
try {
doProduce(sleep);
} catch (Exception e) {
return;
}
}
}
private void doProduce(final long sleep) throws InterruptedException {
produce();
}
public void produce() throws InterruptedException {
queue.put(new Long(System.nanoTime()));//new Long() is faster than value of
}
public long consume() throws InterruptedException {
long t = queue.take();
long now = System.nanoTime();
return now-t;
}
public static void main(String[] args) throws Throwable {
QueueTest test = new QueueTest();
System.out.println("Starting + warming up...");
// Run first once, ignoring results
test.sleep = 0;
test.start(15000);//10k is the normal warm-up for -server hotspot
// Run again, printing the results
System.gc();
System.out.println("Starting again...");
test.sleep = 1000;//ignored now
Thread.yield();
test.start(RUNS);
long sum = 0;
for (long elapsed: test.results){
sum+=elapsed;
}
BigDecimal elapsed = BigDecimal.valueOf(sum, 3).divide(BigDecimal.valueOf(test.results.length), BigDecimal.ROUND_HALF_UP);
System.out.printf("Avg: %1.3f micros%n", elapsed);
}
}

If latency is critical and you do not require strict FIFO semantics, then you may want to consider JSR-166's LinkedTransferQueue. It enables elimination so that opposing operations can exchange values instead of synchronizing on the queue data structure. This approach helps reduce contention, enables parallel exchanges, and avoids thread sleep/wake-up penalties.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.