Groovy ASTBuilder bad performance with multiple threads

Groovy ASTBuilder bad performance with multiple threads - java

I'm using Groovy's ASTBuilder (version 2.5.5) in a project. It's being used to parse and analyze groovy expressions received via a REST API. This REST service receives thousands of requests, and the analysis is done on the fly.
I'm noticing some serious performance issues in a multithreaded environment. Below is a simulation, running 100 threads in parallel:
int numthreads = 100;
final Callable<Void> task = () -> {
long initial = System.currentTimeInMillis();
// Simple rule
new AstBuilder().buildFromString("a+b");
System.out.print(String.format("\n\nThread took %s ms.",
System.currentTimeInMillis() - initial));
return null;
};
final ExecutorService executorService = Executors.newFixedThreadPool(numthreads);
final List<Callable<Void>> tasks = new ArrayList<>();
while (numthreads-- > 0) {
tasks.add(task);
}
for (Future<Void> future : executorService.invokeAll(tasks)) {
future.get();
}
Im trying with different thread loads. The greater the number, the slower.
100 threads => ~1800ms
200 threads => ~2500ms
300 threads => ~4000ms
However, if I serialize the threads, (like setting the pool size to 1), I get much better results, around 10ms each thread. Can someone please help me understand why is this happening?

Performing multiple threaded code, computer shares threads between physical CPU cores. That means the more the number of threads exceeds number of cores, the less benefit you get from every thread. In your example the number of threads increases with number of tasks. So with growing up of the task number every CPU core forced to process the more and more threads. At the same time you may notice that difference between numthreads = 1 and numthreads = 4 is very small. Because in this case every core processes only few (or even just one) thread. Don't set number of threads much more than number of physical CPU threads because it doesn't make a lot of sense.
Additionally in your example you're trying to compare how different numbers of threads performs with different numbers of tasks. But in order to see the efficiency of multiple threaded code you have to compare how the different numbers of threads performs with the same number of tasks. I would change the example the next way:
int threadNumber = 16;
int taskNumber = 200;
//...task method
final ExecutorService executorService = Executors.newFixedThreadPool(threadNumber);
final List<Callable<Void>> tasks = new ArrayList<>();
while (taskNumber-- > 0) {
tasks.add(task);
}
long start = System.currentTimeMillis();
for (Future<Void> future : executorService.invokeAll(tasks)) {
future.get();
}
long end = System.currentTimeMillis() - start;
System.out.println(end);
executorService.shutdown();
Try this code for threadNumber=1 and, lets say, threadNumber=16 and you'll see the difference.

Dynamic evaluation of expressions involves a lot of resources including class loading, security manager, compilation and execution. It is not designed for high performance. If you just need to evaluate an expression for its value, you could try groovy.util.Eval. It may not consume as many resources as AstBuilder. However, it is probably not going to be that much different, so don't expect too much.
If you want to get the AST only and not any extra information like types, you could call the parser more directly. This would involve a lot fewer resources. See org.codehaus.groovy.control.ParserPluginFactory for more direct access to the source parser.

Related

Why does parallelStream not use the entire available parallelism?

I have a custom ForkJoinPool created with parallelism of 25.
customForkJoinPool = new ForkJoinPool(25);
I have a list of 700 file names and I used code like this to download the files from S3 in parallel and cast them to Java objects:
customForkJoinPool.submit(() -> {
return fileNames
.parallelStream()
.map((fileName) -> {
Logger log = Logger.getLogger("ForkJoinTest");
long startTime = System.currentTimeMillis();
log.info("Starting job at Thread:" + Thread.currentThread().getName());
MyObject obj = readObjectFromS3(fileName);
long endTime = System.currentTimeMillis();
log.info("completed a job with Latency:" + (endTime - startTime));
return obj;
})
.collect(Collectors.toList);
});
});
When I look at the logs, I see only 5 threads being used. With a parallelism of 25, I expected this to use 25 threads. The average latency to download and convert the file to an object is around 200ms. What am I missing?
May be a better question is how does a parallelstream figure how much to split the original list before creating threads for it? In this case, it looks like it decided to split it 5 times and stop.

Why are you doing this with ForkJoinPool? It's meant for CPU-bound tasks with subtasks that are too fast to warrant individual scheduling. Your workload is IO-bound and with 200ms latency the individual scheduling overhead is negligible.
Use an Executor:
import static java.util.stream.Collectors.toList;
import static java.util.concurrent.CompletableFuture.supplyAsync;
ExecutorService threads = Executors.newFixedThreadPool(25);
List<MyObject> result = fileNames.stream()
.map(fn -> supplyAsync(() -> readObjectFromS3(fn), threads))
.collect(toList()).stream()
.map(CompletableFuture::join)
.collect(toList());

I think that the answer is in this ... from the ForkJoinPool javadoc.
"The pool attempts to maintain enough active (or available) threads by dynamically adding, suspending, or resuming internal worker threads, even if some tasks are stalled waiting to join others. However, no such adjustments are guaranteed in the face of blocked I/O or other unmanaged synchronization."
In your case, the downloads will be performing blocking I/O operations.

WordNetSimalarity in large dataset of synsets

I use wordnet similarity java api to measure similarity between two synsets as such:
public class WordNetSimalarity {
private static ILexicalDatabase db = new NictWordNet();
private static RelatednessCalculator[] rcs = {
new HirstStOnge(db), new LeacockChodorow(db), new Lesk(db), new WuPalmer(db),
new Resnik(db), new JiangConrath(db), new Lin(db), new Path(db)
};
public static double computeSimilarity( String word1, String word2 ) {
WS4JConfiguration.getInstance().setMFS(true);
double s=0;
for ( RelatednessCalculator rc : rcs ) {
s = rc.calcRelatednessOfWords(word1, word2);
// System.out.println( rc.getClass().getName()+"\t"+s );
}
return s;
}
Main class
public static void main(String[] args) {
long t0 = System.currentTimeMillis();
File source = new File ("TagsFiltered.txt");
File target = new File ("fich4.txt");
ArrayList<String> sList= new ArrayList<>();
try {
if (!target.exists()) target.createNewFile();
Scanner scanner = new Scanner(source);
PrintStream psStream= new PrintStream(target);
while (scanner.hasNext()) {
sList.add(scanner.nextLine());
}
for (int i = 0; i < sList.size(); i++) {
for (int j = i+1; j < sList.size(); j++) {
psStream.println(sList.get(i)+" "+sList.get(j)+" "+WordNetSimalarity.computeSimilarity(sList.get(i), sList.get(j)));
}
}
psStream.close();
} catch (Exception e) {e.printStackTrace();
}
long t1 = System.currentTimeMillis();
System.out.println( "Done in "+(t1-t0)+" msec." );
}
My database contain 595 synsets that's mean method computeSimilarity will be called (595*594/2) time
To compute Similarity between two words it spend more than 5000 ms!
so to finalize my task I need at least one week !!
My question is how to reduce this period !
How to ameliorate performances??

I don't think language is your issue.
You can help yourself with parallelism. I think this would be a good candidate for map reduce and Hadoop.

Have you tried the MatrixCalculator?

I don't know if it is possible to optimize this algorithm-wise.
But definitely you can run this much faster. On my machine this operation takes twice less time, so if you have eight i7 cores, you'd need 15 hours to process everything(if you process the loop in parallel)
You can get virtual machines at Amazon Web Services. So if you get several machines and run multithreaded processing for different chunks of data on each machine - you will complete in several hours.
Technically it is possible to use Hadoop for this, but if you need to run this just once - making computation parallel and launching on several machines will be simpler in my opinion.

Perl is different from a lot of other languages when it comes to threading/forking.
One of the key things that makes Perl threads different from other threads is that data is not shared by default. This makes threads much easier and safer to work with, you don't have to worry about thread safety of libraries or most of your code, just the threaded bit. However it can be a performance drag and memory hungry as Perl must put a copy of the interpreter and all loaded modules into each thread.
When it comes to forking I will only be talking about Unix. Perl emulates fork on Windows using threads, it works but it can be slow and buggy.
Forking Advantages
Very fast to create a fork
Very robust
Forking Disadvantages
Communicating between the processes can be slow and awkward
Thread Advantages
Thread coordination and data interchange is fairly easy
Threads are fairly easy to use
Thread Disadvantages
Each thread takes a lot of memory
Threads can be slow to start
Threads can be buggy (better the more recent your perl)
Database connections are not shared across threads
In general, to get good performance out of Perl threads it's best to start a pool of threads and reuse them. Forks can more easily be created, used and discarded.
For either case, you're likely going to want something to manage your pool of workers. For forking you're going to want to use Parallel::ForkManager or Child. Child is particularly nice as it has built in inter-process communication.
For threads you're going to want to use threads::shared, Thread::Queue and read perlthrtut.
Also, the number of threads is going to be dependent on the number of cores that your computer has. If you have four cores, creating more than 3 threads isn't really going to be very helpful (3 + 1 for your main program).
To be honest, though, threads/forking may not be the way to go. In fact, in many situations they can even slow stuff down due to overhead. If you really need the speed, the best way to get it is going to be through distributed computing. I would suggest that you look into some sort of distributed computing platform to make your runtime better. If you can reduce the dimensionality of your search/compareTo space to less than n^2, then map reduce or Hadoop may be a good option; otherwise, you'll just have a bunch of overhead and no use of the real scalability that Hadoop offers (#Thomas Jungblut).

Java- FixedThreadPool with known pool size but unknown workers

So I think I sort of understand how fixed thread pools work (using the Executor.fixedThreadPool built into Java), but from what I can see, there's usually a set number of jobs you want done and you know how many to when you start the program. For example
int numWorkers = Integer.parseInt(args[0]);
int threadPoolSize = Integer.parseInt(args[1]);
ExecutorService tpes =
Executors.newFixedThreadPool(threadPoolSize);
WorkerThread[] workers = new WorkerThread[numWorkers];
for (int i = 0; i < numWorkers; i++) {
workers[i] = new WorkerThread(i);
tpes.execute(workers[i]);
}
Where each workerThread does something really simple,that part is arbitrary. What I want to know is, what if you have a fixed pool size (say 8 max) but you don't know how many workers you'll need to finish the task until runtime.
The specific example is: If I have a pool size of 8 and I'm reading from standard input. As I read, I split the input into blocks of a set size. Each one of these blocks is given to a thread (along with some other information) so that they can compress it. As such, I don't know how many threads I'll need to create as I need to keep going until I reach the end of the input. I also have to somehow ensure that the data stays in the same order. If thread 2 finishes before thread 1 and just submits its work, my data will be out of order!
Would a thread pool be the wrong approach in this situation then? It seems like it'd be great (since I can't use more than 8 threads at a time).
Basically, I want to do something like this:
ExecutorService tpes = Executors.newFixedThreadPool(threadPoolSize);
BufferedInputStream inBytes = new BufferedInputStream(System.in);
byte[] buff = new byte[BLOCK_SIZE];
byte[] dict = new byte[DICT_SIZE];
WorkerThread worker;
int bytesRead = 0;
while((bytesRead = inBytes.read(buff)) != -1) {
System.arraycopy(buff, BLOCK_SIZE-DICT_SIZE, dict, 0, DICT_SIZE);
worker = new WorkerThread(buff, dict)
tpes.execute(worker);
}
This is not working code, I know, but I'm just trying to illustrate what I want.
I left out a bit, but see how buff and dict have changing values and that I don't know how long the input is. I don't think I can't actually do this thought because, well worker already exists after the first call! I can't just say worker = new WorkerThread a bunch of time since isn't it already pointing towards an existing thread (true, a thread that might be dead) and obviously in this implemenation if it did work I wouldn't be running in parallel. But my point is, I want to keep creating threads until I hit the max pool size, wait till a thread is done, then keep creating threads until I hit the end of the input.
I also need to keep stuff in order, which is the part that's really annoying.

Your solution is completely fine (the only point is that parallelism is perhaps not necessary if the workload of your WorkerThreads is very small).
With a thread pool, the number of submitted tasks is not relevant. There may be less or more than the number of threads in the pool, the thread pool takes care of that.
However, and this is important: You rely on some kind of order of the results of your WorkerThreads, but when using parallelism, this order is not guaranteed! It doesn't matter whether you use a thread pool, or how much worker threads you have, etc., it will always be possible that your results will be finished in an arbitrary order!
To keep the order right, give each WorkerThread the number of the current item in its constructor, and let them put their results in the right order after they are finished:
int noOfWorkItem = 0;
while((bytesRead = inBytes.read(buff)) != -1) {
System.arraycopy(buff, BLOCK_SIZE-DICT_SIZE, dict, 0, DICT_SIZE);
worker = new WorkerThread(buff, dict, noOfWorkItem++)
tpes.execute(worker);
}

As #ignis points out, parallel execution may not be the best answer for your situation.
However, to answer the more general question, there are several other Executor implementations to consider beyond FixedThreadPool, some of which may have the characteristics that you desire.
As far as keeping things in order, typically you would submit tasks to the executor, and for each submission, you get a Future (which is an object that promises to give you a result later, when the task finishes). So, you can keep track of the Futures in the order that you submitted tasks, and then when all tasks are done, invoke get() on each Future in order, to get the results.

Why using Java threads is not much faster?

I have the following program to remove even numbers from a string vector, when the vector size grows larger, it might take a long time, so I thought of threads, but using 10 threads is not faster then one thread, my PC has 6 cores and 12 threads, why ?
import java.util.*;
public class Test_Threads
{
static boolean Use_Threads_To_Remove_Duplicates(Vector<String> Good_Email_Address_Vector,Vector<String> To_Be_Removed_Email_Address_Vector)
{
boolean Removed_Duplicates=false;
int Threads_Count=10,Delay=5,Average_Size_For_Each_Thread=Good_Email_Address_Vector.size()/Threads_Count;
Remove_Duplicate_From_Vector_Thread RDFVT[]=new Remove_Duplicate_From_Vector_Thread[Threads_Count];
Remove_Duplicate_From_Vector_Thread.To_Be_Removed_Email_Address_Vector=To_Be_Removed_Email_Address_Vector;
for (int i=0;i<Threads_Count;i++)
{
Vector<String> Target_Vector=new Vector<String>();
if (i<Threads_Count-1) for (int j=i*Average_Size_For_Each_Thread;j<(i+1)*Average_Size_For_Each_Thread;j++) Target_Vector.add(Good_Email_Address_Vector.elementAt(j));
else for (int j=i*Average_Size_For_Each_Thread;j<Good_Email_Address_Vector.size();j++) Target_Vector.add(Good_Email_Address_Vector.elementAt(j));
RDFVT[i]=new Remove_Duplicate_From_Vector_Thread(Target_Vector,Delay);
}
try { for (int i=0;i<Threads_Count;i++) RDFVT[i].Remover_Thread.join(); }
catch (Exception e) { e.printStackTrace(); } // Wait for all threads to finish
for (int i=0;i<Threads_Count;i++) if (RDFVT[i].Changed) Removed_Duplicates=true;
if (Removed_Duplicates) // Collect results
{
Good_Email_Address_Vector.clear();
for (int i=0;i<Threads_Count;i++) Good_Email_Address_Vector.addAll(RDFVT[i].Target_Vector);
}
return Removed_Duplicates;
}
public static void out(String message) { System.out.print(message); }
public static void Out(String message) { System.out.println(message); }
public static void main(String[] args)
{
long start=System.currentTimeMillis();
Vector<String> Good_Email_Address_Vector=new Vector<String>(),To_Be_Removed_Email_Address_Vector=new Vector<String>();
for (int i=0;i<1000;i++) Good_Email_Address_Vector.add(i+"");
Out(Good_Email_Address_Vector.toString());
for (int i=0;i<1500000;i++) To_Be_Removed_Email_Address_Vector.add(i*2+"");
Out("=============================");
Use_Threads_To_Remove_Duplicates(Good_Email_Address_Vector,To_Be_Removed_Email_Address_Vector); // [ Approach 1 : Use 10 threads ]
// Good_Email_Address_Vector.removeAll(To_Be_Removed_Email_Address_Vector); // [ Approach 2 : just one thread ]
Out(Good_Email_Address_Vector.toString());
long end=System.currentTimeMillis();
Out("Time taken for execution is " + (end - start));
}
}
class Remove_Duplicate_From_Vector_Thread
{
static Vector<String> To_Be_Removed_Email_Address_Vector;
Vector<String> Target_Vector;
Thread Remover_Thread;
boolean Changed=false;
public Remove_Duplicate_From_Vector_Thread(final Vector<String> Target_Vector,final int Delay)
{
this.Target_Vector=Target_Vector;
Remover_Thread=new Thread(new Runnable()
{
public void run()
{
try
{
Thread.sleep(Delay);
Changed=Target_Vector.removeAll(To_Be_Removed_Email_Address_Vector);
}
catch (InterruptedException e) { e.printStackTrace(); }
finally { }
}
});
Remover_Thread.start();
}
}
In my program you can try "[ Approach 1 : Use 10 threads ]" or "[ Approach 2 : just one thread ]" there isn't much difference speed wise, I expext it to be several times faster, why ?

The simple answer is that your threads are all trying to access a single vector calling synchronized methods. The synchronized modifier on those methods ensures that only one thread can be executing any of the methods on that object at any given time. So a significant part of the parallel part of the computation involves waiting for other threads.
The other problem is that for an O(N) input list, you have an O(N) setup ... population of the Target_Vector objects ... that is done in one thread. Plus the overheads of thread creation.
All of this adds up to not much speedup.
You should get a significant speedup (with multiple threads) if you used a single ConcurrentHashMap instead of a single Good_Email_Address_Vector object that gets split into multiple Target_Vector objects:
the remove operation is O(1) not O(n),
reduced copying,
the data structure provides better multi-threaded performance due to better handling of contention, and
you don't need to jump through hoops to avoid ConcurrentModificationException.
In addition, the To_Be_Removed_Email_Address_Vector object should be replaced with an unsynchronized List, and List.sublist(...) should be used to create views that can be passed to the threads.
In short, you are better of throwing away your current code and starting again. And please use sensible identifier names that follow the Java coding conventions, and
wrap your code at line ~80 so that people can read it!

Vector Synchronization Creates Contention
You've split up the vector to be modified, which avoids a some contention. But multiple threads are accessing a the static Vector To_Be_Removed_Email_Address_Vector, so much contention still remains (all Vector methods are synchronized).
Use an unsynchronized data structure for the shared, read-only information so that there is no contention between threads. On my machine, running your test with ArrayList in place of Vector cut the execution time in half.
Even without contention, thread-safe structures are slower, so don't use them when only a single thread has access to an object. Additionally, Vector is largely obsolete by Java 5. Avoid it unless you have to inter-operate with a legacy API you can't alter.
Choose a Suitable Data Structure
A list data structure is going to provide poor performance for this task. Since email addresses are likely to be unique, a set should be a suitable replace, and will removeAll() much faster on large sets. Using HashSet in place of the original Vector cut execution time on my (8 core) machine from over 5 seconds to around 3 milliseconds. Roughly half of this improvement is due to using the right data structure for the job.
Concurrent Structures Are a Bad Fit
Using a concurrent concurrent data structure is relatively slow, and doesn't simplify the code, so I don't recommend it.
Using a more up-to-date concurrent data structure is much faster than contending for a Vector, but the concurrency overhead of these data structures is still much higher than single-threaded structures. For example, running the original code on my machine took more than five seconds, while a ConcurrentSkipListSet took half a second, and a ConcurrentHashMap took one eighth of a second. But remember, when each thread had its own HashSet to update, the total time was just 3 milliseconds.
Even when all threads are updating a single concurrent data structure, the code needed to partition the workload is very similar to that used to create a separate Vector for each thread in the original code. From a readability and maintenance standpoint, all of these solutions have equivalent complexity.
If you had a situation where "bad" email addresses were being added to the set asynchronously, and you wanted readers of the "good" list to see those updates auto-magically, a concurrent set would be a good choice. But, with the current design of the API, where consumers of the "good" list explicitly call a blocking filter method to update the list, a concurrent data structure may be the wrong choice.

All your threads are working on the same vector. Your access to the vector is serialized (i.e. only one thread can access it at a time) so using multiple threads is likely to be the same speed at best, but more likely to be much slower.
Multiple threads work much faster when you have independent tasks to perform.
In this case, the fastest option is likely to be to create a new List which contains all the elements you want to retain and replacing the original, in one thread. This will be fastest than using a concurrent collection with multiple threads.
For comparison, this is what you can do with one thread. As the collection is fairly small, the JVM doesn't warmup in just one run, so there are multiple dummy runs which are not printed.
public static void main(String... args) throws IOException, InterruptedException, ParseException {
for (int n = -50; n < 5; n++) {
List<String> allIds = new ArrayList<String>();
for (int i = 0; i < 1000; i++) allIds.add(String.valueOf(i));
long start = System.nanoTime();
List<String> oddIds = new ArrayList<String>();
for (String id : allIds) {
if ((id.charAt(id.length()-1) % 2) != 0)
oddIds.add(id);
}
long time = System.nanoTime() - start;
if (n >= 0)
System.out.println("Time taken to filter " + allIds.size() + " entries was " + time / 1000 + " micro-seconds");
}
}
prints
Time taken to filter 1000 entries was 136 micro-seconds
Time taken to filter 1000 entries was 141 micro-seconds
Time taken to filter 1000 entries was 136 micro-seconds
Time taken to filter 1000 entries was 137 micro-seconds
Time taken to filter 1000 entries was 138 micro-seconds

Java BlockingQueue latency high on Linux

I am using BlockingQueue:s (trying both ArrayBlockingQueue and LinkedBlockingQueue) to pass objects between different threads in an application I’m currently working on. Performance and latency is relatively important in this application, so I was curious how much time it takes to pass objects between two threads using a BlockingQueue. In order to measure this, I wrote a simple program with two threads (one consumer and one producer), where I let the producer pass a timestamp (taken using System.nanoTime()) to the consumer, see code below.
I recall reading somewhere on some forum that it took about 10 microseconds for someone else who tried this (don’t know on what OS and hardware that was on), so I was not too surprised when it took ~30 microseconds for me on my windows 7 box (Intel E7500 core 2 duo CPU, 2.93GHz), whilst running a lot of other applications in the background. However, I was quite surprised when I did the same test on our much faster Linux server (two Intel X5677 3.46GHz quad-core CPUs, running Debian 5 with kernel 2.6.26-2-amd64). I expected the latency to be lower than on my windows box , but on the contrary it was much higher - ~75 – 100 microseconds! Both tests were done with Sun’s Hotspot JVM version 1.6.0-23.
Has anyone else done any similar tests with similar results on Linux? Or does anyone know why it is so much slower on Linux (with better hardware), could it be that thread switching simply is this much slower on Linux compared to windows? If that’s the case, it’s seems like windows is actually much better suited for some kind of applications. Any help in helping me understanding the relatively high figures are much appreciated.
Edit:
After a comment from DaveC, I also did a test where I restricted the JVM (on the Linux machine) to a single core (i.e. all threads running on the same core). This changed the results dramatically - the latency went down to below 20 microseconds, i.e. better than the results on the Windows machine. I also did some tests where I restricted the producer thread to one core and the consumer thread to another (trying both to have them on the same socket and on different sockets), but this did not seem to help - the latency was still ~75 microseconds. Btw, this test application is pretty much all I'm running on the machine while performering test.
Does anyone know if these results make sense? Should it really be that much slower if the producer and the consumer are running on different cores? Any input is really appreciated.
Edited again (6 January):
I experimented with different changes to the code and running environment:
I upgraded the Linux kernel to 2.6.36.2 (from 2.6.26.2). After the kernel upgrade, the measured time changed to 60 microseconds with very small variations, from 75-100 before the upgrade. Setting CPU affinity for the producer and consumer threads had no effect, except when restricting them to the same core. When running on the same core, the latency measured was 13 microseconds.
In the original code, I had the producer go to sleep for 1 second between every iteration, in order to give the consumer enough time to calculate the elapsed time and print it to the console. If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly? My first guess was that the change had the effect that the producer called queue.put() before the consumer called queue.take(), so the consumer never had to block, but after playing around with a modified version of ArrayBlockingQueue, I found this guess to be false – the consumer did in fact block. If you have some other guess, please let me know. (Btw, if I let the producer call both Thread.sleep() and barrier.await(), the latency remains at 60 microseconds).
I also tried another approach – instead of calling queue.take(), I called queue.poll() with a timeout of 100 micros. This reduced the average latency to below 10 microseconds, but is of course much more CPU intensive (but probably less CPU intensive that busy waiting?).
Edited again (10 January) - Problem solved:
ninjalj suggested that the latency of ~60 microseconds was due to the CPU having to wake up from deeper sleep states - and he was completely right! After disabling C-states in BIOS, the latency was reduced to <10 microseconds. This explains why I got so much better latency under point 2 above - when I sent objects more frequently the CPU was kept busy enough not to go to the deeper sleep states. Many thanks to everyone who has taken time to read my question and shared your thoughts here!
...
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.CyclicBarrier;
public class QueueTest {
ArrayBlockingQueue<Long> queue = new ArrayBlockingQueue<Long>(10);
Thread consumerThread;
CyclicBarrier barrier = new CyclicBarrier(2);
static final int RUNS = 500000;
volatile int sleep = 1000;
public void start() {
consumerThread = new Thread(new Runnable() {
#Override
public void run() {
try {
barrier.await();
for(int i = 0; i < RUNS; i++) {
consume();
}
} catch (Exception e) {
e.printStackTrace();
}
}
});
consumerThread.start();
try {
barrier.await();
} catch (Exception e) { e.printStackTrace(); }
for(int i = 0; i < RUNS; i++) {
try {
if(sleep > 0)
Thread.sleep(sleep);
produce();
} catch (Exception e) {
e.printStackTrace();
}
}
}
public void produce() {
try {
queue.put(System.nanoTime());
} catch (InterruptedException e) {
}
}
public void consume() {
try {
long t = queue.take();
long now = System.nanoTime();
long time = (now - t) / 1000; // Divide by 1000 to get result in microseconds
if(sleep > 0) {
System.out.println("Time: " + time);
}
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
QueueTest test = new QueueTest();
System.out.println("Starting...");
// Run first once, ignoring results
test.sleep = 0;
test.start();
// Run again, printing the results
System.out.println("Starting again...");
test.sleep = 1000;
test.start();
}
}

Your test is not a good measure of queue handoff latency because you have a single thread reading off the queue which writes synchronously to System.out (doing a String and long concatenation while it is at it) before it takes again. To measure this properly you need to move this activity out of this thread and do as little work as possible in the taking thread.
You'd be better off just doing the calculation (then-now) in the taker and adding the result to some other collection which is periodically drained by another thread that outputs the results. I tend to do this by adding to an appropriately presized array backed structure accessed via an AtomicReference (hence the reporting thread just has to getAndSet on that reference with another instance of that storage structure in order to grab the latest batch of results; e.g. make 2 lists, set one as active, every x s a thread wakes up and swaps the active and the passive ones). You can then report some distribution instead of every single result (e.g. a decile range) which means you don't generate vast log files with every run and get useful information printed for you.
FWIW I concur with the times Peter Lawrey stated & if latency is really critical then you need to think about busy waiting with appropriate cpu affinity (i.e. dedicate a core to that thread)
EDIT after Jan 6
If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly?
You're looking at the difference between java.util.concurrent.locks.LockSupport#park (and corresponding unpark) and Thread#sleep. Most j.u.c. stuff is built on LockSupport (often via an AbstractQueuedSynchronizer that ReentrantLock provides or directly) and this (in Hotspot) resolves down to sun.misc.Unsafe#park (and unpark) and this tends to end up in the hands of the pthread (posix threads) lib. Typically pthread_cond_broadcast to wake up and pthread_cond_wait or pthread_cond_timedwait for things like BlockingQueue#take.
I can't say I've ever looked at how Thread#sleep is actually implemented (cos I've never come across something low latency that isn't a condition based wait) but I would imagine that it causes it to be demoted by the schedular in a more aggressive way than the pthread signalling mechanism and that is what accounts for the latency difference.

I would use just an ArrayBlockingQueue if you can. When I have used it the latency was between 8-18 microseconds on Linux. Some point of note.
The cost is largely the time it takes to wake up the thread. When you wake up a thread its data/code won't be in cache so you will find that if you time what happens after a thread has woken that can take 2-5x longer than if you were to run the same thing repeatedly.
Certain operations use OS calls (such as locking/cyclic barriers) these are often more expensive in a low latency scenario than busy waiting. I suggest trying to busy wait your producer rather than use a CyclicBarrier. You could busy wait your consumer as well but this could be unreasonably expensive on a real system.

#Peter Lawrey
Certain operations use OS calls (such as locking/cyclic barriers)
Those are NOT OS (kernel) calls. Implemented via simple CAS (which on x86 comes w/ free memory fence as well)
One more: dont use ArrayBlockingQueue unless you know why (you use it).
#OP:
Look at ThreadPoolExecutor, it offers excellent producer/consumer framework.
Edit below:
to reduce the latency (baring the busy wait), change the queue to SynchronousQueue add the following like before starting the consumer
...
consumerThread.setPriority(Thread.MAX_PRIORITY);
consumerThread.start();
This is the best you can get.
Edit2:
Here w/ sync. queue. And not printing the results.
package t1;
import java.math.BigDecimal;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.SynchronousQueue;
public class QueueTest {
static final int RUNS = 250000;
final SynchronousQueue<Long> queue = new SynchronousQueue<Long>();
int sleep = 1000;
long[] results = new long[0];
public void start(final int runs) throws Exception {
results = new long[runs];
final CountDownLatch barrier = new CountDownLatch(1);
Thread consumerThread = new Thread(new Runnable() {
#Override
public void run() {
barrier.countDown();
try {
for(int i = 0; i < runs; i++) {
results[i] = consume();
}
} catch (Exception e) {
return;
}
}
});
consumerThread.setPriority(Thread.MAX_PRIORITY);
consumerThread.start();
barrier.await();
final long sleep = this.sleep;
for(int i = 0; i < runs; i++) {
try {
doProduce(sleep);
} catch (Exception e) {
return;
}
}
}
private void doProduce(final long sleep) throws InterruptedException {
produce();
}
public void produce() throws InterruptedException {
queue.put(new Long(System.nanoTime()));//new Long() is faster than value of
}
public long consume() throws InterruptedException {
long t = queue.take();
long now = System.nanoTime();
return now-t;
}
public static void main(String[] args) throws Throwable {
QueueTest test = new QueueTest();
System.out.println("Starting + warming up...");
// Run first once, ignoring results
test.sleep = 0;
test.start(15000);//10k is the normal warm-up for -server hotspot
// Run again, printing the results
System.gc();
System.out.println("Starting again...");
test.sleep = 1000;//ignored now
Thread.yield();
test.start(RUNS);
long sum = 0;
for (long elapsed: test.results){
sum+=elapsed;
}
BigDecimal elapsed = BigDecimal.valueOf(sum, 3).divide(BigDecimal.valueOf(test.results.length), BigDecimal.ROUND_HALF_UP);
System.out.printf("Avg: %1.3f micros%n", elapsed);
}
}

If latency is critical and you do not require strict FIFO semantics, then you may want to consider JSR-166's LinkedTransferQueue. It enables elimination so that opposing operations can exchange values instead of synchronizing on the queue data structure. This approach helps reduce contention, enables parallel exchanges, and avoids thread sleep/wake-up penalties.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.