I use the neo4j java core api and want to update 10 million nodes.
I thought it will be better to do it with multithreading but the performance is not that good (35 minutes for setting properties).
To explain: Each node "Person" has at least one relation "POINTSREL" to a "Point" node, which has the property "Points". I want to sum up the points from the "Point" node and set it as property to the "Person" node.
Here is my code:
Transaction transaction = service.beginTx();
ResourceIterator<Node> iterator = service.findNodes(Labels.person);
transaction.success();
transaction.close();
ExecutorService executor = Executors.newFixedThreadPool(5);
while(iterator.hasNext()){
executor.execute(new MyJob(iterator.next()));
}
//wait until all threads are done
executor.shutdown();
try {
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
}
And here the runnable class
private class MyJob implements Runnable {
private Node node;
/* collect useful parameters in the constructor */
public MyJob(Node node) {
this.node = node;
}
public void run() {
Transaction transaction = service.beginTx();
Iterable<org.neo4j.graphdb.Relationship> rel = this.node.getRelationships(RelationType.POINTSREL, Direction.OUTGOING);
double sum = 0;
for(org.neo4j.graphdb.Relationship entry : rel){
try{
sum += (Double)entry.getEndNode().getProperty("Points");
} catch(Exception e){
e.printStackTrace();
}
}
this.node.setProperty("Sum", sum);
transaction.success();
transaction.close();
}
}
Is there a better (faster) way to do that?
About my setting:
AWS Instance with 8 CPUs and 32GB ram
neo4j-wrapper.conf
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=16000
wrapper.java.maxmemory=16000
neo4j.properties
# The type of cache to use for nodes and relationships.
cache_type=soft
cache.memory_ratio=30.0
neostore.nodestore.db.mapped_memory=2G
neostore.relationshipstore.db.mapped_memory=7G
neostore.propertystore.db.mapped_memory=2G
neostore.propertystore.db.strings.mapped_memory=2G
neostore.propertystore.db.arrays.mapped_memory=512M
From my perspective there is something that can be improved.
Offtopic
If you are using Java 7 (or greater) consider using try with resource to handler transaction. It will prevent you from errors.
Performance
First of all - batching. Currently you are:
Creating Job
Starting thread (actually, there is pool in executor)
Starting transaction
For each node. You should consider to make updates in batches. This means that you should:
Collect N nodes (i.e. N=1000)
Create single job for N nodes
Create single transaction in job
Update N nodes in that transaction
Close transaction
Setup
You have 8 CPU's. That means that you can create bigger thread pool. I think Executors.newFixedThreadPool(16) will be OK.
Hacks
You have 32GB RAM. I can suggest:
Decrease java heap size to 8GB. From my experience large heap size can lead to large GC pauses and performance degradation
Increate mapped memory size. Just to make sure that more data can be kept in cache.
Just for your case. If all your data can fit in RAM, then you can change cache_type to hard for this change purposes. Details.
Configuration
As you said - you are using Core API. Is this Embedded graph database or server extension?
If this is Embedded graph database - you should verify that your database settings is applied to created instance.
I found out that there was, amongst others, a problem with the property "cache_type=soft". I set it to "cache_type=none" and the duration of the execution decreased from 30 minutes to 2 minutes. After some updates there were always threads which were blocked for about 30 seconds - changing this property helps to avoid these blockings. I will search for a more detail explantation.
Related
I have collection of elements that I want to process in parallel. When I use a List, parallelism works. However, when I use a Set, it does not run in parallel.
I wrote a code sample that shows the problem:
public static void main(String[] args) {
ParallelTest test = new ParallelTest();
List<Integer> list = Arrays.asList(1,2);
Set<Integer> set = new HashSet<>(list);
ForkJoinPool forkJoinPool = new ForkJoinPool(4);
System.out.println("set print");
try {
forkJoinPool.submit(() ->
set.parallelStream().forEach(test::print)
).get();
} catch (Exception e) {
return;
}
System.out.println("\n\nlist print");
try {
forkJoinPool.submit(() ->
list.parallelStream().forEach(test::print)
).get();
} catch (Exception e) {
return;
}
}
private void print(int i){
System.out.println("start: " + i);
try {
TimeUnit.SECONDS.sleep(1);
} catch (InterruptedException e) {
}
System.out.println("end: " + i);
}
This is the output that I get on windows 7
set print
start: 1
end: 1
start: 2
end: 2
list print
start: 2
start: 1
end: 1
end: 2
We can see that the first element from the Set had to finish before the second element is processed. For the List, the second element starts before the first element finishes.
Can you tell me what causes this issue, and how to avoid it using a Set collection?
I can reproduce the behavior you see, where the parallelism doesn't match the parallelism of the fork-join pool parallelism you've specified. After setting the fork-join pool parallelism to 10, and increasing the number of elements in the collection to 50, I see the parallelism of the list-based stream rising only to 6, whereas the parallelism of the set-based stream never gets above 2.
Note, however, that this technique of submitting a task to a fork-join pool to run the parallel stream in that pool is an implementation "trick" and is not guaranteed to work. Indeed, the threads or thread pool that is used for execution of parallel streams is unspecified. By default, the common fork-join pool is used, but in different environments, different thread pools might end up being used. (Consider a container within an application server.)
In the java.util.stream.AbstractTask class, the LEAF_TARGET field determines the amount of splitting that is done, which in turn determines the amount of parallelism that can be achieved. The value of this field is based on ForkJoinPool.getCommonPoolParallelism() which of course uses the parallelism of the common pool, not whatever pool happens to be running the tasks.
Arguably this is a bug (see OpenJDK issue JDK-8190974), however, this entire area is unspecified anyway. However, this area of the system definitely needs development, for example in terms of splitting policy, the amount of parallelism available, dealing with blocking tasks, among other issues. A future release of the JDK may address some of these issues.
Meanwhile, it is possible to control the parallelism of the common fork-join pool through the use of system properties. If you add this line to your program,
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "10");
and you run the streams in the common pool (or if you submit them to your own pool that has a sufficiently high level of parallelism set) you will observe that many more tasks are run in parallel.
You can also set this property on the command line using the -D option.
Again, this is not guaranteed behavior, and it may change in the future. But this technique will probably work for JDK 8 implementations for the forseeable future.
UPDATE 2019-06-12: The bug JDK-8190974 was fixed in JDK 10, and the fix has been backported to an upcoming JDK 8u release (8u222).
I am using an ThreadPoolExecutor with 5 active threads, number of tasks is huge 20,000.
The queue is filled up (pool.execute(new WorkingThreadTask())) with instances of a Runnable tasks almost immediately.
Each WorkingThreadTask has a HashMap:
Map<Integer, HashMap<Integer, String>> themap ;
each map can have up to 2000 items, and each sub-map has 5 items. There is also a shared BlockingQueue.
When process is running I am getting out of memory. I'm running with: (32bit -Xms1024m -Xmx1024m)
How can I handle this problem? I don't think I have leaks in hashmap... When the thread is finished hashmap is cleaned right?
Update:
After running a profiler and checking the memory, the biggest hit is:
byte[] 2,516,024 hits, 918 MB
I don't know from where it's called or used.
Name Instance count Size (bytes)
byte[ ] 2519560 918117496
oracle.jdbc.ttc7.TTCItem 2515402 120739296
char[ ] 357882 15549280
java.lang.String 9677 232248
int[ ] 2128 110976
short[ ] 2097 150024
java.lang.Class 1537 635704
java.util.concurrent.locks.ReentrantLock$NonfairSync 1489 35736
java.util.Hashtable$Entry 1417 34008
java.util.concurrent.ConcurrentHashMap$HashEntry[ ] 1376 22312
java.util.concurrent.ConcurrentHashMap$Segment 1376 44032
java.lang.Object[ ] 1279 60216
java.util.TreeMap$Entry 828 26496
oracle.jdbc.dbaccess.DBItem[ ] 802 10419712
oracle.jdbc.ttc7.v8TTIoac 732 52704
I'm not sure about the inner map but I suspect the problem is that you are creating a large number of tasks that is filling memory. You should be using a bounded task queue and limit the job producer.
Take a look at my answer here: Process Large File for HTTP Calls in Java
To summarize it, you should create your own bounded queue and then use a RejectedExecutionHandler to block the producer until there is space in the queue. Something like:
final BlockingQueue<WorkingThreadTask> queue =
new ArrayBlockingQueue<WorkingThreadTask>(100);
ThreadPoolExecutor threadPool =
new ThreadPoolExecutor(nThreads, nThreads, 0L, TimeUnit.MILLISECONDS, queue);
// we need our RejectedExecutionHandler to block if the queue is full
threadPool.setRejectedExecutionHandler(new RejectedExecutionHandler() {
#Override
public void rejectedExecution(WorkingThreadTask task,
ThreadPoolExecutor executor) {
try {
// this will block the producer until there's room in the queue
executor.getQueue().put(task);
} catch (InterruptedException e) {
throw new RejectedExecutionException(
"Unexpected InterruptedException", e);
}
}
});
Edit:
I don't think I have leeks in hashmap... when thread is finished hashmap is cleaned right?
You might consider aggressively calling clear() on the work HashMap and other collections when the task completes. Although they should be reaped by the GC eventually, giving the GC some help may solve your problem if you have limited memory.
If this doesn't work, a profiler is the way to go to help you identify where the memory is being held.
Edit:
After looking at the profiler output, the byte[] is interesting. Typically this indicates some sort of serialization or other IO. You may also be storing blobs in a database. The oracle.jdbc.ttc7.TTCItem is very interesting however. That indicates to me that you are not closing a database connection somewhere. Make sure to use proper try/finally blocks to close your connections.
HashMap carries quite a lot of overhead in terms of memory usage..... it carries about 36 bytes minimum per entry, plus the size of the key/value itself - each will be at least 32 bytes (I think that's about the typical value for 32-bit sun JVM).... doing some quick math:
20,000 tasks, each with map with 2000 entry hashmap. The value in the map is another map with 5 entries.
-> 5-entry map is 1* Map + 5* Map.Object entries + 5*keys + 5*values = 16 objects at 32 bytes => 512 bytes per sub-map.
-> 2000 entry map is 1* Map, 2000*Map.Object + 2000 keys + 2000 submaps (each is 512 bytes) => 2000*(512+32+32) + 32 => 1.1MB
-> 20,000 tasks, each of 1.1MB -> 23GB
So, your overall footprint is 23GB.
The logical solution is to restrict the depth of your blocking queue feeding the ExecutorService, and only create enough child tasks to keep it busy..... set a limit of about 64 entries in the queue, and then you will never have more than 64 + 5 tasks instantiated at one time. When wpace comes available in the executor's queue, you can create and add another task.
You can improve the efficiency by not adding so many tasks ahead of what is being processed. Try checking the queue and only adding to it if there is less than 1000 entries.
You can also make the data structures more efficient. A Map with an Integer key can often be reduced to an array of some kind.
Lastly, 1 GB isn't that much these days. My mobile phone has 2 GB. If you are going to process large amount of data, I suggest getting a machine with 32-64 GB of memory and a 64-bit JVM.
From the large byte[]s, I'd suspect IO related issues (unless you are handling video/audio or something).
Things to look at:
DB: Are you trying to read large amount of stuff at once? You can
e.g. use a cursor to not do that
File/Network: Are you trying to read large amounts of stuff from file/network at once? You should "propagate the load" to whatever is reading and regulate the rate of read.
UPDATE: OK, so you are using a cursor to read from DB. Now you need to make sure that the reading from the cursor only progresses as you finish stuff (aka "propagate the load"). To do this, use a thread pool like this:
BlockingQueue<Runnable> queue = new LinkedBlockingQueue<Runnable>(queueSize);
ThreadPoolExecutor tpe = new ThreadPoolExecutor(
threadNum,
threadNum,
1000,
TimeUnit.HOURS,
queue,
new ThreadPoolExecutor.CallerRunsPolicy());
Now when you post to this service from your code which reads from the DB, it will block when the queue is full (the calling thread is used to run tasks and hence blocks).
I used a while loop to fetch message from Amazon SQS. Partial code is as follows:
ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl);
while (true) {
List<Message> messages = sqs.receiveMessage(receiveMessageRequest).getMessages();
if (messages.size() > 0) {
MemcachedClient c = new MemcachedClient(new BinaryConnectionFactory(), AddrUtil.getAddresses(memAddress));
for (Message message : messages) {
// get message from aws sqs
String messageid = message.getBody();
String messageRecieptHandle = message.getReceiptHandle();
sqs.deleteMessage(new DeleteMessageRequest(myQueueUrl, messageRecieptHandle));
// get details info from memcache
String result = null;
String key = null;
key = "message-"+messageid;
result = c.get(key);
}
c.shutdown();
}
}
Will it cause memory leak in such case?
I checked using "ps aux". What I found is that the RSS (resident set size, the non-swapped physical memory that a task used) is growing slowly.
You can't evaluate whether your Java application has a memory leak simply based on the RSS of the process. Most JVMs are pretty greedy, they would rather take more memory from the OS than spend a lot of work on Garbage Collection.
That said your while loop doesn't seem like it has any obvious memory "leaks" either, but that depends on what some of the method calls do (which isn't included above). If you are storing things in static variables, that can be a cause of concern but if the only references are within the scope of the loop you're probably fine.
The simplest way to know if you have a memory leak in a certain area of code is to rigorously exercise that code within a single run of your application (potentially set with a relatively low maximum heap size). If you get an OutOfMemoryError, you probably have a memory leak.
Sorry, but I don't see here code to remove message from the message queue. Did you clean the message list? In case that DeleteRequest removes message from the queue then you try to modify message list which you itereate.
Also you can get better memory usage statistic with visualvm tool which is part of JDK now.
I have the following program to remove even numbers from a string vector, when the vector size grows larger, it might take a long time, so I thought of threads, but using 10 threads is not faster then one thread, my PC has 6 cores and 12 threads, why ?
import java.util.*;
public class Test_Threads
{
static boolean Use_Threads_To_Remove_Duplicates(Vector<String> Good_Email_Address_Vector,Vector<String> To_Be_Removed_Email_Address_Vector)
{
boolean Removed_Duplicates=false;
int Threads_Count=10,Delay=5,Average_Size_For_Each_Thread=Good_Email_Address_Vector.size()/Threads_Count;
Remove_Duplicate_From_Vector_Thread RDFVT[]=new Remove_Duplicate_From_Vector_Thread[Threads_Count];
Remove_Duplicate_From_Vector_Thread.To_Be_Removed_Email_Address_Vector=To_Be_Removed_Email_Address_Vector;
for (int i=0;i<Threads_Count;i++)
{
Vector<String> Target_Vector=new Vector<String>();
if (i<Threads_Count-1) for (int j=i*Average_Size_For_Each_Thread;j<(i+1)*Average_Size_For_Each_Thread;j++) Target_Vector.add(Good_Email_Address_Vector.elementAt(j));
else for (int j=i*Average_Size_For_Each_Thread;j<Good_Email_Address_Vector.size();j++) Target_Vector.add(Good_Email_Address_Vector.elementAt(j));
RDFVT[i]=new Remove_Duplicate_From_Vector_Thread(Target_Vector,Delay);
}
try { for (int i=0;i<Threads_Count;i++) RDFVT[i].Remover_Thread.join(); }
catch (Exception e) { e.printStackTrace(); } // Wait for all threads to finish
for (int i=0;i<Threads_Count;i++) if (RDFVT[i].Changed) Removed_Duplicates=true;
if (Removed_Duplicates) // Collect results
{
Good_Email_Address_Vector.clear();
for (int i=0;i<Threads_Count;i++) Good_Email_Address_Vector.addAll(RDFVT[i].Target_Vector);
}
return Removed_Duplicates;
}
public static void out(String message) { System.out.print(message); }
public static void Out(String message) { System.out.println(message); }
public static void main(String[] args)
{
long start=System.currentTimeMillis();
Vector<String> Good_Email_Address_Vector=new Vector<String>(),To_Be_Removed_Email_Address_Vector=new Vector<String>();
for (int i=0;i<1000;i++) Good_Email_Address_Vector.add(i+"");
Out(Good_Email_Address_Vector.toString());
for (int i=0;i<1500000;i++) To_Be_Removed_Email_Address_Vector.add(i*2+"");
Out("=============================");
Use_Threads_To_Remove_Duplicates(Good_Email_Address_Vector,To_Be_Removed_Email_Address_Vector); // [ Approach 1 : Use 10 threads ]
// Good_Email_Address_Vector.removeAll(To_Be_Removed_Email_Address_Vector); // [ Approach 2 : just one thread ]
Out(Good_Email_Address_Vector.toString());
long end=System.currentTimeMillis();
Out("Time taken for execution is " + (end - start));
}
}
class Remove_Duplicate_From_Vector_Thread
{
static Vector<String> To_Be_Removed_Email_Address_Vector;
Vector<String> Target_Vector;
Thread Remover_Thread;
boolean Changed=false;
public Remove_Duplicate_From_Vector_Thread(final Vector<String> Target_Vector,final int Delay)
{
this.Target_Vector=Target_Vector;
Remover_Thread=new Thread(new Runnable()
{
public void run()
{
try
{
Thread.sleep(Delay);
Changed=Target_Vector.removeAll(To_Be_Removed_Email_Address_Vector);
}
catch (InterruptedException e) { e.printStackTrace(); }
finally { }
}
});
Remover_Thread.start();
}
}
In my program you can try "[ Approach 1 : Use 10 threads ]" or "[ Approach 2 : just one thread ]" there isn't much difference speed wise, I expext it to be several times faster, why ?
The simple answer is that your threads are all trying to access a single vector calling synchronized methods. The synchronized modifier on those methods ensures that only one thread can be executing any of the methods on that object at any given time. So a significant part of the parallel part of the computation involves waiting for other threads.
The other problem is that for an O(N) input list, you have an O(N) setup ... population of the Target_Vector objects ... that is done in one thread. Plus the overheads of thread creation.
All of this adds up to not much speedup.
You should get a significant speedup (with multiple threads) if you used a single ConcurrentHashMap instead of a single Good_Email_Address_Vector object that gets split into multiple Target_Vector objects:
the remove operation is O(1) not O(n),
reduced copying,
the data structure provides better multi-threaded performance due to better handling of contention, and
you don't need to jump through hoops to avoid ConcurrentModificationException.
In addition, the To_Be_Removed_Email_Address_Vector object should be replaced with an unsynchronized List, and List.sublist(...) should be used to create views that can be passed to the threads.
In short, you are better of throwing away your current code and starting again. And please use sensible identifier names that follow the Java coding conventions, and
wrap your code at line ~80 so that people can read it!
Vector Synchronization Creates Contention
You've split up the vector to be modified, which avoids a some contention. But multiple threads are accessing a the static Vector To_Be_Removed_Email_Address_Vector, so much contention still remains (all Vector methods are synchronized).
Use an unsynchronized data structure for the shared, read-only information so that there is no contention between threads. On my machine, running your test with ArrayList in place of Vector cut the execution time in half.
Even without contention, thread-safe structures are slower, so don't use them when only a single thread has access to an object. Additionally, Vector is largely obsolete by Java 5. Avoid it unless you have to inter-operate with a legacy API you can't alter.
Choose a Suitable Data Structure
A list data structure is going to provide poor performance for this task. Since email addresses are likely to be unique, a set should be a suitable replace, and will removeAll() much faster on large sets. Using HashSet in place of the original Vector cut execution time on my (8 core) machine from over 5 seconds to around 3 milliseconds. Roughly half of this improvement is due to using the right data structure for the job.
Concurrent Structures Are a Bad Fit
Using a concurrent concurrent data structure is relatively slow, and doesn't simplify the code, so I don't recommend it.
Using a more up-to-date concurrent data structure is much faster than contending for a Vector, but the concurrency overhead of these data structures is still much higher than single-threaded structures. For example, running the original code on my machine took more than five seconds, while a ConcurrentSkipListSet took half a second, and a ConcurrentHashMap took one eighth of a second. But remember, when each thread had its own HashSet to update, the total time was just 3 milliseconds.
Even when all threads are updating a single concurrent data structure, the code needed to partition the workload is very similar to that used to create a separate Vector for each thread in the original code. From a readability and maintenance standpoint, all of these solutions have equivalent complexity.
If you had a situation where "bad" email addresses were being added to the set asynchronously, and you wanted readers of the "good" list to see those updates auto-magically, a concurrent set would be a good choice. But, with the current design of the API, where consumers of the "good" list explicitly call a blocking filter method to update the list, a concurrent data structure may be the wrong choice.
All your threads are working on the same vector. Your access to the vector is serialized (i.e. only one thread can access it at a time) so using multiple threads is likely to be the same speed at best, but more likely to be much slower.
Multiple threads work much faster when you have independent tasks to perform.
In this case, the fastest option is likely to be to create a new List which contains all the elements you want to retain and replacing the original, in one thread. This will be fastest than using a concurrent collection with multiple threads.
For comparison, this is what you can do with one thread. As the collection is fairly small, the JVM doesn't warmup in just one run, so there are multiple dummy runs which are not printed.
public static void main(String... args) throws IOException, InterruptedException, ParseException {
for (int n = -50; n < 5; n++) {
List<String> allIds = new ArrayList<String>();
for (int i = 0; i < 1000; i++) allIds.add(String.valueOf(i));
long start = System.nanoTime();
List<String> oddIds = new ArrayList<String>();
for (String id : allIds) {
if ((id.charAt(id.length()-1) % 2) != 0)
oddIds.add(id);
}
long time = System.nanoTime() - start;
if (n >= 0)
System.out.println("Time taken to filter " + allIds.size() + " entries was " + time / 1000 + " micro-seconds");
}
}
prints
Time taken to filter 1000 entries was 136 micro-seconds
Time taken to filter 1000 entries was 141 micro-seconds
Time taken to filter 1000 entries was 136 micro-seconds
Time taken to filter 1000 entries was 137 micro-seconds
Time taken to filter 1000 entries was 138 micro-seconds
I am using BlockingQueue:s (trying both ArrayBlockingQueue and LinkedBlockingQueue) to pass objects between different threads in an application I’m currently working on. Performance and latency is relatively important in this application, so I was curious how much time it takes to pass objects between two threads using a BlockingQueue. In order to measure this, I wrote a simple program with two threads (one consumer and one producer), where I let the producer pass a timestamp (taken using System.nanoTime()) to the consumer, see code below.
I recall reading somewhere on some forum that it took about 10 microseconds for someone else who tried this (don’t know on what OS and hardware that was on), so I was not too surprised when it took ~30 microseconds for me on my windows 7 box (Intel E7500 core 2 duo CPU, 2.93GHz), whilst running a lot of other applications in the background. However, I was quite surprised when I did the same test on our much faster Linux server (two Intel X5677 3.46GHz quad-core CPUs, running Debian 5 with kernel 2.6.26-2-amd64). I expected the latency to be lower than on my windows box , but on the contrary it was much higher - ~75 – 100 microseconds! Both tests were done with Sun’s Hotspot JVM version 1.6.0-23.
Has anyone else done any similar tests with similar results on Linux? Or does anyone know why it is so much slower on Linux (with better hardware), could it be that thread switching simply is this much slower on Linux compared to windows? If that’s the case, it’s seems like windows is actually much better suited for some kind of applications. Any help in helping me understanding the relatively high figures are much appreciated.
Edit:
After a comment from DaveC, I also did a test where I restricted the JVM (on the Linux machine) to a single core (i.e. all threads running on the same core). This changed the results dramatically - the latency went down to below 20 microseconds, i.e. better than the results on the Windows machine. I also did some tests where I restricted the producer thread to one core and the consumer thread to another (trying both to have them on the same socket and on different sockets), but this did not seem to help - the latency was still ~75 microseconds. Btw, this test application is pretty much all I'm running on the machine while performering test.
Does anyone know if these results make sense? Should it really be that much slower if the producer and the consumer are running on different cores? Any input is really appreciated.
Edited again (6 January):
I experimented with different changes to the code and running environment:
I upgraded the Linux kernel to 2.6.36.2 (from 2.6.26.2). After the kernel upgrade, the measured time changed to 60 microseconds with very small variations, from 75-100 before the upgrade. Setting CPU affinity for the producer and consumer threads had no effect, except when restricting them to the same core. When running on the same core, the latency measured was 13 microseconds.
In the original code, I had the producer go to sleep for 1 second between every iteration, in order to give the consumer enough time to calculate the elapsed time and print it to the console. If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly? My first guess was that the change had the effect that the producer called queue.put() before the consumer called queue.take(), so the consumer never had to block, but after playing around with a modified version of ArrayBlockingQueue, I found this guess to be false – the consumer did in fact block. If you have some other guess, please let me know. (Btw, if I let the producer call both Thread.sleep() and barrier.await(), the latency remains at 60 microseconds).
I also tried another approach – instead of calling queue.take(), I called queue.poll() with a timeout of 100 micros. This reduced the average latency to below 10 microseconds, but is of course much more CPU intensive (but probably less CPU intensive that busy waiting?).
Edited again (10 January) - Problem solved:
ninjalj suggested that the latency of ~60 microseconds was due to the CPU having to wake up from deeper sleep states - and he was completely right! After disabling C-states in BIOS, the latency was reduced to <10 microseconds. This explains why I got so much better latency under point 2 above - when I sent objects more frequently the CPU was kept busy enough not to go to the deeper sleep states. Many thanks to everyone who has taken time to read my question and shared your thoughts here!
...
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.CyclicBarrier;
public class QueueTest {
ArrayBlockingQueue<Long> queue = new ArrayBlockingQueue<Long>(10);
Thread consumerThread;
CyclicBarrier barrier = new CyclicBarrier(2);
static final int RUNS = 500000;
volatile int sleep = 1000;
public void start() {
consumerThread = new Thread(new Runnable() {
#Override
public void run() {
try {
barrier.await();
for(int i = 0; i < RUNS; i++) {
consume();
}
} catch (Exception e) {
e.printStackTrace();
}
}
});
consumerThread.start();
try {
barrier.await();
} catch (Exception e) { e.printStackTrace(); }
for(int i = 0; i < RUNS; i++) {
try {
if(sleep > 0)
Thread.sleep(sleep);
produce();
} catch (Exception e) {
e.printStackTrace();
}
}
}
public void produce() {
try {
queue.put(System.nanoTime());
} catch (InterruptedException e) {
}
}
public void consume() {
try {
long t = queue.take();
long now = System.nanoTime();
long time = (now - t) / 1000; // Divide by 1000 to get result in microseconds
if(sleep > 0) {
System.out.println("Time: " + time);
}
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
QueueTest test = new QueueTest();
System.out.println("Starting...");
// Run first once, ignoring results
test.sleep = 0;
test.start();
// Run again, printing the results
System.out.println("Starting again...");
test.sleep = 1000;
test.start();
}
}
Your test is not a good measure of queue handoff latency because you have a single thread reading off the queue which writes synchronously to System.out (doing a String and long concatenation while it is at it) before it takes again. To measure this properly you need to move this activity out of this thread and do as little work as possible in the taking thread.
You'd be better off just doing the calculation (then-now) in the taker and adding the result to some other collection which is periodically drained by another thread that outputs the results. I tend to do this by adding to an appropriately presized array backed structure accessed via an AtomicReference (hence the reporting thread just has to getAndSet on that reference with another instance of that storage structure in order to grab the latest batch of results; e.g. make 2 lists, set one as active, every x s a thread wakes up and swaps the active and the passive ones). You can then report some distribution instead of every single result (e.g. a decile range) which means you don't generate vast log files with every run and get useful information printed for you.
FWIW I concur with the times Peter Lawrey stated & if latency is really critical then you need to think about busy waiting with appropriate cpu affinity (i.e. dedicate a core to that thread)
EDIT after Jan 6
If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly?
You're looking at the difference between java.util.concurrent.locks.LockSupport#park (and corresponding unpark) and Thread#sleep. Most j.u.c. stuff is built on LockSupport (often via an AbstractQueuedSynchronizer that ReentrantLock provides or directly) and this (in Hotspot) resolves down to sun.misc.Unsafe#park (and unpark) and this tends to end up in the hands of the pthread (posix threads) lib. Typically pthread_cond_broadcast to wake up and pthread_cond_wait or pthread_cond_timedwait for things like BlockingQueue#take.
I can't say I've ever looked at how Thread#sleep is actually implemented (cos I've never come across something low latency that isn't a condition based wait) but I would imagine that it causes it to be demoted by the schedular in a more aggressive way than the pthread signalling mechanism and that is what accounts for the latency difference.
I would use just an ArrayBlockingQueue if you can. When I have used it the latency was between 8-18 microseconds on Linux. Some point of note.
The cost is largely the time it takes to wake up the thread. When you wake up a thread its data/code won't be in cache so you will find that if you time what happens after a thread has woken that can take 2-5x longer than if you were to run the same thing repeatedly.
Certain operations use OS calls (such as locking/cyclic barriers) these are often more expensive in a low latency scenario than busy waiting. I suggest trying to busy wait your producer rather than use a CyclicBarrier. You could busy wait your consumer as well but this could be unreasonably expensive on a real system.
#Peter Lawrey
Certain operations use OS calls (such as locking/cyclic barriers)
Those are NOT OS (kernel) calls. Implemented via simple CAS (which on x86 comes w/ free memory fence as well)
One more: dont use ArrayBlockingQueue unless you know why (you use it).
#OP:
Look at ThreadPoolExecutor, it offers excellent producer/consumer framework.
Edit below:
to reduce the latency (baring the busy wait), change the queue to SynchronousQueue add the following like before starting the consumer
...
consumerThread.setPriority(Thread.MAX_PRIORITY);
consumerThread.start();
This is the best you can get.
Edit2:
Here w/ sync. queue. And not printing the results.
package t1;
import java.math.BigDecimal;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.SynchronousQueue;
public class QueueTest {
static final int RUNS = 250000;
final SynchronousQueue<Long> queue = new SynchronousQueue<Long>();
int sleep = 1000;
long[] results = new long[0];
public void start(final int runs) throws Exception {
results = new long[runs];
final CountDownLatch barrier = new CountDownLatch(1);
Thread consumerThread = new Thread(new Runnable() {
#Override
public void run() {
barrier.countDown();
try {
for(int i = 0; i < runs; i++) {
results[i] = consume();
}
} catch (Exception e) {
return;
}
}
});
consumerThread.setPriority(Thread.MAX_PRIORITY);
consumerThread.start();
barrier.await();
final long sleep = this.sleep;
for(int i = 0; i < runs; i++) {
try {
doProduce(sleep);
} catch (Exception e) {
return;
}
}
}
private void doProduce(final long sleep) throws InterruptedException {
produce();
}
public void produce() throws InterruptedException {
queue.put(new Long(System.nanoTime()));//new Long() is faster than value of
}
public long consume() throws InterruptedException {
long t = queue.take();
long now = System.nanoTime();
return now-t;
}
public static void main(String[] args) throws Throwable {
QueueTest test = new QueueTest();
System.out.println("Starting + warming up...");
// Run first once, ignoring results
test.sleep = 0;
test.start(15000);//10k is the normal warm-up for -server hotspot
// Run again, printing the results
System.gc();
System.out.println("Starting again...");
test.sleep = 1000;//ignored now
Thread.yield();
test.start(RUNS);
long sum = 0;
for (long elapsed: test.results){
sum+=elapsed;
}
BigDecimal elapsed = BigDecimal.valueOf(sum, 3).divide(BigDecimal.valueOf(test.results.length), BigDecimal.ROUND_HALF_UP);
System.out.printf("Avg: %1.3f micros%n", elapsed);
}
}
If latency is critical and you do not require strict FIFO semantics, then you may want to consider JSR-166's LinkedTransferQueue. It enables elimination so that opposing operations can exchange values instead of synchronizing on the queue data structure. This approach helps reduce contention, enables parallel exchanges, and avoids thread sleep/wake-up penalties.