Concurrency: Java Map

Concurrency: Java Map - java

What is the best way to push 20 Million entities into a java map object?
Without multi-threading it is taking ~40 seconds.
Using ForkJoinPool it is taking ~25 seconds, where I have created 2 tasks and each of these tasks are pushing 10 Million entities
I believe that both these tasks are running in 2 different cores.
Question: When I create 1 task that pushes 10 Million data, it takes ~9 seconds, then when running 2 tasks where each of these tasks pushes 10 million data, why does it take ~26 seconds ? Am I doing something wrong ?
Is there a different solution for inserting 20 M data where it takes less than 10 seconds ?

Without seeing your code, the most probable cause of these bad performance results is due to the garbage collection activity. To demonstrate it, I wrote the following program:
import java.lang.management.ManagementFactory;
import java.util.*;
import java.util.concurrent.*;
public class TestMap {
// we assume NB_ENTITIES is divisible by NB_TASKS
static final int NB_ENTITIES = 20_000_000, NB_TASKS = 2;
static Map<String, String> map = new ConcurrentHashMap<>();
public static void main(String[] args) {
try {
System.out.printf("running with nb entities = %,d, nb tasks = %,d, VM args = %s%n", NB_ENTITIES, NB_TASKS, ManagementFactory.getRuntimeMXBean().getInputArguments());
ExecutorService executor = Executors.newFixedThreadPool(NB_TASKS);
int entitiesPerTask = NB_ENTITIES / NB_TASKS;
List<Future<?>> futures = new ArrayList<>(NB_TASKS);
long startTime = System.nanoTime();
for (int i=0; i<NB_TASKS; i++) {
MyTask task = new MyTask(i * entitiesPerTask, (i + 1) * entitiesPerTask - 1);
futures.add(executor.submit(task));
}
for (Future<?> f: futures) {
f.get();
}
long elapsed = System.nanoTime() - startTime;
executor.shutdownNow();
System.gc();
Runtime rt = Runtime.getRuntime();
long usedMemory = rt.maxMemory() - rt.freeMemory();
System.out.printf("processing completed in %,d ms, usedMemory after GC = %,d bytes%n", elapsed/1_000_000L, usedMemory);
} catch (Exception e) {
e.printStackTrace();
}
}
static class MyTask implements Runnable {
private final int startIdx, endIdx;
public MyTask(final int startIdx, final int endIdx) {
this.startIdx = startIdx;
this.endIdx = endIdx;
}
#Override
public void run() {
long startTime = System.nanoTime();
for (int i=startIdx; i<=endIdx; i++) {
map.put("sambit:rout:" + i, "C:\\Images\\Provision_Images");
}
long elapsed = System.nanoTime() - startTime;
System.out.printf("task[%,d - %,d], completed in %,d ms%n", startIdx, endIdx, elapsed/1_000_000L);
}
}
}
At the end of the processing, this code computes an approximation of the used memory by doing a System.gc() immediately followed by Runtime.maxMemory() - Runtime.freeMemory(). This shows that the map with 20 million entries takes approximately just under 2.2 GB, which is considerable. I have run it with 1 and 2 threads, for various values of the -Xmx and -Xms JVM arguments, here are the resulting outputs (just to be clear: 2560m = 2.5g):
running with nb entities = 20,000,000, nb tasks = 1, VM args = [-Xms2560m, -Xmx2560m]
task[0 - 19,999,999], completed in 11,781 ms
processing completed in 11,782 ms, usedMemory after GC = 2,379,068,760 bytes
running with nb entities = 20,000,000, nb tasks = 2, VM args = [-Xms2560m, -Xmx2560m]
task[0 - 9,999,999], completed in 8,269 ms
task[10,000,000 - 19,999,999], completed in 12,385 ms
processing completed in 12,386 ms, usedMemory after GC = 2,379,069,480 bytes
running with nb entities = 20,000,000, nb tasks = 1, VM args = [-Xms3g, -Xmx3g]
task[0 - 19,999,999], completed in 12,525 ms
processing completed in 12,527 ms, usedMemory after GC = 2,398,339,944 bytes
running with nb entities = 20,000,000, nb tasks = 2, VM args = [-Xms3g, -Xmx3g]
task[0 - 9,999,999], completed in 12,220 ms
task[10,000,000 - 19,999,999], completed in 12,264 ms
processing completed in 12,265 ms, usedMemory after GC = 2,382,777,776 bytes
running with nb entities = 20,000,000, nb tasks = 1, VM args = [-Xms4g, -Xmx4g]
task[0 - 19,999,999], completed in 7,363 ms
processing completed in 7,364 ms, usedMemory after GC = 2,402,467,040 bytes
running with nb entities = 20,000,000, nb tasks = 2, VM args = [-Xms4g, -Xmx4g]
task[0 - 9,999,999], completed in 5,466 ms
task[10,000,000 - 19,999,999], completed in 5,511 ms
processing completed in 5,512 ms, usedMemory after GC = 2,381,821,576 bytes
running with nb entities = 20,000,000, nb tasks = 1, VM args = [-Xms8g, -Xmx8g]
task[0 - 19,999,999], completed in 7,778 ms
processing completed in 7,779 ms, usedMemory after GC = 2,438,159,312 bytes
running with nb entities = 20,000,000, nb tasks = 2, VM args = [-Xms8g, -Xmx8g]
task[0 - 9,999,999], completed in 5,739 ms
task[10,000,000 - 19,999,999], completed in 5,784 ms
processing completed in 5,785 ms, usedMemory after GC = 2,396,478,680 bytes
These results can be summarized in the following table:
--------------------------------
heap | exec time (ms) for:
size (gb) | 1 thread | 2 threads
--------------------------------
2.5 | 11782 | 12386
3.0 | 12527 | 12265
4.0 | 7364 | 5512
8.0 | 7779 | 5785
--------------------------------
I also observed that, for the 2.5g and 3g heap sizes, there was a high CPU activity, with spikes at 100% during the whole processing time, due to the GC activity, whereas for 4g and 8g it is only observed at the end due to the System.gc() call.
To conclude:
if your heap is sized inappropriately, the garbage collection will kill any performance gain you would hope to obtain. You should make it large enough to avoid the side effects of long GC pauses.
you must also be aware that using a concurrent collection such as ConcurrentHashMap has a significant performance overhead. To illustrate this, I slightly modified the code so that each task uses its own HashMap, then at the end all the maps are aggregated (with Map.putAll()) in the map of the first task. The processing time fell to around 3200 ms

An addition probably takes one CPU cycle, so if your CPU runs at 3GHz, that's 0.3 nanoseconds. Do it 20M times and that becomes 6000000 nanoseconds or 6 milliseconds. So your measurement is more affected by the overhead of starting threads, thread switching, JIT compilation etc. than by the operation you are
trying to measure.
Garbage collection may also play a role as it may slow you down.
I suggest you use a specialized library for micro benchmarking, such as jmh.
Thanks to assylias's post which helped me write my response

While I have not tried multiple threads, I did try all 7 appropriate Map types of the 10 provided by Java 11.
My results were all substantially faster than your reported 25 to 40 seconds. My results for 20,000,000 entries of < String , UUID > is more like 3-9 seconds for any of the 7 map classes.
I am using Java 13 on:
Model Name: Mac mini
Model Identifier: Macmini8,1
Processor Name: Intel Core i5
Processor Speed: 3 GHz
Number of Processors: 1
Total Number of Cores: 6
L2 Cache (per Core): 256 KB
L3 Cache: 9 MB
Memory: 32 GB
Preparing.
size of instants: 20000000
size of uuids: 20000000
Running test.
java.util.HashMap took: PT3.645250368S
java.util.WeakHashMap took: PT3.199812894S
java.util.TreeMap took: PT8.97788412S
java.util.concurrent.ConcurrentSkipListMap took: PT7.347253106S
java.util.concurrent.ConcurrentHashMap took: PT4.494560252S
java.util.LinkedHashMap took: PT2.78054883S
java.util.IdentityHashMap took: PT5.608737472S
My code:
System.out.println( "Preparing." );
int limit = 20_000_000; // 20_000_000
Set < String > instantsSet = new TreeSet <>(); // Use `Set` to forbid duplicates.
List < UUID > uuids = new ArrayList <>( limit );
while ( instantsSet.size() < limit )
{
instantsSet.add( Instant.now().toString() );
}
List < String > instants = new ArrayList <>( instantsSet );
for ( int i = 0 ; i < limit ; i++ )
{
uuids.add( UUID.randomUUID() );
}
System.out.println( "size of instants: " + instants.size() );
System.out.println( "size of uuids: " + uuids.size() );
System.out.println( "Running test." );
// Using 7 of the 10 `Map` implementations bundled with Java 11.
// Omitting `EnumMap`, as it requires enums for the key.
// Omitting `Map.of` because it is for literals.
// Omitting `HashTable` because it is outmoded, replaced by `ConcurrentHashMap`.
List < Map < String, UUID > > maps = List.of(
new HashMap <>( limit ) ,
new WeakHashMap <>( limit ) ,
new TreeMap <>() ,
new ConcurrentSkipListMap <>() ,
new ConcurrentHashMap <>( limit ) ,
new LinkedHashMap <>( limit ) ,
new IdentityHashMap <>( limit )
);
for ( Map < String, UUID > map : maps )
{
long start = System.nanoTime();
for ( int i = 0 ; i < instants.size() ; i++ )
{
map.put( instants.get( i ) , uuids.get( i ) );
}
long stop = System.nanoTime();
Duration d = Duration.of( stop - start , ChronoUnit.NANOS );
System.out.println( map.getClass().getName() + " took: " + d );
// Free up memory.
map = null;
System.gc(); // Request garbage collector do its thing. No guarantee!
try
{
Thread.sleep( TimeUnit.SECONDS.toMillis( 4 ) ); // Wait for garbage collector to hopefully finish. No guarantee!
}
catch ( InterruptedException e )
{
e.printStackTrace();
}
}
System.out.println("Done running test.");
And here is a table I wrote comparing the various Map implementations.

Related

java parallelStreams on different machines

I have a function that is iterating the list using parallelStream in forEach is then calling an API with the the item as param. I am then storing the result in a hashMap.
try {
return answerList.parallelStream()
.map(answer -> getReplyForAnswerCombination(answer))
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
} catch (final NullPointerException e) {
log.error("Error in generating final results.", e);
return null;
}
When I run it on laptop 1, it takes 1 hour.
But on laptop 2, it takes 5 hours.
Doing some basic research I found that the parallel streams use the default ForkJoinPool.commonPool which by default has one less threads as you have processors.
Laptop1 and laptop2 have different processors.
Is there a way to find out how many streams that can run parallelly on Laptop1 and Laptop2?
Can I use the suggestion given here to safely increase the number of parallel streams in laptop2?
long start = System.currentTimeMillis();
IntStream s = IntStream.range(0, 20);
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "20");
s.parallel().forEach(i -> {
try { Thread.sleep(100); } catch (Exception ignore) {}
System.out.print((System.currentTimeMillis() - start) + " ");
});

Project Loom
If you want maximum performance on threaded code that blocks (as opposed to CPU-bound code), then use virtual threads (fibers) provided in Project Loom. Preliminary builds are available now, based on early-access Java 16.
Virtual threads
Virtual threads can be dramatically faster because a virtual thread is “parked” while blocked, set aside, so another virtual thread can make progress. This is so efficient for blocking tasks that threads can number in the millions.
Drop the streams approach. Merely send off each input to a virtual thread.
Full example code
Let's define classes for Answer and Reply, our inputs & outputs. We will use record, a new feature coming to Java 16, as an abbreviated way to define an immutable data-driven class. The compiler implicitly creates default implementations of constructor, getters, equals & hashCode, and toString.
public record Answer (String text)
{
}
…and:
public record Reply (String text)
{
}
Define our task to be submitted to an executor service. We write a class named ReplierTask that implements Runnable (has a run method).
Within the run method, we sleep the current thread to simulate waiting for a call to a database, file system, and/or remote service.
package work.basil.example;
import java.time.Duration;
import java.time.Instant;
import java.util.UUID;
import java.util.concurrent.ConcurrentMap;
public class ReplierTask implements Runnable
{
private Answer answer;
ConcurrentMap < Answer, Reply > map;
public ReplierTask ( Answer answer , ConcurrentMap < Answer, Reply > map )
{
this.answer = answer;
this.map = map;
}
private Reply getReplyForAnswerCombination ( Answer answer )
{
// Simulating a call to some service to produce a `Reply` object.
try { Thread.sleep( Duration.ofSeconds( 1 ) ); } catch ( InterruptedException e ) { e.printStackTrace(); } // Simulate blocking to wait for call to service or db or such.
return new Reply( UUID.randomUUID().toString() );
}
// `Runnable` interface
#Override
public void run ( )
{
System.out.println( "`run` method at " + Instant.now() + " for answer: " + this.answer );
Reply reply = this.getReplyForAnswerCombination( this.answer );
this.map.put( this.answer , reply );
}
}
Lastly, some code to do the work. We make a class named Mapper that contains a main method.
We simulate some input by populating an array of Answer objects. We create an empty ConcurrentMap in which to collect the results. And we assign each Answer object to a new thread where we call for a new Reply object and store the Answer/Reply pair as an entry in the map.
package work.basil.example;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
public class Mapper
{
public static void main ( String[] args )
{
System.out.println("Runtime.version(): " + Runtime.version() );
System.out.println("availableProcessors: " + Runtime.getRuntime().availableProcessors());
System.out.println("maxMemory: " + Runtime.getRuntime().maxMemory() + " | maxMemory/(1024*1024) -> megs: " +Runtime.getRuntime().maxMemory()/(1024*1024) );
Mapper app = new Mapper();
app.demo();
}
private void demo ( )
{
// Simulate our inputs, a list of `Answer` objects.
int limit = 10_000;
List < Answer > answers = new ArrayList <>( limit );
for ( int i = 0 ; i < limit ; i++ )
{
answers.add( new Answer( String.valueOf( i ) ) );
}
// Do the work.
Instant start = Instant.now();
System.out.println( "Starting work at: " + start + " on count of tasks: " + limit );
ConcurrentMap < Answer, Reply > results = new ConcurrentHashMap <>();
try
(
ExecutorService executorService = Executors.newVirtualThreadExecutor() ;
// Executors.newFixedThreadPool( 5 )
// Executors.newFixedThreadPool( 10 )
// Executors.newFixedThreadPool( 1_000 )
// Executors.newVirtualThreadExecutor()
)
{
for ( Answer answer : answers )
{
ReplierTask task = new ReplierTask( answer , results );
executorService.submit( task );
}
}
// At this point the flow-of-control blocks until all submitted tasks are done.
// The executor service is automatically closed by this point as well.
Duration elapsed = Duration.between( start , Instant.now() );
System.out.println( "results.size() = " + results.size() + ". Elapsed: " + elapsed );
}
}
We can change out the Executors.newVirtualThreadExecutor() with a pool of platform threads, to compare against virtual threads. Let's try a pool of 5, 10, and 1,000 platform threads on a Mac mini Intel with macOS Mojave sporting 6 real cores, no hyper-threading, 32 gigs of memory, and OpenJDK special build version 16-loom+9-316 assigned maxMemory of 8 gigs.
10,000 tasks at 1 second each
Total elapsed time
5 platform threads
half-hour — PT33M29.755792S
10 platform threads
quarter-hour — PT16M43.318973S
1,000 platform threads
10 seconds — PT10.487689S
10,000 platform threads
Error…unable to create native thread: possibly out of memory or process/resource limits reached
virtual threads
Under 3 seconds — PT2.645964S
Caveats
Caveat: Project Loom is experimental and subject to change, not intended for production use yet. The team is asking for folks to give feedback now.
Caveat: CPU-bound tasks such as encoding video should stick with platform/kernel threads rather than virtual threads. Most common code doing blocking operations such as I/O, like accessing files, logging, hitting a database, or making network calls, will likely see massive performance boosts with virtual threads.
Caveat: You must have enough memory available for many or even all of your tasks to be running simultaneously. If not enough memory will be available, you must take additional steps to throttle the virtual threads.

The setting java.util.concurrent.ForkJoinPool.common.parallelism will have an effect on the threads available to use for operations which make use of the ForkJoinPool, such as Stream.parallel(). However: whether your task uses more threads depends on the number of items in the stream, and whether it takes less time to run depends on the nature of each task and your available processors.
This test program shows the effect of changing this system property with a trivial task:
public static void main(String[] args) {
ConcurrentHashMap<String,String> threads = new ConcurrentHashMap<>();
int max = Integer.parseInt(args[0]);
boolean parallel = args.length < 2 || !"single".equals(args[1]);
int [] arr = IntStream.range(0, max).toArray();
long start = System.nanoTime();
IntStream stream = Arrays.stream(arr);
if (parallel)
stream = stream.parallel();
stream.forEach(i -> {
threads.put("hc="+Thread.currentThread().hashCode()+" tn="+Thread.currentThread().getName(), "value");
});
long end = System.nanoTime();
System.out.println("parallelism: "+System.getProperty("java.util.concurrent.ForkJoinPool.common.parallelism"));
System.out.println("Threads: "+threads.keySet());
System.out.println("Array size: "+arr.length+" threads used: "+threads.size()+" ms="+TimeUnit.NANOSECONDS.toMillis(end-start));
}
Adding more threads won't necessarily speed things up. Here are some examples from test run to count the threads used. It may help you decide on best approach for your own task contained in getReplyForAnswerCombination().
java -cp example.jar -Djava.util.concurrent.ForkJoinPool.common.parallelism=1000 App 100000
Array size: 100000 threads used: 37
java -cp example.jar -Djava.util.concurrent.ForkJoinPool.common.parallelism=50 App 100000
Array size: 100000 threads used: 20
java -cp example.jar APP 100000 single
Array size: 100000 threads used: 1
I suggest you see the thread pooling (with or without LOOM) in #Basil Bourque answer and also the JDK source code of the ForkJoinPool constructor has some details on this system property.
private ForkJoinPool(byte forCommonPoolOnly)

Java - Difference between Java 8 parallelStream and creating threads ourselves

I was trying to find the difference between using Java 8's parallelStream(method1) and creating parallel threads(method2)
I measured the time taken when using method 1 and method 2. But I found a huge deviation. Method2(~700ms) is way faster than method1(~20sec)
Method 1: (list has about 100 entries)
list.parallelStream()
.forEach(ele -> {
//Do something.
}));
Method 2:
for(i = 0;i < 100; i++) {
Runnable task = () -> {
//Do something.
}
Thread thread = new Thread(task);
thread.start();
}
NOTE: Do something is an expensive operation like hitting a Database.
I added System.out.println() messages to both. I found that method 1(parallelStream) appeared to be executing sequentially while in method 2 the messages were printed very fast.
Can anyone explain what is happening.

Can anyone explain what is happening.
Most likely you are doing something wrong but it's not clear what.
for (int i = 0; i < 3; i++) {
long start = System.currentTimeMillis();
IntStream.range(0, 100).parallel()
.forEach(ele -> {
try {
Thread.sleep(100);
} catch (InterruptedException ignored) {
}
});
long time = System.currentTimeMillis() - start;
System.out.printf("Took %,d ms to perform 100 tasks of 100 ms on %d processors%n",
time, Runtime.getRuntime().availableProcessors());
}
prints
Took 475 ms to perform 100 tasks of 100 ms on 32 processors
Took 401 ms to perform 100 tasks of 100 ms on 32 processors
Took 401 ms to perform 100 tasks of 100 ms on 32 processors

Performance Issues with newFixedThreadPool vs newSingleThreadExecutor

I am trying to Benchmark our Client code. So I decided I will write a multithreading program to do the benchmarking of my client code. I am trying to measure how much time (95 Percentile) below method will take-
attributes = deClient.getDEAttributes(columnsList);
So below is the multithreaded code I wrote to do the benchmarking on the above method. I am seeing lot of variations in my two scenarios-
1) Firstly, with multithreaded code by using 20 threads and running for 15 minutes. I get 95 percentile as 37ms. And I am using-
ExecutorService service = Executors.newFixedThreadPool(20);
2) But If I am running my same program for 15 minutes using-
ExecutorService service = Executors.newSingleThreadExecutor();
instead of
ExecutorService service = Executors.newFixedThreadPool(20);
I get 95 percentile as 7ms which is way less than the above number when I am running my code with newFixedThreadPool(20).
Can anyone tell me what can be the reason for such high performance issues with-
newSingleThreadExecutor vs newFixedThreadPool(20)
And by both ways I am running my program for 15 minutes.
Below is my code-
public static void main(String[] args) {
try {
// create thread pool with given size
//ExecutorService service = Executors.newFixedThreadPool(20);
ExecutorService service = Executors.newSingleThreadExecutor();
long startTime = System.currentTimeMillis();
long endTime = startTime + (15 * 60 * 1000);//Running for 15 minutes
for (int i = 0; i < threads; i++) {
service.submit(new ServiceTask(endTime, serviceList));
}
// wait for termination
service.shutdown();
service.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
} catch (InterruptedException e) {
} catch (Exception e) {
}
}
Below is the class that implements Runnable interface-
class ServiceTask implements Runnable {
private static final Logger LOG = Logger.getLogger(ServiceTask.class.getName());
private static Random random = new SecureRandom();
public static volatile AtomicInteger countSize = new AtomicInteger();
private final long endTime;
private final LinkedHashMap<String, ServiceInfo> tableLists;
public static ConcurrentHashMap<Long, Long> selectHistogram = new ConcurrentHashMap<Long, Long>();
public ServiceTask(long endTime, LinkedHashMap<String, ServiceInfo> tableList) {
this.endTime = endTime;
this.tableLists = tableList;
}
#Override
public void run() {
try {
while (System.currentTimeMillis() <= endTime) {
double randomNumber = random.nextDouble() * 100.0;
ServiceInfo service = selectRandomService(randomNumber);
final String id = generateRandomId(random);
final List<String> columnsList = getColumns(service.getColumns());
List<DEAttribute<?>> attributes = null;
DEKey bk = new DEKey(service.getKeys(), id);
List<DEKey> list = new ArrayList<DEKey>();
list.add(bk);
Client deClient = new Client(list);
final long start = System.nanoTime();
attributes = deClient.getDEAttributes(columnsList);
final long end = System.nanoTime() - start;
final long key = end / 1000000L;
boolean done = false;
while(!done) {
Long oldValue = selectHistogram.putIfAbsent(key, 1L);
if(oldValue != null) {
done = selectHistogram.replace(key, oldValue, oldValue + 1);
} else {
done = true;
}
}
countSize.getAndAdd(attributes.size());
handleDEAttribute(attributes);
if (BEServiceLnP.sleepTime > 0L) {
Thread.sleep(BEServiceLnP.sleepTime);
}
}
} catch (Exception e) {
}
}
}
Updated:-
Here is my processor spec- I am running my program from Linux machine with 2 processors defined as:
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2670 0 # 2.60GHz
stepping : 7
cpu MHz : 2599.999
cache size : 20480 KB
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm arat pln pts
bogomips : 5199.99
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

Can anyone tell me what can be the reason for such high performance issues with newSingleThreadExecutor vs newFixedThreadPool(20)...
If you are running many more tasks in parallel (20 in the case) than you have processors (I doubt that you have 20+ processor box) then yes, each individual task is going to take longer to complete. It is easier for the computer to execute one task at a time instead of switching between multiple threads running at the same time. Even if you limit the number of threads in the pool to the number of CPUs you have, each task probably will run slower, albeit slightly.
If, however, you compare the throughput (amount of time needed to complete a number of tasks) of your different sized thread pools, you should see that the 20 thread throughput should be higher. If you execute 1000 tasks with 20 threads, they overall will finish much sooner than with just 1 thread. Each task may take longer but they will be executing in parallel. It will probably not be 20 times faster given thread overhead, etc. but it might be something like 15 times faster.
You should not be worrying about the individual task speed but rather you should be trying to maximize the task throughput by tuning the number of threads in your pool. How many threads to use depends heavily on the amount of IO, the CPU cycles used by each task, locks, synchronized blocks, other applications running on the OS, and other factors.
People often use 1-2 times the number of CPUs as a good place to start in terms of the number of threads in the pool to maximize throughput. More IO requests or thread blocking operations then add more threads. More CPU bound then reduce the number of threads to be closer to the number of CPUs available. If your application is competing for OS cycles with other more important applications on the server then even less threads may be required.

In a nutshell
if your tasks are CPU intensive (i.e there are no read/writes or blocking tasks which keep thread idle) then thread size can be set near to your core counts. This will use all resources efficiently & avoid too many context switching.
If they are memory intensive where you make API or DB calls then its better to have higher number of threads so that while a thread is waiting for response another thread can be switched with current idle thread.

Why does this code not see any significant performance gain when I use multiple threads on a quadcore machine?

I wrote some Java code to learn more about the Executor framework.
Specifically, I wrote code to verify the Collatz Hypothesis - this says that if you iteratively apply the following function to any integer, you get to 1 eventually:
f(n) = ((n % 2) == 0) ? n/2 : 3*n + 1
CH is still unproven, and I figured it would be a good way to learn about Executor. Each thread is assigned a range [l,u] of integers to check.
Specifically, my program takes 3 arguments - N (the number to which I want to check CH), RANGESIZE (the length of the interval that a thread has to process), and NTHREAD, the size of the threadpool.
My code works fine, but I saw much less speedup that I expected - of the order of 30% when I went from 1 to 4 threads.
My logic was that the computation is completely CPU bound, and each subtask (checking CH for a fixed size range) is takes roughly the same time.
Does anyone have ideas as to why I'm not seeing a 3 to 4x increase in speed?
If you could report your runtimes as you increase the number of thread (along with the machine, JVM and OS) that would also be great.
Specifics
Runtimes:
java -d64 -server -cp . Collatz 10000000 1000000 4 => 4 threads, takes 28412 milliseconds
java -d64 -server -cp . Collatz 10000000 1000000 1 => 1 thread, takes 38286 milliseconds
Processor:
Quadcore Intel Q6600 at 2.4GHZ, 4GB. The machine is unloaded.
Java:
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
OS:
Linux quad0 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 x86_64 GNU/Linux
Code: (I can't get the code to post, I think it's too long for SO requirements, the source is available on Google Docs
import java.math.BigInteger;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyRunnable implements Runnable {
public int lower;
public int upper;
MyRunnable(int lower, int upper) {
this.lower = lower;
this.upper = upper;
}
#Override
public void run() {
for (int i = lower ; i <= upper; i++ ) {
Collatz.check(i);
}
System.out.println("(" + lower + "," + upper + ")" );
}
}
public class Collatz {
public static boolean check( BigInteger X ) {
if (X.equals( BigInteger.ONE ) ) {
return true;
} else if ( X.getLowestSetBit() == 1 ) {
// odd
BigInteger Y = (new BigInteger("3")).multiply(X).add(BigInteger.ONE);
return check(Y);
} else {
BigInteger Z = X.shiftRight(1); // fast divide by 2
return check(Z);
}
}
public static boolean check( int x ) {
BigInteger X = new BigInteger( new Integer(x).toString() );
return check(X);
}
static int N = 10000000;
static int RANGESIZE = 1000000;
static int NTHREADS = 4;
static void parseArgs( String [] args ) {
if ( args.length >= 1 ) {
N = Integer.parseInt(args[0]);
}
if ( args.length >= 2 ) {
RANGESIZE = Integer.parseInt(args[1]);
}
if ( args.length >= 3 ) {
NTHREADS = Integer.parseInt(args[2]);
}
}
public static void maintest(String [] args ) {
System.out.println("check(1): " + check(1));
System.out.println("check(3): " + check(3));
System.out.println("check(8): " + check(8));
parseArgs(args);
}
public static void main(String [] args) {
long lDateTime = new Date().getTime();
parseArgs( args );
List<Thread> threads = new ArrayList<Thread>();
ExecutorService executor = Executors.newFixedThreadPool( NTHREADS );
for( int i = 0 ; i < (N/RANGESIZE); i++) {
Runnable worker = new MyRunnable( i*RANGESIZE+1, (i+1)*RANGESIZE );
executor.execute( worker );
}
executor.shutdown();
while (!executor.isTerminated() ) {
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime ) +
" (" + N/(fDateTime - lDateTime ) + " per ms)" );
}
}

Busy waiting can be a problem:
while (!executor.isTerminated() ) {
}
You can use awaitTermination() instead:
while (!executor.awaitTermination(1, TimeUnit.SECONDS)) {}

You are using BigInteger. It consumes a lot of register space. What you most likely have on the compiler level is register spilling that makes your process memory-bound.
Also note that when you are timing your results you are not taking into account extra time taken by the JVM to allocate threads and work with the thread pool.
You could also have memory conflicts when you are using constant Strings. All strings are stored in a shared string pool and so it may become a bottleneck, unless java is really clever about it.
Overall, I wouldn't advise using Java for this kind of stuff. Using pthreads would be a better way to go for you.

As #axtavt answered, busy waiting can be a problem. You should fix that first, as it is part of the answer, but not all of it. It won't appear to help in your case (on Q6600), because it seems to be bottlenecked at 2 cores for some reason, so another is available for the busy loop and so there is no apparent slowdown, but on my Core i5 it speeds up the 4-thread version noticeably.
I suspect that in the case of the Q6600 your particular app is limited by the amount of shared cache available or something else specific to the architecture of that CPU. The Q6600 has two 4MB L2 caches, which means CPUs are sharing them, and no L3 cache. On my core i5, each CPU has a dedicated L2 cache (256K, then there is a larger 8MB shared L3 cache. 256K more per-CPU cache might make a difference... otherwise something else architecture wise does.
Here is a comparison of a Q6600 running your Collatz.java, and a Core i5 750.
On my work PC, which is also a Q6600 # 2.4GHz like yours, but with 6GB RAM, Windows 7 64-bit, and JDK 1.6.0_21* (64-bit), here are some basic results:
10000000 500000 1 (avg of three runs): 36982 ms
10000000 500000 4 (avg of three runs): 21252 ms
Faster, certainly - but not completing in quarter of the time like you would expect, or even half... (though it is roughly just a bit more than half, more on that in a moment). Note in my case I halved the size of the work units, and have a default max heap of 1500m.
At home on my Core i5 750 (4 cores no hyperthreading), 4GB RAM, Windows 7 64-bit, jdk 1.6.0_22 (64-bit):
10000000 500000 1 (avg of 3 runs) 32677 ms
10000000 500000 4 (avg of 3 runs) 8825 ms
10000000 500000 4 (avg of 3 runs) 11475 ms (without the busy wait fix, for reference)
the 4 threads version takes 27% of the time the 1 thread version takes when the busy-wait loop is removed. Much better. Clearly the code can make efficient use of 4 cores...
NOTE: Java 1.6.0_18 and later have modified default heap settings - so my default heap size is almost 1500m on my work PC, and around 1000m on my home PC.
You may want to increase your default heap, just in case garbage collection is happening and slowing your 4 threaded version down a bit. It might help, it might not.
At least in your example, there's a chance your larger work unit size is skewing your results slightly...halving it may help you get closer to at least 2x the speed since 4 threads will be kept busy for a longer portion of the time. I don't think the Q6600 will do much better at this particular task...whether it is cache or some other inherent architecture thing.
In all cases, I am simply running "java Collatz 10000000 500000 X", where x = # of threads indicated.
The only changes I made to your java file were to make one of the println's into a print, so there were less linebreaks for my runs with 500000 per work unit so I could see more results in my console at once, and I ditched the busy wait loop, which matters on the i5 750 but didn't make a difference on the Q6600.

You can should try using the submit function and then watching the Future's that are returning checking them to see if the thread has finished.
Terminate doesn't return until there is a shutdown.
Future submit(Runnable task)
Submits a Runnable task for execution and returns a Future representing that task.
isTerminated()
Returns true if all tasks have completed following shut down.
Try this...
public static void main(String[] args) {
long lDateTime = new Date().getTime();
parseArgs(args);
List<Thread> threads = new ArrayList<Thread>();
List<Future> futures = new ArrayList<Future>();
ExecutorService executor = Executors.newFixedThreadPool(NTHREADS);
for (int i = 0; i < (N / RANGESIZE); i++) {
Runnable worker = new MyRunnable(i * RANGESIZE + 1, (i + 1) * RANGESIZE);
futures.add(executor.submit(worker));
}
boolean done = false;
while (!done) {
for(Future future : futures) {
done = true;
if( !future.isDone() ) {
done = false;
break;
}
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime) +
" (" + N / (fDateTime - lDateTime) + " per ms)");
System.exit(0);
}

Performance of ThreadLocal variable

How much is read from ThreadLocal variable slower than from regular field?
More concretely is simple object creation faster or slower than access to ThreadLocal variable?
I assume that it is fast enough so that having ThreadLocal<MessageDigest> instance is much faster then creating instance of MessageDigest every time. But does that also apply for byte[10] or byte[1000] for example?
Edit: Question is what is really going on when calling ThreadLocal's get? If that is just a field, like any other, then answer would be "it's always fastest", right?

In 2009, some JVMs implemented ThreadLocal using an unsynchronised HashMap in the Thread.currentThread() object. This made it extremely fast (though not nearly as fast as using a regular field access, of course), as well as ensuring that the ThreadLocal object got tidied up when the Thread died. Updating this answer in 2016, it seems most (all?) newer JVMs use a ThreadLocalMap with linear probing. I am uncertain about the performance of those – but I cannot imagine it is significantly worse than the earlier implementation.
Of course, new Object() is also very fast these days, and the garbage collectors are also very good at reclaiming short-lived objects.
Unless you are certain that object creation is going to be expensive, or you need to persist some state on a thread by thread basis, you are better off going for the simpler allocate when needed solution, and only switching over to a ThreadLocal implementation when a profiler tells you that you need to.

Running unpublished benchmarks, ThreadLocal.get takes around 35 cycle per iteration on my machine. Not a great deal. In Sun's implementation a custom linear probing hash map in Thread maps ThreadLocals to values. Because it is only ever accessed by a single thread, it can be very fast.
Allocation of small objects take a similar number of cycles, although because of cache exhaustion you may get somewhat lower figures in a tight loop.
Construction of MessageDigest is likely to be relatively expensive. It has a fair amount of state and construction goes through the Provider SPI mechanism. You may be able to optimise by, for instance, cloning or providing the Provider.
Just because it may be faster to cache in a ThreadLocal rather than create does not necessarily mean that the system performance will increase. You will have additional overheads related to GC which slows everything down.
Unless your application very heavily uses MessageDigest you might want to consider using a conventional thread-safe cache instead.

Good question, I've been asking myself that recently. To give you definite numbers, the benchmarks below (in Scala, compiled to virtually the same bytecodes as the equivalent Java code):
var cnt: String = ""
val tlocal = new java.lang.ThreadLocal[String] {
override def initialValue = ""
}
def loop_heap_write = {
var i = 0
val until = totalwork / threadnum
while (i < until) {
if (cnt ne "") cnt = "!"
i += 1
}
cnt
}
def threadlocal = {
var i = 0
val until = totalwork / threadnum
while (i < until) {
if (tlocal.get eq null) i = until + i + 1
i += 1
}
if (i > until) println("thread local value was null " + i)
}
available here, were performed on an AMD 4x 2.8 GHz dual-cores and a quad-core i7 with hyperthreading (2.67 GHz).
These are the numbers:
i7
Specs: Intel i7 2x quad-core # 2.67 GHz
Test: scala.threads.ParallelTests
Test name: loop_heap_read
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
9.0069 9.0036 9.0017 9.0084 9.0074 (avg = 9.1034 min = 8.9986 max = 21.0306 )
Thread num.: 2
Total tests: 200
Run times: (showing last 5)
4.5563 4.7128 4.5663 4.5617 4.5724 (avg = 4.6337 min = 4.5509 max = 13.9476 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
2.3946 2.3979 2.3934 2.3937 2.3964 (avg = 2.5113 min = 2.3884 max = 13.5496 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
2.4479 2.4362 2.4323 2.4472 2.4383 (avg = 2.5562 min = 2.4166 max = 10.3726 )
Test name: threadlocal
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
91.1741 90.8978 90.6181 90.6200 90.6113 (avg = 91.0291 min = 90.6000 max = 129.7501 )
Thread num.: 2
Total tests: 200
Run times: (showing last 5)
45.3838 45.3858 45.6676 45.3772 45.3839 (avg = 46.0555 min = 45.3726 max = 90.7108 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
22.8118 22.8135 59.1753 22.8229 22.8172 (avg = 23.9752 min = 22.7951 max = 59.1753 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
22.2965 22.2415 22.3438 22.3109 22.4460 (avg = 23.2676 min = 22.2346 max = 50.3583 )
AMD
Specs: AMD 8220 4x dual-core # 2.8 GHz
Test: scala.threads.ParallelTests
Test name: loop_heap_read
Total work: 20000000
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
12.625 12.631 12.634 12.632 12.628 (avg = 12.7333 min = 12.619 max = 26.698 )
Test name: loop_heap_read
Total work: 20000000
Run times: (showing last 5)
6.412 6.424 6.408 6.397 6.43 (avg = 6.5367 min = 6.393 max = 19.716 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
3.385 4.298 9.7 6.535 3.385 (avg = 5.6079 min = 3.354 max = 21.603 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
5.389 5.795 10.818 3.823 3.824 (avg = 5.5810 min = 2.405 max = 19.755 )
Test name: threadlocal
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
200.217 207.335 200.241 207.342 200.23 (avg = 202.2424 min = 200.184 max = 245.369 )
Thread num.: 2
Total tests: 200
Run times: (showing last 5)
100.208 100.199 100.211 103.781 100.215 (avg = 102.2238 min = 100.192 max = 129.505 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
62.101 67.629 62.087 52.021 55.766 (avg = 65.6361 min = 50.282 max = 167.433 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
40.672 74.301 34.434 41.549 28.119 (avg = 54.7701 min = 28.119 max = 94.424 )
Summary
A thread local is around 10-20x that of the heap read. It also seems to scale well on this JVM implementation and these architectures with the number of processors.

#Pete is correct test before you optimise.
I would be very surprised if constructing a MessageDigest has any serious overhead when compared to actaully using it.
Miss using ThreadLocal can be a source of leaks and dangling references, that don't have a clear life cycle, generally I don't ever use ThreadLocal without a very clear plan of when a particular resource will be removed.

Here it goes another test. The results shows that ThreadLocal is a bit slower than a regular field, but in the same order. Aprox 12% slower
public class Test {
private static final int N = 100000000;
private static int fieldExecTime = 0;
private static int threadLocalExecTime = 0;
public static void main(String[] args) throws InterruptedException {
int execs = 10;
for (int i = 0; i < execs; i++) {
new FieldExample().run(i);
new ThreadLocaldExample().run(i);
}
System.out.println("Field avg:"+(fieldExecTime / execs));
System.out.println("ThreadLocal avg:"+(threadLocalExecTime / execs));
}
private static class FieldExample {
private Map<String,String> map = new HashMap<String, String>();
public void run(int z) {
System.out.println(z+"-Running field sample");
long start = System.currentTimeMillis();
for (int i = 0; i < N; i++){
String s = Integer.toString(i);
map.put(s,"a");
map.remove(s);
}
long end = System.currentTimeMillis();
long t = (end - start);
fieldExecTime += t;
System.out.println(z+"-End field sample:"+t);
}
}
private static class ThreadLocaldExample{
private ThreadLocal<Map<String,String>> myThreadLocal = new ThreadLocal<Map<String,String>>() {
#Override protected Map<String, String> initialValue() {
return new HashMap<String, String>();
}
};
public void run(int z) {
System.out.println(z+"-Running thread local sample");
long start = System.currentTimeMillis();
for (int i = 0; i < N; i++){
String s = Integer.toString(i);
myThreadLocal.get().put(s, "a");
myThreadLocal.get().remove(s);
}
long end = System.currentTimeMillis();
long t = (end - start);
threadLocalExecTime += t;
System.out.println(z+"-End thread local sample:"+t);
}
}
}'
Output:
0-Running field sample
0-End field sample:6044
0-Running thread local sample
0-End thread local sample:6015
1-Running field sample
1-End field sample:5095
1-Running thread local sample
1-End thread local sample:5720
2-Running field sample
2-End field sample:4842
2-Running thread local sample
2-End thread local sample:5835
3-Running field sample
3-End field sample:4674
3-Running thread local sample
3-End thread local sample:5287
4-Running field sample
4-End field sample:4849
4-Running thread local sample
4-End thread local sample:5309
5-Running field sample
5-End field sample:4781
5-Running thread local sample
5-End thread local sample:5330
6-Running field sample
6-End field sample:5294
6-Running thread local sample
6-End thread local sample:5511
7-Running field sample
7-End field sample:5119
7-Running thread local sample
7-End thread local sample:5793
8-Running field sample
8-End field sample:4977
8-Running thread local sample
8-End thread local sample:6374
9-Running field sample
9-End field sample:4841
9-Running thread local sample
9-End thread local sample:5471
Field avg:5051
ThreadLocal avg:5664
Env:
openjdk version "1.8.0_131"
Intel® Core™ i7-7500U CPU # 2.70GHz × 4
Ubuntu 16.04 LTS

Build it and measure it.
Also, you only need one threadlocal if you encapsulate your message digesting behaviour into an object. If you need a local MessageDigest and a local byte[1000] for some purpose, create an object with a messageDigest and a byte[] field and put that object into the ThreadLocal rather than both individually.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.