java parallelStreams on different machines

java parallelStreams on different machines - java

I have a function that is iterating the list using parallelStream in forEach is then calling an API with the the item as param. I am then storing the result in a hashMap.
try {
return answerList.parallelStream()
.map(answer -> getReplyForAnswerCombination(answer))
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
} catch (final NullPointerException e) {
log.error("Error in generating final results.", e);
return null;
}
When I run it on laptop 1, it takes 1 hour.
But on laptop 2, it takes 5 hours.
Doing some basic research I found that the parallel streams use the default ForkJoinPool.commonPool which by default has one less threads as you have processors.
Laptop1 and laptop2 have different processors.
Is there a way to find out how many streams that can run parallelly on Laptop1 and Laptop2?
Can I use the suggestion given here to safely increase the number of parallel streams in laptop2?
long start = System.currentTimeMillis();
IntStream s = IntStream.range(0, 20);
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "20");
s.parallel().forEach(i -> {
try { Thread.sleep(100); } catch (Exception ignore) {}
System.out.print((System.currentTimeMillis() - start) + " ");
});

Project Loom
If you want maximum performance on threaded code that blocks (as opposed to CPU-bound code), then use virtual threads (fibers) provided in Project Loom. Preliminary builds are available now, based on early-access Java 16.
Virtual threads
Virtual threads can be dramatically faster because a virtual thread is “parked” while blocked, set aside, so another virtual thread can make progress. This is so efficient for blocking tasks that threads can number in the millions.
Drop the streams approach. Merely send off each input to a virtual thread.
Full example code
Let's define classes for Answer and Reply, our inputs & outputs. We will use record, a new feature coming to Java 16, as an abbreviated way to define an immutable data-driven class. The compiler implicitly creates default implementations of constructor, getters, equals & hashCode, and toString.
public record Answer (String text)
{
}
…and:
public record Reply (String text)
{
}
Define our task to be submitted to an executor service. We write a class named ReplierTask that implements Runnable (has a run method).
Within the run method, we sleep the current thread to simulate waiting for a call to a database, file system, and/or remote service.
package work.basil.example;
import java.time.Duration;
import java.time.Instant;
import java.util.UUID;
import java.util.concurrent.ConcurrentMap;
public class ReplierTask implements Runnable
{
private Answer answer;
ConcurrentMap < Answer, Reply > map;
public ReplierTask ( Answer answer , ConcurrentMap < Answer, Reply > map )
{
this.answer = answer;
this.map = map;
}
private Reply getReplyForAnswerCombination ( Answer answer )
{
// Simulating a call to some service to produce a `Reply` object.
try { Thread.sleep( Duration.ofSeconds( 1 ) ); } catch ( InterruptedException e ) { e.printStackTrace(); } // Simulate blocking to wait for call to service or db or such.
return new Reply( UUID.randomUUID().toString() );
}
// `Runnable` interface
#Override
public void run ( )
{
System.out.println( "`run` method at " + Instant.now() + " for answer: " + this.answer );
Reply reply = this.getReplyForAnswerCombination( this.answer );
this.map.put( this.answer , reply );
}
}
Lastly, some code to do the work. We make a class named Mapper that contains a main method.
We simulate some input by populating an array of Answer objects. We create an empty ConcurrentMap in which to collect the results. And we assign each Answer object to a new thread where we call for a new Reply object and store the Answer/Reply pair as an entry in the map.
package work.basil.example;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
public class Mapper
{
public static void main ( String[] args )
{
System.out.println("Runtime.version(): " + Runtime.version() );
System.out.println("availableProcessors: " + Runtime.getRuntime().availableProcessors());
System.out.println("maxMemory: " + Runtime.getRuntime().maxMemory() + " | maxMemory/(1024*1024) -> megs: " +Runtime.getRuntime().maxMemory()/(1024*1024) );
Mapper app = new Mapper();
app.demo();
}
private void demo ( )
{
// Simulate our inputs, a list of `Answer` objects.
int limit = 10_000;
List < Answer > answers = new ArrayList <>( limit );
for ( int i = 0 ; i < limit ; i++ )
{
answers.add( new Answer( String.valueOf( i ) ) );
}
// Do the work.
Instant start = Instant.now();
System.out.println( "Starting work at: " + start + " on count of tasks: " + limit );
ConcurrentMap < Answer, Reply > results = new ConcurrentHashMap <>();
try
(
ExecutorService executorService = Executors.newVirtualThreadExecutor() ;
// Executors.newFixedThreadPool( 5 )
// Executors.newFixedThreadPool( 10 )
// Executors.newFixedThreadPool( 1_000 )
// Executors.newVirtualThreadExecutor()
)
{
for ( Answer answer : answers )
{
ReplierTask task = new ReplierTask( answer , results );
executorService.submit( task );
}
}
// At this point the flow-of-control blocks until all submitted tasks are done.
// The executor service is automatically closed by this point as well.
Duration elapsed = Duration.between( start , Instant.now() );
System.out.println( "results.size() = " + results.size() + ". Elapsed: " + elapsed );
}
}
We can change out the Executors.newVirtualThreadExecutor() with a pool of platform threads, to compare against virtual threads. Let's try a pool of 5, 10, and 1,000 platform threads on a Mac mini Intel with macOS Mojave sporting 6 real cores, no hyper-threading, 32 gigs of memory, and OpenJDK special build version 16-loom+9-316 assigned maxMemory of 8 gigs.
10,000 tasks at 1 second each
Total elapsed time
5 platform threads
half-hour — PT33M29.755792S
10 platform threads
quarter-hour — PT16M43.318973S
1,000 platform threads
10 seconds — PT10.487689S
10,000 platform threads
Error…unable to create native thread: possibly out of memory or process/resource limits reached
virtual threads
Under 3 seconds — PT2.645964S
Caveats
Caveat: Project Loom is experimental and subject to change, not intended for production use yet. The team is asking for folks to give feedback now.
Caveat: CPU-bound tasks such as encoding video should stick with platform/kernel threads rather than virtual threads. Most common code doing blocking operations such as I/O, like accessing files, logging, hitting a database, or making network calls, will likely see massive performance boosts with virtual threads.
Caveat: You must have enough memory available for many or even all of your tasks to be running simultaneously. If not enough memory will be available, you must take additional steps to throttle the virtual threads.

The setting java.util.concurrent.ForkJoinPool.common.parallelism will have an effect on the threads available to use for operations which make use of the ForkJoinPool, such as Stream.parallel(). However: whether your task uses more threads depends on the number of items in the stream, and whether it takes less time to run depends on the nature of each task and your available processors.
This test program shows the effect of changing this system property with a trivial task:
public static void main(String[] args) {
ConcurrentHashMap<String,String> threads = new ConcurrentHashMap<>();
int max = Integer.parseInt(args[0]);
boolean parallel = args.length < 2 || !"single".equals(args[1]);
int [] arr = IntStream.range(0, max).toArray();
long start = System.nanoTime();
IntStream stream = Arrays.stream(arr);
if (parallel)
stream = stream.parallel();
stream.forEach(i -> {
threads.put("hc="+Thread.currentThread().hashCode()+" tn="+Thread.currentThread().getName(), "value");
});
long end = System.nanoTime();
System.out.println("parallelism: "+System.getProperty("java.util.concurrent.ForkJoinPool.common.parallelism"));
System.out.println("Threads: "+threads.keySet());
System.out.println("Array size: "+arr.length+" threads used: "+threads.size()+" ms="+TimeUnit.NANOSECONDS.toMillis(end-start));
}
Adding more threads won't necessarily speed things up. Here are some examples from test run to count the threads used. It may help you decide on best approach for your own task contained in getReplyForAnswerCombination().
java -cp example.jar -Djava.util.concurrent.ForkJoinPool.common.parallelism=1000 App 100000
Array size: 100000 threads used: 37
java -cp example.jar -Djava.util.concurrent.ForkJoinPool.common.parallelism=50 App 100000
Array size: 100000 threads used: 20
java -cp example.jar APP 100000 single
Array size: 100000 threads used: 1
I suggest you see the thread pooling (with or without LOOM) in #Basil Bourque answer and also the JDK source code of the ForkJoinPool constructor has some details on this system property.
private ForkJoinPool(byte forCommonPoolOnly)

Related

Multithreaded vs Asynchronous programming in a single core

If in real time the CPU performs only one task at a time then how is multithreading different from asynchronous programming (in terms of efficiency) in a single processor system?
Lets say for example we have to count from 1 to IntegerMax. In the following program for my multicore machine, the two thread final count count is almost half of the single thread count. What if we ran this in a single core machine? And is there any way we could achieve the same result there?
class Demonstration {
public static void main( String args[] ) throws InterruptedException {
SumUpExample.runTest();
}
}
class SumUpExample {
long startRange;
long endRange;
long counter = 0;
static long MAX_NUM = Integer.MAX_VALUE;
public SumUpExample(long startRange, long endRange) {
this.startRange = startRange;
this.endRange = endRange;
}
public void add() {
for (long i = startRange; i <= endRange; i++) {
counter += i;
}
}
static public void twoThreads() throws InterruptedException {
long start = System.currentTimeMillis();
SumUpExample s1 = new SumUpExample(1, MAX_NUM / 2);
SumUpExample s2 = new SumUpExample(1 + (MAX_NUM / 2), MAX_NUM);
Thread t1 = new Thread(() -> {
s1.add();
});
Thread t2 = new Thread(() -> {
s2.add();
});
t1.start();
t2.start();
t1.join();
t2.join();
long finalCount = s1.counter + s2.counter;
long end = System.currentTimeMillis();
System.out.println("Two threads final count = " + finalCount + " took " + (end - start));
}
static public void oneThread() {
long start = System.currentTimeMillis();
SumUpExample s = new SumUpExample(1, MAX_NUM );
s.add();
long end = System.currentTimeMillis();
System.out.println("Single thread final count = " + s.counter + " took " + (end - start));
}
public static void runTest() throws InterruptedException {
oneThread();
twoThreads();
}
}
Output:
Single thread final count = 2305843008139952128 took 1003
Two threads final count = 2305843008139952128 took 540

For a purely CPU-bound operation you are correct. Most (99.9999%) of programs need to do input, output, and invoke other services. Those are orders of magnitude slower than the CPU, so while waiting for the results of an external operation, the OS can schedule and run other (many other) processes in time slices.
Hardware multithreading benefits primarily when 2 conditions are met:
CPU-intensive operations;
That can be efficiently divided into independent subsets
Or you have lots of different tasks to run that can be efficiently divided among multiple hardware processors.

In the following program for my multicore machine, the two thread final count count is almost half of the single thread count.
That is what I would expect from a valid benchmark when the application is using two cores.
However, looking at your code, I am somewhat surprised that you are getting those results ... so reliably.
Your benchmark doesn't take account of JVM warmup effects, particularly JIT compilation.
You benchmark's add method could potentially be optimized by the JIT compiler to get rid of the loop entirely. (But at least the counts are "used" ... by printing them out.)
I guess you got lucky ... but I'm not convinced those results will be reproducible for all versions of Java, or if you tweaked the benchmark.
Please read this:
How do I write a correct micro-benchmark in Java?
What if we ran this in a single core machine?
Assuming the following:
You rewrote the benchmark to corrected the flaws above.
You are running on a system where hardware hyper-threading1 is disabled2.
Then ... I would expect it to take two threads to take more than twice as long as the one thread version.
Q: Why "more than"?
A: Because there is a significant overhead in starting a new thread. Depending on your hardware, OS and Java version, it could be more than a millisecond. Certainly, the time taken is significant if you repeatedly use and discard threads.
And is there any way we could achieve the same result there?
Not sure what you are asking here. But are if you are asking how to simulate the behavior of one core on a multi-core machine, you would probably need to do this at the OS level. See https://superuser.com/questions/309617 for Windows and https://askubuntu.com/questions/483824 for Linux.
1 - Hyperthreading is a hardware optimization where a single core's processing hardware supports (typically) two hyper-threads. Each hyperthread
has its own sets of registers, but it shares functional units such as the ALU with the other hyperthread. So the two hyperthreads behave like (typically) two cores, except that they may be slower, depending on the precise instruction mix. A typical OS will treat a hyperthread as if it is a regular core. Hyperthreading is typically enabled / disabled at boot time; e.g. via a BIOS setting.
2 - If hyperthreading is enabled, it is possible that two Java threads won't be twice as fast as one in a CPU-intensive computation like this ... due to possible slowdown caused by the "other" hyperthread on respective cores. Did someone mention that benchmarking is complicated?

Best way to process incoming data in parallel? (Java/Groovy)

I have a program that parses reports coming from multiple devices (about 1000 devices), saves them to a DB and then does additional processing on them.
Parsing the reports can be done concurrently, but the saving to the DB and the additional processing requires some synchronization based on what device ID they come from (since it might be needed to update the same data on the DB).
So, I can run the processing in parallel as long as the threads are handling reports from different device IDs.
What could be the most efficient way to process this?
Example
I initially thought about using a thread pool and locking on the device ID, but that won't be efficient if I get a burst of reports coming from a single device.
For example, considering a thread pool with 4 threads and 10 incoming reports:
Report #
DeviceID
1
A
2
A
3
A
4
A
5
A
6
B
7
C
8
D
9
E
10
F
Thread 1 would start processing A's report, thread 2-4 would wait until thread 1 finishes, and the rest of the reports would get queued.
It would be more efficient if the rest of A's reports could be queued instead, allowing B/C/D reports to be processed concurrently. Is there an efficient way to do this?

Try using a priority queue. The highest priority items in the queue would be chosen for processing by the thread pool. For example:
NOTE: I know priority queues are not typically implemented using an array and that some priority queues use smaller index values for higher priority. I just use this notation for simplicities sake.
Let (DeviceID, Priority).
Let current thread pool be empty -> []
Say, we get an incoming 10 reports ->
[(A, 1), (A, 1), (A, 1), (B, 1), (B, 1), (C, 1), (D, 1), (E, 1), (F, 1), (G, 1)]
(represents the filled priority queue upon receiving the reports).
So, you dequeue the first item and give it to the thread pool. Then decrement the priority of all items in the priority queue with DeviceID A. This would look like the following:
(A, 1) is dequeued so you just get A. The priority queue would then shift after decrementing the priorities of A's still in the queue.
[(B, 1), (B, 1), (C, 1), (D, 1), (E, 1), (F, 1), (G, 1), (A, 0), (A, 0)]

Project Loom
After having seen some late 2020 videos on YouTube.com with Ron Pressler, head of Project Loom at Oracle, the solution is quite simple with the new virtual threads (fibers) feature coming to a future release of Java:
Call a new Executors method to create an executor service that uses virtual threads (fibers) rather than platform/kernel threads.
Submit all incoming report processing tasks to that executor service.
Inside each task, attempt to grab a semaphore, one semaphore for each of your 1,000 devices.
That semaphore will be the way to process only one input per device at a time, to parallelize per source-device. If the semaphore representing a particular device is not available, simply block — let your report processing thread wait until the semaphore is available.
Project Loom maintains many lightweight virtual threads (fibers), even millions, that are run on a few heavyweight platform/kernel threads. This makes blocking a thread cheap.
Early builds of a JDK binary with Project Loom built-in for macOS/Linux/Windows are available now.
Caveat: I’m no expert on concurrency nor on Project Loom. But your particular use-case seems to match some specific recommendations made by Ron Pressler in his videos.
Example code
Here is some example code that I noodled around with. I am not all sure this is a good example or not.
I used an early-access build of Java 16, specially built with Project Loom technology: Build 16-loom+9-316 (2020/11/30) for macOS Intel.
package work.basil.example;
import java.time.*;
import java.util.*;
import java.util.concurrent.*;
/**
* An example of using Project Loom virtual threads to more simply process incoming data on background threads.
* <p>
* This code was built as a possible solution to this Question at StackOverflow.com: https://stackoverflow.com/q/65327325/642706
* <p>
* Posted in my Answer at StackOverflow.com: https://stackoverflow.com/a/65328799/642706
* <p>
* ©2020 Basil Bourque. 2020-12.
* <p>
* This work by Basil Bourque is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0
* <p>
* Caveats:
* - Project Loom is still in early-release, available only as a special build of OpenJDK for Java 16.
* - I am *not* an expert on concurrency in general, nor Project Loom in particular. This code is merely me guessing and experimenting.
*/
public class App
{
// FYI, Project Loom links:
// https://wiki.openjdk.java.net/display/loom/Main
// http://jdk.java.net/loom/ (special early-access builds of Java 16 with Project Loom built-in)
// https://download.java.net/java/early_access/loom/docs/api/ (Javadoc)
// https://www.youtube.com/watch?v=23HjZBOIshY (Ron Pressler talk, 2020-07)
public static void main ( String[] args )
{
System.out.println( "java.version: " + System.getProperty( "java.version" ) );
App app = new App();
app.checkForProjectLoom();
app.demo();
}
public static boolean projectLoomIsPresent ( )
{
try
{
Thread.class.getDeclaredMethod( "startVirtualThread" , Runnable.class );
return true;
}
catch ( NoSuchMethodException e )
{
return false;
}
}
private void checkForProjectLoom ( )
{
if ( App.projectLoomIsPresent() )
{
System.out.println( "INFO - Running on a JVM with Project Loom technology. " + Instant.now() );
} else
{
throw new IllegalStateException( "Project Loom technology not present in this Java implementation. " + Instant.now() );
}
}
record ReportProcessorRunnable(Semaphore semaphore , Integer deviceIdentifier , boolean printToConsole , Queue < String > fauxDatabase) implements Runnable
{
#Override
public void run ( )
{
// Our goal is to serialize the report-processing per device.
// Each device can have only one report being processed at a time.
// In Project Loom this can be accomplished simply by spawning virtual threads for all such
// reports but process them serially by synchronizing on a binary (single-permit) semaphore.
// Each thread working on a report submitted for that device waits on semaphore assigned to that device.
// Blocking to wait for the semaphore is cheap in Project Loom using virtual threads. The underlying
// platform/kernel thread carrying this virtual thread will be assigned other work while this
// virtual thread is parked.
try
{
semaphore.acquire(); // Blocks until the semaphore for this particular device becomes available. Blocking is cheap on a virtual thread.
// Simulate more lengthy work being done by sleeping the virtual thread handling this task via the executor service.
try {Thread.sleep( Duration.ofMillis( 100 ) );} catch ( InterruptedException e ) {e.printStackTrace();}
String fauxData = "Insert into database table for device ID # " + this.deviceIdentifier + " at " + Instant.now();
fauxDatabase.add( fauxData );
if ( Objects.nonNull( this.printToConsole ) && this.printToConsole ) { System.out.println( fauxData ); }
semaphore.release(); // For fun, comment-out this line to see the effect of the per-device semaphore at runtime.
}
catch ( InterruptedException e )
{
e.printStackTrace();
}
}
}
record IncomingReportsSimulatorRunnable(Map < Integer, Semaphore > deviceToSemaphoreMap ,
ExecutorService reportProcessingExecutorService ,
int countOfReportsToGeneratePerBatch ,
boolean printToConsole ,
Queue < String > fauxDatabase)
implements Runnable
{
#Override
public void run ( )
{
if ( printToConsole ) System.out.println( "INFO - Generating " + countOfReportsToGeneratePerBatch + " reports at " + Instant.now() );
for ( int i = 0 ; i < countOfReportsToGeneratePerBatch ; i++ )
{
// Make a new Runnable task containing report data to be processed, and submit this task to the executor service using virtual threads.
// To simulate a device sending in a report, we randomly pick one of the devices to pretend it is our source of report data.
final List < Integer > deviceIdentifiers = List.copyOf( deviceToSemaphoreMap.keySet() );
int randomIndexNumber = ThreadLocalRandom.current().nextInt( 0 , deviceIdentifiers.size() );
Integer deviceIdentifier = deviceIdentifiers.get( randomIndexNumber );
Semaphore semaphore = deviceToSemaphoreMap.get( deviceIdentifier );
Runnable processReport = new ReportProcessorRunnable( semaphore , deviceIdentifier , printToConsole , fauxDatabase );
reportProcessingExecutorService.submit( processReport );
}
}
}
private void demo ( )
{
// Configure experiment.
Duration durationOfExperiment = Duration.ofSeconds( 20 );
int countOfReportsToGeneratePerBatch = 7; // Would be 40 per the Stack Overflow Question.
boolean printToConsole = true;
// To use as a concurrent list, I found this suggestion to use `ConcurrentLinkedQueue`: https://stackoverflow.com/a/25630263/642706
Queue < String > fauxDatabase = new ConcurrentLinkedQueue < String >();
// Represent each of the thousand devices that are sending us report data to be processed.
// We map each device to a Java `Semaphore` object, to serialize the processing of multiple reports per device.
final int firstDeviceNumber = 1_000;
final int countDevices = 10; // Would be 1_000 per the Stack Overflow question.
final Map < Integer, Semaphore > deviceToSemaphoreMap = new TreeMap <>();
for ( int i = 0 ; i < countDevices ; i++ )
{
Integer deviceIdentifier = i + firstDeviceNumber; // Our devices are identified as numbered 1,000 to 1,999.
Semaphore semaphore = new Semaphore( 1 , true ); // A single permit to make a binary semaphore, and make it fair.
deviceToSemaphoreMap.put( deviceIdentifier , semaphore );
}
// Run experiment.
// Notice that in Project Loom the `ExecutorService` interface is now `AutoCloseable`, for use in try-with-resources syntax.
try (
ScheduledExecutorService reportGeneratingExecutorService = Executors.newSingleThreadScheduledExecutor() ;
ExecutorService reportProcessingExecutorService = Executors.newVirtualThreadExecutor() ;
)
{
Runnable simulateIncommingReports = new IncomingReportsSimulatorRunnable( deviceToSemaphoreMap , reportProcessingExecutorService , countOfReportsToGeneratePerBatch , printToConsole , fauxDatabase );
ScheduledFuture scheduledFuture = reportGeneratingExecutorService.scheduleAtFixedRate( simulateIncommingReports , 0 , 1 , TimeUnit.SECONDS );
try {Thread.sleep( durationOfExperiment );} catch ( InterruptedException e ) {e.printStackTrace();}
}
// Notice that when reaching this point we block until all submitted tasks still running are finished,
// because that is the new behavior of `ExecutorService` being `AutoCloseable`.
System.out.println( "INFO - executor services shut down at this point. " + Instant.now() );
// Results of experiment
System.out.println( "fauxDatabase.size(): " + fauxDatabase.size() );
System.out.println( "fauxDatabase = " + fauxDatabase );
}
}

System.out.print causing latency? [duplicate]

This question already has answers here:
Do not use System.out.println in server side code
(9 answers)
Closed 6 years ago.
I've got a simple program that I got from my Java programming book, just added a bit to it.
package personal;
public class SpeedTest {
public static void main(String[] args) {
double DELAY = 5000;
long startTime = System.currentTimeMillis();
long endTime = (long)(startTime + DELAY);
long index = 0;
while (true) {
double x = Math.sqrt(index);
long now = System.currentTimeMillis();
if (now >= endTime) {
break;
}
index++;
}
System.out.println(index + " loops in " + (DELAY / 1000) + " seconds.");
}
}
This returns 128478180 loops in 5.0 seconds.
If I add System.out.println(x); before the if statement, then my number of loops in 5 seconds goes down to the 400,000s, is that due to latency in the System.out.println()? Or is it just that x was not being calculated when I wasn't printing it out?

Anytime you "do output" within a very-busy loop, in any programming language whatsoever, you introduce two possibly-very-significant delays:
The data must be converted to printable characters, then be written to whatever display/device it might be going to ... and ...
"The act of outputting anything" obliges the process to synchronize itself with any-and-every-other process that might also be generating output.
One alternative strategy that is often used for this purpose is a "trace table." This is an in-memory array, of some fixed size, which contains strings. Entries are added to this table in a "round-robin" fashion: the oldest entry is continuously being replaced by the newest one. This strategy provides a history without requiring output. (The only requirement that remains is that anyone which is adding an entry to the table, or reading from it, must synchronize their activities e.g. using a mutex.)
Processes which wish to display the contents of the trace-table should grab the mutex, make an in-memory copy of the content-of-interest, then release the mutex before preparing their output. In this way, the various processes which are contributing entries to the trace-table will not be delayed by I/O-associated sources of delay.

Odd c++/java multi threading performance results compared to single thread

I was struggling since 2 days to understand what is going on with c++ threadpool performance compared to a single thread, then I decided to do the same on java, this is when I noticed that the behaviour is same on c++ and java.. basically my code is simple straight forward.
package com.examples.threading
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;
public class ThreadPool {
final static AtomicLong lookups = new AtomicLong(0);
final static AtomicLong totalTime = new AtomicLong(0);
public static class Task implements Runnable
{
int start = 0;
Task(int s) {
start = s;
}
#Override
public void run()
{
for (int j = start ; j < start + 3000; j++ ) {
long st = System.nanoTime();
boolean a = false;
long et = System.nanoTime();
totalTime.getAndAdd((et - st));
lookups.getAndAdd(1l);
}
}
}
public static void main(String[] args)
{
// change threads from 1 -> 100 then you will get different numbers
ExecutorService executor = Executors.newFixedThreadPool(1);
for (int i = 0; i <= 1000000; i++)
{
if (i % 3000 == 0) {
Task task = new Task(i);
executor.execute(task);
System.out.println("in time " + (totalTime.doubleValue()/lookups.doubleValue()) + " lookups: " + lookups.toString());
}
}
executor.shutdown();
while (!executor.isTerminated()) {
;
}
System.out.println("in time " + (totalTime.doubleValue()/lookups.doubleValue()) + " lookups: " + lookups.toString());
}
}
now same code when you run with different pool number say like 100 threads, the overall elapsed time will change.
one thread:
in time 36.91493612774451 lookups: 1002000
100 threads:
in time 141.47934530938124 lookups: 1002000
the question is, the code is same why the overall elapsed time is different what is exactly going on here..

You have a couple of obvious possibilities here.
One is that System.nanoTime may serialize internally, so even though each thread is making its call separately, it may internally execute those calls in sequence (and, for example, queue up calls as they come in). This is particularly likely when nanoTime directly accesses a hardware clock, such as on Windows (where it uses Windows' QueryPerformanceCounter).
Another point at which you get essentially sequential execution is your atomic variables. Even though you're using lock-free atomics, the basic fact is that each has to execute a read/modify/write as an atomic sequence. With locked variables, that's done by locking, then reading, modifying, writing, and unlocking. With lock-free, you eliminate some of the overhead in doing that, but you're still stuck with the fact that only one thread can successfully read, modify, and write a particular memory location at a given time.
In this case the only "work" each thread is doing is trivial, and the result is never used, so the optimizer can (and probably will) eliminate it entirely. So all you're really measuring is the time to read the clock and increment your variables.
To gain at least some of the speed back, you could (for one example) give thread thread its own lookups and totalTime variable. Then when all the threads finish, you can add together the values for the individual threads to get an overall total for each.
Preventing serialization of the timing is a little more difficult (to put it mildly). At least in the obvious design, each call to nanoTime directly accesses a hardware register, which (at least with most typical hardware) can only happen sequentially. It could be fixed at the hardware level (provide a high-frequency timer register that's directly readable per-core, guaranteed to be synced between cores). That's a somewhat non-trivial task, and (more importantly) most current hardware just doesn't include such a thing.
Other than that, do some meaningful work in each thread, so when you execute in multiple threads, you have something that can actually use the resources of your multiple CPUs/cores to run faster.

Why does this code not see any significant performance gain when I use multiple threads on a quadcore machine?

I wrote some Java code to learn more about the Executor framework.
Specifically, I wrote code to verify the Collatz Hypothesis - this says that if you iteratively apply the following function to any integer, you get to 1 eventually:
f(n) = ((n % 2) == 0) ? n/2 : 3*n + 1
CH is still unproven, and I figured it would be a good way to learn about Executor. Each thread is assigned a range [l,u] of integers to check.
Specifically, my program takes 3 arguments - N (the number to which I want to check CH), RANGESIZE (the length of the interval that a thread has to process), and NTHREAD, the size of the threadpool.
My code works fine, but I saw much less speedup that I expected - of the order of 30% when I went from 1 to 4 threads.
My logic was that the computation is completely CPU bound, and each subtask (checking CH for a fixed size range) is takes roughly the same time.
Does anyone have ideas as to why I'm not seeing a 3 to 4x increase in speed?
If you could report your runtimes as you increase the number of thread (along with the machine, JVM and OS) that would also be great.
Specifics
Runtimes:
java -d64 -server -cp . Collatz 10000000 1000000 4 => 4 threads, takes 28412 milliseconds
java -d64 -server -cp . Collatz 10000000 1000000 1 => 1 thread, takes 38286 milliseconds
Processor:
Quadcore Intel Q6600 at 2.4GHZ, 4GB. The machine is unloaded.
Java:
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
OS:
Linux quad0 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 x86_64 GNU/Linux
Code: (I can't get the code to post, I think it's too long for SO requirements, the source is available on Google Docs
import java.math.BigInteger;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyRunnable implements Runnable {
public int lower;
public int upper;
MyRunnable(int lower, int upper) {
this.lower = lower;
this.upper = upper;
}
#Override
public void run() {
for (int i = lower ; i <= upper; i++ ) {
Collatz.check(i);
}
System.out.println("(" + lower + "," + upper + ")" );
}
}
public class Collatz {
public static boolean check( BigInteger X ) {
if (X.equals( BigInteger.ONE ) ) {
return true;
} else if ( X.getLowestSetBit() == 1 ) {
// odd
BigInteger Y = (new BigInteger("3")).multiply(X).add(BigInteger.ONE);
return check(Y);
} else {
BigInteger Z = X.shiftRight(1); // fast divide by 2
return check(Z);
}
}
public static boolean check( int x ) {
BigInteger X = new BigInteger( new Integer(x).toString() );
return check(X);
}
static int N = 10000000;
static int RANGESIZE = 1000000;
static int NTHREADS = 4;
static void parseArgs( String [] args ) {
if ( args.length >= 1 ) {
N = Integer.parseInt(args[0]);
}
if ( args.length >= 2 ) {
RANGESIZE = Integer.parseInt(args[1]);
}
if ( args.length >= 3 ) {
NTHREADS = Integer.parseInt(args[2]);
}
}
public static void maintest(String [] args ) {
System.out.println("check(1): " + check(1));
System.out.println("check(3): " + check(3));
System.out.println("check(8): " + check(8));
parseArgs(args);
}
public static void main(String [] args) {
long lDateTime = new Date().getTime();
parseArgs( args );
List<Thread> threads = new ArrayList<Thread>();
ExecutorService executor = Executors.newFixedThreadPool( NTHREADS );
for( int i = 0 ; i < (N/RANGESIZE); i++) {
Runnable worker = new MyRunnable( i*RANGESIZE+1, (i+1)*RANGESIZE );
executor.execute( worker );
}
executor.shutdown();
while (!executor.isTerminated() ) {
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime ) +
" (" + N/(fDateTime - lDateTime ) + " per ms)" );
}
}

Busy waiting can be a problem:
while (!executor.isTerminated() ) {
}
You can use awaitTermination() instead:
while (!executor.awaitTermination(1, TimeUnit.SECONDS)) {}

You are using BigInteger. It consumes a lot of register space. What you most likely have on the compiler level is register spilling that makes your process memory-bound.
Also note that when you are timing your results you are not taking into account extra time taken by the JVM to allocate threads and work with the thread pool.
You could also have memory conflicts when you are using constant Strings. All strings are stored in a shared string pool and so it may become a bottleneck, unless java is really clever about it.
Overall, I wouldn't advise using Java for this kind of stuff. Using pthreads would be a better way to go for you.

As #axtavt answered, busy waiting can be a problem. You should fix that first, as it is part of the answer, but not all of it. It won't appear to help in your case (on Q6600), because it seems to be bottlenecked at 2 cores for some reason, so another is available for the busy loop and so there is no apparent slowdown, but on my Core i5 it speeds up the 4-thread version noticeably.
I suspect that in the case of the Q6600 your particular app is limited by the amount of shared cache available or something else specific to the architecture of that CPU. The Q6600 has two 4MB L2 caches, which means CPUs are sharing them, and no L3 cache. On my core i5, each CPU has a dedicated L2 cache (256K, then there is a larger 8MB shared L3 cache. 256K more per-CPU cache might make a difference... otherwise something else architecture wise does.
Here is a comparison of a Q6600 running your Collatz.java, and a Core i5 750.
On my work PC, which is also a Q6600 # 2.4GHz like yours, but with 6GB RAM, Windows 7 64-bit, and JDK 1.6.0_21* (64-bit), here are some basic results:
10000000 500000 1 (avg of three runs): 36982 ms
10000000 500000 4 (avg of three runs): 21252 ms
Faster, certainly - but not completing in quarter of the time like you would expect, or even half... (though it is roughly just a bit more than half, more on that in a moment). Note in my case I halved the size of the work units, and have a default max heap of 1500m.
At home on my Core i5 750 (4 cores no hyperthreading), 4GB RAM, Windows 7 64-bit, jdk 1.6.0_22 (64-bit):
10000000 500000 1 (avg of 3 runs) 32677 ms
10000000 500000 4 (avg of 3 runs) 8825 ms
10000000 500000 4 (avg of 3 runs) 11475 ms (without the busy wait fix, for reference)
the 4 threads version takes 27% of the time the 1 thread version takes when the busy-wait loop is removed. Much better. Clearly the code can make efficient use of 4 cores...
NOTE: Java 1.6.0_18 and later have modified default heap settings - so my default heap size is almost 1500m on my work PC, and around 1000m on my home PC.
You may want to increase your default heap, just in case garbage collection is happening and slowing your 4 threaded version down a bit. It might help, it might not.
At least in your example, there's a chance your larger work unit size is skewing your results slightly...halving it may help you get closer to at least 2x the speed since 4 threads will be kept busy for a longer portion of the time. I don't think the Q6600 will do much better at this particular task...whether it is cache or some other inherent architecture thing.
In all cases, I am simply running "java Collatz 10000000 500000 X", where x = # of threads indicated.
The only changes I made to your java file were to make one of the println's into a print, so there were less linebreaks for my runs with 500000 per work unit so I could see more results in my console at once, and I ditched the busy wait loop, which matters on the i5 750 but didn't make a difference on the Q6600.

You can should try using the submit function and then watching the Future's that are returning checking them to see if the thread has finished.
Terminate doesn't return until there is a shutdown.
Future submit(Runnable task)
Submits a Runnable task for execution and returns a Future representing that task.
isTerminated()
Returns true if all tasks have completed following shut down.
Try this...
public static void main(String[] args) {
long lDateTime = new Date().getTime();
parseArgs(args);
List<Thread> threads = new ArrayList<Thread>();
List<Future> futures = new ArrayList<Future>();
ExecutorService executor = Executors.newFixedThreadPool(NTHREADS);
for (int i = 0; i < (N / RANGESIZE); i++) {
Runnable worker = new MyRunnable(i * RANGESIZE + 1, (i + 1) * RANGESIZE);
futures.add(executor.submit(worker));
}
boolean done = false;
while (!done) {
for(Future future : futures) {
done = true;
if( !future.isDone() ) {
done = false;
break;
}
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime) +
" (" + N / (fDateTime - lDateTime) + " per ms)");
System.exit(0);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.