I want to know how many games my computer can play in 1000 ms. I did the tests before without using Threads (it plays 13k). Now that I think I'm using threads, I still get the same. Since I don't have much experience with Java threads, I assume I'm doing something wrong but I just can't get it.
Thanks in advance
public class SpeedTest<T extends BoardGame> implements Runnable
{
public static int gamesPlayed = 0;
private ElapsedTimer timer;
private double maxTime;
private BoardAgent<T> agent;
private BoardGame<T> game;
public SpeedTest(BoardGame<T> game, ElapsedTimer timer, double maxTime, Random rng)
{
this.game = game;
this.timer = timer;
this.maxTime = maxTime;
this.agent = new RandomAgent<T>(rng);
}
#Override
public void run()
{
while (true)
{
BoardGame<T> newBoard = game.copy();
while (!newBoard.isGameOver())
newBoard.makeMove(agent.move(newBoard));
gamesPlayed++;
if (timer.elapsedMilliseconds() > maxTime) {
break;
}
}
}
public static void main(String[] args)
{
Random rng = new Random();
BoardGame<Connect4> game = new Connect4(6, 7);
double maxTime = 1000;
ElapsedTimer timer = new ElapsedTimer();
SpeedTest<Connect4> speedTest1 = new SpeedTest<Connect4>(game, timer, maxTime, rng);
SpeedTest<Connect4> speedTest2 = new SpeedTest<Connect4>(game, timer, maxTime, rng);
Thread t1 = new Thread(speedTest1);
Thread t2 = new Thread(speedTest2);
t1.start();
t2.start();
try {
Thread.sleep((long) maxTime);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Games: " + SpeedTest.gamesPlayed);
}
}
I suspect that the reason that you are not seeing any speedup is that your application is only using 1 physical processor. If it is only using one processor, then the two threads won't be running in parallel. Instead, the processor will be "time-slicing" between the two threads.
What can you do about this?
Run on a dual-core etc processor. Or if you have a single processor machine with HT support, enable HT.
Run the test over a longer time; e.g. a number of minutes.
The reason I suggest the latter is that this could be a JVM warmup effect. When a JVM starts a new application, it needs to do a lot of class loading and JIT compilation behind the scenes. These tasks will be largely (if not totally) single-threaded. Running the tests over a longer period of time reduces the contribution of the "warm up" overheads to the average time per "game".
There is a fix that you ought to make to make the program thread-safe. Change
public static int gamesPlayed = 0;
to
private static final AtomicInteger gamesPlayed = new AtomicInteger();
and then use getAndIncrement() to increment the counter and intValue() to fetch its value. (This is simpler than having each thread maintain its own counter and summing them at the end.)
However, I strongly suspect that this change (or #Erik's alternative) will make little difference to the results you are seeing. I'm now sure it is either:
JVM warmup issue as described above,
a consequence of high object creation rates and/or heap starvation, or
some hidden synchronization issue between the instances of your game.
Don't use a static int, use a normal member int.
Instead of the sleep, call .join on both threads.
Then finally add the member ints.
Related
In the tutorial of java multi-threading, it gives an exmaple of Memory Consistency Errors. But I can not reproduce it. Is there any other method to simulate Memory Consistency Errors?
The example provided in the tutorial:
Suppose a simple int field is defined and initialized:
int counter = 0;
The counter field is shared between two threads, A and B. Suppose thread A increments counter:
counter++;
Then, shortly afterwards, thread B prints out counter:
System.out.println(counter);
If the two statements had been executed in the same thread, it would be safe to assume that the value printed out would be "1". But if the two statements are executed in separate threads, the value printed out might well be "0", because there's no guarantee that thread A's change to counter will be visible to thread B — unless the programmer has established a happens-before relationship between these two statements.
I answered a question a while ago about a bug in Java 5. Why doesn't volatile in java 5+ ensure visibility from another thread?
Given this piece of code:
public class Test {
volatile static private int a;
static private int b;
public static void main(String [] args) throws Exception {
for (int i = 0; i < 100; i++) {
new Thread() {
#Override
public void run() {
int tt = b; // makes the jvm cache the value of b
while (a==0) {
}
if (b == 0) {
System.out.println("error");
}
}
}.start();
}
b = 1;
a = 1;
}
}
The volatile store of a happens after the normal store of b. So when the thread runs and sees a != 0, because of the rules defined in the JMM, we must see b == 1.
The bug in the JRE allowed the thread to make it to the error line and was subsequently resolved. This definitely would fail if you don't have a defined as volatile.
This might reproduce the problem, at least on my computer, I can reproduce it after some loops.
Suppose you have a Counter class:
class Holder {
boolean flag = false;
long modifyTime = Long.MAX_VALUE;
}
Let thread_A set flag as true, and save the time into
modifyTime.
Let another thread, let's say thread_B, read the Counter's flag. If thread_B still get false even when it is later than modifyTime, then we can say we have reproduced the problem.
Example code
class Holder {
boolean flag = false;
long modifyTime = Long.MAX_VALUE;
}
public class App {
public static void main(String[] args) {
while (!test());
}
private static boolean test() {
final Holder holder = new Holder();
new Thread(new Runnable() {
#Override
public void run() {
try {
Thread.sleep(10);
holder.flag = true;
holder.modifyTime = System.currentTimeMillis();
} catch (Exception e) {
e.printStackTrace();
}
}
}).start();
long lastCheckStartTime = 0L;
long lastCheckFailTime = 0L;
while (true) {
lastCheckStartTime = System.currentTimeMillis();
if (holder.flag) {
break;
} else {
lastCheckFailTime = System.currentTimeMillis();
System.out.println(lastCheckFailTime);
}
}
if (lastCheckFailTime > holder.modifyTime
&& lastCheckStartTime > holder.modifyTime) {
System.out.println("last check fail time " + lastCheckFailTime);
System.out.println("modify time " + holder.modifyTime);
return true;
} else {
return false;
}
}
}
Result
last check time 1565285999497
modify time 1565285999494
This means thread_B get false from Counter's flag filed at time 1565285999497, even thread_A has set it as true at time 1565285999494(3 milli seconds ealier).
The example used is too bad to demonstrate the memory consistency issue. Making it work will require brittle reasoning and complicated coding. Yet you may not be able to see the results. Multi-threading issues occur due to unlucky timing. If someone wants to increase the chances of observing issue, we need to increase chances of unlucky timing.
Following program achieves it.
public class ConsistencyIssue {
static int counter = 0;
public static void main(String[] args) throws InterruptedException {
Thread thread1 = new Thread(new Increment(), "Thread-1");
Thread thread2 = new Thread(new Increment(), "Thread-2");
thread1.start();
thread2.start();
thread1.join();
thread2.join();
System.out.println(counter);
}
private static class Increment implements Runnable{
#Override
public void run() {
for(int i = 1; i <= 10000; i++)
counter++;
}
}
}
Execution 1 output: 10963,
Execution 2 output: 14552
Final count should have been 20000, but it is less than that. Reason is count++ is multi step operation,
1. read count
2. increment count
3. store it
two threads may read say count 1 at once, increment it to 2. and write out 2. But if it was a serial execution it should have been 1++ -> 2++ -> 3.
We need a way to make all 3 steps atomic. i.e to be executed by only one thread at a time.
Solution 1: Synchronized
Surround the increment with Synchronized. Since counter is static variable you need to use class level synchronization
#Override
public void run() {
for (int i = 1; i <= 10000; i++)
synchronized (ConsistencyIssue.class) {
counter++;
}
}
Now it outputs: 20000
Solution 2: AtomicInteger
public class ConsistencyIssue {
static AtomicInteger counter = new AtomicInteger(0);
public static void main(String[] args) throws InterruptedException {
Thread thread1 = new Thread(new Increment(), "Thread-1");
Thread thread2 = new Thread(new Increment(), "Thread-2");
thread1.start();
thread2.start();
thread1.join();
thread2.join();
System.out.println(counter.get());
}
private static class Increment implements Runnable {
#Override
public void run() {
for (int i = 1; i <= 10000; i++)
counter.incrementAndGet();
}
}
}
We can do with semaphores, explicit locking too. but for this simple code AtomicInteger is enough
Sometimes when I try to reproduce some real concurrency problems, I use the debugger.
Make a breakpoint on the print and a breakpoint on the increment and run the whole thing.
Releasing the breakpoints in different sequences gives different results.
Maybe to simple but it worked for me.
Please have another look at how the example is introduced in your source.
The key to avoiding memory consistency errors is understanding the happens-before relationship. This relationship is simply a guarantee that memory writes by one specific statement are visible to another specific statement. To see this, consider the following example.
This example illustrates the fact that multi-threading is not deterministic, in the sense that you get no guarantee about the order in which operations of different threads will be executed, which might result in different observations across several runs. But it does not illustrate a memory consistency error!
To understand what a memory consistency error is, you need to first get an insight about memory consistency. The simplest model of memory consistency has been introduced by Lamport in 1979. Here is the original definition.
The result of any execution is the same as if the operations of all the processes were executed in some sequential order and the operations of each individual process appear in this sequence in the order specified by its program
Now, consider this example multi-threaded program, please have a look at this image from a more recent research paper about sequential consistency. It illustrates what a real memory consistency error might look like.
To finally answer your question, please note the following points:
A memory consistency error always depends on the underlying memory model (A particular programming languages may allow more behaviours for optimization purposes). What's the best memory model is still an open research question.
The example given above gives an example of sequential consistency violation, but there is no guarantee that you can observe it with your favorite programming language, for two reasons: it depends on the programming language exact memory model, and due to undeterminism, you have no way to force a particular incorrect execution.
Memory models are a wide topic. To get more information, you can for example have a look at Torsten Hoefler and Markus Püschel course at ETH Zürich, from which I understood most of these concepts.
Sources
Leslie Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multiprocessor Programs, 1979
Wei-Yu Chen, Arvind Krishnamurthy, Katherine Yelick, Polynomial-Time Algorithms for Enforcing Sequential Consistency in SPMD Programs with Arrays, 2003
Design of Parallel and High-Performance Computing course, ETH Zürich
I need to do some computations/processing on a large set of ids (about 100k to 1 Million). Since the number of ids is quite large and each processing does take some time, i was thinking about implementing threads in my Java code.
Assuming we cant have 100K threads running at once, how do i implement threading in this case ?
Note - The only solution i can think of is have about 100 or more threads running where each thread would process about a 1000 or more IDs.
Use Java's built in thread pooling and executors.
ExecutorService foo = Executors.newFixedThreadPool(100);
foo.submit(new MyRunnable());
There are various thread pools you can create to tailor how many you want, if it's dynamic, etc.
Using ThreadPool:
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class ThreadIDS implements Runnable
{
public static final int totalIDS = 1000000;
int start;
int range;
public ThreadIDS(int start, int range)
{
this.start=start;
this.range=range;
}
public static void main(String[] args)
{
int availableProcessors = Runtime.getRuntime().availableProcessors();
int eachThread = totalIDS/availableProcessors + 1;
ExecutorService threads = Executors.newFixedThreadPool(availableProcessors);
for(int i = 0 ; i < availableProcessors ; i++)
{
threads.submit(new ThreadIDS(i*eachThread, eachThread));
}
while(!threads.awaitTermination(1000, TimeUnit.MILLISECONDS))System.out.println("Waiting for threads to finish");
}
public void processID(int id)
{
}
public void run()
{
for(int i = start ; i < Math.min(start+range, totalIDS) ; i++)
{
processID(i);
}
}
}
Edited the run method. Since we add 1 when dividing to avoid integer division making us miss ids, we could potentially run over the totalIDS limit. The Math.min avoids that.
If you don't want to use ThreadPools, then change the main to:
public static void main(String[] args)
{
int availableProcessors = Runtime.getRuntime().availableProcessors();
int eachThread = totalIDS/availableProcessors + 1;
for(int i = 0 ; i < availableProcessors ; i++)
{
new Thread(new ThreadIDS(i * eachThread, eachThread)).start();
}
}
Run as many threads as you have CPU cores (Runtime.getRuntime().availableProcessors()). Let each thread runs loop like this:
public void run() {
while (!ids.isEmpty()) {
Id id = ids.poll(); // exact access method depends on how your set of ids is organized
processId(id);
}
}
Comparing to using thread pool, this is simpler and requires less memory (no need to create Runnable for each id).
Splitting your work into 4 Runnables (1 per core) is probably not the best idea if there is any variation in processing time for a given ID. A better solution would be to split your work up into small chunks so that one core doesn't get stuck with all the "hard" work while the other 3 cores plow through theirs and then do nothing.
You could split your tasks into small chunks in advance and submit them to a ThreadPoolExecutor, but it might be better to use the Fork/Join framework. It's designed to handle this type of thing very efficiently.
Something like this would make sure all 4 cores stayed busy until all the work was done:
public class Test
{
public void workTest()
{
ForkJoinPool pool = new ForkJoinPool(); //Defaults to # of cores
List<ObjectThatWeProcess> work = getWork(); //Get IDs or whatever
FJAction action = new FJAction(work);
pool.invoke(action);
}
public static class FJAction extends RecursiveAction
{
private static final workSize = 1000; //Only do work if 1000 objects or less
List<ObjectThatWeProcess> work;
FJAction(List<ObjectThatWeProcess> work)
{
this.work = work;
}
public void compute()
{
if(work.size() > workSize)
{
invokeAll(new FJAction(work.subList(0,work.size()/2)),
new FJAction(work.subList(work.size()/2,work.size())));
}
else
processWork();
}
private void processWork()
{
//do something
}
}
}
You could also extend RecursiveTask<T> if the "work" returned a value that was relevant to you.
I am trying to measure the performance of Database Insert. So for that I have written a StopWatch class which will reset the counter before executeUpdate method and calculate the time after executeUpdate method is done.
And I am trying to see how much time each thread is taking, so I am keeping those numbers in a ConcurrentHashMap.
Below is my main class-
public static void main(String[] args) {
final int noOfThreads = 4;
final int noOfTasks = 100;
final AtomicInteger id = new AtomicInteger(1);
ExecutorService service = Executors.newFixedThreadPool(noOfThreads);
for (int i = 0; i < noOfTasks * noOfThreads; i++) {
service.submit(new Task(id));
}
while (!service.isTerminated()) {
}
//printing the histogram
System.out.println(Task.histogram);
}
Below is the class that implements Runnable in which I am trying to measure each thread performance in inserting to database meaning how much time each thread is taking to insert to database-
class Task implements Runnable {
private final AtomicInteger id;
private StopWatch totalExecTimer = new StopWatch(Task.class.getSimpleName() + ".totalExec");
public static ConcurrentHashMap<Long, AtomicLong> histogram = new ConcurrentHashMap<Long, AtomicLong>();
public Task(AtomicInteger id) {
this.id = id;
}
#Override
public void run() {
dbConnection = getDBConnection();
preparedStatement = dbConnection.prepareStatement(Constants.INSERT_ORACLE_SQL);
//other preparedStatement
totalExecTimer.resetLap();
preparedStatement.executeUpdate();
totalExecTimer.accumulateLap();
final AtomicLong before = histogram.putIfAbsent(totalExecTimer.getCumulativeTime() / 1000, new AtomicLong(1L));
if (before != null) {
before.incrementAndGet();
}
}
}
Below is the StopWatch class
/**
* A simple stop watch.
*/
protected static class StopWatch {
private final String name;
private long lapStart;
private long cumulativeTime;
public StopWatch(String _name) {
name = _name;
}
/**
* Resets lap start time.
*/
public void resetLap() {
lapStart = System.currentTimeMillis();
}
/**
* Accumulates the lap time and return the current lap time.
*
* #return the current lap time.
*/
public long accumulateLap() {
long lapTime = System.currentTimeMillis() - lapStart;
cumulativeTime += lapTime;
return lapTime;
}
/**
* Gets the current cumulative lap time.
*
* #return
*/
public long getCumulativeTime() {
return cumulativeTime;
}
public String getName() {
return name;
}
#Override
public String toString() {
StringBuilder sb = new StringBuilder();
sb.append(name);
sb.append("=");
sb.append((cumulativeTime / 1000));
sb.append("s");
return sb.toString();
}
}
After running the above program, I can see 400 rows got inserted. And when it is printing the histogram, I am only seeing like this-
{0=400}
which means 400 calls came back in 0 seconds? It's not possible for sure.
I am just trying to see how much time each thread is taking to insert the record and then store those numbers in a Map and print that map from the main thread.
I think the problem I am assuming it's happening because of thread safety here and that is the reason whenever it is doing resetlap zero is getting set to Map I guess.
If yes how can I avoid this problem? And also it is required to pass histogram map from the main thread to constructor of Task? As I need to print that Map after all the threads are finished to see what numbers are there.
Update:-
If I remove divide by 1000 thing to store the number as milliseconds then I am able to see some numbers apart from zero. So that looks good.
But One thing more I found out is that numbers are not consistent, If I sum up each threads time, I will get some number for that. And I also I am printing how much time in which whole program is finishing as well. SO I compare these two numbers they are different by big margin
To avoid concurrency issues with your stopwatch you're probably better off creating a new one as a local variable within the run method of your Runnable. That way each thread has it's own stopwatch.
As for the timing you're seeing, I would absolutely hope that a simple record insert would happen in well under a second. Seeing 400 inserts that all happen in less than a second each doesn't surprise me at all. You may get better results by using the millisecond value from your stopwatch as your HashMap key.
Update
For the stopwatch concurrency problem I'm suggesting something like this:
class Task implements Runnable {
private final AtomicInteger id;
// Remove the stopwatch from here
//private StopWatch totalExecTimer = new StopWatch(Task.class.getSimpleName() + ".totalExec");
public static ConcurrentHashMap<Long, AtomicLong> histogram = new ConcurrentHashMap<Long, AtomicLong>();
public Task(AtomicInteger id) {
this.id = id;
}
#Override
public void run() {
// And add it here
StopWatch totalExecTimer = new StopWatch(Task.class.getSimpleName() + ".totalExec");
dbConnection = getDBConnection();
In this way each thread, indeed each Task, gets its own copy, and you don't have to worry about concurrency. Making the StopWatch thread-safe as-is is probably more trouble than it's worth.
Update 2
Having said that then the approach you mentioned in your comment would probably give better results, as there's less overhead in the timing mechanism.
To answer your question about the difference in cumulative thread time and the toal running time of the program I would glibbly say, "What did you expect?".
There are two issues here. One is that you're not measuring the total running time of each thread, just the bit where you're doing the DB insert.
The other is that measuring the running time of the whole application does not take into account any overlap in the execution times of the threads. Even if you were measuring the total time of each task, and assuming you're running on a multi-core machine, I would expect the cumulative time to be more than the elapse time of program execution. That's the benefit of parallel programming.
As an additional note, System.currentTimeMillis() is pseudo time and has a level of innacuracy. Using System.nanoTime() is a more accurate approach
long start = System.nanoTime();
long end = System.nanoTime();
long timeInSeconds = TimeUnit.NANOSECONDS.convert(end-start, TimeUnit.SECONDS);
For a number of reasons, currentTimeMillis is apt to not "refresh" its value on every call. You should use nanoTime for high-resolution measurements.
And your code is throwing away fractions of a second. Your toString method should use sb.append((cumulativeTime / 1000.0)); so that you get fractional seconds.
But the overhead of your timing mechanism is substantial, and if you ever do measure something a big chunk of the time will just be the timing overhead. It's much better to measure a number of operations rather than just one.
So this seems like a pretty common use case, and maybe I'm over thinking it, but I'm having an issue with keeping centralized metrics from multiple threads. Say I have multiple worker threads all processing records and I every 1000 records I want to spit out some metric. Now I could have each thread log individual metrics, but then to get throughput numbers, but I'd have to add them up manually (and of course time boundaries won't be exact). Here's a simple examples:
public class Worker implements Runnable {
private static int count = 0;
private static long processingTime = 0;
public void run() {
while (true) {
...get record
count++;
long start = System.currentTimeMillis();
...do work
long end = System.currentTimeMillis();
processingTime += (end-start);
if (count % 1000 == 0) {
... log some metrics
processingTime = 0;
count = 0;
}
}
}
}
Hope that makes some sense. Also I know the two static variables will probably be AtomicInteger and AtomicLong . . . but maybe not. Interested in what kinds of ideas people have. I had thought about using Atomic variables and using a ReeantrantReadWriteLock - but I really don't want the metrics to stop the processing flow (i.e. the metrics should have very very minimal impact on the processing). Thanks.
Offloading the actual processing to another thread can be a good idea. The idea is to encapsulate your data and hand it off to a processing thread quickly so you minimize impact on the threads that are doing meaningful work.
There is a small handoff contention, but that cost is usually a lot smaller than any other type of synchronization that it should be a good candidate in many situations. I think M. Jessup's solution is pretty close to mine, but hopefully the following code illustrates the point clearly.
public class Worker implements Runnable {
private static final Metrics metrics = new Metrics();
public void run() {
while (true) {
...get record
long start = System.currentTimeMillis();
...do work
long end = System.currentTimeMillis();
// process the metric asynchronously
metrics.addMetric(end - start);
}
}
private static final class Metrics {
// a single "background" thread that actually handles
// processing
private final ExecutorService metricThread =
Executors.newSingleThreadExecutor();
// data (no synchronization needed)
private int count = 0;
private long processingTime = 0;
public void addMetric(final long time) {
metricThread.execute(new Runnable() {
public void run() {
count++;
processingTime += time;
if (count % 1000 == 0) {
... log some metrics
processingTime = 0;
count = 0;
}
}
});
}
}
}
I would suggest if you don't want the logging to interfere with the processing, you should have a separate log worker thread and have your processing threads simply provide some type of value object that can be handed off. In the example I choose a LinkedBlockingQueue since it has the ability to block for an insignificant amount of time using offer() and you can defer the blocking to another thread that pulls the values from a queue. You might need to have increased logic in the MetricProcessor to order data, etc depending on your requirements, but even if it is a long running operation it wont keep the VM thread scheduler from restarting the real processing threads in the mean time.
public class Worker implements Runnable {
public void run() {
while (true) {
... do some stuff
if (count % 1000 == 0) {
... log some metrics
if(MetricProcessor.getInstance().addMetrics(
new Metrics(processingTime, count, ...)) {
processingTime = 0;
count = 0;
} else {
//the call would have blocked for a more significant
//amount of time, here the results
//could be abandoned or just held and attempted again
//as a larger data set later
}
}
}
}
}
public class WorkerMetrics {
...some interesting data
public WorkerMetrics(... data){
...
}
...getter setters etc
}
public class MetricProcessor implements Runnable {
LinkedBlockingQueue metrics = new LinkedBlockingQueue();
public boolean addMetrics(WorkerMetrics m) {
return metrics.offer(m); //This may block, but not for a significant amount of time.
}
public void run() {
while(true) {
WorkMetrics m = metrics.take(); //wait here for something to come in
//the above call does all the significant blocking without
//interrupting the real processing
...do some actual logging, aggregation, etc of the metrics
}
}
}
If you depend on the state of count and the state of processingTime to be in synch then you would have to be using a Lock. For example if when ++count % 1000 == 0 is true, you want to evaluate the metrics of processingTime at THAT time.
For that case, it would make sense to use a ReentrantLock. I wouldn't use a RRWL because there isn't really an instance where a pure read is occuring. It is always a read/write set. But you would need to Lock around all of
count++
processingTime += (end-start);
if (count % 1000 == 0) {
... log some metrics
processingTime = 0;
count = 0;
}
Whether or not count++ is going to be at that location, you will need to lock around that also.
Finally if you are using a Lock, you do not need an AtomicLong and AtomicInteger. It just adds to the overhead and isn't more thread-safe.
I'm trying to figure out how to correctly use Java's Executors. I realize submitting tasks to an ExecutorService has its own overhead. However, I'm surprised to see it is as high as it is.
My program needs to process huge amount of data (stock market data) with as low latency as possible. Most of the calculations are fairly simple arithmetic operations.
I tried to test something very simple: "Math.random() * Math.random()"
The simplest test runs this computation in a simple loop. The second test does the same computation inside a anonymous Runnable (this is supposed to measure the cost of creating new objects). The third test passes the Runnable to an ExecutorService (this measures the cost of introducing executors).
I ran the tests on my dinky laptop (2 cpus, 1.5 gig ram):
(in milliseconds)
simpleCompuation:47
computationWithObjCreation:62
computationWithObjCreationAndExecutors:422
(about once out of four runs, the first two numbers end up being equal)
Notice that executors take far, far more time than executing on a single thread. The numbers were about the same for thread pool sizes between 1 and 8.
Question: Am I missing something obvious or are these results expected? These results tell me that any task I pass in to an executor must do some non-trivial computation. If I am processing millions of messages, and I need to perform very simple (and cheap) transformations on each message, I still may not be able to use executors...trying to spread computations across multiple CPUs might end up being costlier than just doing them in a single thread. The design decision becomes much more complex than I had originally thought. Any thoughts?
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class ExecServicePerformance {
private static int count = 100000;
public static void main(String[] args) throws InterruptedException {
//warmup
simpleCompuation();
computationWithObjCreation();
computationWithObjCreationAndExecutors();
long start = System.currentTimeMillis();
simpleCompuation();
long stop = System.currentTimeMillis();
System.out.println("simpleCompuation:"+(stop-start));
start = System.currentTimeMillis();
computationWithObjCreation();
stop = System.currentTimeMillis();
System.out.println("computationWithObjCreation:"+(stop-start));
start = System.currentTimeMillis();
computationWithObjCreationAndExecutors();
stop = System.currentTimeMillis();
System.out.println("computationWithObjCreationAndExecutors:"+(stop-start));
}
private static void computationWithObjCreation() {
for(int i=0;i<count;i++){
new Runnable(){
#Override
public void run() {
double x = Math.random()*Math.random();
}
}.run();
}
}
private static void simpleCompuation() {
for(int i=0;i<count;i++){
double x = Math.random()*Math.random();
}
}
private static void computationWithObjCreationAndExecutors()
throws InterruptedException {
ExecutorService es = Executors.newFixedThreadPool(1);
for(int i=0;i<count;i++){
es.submit(new Runnable() {
#Override
public void run() {
double x = Math.random()*Math.random();
}
});
}
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
}
}
Using executors is about utilizing CPUs and / or CPU cores, so if you create a thread pool that utilizes the amount of CPUs at best, you have to have as many threads as CPUs / cores.
You are right, creating new objects costs too much. So one way to reduce the expenses is to use batches. If you know the kind and amount of computations to do, you create batches. So think about thousand(s) computations done in one executed task. You create batches for each thread. As soon as the computation is done (java.util.concurrent.Future), you create the next batch. Even the creation of new batches can be done in parralel (4 CPUs -> 3 threads for computation, 1 thread for batch provisioning). In the end, you may end up with more throughput, but with higher memory demands (batches, provisioning).
Edit: I changed your example and I let it run on my little dual-core x200 laptop.
provisioned 2 batches to be executed
simpleCompuation:14
computationWithObjCreation:17
computationWithObjCreationAndExecutors:9
As you see in the source code, I took the batch provisioning and executor lifecycle out of the measurement, too. That's more fair compared to the other two methods.
See the results by yourself...
import java.util.List;
import java.util.Vector;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class ExecServicePerformance {
private static int count = 100000;
public static void main( String[] args ) throws InterruptedException {
final int cpus = Runtime.getRuntime().availableProcessors();
final ExecutorService es = Executors.newFixedThreadPool( cpus );
final Vector< Batch > batches = new Vector< Batch >( cpus );
final int batchComputations = count / cpus;
for ( int i = 0; i < cpus; i++ ) {
batches.add( new Batch( batchComputations ) );
}
System.out.println( "provisioned " + cpus + " batches to be executed" );
// warmup
simpleCompuation();
computationWithObjCreation();
computationWithObjCreationAndExecutors( es, batches );
long start = System.currentTimeMillis();
simpleCompuation();
long stop = System.currentTimeMillis();
System.out.println( "simpleCompuation:" + ( stop - start ) );
start = System.currentTimeMillis();
computationWithObjCreation();
stop = System.currentTimeMillis();
System.out.println( "computationWithObjCreation:" + ( stop - start ) );
// Executor
start = System.currentTimeMillis();
computationWithObjCreationAndExecutors( es, batches );
es.shutdown();
es.awaitTermination( 10, TimeUnit.SECONDS );
// Note: Executor#shutdown() and Executor#awaitTermination() requires
// some extra time. But the result should still be clear.
stop = System.currentTimeMillis();
System.out.println( "computationWithObjCreationAndExecutors:"
+ ( stop - start ) );
}
private static void computationWithObjCreation() {
for ( int i = 0; i < count; i++ ) {
new Runnable() {
#Override
public void run() {
double x = Math.random() * Math.random();
}
}.run();
}
}
private static void simpleCompuation() {
for ( int i = 0; i < count; i++ ) {
double x = Math.random() * Math.random();
}
}
private static void computationWithObjCreationAndExecutors(
ExecutorService es, List< Batch > batches )
throws InterruptedException {
for ( Batch batch : batches ) {
es.submit( batch );
}
}
private static class Batch implements Runnable {
private final int computations;
public Batch( final int computations ) {
this.computations = computations;
}
#Override
public void run() {
int countdown = computations;
while ( countdown-- > -1 ) {
double x = Math.random() * Math.random();
}
}
}
}
This is not a fair test for the thread pool for following reasons,
You are not taking advantage of the pooling at all because you only have 1 thread.
The job is too simple that the pooling overhead can't be justified. A multiplication on a CPU with FPP only takes a few cycles.
Considering following extra steps the thread pool has to do besides object creation and the running the job,
Put the job in the queue
Remove the job from queue
Get the thread from the pool and execute the job
Return the thread to the pool
When you have a real job and multiple threads, the benefit of the thread pool will be apparent.
The 'overhead' you mention is nothing to do with ExecutorService, it is caused by multiple threads synchronizing on Math.random, creating lock contention.
So yes, you are missing something (and the 'correct' answer below is not actually correct).
Here is some Java 8 code to demonstrate 8 threads running a simple function in which there is no lock contention:
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.function.DoubleFunction;
import com.google.common.base.Stopwatch;
public class ExecServicePerformance {
private static final int repetitions = 120;
private static int totalOperations = 250000;
private static final int cpus = 8;
private static final List<Batch> batches = batches(cpus);
private static DoubleFunction<Double> performanceFunc = (double i) -> {return Math.sin(i * 100000 / Math.PI); };
public static void main( String[] args ) throws InterruptedException {
printExecutionTime("Synchronous", ExecServicePerformance::synchronous);
printExecutionTime("Synchronous batches", ExecServicePerformance::synchronousBatches);
printExecutionTime("Thread per batch", ExecServicePerformance::asynchronousBatches);
printExecutionTime("Executor pool", ExecServicePerformance::executorPool);
}
private static void printExecutionTime(String msg, Runnable f) throws InterruptedException {
long time = 0;
for (int i = 0; i < repetitions; i++) {
Stopwatch stopwatch = Stopwatch.createStarted();
f.run(); //remember, this is a single-threaded synchronous execution since there is no explicit new thread
time += stopwatch.elapsed(TimeUnit.MILLISECONDS);
}
System.out.println(msg + " exec time: " + time);
}
private static void synchronous() {
for ( int i = 0; i < totalOperations; i++ ) {
performanceFunc.apply(i);
}
}
private static void synchronousBatches() {
for ( Batch batch : batches) {
batch.synchronously();
}
}
private static void asynchronousBatches() {
CountDownLatch cb = new CountDownLatch(cpus);
for ( Batch batch : batches) {
Runnable r = () -> { batch.synchronously(); cb.countDown(); };
Thread t = new Thread(r);
t.start();
}
try {
cb.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
private static void executorPool() {
final ExecutorService es = Executors.newFixedThreadPool(cpus);
for ( Batch batch : batches ) {
Runnable r = () -> { batch.synchronously(); };
es.submit(r);
}
es.shutdown();
try {
es.awaitTermination( 10, TimeUnit.SECONDS );
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
private static List<Batch> batches(final int cpus) {
List<Batch> list = new ArrayList<Batch>();
for ( int i = 0; i < cpus; i++ ) {
list.add( new Batch( totalOperations / cpus ) );
}
System.out.println("Batches: " + list.size());
return list;
}
private static class Batch {
private final int operationsInBatch;
public Batch( final int ops ) {
this.operationsInBatch = ops;
}
public void synchronously() {
for ( int i = 0; i < operationsInBatch; i++ ) {
performanceFunc.apply(i);
}
}
}
}
Result timings for 120 tests of 25k operations (ms):
Synchronous exec time: 9956
Synchronous batches exec time: 9900
Thread per batch exec time: 2176
Executor pool exec time: 1922
Winner: Executor Service.
I don't think this is at all realistic since you're creating a new executor service every time you make the method call. Unless you have very strange requirements that seems unrealistic - typically you'd create the service when your app starts up, and then submit jobs to it.
If you try the benchmarking again but initialise the service as a field, once, outside the timing loop; then you'll see the actual overhead of submitting Runnables to the service vs. running them yourself.
But I don't think you've grasped the point fully - Executors aren't meant to be there for efficiency, they're there to make co-ordinating and handing off work to a thread pool simpler. They will always be less efficient than just invoking Runnable.run() yourself (since at the end of the day the executor service still needs to do this, after doing some extra housekeeping beforehand). It's when you are using them from multiple threads needing asynchronous processing, that they really shine.
Also consider that you're looking at the relative time difference of a basically fixed cost (Executor overhead is the same whether your tasks take 1ms or 1hr to run) compared to a very small variable amount (your trivial runnable). If the executor service takes 5ms extra to run a 1ms task, that's not a very favourable figure. If it takes 5ms extra to run a 5 second task (e.g. a non-trivial SQL query), that's completely negligible and entirely worth it.
So to some extent it depends on your situation - if you have an extremely time-critical section, running lots of small tasks, that don't need to be executed in parallel or asynchronously then you'll get nothing from an Executor. If you're processing heavier tasks in parallel and want to respond asynchronously (e.g. a webapp) then Executors are great.
Whether they are the best choice for you depends on your situation, but really you need to try the tests with realistic representative data. I don't think it would be appropriate to draw any conclusions from the tests you've done unless your tasks really are that trivial (and you don't want to reuse the executor instance...).
Math.random() actually synchronizes on a single Random number generator. Calling Math.random() results in significant contention for the number generator. In fact the more threads you have, the slower it's going to be.
From the Math.random() javadoc:
This method is properly synchronized to allow correct use by more than
one thread. However, if many threads need to generate pseudorandom
numbers at a great rate, it may reduce contention for each thread to
have its own pseudorandom-number generator.
Firstly there's a few issues with the microbenchmark. You do a warm up, which is good. However, it is better to run the test multiple times, which should give a feel as to whether it has really warmed up and the variance of the results. It also tends to be better to do the test of each algorithm in separate runs, otherwise you might cause deoptimisation when an algorithm changes.
The task is very small, although I'm not entirely sure how small. So number of times faster is pretty meaningless. In multithreaded situations, it will touch the same volatile locations so threads could cause really bad performance (use a Random instance per thread). Also a run of 47 milliseconds is a bit short.
Certainly going to another thread for a tiny operation is not going to be fast. Split tasks up into bigger sizes if possible. JDK7 looks as if it will have a fork-join framework, which attempts to support fine tasks from divide and conquer algorithms by preferring to execute tasks on the same thread in order, with larger tasks pulled out by idle threads.
Here are results on my machine (OpenJDK 8 on 64-bit Ubuntu 14.0, Thinkpad W530)
simpleCompuation:6
computationWithObjCreation:5
computationWithObjCreationAndExecutors:33
There's certainly overhead. But remember what these numbers are: milliseconds for 100k iterations. In your case, the overhead was about 4 microseconds per iteration. For me, the overhead was about a quarter of a microsecond.
The overhead is synchronization, internal data structures, and possibly a lack of JIT optimization due to complex code paths (certainly more complex than your for loop).
The tasks that you'd actually want to parallelize would be worth it, despite the quarter microsecond overhead.
FYI, this would be a very bad computation to parallelize. I upped the thread to 8 (the number of cores):
simpleCompuation:5
computationWithObjCreation:6
computationWithObjCreationAndExecutors:38
It didn't make it any faster. This is because Math.random() is synchronized.
The Fixed ThreadPool's ultimate porpose is to reuse already created threads. So the performance gains are seen in the lack of the need to recreate a new thread every time a task is submitted. Hence the stop time must be taken inside the submitted task. Just with in the last statement of the run method.
You need to somehow group execution, in order to submit larger portions of computation to each thread (e.g. build groups based on stock symbol).
I got best results in similar scenarios by using the Disruptor. It has a very low per-job overhead. Still its important to group jobs, naive round robin usually creates many cache misses.
see http://java-is-the-new-c.blogspot.de/2014/01/comparision-of-different-concurrency.html
In case it is useful to others, here are test results with a realistic scenario - use ExecutorService repeatedly until the end of all tasks - on a Samsung Android device.
Simple computation (MS): 102
Use threads (MS): 31049
Use ExecutorService (MS): 257
Code:
ExecutorService executorService = Executors.newFixedThreadPool(1);
int count = 100000;
//Simple computation
Instant instant = Instant.now();
for (int i = 0; i < count; i++) {
double x = Math.random() * Math.random();
}
Duration duration = Duration.between(instant, Instant.now());
Log.d("ExecutorPerformanceTest", "Simple computation (MS): " + duration.toMillis());
//Use threads
instant = Instant.now();
for (int i = 0; i < count; i++) {
new Thread(() -> {
double x = Math.random() * Math.random();
}
).start();
}
duration = Duration.between(instant, Instant.now());
Log.d("ExecutorPerformanceTest", "Use threads (MS): " + duration.toMillis());
//Use ExecutorService
instant = Instant.now();
for (int i = 0; i < count; i++) {
executorService.execute(() -> {
double x = Math.random() * Math.random();
}
);
}
duration = Duration.between(instant, Instant.now());
Log.d("ExecutorPerformanceTest", "Use ExecutorService (MS): " + duration.toMillis());
I've faced a similar problem, but Math.random() was not the issue.
The problem is having many small tasks that take just a few milliseconds to complete. It is not much but a lot of small tasks in series ends up being a lot of time and I needed to parallelize.
So, the solution I found, and it might work for those of you facing this same problem: do not use any of the executor services. Instead create your own long living Threads and feed them tasks.
Here is an example, just as an idea don't try to copy paste it cause it probably won't work as I am using Kotlin and translating to Java in my head. The concept is what's important:
First, the Thread, a Thread that can execute a task and then continue there waiting for the next one:
public class Worker extends Thread {
private Callable task;
private Semaphore semaphore;
private CountDownLatch latch;
public Worker(Semaphore semaphore) {
this.semaphore = semaphore;
}
public void run() {
while (true) {
semaphore.acquire(); // this will block, the while(true) won't go crazy
if (task == null) continue;
task.run();
if (latch != null) latch.countDown();
task = null;
}
}
public void setTask(Callable task) {
this.task = task;
}
public void setCountDownLatch(CountDownLatch latch) {
this.latch = latch;
}
}
There is two things here that need explanation:
the Semaphore: gives you control over how many tasks and when they are executed by this thread
the CountDownLatch: is the way to notify someone else that a task was completed
So this is how you would use this Worker, first just a simple example:
Semaphore semaphore = new Semaphore(0); // initially the semaphore is closed
Worker worker = new Worker(semaphore);
worker.start();
worker.setTask( .. your callable task .. );
semaphore.release(); // this will allow one task to be processed by the worker
Now a more complicated example, with two Threads and waiting for both to complete using the CountDownLatch:
Semaphore semaphore1 = new Semaphore(0);
Worker worker1 = new Worker(semaphore1);
worker1.start();
Semaphore semaphore2 = new Semaphore(0);
Worker worker2 = new Worker(semaphore2);
worker2.start();
// same countdown latch for both workers, with a counter of 2
CountDownLatch countDownLatch = new CountDownLatch(2);
worker1.setCountDownLatch(countDownLatch);
worker2.setCountDownLatch(countDownLatch);
worker1.setTask( .. your callable task .. );
worker2.setTask( .. your callable task .. );
semaphore1.release();
semaphore2.release();
countDownLatch.await(); // this will block until 2 tasks have been completed
And after that code runs you could just add more tasks to the same threads and reuse them. That's the whole point of this, reusing the threads instead of creating new ones.
It is unpolished as f*** but hopefully this gives you an idea. For me this was an improvement compared to no multi threading. And it was much much better than any executor service with any number of threads in the pool by far.