Why threads do not cache object locally? - java

I have a String and ThreadPoolExecutor that changes the value of this String. Just check out my sample:
String str_example = "";
ThreadPoolExecutor poolExecutor = new ThreadPoolExecutor(10, 30, (long)10, TimeUnit.SECONDS, runnables);
for (int i = 0; i < 80; i++){
poolExecutor.submit(new Runnable() {
#Override
public void run() {
try {
Thread.sleep((long) (Math.random() * 1000));
String temp = str_example + "1";
str_example = temp;
System.out.println(str_example);
} catch (Exception e) {
e.printStackTrace();
}
}
});
}
so after executing this, i get something like that:
1
11
111
1111
11111
.......
So question is: i just expect the result like this if my String object has volatile modifier. But i have the same result with this modifier and without.

There are several reasons why you see "correct" execution.
First, CPU designers do as much as they can so that our programs run correctly even in presence of data races. Cache coherence deals with cache lines and tries to minimize possible conflicts. For example, only one CPU can write to a cache line at some point of time. After write was done other CPUs should request that cache line to be able to write to it. Not to say x86 architecture(most probable which you use) is very strict comparing to others.
Second, your program is slow and threads sleep for some random period of time. So they do almost all the work at different points of time.
How to achieve inconsistent behavior? Try something with for loop without any sleep. In that case field value most probably will be cached in CPU registers and some updates will not be visible.
P.S. Updates of field str_example are not atomic so you program may produce the same string values even in presense of volatile keyword.

When you talk about concepts like thread caching, you're talking about the properties of a hypothetical machine that Java might be implemented on. The logic is something like "Java permits an implementation to cache things, so it requires you to tell it when such things would break your program". That does not mean that any actual machine does anything of the sort. In reality, most machines you are likely to use have completely different kinds of optimizations that don't involve the kind of caches that you're thinking of.
Java requires you to use volatile precisely so that you don't have to worry about what kinds of absurdly complex optimizations the actual machine you're working on might or might not have. And that's a really good thing.

Your code is unlikely to exhibit concurrency bugs because it executes with very low concurrency. You have 10 threads, each of which sleep on average 500 ms before doing a string concatenation. As a rough guess, String concatenation takes about 1ns per character, and because your string is only 80 characters long, this would mean that each thread spends about 80 out of 500000000 ns executing. The chance of two or more threads running at the same time is therefore vanishingly small.
If we change your program so that several threads are running concurrently all the time, we see quite different results:
static String s = "";
public static void main(String[] args) throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(5);
for (int i = 0; i < 10_000; i ++) {
executor.submit(() -> {
s += "1";
});
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.MINUTES);
System.out.println(s.length());
}
In the absence of data races, this should print 10000. On my computer, this prints about 4200, meaning over half the updates have been lost in the data race.
What if we declare s volatile? Interestingly, we still get about 4200 as a result, so data races were not prevented. That makes sense, because volatile ensures that writes are visible to other threads, but does not prevent intermediary updates, i.e. what happens is something like:
Thread 1 reads s and starts making a new String
Thread 2 reads s and starts making a new String
Thread 1 stores its result in s
Thread 2 stores its result in s, overwriting the previous result
To prevent this, you can use a plain old synchronized block:
executor.submit(() -> {
synchronized (Test.class) {
s += "1";
}
});
And indeed, this returns 10000, as expected.

It is working because you are using Thread.sleep((long) (Math.random() * 100));So every thread has different sleep time and executing may be one by one as all other thread in sleep mode or completed execution.But though your code is working is not thread safe.Even if you use Volatile also will not make your code thread safe.Volatile only make sure visibility i.e when one thread make some changes other threads are able to see it.
In your case your operation is multi step process reading the variable,updating then writing to memory.So you required locking mechanism to make it thread safe.

Related

Trying to understand shared variables in java threads

I have the following code :
class thread_creation extends Thread{
int t;
thread_creation(int x){
t=x;
}
public void run() {
increment();
}
public void increment() {
for(int i =0 ; i<10 ; i++) {
t++;
System.out.println(t);
}
}
}
public class test {
public static void main(String[] args) {
int i =0;
thread_creation t1 = new thread_creation(i);
thread_creation t2 = new thread_creation(i);
t1.start();
try {
Thread.sleep(500);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
t2.start();
}
}
When I run it , I get :
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
Why I am getting this output ? According to my understanding , the variable i is a shared variable between the two threads created. So according to the code , the first thread will execute and increments i 10 times , and hence , i will be equal to 10 . The second thread will start after the first one because of the sleep statement and since i is shared , then the second thread will start will i=10 and will start incrementing it 10 times to have i = 20 , but this is not the case in the output , so why that ?
You seem to think that int t; in thread_creation is a shared variable. I'm afraid you are mistaken. Each t instance is a different variable. So the two threads are updating distinct counters.
The output you are seeing reflects that.
This is the nub of your question:
How do I pass a shared variable then ?
Actually, you can't1. Strictly a shared variable is actually a variable belonging to a shared object. You cannot pass a variable per se. Java does not allow passing of variables. This is what "Java does not support call-by-reference" really means. You can't pass or return a variable or the address of a variable in any method call. (Or in any other way.)
In Java you pass and return values: either primitives, or references to objects. The values may read from a variable by the call's parameter expression or assigned to a variable after the call's return. But you are not passing the variable. A variable and its value / contents are different things.
So the only way to implement a shared counter is to implement it as a shared counter object.
Note that "variable" and "object" mean different things, both in Java and in other programming languages. You should NOT use the two terms interchangeable. For example, when I declare this in Java:
String s = "Hello";
the s variable is not a String object. It is a variable that contains a reference to the String object. Other variables may contain references to the same String object as well. The distinction is even more stark when the objects are mutable. (String is not mutable ... in Java.)
Here are the two (IMO) best ways to implement a shared counter object.
You could create a custom Java Counter class with a count variable, a get method, and methods for incrementing, decrementing the counter. The class needs to implement various methods as thread-safe and atomic; e.g. by using synchronized methods or blocks2.
You could just use an AtomicInteger instance. That takes care of atomicity and thread-safety ... to the extent that it is possible with this kind of API.
The latter approach is simpler and likely more efficient ... unless you need to do something special each time the counter changes.
(It is conceivable that you could implement a shared counter other ways, but that is too much detail for this answer.)
1 - I realize that I just said the same thing more than 3 times. But as the Bellman says in "The Hunting of the Snark": "What I tell you three times is true."
2 - If the counter is not implemented using synchronized or an equivalent mutual exclusion mechanism with the appropriate happens before semantics, you are liable to see Heisenbugs; e.g. race conditions and memory visibility problems.
Two crucial things you're missing. Both individually explain this behaviour - you can 'fix' either one and you'll still see this, you'd have to fix both to see 1-20:
Java is pass-by-value
When you pass i, you pass a copy of it. In fact, in java, all parameters to methods are always copies. Hence, when the thread does t++, it has absolutely no effect whatsoever on your i. You can trivially test this, and you don't need to mess with threads to see it:
public static void main(String[] args) {
int i = 0;
add5(i);
System.out.println(i); // prints 0!!
}
static void add5(int i) {
i = i + 5;
}
Note that all non-primitives are references. That means: A copy of the reference is passed. It's like passing the address of a house and not the house itself. If I have an address book, and I hand you a scanned copy of a page that contains the address to my summer home, you can still drive over there and toss a brick through the window, and I'll 'see' that when I go follow my copy of the address. So, when you pass e.g. a list and the method you passed the list to runs list.add("foo"), you DO see that. You may think: AHA! That means java does not pass a copy, it passed the real list! Not so. Java passed a copy of a street address (A reference). The method I handed that copy to decided to drive over there and act - that you can see.
In other words, =, ++, that sort of thing? That is done to the copy. . is java for 'drive to the address and enter the house'. Anything you 'do' with . is visible to the caller, = and ++ and such are not.
Fixing the code to avoid the pass-by-value problem
Change your code to:
class thread_creation extends Thread {
static int t; // now its global!
public void run() {
increment();
}
public void increment() {
for(int i =0 ; i<10 ; i++) {
t++;
// System.out.println(t);
}
}
}
public class test {
public static void main(String[] args) throws Exception {
thread_creation t1 = new thread_creation();
thread_creation t2 = new thread_creation();
t1.start();
Thread.sleep(500);
t2.start();
Thread.sleep(500);
System.out.println(thread_creation.t);
}
}
Note that I remarked out the print line. I did that intentionally - see below. If you run the above code, you'd think you see 20, but depending on your hardware, the OS, the song playing on your mp3 playing app, which websites you have open, and the phase of the moon, it may be less than 20. So what's going on there? Enter the...
The evil coin.
The relevant spec here is the JMM (The Java Memory Model). This spec explains precisely what a JVM must do, and therefore, what a JVM is free not to do, especially when it comes to how memory is actually managed.
The crucial aspect is the following:
Any effects (updates to fields, such as that t field) may or may not be observable, JVM's choice. There's no guarantee that anything you do is visible to anything else... unless there exists a Happens-Before/Happens-After relationship: Any 2 statements with such a relationship have the property that the JVM guarantees that you cannot observe the lack of the update done by the HB line from the HA line.
HB/HA can be established in various ways:
The 'natural' way: Anything that is 'before' something else _and runs in the same thread has an HB/HA relationship. In other words, if you do in one thread x++; System.out.println(x); then you can't observe that the x++ hasn't happened yet. It's stated like this so that if you're not observing, you get no guarantees, which gives the JVM the freedom to optimize. For example, Given x++;y++; and that's all you do, the JVM is free to re-order that and increment y before x. Or not. There are no guarantees, a JVM can do whatever it wants.
synchronized. The moment of 'exiting' a synchronized (x) {} block has HB to the HA of another thread 'entering' the top of any synchronized block on the same object, if it enters later.
volatile - but note that with volatile it's basically impossible which one came first. But one of them did, and any interaction with a volatile field is HB relative to another thread accessing the same field later.
thread starting. thread.start() is HB relative to the first line of the run() of that thread.
thread yielding. thread.yield() is HA relative to the last line of the thread.
There are a few more exotic ways to establish HB/HA but that's pretty much it.
Crucially, in your code there is no HB/HA between any of the statements that modify or print t!
In other words, the JVM is free to run it all in such a way that the effects of various t++ statements run by one thread aren't observed by another thread.
What the.. WHY????
Because of efficiency. Your memory banks on your CPU are, relative to how fast CPUs are, oceans away from the CPU core. Fetching or writing to core memory from a CPU takes an incredibly long time - your CPU is twiddling its thumbs for a very long time while it waits for the memory controller to get the job done. It could be running hundreds of instructions in that time.
So, CPU cores do not write to memory AT ALL. Instead they work with caches: They have an on-core cache page, and the only interaction with your main memory banks (which are shared by CPU cores) is 'load in an entire cache page' and 'write an entire cache page'. That cache page is then effectively a 'local copy' that only that core can see and interact with (but can do so very very quickly, as that IS very close to the core, unlike the main memory banks), and then once the algorithm is done it can flush that page back to main memory.
The JVM needs to be free to use this. Had the JVM actually worked like you want (that anything any thread does is instantly observable by all others), then anything that any line does must first wait 500 cycles to load the relevant page, then wait another 500 cycles to write it back. All java apps would literally be 1000x slower than they could be.
This in passing also explains that actual synchronizing is really slow. Nothing java can do about that, it is a fundamental limitation of our modern multi-core CPUs.
So, evil coin?
Note that the JVM does not guarantee that the CPU must neccessarily work with this cache stuff, nor does it make any promises about when cache pages are flushed. It merely limits the guarantees so that JVMs can be efficiently written on CPUs that work like that.
That means that any read or write to any field any java code ever does can best be thought of as follows:
The JVM first flips a coin. On heads, it uses a local cached copy. On tails, it copies over the value from some other thread's cached copy instead.
The coin is evil: It is not reliably a 50/50 arrangement. It is entirely plausible that throughout developing a feature and testing it, the coin lands tails every time it is flipped. It remains flipping tails 100% of the time for the first week that you deployed it. And then just when that big potential customer comes in and you're demoing your app, the coin, being an evil, evil coin, starts flipping heads a few times and breaking your app.
The correct conclusion is that the coin will mess with you and that you cannot unit test against it. The only way to win the game is to ensure that the coin is never flipped.
You do this by never touching a field from multiple threads unless it is constant (final, or simply never changes), or if all access to it (both reads and writes) has clearly established HB/HA between all threads.
This is hard to do. That's why the vast majority of apps don't do it at all. Instead, they:
Talk between threads using a database, which has vastly more advanced synchronization primitives: Transactions.
Talk using a message bus such as RabbitMQ or similar.
Use stuff from the java.util.concurrent package such as a Latch, ForkJoin, ConcurrentMap, or AtomicInteger. These are easier to use (specifically: It is a lot harder to write code for these abstractions that is buggy but where the bug cannot be observed or tested for on the machine of the developer that wrote it, it'll only blow up much later in production. But not impossible, of course).
Let's fix it!
volatile doesn't 'fix' ++. x++; is 'read x, increment by 1, write result to x' and volatile doesn't make that atomic, so we cannot use this. We can either replace t++ with:
synchronized(thread_creation.class) {
t++;
}
Which works fine but is really slow (and you shouldn't lock on publicly visible stuff if you can help it, so make a custom object to lock on, but you get the gist hopefully), or, better, dig into that j.u.c package for something that seems useful. And so there is! AtomicInteger!
class thread_creation extends Thread {
static AtomicInteger t = new AtomicInteger();
public void run() {
increment();
}
public void increment() {
for(int i =0 ; i<10 ; i++) {
t.incrementAndGet();
}
}
}
public class test {
public static void main(String[] args) throws Exception {
thread_creation t1 = new thread_creation();
thread_creation t2 = new thread_creation();
t1.start();
Thread.sleep(500);
t2.start();
Thread.sleep(500);
System.out.println(thread_creation.t.get());
}
}
That code will print 20. Every time (unless those threads take longer than 500msec which technically could be, but is rather unlikely of course).
Why did you remark out the print statement?
That HB/HA stuff can sneak up on you: When you call code you did not write, such as System.out.println, who knows what kind of HB/HA relationships are in that code? Javadoc isn't that kind of specific, they won't tell you. Turns out that on most OSes and JVM implementations, interaction with standard out, such as System.out.println, causes synchronization; either the JVM does it, or the OS does. Thus, introducing print statements 'to test stuff' doesn't work - that makes it impossible to observe the race conditions your code does have. Similarly, involving debuggers is a great way to make that coin really go evil on you and flip juuust so that you can't tell your code is buggy.
That is why I remarked it out, because with it in, I bet on almost all hardware you end up seeing 20 eventhough the JVM doesn't guarantee it and that first version is broken. Even if on your particular machine, on this day, with this phase of the moon, it seems to reliably print 20 every single time you run it.

why using two threads for a counter decreases performance in Java?

In python using two Threads for a simple counter program (as demonstrated below) is slower than the program with a single thread. The reason given to this is the mechanism behind Global Interpreter lock.
I tested the same in java to see the performance. Here again, I see that a single Thread out-performs two-threaded one with a significant time scale. why is it so?
Here is the code:
public class ThreadTiming {
static void threadMessage(String message) {
String threadName =
Thread.currentThread().getName();
System.out.format("%s: %s%n",
threadName,
message);
}
private static class Counter implements Runnable {
private int count=500000000;
#Override
public void run() {
while(count>0) {
count--;
}
threadMessage("done processing");
}
}
public static void main(String[] args) throws InterruptedException{
Thread t1 = new Thread(new Counter());
Thread t2 = new Thread(new Counter());
long startTime=System.currentTimeMillis();
t1.start();
t2.start();
t1.join();
t2.join();
long endTime=System.currentTimeMillis();
System.out.println("Time taken by two threads "+ (endTime-startTime)/1000.0);
startTime=System.currentTimeMillis();
Calculate(2*500000000);
endTime=System.currentTimeMillis();
System.out.println("Time taken by single thread "+ (endTime-startTime)/1000.0);
}
public static void Calculate(int x){
while (x>0){
x--;
}
threadMessage("Done processing");
}
}
Output:
Thread-1: done processing
Thread-2: done processing
Time taken by two threads 0.052
main: Done processing
Time taken by single thread 0.0010
Very simple. The single threaded version uses a local variable which hotspot has no problems to reason that it never leaves the scope, hence the whole function is reduced to a nop.
On the other hand proving that the instance variable never leaves scope (hello reflection!) Is much harder and obviously hotspot cannot it here hence the loop isn't removed.
On a general note benchmarking is hard (i count at least three other mistakes that could lead to "wrong" results) and requires tons of knowledge.You are better off using jmh (java measuring harness) which takes care of most things.
The basic answer is you have code the optimiser can eliminate and you are timing how long it takes to detect this. You are also adding the time it takes to start and stop two threads which could be more than half this time.
The second test doesn't start a new thread, it uses the current one so you just need to wait for it to detect the loop doesn't do anything.
For example you have timed that a single thread can do 1 billion loops in 1 ms. If you have a 3.33 GHz processor, this would have to do 300 iterations in a single clock cycle. If this sounds too good to be true, that is because it is. ;)
#Voo seems to be generally right, as you can see by moving ThreadTiming.Counter.count to be a local variable of ThreadTiming.Counter.run(). That eliminates any possibility of non-local references, and the resulting program exhibits much less single-thread vs. dual-thread performance difference.
HOWEVER, that doesn't eliminate all the difference. The timing reported for the dual-thread case is still worse by about a factor of 9 for me. But if I then swap so that the single-threaded case is measured first, the two-thread case wins by about a factor of 2.
But that, too, is illusory, because the two tests are running different -- albeit similar -- code. The single-thread case can easily be made to run exactly the same code as the dual thread case:
Counter c = new Counter();
c.run();
c.run();
(Using the version where count is local to run().) If that approach is used then I observe no difference in performance (at the resolution of the measurement) between single- and dual-threaded, regardless of which case is tested first.
As #Voo said, benchmarking is hard.
It just looks like it's from loading each thread and its context into the CPU. It's thrashing. There's probably a more detailed answer waiting to strike, but let's start by posting the basics...
When running two threads, your timer is including the time taken to launch the two threads. Creating and starting threads has some overhead, and in this case, the overhead is longer than the time to actually carry out the process.

These three threads don't take turns when using Thread.yield()?

In an effort to practice my rusty Java, I wanted to try a simple multi-threaded shared data example and I came across something that surprised me.
Basically we have a shared AtomicInteger counter between three threads that each take turns incrementing and printing the counter.
main
AtomicInteger counter = new AtomicInteger(0);
CounterThread ct1 = new CounterThread(counter, "A");
CounterThread ct2 = new CounterThread(counter, "B");
CounterThread ct3 = new CounterThread(counter, "C");
ct1.start();
ct2.start();
ct3.start();
CounterThread
public class CounterThread extends Thread
{
private AtomicInteger _count;
private String _id;
public CounterThread(AtomicInteger count, String id)
{
_count = count;
_id = id;
}
public void run()
{
while(_count.get() < 1000)
{
System.out.println(_id + ": " + _count.incrementAndGet());
Thread.yield();
}
}
}
I expected that when each thread executed Thread.yield(), that it would give over execution to another thread to increment _count like this:
A: 1
B: 2
C: 3
A: 4
...
Instead, I got output where A would increment _count 100 times, then pass it off to B. Sometimes all three threads would take turns consistently, but sometimes one thread would dominate for several increments.
Why doesn't Thread.yield() always yield processing over to another thread?
I expected that when each thread executed Thread.yield(), that it would give over execution to another thread to increment _count like this:
In threaded applications that are spinning, predicting the output is extremely hard. You would have to do a lot of work with locks and stuff to get perfect A:1 B:2 C:3 ... type output.
The problem is that everything is a race condition and unpredictable due to hardware, race-conditions, time-slicing randomness, and other factors. For example, when the first thread starts, it may run for a couple of millis before the next thread starts. There would be no one to yield() to. Also, even if it yields, maybe you are on a 4 processor box so there is no reason to pause any other threads at all.
Instead, I got output where A would increment _count 100 times, then pass it off to B. Sometimes all three threads would take turns consistently, but sometimes one thread would dominate for several increments.
Right, in general with this spinning loops, you see bursts of output from a single thread as it gets time slices. This is also confused by the fact that System.out.println(...) is synchronized which affects the timing as well. If it was not doing a synchronized operation, you would see even more bursty output.
Why doesn't Thread.yield() always yield processing over to another thread?
I very rarely use Thread.yield(). It is a hint to the scheduler at best and probably is ignored on some architectures. The idea that it "pauses" the thread is very misleading. It may cause the thread to be put back to the end of the run queue but there is no guarantee that there are any threads waiting so it may keep running as if the yield were removed.
See my answer here for more info : unwanted output in multithreading
Let's read some javadoc, shall we?
A hint to the scheduler that the current thread is willing to yield
its current use of a processor. The scheduler is free to ignore this
hint.
[...]
It is rarely appropriate to use this method. It may be useful
for debugging or testing purposes, where it may help to reproduce bugs
due to race conditions. It may also be useful when designing
concurrency control constructs such as the ones in the
java.util.concurrent.locks package.
You cannot guarantee that another thread will obtain the processor after a yield(). It's up to the scheduler and it seems he/she doesn't want to in your case. You might consider sleep()ing instead, for testing.

Weak performance of CyclicBarrier with many threads: Would a tree-like synchronization structure be an alternative?

Our application requires all worker threads to synchronize at a defined point. For this we use a CyclicBarrier, but it does not seem to scale well. With more than eight threads, the synchronization overhead seems to outweigh the benefits of multithreading. (However, I cannot support this with measurement data.)
EDIT: Synchronization happens very frequently, in the order of 100k to 1M times.
If synchronization of many threads is "hard", would it help building a synchronization tree? Thread 1 waits for 2 and 3, which in turn wait for 4+5 and 6+7, respectively, etc.; after finishing, threads 2 and 3 wait for thread 1, thread 4 and 5 wait for thread 2, etc..
1
| \
2 3
|\ |\
4 5 6 7
Would such a setup reduce synchronization overhead? I'd appreciate any advice.
See also this featured question: What is the fastest cyclic synchronization in Java (ExecutorService vs. CyclicBarrier vs. X)?
With more than eight threads, the synchronization overhead seems to outweigh the benefits of multithreading. (However, I cannot support this with measurement data.)
Honestly, there's your problem right there. Figure out a performance benchmark and prove that this is the problem, or risk spending hours / days solving the entirely wrong problem.
You are thinking about the problem in a subtly wrong way that tends to lead to very bad coding. You don't want to wait for threads, you want to wait for work to be completed.
Probably the most efficient way is a shared, waitable counter. When you make new work, increment the counter and signal the counter. When you complete work, decrement the counter. If there is no work to do, wait on the counter. If you drop the counter to zero, check if you can make new work.
If I understand correctly, you're trying to break your solution up into parts and solve them separately, but concurrently, right? Then have your current thread wait for those tasks? You want to use something like a fork/join pattern.
List<CustomThread> threads = new ArrayList<CustomThread>();
for (Something something : somethings) {
threads.add(new CustomThread(something));
}
for (CustomThread thread : threads) {
thread.start();
}
for (CustomThread thread : threads) {
thread.join(); // Blocks until thread is complete
}
List<Result> results = new ArrayList<Result>();
for (CustomThread thread : threads) {
results.add(thread.getResult());
}
// do something with results.
In Java 7, there's even further support via a fork/join pool. See ForkJoinPool and its trail, and use Google to find one of many other tutorials.
You can recurse on this concept to get the tree you want, just have the threads you create generate more threads in the exact same way.
Edit: I was under the impression that you wouldn't be creating that many threads, so this is better for your scenario. The example won't be horribly short, but it goes along the same vein as the discussion you're having in the other answer, that you can wait on jobs, not threads.
First, you need a Callable for your sub-jobs that takes an Input and returns a Result:
public class SubJob implements Callable<Result> {
private final Input input;
public MyCallable(Input input) {
this.input = input;
}
public Result call() {
// Actually process input here and return a result
return JobWorker.processInput(input);
}
}
Then to use it, create an ExecutorService with a fix-sized thread pool. This will limit the number of jobs you're running concurrently so you don't accidentally thread-bomb your system. Here's your main job:
public class MainJob extends Thread {
// Adjust the pool to the appropriate number of concurrent
// threads you want running at the same time
private static final ExecutorService pool = Executors.newFixedThreadPool(30);
private final List<Input> inputs;
public MainJob(List<Input> inputs) {
super("MainJob")
this.inputs = new ArrayList<Input>(inputs);
}
public void run() {
CompletionService<Result> compService = new ExecutorCompletionService(pool);
List<Result> results = new ArrayList<Result>();
int submittedJobs = inputs.size();
for (Input input : inputs) {
// Starts the job when a thread is available
compService.submit(new SubJob(input));
}
for (int i = 0; i < submittedJobs; i++) {
// Blocks until a job is completed
results.add(compService.take())
}
// Do something with results
}
}
This will allow you to reuse threads instead of generating a bunch of new ones every time you want to run a job. The completion service will do the blocking while it waits for jobs to complete. Also note that the results list will be in order of completion.
You can also use Executors.newCachedThreadPool, which creates a pool with no upper limit (like using Integer.MAX_VALUE). It will reuse threads if one is available and create a new one if all the threads in the pool are running a job. This may be desirable later if you start encountering deadlocks (because there's so many jobs in the fixed thread pool waiting that sub jobs can't run and complete). This will at least limit the number of threads you're creating/destroying.
Lastly, you'll need to shutdown the ExecutorService manually, perhaps via a shutdown hook, or the threads that it contains will not allow the JVM to terminate.
Hope that helps/makes sense.
If you have a generation task (like the example of processing columns of a matrix) then you may be stuck with a CyclicBarrier. That is to say, if every single piece of work for generation 1 must be done in order to process any work for generation 2, then the best you can do is to wait for that condition to be met.
If there are thousands of tasks in each generation, then it may be better to submit all of those tasks to an ExecutorService (ExecutorService.invokeAll) and simply wait for the results to return before proceeding to the next step. The advantage of doing this is eliminating context switching and wasted time/memory from allocating hundreds of threads when the physical CPU is bounded.
If your tasks are not generational but instead more of a tree-like structure in which only a subset need to be complete before the next step can occur on that subset, then you might want to consider a ForkJoinPool and you don't need Java 7 to do that. You can get a reference implementation for Java 6. This would be found under whatever JSR introduced the ForkJoinPool library code.
I also have another answer which provides a rough implementation in Java 6:
public class Fib implements Callable<Integer> {
int n;
Executor exec;
Fib(final int n, final Executor exec) {
this.n = n;
this.exec = exec;
}
/**
* {#inheritDoc}
*/
#Override
public Integer call() throws Exception {
if (n == 0 || n == 1) {
return n;
}
//Divide the problem
final Fib n1 = new Fib(n - 1, exec);
final Fib n2 = new Fib(n - 2, exec);
//FutureTask only allows run to complete once
final FutureTask<Integer> n2Task = new FutureTask<Integer>(n2);
//Ask the Executor for help
exec.execute(n2Task);
//Do half the work ourselves
final int partialResult = n1.call();
//Do the other half of the work if the Executor hasn't
n2Task.run();
//Return the combined result
return partialResult + n2Task.get();
}
}
Keep in mind that if you have divided the tasks up too much and the unit of work being done by each thread is too small, there will negative performance impacts. For example, the above code is a terribly slow way to solve Fibonacci.

Dynamically spawn threads Java for a case like this

Suppose I have a List of integers. Each int I have must be multiplied by 100. To do this with a for loop I'd construct something like the following:
for(Integer i : numbers){
i = i*100;
}
But suppose for performance reasons I wanted to simultaneously spawn a thread for each number in numbers and perform a single multiplication on each thread returning the result to the same List. What would be the best way of doing such a thing?
My actual problem isn't as trivial as multiplication of ints but rather a task that each iteration of the loop takes a substantial amount of time, and so I'd like to do them all at the same time in order to decrease execution time.
If you can use Java 7, the Fork/Join framework is created for precisely this problem. If not, there is a JSR166 (the fork/join proposal) source code at this link.
Essentially, you would create a task for each step (in your case, for each index in the array) and submit it to a service that can pool threads (the fork part). Then you wait for everything to complete and merge the results (the join part).
The reason to use a service as opposed to launching your own threads, is there can be an overhead in creating threads, and in some cases, you may want to limit the number of threads. For example, if you're on a four CPU machine, it wouldn't make much sense to have more than four threads running concurrently.
If your tasks are independent of each other, you can use Executors framework.
Note that you would gain more speed if you create no more threads than you have CPU cores at your disposal.
Sample:
class WorkInstance {
final int argument;
final int result;
WorkInstance(int argument, int result) {
this.argument = argument;
this.result = result;
}
public String toString() {
return "WorkInstance{" +
"argument=" + argument +
", result=" + result +
'}';
}
}
public class Main {
public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
int numOfCores = 4;
final ExecutorService executor = Executors.newFixedThreadPool(numOfCores);
List<Integer> toMultiplyBy100 = Arrays.asList(1, 3, 19);
List<Future<WorkInstance>> tasks = new ArrayList<Future<WorkInstance>>(toMultiplyBy100.size());
for (final Integer workInstance : toMultiplyBy100)
tasks.add(executor.submit(new Callable<WorkInstance>() {
public WorkInstance call() throws Exception {
return new WorkInstance(workInstance, workInstance * 100);
}
}));
for (Future<WorkInstance> result : tasks)
System.out.println("Result: " + result.get());
executor.shutdown();
}
}
Spawning a new thread for
each number in numbers
is not a good idea. However, using a fixed thread pool of size matching the number of cores/CPUs might increase your performace slightly.
The quick and dirty way to get started is to use a thread pool, such as one returned by Executors.newCachedThreadPool(). Then create tasks that implement Runnable and submit() them to your thread pool. Also read up on the classes and interfaces linked by those Javadocs, lots of cool stuff you can try.
See the concurrency chapter in Effective Java, 2nd ed for a great introduction to multithreaded Java.
Take a look at ThreadPoolExecutor and create a task for each iteration. A prerequisite is that those tasks are independent though.
The use of a thread pool allows you to create a task per iteration but only run as many concurrently as there a threads, since you'd want to reduce the number of thread, for example to the number of cores or hardware threads available. Creating a whole lot of threads would be counter productive since they'd require a lot of context switching which hurts performance.
I assume you are on a commodity PC. You will at most have N threads executing at the same time on your machine, where N is the # of cores of your CPUs, so most likely in the [1, 4] range. Plus the contention on the shared list.
But even more importantly, the cost of spawning a new thread is much greater than the cost of doing a multiplication. One could have a thread pool... but in this specific case, it's not even worth talking about it. Really.
If it is the only application on a node you should determine which number of threads will finish the job most quickly (max_throughput). This depends on the processor you use how much JIT can optimize your code, so there is no general advise but measure.
After that you could distribute the jobs to a pool of worker threads by numbers modulo max_throughput

Categories