Odd c++/java multi threading performance results compared to single thread

Odd c++/java multi threading performance results compared to single thread - java

I was struggling since 2 days to understand what is going on with c++ threadpool performance compared to a single thread, then I decided to do the same on java, this is when I noticed that the behaviour is same on c++ and java.. basically my code is simple straight forward.
package com.examples.threading
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;
public class ThreadPool {
final static AtomicLong lookups = new AtomicLong(0);
final static AtomicLong totalTime = new AtomicLong(0);
public static class Task implements Runnable
{
int start = 0;
Task(int s) {
start = s;
}
#Override
public void run()
{
for (int j = start ; j < start + 3000; j++ ) {
long st = System.nanoTime();
boolean a = false;
long et = System.nanoTime();
totalTime.getAndAdd((et - st));
lookups.getAndAdd(1l);
}
}
}
public static void main(String[] args)
{
// change threads from 1 -> 100 then you will get different numbers
ExecutorService executor = Executors.newFixedThreadPool(1);
for (int i = 0; i <= 1000000; i++)
{
if (i % 3000 == 0) {
Task task = new Task(i);
executor.execute(task);
System.out.println("in time " + (totalTime.doubleValue()/lookups.doubleValue()) + " lookups: " + lookups.toString());
}
}
executor.shutdown();
while (!executor.isTerminated()) {
;
}
System.out.println("in time " + (totalTime.doubleValue()/lookups.doubleValue()) + " lookups: " + lookups.toString());
}
}
now same code when you run with different pool number say like 100 threads, the overall elapsed time will change.
one thread:
in time 36.91493612774451 lookups: 1002000
100 threads:
in time 141.47934530938124 lookups: 1002000
the question is, the code is same why the overall elapsed time is different what is exactly going on here..

You have a couple of obvious possibilities here.
One is that System.nanoTime may serialize internally, so even though each thread is making its call separately, it may internally execute those calls in sequence (and, for example, queue up calls as they come in). This is particularly likely when nanoTime directly accesses a hardware clock, such as on Windows (where it uses Windows' QueryPerformanceCounter).
Another point at which you get essentially sequential execution is your atomic variables. Even though you're using lock-free atomics, the basic fact is that each has to execute a read/modify/write as an atomic sequence. With locked variables, that's done by locking, then reading, modifying, writing, and unlocking. With lock-free, you eliminate some of the overhead in doing that, but you're still stuck with the fact that only one thread can successfully read, modify, and write a particular memory location at a given time.
In this case the only "work" each thread is doing is trivial, and the result is never used, so the optimizer can (and probably will) eliminate it entirely. So all you're really measuring is the time to read the clock and increment your variables.
To gain at least some of the speed back, you could (for one example) give thread thread its own lookups and totalTime variable. Then when all the threads finish, you can add together the values for the individual threads to get an overall total for each.
Preventing serialization of the timing is a little more difficult (to put it mildly). At least in the obvious design, each call to nanoTime directly accesses a hardware register, which (at least with most typical hardware) can only happen sequentially. It could be fixed at the hardware level (provide a high-frequency timer register that's directly readable per-core, guaranteed to be synced between cores). That's a somewhat non-trivial task, and (more importantly) most current hardware just doesn't include such a thing.
Other than that, do some meaningful work in each thread, so when you execute in multiple threads, you have something that can actually use the resources of your multiple CPUs/cores to run faster.

Related

Multithreaded vs Asynchronous programming in a single core

If in real time the CPU performs only one task at a time then how is multithreading different from asynchronous programming (in terms of efficiency) in a single processor system?
Lets say for example we have to count from 1 to IntegerMax. In the following program for my multicore machine, the two thread final count count is almost half of the single thread count. What if we ran this in a single core machine? And is there any way we could achieve the same result there?
class Demonstration {
public static void main( String args[] ) throws InterruptedException {
SumUpExample.runTest();
}
}
class SumUpExample {
long startRange;
long endRange;
long counter = 0;
static long MAX_NUM = Integer.MAX_VALUE;
public SumUpExample(long startRange, long endRange) {
this.startRange = startRange;
this.endRange = endRange;
}
public void add() {
for (long i = startRange; i <= endRange; i++) {
counter += i;
}
}
static public void twoThreads() throws InterruptedException {
long start = System.currentTimeMillis();
SumUpExample s1 = new SumUpExample(1, MAX_NUM / 2);
SumUpExample s2 = new SumUpExample(1 + (MAX_NUM / 2), MAX_NUM);
Thread t1 = new Thread(() -> {
s1.add();
});
Thread t2 = new Thread(() -> {
s2.add();
});
t1.start();
t2.start();
t1.join();
t2.join();
long finalCount = s1.counter + s2.counter;
long end = System.currentTimeMillis();
System.out.println("Two threads final count = " + finalCount + " took " + (end - start));
}
static public void oneThread() {
long start = System.currentTimeMillis();
SumUpExample s = new SumUpExample(1, MAX_NUM );
s.add();
long end = System.currentTimeMillis();
System.out.println("Single thread final count = " + s.counter + " took " + (end - start));
}
public static void runTest() throws InterruptedException {
oneThread();
twoThreads();
}
}
Output:
Single thread final count = 2305843008139952128 took 1003
Two threads final count = 2305843008139952128 took 540

For a purely CPU-bound operation you are correct. Most (99.9999%) of programs need to do input, output, and invoke other services. Those are orders of magnitude slower than the CPU, so while waiting for the results of an external operation, the OS can schedule and run other (many other) processes in time slices.
Hardware multithreading benefits primarily when 2 conditions are met:
CPU-intensive operations;
That can be efficiently divided into independent subsets
Or you have lots of different tasks to run that can be efficiently divided among multiple hardware processors.

In the following program for my multicore machine, the two thread final count count is almost half of the single thread count.
That is what I would expect from a valid benchmark when the application is using two cores.
However, looking at your code, I am somewhat surprised that you are getting those results ... so reliably.
Your benchmark doesn't take account of JVM warmup effects, particularly JIT compilation.
You benchmark's add method could potentially be optimized by the JIT compiler to get rid of the loop entirely. (But at least the counts are "used" ... by printing them out.)
I guess you got lucky ... but I'm not convinced those results will be reproducible for all versions of Java, or if you tweaked the benchmark.
Please read this:
How do I write a correct micro-benchmark in Java?
What if we ran this in a single core machine?
Assuming the following:
You rewrote the benchmark to corrected the flaws above.
You are running on a system where hardware hyper-threading1 is disabled2.
Then ... I would expect it to take two threads to take more than twice as long as the one thread version.
Q: Why "more than"?
A: Because there is a significant overhead in starting a new thread. Depending on your hardware, OS and Java version, it could be more than a millisecond. Certainly, the time taken is significant if you repeatedly use and discard threads.
And is there any way we could achieve the same result there?
Not sure what you are asking here. But are if you are asking how to simulate the behavior of one core on a multi-core machine, you would probably need to do this at the OS level. See https://superuser.com/questions/309617 for Windows and https://askubuntu.com/questions/483824 for Linux.
1 - Hyperthreading is a hardware optimization where a single core's processing hardware supports (typically) two hyper-threads. Each hyperthread
has its own sets of registers, but it shares functional units such as the ALU with the other hyperthread. So the two hyperthreads behave like (typically) two cores, except that they may be slower, depending on the precise instruction mix. A typical OS will treat a hyperthread as if it is a regular core. Hyperthreading is typically enabled / disabled at boot time; e.g. via a BIOS setting.
2 - If hyperthreading is enabled, it is possible that two Java threads won't be twice as fast as one in a CPU-intensive computation like this ... due to possible slowdown caused by the "other" hyperthread on respective cores. Did someone mention that benchmarking is complicated?

Multi threaded matrix multiplication performance issue

I am using java for multi threaded multiplication. I am practicing multi threaded programming. Following is the code that I took from another post of stackoverflow.
public class MatMulConcur {
private final static int NUM_OF_THREAD =1 ;
private static Mat matC;
public static Mat matmul(Mat matA, Mat matB) {
matC = new Mat(matA.getNRows(),matB.getNColumns());
return mul(matA,matB);
}
private static Mat mul(Mat matA,Mat matB) {
int numRowForThread;
int numRowA = matA.getNRows();
int startRow = 0;
Worker[] myWorker = new Worker[NUM_OF_THREAD];
for (int j = 0; j < NUM_OF_THREAD; j++) {
if (j<NUM_OF_THREAD-1){
numRowForThread = (numRowA / NUM_OF_THREAD);
} else {
numRowForThread = (numRowA / NUM_OF_THREAD) + (numRowA % NUM_OF_THREAD);
}
myWorker[j] = new Worker(startRow, startRow+numRowForThread,matA,matB);
myWorker[j].start();
startRow += numRowForThread;
}
for (Worker worker : myWorker) {
try {
worker.join();
} catch (InterruptedException e) {
}
}
return matC;
}
private static class Worker extends Thread {
private int startRow, stopRow;
private Mat matA, matB;
public Worker(int startRow, int stopRow, Mat matA, Mat matB) {
super();
this.startRow = startRow;
this.stopRow = stopRow;
this.matA = matA;
this.matB = matB;
}
#Override
public void run() {
for (int i = startRow; i < stopRow; i++) {
for (int j = 0; j < matB.getNColumns(); j++) {
double sum = 0;
for (int k = 0; k < matA.getNColumns(); k++) {
sum += matA.get(i, k) * matB.get(k, j);
}
matC.set(i, j, sum);
}
}
}
}
I ran this program for 1,10,20,...,100 threads but performance is decreasing instead. Following is the time table
Thread 1 takes 18 Milliseconds
Thread 10 takes 18 Milliseconds
Thread 20 takes 35 Milliseconds
Thread 30 takes 38 Milliseconds
Thread 40 takes 43 Milliseconds
Thread 50 takes 48 Milliseconds
Thread 60 takes 57 Milliseconds
Thread 70 takes 66 Milliseconds
Thread 80 takes 74 Milliseconds
Thread 90 takes 87 Milliseconds
Thread 100 takes 98 Milliseconds
Any Idea?

People think that using multiple threads will automatically (magically!) make any computation go faster. This is not so1.
There are a number of factors that can make multi-threading speedup less than you expect, or indeed result in a slowdown.
A computer with N cores (or hyperthreads) can do computations at most N times as fast as a computer with 1 core. This means that when you have T threads where T > N, the computational performance will be capped at N. (Beyond that, the threads make progress because of time slicing.)
A computer has a certain amount of memory bandwidth; i.e. it can only perform a certain number of read/write operations per second on main memory. If you have an application where the demand exceeds what the memory subsystem can achieve, it will stall (for a few nanoseconds). If there are many cores executing many threads at the same time, then it is the aggregate demand that matters.
A typical multi-threaded application working on shared variables or data structures will either use volatile or explicit synchronization to do this. Both of these increase the demand on the memory system.
When explicit synchronization is used and two threads want to hold a lock at the same time, one of them will be blocked. This lock contention slows down the computation. Indeed, the computation is likely to be slowed down if there was past contention on the lock.
Thread creation is expensive. Even acquiring an existing thread from a thread pool can be relatively expensive. If the task that you perform with the thread is too small, the setup costs can outweigh the possible speedup.
There is also the issue that you may be running into problems with a poorly written benchmark; e.g. the JVM may not be properly warmed up before taking the timing measurements.
There is insufficient detail in your question to be sure which of the above factors is likely to affect your application's performance. But it is likely to be a combination of 1 2 and 5 ... depending on how many cores are used, how big the CPUs memory caches are, how big the matrix is, and other factors.
1 - Indeed, if this was true then we would not need to buy computers with lots of cores. We could just use more and more threads. Provided you had enough memory, you could do an infinite amount of computation on a single machine. Bitcoin mining would be a doddle. Of course, it isn't true.

Using multi-threading is not primarily for performance, but for parallelization. There are cases where parallelization can benefit performance, though.
Your computer doesn't have infinite resources. Adding more and more threads will decrease performance. It's like starting more and more applications, you wouldn't expect a program to run faster when you start another program, and you probably wouldn't be surprised if it runs slower.
Up to a certain point performance will remain constant (your computer still has resources to handle the demand), but at some point you reach the maximum your computer can handle and performance will drop. That's exactly what your result shows. Performance stays somewhat constant with 1 or 10 threads, and then drops steadily.

AtomicInteger in multithreading

I want to find out all the prime numbers from 0 to 1000000. For that I wrote this stupid method:
public static boolean isPrime(int n) {
for(int i = 2; i < n; i++) {
if (n % i == 0)
return false;
}
return true;
}
It's good for me and it doesn't need any edit. Than I wrote the following code:
private static ExecutorService executor = Executors.newFixedThreadPool(10);
private static AtomicInteger counter = new AtomicInteger(0);
private static AtomicInteger numbers = new AtomicInteger(0);
public static void main(String args[]) {
long start = System.currentTimeMillis();
while (numbers.get() < 1000000) {
final int number = numbers.getAndIncrement(); // (1) - fast
executor.submit(new Runnable() {
#Override
public void run() {
// int number = numbers.getAndIncrement(); // (2) - slow
if (Main.isPrime(number)) {
System.out.println("Ts: " + new Date().getTime() + " " + Thread.currentThread() + ": " + number + " is prime!");
counter.incrementAndGet();
}
}
});
}
executor.shutdown();
try {
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
System.out.println("Primes: " + counter);
System.out.println("Delay: " + (System.currentTimeMillis() - start));
} catch (Exception e) {
e.printStackTrace();
}
}
Please, pay attention to (1) and (2) marked rows. When (1) is enabled the program runs fast, but when (2) is enabled it works slower.
The output shows small portions with large delay
Ts: 1480489699692 Thread[pool-1-thread-9,5,main]: 350431 is prime!
Ts: 1480489699692 Thread[pool-1-thread-6,5,main]: 350411 is prime!
Ts: 1480489699692 Thread[pool-1-thread-4,5,main]: 350281 is prime!
Ts: 1480489699692 Thread[pool-1-thread-5,5,main]: 350257 is prime!
Ts: 1480489699693 Thread[pool-1-thread-7,5,main]: 350447 is prime!
Ts: 1480489711996 Thread[pool-1-thread-6,5,main]: 350503 is prime!
and threads get equal number value:
Ts: 1480489771083 Thread[pool-1-thread-8,5,main]: 384733 is prime!
Ts: 1480489712745 Thread[pool-1-thread-6,5,main]: 384733 is prime!
Please explain me why option (2) is more slowly and why threads get equal value for number despite AtomicInteger multithreading safe?

In the (2) case, up to 11 threads (the ten from the ExecutorService plus the main thread) are contending for access to the AtomicInteger, whereas in case (1) only the main thread accesses it. In fact, for case (1) you could use int instead of AtomicInteger.
The AtomicInteger class makes use of CAS registers. It does this by reading the value, doing the increment, and then swapping the value with the value in the register if it still has the same value that was originally read (compare and swap). If another thread has changed the value it retries by starting again : read - increment - compare-and-swap, until it is succesful.
The advantage is that this is lockless, and therefore potentially faster than using locks. But it performs poorly under heavy contention. More contention means more retries.
Edit
As #teppic points out, another problem makes case (2) slower than case (1). As the increment of numbers happens in the posted jobs, the loop condition remains true for much longer than needed. While all 10 threads of the executor are churning away to determine whether their given number is a prime, the main thread keeps posting new jobs to the executor. These new jobs don't get an opportunity to increment numbers until preceding jobs are done. So while they're on the queue numbers does not increase and the main thread can meanwhile complete one or more loops loop, posting new jobs. The end result is that many more jobs can be created and posted than the needed 1000000.

Your outer loop is:
while (numbers.get() < 1000000)
This allows you to continue submitting more Runnables than intended to the ExecutorService in the main thread.
You could try changing the loop to: for(int i=0; i < 1000000; i++)
(As others have mentioned you are obviously increasing the amount of contention, but I suspect the extra worker threads are a larger factor in the slowdown you are seeing.)
As for your second question, I'm pretty sure that it is against the contract of AtomicInteger for two child threads to see the same value of getAndIncrement. So something else must be going on which I am not seeing from your code sample. Might it be that you are seeing output from two separate runs of the program?

Explain me why option (2) is more slowly?
Simply because you do it inside run(). So multiple threads will try to do it at the same time hence there will be wait s and release s. Bowmore has given a low level explanation.
In (1) it is sequential. So there will be no such a scenario.
Why threads get equal value for number despite AtomicInteger
multithreading safe?
I don't see any possibility to happen this. If there's such a case it should happen from 0.

You miss two main points here: what AtomicInteger is for and how multithreading works in general.
Regarding why Option 2 is slower, #bowmore provided an excellent answer already.
Now regarding printing same number twice. AtomicInteger is like any other object. You launch your threads, and they check the value of this object. Since they compete with your main thread, that increases the counter, two child threads still may see same value. I would pass an int to each Runnable to avoid that.

How to achieve a guaranteed sleep time on a thread

I have a requirement for a class method to be called every 50 milliseconds. I don't use Thread.sleep because it's very important that it happens as precisely as possible to the milli, whereas sleep only guarantees a minimum time. The basic set up is this:
public class ClassA{
public void setup(){
ScheduledExecutorService se = Executors.newScheduledThreadPool(20);
se.scheduleAtFixedRate(this::onCall, 2000, 50, TimeUnit.MILLISECONDS);
}
protected void onCall(Event event) {
// do something
}
}
Now this by and large works fine. I have put System.out.println(System.nanoTime) in onCall to check its being called as precisely as I hope it is. I have found that there is a drift of 1-5 milliseconds over the course of 100s of calls, which corrects itself now and again.
A 5 ms drift unfortunately is pretty hefty for me. 1 milli drift is ok but at 5ms it messes up the calculation I'm doing in onCall because of states of other objects. It would be almost OK if I could get the scheduler to auto-correct such that if it's 5ms late on one call, the next one would happen in 45ms instead of 50.
My question is: Is there a more precise way to achieve this in Java? The only solution I can think of at the moment is to call a check method every 1ms and check the time to see if its at the 50ms mark. But then I'd need to maintain some logic if, on the off-chance, the precise 50ms interval is missed (49,51).
Thanks

Can I achieve a guaranteed sleep time on a thread?
Sorry, but No.
There is no way to get reliable, precise delay timing in a Java SE JVM. You need to use a Real time Java implementation running on a real time operating system.
Here are a couple of reasons why Java SE on a normal OS cannot do this.
At certain points, the GC in a Java SE JVM needs to "stop the world". While this is happening, no user thread can run. If your timer goes off in a "stop the world" pause, it can't be scheduled until the pause is over.
Scheduling of threads in a JVM is actually done by the host operating system. If the system is busy, the host OS may decide not to schedule the JVM's threads when your application needs this to happen.
The java.util.Timer.scheduleAtFixedRate approach is probably as good as you will get on Java SE. It should address long-term drift, but you can't get rid of the "jitter". And that jitter could easily be hundreds of milliseconds ... or even seconds.
Spinlocks won't help if the system is busy and the OS is preempting or not scheduling your threads. (And spinlocking in user code is wasteful ...)

According to the comment, the primary goal is not to concurrently execute multiple tasks at this precise interval. Instead, the goal is to execute a single task at this interval as precisely as possible.
Unfortunately, neither the ScheduledExecutorService nor any manual constructs involving Thread#sleep or LockSupport#parkNanos are very precise in that sense. And as pointed out in the other answers: There may always be influencing factors that are beyond your control - namely, details of the JVM implementation, garbage collection, JIT runs etc.
Nevertheless, a comparatively simple approach to achieve a high precision here is busy waiting. (This was already mentioned in an answer that is now deleted). But of course, this has several caveats. Most importantly, it will burn processing resources of one CPU. (And on a single-CPU-system, this may be particularly bad).
But in order to show that it may be far more precise than other waiting approaches, here is a simple comparison of the ScheduledExecutorService approach and the busy waiting:
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
public class PreciseSchedulingTest
{
public static void main(String[] args)
{
long periodMs = 50;
PreciseSchedulingA a = new PreciseSchedulingA();
a.setup(periodMs);
PreciseSchedulingB b = new PreciseSchedulingB();
b.setup(periodMs);
}
}
class CallTracker implements Runnable
{
String name;
long expectedPeriodMs;
long baseTimeNs;
long callTimesNs[];
int numCalls;
int currentCall;
CallTracker(String name, long expectedPeriodMs)
{
this.name = name;
this.expectedPeriodMs = expectedPeriodMs;
this.baseTimeNs = System.nanoTime();
this.numCalls = 50;
this.callTimesNs = new long[numCalls];
}
#Override
public void run()
{
callTimesNs[currentCall] = System.nanoTime();
currentCall++;
if (currentCall == numCalls)
{
currentCall = 0;
double maxErrorMs = 0;
for (int i = 1; i < numCalls; i++)
{
long ns = callTimesNs[i] - callTimesNs[i - 1];
double ms = ns * 1e-6;
double errorMs = ms - expectedPeriodMs;
if (Math.abs(errorMs) > Math.abs(maxErrorMs))
{
maxErrorMs = errorMs;
}
//System.out.println(errorMs);
}
System.out.println(name + ", maxErrorMs : " + maxErrorMs);
}
}
}
class PreciseSchedulingA
{
public void setup(long periodMs)
{
CallTracker callTracker = new CallTracker("A", periodMs);
ScheduledExecutorService se = Executors.newScheduledThreadPool(20);
se.scheduleAtFixedRate(callTracker, periodMs,
periodMs, TimeUnit.MILLISECONDS);
}
}
class PreciseSchedulingB
{
public void setup(long periodMs)
{
CallTracker callTracker = new CallTracker("B", periodMs);
Thread thread = new Thread(new Runnable()
{
#Override
public void run()
{
while (true)
{
long periodNs = periodMs * 1000 * 1000;
long endNs = System.nanoTime() + periodNs;
while (System.nanoTime() < endNs)
{
// Busy waiting...
}
callTracker.run();
}
}
});
thread.setDaemon(true);
thread.start();
}
}
Again, this should be taken with a grain of salt, but the results on My Machine® are as follows:
A, maxErrorMs : 1.7585339999999974
B, maxErrorMs : 0.06753599999999693
A, maxErrorMs : 1.7669149999999973
B, maxErrorMs : 0.007193999999998368
A, maxErrorMs : 1.7775299999999987
B, maxErrorMs : 0.012780999999996823
showing that the error for the waiting times is in the range of few microseconds.
In order to apply such an approach in practice, a more sophisticated infrastructure would be necessary. E.g. the bookkeeping that is necessary to compensate for waiting times that have been too high. (I think they can't be too low). Also, all this still does not guarantee a precisely timed execution. But it may be an option to consider, at least.

If you really have hard time constraints, you want to use a real-time operating system. General computing does not have hard time constraints; if your OS goes to virtual memory in one of your intervals, then you can miss your sleep interval. The real-time OS will make the tradeoff that you may get less done, but that work will can be better scheduled.
If you need to do this on a normal OS, you can spinlock instead of sleeping. This is really inefficient, but if you really have hard time constraints, it's the best way to approximate that.

That will be hard - think about GC... What I would do is to grab time with nanoTime, and use it in calculations. Or in other words I would get exact time and use it in calculations.

Yes (assuming you only want to prevent long term drifts and don't worry about each delay individually). java.util.Timer.scheduleAtFixedRate:
...In fixed-rate execution, each execution is scheduled relative to the scheduled execution time of the initial execution. If an execution is delayed for any reason (such as garbage collection or other background activity), two or more executions will occur in rapid succession to "catch up." In the long run, the frequency of execution will be exactly the reciprocal of the specified period (assuming the system clock underlying Object.wait(long) is accurate). ...
Basically, do something like this:
new Timer().scheduleAtFixedRate(new TimerTask() {
#Override
public void run() {
this.onCall();
}
}, 2000, 50);

Cyclic barrier Java, How to verify?

I am preparing for interviews and just want to prepare some basic threading examples and structures so that I can use them during my white board coding if I have to.
I was reading about CyclicBarrier and was just trying my hands at it, so I wrote a very simple code:
import java.util.concurrent.CyclicBarrier;
public class Threads
{
/**
* #param args
*/
public static void main(String[] args)
{
// ******************************************************************
// Using CyclicBarrier to make all threads wait at a point until all
// threads reach there
// ******************************************************************
barrier = new CyclicBarrier(N);
for (int i = 0; i < N; ++i)
{
new Thread(new CyclicBarrierWorker()).start();
}
// ******************************************************************
}
static class CyclicBarrierWorker implements Runnable
{
public void run()
{
try
{
long id = Thread.currentThread().getId();
System.out.println("I am thread " + id + " and I am waiting for my friends to arrive");
// Do Something in the Thread
Thread.sleep(1000*(int)(4*Math.random()*10));
// Now Wait till all the thread reaches this point
barrier.await();
}
catch (Exception e)
{
e.printStackTrace();
}
//Now do whatever else after all threads are released
long id1 = Thread.currentThread().getId();
System.out.println("Thread:"+id1+" We all got released ..hurray!!");
System.out.println("We all got released ..hurray!!");
}
}
final static int N = 4;
static CyclicBarrier barrier = null;
}
You can copy paste it as is and run in your compiler.
What I want to verify is that indeed all threads wait at this point in code:
barrier.await();
I put some wait and was hoping that I would see 4 statements appear one after other in a sequential fashion on the console, followed by 'outburst' of "released..hurray" statement. But I am seeing outburst of all the statements together no matter what I select as the sleep.
Am I missing something here ?
Thanks
P.S: Is there an online editor like http://codepad.org/F01xIhLl where I can just put Java code and hit a button to run a throw away code ? . I found some which require some configuration before I can run any code.

The code looks fine, but it might be more enlightening to write to System.out before the sleep. Consider this in run():
long id = Thread.currentThread().getId();
System.out.println("I am thread " + id + " and I am waiting for my friends to arrive");
// Do Something in the Thread
Thread.sleep(1000*8);
On my machine, I still see a burst, but it is clear that the threads are blocked on the barrier.

if you want to avoid the first burst use a random in the sleep
Thread.sleep(1000*(int)(8*Math.rand()));

I put some wait and was hoping that I
would see 4 statements appear one
after other in a sequential fashion on
the console, followed by 'outburst' of
"released..hurray" statement. But I am
seeing outburst of all the statements
together no matter what I select as
the sleep.
The behavior I'm observing is that all the threads created, sleep for approximately the same amount of time. Remember that other threads can perform their work in the interim, and will therefore get scheduled; since all threads created sleep for the same amount of time, there is very little difference between the instants of time when the System.out.println calls are invoked.
Edit: The other answer of sleeping of a random amount of time will aid in understanding the concept of a barrier better, for it would guarantee (to some extent) the possibility of multiple threads arriving at the barrier at different instants of time.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.