Performance Issues with newFixedThreadPool vs newSingleThreadExecutor

Performance Issues with newFixedThreadPool vs newSingleThreadExecutor - java

I am trying to Benchmark our Client code. So I decided I will write a multithreading program to do the benchmarking of my client code. I am trying to measure how much time (95 Percentile) below method will take-
attributes = deClient.getDEAttributes(columnsList);
So below is the multithreaded code I wrote to do the benchmarking on the above method. I am seeing lot of variations in my two scenarios-
1) Firstly, with multithreaded code by using 20 threads and running for 15 minutes. I get 95 percentile as 37ms. And I am using-
ExecutorService service = Executors.newFixedThreadPool(20);
2) But If I am running my same program for 15 minutes using-
ExecutorService service = Executors.newSingleThreadExecutor();
instead of
ExecutorService service = Executors.newFixedThreadPool(20);
I get 95 percentile as 7ms which is way less than the above number when I am running my code with newFixedThreadPool(20).
Can anyone tell me what can be the reason for such high performance issues with-
newSingleThreadExecutor vs newFixedThreadPool(20)
And by both ways I am running my program for 15 minutes.
Below is my code-
public static void main(String[] args) {
try {
// create thread pool with given size
//ExecutorService service = Executors.newFixedThreadPool(20);
ExecutorService service = Executors.newSingleThreadExecutor();
long startTime = System.currentTimeMillis();
long endTime = startTime + (15 * 60 * 1000);//Running for 15 minutes
for (int i = 0; i < threads; i++) {
service.submit(new ServiceTask(endTime, serviceList));
}
// wait for termination
service.shutdown();
service.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
} catch (InterruptedException e) {
} catch (Exception e) {
}
}
Below is the class that implements Runnable interface-
class ServiceTask implements Runnable {
private static final Logger LOG = Logger.getLogger(ServiceTask.class.getName());
private static Random random = new SecureRandom();
public static volatile AtomicInteger countSize = new AtomicInteger();
private final long endTime;
private final LinkedHashMap<String, ServiceInfo> tableLists;
public static ConcurrentHashMap<Long, Long> selectHistogram = new ConcurrentHashMap<Long, Long>();
public ServiceTask(long endTime, LinkedHashMap<String, ServiceInfo> tableList) {
this.endTime = endTime;
this.tableLists = tableList;
}
#Override
public void run() {
try {
while (System.currentTimeMillis() <= endTime) {
double randomNumber = random.nextDouble() * 100.0;
ServiceInfo service = selectRandomService(randomNumber);
final String id = generateRandomId(random);
final List<String> columnsList = getColumns(service.getColumns());
List<DEAttribute<?>> attributes = null;
DEKey bk = new DEKey(service.getKeys(), id);
List<DEKey> list = new ArrayList<DEKey>();
list.add(bk);
Client deClient = new Client(list);
final long start = System.nanoTime();
attributes = deClient.getDEAttributes(columnsList);
final long end = System.nanoTime() - start;
final long key = end / 1000000L;
boolean done = false;
while(!done) {
Long oldValue = selectHistogram.putIfAbsent(key, 1L);
if(oldValue != null) {
done = selectHistogram.replace(key, oldValue, oldValue + 1);
} else {
done = true;
}
}
countSize.getAndAdd(attributes.size());
handleDEAttribute(attributes);
if (BEServiceLnP.sleepTime > 0L) {
Thread.sleep(BEServiceLnP.sleepTime);
}
}
} catch (Exception e) {
}
}
}
Updated:-
Here is my processor spec- I am running my program from Linux machine with 2 processors defined as:
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2670 0 # 2.60GHz
stepping : 7
cpu MHz : 2599.999
cache size : 20480 KB
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm arat pln pts
bogomips : 5199.99
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

Can anyone tell me what can be the reason for such high performance issues with newSingleThreadExecutor vs newFixedThreadPool(20)...
If you are running many more tasks in parallel (20 in the case) than you have processors (I doubt that you have 20+ processor box) then yes, each individual task is going to take longer to complete. It is easier for the computer to execute one task at a time instead of switching between multiple threads running at the same time. Even if you limit the number of threads in the pool to the number of CPUs you have, each task probably will run slower, albeit slightly.
If, however, you compare the throughput (amount of time needed to complete a number of tasks) of your different sized thread pools, you should see that the 20 thread throughput should be higher. If you execute 1000 tasks with 20 threads, they overall will finish much sooner than with just 1 thread. Each task may take longer but they will be executing in parallel. It will probably not be 20 times faster given thread overhead, etc. but it might be something like 15 times faster.
You should not be worrying about the individual task speed but rather you should be trying to maximize the task throughput by tuning the number of threads in your pool. How many threads to use depends heavily on the amount of IO, the CPU cycles used by each task, locks, synchronized blocks, other applications running on the OS, and other factors.
People often use 1-2 times the number of CPUs as a good place to start in terms of the number of threads in the pool to maximize throughput. More IO requests or thread blocking operations then add more threads. More CPU bound then reduce the number of threads to be closer to the number of CPUs available. If your application is competing for OS cycles with other more important applications on the server then even less threads may be required.

In a nutshell
if your tasks are CPU intensive (i.e there are no read/writes or blocking tasks which keep thread idle) then thread size can be set near to your core counts. This will use all resources efficiently & avoid too many context switching.
If they are memory intensive where you make API or DB calls then its better to have higher number of threads so that while a thread is waiting for response another thread can be switched with current idle thread.

Related

Multithreaded vs Asynchronous programming in a single core

If in real time the CPU performs only one task at a time then how is multithreading different from asynchronous programming (in terms of efficiency) in a single processor system?
Lets say for example we have to count from 1 to IntegerMax. In the following program for my multicore machine, the two thread final count count is almost half of the single thread count. What if we ran this in a single core machine? And is there any way we could achieve the same result there?
class Demonstration {
public static void main( String args[] ) throws InterruptedException {
SumUpExample.runTest();
}
}
class SumUpExample {
long startRange;
long endRange;
long counter = 0;
static long MAX_NUM = Integer.MAX_VALUE;
public SumUpExample(long startRange, long endRange) {
this.startRange = startRange;
this.endRange = endRange;
}
public void add() {
for (long i = startRange; i <= endRange; i++) {
counter += i;
}
}
static public void twoThreads() throws InterruptedException {
long start = System.currentTimeMillis();
SumUpExample s1 = new SumUpExample(1, MAX_NUM / 2);
SumUpExample s2 = new SumUpExample(1 + (MAX_NUM / 2), MAX_NUM);
Thread t1 = new Thread(() -> {
s1.add();
});
Thread t2 = new Thread(() -> {
s2.add();
});
t1.start();
t2.start();
t1.join();
t2.join();
long finalCount = s1.counter + s2.counter;
long end = System.currentTimeMillis();
System.out.println("Two threads final count = " + finalCount + " took " + (end - start));
}
static public void oneThread() {
long start = System.currentTimeMillis();
SumUpExample s = new SumUpExample(1, MAX_NUM );
s.add();
long end = System.currentTimeMillis();
System.out.println("Single thread final count = " + s.counter + " took " + (end - start));
}
public static void runTest() throws InterruptedException {
oneThread();
twoThreads();
}
}
Output:
Single thread final count = 2305843008139952128 took 1003
Two threads final count = 2305843008139952128 took 540

For a purely CPU-bound operation you are correct. Most (99.9999%) of programs need to do input, output, and invoke other services. Those are orders of magnitude slower than the CPU, so while waiting for the results of an external operation, the OS can schedule and run other (many other) processes in time slices.
Hardware multithreading benefits primarily when 2 conditions are met:
CPU-intensive operations;
That can be efficiently divided into independent subsets
Or you have lots of different tasks to run that can be efficiently divided among multiple hardware processors.

In the following program for my multicore machine, the two thread final count count is almost half of the single thread count.
That is what I would expect from a valid benchmark when the application is using two cores.
However, looking at your code, I am somewhat surprised that you are getting those results ... so reliably.
Your benchmark doesn't take account of JVM warmup effects, particularly JIT compilation.
You benchmark's add method could potentially be optimized by the JIT compiler to get rid of the loop entirely. (But at least the counts are "used" ... by printing them out.)
I guess you got lucky ... but I'm not convinced those results will be reproducible for all versions of Java, or if you tweaked the benchmark.
Please read this:
How do I write a correct micro-benchmark in Java?
What if we ran this in a single core machine?
Assuming the following:
You rewrote the benchmark to corrected the flaws above.
You are running on a system where hardware hyper-threading1 is disabled2.
Then ... I would expect it to take two threads to take more than twice as long as the one thread version.
Q: Why "more than"?
A: Because there is a significant overhead in starting a new thread. Depending on your hardware, OS and Java version, it could be more than a millisecond. Certainly, the time taken is significant if you repeatedly use and discard threads.
And is there any way we could achieve the same result there?
Not sure what you are asking here. But are if you are asking how to simulate the behavior of one core on a multi-core machine, you would probably need to do this at the OS level. See https://superuser.com/questions/309617 for Windows and https://askubuntu.com/questions/483824 for Linux.
1 - Hyperthreading is a hardware optimization where a single core's processing hardware supports (typically) two hyper-threads. Each hyperthread
has its own sets of registers, but it shares functional units such as the ALU with the other hyperthread. So the two hyperthreads behave like (typically) two cores, except that they may be slower, depending on the precise instruction mix. A typical OS will treat a hyperthread as if it is a regular core. Hyperthreading is typically enabled / disabled at boot time; e.g. via a BIOS setting.
2 - If hyperthreading is enabled, it is possible that two Java threads won't be twice as fast as one in a CPU-intensive computation like this ... due to possible slowdown caused by the "other" hyperthread on respective cores. Did someone mention that benchmarking is complicated?

Multi threaded matrix multiplication performance issue

I am using java for multi threaded multiplication. I am practicing multi threaded programming. Following is the code that I took from another post of stackoverflow.
public class MatMulConcur {
private final static int NUM_OF_THREAD =1 ;
private static Mat matC;
public static Mat matmul(Mat matA, Mat matB) {
matC = new Mat(matA.getNRows(),matB.getNColumns());
return mul(matA,matB);
}
private static Mat mul(Mat matA,Mat matB) {
int numRowForThread;
int numRowA = matA.getNRows();
int startRow = 0;
Worker[] myWorker = new Worker[NUM_OF_THREAD];
for (int j = 0; j < NUM_OF_THREAD; j++) {
if (j<NUM_OF_THREAD-1){
numRowForThread = (numRowA / NUM_OF_THREAD);
} else {
numRowForThread = (numRowA / NUM_OF_THREAD) + (numRowA % NUM_OF_THREAD);
}
myWorker[j] = new Worker(startRow, startRow+numRowForThread,matA,matB);
myWorker[j].start();
startRow += numRowForThread;
}
for (Worker worker : myWorker) {
try {
worker.join();
} catch (InterruptedException e) {
}
}
return matC;
}
private static class Worker extends Thread {
private int startRow, stopRow;
private Mat matA, matB;
public Worker(int startRow, int stopRow, Mat matA, Mat matB) {
super();
this.startRow = startRow;
this.stopRow = stopRow;
this.matA = matA;
this.matB = matB;
}
#Override
public void run() {
for (int i = startRow; i < stopRow; i++) {
for (int j = 0; j < matB.getNColumns(); j++) {
double sum = 0;
for (int k = 0; k < matA.getNColumns(); k++) {
sum += matA.get(i, k) * matB.get(k, j);
}
matC.set(i, j, sum);
}
}
}
}
I ran this program for 1,10,20,...,100 threads but performance is decreasing instead. Following is the time table
Thread 1 takes 18 Milliseconds
Thread 10 takes 18 Milliseconds
Thread 20 takes 35 Milliseconds
Thread 30 takes 38 Milliseconds
Thread 40 takes 43 Milliseconds
Thread 50 takes 48 Milliseconds
Thread 60 takes 57 Milliseconds
Thread 70 takes 66 Milliseconds
Thread 80 takes 74 Milliseconds
Thread 90 takes 87 Milliseconds
Thread 100 takes 98 Milliseconds
Any Idea?

People think that using multiple threads will automatically (magically!) make any computation go faster. This is not so1.
There are a number of factors that can make multi-threading speedup less than you expect, or indeed result in a slowdown.
A computer with N cores (or hyperthreads) can do computations at most N times as fast as a computer with 1 core. This means that when you have T threads where T > N, the computational performance will be capped at N. (Beyond that, the threads make progress because of time slicing.)
A computer has a certain amount of memory bandwidth; i.e. it can only perform a certain number of read/write operations per second on main memory. If you have an application where the demand exceeds what the memory subsystem can achieve, it will stall (for a few nanoseconds). If there are many cores executing many threads at the same time, then it is the aggregate demand that matters.
A typical multi-threaded application working on shared variables or data structures will either use volatile or explicit synchronization to do this. Both of these increase the demand on the memory system.
When explicit synchronization is used and two threads want to hold a lock at the same time, one of them will be blocked. This lock contention slows down the computation. Indeed, the computation is likely to be slowed down if there was past contention on the lock.
Thread creation is expensive. Even acquiring an existing thread from a thread pool can be relatively expensive. If the task that you perform with the thread is too small, the setup costs can outweigh the possible speedup.
There is also the issue that you may be running into problems with a poorly written benchmark; e.g. the JVM may not be properly warmed up before taking the timing measurements.
There is insufficient detail in your question to be sure which of the above factors is likely to affect your application's performance. But it is likely to be a combination of 1 2 and 5 ... depending on how many cores are used, how big the CPUs memory caches are, how big the matrix is, and other factors.
1 - Indeed, if this was true then we would not need to buy computers with lots of cores. We could just use more and more threads. Provided you had enough memory, you could do an infinite amount of computation on a single machine. Bitcoin mining would be a doddle. Of course, it isn't true.

Using multi-threading is not primarily for performance, but for parallelization. There are cases where parallelization can benefit performance, though.
Your computer doesn't have infinite resources. Adding more and more threads will decrease performance. It's like starting more and more applications, you wouldn't expect a program to run faster when you start another program, and you probably wouldn't be surprised if it runs slower.
Up to a certain point performance will remain constant (your computer still has resources to handle the demand), but at some point you reach the maximum your computer can handle and performance will drop. That's exactly what your result shows. Performance stays somewhat constant with 1 or 10 threads, and then drops steadily.

Performance of ExecutorService in Java

I am trying to understand the ExecutorService in java. There is not much performance difference when I use 1 thread or 4 threads. I have a quad core CPU and I do not have any other process running.
ExecutorService exService = Executors.newFixedThreadPool(4);
exService.execute(new Test().new RunnableThread());
exService.awaitTermination(25, TimeUnit.SECONDS);
class RunnableThread implements Runnable {
#Override
public void run() {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
long cnt = 0;
for (cnt = 0; cnt < 999999999; cnt++) {
try {
for (long j = 0; j < 20; j++){
x += j;
}
} catch (Exception e) {
e.printStackTrace();
}
}
stopWatch.stop();
System.out.println(stopWatch.getTime());
}
}
If my understanding is right, my task should have close to 4x performance improvement when I say newFixedThreadPool(4) right?

Unfortunately, there is no magic in the allocation of workload to threads.
Every task runs on its own thread. It does not somehow automatically get transformed into concurrent execution paths.
If you have only one task, the remaining three threads will be idle.
Multiple threads only speed up things if you can split your workload into multiple tasks that can run concurrently (and you have to do that splitting yourself).

If my understanding is right, my task should have close to 4x performance improvement when I say newFixedThreadPool(4) right?
Yes, if you're actually running 4 concurrent tasks.
Currently, you have a single task that you are submitting to the executor. Let's say that it takes 10 seconds. Even if you have 4 cores and 4 threads, Java will not be able to parallelize a single task. However, if you submit 4 independent tasks (that have no memory or lock contention), then you will see all of them complete in those 10 seconds that it took the 1 task.

Using parallelism in Java makes program slower (four times slower!!!)

I'm writing conjugate-gradient method realization.
I use Java multi threading for matrix back-substitution.
Synchronization is made using CyclicBarrier, CountDownLatch.
Why it takes so much time to synchronize threads?
Are there other ways to do it?
code snippet
private void syncThreads() {
// barrier.await();
try {
barrier.await();
} catch (InterruptedException e) {
} catch (BrokenBarrierException e) {
}
}

You need to ensure that each thread spends more time doing useful work than it costs in overhead to pass a task to another thread.
Here is an example of where the overhead of passing a task to another thread far outweighs the benefits of using multiple threads.
final double[] results = new double[10*1000*1000];
{
long start = System.nanoTime();
// using a plain loop.
for(int i=0;i<results.length;i++) {
results[i] = (double) i * i;
}
long time = System.nanoTime() - start;
System.out.printf("With one thread it took %.1f ns per square%n", (double) time / results.length);
}
{
ExecutorService ex = Executors.newFixedThreadPool(4);
long start = System.nanoTime();
// using a plain loop.
for(int i=0;i<results.length;i++) {
final int i2 = i;
ex.execute(new Runnable() {
#Override
public void run() {
results[i2] = i2 * i2;
}
});
}
ex.shutdown();
ex.awaitTermination(1, TimeUnit.MINUTES);
long time = System.nanoTime() - start;
System.out.printf("With four threads it took %.1f ns per square%n", (double) time / results.length);
}
prints
With one thread it took 1.4 ns per square
With four threads it took 715.6 ns per square
Using multiple threads is much worse.
However, increase the amount of work each thread does and
final double[] results = new double[10 * 1000 * 1000];
{
long start = System.nanoTime();
// using a plain loop.
for (int i = 0; i < results.length; i++) {
results[i] = Math.pow(i, 1.5);
}
long time = System.nanoTime() - start;
System.out.printf("With one thread it took %.1f ns per pow 1.5%n", (double) time / results.length);
}
{
int threads = 4;
ExecutorService ex = Executors.newFixedThreadPool(threads);
long start = System.nanoTime();
int blockSize = results.length / threads;
// using a plain loop.
for (int i = 0; i < threads; i++) {
final int istart = i * blockSize;
final int iend = (i + 1) * blockSize;
ex.execute(new Runnable() {
#Override
public void run() {
for (int i = istart; i < iend; i++)
results[i] = Math.pow(i, 1.5);
}
});
}
ex.shutdown();
ex.awaitTermination(1, TimeUnit.MINUTES);
long time = System.nanoTime() - start;
System.out.printf("With four threads it took %.1f ns per pow 1.5%n", (double) time / results.length);
}
prints
With one thread it took 287.6 ns per pow 1.5
With four threads it took 77.3 ns per pow 1.5
That's an almost 4x improvement.

How many threads are being used in total? That is likely the source of your problem. Using multiple threads will only really give a performance boost if:
Each task in the thread does some sort of blocking. For example, waiting on I/O. Using multiple threads in this case enables that blocking time to be used by other threads.
or You have multiple cores. If you have 4 cores or 4 CPUs, you can do 4 tasks simultaneously (or 4 threads).
It sounds like you are not blocking in the threads so my guess is you are using too many threads. If you are for example using 10 different threads to do the work at the same time but only have 2 cores, that would likely be much slower than running all of the tasks in sequence. Generally start the number of threads equal to your number of cores/CPUs. Increase the threads used slowly gaging the performance each time. This will give you the optimal thread count to use.

Perhaps you could try to implement to re-implement your code using fork/join from JDK 7 and see what it does?
The default creates a thread-pool with exactly the same amount of threads as you have cores in your system. If you choose the threshold for dividing your work into smaller chunks reasonably this will probably execute much more efficient.

You are most likely aware of this, but in case you aren't, please read up on Amdahl's Law. It gives the relationship between expected speedup of a program by using parallelism and the sequential segments of the program.

synchronizing across cores is much slower than on a single cored environment see if you can limit the jvm to 1 core (see this blog post)
or you can use a ExecuterorService and use invokeAll to run the parallel tasks

Why does this code not see any significant performance gain when I use multiple threads on a quadcore machine?

I wrote some Java code to learn more about the Executor framework.
Specifically, I wrote code to verify the Collatz Hypothesis - this says that if you iteratively apply the following function to any integer, you get to 1 eventually:
f(n) = ((n % 2) == 0) ? n/2 : 3*n + 1
CH is still unproven, and I figured it would be a good way to learn about Executor. Each thread is assigned a range [l,u] of integers to check.
Specifically, my program takes 3 arguments - N (the number to which I want to check CH), RANGESIZE (the length of the interval that a thread has to process), and NTHREAD, the size of the threadpool.
My code works fine, but I saw much less speedup that I expected - of the order of 30% when I went from 1 to 4 threads.
My logic was that the computation is completely CPU bound, and each subtask (checking CH for a fixed size range) is takes roughly the same time.
Does anyone have ideas as to why I'm not seeing a 3 to 4x increase in speed?
If you could report your runtimes as you increase the number of thread (along with the machine, JVM and OS) that would also be great.
Specifics
Runtimes:
java -d64 -server -cp . Collatz 10000000 1000000 4 => 4 threads, takes 28412 milliseconds
java -d64 -server -cp . Collatz 10000000 1000000 1 => 1 thread, takes 38286 milliseconds
Processor:
Quadcore Intel Q6600 at 2.4GHZ, 4GB. The machine is unloaded.
Java:
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
OS:
Linux quad0 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 x86_64 GNU/Linux
Code: (I can't get the code to post, I think it's too long for SO requirements, the source is available on Google Docs
import java.math.BigInteger;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyRunnable implements Runnable {
public int lower;
public int upper;
MyRunnable(int lower, int upper) {
this.lower = lower;
this.upper = upper;
}
#Override
public void run() {
for (int i = lower ; i <= upper; i++ ) {
Collatz.check(i);
}
System.out.println("(" + lower + "," + upper + ")" );
}
}
public class Collatz {
public static boolean check( BigInteger X ) {
if (X.equals( BigInteger.ONE ) ) {
return true;
} else if ( X.getLowestSetBit() == 1 ) {
// odd
BigInteger Y = (new BigInteger("3")).multiply(X).add(BigInteger.ONE);
return check(Y);
} else {
BigInteger Z = X.shiftRight(1); // fast divide by 2
return check(Z);
}
}
public static boolean check( int x ) {
BigInteger X = new BigInteger( new Integer(x).toString() );
return check(X);
}
static int N = 10000000;
static int RANGESIZE = 1000000;
static int NTHREADS = 4;
static void parseArgs( String [] args ) {
if ( args.length >= 1 ) {
N = Integer.parseInt(args[0]);
}
if ( args.length >= 2 ) {
RANGESIZE = Integer.parseInt(args[1]);
}
if ( args.length >= 3 ) {
NTHREADS = Integer.parseInt(args[2]);
}
}
public static void maintest(String [] args ) {
System.out.println("check(1): " + check(1));
System.out.println("check(3): " + check(3));
System.out.println("check(8): " + check(8));
parseArgs(args);
}
public static void main(String [] args) {
long lDateTime = new Date().getTime();
parseArgs( args );
List<Thread> threads = new ArrayList<Thread>();
ExecutorService executor = Executors.newFixedThreadPool( NTHREADS );
for( int i = 0 ; i < (N/RANGESIZE); i++) {
Runnable worker = new MyRunnable( i*RANGESIZE+1, (i+1)*RANGESIZE );
executor.execute( worker );
}
executor.shutdown();
while (!executor.isTerminated() ) {
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime ) +
" (" + N/(fDateTime - lDateTime ) + " per ms)" );
}
}

Busy waiting can be a problem:
while (!executor.isTerminated() ) {
}
You can use awaitTermination() instead:
while (!executor.awaitTermination(1, TimeUnit.SECONDS)) {}

You are using BigInteger. It consumes a lot of register space. What you most likely have on the compiler level is register spilling that makes your process memory-bound.
Also note that when you are timing your results you are not taking into account extra time taken by the JVM to allocate threads and work with the thread pool.
You could also have memory conflicts when you are using constant Strings. All strings are stored in a shared string pool and so it may become a bottleneck, unless java is really clever about it.
Overall, I wouldn't advise using Java for this kind of stuff. Using pthreads would be a better way to go for you.

As #axtavt answered, busy waiting can be a problem. You should fix that first, as it is part of the answer, but not all of it. It won't appear to help in your case (on Q6600), because it seems to be bottlenecked at 2 cores for some reason, so another is available for the busy loop and so there is no apparent slowdown, but on my Core i5 it speeds up the 4-thread version noticeably.
I suspect that in the case of the Q6600 your particular app is limited by the amount of shared cache available or something else specific to the architecture of that CPU. The Q6600 has two 4MB L2 caches, which means CPUs are sharing them, and no L3 cache. On my core i5, each CPU has a dedicated L2 cache (256K, then there is a larger 8MB shared L3 cache. 256K more per-CPU cache might make a difference... otherwise something else architecture wise does.
Here is a comparison of a Q6600 running your Collatz.java, and a Core i5 750.
On my work PC, which is also a Q6600 # 2.4GHz like yours, but with 6GB RAM, Windows 7 64-bit, and JDK 1.6.0_21* (64-bit), here are some basic results:
10000000 500000 1 (avg of three runs): 36982 ms
10000000 500000 4 (avg of three runs): 21252 ms
Faster, certainly - but not completing in quarter of the time like you would expect, or even half... (though it is roughly just a bit more than half, more on that in a moment). Note in my case I halved the size of the work units, and have a default max heap of 1500m.
At home on my Core i5 750 (4 cores no hyperthreading), 4GB RAM, Windows 7 64-bit, jdk 1.6.0_22 (64-bit):
10000000 500000 1 (avg of 3 runs) 32677 ms
10000000 500000 4 (avg of 3 runs) 8825 ms
10000000 500000 4 (avg of 3 runs) 11475 ms (without the busy wait fix, for reference)
the 4 threads version takes 27% of the time the 1 thread version takes when the busy-wait loop is removed. Much better. Clearly the code can make efficient use of 4 cores...
NOTE: Java 1.6.0_18 and later have modified default heap settings - so my default heap size is almost 1500m on my work PC, and around 1000m on my home PC.
You may want to increase your default heap, just in case garbage collection is happening and slowing your 4 threaded version down a bit. It might help, it might not.
At least in your example, there's a chance your larger work unit size is skewing your results slightly...halving it may help you get closer to at least 2x the speed since 4 threads will be kept busy for a longer portion of the time. I don't think the Q6600 will do much better at this particular task...whether it is cache or some other inherent architecture thing.
In all cases, I am simply running "java Collatz 10000000 500000 X", where x = # of threads indicated.
The only changes I made to your java file were to make one of the println's into a print, so there were less linebreaks for my runs with 500000 per work unit so I could see more results in my console at once, and I ditched the busy wait loop, which matters on the i5 750 but didn't make a difference on the Q6600.

You can should try using the submit function and then watching the Future's that are returning checking them to see if the thread has finished.
Terminate doesn't return until there is a shutdown.
Future submit(Runnable task)
Submits a Runnable task for execution and returns a Future representing that task.
isTerminated()
Returns true if all tasks have completed following shut down.
Try this...
public static void main(String[] args) {
long lDateTime = new Date().getTime();
parseArgs(args);
List<Thread> threads = new ArrayList<Thread>();
List<Future> futures = new ArrayList<Future>();
ExecutorService executor = Executors.newFixedThreadPool(NTHREADS);
for (int i = 0; i < (N / RANGESIZE); i++) {
Runnable worker = new MyRunnable(i * RANGESIZE + 1, (i + 1) * RANGESIZE);
futures.add(executor.submit(worker));
}
boolean done = false;
while (!done) {
for(Future future : futures) {
done = true;
if( !future.isDone() ) {
done = false;
break;
}
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime) +
" (" + N / (fDateTime - lDateTime) + " per ms)");
System.exit(0);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.