Implement a TimeWindowBuffer of limited size

Implement a TimeWindowBuffer of limited size - java

I'd an interview yesterday. I couldn't figure out a solution to one programming problem and I'd like to get some ideas here. The problem is:
I need to implement a TimeWindowBuffer in Java, which stores the number a user continuously receives as time goes on. The buffer has a maxBufferSize. The user wants to know the average value of the past several seconds, a timeWindow passed in by user (so this is a sliding window). We could get the current time from the system (e.g. System.currentTimeMills() in Java). The TimeWindowBuffer class is like this:
public class TimeWindowBuffer {
private int maxBufferSize;
private int timeWindow;
public TimwWindowBuffer(int maxBufferSize, int timeWindow) {
this.maxBufferSize = maxBufferSize;
this.timeWindow = timeWindow;
}
public void addValue(long value) {
...
}
public double getAvg() {
...
return average;
}
// other auxiliary methods
}
Example:
Say, a user receive a number every second (the user may not receive a number at a certain rate) and wants to know the average value of the past 5 seconds.
Input:
maxBufferSize = 5, timeWindow = 5 (s)
numbers={-5 4 -8 -8 -8 1 6 1 8 5}
Output (I list the formula here for illustration but the user only needs the result)
:
-5 / 1 (t=1)
(-5 + 4) / 2 (t=2)
(-5 + 4 - 8) / 3 (t=3)
(-5 + 4 - 8 - 8) / 4 (t=4)
(-5 + 4 - 8 - 8 - 8) / 5 (t=5)
(4 - 8 - 8 - 8 + 1) / 5 (t=6)
(-8 - 8 - 8 + 1 + 6) / 5 (t=7)
(-8 - 8 + 1 + 6 + 1) / 5 (t=8)
(-8 + 1 + 6 + 1 + 8) / 5 (t=9)
(1 + 6 + 1 + 8 + 5) / 5 (t=10)
Since the data structure of the TimeWindowBuffer is not specified, I've been thinking about keeping a pair of value and its added time. So my declaration of underlying buffer is like this:
private ArrayList<Pair> buffer = new ArrayList<Pair>(maxBufferSize);
where
class Pair {
private long value;
private long time;
...
}
Since the Pair is added in time order, I could do a binary search on the list and calculate the average of the numbers that fall into the timeWindow. The problem is the buffer has a maxBufferSize (although ArrayList doesn't) and I have to remove the oldest value when the buffer is full. And that value could still satisfy the timeWindow but now it goes off the record and I will never know when it expires.
I'm stuck here for the current.
I don't need a direct answer but have some discussion or ideas here. Please let me now if there are any confusions about the problem and my description.

I enjoy little puzzles like this. I did not compile this code, nor did I take into account all the things you would have to for production usage. Like I did not design a way to set a missed value to 0 - i.e. if a value does not come in at every tick.
But this will give you another way to think of it....
public class TickTimer
{
private int tick = 0;
private java.util.Timer timer = new java.util.Timer();
public TickTimer(double timeWindow)
{
timer.scheduleAtFixedRate(new TickerTask(),
0, // initial delay
Math.round(1000/timeWindow)); // interval
}
private class TickerTask extends TimerTask
{
public void run ()
{
tick++;
}
}
public int getTicks()
{
return tick;
}
}
public class TimeWindowBuffer
{
int buffer[];
TickTimer timer;
final Object bufferSync = new Object();
public TimeWindowBuffer(int maxBufferSize, double timeWindow)
{
buffer = new int[maxBufferSize];
timer = TickTimer(timeWindow);
}
public boolean add(int value)
{
synchronize(bufferSync)
{
buffer[timer.getTicks() % maxBufferSize] = value;
}
}
public int averageValue()
{
int average = 0;
synchronize(bufferSync)
{
for (int i: buffer)
{
average += i;
}
}
return average/maxBufferSize;
}
}

Your question could be summarized as using constant memory to compute some statistics on a stream.
To me it's a heap (priority queue) with time as the key and value as the value, and least time on the top.
When you receive a new (time,value), add it to the heap. If the heap size is greater than the buffer size, just remove the root node in the heap, until the heap is small enough.
Also by using a heap you can get the minimum time in the buffer (i.e. the heap) in O(1) time, so just remove the root (the node with the minimum time) until all out-dated pairs are cleared.
For statistics, keep an integer sum. When you add a new pair to the heap, sum = sum + value of pair. When you remove the root from the heap, sum = sum - value of root.

Related

Java Sequence Generation without repetition

I have a specific requirement to generate a unique sequence number for the day. The utility should be able to generate the sequence without repeating even if there's a JVM restart.
Prerequisites:
should not use database sequence.
should not store anything in filesystem.
Sequence can be repeated across the day.
Sequence should not get repeated within the day, even if there's a JVM restart (this is already ensured with different attribute).
No of sequence per second is min requirement 99.
Sequence format: ######## (8 digits max)
Note: this will be running in different instance of nodes and hence first digits of sequence is reserved for identifying the node.

A simple clock-based solution may look like this:
static int seq(int nodeId) {
int val = nodeId * 100_000_000 + (int) (System.currentTimeMillis() % 100_000_000);
try {
Thread.sleep(1); // introduce delay to ensure the generated values are unique
} catch (InterruptedException e) {}
return val;
}
The delay may be randomized additionally (up to 5 millis):
static Random random = new SecureRandom();
static int seq(int nodeId) {
int val = nodeId * 100_000_000 + (int) (System.currentTimeMillis() % 100_000_000);
try {
Thread.sleep(1 + random.nextInt(4));
} catch (InterruptedException e) {}
return val;
}

Strange memory usage

In the process of creating a voxel game, I'm doing some performance tests, for the basic chunk system.
A chunk is made of 16 tiles on the y axis. A tile is a Hashmap of material ids.
The key is a byte, and the material id is a short.
According to my calculations a chunk should be 12KB + a little bit (Let's just say 16KB). 16*16*16*3. 3 is for a byte and a short(3 bytes).
What I basically don't understand is that my application uses much more memory than expected. Actually around 256KB for each chunk. It uses around 2GB running 8192 chunks. Notice that this is chunk storage performance test, and therefore not a whole game.
Another strange thing is that the memory usage varies from 1.9GB to 2.2GB each time I run it. Theres no randomizer in the code, so it should always be the same amount of variables, arrays, elements etc.
Heres my code:
public class ChunkTest {
public static void main(String[] args) {
List <Chunk> chunks = new ArrayList <Chunk>();
long time = System.currentTimeMillis();
for(int i = 0; i<8192; i++) {
chunks.add(new Chunk());
}
long time2 = System.currentTimeMillis();
System.out.println(time2-time);
System.out.println("Done");
//System.out.println(chunk.getBlock((byte)0, (byte)0, (byte)0));
while(1==1) {
//Just to keep it running to view memory usage
}
}
}
And the other class
public class Chunk {
int x;
int y;
int z;
boolean solidUp;
boolean solidDown;
boolean solidNorth;
boolean solidSouth;
boolean solidWest;
boolean solidEast;
private HashMap<Byte, HashMap<Byte, Short>> tiles = new HashMap<Byte, HashMap<Byte, Short>>();
public Chunk() {
HashMap<Byte, Short> tile;
//Create 16 tiles
for(byte i = 0; i<16;i++) {
//System.out.println(i);
tile = new HashMap<Byte, Short>();
//Create 16 by 16 blocks (1 is the default id)
for(short e = 0; e<256;e++) {
//System.out.println(e);
tile.put((byte) e, (short) 1);
}
tiles.put(i, tile);
}
}
public short getBlock(byte x, byte y, byte z) {
HashMap<Byte, Short> tile = tiles.get(y);
short block = tile.get((byte)(x+(z*16)));
return block;
}
}
I'm using windows task manager to monitor the memory usage.
Is that a very inaccurate tool to monitor, and does it kind of guess, which would explain why it varies from instance to instance.
And what is making each chunk 20 times heavier than it should?
A little bonus question, if you know: If I know the index of what I'm trying to find, is hashMap or ArrayList faster?

A chunk is made of 16 tiles on the y axis. A tile is a Hashmap of material ids. The key is a byte, and the material id is a short.
According to my calculations a chunk should be 12KB + a little bit (Let's just say 16KB). 16*16*16*3. 3 is for a byte and a short(3 bytes).
That's bad. Though you're keeping the size of the HashMap secret, I can see you're too optimistic.
A Map.Entry is an object. Add 4 or 8 bytes for its header.
Its key is an object, never a primitive. Count 8 bytes.
The same for the value.
A HashMap.Entry stores the hash (int, 4 bytes) and a reference to Entry next (4 or 8 bytes). The HashMap maintains an array of references to its entries (4 or 8 bytes per element), which by default kept at most 75% full.
So we have much more that what you expected. The exact values depends on your JVM and some of my above figures may be wrong. Anyway, you're off by a factor of at maybe 10 or more.
I'd suggest you to post you code to CR with all the details needed for the size estimation. Consider using some primitive map or maybe just an array...

Object Sharing in Simple Multi-threaded Program

Introduction
I have written a very simple program as an attempt to re-introduce myself to multi-threaded programming in JAVA. The objective of my program is derived from this rather neat set of articles, written by Jakob Jankov. For the program's original, unmodified version, consult the bottom of the linked article.
Jankov's program does not System.out.println the variables, so you cannot see what is happening. If you .print the resulting value you get the same results, every time (the program is thread safe); however, if you print some of the inner workings, the "inner behaviour" is different, each time.
I understand the issues involved in thread scheduling and the unpredictability of a thread's Running. I believe that may be a factor in the question I ask, below.
Program's Three Parts
The Main Class:
public class multiThreadTester {
public static void main (String[] args) {
// Counter object to be shared between two threads:
Counter counter = new Counter();
// Instantiation of Threads:
Thread counterThread1 = new Thread(new CounterThread(counter), "counterThread1");
Thread counterThread2 = new Thread(new CounterThread(counter), "counterThread2");
counterThread1.start();
counterThread2.start();
}
}
The objective of the above class is simply to share an object. In this case, the threads share an object of type Counter:
Counter Class
public class Counter {
long count = 0;
// Adding a value to count data member:
public synchronized void add (long value) {
this.count += value;
}
public synchronized long getValue() {
return count;
}
}
The above is simply the definition of the Counter class, which includes only a primitive member of type long.
CounterThread Class
Below, is the CounterThread class, virtually unmodified from the code provided by Jankov. The only real difference (despite my implementing Runnable as opposed to extending Thread) is the addition of System.out.println(). I added this to watch the inner-workings of the program.
public class CounterThread implements Runnable {
protected Counter counter = null;
public CounterThread(Counter aCounter) {
this.counter = aCounter;
}
public void run() {
for (int i = 0; i < 10; i++) {
System.out.println("BEFORE add - " + Thread.currentThread().getName() + ": " + this.counter.getValue());
counter.add(i);
System.out.println("AFTER add - " + Thread.currentThread().getName() + ": " + this.counter.getValue());
}
}
}
Question
As you can see, the code is very simple. The above code's only purpose is to watch what happens as two threads share a thread-safe object.
My question comes as a result of the output of the program (which I have tried to condense, below). The output is hard to "get consistent" to demonstrate my question, as the spread of the difference (see below) can be quite great:
Here's the condensed output (trying to minimize what you look at):
AFTER add - counterThread1: 0
BEFORE add - counterThread1: 0
AFTER add - counterThread1: 1
BEFORE add - counterThread1: 1
AFTER add - counterThread1: 3
BEFORE add - counterThread1: 3
AFTER add - counterThread1: 6
BEFORE add - counterThread1: 6
AFTER add - counterThread1: 10
BEFORE add - counterThread2: 0 // This BEFORE add statement is the source of my question
And one more output that better demonstrates:
BEFORE add - counterThread1: 0
AFTER add - counterThread1: 0
BEFORE add - counterThread1: 0
AFTER add - counterThread1: 1
BEFORE add - counterThread2: 0
AFTER add - counterThread2: 1
BEFORE add - counterThread2: 1
AFTER add - counterThread2: 2
BEFORE add - counterThread2: 2
AFTER add - counterThread2: 4
BEFORE add - counterThread2: 4
AFTER add - counterThread2: 7
BEFORE add - counterThread2: 7
AFTER add - counterThread2: 11
BEFORE add - counterThread1: 1 // Here, counterThread1 still believes the value of Counter's counter is 1
AFTER add - counterThread1: 13
BEFORE add - counterThread1: 13
AFTER add - counterThread1: 16
BEFORE add - counterThread1: 16
AFTER add - counterThread1: 20
My question(s):
Thread safety ensures the safe mutability of a variable, i.e. only one thread can access an object at a time. Doing this ensures that the "read" and "write" methods will behave, appropriately, only writing after a thread has released its lock (eliminating racing).
Why, despite the correct write behaviour, does counterThread2 "believe" Counter's value (not the iterator i) to still be zero? What is happening in memory? Is this a matter of the thread containing it's own, local Counter object?
Or, more simply, after counterThread1 has updated the value, why does counterThread2 not see - in this case, System.out.println() - the correct value? Despite not seeing the value, the correct value is written to the object.

Why, despite the correct write behaviour, does counterThread2 "believe" Counter's value to still be zero?
The threads interleaved in such a way to cause this behaviour. Because the print statements are outside of the synchronised block, it is possible for a thread to read the counter value then pause due to is scheduling while the other thread increments multiple times. When the waiting thread finally resumes and enters the inc counter method, the value of the counter will have moved on quite a bit and will no longer match what was printed in the BEFORE log line.
As an example, I have modified your code to make it more evident that both threads are working on the same counter. First I have moved the print statements into the counter, then I added a unique thread label so that we can tell which thread was responsible for the increment and finally I only increment by one so that any jumps in the counter value will stand out more clearly.
public class Main {
public static void main (String[] args) {
// Counter object to be shared between two threads:
Counter counter = new Counter();
// Instantiation of Threads:
Thread counterThread1 = new Thread(new CounterThread("A",counter), "counterThread1");
Thread counterThread2 = new Thread(new CounterThread("B",counter), "counterThread2");
counterThread1.start();
counterThread2.start();
}
}
class Counter {
long count = 0;
// Adding a value to count data member:
public synchronized void add (String label, long value) {
System.out.println(label+ " BEFORE add - " + Thread.currentThread().getName() + ": " + this.count);
this.count += value;
System.out.println(label+ " AFTER add - " + Thread.currentThread().getName() + ": " + this.count);
}
public synchronized long getValue() {
return count;
}
}
class CounterThread implements Runnable {
private String label;
protected Counter counter = null;
public CounterThread(String label, Counter aCounter) {
this.label = label;
this.counter = aCounter;
}
public void run() {
for (int i = 0; i < 10; i++) {
counter.add(label, 1);
}
}
}

Parallelize calculations

I need to calculate the mean and extract the root of some numbers from a huge file:
1, 2, 3, 4, 5,\n
6, 7, 8, 9, 10,\n
11, 12, 13, 14,15,\n
...
This is the code:
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
public class App1{
int res, c;
double mean, root;
ArrayList list = new ArrayList();
public App1() {
// einlesen
Scanner sc = null;
try {
sc = new Scanner(new File("file.txt")).useDelimiter("[,\\s]+");
} catch (FileNotFoundException ex) {
System.err.println(ex);
}
while (sc.hasNextInt()) {
list.add(sc.nextInt());
res += (int) list.get(c);
c++;
}
sc.close();
// Mean
mean = res / list.size();
// Root
root = Math.sqrt(mean);
System.out.println("Mean: " + mean);
System.out.println("Root: " + root);
}
public static void main(String[] args) {
App1 app = new App1();
}
}
Is there any way to parallelize it?
Before calculating the mean I need all the numbers, so one thread can't calculate while another is still fetching the numbers from the file.
The same thing with extracting the root: A thread can't extract it from the mean if the mean isn't calculated yet.
I thought about Future, would that be a solution?

There's something critical you will have to accept up front - you will not be able to process the data any faster than you can read it from the file. So first time how long it will take to read through the whole file and accept that you won't improve on that.
That said - have you considered a ForkJoinPool.

You can calculate the mean in parallel, because the mean is simply the sum divided by the count. There is no reason why you cannot sum up the values in parallel, and count them as well, then just do the division later.
Consider a class:
public class PartialSum() {
private final int partialcount;
private final int partialsum;
public PartialSum(int count, int sum) {
partialsum = sum;
partialcount = count;
public int getCount() {
return partialcount;
}
public int getSum() {
return partialsum;
}
}
Now, this could be the return type of a Future, as in Future<PartialSum>.
So, what you need to do is split the file in parts, and then send the parts to individual threads.
Each thread calculates a PartialSum. Then, as the threads complete, you can:
int sum = 0;
int count = 0;
for(Future<PartialSum> partial : futures) {
PartialSum ps = partial.get();
sum += ps.getSum();
count += ps.getCount();
}
double mean = (double)sum / count;
double root = ....

I think it's possible.
int offset = (filesize / Number of Threads)
Create n threads
Each thread starts reading from offset * thread number. Eg Thread 0 starts reading from byte 0, thread 1 starts reading from offset * 1, thread 2 starts reading from offset * 2
If thread num != 0, read ahead until you hit a newline character - start from there.
Add up an average per thread. Save to "thread_average" or something.
When all threads are finished, total average = average of all "thread_average" variables
Square root the total average variable.
It will need a bit of messing around to make sure the threads don't read too far into another threads block of the file, but should be do-able

No there is no way to parallelize this. Although you could do something that looks like you are using threading, the result will be overly complex but still run at about the same speed as before.
The reason for this is that file access is and has to be single-threaded, and beside reading from file all you do is two add operations. So in best case those add operations could be parallelized, however since those take almost no execution time, the gain would be like 5% - 10% at best. And that time is negated (or worse) by the thread creation and maintenance.
Once thing you can do to speed things up would be to remove the part where you put things into a list (assuming that you don't need those values later).
while (sc.hasNextInt()) {
res += sc.nextInt();
++c;
}
mean = res / c;

Why does this code not see any significant performance gain when I use multiple threads on a quadcore machine?

I wrote some Java code to learn more about the Executor framework.
Specifically, I wrote code to verify the Collatz Hypothesis - this says that if you iteratively apply the following function to any integer, you get to 1 eventually:
f(n) = ((n % 2) == 0) ? n/2 : 3*n + 1
CH is still unproven, and I figured it would be a good way to learn about Executor. Each thread is assigned a range [l,u] of integers to check.
Specifically, my program takes 3 arguments - N (the number to which I want to check CH), RANGESIZE (the length of the interval that a thread has to process), and NTHREAD, the size of the threadpool.
My code works fine, but I saw much less speedup that I expected - of the order of 30% when I went from 1 to 4 threads.
My logic was that the computation is completely CPU bound, and each subtask (checking CH for a fixed size range) is takes roughly the same time.
Does anyone have ideas as to why I'm not seeing a 3 to 4x increase in speed?
If you could report your runtimes as you increase the number of thread (along with the machine, JVM and OS) that would also be great.
Specifics
Runtimes:
java -d64 -server -cp . Collatz 10000000 1000000 4 => 4 threads, takes 28412 milliseconds
java -d64 -server -cp . Collatz 10000000 1000000 1 => 1 thread, takes 38286 milliseconds
Processor:
Quadcore Intel Q6600 at 2.4GHZ, 4GB. The machine is unloaded.
Java:
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
OS:
Linux quad0 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 x86_64 GNU/Linux
Code: (I can't get the code to post, I think it's too long for SO requirements, the source is available on Google Docs
import java.math.BigInteger;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyRunnable implements Runnable {
public int lower;
public int upper;
MyRunnable(int lower, int upper) {
this.lower = lower;
this.upper = upper;
}
#Override
public void run() {
for (int i = lower ; i <= upper; i++ ) {
Collatz.check(i);
}
System.out.println("(" + lower + "," + upper + ")" );
}
}
public class Collatz {
public static boolean check( BigInteger X ) {
if (X.equals( BigInteger.ONE ) ) {
return true;
} else if ( X.getLowestSetBit() == 1 ) {
// odd
BigInteger Y = (new BigInteger("3")).multiply(X).add(BigInteger.ONE);
return check(Y);
} else {
BigInteger Z = X.shiftRight(1); // fast divide by 2
return check(Z);
}
}
public static boolean check( int x ) {
BigInteger X = new BigInteger( new Integer(x).toString() );
return check(X);
}
static int N = 10000000;
static int RANGESIZE = 1000000;
static int NTHREADS = 4;
static void parseArgs( String [] args ) {
if ( args.length >= 1 ) {
N = Integer.parseInt(args[0]);
}
if ( args.length >= 2 ) {
RANGESIZE = Integer.parseInt(args[1]);
}
if ( args.length >= 3 ) {
NTHREADS = Integer.parseInt(args[2]);
}
}
public static void maintest(String [] args ) {
System.out.println("check(1): " + check(1));
System.out.println("check(3): " + check(3));
System.out.println("check(8): " + check(8));
parseArgs(args);
}
public static void main(String [] args) {
long lDateTime = new Date().getTime();
parseArgs( args );
List<Thread> threads = new ArrayList<Thread>();
ExecutorService executor = Executors.newFixedThreadPool( NTHREADS );
for( int i = 0 ; i < (N/RANGESIZE); i++) {
Runnable worker = new MyRunnable( i*RANGESIZE+1, (i+1)*RANGESIZE );
executor.execute( worker );
}
executor.shutdown();
while (!executor.isTerminated() ) {
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime ) +
" (" + N/(fDateTime - lDateTime ) + " per ms)" );
}
}

Busy waiting can be a problem:
while (!executor.isTerminated() ) {
}
You can use awaitTermination() instead:
while (!executor.awaitTermination(1, TimeUnit.SECONDS)) {}

You are using BigInteger. It consumes a lot of register space. What you most likely have on the compiler level is register spilling that makes your process memory-bound.
Also note that when you are timing your results you are not taking into account extra time taken by the JVM to allocate threads and work with the thread pool.
You could also have memory conflicts when you are using constant Strings. All strings are stored in a shared string pool and so it may become a bottleneck, unless java is really clever about it.
Overall, I wouldn't advise using Java for this kind of stuff. Using pthreads would be a better way to go for you.

As #axtavt answered, busy waiting can be a problem. You should fix that first, as it is part of the answer, but not all of it. It won't appear to help in your case (on Q6600), because it seems to be bottlenecked at 2 cores for some reason, so another is available for the busy loop and so there is no apparent slowdown, but on my Core i5 it speeds up the 4-thread version noticeably.
I suspect that in the case of the Q6600 your particular app is limited by the amount of shared cache available or something else specific to the architecture of that CPU. The Q6600 has two 4MB L2 caches, which means CPUs are sharing them, and no L3 cache. On my core i5, each CPU has a dedicated L2 cache (256K, then there is a larger 8MB shared L3 cache. 256K more per-CPU cache might make a difference... otherwise something else architecture wise does.
Here is a comparison of a Q6600 running your Collatz.java, and a Core i5 750.
On my work PC, which is also a Q6600 # 2.4GHz like yours, but with 6GB RAM, Windows 7 64-bit, and JDK 1.6.0_21* (64-bit), here are some basic results:
10000000 500000 1 (avg of three runs): 36982 ms
10000000 500000 4 (avg of three runs): 21252 ms
Faster, certainly - but not completing in quarter of the time like you would expect, or even half... (though it is roughly just a bit more than half, more on that in a moment). Note in my case I halved the size of the work units, and have a default max heap of 1500m.
At home on my Core i5 750 (4 cores no hyperthreading), 4GB RAM, Windows 7 64-bit, jdk 1.6.0_22 (64-bit):
10000000 500000 1 (avg of 3 runs) 32677 ms
10000000 500000 4 (avg of 3 runs) 8825 ms
10000000 500000 4 (avg of 3 runs) 11475 ms (without the busy wait fix, for reference)
the 4 threads version takes 27% of the time the 1 thread version takes when the busy-wait loop is removed. Much better. Clearly the code can make efficient use of 4 cores...
NOTE: Java 1.6.0_18 and later have modified default heap settings - so my default heap size is almost 1500m on my work PC, and around 1000m on my home PC.
You may want to increase your default heap, just in case garbage collection is happening and slowing your 4 threaded version down a bit. It might help, it might not.
At least in your example, there's a chance your larger work unit size is skewing your results slightly...halving it may help you get closer to at least 2x the speed since 4 threads will be kept busy for a longer portion of the time. I don't think the Q6600 will do much better at this particular task...whether it is cache or some other inherent architecture thing.
In all cases, I am simply running "java Collatz 10000000 500000 X", where x = # of threads indicated.
The only changes I made to your java file were to make one of the println's into a print, so there were less linebreaks for my runs with 500000 per work unit so I could see more results in my console at once, and I ditched the busy wait loop, which matters on the i5 750 but didn't make a difference on the Q6600.

You can should try using the submit function and then watching the Future's that are returning checking them to see if the thread has finished.
Terminate doesn't return until there is a shutdown.
Future submit(Runnable task)
Submits a Runnable task for execution and returns a Future representing that task.
isTerminated()
Returns true if all tasks have completed following shut down.
Try this...
public static void main(String[] args) {
long lDateTime = new Date().getTime();
parseArgs(args);
List<Thread> threads = new ArrayList<Thread>();
List<Future> futures = new ArrayList<Future>();
ExecutorService executor = Executors.newFixedThreadPool(NTHREADS);
for (int i = 0; i < (N / RANGESIZE); i++) {
Runnable worker = new MyRunnable(i * RANGESIZE + 1, (i + 1) * RANGESIZE);
futures.add(executor.submit(worker));
}
boolean done = false;
while (!done) {
for(Future future : futures) {
done = true;
if( !future.isDone() ) {
done = false;
break;
}
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime) +
" (" + N / (fDateTime - lDateTime) + " per ms)");
System.exit(0);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.