Effiecient serialization of native java arrays with java.io

Effiecient serialization of native java arrays with java.io - java

i gotta question about Java Serialization.
I'm simply writing out 10 arrays of size int[] array = new int[2^28] to my harddik (i know that's kinda big, but i need it that way) using a FileOutputStream and a BufferedOutputStream in combination with a Dataoutputstream. Before each serialization i create a new FileOutputstream and all the other streams and afterwards i close and flush my streams.
Problem:
The first serialization takes about 2 seconds, afterwards it increases up tp 17seconds and stays on this level. What's the problem here? If i go into the code i can see that the FileOutputStreams take a huge amount of time for writeByte(...). Is this due to the HDD caching (full)? How can i avoid this? Can i clear it?
Here is my simple code:
public static void main(String[] args) throws IOException {
System.out.println("### Starting test");
for (int k = 0; k < 10; k++) {
System.out.println("### Run nr ... " + k);
// Creating the test array....
int[] testArray = new int[(int) Math.pow(2, 28)];
for (int i = 0; i < testArray.length; i++) {
if (i % 2 == 0) {
testArray[i] = i;
}
}
BufferedDataOutputStream dataOut = new BufferedDataOutputStream(
new FileOutputStream("e:\\test" + k + "_" + 28 + ".dat"));
// Serializing...
long start = System.nanoTime();
dataOut.write(testArray);
System.out.println((System.nanoTime() - start) / 1000000000.0
+ " s");
dataOut.flush();
dataOut.close();
}
}
where dataOut.write(int[], 0, end)
public void write(int[] i, int start, int len) throws IOException {
for (int ii = start; ii < start + len; ii += 1) {
if (count + 4 > buf.length) {
checkBuf(4);
}
buf[count++] = (byte) (i[ii] >>> 24);
buf[count++] = (byte) (i[ii] >>> 16);
buf[count++] = (byte) (i[ii] >>> 8);
buf[count++] = (byte) (i[ii]);
}
}
and `protected void checkBuf(int need) throws IOException {
if (count + need > buf.length) {
out.write(buf, 0, count);
count = 0;
}
}`
BufferedDataOutputStream extends BufferedOutputStream comes along with the fits framework. It simply combines the BufferedOutputStream with the DataOutputStream to reduce the number of method calls when you write big arrays (which makes it a lot faster... up to 10 times ...).
Here is the output:
Starting benchmark
STARTING RUN 0
2.001972271
STARTING RUN 1
1.986544604
STARTING RUN 2
15.663881232
STARTING RUN 3
17.652161328
STARTING RUN 4
18.020969301
STARTING RUN 5
11.647542466
STARTING RUN 6
Why the time is so much increasing?
Thank you,
Eeth

In this program I populate 1 GB as int values and "force" these to be written to disk.
String dir = args[0];
for (int i = 0; i < 24; i++) {
long start = System.nanoTime();
File tmp = new File(dir, "deleteme." + i);
tmp.deleteOnExit();
RandomAccessFile raf = new RandomAccessFile(tmp, "rw");
final MappedByteBuffer map = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, 1 << 30);
IntBuffer array = map.order(ByteOrder.nativeOrder()).asIntBuffer();
for (int n = 0; n < array.capacity(); n++)
array.put(n, n);
map.force();
((DirectBuffer) map).cleaner().clean();
raf.close();
long time = System.nanoTime() - start;
System.out.printf("Took %.1f seconds to write 1 GB%n", time / 1e9);
}
with each file forced to disk, they take about the same amount of time each.
Took 7.7 seconds to write 1 GB
Took 7.5 seconds to write 1 GB
Took 7.7 seconds to write 1 GB
Took 7.9 seconds to write 1 GB
Took 7.6 seconds to write 1 GB
Took 7.7 seconds to write 1 GB
However, if I comment out the map.force(); I see this profile.
Took 0.8 seconds to write 1 GB
Took 1.0 seconds to write 1 GB
Took 4.9 seconds to write 1 GB
Took 7.2 seconds to write 1 GB
Took 7.0 seconds to write 1 GB
Took 7.2 seconds to write 1 GB
Took 7.2 seconds to write 1 GB
It appears that it will buffer about 2.5 GB which is about 10% of my main memory before it slows down.
You can clear you cache by waiting for the previous writes to finish.
Basically you have 1 GB of data and the sustain write speed of your disk appears to be about 60 MB/s which is reasonable for a SATA Hard Drive. If you get speeds higher than this its because the data hasn't really written to disk and is actually in memory.
If you want to this to be faster you can use a memory mapped file. This has the benefit of writing to disk in background as you are populating the "array" i.e. it can be finished writing almost as soon as you finish setting the values.
Another option is to get a faster drive. A single 250 GB SSD drive can sustain writes of around 200 MB/s. Using multiple drives in a RAID configuration can also increase write speed.

The first writes may just be filling up your hard drive's cache without actually writing to disk yet.

Related

Trying to measure how much time an insert takes in a database

I have a Multithreaded program which will insert into one of my table and that program I am running like this-
java -jar CannedTest.jar 100 10000
which means:
Number of threads is 100
Number of tasks is 10000
So each thread will insert 10000 records in my table. So that means total count (100 * 10000) in the table should be 1,000,000 after program is finished executing.
I am trying to measure how much time an insert is taking into my table as a part of our LnP testing. I am storing all these numbers in a ConcurrentHashMap like how much time an insert into database is taking like below.
long start = System.nanoTime();
callableStatement[pos].executeUpdate(); // flush the records.
long end = System.nanoTime() - start;
final AtomicLong before = insertHistogram.putIfAbsent(end / 1000000L, new AtomicLong(1L));
if (before != null) {
before.incrementAndGet();
}
When all the threads are finished executing all the tasks, then I try to print out the numbers from the ConcurrentHashMap insertHistogram by sorting it on Key which is Milliseconds and I get the result like below-
Milliseconds Number
0 2335
1 62488
2 60286
3 54967
4 52374
5 93034
6 123083
7 179355
8 118686
9 87126
10 42305
.. ..
.. ..
.. ..
And also, from the same ConcurrentHashMap insertHistogram I tried to make a Histogram like below.
17:46:06,112 INFO LoadTest:195 - Insert Histogram List:
17:46:06,112 INFO LoadTest:212 - 64823 came back between 1 and 2 ms
17:46:06,112 INFO LoadTest:212 - 115253 came back between 3 and 4 ms
17:46:06,112 INFO LoadTest:212 - 447846 came back between 5 and 8 ms
17:46:06,112 INFO LoadTest:212 - 330533 came back between 9 and 16 ms
17:46:06,112 INFO LoadTest:212 - 29188 came back between 17 and 32 ms
17:46:06,112 INFO LoadTest:212 - 6548 came back between 33 and 64 ms
17:46:06,112 INFO LoadTest:212 - 3821 came back between 65 and 128 ms
17:46:06,113 INFO LoadTest:212 - 1988 came back greater than 128 ms
NOTE:- The database in which I am trying to insert records, it's in Memory Only mode currently.
Problem Statement:-
Take a look at this number in my above result which prints out by sorting it on the key-
0 2335
I am not sure how it is possible that 2335 calls was inserted in 0 milliseconds? And also I am using System.nanotime while measuring the insert.
Below is the code which will print out the above logs-
private static void logHistogramInfo() {
int[] definition = { 0, 2, 4, 8, 16, 32, 64, 128 };
long[] buckets = new long[definition.length];
System.out.println("Milliseconds Number");
SortedSet<Long> keys = new TreeSet<Long>(Task.insertHistogram.keySet());
for (long key : keys) {
AtomicLong value = Task.insertHistogram.get(key);
System.out.println(key+ " " + value);
}
LOG.info("Insert Histogram List: ");
for (Long time : Task.insertHistogram.keySet()) {
for (int i = definition.length - 1; i >= 0; i--) {
if (time >= definition[i]) {
buckets[i] += Task.insertHistogram.get(time).get();
break;
}
}
}
for (int i = 0; i < definition.length; i++) {
String period = "";
if (i == definition.length - 1) {
period = "greater than " + definition[i] + " ms";
} else {
period = "between " + (definition[i] + 1) + " and " + definition[i + 1] + " ms";
}
LOG.info(buckets[i] + " came back " + period);
}
}
I am not sure why 0 milliseconds is getting shown when I try to print the values from the Map directly by sorting it on the key.
But the same 0 milliseconds doesn't get shown when I try to make the histogram in the same logHistogramInfo method.
Is there anything wrong I am doing in my calculation process in my above method?

Profiling java code that calls Runtime.freeMemory()

I have some code that profiles Runtime.freeMemory. Here is my code:
package misc;
import java.util.ArrayList;
import java.util.Random;
public class FreeMemoryTest {
private final ArrayList<Double> l;
private final Random r;
public FreeMemoryTest(){
this.r = new Random();
this.l = new ArrayList<Double>();
}
public static boolean memoryCheck() {
double freeMem = Runtime.getRuntime().freeMemory();
double totalMem = Runtime.getRuntime().totalMemory();
double fptm = totalMem * 0.05;
boolean toReturn = fptm > freeMem;
return toReturn;
}
public void freeMemWorkout(int max){
for(int i = 0; i < max; i++){
memoryCheck();
l.add(r.nextDouble());
}
}
public void workout(int max){
for(int i = 0; i < max; i++){
l.add(r.nextDouble());
}
}
public static void main(String[] args){
FreeMemoryTest f = new FreeMemoryTest();
int count = Integer.parseInt(args[1]);
long startTime = System.currentTimeMillis();
if(args[0].equals("f")){
f.freeMemWorkout(count);
} else {
f.workout(count);
}
long endTime = System.currentTimeMillis();
System.out.println(endTime - startTime);
}
}
When I run the profiler using -Xrunhprof:cpu=samples, the vast majority of the calls are to the Runtime.freeMemory(), like this:
CPU SAMPLES BEGIN (total = 531) Fri Dec 7 00:17:20 2012
rank self accum count trace method
1 83.62% 83.62% 444 300274 java.lang.Runtime.freeMemory
2 9.04% 92.66% 48 300276 java.lang.Runtime.totalMemory
When I run the profiler using -Xrunhprof:cpu=time, I don't see any of the calls to Runtime.freeMemory at all, and the top five calls are as follows:
CPU TIME (ms) BEGIN (total = 10042) Fri Dec 7 00:29:51 2012
rank self accum count trace method
1 13.39% 13.39% 200000 307547 java.util.Random.next
2 9.69% 23.08% 1 307852 misc.FreeMemoryTest.freeMemWorkout
3 7.41% 30.49% 100000 307544 misc.FreeMemoryTest.memoryCheck
4 7.39% 37.88% 100000 307548 java.util.Random.nextDouble
5 4.35% 42.23% 100000 307561 java.util.ArrayList.add
These two profiles are so different from one another. I thought that samples was supposed to at least roughly approximate the results from the times, but here we see a very radical difference, something that consumes more than 80% of the samples doesn't even appear in the times profile. This does not make any sense to me, does anyone know why this is happening?
More on this:
$ java -Xmx1000m -Xms1000m -jar memtest.jar a 20000000 5524
//does not have the calls to Runtime.freeMemory()
$ java -Xmx1000m -Xms1000m -jar memtest.jar f 20000000 9442
//has the calls to Runtime.freeMemory()
Running with freemem requires approximately twice the amount of time as running without it. If 80% of the CPU time is spent in java.Runtime.freeMemory(), and I remove that call, I would expect the program to speed up by a factor of approximately 5. As we can see above, the program speeds up by a factor of approximately 2.
A slowdown of a factor of 5 is way worse than a slowdown of a factor of 2 that was observed empirically, so what I do not understand is how the sampling profiler is so far off from reality.

The Runtime freeMemory() and totalMemory() are native calls.
See http://www.docjar.com/html/api/java/lang/Runtime.java.html
The timer cannot time them, but the sampler can.

Adding numbers using Java Long wrapper versus primitive longs

I am running this code and getting unexpected results. I expect that the loop which adds the primitives would perform much faster, but the results do not agree.
import java.util.*;
public class Main {
public static void main(String[] args) {
StringBuilder output = new StringBuilder();
long start = System.currentTimeMillis();
long limit = 1000000000; //10^9
long value = 0;
for(long i = 0; i < limit; ++i){}
long i;
output.append("Base time\n");
output.append(System.currentTimeMillis() - start + "ms\n");
start = System.currentTimeMillis();
for(long j = 0; j < limit; ++j) {
value = value + j;
}
output.append("Using longs\n");
output.append(System.currentTimeMillis() - start + "ms\n");
start = System.currentTimeMillis();
value = 0;
for(long k = 0; k < limit; ++k) {
value = value + (new Long(k));
}
output.append("Using Longs\n");
output.append(System.currentTimeMillis() - start + "ms\n");
System.out.print(output);
}
}
Output:
Base time
359ms
Using longs
1842ms
Using Longs
614ms
I have tried running each individual test in it's own java program, but the results are the same. What could cause this?
Small detail: running java 1.6
Edit:
I asked 2 other people to try out this code, one gets the same exact strange results that I get. The other gets results that actually make sense! I asked the guy who got normal results to give us his class binary. We run it and we STILL get the strange results. The problem is not at compile time (I think). I'm running 1.6.0_31, the guy who gets normal results is on 1.6.0_16, the guy who gets strange results like I do is on 1.7.0_04.
Edit: Get same results with a Thread.sleep(5000) at the start of program. Also get the same results with a while loop around the whole program (to see if the times would converge to normal times after java was fully started up)

I suspect that this is a JVM warmup effect. Specifically, the code is being JIT compiled at some point, and this is distorting the times that you are seeing.
Put the whole lot in a loop, and ignore the times reported until they stabilize. (But note that they won't entirely stabilize. Garbage is being generated, and therefore the GC will need to kick occasionally. This is liable to distort the timings, at least a bit. The best way to deal with this is to run a huge number of iterations of the outer loop, and calculate / display the average times.)
Another problem is that the JIT compiler on some releases of Java may be able to optimize away the stuff you are trying to test:
It could figure out that the creation and immediate unboxing of the Long objects could be optimized away. (Thanks Louis!)
It could figure out that the loops are doing "busy work" ... and optimize them away entirely. (The value of value is not used once each loop ends.)
FWIW, it is generally recommended that you use Long.valueOf(long) rather than new Long(long) because the former can make use of a cached Long instance. However, in this case, we can predict that there will be a cache miss in all but the first few loop iterations, so the recommendation is not going to help. If anything, it is likely to make the loop in question slower.
UPDATE
I did some investigation of my own, and ended up with the following:
import java.util.*;
public class Main {
public static void main(String[] args) {
while (true) {
test();
}
}
private static void test() {
long start = System.currentTimeMillis();
long limit = 10000000; //10^9
long value = 0;
for(long i = 0; i < limit; ++i){}
long t1 = System.currentTimeMillis() - start;
start = System.currentTimeMillis();
for(long j = 0; j < limit; ++j) {
value = value + j;
}
long t2 = System.currentTimeMillis() - start;
start = System.currentTimeMillis();
for(long k = 0; k < limit; ++k) {
value = value + (new Long(k));
}
long t3 = System.currentTimeMillis() - start;
System.out.print(t1 + " " + t2 + " " + t3 + " " + value + "\n");
}
}
which gave me the following output.
28 58 2220 99999990000000
40 58 2182 99999990000000
36 49 157 99999990000000
34 51 157 99999990000000
37 49 158 99999990000000
33 52 158 99999990000000
33 50 159 99999990000000
33 54 159 99999990000000
35 52 159 99999990000000
33 52 159 99999990000000
31 50 157 99999990000000
34 51 156 99999990000000
33 50 159 99999990000000
Note that the first two columns are pretty stable, but the third one shows a significant speedup on the 3rd iteration ... probably indicating that JIT compilation has occurred.
Interestingly, before I separated out the test into a separate method, I didn't see the speedup on the 3rd iteration. The numbers all looked like the first two rows. And that seems to be saying that the JVM (that I'm using) won't JIT compile a method that is currently executing ... or something like that.
Anyway, this demonstrates (to me) that there should be a warm up effect. If you don't see a warmup effect, your benchmark is doing something that is inhibiting JIT compilation ... and therefore isn't meaningful for real applications.

I'm surprised, too.
My first guess would have been inadvertant "autoboxing", but that's clearly not an issue in your example code.
This link might give a clue:
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Long.html
valueOf
public static Long valueOf(long l)
Returns a Long instance representing the specified long value. If a new Long instance is not required, this method should generally be
used in preference to the constructor Long(long), as this method is
likely to yield significantly better space and time performance by
caching frequently requested values.
Parameters:
l - a long value.
Returns:
a Long instance representing l.
Since:
1.5
But yes, I would expect using a wrapper (e.g. "Long") to take MORE time, and MORE space. I would not expect using the wrapper to be three times FASTER!
================================================================================
ADDENDUM:
I got these results with your code:
Base time 6878ms
Using longs 10515ms
Using Longs 428022ms
I'm running JDK 1.6.0_16 on a pokey 32-bit, single-core CPU.

OK - here's a slightly different version, along with my results (running JDK 1.6.0_16 pokey 32-bit single-code CPU):
import java.util.*;
/*
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 343 896 3431 6025
1 342 957 3401 5796
2 342 881 3379 5742
*/
public class LongTest {
private static int limit = 100000000;
private static int ntimes = 3;
private static final long[] base = new long[ntimes];
private static final long[] primitives = new long[ntimes];
private static final long[] wrappers1 = new long[ntimes];
private static final long[] wrappers2 = new long[ntimes];
private static void test_base (int idx) {
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){}
base[idx] = System.currentTimeMillis() - start;
}
private static void test_primitive (int idx) {
long value = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){
value = value + i;
}
primitives[idx] = System.currentTimeMillis() - start;
}
private static void test_wrappers1 (int idx) {
long value = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){
value = value + new Long(i);
}
wrappers1[idx] = System.currentTimeMillis() - start;
}
private static void test_wrappers2 (int idx) {
long value = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < limit; ++i){
value = value + Long.valueOf(i);
}
wrappers2[idx] = System.currentTimeMillis() - start;
}
public static void main(String[] args) {
for (int i=0; i < ntimes; i++) {
test_base (i);
test_primitive(i);
test_wrappers1 (i);
test_wrappers2 (i);
}
System.out.println ("Test Base longs Longs/new Longs/valueOf");
System.out.println ("---- ---- ----- --------- -------------");
for (int i=0; i < ntimes; i++) {
System.out.printf (" %2d %6d %6d %6d %6d\n",
i, base[i], primitives[i], wrappers1[i], wrappers2[i]);
}
}
}
=======================================================================
5.28.2012:
Here are some additional timings, from a faster (but still modest), dual-core CPU running Windows 7/64 and running the same JDK revision 1.6.0_16:
/*
PC 1: limit = 100,000,000, ntimes = 3, JDK 1.6.0_16 (32-bit):
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 343 896 3431 6025
1 342 957 3401 5796
2 342 881 3379 5742
PC 2: limit = 1,000,000,000, ntimes = 5,JDK 1.6.0_16 (64-bit):
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 3 2 5627 5573
1 0 0 5494 5537
2 0 0 5475 5530
3 0 0 5477 5505
4 0 0 5487 5508
PC 2: "for loop" counters => long; limit = 10,000,000,000, ntimes = 5:
Test Base longs Longs/new Longs/valueOf
---- ---- ----- --------- -------------
0 6278 6302 53713 54064
1 6273 6286 53547 53999
2 6273 6294 53606 53986
3 6274 6325 53593 53938
4 6274 6279 53566 53974
*/
You'll notice:
I'm not using StringBuilder, and I separate out all of the I/O until the end of the program.
"long" primtive is consistently equivalent to a "no-op"
"Long" wrappers are consistently much, much slower
"new Long()" is slightly faster than "Long.valueOf()"
Changing the loop counters from "int" to "long" makes the first two columns ("base" and "longs" much slower.
"JIT warmup" is negligible after the the first few iterations...
... provided I/O (like System.out) and potentially memory-intensive activities (like StringBuilder) are moved outside of the actual test sections.

How to measure internet bandwidth

I have a problem and can't find answers. I would like to measure internet bandwidth with java, but I don´t know how.
It would be great to get some hints; I know that I have to open a socket and send it to a defined server, get it back and then use the time.
But how would I code this?

Well I'd implement this simply by downloading a fixed size file. Not tested, but something along these lines should work just fine
byte[] buffer = new byte[BUFFERSIZE];
Socket s = new Socket(urlOfKnownFile);
InputStream is = s.getInputStream();
long start = System.nanoTime();
while (is.read(buffer) != -1) continue;
long end = System.nanoTime();
long time = end-start;
// Now we know that it took about time ns to download <filesize>.
// If you don't know the correct filesize you can obviously use the total of all is.read() calls.

How about fixing an arbitrary amount of time and send the data respecting it?
For example, let's say i want my server to limit it's bandwidth usage to 100Bytes/s.
So i fix 1 second and send the data as long as it does not goes beyond 1 second and 100 Bytes.
Here's some pseudocode to show what I'm talking about:
timer_get (a);
sent_data = 0;
while (not_finished_sending_data)
{
timer_get (b);
if ((b - a) < 1 ) // 1 second
{
if (sent_data < 100) // 100 bytes
{
// We actually send here
sent_data += send();
}
}
else
{
timer_get (a);
sent_data = 0;
}
}

Why does this code not see any significant performance gain when I use multiple threads on a quadcore machine?

I wrote some Java code to learn more about the Executor framework.
Specifically, I wrote code to verify the Collatz Hypothesis - this says that if you iteratively apply the following function to any integer, you get to 1 eventually:
f(n) = ((n % 2) == 0) ? n/2 : 3*n + 1
CH is still unproven, and I figured it would be a good way to learn about Executor. Each thread is assigned a range [l,u] of integers to check.
Specifically, my program takes 3 arguments - N (the number to which I want to check CH), RANGESIZE (the length of the interval that a thread has to process), and NTHREAD, the size of the threadpool.
My code works fine, but I saw much less speedup that I expected - of the order of 30% when I went from 1 to 4 threads.
My logic was that the computation is completely CPU bound, and each subtask (checking CH for a fixed size range) is takes roughly the same time.
Does anyone have ideas as to why I'm not seeing a 3 to 4x increase in speed?
If you could report your runtimes as you increase the number of thread (along with the machine, JVM and OS) that would also be great.
Specifics
Runtimes:
java -d64 -server -cp . Collatz 10000000 1000000 4 => 4 threads, takes 28412 milliseconds
java -d64 -server -cp . Collatz 10000000 1000000 1 => 1 thread, takes 38286 milliseconds
Processor:
Quadcore Intel Q6600 at 2.4GHZ, 4GB. The machine is unloaded.
Java:
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
OS:
Linux quad0 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 x86_64 GNU/Linux
Code: (I can't get the code to post, I think it's too long for SO requirements, the source is available on Google Docs
import java.math.BigInteger;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyRunnable implements Runnable {
public int lower;
public int upper;
MyRunnable(int lower, int upper) {
this.lower = lower;
this.upper = upper;
}
#Override
public void run() {
for (int i = lower ; i <= upper; i++ ) {
Collatz.check(i);
}
System.out.println("(" + lower + "," + upper + ")" );
}
}
public class Collatz {
public static boolean check( BigInteger X ) {
if (X.equals( BigInteger.ONE ) ) {
return true;
} else if ( X.getLowestSetBit() == 1 ) {
// odd
BigInteger Y = (new BigInteger("3")).multiply(X).add(BigInteger.ONE);
return check(Y);
} else {
BigInteger Z = X.shiftRight(1); // fast divide by 2
return check(Z);
}
}
public static boolean check( int x ) {
BigInteger X = new BigInteger( new Integer(x).toString() );
return check(X);
}
static int N = 10000000;
static int RANGESIZE = 1000000;
static int NTHREADS = 4;
static void parseArgs( String [] args ) {
if ( args.length >= 1 ) {
N = Integer.parseInt(args[0]);
}
if ( args.length >= 2 ) {
RANGESIZE = Integer.parseInt(args[1]);
}
if ( args.length >= 3 ) {
NTHREADS = Integer.parseInt(args[2]);
}
}
public static void maintest(String [] args ) {
System.out.println("check(1): " + check(1));
System.out.println("check(3): " + check(3));
System.out.println("check(8): " + check(8));
parseArgs(args);
}
public static void main(String [] args) {
long lDateTime = new Date().getTime();
parseArgs( args );
List<Thread> threads = new ArrayList<Thread>();
ExecutorService executor = Executors.newFixedThreadPool( NTHREADS );
for( int i = 0 ; i < (N/RANGESIZE); i++) {
Runnable worker = new MyRunnable( i*RANGESIZE+1, (i+1)*RANGESIZE );
executor.execute( worker );
}
executor.shutdown();
while (!executor.isTerminated() ) {
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime ) +
" (" + N/(fDateTime - lDateTime ) + " per ms)" );
}
}

Busy waiting can be a problem:
while (!executor.isTerminated() ) {
}
You can use awaitTermination() instead:
while (!executor.awaitTermination(1, TimeUnit.SECONDS)) {}

You are using BigInteger. It consumes a lot of register space. What you most likely have on the compiler level is register spilling that makes your process memory-bound.
Also note that when you are timing your results you are not taking into account extra time taken by the JVM to allocate threads and work with the thread pool.
You could also have memory conflicts when you are using constant Strings. All strings are stored in a shared string pool and so it may become a bottleneck, unless java is really clever about it.
Overall, I wouldn't advise using Java for this kind of stuff. Using pthreads would be a better way to go for you.

As #axtavt answered, busy waiting can be a problem. You should fix that first, as it is part of the answer, but not all of it. It won't appear to help in your case (on Q6600), because it seems to be bottlenecked at 2 cores for some reason, so another is available for the busy loop and so there is no apparent slowdown, but on my Core i5 it speeds up the 4-thread version noticeably.
I suspect that in the case of the Q6600 your particular app is limited by the amount of shared cache available or something else specific to the architecture of that CPU. The Q6600 has two 4MB L2 caches, which means CPUs are sharing them, and no L3 cache. On my core i5, each CPU has a dedicated L2 cache (256K, then there is a larger 8MB shared L3 cache. 256K more per-CPU cache might make a difference... otherwise something else architecture wise does.
Here is a comparison of a Q6600 running your Collatz.java, and a Core i5 750.
On my work PC, which is also a Q6600 # 2.4GHz like yours, but with 6GB RAM, Windows 7 64-bit, and JDK 1.6.0_21* (64-bit), here are some basic results:
10000000 500000 1 (avg of three runs): 36982 ms
10000000 500000 4 (avg of three runs): 21252 ms
Faster, certainly - but not completing in quarter of the time like you would expect, or even half... (though it is roughly just a bit more than half, more on that in a moment). Note in my case I halved the size of the work units, and have a default max heap of 1500m.
At home on my Core i5 750 (4 cores no hyperthreading), 4GB RAM, Windows 7 64-bit, jdk 1.6.0_22 (64-bit):
10000000 500000 1 (avg of 3 runs) 32677 ms
10000000 500000 4 (avg of 3 runs) 8825 ms
10000000 500000 4 (avg of 3 runs) 11475 ms (without the busy wait fix, for reference)
the 4 threads version takes 27% of the time the 1 thread version takes when the busy-wait loop is removed. Much better. Clearly the code can make efficient use of 4 cores...
NOTE: Java 1.6.0_18 and later have modified default heap settings - so my default heap size is almost 1500m on my work PC, and around 1000m on my home PC.
You may want to increase your default heap, just in case garbage collection is happening and slowing your 4 threaded version down a bit. It might help, it might not.
At least in your example, there's a chance your larger work unit size is skewing your results slightly...halving it may help you get closer to at least 2x the speed since 4 threads will be kept busy for a longer portion of the time. I don't think the Q6600 will do much better at this particular task...whether it is cache or some other inherent architecture thing.
In all cases, I am simply running "java Collatz 10000000 500000 X", where x = # of threads indicated.
The only changes I made to your java file were to make one of the println's into a print, so there were less linebreaks for my runs with 500000 per work unit so I could see more results in my console at once, and I ditched the busy wait loop, which matters on the i5 750 but didn't make a difference on the Q6600.

You can should try using the submit function and then watching the Future's that are returning checking them to see if the thread has finished.
Terminate doesn't return until there is a shutdown.
Future submit(Runnable task)
Submits a Runnable task for execution and returns a Future representing that task.
isTerminated()
Returns true if all tasks have completed following shut down.
Try this...
public static void main(String[] args) {
long lDateTime = new Date().getTime();
parseArgs(args);
List<Thread> threads = new ArrayList<Thread>();
List<Future> futures = new ArrayList<Future>();
ExecutorService executor = Executors.newFixedThreadPool(NTHREADS);
for (int i = 0; i < (N / RANGESIZE); i++) {
Runnable worker = new MyRunnable(i * RANGESIZE + 1, (i + 1) * RANGESIZE);
futures.add(executor.submit(worker));
}
boolean done = false;
while (!done) {
for(Future future : futures) {
done = true;
if( !future.isDone() ) {
done = false;
break;
}
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime) +
" (" + N / (fDateTime - lDateTime) + " per ms)");
System.exit(0);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.