How can I measure the searching time in linkedlists java

How can I measure the searching time in linkedlists java - java

I need to measure the searching time for LinkedLists for different values. My solution is not logical, so i don't know if i'm in the right path or not.
This is my solution;
LinkedList<Integer> myList = new LinkedList<>();
for (int i = 1; i <= 1000; i++)
myList.add(i);
Collections.shuffle(myList);
Random rand = new Random();
myList.contains(rand.nextInt(myList.size()));
System.out.println(System.nanoTime());
LinkedList<Integer> myList2 = new LinkedList<>();
for (int i = 1; i <= 2000; i++)
myList2.add(i);
Collections.shuffle(myList2);
myList.contains(rand.nextInt(myList2.size()));
System.out.println(System.nanoTime());
These are my outputs
38565758048600
38565759163200

You can't measure performance like this. Computers are not reliable; not today, and haven't been for decades. OSes pre-empt. the JVM JITs. CPUs have cores. Cores have pipelines and tiered caches. You're hopelessly out of luck; the nanotime between a single contains says absolutely not a thing - it's like trying to measure the effect of someone in honolulu peeing in the water by checking the height of the ocean lapping at the pier in Amsterdam. About a billion other factors that are far more impressive in magnitude are involved.
The normal principle for measuring performance is to mark the time, do the thing, and then mark time again. The difference in time? That's how long it took. But you have to run the operation not once; but many many times (and then take the average), and don't actually take the timing seriously until it's run a few times to ensure cache lines and JIT processes have stabilized out. You also need to ensure that the VM hotspot engine won't optimize your entire code away, so you must also actually use the result (you are throwing away the boolean that contains returns; this may lead the hotspot engine to realize that it can just eliminate the entire call, as it couldn't possibly change the list.
This is, in a word, completely and utterly impossible for any java newbie; even a grizzled old veteran is extremely likely to do it wrong, and as it is about timing, it's very hard to 'test' that your timing framework actually works.
Fortunately, you don't have to write it. It already exists: JMH. You either use that, or a direct competitor of that, or you are getting almost certainly complete lies and fabrications out of your timing code.
Note that the involvement of Random is a very bad idea for timing; you want the code you are timing to be as reliable as possible. Thus, do not shuffle the list, and do not ask for a random number. Ask for a specific number every time, or a host of numbers, or duplicate the list, shuffle that, then search for every number - this means that all numbers in the input list are searched for exactly once, but in an arbitrary order.

Related

Fibonacci sequence using BigInteger Does not yield an answer

Im to find 100th element of a Fibonacci Sequence, I initially tried to store the value of numbers using an int but it overflowed and switched to negative values just like long.Then I came across BigInteger I did get the solution using a simple for loop and an array to store the results, and access the previous elements. Now as Im trying to do the same problem using recursion The program does not seem to be terminating. Am I missing something here? Or are BigInteger's not suggested to used with recursion? Here is the code:
import java.math.BigInteger;
class Test {
public static void main(String[] args) {
BigInteger n = BigInteger.valueOf(100);
System.out.println(fib(n));
}
public static BigInteger fib(BigInteger n) {
if (n.compareTo(BigInteger.valueOf(1)) == 0 || n.compareTo(BigInteger.valueOf(1)) == -1)
return n;
return fib(n.subtract(BigInteger.valueOf(1))).add(fib(n.subtract(BigInteger.valueOf(2))));
}
}

In the comments, you mentioned that your assumption that the program doesn't terminate is based on the fact that it ran for over 5 minutes. That is not how you prove non-termination.
If you observe the program terminating within a certain amount of time, then you can conclude that it does, indeed, terminate. However, if you don't observe it terminating within a certain amount of time, then you can say precisely nothing about whether it terminates or not. It may terminate if you wait a little bit longer, it may terminate if you wait a lot longer, it may even theoretically terminate but take longer than the heat death of the universe.
In your specific case, the algorithm is perfectly correct, and it always terminates. It is simply not a very efficient algorithm: for computing fib(n), fib gets called fib(n) times, because you compute the same numbers over and over and over and over again.
If we assume that you can execute fib once per clock cycle (which is an optimistic assumption since a single call to fib performs one condition, two subtractions, one addition, and two calls to fib in most cases, and a single addition may already take multiple clock cycles depending on the CPU), and we further assume that you have a 100 core CPU and your code is actually executed in parallel, and you have 100 CPUs, and each CPU is clocked at 100 GHz, and you have a cluster of 100 computers, then it will still take you about an hour.
Under some more realistic assumptions, the time it takes your program to finish is more on the order of tens of thousands of years.
Since your code is not parallelized, in order for your code to finish in 5 minutes on a more realistic 4 GHz CPU, it would need to execute fib almost 300 million times per clock cycle.
It often helps to do some very rough guesstimates of the expected performance of your code. As you can see, you don't need to be an expert in Java or JVM or compilers or optimization or computer organization or CPU design or performance engineering. You don't need to know what, exactly, your code gets compiled down to. You don't need to know how many clock cycles an integer ADD takes. Because even when you make some totally over-the-top ridiculous assumptions, you can still easily see that your code cannot possibly finish in minutes or even hours.

Oracle JDK8: unrolled loop converted to NOOP by JVM?

I've got a little program that is a fairly pointless exercise in simple number crunching that has thrown me for a loop.
The program spawns a bunch of worker threads that do simple mathematical operations. Recently I changed the inner loop of one variant of worker from:
do
{
int3 = int1 + int2;
int3 = int1 * int2;
int1++;
int2++;
i++;
}
while (i < 128);
to something akin to:
int3 = tempint4[0] + tempint5[0];
int3 = tempint4[0] * tempint5[0];
int3 = tempint4[1] + tempint5[1];
int3 = tempint4[1] * tempint5[1];
int3 = tempint4[2] + tempint5[2];
int3 = tempint4[2] * tempint5[2];
int3 = tempint4[3] + tempint5[3];
int3 = tempint4[3] * tempint5[3];
...
int3 = tempint4[127] + tempint5[127];
int3 = tempint4[127] * tempint5[127];
The arrays are populated by random integers no higher than 1025 in value, and the array values do not change.
The end result was that the program ran much faster, though closer examination seems to indicate that the CPU isn't actually doing anything when running the newer version of the code. It seems that the JVM has figured out that it can safely ignore the code that replaced the inner loop after one iteration of the outer loop since it is only redoing the same calculations on the same set of data over and over again.
To illustrate my point, the old code took maybe ~27000 ms to run and noticeably increased the operating temperature of the CPU (it also showed 100% utilization for all cores). The new code takes maybe 5 ms to run (sometimes less) and causes nary a spike in CPU utilization or temperature. Increasing the number of outer loop iterations does nothing to change the behavior of the new code, even when the number of iterations increases by a hundred times or more.
I have another version of the worker that is identical to the one above except that it has a division operation along with the addition and multiplication operations. In its new unrolled form, the division-enabled version is also much faster than it's previous form, but it actually takes a little while (~300 ms on the first run and ~200 ms on subsequent runs, despite warmup, which is a little odd) and produces a profound spike in CPU temperature for its brief run. Increasing the number of outer loop iterations seems to cause the temperature phenomenon to mostly cease after a certain amount of time has passed while running the program, though utilization still shows 100% for all cores. My guess is the JVM is taking much longer to figure out which operations it can safely ignore when handling division operations, and that it is not ignoring all of them.
Short of adding division operations to all my code (which isn't really a fix anyway beyond a certain number of outer loop iterations), is there any way I can get the JVM to stop reducing my code to apparent NOOPs? I've tried several solutions to the problem, such as generating new random values per iteration of the outer loop, going back to simple integer variables with incrementation, and some other nonsense, but none of those solutions have produced desirable results. Either it continues to ignore the series of instructions, or the performance hit from modifications is bad enough that my division-heavy variant actually performs better than the code without division operations.
edit: to provide some context:
i: this variable is an integer that is used for a loop counter in a do/while loop. It is defined in the class file containing the worker code. It's initial value is 0. It is no longer used in the newer version of the worker.
int1/int2: These are integers defined in the class file containing the worker code. Their initial values are both 0. They were used in the old version of the code to provide changing values for each iteration of the internal loop. All I had to do was increment them upward by one per loop iteration, and the JVM would be forced to carry out every operation faithfully. Unfortunately, this loop apparently prevented the use of SIMD. Each time the outer loop iterated, int1 and int2 had their values reset to prevent overflow of int1, int2, or int3 (I have discovered that integer overflow can slow down the code unnecessarily, as can allowing a float to reach Infinity).
tempint4/tempint5: These are references to a pair of integer arrays defined in the main class file for the program (Mathtester. Yes, unimaginative, I know). When the program first starts, there is a short do/while loop that fills each array with random integers randing from 1-1025. The arrays are 128 integers in size. Each array is static, though the reference variables are not. In truth there is no particular reason for me to use the reference variables. They are leftovers from when I was trying to do an array reference swap so that, after each iteration of the outer loop, tempint4 and tempint5 would be referred to the opposite array. It was my hope that the JVM would stop ignoring my code block. For the division-enabled version of the code, this seems to have worked (sort of), since it fundamentally changes the values to be calculated. Swapping tempint4 for tempint5 and vice versa does not change the results of the addition and multiplication operations, so the JVM can still ignore those.
edit: Making tempint4 and tempint5 (since they are only reference variables, I am actually referring to the main arrays, Mathtester.int4 and Mathtester.int5) volatile worked without notably reducing the amount of CPU activity or level or CPU temperature. It did slow down the code a bit, but that is a probable indicator that the JVM was NOOPing more than I knew.

Is there any way I can get the JVM to stop reducing my code to apparent NOOPs?
Yes, by making int3 volatile.

One of the first things when dealing with Java performance that you have to learn by heart is this:
"A single line of Java code means nothing at all in isolation".
Modern JVMs are very complex beasts, and do all kinds of optimization. If you try to measure some small piece of code, the chances are that you will not be measuring what you think you are - it is really complicated to do it correctly without very, very detailed knowledge of what the JVM is doing.
In this case, yes, it's entirely likely that the JVM is optimizing away the loop. There's no simple way to prevent it from doing this, and almost all techniques are fragile and JVM-version specific (because new & cleverer optimizations are developed & added to the JVM all the time).
So, let me turn the question around: "What are you really trying to achieve here? Why do you want to prevent the JVM from optimizing?"

Programmatically determine asymptotic runtime of a given algorithm? [duplicate]

I wonder whether there is any automatic way of determining (at least roughly) the Big-O time complexity of a given function?
If I graphed an O(n) function vs. an O(n lg n) function I think I would be able to visually ascertain which is which; I'm thinking there must be some heuristic solution which enables this to be done automatically.
Any ideas?
Edit: I am happy to find a semi-automated solution, just wondering whether there is some way of avoiding doing a fully manual analysis.

It sounds like what you are asking for is an extention of the Halting Problem. I do not believe that such a thing is possible, even in theory.
Just answering the question "Will this line of code ever run?" would be very difficult if not impossible to do in the general case.
Edited to add:
Although the general case is intractable, see here for a partial solution: http://research.microsoft.com/apps/pubs/default.aspx?id=104919
Also, some have stated that doing the analysis by hand is the only option, but I don't believe that is really the correct way of looking at it. An intractable problem is still intractable even when a human being is added to the system/machine. Upon further reflection, I suppose that a 99% solution may be doable, and might even work as well as or better than a human.

You can run the algorithm over various size data sets, and you could then use curve fitting to come up with an approximation. (Just looking at the curve you create probably will be enough in most cases, but any statistical package has curve fitting).
Note that some algorithms exhibit one shape with small data sets, but another with large... and the definition of large remains a bit nebulous. This means that an algorithm with a good performance curve could have so much real world overhead that (for small data sets) it doesn't work as well as the theoretically better algorithm.
As far as code inspection techniques, none exist. But instrumenting your code to run at various lengths and outputting a simple file (RunSize RunLength would be enough) should be easy. Generating proper test data could be more complex (some algorithms work better/worse with partially ordered data, so you would want to generate data that represented your normal use-case).
Because of the problems with the definition of "what is large" and the fact that performance is data dependent, I find that static analysis often is misleading. When optimizing performance and selecting between two algorithms, the real world "rubber hits the road" test is the only final arbitrator I trust.

A short answer is that it's impossible because constants matter.
For instance, I might write a function that runs in O((n^3/k) + n^2). This simplifies to O(n^3) because as n approaches infinity, the n^3 term will dominate the function, irrespective of the constant k.
However, if k is very large in the above example function, the function will appear to run in almost exactly n^2 until some crossover point, at which the n^3 term will begin to dominate. Because the constant k will be unknown to any profiling tool, it will be impossible to know just how large a dataset to test the target function with. If k can be arbitrarily large, you cannot craft test data to determine the big-oh running time.

I am surprised to see so many attempts to claim that one can "measure" complexity by a stopwatch. Several people have given the right answer, but I think that there is still room to drive the essential point home.
Algorithm complexity is not a "programming" question; it is a "computer science" question. Answering the question requires analyzing the code from the perspective of a mathematician, such that computing the Big-O complexity is practically a form of mathematical proof. It requires a very strong understanding of the fundamental computer operations, algebra, perhaps calculus (limits), and logic. No amount of "testing" can be substituted for that process.
The Halting Problem applies, so the complexity of an algorithm is fundamentally undecidable by a machine.
The limits of automated tools applies, so it might be possible to write a program to help, but it would only be able to help about as much as a calculator helps with one's physics homework, or as much as a refactoring browser helps with reorganizing a code base.
For anyone seriously considering writing such a tool, I suggest the following exercise. Pick a reasonably simple algorithm, such as your favorite sort, as your subject algorithm. Get a solid reference (book, web-based tutorial) to lead you through the process of calculating the algorithm complexity and ultimately the "Big-O". Document your steps and results as you go through the process with your subject algorithm. Perform the steps and document your progress for several scenarios, such as best-case, worst-case, and average-case. Once you are done, review your documentation and ask yourself what it would take to write a program (tool) to do it for you. Can it be done? How much would actually be automated, and how much would still be manual?
Best wishes.

I am curious as to why it is that you want to be able to do this. In my experience when someone says: "I want to ascertain the runtime complexity of this algorithm" they are not asking what they think they are asking. What you are most likely asking is what is the realistic performance of such an algorithm for likely data. Calculating the Big-O of a function is of reasonable utility, but there are so many aspects that can change the "real runtime performance" of an algorithm in real use that nothing beats instrumentation and testing.
For example, the following algorithms have the same exact Big-O (wacky pseudocode):
example a:
huge_two_dimensional_array foo
for i = 0, i < foo[i].length, i++
for j = 0; j < foo[j].length, j++
do_something_with foo[i][j]
example b:
huge_two_dimensional_array foo
for j = 0, j < foo[j].length, j++
for i = 0; i < foo[i].length, i++
do_something_with foo[i][j]
Again, exactly the same big-O... but one of them uses row ordinality and one of them uses column ordinality. It turns out that due to locality of reference and cache coherency you might have two completely different actual runtimes, especially depending on the actual size of the array foo. This doesn't even begin to touch the actual performance characteristics of how the algorithm behaves if it's part of a piece of software that has some concurrency built in.
Not to be a negative nelly but big-O is a tool with a narrow scope. It is of great use if you are deep inside algorithmic analysis or if you are trying to prove something about an algorithm, but if you are doing commercial software development the proof is in the pudding, and you are going to want to have actual performance numbers to make intelligent decisions.
Cheers!

This could work for simple algorithms, but what about O(n^2 lg n), or O(n lg^2 n)?
You could get fooled visually very easily.
And if its a really bad algorithm, maybe it wouldn't return even on n=10.

Proof that this is undecidable:
Suppose that we had some algorithm HALTS_IN_FN(Program, function) which determined whether a program halted in O(f(n)) for all n, for some function f.
Let P be the following program:
if(HALTS_IN_FN(P,f(n)))
{
while(1);
}
halt;
Since the function and the program are fixed, HALTS_IN_FN on this input is constant time. If HALTS_IN_FN returns true, the program runs forever and of course does not halt in O(f(n)) for any f(n). If HALTS_IN_FN returns false, the program halts in O(1) time.
Thus, we have a paradox, a contradiction, and so the program is undecidable.

A lot of people have commented that this is an inherently unsolvable problem in theory. Fair enough, but beyond that, even solving it for any but the most trivial cases would seem to be incredibly difficult.
Say you have a program that has a set of nested loops, each based on the number of items in an array. O(n^2). But what if the inner loop is only run in a very specific set of circumstances? Say, on average, it's run in aprox log(n) cases. Suddenly our "obviously" O(n^2) algorithm is really O(n log n). Writing a program that could determine if the inner loop would be run, and how often, is potentially more difficult than the original problem.
Remember O(N) isn't god; high constants can and will change the playing field. Quicksort algorithms are O(n log n) of course, but when the recursion gets small enough, say down to 20 items or so, many implementations of quicksort will change tactics to a separate algorithm as it's actually quicker to do a different type of sort, say insertion sort with worse O(N), but much smaller constant.
So, understand your data, make educated guesses, and test.

I think it's pretty much impossible to do this automatically. Remember that O(g(n)) is the worst-case upper bound and many functions perform better than that for a lot of data sets. You'd have to find the worst-case data set for each one in order to compare them. That's a difficult task on its own for many algorithms.

You must also take care when running such benchmarks. Some algorithms will have a behavior heavily dependent on the input type.
Take Quicksort for example. It is a worst-case O(n²), but usually O(nlogn). For two inputs of the same size.
The traveling salesman is (I think, not sure) O(n²) (EDIT: the correct value is 0(n!) for the brute force algotithm) , but most algorithms get rather good approximated solutions much faster.
This means that the the benchmarking structure has to most of the time be adapted on an ad hoc basis. Imagine writing something generic for the two examples mentioned. It would be very complex, probably unusable, and likely will be giving incorrect results anyway.

Jeffrey L Whitledge is correct. A simple reduction from the halting problem proves that this is undecidable...
ALSO, if I could write this program, I'd use it to solve P vs NP, and have $1million... B-)

I'm using a big_O library (link here) that fits the change in execution time against independent variable n to infer the order of growth class O().
The package automatically suggests the best fitting class by measuring the residual from collected data against each class growth behavior.
Check the code in this answer.
Example of output,
Measuring .columns[::-1] complexity against rapid increase in # rows
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.017 + 0.00067*n^3
--------------------------------------------------------------------------------
Constant: time = 0.032 (res: 0.021)
Linear: time = -0.051 + 0.024*n (res: 0.011)
Quadratic: time = -0.026 + 0.0038*n^2 (res: 0.0077)
Cubic: time = -0.017 + 0.00067*n^3 (res: 0.0052)
Polynomial: time = -6.3 * x^1.5 (res: 6)
Logarithmic: time = -0.026 + 0.053*log(n) (res: 0.015)
Linearithmic: time = -0.024 + 0.012*n*log(n) (res: 0.0094)
Exponential: time = -7 * 0.66^n (res: 3.6)
--------------------------------------------------------------------------------

I guess this isn't possible in a fully automatic way since the type and structure of the input differs a lot between functions.

Well, since you can't prove whether or not a function even halts, I think you're asking a little much.
Otherwise #Godeke has it.

I don't know what's your objective in doing this, but we had a similar problem in a course I was teaching. The students were required to implement something that works at a certain complexity.
In order not to go over their solution manually, and read their code, we used the method #Godeke suggested. The objective was to find students who used linked list instead of a balansed search tree, or students who implemented bubble sort instead of heap sort (i.e. implementations that do not work in the required complexity - but without actually reading their code).
Surprisingly, the results did not reveal students who cheated. That might be because our students are honest and want to learn (or just knew that we'll check this ;-) ). It is possible to miss cheating students if the inputs are small, or if the input itself is ordered or such. It is also possible to be wrong about students who did not cheat, but have large constant values.
But in spite of the possible errors, it is well worth it, since it saves a lot of checking time.

As others have said, this is theoretically impossible. But in practice, you can make an educated guess as to whether a function is O(n) or O(n^2), as long as you don't mind being wrong sometimes.
First time the algorithm, running it on input of various n. Plot the points on a log-log graph. Draw the best-fit line through the points. If the line fits all the points well, then the data suggests that the algorithm is O(n^k), where k is the slope of the line.
I am not a statistician. You should take all this with a grain of salt. But I have actually done this in the context of automated testing for performance regressions. The patch here contains some JS code for it.

If you have lots of homogenious computational resources, I'd time them against several samples and do linear regression, then simply take the highest term.

It's easy to get an indication (e.g. "is the function linear? sub-linear? polynomial? exponential")
It's hard to find the exact complexity.
For example, here's a Python solution: you supply the function, and a function that creates parameters of size N for it. You get back a list of (n,time) values to plot, or to perform regression analysis. It times it once for speed, to get a really good indication it would have to time it many times to minimize interference from environmental factors (e.g. with the timeit module).
import time
def measure_run_time(func, args):
start = time.time()
func(*args)
return time.time() - start
def plot_times(func, generate_args, plot_sequence):
return [
(n, measure_run_time(func, generate_args(n+1)))
for n in plot_sequence
]
And to use it to time bubble sort:
def bubble_sort(l):
for i in xrange(len(l)-1):
for j in xrange(len(l)-1-i):
if l[i+1] < l[i]:
l[i],l[i+1] = l[i+1],l[i]
import random
def gen_args_for_sort(list_length):
result = range(list_length) # list of 0..N-1
random.shuffle(result) # randomize order
# should return a tuple of arguments
return (result,)
# timing for N = 1000, 2000, ..., 5000
times = plot_times(bubble_sort, gen_args_for_sort, xrange(1000,6000,1000))
import pprint
pprint.pprint(times)
This printed on my machine:
[(1000, 0.078000068664550781),
(2000, 0.34400010108947754),
(3000, 0.7649998664855957),
(4000, 1.3440001010894775),
(5000, 2.1410000324249268)]

Java can recognize SIMD advantages of CPU; or there is just optimization effect of loop unrolling

This part of code is from dotproduct method of a vector class of mine. The method does inner product computing for a target array of vectors(1000 vectors).
When vector length is an odd number(262145), compute time is 4.37 seconds. When vector length(N) is 262144(multiple of 8), compute time is 1.93 seconds.
time1=System.nanotime();
int count=0;
for(int j=0;j<1000;i++)
{
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
if(((N/2)*2)!=N)
{
for(int i=0;i<N;i++)
{
t1+=elements[i]*b.elements[i];
}
}
else if(((N/8)*8)==N)
{
float []vek=new float[8];
for(int i=0;i<(N/8);i++)
{
vek[0]=elements[i]*b.elements[i];
vek[1]=elements[i+1]*b.elements[i+1];
vek[2]=elements[i+2]*b.elements[i+2];
vek[3]=elements[i+3]*b.elements[i+3];
vek[4]=elements[i+4]*b.elements[i+4];
vek[5]=elements[i+5]*b.elements[i+5];
vek[6]=elements[i+6]*b.elements[i+6];
vek[7]=elements[i+7]*b.elements[i+7];
t1+=vek[0]+vek[1]+vek[2]+vek[3]+vek[4]+vek[5]+vek[6]+vek[7];
//t1 is total sum of all dot products.
}
}
}
time2=System.nanotime();
time3=(time2-time1)/1000000000.0; //seconds
Question: Could the reduction of time from 4.37s to 1.93s (2x as fast) be JIT's wise decision of using SIMD instructions or just my loop-unrolling's positive effect?
If JIT cannot do SIMD optimizaton automatically, then in this example there is also no unrolling optimization done automatically by JIT, is this true?.
For 1M iterations(vectors) and for vector size of 64, speedup multiplier goes to 3.5X(cache advantage?).
Thanks.

Your code has a bunch of problems. Are you sure you're measuring what you think you're measuring?
Your first loop does this, indented more conventionally:
for(int j=0;j<1000;i++) {
b=vektors[i]; // selects next vector(b) to multiply as inner product.
// each vector has an array of float elements.
}
Your rolled loop involves a really long chain of dependent loads and stores. Your unrolled loop involves 8 separate chains of dependent loads and stores. The JVM can't turn one into the other if you're using floating-point arithmetic because they're fundamentally different computations. Breaking dependent load-store chains can lead to major speedups on modern processors.
Your rolled loop iterates over the whole vector. Your unrolled loop only iterates over the first (roughly) eighth. Thus, the unrolled loop again computes something fundamentally different.
I haven't seen a JVM generate vectorised code for something like your second loop, but I'm maybe a few years out of date on what JVMs do. Try using -XX:+PrintAssembly when you run your code and inspect the code opto generates.

I have done a little research on this (and am drawing from knowledge from a similar project I did in C with matrix multiplication), but take my answer with a grain of salt as I am by no means an expert on this topic.
As for your first question, I think the speedup is coming from your loop unrolling; you're making roughly 87% fewer condition checks in terms of the for loop. As far as I know, JVM supports SSE since 1.4, but to actually control whether your code is using vectorization (and to know for sure), you'll need to use JNI.
See an example of JNI here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
When you decrease the size of your vector to 64 from 262144, cache is definitely a factor. When I did this project in C, we had to implement cache blocking for larger matrices in order to take advantage of the cache. One thing you might want to do is check your cache size.
Just as a side note: It might be a better idea to measure performance in flops rather than seconds, just because the runtime (in seconds) of your program can vary based on many different factors, such as CPU usage at the time.

Divide a list of numbers into smaller list with "sum" approximately same

I execute around 2000 tests on grid, each test being run as separate task on grid. The tests do have rather big startup time. Total execution takes 500 hours, finishes in less than 10 hours on 60 node SunGridEngine. Runtime of tests vary from 5 minutes to 90 minutes. Combining tests without much intelligence gave some performance gain. I would like to create "tasks" that are approximately equal size. How can I do so?
(what we do now: Sorting all tests and keep adding till the sum of execution time is approximately 5 hours. Looking for some thing better )

Doing this optimally is NP-complete. This is a variation of the partition problem, which is a special case of the subset sum problem, which itself a special case of the knapsack problem.
In your case you probably don't need an exact solution, so you can probably use some heuristics to get something "good enough" in a reasonable amount of time. See the Methods section of the partition problem page for a description of some approaches.

What you are looking for is the Partition problem for k sets.
There is som literature about k=3, called the 3-partition problem. This is NP complete in the strong sense.
There are many heuristics that should give an approximate result quickly.
I suggest you start here: http://en.wikipedia.org/wiki/Partition_problem
Hope this helps.

This is a version of the subset-sum problem, and is NP-complete. Your best bet is to employ some subset-sum heuristics.

Your problem sounds a little like a shop scheduling problem. There are all kinds of different sequencing approaches some of which are described here. Sorting in increasing order of processing time, for instance, will minimized the mean waiting time and a whole bunch of other measures. If you elaborate a bit more on the objective, the setup times, the processing time, and any interdependence that would help.

Looking at the links Laurence posted I thought I would try whipping something up. The algorithm is to assign the longest test to the shortest task list (repeat until all the tests are assigned). Using your examples and random test times the std deviation was pretty low, less than 2 minutes in running it several times (code in C#, but nothing that wouldn't be trivial to convert):
private static void BuildJobs()
{
PriorityQueue<Task> tasks = new PriorityQueue<Task>();
//create a task list for each node
for (int i = 0; i < 60; i++)
{
Task t = new Task();
tasks.Enqueue(t);
}
//get the list of tests, in order from longest to shortest
int[] testList = new int[2000];
for (int i = 0; i < testList.Length; i++)
{
testList[i] = random.Next(5, 90);
}
Array.Sort<int>(testList);
Array.Reverse(testList);
// add the longest running test to the current shortest task list
foreach (int time in testList)
{
Task t = tasks.Dequeue();
t.addTest(time);
tasks.Enqueue(t);
}
Debug.WriteLine(CalculateStdDev(tasks));
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.