I have written code for a neural network but when I train my network it does not produce the desired output (network not learning and sometimes NaN values when training). What wrong with my back propagation algorithm? Attached below is how I derived the formula for weight and bias gradients respectively. Full code can be found here.
public double[][] predict(double[][] input) {
if(input.length != this.activations.get(0).length || input[0].length != this.activations.get(0)[0].length) {
throw new IllegalArgumentException("Prediction Error!");
}
this.activations.set(0, input);
for(int i = 1; i < this.activations.size(); i++) {
this.activations.set(i, this.sigmoid(this.add(this.multiply(this.weights.get(i-1), this.activations.get(i-1)), this.biases.get(i-1))));
}
return this.activations.get(this.n-1);
}
public void train(double[][] input, double[][] target) {
//calculate activations
this.predict(input);
//calculate weight gradients
for(int l = 0; l < this.weightGradients.size(); l++) {
for(int i = 0; i < this.weightGradients.get(l).length; i++) {
for(int j = 0; j < this.weightGradients.get(l)[0].length; j++) {
this.weightGradients.get(l)[i][j] = this.gradientOfWeight(l, i, j, target);
}
}
}
//calculated bias gradients
for(int l = 0; l < this.biasGradients.size(); l++) {
for(int i = 0; i < this.biasGradients.get(l).length; i++) {
for(int j = 0; j < this.biasGradients.get(l)[0].length; j++) {
this.biasGradients.get(l)[i][j] = this.gradientOfBias(l, i, j, target);
}
}
}
//apply gradient
for(int i = 0; i < this.weights.size(); i++) {
this.weights.set(i, this.subtract(this.weights.get(i), this.weightGradients.get(i)));
}
for(int i = 0; i < this.biases.size(); i++) {
this.biases.set(i, this.subtract(this.biases.get(i), this.biasGradients.get(i)));
}
}
private double gradientOfWeight(int l, int i, int j, double[][] t) { //when referring to A, use l+1 because A[0] is input vector, n-1 because n starts at 1
double z = (this.activations.get(l + 1)[i][0] * (1.0 - this.activations.get(l + 1)[i][0]) * this.activations.get(l)[j][0]);
if((l + 1) < (this.n - 1)) {
double sum = 0.0;
for(int k = 0; k < this.weights.get(l + 1).length; k++) {
sum += this.gradientOfWeight(l + 1, k, i, t)*this.weights.get(l + 1)[k][i];
}
return ((z * sum) / this.activations.get(l + 1)[i][0]);
} else if((l + 1) == (this.n - 1)) {
return 2.0 * (this.activations.get(l + 1)[i][0] - t[i][0]) * z;
}
throw new IllegalArgumentException("Weight Gradient Calculation Error!");
}
The amount of math that's involved in this question combined with the lack of data/reproduction of your code makes it nearly impossible to answer the original question of "where is my NaN".
Instead, I would propose you reconsider this question to be a simpler one, "How can I tell where a value like NaN is coming from in my code".
If you can run your code in an IDE, most of them will support conditional breakpoints. i.e. breakpoints that will pause your code whenever a variable reaches a value. In your case, I would recommend running your code in your preferred IDE with a conditional breakpoint detecting a value is NaN.
You can read more about how you would set it in this SO post where the topic of NaN double checking is nicely mentioned in this thread:
Eclipse Debugger doesn't stop at conditional breakpoint
Another follow-up consideration is to think WHERE you need to put these breakpoints. The short answer is to put them wherever a double is computed, because any of these computations might introduce the NaN.
To that effect, I make the following two recommendations:
First, put a breakpoint where you currently compute doubles to see if NaN's come from these computations. That would be these two variables:
double z = ...
double sum = ...
Second, refactor your calls to gradientOfWeight to return into a temporary variable, and then put a similar breakpoint on THOSE interrim computations.
So instead of
this.weightGradients.get(l)[i][j] = this.gradientOfWeight(l, i, j, target);
You would have:
double interrimComputationToListenForNaNon = this.gradientOfWeight(l, i, j, target);
this.weightGradients.get(l)[i][j] = interrimComputationToListenForNaNon;
Having these interrim variables is more of a convenience to give you an easy way to monitor the computation without changing the call in any significant way. There may be a smarter way to do that without requiring an interrim variable, but this one seems to easiest to monitor and explain.
The NaN you see is due to underflow, you need to use BigDecimal class instead of double for higher precision. Refer these for better understanding bigdecimal class java sample use , BigDecimal API Reference
Related
So I am new to ML and trying to make a simple "library" so I can learn more about neural networks.
My question:
According to my understanding I have to take the derivative of each layer according to their activation function so I can calculate their deltas and adjust their weights etc...
For ReLU, sigmoid, tanh, it's super simple to implement them in Java (which is the language I am using BTW)
But to go from output to the input I have to start from (obviously) the output which has an activation function of softmax.
So do I have to take the derivative of the output layer as well or does it just apply to every other layer?
If I do have to get the derivative how can I implement the derivative that in Java?
Thanks.
I have read a lot of pages with the explanation of the derivative of the softmax algorithm but they were really complicated for me and as I said I just started to learn ML and I didn't wanted to use a library off the shelf so here I am.
This is the class I store my activation functions.
public class ActivationFunction {
public static double tanh(double val) {
return Math.tanh(val);
}
public static double sigmoid(double val) {
return 1 / 1 + Math.exp(-val);
}
public static double relu(double val) {
return Math.max(val, 0);
}
public static double leaky_relu(double val) {
double result = 0;
if (val > 0) result = val;
else result = val * 0.01;
return result;
}
public static double[] softmax(double[] array) {
double max = max(array);
for (int i = 0; i < array.length; i++) {
array[i] = array[i] - max;
}
double sum = 0;
double[] result = new double[array.length];
for (int i = 0; i < array.length; i++) {
sum += Math.exp(array[i]);
}
for (int i = 0; i < result.length; i++) {
result[i] = Math.exp(array[i]) / sum;
}
return result;
}
public static double dTanh(double x) {
double tan = Math.tanh(x);
return (1 / tan) - tan;
}
public static double dSigmoid(double x) {
return x * (1 - x);
}
public static double dRelu(double x) {
double result;
if (x > 0) result = 1;
else result = 0;
return result;
}
public static double dLeaky_Relu(double x) {
double result;
if (x > 0) result = 1;
else if (x < 0) result = 0.01;
else result = 0;
return result;
}
private static double max(double[] array) {
double result = Double.MIN_VALUE;
for (int i = 0; i < array.length; i++) {
if (array[i] > result) result = array[i];
}
return result;
}
}
I am expecting to get the answer for the question: Do I need the derivative of softmax or not?
If so how can I implement it?
A short answer to your first question is yes, you need to compute the derivative of softmax.
The longer version will involve some computation since in order to implement backpropagation you train your network by means of first-order optimization algorithm that requires to calculate partial derivatives of the cost function w.r.t the weights, i.e.:
However, since you are using the softmax for your last layer, it is very likely that you are going to optimize a cross-entropy cost function while training your neural network, namely:
where tj is a target value and aj is a softmax result for class j.
Softmax itself represents a probability distribution over n classes:
where all of z's are simple sums of the result of activation functions of previous layers times the corresponding weights:
where n is the number of layer, i is the number of neuron in the previous layer and j is the number of neuron in our softmax layer.
So in order to take partial derivatives with respect to any of these weights, one should calculate:
where second partial derivative ∂ak/∂zj is indeed the softmax derivative and can be computed in the following way:
But if you try to compute the aforementioned sum term of the derivative of the cost function w.r.t. the weights, you will get:
So in this particular case it turns out that the final result of the computation is quite neat and represents a simple difference between the outputs of the network and the target values, and that's it, i.e., all you need to compute this sum term of partial derivatives is just:
So to answer your second question, you can combine computation of the partial derivative of the cross-entropy cost function w.r.t output activation (i.e. softmax) together with the partial derivative of the output activation w.r.t. zj which results in a short and clear implementation, if you are using a non-vectorized form, it will look like this:
for (int i = 0; i < lenOfClasses; ++i)
{
dCdz[i] = t[i] - a[i];
}
And subsequently you can use dCdz for backpropagating to the rest of the layers of the neural network.
I wrote this algorithm. It works (at least with my short test cases), but takes too long on larger inputs. How can I make it faster?
// Returns an array of length 2 with the two closest points to each other from the
// original array of points "arr"
private static Point2D[] getClosestPair(Point2D[] arr)
{
int n = arr.length;
float min = 1.0f;
float dist = 0.0f;
Point2D[] ret = new Point2D[2];
// If array only has 2 points, return array
if (n == 2) return arr;
// Algorithm says to brute force at 3 or lower array items
if (n <= 3)
{
for (int i = 0; i < arr.length; i++)
{
for (int j = 0; j < arr.length; j++)
{
// If points are identical but the point is not looking
// at itself, return because shortest distance is 0 then
if (i != j && arr[i].equals(arr[j]))
{
ret[0] = arr[i];
ret[1] = arr[j];
return ret;
}
// If points are not the same and current min is larger than
// current stored distance
else if (i != j && dist < min)
{
dist = distanceSq(arr[i], arr[j]);
ret[0] = arr[i];
ret[1] = arr[j];
min = dist;
}
}
}
return ret;
}
int halfN = n/2;
// Left hand side
Point2D[] LHS = Arrays.copyOfRange(arr, 0, halfN);
// Right hand side
Point2D[] RHS = Arrays.copyOfRange(arr, halfN, n);
// Result of left recursion
Point2D[] LRes = getClosestPair(LHS);
// Result of right recursion
Point2D[] RRes = getClosestPair(RHS);
float LDist = distanceSq(LRes[0], LRes[1]);
float RDist = distanceSq(RRes[0], RRes[1]);
// Calculate minimum of both recursive results
if (LDist > RDist)
{
min = RDist;
ret[0] = RRes[0];
ret[1] = RRes[1];
}
else
{
min = LDist;
ret[0] = LRes[0];
ret[1] = LRes[1];
}
for (Point2D q : LHS)
{
// If q is close to the median line
if ((halfN - q.getX()) < min)
{
for (Point2D p : RHS)
{
// If p is close to q
if ((p.getX() - q.getX()) < min)
{
dist = distanceSq(q, p);
if (!q.equals(p) && dist < min)
{
min = dist;
ret[0] = q;
ret[1] = p;
}
}
}
}
}
return ret;
}
private static float distanceSq(Point2D p1, Point2D p2)
{
return (float)Math.pow((p1.getX() - p2.getX()) + (p1.getY() - p2.getY()), 2);
}
I am loosely following the algorithm explained here: http://www.cs.mcgill.ca/~cs251/ClosestPair/ClosestPairDQ.html
and a different resource with pseudocode here:
http://i.imgur.com/XYDTfBl.png
I cannot change the return type of the function, or add any new arguments.
Thanks for any help!
There are several things you can do.
First, you can very simply cut the time the program takes to run by changing the second iteration to run only on the "reminder" points. This helps you to avoid calculating both (i,j) and (j,i) for each values. To do so, simply change:
for (int j = 0; j < arr.length; j++)
to
for (int j = i+1; j < arr.length; j++)
This will still be O(n^2) though.
You can achieve O(nlogn) time by iterating the points, and storing each in a smart data structure (kd-tree most likely). Before each insertion, find the closest point already stored in the DS (the kd-tree supports this in O(logn) time), and it is your candidate for minimal distance.
I believe the linked algorithm mentions sorting the array by one coordinate so that given LHS q in point 1 to 2000, if RHS p at point 200 is more than 'min' distance away with only its x distance, you can avoid checking the remaining 201 to 2000 points.
I figured it out - cut the time by a vast amount. The distanceSq function is wrong. Best to use Java's Point2D somepoint.distanceSq(otherpoint); method instead.
As for the original brute force when n is 3 (it will only ever be 3 or 2 in that scenario), a linear search is better and more effective.
The checks against the min variable are also wrong in the inner for loops after the brute force condition. Using squared distance is fine, but min is not squared. It has preserved, original distance, which means that min must be square rooted in both checks (once in the outer loop, once in the inner for each check).
So,
if ((p.getX() - q.getX()) < min)
Should be
if ((p.getX() - q.getX()) < Math.sqrt(min))
Same goes for the other check.
Thanks for your answers everyone!
During a 45 minute technical interview with Google, I was asked a Leaper Graph problem.
I wrote working code, but later was declined the job offer because I lacked Data structure knowledge. I'm wondering what I could have done better.
The problem was as following:
"Given an N sized board, and told that a piece can jump i positions horizontally (left or right) and j positions vertically (up or down) (I.e, sort of like a horse in chess), can the leaper reach every spot on the board?"
I wrote the following algorithm. It recursively finds out if every position on the board is reachable by marking all spots on the graph that were visited. If it was not reachable, then at least one field was false and the function would return false.
static boolean reachable(int i, int j, int n) {
boolean grid[][] = new boolean[n][n];
reachableHelper(0, 0, grid, i, j, n - 1);
for (int x = 0; x < n; x++) {
for (int y = 0; y < n; y++) {
if (!grid[x][y]) {
return false;
}
}
}
return true;
}
static void reachableHelper(int x, int y, boolean[][] grid, int i, int j, int max) {
if (x > max || y > max || x < 0 || y < 0 || grid[x][y]) {
return;
}
grid[x][y] = true;
int i2 = i;
int j2 = j;
for (int a = 0; a < 2; a++) {
for (int b = 0; b < 2; b++) {
reachableHelper(x + i2, y + j2, grid, i, j, max);
reachableHelper(x + j2, y + i2, grid, i, j, max);
i2 = -i2;
}
j2 = -j2;
}
}
Now, later it was pointed out that the optimal solution would be to implement Donald Knuth's co-prime implementation:
http://arxiv.org/pdf/math/9411240v1.pdf
Is this something that one should be able to figure out on a 45 minute technical interview??
Besides the above, is there anything I could have done better?
edit:
- I enquired about starting position. I was told starting at 0,0 is fine.
edit2
Based on feedback, I wrote a while-loop with queue approach.
The recursive approach runs into a stack-overflow when n = 85.
However, the while loop with queue method below works up to ~n = 30,000. (after that it runs into heap-issues with memory exceeding GB's). If you know how to optimize further, please let me know.
static boolean isReachableLoop(int i, int j, int n) {
boolean [][] grid = new boolean [n][n];
LinkedList<Point> queue = new LinkedList<Point>();
queue.add(new Point(0,0)); // starting position.
int nodesVisited = 0;
while (queue.size() != 0) {
Point pos = queue.removeFirst();
if (pos.x >= 0 && pos.y >= 0 && pos.x < n && pos.y < n) {
if (!grid[pos.x][pos.y]) {
grid[pos.x][pos.y] = true;
nodesVisited++;
int i2 = i;
int j2 = j;
for (int a = 0; a < 2; a++) {
for (int b = 0; b < 2; b++) {
queue.add(new Point(pos.x+i2, pos.y+j2));
queue.add(new Point(pos.x+j2, pos.y+i2));
i2 = -i2;
}
j2 = -j2;
}
}
}
}
if (nodesVisited == (n * n)) {
return true;
} else {
return false;
}
}
I ask a lot of interview questions like this. I don't think you would be expected to figure out the coprime method during the interview, but I would have docked you for using O(n^2) stack space -- especially since you passed all those parameters to each recursive call instead of using an object.
I would have asked you about that, and expected you to come up with a BFS or DFS using a stack or queue on the heap. If you failed on that, I might have a complaint like "lacked data structure knowledge".
I would also have asked questions to make sure you knew what you were doing when you allocated that 2D array.
If you were really good, I would ask you if you can use the symmetry of the problem to reduce your search space. You really only have to search a J*J-sized grid (assuming J>=i).
It's important to remember that the interviewer isn't just looking at your answer. He's looking at the way you solve problems and what tools you have in your brain that you can bring to bear on a solution.
Edit: thinking about this some more, there are lots of incremental steps on the way to the coprime method that you might also come up with. Nobody will expect that, but it would be impressive!
I'm sorry, I feel like I'm missing something.
If you can only go up or down by i and left or right by j, then a case (x,y) is reachable from a start case (a,b) if there are integers m and n so that
a + m*i = x
b + n*j = y
That is, everything is false for a square board where n > 1.
If you meant more like a knight in chess, and you can go up/down by i and left/right by j OR up/down by j and left/right by i, you can use the same technique. It just becomes 2 equations to solve:
a + m * i + n * j = x
b + o * i + p * j = y
If there are no integers m, n, o and p that satisfy those equations, you can't reach that point.
How can I use differential evolution to find the maximum values of the function function f(x) = -x(x+1) from -500 to 500? I need this for a chess program I am making, I have begun researching on Differential Evolution and am still finding it quite difficult to understand, let alone use for a program. Can anyone please help me by introducing me to the algorithm in a simple way and possibly giving some example pseudo-code for such a program?
First, of all, sorry for the late reply.
I bet that you won't know the derivatives of the function that you'll be trying to max, that's why you want to use the Differential Evolution algorithm and not something like the Newton-Raphson method.
I found a great link that explains Differential Evolution in a straightforward manner: http://web.as.uky.edu/statistics/users/viele/sta705s06/diffev.pdf.
On the first page, there is a section with an explanation of the algorithm:
Let each generation of points consist of n points, with j terms in each.
Initialize an array with size j. Add a number j of distinct random x values from -500 to 500, the interval you are considering right now. Ideally, you would know around where the maximum value would be, and you would make it more probable for your x values to be there.
For each j, randomly select two points yj,1 and yj,2 uniformly from the set of points x
(m)
.
Construct a candidate point cj = x
(m)
j + α(yj,1 − yj,2). Basically the two y values involve
picking a random direction and distance, and the candidate is found by adding that random
direction and distance (scaled by α) to the current value.
Hmmm... This is a bit more complicated. Iterate through the array you made in the last step. For each x value, pick two random indexes (yj1 and yj2). Construct a candidate x value with cx = α(yj1 − yj2), where you choose your α. You can try experimenting with different values of alpha.
Check to see which one is larger, the candidate value or the x value at j. If the candidate value is larger, replace it for the x value at j.
Do this all until all of the values in the array are more or less similar.
Tahdah, any of the values of the array will be the maximum value. Just to reduce randomness (or maybe this is not important....), average them all together.
The more stringent you make the about method, the better approximations you will get, but the more time it will take.
For example, instead of Math.abs(a - b) <= alpha /10, I would do Math.abs(a - b) <= alpha /10000 to get a better approximation.
You will get a good approximation of the value that you want.
Happy coding!
Code I wrote for this response:
public class DifferentialEvolution {
public static final double alpha = 0.001;
public static double evaluate(double x) {
return -x*(x+1);
}
public static double max(int N) { // N is initial array size.
double[] xs = new double[N];
for(int j = 0; j < N; j++) {
xs[j] = Math.random()*1000.0 - 500.0; // Number from -500 to 500.
}
boolean done = false;
while(!done) {
for(int j = 0; j < N; j++) {
double yj1 = xs[(int)(Math.random()*N)]; // This might include xs[j], but that shouldn't be a problem.
double yj2 = xs[(int)(Math.random()*N)]; // It will only slow things down a bit.
double cj = xs[j] + alpha*(yj1-yj2);
if(evaluate(cj) > evaluate(xs[j])) {
xs[j] = cj;
}
}
double average = average(xs); // Edited
done = true;
for(int j = 0; j < N; j++) { // Edited
if(!about(xs[j], average)) { // Edited
done = false;
break;
}
}
}
return average(xs);
}
public static double average(double[] values) {
double sum = 0;
for(int i = 0; i < values.length; i++) {
sum += values[i];
}
return sum/values.length;
}
public static boolean about(double a, double b) {
if(Math.abs(a - b) <= alpha /10000) { // This should work.
return true;
}
return false;
}
public static void main(String[] args) {
long t = System.currentTimeMillis();
System.out.println(max(3));
System.out.println("Time (Milliseconds): " + (System.currentTimeMillis() - t));
}
}
If you have any questions after reading this, feel free to ask them in the comments. I'll do my best to help.
Introduction:
Using two identical mergesort algorithms, I tested the execution speed of C++ (using Visual Studios C++ 2010 express) vs Java (using NetBeans 7.0). I conjectured that the C++ execution would be at least slightly faster, but testing revealed that the C++ execution was 4 - 10 times slower than the Java execution. I believe that I have set all the speed optimisations for C++, and I am publishing as a release rather than as a debug. Why is this speed discrepancy occurring?
Code:
Java:
public class PerformanceTest1
{
/**
* Sorts the array using a merge sort algorithm
* #param array The array to be sorted
* #return The sorted array
*/
public static void sort(double[] array)
{
if(array.length > 1)
{
int centre;
double[] left;
double[] right;
int arrayPointer = 0;
int leftPointer = 0;
int rightPointer = 0;
centre = (int)Math.floor((array.length) / 2.0);
left = new double[centre];
right = new double[array.length - centre];
System.arraycopy(array,0,left,0,left.length);
System.arraycopy(array,centre,right,0,right.length);
sort(left);
sort(right);
while((leftPointer < left.length) && (rightPointer < right.length))
{
if(left[leftPointer] <= right[rightPointer])
{
array[arrayPointer] = left[leftPointer];
leftPointer += 1;
}
else
{
array[arrayPointer] = right[rightPointer];
rightPointer += 1;
}
arrayPointer += 1;
}
if(leftPointer < left.length)
{
System.arraycopy(left,leftPointer,array,arrayPointer,array.length - arrayPointer);
}
else if(rightPointer < right.length)
{
System.arraycopy(right,rightPointer,array,arrayPointer,array.length - arrayPointer);
}
}
}
public static void main(String args[])
{
//Number of elements to sort
int arraySize = 1000000;
//Create the variables for timing
double start;
double end;
double duration;
//Build array
double[] data = new double[arraySize];
for(int i = 0;i < data.length;i += 1)
{
data[i] = Math.round(Math.random() * 10000);
}
//Run performance test
start = System.nanoTime();
sort(data);
end = System.nanoTime();
//Output performance results
duration = (end - start) / 1E9;
System.out.println("Duration: " + duration);
}
}
C++:
#include <iostream>
#include <windows.h>
using namespace std;
//Mergesort
void sort1(double *data,int size)
{
if(size > 1)
{
int centre;
double *left;
int leftSize;
double *right;
int rightSize;
int dataPointer = 0;
int leftPointer = 0;
int rightPointer = 0;
centre = (int)floor((size) / 2.0);
leftSize = centre;
left = new double[leftSize];
for(int i = 0;i < leftSize;i += 1)
{
left[i] = data[i];
}
rightSize = size - leftSize;
right = new double[rightSize];
for(int i = leftSize;i < size;i += 1)
{
right[i - leftSize] = data[i];
}
sort1(left,leftSize);
sort1(right,rightSize);
while((leftPointer < leftSize) && (rightPointer < rightSize))
{
if(left[leftPointer] <= right[rightPointer])
{
data[dataPointer] = left[leftPointer];
leftPointer += 1;
}
else
{
data[dataPointer] = right[rightPointer];
rightPointer += 1;
}
dataPointer += 1;
}
if(leftPointer < leftSize)
{
for(int i = dataPointer;i < size;i += 1)
{
data[i] = left[leftPointer++];
}
}
else if(rightPointer < rightSize)
{
for(int i = dataPointer;i < size;i += 1)
{
data[i] = right[rightPointer++];
}
}
delete left;
delete right;
}
}
void main()
{
//Number of elements to sort
int arraySize = 1000000;
//Create the variables for timing
LARGE_INTEGER start; //Starting time
LARGE_INTEGER end; //Ending time
LARGE_INTEGER freq; //Rate of time update
double duration; //end - start
QueryPerformanceFrequency(&freq); //Determinine the frequency of the performance counter (high precision system timer)
//Build array
double *temp2 = new double[arraySize];
QueryPerformanceCounter(&start);
srand((int)start.QuadPart);
for(int i = 0;i < arraySize;i += 1)
{
double randVal = rand() % 10000;
temp2[i] = randVal;
}
//Run performance test
QueryPerformanceCounter(&start);
sort1(temp2,arraySize);
QueryPerformanceCounter(&end);
delete temp2;
//Output performance test results
duration = (double)(end.QuadPart - start.QuadPart) / (double)(freq.QuadPart);
cout << "Duration: " << duration << endl;
//Dramatic pause
system("pause");
}
Observations:
For 10000 elements, the C++ execution takes roughly 4 times the amount of time as the Java execution.
For 100000 elements, the ratio is about 7:1.
For 10000000 elements, the ratio is about 10:1.
For over 10000000, the Java execution completes, but the C++ execution stalls, and I have to manually kill the process.
I think there might be a mistake in the way you ran the program. When you hit F5 in Visual C++ Express, the program is running under debugger and it will be a LOT slower. In other versions of Visual C++ 2010 (e.g. Ultimate that I use), try hitting CTRL+F5 (i.e. Start without Debugging) or try running the executable file itself (in the Express) and you see the difference.
I run your program with only one modification on my machine (added delete[] left; delete[] right; to get rid of memory leak; otherwise it would ran out of memory in 32 bits mode!). I have an i7 950. To be fair, I also passed the same array to the Arrays.sort() in Java and to the std::sort in C++. I used an array size of 10,000,000.
Here are the results (time in seconds):
Java code: 7.13
Java Arrays.sort: 0.93
32 bits
C++ code: 3.57
C++ std::sort 0.81
64 bits
C++ code: 2.77
C++ std::sort 0.76
So the C++ code is much faster and even the standard library, which is highly tuned for in both Java and C++, tends to show slight advantage for C++.
Edit: I just realized in your original test, you run the C++ code in the debug mode. You should switch to the Release mode AND run it outside the debugger (as I explained in my post) to get a fair result.
I don't program C++ professionally (or even unprofessionally:) but I notice that you are allocating a double on the heap (double *temp2 = new double[arraySize];). This is expensive compared to Java initialisation but more importantly, it constitutes a memory leak since you never delete it, this could explain why your C++ implementation stalls, it's basically run out of memory.
To start with did you try using std::sort (or std::stable_sort which is typically mergesort) to get a baseline performance in C++?
I can't comment on the Java code but for the C++ code:
Unlike Java, new in C++ requires manual intervention to free the memory. Every recursion you'll be leaking memory. I would suggest using std::vector instead as it manages all the memory for you AND the iterator, iterator constructor will even do the copy (and possibly optimized better than your for loop`). This is almost certainly the cause of your performance difference.
You use arraycopy in Java but don't use the library facility (std::copy) in C++ although again this wouldn't matter if you used vector.
Nit: Declare and initialize your variable on the same line, at the point you first need them, not all at the top of the function.
If you're allowed to use parts of the standard library, std::merge could replace your merge algorithm.
EDIT: If you really are using say delete left; to cleanup memory that's probably your problem. The correct syntax would be delete [] left; to deallocate an array.
Your version was leaking so much memory that the timing were meaningless.
I am sure the time was spent thrashing the memory allocator.
Rewrite it to use standard C++ objects for memory management std::vector and see what happens.
Personally I would still expect the Java version to win (just). Because the JIT allows machine specific optimizations and while the C++ can do machine specific optimizations generally it will only do generic architecture optimizations (unless you provide the exact architecture flags).
Note: Don't forget to compile with optimizations turned on.
Just cleaning up your C++:
I have not tried to make a good merge sort (just re-written) in a C++ style
void sort1(std::vector<double>& data)
{
if(data.size() > 1)
{
std::size_t centre = data.size() / 2;
std::size_t lftSize = centre;
std::size_t rhtSize = data.size() - lftSize;
// Why are we allocating new arrays here??
// Is the whole point of the merge sort to do it in place?
// I forget bbut I think you need to go look at a knuth book.
//
std::vector<double> lft(data.begin(), data.begin() + lftSize);
std::vector<double> rht(data.begin() + lftSize, data.end());
sort1(lft);
sort1(rht);
std::size_t dataPointer = 0;
std::size_t lftPointer = 0;
std::size_t rhtPointer = 0;
while((lftPointer < lftSize) && (rhtPointer < rhtSize))
{
data[dataPointer++] = (lft[lftPointer] <= rht[rhtPointer])
? lft[lftPointer++]
: rht[rhtPointer++];
}
std::copy(lft.begin() + lftPointer, lft.end(), &data[dataPointer]);
std::copy(rht.begin() + rhtPointer, rht.end(), &data[dataPointer]);
}
}
Thinking about merge sort. I would try this:
I have not tested it, so it may not work correctly. Bu it is an attempt to not keep allocating huge amounts of memory to do the sort. instead it uses a single temp area and copies the result back when the sort is done.
void mergeSort(double* begin, double* end, double* tmp)
{
if (end - begin <= 1)
{ return;
}
std::size_t size = end - begin;
double* middle = begin + (size / 2);
mergeSort(begin, middle, tmp);
mergeSort(middle, end, tmp);
double* lft = begin;
double* rht = middle;
double* dst = tmp;
while((lft < middle) && (rht < end))
{
*dst++ = (*lft < *rht)
? *lft++
: *rht++;
}
std::size_t count = dst - tmp;
memcpy(begin, tmp, sizeof(double) * count);
memcpy(begin + count, lft, sizeof(double) * (middle - lft));
memcpy(begin + count, rht, sizeof(double) * (end - rht));
}
void sort2(std::vector<double>& data)
{
double* left = &data[0];
double* right = &data[data.size()];
std::vector<double> tmp(data.size());
mergeSort(left,right, &tmp[0]);
}
A couple of things.
Java is highly optimized and after the code has executed once the JIT compiler then executes the code as native.
Your System.arraycopy in Java is going to execute much faster than simply copying each element one at a time. try replacing this copy with a memcpy and you will see that it is much faster.
EDIT:
Look at this post: C++ performance vs. Java/C#
It is hard to tell from just looking at your code, but I would hazard a guess that the reason is in the handling recursion rather than actual computations. Try using some sorting algorithm that relies on the iteration instead of the recursion and share the results of the performance comparison.
I don't know why Java is so much faster here.
I compared it with the built in Arrays.sort() and it was 4x faster again. (It doesn't create any objects).
Usually if there is a test where Java is much faster its because Java is much better at removing code which doesn't do anything.
Perhaps you could use memcpy rather than a loop at the end.
Try to make a global vector as a buffer, and try not to allocate a lot of memory.
This will run faster than your code, because if uses some tricks(uses only one buffer and the memory is allocated when the program starts, so the memory will not be fragmented):
#include <cstdio>
#define N 500001
int a[N];
int x[N];
int n;
void merge (int a[], int l, int r)
{
int m = (l + r) / 2;
int i, j, k = l - 1;
for (i = l, j = m + 1; i <= m && j <= r;)
if (a[i] < a[j])
x[++k] = a[i++];
else
x[++k] = a[j++];
for (; i <= m; ++i)
x[++k] = a[i];
for (; j <= r; ++j)
x[++k] = a[j];
for (i = l; i <= r; ++i)
a[i] = x[i];
}
void mergeSort (int a[], int l, int r)
{
if (l >= r)
return;
int m = (l + r) / 2;
mergeSort (a, l, m);
mergeSort (a, m + 1, r);
merge (a, l, r);
}
int main ()
{
int i;
freopen ("algsort.in", "r", stdin);
freopen ("algsort.out", "w", stdout);
scanf ("%d\n", &n);
for (i = 1; i <= n; ++i)
scanf ("%d ", &a[i]);
mergeSort (a, 1, n);
for (i = 1; i <= n; ++i)
printf ("%d ", a[i]);
return 0;
}