Fast 4x4 matrix multiplication in Java with NIO float buffers

Fast 4x4 matrix multiplication in Java with NIO float buffers - java

I know there are LOT of questions like that but I can't find one specific to my situation. I have 4x4 matrices implemented as NIO float buffers (These matrices are used for OpenGL). Now I want to implement a multiply method which multiplies Matrix A with Matrix B and stores the result in Matrix C. So the code may look like this:
class Matrix4f
{
private FloatBuffer buffer = FloatBuffer.allocate(16);
public Matrix4f multiply(Matrix4f matrix2, Matrix4f result)
{
{{{result = this * matrix2}}} <-- I need this code
return result;
}
}
What is the fastest possible code to do this multiplication? Some OpenGL implementations (Like the OpenGL ES stuff in Android) provide native code for this but others doesn't. So I want to provide a generic multiplication method for these implementations.

The real answer is of course to test different implementations and check which one is fastest.
My guess, without testing, would be that as the matrices are so small, expanding the loops by hand would result in the fastest code. E.g. something like
result[0][0] = this[0][0] * matrix2[0][0] + this[0][1] * matrix2[1][0]
+ this[0][2] * matrix2[2][0] + this[0][3] * matrix2[3][0];
result[0][1] = // ... and so forth
or then maybe just unroll the innermost loop, and retain the two outermost ones to save some typing as well as I$.

Go through the FloatBuffer.array() if that operation is supported. Then just perform the necessary multiplications through that array, and return the resulting matrix.
Have a look at GameDev.net - Matrix Math for the exact computations.
If you want to optimize it further, you could try out Strassens Algorithm. You wouldn't even need to pad your matrices, since they are square and of a size that is a power of 2.

Related

How to divide Array Length into arrays with lengths that are multiples of 3

So I have a fairly large array that contains xyz coordinates, where array[0] = x0, array[1] = y0, array[2] = z0, array[3] = x1, array[4] = y1... and so on.
I'm running an algorithm on this array that is taking longer than I would like it to, and I want to split the work amongst threads. I have my threads set up, but I am not sure how to divide this array properly so I can distribute this work across 3 threads.
Even though I have an array length that is divisible by 3, this won't work, because splitting into 3 can split an xyz coordinate (for instance, if my array was size 15, dividing it by 3 will give me arrays of size 5, which means I'm splitting an XYZ coordinate.
How can I split this array (it doesn't have to necessarily be equal in size) so that I can distribute the work? (for instance, in the previous example, I would like to have two arrays of size 6 and one of size 3).
Note: The size of the array is variable, but is always divisible by 3.
EDIT: Sorry, should have mentioned that I'm working in Java. My algorithm iterates through a collection of coordinates and determines which coordinates lie inside of a particular 3d shape (such as an ellipsoid). It saves these coordinates and I perform other tasks with these coordinates (I'm working on a computer graphics app).
EDIT2: I'm going to elaborate on the algorithm a bit more.
Basically, I am working in Android OpenGL-ES-3.0. I have complex 3D-object with somewhere around 230000 vertices and close to a million triangles.
In the app, the user moves either a ellipsoid or box (they choose which one) to a location close to or on the object. After moving it, they click a button, which runs my algorithm.
The purpose of the algorithm is to determine which points from my object lie inside of the ellipsoid or box. These points are subsequently changed to a different color. To add to the complexity, however, is the fact that I have transformation matrices applied to both the points of the object and the points of the ellipsoid/box.
My current algorithm begins by iterating through all the points of the object. For those of you unclear on my iteration, this is my loop.
for(int i = 0; i < numberOfVertices*3;)
{
pointX = vertices[i];
i++;
pointY = vertices[i];
i++;
pointZ = vertices[i];
i++;
//consider transformations, then run algorithm
}
I perform the necessary steps to consider all my transformations, and after that is finished, I have a point from my object and the location of my ellipsoid/box centroid.
Then, depending on the shape, one of the following algorithms is used:
Ellipsoid: I use the centroid of the ellipse and apply the formula
(x−c)T RT A R(x−c) (sorry I don't know how to format that, I'll explain the formula). x is a column vector describing the xyz point from my object that I am on in my iteration. c is a column vector describing the xyz point of my centroid. T is supposed to mean transpose. R is my rotation matrix. A is a diagonal matrix with entries with entries (1/a^2, 1/b^2, 1/c^2), and I have values for a b and c. If this formula is > 1, then x lies outside of my ellipsoid and is not a valid point. If it is <=1, then I save x.
Box: I simply check if the point falls within a range. If the point of the object lies a certain distance in the X-direction, Y-direction, and Z-direction from the centroid, I save it.
These algorithms are accurate, and work as intended. The issue, is obviously efficiency. I don't seem to have a good understanding of what makes my app strain and what doesn't. I thought multi-threading would work, and I tried some of the techniques described but they didn't have a significant improvement on performance. If anyone has ideas on filtering out my search so I'm not iterating through all these points, it would help.

May I suggest a slightly different way to handle it. I know this isn't a direct answer to your question, but please consider it.
This could be easier to see if you implemented it as coordinate Objects, each with x, y and z values. Your "array" would now be 1/3 as long. You might think this would be less efficient--and you might be right--but you'd be surprised at how well java can optimize things. Often Java optimizes for the cases people use the most and your manually manipulating this array as you suggest is possibly even slower than using objects. Until you've proven the most readable design too slow you shouldn't optimize it.
Now you have a collection of coordinate objects. Java has queues that multiple threads can pull from efficiently. Dump all your objects into a queue and have each of your threads pull one and work on it by processing it and putting it in a "Completed" queue. Note that this gives you the ability to add or remove threads easily, without effecting your code except for 1 number. How would you take the array based solution to 4 or 6 threads?
Good luck

Here is a demo of the work explained below.
Observations
Each coordinate is 3 indexes.
You have 3 threads.
Let's say you have 17 coordinates, that's 51 indexes. You want to split the 17 coordinates among your 3 threads.
var arraySize = 51;
var numberOfThreads = 3;
var numberOfIndexesPerCoordinate = 3;
var numberOfCoordinates = arraySize / numberOfIndexesPerCoordinate; //17 coordinates
Now split that 17 coordinates among your threads.
var coordinatesPerThread = numberOfCoordinates / numberOfThreads; //5.6667
This isn't an even number, so you need to distribute unevenly. We can use Math.floor and modulo to distribute.
var floored = Math.floor(coordinatesPerThread); //5 - every thread gets at least 5.
var modulod = numberOfCoordinates % floored; // 2 - there will be 2 left that need to be placed sequentially into your thread pool
This should give you all the information you need. Without knowing what language you are using, I don't want to give any real code samples.
I see you edited your question to specify Java as your language. I'm not going to do the threading work for you, but I'll give a rough idea.
float[] coordinates = new float[17 * 3]; //17 coordinates with 3 indexes each.
int numberOfThreads = 3;
int numberOfIndexesPerCoordinate = 3;
int numberOfCoordinates = coordinates.length / numberOfIndexesPerCoordinate ; //coordinates * 3 indexes each = 17
//Every thread has this many coordinates
int coordinatesPerThread = Math.floor(numberOfCoordinates / numberOfThreads);
//This is the number of coordinates remaining that couldn't evenly be split.
int remainingCoordinates = numberOfCoordinates % coordinatesPerThread
//To make things easier, I'm just going to track the offset in the original array. It could probably be computed instead, but its just an int.
int offset = 0;
for (int i = 0; i < numberOfThreads; i++) {
int numberOfIndexes = coordinatesPerThread * numberOfIndexesPerCoordinate;
//If this index is one of the remainders, then increase by 1 coordinate (3 indexes).
if (i < remainingCoordinates)
numberOfIndexes += numberOfIndexesPerCoordinate ;
float[] dest = new float[numberOfIndexes];
System.arraycopy(coordinates, offset, dest, 0, numberOfIndexes);
offset += numberOfIndexes;
//Put the dest array of indexes into your threads.
}
Another, potentially better option would be to use a Concurrent Deque that has all of your coordinates, and have each thread pull from it as they need a new coordinate to work with. For this solution, you'd need to create Coordinate objects.
Declare a Coordinate object
public static class Coordinate {
protected float x;
protected float y;
protected float z;
public Coordinate(float x, float y, float z) {
this.x = x;
this.y = y;
this.z = z;
}
}
Declare a task to do your work, and pass it your concurrent deque.
public static class CoordinateTask implements Runnable {
private final Deque<Coordinate> deque;
public CoordinateTask(Deque<Coordinate> deque) {
this.deque = deque;
}
public void run() {
Coordinate coordinate;
while ((coordinate = this.deque.poll()) != null) {
//Do your processing here.
System.out.println(String.format("Proccessing coordinate <%f, %f, %f>.",
coordinate.x,
coordinate.y,
coordinate.z));
}
}
}
Here's the main method showing the example in action
public static void main(String []args){
Coordinate[] coordinates = new Coordinate[17];
for (int i = 0; i < coordinates.length; i++)
coordinates[i] = new Coordinate(i, i + 1, i + 2);
final Deque<Coordinate> deque = new ConcurrentLinkedDeque<Coordinate>(Arrays.asList(coordinates));
Thread t1 = new Thread(new CoordinateTask(deque));
Thread t2 = new Thread(new CoordinateTask(deque));
Thread t3 = new Thread(new CoordinateTask(deque));
t1.start();
t2.start();
t3.start();
}
See this demo.

Before trying to optimize with concurrency, try to minimize the amount of points you need to test, and minimize the cost of those tests, by using the most efficient collision detection methods at your disposal.
Some general suggestions:
Consider normalizing everything to a common frame of reference before running through your calculations. For example, instead of applying transformations to each point, transform the selection box/ellipsoid into the shape's coordinate system so you can perform your collision detection without the transformations within each iteration.
You may also be able to combine some or all of your transformations (rotation, translation, etc.) into a single matrix calculation, but that won't gain you much unless you're performing a lot of transformations, which you should try to avoid.
Generally speaking it's beneficial to keep the transformation pipeline as streamlined as possible, and keep all coordinate calculations in the same space to avoid transformations as much as possible.
Try to minimize the number of points you need to perform your slowest calculations on. The most accurate collision test should only be necessary for points that you can't rule out as being inside the shape by faster means, using an approximation of the shape, such as a collection of spheres, or the shape's convex hull. Simplifying the shape allows you to limit the slowest calculations to only those points that lie very close to your shape's actual bounds.
In my own 2D work in the past I found that even calculating the convex hulls for hundreds of complex animated shapes in real time was faster than doing collision detection directly without using their convex hulls, because they enable much faster collision calculations.
Consider calculating/storing additional information about the shape, such as an inner and outer collision sphere (one sphere inside all points, and one outside all points) which you can use as a fast initial filter. Anything inside the smaller sphere is guaranteed to be inside your shape, anything outside the outer sphere is known to be outside your shape. You might even want to store a simplified version of your shape, (or its convex hull), which you could calculate in advance and use to aid collision detection.
Similarly, consider using one or more spheres to approximate your ellipsoid in initial calculations, to minimize which points you need to test for collision.
Instead of calculating actual distances, calculate the squared distances and use those for comparison. However, prefer using faster tests for collision if possible. For example, for convex polygons you can use the Separating Axis Theorem, which projects vertices onto a common axis/plane to permit very quick overlap calculations.

Performance of nested loop vs hard coded matrix multiplication

I am reading a book on 2D game programming and am being walked through a 3x3 matrix class for linear transformations. The author has written a method for multiplying two 3x3 matrices as follows.
public Matrix3x3f mul(Matrix3x3f m1)
{
return new Matrix3x3f(new float[][]
{
{
this.m[0][0] * m1.m[0][0] // M[0,0]
+ this.m[0][1] * m1.m[1][0]
+ this.m[0][2] * m1.m[2][0],
this.m[0][0] * m1.m[0][1] // M[0,1]
+ this.m[0][1] * m1.m[1][1]
+ this.m[0][2] * m1.m[2][1],
this.m[0][0] * m1.m[0][2] // M[0,2]
+ this.m[0][1] * m1.m[1][2]
+ this.m[0][2] * m1.m[2][2],
},
{
this.m[1][0] * m1.m[0][0] // M[1,0]
+ this.m[1][1] * m1.m[1][0]
+ this.m[1][2] * m1.m[2][0],
this.m[1][0] * m1.m[0][1] // M[1,1]
+ this.m[1][1] * m1.m[1][1]
+ this.m[1][2] * m1.m[2][1],
this.m[1][0] * m1.m[0][2] // M[1,2]
+ this.m[1][1] * m1.m[1][2]
+ this.m[1][2] * m1.m[2][2],
},
{
this.m[2][0] * m1.m[0][0] // M[2,0]
+ this.m[2][1] * m1.m[1][0]
+ this.m[2][2] * m1.m[2][0],
this.m[2][0] * m1.m[0][1] // M[2,1]
+ this.m[2][1] * m1.m[1][1]
+ this.m[2][2] * m1.m[2][1],
this.m[2][0] * m1.m[0][2] // M[2,2]
+ this.m[2][1] * m1.m[1][2]
+ this.m[2][2] * m1.m[2][2],
},
});
}
If I personally needed to write a method to do the same I would have come up with some nested loop which did all of these calculations automatically, I am assuming that perhaps the author has written it out this way so that people with little math background can follow along easier.
Does this sound like a fair assumption or could a nested loop version of this method possibly cause performance issues when used heavily in a loop where performance is vital?

I think this is a performance issue.
If you use a loop, it will use a lot of jumping orders, since every iteration it needs to check "if cond goto ___". You should read this post on Branch Prediction and also learn a bit on computer architecture to understand how instructions affects performance, in this case I think you might find caching interesting.

From the looks of it, I think it's for clarity's sake, not for performance's sake. Consider the fact that it's Java code. There's object allocation in the return statement. If it were so performance critical that the conditional jump of a for-loop can't be afforded, the result would be written into a mutable instance.

If the hardcoded operations are exactly the same as the operations processed by a loop, I can see no reason why the loop would be less efficient (or at least, not in a considerable way). Actually, large loops (which is not the case here) are more efficient than hardcoding by far because :
some optimizations can be processed by the compiler and the JVM at runtime
(they enable a clearer code and a shorter binary)
I heard that soometimes it could be better to hardcode the operations if the loop iterates through a tiny space but I don't think it is really interesting to do so.
Finally, for multiplying matrices, using a loop or not won't change much things, what could speed up your calculations is using dynamic programming. I don't know if it's worth doing it for small matrices but if I were you I would give it a try.

This is definitely for performance issue. Having nested loops that have to increment the loop index and to check whether the loop has ended always makes it a slower implementation. For computer graphic and CAD/CAM software, the 3x3 or 4x4 matrix multiplication will be done for every rendering action. So, the matrix multiplication can be easily done millions of times. Therefore, implementing 3x3 or 4x4 matrix multiplication without using nested loops is a common practice, especially in the older days where there is no such thing as GPU. For matrices with more than 4 rows/columns, nested loops approach is still used.

Matrix library vs for loops for element-wise operations in Java

I'm looking to do some some element-wise operations (addition, multiplication, sqrt, etc.) on floating point arrays that are ~800x300 elements in size.
How much of a speedup (if any) would I get from doing this with matrix libraries (JAMA, EJML, etc.) over just doing the element-wise operations in for loops?
For loops look more appealing because my equations can get kind of complicated, and for loops would mean I could keep all my equations as is -- in plain old infix notation. Since java doesn't support operator overloading, using a matrix library wouldn't be as simple. So, I only want to use a matrix library if it's going to mean a real speedup. (Speed will be important here.)

I would suggest you to use some of the matrix libraries for that. In most cases it should run as fast as simple for loops. But it also can run faster. So, what you will get for free: API & the equal or better perfromance. It also saves a bit of your time while writing element-wise operations.
As the author of la4j library I can say that using third-party library gives you an opportunity to get faster and faster code from new releases. For example. You can choise la4j for you needs. It is currenlty (version 0.4.0-0.4.5) uses simple for loops calculations for element-wise operations. So, it won't be faster then hand-written code. But, I'm now on the middle of developing a new parallel engine for la4j, that allows to run a code in parallel mode without any significant changes in API. Like this:
Matrix a = new Basic2DMatrix(...); // simple 2D array matrix
Matrix b = new Basic2DMatrix(...); // that is too
Matrix c = a.multiply(b); // a * b in sequental mode
Matrix c = a.par().multiply(b); // a * b in parallel mode
So, all you need to do is change a one piece of the code. All these advantages you'll get for free with libraries like la4j. Just let the libraries do their job and spend your solving real problems.

Java Array Manipulation

I have a function named resize, which takes a source array, and resizes to new widths and height. The method I'm using, I think, is inefficient. I heard there's a better way to do it. Anyway, the code below works when scale is an int. However, there's a second function called half, where it uses resize to shrink an image in half. So I made scale a double, and used a typecast to convert it back to an int. This method is not working, and I dont know what the error is (the teacher uses his own grading and tests on these functions, and its not passing it). Can you spot the error, or is there a more efficient way to make a resize function?
public static int[][] resize(int[][] source, int newWidth, int newHeight) {
int[][] newImage=new int[newWidth][newHeight];
double scale=newWidth/(source.length);
for(int i=0;i<newWidth/scale;i++)
for(int j=0;j<newHeight/scale;j++)
for (int s1=0;s1<scale;s1++)
for (int s2=0;s2<scale;s2++)
newImage[(int)(i*scale+s1)][(int)(j*scale+s2)] =source[i][j];
return newImage;
}
/**
* Half the size of the image. This method should be just one line! Just
* delegate the work to resize()!
*/
public static int[][] half(int[][] source) {
int[][] newImage=new int[source.length/2][source[0].length/2];
newImage=resize(source,source.length/2,source[0].length/2);
return newImage;
}

So one scheme for changing the size of an image is to resample it (technically this is really the only way, every variation is really just a different kind of resampling function).
Cutting an image in half is super easy, you want to read every other pixel in each direction, and then load that pixel into the new half sized array. The hard part is making sure your bookkeeping is strong.
static int[][] halfImage(int[][] orig){
int[][] hi = new int[orig.length/2][orig[0].length/2];
for(int r = 0, newr = 0; r < orig.length; r += 2, newr++){
for(int c = 0, newc = 0; c < orig[0].length; c += 2, newc++){
hi[newr][newc] = orig[r][c];
}
}
return hi;
}
In the code above I'm indexing into the original image reading every other pixel in every other row starting at the 0th row and 0th column (assuming images are row major, here). Thus, r tells us which row in the original image we're looking at, and c tells us which column in the original image we're looking at. orig[r][c] gives us the "current" pixel.
Similarly, newr and newc index into the "half-image" matrix designated hi. For each increment in newr or newc we increment r and c by 2, respectively. By doing this, we skip every other pixel as we iterate through the image.
Writing a generalized resize routine that doesn't operate on nice fractional quantities (like 1/2, 1/4, 1/8, etc.) is really pretty hard. You'd need to define a way to determine the value of a sub-pixel -- a point between pixels -- for more complicated factors, like 0.13243, for example. This is, of course, easy to do, and you can develop a very naive linear interpolation principle, where when you need the value between two pixels you simply take the surrounding pixels, construct a line between their values, then read the sub-pixel point from the line. More complicated versions of interpolation might be a sinc based interpolation...or one of many others in widely published literature.
Blowing up the size of the image involves something a little different than we've done here (and if you do in fact have to write a generalized resize function you might consider splitting your function to handle upscaling and downscaling differently). You need to somehow create more values than you have originally -- those interpolation functions work for that too. A trivial method might simply be to repeat a value between points until you have enough, and slight variations on this as well, where you might take so many values from the left and so many from the right for a particular position.
What I'd encourage you to think about -- and since this is homework I'll stay away from the implementation -- is treating the scaling factor as something that causes you to make observations on one image, and writes on the new image. When the scaling factor is less than 1 you generally sample from the original image to populate the new image and ignore some of the original image's pixels. When the scaling factor is greater than 1, you generally write more often to the new image and might need to read the same value several times from the old image. (I'm doing a poor job highlighting the difference here, hopefully you see the dualism I'm getting at.)

What you have is pretty understandable, and I think it IS an O(n^4) algorithm. Ouchies!
You can improve it slightly by pushing the i*scale and j*scale out of the inner two loops - they are invariant where they are now. The optimizer might be doing it for you, however. There are also some other similar optimizations.
Regarding the error, run it twice, once with an input array that's got an even length (6x6) and another that's odd (7x7). And 6x7 and 7x6 while you're at it.

Based on your other question, it seems like you may be having trouble with mixing of types - with numeric conversions. One way to do this, which can make your code more debuggable and more readable to others not familiar with the problem space, would be to split the problematic line into multiple lines. Each minor operation would be one line, until you reach the final value. For example,
newImage[(int)(i*scale+s1)][(int)(j*scale+s2)] =source[i][j];
would become
int x = i * scale;
x += s1;
int y = j* scale;
y +=s2;
newImage[x][y] = source[i][j];
Now, you can run the code in a debugger and look at the values of each item after each operation is performed. When a value doesn't match what you think it should be, look at it and figure out why.
Now, back to the suspected problem: I expect that you need to use doubles somewhere, not ints - in your other question you talked about scaling factors. Is the factor less than 1? If so, when it's converted to an int, it'll be 0, and you'll get the wrong result.

Static Typing and Writing a Simple Matrix Library

Aye it's been done a million times before, but damnit I want to do it again. I'm writing a simple Matrix Library for C++ with the intention of doing it right. I've come across something that's fairly obvious in mathematics, but not so obvious to a strongly typed system -- the fact that a 1x1 matrix is just a number. To avoid this, I started walking down the hairy path of matrices as a composition of vectors, but also stumbled upon the fact that two vectors multiplied together could either be a number or a dyad, depending on the orientation of the two.
My question is, what is the right way to deal with this situation in a strongly typed language like C++ or Java?

something that's fairly obvious in
mathematics, but not so obvious to a
strongly typed system -- the fact that
a 1x1 matrix is just a number.
That's arguable. A hardcore mathematician (I'm not) would probably argue against it, he would say that a 1x1 matrix can be regarded as isomorphic (or something like that) to a scalar, but they are conceptually different things. Only in some informal sense "a 1x1 matrix is a scalar" (similar, though stronger, that a complex number without an imaginary part "is a real").
I don't think that that correspondence should be reflected in a strong typed language. And I dont' think it is, in typical implementations (of complex or matrix), eg. Java Apache Commons Math. For example, a Complex with zero imaginary part is not a Number (from the type POV - they cannot be casted one into another).
In the case of matrices, the correspondence is even more disputable. Should we be able to multiply two matrices of sizes (4x3) x (1x1) ? If we regard the second as a scalar, it's valid, but not as a matrix, since it violates the restriction on matrix dimensions for multiplication. And I believe Commons sticks to that.
In a weakly typed language (eg Matlab) it would be another story.

If you aren't worried about SIMD optimisations and the like then I would have thought the best way would be to set up a templated tensor. Choose your maximum tensor dimensions and then you can do things like this:
typedef Tensor3D< float, 4, 1, 1 > Vector4;
And so forth. The mathematics, if implemented correctly, will just work with all forms of "matrix" and "vector". Both are, afterall, just special cases of tensors.
Edit: knowing the size of a template is actually pretty easy. Add in a GetRows() etc function and you can return the value you pass into the template at instantiation.
ie
template< typename T, int rows, int cols > class Tensor2D
{
public:
int GetRows() { return rows; }
int GetCols() { return cols; }
};

My advice? Don't worry about the 1x1 case and sleep at night. You shouldn't be worried about any uses suddenly deciding to use your library to model a bunch of numbers as 1x1 matricies and complaining about your implementation.
No one who solves these problems will be so foolish. If you're smart enough to use matricies, you're smart enough to use them properly.
As for all the permutations that scalars introduce, I'd say that you must account for them. As a matrix library user, I'd expect to be able to multiply two matricies together to get another matrix, a matrix by a (column or row) vector get a vector result, and a scalar times a matrix to get another matrix.
If I multiply two vectors I can get a scalar (inner product) or a matrix (outer product). Your library had better give them to me.
It's not trivial. It's been done "right" by others, but kudos to working it through for yourself.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.