OpenCL Shared Memory Among Tasks

OpenCL Shared Memory Among Tasks - java

I've been working to create a GPU based conway's game of life program. If you're not familiar with it, here is the Wikipedia Page. I created one version that works by keeping an array of values where 0 represents a dead cell, and 1 a live one. The kernel then simply writes to an image buffer data array to draw an image based on the cell data and then checks each cell's neighbors to update the cell array for the next execution to render.
However, a faster method instead represents the value of a cell as a negative number if dead and a positive number if alive. The number of that cell represents the amount of neighbors it has plus one (making zero an impossible value since we cannot differentiate 0 from -0). However this means that when spawning or killing a cell we must update it's eight neighbor's values accordingly. Thus unlike the working procedure, which only has to read from the neighboring memory slots, this procedure must write to those slots. Doing so is inconsistent and the outputted array is not valid. For example cells contain numbers such as 14 which indicates 13 neighbors, an impossible value. The code is correct as I wrote the same procedure on the cpu and it works as expected. After testing, I believe that when tasks try to write to the memory at the same time there is a delay that leads to a writing error of some kind. For example, perhaps there is a delay between reading the array data and setting in which time the data is changed making another task's procedure incorrect. I've tried using semaphors and barriers, but have just learned OpenCL and parallel processing and don't quite grasp them completely yet. The kernel is as follows.
int wrap(int val, int limit){
int response = val;
if(response<0){response+=limit;}
if(response>=limit){response-=limit;}
return response;
}
__kernel void optimizedModel(
__global uint *output,
int sizeX, int sizeY,
__global uint *colorMap,
__global uint *newCellMap,
__global uint *historyBuffer
)
{
// the x and y coordinates that currently being computed
unsigned int x = get_global_id(0);
unsigned int y = get_global_id(1);
int cellValue = historyBuffer[sizeX*y+x];
int neighborCount = abs(cellValue)-1;
output[y*sizeX+x] = colorMap[cellValue > 0 ? 1 : 0];
if(cellValue > 0){// if alive
if(neighborCount < 2 || neighborCount > 3){
// kill
for(int i=-1; i<2; i++){
for(int j=-1; j<2; j++){
if(i!=0 || j!=0){
int wxc = wrap(x+i, sizeX);
int wyc = wrap(y+j, sizeY);
newCellMap[sizeX*wyc+wxc] -= newCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
}
}
}
newCellMap[sizeX*y+x] *= -1;
// end kill
}
}else{
if(neighborCount==3){
// spawn
for(int i=-1; i<2; i++){
for(int j=-1; j<2; j++){
if(i!=0 || j!=0){
int wxc = wrap(x+i, sizeX);
int wyc = wrap(y+j, sizeY);
newCellMap[sizeX*wyc+wxc] += newCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
}
}
}
newCellMap[sizeX*y+x] *= -1;
// end spawn
}
}
}
The array output is the image buffer data used to render the
kernel's computation.
The sizeX and sizeY constants are the width and height of the image buffer respectively.
The colorMap array contains the rgb integer values for black and white respectively which are used to change the image buffer's values properly to render colors.
The newCellMap array is the updated cell map being calculated once rendering is determined.
The historyBuffer is the old state of the cells at the beginning of the kernel call. Every time the kernel is executed, this array is updated to the newCellMap array.
Additionally the wrap function makes the space toroidal. How could I fix this code such that it works as expected. And why doesn't the global memory update with each change by a task? Isn't it supposed to be shared memory?

As sharpneli said in his answer, you are reading and writing same memory zones from different threads and that gives an undefined behaviour.
Solution:
You need to split your newCellMap in 2 arrays, one for the previous execution and one where the new value will be stored. Then, you need to change the kernel arguments from the host side in each call, so that the oldvalues of the next iteration are the newvalues of the previous iteration. Due to how you structurize your algorithm, you will also need to perform a copybuffer of oldvalues to newvalues before you run it.
__kernel void optimizedModel(
__global uint *output,
int sizeX, int sizeY,
__global uint *colorMap,
__global uint *oldCellMap,
__global uint *newCellMap,
__global uint *historyBuffer
)
{
// the x and y coordinates that currently being computed
unsigned int x = get_global_id(0);
unsigned int y = get_global_id(1);
int cellValue = historyBuffer[sizeX*y+x];
int neighborCount = abs(cellValue)-1;
output[y*sizeX+x] = colorMap[cellValue > 0 ? 1 : 0];
if(cellValue > 0){// if alive
if(neighborCount < 2 || neighborCount > 3){
// kill
for(int i=-1; i<2; i++){
for(int j=-1; j<2; j++){
if(i!=0 || j!=0){
int wxc = wrap(x+i, sizeX);
int wyc = wrap(y+j, sizeY);
newCellMap[sizeX*wyc+wxc] -= oldCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
}
}
}
newCellMap[sizeX*y+x] *= -1;
// end kill
}
}else{
if(neighborCount==3){
// spawn
for(int i=-1; i<2; i++){
for(int j=-1; j<2; j++){
if(i!=0 || j!=0){
int wxc = wrap(x+i, sizeX);
int wyc = wrap(y+j, sizeY);
newCellMap[sizeX*wyc+wxc] += oldCellMap[sizeX*wyc+wxc] > 0 ? 1 : -1;
}
}
}
newCellMap[sizeX*y+x] *= -1;
// end spawn
}
}
}
Regarding your question about shared memory has a simple answer. OpenCL does not have shared memory across HOST-DEVICE.
When you create a memory buffer for the device, you first have to init that memory zone with clEnqueueWriteBuffer() and read it with clEnqueueWriteBuffer() to get the results. Even if you do have a pointer to the memory zone, your pointer is a pointer to the host side copy of that zone. Which is likely not to have the last version of device computed output.
PD: I created long time ago a "Live" game on OpenCL, I found that the easyer and faster way to do it is simply to create a big 2D array of bits (bit addressing). And then write a piece of code without any branches that simply analize the neibours and gets the updated value for that cell. Since bit addressing is used, the amount of memory read/write by each thread is considerably lower that addressing chars/ints/other. I achieved 33Mcells/sec in a very old OpenCL HW (nVIDIA 9100M G). Just to let you know that your if/else approach is probably not the most efficient one.

Just as a reference, I let you here my implementation of the game of life (OpenCL kernel):
//Each work-item processess one 4x2 block of cells, but needs to access to the (3x3)x(4x2) block of cells surrounding it
// . . . . . .
// . * * * * .
// . * * * * .
// . . . . . .
__kernel void life (__global unsigned char * input, __global unsigned char * output){
int x_length = get_global_size(0);
int x_id = get_global_id(0);
int y_length = get_global_size(1);
int y_id = get_global_id(1);
//int lx_length = get_local_size(0);
//int ly_length = get_local_size(1);
int x_n = (x_length+x_id-1)%x_length; //Negative X
int x_p = (x_length+x_id+1)%x_length; //Positive X
int y_n = (y_length+y_id-1)%y_length; //Negative Y
int y_p = (y_length+y_id+1)%y_length; //Positive X
//Get the data of the surrounding blocks (TODO: Make this shared across the local group)
unsigned char block[3][3];
block[0][0] = input[x_n + y_n*x_length];
block[1][0] = input[x_id + y_n*x_length];
block[2][0] = input[x_p + y_n*x_length];
block[0][1] = input[x_n + y_id*x_length];
block[1][1] = input[x_id + y_id*x_length];
block[2][1] = input[x_p + y_id*x_length];
block[0][2] = input[x_n + y_p*x_length];
block[1][2] = input[x_id + y_p*x_length];
block[2][2] = input[x_p + y_p*x_length];
//Expand the block to points (bool array)
bool point[6][4];
point[0][0] = (bool)(block[0][0] & 1);
point[1][0] = (bool)(block[1][0] & 8);
point[2][0] = (bool)(block[1][0] & 4);
point[3][0] = (bool)(block[1][0] & 2);
point[4][0] = (bool)(block[1][0] & 1);
point[5][0] = (bool)(block[2][0] & 8);
point[0][1] = (bool)(block[0][1] & 16);
point[1][1] = (bool)(block[1][1] & 128);
point[2][1] = (bool)(block[1][1] & 64);
point[3][1] = (bool)(block[1][1] & 32);
point[4][1] = (bool)(block[1][1] & 16);
point[5][1] = (bool)(block[2][1] & 128);
point[0][2] = (bool)(block[0][1] & 1);
point[1][2] = (bool)(block[1][1] & 8);
point[2][2] = (bool)(block[1][1] & 4);
point[3][2] = (bool)(block[1][1] & 2);
point[4][2] = (bool)(block[1][1] & 1);
point[5][2] = (bool)(block[2][1] & 8);
point[0][3] = (bool)(block[0][2] & 16);
point[1][3] = (bool)(block[1][2] & 128);
point[2][3] = (bool)(block[1][2] & 64);
point[3][3] = (bool)(block[1][2] & 32);
point[4][3] = (bool)(block[1][2] & 16);
point[5][3] = (bool)(block[2][2] & 128);
//Process one point of the game of life!
unsigned char out = (unsigned char)0;
for(int j=0; j<2; j++){
for(int i=0; i<4; i++){
char num = point[i][j] + point[i+1][j] + point[i+2][j] + point[i][j+1] + point[i+2][j+1] + point[i][j+2] + point[i+1][j+2] + point[i+2][j+2];
if(num == 3 || num == 2 && point[i+1][j+1] ){
out |= (128>>(i+4*j));
}
}
}
output[x_id + y_id*x_length] = out; //Assign to the output the new cells value
};
Here you don't save any intermediate states, just the cell status at the end (live/death). It does not have branches, so it is quite fast in the process.

Related

How to save a large array of 5 bit numbers in Java?

How to efficiently save and access a large array of 5 bit numbers in memory?
For example
01100
01101
01110
01111
10000
10001
which I will later convert to a byte to check what number it is?
I was thinking of just using an array of bytes but after a while this will be wasting a lot of memory as this will be a continually growing array. Also I will want to save this array efficiently. I will only be using exactly 5 bits.

This is the code that I use for a bit array implementation in C, in JAVA it's going to be the same, I must reconsider what I said about the list, maybe an array is going to be better.
Anyway, you consider the array as a contiguous segments of bits. Those functions set, get, and read the k-th bit of the array. In this case I'm using an array of integers, so you see '32', is you use an array of bytes, then you'd use '8'.
void set_bit(int a[], int k)
{
int i = k / 32;
int pos = k % 32;
unsigned int flag = 1; // flag = 0000....00001
flag = flag << pos; // flag = 0000...00100..0000
a[i] = a[i] | flag; // set the bit at the k-th position in a[i]
}
void clear_bit(int a[], int k)
{
int i = k / 32;
int pos = k % 32;
unsigned int flag = 1; // flag = 0000....00001
flag = flag << pos; // flag = 0000...00100..0000
flag = ~flag;
a[i] = a[i] & flag; // set the bit at the k-th position in a[i]
}
int test_bit(int a[], int k)
{
int i = k / 32;
int pos = k % 32;
unsigned int flag = 1; // flag = 0000....00001
flag = flag << pos; // flag = 0000...00100..0000
if (a[i] & flag) // test the k-th bit of a to be 1
return 1;
else
return 0;
}
I don't know how you store the five bits number, you'll have to insert them bit by bit, and also keep track of the last empty position in the bit array.

"I was thinking of just using an array of bytes but after a while this will be wasting a lot of memory as this will be a continually growing array."
I've dealt with a similar problem and decided to write a file based BitInputStream and a BitOutputSteam. Therefore running out of memory was no longer an issue. Please note that the given links are not my work but good examples of how to write a bit input/output stream.

I wrote an implementation of a 5-bit byte vector on top of an 8-bit byte vector in Javascript some time ago that might be of some help.
const ByteVector = require('bytevector');
class FiveBuffer {
constructor(buffer = [0], bitsAvailable = 8) {
this.buf = new ByteVector(buffer);
this.bitsAvailable = bitsAvailable;
this.size = Math.floor(((this.byteSize() * 8) - this.bitsAvailable) / 5);
}
push(num) {
if (num > 31 || num < 0)
throw new Error(`Only 5-bit unsigned integers (${num} not among them) are accepted`);
var firstShift = 5 - this.bitsAvailable;
var secondShift = this.bitsAvailable + 3;
var firstShifted = shiftRight(num, firstShift);
var backIdx = this.buf.length - 1;
var back = this.buf.get(backIdx);
this.buf.set(backIdx, back | firstShifted);
if (secondShift < 8) {
var secondShifted = num << secondShift;
this.buf.push(secondShifted);
}
this.bitsAvailable = secondShift % 8;
this.size++;
}
get(idx) {
if (idx > this.size)
throw new Error(`Index ${idx} is out of bounds for FiveBuffer of size ${this.size}`);
var bitIdx = idx * 5;
var byteIdx = Math.floor(bitIdx / 8);
var byte = this.buf.get(byteIdx);
var bit = bitIdx % 8;
var firstShift = 3 - bit;
var firstShifted = shiftRightDestroy(byte, firstShift);
var final = firstShifted;
var secondShift = 11 - bit;
if (secondShift < 8) {
var secondShifted = this.buf.get(byteIdx + 1) >> secondShift;
final = final | secondShifted;
}
return final;
}
buffer() {
this.buf.shrink_to_fit();
return this.buf.buffer();
}
debug() {
var arr = [];
this.buffer().forEach(x => arr.push(x.toString(2)));
console.log(arr);
}
byteSize() {
return this.buf.size();
}
}
function shiftRightDestroy(num, bits) {
var left = 3 - bits;
var res = (left > 0) ? ((num << left) % 256) >> left : num;
return shiftRight(res, bits);
}
function shiftRight(num, bits) {
return (bits < 0) ?
num << -bits :
num >> bits;
}
module.exports = FiveBuffer;

Optimizing and finding the computation of this inequality

Let's assume I have a set of integers, out of which I want to find out all the maximum number of integers which satisfy a particular inequality. For sake of explanation,
r1, r2, r3, ... rn when ri is a positive integer. I want the to find the maximum z which would range from 1 to n for which ri <= 0.5 * (r1 + r2 + r3 + ... + rn) for all i from 3 to z. How to approach such problems? I have approaches the naive method of finding all subsets of sizes from 1 to n and iterating through each of the subset to check whether each element satisfies the condition or not? Any other approach?

I feel kind of bad, especially for the false edit I have done to the question... Anyway:
First, the most naive, direct and straightforward way to solve this question; which would be to start off from the 1st number in the set, calculate the nth partial sum on each step, compare that partial sum with the double of each element starting from the 3rd element, up till the nth element. If the comparison holds for each element, mark the current last as the maximum z.
The following macro largestzfinder, dependant on the function largestzfinderfunc does that:
// indices are zero based
#define largestzfinder(_x_) largestzfinderfunc((_x_), sizeof(_x_) / sizeof(*_x_))
unsigned int largestzfinderfunc( unsigned int set[], size_t size ) {
unsigned int largestz = 0;
unsigned int partialsumsofar = 0;
int disqualified;
for ( int i = 0; i < size; i++ ){
partialsumsofar += set[i];
disqualified = 0;
for ( int j = 2; j <= i; j++ ) { // for all j from 2 to i (inclusive)
if ( 2 * set[j] > partialsumsofar ) {
disqualified = 1;
break;
}
}
if ( !disqualified ) // if comparison held for all j
largestz = i;
}
return largestz;
}
With this method, we gradually reach to our largestz by starting off from the smallest z and finding the next bigger z until we reach the largest. We could simplify that process by starting from the end, in which case we wouldn't have to go through all that other zs except for the largest one, and we don't need the others either...
To do that, we would need to pre-calculate the whole sum, and then reduce the elements from the last one by one, make comparisons with shrinking partial sums the same way, and return an answer as soon as we find a non disqualified candidate. Following code does that:
// indices are zero based
#define largestzfinder(_x_) largestzfinderfunc((_x_), sizeof(_x_) / sizeof(*_x_))
unsigned int largestzfinderfunc( unsigned int set[], size_t size ) {
unsigned int largestz = 0;
unsigned int partialsumsofar = 0;
int disqualified;
for ( int i = 0; i < size; i++ )
partialsumsofar += set[i];
for ( int i = size - 1; i >= 0; i-- ){
disqualified = 0;
for ( int j = 2; j <= i; j++ ) { // for all j from 2 to i (inclusive)
if ( 2 * set[j] > partialsumsofar ) {
disqualified = 1;
break;
}
}
if ( !disqualified ) { // if comparison held for all j
largestz = i;
break;
}
partialsumsofar -= set[i]; // updates/reduces partialsumsofar
}
return largestz;
}
Now, you see, we check if the condition holds for every single element from 3rd to last, one by one... while we could just check it for the largest among them! If largestamongthem <= partialsumsofar / 2, then all of them will simply be less than or equal to the partial sum that far.
How would you determine the largestsofar? Well, things get complicated, especially when you do the process starting from the end. If we were doing it from the start, then we could just start off with largestsofar = 0; and compare each subsequent element with it, update our largestsofar.
One way to do it, is to make a new array of integers that has the same size with the set array, which will hold the largestsofar for each point. The following code uses that method:
// indices are zero based
#define largestzfinder(_x_) largestzfinderfunc((_x_), sizeof(_x_) / sizeof(*_x_))
unsigned int largestzfinderfunc( unsigned int set[], size_t size ) {
unsigned int largestz = 0;
unsigned int partialsumsofar = 0;
unsigned int * largestsofar = calloc( size, sizeof * largestsofar );
for ( int i = 0; i < size; i++ )
partialsumsofar += set[i];
largestsofar[0] = 0;
largestsofar[1] = 0;
largestsofar[2] = set[2];
for ( int i = 3; i < size; i++ )
largestsofar[i] = (set[i] > largestsofar[i-1]) ? set[i] : largestsofar[i-1];
for ( int i = size - 1; i >= 0; i-- ){
if ( 2 * largestsofar[i] <= partialsumsofar ) {
largestz = i;
break;
}
partialsumsofar -= set[i];
}
free( largestsofar );
return largestz;
}
Well, if you aren't really happy with allocating memory, keeping a list of largest number so far in it, I'm with you. The following recursive function does the thing, without keeping a list. It also looks far shorter. The downside is that it is less reader-friendly. Here:
// indices are zero based
#define largestzfinder(_x_) largestzfinderfunc((_x_), sizeof(_x_) / sizeof(*_x_), (_x_)[0] + (_x_)[1], 1, 0)
unsigned int largestzfinderfunc( unsigned int set[], size_t size, unsigned int partialsumsofar, unsigned int i, unsigned int largestsofar ){
unsigned int j;
if ( i < size - 1 && ( j = largestzfinderfunc( set, size, partialsumsofar + set[i + 1], i + 1, ( set[i + 1] > largestsofar ) ? set[i + 1] : largestsofar ) ) ){
return j;
}
else if ( 2 * largestsofar <= partialsumsofar ) {
return i;
}
else
return 0;
}
Let me explain what it does briefly: By a call to the macro, it first passes r_0 + r_1 as partialsumsofar, 1 as the current z to be checked, and 0 as the largest number up till the r_1 (this is because r_0 and r_1 are not to be considered during comparison).
It doesn't check right away, however; provided that this was not the last element (i < size - 1), it will call for the function one level beyond, with partialsumsofar updated, current z to be checked incremented, and largest number to be updated as well. This will be done until the last element is reached.
After that, the recursion tree will collapse. If the last one fails to satisfy the condition, it will return a 0, causing the parent recursion branch to check if he/she satisfies the condition.
As soon as one parent satisfies the condition, it will return its current z, causing each parent to return the same z, since they all store the returned value under j and return it, if j is not zero.
In essence, it actually does allocate memory for each largestsumsofar and partialsumsofar and all, but it frees all those one by one as the recursion tree/line/whatever collapses.
This wasn't so brief, but whatever, I hope I've done it right this time. I want to get over this question now...

Reduce treatment time of the FFT

I'm currently working on Java for Android. I try to implement the FFT in order to realize a kind of viewer of the frequencies.
Actually I was able to do it, but the display is not fluid at all.
I added some traces in order to check the treatment time of each part of my code, and the fact is that the FFT takes about 300ms to be applied on my complex array, that owns 4096 elements. And I need it to take less than 100ms, as my thread (that displays the frequencies) is refreshed every 100ms. I reduced the initial array in order that the FFT results own only 1028 elements, and it works, but the result is deprecated.
Does someone have an idea ?
I used the default fft.java and Complex.java classes that can be found on the internet.
For information, my code computing the FFT is the following :
int bytesPerSample = 2;
Complex[] x = new Complex[bufferSize/2] ;
for (int index = 0 ; index < bufferReadResult - bytesPerSample + 1; index += bytesPerSample)
{
// 16BITS = 2BYTES
float asFloat = Float.intBitsToFloat(asInt);
double sample = 0;
for (int b = 0; b < bytesPerSample; b++) {
int v = buffer[index + b];
if (b < bytesPerSample - 1 || bytesPerSample == 1) {
v &= 0xFF;
}
sample += v << (b * 8);
}
double sample32 = 100 * (sample / 32768.0); // don't know the use of this compute...
x[index/bytesPerSample] = new Complex(sample32, 0);
}
Complex[] tx = new Complex[1024]; // size = 2048
///// reduction of the size of the signal in order to improve the fft traitment time
for (int i = 0; i < x.length/4; i++)
{
tx[i] = new Complex(x[i*4].re(), 0);
}
// Signal retrieval thanks to the FFT
fftRes = FFT.fft(tx);

I don't know Java, but you're way of converting between your input data and an array of complex values seems very convoluted. You're building two arrays of complex data where only one is necessary.
Also it smells like your complex real and imaginary values are doubles. That's way over the top for what you need, and ARMs are veeeery slow at double arithmetic anyway. Is there a complex class based on single precision floats?
Thirdly you're performing a complex fft on real data by filling the imaginary part of your complexes with zero. Whilst the result will be correct it is twice as much work straight off (unless the routine is clever enough to spot that, which I doubt). If possible perform a real fft on your data and save half your time.
And then as Simon says there's the whole issue of avoiding garbage collection and memory allocation.
Also it looks like your FFT has no preparatory step. This mean that the routine FFT.fft() is calculating the complex exponentials every time. The longest part of the FFT calculation is working out the complex exponentials, which is a shame because for any given FFT length the exponentials are constants. They don't depend on your input data at all. In the real time world we use FFT routines where we calculate the exponentials once at the start of the program and then the actual fft itself takes that const array as one of its inputs. Don't know if your FFT class can do something similar.
If you do end up going to something like FFTW then you're going to have to get used to calling C code from your Java. Also make sure you get a version that supports (I think) NEON, ARM's answer to SSE, AVX and Altivec. It's worth ploughing through their release notes to check. Also I strongly suspect that FFTW will only be able to offer a significant speed up if you ask it to perform an FFT on single precision floats, not doubles.
Google luck!
--Edit--
I meant of course 'good luck'. Give me a real keyboard quick, these touchscreen ones are unreliable...

First, thanks for all your answers.
I followed them and made two test :
first one, I replace the double used in my Complex class by float. The result is just a bit better, but not enough.
then I've rewroten the fft method in order not to use Complex anymore, but a two-dimensional float array instead. For each row of this array, the first column contains the real part, and the second one the imaginary part.
I also changed my code in order to instanciate the float array only once, on the onCreate method.
And the result... is worst !! Now it takes a little bit more than 500ms instead of 300ms.
I don't know what to do now.
You can find below the initial fft fonction, and then the one I've re-wroten.
Thanks for your help.
// compute the FFT of x[], assuming its length is a power of 2
public static Complex[] fft(Complex[] x) {
int N = x.length;
// base case
if (N == 1) return new Complex[] { x[0] };
// radix 2 Cooley-Tukey FFT
if (N % 2 != 0) { throw new RuntimeException("N is not a power of 2 : " + N); }
// fft of even terms
Complex[] even = new Complex[N/2];
for (int k = 0; k < N/2; k++) {
even[k] = x[2*k];
}
Complex[] q = fft(even);
// fft of odd terms
Complex[] odd = even; // reuse the array
for (int k = 0; k < N/2; k++) {
odd[k] = x[2*k + 1];
}
Complex[] r = fft(odd);
// combine
Complex[] y = new Complex[N];
for (int k = 0; k < N/2; k++) {
double kth = -2 * k * Math.PI / N;
Complex wk = new Complex(Math.cos(kth), Math.sin(kth));
y[k] = q[k].plus(wk.times(r[k]));
y[k + N/2] = q[k].minus(wk.times(r[k]));
}
return y;
}
public static float[][] fftf(float[][] x) {
/**
* x[][0] = real part
* x[][1] = imaginary part
*/
int N = x.length;
// base case
if (N == 1) return new float[][] { x[0] };
// radix 2 Cooley-Tukey FFT
if (N % 2 != 0) { throw new RuntimeException("N is not a power of 2 : " + N); }
// fft of even terms
float[][] even = new float[N/2][2];
for (int k = 0; k < N/2; k++) {
even[k] = x[2*k];
}
float[][] q = fftf(even);
// fft of odd terms
float[][] odd = even; // reuse the array
for (int k = 0; k < N/2; k++) {
odd[k] = x[2*k + 1];
}
float[][] r = fftf(odd);
// combine
float[][] y = new float[N][2];
double kth, wkcos, wksin ;
for (int k = 0; k < N/2; k++) {
kth = -2 * k * Math.PI / N;
//Complex wk = new Complex(Math.cos(kth), Math.sin(kth));
wkcos = Math.cos(kth) ; // real part
wksin = Math.sin(kth) ; // imaginary part
// y[k] = q[k].plus(wk.times(r[k]));
y[k][0] = (float) (q[k][0] + wkcos * r[k][0] - wksin * r[k][1]);
y[k][1] = (float) (q[k][1] + wkcos * r[k][1] + wksin * r[k][0]);
// y[k + N/2] = q[k].minus(wk.times(r[k]));
y[k + N/2][0] = (float) (q[k][0] - (wkcos * r[k][0] - wksin * r[k][1]));
y[k + N/2][1] = (float) (q[k][1] - (wkcos * r[k][1] + wksin * r[k][0]));
}
return y;
}

actually I think I don't understand everything.
First, about Math.cos and Math.sin : how do you want me not to compute it each time ? Do you mean that I should instanciate the whole values only once (e.g store it in an array) and use them for each compute ?
Second, about the N % 2, indeed it's not very useful, I could make the test before the call of the function.
Third, about Simon's advice : I mixed what he said and what you said, that's why I've replaced the Complex by a two-dimensional float[][]. If that was not what he suggested, then what was it ?
At least, I'm not a FFT expert, so what do you mean by making a "real FFT" ? Do you mean that my imaginary part is useless ? If so, I'm not sure, because later in my code, I compute the magnitude of each frequence, so sqrt(real[i]*real[i] + imag[i]*imag[i]). And I think that my imaginary part is not equal to zero...
thanks !

Change the alpha value of a hex integer on the fly

I've got this loop that is run thousands of times (so needs to be efficient). It changes the value of the bitmap pixel.
I want to be able to run thought the loop and "switch" a certain group of pixels to alpha and then switch them back at a later point.
My question is.
How do I switch the values? So say 0xFFCC1BE0 becomes 0x00CC1BE0 and then if I want to switch back to 0xFFCC1BE0 I simply take the 00 and turn it too FF.
I can't make two bitmaps as I run out of memory :-(
Anyhow here's what I've got so far:
private void setTransparencyOnLightMap(float WidthPercentage, float LeftPosition)
{
int blankPixel = 0x00000000;
int savedPixel = 0x00000000;
int desiredAlpha = 200; //Can also be 0x00
//Find away of turning alhpa off and on.
for(int BMx = 0; BMx < mLightMap.getWidth(); BMx++)
{
for(int BMy = 0; BMy < mLightMap.getHeight(); BMy++)
{
if(mLightMap.getPixel(BMx, BMy) != blankPixel) //Make sure don't overwrite blank transparent pixels.
{
savedPixel = mLightMap.getPixel(BMx,BMy);
savedPixel = savedPixel | (desiredAlpha << 24);
mLightMap.setPixel(BMx, BMy, savedPixel);
}
}
}
}

You could switch the alpha of a pixel like so:
savedPixel = savedPixel & 0x00FFFFFF;
savedPixel = savedPixel | (desiredAlpha << 24);
The first line zeros out the 8 most significant bits of savedPixel (these are the bits where the alpha is held). The second line sets the 8 most significant bits of savedPixel to desiredAlpha. Note that desiredAlpha must be between 0 and 255 (These are the ints that can be stored in 8 bits).
Note that this uses bitwise operators, (&, |, <<) which are very efficient.

It seems to me, to reduce memory use, you can just save the original Alpha value of each pixel rather than the whole ARGB value - to do this use a byte array which will be 1/4 the size of the original bitmap. Also use a bit-mask for the new Alpha so you can use bitwise AND (&) as described by Tristan Hull...
byte[] savedAlphaArray = new byte[mLightMap.getWidth(), mLightMap.getHeight()];
int desiredAlphaMask = 0x00FFFFFF;
int pixel;
Then to save the Alpha values and apply the bit-mask do the following...
for (int i = 0; i < mLightMap.getWidth(); i++) {
for (int j = 0; j < mLightMap.getHeight(); j++) {
pixel = mLightMap.getPixel(i, j);
savedAlphaArray[i, j] = (pixel & 0xFF000000) >> 24;
mLightMap.setPixel(i, j, desiredAlphaMask & pixel);
}
}
To 'switch' back do the following...
for (int i = 0; i < mLightMap.getWidth(); i++) {
for (int j = 0; j < mLightMap.getHeight(); j++) {
pixel = mLightMap.getPixel(i, j);
mLightMap.setPixel(i, j, savedAlphaArray[i, j] << 24 & pixel);
}
}

Convolution Filter - Float Precision C Vs Java

I'm porting a library of image manipulation routines into C from Java and I'm getting some very small differences when I compare the results. Is it reasonable that these differences are in the different languages' handling of float values or do I still have work to do!
The routine is Convolution with a 3 x 3 kernel, it's operated on a bitmap represented by a linear array of pixels, a width and a depth. You need not understand this code exactly to answer my question, it's just here for reference.
Java code;
for (int x = 0; x < width; x++){
for (int y = 0; y < height; y++){
int offset = (y*width)+x;
if(x % (width-1) == 0 || y % (height-1) == 0){
input.setPixel(x, y, 0xFF000000); // Alpha channel only for border
} else {
float r = 0;
float g = 0;
float b = 0;
for(int kx = -1 ; kx <= 1; kx++ ){
for(int ky = -1 ; ky <= 1; ky++ ){
int pixel = pix[offset+(width*ky)+kx];
int t1 = Color.red(pixel);
int t2 = Color.green(pixel);
int t3 = Color.blue(pixel);
float m = kernel[((ky+1)*3)+kx+1];
r += Color.red(pixel) * m;
g += Color.green(pixel) * m;
b += Color.blue(pixel) * m;
}
}
input.setPixel(x, y, Color.rgb(clamp((int)r), clamp((int)g), clamp((int)b)));
}
}
}
return input;
Clamp restricts the bands' values to the range [0..255] and Color.red is equivalent to (pixel & 0x00FF0000) >> 16.
The C code goes like this;
for(x=1;x<width-1;x++){
for(y=1; y<height-1; y++){
offset = x + (y*width);
rAcc=0;
gAcc=0;
bAcc=0;
for(z=0;z<kernelLength;z++){
xk = x + xOffsets[z];
yk = y + yOffsets[z];
kOffset = xk + (yk * width);
rAcc += kernel[z] * ((b1[kOffset] & rMask)>>16);
gAcc += kernel[z] * ((b1[kOffset] & gMask)>>8);
bAcc += kernel[z] * (b1[kOffset] & bMask);
}
// Clamp values
rAcc = rAcc > 255 ? 255 : rAcc < 0 ? 0 : rAcc;
gAcc = gAcc > 255 ? 255 : gAcc < 0 ? 0 : gAcc;
bAcc = bAcc > 255 ? 255 : bAcc < 0 ? 0 : bAcc;
// Round the floats
r = (int)(rAcc + 0.5);
g = (int)(gAcc + 0.5);
b = (int)(bAcc + 0.5);
output[offset] = (a|r<<16|g<<8|b) ;
}
}
It's a little different xOffsets provides the xOffset for the kernel element for example.
The main point is that my results are out by at most one bit. The following are pixel values;
FF205448 expected
FF215449 returned
44 wrong
FF56977E expected
FF56977F returned
45 wrong
FF4A9A7D expected
FF4B9B7E returned
54 wrong
FF3F9478 expected
FF3F9578 returned
74 wrong
FF004A12 expected
FF004A13 returned
Do you believe this is a problem with my code or rather a difference in the language?
Kind regards,
Gav

After a quick look:
do you realize that (int)r will floor the r value instead of rounding it normally?
in the c code, you seem to use (int)(r + 0.5)

Further to Fortega's answer, try the roundf() function from the C math library.

Java's floating point behaviour is quite precise. What I expect to be happening here is that the value as being kept in registers with extended precision. IIRC, Java requires that the precision is rounded to that of the appropriate type. This is to try to make sure you always get the same result (full details in the JLS). C compilers will tend to leave any extra precision there, until the result in stored into main memory.

I would suggest you use double instead of float. Float is almost never the best choice.

This might be due to different default round in the two languages. I'm not saying they have (you need to read up to determine that), but it's an idea.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

OpenCL Shared Memory Among Tasks - java

Related

How to save a large array of 5 bit numbers in Java?

Optimizing and finding the computation of this inequality

Reduce treatment time of the FFT

Change the alpha value of a hex integer on the fly

Convolution Filter - Float Precision C Vs Java

Categories

Resources