Generate various data intensity on bus

Generate various data intensity on bus - java

I am trying to generate various types of data intensity on bus on a multiprocessor environment. Basically I need two patterns - almost negligible traffic on bus and very high traffic on bus. Initially I was thinking about accessing registers and not writing them back to cache/main memory to generate a low bus traffic. But I am not very sure about the idea. Besides I am doing the coding part in Java.
Any hints how to do this?
Architecture: x86_64
EDIT: I have the following code snippet.
mutex.lock();
try{
// Generate Bus traffic
}
finally{
mutex.unlock();
}
For each thread, I am trying to generate the traffic in the critical section.

Accessing registers generates zero traffic on the bus (unless you mean some CPU-internal bus). To generate maximum traffic on a CPU-memory bus, just read an array bigger than your biggest cache (typically L3 with a few megabytes). Make sure the read data gets actually used, so that DCE doesn't kick in.
long[] data = new long[1<<24]; // 8 * 16MB
volatile long blackhole;
void saturateBus() {
long sum = 0;
for (long x : data) sum += x;
blackhole = sum;
}
This should saturate your memory bus on a modern amd64 architecture as the loop can execute in 1 cycle per element. Assuming some unbelievably fast memory, you could need to unroll manually like this
long sum0 = 0, sum1 = 0;
for (int i=0; i<data.length; i+=2) { // assuming even `data.length`
sum0 += data[i+0];
sum1 += data[i+1];
}
blackhole = sum0 + sum1;

Related

System.out.print consumes too much memory when printing to console. Is it possible to reduce?

I have simple programm:
public class Test {
public static void main(String[] args) {
for (int i = 0; i < 1_000_000; i++) {
System.out.print(1);
}
}
}
And launched profiling. Here are the results:
I assume that memory grows because of this method calls:
public void print(int i) {
write(String.valueOf(i));
}
Is there a way to print int values in the console without memory drawdown?
On local machine I try add if (i % 10000 == 0) System.gc(); to cycle and memory consumption evened out. But the system that checks the solution does not make a decision. I tried to change the values of the step but still does not pass either in memory(should work less than 20mb) or in time(<1sec)
EDIT I try this
String str = "hz";
for (int i = 0; i < 1_000_0000; i++) {
System.out.print(str);
}
But same result:
EDIT2 if I write this code
public class Test {
public static void main(String[] args) {
byte[] bytes = "hz".getBytes();
for (int i = 0; i < 1_000_0000; i++) {
System.out.write(bytes, 0, bytes.length);
}
}
}
I have this
Therefore, I do not believe that Java is making its noises. They would be in both cases.

You need to convert the int into characters without generating a new String each time you do it. This could be done in a couple of ways:
Write a custom "int to characters" method that converts to ASCII bytes in a byte[] (See #AndyTurner's example code). Then write the byte[]. And repeat.
Use ByteBuffer, fill it directly using a custom "int to characters" converter method, and use a Channel to output the bytes when the buffer is full. And repeat.
If done correctly, you should be able to output the numbers without generating any garbage ... other than your once-off buffers.
Note that System.out is a PrintStream wrapping a BufferedOutputStream wrapping a FileOuputStream. And, when you output a String directly or indirectly using one of the print methods, that actually does through a BufferedWriter that is internal to the PrintStream. It is complicated ... and apparently the print(String) method generates garbage somewhere in that complexity.
Concerning your EDIT 1: when you repeatedly print out a constant string, you are still apparently generating garbage. I was surprised by this, but I guess it is happening in the BufferedWriter.
Concerning your EDIT 2: when you repeatedly write from a byte[], the garbage generation all but disappears. This confirms that at least one of my suggestions will work.
However, since you are monitoring the JVM with an external profile, your JVM is also running an agent that is periodically sending updates to your profiler. That agent will most likely be generating a small amount of garbage. And there could be other sources of garbage in the JVM; e.g. if you have JVM GC logging enabled.

Since you have discovered that printing a byte[] keeps memory allocation within the required bounds, you can use this fact:
Allocate a byte array the length of the ASCII representation of Integer.MIN_VALUE (11 - the longest an int can be). Then you can fill the array backwards to convert a number i:
int p = buffer.length;
if (i == Integer.MIN_VALUE) {
buffer[--p] = (byte) ('0' - i % 10);
i /= 10;
}
boolean neg = i < 0;
if (neg) i = -i;
do {
buffer[--p] = (byte) ('0' + i % 10);
i /= 10;
} while (i != 0);
if (neg) buffer[--p] = '-';
Then write this to your stream:
out.write(buffer, p, buffer.length - p);
You can reuse the same buffer to write as many numbers as you wish.

The pattern of memory usage is typical for java. Your code is irrelevant. To control java memory usage you need to use some -X parameters for example "-Xms512m -Xmx512m" will set both minimum and maximum heap size to 512m. BTW in order to minimize the sow-like memory graph it would be recommended to set min and max size to the same value. Those params could be given to java on command line when you run your java for example:
java -Xms512m -Xmx512m myProgram
There are other ways as well. Here is one link where you can read more about it: Oracle docs. There are other params that control stacksize and some other things. The code itself if written without memory usage considerations may influence memory usage as well, but in your case its too trivial of a code to do anything. Most memory concerns are addressed by configuring jvm memory usage params

Multi threaded matrix multiplication performance issue

I am using java for multi threaded multiplication. I am practicing multi threaded programming. Following is the code that I took from another post of stackoverflow.
public class MatMulConcur {
private final static int NUM_OF_THREAD =1 ;
private static Mat matC;
public static Mat matmul(Mat matA, Mat matB) {
matC = new Mat(matA.getNRows(),matB.getNColumns());
return mul(matA,matB);
}
private static Mat mul(Mat matA,Mat matB) {
int numRowForThread;
int numRowA = matA.getNRows();
int startRow = 0;
Worker[] myWorker = new Worker[NUM_OF_THREAD];
for (int j = 0; j < NUM_OF_THREAD; j++) {
if (j<NUM_OF_THREAD-1){
numRowForThread = (numRowA / NUM_OF_THREAD);
} else {
numRowForThread = (numRowA / NUM_OF_THREAD) + (numRowA % NUM_OF_THREAD);
}
myWorker[j] = new Worker(startRow, startRow+numRowForThread,matA,matB);
myWorker[j].start();
startRow += numRowForThread;
}
for (Worker worker : myWorker) {
try {
worker.join();
} catch (InterruptedException e) {
}
}
return matC;
}
private static class Worker extends Thread {
private int startRow, stopRow;
private Mat matA, matB;
public Worker(int startRow, int stopRow, Mat matA, Mat matB) {
super();
this.startRow = startRow;
this.stopRow = stopRow;
this.matA = matA;
this.matB = matB;
}
#Override
public void run() {
for (int i = startRow; i < stopRow; i++) {
for (int j = 0; j < matB.getNColumns(); j++) {
double sum = 0;
for (int k = 0; k < matA.getNColumns(); k++) {
sum += matA.get(i, k) * matB.get(k, j);
}
matC.set(i, j, sum);
}
}
}
}
I ran this program for 1,10,20,...,100 threads but performance is decreasing instead. Following is the time table
Thread 1 takes 18 Milliseconds
Thread 10 takes 18 Milliseconds
Thread 20 takes 35 Milliseconds
Thread 30 takes 38 Milliseconds
Thread 40 takes 43 Milliseconds
Thread 50 takes 48 Milliseconds
Thread 60 takes 57 Milliseconds
Thread 70 takes 66 Milliseconds
Thread 80 takes 74 Milliseconds
Thread 90 takes 87 Milliseconds
Thread 100 takes 98 Milliseconds
Any Idea?

People think that using multiple threads will automatically (magically!) make any computation go faster. This is not so1.
There are a number of factors that can make multi-threading speedup less than you expect, or indeed result in a slowdown.
A computer with N cores (or hyperthreads) can do computations at most N times as fast as a computer with 1 core. This means that when you have T threads where T > N, the computational performance will be capped at N. (Beyond that, the threads make progress because of time slicing.)
A computer has a certain amount of memory bandwidth; i.e. it can only perform a certain number of read/write operations per second on main memory. If you have an application where the demand exceeds what the memory subsystem can achieve, it will stall (for a few nanoseconds). If there are many cores executing many threads at the same time, then it is the aggregate demand that matters.
A typical multi-threaded application working on shared variables or data structures will either use volatile or explicit synchronization to do this. Both of these increase the demand on the memory system.
When explicit synchronization is used and two threads want to hold a lock at the same time, one of them will be blocked. This lock contention slows down the computation. Indeed, the computation is likely to be slowed down if there was past contention on the lock.
Thread creation is expensive. Even acquiring an existing thread from a thread pool can be relatively expensive. If the task that you perform with the thread is too small, the setup costs can outweigh the possible speedup.
There is also the issue that you may be running into problems with a poorly written benchmark; e.g. the JVM may not be properly warmed up before taking the timing measurements.
There is insufficient detail in your question to be sure which of the above factors is likely to affect your application's performance. But it is likely to be a combination of 1 2 and 5 ... depending on how many cores are used, how big the CPUs memory caches are, how big the matrix is, and other factors.
1 - Indeed, if this was true then we would not need to buy computers with lots of cores. We could just use more and more threads. Provided you had enough memory, you could do an infinite amount of computation on a single machine. Bitcoin mining would be a doddle. Of course, it isn't true.

Using multi-threading is not primarily for performance, but for parallelization. There are cases where parallelization can benefit performance, though.
Your computer doesn't have infinite resources. Adding more and more threads will decrease performance. It's like starting more and more applications, you wouldn't expect a program to run faster when you start another program, and you probably wouldn't be surprised if it runs slower.
Up to a certain point performance will remain constant (your computer still has resources to handle the demand), but at some point you reach the maximum your computer can handle and performance will drop. That's exactly what your result shows. Performance stays somewhat constant with 1 or 10 threads, and then drops steadily.

Aparapi GPU execution slower than CPU

I am trying to test the performance of Aparapi.
I have seen some blogs where the results show that Aparapi does improve the performance while doing data parallel operations.
But I am not able to see that in my tests. Here is what I did, I wrote two programs, one using Aparapi, the other one using normal loops.
Program 1: In Aparapi
import com.amd.aparapi.Kernel;
import com.amd.aparapi.Range;
public class App
{
public static void main( String[] args )
{
final int size = 50000000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
#Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
long t1 = System.currentTimeMillis();
kernel.execute(Range.create(size));
long t2 = System.currentTimeMillis();
System.out.println("Execution mode = "+kernel.getExecutionMode());
kernel.dispose();
System.out.println(t2-t1);
}
}
Program 2: using loops
public class App2 {
public static void main(String[] args) {
final int size = 50000000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
final float[] sum = new float[size];
long t1 = System.currentTimeMillis();
for(int i=0;i<size;i++) {
sum[i]=a[i]+b[i];
}
long t2 = System.currentTimeMillis();
System.out.println(t2-t1);
}
}
Program 1 takes around 330ms whereas Program 2 takes only around 55ms.
Am I doing something wrong here? I did printout the execution mode in Aparpai program and it prints that the mode of execution is GPU

You did not do anything wrong - execpt for the benchmark itself.
Benchmarking is always tricky, and particularly for the cases where a JIT is involved (as for Java), and for libraries where many nitty-gritty details are hidden from the user (as for Aparapi). And in both cases, you should at least execute the code section that you want to benchmark multiple times.
For the Java version, one might expect the computation time for a single execution of the loop to decrease when the loop itself it is executed multiple times, due to the JIT kicking in. There are many additional caveats to consider - for details, you should refer to this answer. In this simple test, the effect of the JIT may not really be noticable, but in more realistic or complex scenarios, this will make a difference. Anyhow: When repeating the loop for 10 times, the time for a single execution of the loop on my machine was about 70 milliseconds.
For the Aparapi version, the point of possible GPU initialization was already mentioned in the comments. And here, this is indeed the main problem: When running the kernel 10 times, the timings on my machine are
1248
72
72
72
73
71
72
73
72
72
You see that the initial call causes all the overhead. The reason for this is that, during the first call to Kernel#execute(), it has to do all the initializations (basically converting the bytecode to OpenCL, compile the OpenCL code etc.). This is also mentioned in the documentation of the KernelRunner class:
The KernelRunner is created lazily as a result of calling Kernel.execute().
The effect of this - namely, a comparatively large delay for the first execution - has lead to this question on the Aparapi mailing list: A way to eagerly create KernelRunners. The only workaround suggested there was to create an "initialization call" like
kernel.execute(Range.create(1));
without a real workload, only to trigger the whole setup, so that the subsequent calls are fast. (This also works for your example).
You may have noticed that, even after the initialization, the Aparapi version is still not faster than the plain Java version. The reason for that is that the task of a simple vector addition like this is memory bound - for details, you may refer to this answer, which explains this term and some issues with GPU programming in general.
As an overly suggestive example for a case where you might benefit from the GPU, you might want to modify your test, in order to create an artificial compute bound task: When you change the kernel to involve some expensive trigonometric functions, like this
Kernel kernel = new Kernel() {
#Override
public void run() {
int gid = getGlobalId();
sum[gid] = (float)(Math.cos(Math.sin(a[gid])) + Math.sin(Math.cos(b[gid])));
}
};
and the plain Java loop version accordingly, like this
for (int i = 0; i < size; i++) {
sum[i] = (float)(Math.cos(Math.sin(a[i])) + Math.sin(Math.cos(b[i])));;
}
then you will see a difference. On my machine (GeForce 970 GPU vs. AMD K10 CPU) the timings are about 140 milliseconds for the Aparapi version, and a whopping 12000 milliseconds for the plain Java version - that's a speedup of nearly 90 through Aparapi!
Also note that even in CPU mode, Aparapi may offer an advantage compared to plain Java. On my machine, in CPU mode, Aparapi needs only 2300 milliseconds, because it still parallelizes the execution using a Java thread pool.

Just add before main loop kernel execution
kernel.setExplicit(true);
kernel.put(a);
kernel.put(b);
and
kernel.get(sum);
after it.
Although Aparapi does analyze the byte code of the Kernel.run()
method (and any method reachable from Kernel.run()) Aparapi has no
visibility to the call site. In the above code there is no way for
Aparapi to detect that that hugeArray is not modified within the for
loop body. Unfortunately, Aparapi must default to being ‘safe’ and
copy the contents of hugeArray backwards and forwards to the GPU
device.
https://github.com/aparapi/aparapi/blob/master/doc/ExplicitBufferHandling.md

Code inside thread slower than outside thread..?

I'm trying to alter some code so it can work with multithreading. I stumbled upon a performance loss when putting a Runnable around some code.
For clarification: The original code, let's call it
//doSomething
got a Runnable around it like this:
Runnable r = new Runnable()
{
public void run()
{
//doSomething
}
}
Then I submit the runnable to a ChachedThreadPool ExecutorService. This is my first step towards multithreading this code, to see if the code runs as fast with one thread as the original code.
However, this is not the case. Where //doSomething executes in about 2 seconds, the Runnable executes in about 2.5 seconds. I need to mention that some other code, say, //doSomethingElse, inside a Runnable had no performance loss compared to the original //doSomethingElse.
My guess is that //doSomething has some operations that are not as fast when working in a Thread, but I don't know what it could be or what, in that aspect is the difference with //doSomethingElse.
Could it be the use of final int[]/float[] arrays that makes a Runnable so much slower? The //doSomethingElse code also used some finals, but //doSomething uses more. This is the only thing I could think of.
Unfortunately, the //doSomething code is quite long and out-of-context, but I will post it here anyway. For those who know the Mean Shift segmentation algorithm, this a part of the code where the mean shift vector is being calculated for each pixel. The for-loop
for(int i=0; i<L; i++)
runs through each pixel.
timer.start(); // this is where I start the timer
// Initialize mode table used for basin of attraction
char[] modeTable = new char [L]; // (L is a class property and is about 100,000)
Arrays.fill(modeTable, (char)0);
int[] pointList = new int [L];
// Allcocate memory for yk (current vector)
double[] yk = new double [lN]; // (lN is a final int, defined earlier)
// Allocate memory for Mh (mean shift vector)
double[] Mh = new double [lN];
int idxs2 = 0; int idxd2 = 0;
for (int i = 0; i < L; i++) {
// if a mode was already assigned to this data point
// then skip this point, otherwise proceed to
// find its mode by applying mean shift...
if (modeTable[i] == 1) {
continue;
}
// initialize point list...
int pointCount = 0;
// Assign window center (window centers are
// initialized by createLattice to be the point
// data[i])
idxs2 = i*lN;
for (int j=0; j<lN; j++)
yk[j] = sdata[idxs2+j]; // (sdata is an earlier defined final float[] of about 100,000 items)
// Calculate the mean shift vector using the lattice
/*****************************************************/
// Initialize mean shift vector
for (int j = 0; j < lN; j++) {
Mh[j] = 0;
}
double wsuml = 0;
double weight;
// find bucket of yk
int cBucket1 = (int) yk[0] + 1;
int cBucket2 = (int) yk[1] + 1;
int cBucket3 = (int) (yk[2] - sMinsFinal) + 1;
int cBucket = cBucket1 + nBuck1*(cBucket2 + nBuck2*cBucket3);
for (int j=0; j<27; j++) {
idxd2 = buckets[cBucket+bucNeigh[j]]; // (buckets is a final int[] of about 75,000 items)
// list parse, crt point is cHeadList
while (idxd2>=0) {
idxs2 = lN*idxd2;
// determine if inside search window
double el = sdata[idxs2+0]-yk[0];
double diff = el*el;
el = sdata[idxs2+1]-yk[1];
diff += el*el;
//...
idxd2 = slist[idxd2]; // (slist is a final int[] of about 100,000 items)
}
}
//...
}
timer.end(); // this is where I stop the timer.
There is more code, but the last while loop was where I first noticed the difference in performance.
Could anyone think of a reason why this code runs slower inside a Runnable than original?
Thanks.
Edit: The measured time is inside the code, so excluding startup of the thread.

All code always runs "inside a thread".
The slowdown you see is most likely caused by the overhead that multithreading adds. Try parallelizing different parts of your code - the tasks should neither be too large, nor too small. For example, you'd probably be better off running each of the outer loops as a separate task, rather than the innermost loops.
There is no single correct way to split up tasks, though, it all depends on how the data looks and what the target machine looks like (2 cores, 8 cores, 512 cores?).
Edit: What happens if you run the test repeatedly? E.g., if you do it like this:
Executor executor = ...;
for (int i = 0; i < 10; i++) {
final int lap = i;
Runnable r = new Runnable() {
public void run() {
long start = System.currentTimeMillis();
//doSomething
long duration = System.currentTimeMillis() - start;
System.out.printf("Lap %d: %d ms%n", lap, duration);
}
};
executor.execute(r);
}
Do you notice any difference in the results?

I personally do not see any reason for this. Any program has at least one thread. All threads are equal. All threads are created by default with medium priority (5). So, the code should show the same performance in both the main application thread and other thread that you open.
Are you sure you are measuring the time of "do something" and not the overall time that your program runs? I believe that you are measuring the time of operation together with the time that is required to create and start the thread.

When you create a new thread you always have an overhead. If you have a small piece of code, you may experience performance loss.
Once you have more code (bigger tasks) you make get a performance improvement by your parallelization (the code on the thread will not necessarily run faster, but you are doing two thing at once).
Just a detail: this decision of how big small can a task be so parallelizing it is still worth is a known topic in parallel computation :)

You haven't explained exactly how you are measuring the time taken. Clearly there are thread start-up costs but I infer that you are using some mechanism that ensures that these costs don't distort your picture.
Generally speaking when measuring performance it's easy to get mislead when measuring small pieces of work. I would be looking to get a run of at least 1,000 times longer, putting the whole thing in a loop or whatever.
Here the one different between the "No Thread" and "Threaded" cases is actually that you have gone from having one Thread (as has been pointed out you always have a thread) and two threads so now the JVM has to mediate between two threads. For this kind of work I can't see why that should make a difference, but it is a difference.
I would want to be using a good profiling tool to really dig into this.

Microphone level in Java

I'm trying to access the level of the mic through Java.
I don't need to record anything, I just want to know a relative scale of sound level.
Is this possible in real-time?
If it's impossible, maybe this could work:
Start recording when the level is over a certain value, stop when the level drops under a certain level for a certain time
Recording bits of a quarter second and reading it's volume, and if it's under the threshold stop recording.
Thanks in advance

http://www.technogumbo.com/tutorials/Java-Microphone-Selection-And-Level-Monitoring/Java-Microphone-Selection-And-Level-Monitoring.php
Pretty good article on this. Helped me out a lot.
From what i can tell, this uses the root mean squared stuff talked about in #Nick's answer
Basically:
public int calculateRMSLevel(byte[] audioData)
{
long lSum = 0;
for(int i=0; i < audioData.length; i++)
lSum = lSum + audioData[i];
double dAvg = lSum / audioData.length;
double sumMeanSquare = 0d;
for(int j=0; j < audioData.length; j++)
sumMeanSquare += Math.pow(audioData[j] - dAvg, 2d);
double averageMeanSquare = sumMeanSquare / audioData.length;
return (int)(Math.pow(averageMeanSquare,0.5d) + 0.5);
}
and the usage:
int level = 0;
byte tempBuffer[] = new byte[6000];
stopCapture = false;
try {
while (!stopCapture) {
if (targetRecordLine.read(tempBuffer, 0, tempBuffer.length) > 0) {
level = calculateRMSLevel(tempBuffer);
}
}
targetRecordLine.close();
} catch (Exception e) {
System.out.println(e);
System.exit(0);
}

You can access microphones through the Sound API, but it won't give you a simple loudness level. You'll just have to capture the data and make your own decision about how loud it is.
http://download.oracle.com/javase/tutorial/sound/capturing.html
Recording implies saving the data, but here you can discard the data once you've finished determining its loudness.
The root mean squared method is a good way of calculating the amplitude of a section of wave data.
In answer to your comment, yes, you'd capture a short length of data (maybe just a few milliseconds worth) and calculate the amplitude of that. Repeat this periodically depending on how often you need updates. If you want to keep track of previous loudnesses and compare them, that's up to you - at this point it's just comparing numbers. You might use the average of recent loudnesses to calculate the ambient loudness of the room, so you can detect sudden increases in noise.
I don't know how much overhead there is in turning audio capture on and off, but you may be better off keeping the TargetDataLine open all the time, and just calculating the loudness when you need it. While the line is open you do need to keep calling read() on it though, otherwise the application will hang waiting for you to read data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.