False Sharing only became noticeable on certain machines - java

I wrote the following test class in java to reproduce the performance penalty introduced by "False Sharing".
Basically you can tweak the "size" of array from 4 to a much larger value (e.g. 10000) to turn the "False Sharing phenomenon" either on or off. To be specific, when size = 4, different threads are more likely to update values within the same cache line, causing much more frequent cache misses. In theory, the test program should run much faster when size = 10000 than size = 4.
I ran the same test on two different machines multiple times:
Machine A: Lenovo X230 laptop w/ Intel® Core™ i5-3210M Processor (2 core, 4 threads) Windows 7 64bit
size = 4 => 5.5 second
size = 10000 => 5.4 second
Machine B: Dell OptiPlex 780 w/ Intel® Core™2 Duo Processor E8400 (2 core) Windows XP 32bit
size = 4 => 14.5 second
size = 10000 => 7.2 second
I ran the tests later on a few other machines and quite obviously False Sharing only becomes noticeable on certain machines and I couldn't figure out the decisive factor that makes such difference.
Can anyone kindly take a look at this problem and explain why false sharing introduced in this test class only became noticeable on certain machines?
public class FalseSharing {
interface Oper {
int eval(int value);
}
//try tweak the size
static int size = 4;
//try tweak the op
static Oper op = new Oper() {
#Override
public int eval(int value) {
return value + 2;
}
};
static int[] array = new int[10000 + size];
static final int interval = (size / 4);
public static void main(String args[]) throws InterruptedException {
long start = System.currentTimeMillis();
Thread t1 = new Thread(new Runnable() {
#Override
public void run() {
System.out.println("Array index:" + 5000);
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000] = op.eval(array[5000]);
}
}
}
});
Thread t2 = new Thread(new Runnable() {
#Override
public void run() {
System.out.println("Array index:" + (5000 + interval));
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000 + interval] = op.eval(array[5000 + interval]);
}
}
}
});
Thread t3 = new Thread(new Runnable() {
#Override
public void run() {
System.out.println("Array index:" + (5000 + interval * 2));
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000 + interval * 2] = op.eval(array[5000 + interval * 2]);
}
}
}
});
Thread t4 = new Thread(new Runnable() {
#Override
public void run() {
System.out.println("Array index:" + (5000 + interval * 3));
for (int j = 0; j < 30; j++) {
for (int i = 0; i < 1000000000; i++) {
array[5000 + interval * 3] = op.eval(array[5000 + interval * 3]);
}
}
}
});
t1.start();
t2.start();
t3.start();
t4.start();
t1.join();
t2.join();
t3.join();
t4.join();
System.out.println("Finished!" + (System.currentTimeMillis() - start));
}
}

False sharing only occurs with blocks of 64 bytes. You need to be accessing the same 64-byte block in all four threads. I suggest you create an object or an array with long[8] and update different cells of this array in all four threads and compare with the four threads accessing independent arrays.

Your code is probably fine, Here is a simpler version with results:
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;
public class TestFalseSharing {
static long T0 = System.currentTimeMillis();
static void p(Object msg) {
System.out.format("%09.3f %-10s %s%n", new Double(0.001*(System.currentTimeMillis()-T0)), Thread.currentThread().getName(), " : "+msg);
}
public static void main(String args[]) throws InterruptedException {
int NT = Runtime.getRuntime().availableProcessors();
p("Available processors: "+NT);
int MAXSPAN = 0x1000; //4kB
final byte[] array = new byte[NT*MAXSPAN];
for(int i=1; i<=MAXSPAN; i<<=1) {
testFalseSharing(NT, i, array);
}
}
static void testFalseSharing(final int NT, final int span, final byte[] array) throws InterruptedException {
final int L1 = 10;
final int L2 = 10_000_000;
final CountDownLatch cl = new CountDownLatch(NT*L1);
long t0 = System.nanoTime();
for(int i=0 ; i<4; i++) {
final int startOffset = i*span;
Thread t = new Thread(new Runnable() {
#Override
public void run() {
//p("Offset:" + startOffset);
for (int j = 0; j < L1; j++) {
for (int k = 0; k < L2; k++) {
array[startOffset] += 1;
}
cl.countDown();
}
}
});
t.start();
}
while(!cl.await(10, TimeUnit.SECONDS)) {
p(""+cl.getCount()+" left");
}
long d = System.nanoTime() - t0;
p("Duration: " + 1e-9*d + " seconds, Span="+span+" bytes");
}
}
Results:
00000.000 main : Available processors: 4
00002.843 main : Duration: 2.837645384 seconds, Span=1 bytes
00005.689 main : Duration: 2.8454065760000002 seconds, Span=2 bytes
00008.659 main : Duration: 2.9697156340000004 seconds, Span=4 bytes
00011.640 main : Duration: 2.979306959 seconds, Span=8 bytes
00013.780 main : Duration: 2.140246744 seconds, Span=16 bytes
00015.387 main : Duration: 1.6061148440000002 seconds, Span=32 bytes
00016.729 main : Duration: 1.34128957 seconds, Span=64 bytes
00017.944 main : Duration: 1.215005455 seconds, Span=128 bytes
00019.208 main : Duration: 1.263007368 seconds, Span=256 bytes
00020.477 main : Duration: 1.269272208 seconds, Span=512 bytes
00021.719 main : Duration: 1.241061631 seconds, Span=1024 bytes
00022.975 main : Duration: 1.256024242 seconds, Span=2048 bytes
00024.171 main : Duration: 1.195086858 seconds, Span=4096 bytes
So to answer, it confirms the 64 bytes cache line theory, at least on my laptop core i5.

Related

Running single java thread is faster than main?

im doing a few concurrency experiments in java.
I have this prime calculation method, which is just for mimicking a semi-expensive operation:
static boolean isprime(int n){
if (n == 1)
return false;
boolean flag = false;
for (int i = 2; i <= n / 2; ++i) {
if (n % i == 0) {
flag = true;
break;
}
}
return ! flag;
}
And then I have this main method, which simply calculates all prime number from 0 to N, and stores results in a array of booleans:
public class Main {
public static void main(String[] args) {
final int N = 100_000;
int T = 1;
boolean[] bool = new boolean[N];
ExecutorService es = Executors.newFixedThreadPool(T);
final int partition = N / T;
long start = System.nanoTime();
for (int j = 0; j < N; j++ ){
boolean res = isprime(j);
bool[j] = res;
}
System.out.println(System.nanoTime()-start);
}
This gives me results like: 893888901 n/s 848995600 n/s
And i also have this drivercode, where I use a executorservice where I use one thread to do the same:
public class Main {
public static void main(String[] args) {
final int N = 100_000;
int T = 1;
boolean[] bool = new boolean[N];
ExecutorService es = Executors.newFixedThreadPool(T);
final int partition = N / T;
long start = System.nanoTime();
for (int i = 0; i < T; i++ ){
final int current = i;
es.execute(new Runnable() {
#Override
public void run() {
for (int j = current*partition; j < current*partition+partition; j++ ){
boolean res = isprime(j);
bool[j] = res;
}
}
});
}
es.shutdown();
try {
es.awaitTermination(1, TimeUnit.MILLISECONDS);
} catch (Exception e){
System.out.println("what?");
}
System.out.println(System.nanoTime()-start);
}
this gives results like: 9523201 n/s , 15485300 n/s.
Now the second example is, as you can read, much faster than the first. I can't really understand why that is? should'nt the exercutorservice thread (with 1 thread) be slower, since it's basically doing the work sequentially + overhead from "awaking" the thread, compared to the main thread?
I was expecting the executorservice to be faster when I started adding multiple threads, but this is a little counterintuitive.
It's the timeout at the bottom of your code. If you set that higher you arrive at pretty similar execution times.
es.awaitTermination(1000, TimeUnit.MILLISECONDS);
The execution times you mention for the first main are much higher than the millisecond you allow the second main to wait for the threads to finish.

the multithread not working with me

I have to write program that finds the sum of a 2D array of int,
I coded every thing as I know and there is no syntax error but when I use someways to check my code the thread is not working at all but sometimes work some of thread not all of them
I put the number 1 to check the summation
and I put lock to make sure not two of thread in same method of summation only for make sure
and the n for see how much time it's join the add method
public class extend extends Thread {
int a, b;
private static int sum = 0;
static int n;
boolean lock;
int[][] arr;
public extend() {
arr = new int[45][45];
for (int i = 0; i < 45; i++) {
for (int j = 0; j < 45; j++)
arr[i][j] = 1;
}
n = 0;
lock = false;
}
public extend(int a, int b) {
arr = new int[45][45];
for (int i = 0; i < 45; i++) {
for (int j = 0; j < 45; j++)
arr[i][j] = 1;
}
n = 0;
lock = false;
this.a = a;
this.b = b;
}
public void run() {
add(a, b);
}
public void add(int st, int e) {
n++;
while (lock) ;
lock = true;
int sums = 0;
synchronized (this) {
for (int i = st; i < e; i++) {
for (int j = 0; j < 45; j++) {
sums += arr[i][j];
}
}
}
sum = sums;
lock = false;
}
public int getSum() {
return sum;
}
public static void main(String[] args) {
long ss = System.currentTimeMillis();
Thread t1 = new Thread(new extend(0, 9));
Thread t2 = new Thread(new extend(9, 18));
Thread t3 = new Thread(new extend(18, 27));
Thread t4 = new Thread(new extend(27, 36));
Thread t5 = new Thread(new extend(36, 45));
t1.start();
t2.start();
t3.start();
t4.start();
t5.start();
long se = System.currentTimeMillis();
System.out.println("The sum for 45*45 array is: " + sum);
System.out.println("time start;" + (se - ss));
System.out.print(n);
}
}
I'm sorry to say, but there's so much wrong with this code, it's hard to point at one problem:
You start your threads, but you don't wait for them to finish using .join()
Extending Thread when you actually meant implementing Runnable
Using busy waiting in your thread with while (true)
Using of static intfor counting
But, if there's only one thing you must fix, wait for your threads:
t1.join();
...
t5.join();
Your lockout of the sum variable may not even result in a speedup while taking into account the overhead of creating Threads, but your main problem is you are not adding sums to sum.
Change:
sum = sums;
to:
sum += sums;
This will make your code work for some of the time. It is not guaranteed to work and will sometimes output weird results like 1620 instead of 2025. You should learn more about how to properly handle multithreading, race conditions, and atomic locks.

Gaussian Elimination Multithreading

I m trying to make a parallel program for Gaussian Elimination with Java.
I am making 2 random matrices A and B at the start and i am not using pivoting.
My code when i create the threads is :
GaussianElimination threads[] = new GaussianElimination[T];
long startTime = System.currentTimeMillis();
for (int k = 0; k < N; k++) {
/**This for statement creates threads that behave like objects
* With the start() method we execute the run() proccess .And with
* Join() the main thread wait for all the other threads to finish
*/
for (int i = 0; i < T; i++) {
threads[i] = new GaussianElimination(T, k, i, A, B);
threads[i].start();
}
for (int i = 0; i < T; i++) {
try {
threads[i].join();
} catch (InterruptedException e) {
System.err.println("this should not happen");
}
}
}
long endTime = System.currentTimeMillis();
float time = (endTime - startTime) / 1000.0f;
System.out.println("Computation time: " + time);
After this the run method is :
class GaussianElimination extends Thread {
private int myid;
private double[] B;
private double[][] A;
int k;
int threads;
GaussianElimination(int threads, int k, int myid, double[][] A, double[] B) {
this.A = A;//Matrix A
this.B = B;//Matrix B
this.myid = myid;//Id of thread
this.k = k;//k value from exterior loop
this.threads = threads; //size of threads
}
/**Method run() where the threads are running .
*
* The distribution of the data are in cyclic mode.
* e.g. For 3 threads the operation of each thread will be distribute like this:
* thread 1 = 1,4,7,10...
* thread 2= 2,5,8,11...
* thread 3 =3,6,9,12...
*/
public void run() {
int N = B.length;
long startTime = System.currentTimeMillis();//Clock starts
for (int i = myid + k + 1; i < N; i += threads) {
double factor = A[i][k] / A[k][k];
B[i] -= factor * B[k];
for (int j = k; j < N; j++)
A[i][j] -= factor * A[k][j];
long endTime = System.currentTimeMillis();
float time = (endTime - startTime) ;//clock ends
System.out.println("Computation time of thread: " + time);
}
}
}
After that i am doing a back substitution serial and the i print the solution.
So the program is running but it isnt running in parallel.
I have tried to check the time between every thread but it didnt come to a solution.
I couldn't find a lot of examples in java for similar problems so i am asking here.
Is it bad architecture and logic or is any coding errors in my programm?
Also here is the serial code that i used https://www.sanfoundry.com/java-program-gaussian-elimination-algorithm/
Thank for your cooperation!

Quicksort Multithreading

I'd like to implement a quicksort algorthmus for a 2D array with multithreading.
Its working very fast in Single thread, but now i Tried to speed it up. This is my code to sort every part of the 2d array correctly (the speed of the sorting algorithm himself should be very fast). Its directly working on the "c".
public static void sort(int[][] c) {
int[][] a = new int[][] { { 0, -4, 1, 2 }, { 1, 0, 3 }, { 2, 3, 0 } };
for (int i = 0; i < c.length; i++) {
sort(c[i],0,c[i].length-1);
}
}
I tried up to know:
split the for loop in small "loopers" which perform a task of "x" loops, but this is slowing the algorithm.
Can someone help me to speed it up?
A couple of possibilities:
public static void sort(int[][] c) {
for (int i = 0; i < c.length; i++) {
//sort(c[i],0,c[i].length-1);
Arrays.sort(c[i]);
}
}
public static void parallelSort(int[][] c) {
Arrays.asList(c).parallelStream().forEach(d -> Arrays.sort(d));
}
public static void threadedSort(int[][] c) throws InterruptedException {
int count = 4;
Thread[] threads = new Thread[count];
for (int i = 0; i < count; i++) {
final int finalI = i;
threads[i] = new Thread(
() -> sortOnThread(c, (c.length / count) * finalI, c.length / count),
"Thread " + i
);
threads[i].start();
}
for (Thread thread : threads) {
thread.join();
}
}
private static void sortOnThread(int[][] c, int first, int length) {
for (int i = first; i < first + length; i++) {
Arrays.sort(c[i]);
}
}
public static void main(String[] args) throws InterruptedException {
int[][] c = new int[10_000_000][75];
shuffle(c);
System.out.println("Starting sort()");
long before = System.currentTimeMillis();
sort(c);
System.out.println("Took " + (System.currentTimeMillis() - before) + "ms");
shuffle(c);
System.out.println("Starting parallelSort()");
before = System.currentTimeMillis();
parallelSort(c);
System.out.println("Took " + (System.currentTimeMillis() - before) + "ms");
shuffle(c);
System.out.println("Starting threadedSort()");
before = System.currentTimeMillis();
threadedSort(c);
System.out.println("Took " + (System.currentTimeMillis() - before) + "ms");
}
private static void shuffle(int[][] c) {
for (int i = 0; i < c.length; i++) {
for (int j = 0; j < c[i].length; j++)
c[i][j] = j;
Collections.shuffle(Arrays.asList(c[i]));
}
}
Which produced these timings on a quad core (i5-2430M):
Starting sort()
Took 2486ms
Starting parallelSort()
Took 984ms
Starting threadedSort()
Took 875ms
The parallelStream() approach was the least code, but clearly comes with a bit more overhead (sending each sort through the ForkJoinPool) than direct threading. This was more noticable when the array was the smaller [100_000] [75]:
Starting sort()
Took 48ms
Starting parallelSort()
Took 101ms
Starting threadedSort()
Took 21ms
Just in case it's useful ... initially while coding this, I found the timings for the three approaches were much more similar:
Starting sort()
Took 2403ms
Starting parallelSort()
Took 2435ms
Starting threadedSort()
Took 2284ms
This turned out to be because I was naively allocating new sub-arrays each time in my shuffle() method. Clearly this was generating a lot of extra GC work - even a short sleep before the calling the sort methods made all the difference.

Is multiplication faster than array access?

To my surprise I get a longer time (10 milliseconds) when "optimizing" multiplications by pregenerating the results in an array compared to the original 8 milliseconds. Is that just a Java quirk or is that general of the PC architecture? I have a Core i5 760 with Java 7, Windows 8 64 Bit.
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
long sum=0;
int[] sqr = new int[1000];
for(int a=1;a<1000;a++) {sqr[a]=a*a;}
for(int b=1;b<1000;b++)
// for(int a=1;a<1000;a++) {sum+=a*a+b*b;}
for(int a=1;a<1000;a++) {sum+=sqr[a]+sqr[b];}
System.out.println(System.currentTimeMillis()-start+"ms");
System.out.println(sum);
}
}
Konrad Rudolph commented on the issues with the benchmarking. So I am ignoring the benchmark and focus on the question:
Is multiplication faster than array access?
Yes, it is very likely. It used to be the other way around 20 or 30 years ago.
Roughly speaking, you can do an integer multiplication in 3 cycles (pessimistic, if you don't get vector instructions), and a memory access costs you 4 cycles if you get it straight from the L1 cache but it is straight downhill from there. For reference, see
Latencies and throughput in Appendix C of the Intel 64 and IA-32 Architectures Optimization Reference Manual
Approximate cost to access various caches and main memory?
Herb Sutter's presentation on this very subject: Machine Architecture: Things Your Programming Language Never Told You
One thing specific to Java was pointed out by Ingo in a comment below: You also get bounds checking in Java, which makes the already slower array access even slower...
A more reasonable benchmark would be:
public abstract class Benchmark {
final String name;
public Benchmark(String name) {
this.name = name;
}
abstract int run(int iterations) throws Throwable;
private BigDecimal time() {
try {
int nextI = 1;
int i;
long duration;
do {
i = nextI;
long start = System.nanoTime();
run(i);
duration = System.nanoTime() - start;
nextI = (i << 1) | 1;
} while (duration < 1000000000 && nextI > 0);
return new BigDecimal((duration) * 1000 / i).movePointLeft(3);
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
#Override
public String toString() {
return name + "\t" + time() + " ns";
}
private static void shuffle(int[] a) {
Random chaos = new Random();
for (int i = a.length; i > 0; i--) {
int r = chaos.nextInt(i);
int t = a[r];
a[r] = a[i - 1];
a[i - 1] = t;
}
}
public static void main(String[] args) throws Exception {
final int[] table = new int[1000];
final int[] permutation = new int[1000];
for (int i = 0; i < table.length; i++) {
table[i] = i * i;
permutation[i] = i;
}
shuffle(permutation);
Benchmark[] marks = {
new Benchmark("sequential multiply") {
#Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += i * i;
}
}
return sum;
}
},
new Benchmark("sequential lookup") {
#Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += table[i];
}
}
return sum;
}
},
new Benchmark("random order multiply") {
#Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += permutation[i] * permutation[i];
}
}
return sum;
}
},
new Benchmark("random order lookup") {
#Override
int run(int iterations) throws Throwable {
int sum = 0;
for (int j = 0; j < iterations; j++) {
for (int i = 0; i < table.length; i++) {
sum += table[permutation[i]];
}
}
return sum;
}
}
};
for (Benchmark mark : marks) {
System.out.println(mark);
}
}
}
which prints on my intel core duo (yes, it's old):
sequential multiply 2218.666 ns
sequential lookup 1081.220 ns
random order multiply 2416.923 ns
random order lookup 2351.293 ns
So, if I access the lookup array sequentially, which minimizes the number of cache misses, and permits the hotspot JVM to optimize bounds checking on array access, there is a slight improvement on an array of 1000 elements. If we do random access into the array, that advantage disappears. Also, if the table is larger, the lookup gets slower. For instance, for 10000 elements, I get:
sequential multiply 23192.236 ns
sequential lookup 12701.695 ns
random order multiply 24459.697 ns
random order lookup 31595.523 ns
So, array lookup is not faster than multiplication, unless the access pattern is (nearly) sequential and the lookup array small.
In any case, my measurements indicate that a multiplication (and addition) takes merely 4 processor cycles (2.3 ns per loop iteration on a 2GHz CPU). You're unlikely to get much faster than that. Also, unless you do half a billion multiplications per second, the multiplications are not your bottleneck, and optimizing other parts of the code will be more fruitful.

Categories