While making my first approaches to using JMH to benchmark my class, I encountered a behavior that confuses me, and I'd like to clarify the issue before moving on.
The situation that confuses me:
When I run the benchmarks while the CPU is loaded (78%-80%) by extraneous processes, the results shown by JMH look quite plausible and stable:
Benchmark Mode Cnt Score Error Units
ArrayOperations.a_bigDecimalAddition avgt 5 264,703 ± 2,800 ns/op
ArrayOperations.b_quadrupleAddition avgt 5 44,290 ± 0,769 ns/op
ArrayOperations.c_bigDecimalSubtraction avgt 5 286,266 ± 2,454 ns/op
ArrayOperations.d_quadrupleSubtraction avgt 5 46,966 ± 0,629 ns/op
ArrayOperations.e_bigDecimalMultiplcation avgt 5 546,535 ± 4,988 ns/op
ArrayOperations.f_quadrupleMultiplcation avgt 5 85,056 ± 1,820 ns/op
ArrayOperations.g_bigDecimalDivision avgt 5 612,814 ± 5,943 ns/op
ArrayOperations.h_quadrupleDivision avgt 5 631,127 ± 4,172 ns/op
Relatively large errors are because I need only a rough estimate right now and I trade precision for quickness deliberately.
But the results obtained without extraneous load on the processor seem amazing to me:
Benchmark Mode Cnt Score Error Units
ArrayOperations.a_bigDecimalAddition avgt 5 684,035 ± 370,722 ns/op
ArrayOperations.b_quadrupleAddition avgt 5 83,743 ± 25,762 ns/op
ArrayOperations.c_bigDecimalSubtraction avgt 5 531,430 ± 184,980 ns/op
ArrayOperations.d_quadrupleSubtraction avgt 5 85,937 ± 103,351 ns/op
ArrayOperations.e_bigDecimalMultiplcation avgt 5 641,953 ± 288,545 ns/op
ArrayOperations.f_quadrupleMultiplcation avgt 5 102,692 ± 31,625 ns/op
ArrayOperations.g_bigDecimalDivision avgt 5 733,727 ± 161,827 ns/op
ArrayOperations.h_quadrupleDivision avgt 5 820,388 ± 546,990 ns/op
Everything seems to work almost twice slower, iteration times are very unstable (may vary from 500 to 1300 ns/op at neighbor iterations) and the errors are respectively unacceptably large.
The first set of results is obtained with a bunch of application running, including Folding#home distribute computations client (FahCore_a7.exe) which takes 75% of CPU time, a BitTorrent client that actively uses disks, a dozen of tabs in a browser, e-mail client etc. Average CPU load is about 85%. During the benchmark execution FahCoredecreases the load so that Java takes 25% and total load is 100%.
The second set of results is taken when all unnecessary processes are stopped, CPU is practically idle, only Java takes it's 25% and a couple of percents are used for system needs.
My CPU is Intel i5-4460, 4 kernels, 3.2 GHz, RAM 32 GB, OS Windows Server 2008 R2.
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
The questions are:
Why the benchmarks show much worse and unstable results when it's the only task that loads the machine?
Can I consider the first set of results more or less reliable when they depend on the environment so dramatically?
Should I setup the environment somehow to eliminate this dependency?
Or is this my code that is to blame?
The code:
package com.mvohm.quadruple.benchmarks;
// Required imports here
import com.mvohm.quadruple.Quadruple; // The class under tests
#State(value = Scope.Benchmark)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(java.util.concurrent.TimeUnit.NANOSECONDS)
#Fork(value = 1)
#Warmup(iterations = 3, time = 7)
#Measurement(iterations = 5, time = 10)
public class ArrayOperations {
// To do BigDecimal arithmetic with the precision close to this of Quadruple
private static final MathContext MC_38 = new MathContext(38, RoundingMode.HALF_EVEN);
private static final int DATA_SIZE = 0x1_0000; // 65536
private static final int INDEX_MASK = DATA_SIZE - 1; // 0xFFFF
private static final double RAND_SCALE = 1e39; // To provide a sensible range of operands,
// so that the actual calculations don't get bypassed
private final BigDecimal[] // Data to apply operations to
bdOp1 = new BigDecimal[DATA_SIZE], // BigDecimals
bdOp2 = new BigDecimal[DATA_SIZE],
bdResult = new BigDecimal[DATA_SIZE];
private final Quadruple[]
qOp1 = new Quadruple[DATA_SIZE], // Quadruples
qOp2 = new Quadruple[DATA_SIZE],
qResult = new Quadruple[DATA_SIZE];
private int index = 0;
#Setup
public void initData() {
final Random rand = new Random(12345); // for reproducibility
for (int i = 0; i < DATA_SIZE; i++) {
bdOp1[i] = randomBigDecimal(rand);
bdOp2[i] = randomBigDecimal(rand);
qOp1[i] = randomQuadruple(rand);
qOp2[i] = randomQuadruple(rand);
}
}
private static Quadruple randomQuadruple(Random rand) {
return Quadruple.nextNormalRandom(rand).multiply(RAND_SCALE); // ranged 0 .. 9.99e38
}
private static BigDecimal randomBigDecimal(Random rand) {
return Quadruple.nextNormalRandom(rand).multiply(RAND_SCALE).bigDecimalValue();
}
#Benchmark
public void a_bigDecimalAddition() {
bdResult[index] = bdOp1[index].add(bdOp2[index], MC_38);
index = ++index & INDEX_MASK;
}
#Benchmark
public void b_quadrupleAddition() {
// semantically the same as above
qResult[index] = Quadruple.add(qOp1[index], qOp2[index]);
index = ++index & INDEX_MASK;
}
// Other methods are similar
public static void main(String... args) throws IOException, RunnerException {
final Options opt = new OptionsBuilder()
.include(ArrayOperations.class.getSimpleName())
.forks(1)
.build();
new Runner(opt).run();
}
}
The reason was very simple, and I should have understood it immediately. Power saving mode was enabled in the OS, which reduced the clock frequency of the CPU under low load. The moral is, always disable power saving when benchmarking!
Related
I'm trying to improve the following code:
public int applyAsInt(String ipAddress) {
var ipAddressInArray = ipAddress.split("\\.");
...
So I compile the regular expression into a static constant:
private static final Pattern PATTERN_DOT = Pattern.compile(".", Pattern.LITERAL);
public int applyAsInt(String ipAddress) {
var ipAddressInArray = PATTERN_DOT.split(ipAddress);
...
The rest of the code remained unchanged.
To my amazement, the new code is slower than the previous one.
Below are the test results:
Benchmark (ipAddress) Mode Cnt Score Error Units
ConverterBenchmark.mkyongConverter 1.2.3.4 avgt 10 166.456 ± 9.087 ns/op
ConverterBenchmark.mkyongConverter 120.1.34.78 avgt 10 168.548 ± 2.996 ns/op
ConverterBenchmark.mkyongConverter 129.205.201.114 avgt 10 180.754 ± 6.891 ns/op
ConverterBenchmark.mkyong2Converter 1.2.3.4 avgt 10 253.318 ± 4.977 ns/op
ConverterBenchmark.mkyong2Converter 120.1.34.78 avgt 10 263.045 ± 8.373 ns/op
ConverterBenchmark.mkyong2Converter 129.205.201.114 avgt 10 331.376 ± 53.092 ns/op
Help me understand why this is happening, please.
String.split has code aimed at exactly this use case:
https://github.com/openjdk/jdk17u/blob/master/src/java.base/share/classes/java/lang/String.java#L3102
/* fastpath if the regex is a
* (1) one-char String and this character is not one of the
* RegEx's meta characters ".$|()[{^?*+\\", or
* (2) two-char String and the first char is the backslash and
* the second is not the ascii digit or ascii letter.
*/
That means that when using split("\\.") the string is effectively not split using a regular expression - the method splits the string directly at the '.' characters.
This optimization is not possible when you write PATTERN_DOT.split(ipAddress).
I'm working to determine if an 32bit integer is even or odd. I have setup 2 approaches:
modulo(%) approach
int r = (i % 2);
bitwise(&) approach
int r = (i & 0x1);
Both approaches work successfully. So I run each line for 15000 times to test performance.
Result:
modulo(%) approach (source code)
mean 141.5801887ns | SD 270.0700275ns
bitwise(&) approach (source code)
mean 141.2504ns | SD 193.6351007ns
Questions:
Why is bitwise(&) more stable than division(%) ?
Does JVM optimize modulo(%) using AND(&) according to here?
Let's try to reproduce with JMH.
#Benchmark
#Measurement(timeUnit = TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
public int first() throws IOException {
return i % 2;
}
#Benchmark
#Measurement(timeUnit = TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
public int second() throws IOException {
return i & 0x1;
}
Okay, it is reproducable. The first is slightly slower than the second. Now let's figure out why. Run it with -prof perfnorm:
Benchmark Mode Cnt Score Error Units
MyBenchmark.first avgt 50 2.674 ± 0.028 ns/op
MyBenchmark.first:CPI avgt 10 0.301 ± 0.002 #/op
MyBenchmark.first:L1-dcache-load-misses avgt 10 0.001 ± 0.001 #/op
MyBenchmark.first:L1-dcache-loads avgt 10 11.011 ± 0.146 #/op
MyBenchmark.first:L1-dcache-stores avgt 10 3.011 ± 0.034 #/op
MyBenchmark.first:L1-icache-load-misses avgt 10 ≈ 10⁻³ #/op
MyBenchmark.first:LLC-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:LLC-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:LLC-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:LLC-stores avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:branch-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:branches avgt 10 4.006 ± 0.054 #/op
MyBenchmark.first:cycles avgt 10 9.322 ± 0.113 #/op
MyBenchmark.first:dTLB-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:dTLB-loads avgt 10 10.939 ± 0.175 #/op
MyBenchmark.first:dTLB-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:dTLB-stores avgt 10 2.991 ± 0.045 #/op
MyBenchmark.first:iTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.first:iTLB-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.first:instructions avgt 10 30.991 ± 0.427 #/op
MyBenchmark.second avgt 50 2.263 ± 0.015 ns/op
MyBenchmark.second:CPI avgt 10 0.320 ± 0.001 #/op
MyBenchmark.second:L1-dcache-load-misses avgt 10 0.001 ± 0.001 #/op
MyBenchmark.second:L1-dcache-loads avgt 10 11.045 ± 0.152 #/op
MyBenchmark.second:L1-dcache-stores avgt 10 3.014 ± 0.032 #/op
MyBenchmark.second:L1-icache-load-misses avgt 10 ≈ 10⁻³ #/op
MyBenchmark.second:LLC-load-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:LLC-loads avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:LLC-store-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:LLC-stores avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:branch-misses avgt 10 ≈ 10⁻⁴ #/op
MyBenchmark.second:branches avgt 10 4.014 ± 0.045 #/op
MyBenchmark.second:cycles avgt 10 8.024 ± 0.098 #/op
MyBenchmark.second:dTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:dTLB-loads avgt 10 10.989 ± 0.161 #/op
MyBenchmark.second:dTLB-store-misses avgt 10 ≈ 10⁻⁶ #/op
MyBenchmark.second:dTLB-stores avgt 10 3.004 ± 0.042 #/op
MyBenchmark.second:iTLB-load-misses avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:iTLB-loads avgt 10 ≈ 10⁻⁵ #/op
MyBenchmark.second:instructions avgt 10 25.076 ± 0.296 #/op
Note the difference in cycles and instructions. And now that's kind of obvious. The first does care about the sign, but the second does not (just bitwise and). To make sure this is the reason take a look at the assembly fragment:
first:
0x00007f91111f8355: mov 0xc(%r10),%r11d ;*getfield i
0x00007f91111f8359: mov %r11d,%edx
0x00007f91111f835c: and $0x1,%edx
0x00007f91111f835f: mov %edx,%r10d
0x00007f6bd120a6e2: neg %r10d
0x00007f6bd120a6e5: test %r11d,%r11d
0x00007f6bd120a6e8: cmovl %r10d,%edx
second:
0x00007ff36cbda580: mov $0x1,%edx
0x00007ff36cbda585: mov 0x40(%rsp),%r10
0x00007ff36cbda58a: and 0xc(%r10),%edx
An execution time of 150 ns is about 500 clock cycles. I don't think there has ever been a processor that went about the business of checking a bit this inefficiently :-).
The problem is that your test harness is flawed in many ways. In particular:
you make no attempt to trigger JIT compilation before starting the clock
System.nanotime() is not guaranteed to have nanosecond accuracy
System.nanotime() is quite a bit more expensive to call that the code you want to measure
See How do I write a correct micro-benchmark in Java? for a more complete list of things to keep in mind.
Here's a better benchmark:
public abstract class Benchmark {
final String name;
public Benchmark(String name) {
this.name = name;
}
#Override
public String toString() {
return name + "\t" + time() + " ns / iteration";
}
private BigDecimal time() {
try {
// automatically detect a reasonable iteration count (and trigger just in time compilation of the code under test)
int iterations;
long duration = 0;
for (iterations = 1; iterations < 1_000_000_000 && duration < 1_000_000_000; iterations *= 2) {
long start = System.nanoTime();
run(iterations);
duration = System.nanoTime() - start;
cleanup();
}
return new BigDecimal((duration) * 1000 / iterations).movePointLeft(3);
} catch (Throwable e) {
throw new RuntimeException(e);
}
}
/**
* Executes the code under test.
* #param iterations
* number of iterations to perform
* #return any value that requires the entire code to be executed (to
* prevent dead code elimination by the just in time compiler)
* #throws Throwable
* if the test could not complete successfully
*/
protected abstract Object run(int iterations) throws Throwable;
/**
* Cleans up after a run, setting the stage for the next.
*/
protected void cleanup() {
// do nothing
}
public static void main(String[] args) throws Exception {
System.out.println(new Benchmark("%") {
#Override
protected Object run(int iterations) throws Throwable {
int sum = 0;
for (int i = 0; i < iterations; i++) {
sum += i % 2;
}
return sum;
}
});
System.out.println(new Benchmark("&") {
#Override
protected Object run(int iterations) throws Throwable {
int sum = 0;
for (int i = 0; i < iterations; i++) {
sum += i & 1;
}
return sum;
}
});
}
}
On my machine, it prints:
% 0.375 ns / iteration
& 0.139 ns / iteration
So the difference is, as expected, on the order of a couple clock cycles. That is, & 1 was optimized slightly better by this JIT on this particular hardware, but the difference is so small it is extremely unlikely to have a measurable (let alone significant) impact on the performance of your program.
The two operations correspond to different JVM processor instructions:
irem // int remainder (%)
iand // bitwise and (&)
Somewhere I read irem is usually implemented by the JVM, while iand is available on hardware. Oracle explains the two instructions as follows:
iand
An int result is calculated by taking the bitwise AND (conjunction) of value1 and value2.
irem
The int result is value1 - (value1 / value2) * value2.
It seems reasonable to me to assume that iand results in less CPU cycles.
I know I can simply iterate from start to end and clear those cells but I was wondering if it was possible in any faster way (perhaps using JNI-ed System.arrayCopy)?
If I got it right, you need to nullify an array, or a sub-range of an array containing references to objects to make them eligible for GC. And you have a regular Java array, which stores data on-heap.
Answering your question, System.arrayCopy is the fastest way to null a sub-range of an array. It is worse memory-wise than Arrays.fill though, since you would have to allocate twice as much memory to hold references at worst case for an array of nulls you can copy from. Though if you need to fully null an array, even faster would be just to create a new empty array (e.g. new Object[desiredLength]) and replace the one you want to nullify with it.
Unsafe, DirectByteBuffer, DirectLongBuffer implementations doesn't provide any performance gain in a naive straight-forward implementation (i.e. if you just replace the Array with DirectByteBuffer or Unsafe). They are slower then bulk System.arrayCopy as well. Since those implementations have nothing to do with Java Array, they're out of scope of your question anyway.
Here's my JMH benchmark (full benchmark code available via gist) snippet for those including unsafe.setMemory case as per #apangin comment; and including ByteBuffer.put(long[] src, int srcOffset, int longCount) as per #jan-chaefer; and an equivalent of Arrays.fill loop as per #scott-carey to check if Arrays.fill could be an intrinsic in JDK 8.
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayFill() {
Arrays.fill(objectHolderForFill, null);
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayFillManualLoop() {
for (int i = 0, len = objectHolderForFill.length; i < len; i++) {
objectHolderForLoop[i] = null;
}
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayCopy() {
System.arraycopy(nullsArray, 0, objectHolderForArrayCopy, 0,
objectHolderForArrayCopy.length);
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferManualLoop() {
while (referenceHolderByteBuffer.hasRemaining()) {
referenceHolderByteBuffer.putLong(0);
}
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferBatch() {
referenceHolderByteBuffer.put(nullBytes, 0, nullBytes.length);
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferManualLoop() {
while (referenceHolderLongBuffer.hasRemaining()) {
referenceHolderLongBuffer.put(0L);
}
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferBatch() {
referenceHolderLongBuffer.put(nullLongs, 0, nullLongs.length);
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArrayManualLoop() {
long addr = referenceHolderUnsafe;
long pos = 0;
for (int i = 0; i < size; i++) {
unsafe.putLong(addr + pos, 0L);
pos += 1 << 3;
}
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArraySetMemory() {
unsafe.setMemory(referenceHolderUnsafe, size*8, (byte) 0);
}
Here's what I got (Java 1.8, JMH 1.13, Core i3-6100U 2.30 GHz, Win10):
100 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 5234029 39,518 ± 0,991 ns/op
ArrayNullFillBench.directByteBufferBatch sample 6271334 43,646 ± 1,523 ns/op
ArrayNullFillBench.directLongBufferBatch sample 4615974 45,252 ± 2,352 ns/op
ArrayNullFillBench.arrayFill sample 4745406 76,997 ± 3,547 ns/op
ArrayNullFillBench.arrayFillManualLoop sample 5549216 78,677 ± 13,013 ns/op
ArrayNullFillBench.unsafeArrayManualLoop sample 5980381 78,811 ± 2,870 ns/op
ArrayNullFillBench.unsafeArraySetMemory sample 5985884 85,062 ± 2,096 ns/op
ArrayNullFillBench.directLongBufferManualLoop sample 4697023 116,242 ± 2,579 ns/op <-- wow
ArrayNullFillBench.directByteBufferManualLoop sample 7504629 208,440 ± 10,651 ns/op <-- wow
I skipped all** the loop implementations from further tests
** - except arrayFill and arrayFillManualLoop for scale
1000 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 6780681 184,516 ± 14,036 ns/op
ArrayNullFillBench.directLongBufferBatch sample 4018778 293,325 ± 4,074 ns/op
ArrayNullFillBench.directByteBufferBatch sample 4063969 313,171 ± 4,861 ns/op
ArrayNullFillBench.arrayFillManualLoop sample 6270397 543,801 ± 20,325 ns/op
ArrayNullFillBench.arrayFill sample 6590416 548,250 ± 13,475 ns/op
10000 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 2551851 2024,543 ± 12,533 ns/op
ArrayNullFillBench.directLongBufferBatch sample 2958517 4469,210 ± 10,376 ns/op
ArrayNullFillBench.directByteBufferBatch sample 2892258 4526,945 ± 33,443 ns/op
ArrayNullFillBench.arrayFill sample 2578580 5532,063 ± 20,705 ns/op
ArrayNullFillBench.arrayFillManualLoop sample 2562569 5550,195 ± 40,666 ns/op
P.S.
Speaking of ByteBuffer and Unsafe - their main benefits in your case is that they store data off-heap, and you can implement your own memory deallocation alghorithm which would siut your data-structure better than regular GC. So you won't need to nullify them, and could compact memory as you please. Most likely the efforts won't worth much, since it would be much easier to get a less performant and more error-prone code then you have now.
I want to tranpose a double[][] matrix with the most compact and efficient expression possible. Right now I have this:
public static Function<double[][], double[][]> transpose() {
return (m) -> {
final int rows = m.length;
final int columns = m[0].length;
double[][] transpose = new double[columns][rows];
range(0, rows).forEach(r -> {
range(0, columns).forEach(c -> {
transpose[c][r] = m[r][c];
});
});
return transpose;
};
}
Thoughts?
You could have:
public static UnaryOperator<double[][]> transpose() {
return m -> {
return range(0, m[0].length).mapToObj(r ->
range(0, m.length).mapToDouble(c -> m[c][r]).toArray()
).toArray(double[][]::new);
};
}
This code does not use forEach but prefers mapToObj and mapToDouble for mapping each row to their transposition. I also changed Function<double[][], double[][]> to UnaryOperator<double[][]> since the return type is the same.
However, it probably won't be more efficient that having a simple for loop like in assylias's answer.
Sample code:
public static void main(String[] args) {
double[][] m = { { 2, 3 }, { 1, 2 }, { -1, 1 } };
double[][] tm = transpose().apply(m);
System.out.println(Arrays.deepToString(tm)); // prints [[2.0, 1.0, -1.0], [3.0, 2.0, 1.0]]
}
I've realized a JMH benchmark comparing the code above, the for loop version, and the code above ran in parallel. All three methods are called with random square matrices having size 100, 1000 and 3000. The results are that for small matrices, the for loop version is faster but with bigger matrices the parallel Stream solution is indeed better in terms of performance (Windows 10, JDK 1.8.0_66, i5-3230M # 2.60 GHz):
Benchmark (matrixSize) Mode Cnt Score Error Units
StreamTest.forLoopTranspose 100 avgt 30 0,026 ± 0,001 ms/op
StreamTest.forLoopTranspose 1000 avgt 30 14,653 ± 0,205 ms/op
StreamTest.forLoopTranspose 3000 avgt 30 222,212 ± 11,449 ms/op
StreamTest.parallelStreamTranspose 100 avgt 30 0,113 ± 0,007 ms/op
StreamTest.parallelStreamTranspose 1000 avgt 30 7,960 ± 0,207 ms/op
StreamTest.parallelStreamTranspose 3000 avgt 30 122,587 ± 7,100 ms/op
StreamTest.streamTranspose 100 avgt 30 0,040 ± 0,003 ms/op
StreamTest.streamTranspose 1000 avgt 30 14,059 ± 0,444 ms/op
StreamTest.streamTranspose 3000 avgt 30 216,741 ± 5,738 ms/op
Benchmark code:
#Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#Fork(3)
public class StreamTest {
private static final UnaryOperator<double[][]> streamTranspose() {
return m -> {
return range(0, m[0].length).mapToObj(r ->
range(0, m.length).mapToDouble(c -> m[c][r]).toArray()
).toArray(double[][]::new);
};
}
private static final UnaryOperator<double[][]> parallelStreamTranspose() {
return m -> {
return range(0, m[0].length).parallel().mapToObj(r ->
range(0, m.length).parallel().mapToDouble(c -> m[c][r]).toArray()
).toArray(double[][]::new);
};
}
private static final Function<double[][], double[][]> forLoopTranspose() {
return m -> {
final int rows = m.length;
final int columns = m[0].length;
double[][] transpose = new double[columns][rows];
for (int r = 0; r < rows; r++)
for (int c = 0; c < columns; c++)
transpose[c][r] = m[r][c];
return transpose;
};
}
#State(Scope.Benchmark)
public static class MatrixContainer {
#Param({ "100", "1000", "3000" })
private int matrixSize;
private double[][] matrix;
#Setup(Level.Iteration)
public void setUp() {
ThreadLocalRandom random = ThreadLocalRandom.current();
matrix = random.doubles(matrixSize).mapToObj(i -> random.doubles(matrixSize).toArray()).toArray(double[][]::new);
}
}
#Benchmark
public double[][] streamTranspose(MatrixContainer c) {
return streamTranspose().apply(c.matrix);
}
#Benchmark
public double[][] parallelStreamTranspose(MatrixContainer c) {
return parallelStreamTranspose().apply(c.matrix);
}
#Benchmark
public double[][] forLoopTranspose(MatrixContainer c) {
return forLoopTranspose().apply(c.matrix);
}
}
As compact and more efficient:
for (int r = 0; r < rows; r++)
for (int c = 0; c < cols; c++)
transpose[c][r] = m[r][c];
Note that if you have a Matrix class that holds a double[][], an alternative option would be to return a view that has the same underlying array but swaps the columns/rows indices. You would save on copying but you may get worse performance on iteration due to worse cache locality.
If you assume a rectangular input (as your original code seems to rely on), you could write it as
public static Function<double[][], double[][]> transpose() {
return m -> range(0, m[0].length)
.mapToObj(c->range(0, m.length).mapToDouble(r->m[r][c]).toArray())
.toArray(double[][]::new);
}
This could run in parallel but I suppose you’d need a damn big matrix to get a benefit of it.
My advice: for simple low-level math you should use plain old for loops instead of Stream API. Also you should benchmark such code very carefully.
As for #Tunaki benchmark. First, you should not limit single measurement with 1 microsecond. The results for matrixSize = 100 are complete junk: 0,093 ± 0,054 and 0,237 ± 0,134: the error is more than 50%. Note that time measurement which performed before and after each iteration is not a magic and takes time too. And such a small interval can be easily spoiled by some Windows service which suddenly woke up, took some CPU cycles to check something, then go to sleep again. I usually set every warmup/measurement time to 500ms, this number looks comfortable for me.
Second, when testing Stream API with very simple payload (such as copying numbers to primitive array), you should always test with type profile pollution as it really matters. In clean benchmark the JIT compiler can inline everything into single method, because it knows, for example, that after some range you always call the same mapToObj with the same lambda expression. But in real application it's not the same. I modified the MatrixContainer class this way:
#State(Scope.Benchmark)
public static class MatrixContainer {
#Param({"true", "false"})
private boolean pollute;
#Param({ "100", "1000", "3000" })
private int matrixSize;
private double[][] matrix;
#Setup(Level.Iteration)
public void setUp() {
ThreadLocalRandom random = ThreadLocalRandom.current();
matrix = random.doubles(matrixSize)
.mapToObj(i -> random.doubles(matrixSize).toArray())
.toArray(double[][]::new);
if(!pollute) return;
// do some seemingly harmless operations which will
// poison JIT compiler type profile with some other lambdas
for(int i=0; i<100; i++) {
range(0, 1000).map(x -> x+2).toArray();
range(0, 1000).map(x -> x+5).toArray();
range(0, 1000).mapToObj(x -> x*2).toArray();
range(0, 1000).mapToObj(x -> x*3).toArray();
}
}
}
Also I set 5 forks as for Stream API JIT-compiler may behave differently from run to run. Compilation goes in background thread and profiling info may differ at the compilation point due to race which may change the results of compilation significatly. So within fork the results will be the same, but between forks they might be completely different.
My results are (Windows 7, Oracle JVM 8u45 64bit, some not-very-new i5-2410 laptop):
Benchmark (matrixSize) (pollute) Mode Cnt Score Error Units
StreamTest.forLoopTranspose 100 true avgt 50 0,033 ± 0,001 ms/op
StreamTest.forLoopTranspose 100 false avgt 50 0,032 ± 0,001 ms/op
StreamTest.forLoopTranspose 1000 true avgt 50 17,094 ± 0,060 ms/op
StreamTest.forLoopTranspose 1000 false avgt 50 17,065 ± 0,080 ms/op
StreamTest.forLoopTranspose 3000 true avgt 50 260,173 ± 7,855 ms/op
StreamTest.forLoopTranspose 3000 false avgt 50 258,774 ± 7,557 ms/op
StreamTest.streamTranspose 100 true avgt 50 0,096 ± 0,001 ms/op
StreamTest.streamTranspose 100 false avgt 50 0,055 ± 0,012 ms/op
StreamTest.streamTranspose 1000 true avgt 50 21,497 ± 0,439 ms/op
StreamTest.streamTranspose 1000 false avgt 50 15,883 ± 0,265 ms/op
StreamTest.streamTranspose 3000 true avgt 50 272,806 ± 8,534 ms/op
StreamTest.streamTranspose 3000 false avgt 50 260,515 ± 9,159 ms/op
Now you have much less errors and see that type pollution makes the stream results worse while does not affect for-loop results. For matrices like 100x100 the difference is quite significant.
I'm adding an implementation example that includes the parallel switch. I'm curious what you all think of it.
/**
* Returns a {#link UnaryOperator} that transposes the matrix.
*
* Example {#code transpose(true).apply(m);}
*
* #param parallel
* Whether to perform the transpose concurrently.
*/
public static UnaryOperator<ArrayMatrix> transpose(boolean parallel) {
return (m) -> {
double[][] data = m.getData();
IntStream stream = range(0, m.getColumnDimension());
stream = parallel ? stream.parallel() : stream;
double[][] transpose =
stream.mapToObj(
column -> range(0, data.length).mapToDouble(row -> data[row][column]).toArray())
.toArray(double[][]::new);
return new ArrayMatrix(transpose);
};
}
In Java I need to construct a string of n zeros with n unknown at compile time. Ideally I'd use
String s = new String('0', n);
But no such constructor exists. CharSequence doesn't seem to have a suitable constructor either. So I'm tempted to build my own loop using StringBuilder.
Before I do this and risk getting defenestrated by my boss, could anyone advise: is there a standard way of doing this in Java? In C++, one of the std::string constructors allows this.
If you don't mind creating an extra string:
String zeros = new String(new char[n]).replace((char) 0, '0');
Or more explicit (and probably more efficient):
char[] c = new char[n];
Arrays.fill(c, '0');
String zeros = new String(c);
Performance wise, the Arrays.fill option seems to perform better in most situations, but especially for large strings. Using a StringBuilder is quite slow for large strings but efficient for small ones. Using replace is a nice one liner and performs ok for larger strings, but not as well as filll.
Micro benchmark for different values of n:
Benchmark (n) Mode Samples Score Error Units
c.a.p.SO26504151.builder 1 avgt 3 29.452 ± 1.849 ns/op
c.a.p.SO26504151.builder 10 avgt 3 51.641 ± 12.426 ns/op
c.a.p.SO26504151.builder 1000 avgt 3 2681.956 ± 336.353 ns/op
c.a.p.SO26504151.builder 1000000 avgt 3 3522995.218 ± 422579.979 ns/op
c.a.p.SO26504151.fill 1 avgt 3 30.255 ± 0.297 ns/op
c.a.p.SO26504151.fill 10 avgt 3 32.638 ± 7.553 ns/op
c.a.p.SO26504151.fill 1000 avgt 3 592.459 ± 91.413 ns/op
c.a.p.SO26504151.fill 1000000 avgt 3 706187.003 ± 152774.601 ns/op
c.a.p.SO26504151.replace 1 avgt 3 44.366 ± 5.153 ns/op
c.a.p.SO26504151.replace 10 avgt 3 51.778 ± 2.959 ns/op
c.a.p.SO26504151.replace 1000 avgt 3 1385.383 ± 289.319 ns/op
c.a.p.SO26504151.replace 1000000 avgt 3 1486335.886 ± 1807239.775 ns/op
Create a n sized char array and convert it to String:
char[] myZeroCharArray = new char[n];
for(int i = 0; i < n; i++) myZeroCharArray[i] = '0';
String myZeroString = new String(myZeroCharArray);
See StringUtils in Apache Commons Lang
https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html#repeat%28java.lang.String,%20int%29
There isn't a standard JDK way, but Apache commons (almost a defacto standard), has the StringUtils.repeat() method, e.g.:
String s = StringUtils.repeat('x', 5); // s = "xxxxx"
or the plain old String Format
int n = 10;
String s = String.format("%" + n + "s", "").replace(' ', '0');
System.out.println(s);