I need to convert double to string with given precision. String.format("%.3f", value) (or DecimalFormat) does the job, but benchmarks show that it is slow. Even Double.toString conversion takes about 1-3 seconds to convert 1 million numbers on my machine.
Are there any better way to do it?
UPDATE: Benchmarking results
Random numbers from 0 to 1000000, results are in operations per millisecond (Java 1.7.0_45), higher is better:
Benchmark Mean Mean error Units
String_format 747.394 13.197 ops/ms
BigDecimal_toPlainString 1349.552 31.144 ops/ms
DecimalFormat_format 1890.917 28.886 ops/ms
Double_toString 3341.941 85.453 ops/ms
DoubleFormatUtil_formatDouble 7760.968 87.630 ops/ms
SO_User_format 14269.388 168.206 ops/ms
UPDATE:
Java 10, +ryu, higher is better:
Mode Cnt Score Error Units
String_format thrpt 20 998.741 ± 52.704 ops/ms
BigDecimal_toPlainString thrpt 20 2079.965 ± 101.398 ops/ms
DecimalFormat_format thrpt 20 2040.792 ± 48.378 ops/ms
Double_toString thrpt 20 3575.301 ± 112.548 ops/ms
DoubleFormatUtil_formatDouble thrpt 20 7206.281 ± 307.348 ops/ms
ruy_doubleToString thrpt 20 9626.312 ± 285.778 ops/ms
SO_User_format thrpt 20 17143.901 ± 1307.685 ops/ms
Disclaimer: I only recommend that you use this if speed is an absolute requirement.
On my machine, the following can do 1 million conversions in about 130ms:
private static final int POW10[] = {1, 10, 100, 1000, 10000, 100000, 1000000};
public static String format(double val, int precision) {
StringBuilder sb = new StringBuilder();
if (val < 0) {
sb.append('-');
val = -val;
}
int exp = POW10[precision];
long lval = (long)(val * exp + 0.5);
sb.append(lval / exp).append('.');
long fval = lval % exp;
for (int p = precision - 1; p > 0 && fval < POW10[p]; p--) {
sb.append('0');
}
sb.append(fval);
return sb.toString();
}
The code as presented has several shortcomings: it can only handle a limited range of doubles, and it doesn't handle NaNs. The former can be addressed (but only partially) by extending the POW10 array. The latter can be explicitly handled in the code.
If you don't need thread-safe code, you can re-use the buffer for a little more speed (to avoid recreating a new object each time), such as:
private static final int[] POW10 = {1, 10, 100, 1000, 10000, 100000, 1000000};
private static final StringBuilder BUFFER = new StringBuilder();
public String format( double value, final int precision ) {
final var sb = BUFFER;
sb.setLength( 0 );
if( value < 0 ) {
sb.append( '-' );
value = -value;
}
final int exp = POW10[ precision ];
final long lval = (long) (value * exp + 0.5);
sb.append( lval / exp ).append( '.' );
final long fval = lval % exp;
for( int p = precision - 1; p > 0 && fval < POW10[ p ]; p-- ) {
sb.append( '0' );
}
sb.append( fval );
return sb.toString();
}
If you need both speed and precision, I've developed a fast DoubleFormatUtil class at xmlgraphics-commons: http://xmlgraphics.apache.org/commons/changes.html#version_1.5rc1
You can see the code there:
http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/util/DoubleFormatUtil.java?view=markup
It's faster than both DecimalFormat/BigDecimal, as fast as Double.toString, it's precise, it's well tested.
It's licensed under Apache License 2.0, so you can use it as you want.
To my knowledge the fastest and most complete implementation is that of Jack Shirazi:
http://web.archive.org/web/20150623133220/http://archive.oreilly.com/pub/a/onjava/2000/12/15/formatting_doubles.html
Code:
Original implementation is no longer available online (http://archive.oreilly.com/onjava/2000/12/15/graphics/DoubleToString.java). An implementation can be found here: https://raw.githubusercontent.com/openxal/openxal/57392be263b98565738d1962ba3b53e5ca60e64e/core/src/xal/tools/text/DoubleToString.java
It provides formatted (number of decimals) and unformatted doubleToString conversion. My observation is, that the JDK performance of unformatted conversion dramatically improved over the years, so here the gain is not so big anymore.
For formatted conversion it still is.
For benchmarkers: It often makes a big difference which kind of doubles are used, e.g. doubles very close to 0.
I haven't benchmarked this, but how about using BigDecimal?
BigDecimal bd = new BigDecimal(value).setScale(3, RoundingMode.HALF_UP);
return bd.toString();
Related
I am trying to microbenchmark a KMeans program. I am focussing on the Euclidean distance at the moment. I thought that (due to the square root, below) increasing the number of decimal places of each coordinate (x, y) would cause computation time to increase.
Here is how I calculate the Euclidean distance:
Math.sqrt((x - otherPoint.x) * (x - otherPoint.x) + (y - otherPoint.y) * (y - otherPoint.y))
Here are my results from microbenchmarking:
Benchmark (noOfFloatingPoints) (noOfPoints) Mode Cnt Score Error Units
TheBenchmark.toyBenchmark 16 5000 avgt 5 251214.457 ± 40224.490 ns/op
TheBenchmark.toyBenchmark 8 5000 avgt 5 319809.483 ± 560434.712 ns/op
TheBenchmark.toyBenchmark 2 5000 avgt 5 477652.450 ± 1068570.972 ns/op
As you can see, the score actually increases as the number of decimal places decreases! I have tried this on 5000 points, however it remains the same no matter how little or many points I use.
Why is this the case? I thought that the more floating points, the more computation would be required, especially due to the square root.
To increase the number of decimal places I have created this function:
public static double generateRandomToDecimalPlace(Random rnd,
int lowerBound,
int upperBound,
int decimalPlaces) {
final double dbl = (rnd.nextDouble() * (upperBound - lowerBound)) + lowerBound;
return roundAvoid(dbl, decimalPlaces);
}
public static double roundAvoid(double value, int places) {
double scale = Math.pow(10, places);
return Math.round(value * scale) / scale;
}
I am randomly generating points between a certain range (-100 to 100) and specific number of decimal points:
#Param({"16", "8", "2"})
public int noOfFloatingPoints;
The type double is a binary, fixed-length data type. It always uses 64 bits to represent a value, no matter how many decimal points your number has. Furthermore, since it is coded in binary, it doesn't even use decimal points. It uses floating points using base-2 arithmetic.
I know that using += on strings in loops takes O(n^2) time where n is the number of loops. But if the loop will run at most 20 times. Will that change the time complexity to O(1) ? For example,
List<String> strList = new ArrayList<>();
//some operations to add string to strList
for(String str : strList) appendStr += str + ",";
I know that the size of strList will never exceed 20. Also each string in strList will have less than 20 characters.
If the string concatenation in this case still has O(n^2) time complexity, would it better be to use google.common.base.Joiner if I want my algorithm to have a better time complexity?
I have completely erased my previous answer, because the tests that I had were seriously flawed. Here are some updated results and code:
#State(Scope.Benchmark)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
#Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
public class DifferentConcats {
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder().include(DifferentConcats.class.getSimpleName())
.verbosity(VerboseMode.EXTRA)
.build();
new Runner(opt).run();
}
#Param(value = {"1", "10", "100", "1000", "10000"})
private int howMany;
private static final Joiner JOINER = Joiner.on(",");
#Benchmark
#Fork(3)
public String guavaJoiner() {
List<String> list = new ArrayList<>(howMany);
for (int i = 0; i < howMany; ++i) {
list.add("" + i);
}
return JOINER.join(list);
}
#Benchmark
#Fork(3)
public String java9Default() {
List<String> list = new ArrayList<>(howMany);
for (int i = 0; i < howMany; ++i) {
list.add("" + i);
}
String result = "";
for (String s : list) {
result += s;
}
return result;
}
}
And the results:
Benchmark (howMany) Mode Cnt Score Error Units
DifferentConcats.guavaJoiner 1 avgt 15 62.582 ± 0.756 ns/op
DifferentConcats.java9Default 1 avgt 15 47.209 ± 0.708 ns/op
DifferentConcats.guavaJoiner 10 avgt 15 430.310 ± 4.690 ns/op
DifferentConcats.java9Default 10 avgt 15 377.203 ± 4.071 ns/op
DifferentConcats.guavaJoiner 100 avgt 15 4115.152 ± 38.505 ns/op
DifferentConcats.java9Default 100 avgt 15 4659.620 ± 182.488 ns/op
DifferentConcats.guavaJoiner 1000 avgt 15 43917.367 ± 360.601 ns/op
DifferentConcats.java9Default 1000 avgt 15 362959.115 ± 6604.020 ns/op
DifferentConcats.guavaJoiner 10000 avgt 15 435289.491 ± 5391.097 ns/op
DifferentConcats.java9Default 10000 avgt 15 47132980.336 ± 1152934.498 ns/op
TL;DR
The other, accepted answer, is absolutely correct.
In a very pedantic sense yes, if your input is capped at a fixed size than any operations performed on that input are effectively constant-time, however that misses the purpose of such analysis. Examine how your code behaves in the asymptotic case if you are interested in its time complexity, not how it behaves for a single specific input.
Even if you cap the size of the list to 20 elements, you're still doing O(n^2) "work" in order to concatenate the elements. Contrast that with using a StringBuilder or higher-level tool such as Joiner which are designed to be more efficient than repeated concatenations. Joiner only has to do O(n) "work" in order to construct the string you need.
Put simply, there's never a reason to do:
for(String str : strList) appendStr += str + ",";
instead of:
Joiner.on(',').join(strList);
It is impossible to state that the Guava's Joiner would work more effectively with 100% assurance due to JVM runtime optimizations, under certain circumstances a plain concatenation would work faster.
That's said, prefer Joiner (or similar constructs that utilizes a StringBuilder under the hood) for concatenating collections since it's readability and performance, in general, are better.
I found a great blog post explaining the performance of each Concatenation technique in details java-string-concatenation-which-way-is-best
Note : Concatenation performance varies with no. of strings to concatenate. For example - to concatenate 1-10 strings, these techniques works best - StringBuilder, StringBuffer and Plus Operator. And to concatenate 100s of strings - Guava Joiner, apache's stringsUtils library also works great.
Please go through the above blog. It really explains performance of various concatenation Techniques very well.
Thanks.
I have a benchmark :
#BenchmarkMode(Mode.Throughput)
#Fork(1)
#State(Scope.Thread)
#Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS, batchSize = 1000)
#Measurement(iterations = 40, time = 1, timeUnit = TimeUnit.SECONDS, batchSize = 1000)
public class StringConcatTest {
private int aInt;
#Setup
public void prepare() {
aInt = 100;
}
#Benchmark
public String emptyStringInt() {
return "" + aInt;
}
#Benchmark
public String valueOfInt() {
return String.valueOf(aInt);
}
}
And here is result :
Benchmark Mode Cnt Score Error Units
StringConcatTest.emptyStringInt thrpt 40 66045.741 ± 1306.280 ops/s
StringConcatTest.valueOfInt thrpt 40 43947.708 ± 1140.078 ops/s
It shows that concatenating of empty string with integer number is 30% faster than calling String.value(100).
I understand that "" + 100 converted to
new StringBuilder().append(100).toString()
and -XX:+OptimizeStringConcat optimization is applied that makes it fast. What I do not understand is why valueOf itself is slower than concatenation.
Can someone explain what exactly is happening and why "" + 100 is faster. What magic does OptimizeStringConcat make?
As you've mentioned, HotSpot JVM has -XX:+OptimizeStringConcat optimization that recognizes StringBuilder pattern and replaces it with highly tuned hand-written IR graph, while String.valueOf() relies on general compiler optimizations.
I've found the following key differences by analyzing the generated assembly code:
Optimized concat does not zero char[] array created for the result string, while the array created by Integer.toString is cleared after allocation just like any other regular object.
Optimized concat translates digits to chars by simple addition of '0' constant, while Integer.getChars uses table lookup with the related array bounds check etc.
There are other minor differences in the implementation of PhaseStringOpts::int_getChars vs. Integer.getChars, but I guess they are not that significant for performance.
BTW, if you take a bigger number (e.g. 1234567890), the performance difference will be negligible because of an extra loop in Integer.getChars that converts two digits at once.
I've been looking at the implementation of ThreadLocal in the JDK, out of curiosity, and I found this :
/**
* Increment i modulo len.
*/
private static int nextIndex(int i, int len) {
return ((i + 1 < len) ? i + 1 : 0);
}
It looks fairly obvious that this could be implemented with a simple return (i + 1) % len, but I think these guys know their stuff. Any idea why they did this ?
This code is highly oriented towards performance, with a custom map for holding thread-local mappings, weak references to help the GC being clever and so on, so I guess this is a matter of performance. Is modulo slow in Java ?
% is avoided for performance reasons in this example.
div/rem operations are slower even on CPU architecture level; not only in Java. For example, minimum latency of idiv instruction on Haswell is about 10 cycles, but only 1 cycle for add.
Let's benchmark using JMH.
import org.openjdk.jmh.annotations.*;
#State(Scope.Benchmark)
public class Modulo {
#Param("16")
int len;
int i;
#Benchmark
public int baseline() {
return i;
}
#Benchmark
public int conditional() {
return i = (i + 1 < len) ? i + 1 : 0;
}
#Benchmark
public int mask() {
return i = (i + 1) & (len - 1);
}
#Benchmark
public int mod() {
return i = (i + 1) % len;
}
}
Results:
Benchmark (len) Mode Cnt Score Error Units
Modulo.baseline 16 avgt 10 2,951 ± 0,038 ns/op
Modulo.conditional 16 avgt 10 3,517 ± 0,051 ns/op
Modulo.mask 16 avgt 10 3,765 ± 0,016 ns/op
Modulo.mod 16 avgt 10 9,125 ± 0,023 ns/op
As you can see, using % is ~2.6x slower than a conditional expression. JIT cannot optimize this automatically in the discussed ThreadLocal code, because the divisor (table.length) is variable.
mod is not that slow in Java. It's implemented as the byte code instructions irem and frem for Integers and Floats respectively. The JIT does a good job of optimizing this.
In my benchmarks (see article), irem calls in JDK 1.8 take about 1 nanosecond. That's pretty quick. frem calls are about 3x slower, so use integers where possible.
If you're using Natural Integers (e.g. array indexing) and a power of 2 Divisor (e.g. 8 thread locals), then you can use a bit twiddling trick to get a 20% performance gain.
I want to tranpose a double[][] matrix with the most compact and efficient expression possible. Right now I have this:
public static Function<double[][], double[][]> transpose() {
return (m) -> {
final int rows = m.length;
final int columns = m[0].length;
double[][] transpose = new double[columns][rows];
range(0, rows).forEach(r -> {
range(0, columns).forEach(c -> {
transpose[c][r] = m[r][c];
});
});
return transpose;
};
}
Thoughts?
You could have:
public static UnaryOperator<double[][]> transpose() {
return m -> {
return range(0, m[0].length).mapToObj(r ->
range(0, m.length).mapToDouble(c -> m[c][r]).toArray()
).toArray(double[][]::new);
};
}
This code does not use forEach but prefers mapToObj and mapToDouble for mapping each row to their transposition. I also changed Function<double[][], double[][]> to UnaryOperator<double[][]> since the return type is the same.
However, it probably won't be more efficient that having a simple for loop like in assylias's answer.
Sample code:
public static void main(String[] args) {
double[][] m = { { 2, 3 }, { 1, 2 }, { -1, 1 } };
double[][] tm = transpose().apply(m);
System.out.println(Arrays.deepToString(tm)); // prints [[2.0, 1.0, -1.0], [3.0, 2.0, 1.0]]
}
I've realized a JMH benchmark comparing the code above, the for loop version, and the code above ran in parallel. All three methods are called with random square matrices having size 100, 1000 and 3000. The results are that for small matrices, the for loop version is faster but with bigger matrices the parallel Stream solution is indeed better in terms of performance (Windows 10, JDK 1.8.0_66, i5-3230M # 2.60 GHz):
Benchmark (matrixSize) Mode Cnt Score Error Units
StreamTest.forLoopTranspose 100 avgt 30 0,026 ± 0,001 ms/op
StreamTest.forLoopTranspose 1000 avgt 30 14,653 ± 0,205 ms/op
StreamTest.forLoopTranspose 3000 avgt 30 222,212 ± 11,449 ms/op
StreamTest.parallelStreamTranspose 100 avgt 30 0,113 ± 0,007 ms/op
StreamTest.parallelStreamTranspose 1000 avgt 30 7,960 ± 0,207 ms/op
StreamTest.parallelStreamTranspose 3000 avgt 30 122,587 ± 7,100 ms/op
StreamTest.streamTranspose 100 avgt 30 0,040 ± 0,003 ms/op
StreamTest.streamTranspose 1000 avgt 30 14,059 ± 0,444 ms/op
StreamTest.streamTranspose 3000 avgt 30 216,741 ± 5,738 ms/op
Benchmark code:
#Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
#Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#Fork(3)
public class StreamTest {
private static final UnaryOperator<double[][]> streamTranspose() {
return m -> {
return range(0, m[0].length).mapToObj(r ->
range(0, m.length).mapToDouble(c -> m[c][r]).toArray()
).toArray(double[][]::new);
};
}
private static final UnaryOperator<double[][]> parallelStreamTranspose() {
return m -> {
return range(0, m[0].length).parallel().mapToObj(r ->
range(0, m.length).parallel().mapToDouble(c -> m[c][r]).toArray()
).toArray(double[][]::new);
};
}
private static final Function<double[][], double[][]> forLoopTranspose() {
return m -> {
final int rows = m.length;
final int columns = m[0].length;
double[][] transpose = new double[columns][rows];
for (int r = 0; r < rows; r++)
for (int c = 0; c < columns; c++)
transpose[c][r] = m[r][c];
return transpose;
};
}
#State(Scope.Benchmark)
public static class MatrixContainer {
#Param({ "100", "1000", "3000" })
private int matrixSize;
private double[][] matrix;
#Setup(Level.Iteration)
public void setUp() {
ThreadLocalRandom random = ThreadLocalRandom.current();
matrix = random.doubles(matrixSize).mapToObj(i -> random.doubles(matrixSize).toArray()).toArray(double[][]::new);
}
}
#Benchmark
public double[][] streamTranspose(MatrixContainer c) {
return streamTranspose().apply(c.matrix);
}
#Benchmark
public double[][] parallelStreamTranspose(MatrixContainer c) {
return parallelStreamTranspose().apply(c.matrix);
}
#Benchmark
public double[][] forLoopTranspose(MatrixContainer c) {
return forLoopTranspose().apply(c.matrix);
}
}
As compact and more efficient:
for (int r = 0; r < rows; r++)
for (int c = 0; c < cols; c++)
transpose[c][r] = m[r][c];
Note that if you have a Matrix class that holds a double[][], an alternative option would be to return a view that has the same underlying array but swaps the columns/rows indices. You would save on copying but you may get worse performance on iteration due to worse cache locality.
If you assume a rectangular input (as your original code seems to rely on), you could write it as
public static Function<double[][], double[][]> transpose() {
return m -> range(0, m[0].length)
.mapToObj(c->range(0, m.length).mapToDouble(r->m[r][c]).toArray())
.toArray(double[][]::new);
}
This could run in parallel but I suppose you’d need a damn big matrix to get a benefit of it.
My advice: for simple low-level math you should use plain old for loops instead of Stream API. Also you should benchmark such code very carefully.
As for #Tunaki benchmark. First, you should not limit single measurement with 1 microsecond. The results for matrixSize = 100 are complete junk: 0,093 ± 0,054 and 0,237 ± 0,134: the error is more than 50%. Note that time measurement which performed before and after each iteration is not a magic and takes time too. And such a small interval can be easily spoiled by some Windows service which suddenly woke up, took some CPU cycles to check something, then go to sleep again. I usually set every warmup/measurement time to 500ms, this number looks comfortable for me.
Second, when testing Stream API with very simple payload (such as copying numbers to primitive array), you should always test with type profile pollution as it really matters. In clean benchmark the JIT compiler can inline everything into single method, because it knows, for example, that after some range you always call the same mapToObj with the same lambda expression. But in real application it's not the same. I modified the MatrixContainer class this way:
#State(Scope.Benchmark)
public static class MatrixContainer {
#Param({"true", "false"})
private boolean pollute;
#Param({ "100", "1000", "3000" })
private int matrixSize;
private double[][] matrix;
#Setup(Level.Iteration)
public void setUp() {
ThreadLocalRandom random = ThreadLocalRandom.current();
matrix = random.doubles(matrixSize)
.mapToObj(i -> random.doubles(matrixSize).toArray())
.toArray(double[][]::new);
if(!pollute) return;
// do some seemingly harmless operations which will
// poison JIT compiler type profile with some other lambdas
for(int i=0; i<100; i++) {
range(0, 1000).map(x -> x+2).toArray();
range(0, 1000).map(x -> x+5).toArray();
range(0, 1000).mapToObj(x -> x*2).toArray();
range(0, 1000).mapToObj(x -> x*3).toArray();
}
}
}
Also I set 5 forks as for Stream API JIT-compiler may behave differently from run to run. Compilation goes in background thread and profiling info may differ at the compilation point due to race which may change the results of compilation significatly. So within fork the results will be the same, but between forks they might be completely different.
My results are (Windows 7, Oracle JVM 8u45 64bit, some not-very-new i5-2410 laptop):
Benchmark (matrixSize) (pollute) Mode Cnt Score Error Units
StreamTest.forLoopTranspose 100 true avgt 50 0,033 ± 0,001 ms/op
StreamTest.forLoopTranspose 100 false avgt 50 0,032 ± 0,001 ms/op
StreamTest.forLoopTranspose 1000 true avgt 50 17,094 ± 0,060 ms/op
StreamTest.forLoopTranspose 1000 false avgt 50 17,065 ± 0,080 ms/op
StreamTest.forLoopTranspose 3000 true avgt 50 260,173 ± 7,855 ms/op
StreamTest.forLoopTranspose 3000 false avgt 50 258,774 ± 7,557 ms/op
StreamTest.streamTranspose 100 true avgt 50 0,096 ± 0,001 ms/op
StreamTest.streamTranspose 100 false avgt 50 0,055 ± 0,012 ms/op
StreamTest.streamTranspose 1000 true avgt 50 21,497 ± 0,439 ms/op
StreamTest.streamTranspose 1000 false avgt 50 15,883 ± 0,265 ms/op
StreamTest.streamTranspose 3000 true avgt 50 272,806 ± 8,534 ms/op
StreamTest.streamTranspose 3000 false avgt 50 260,515 ± 9,159 ms/op
Now you have much less errors and see that type pollution makes the stream results worse while does not affect for-loop results. For matrices like 100x100 the difference is quite significant.
I'm adding an implementation example that includes the parallel switch. I'm curious what you all think of it.
/**
* Returns a {#link UnaryOperator} that transposes the matrix.
*
* Example {#code transpose(true).apply(m);}
*
* #param parallel
* Whether to perform the transpose concurrently.
*/
public static UnaryOperator<ArrayMatrix> transpose(boolean parallel) {
return (m) -> {
double[][] data = m.getData();
IntStream stream = range(0, m.getColumnDimension());
stream = parallel ? stream.parallel() : stream;
double[][] transpose =
stream.mapToObj(
column -> range(0, data.length).mapToDouble(row -> data[row][column]).toArray())
.toArray(double[][]::new);
return new ArrayMatrix(transpose);
};
}