Is modulo slow in Java?

Is modulo slow in Java? - java

I've been looking at the implementation of ThreadLocal in the JDK, out of curiosity, and I found this :
/**
* Increment i modulo len.
*/
private static int nextIndex(int i, int len) {
return ((i + 1 < len) ? i + 1 : 0);
}
It looks fairly obvious that this could be implemented with a simple return (i + 1) % len, but I think these guys know their stuff. Any idea why they did this ?
This code is highly oriented towards performance, with a custom map for holding thread-local mappings, weak references to help the GC being clever and so on, so I guess this is a matter of performance. Is modulo slow in Java ?

% is avoided for performance reasons in this example.
div/rem operations are slower even on CPU architecture level; not only in Java. For example, minimum latency of idiv instruction on Haswell is about 10 cycles, but only 1 cycle for add.
Let's benchmark using JMH.
import org.openjdk.jmh.annotations.*;
#State(Scope.Benchmark)
public class Modulo {
#Param("16")
int len;
int i;
#Benchmark
public int baseline() {
return i;
}
#Benchmark
public int conditional() {
return i = (i + 1 < len) ? i + 1 : 0;
}
#Benchmark
public int mask() {
return i = (i + 1) & (len - 1);
}
#Benchmark
public int mod() {
return i = (i + 1) % len;
}
}
Results:
Benchmark (len) Mode Cnt Score Error Units
Modulo.baseline 16 avgt 10 2,951 ± 0,038 ns/op
Modulo.conditional 16 avgt 10 3,517 ± 0,051 ns/op
Modulo.mask 16 avgt 10 3,765 ± 0,016 ns/op
Modulo.mod 16 avgt 10 9,125 ± 0,023 ns/op
As you can see, using % is ~2.6x slower than a conditional expression. JIT cannot optimize this automatically in the discussed ThreadLocal code, because the divisor (table.length) is variable.

mod is not that slow in Java. It's implemented as the byte code instructions irem and frem for Integers and Floats respectively. The JIT does a good job of optimizing this.
In my benchmarks (see article), irem calls in JDK 1.8 take about 1 nanosecond. That's pretty quick. frem calls are about 3x slower, so use integers where possible.
If you're using Natural Integers (e.g. array indexing) and a power of 2 Divisor (e.g. 8 thread locals), then you can use a bit twiddling trick to get a 20% performance gain.

Related

time complexity: google.common.base.Joiner vs String concatenation

I know that using += on strings in loops takes O(n^2) time where n is the number of loops. But if the loop will run at most 20 times. Will that change the time complexity to O(1) ? For example,
List<String> strList = new ArrayList<>();
//some operations to add string to strList
for(String str : strList) appendStr += str + ",";
I know that the size of strList will never exceed 20. Also each string in strList will have less than 20 characters.
If the string concatenation in this case still has O(n^2) time complexity, would it better be to use google.common.base.Joiner if I want my algorithm to have a better time complexity?

I have completely erased my previous answer, because the tests that I had were seriously flawed. Here are some updated results and code:
#State(Scope.Benchmark)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
#Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
public class DifferentConcats {
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder().include(DifferentConcats.class.getSimpleName())
.verbosity(VerboseMode.EXTRA)
.build();
new Runner(opt).run();
}
#Param(value = {"1", "10", "100", "1000", "10000"})
private int howMany;
private static final Joiner JOINER = Joiner.on(",");
#Benchmark
#Fork(3)
public String guavaJoiner() {
List<String> list = new ArrayList<>(howMany);
for (int i = 0; i < howMany; ++i) {
list.add("" + i);
}
return JOINER.join(list);
}
#Benchmark
#Fork(3)
public String java9Default() {
List<String> list = new ArrayList<>(howMany);
for (int i = 0; i < howMany; ++i) {
list.add("" + i);
}
String result = "";
for (String s : list) {
result += s;
}
return result;
}
}
And the results:
Benchmark (howMany) Mode Cnt Score Error Units
DifferentConcats.guavaJoiner 1 avgt 15 62.582 ± 0.756 ns/op
DifferentConcats.java9Default 1 avgt 15 47.209 ± 0.708 ns/op
DifferentConcats.guavaJoiner 10 avgt 15 430.310 ± 4.690 ns/op
DifferentConcats.java9Default 10 avgt 15 377.203 ± 4.071 ns/op
DifferentConcats.guavaJoiner 100 avgt 15 4115.152 ± 38.505 ns/op
DifferentConcats.java9Default 100 avgt 15 4659.620 ± 182.488 ns/op
DifferentConcats.guavaJoiner 1000 avgt 15 43917.367 ± 360.601 ns/op
DifferentConcats.java9Default 1000 avgt 15 362959.115 ± 6604.020 ns/op
DifferentConcats.guavaJoiner 10000 avgt 15 435289.491 ± 5391.097 ns/op
DifferentConcats.java9Default 10000 avgt 15 47132980.336 ± 1152934.498 ns/op
TL;DR
The other, accepted answer, is absolutely correct.

In a very pedantic sense yes, if your input is capped at a fixed size than any operations performed on that input are effectively constant-time, however that misses the purpose of such analysis. Examine how your code behaves in the asymptotic case if you are interested in its time complexity, not how it behaves for a single specific input.
Even if you cap the size of the list to 20 elements, you're still doing O(n^2) "work" in order to concatenate the elements. Contrast that with using a StringBuilder or higher-level tool such as Joiner which are designed to be more efficient than repeated concatenations. Joiner only has to do O(n) "work" in order to construct the string you need.
Put simply, there's never a reason to do:
for(String str : strList) appendStr += str + ",";
instead of:
Joiner.on(',').join(strList);

It is impossible to state that the Guava's Joiner would work more effectively with 100% assurance due to JVM runtime optimizations, under certain circumstances a plain concatenation would work faster.
That's said, prefer Joiner (or similar constructs that utilizes a StringBuilder under the hood) for concatenating collections since it's readability and performance, in general, are better.

I found a great blog post explaining the performance of each Concatenation technique in details java-string-concatenation-which-way-is-best
Note : Concatenation performance varies with no. of strings to concatenate. For example - to concatenate 1-10 strings, these techniques works best - StringBuilder, StringBuffer and Plus Operator. And to concatenate 100s of strings - Guava Joiner, apache's stringsUtils library also works great.
Please go through the above blog. It really explains performance of various concatenation Techniques very well.
Thanks.

Why is String concatenation faster than String.valueOf for converting an Integer to a String?

I have a benchmark :
#BenchmarkMode(Mode.Throughput)
#Fork(1)
#State(Scope.Thread)
#Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS, batchSize = 1000)
#Measurement(iterations = 40, time = 1, timeUnit = TimeUnit.SECONDS, batchSize = 1000)
public class StringConcatTest {
private int aInt;
#Setup
public void prepare() {
aInt = 100;
}
#Benchmark
public String emptyStringInt() {
return "" + aInt;
}
#Benchmark
public String valueOfInt() {
return String.valueOf(aInt);
}
}
And here is result :
Benchmark Mode Cnt Score Error Units
StringConcatTest.emptyStringInt thrpt 40 66045.741 ± 1306.280 ops/s
StringConcatTest.valueOfInt thrpt 40 43947.708 ± 1140.078 ops/s
It shows that concatenating of empty string with integer number is 30% faster than calling String.value(100).
I understand that "" + 100 converted to
new StringBuilder().append(100).toString()
and -XX:+OptimizeStringConcat optimization is applied that makes it fast. What I do not understand is why valueOf itself is slower than concatenation.
Can someone explain what exactly is happening and why "" + 100 is faster. What magic does OptimizeStringConcat make?

As you've mentioned, HotSpot JVM has -XX:+OptimizeStringConcat optimization that recognizes StringBuilder pattern and replaces it with highly tuned hand-written IR graph, while String.valueOf() relies on general compiler optimizations.
I've found the following key differences by analyzing the generated assembly code:
Optimized concat does not zero char[] array created for the result string, while the array created by Integer.toString is cleared after allocation just like any other regular object.
Optimized concat translates digits to chars by simple addition of '0' constant, while Integer.getChars uses table lookup with the related array bounds check etc.
There are other minor differences in the implementation of PhaseStringOpts::int_getChars vs. Integer.getChars, but I guess they are not that significant for performance.
BTW, if you take a bigger number (e.g. 1234567890), the performance difference will be negligible because of an extra loop in Integer.getChars that converts two digits at once.

What is the fastest way to set an arbitrary range of elements in a Java array to null?

I know I can simply iterate from start to end and clear those cells but I was wondering if it was possible in any faster way (perhaps using JNI-ed System.arrayCopy)?

If I got it right, you need to nullify an array, or a sub-range of an array containing references to objects to make them eligible for GC. And you have a regular Java array, which stores data on-heap.
Answering your question, System.arrayCopy is the fastest way to null a sub-range of an array. It is worse memory-wise than Arrays.fill though, since you would have to allocate twice as much memory to hold references at worst case for an array of nulls you can copy from. Though if you need to fully null an array, even faster would be just to create a new empty array (e.g. new Object[desiredLength]) and replace the one you want to nullify with it.
Unsafe, DirectByteBuffer, DirectLongBuffer implementations doesn't provide any performance gain in a naive straight-forward implementation (i.e. if you just replace the Array with DirectByteBuffer or Unsafe). They are slower then bulk System.arrayCopy as well. Since those implementations have nothing to do with Java Array, they're out of scope of your question anyway.
Here's my JMH benchmark (full benchmark code available via gist) snippet for those including unsafe.setMemory case as per #apangin comment; and including ByteBuffer.put(long[] src, int srcOffset, int longCount) as per #jan-chaefer; and an equivalent of Arrays.fill loop as per #scott-carey to check if Arrays.fill could be an intrinsic in JDK 8.
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayFill() {
Arrays.fill(objectHolderForFill, null);
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayFillManualLoop() {
for (int i = 0, len = objectHolderForFill.length; i < len; i++) {
objectHolderForLoop[i] = null;
}
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayCopy() {
System.arraycopy(nullsArray, 0, objectHolderForArrayCopy, 0,
objectHolderForArrayCopy.length);
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferManualLoop() {
while (referenceHolderByteBuffer.hasRemaining()) {
referenceHolderByteBuffer.putLong(0);
}
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferBatch() {
referenceHolderByteBuffer.put(nullBytes, 0, nullBytes.length);
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferManualLoop() {
while (referenceHolderLongBuffer.hasRemaining()) {
referenceHolderLongBuffer.put(0L);
}
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferBatch() {
referenceHolderLongBuffer.put(nullLongs, 0, nullLongs.length);
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArrayManualLoop() {
long addr = referenceHolderUnsafe;
long pos = 0;
for (int i = 0; i < size; i++) {
unsafe.putLong(addr + pos, 0L);
pos += 1 << 3;
}
}
#Benchmark
#BenchmarkMode(Mode.SampleTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArraySetMemory() {
unsafe.setMemory(referenceHolderUnsafe, size*8, (byte) 0);
}
Here's what I got (Java 1.8, JMH 1.13, Core i3-6100U 2.30 GHz, Win10):
100 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 5234029 39,518 ± 0,991 ns/op
ArrayNullFillBench.directByteBufferBatch sample 6271334 43,646 ± 1,523 ns/op
ArrayNullFillBench.directLongBufferBatch sample 4615974 45,252 ± 2,352 ns/op
ArrayNullFillBench.arrayFill sample 4745406 76,997 ± 3,547 ns/op
ArrayNullFillBench.arrayFillManualLoop sample 5549216 78,677 ± 13,013 ns/op
ArrayNullFillBench.unsafeArrayManualLoop sample 5980381 78,811 ± 2,870 ns/op
ArrayNullFillBench.unsafeArraySetMemory sample 5985884 85,062 ± 2,096 ns/op
ArrayNullFillBench.directLongBufferManualLoop sample 4697023 116,242 ± 2,579 ns/op <-- wow
ArrayNullFillBench.directByteBufferManualLoop sample 7504629 208,440 ± 10,651 ns/op <-- wow
I skipped all** the loop implementations from further tests
** - except arrayFill and arrayFillManualLoop for scale
1000 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 6780681 184,516 ± 14,036 ns/op
ArrayNullFillBench.directLongBufferBatch sample 4018778 293,325 ± 4,074 ns/op
ArrayNullFillBench.directByteBufferBatch sample 4063969 313,171 ± 4,861 ns/op
ArrayNullFillBench.arrayFillManualLoop sample 6270397 543,801 ± 20,325 ns/op
ArrayNullFillBench.arrayFill sample 6590416 548,250 ± 13,475 ns/op
10000 elements
Benchmark Mode Cnt Score Error Units
ArrayNullFillBench.arrayCopy sample 2551851 2024,543 ± 12,533 ns/op
ArrayNullFillBench.directLongBufferBatch sample 2958517 4469,210 ± 10,376 ns/op
ArrayNullFillBench.directByteBufferBatch sample 2892258 4526,945 ± 33,443 ns/op
ArrayNullFillBench.arrayFill sample 2578580 5532,063 ± 20,705 ns/op
ArrayNullFillBench.arrayFillManualLoop sample 2562569 5550,195 ± 40,666 ns/op
P.S.
Speaking of ByteBuffer and Unsafe - their main benefits in your case is that they store data off-heap, and you can implement your own memory deallocation alghorithm which would siut your data-structure better than regular GC. So you won't need to nullify them, and could compact memory as you please. Most likely the efforts won't worth much, since it would be much easier to get a less performant and more error-prone code then you have now.

When should streams be preferred over traditional loops for best performance? Do streams take advantage of branch-prediction?

I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams.
However the performance with Streams is always turning out to be worse than traditional loops.
int totalSize = 32768;
int filterValue = 1280;
int[] array = new int[totalSize];
Random rnd = new Random(0);
int loopCount = 10000;
for (int i = 0; i < totalSize; i++) {
// array[i] = rnd.nextInt() % 2560; // Unsorted Data
array[i] = i; // Sorted Data
}
long start = System.nanoTime();
long sum = 0;
for (int j = 0; j < loopCount; j++) {
for (int c = 0; c < totalSize; ++c) {
sum += array[c] >= filterValue ? array[c] : 0;
}
}
long total = System.nanoTime() - start;
System.out.printf("Conditional Operator Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));
start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
for (int c = 0; c < totalSize; ++c) {
if (array[c] >= filterValue) {
sum += array[c];
}
}
}
total = System.nanoTime() - start;
System.out.printf("Branch Statement Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));
start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
sum += Arrays.stream(array).filter(value -> value >= filterValue).sum();
}
total = System.nanoTime() - start;
System.out.printf("Streams Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));
start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
sum += Arrays.stream(array).parallel().filter(value -> value >= filterValue).sum();
}
total = System.nanoTime() - start;
System.out.printf("Parallel Streams Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));
Output :
For Sorted-Array :
Conditional Operator Time : 294062652 ns, (0.294063 sec)
Branch Statement Time : 272992442 ns, (0.272992 sec)
Streams Time : 806579913 ns, (0.806580 sec)
Parallel Streams Time : 2316150852 ns, (2.316151 sec)
For Un-Sorted Array:
Conditional Operator Time : 367304250 ns, (0.367304 sec)
Branch Statement Time : 906073542 ns, (0.906074 sec)
Streams Time : 1268648265 ns, (1.268648 sec)
Parallel Streams Time : 2420482313 ns, (2.420482 sec)
I tried the same code using List:
list.stream() instead of Arrays.stream(array)
list.get(c) instead of array[c]
Output :
For Sorted-List :
Conditional Operator Time : 860514446 ns, (0.860514 sec)
Branch Statement Time : 663458668 ns, (0.663459 sec)
Streams Time : 2085657481 ns, (2.085657 sec)
Parallel Streams Time : 5026680680 ns, (5.026681 sec)
For Un-Sorted List
Conditional Operator Time : 704120976 ns, (0.704121 sec)
Branch Statement Time : 1327838248 ns, (1.327838 sec)
Streams Time : 1857880764 ns, (1.857881 sec)
Parallel Streams Time : 2504468688 ns, (2.504469 sec)
I referred to few blogs this & this which suggest the same performance issue w.r.t streams.
I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them? Is there something I'm missing out on?
Which is the scenario in which streams perform equal to loops? Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance?
In none of the scenario's I could see streams taking advantage of branch-prediction (I tried with sorted and unordered streams, but of no use. It gave more than double the performance impact compared to normal streams)?

I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them?
Performance is rarely an issue. It would be usual for 10% of your streams would need to be rewritten as loops to get the performance you need.
Is there something I'm missing out on?
Using parallelStream() is much easier using streams and possibly more efficient as it's hard to write efficient concurrent code.
Which is the scenario in which streams perform equal to loops? Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance?
Your benchmark is flawed in the sense that the code hasn't been compiled when it starts. I would do the whole test in a loop as JMH does, or I would use JMH.
In none of the scenario's I could see streams taking advantage of branch-prediction
Branch prediction is a CPU feature not a JVM or streams feature.

Java is a high level language saving the programmer from considering low level performance optimization.
Never choose a certain approach for performance reasons unless you have proven that this is a problem in your real application.
Your measurements show some negative effect for streams, but the difference is below observability. Therefore, it's not a Problem.
Also, this Test is a "synthetic" situation and the code may behave completely different in a heavy duty production environment.
Furthermore, the machine code created from your Java (byte) code by the JIT may change in future Java (maintenance) releases and make your measurements obsolete.
In conclusion: Choose the syntax or approach that most expresses your (the programmer's) intention. Keep to that same approach or syntax throughout the program unless you have a good reason to change.

Everything is said, but I want to show you how your code should look like using JMH.
#Fork(3)
#BenchmarkMode(Mode.AverageTime)
#Measurement(iterations = 10, timeUnit = TimeUnit.NANOSECONDS)
#State(Scope.Benchmark)
#Threads(1)
#Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {
private final int totalSize = 32_768;
private final int filterValue = 1_280;
private final int loopCount = 10_000;
// private Random rnd;
private int[] array;
#Setup
public void setup() {
array = IntStream.range(0, totalSize).toArray();
// rnd = new Random(0);
// array = rnd.ints(totalSize).map(i -> i % 2560).toArray();
}
#Benchmark
public long conditionalOperatorTime() {
long sum = 0;
for (int j = 0; j < loopCount; j++) {
for (int c = 0; c < totalSize; ++c) {
sum += array[c] >= filterValue ? array[c] : 0;
}
}
return sum;
}
#Benchmark
public long branchStatementTime() {
long sum = 0;
for (int j = 0; j < loopCount; j++) {
for (int c = 0; c < totalSize; ++c) {
if (array[c] >= filterValue) {
sum += array[c];
}
}
}
return sum;
}
#Benchmark
public long streamsTime() {
long sum = 0;
for (int j = 0; j < loopCount; j++) {
sum += IntStream.of(array).filter(value -> value >= filterValue).sum();
}
return sum;
}
#Benchmark
public long parallelStreamsTime() {
long sum = 0;
for (int j = 0; j < loopCount; j++) {
sum += IntStream.of(array).parallel().filter(value -> value >= filterValue).sum();
}
return sum;
}
}
The results for a sorted array:
Benchmark Mode Cnt Score Error Units
MyBenchmark.branchStatementTime avgt 30 119833793,881 ± 1345228,723 ns/op
MyBenchmark.conditionalOperatorTime avgt 30 118146194,368 ± 1748693,962 ns/op
MyBenchmark.parallelStreamsTime avgt 30 499436897,422 ± 7344346,333 ns/op
MyBenchmark.streamsTime avgt 30 1126768177,407 ± 198712604,716 ns/op
Results for unsorted data:
Benchmark Mode Cnt Score Error Units
MyBenchmark.branchStatementTime avgt 30 534932594,083 ± 3622551,550 ns/op
MyBenchmark.conditionalOperatorTime avgt 30 530641033,317 ± 8849037,036 ns/op
MyBenchmark.parallelStreamsTime avgt 30 489184423,406 ± 5716369,132 ns/op
MyBenchmark.streamsTime avgt 30 1232020250,900 ± 185772971,366 ns/op
I only can say that there are many possibilities of JVM optimizations and maybe branch-prediction is also involved. Now it is up to you to interpret the benchmark results.

I will add my 0.02$ here.
I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams
Branch Prediction is a CPU feature, it has nothing to do with JVM. It is needed to keep CPU pipeline full and ready to do something. Measuring or predicting the branch prediction is extremely hard (unless you actually know the EXACT things that the CPU will do). This will depend on at least the load that the CPU is having right now (that might be a lot more than your program only).
However the performance with Streams is always turning out to be worse than traditional loops
This statement and the previous one are un-related. Yes, streams will be slower for simple examples like yours, up to 30% slower, which is OK.
You could measure for a particular case how slower they are or faster via JMH as others have suggested, but that proves only that case, only that load.
At the same time you might be working with Spring/Hibernate/Services, etc etc that do things in milliseconds and your streams in nano-seconds and you worry about the performance? You are questioning the speed of your fastest part of the code? That's of course a theoretical thing.
And about your last point that you tried with sorted and un-sorted arrays and it gives you bad results. This is absolutely no indication of branch prediction or not - you have no idea at which point the prediction happened and if it did unless you can look inside the actual CPU pipelines - which you did not.

How can my Java program run fast?
Long story short, Java programs can be accelerated by:
Multithreading
JIT
Do streams relate to Java program speedup?
Yes!
Note Collection.parallelStream() and Stream.parallel() methods for multithreading
One can write for cycle that is long enough for JIT to skip. Lambdas are typically small and can be compiled by JIT => there's possibility to gain performance
What is the scenario stream can be faster than for loop?
Let's take a look at jdk/src/share/vm/runtime/globals.hpp
develop(intx, HugeMethodLimit, 8000,
"Don't compile methods larger than this if "
"+DontCompileHugeMethods")
If you have long enough cycle, it won't be compiled by JIT and will run slowly.
If you rewrite such a cycle to stream you'll probably use map, filter, flatMap methods that split code to pieces and every piece can be small enough to fit under limit.
For sure, writing huge methods has other downsides apart from JIT compilation.
This scenario can be considered if, for example, you've got a lot of generated code.
What's about branch prediction?
Of course streams take advantage of branch prediction as every other code does.
However branch prediction isn't the technology explicitly used to make streams faster AFAIK.
So, when do I rewrite my loops to streams to achieve the best performance?
Never.
Premature optimization is the root of all evil
©Donald Knuth
Try to optimize algorithm instead. Streams are the interface for functional-like programming, not a tool to speedup loops.

Fast double to string conversion with given precision

I need to convert double to string with given precision. String.format("%.3f", value) (or DecimalFormat) does the job, but benchmarks show that it is slow. Even Double.toString conversion takes about 1-3 seconds to convert 1 million numbers on my machine.
Are there any better way to do it?
UPDATE: Benchmarking results
Random numbers from 0 to 1000000, results are in operations per millisecond (Java 1.7.0_45), higher is better:
Benchmark Mean Mean error Units
String_format 747.394 13.197 ops/ms
BigDecimal_toPlainString 1349.552 31.144 ops/ms
DecimalFormat_format 1890.917 28.886 ops/ms
Double_toString 3341.941 85.453 ops/ms
DoubleFormatUtil_formatDouble 7760.968 87.630 ops/ms
SO_User_format 14269.388 168.206 ops/ms
UPDATE:
Java 10, +ryu, higher is better:
Mode Cnt Score Error Units
String_format thrpt 20 998.741 ± 52.704 ops/ms
BigDecimal_toPlainString thrpt 20 2079.965 ± 101.398 ops/ms
DecimalFormat_format thrpt 20 2040.792 ± 48.378 ops/ms
Double_toString thrpt 20 3575.301 ± 112.548 ops/ms
DoubleFormatUtil_formatDouble thrpt 20 7206.281 ± 307.348 ops/ms
ruy_doubleToString thrpt 20 9626.312 ± 285.778 ops/ms
SO_User_format thrpt 20 17143.901 ± 1307.685 ops/ms

Disclaimer: I only recommend that you use this if speed is an absolute requirement.
On my machine, the following can do 1 million conversions in about 130ms:
private static final int POW10[] = {1, 10, 100, 1000, 10000, 100000, 1000000};
public static String format(double val, int precision) {
StringBuilder sb = new StringBuilder();
if (val < 0) {
sb.append('-');
val = -val;
}
int exp = POW10[precision];
long lval = (long)(val * exp + 0.5);
sb.append(lval / exp).append('.');
long fval = lval % exp;
for (int p = precision - 1; p > 0 && fval < POW10[p]; p--) {
sb.append('0');
}
sb.append(fval);
return sb.toString();
}
The code as presented has several shortcomings: it can only handle a limited range of doubles, and it doesn't handle NaNs. The former can be addressed (but only partially) by extending the POW10 array. The latter can be explicitly handled in the code.
If you don't need thread-safe code, you can re-use the buffer for a little more speed (to avoid recreating a new object each time), such as:
private static final int[] POW10 = {1, 10, 100, 1000, 10000, 100000, 1000000};
private static final StringBuilder BUFFER = new StringBuilder();
public String format( double value, final int precision ) {
final var sb = BUFFER;
sb.setLength( 0 );
if( value < 0 ) {
sb.append( '-' );
value = -value;
}
final int exp = POW10[ precision ];
final long lval = (long) (value * exp + 0.5);
sb.append( lval / exp ).append( '.' );
final long fval = lval % exp;
for( int p = precision - 1; p > 0 && fval < POW10[ p ]; p-- ) {
sb.append( '0' );
}
sb.append( fval );
return sb.toString();
}

If you need both speed and precision, I've developed a fast DoubleFormatUtil class at xmlgraphics-commons: http://xmlgraphics.apache.org/commons/changes.html#version_1.5rc1
You can see the code there:
http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/util/DoubleFormatUtil.java?view=markup
It's faster than both DecimalFormat/BigDecimal, as fast as Double.toString, it's precise, it's well tested.
It's licensed under Apache License 2.0, so you can use it as you want.

To my knowledge the fastest and most complete implementation is that of Jack Shirazi:
http://web.archive.org/web/20150623133220/http://archive.oreilly.com/pub/a/onjava/2000/12/15/formatting_doubles.html
Code:
Original implementation is no longer available online (http://archive.oreilly.com/onjava/2000/12/15/graphics/DoubleToString.java). An implementation can be found here: https://raw.githubusercontent.com/openxal/openxal/57392be263b98565738d1962ba3b53e5ca60e64e/core/src/xal/tools/text/DoubleToString.java
It provides formatted (number of decimals) and unformatted doubleToString conversion. My observation is, that the JDK performance of unformatted conversion dramatically improved over the years, so here the gain is not so big anymore.
For formatted conversion it still is.
For benchmarkers: It often makes a big difference which kind of doubles are used, e.g. doubles very close to 0.

I haven't benchmarked this, but how about using BigDecimal?
BigDecimal bd = new BigDecimal(value).setScale(3, RoundingMode.HALF_UP);
return bd.toString();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.