Does test order affect performance result?

Does test order affect performance result? - java

I wrote 2 blocks of time measurement code. The print result of t1 is always much bigger than t2.
Block1 and block2 do the exact same thing. If I write block 2 before block1, then The print result of t2 is much lesser than t1.
I wonder why this happens.
#Test
fun test(){
val list = (1..100000).toList()
//block 1
var t1 = System.nanoTime()
list.filter { it % 7 == 0 }
t1 = System.nanoTime() - t1
//block 2
var t2 = System.nanoTime()
list.filter { it % 7 == 0 }
t2 = System.nanoTime() - t2
//print
println(t1)
println(t2)
}

What you are experiencing is called the warmup. The first requests made to a Kotlin (and other JVm based languages) is often substantially slower than the average response time. This warm-up period is caused by lazy class loading and just-in-time compilation.
There are a few ways how to measure performance more reliably. One of them is to create a warmup manually before the test itself is executed. Even more reliable method would be to use a specialized library such as JMH.
Example of manual warmup:
// warmup
for (i in 1..9999) {
val list = (1..100000).toList()
list.filter { it % 7 == 0 }
}
// rest of the test
As a side note, Kotlin has built-it functions which you can use instead of manually calculating the time difference. There are measureTimeMillis and measureNanoTime.
It would be used like this:
val time = measureNanoTime {
list.filter { it % 7 == 0 }
}

Related

How Java/Scala synchronized affect performance?

I am using scala, a timing function to time the method.
def timing()(f: => T) = {
val start = System.currentTimeMillis()
val result = f
val end = System.currentTimeMillis()
// print time here
result
}
I have a fun() and use following to time it.
(0 to 10000).map{
timing(fun())
}
it is 8ms on average
I use following and time it again
(0 to 10000).map{
timing(fun())
Singleton.synchronized(Singleton.i += 1)
Thread.sleep(50)
Singleton.synchronized(Singleton.i -= 1)
}
object Singleton{
var i = 0
}
The timing shows fun on average becomes 30ms now. Very few records could be 8ms, and most of them are around 30~35ms
but the timing is totally outside the synchronized block. How does this happen? How does synchronization bring the overhead?

Java calculations that takes X amount of time

This is just a hypothetical question, but could be a way to get around an issue I have been having.
Imagine you want to be able to time a calculation function based not on the answer, but on the time it takes to calculating. So instead of finding out what a + b is, you wish to continue perform some calculation while time < x seconds.
Look at this pseudo code:
public static void performCalculationsForTime(int seconds)
{
// Get start time
int millisStart = System.currentTimeMillis();
// Perform calculation to find the 1000th digit of PI
// Check if the given amount of seconds have passed since millisStart
// If number of seconds have not passed, redo the 1000th PI digit calculation
// At this point the time has passed, return the function.
}
Now I know that I am horrible, despicable person for using precious CPU cycles to simple get time to pass, but what I am wondering is:
A) Is this possible and would JVM start complaining about non-responsiveness?
B) If it is possible, what calculations would be best to try to perform?
Update - Answer:
Based on the answers and comments, the answer seems to be that "Yes, this is possible. But only if it is not done in Android main UI thread, because the user's GUI will be become unresponsive and will throw an ANR after 5 seconds."

A) Is this possible and would JVM start complaining about non-responsiveness?
It is possible, and if you run it in the background, neither JVM nor Dalvik will complain.
B) If it is possible, what calculations would be best to try to perform?
If the objective is to just run any calculation for x seconds, just keep adding 1 to a sum until the required time has reached. Off the top of my head, something like:
public static void performCalculationsForTime(int seconds)
{
// Get start time
int secondsStart = System.currentTimeMillis()/1000;
int requiredEndTime = millisStart + seconds;
float sum = 0;
while(secondsStart != requiredEndTime) {
sum = sum + 0.1;
secondsStart = System.currentTimeMillis()/1000;
}
}

You can and JVM won't complain if your code is not part of some complex system that actually tracks thread execution time.
long startTime = System.currentTimeMillis();
while(System.currentTimeMillis() - startTime < 100000) {
// do something
}
Or even a for loop that checks time only every 1000 cycles.
for (int i = 0; ;i++) {
if (i % 1000 == 0 && System.currentTimeMillis() - startTime < 100000)
break;
// do something
}
As for your second question, the answer is probably calculating some value that can always be improved upon, like your PI digits example.

Wait for system time to continue application

I've written a class to continue a started JAVA application if the current second is a multiple of 5 (i.e. Calender.SECOND % 5 == 0)
The class code is presented below, what I'm curious about is, am I doing this the right way? It doesn't seem like an elegant solution, blocking the execution like this and getting the instance over and over.
public class Synchronizer{
private static Calendar c;
public static void timeInSync(){
do{
c = Calendar.getInstance();
}
while(c.get(Calendar.SECOND) % 5 != 0);
}
}
Synchronizer.timeInSync() is called in another class's constructor and an instance of that class is created at the start of the main method. Then the application runs forever with a TimerTask that's called every 5 seconds.
Is there a cleaner solution for synchronizing the time?
Update:
I think I did not clearly stated but what I'm looking for here is to synchronization with the system time without doing busy waiting.
So I need to be able to get
12:19:00
12:19:05
12:19:10
...

What you have now is called busy waiting (also sometimes referred as polling), and yes its inefficient in terms of processor usage and also in terms of energy usage. You code executes whenever the OS allows it, and in doing so it prevents the use of a CPU for other work, or when there is no other work it prevents the CPU from taking a nap, wasting energy (heating the CPU, draining the battery...).
What you should do is put your thread to sleep until the time where you want to do something arrives. This allows the CPU to perform other tasks or go to sleep.
There is a method on java.lang.Thread to do just that: Thread.sleep(long milliseconds) (it also has a cousin taking an additional nanos parameter, but the nanos may be ignored by the VM, and that kind of precision is rarely needed).
So first you determine when you need to do some work. Then you sleep until then. A naive implementation could look like that:
public static void waitUntil(long timestamp) {
long millis = timestamp - System.currentTimeMillis();
// return immediately if time is already in the past
if (millis <= 0)
return;
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
throw new RuntimeException(e.getMessage(), e);
}
}
This works fine if you don't have too strict requirements on precisely hitting the time, you can expect it to return reasonably close to the specified time (a few ten ms away probably) if the time isn't too far in the future (a few secs tops). You have however no guarantees that occasionally when the OS is really busy that it possily returns much later.
A slightly more accurate method is to determine the reuired sleep time, sleep for half the time, evaluate required sleep again, sleep again half the time and so on until the required sleep time becomes very small, then busy wait the remaining few milliseconds.
However System.currentTimeMillis() does not guarantee the actual resolution of time; it may change once every millisecond, but it might as well only change every ten ms by 10 (this depends on the platform). Same goes for System.nanoTime().
Waiting for an exact point in time is not possible in high level programming languages in a multi-tasking environment (practically everywhere nowadays). If you have strict requirements, you need to turn to the operating system specifics to create an interrupt at the specified time and handle the event in the interrupt (that means assembler or at least C for the interrupt handler). You won't need that in most normal applications, a few ms +/- usually don't matter in a game/application.

As #ChrisK suggests, you could simplify by just making a direct call to System.currentTimeMillis().
For example:
long time = 0;
do
{
time = System.currentTimeMillis();
} while (time % 5000 != 0);
Note that you need to change the comparison value to 5000 because the representation of the time is in milliseconds.
Also, there are possible pitfalls to doing any comparison so directly like this, as the looping call depends on processor availability and whatnot, so there is a chance that an implementation such as this could make one call that returns:
`1411482384999`
And then the next call in the loop return
`1411482385001`
Meaning that your condition has been skipped by virtue of hardware availability.
If you want to use a built in scheduler, I suggest looking at the answer to a similar question here java: run a function after a specific number of seconds

You should use
System.nanoTime()
instead of
System.currentTimeMillis()
because it returns the measured elapsed time instead of the system time, so nanoTime is not influenced by system time changes.
public class Synchronizer
{
public static void timeInSync()
{
long lastNanoTime = System.nanoTime();
long nowTime = System.nanoTime();
while(nowTime/1000000 - lastNanoTime /1000000 < 5000 )
{
nowTime = System.nanoTime();
}
}
}

The first main point is that you must never use busy-waiting. In java you can avoid busy-waiting by using either Object.wait(timeout) or Thread.sleep(timeout). The later is more suitable for your case, because your case doesn't require losing monitor lock.
Next, you can use two approaches to wait until your time condition is satisfied. You can either precalculate your whole wait time or wait for small time intervals in loop, checking the condition.
I will illustrate both approaches here:
private static long nextWakeTime(long time) {
if (time / 1000 % 5 == 0) { // current time is multiple of five seconds
return time;
}
return (time / 1000 / 5 + 1) * 5000;
}
private static void waitUsingCalculatedTime() {
long currentTime = System.currentTimeMillis();
long wakeTime = nextWakeTime(currentTime);
while (currentTime < wakeTime) {
try {
System.out.printf("Current time: %d%n", currentTime);
System.out.printf("Wake time: %d%n", wakeTime);
System.out.printf("Waiting: %d ms%n", wakeTime - currentTime);
Thread.sleep(wakeTime - currentTime);
} catch (InterruptedException e) {
// ignore
}
currentTime = System.currentTimeMillis();
}
}
private static void waitUsingSmallTime() {
while (System.currentTimeMillis() / 1000 % 5 != 0) {
try {
System.out.printf("Current time: %d%n", System.currentTimeMillis());
Thread.sleep(100);
} catch (InterruptedException e) {
// ignore
}
}
}
As you can see, waiting for the precalculated time is more complex, but it is more precise and more efficient (since in general case it will be done in single iteration). Waiting iteratively for small time interval is simpler, but less efficient and precise (precision is dependent on the selected size of the time interval).
Also please note how I calculate if the time condition is satisfied:
(time / 1000 % 5 == 0)
In first step you need to calculate seconds and only then check if the are multiple of five. Checking by time % 5000 == 0 as suggested in other answer is wrong, as it is true only for the first millisecond of each fifth second.

Why does this code not see any significant performance gain when I use multiple threads on a quadcore machine?

I wrote some Java code to learn more about the Executor framework.
Specifically, I wrote code to verify the Collatz Hypothesis - this says that if you iteratively apply the following function to any integer, you get to 1 eventually:
f(n) = ((n % 2) == 0) ? n/2 : 3*n + 1
CH is still unproven, and I figured it would be a good way to learn about Executor. Each thread is assigned a range [l,u] of integers to check.
Specifically, my program takes 3 arguments - N (the number to which I want to check CH), RANGESIZE (the length of the interval that a thread has to process), and NTHREAD, the size of the threadpool.
My code works fine, but I saw much less speedup that I expected - of the order of 30% when I went from 1 to 4 threads.
My logic was that the computation is completely CPU bound, and each subtask (checking CH for a fixed size range) is takes roughly the same time.
Does anyone have ideas as to why I'm not seeing a 3 to 4x increase in speed?
If you could report your runtimes as you increase the number of thread (along with the machine, JVM and OS) that would also be great.
Specifics
Runtimes:
java -d64 -server -cp . Collatz 10000000 1000000 4 => 4 threads, takes 28412 milliseconds
java -d64 -server -cp . Collatz 10000000 1000000 1 => 1 thread, takes 38286 milliseconds
Processor:
Quadcore Intel Q6600 at 2.4GHZ, 4GB. The machine is unloaded.
Java:
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02, mixed mode)
OS:
Linux quad0 2.6.26-2-amd64 #1 SMP Tue Mar 9 22:29:32 UTC 2010 x86_64 GNU/Linux
Code: (I can't get the code to post, I think it's too long for SO requirements, the source is available on Google Docs
import java.math.BigInteger;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
class MyRunnable implements Runnable {
public int lower;
public int upper;
MyRunnable(int lower, int upper) {
this.lower = lower;
this.upper = upper;
}
#Override
public void run() {
for (int i = lower ; i <= upper; i++ ) {
Collatz.check(i);
}
System.out.println("(" + lower + "," + upper + ")" );
}
}
public class Collatz {
public static boolean check( BigInteger X ) {
if (X.equals( BigInteger.ONE ) ) {
return true;
} else if ( X.getLowestSetBit() == 1 ) {
// odd
BigInteger Y = (new BigInteger("3")).multiply(X).add(BigInteger.ONE);
return check(Y);
} else {
BigInteger Z = X.shiftRight(1); // fast divide by 2
return check(Z);
}
}
public static boolean check( int x ) {
BigInteger X = new BigInteger( new Integer(x).toString() );
return check(X);
}
static int N = 10000000;
static int RANGESIZE = 1000000;
static int NTHREADS = 4;
static void parseArgs( String [] args ) {
if ( args.length >= 1 ) {
N = Integer.parseInt(args[0]);
}
if ( args.length >= 2 ) {
RANGESIZE = Integer.parseInt(args[1]);
}
if ( args.length >= 3 ) {
NTHREADS = Integer.parseInt(args[2]);
}
}
public static void maintest(String [] args ) {
System.out.println("check(1): " + check(1));
System.out.println("check(3): " + check(3));
System.out.println("check(8): " + check(8));
parseArgs(args);
}
public static void main(String [] args) {
long lDateTime = new Date().getTime();
parseArgs( args );
List<Thread> threads = new ArrayList<Thread>();
ExecutorService executor = Executors.newFixedThreadPool( NTHREADS );
for( int i = 0 ; i < (N/RANGESIZE); i++) {
Runnable worker = new MyRunnable( i*RANGESIZE+1, (i+1)*RANGESIZE );
executor.execute( worker );
}
executor.shutdown();
while (!executor.isTerminated() ) {
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime ) +
" (" + N/(fDateTime - lDateTime ) + " per ms)" );
}
}

Busy waiting can be a problem:
while (!executor.isTerminated() ) {
}
You can use awaitTermination() instead:
while (!executor.awaitTermination(1, TimeUnit.SECONDS)) {}

You are using BigInteger. It consumes a lot of register space. What you most likely have on the compiler level is register spilling that makes your process memory-bound.
Also note that when you are timing your results you are not taking into account extra time taken by the JVM to allocate threads and work with the thread pool.
You could also have memory conflicts when you are using constant Strings. All strings are stored in a shared string pool and so it may become a bottleneck, unless java is really clever about it.
Overall, I wouldn't advise using Java for this kind of stuff. Using pthreads would be a better way to go for you.

As #axtavt answered, busy waiting can be a problem. You should fix that first, as it is part of the answer, but not all of it. It won't appear to help in your case (on Q6600), because it seems to be bottlenecked at 2 cores for some reason, so another is available for the busy loop and so there is no apparent slowdown, but on my Core i5 it speeds up the 4-thread version noticeably.
I suspect that in the case of the Q6600 your particular app is limited by the amount of shared cache available or something else specific to the architecture of that CPU. The Q6600 has two 4MB L2 caches, which means CPUs are sharing them, and no L3 cache. On my core i5, each CPU has a dedicated L2 cache (256K, then there is a larger 8MB shared L3 cache. 256K more per-CPU cache might make a difference... otherwise something else architecture wise does.
Here is a comparison of a Q6600 running your Collatz.java, and a Core i5 750.
On my work PC, which is also a Q6600 # 2.4GHz like yours, but with 6GB RAM, Windows 7 64-bit, and JDK 1.6.0_21* (64-bit), here are some basic results:
10000000 500000 1 (avg of three runs): 36982 ms
10000000 500000 4 (avg of three runs): 21252 ms
Faster, certainly - but not completing in quarter of the time like you would expect, or even half... (though it is roughly just a bit more than half, more on that in a moment). Note in my case I halved the size of the work units, and have a default max heap of 1500m.
At home on my Core i5 750 (4 cores no hyperthreading), 4GB RAM, Windows 7 64-bit, jdk 1.6.0_22 (64-bit):
10000000 500000 1 (avg of 3 runs) 32677 ms
10000000 500000 4 (avg of 3 runs) 8825 ms
10000000 500000 4 (avg of 3 runs) 11475 ms (without the busy wait fix, for reference)
the 4 threads version takes 27% of the time the 1 thread version takes when the busy-wait loop is removed. Much better. Clearly the code can make efficient use of 4 cores...
NOTE: Java 1.6.0_18 and later have modified default heap settings - so my default heap size is almost 1500m on my work PC, and around 1000m on my home PC.
You may want to increase your default heap, just in case garbage collection is happening and slowing your 4 threaded version down a bit. It might help, it might not.
At least in your example, there's a chance your larger work unit size is skewing your results slightly...halving it may help you get closer to at least 2x the speed since 4 threads will be kept busy for a longer portion of the time. I don't think the Q6600 will do much better at this particular task...whether it is cache or some other inherent architecture thing.
In all cases, I am simply running "java Collatz 10000000 500000 X", where x = # of threads indicated.
The only changes I made to your java file were to make one of the println's into a print, so there were less linebreaks for my runs with 500000 per work unit so I could see more results in my console at once, and I ditched the busy wait loop, which matters on the i5 750 but didn't make a difference on the Q6600.

You can should try using the submit function and then watching the Future's that are returning checking them to see if the thread has finished.
Terminate doesn't return until there is a shutdown.
Future submit(Runnable task)
Submits a Runnable task for execution and returns a Future representing that task.
isTerminated()
Returns true if all tasks have completed following shut down.
Try this...
public static void main(String[] args) {
long lDateTime = new Date().getTime();
parseArgs(args);
List<Thread> threads = new ArrayList<Thread>();
List<Future> futures = new ArrayList<Future>();
ExecutorService executor = Executors.newFixedThreadPool(NTHREADS);
for (int i = 0; i < (N / RANGESIZE); i++) {
Runnable worker = new MyRunnable(i * RANGESIZE + 1, (i + 1) * RANGESIZE);
futures.add(executor.submit(worker));
}
boolean done = false;
while (!done) {
for(Future future : futures) {
done = true;
if( !future.isDone() ) {
done = false;
break;
}
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
System.out.println("Finished all threads");
long fDateTime = new Date().getTime();
System.out.println("time in milliseconds for checking to " + N + " is " +
(fDateTime - lDateTime) +
" (" + N / (fDateTime - lDateTime) + " per ms)");
System.exit(0);
}

Performance of ThreadLocal variable

How much is read from ThreadLocal variable slower than from regular field?
More concretely is simple object creation faster or slower than access to ThreadLocal variable?
I assume that it is fast enough so that having ThreadLocal<MessageDigest> instance is much faster then creating instance of MessageDigest every time. But does that also apply for byte[10] or byte[1000] for example?
Edit: Question is what is really going on when calling ThreadLocal's get? If that is just a field, like any other, then answer would be "it's always fastest", right?

In 2009, some JVMs implemented ThreadLocal using an unsynchronised HashMap in the Thread.currentThread() object. This made it extremely fast (though not nearly as fast as using a regular field access, of course), as well as ensuring that the ThreadLocal object got tidied up when the Thread died. Updating this answer in 2016, it seems most (all?) newer JVMs use a ThreadLocalMap with linear probing. I am uncertain about the performance of those – but I cannot imagine it is significantly worse than the earlier implementation.
Of course, new Object() is also very fast these days, and the garbage collectors are also very good at reclaiming short-lived objects.
Unless you are certain that object creation is going to be expensive, or you need to persist some state on a thread by thread basis, you are better off going for the simpler allocate when needed solution, and only switching over to a ThreadLocal implementation when a profiler tells you that you need to.

Running unpublished benchmarks, ThreadLocal.get takes around 35 cycle per iteration on my machine. Not a great deal. In Sun's implementation a custom linear probing hash map in Thread maps ThreadLocals to values. Because it is only ever accessed by a single thread, it can be very fast.
Allocation of small objects take a similar number of cycles, although because of cache exhaustion you may get somewhat lower figures in a tight loop.
Construction of MessageDigest is likely to be relatively expensive. It has a fair amount of state and construction goes through the Provider SPI mechanism. You may be able to optimise by, for instance, cloning or providing the Provider.
Just because it may be faster to cache in a ThreadLocal rather than create does not necessarily mean that the system performance will increase. You will have additional overheads related to GC which slows everything down.
Unless your application very heavily uses MessageDigest you might want to consider using a conventional thread-safe cache instead.

Good question, I've been asking myself that recently. To give you definite numbers, the benchmarks below (in Scala, compiled to virtually the same bytecodes as the equivalent Java code):
var cnt: String = ""
val tlocal = new java.lang.ThreadLocal[String] {
override def initialValue = ""
}
def loop_heap_write = {
var i = 0
val until = totalwork / threadnum
while (i < until) {
if (cnt ne "") cnt = "!"
i += 1
}
cnt
}
def threadlocal = {
var i = 0
val until = totalwork / threadnum
while (i < until) {
if (tlocal.get eq null) i = until + i + 1
i += 1
}
if (i > until) println("thread local value was null " + i)
}
available here, were performed on an AMD 4x 2.8 GHz dual-cores and a quad-core i7 with hyperthreading (2.67 GHz).
These are the numbers:
i7
Specs: Intel i7 2x quad-core # 2.67 GHz
Test: scala.threads.ParallelTests
Test name: loop_heap_read
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
9.0069 9.0036 9.0017 9.0084 9.0074 (avg = 9.1034 min = 8.9986 max = 21.0306 )
Thread num.: 2
Total tests: 200
Run times: (showing last 5)
4.5563 4.7128 4.5663 4.5617 4.5724 (avg = 4.6337 min = 4.5509 max = 13.9476 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
2.3946 2.3979 2.3934 2.3937 2.3964 (avg = 2.5113 min = 2.3884 max = 13.5496 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
2.4479 2.4362 2.4323 2.4472 2.4383 (avg = 2.5562 min = 2.4166 max = 10.3726 )
Test name: threadlocal
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
91.1741 90.8978 90.6181 90.6200 90.6113 (avg = 91.0291 min = 90.6000 max = 129.7501 )
Thread num.: 2
Total tests: 200
Run times: (showing last 5)
45.3838 45.3858 45.6676 45.3772 45.3839 (avg = 46.0555 min = 45.3726 max = 90.7108 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
22.8118 22.8135 59.1753 22.8229 22.8172 (avg = 23.9752 min = 22.7951 max = 59.1753 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
22.2965 22.2415 22.3438 22.3109 22.4460 (avg = 23.2676 min = 22.2346 max = 50.3583 )
AMD
Specs: AMD 8220 4x dual-core # 2.8 GHz
Test: scala.threads.ParallelTests
Test name: loop_heap_read
Total work: 20000000
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
12.625 12.631 12.634 12.632 12.628 (avg = 12.7333 min = 12.619 max = 26.698 )
Test name: loop_heap_read
Total work: 20000000
Run times: (showing last 5)
6.412 6.424 6.408 6.397 6.43 (avg = 6.5367 min = 6.393 max = 19.716 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
3.385 4.298 9.7 6.535 3.385 (avg = 5.6079 min = 3.354 max = 21.603 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
5.389 5.795 10.818 3.823 3.824 (avg = 5.5810 min = 2.405 max = 19.755 )
Test name: threadlocal
Thread num.: 1
Total tests: 200
Run times: (showing last 5)
200.217 207.335 200.241 207.342 200.23 (avg = 202.2424 min = 200.184 max = 245.369 )
Thread num.: 2
Total tests: 200
Run times: (showing last 5)
100.208 100.199 100.211 103.781 100.215 (avg = 102.2238 min = 100.192 max = 129.505 )
Thread num.: 4
Total tests: 200
Run times: (showing last 5)
62.101 67.629 62.087 52.021 55.766 (avg = 65.6361 min = 50.282 max = 167.433 )
Thread num.: 8
Total tests: 200
Run times: (showing last 5)
40.672 74.301 34.434 41.549 28.119 (avg = 54.7701 min = 28.119 max = 94.424 )
Summary
A thread local is around 10-20x that of the heap read. It also seems to scale well on this JVM implementation and these architectures with the number of processors.

#Pete is correct test before you optimise.
I would be very surprised if constructing a MessageDigest has any serious overhead when compared to actaully using it.
Miss using ThreadLocal can be a source of leaks and dangling references, that don't have a clear life cycle, generally I don't ever use ThreadLocal without a very clear plan of when a particular resource will be removed.

Here it goes another test. The results shows that ThreadLocal is a bit slower than a regular field, but in the same order. Aprox 12% slower
public class Test {
private static final int N = 100000000;
private static int fieldExecTime = 0;
private static int threadLocalExecTime = 0;
public static void main(String[] args) throws InterruptedException {
int execs = 10;
for (int i = 0; i < execs; i++) {
new FieldExample().run(i);
new ThreadLocaldExample().run(i);
}
System.out.println("Field avg:"+(fieldExecTime / execs));
System.out.println("ThreadLocal avg:"+(threadLocalExecTime / execs));
}
private static class FieldExample {
private Map<String,String> map = new HashMap<String, String>();
public void run(int z) {
System.out.println(z+"-Running field sample");
long start = System.currentTimeMillis();
for (int i = 0; i < N; i++){
String s = Integer.toString(i);
map.put(s,"a");
map.remove(s);
}
long end = System.currentTimeMillis();
long t = (end - start);
fieldExecTime += t;
System.out.println(z+"-End field sample:"+t);
}
}
private static class ThreadLocaldExample{
private ThreadLocal<Map<String,String>> myThreadLocal = new ThreadLocal<Map<String,String>>() {
#Override protected Map<String, String> initialValue() {
return new HashMap<String, String>();
}
};
public void run(int z) {
System.out.println(z+"-Running thread local sample");
long start = System.currentTimeMillis();
for (int i = 0; i < N; i++){
String s = Integer.toString(i);
myThreadLocal.get().put(s, "a");
myThreadLocal.get().remove(s);
}
long end = System.currentTimeMillis();
long t = (end - start);
threadLocalExecTime += t;
System.out.println(z+"-End thread local sample:"+t);
}
}
}'
Output:
0-Running field sample
0-End field sample:6044
0-Running thread local sample
0-End thread local sample:6015
1-Running field sample
1-End field sample:5095
1-Running thread local sample
1-End thread local sample:5720
2-Running field sample
2-End field sample:4842
2-Running thread local sample
2-End thread local sample:5835
3-Running field sample
3-End field sample:4674
3-Running thread local sample
3-End thread local sample:5287
4-Running field sample
4-End field sample:4849
4-Running thread local sample
4-End thread local sample:5309
5-Running field sample
5-End field sample:4781
5-Running thread local sample
5-End thread local sample:5330
6-Running field sample
6-End field sample:5294
6-Running thread local sample
6-End thread local sample:5511
7-Running field sample
7-End field sample:5119
7-Running thread local sample
7-End thread local sample:5793
8-Running field sample
8-End field sample:4977
8-Running thread local sample
8-End thread local sample:6374
9-Running field sample
9-End field sample:4841
9-Running thread local sample
9-End thread local sample:5471
Field avg:5051
ThreadLocal avg:5664
Env:
openjdk version "1.8.0_131"
Intel® Core™ i7-7500U CPU # 2.70GHz × 4
Ubuntu 16.04 LTS

Build it and measure it.
Also, you only need one threadlocal if you encapsulate your message digesting behaviour into an object. If you need a local MessageDigest and a local byte[1000] for some purpose, create an object with a messageDigest and a byte[] field and put that object into the ThreadLocal rather than both individually.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.