Just how 'approximate' is ThreadPoolExecutor#getActiveCount()?

Just how 'approximate' is ThreadPoolExecutor#getActiveCount()? - java

The javadocs for ThreadPoolExecutor#getActiveCount() say that the method "Returns the approximate number of threads that are actively executing tasks."
What makes this number approximate, and not exact? Will it over or under-report active threads?
Here is the method:
/**
* Returns the approximate number of threads that are actively
* executing tasks.
*
* #return the number of threads
*/
public int getActiveCount() {
final ReentrantLock mainLock = this.mainLock;
mainLock.lock();
try {
int n = 0;
for (Worker w : workers)
if (w.isLocked())
++n;
return n;
} finally {
mainLock.unlock();
}
}

The method takes the worker list and counts the workers that are being locked.
By the time counting reaches the end of the list, some of the workers previously counted may have finished. (Or some unused workers may have been given a task.)
But you shouldn't be relying on this knowledge as a client, just the fact that it's a best effort approximation. Note that this "inaccuracy" isn't a result of sloppy implementation, it's inherent in every truly multi-threaded system. In such systems there's no global moment of "present". Even if you stop all the workers to count them, by the time you return the result, it may be inaccurate.

Related

How is codahale metrics Meter mark() method threadsafe?

I have recently begun to learn CodaHale/DropWizard metrics library. I cannot understand how is the Meter class thread-safe (it is according to the documentation), especially mark() and tickIfNecessary() methods here:
https://github.com/dropwizard/metrics/blob/3.2-development/metrics-core/src/main/java/com/codahale/metrics/Meter.java#L54-L77
public void mark(long n) {
tickIfNecessary();
count.add(n);
m1Rate.update(n);
m5Rate.update(n);
m15Rate.update(n);
}
private void tickIfNecessary() {
final long oldTick = lastTick.get();
final long newTick = clock.getTick();
final long age = newTick - oldTick;
if (age > TICK_INTERVAL) {
final long newIntervalStartTick = newTick - age % TICK_INTERVAL;
if (lastTick.compareAndSet(oldTick, newIntervalStartTick)) {
final long requiredTicks = age / TICK_INTERVAL;
for (long i = 0; i < requiredTicks; i++) {
m1Rate.tick();
m5Rate.tick();
m15Rate.tick();
}
}
}
}
I can see that there is a lastTick of type AtomicLong, but still there can be a situation that m1-m15 rates are ticking a little bit longer so another thread can invoke those ticks as well as a part of next TICK_INTERVAL. Wouldn't that be a race condition since tick() method of Rates is not synchronized at all? https://github.com/dropwizard/metrics/blob/3.2-development/metrics-core/src/main/java/com/codahale/metrics/EWMA.java#L86-L95
public void tick() {
final long count = uncounted.sumThenReset();
final double instantRate = count / interval;
if (initialized) {
rate += (alpha * (instantRate - rate));
} else {
rate = instantRate;
initialized = true;
}
}
Thanks,
Marian

It is thread safe because this line from tickIfNecessary() returns true only once per newIntervalStartTick
if (lastTick.compareAndSet(oldTick, newIntervalStartTick))
What happens if two threads enter tickIfNecessary() at almost the same time?
Both threads read the same value from oldTick, decide that at least TICK_INTERVAL nanoseconds have passed and calculate a newIntervalStartTick.
Now both threads try to do lastTick.compareAndSet(oldTick, newIntervalStartTick). As the name compareAndSet implies, this method compares to current value of lastTick to oldTick and only if the value is equal to oldTick it gets atomically replaced with newIntervalStartTick and returns true.
Since this is an atomic instruction (at the hardware level!), only one thread can succeed. When the other thread executes this method it will already see newIntervalStartTick as the current value of lastTick. Since this value no longer matches oldTick the update fails and the method returns false and therefore this thread does not call m1Rate.tick() to m15Rate.tick().
The EWMA.update(n) method uses a java.util.concurrent.atomic.LongAdder to accumulate the event counts that gives similar thread safety guarantees.

As far as I can see you are right. If tickIfNecessary() is called such that age > TICK_INTERVAL while another call is still running, it is possible that m1Rate.tick() and the other tick() methods are called at the same time from multiple threads. So it boils down to wether tick() and its called routines/operations are safe.
Let's dissect tick():
public void tick() {
final long count = uncounted.sumThenReset();
final double instantRate = count / interval;
if (initialized) {
rate += (alpha * (instantRate - rate));
} else {
rate = instantRate;
initialized = true;
}
}
alpha and interval are set only on instance initialization and marked final those thread-safe since read-only. count and instantRate are local and those not visible to other threads anyway. rate and initialized are marked volatile and those writes should always be visible for following reads.
If I'm not wrong, pretty much from the first read of initialized to the last write on either initialized or rate this is open for races but some are without effect like when 2 threads race for the switch of initialized to true.
It seems the majority of effective races can happen in rate += (alpha * (instantRate - rate)); especially dropped or mixed calculations like:
Assumed: initialized is true
Thread1: calculates count, instantRate, checks initialized, does the first read of rate which we call previous_rate and for whatever reason stalls
Thread2: calculates count, instantRate, checks initialized, and calculates rate += (alpha * (instantRate - rate));
Thread1: continues its operation and calculates rate += (alpha * (instantRate - previous_rate));
A drop would occur if the reads and writes somehow get ordered such that rate is read on all threads and then written on all threads, effectively dropping one or more calculations.
But the probability for such races, meaning that both age > TICK_INTERVAL matches such that 2 Threads run into the same tick() method and especially the rate += (alpha * (instantRate - rate)) may be extremely low and depending on the values not noticeable.
The mark() method seems to be thread-safe as long as the LongAdderProxy uses a thread-safe Data-structure for update/add and for the tick() method in sumThenReset.
I think the only ones who can answer the Questions left open - wether the races are without noticeable effect or otherwise mitigated - are the project authors or people who have in depth knowledge of these parts of the project and the values calculated.

AtomicInteger in multithreading

I want to find out all the prime numbers from 0 to 1000000. For that I wrote this stupid method:
public static boolean isPrime(int n) {
for(int i = 2; i < n; i++) {
if (n % i == 0)
return false;
}
return true;
}
It's good for me and it doesn't need any edit. Than I wrote the following code:
private static ExecutorService executor = Executors.newFixedThreadPool(10);
private static AtomicInteger counter = new AtomicInteger(0);
private static AtomicInteger numbers = new AtomicInteger(0);
public static void main(String args[]) {
long start = System.currentTimeMillis();
while (numbers.get() < 1000000) {
final int number = numbers.getAndIncrement(); // (1) - fast
executor.submit(new Runnable() {
#Override
public void run() {
// int number = numbers.getAndIncrement(); // (2) - slow
if (Main.isPrime(number)) {
System.out.println("Ts: " + new Date().getTime() + " " + Thread.currentThread() + ": " + number + " is prime!");
counter.incrementAndGet();
}
}
});
}
executor.shutdown();
try {
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
System.out.println("Primes: " + counter);
System.out.println("Delay: " + (System.currentTimeMillis() - start));
} catch (Exception e) {
e.printStackTrace();
}
}
Please, pay attention to (1) and (2) marked rows. When (1) is enabled the program runs fast, but when (2) is enabled it works slower.
The output shows small portions with large delay
Ts: 1480489699692 Thread[pool-1-thread-9,5,main]: 350431 is prime!
Ts: 1480489699692 Thread[pool-1-thread-6,5,main]: 350411 is prime!
Ts: 1480489699692 Thread[pool-1-thread-4,5,main]: 350281 is prime!
Ts: 1480489699692 Thread[pool-1-thread-5,5,main]: 350257 is prime!
Ts: 1480489699693 Thread[pool-1-thread-7,5,main]: 350447 is prime!
Ts: 1480489711996 Thread[pool-1-thread-6,5,main]: 350503 is prime!
and threads get equal number value:
Ts: 1480489771083 Thread[pool-1-thread-8,5,main]: 384733 is prime!
Ts: 1480489712745 Thread[pool-1-thread-6,5,main]: 384733 is prime!
Please explain me why option (2) is more slowly and why threads get equal value for number despite AtomicInteger multithreading safe?

In the (2) case, up to 11 threads (the ten from the ExecutorService plus the main thread) are contending for access to the AtomicInteger, whereas in case (1) only the main thread accesses it. In fact, for case (1) you could use int instead of AtomicInteger.
The AtomicInteger class makes use of CAS registers. It does this by reading the value, doing the increment, and then swapping the value with the value in the register if it still has the same value that was originally read (compare and swap). If another thread has changed the value it retries by starting again : read - increment - compare-and-swap, until it is succesful.
The advantage is that this is lockless, and therefore potentially faster than using locks. But it performs poorly under heavy contention. More contention means more retries.
Edit
As #teppic points out, another problem makes case (2) slower than case (1). As the increment of numbers happens in the posted jobs, the loop condition remains true for much longer than needed. While all 10 threads of the executor are churning away to determine whether their given number is a prime, the main thread keeps posting new jobs to the executor. These new jobs don't get an opportunity to increment numbers until preceding jobs are done. So while they're on the queue numbers does not increase and the main thread can meanwhile complete one or more loops loop, posting new jobs. The end result is that many more jobs can be created and posted than the needed 1000000.

Your outer loop is:
while (numbers.get() < 1000000)
This allows you to continue submitting more Runnables than intended to the ExecutorService in the main thread.
You could try changing the loop to: for(int i=0; i < 1000000; i++)
(As others have mentioned you are obviously increasing the amount of contention, but I suspect the extra worker threads are a larger factor in the slowdown you are seeing.)
As for your second question, I'm pretty sure that it is against the contract of AtomicInteger for two child threads to see the same value of getAndIncrement. So something else must be going on which I am not seeing from your code sample. Might it be that you are seeing output from two separate runs of the program?

Explain me why option (2) is more slowly?
Simply because you do it inside run(). So multiple threads will try to do it at the same time hence there will be wait s and release s. Bowmore has given a low level explanation.
In (1) it is sequential. So there will be no such a scenario.
Why threads get equal value for number despite AtomicInteger
multithreading safe?
I don't see any possibility to happen this. If there's such a case it should happen from 0.

You miss two main points here: what AtomicInteger is for and how multithreading works in general.
Regarding why Option 2 is slower, #bowmore provided an excellent answer already.
Now regarding printing same number twice. AtomicInteger is like any other object. You launch your threads, and they check the value of this object. Since they compete with your main thread, that increases the counter, two child threads still may see same value. I would pass an int to each Runnable to avoid that.

How to achieve a guaranteed sleep time on a thread

I have a requirement for a class method to be called every 50 milliseconds. I don't use Thread.sleep because it's very important that it happens as precisely as possible to the milli, whereas sleep only guarantees a minimum time. The basic set up is this:
public class ClassA{
public void setup(){
ScheduledExecutorService se = Executors.newScheduledThreadPool(20);
se.scheduleAtFixedRate(this::onCall, 2000, 50, TimeUnit.MILLISECONDS);
}
protected void onCall(Event event) {
// do something
}
}
Now this by and large works fine. I have put System.out.println(System.nanoTime) in onCall to check its being called as precisely as I hope it is. I have found that there is a drift of 1-5 milliseconds over the course of 100s of calls, which corrects itself now and again.
A 5 ms drift unfortunately is pretty hefty for me. 1 milli drift is ok but at 5ms it messes up the calculation I'm doing in onCall because of states of other objects. It would be almost OK if I could get the scheduler to auto-correct such that if it's 5ms late on one call, the next one would happen in 45ms instead of 50.
My question is: Is there a more precise way to achieve this in Java? The only solution I can think of at the moment is to call a check method every 1ms and check the time to see if its at the 50ms mark. But then I'd need to maintain some logic if, on the off-chance, the precise 50ms interval is missed (49,51).
Thanks

Can I achieve a guaranteed sleep time on a thread?
Sorry, but No.
There is no way to get reliable, precise delay timing in a Java SE JVM. You need to use a Real time Java implementation running on a real time operating system.
Here are a couple of reasons why Java SE on a normal OS cannot do this.
At certain points, the GC in a Java SE JVM needs to "stop the world". While this is happening, no user thread can run. If your timer goes off in a "stop the world" pause, it can't be scheduled until the pause is over.
Scheduling of threads in a JVM is actually done by the host operating system. If the system is busy, the host OS may decide not to schedule the JVM's threads when your application needs this to happen.
The java.util.Timer.scheduleAtFixedRate approach is probably as good as you will get on Java SE. It should address long-term drift, but you can't get rid of the "jitter". And that jitter could easily be hundreds of milliseconds ... or even seconds.
Spinlocks won't help if the system is busy and the OS is preempting or not scheduling your threads. (And spinlocking in user code is wasteful ...)

According to the comment, the primary goal is not to concurrently execute multiple tasks at this precise interval. Instead, the goal is to execute a single task at this interval as precisely as possible.
Unfortunately, neither the ScheduledExecutorService nor any manual constructs involving Thread#sleep or LockSupport#parkNanos are very precise in that sense. And as pointed out in the other answers: There may always be influencing factors that are beyond your control - namely, details of the JVM implementation, garbage collection, JIT runs etc.
Nevertheless, a comparatively simple approach to achieve a high precision here is busy waiting. (This was already mentioned in an answer that is now deleted). But of course, this has several caveats. Most importantly, it will burn processing resources of one CPU. (And on a single-CPU-system, this may be particularly bad).
But in order to show that it may be far more precise than other waiting approaches, here is a simple comparison of the ScheduledExecutorService approach and the busy waiting:
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
public class PreciseSchedulingTest
{
public static void main(String[] args)
{
long periodMs = 50;
PreciseSchedulingA a = new PreciseSchedulingA();
a.setup(periodMs);
PreciseSchedulingB b = new PreciseSchedulingB();
b.setup(periodMs);
}
}
class CallTracker implements Runnable
{
String name;
long expectedPeriodMs;
long baseTimeNs;
long callTimesNs[];
int numCalls;
int currentCall;
CallTracker(String name, long expectedPeriodMs)
{
this.name = name;
this.expectedPeriodMs = expectedPeriodMs;
this.baseTimeNs = System.nanoTime();
this.numCalls = 50;
this.callTimesNs = new long[numCalls];
}
#Override
public void run()
{
callTimesNs[currentCall] = System.nanoTime();
currentCall++;
if (currentCall == numCalls)
{
currentCall = 0;
double maxErrorMs = 0;
for (int i = 1; i < numCalls; i++)
{
long ns = callTimesNs[i] - callTimesNs[i - 1];
double ms = ns * 1e-6;
double errorMs = ms - expectedPeriodMs;
if (Math.abs(errorMs) > Math.abs(maxErrorMs))
{
maxErrorMs = errorMs;
}
//System.out.println(errorMs);
}
System.out.println(name + ", maxErrorMs : " + maxErrorMs);
}
}
}
class PreciseSchedulingA
{
public void setup(long periodMs)
{
CallTracker callTracker = new CallTracker("A", periodMs);
ScheduledExecutorService se = Executors.newScheduledThreadPool(20);
se.scheduleAtFixedRate(callTracker, periodMs,
periodMs, TimeUnit.MILLISECONDS);
}
}
class PreciseSchedulingB
{
public void setup(long periodMs)
{
CallTracker callTracker = new CallTracker("B", periodMs);
Thread thread = new Thread(new Runnable()
{
#Override
public void run()
{
while (true)
{
long periodNs = periodMs * 1000 * 1000;
long endNs = System.nanoTime() + periodNs;
while (System.nanoTime() < endNs)
{
// Busy waiting...
}
callTracker.run();
}
}
});
thread.setDaemon(true);
thread.start();
}
}
Again, this should be taken with a grain of salt, but the results on My Machine® are as follows:
A, maxErrorMs : 1.7585339999999974
B, maxErrorMs : 0.06753599999999693
A, maxErrorMs : 1.7669149999999973
B, maxErrorMs : 0.007193999999998368
A, maxErrorMs : 1.7775299999999987
B, maxErrorMs : 0.012780999999996823
showing that the error for the waiting times is in the range of few microseconds.
In order to apply such an approach in practice, a more sophisticated infrastructure would be necessary. E.g. the bookkeeping that is necessary to compensate for waiting times that have been too high. (I think they can't be too low). Also, all this still does not guarantee a precisely timed execution. But it may be an option to consider, at least.

If you really have hard time constraints, you want to use a real-time operating system. General computing does not have hard time constraints; if your OS goes to virtual memory in one of your intervals, then you can miss your sleep interval. The real-time OS will make the tradeoff that you may get less done, but that work will can be better scheduled.
If you need to do this on a normal OS, you can spinlock instead of sleeping. This is really inefficient, but if you really have hard time constraints, it's the best way to approximate that.

That will be hard - think about GC... What I would do is to grab time with nanoTime, and use it in calculations. Or in other words I would get exact time and use it in calculations.

Yes (assuming you only want to prevent long term drifts and don't worry about each delay individually). java.util.Timer.scheduleAtFixedRate:
...In fixed-rate execution, each execution is scheduled relative to the scheduled execution time of the initial execution. If an execution is delayed for any reason (such as garbage collection or other background activity), two or more executions will occur in rapid succession to "catch up." In the long run, the frequency of execution will be exactly the reciprocal of the specified period (assuming the system clock underlying Object.wait(long) is accurate). ...
Basically, do something like this:
new Timer().scheduleAtFixedRate(new TimerTask() {
#Override
public void run() {
this.onCall();
}
}, 2000, 50);

Odd c++/java multi threading performance results compared to single thread

I was struggling since 2 days to understand what is going on with c++ threadpool performance compared to a single thread, then I decided to do the same on java, this is when I noticed that the behaviour is same on c++ and java.. basically my code is simple straight forward.
package com.examples.threading
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;
public class ThreadPool {
final static AtomicLong lookups = new AtomicLong(0);
final static AtomicLong totalTime = new AtomicLong(0);
public static class Task implements Runnable
{
int start = 0;
Task(int s) {
start = s;
}
#Override
public void run()
{
for (int j = start ; j < start + 3000; j++ ) {
long st = System.nanoTime();
boolean a = false;
long et = System.nanoTime();
totalTime.getAndAdd((et - st));
lookups.getAndAdd(1l);
}
}
}
public static void main(String[] args)
{
// change threads from 1 -> 100 then you will get different numbers
ExecutorService executor = Executors.newFixedThreadPool(1);
for (int i = 0; i <= 1000000; i++)
{
if (i % 3000 == 0) {
Task task = new Task(i);
executor.execute(task);
System.out.println("in time " + (totalTime.doubleValue()/lookups.doubleValue()) + " lookups: " + lookups.toString());
}
}
executor.shutdown();
while (!executor.isTerminated()) {
;
}
System.out.println("in time " + (totalTime.doubleValue()/lookups.doubleValue()) + " lookups: " + lookups.toString());
}
}
now same code when you run with different pool number say like 100 threads, the overall elapsed time will change.
one thread:
in time 36.91493612774451 lookups: 1002000
100 threads:
in time 141.47934530938124 lookups: 1002000
the question is, the code is same why the overall elapsed time is different what is exactly going on here..

You have a couple of obvious possibilities here.
One is that System.nanoTime may serialize internally, so even though each thread is making its call separately, it may internally execute those calls in sequence (and, for example, queue up calls as they come in). This is particularly likely when nanoTime directly accesses a hardware clock, such as on Windows (where it uses Windows' QueryPerformanceCounter).
Another point at which you get essentially sequential execution is your atomic variables. Even though you're using lock-free atomics, the basic fact is that each has to execute a read/modify/write as an atomic sequence. With locked variables, that's done by locking, then reading, modifying, writing, and unlocking. With lock-free, you eliminate some of the overhead in doing that, but you're still stuck with the fact that only one thread can successfully read, modify, and write a particular memory location at a given time.
In this case the only "work" each thread is doing is trivial, and the result is never used, so the optimizer can (and probably will) eliminate it entirely. So all you're really measuring is the time to read the clock and increment your variables.
To gain at least some of the speed back, you could (for one example) give thread thread its own lookups and totalTime variable. Then when all the threads finish, you can add together the values for the individual threads to get an overall total for each.
Preventing serialization of the timing is a little more difficult (to put it mildly). At least in the obvious design, each call to nanoTime directly accesses a hardware register, which (at least with most typical hardware) can only happen sequentially. It could be fixed at the hardware level (provide a high-frequency timer register that's directly readable per-core, guaranteed to be synced between cores). That's a somewhat non-trivial task, and (more importantly) most current hardware just doesn't include such a thing.
Other than that, do some meaningful work in each thread, so when you execute in multiple threads, you have something that can actually use the resources of your multiple CPUs/cores to run faster.

Usage of ForkJoinTask.getSurplusQueuedTaskCount()

Java doc of RecursiveAction mentions:
The following example illustrates some refinements and idioms that may lead to better performance: RecursiveActions need not be fully recursive, so long as they maintain the basic divide-and-conquer approach. Here is a class that sums the squares of each element of a double array, by subdividing out only the right-hand-sides of repeated divisions by two, and keeping track of them with a chain of next references. It uses a dynamic threshold based on method getSurplusQueuedTaskCount, but counterbalances potential excess partitioning by directly performing leaf actions on unstolen tasks rather than further subdividing.
The related code:
protected void compute() {
int l = lo;
int h = hi;
Applyer right = null;
while (h - l > 1 && getSurplusQueuedTaskCount() <= 3) {
int mid = (l + h) >>> 1;
right = new Applyer(array, mid, h, right);
right.fork();
h = mid;
}
double sum = atLeaf(l, h);
while (right != null) {
if (right.tryUnfork()) // directly calculate if not stolen
sum += right.atLeaf(right.lo, right.hi);
else {
right.join();
sum += right.result;
}
right = right.next;
}
result = sum;
}
I just wonder the reasoning of getSurplusQueuedTaskCount() <= 3.
Java doc of ForkJoinTask.getSurplusQueuedTaskCount() mentions:
Returns an estimate of how many more locally queued tasks are held by the current worker thread than there are other worker threads that might steal them. This value may be useful for heuristic decisions about whether to fork other tasks. In many usages of ForkJoinTasks, at steady state, each worker should aim to maintain a small constant surplus (for example, 3) of tasks, and to process computations locally if this threshold is exceeded.
Again, why we have to process computations locally if this threshold is exceeded?
My guess:
getSurplusQueuedTaskCount() = number of locally queued tasks - number of other worker threads that might steal them (locally queued tasks)
getSurplusQueuedTaskCount() > 3 means locally queued tasks outnumber other worker threads. Thus, other worker threads are already so busy that they won't be able to steal any newly created tasks. Thus, the current thread should perform the calculation instead of subdividing calculation (i.e. creating new task).
Is my guess correct?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.