I am using BlockingQueue:s (trying both ArrayBlockingQueue and LinkedBlockingQueue) to pass objects between different threads in an application I’m currently working on. Performance and latency is relatively important in this application, so I was curious how much time it takes to pass objects between two threads using a BlockingQueue. In order to measure this, I wrote a simple program with two threads (one consumer and one producer), where I let the producer pass a timestamp (taken using System.nanoTime()) to the consumer, see code below.
I recall reading somewhere on some forum that it took about 10 microseconds for someone else who tried this (don’t know on what OS and hardware that was on), so I was not too surprised when it took ~30 microseconds for me on my windows 7 box (Intel E7500 core 2 duo CPU, 2.93GHz), whilst running a lot of other applications in the background. However, I was quite surprised when I did the same test on our much faster Linux server (two Intel X5677 3.46GHz quad-core CPUs, running Debian 5 with kernel 2.6.26-2-amd64). I expected the latency to be lower than on my windows box , but on the contrary it was much higher - ~75 – 100 microseconds! Both tests were done with Sun’s Hotspot JVM version 1.6.0-23.
Has anyone else done any similar tests with similar results on Linux? Or does anyone know why it is so much slower on Linux (with better hardware), could it be that thread switching simply is this much slower on Linux compared to windows? If that’s the case, it’s seems like windows is actually much better suited for some kind of applications. Any help in helping me understanding the relatively high figures are much appreciated.
Edit:
After a comment from DaveC, I also did a test where I restricted the JVM (on the Linux machine) to a single core (i.e. all threads running on the same core). This changed the results dramatically - the latency went down to below 20 microseconds, i.e. better than the results on the Windows machine. I also did some tests where I restricted the producer thread to one core and the consumer thread to another (trying both to have them on the same socket and on different sockets), but this did not seem to help - the latency was still ~75 microseconds. Btw, this test application is pretty much all I'm running on the machine while performering test.
Does anyone know if these results make sense? Should it really be that much slower if the producer and the consumer are running on different cores? Any input is really appreciated.
Edited again (6 January):
I experimented with different changes to the code and running environment:
I upgraded the Linux kernel to 2.6.36.2 (from 2.6.26.2). After the kernel upgrade, the measured time changed to 60 microseconds with very small variations, from 75-100 before the upgrade. Setting CPU affinity for the producer and consumer threads had no effect, except when restricting them to the same core. When running on the same core, the latency measured was 13 microseconds.
In the original code, I had the producer go to sleep for 1 second between every iteration, in order to give the consumer enough time to calculate the elapsed time and print it to the console. If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly? My first guess was that the change had the effect that the producer called queue.put() before the consumer called queue.take(), so the consumer never had to block, but after playing around with a modified version of ArrayBlockingQueue, I found this guess to be false – the consumer did in fact block. If you have some other guess, please let me know. (Btw, if I let the producer call both Thread.sleep() and barrier.await(), the latency remains at 60 microseconds).
I also tried another approach – instead of calling queue.take(), I called queue.poll() with a timeout of 100 micros. This reduced the average latency to below 10 microseconds, but is of course much more CPU intensive (but probably less CPU intensive that busy waiting?).
Edited again (10 January) - Problem solved:
ninjalj suggested that the latency of ~60 microseconds was due to the CPU having to wake up from deeper sleep states - and he was completely right! After disabling C-states in BIOS, the latency was reduced to <10 microseconds. This explains why I got so much better latency under point 2 above - when I sent objects more frequently the CPU was kept busy enough not to go to the deeper sleep states. Many thanks to everyone who has taken time to read my question and shared your thoughts here!
...
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.CyclicBarrier;
public class QueueTest {
ArrayBlockingQueue<Long> queue = new ArrayBlockingQueue<Long>(10);
Thread consumerThread;
CyclicBarrier barrier = new CyclicBarrier(2);
static final int RUNS = 500000;
volatile int sleep = 1000;
public void start() {
consumerThread = new Thread(new Runnable() {
#Override
public void run() {
try {
barrier.await();
for(int i = 0; i < RUNS; i++) {
consume();
}
} catch (Exception e) {
e.printStackTrace();
}
}
});
consumerThread.start();
try {
barrier.await();
} catch (Exception e) { e.printStackTrace(); }
for(int i = 0; i < RUNS; i++) {
try {
if(sleep > 0)
Thread.sleep(sleep);
produce();
} catch (Exception e) {
e.printStackTrace();
}
}
}
public void produce() {
try {
queue.put(System.nanoTime());
} catch (InterruptedException e) {
}
}
public void consume() {
try {
long t = queue.take();
long now = System.nanoTime();
long time = (now - t) / 1000; // Divide by 1000 to get result in microseconds
if(sleep > 0) {
System.out.println("Time: " + time);
}
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
QueueTest test = new QueueTest();
System.out.println("Starting...");
// Run first once, ignoring results
test.sleep = 0;
test.start();
// Run again, printing the results
System.out.println("Starting again...");
test.sleep = 1000;
test.start();
}
}
Your test is not a good measure of queue handoff latency because you have a single thread reading off the queue which writes synchronously to System.out (doing a String and long concatenation while it is at it) before it takes again. To measure this properly you need to move this activity out of this thread and do as little work as possible in the taking thread.
You'd be better off just doing the calculation (then-now) in the taker and adding the result to some other collection which is periodically drained by another thread that outputs the results. I tend to do this by adding to an appropriately presized array backed structure accessed via an AtomicReference (hence the reporting thread just has to getAndSet on that reference with another instance of that storage structure in order to grab the latest batch of results; e.g. make 2 lists, set one as active, every x s a thread wakes up and swaps the active and the passive ones). You can then report some distribution instead of every single result (e.g. a decile range) which means you don't generate vast log files with every run and get useful information printed for you.
FWIW I concur with the times Peter Lawrey stated & if latency is really critical then you need to think about busy waiting with appropriate cpu affinity (i.e. dedicate a core to that thread)
EDIT after Jan 6
If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly?
You're looking at the difference between java.util.concurrent.locks.LockSupport#park (and corresponding unpark) and Thread#sleep. Most j.u.c. stuff is built on LockSupport (often via an AbstractQueuedSynchronizer that ReentrantLock provides or directly) and this (in Hotspot) resolves down to sun.misc.Unsafe#park (and unpark) and this tends to end up in the hands of the pthread (posix threads) lib. Typically pthread_cond_broadcast to wake up and pthread_cond_wait or pthread_cond_timedwait for things like BlockingQueue#take.
I can't say I've ever looked at how Thread#sleep is actually implemented (cos I've never come across something low latency that isn't a condition based wait) but I would imagine that it causes it to be demoted by the schedular in a more aggressive way than the pthread signalling mechanism and that is what accounts for the latency difference.
I would use just an ArrayBlockingQueue if you can. When I have used it the latency was between 8-18 microseconds on Linux. Some point of note.
The cost is largely the time it takes to wake up the thread. When you wake up a thread its data/code won't be in cache so you will find that if you time what happens after a thread has woken that can take 2-5x longer than if you were to run the same thing repeatedly.
Certain operations use OS calls (such as locking/cyclic barriers) these are often more expensive in a low latency scenario than busy waiting. I suggest trying to busy wait your producer rather than use a CyclicBarrier. You could busy wait your consumer as well but this could be unreasonably expensive on a real system.
#Peter Lawrey
Certain operations use OS calls (such as locking/cyclic barriers)
Those are NOT OS (kernel) calls. Implemented via simple CAS (which on x86 comes w/ free memory fence as well)
One more: dont use ArrayBlockingQueue unless you know why (you use it).
#OP:
Look at ThreadPoolExecutor, it offers excellent producer/consumer framework.
Edit below:
to reduce the latency (baring the busy wait), change the queue to SynchronousQueue add the following like before starting the consumer
...
consumerThread.setPriority(Thread.MAX_PRIORITY);
consumerThread.start();
This is the best you can get.
Edit2:
Here w/ sync. queue. And not printing the results.
package t1;
import java.math.BigDecimal;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.SynchronousQueue;
public class QueueTest {
static final int RUNS = 250000;
final SynchronousQueue<Long> queue = new SynchronousQueue<Long>();
int sleep = 1000;
long[] results = new long[0];
public void start(final int runs) throws Exception {
results = new long[runs];
final CountDownLatch barrier = new CountDownLatch(1);
Thread consumerThread = new Thread(new Runnable() {
#Override
public void run() {
barrier.countDown();
try {
for(int i = 0; i < runs; i++) {
results[i] = consume();
}
} catch (Exception e) {
return;
}
}
});
consumerThread.setPriority(Thread.MAX_PRIORITY);
consumerThread.start();
barrier.await();
final long sleep = this.sleep;
for(int i = 0; i < runs; i++) {
try {
doProduce(sleep);
} catch (Exception e) {
return;
}
}
}
private void doProduce(final long sleep) throws InterruptedException {
produce();
}
public void produce() throws InterruptedException {
queue.put(new Long(System.nanoTime()));//new Long() is faster than value of
}
public long consume() throws InterruptedException {
long t = queue.take();
long now = System.nanoTime();
return now-t;
}
public static void main(String[] args) throws Throwable {
QueueTest test = new QueueTest();
System.out.println("Starting + warming up...");
// Run first once, ignoring results
test.sleep = 0;
test.start(15000);//10k is the normal warm-up for -server hotspot
// Run again, printing the results
System.gc();
System.out.println("Starting again...");
test.sleep = 1000;//ignored now
Thread.yield();
test.start(RUNS);
long sum = 0;
for (long elapsed: test.results){
sum+=elapsed;
}
BigDecimal elapsed = BigDecimal.valueOf(sum, 3).divide(BigDecimal.valueOf(test.results.length), BigDecimal.ROUND_HALF_UP);
System.out.printf("Avg: %1.3f micros%n", elapsed);
}
}
If latency is critical and you do not require strict FIFO semantics, then you may want to consider JSR-166's LinkedTransferQueue. It enables elimination so that opposing operations can exchange values instead of synchronizing on the queue data structure. This approach helps reduce contention, enables parallel exchanges, and avoids thread sleep/wake-up penalties.
Related
I'm using Groovy's ASTBuilder (version 2.5.5) in a project. It's being used to parse and analyze groovy expressions received via a REST API. This REST service receives thousands of requests, and the analysis is done on the fly.
I'm noticing some serious performance issues in a multithreaded environment. Below is a simulation, running 100 threads in parallel:
int numthreads = 100;
final Callable<Void> task = () -> {
long initial = System.currentTimeInMillis();
// Simple rule
new AstBuilder().buildFromString("a+b");
System.out.print(String.format("\n\nThread took %s ms.",
System.currentTimeInMillis() - initial));
return null;
};
final ExecutorService executorService = Executors.newFixedThreadPool(numthreads);
final List<Callable<Void>> tasks = new ArrayList<>();
while (numthreads-- > 0) {
tasks.add(task);
}
for (Future<Void> future : executorService.invokeAll(tasks)) {
future.get();
}
Im trying with different thread loads. The greater the number, the slower.
100 threads => ~1800ms
200 threads => ~2500ms
300 threads => ~4000ms
However, if I serialize the threads, (like setting the pool size to 1), I get much better results, around 10ms each thread. Can someone please help me understand why is this happening?
Performing multiple threaded code, computer shares threads between physical CPU cores. That means the more the number of threads exceeds number of cores, the less benefit you get from every thread. In your example the number of threads increases with number of tasks. So with growing up of the task number every CPU core forced to process the more and more threads. At the same time you may notice that difference between numthreads = 1 and numthreads = 4 is very small. Because in this case every core processes only few (or even just one) thread. Don't set number of threads much more than number of physical CPU threads because it doesn't make a lot of sense.
Additionally in your example you're trying to compare how different numbers of threads performs with different numbers of tasks. But in order to see the efficiency of multiple threaded code you have to compare how the different numbers of threads performs with the same number of tasks. I would change the example the next way:
int threadNumber = 16;
int taskNumber = 200;
//...task method
final ExecutorService executorService = Executors.newFixedThreadPool(threadNumber);
final List<Callable<Void>> tasks = new ArrayList<>();
while (taskNumber-- > 0) {
tasks.add(task);
}
long start = System.currentTimeMillis();
for (Future<Void> future : executorService.invokeAll(tasks)) {
future.get();
}
long end = System.currentTimeMillis() - start;
System.out.println(end);
executorService.shutdown();
Try this code for threadNumber=1 and, lets say, threadNumber=16 and you'll see the difference.
Dynamic evaluation of expressions involves a lot of resources including class loading, security manager, compilation and execution. It is not designed for high performance. If you just need to evaluate an expression for its value, you could try groovy.util.Eval. It may not consume as many resources as AstBuilder. However, it is probably not going to be that much different, so don't expect too much.
If you want to get the AST only and not any extra information like types, you could call the parser more directly. This would involve a lot fewer resources. See org.codehaus.groovy.control.ParserPluginFactory for more direct access to the source parser.
Does anyone have a Fairly effective way of running a function repetitively in a precise and accurate number of milliseconds. I have tried to accomplish this by using the code below to try to run a function called wave() once a second for 30 seconds:
startTime = System.nanoTime();
wholeTime = System.nanoTime();
while (loop) {
if (startTime >= time2) {
startTime = System.nanoTime();
wave();
sec++;
}
if (sec == 30) {
loop = false;
endTime = System.nanoTime();
System.out.println(wholeTime - System.nanoTime());
}
}
}
This code did not work and am wondering why this code didn't work and if their is a better approach to the problem. Any ideas on how to improve fix the above code or other successful ways of accomplishing the problem are all welcome. Thank you for your help!
more simple:
long start=System.currentTimeMillis(); // Not very very accurate
while (System.currentTimeMillis()-start<30000)
{
wave();
// count something
}
You can use a Timer+TimerTask: https://docs.oracle.com/javase/7/docs/api/java/util/Timer.html
https://docs.oracle.com/javase/7/docs/api/java/util/TimerTask.html
http://bioportal.weizmann.ac.il/course/prog2/tutorial/essential/threads/timer.html
You may use Thread.sleep():
public static void main (String[] args) throws InterruptedException {
int count = 30;
long start = System.currentTimeMillis();
for(int i=0; i<count; i++) {
wave();
// how many milliseconds till the end of the second?
long sleep = start+(i+1)*1000-System.currentTimeMillis();
if(sleep > 0) // condition might be false if wave() runs longer than second
Thread.sleep(sleep);
}
}
Does anyone have a Fairly effective way of running a function repetitively in a precise and accurate number of milliseconds.
There is no way to do this kind of thing reliably and accurately in standard Java. The problem is that there is no way that you can guarantee that your thread will run when you want ti to run. For example:
your thread could be suspended to allow the GC to run
your thread could be preempted to allow another thread in your application to run
your thread could be suspended by the OS while it fetches pages by the JVM back from disk.
You can only get reliable behavior for this kind of code if you run on a hard realtime OS, and an realtime Java.
Note that this is not an issue with clock accuracy. The real problem is that the scheduler does not give you the kind of guarantees you need. For instance, none of the "sleep until X" functionality in a JVM can guarantee that your thread will wake up at time X exactly ... for any useful meaning of "exactly".
The other answers suggest various ways to do this, but beware that they are not (and cannot be) reliable and accurate in all circumstances .. or even on a typical machine running other things as well as your application.
I am playing around with multithreading and came across an inconsistency when running a small snippet of code. The following code should print out 123123... but what I'm getting is
class RunnableDemo implements Runnable {
private String message;
RunnableDemo(String m) {
message = m;
}
public void run() {
try {
for (int i = 0; i < message.length(); i++) {
System.out.print(message.charAt(i));
Thread.sleep(1000);
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public class TestThread {
public static void main(String args[]) throws InterruptedException {
new Thread(new RunnableDemo("1111111")).start();
new Thread(new RunnableDemo("2222222")).start();
new Thread(new RunnableDemo("3333333")).start();
}
}
Output: 123231231132123231321
Output: 123213123123213213213
Output: 123231213213231231213
What I don't get is that it run correctly the first pass through (prints '123') but then the second pass through it prints '231'. If the thread is printing a char, sleeping 1 second, then repeating. Shouldn't the pattern 123123... be consistent each time I run the code or at least follow the pattern of the first 3 chars?
The following code should print out 123123
Not necessarily. You should basically never rely on threads with no synchronization between them happening to wake up and execute in any particular order.
Let's take the very first character output: there's no guarantee that that will be 1. Yes, you're starting the thread printing 1 first, but that doesn't mean that's the first thread that will actually start executing run first - or even if it does, that doesn't mean that's the first thread that will get as far as the System.out.print call.
Given the fairly long sleep, I would expect (but ideally not rely on) the output being a sequence of 7 "chunks", where each "chunk" consists of the characters "123" in some permutation. But if you've got three threads which all go to sleep for a second at "roughly" the same time, you shouldn't expect them to necessarily wake up in the order 1, 2, 3 - and again, even if they do, one of them may pre-empt another within the loop body.
On a really, really slow machine, even that expectation would be invalid - imagine it takes a random amount of time between 0 and 20 seconds to call charAt - unlikely, but it's a feasible thought experiment. At that point, one of the threads could race ahead and finish its output before another of the threads managed to print anything.
Threads are designed to be independent - if you want them to work in a coordinated fashion, you have to specify that coordination yourself. There are plenty of tools for the job, but don't expect it to happen magically.
You can't predict what piece of program CPU runs at a time. While running some process the CPU converts the process into small pieces of work. As multiple processes are running at a time. CPU has to schedule according to scheduling algorithm implemented. So, in short, you cannot predict what CPU does next unless you programmatically synchronize the pieces of code.
I have a server with plenty of resources in terms of processors, storage throughput and memory for processing a huge mass of files.
I am doing some performance tests and have adapted a small java program to test the parallel reading.Code is below
import java.io.*;
import java.lang.*;
class MultiThreadedFileRead extends Thread
{
InputStream in;
MultiThreadedFileRead(String fname) throws Exception
{
in=new FileInputStream(fname);
this.start();
}
public void run()
{
int i=0;
while(i!=-1)
{
try
{
i=in.read();
//System.out.print((char)i);
continue;
}catch(Exception e){}
}
try
{
in.close();
}catch(Exception e){}
}
public static void main(String a[]) throws Exception
{
int n=[0];
MultiThreadedFileRead fr[]=new MultiThreadedFileRead[n];
long tim;
tim=System.currentTimeMillis();
for(int i=1;i<n;i++)
fr[i]=new MultiThreadedFileRead(a[i]);
for(int i=1;i<n;i++)
{
try
{
fr[i].join();
}catch(Exception e){}
}
System.out.println("Time Required : "+(System.currentTimeMillis()-tim)+" miliseconds.");
}
}
The results seems corrects: reading 10 files in parallel (10 threads) takes about the same time as reading one file/one thread plus some overhead. (sorry, I don't have the actual numbers here, might edit later adding it).
To be sure though, I'd like to know is what would be the expected, or "reasonable" overhead for opening threads for parallel reading...?
Also, I am not a java developer, so although the program is pretty simple, if I got something wrong please point it out.
ps. to run the program I have 10x10mb files (named tf0, tf1, tf2, etc.), and I run the test as java MultiThreadedFileRead 10 tf* (for 10 threads).
There is no point in researching the time it takes to fire up a thread for two main reasons:
It will be insignificant compared to the the amount of time it takes to read the file - unless you have some seriously empowered system.
The statistics you gather will be specific to the system you are testing on - move it to another system, run it on a hot/cold day and your statistics will be meaningless.
You would be better to investigate actually distributing the processes across multiple machines if you are looking for a real performance boost.
I think that the problem might be that the IO itself is not multithreaded.
If you create X threads and they all issue a read on IO it will not be X times faster because it does the IO reads sequentially.
I have tested the code for 9 threads and for 1 thread files (100MB each). Result were 208 seconds and 94 seconds.
There are two issues with the test as I see it:
Reading without buffering (byte by byte) is not getting you max throughput.
You are creating n-1 threads in the test.
If this is a server, you should probably use a thread pool rather than re-creating the threads. This would remove the thread creation overhead (except for a single time) and would prevent your server from going down when there are too many requests.
http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html
Also, you may want to consider non-blocking io, for a real performance boost.
http://tutorials.jenkov.com/java-nio/nio-vs-io.html
When using normal reads, as #mrVoid suggested, you should use buffered readers.
And you might want to throw in some caching mechanism.
I am doing a past exam paper of Java, I am confused about one question listed below:
What would happen when a thread executes the following statement in its run() method? (Choose all that apply.)
sleep(500);
A. It is going to stop execution, and start executing exactly 500 milliseconds later.
B. It is going to stop execution, and start executing again not earlier than 500 milliseconds later.
C. It is going to result in a compiler error because you cannot call the sleep(…) method inside the run() method.
D. It is going to result in a compiler error because the sleep(…) method does not take any argument.
I select A,B. but the key answer is only B, does there exist any circumstances that A could also happen? Could anyone please clarify that for me? Many thanks.
I select A,B. but the key answer is only B, does there exist any circumstances that A could also happen? Could anyone please clarify that for me?
Yes, depending on your application, you certainly might get 500ms of sleep time and not a nanosecond more.
However, the reason why B is the better answer is that there are no guarantees about when any thread will be run again. You could have an application with a large number of CPU bound threads. Even though the slept thread is now able to be run, it might not get any cycles for a significant period of time. The precise sleep time also depends highly on the particulars of the OS thread scheduler and clock accuracy. Your application also may also have to compete with other applications on the same system which may delay its continued execution.
For example, this following program on my extremely fast 8xi7 CPU Macbook Pro shows a max-sleep of 604ms:
public class MaxSleep {
public static void main(String[] args) throws Exception {
final AtomicLong maxSleep = new AtomicLong(0);
ExecutorService threadPool = Executors.newCachedThreadPool();
// fork 1000 threads
for (int i = 0; i < 1000; i++) {
threadPool.submit(new Runnable() {
#Override
public void run() {
for (int i = 0; i < 10; i++) {
long total = 0;
// spin doing something that eats CPU
for (int j = 0; j < 10000000; j++) {
total += j;
}
// this IO is the real time sink though
System.out.println("total = " + total);
try {
long before = System.currentTimeMillis();
Thread.sleep(500);
long diff = System.currentTimeMillis() - before;
// update the max value
while (true) {
long max = maxSleep.get();
if (diff <= max) {
break;
}
if (maxSleep.compareAndSet(max, diff)) {
break;
}
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
});
}
threadPool.shutdown();
threadPool.awaitTermination(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
System.out.println("max sleep ms = " + maxSleep);
}
}
JVM cannot guarantee exactly 500 ms but it will start on or after ~500 ms as it will need to start its 'engine' back considering no other threads are blocking any resources which may delay a bit.
Read: Inside the Hotspot VM: Clocks, Timers and Scheduling Events
Edit: As Gray pointed out in the comment - the scheduling with other threads also a factor, swapping from one to another may cost some time.
According to Javadoc:-
Sleep()
Causes the currently executing thread to sleep (temporarily cease
execution) for the specified number of milliseconds, subject to the
precision and accuracy of system timers and schedulers. The thread
does not lose ownership of any monitors.
So it may be ~500ms
B. It is going to stop execution, and start executing again not earlier than 500 milliseconds later.
Looks more prominent.
As you know that there are two very closely-related states in Thread : Running and Runnable.
Running : means currently executing thread. Its execution is currently going on.
Runnable : means thread is ready to get processed or executed. But is waiting to be picked up by Thread Schedular. Now this thread schedular, as per its own wish/defined algorithm depending upon the JVM(for example, slicing algorithm) will pick up one of the available/Runnable threads to process them.
So whenever you call sleep method, it only guarantees that the execution of the code by the thread which it ran on, is paused for the specified milliseconds in the argument(e.g. 300ms in threadRef.sleep(300);). As soon as the time defined goes by, it comes back to Runnable state(means back to Available State to be picked up by Thread Schedular).
Therefore, there is no guarantee that your remaining code will start getting executed immediately after sleep method completion.
These sleep times are not guaranteed to be precise, because they are limited by the facilities provided by the underlying OS. Option B: Not earlier than 500 is more correct.
You cannot select A and B as they are opposite to each other, the main difference: exactly 500 milliseconds later and not earlier than 500 milliseconds later
first mean exactly what it means (500 milliseconds only), second means that it can sleep 501 or 502 or even 50000000000000
next question - why B is true, it is not so simply question, you need to understand what is the different between hard-realtime and soft-realtime, and explaining of all reasons is quite offtopic, so simply answer - because of many technical reasons java cannot guarantee hard real time execution of your code, this is why it stated that sleep will finish not earlier than ...
you can read about thread scheduling, priorities, garbage collection, preemptive multitasking - all these are related to this