What is busy spin in a multi-threaded environment? - java

What is "Busy Spin" in multi-threaded environment?
How it is useful and how can it be implemented in java in a multi-threaded environment?
In what way can it be useful in improving the performance of an application?

Some of the other answers miss the real problem with busy waiting.
Unless you're talking about an application where you are concerned with conserving electrical power, then burning CPU time is not, in and of itself, a Bad Thing. It's only bad when there is some other thread or process that is ready-to-run. It's really bad when one of the ready-to-run threads is the thread that your busy-wait loop is waiting for.
That's the real issue. A normal, user-mode program running on a normal operating system has no control over which threads run on which processors, a normal operating system has no way to tell the difference between a thread that is busy waiting and a thread that is doing work, and even if the OS knew that the thread was busy-waiting, it would have no way to know what the thread was waiting for.
So, it's entirely possible for the busy waiter to wait for many milliseconds (practically an eternity), waiting for an event, while the the only thread that could make the event happen sits on the sideline (i.e., in the run queue) waiting for its turn to use a CPU.
Busy waiting is often used in systems where there is tight control over which threads run on which processors. Busy waiting can be the most efficient way to wait for an event when you know that the thread that will cause it is actually running on a different processor. That often is the case when you're writing code for the operating system itself, or when you're writing an embedded, real-time application that runs under a real-time operating system.
Kevin Walters wrote about the case where the time to wait is very short. A CPU-bound, ordinary program running on an ordinary OS may be allowed to execute millions of instructions in each time slice. So, if the program uses a spin-lock to protect a critical section consisting of just a few instructions, then it is highly unlikely that any thread will lose its time slice while it is in the critical section. That means, if thread A finds the spin-lock locked, then it is highly likely that thread B, which holds the lock, actually is running on a different CPU. That's why it can be OK to use spin-locks in an ordinary program when you know it's going to run on a multi-processor host.

Busy-waiting or spinning is a technique in which a process repeatedly checks to see if a condition is true instead of calling wait or sleep method and releasing CPU.
1.It is mainly useful in multicore processor where condition is going to be true quite quickly i.e. in millisecond or micro second
2.Advantage of not releasing CPU is that, all cached data and instruction are remained unaffected, which may be lost, had this thread is suspended on one core and brought back to another thread

Busy spin is one of the techniques to wait for events without releasing CPU. It's often done to avoid losing data in CPU cached which is lost if the thread is paused and resumed in some other core.
So, if you are working on a low latency system where your order processing thread currently doesn't have any order, instead of sleeping or calling wait(), you can just loop and then again check the queue for new messages. It's only beneficial if you need to wait for a very small amount of time e.g. in microseconds or nanoseconds.
LMAX Disrupter framework, a high-performance inter-thread messaging library has a BusySpinWaitStrategy which is based on this concept and uses a busy spin loop for EventProcessors waiting on the barrier.

A "busy spin" is constantly looping in one thread to see if the other thread has completed some work. It is a "Bad Idea" as it consumes resources as it is just waiting. The busiest of spins don't even have a sleep in them, but spin as fast as possible waiting for the work to get finished. It is less wasteful to have the waiting thread notified by the completion of the work directly and just let it sleep until then.
Note, I call this a "Bad Idea", but it is used in some cases on low-level code to minimize latency, but this is rarely (if ever) needed in Java code.

Busy spinning/waiting is normally a bad idea from a performance standpoint. In most cases, it is preferable to sleep and wait for a signal when you are ready to run, than to do spinning. Take the scenario where there are two threads, and thread 1 is waiting for thread 2 to set a variable (say, it waits until var == true. Then, it would busy spin by just doing
while (var == false)
;
In this case, you will take up a lot of time that thread 2 can potentially be running, because when you wake up you are just executing the loop mindlessly. So, in a scenario where you are waiting for something like this to happen, it is better to let thread 2 have all control by putting yourself to sleep and having it wake you up when it is done.
BUT, in rare cases where the time you need to wait is very short, it is actually faster to spinlock. This is because of the time it takes to perform the signalng functions; spinning is preferable if the time used spinning is less than the time it would take to perform the signaling. So, in that way it may be beneficial and could actually improve performance, but this is definitely not the most frequent case.

Spin Waiting is that you constantly wait for a condition comes true. The opposite is waiting for a signal (like thread interruption by notify() and wait()).
There are two ways of waiting, first semi-active (sleep / yield) and active (busy waiting).
On busy waiting a program idles actively using special op codes like HLT or NOP or other time consuming operations. Other use just a while loop checking for a condition comming true.
The JavaFramework provides Thread.sleep, Thread.yield and LockSupport.parkXXX() methods for a thread to hand over the cpu. Sleep waits for a specific amount of time but alwasy takes over a millisecond even if a nano second was specified. The same is true for LockSupport.parkNanos(1). Thread.yield allows for a resolution of 100ns for my example system (win7 + i5 mobile).
The problem with yield is the way it works. If the system is utilized fully yield can take up to 800ms in my test scenario (100 worker threads all counting up a number (a+=a;) indefinitively). Since yield frees the cpu and adds the thread to the end of all threads within its priority group, yield is therefore unstable unless the cpu is not utilized to a certain extend.
Busy waiting will block a CPU (core) for multiple milliseconds.
The Java Framework (check Condition class implementations) uses active (busy) wait for periodes less then 1000ns (1 microsecond). At my system an average invocation of System.nanoTime takes 160ns so busy waiting is like checking the condition spend 160ns on nanoTime and repeat.
So basically the concurrency framework of Java (queues etc) has something like wait under a microsecond spin and hit the waiting periode within a N granulairty where N is the number of nanoseconds for checking time constraints and wait for one ms or longer (for my current system).
So active busy waiting increases utilization but aid in the overall reactiveness of the system.
While burning CPU time one should use special instructions reducing the power consumption of the core executing the time consuming operations.

Busy spin is nothing but looping over until thread(s) completes. E.g. You have say 10 threads, and you want to wait all the thread to finish and then want to continue,
while(ALL_THREADS_ARE_NOT_COMPLETE);
//Continue with rest of the logic
For example in java you can manage multiple thread with ExecutorService
ExecutorService executor = Executors.newFixedThreadPool(10);
for (int i = 0; i < 10; i++) {
Runnable worker = new WorkerThread('' + i);
executor.execute(worker);
}
executor.shutdown();
//With this loop, you are looping over till threads doesn't finish.
while (!executor.isTerminated());
It is a to busy spins as it consumes resources as CPU is not sitting ideal, but keeping running over the loop. We should have mechanism to notify the main thread
(parent thread) to indicate that all thread are done and it can continue with the rest of the task.
With the preceding example, instead of having busy spin, you can use different mechanism to improve performance.
ExecutorService executor = Executors.newFixedThreadPool(10);
for (int i = 0; i < 10; i++) {
Runnable worker = new WorkerThread('' + i);
executor.execute(worker);
}
executor.shutdown();
try {
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
} catch (InterruptedException e) {
log.fatal("Exception ",e);
}

Related

Thread::yield vs Thread::onSpinWait

Well the title basically says it all, with the small addition that I would really like to know when to use them. And it might be simple enough - I've read the documentation for them both, still can't tell the difference much.
There are answers like this here that basically say:
Yielding also was useful for busy waiting...
I can't agree much with them for the simple reason that ForkJoinPool uses Thread::yield internally and that is a pretty recent addition in the jdk world.
The thing that really bothers me is usages like this in jdk too (StampledLock::tryDecReaderOverflow):
else if ((LockSupport.nextSecondarySeed() & OVERFLOW_YIELD_RATE) == 0)
Thread.yield();
else
Thread.onSpinWait();
return 0L;
So it seems there are cases when one would be preferred over the other. And no, I don't have an actual example where I might need this - the only one I actually used was Thread::onSpinWait because 1) I happened to busy wait 2) the name is pretty much self explanatory that I should have used it in the busy spin.
When blocking a thread, there are a few strategies to choose from: spin, wait() / notify(), or a combination of both. Pure spinning on a variable is a very low latency strategy but it can starve other threads that are contending for CPU time. On the other hand, wait() / notify() will free up the CPU for other threads but can cost thousands of CPU cycles in latency when descheduling/scheduling threads.
So how can we avoid pure spinning as well as the overhead associated with descheduling and scheduling the blocked thread?
Thread.yield() is a hint to the thread scheduler to give up its time slice if another thread with equal or higher priority is ready. This avoids pure spinning but doesn't avoid the overhead of rescheduling the thread.
The latest addition is Thread.onSpinWait() which inserts architecture-specific instructions to hint the processor that a thread is in a spin loop. On x86, this is probably the PAUSE instruction, on aarch64, this is the YIELD instruction.
What's the use of these instructions? In a pure spin loop, the processor will speculatively execute the loop over and over again, filling up the pipeline. When the variable the thread is spinning on finally changes, all that speculative work will be thrown out due to memory order violation. What a waste!
A hint to the processor could prevent the pipeline from speculatively executing the spin loop until prior memory instructions are committed. In the context of SMT (hyperthreading), this is useful as the pipeline will be freed up for other hardware threads.

Why Non blocking Concurrency is better than blocking concurrency

I just want to know why Non Blocking concurrency is better than Blocking concurrency. In Blocking Concurrency Your thread must wait till other thread completes its execution. So thread would not consuming CPU in that case.
But if I talk about Non Blocking Concurrency, Threads do not wait to get a lock they immediately returns if certain threads is containing the lock.
For Example in ConcurrentHashMap class , inside put() method there is tryLock() in a loop. Other thread will be active and continuously trying to check if lock has been released or not because tryLock() is Non Blocking. I assume in this case, CPU is unnecessary used.
So Is it not good to suspend the thread till other thread completes its execution and wake the thread up when work is finished?
Whether or not blocking or non-blocking concurrency is better depends on how long you expect to have to wait to acquire the resource you're waiting on.
With a blocking wait (i.e. a mutex lock, in C parlance), the operating system kernel puts the waiting thread to sleep. The CPU scheduler will not allocate any time to it until after the required resource has become available. The advantage here is that, as you said, this thread won't consume any CPU resources while it is sleeping.
There is a disadvantage, however: the process of putting the thread to sleep, determining when it should be woken, and waking it up again is complex and expensive, and may negate the savings achieved by not having the thread consume CPU while waiting. In addition (and possibly because of this), the OS may choose not to wake the thread immediately once the resource becomes available, so the lock may be waited on longer than is necessary.
A non-blocking wait (also known as a spinlock) does consume CPU resource while waiting, but saves the expense of putting the thread to sleep, determining when it should be woken, and waking it. It also may be able to respond faster once the lock becomes free, as it is less at the whim of the OS in terms of when it can proceed with execution.
So, as a very general rule, you should prefer a spinlock if you expect to only wait a short time (e.g. the few CPU cycles it might take for another thread to finish with an entry in ConcurrentHashMap). For longer waits (e.g. on synchronized I/O, or a number of threads waiting on a single complex computation), a mutex (blocking wait) may be preferable.
If you consider ConcurrentHashMap as an example , considering the overhead due to multiple threads performing update operations (like put) , and block waiting for the locks to release (as you mention other thread will be active and continuously trying to check if lock has been released), is not going to be the case,always.
Compared to HashTable , Concurrency control in ConcurrentHashMap is split up. So multiple threads can acquire lock(on segments of the table).
Originally, the ConcurrentHashMap class supports a hard-wired preset concurrency level of 32. This allows a maximum of 32 put and/or remove operations to proceed concurrently(factors other than synchronization tend to be bottlenecks when more than 32 threads concurrently attempt updates.)
Also, successful retrievals (when the key is present) using get(key) and containsKey(key) usually run without locking.
So for instance, one thread might be in the process of adding an element, what cannot be done with such a locking strategy is operations like add an element only if it is not already present (ConcurrentReaderHashMap provides such facilities).
Also, the size() and isEmpty() methods require accumulations across 32 control segments, and so might be slightly slower.

What happens after sleeping thread wakes up?

I know the behavior of sleep method in Java.
When I call sleep method currently executing thread will stop it's execution and goes in sleep state. While it is sleeping it acquires the lock.
For example if I call sleep method as follows
Thread.sleep(50)
My Q is what happens after 50ms.
It will wake up and directly start executing or
it will go in runnable state and wait for CPU to give it a chance to execute?
In other words it will go to Runnable state and fight for CPU with other thread.
Please let me know the answer.
It will go into runnable state. There's never a guarantee a thread will be executing at a particular moment. But you can set the thread's priority to give it a better chance at getting CPU time.
Actually it depends on what operation system do you use and different operating systems has different process scheduling algorithms.
Most desktop operating systems are not real-time operating system. There is no guarantee about the precision of the sleep. When you call sleep, the thread is suspended and is not runnable until the requested duration elapses. When it's runnable again, it's up to the scheduler to run the thread again when some execution time is available.
For example, most Linux distros use CFS as default scheduling algorithm CFS uses a concept called "sleeper fairness", which considers sleeping or waiting tasks equivalent to those on the runqueue. So in your case, thread after sleeping will get a comparable share of CPU time.
It's up to the operating system scheduler. Typically, if the sleep is "sufficiently small" and the thread has enough of its timeslice left, the thread will hold onto the core and resume immediately when the sleep is finished. If the sleep is "too long" (typically around 10ms or more), then the core will be available to do other work and the thread will just be made ready-to-run when the sleep finishes. Depending on relative priorities, a new ready-to-run thread may pre-empt currently-running threads.
It will go in runnable state and wait for CPU to give it a chance to execute

Understanding Threads + Asynchronous

So I have a program that I made that needs to send a lot (like 10,000+) of GET requests to a URL and I need it to be as fast as possible. When I first created the program I just put the connections into a for loop but it was really slow because it would have to wait for each connection to complete before continuing. I wanted to make it faster so I tried using threads and it made it somewhat faster but I am still not satisfied.
I'm guessing the correct way to go about this and making it really fast is using an asynchronous connection and connecting to all of the URLs. Is this the right approach?
Also, I have been trying to understand threads and how they work but I can't seem to get it. The computer I am on has an Intel Core i7-3610QM quad-core processor. According to Intel's website for the specifications for this processor, it has 8 threads. Does this mean I can create 8 threads in a Java application and they will all run concurrently? Any more than 8 and there will be no speed increase?
What exactly does the number represent next to "Threads" in the task manager under the "Performance" tab? Currently, my task manager is showing "Threads" as over 1,000. Why is it this number and how can it even go past 8 if that's all my processor supports?
I also noticed that when I tried my program with 500 threads as a test, the number in the task manager increased by 500 but it had the same speed as if I set it to use 8 threads instead. So if the number is increasing according to the number of threads I am using in my Java application, then why is the speed the same?
Also, I have tried doing a small test with threads in Java but the output doesn't make sense to me.
Here is my Test class:
import java.text.SimpleDateFormat;
import java.util.Date;
public class Test {
private static int numThreads = 3;
private static int numLoops = 100000;
private static SimpleDateFormat dateFormat = new SimpleDateFormat("[hh:mm:ss] ");
public static void main(String[] args) throws Exception {
for (int i=1; i<=numThreads; i++) {
final int threadNum = i;
new Thread(new Runnable() {
public void run() {
System.out.println(dateFormat.format(new Date()) + "Start of thread: " + threadNum);
for (int i=0; i<numLoops; i++)
for (int j=0; j<numLoops; j++);
System.out.println(dateFormat.format(new Date()) + "End of thread: " + threadNum);
}
}).start();
Thread.sleep(2000);
}
}
}
This produces an output such as:
[09:48:51] Start of thread: 1
[09:48:53] Start of thread: 2
[09:48:55] Start of thread: 3
[09:48:55] End of thread: 3
[09:48:56] End of thread: 1
[09:48:58] End of thread: 2
Why does the third thread start and end right away while the first and second take 5 seconds each? If I add more that 3 threads, the same thing happens for all threads above 2.
Sorry if this was a long read, I had a lot of questions.
Thanks in advance.
Your processor has 8 cores, not threads. This does in fact mean that only 8 things can be running at any given moment. That doesn't mean that you are limited to only 8 threads however.
When a thread is synchronously opening a connection to a URL it will often sleep while it waits for the remote server to get back to it. While that thread is sleeping other threads can be doing work. If you have 500 threads and all 500 are sleeping then you aren't using any of the cores of your CPU.
On the flip side, if you have 500 threads and all 500 threads want to do something then they can't all run at once. To handle this scenario there is a special tool. Processors (or more likely the operating system or some combination of the two) have a scheduler which determines which threads get to be actively running on the processor at any given time. There are many different rules and sometimes random activity that controls how these schedulers work. This may explain why in the above example thread 3 always seems to finish first. Perhaps the scheduler is preferring thread 3 because it was the most recent thread to be scheduled by the main thread, it can be impossible to predict the behavior sometimes.
Now to answer your question regarding performance. If opening a connection never involved a sleep then it wouldn't matter if you were handling things synchronously or asynchronously you would not be able to get any performance gain above 8 threads. In reality, a lot of the time involved in opening a connection is spent sleeping. The difference between asynchronous and synchronous is how to handle that time spent sleeping. Theoretically you should be able to get nearly equal performance between the two.
With a multi-threaded model you simply create more threads than there are cores. When the threads hit a sleep they let the other threads do work. This can sometimes be easier to handle because you don't have to write any scheduling or interaction between the threads.
With an asynchronous model you only create a single thread per core. If that thread needs to sleep then it doesn't sleep but actually has to have code to handle switching to the next connection. For example, assume there are three steps in opening a connection (A,B,C):
while (!connectionsList.isEmpty()) {
for(Connection connection : connectionsList) {
if connection.getState() == READY_FOR_A {
connection.stepA();
//this method should return immediately and the connection
//should go into the waiting state for some time before going
//into the READY_FOR_B state
}
if connection.getState() == READY_FOR_B {
connection.stepB();
//same immediate return behavior as above
}
if connection.getState() == READY_FOR_C {
connection.stepC();
//same immediate return behavior as above
}
if connection.getState() == WAITING {
//Do nothing, skip over
}
if connection.getState() == FINISHED {
connectionsList.remove(connection);
}
}
}
Notice that at no point does the thread sleep so there is no point in having more threads than you have cores. Ultimately, whether to go with a synchronous approach or an asynchronous approach is a matter of personal preference. Only at absolute extremes will there be performance differences between the two and you will need to spend a long time profiling to get to the point where that is the bottleneck in your application.
It sounds like you're creating a lot of threads and not getting any performance gain. There could be a number of reasons for this.
It's possible that your establishing a connection isn't actually sleeping in which case I wouldn't expect to see a performance gain past 8 threads. I don't think this is likely.
It's possible that all of the threads are using some common shared resource. In this case the other threads can't work because the sleeping thread has the shared resource. Is there any object that all of the threads share? Does this object have any synchronized methods?
It's possible that you have your own synchronization. This can create the issue mentioned above.
It's possible that each thread has to do some kind of setup/allocation work that is defeating the benefit you are gaining by using multiple threads.
If I were you I would use a tool like JVisualVM to profile your application when running with some smallish number of threads (20). JVisualVM has a nice colored thread graph which will show when threads are running, blocking, or sleeping. This will help you understand the thread/core relationship as you should see that the number of running threads is less than the number of cores you have. In addition if you see a lot of blocked threads then that can help lead you to your bottleneck (if you see a lot of blocked threads use JVisualVM to create a thread dump at that point in time and see what the threads are blocked on).
Some concepts:
You can have many threads in the system, but only some of them (max 8 in your case) will be "scheduled" on the CPU at any point of time. So, you cannot get more performance than 8 threads running in parallel. In fact the performance will probably go down as you increase the number of threads, because of the work involved in creating, destroying and managing threads.
Threads can be in different states : http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.State.html
Out of those states, the RUNNABLE threads stand to get a slice of CPU time. Operating System decides assignment of CPU time to threads. In a regular system with 1000's of threads, it can be completely unpredictable when a certain thread will get CPU time and how long it will be on CPU.
About the problem you are solving:
You seem to have figured out the correct solution - making parallel asynchronous network requests. However, practically speaking starting 10000+ threads and that many network connections, at the same time, may be a strain on the system resources and it may just not work. This post has many suggestions for asynchronous I/O using Java. (Tip: Don't just look at the accepted answer)
This solution is more specific to the general problem of trying to make 10k requests as fast as possible. I would suggest that you abandon the Java HTTP libraries and use Apache's HttpClient instead. They have several suggestions for maximizing performance which may be useful. I have heard the Apache HttpClient library is just faster in general as well, lighter weight and less overhead.

Idle threads left with ScheduledThreadPoolExecutor.schedule

I have a Java application that is structured as:
One thread watching a java.nio.Selector for IO.
A java.util.concurrent.ScheduledThreadPoolExecutor thread pool handling either work to be done immediately — dispatching IO read by the IO thread — or work to be done after a delay, usually errors.
The ScheduledThreadPoolExecutor has an upper bound on the number of threads to create; currently 5000 in the app, but I haven't tuned that number at all.
After running the app for a while, I get thousands and thousands of threads that have this stack trace:
"pool-1-thread-5262" prio=10 tid=0x00007f636c2df800 nid=0x2516 waiting on condition [0x00007f60246a5000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000581c49520> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
at java.util.concurrent.DelayQueue.poll(DelayQueue.java:209)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.poll(ScheduledThreadPoolExecutor.java:611)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.poll(ScheduledThreadPoolExecutor.java:602)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:662)
I assume that the above is being caused by my calls to schedule(java.lang.Runnable, long, java.util.concurrent.TimeUnit), which certainly happens often in the app. Is this the expected behavior?
Having all of these threads hanging around doesn't seem to impact the application at all — if a worker thread is needed, it does not appear like these TIMED_WAITING threads prevent tasks from running when submitted through the submit method, but I'm not totally sure of that. Does having thousands of threads hanging around in this parked state impact the app or system performance?
Tasks that are submitted via the schedule method are very simple: they basically just re-schedule the Channel back with the Selector. So, these tasks are not very long-lived, they just need to execute at some point in the future. Normal worker threads will do traditional blocking-IO to perform their work, and are generally more long-lived.
A related question: is it better to do delayed tasks in an explicit, single thread instead of using the schedule method? That is, have a loop like this:
DelayedQueue<SomeTaskClass> tasks = ...;
while (true) {
task<SomeTaskClass> = tasks.take();
threadpool.submit(task);
}
Does DelayQueue use any worker threads to implement its functionality? I was going to just experiment with it today, but advice would be well appreciated.
After running the app for a while, I get thousands and thousands of threads that have this stack trace.
Unless you actually plan on having 5000 threads all operating at once, that is a too high number. If they are blocked on IO then that should be fine. Unless you are starting with a minimum number of threads that is too large, then their existence in your thread dump means that at some point they were all needed to process the tasks submitted to the executor. So at some point you had 5000 tasks being run at once -- blocking or whatever. If you show the actual executor constructor call I can be more specific.
If you have the time, playing with that upper bound might be good to see if it does affect application behavior.
Does having thousands of threads hanging around in this parked state impact the app or system performance?
They will take up more memory which may affect JVM performance but otherwise it should not impact the application unless too many are running at once. They may just be wasting some system resources which is the only reason why I'd play with the 5000 and other executor constructor args.
is it better to do delayed tasks in an explicit, single thread instead of using the schedule method?
I'd say no. Just about anytime you can replace by-hand thread code with a use of the ExecutorService classes it is a good thing. I think the idea of doing a task and then delaying for a while is a great use of the ScheduledThreadPoolExecutor.
Does DelayQueue use any worker threads to implement its functionality?
No. It is just a BlockingQueue implementation that helps with delaying of tasks. I've never used the class actually, although I would have if I'd known about it. The ScheduledThreadPoolExecutor uses this class to do its job so using DelayQueue yourself is again a waste. Just stick with STPE.

Categories