What exactly makes Java Virtual Threads better - java

I am pretty hyped for Project Loom, but there is one thing that I can't fully understand.
Most Java servers use thread pools with a certain limit of threads (200, 300 ..), however, you are not limited by the OS to spawn many more, I've read that with special configurations for Linux you can reach huge numbers.
OS threads are more expensive and they are slower to start/stop, have to deal with context switching (magnified by their number) and you are dependent on the OS which might refuse to give you more threads.
Having said that virtual threads also consume similar amounts of memory (or at least that is what I understood). With Loom we get tail-call optimizations which should reduce memory usage. Also, synchronization and thread context copy should still be a problem of a similar size.
Indeed you are able to spawn millions of Virtual Threads
public static void main(String[] args) {
for (int i = 0; i < 1_000_000; i++) {
Thread.startVirtualThread(() -> {
try {
Thread.sleep(1000);
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
the code above breaks at around 25k with an OOM exception when I use Platform threads.
My question is what exactly makes these threads so light, what is preventing us from spawning 1 million platform threads and working with them, is it only the context switching that makes the regular threads so "heavy".
One very similar question
Things I found so far:
Context Switching is expensive. Generally speaking even in the ideal case where the OS knows how the threads would behave it will still have to give each thread an equal chance to execute, given they have the same priority. If we spawn 10k OS threads it will have to constantly switch between them and this task alone can occupy up to 80% of the CPU time in some cases, so we have to be very careful with the numbers. With Virtual Threads, context switching is done by the JVM which makes it basically free
Cheap start/stop. When we interrupt a thread we essentially tell the task, "Kill the OS thread you are running on". However if for example, that thread is in a thread pool, by the time we are asking, the thread might be released by the current task and then given to another and the other task might get the interruption signal. This makes the interruption process quite complex. Virtual Threads are simply objects that live in the heap, we can just let the GC collect them in the background
Hard upper limits (tens of thousands at most) of threads, due to how the OS handles them. The OS can’t be fine-tuned to the specific applications and programming language so it has to prepare for the worst-case scenario memory-wise. It has to allocate more memory that will actually be used to accommodate all needs. While doing all of this it has to ensure that the vital OS processes are still working. With VT you are only limited by the memory which is cheap
Thread that performs a transaction behaves very differently than a Thread that does video processing, again the OS has to prepare for the worst-case scenario and accommodate both cases the best way it can, which means we get suboptimal performance in most cases. Since VT are spawned and managed by Java itself, this allows for complete control over them and task-specific optimizations that are not bound to the OS
Resizable stack. The OS gives Threads a big stack to fit all use cases, Virtual Threads have a resizable stack that lives in the heap space, it is dynamically resized to fit the problem which makes it smaller
Smaller metadata size. Platform threads use 1MB as mentioned above, whereas Virtual Threads need 200-300 bytes to store their metadata

One big advantage of coroutines (so virtual threads) is that they can generate high levels of concurrency without the drawback of callbacks.
let me first introduce Little's Law:
concurrency = arrival_rate * latency
And we can rewrite this to:
arrival_rate = concurrency/latency
In a stable system, the arrival rate equals throughput.
throughput = concurrency/latency
To increase throughput, you have 2 options:
decrease latency; which typically is very hard since you have little influence on how much time a remote call or a request to disk takes.
increase concurrency
With regular threads, it is difficult to reach high levels of concurrency with blocking calls due to context switch overhead. Requests can be issued asynchronously in some cases (e.g. NIO + Epoll or Netty io_uring binding), but then you need to deal with callbacks and callback hell.
With a virtual thread, the request can be issued asynchronously and park the virtual thread and schedule another virtual thread. Once the response is received, the virtual thread is rescheduled and this is done completely transparently. The programming model is much more intuitive than using classic threads and callbacks.

Virtual threads are wrapped upon platform threads, so you may consider them an illusion that JVM provides, the whole idea is to make lifecycle of threads to CPU bound operations.
What exactly makes Java Virtual Threads better ?
Virtual threads advantages
exhibits exact the same behavior as platform threads.
disposable and can be scaled to millions.
much more lightweight than platform threads.
fast creation time, as fast as creating string object.
the JVM does delimited continuation on IO operations, no IO for
virtual threads.
yet can have the sequential code as previous but way more effective.
the JVM gives an illusion of virtual threads, underneath whole story
goes on platform threads.
Just with usage of virtual thread CPU core become much more concurrent, the combination of virtual threads and multi core CPU with ComputableFutures to parallelized code is very powerful
Virtual threads usage cautions
Don not use monitor i.e the synchronized block, however this will fix in new release of JDK's, an alternative to do so is to use 'ReentrantLock' with try-final statement.
Blocking with native frames on stack, JNI's. its very rare
Control memory per stack (reduce thread locales and no deep recursion)
Monitoring tools not updated yet like debuggers, JConsole, VisualVM etc
Platform Threads versus Virtual threads. Platform threads take OS
threads hostage in IO based tasks and operations limited to number of
applicable threads with in thread pool and OS threads, by default
they are non Daemon threads
Virtual threads are implemented with JVM, in CPU bound operations the
associated to platform threads and retuning them to thread pool,
after IO bound operation finished a new thread will be called from
thread pool, so no hostage in this case.
Fourth level architecture to have better understanding.
CPU
Multicore CPU multicores with in cpu executing operations.
OS
OS threads the OS scheduler allocating cpu time to engaged OS
threads.
JVM
platform threads are wrapped totally upon OS threads with both task
operations
virtual threads are associated to platform threads in each CPU bound
operation, each virtual thread can be associated with multiple
platform threads as different times.
Virtual threads with Executer service
More effective to use executer service cause it associated to thread pool an limited to applicable threads with it, however in compare of virtual threads, with Executer service and virtual contained we do not ned to handle or manage the associated thread pool.
try(ExecutorService service = Executors.newVirtualThreadPerTaskExecutor()) {
service.submit(ExecutorServiceVirtualThread::taskOne);
service.submit(ExecutorServiceVirtualThread::taskTwo);
}
Executer service implements Auto Closable interface in JDK 19, thus when used with in 'try with resource', once it reach to end of 'try' block the 'close' api being called, alternatively main thread will wait till all submitted task with their dedicated virtual threads finish their lifecycle and associated thread pool being shutdown.
ThreadFactory factory = Thread.ofVirtual().name("user thread-", 0).factory();
try(ExecutorService service = Executors.newThreadPerTaskExecutor(factory)) {
service.submit(ExecutorServiceThreadFactory::taskOne);
service.submit(ExecutorServiceThreadFactory::taskTwo);
}
Executer service can be created with virtual thread factory as well, just putting thread factory with it constructor argument.
Can benefits features of Executer service like Future and Completable Future.
Find more on JEP-425

Sometimes people have to build systems able to handle an enormous number of simultaneous clients. Native threads are inadequate means for doing that due to RAM consumption and context switching costs.
Virtual threads give us an ability to run millions of I/O bound tasks simultaneously without changing our mental model.
That's why Golang made its way into the industry (besides Google support). Goroutines are a concept very similar to Java's virtual threads and they solve the same problem.
There are other ways to achieve what virtual thread do (such as NIO and the related Reactor pattern). This, however, entails using message loops and callbacks which warp your mind (that's why so many people hate JavaScript). There are layers of abstractions on top of them making things a bit easier but they also have a cost.

Related

Performance of Multi-Threaded Client Network Application

I have implemented a client application that sends requests to a server. The way it works can be described very simply. I specify a number of threads. Each of this threads repeatedly sends requests to a server and waits for the answer.
I have plotted the total throughput of the client, for a various number of threads. The number of virtual clients is not important, I am interested by the maximal, saturated performance, at the very right of the graph.
I am surprised because I did not expect the performance to scale with the number of threads. Indeed, most of the processor time is spent in blocking i/o in Java (blocking sockets), as the client-server communication has a 1ms latency, and the client is running on a 8 core machine.
I have looked for solutions online, this answer on Quora seems to imply that the waiting time for blocking i/o can be scheduled to use for other tasks. Is is true, specifically for Java blocking sockets ? In that case, why don't I get linear scaling with the number of threads ?
If this important, I am running this application in the cloud. Also, this is part of a larger application, but I have identified this component as the bottleneck of the whole setup.
I have looked for solutions online, this answer on Quora seems to
imply that the waiting time for blocking i/o can be scheduled to use
for other tasks. Is is true, specifically for Java blocking sockets ?
Regular Java threads map to OS-level threads one-to-one. They're equivalent. So yes, it's true of Java, and in fact every other language. Unless it's using Green Threads or non-blocking IO.
In that case, why don't I get linear scaling with the number of
threads ?
Think about what you're doing from the perspective of the CPU. The CPU performs a costly context switch and allows some thread to run. That thread uses the CPU for a very short duration to prepare a network call, and then it blocks for a long time (milliseconds are quite a lot for a CPU working at 3GHz).
So each thread is doing only a tiny bit of work before another context switch is required. That means that a lot of the CPU's time is wasted on context switches instead of doing useful work.
Contrast that with a thread that's doing a CPU-bound task. The context switch takes the same time. But when a CPU-bound task is allowed to run, it manages to utilize the CPU for a long time, making the context-switch cheaper by comparison. This increases the overall CPU utilization.
So on one hand, you see higher rates with every new thread because you're essentially performing more concurrent I/O operations. On the other hand, every new thread adds a cost. So the marginal benefit of each additional thread is a bit smaller every time. If you keep adding threads, at some point you'll even reach a point where the rate will fall with each new thread.

Java: I want to adjust the size of a thread pool. How can I detect either CMS collection starts, or other system-wide factors affecting available CPU?

(The specifics for this question are for a mod for Minecraft. In general, the question deals with resizing a threadpool based on system load and CPU availability).
I am coming from an Objective C background, and Apple's libdispatch (Grand Central Dispatch) for thread scheduling.
The immediate concern I have is trying to reduce the size of the threadpool when a CMS tenured collection is running. The program in question (Minecraft) only works well with CMS collections. A much less immediate, but still "of interest", is reducing the threadpool size when other programs are demanding significant CPU (specifically, either a screen recorder, or a twitch stream).
In Java, I have just found out about (deep breath):
Executors, which provide access to thread pools (both fixed size, and adjustable size), with cached thread existence (to avoid the overhead of constantly re-creating new threads, or to avoid the worry of coding threads to pause and resume based on workload),
Executor (no s), which is the generic interface for saying "Now it is time to execute this runnable()",
ExecutorService, which manages the threadpools according to Executor,
ThreadPoolExecutor, which is what actually manages the thread pool, and has the ability to say "This is the maximum number of threads to use".
Under normal operation, about 5 times a second, there will be 50 high priority, and 400 low priority operations submitted to the thread pool per user on the server. This is for high-powered machines.
What I want to do is:
Work with less-powerful machines. So, if a computer only has 2 cores, and the main program (two primary threads, plus some minor assistant threads) is already maxing out the CPU, these background tasks will be competing with the main program and the garbage collector. In this case, I don't want to reduce the number of background threads (it will probably stay at 2), but I do want to reduce how much work I schedule. So this is just "How do I detect when the work-load is going up". I suspect that this is just a case of watching the size of the work queue I use when Executors.newCachedThreadPool()
But the first problem: I can't find anything to return the size of the work queue! ThreadPoolExecutor() can return the queue, and I can ask that for a size, but newCachedThreadPool() only returns an ExecutorService, which doesn't let me query for size (or rather, I don't see how to).
If I have "enough cores", I want to tell the pool to use more threads. Ideally, enough to keep CPU usage near max. Most of the tasks that I want to run are CPU bound (disk I/O will be the exception, not the rule; concurrency blocking will also be rare). But I don't want to heavily over-schedule threads. How do I determine "enough threads" without going way over the available cores?
If, for example, screen recording (or streaming) activates, CPU core usage by other programs will go up, and then I want to reduce the number of threads; as the number of threads go down, and queue backlog goes up, I can reduce the amount of tasks I add to the queue. But I have no idea how to detect this.
I think that the best advice I / we can give is to not try to "micro-manage" the number of threads in the thread pools. Set it to sensible size that is proportional to the number of physical cores ... and leave it. By all means provide some static tuning parameters (e.g. in config files), but don't to make the system tune itself dynamically. (IMO, the chances that dynamic tuning will work better than static are ... pretty slim.)
For "real-time" streaming, your success is going to depend on the overall load and the scheduler's ability to prioritize more than the number of threads. However it is a fact that standard Java SE on a standard OS is not suited to hard real-time, so your real-time performance is liable to deteriorate badly if you push the envelope.

What is the relationship between number of CPU cores and number of threads in an app in java?

I'm new to java multi-threaded programming. The question that has came to my mind is that how many threads can I run according to the number of my CPU cores. and if I run threads more than CPU cores will it be an overhead for the machine to run the app. for example when we have a server machine which has a server software that run 2 threads(main thread + developer thread), will it be an overhead for the server when more simultaneous clients make socket connections to the server or not?
Thanks.
The number of threads a system can execute simultaneously is (of course) identical to the number of cores in the system.
The number of threads that can exist on the system is limited by the available memory (each thread requires a stack and a structure used by the OS to manage the thread), and possibly there is a limitation how many threads the OS allows (this depends on the OS architecture, some OS' may use a fixed size table and once its full no more threads can be created).
Commonly, todays computers can handle hundreds to thousands of threads.
The reason why more threads are used than cores exist in the system is: Most threads will inevitably spend much of their time waiting for some event (example: word processor waiting for user to type on keyboard). The OS manages it that threads that wait in such a manner do not consume CPU time.
Idea behind it is don't let your CPU sleep, neither load it too much that it waste most of time in thread switching.
Its helpful to check Tuning the pool size, In IBMs paper
Idea behind is, it depends on the nature of task, if its all in-memory computation tasks you can use N+1 threads (N numbers of cores (included hyper threading)).
Or
we need to do the application profiling and find out waiting time (WT) , service time (ST) for a typical request and approximately N*(1+WT/ST) number of optimal threads we can have, considering 100% utilization of CPU.
That depends on what the threads are doing. The CPU is only able to do X things at once, where X is the number of cores it has. That means X threads at most can be active at any one time - however the other threads can wait their turn and the CPU will process them at appropriate moments.
You should also consider that a lot of the time threads are waiting for a response, or waiting for data to load, or a network message to arrive, etc so are not actually trying to do anything. These idle/waiting threads have very little load on the system.
Don't worry about getting a higher number of threads than CPU cores; that is actually not in your hands, but in OS'.
Assuming the JVM maps your java threads over OS threads (which is fairly normal these days), it depends on the thread management your OS does. There you rely on how smart the kernel implementation is to get performance out of your cores.
What you must keep in mind is that your design must be sustainable. For example, application servers are built on a threadpool full of worker threads. Those threads are awaken in order to serve requests. Do you want a thread for each request? Then you will surely have a problem - requests can arrive in the thousands to the server, and that could be a problem for the kernel to manage. Actually the threadpool size should be limited (between 1 and X and easily changed even in real time), threads should get work from a concurrent queue (java gives you some excellent classes for that) and each one attend requests sequentially.
I hope that being of help
Having less threads than CPUs can mean you are not using all the CPUs in your system. Having more threads might improve throughput if CPU is your bottleneck.
Having more threads than CPU does introduce an overhead and if CPU is your bottleneck this can hurt performance. However, if network IO, is your bottleneck, this overhead is a price worth paying as it usually allows you to handle many more connections. e.g. You can have 1000 TCP connections with their own threads.
There doesn't have to be any relation. A computer can have any number of cores; a process can have any number of threads.
There are several different reasons that processes utilize threading, including:
Programming abstraction. Dividing up work and assigning each division to a unit of execution (a thread) is a natural approach to many problems. Programming patterns that utilize this approach include the reactor, thread-per-connection, and thread pool patterns. Some, however, view threads as an anti-pattern. The inimitable Alan Cox summed this up well with the quote, "threads are for people who can't program state machines."
Blocking I/O. Without threads, blocking I/O halts the whole process. This can be detrimental to both throughput and latency. In a multithreaded process, individual threads may block, waiting on I/O, while other threads make forward progress. Blocking I/O via threads is thus an alternative to asynchronous & non-blocking I/O.
Memory savings. Threads provide an efficient way to share memory yet utilize multiple units of execution. In this manner they are an alternative to multiple processes.
Parallelism. In machines with multiple processors, threads provide an efficient way to achieve true parallelism. As each thread receives its own virtualized processor and is an independently schedulable entity, multiple threads may run on multiple processors at the same time, improving a system's throughput. To the extent that threads are used to achieve parallelism—that is, there are no more threads than processors—the "threads are for people who can't program state machines" quote does not apply.
The first three bullets utilize threads with no relationship to cores. If you are using threads as a programming abstraction to handle UI elements, for example, you'll have one thread per UI element (or whatever) regardless of whether you have 1 core or 12. Similarly, if you were using threads to perform blocking I/O, you'd scale your thread count with your I/O capacity, not your processing power.
The fourth bullet, however, does relate threads to cores. If the goal of threading is parallelism, then the number of threads should scale linearly with the number of cores. For example, if you double the number of cores in a system, then you would double the number of threads in your application. This is true for cores in the logical sense—that is, including SMT.
When threading is used to achieve parallelism—and this is both a common and the best use of threading—you will often have, say, one or two threads per core. Oftentimes, applications are written so as to dynamically size thread pools off the number of available cores. A single thread per core is ideal, but applications often use a larger multiplier, such as two threads per core, due to bugs and inefficiencies in their code, such as operations that block when none should.
Best performance will be when number of cores(NOC) equals number of thread (NOT), because if NOT > NOC then processor should switch context or OS will try to do that work, which is expensive enough opperation. But you have to understand that it impossible to have NOC = NOT on Web Servers because you can't predict how much clients will be at the same time. Take a look on load balancing concept to solve this issue in best way.

What does Thread Affinity mean?

Somewhere I have heard about Thread Affinity and Thread Affinity Executor. But I cannot find a proper reference for it at least in java. Can someone please explain to me what is it all about?
There are two issues. First, it’s preferable that threads have an affinity to a certain CPU (core) to make the most of their CPU-local caches. This must be handled by the operating system. This CPU affinity for threads is often also called “thread affinity”. In case of Java, there is no standard API to get control over this. But there are 3rd party libraries, as mentioned by other answers.
Second, in Java there is the observation that in typical programs objects are thread-affine, i.e. typically used by only one thread most of the time. So it’s the task of the JVM’s optimizer to ensure, that objects affine to one thread are placed close to each other in memory to fit into one CPU’s cache but place objects affine to different threads not too close to each other to avoid that they share a cache line as otherwise two CPUs/Cores have to synchronize them too often.
The ideal situation is that a CPU can work on some objects independently to another CPU working on other objects placed in an unrelated memory region.
Practical examples of optimizations considering Thread Affinity of Java objects are
Thread-Local Allocation Buffers (TLABs)
With TLABs, each object starts its lifetime in a memory region dedicated to the thread which created it. According to the main hypothesis behind generational garbage collectors (“the majority of all objects will die young”), most objects will spent their entire lifetime in such a thread local buffer.
Biased Locking
With Biased Locking, JVMs will perform locking operations with the optimistic assumption that the object will be locked by the same thread only, switching to a more expensive locking implementation only when this assumption does not hold.
#Contended
To address the other end, fields which are known to be accessed by multiple threads, HotSpot/OpenJDK has an annotation, currently not part of a public API, to mark them, to direct the JVM to move these data away from the other, potentially unshared data.
Let me try explaining it. With the rise of multicore processors, message passing between threads & thread pooling, scheduling has become more costlier affair. Why this has become much heavier than before, for that we need to understand the concept of "mechanical sympathy". For details you can go through a blog on it. But in crude words, when threads are distributed across different cores of a processor, when they try to exchange messages; cache miss probability is high. Now coming to your specific question, thread affinity being able to assign specific threads to a particular processor/core. Here is one of the library for java that can be used for it.
The Java Thread Affinity version 1.4 library attempts to get the best of both worlds, by allowing you to reserve a logical thread for critical threads, and reserve a whole core for the most performance sensitive threads. Less critical threads will still run with the benefits of hyper threading. e.g. following code snippet
AffinityLock al = AffinityLock.acquireLock();
try {
// find a cpu on a different socket, otherwise a different core.
AffinityLock readerLock = al.acquireLock(DIFFERENT_SOCKET, DIFFERENT_CORE);
new Thread(new SleepRunnable(readerLock, false), "reader").start();
// find a cpu on the same core, or the same socket, or any free cpu.
AffinityLock writerLock = readerLock.acquireLock(SAME_CORE, SAME_SOCKET, ANY);
new Thread(new SleepRunnable(writerLock, false), "writer").start();
Thread.sleep(200);
} finally {
al.release();
}
// allocate a whole core to the engine so it doesn't have to compete for resources.
al = AffinityLock.acquireCore(false);
new Thread(new SleepRunnable(al, true), "engine").start();
Thread.sleep(200);
System.out.println("\nThe assignment of CPUs is\n" + AffinityLock.dumpLocks());
Thread affinity (or process affinity) describes on which processor cores the thread/process is allowed to run. Normally, this setting is equal to the (logical) CPUs in your system, and there's hardly a reason for changing this, because the operating system then has the best possibilities to schedule your tasks among the available processors.
See i.e. http://msdn.microsoft.com/en-us/library/windows/desktop/ms683213(v=vs.85).aspx for how this works in windows. I don't know whether java offers an API to set these.

Why does ScheduledThreadPoolExecutor only accept a fixed number of threads?

I may imagine some tasks scheduled to take a very long time and ScheduledThreadPoolExecutor would create additional threads for the other tasks that need to be run, until a maximum number of threads is reached.
But seems that I can only specify a fixed number of threads for the pool, why is that so ?
As the why, I don't know either. But I can imagine.
The amount of resources of a computer is limited. Not all resources can be handled concurrently either.
If multiple processes concurrently load files, they will be loaded slower than if they were being loaded sequentially (at least on a harddisk).
A processor also has limited support for handling multiple threads concurrently. At some point the OS or JVM will spend more time switching threads, than threads spend executing their code.
That is a good reason for the ScheduledThreadPoolExecutor to be designed the way it is. You can put any amount of jobs on the queue, but there are never executed more jobs at the same time than can be run efficiently. It's up to you to balance that, of course.
If your tasks are IO bound, I'd set the pool size small, and if they are CPU bound, a bit larger (32 or so). You can also make multiple ScheduledThreadPoolExecutors, one for IO bound tasks and one for CPU bound tasks.
While digging further about SchecduledThreadPoolExecutor I found this link, This link explains SchecduledThreadPoolExecutor solves many problems of Timer class. And also reason for introducing SchecduledThreadPoolExecutor was to replace Timer (due to various problems with Timer).
I think Reason for fixed number of threads passed to SchecduledThreadPoolExecutor is in one of problems this class solves. i.e.
The Timer class starts only one
thread. While it is more efficient
than creating a thread per task, it is
not an optimal solution. The optimal
solution may be to use a number of
threads between one thread for all
tasks and one thread per task. In
effect, the best solution is to place
the tasks in a pool of threads. The
number of threads in the pool should
be assignable during construction to
allow the program to determine the
optimal number of threads in the pool.
So in my opinion this is where use case for SchecduledThreadPoolExecutor is coming from. In your case, You should be able to decide optimal value depending upon tasks you plan to schedule and time these tasks takes to finish. If I have 4 long running tasks which are scheduled for same time I would prefer my pool size be greater than 4 if there are other tasks to be executed during same time. I would also prefer to seperate out long running tasks in different executor as indicated in earlier answers.
Hope this helps :)
According to Java Concurrency In Practice there are the following disadvantages of unbounded thread creation:
Thread lifecyle overhead
Thread creation and teardown are not free. Thread creation takes time, and requires some processing activity by the JVM and OS.
Resource consumption
Active threads consume system resources, especially memory. When there are more runnable threads than available processors, threads sit idle. Having many idle threads can tie up a lot of memory, putting pressure on the garbage collector, and having many threads competing for the CPUs can impose other performance costs as well. If you have enough threads to keep all the CPUs busy, creating more threads won't help and may even hurt.
Stability
There is a limit on how many threads can be created. The limit varies by platform and is affected by factors including JVM invocation parameters, the requested stack size in the Thread constructor, and limits on threads placed by the underlying operating system. When you hit htis limit, the most likely result is an OutOfMemoryError. Trying to recover from such an error is very risky; it is far easier to structur your program to avoid hitting this limit.
Up to a certain point, more threads can improve throughput, but beyond that point creating more threads just slows down your application, and creating one thread too many can cause your entire application to crash horribly. The way to stay out of danger is to place some bound on how many threads your application creates, and to test your application thoroughly to ensure that, even when this bound is reached, it does not run out of resources.
Unbounded thread creation may appear to work just fine during prototyping and development, with problems surfacing only when the application is deployed and under heavy load.

Categories