Somewhere I have heard about Thread Affinity and Thread Affinity Executor. But I cannot find a proper reference for it at least in java. Can someone please explain to me what is it all about?
There are two issues. First, it’s preferable that threads have an affinity to a certain CPU (core) to make the most of their CPU-local caches. This must be handled by the operating system. This CPU affinity for threads is often also called “thread affinity”. In case of Java, there is no standard API to get control over this. But there are 3rd party libraries, as mentioned by other answers.
Second, in Java there is the observation that in typical programs objects are thread-affine, i.e. typically used by only one thread most of the time. So it’s the task of the JVM’s optimizer to ensure, that objects affine to one thread are placed close to each other in memory to fit into one CPU’s cache but place objects affine to different threads not too close to each other to avoid that they share a cache line as otherwise two CPUs/Cores have to synchronize them too often.
The ideal situation is that a CPU can work on some objects independently to another CPU working on other objects placed in an unrelated memory region.
Practical examples of optimizations considering Thread Affinity of Java objects are
Thread-Local Allocation Buffers (TLABs)
With TLABs, each object starts its lifetime in a memory region dedicated to the thread which created it. According to the main hypothesis behind generational garbage collectors (“the majority of all objects will die young”), most objects will spent their entire lifetime in such a thread local buffer.
Biased Locking
With Biased Locking, JVMs will perform locking operations with the optimistic assumption that the object will be locked by the same thread only, switching to a more expensive locking implementation only when this assumption does not hold.
#Contended
To address the other end, fields which are known to be accessed by multiple threads, HotSpot/OpenJDK has an annotation, currently not part of a public API, to mark them, to direct the JVM to move these data away from the other, potentially unshared data.
Let me try explaining it. With the rise of multicore processors, message passing between threads & thread pooling, scheduling has become more costlier affair. Why this has become much heavier than before, for that we need to understand the concept of "mechanical sympathy". For details you can go through a blog on it. But in crude words, when threads are distributed across different cores of a processor, when they try to exchange messages; cache miss probability is high. Now coming to your specific question, thread affinity being able to assign specific threads to a particular processor/core. Here is one of the library for java that can be used for it.
The Java Thread Affinity version 1.4 library attempts to get the best of both worlds, by allowing you to reserve a logical thread for critical threads, and reserve a whole core for the most performance sensitive threads. Less critical threads will still run with the benefits of hyper threading. e.g. following code snippet
AffinityLock al = AffinityLock.acquireLock();
try {
// find a cpu on a different socket, otherwise a different core.
AffinityLock readerLock = al.acquireLock(DIFFERENT_SOCKET, DIFFERENT_CORE);
new Thread(new SleepRunnable(readerLock, false), "reader").start();
// find a cpu on the same core, or the same socket, or any free cpu.
AffinityLock writerLock = readerLock.acquireLock(SAME_CORE, SAME_SOCKET, ANY);
new Thread(new SleepRunnable(writerLock, false), "writer").start();
Thread.sleep(200);
} finally {
al.release();
}
// allocate a whole core to the engine so it doesn't have to compete for resources.
al = AffinityLock.acquireCore(false);
new Thread(new SleepRunnable(al, true), "engine").start();
Thread.sleep(200);
System.out.println("\nThe assignment of CPUs is\n" + AffinityLock.dumpLocks());
Thread affinity (or process affinity) describes on which processor cores the thread/process is allowed to run. Normally, this setting is equal to the (logical) CPUs in your system, and there's hardly a reason for changing this, because the operating system then has the best possibilities to schedule your tasks among the available processors.
See i.e. http://msdn.microsoft.com/en-us/library/windows/desktop/ms683213(v=vs.85).aspx for how this works in windows. I don't know whether java offers an API to set these.
Related
I am pretty hyped for Project Loom, but there is one thing that I can't fully understand.
Most Java servers use thread pools with a certain limit of threads (200, 300 ..), however, you are not limited by the OS to spawn many more, I've read that with special configurations for Linux you can reach huge numbers.
OS threads are more expensive and they are slower to start/stop, have to deal with context switching (magnified by their number) and you are dependent on the OS which might refuse to give you more threads.
Having said that virtual threads also consume similar amounts of memory (or at least that is what I understood). With Loom we get tail-call optimizations which should reduce memory usage. Also, synchronization and thread context copy should still be a problem of a similar size.
Indeed you are able to spawn millions of Virtual Threads
public static void main(String[] args) {
for (int i = 0; i < 1_000_000; i++) {
Thread.startVirtualThread(() -> {
try {
Thread.sleep(1000);
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
the code above breaks at around 25k with an OOM exception when I use Platform threads.
My question is what exactly makes these threads so light, what is preventing us from spawning 1 million platform threads and working with them, is it only the context switching that makes the regular threads so "heavy".
One very similar question
Things I found so far:
Context Switching is expensive. Generally speaking even in the ideal case where the OS knows how the threads would behave it will still have to give each thread an equal chance to execute, given they have the same priority. If we spawn 10k OS threads it will have to constantly switch between them and this task alone can occupy up to 80% of the CPU time in some cases, so we have to be very careful with the numbers. With Virtual Threads, context switching is done by the JVM which makes it basically free
Cheap start/stop. When we interrupt a thread we essentially tell the task, "Kill the OS thread you are running on". However if for example, that thread is in a thread pool, by the time we are asking, the thread might be released by the current task and then given to another and the other task might get the interruption signal. This makes the interruption process quite complex. Virtual Threads are simply objects that live in the heap, we can just let the GC collect them in the background
Hard upper limits (tens of thousands at most) of threads, due to how the OS handles them. The OS can’t be fine-tuned to the specific applications and programming language so it has to prepare for the worst-case scenario memory-wise. It has to allocate more memory that will actually be used to accommodate all needs. While doing all of this it has to ensure that the vital OS processes are still working. With VT you are only limited by the memory which is cheap
Thread that performs a transaction behaves very differently than a Thread that does video processing, again the OS has to prepare for the worst-case scenario and accommodate both cases the best way it can, which means we get suboptimal performance in most cases. Since VT are spawned and managed by Java itself, this allows for complete control over them and task-specific optimizations that are not bound to the OS
Resizable stack. The OS gives Threads a big stack to fit all use cases, Virtual Threads have a resizable stack that lives in the heap space, it is dynamically resized to fit the problem which makes it smaller
Smaller metadata size. Platform threads use 1MB as mentioned above, whereas Virtual Threads need 200-300 bytes to store their metadata
One big advantage of coroutines (so virtual threads) is that they can generate high levels of concurrency without the drawback of callbacks.
let me first introduce Little's Law:
concurrency = arrival_rate * latency
And we can rewrite this to:
arrival_rate = concurrency/latency
In a stable system, the arrival rate equals throughput.
throughput = concurrency/latency
To increase throughput, you have 2 options:
decrease latency; which typically is very hard since you have little influence on how much time a remote call or a request to disk takes.
increase concurrency
With regular threads, it is difficult to reach high levels of concurrency with blocking calls due to context switch overhead. Requests can be issued asynchronously in some cases (e.g. NIO + Epoll or Netty io_uring binding), but then you need to deal with callbacks and callback hell.
With a virtual thread, the request can be issued asynchronously and park the virtual thread and schedule another virtual thread. Once the response is received, the virtual thread is rescheduled and this is done completely transparently. The programming model is much more intuitive than using classic threads and callbacks.
Virtual threads are wrapped upon platform threads, so you may consider them an illusion that JVM provides, the whole idea is to make lifecycle of threads to CPU bound operations.
What exactly makes Java Virtual Threads better ?
Virtual threads advantages
exhibits exact the same behavior as platform threads.
disposable and can be scaled to millions.
much more lightweight than platform threads.
fast creation time, as fast as creating string object.
the JVM does delimited continuation on IO operations, no IO for
virtual threads.
yet can have the sequential code as previous but way more effective.
the JVM gives an illusion of virtual threads, underneath whole story
goes on platform threads.
Just with usage of virtual thread CPU core become much more concurrent, the combination of virtual threads and multi core CPU with ComputableFutures to parallelized code is very powerful
Virtual threads usage cautions
Don not use monitor i.e the synchronized block, however this will fix in new release of JDK's, an alternative to do so is to use 'ReentrantLock' with try-final statement.
Blocking with native frames on stack, JNI's. its very rare
Control memory per stack (reduce thread locales and no deep recursion)
Monitoring tools not updated yet like debuggers, JConsole, VisualVM etc
Platform Threads versus Virtual threads. Platform threads take OS
threads hostage in IO based tasks and operations limited to number of
applicable threads with in thread pool and OS threads, by default
they are non Daemon threads
Virtual threads are implemented with JVM, in CPU bound operations the
associated to platform threads and retuning them to thread pool,
after IO bound operation finished a new thread will be called from
thread pool, so no hostage in this case.
Fourth level architecture to have better understanding.
CPU
Multicore CPU multicores with in cpu executing operations.
OS
OS threads the OS scheduler allocating cpu time to engaged OS
threads.
JVM
platform threads are wrapped totally upon OS threads with both task
operations
virtual threads are associated to platform threads in each CPU bound
operation, each virtual thread can be associated with multiple
platform threads as different times.
Virtual threads with Executer service
More effective to use executer service cause it associated to thread pool an limited to applicable threads with it, however in compare of virtual threads, with Executer service and virtual contained we do not ned to handle or manage the associated thread pool.
try(ExecutorService service = Executors.newVirtualThreadPerTaskExecutor()) {
service.submit(ExecutorServiceVirtualThread::taskOne);
service.submit(ExecutorServiceVirtualThread::taskTwo);
}
Executer service implements Auto Closable interface in JDK 19, thus when used with in 'try with resource', once it reach to end of 'try' block the 'close' api being called, alternatively main thread will wait till all submitted task with their dedicated virtual threads finish their lifecycle and associated thread pool being shutdown.
ThreadFactory factory = Thread.ofVirtual().name("user thread-", 0).factory();
try(ExecutorService service = Executors.newThreadPerTaskExecutor(factory)) {
service.submit(ExecutorServiceThreadFactory::taskOne);
service.submit(ExecutorServiceThreadFactory::taskTwo);
}
Executer service can be created with virtual thread factory as well, just putting thread factory with it constructor argument.
Can benefits features of Executer service like Future and Completable Future.
Find more on JEP-425
Sometimes people have to build systems able to handle an enormous number of simultaneous clients. Native threads are inadequate means for doing that due to RAM consumption and context switching costs.
Virtual threads give us an ability to run millions of I/O bound tasks simultaneously without changing our mental model.
That's why Golang made its way into the industry (besides Google support). Goroutines are a concept very similar to Java's virtual threads and they solve the same problem.
There are other ways to achieve what virtual thread do (such as NIO and the related Reactor pattern). This, however, entails using message loops and callbacks which warp your mind (that's why so many people hate JavaScript). There are layers of abstractions on top of them making things a bit easier but they also have a cost.
As threads execute on a multi-processor/multi-core machines, they can cause CPU caches to load data from RAM.
If threads are supposed to be 'see' same data, it is not guaranteed because thread1 may cause an update in it's CPU's (i.e. where it is currently executing) cache and this change will not be immediately visible to thread2.
To solve this problem, programming languages like Java provide constructs like volatile.
It is clear to me what the problem with multiple threads executing on different CPUs is.
I am pretty sure that a given thread is not bound to a particular CPU for its lifetime and can get scheduled to run on a different CPU. But it is not clear to me why that does not cause problems similar to the one with different threads on different CPUs?
After all this thread may have caused an update in one CPU's cache which is yet to be written to RAM. If this thread now gets scheduled on another CPU wouldn't it have access to stale data?
Only possibility I can think, as of now, is that context switching of threads involve writing all the data visible to the thread back to RAM and that when a thread gets scheduled on a CPU, its cache gets refreshed from RAM (to prevent thread seeing stale values).However this looks problematic from performance point of view as time-slicing means threads are getting scheduled all the time.
Can some expert please advise what the real story is?
Caches on modern CPU's are always coherent. So if a store is performed by one CPU, then a subsequent load on a different CPU will see that store. In other words: the cache is the source of truth, memory is just an overflow bucket and could be completely out of sync with reality. So since the caches are coherent, it doesn't matter on which CPU a thread will run.
Also on a single CPU system, the lack of volatile can cause problems due to compiler optimizations. A compiler could for example the hoist a variable out of a loop and then a write made by 1 thread, will never be seen by another thread no matter if it is running on the same CPU.
I would suggest not thinking in term of hardware. If you use Java, make sure you understand the Java Memory Model (JMM). This is an abstract model that prevents thinking in terms of hardware since the JMM needs to run independent of the hardware.
On a single thread, there is a happens-before relationship between actions that take place, regardless of how the scheduling done. This is enforced by the implementation of the JVM as part of the Java memory model contract promised in the Java Language Specification:
Two actions can be ordered by a happens-before relationship. If one action happens-before another, then the first is visible to and ordered before the second.
If we have two actions x and y, we write hb(x, y) to indicate that x happens-before y.
If x and y are actions of the same thread and x comes before y in program order, then hb(x, y).
How exactly this is achieved by the operating system is implementation dependent.
it is not clear to me why that does not cause problems similar to the one with different threads on different CPUs? After all this thread may have caused an update in one CPU's cache which is yet to be written to RAM. If this thread now gets scheduled on another CPU wouldn't it have access to stale data?
Yes it may have access to stale data but it more likely to have data in its cache that is unhelpful – not relevant to the memory that it needs. First off, the permissions from the OS (if written correctly) won't let one program see the data from another – yes, there are many stories about hardware vulnerabilities in the news these days so I am talking about how it should work. The cache will be clear if another process gets swapped into a CPU.
Whether or not the cache memory is stale or not is a function of the timing of the cache coherence systems of the architectures or whether or not memory fences are crossed.
context switching of threads involves writing all the data visible to the thread back to RAM and that when a thread gets scheduled on a CPU, its cache gets refreshed from RAM (to prevent thread seeing stale values).
That's pretty close to what happens although its cache memory is not refreshed when it gets scheduled. When a thread is contexted switched out of the CPU, all dirty pages of memory are flushed to RAM. When a thread is swapped into a CPU, the cache is either flushed (if from another process) or contains memory that may not be helpful to the incoming thread. This causes a much higher page fault ratio of initial memory accesses meaning that a thread spends longer to access memory until the rows it needs are loaded into the cache.
However this looks problematic from performance point of view as time-slicing means threads are getting scheduled all the time.
Yes there is a performance hit. This highlights why it is so important to properly size your thread-pools. If you have too many threads running CPU intensive tasks, you can cause a loss in performance because of the context switches. If the threads are waiting for IO then increasing the number of threads is a must but if you are just calculating something, using fewer CPUs can result in higher throughput because each CPU can stay in the processor longer and the ratio of cache hits to misses goes up.
For those who might not go through all the comments on the different answers, here is a simplified summary of what I have modelled in my head (please feel free to comment if any point is not correct. I will edit this post)
http://tutorials.jenkov.com/java-concurrency/volatile.html is not accurate and gives rise to questions like this. CPU caches are always coherent. If CPU 1 has written to a memory address X in its cache and CPU 2 reads the same memory address from its own cache (after the fact of CPU 1 writing to its cache), then CPU 2 will read what was written by CPU 1. No special instruction is required to enforce this.
However, modern CPUs also have store buffers. They are used to accumulate writes in the buffer in order to improve performance (so that these writes can be committed to the cache in their own time, making CPU free from waiting for cache coherence protocol to finish).
Whatever is in the store buffer of a CPU is not yet visible to other CPUs.
In addition, CPUs and compilers in order to improve performance are free to re-order instructions as long as it does not change the outcome of the computation (from single thread's point of view)
Also, some compiler optimizations may move a variable completely to CPU registers for a routine, thereby 'hiding' them from shared memory and hence making writes to that variable invisible to other threads.
Points 3,4,5 above are the reason why Java exposes keywords like Volatile. When you use volatile, JVM itself does not re-order instructions if they would break 'happens-before' guarantee. JVM also asks CPU to not re-order by using memory barrier/fence instructions. JVM also does not use any optimization which prevents 'happens-before' guarantee. Overall if a write to a volatile memory has already happened, any read thereafter by another thread will ensure correct value to be available for not only that field but also all fields which were visible to first thread while writing the volatile field.
How does above relate to this question about single thread using different CPUs in its lifetime?
If the thread while executing on a CPU has already written to its cache, nothing more to be considered here. Even if the thread later uses another CPU, it will be able to see its own writes due to cache coherency.
If the thread's writes are waiting in the store buffer and it gets moved out of the CPU, context switching ensures that the thread's writes from store buffer get committed to the cache. After that it is same as point 1.
Any state which is only in CPU registers, anyway gets backed up and restored as part of context switching.
Due to above points, a single thread does not face any problem when it executes over different CPUs during its lifetime.
(The specifics for this question are for a mod for Minecraft. In general, the question deals with resizing a threadpool based on system load and CPU availability).
I am coming from an Objective C background, and Apple's libdispatch (Grand Central Dispatch) for thread scheduling.
The immediate concern I have is trying to reduce the size of the threadpool when a CMS tenured collection is running. The program in question (Minecraft) only works well with CMS collections. A much less immediate, but still "of interest", is reducing the threadpool size when other programs are demanding significant CPU (specifically, either a screen recorder, or a twitch stream).
In Java, I have just found out about (deep breath):
Executors, which provide access to thread pools (both fixed size, and adjustable size), with cached thread existence (to avoid the overhead of constantly re-creating new threads, or to avoid the worry of coding threads to pause and resume based on workload),
Executor (no s), which is the generic interface for saying "Now it is time to execute this runnable()",
ExecutorService, which manages the threadpools according to Executor,
ThreadPoolExecutor, which is what actually manages the thread pool, and has the ability to say "This is the maximum number of threads to use".
Under normal operation, about 5 times a second, there will be 50 high priority, and 400 low priority operations submitted to the thread pool per user on the server. This is for high-powered machines.
What I want to do is:
Work with less-powerful machines. So, if a computer only has 2 cores, and the main program (two primary threads, plus some minor assistant threads) is already maxing out the CPU, these background tasks will be competing with the main program and the garbage collector. In this case, I don't want to reduce the number of background threads (it will probably stay at 2), but I do want to reduce how much work I schedule. So this is just "How do I detect when the work-load is going up". I suspect that this is just a case of watching the size of the work queue I use when Executors.newCachedThreadPool()
But the first problem: I can't find anything to return the size of the work queue! ThreadPoolExecutor() can return the queue, and I can ask that for a size, but newCachedThreadPool() only returns an ExecutorService, which doesn't let me query for size (or rather, I don't see how to).
If I have "enough cores", I want to tell the pool to use more threads. Ideally, enough to keep CPU usage near max. Most of the tasks that I want to run are CPU bound (disk I/O will be the exception, not the rule; concurrency blocking will also be rare). But I don't want to heavily over-schedule threads. How do I determine "enough threads" without going way over the available cores?
If, for example, screen recording (or streaming) activates, CPU core usage by other programs will go up, and then I want to reduce the number of threads; as the number of threads go down, and queue backlog goes up, I can reduce the amount of tasks I add to the queue. But I have no idea how to detect this.
I think that the best advice I / we can give is to not try to "micro-manage" the number of threads in the thread pools. Set it to sensible size that is proportional to the number of physical cores ... and leave it. By all means provide some static tuning parameters (e.g. in config files), but don't to make the system tune itself dynamically. (IMO, the chances that dynamic tuning will work better than static are ... pretty slim.)
For "real-time" streaming, your success is going to depend on the overall load and the scheduler's ability to prioritize more than the number of threads. However it is a fact that standard Java SE on a standard OS is not suited to hard real-time, so your real-time performance is liable to deteriorate badly if you push the envelope.
I want to get the process ID of a Thread to see how much memory it takes.
It depends a lot on the OS and how it manages threads. Theoretically it also depends on how the JVM implements threads, but all modern JVMs implement them as native threads.
On Linux each thread will used to get its own process ID, but most tools hide all but one thread per process (i.e. you don't usually see them unless you explicitly ask for them, ps uses the -m flag for example). This is caused by the fact that the Linux kernel doesn't really make much of a difference between threads and tasks.
Edit: as I just learned this is no longer necessarily the case: you can create a thread with the exact same PID as the parent, in which case the threads will be distinguished by different thread IDs.
However since a thread shares its memory with all other threads in the same process, this doesn't help you find out "how much memory a thread takes", since all threads in a process will use the exact same amount (and they all use the same, so the real used memory is shown_memory_use and not shown_memory_user * number_of_threads).
Threads do not have PIDs, processes do. As such what you're asking is not possible. There is also no reliable way to retrieve your PID from within a Java process (although the first part of the value returned by ManagementFactory.getRuntimeMXBean().getName() usually is the PID).
As the name implies, PID means process ID. Each process can spawn multiple threads, which all share the same PID. Are you sure you don't mean Thread ID?
A feature of thread is that is shares the heap with all other threads. This means that any one thread can potentially use almost all the memory of the process. The only thing which a thread doesn't have access to is the stack or local variables of another thread.
As such it is not useful to try to determine how much memory an individual thread uses. Instead it can be useful to determine how much memory a data structure uses. (Although this can have similar difficulties)
It is worth noting that main memory is relatively cheap. Your situation may be different but a typical new server with 24 GB can cost as little as £1K. You can buy a 96 GB PC for around £2K. Sometimes it is not worth worrying about how much memory you are using until you know it is a problem.
What are the Light weight and heavy weight threads in terms of Java?
It's related to the amount of "context" associated with a thread, and consequently the amount of time it takes to perform a "context switch".
Heavyweight threads, (usually kernel/os level threads) have a lot of context (hardware registers, kernel stacks, etc). So it takes a lot of time to switch between threads. Heavyweight threads may also have restrictions on them, for example, on some OSes, kernel threads cannot be pre-empted, which means they can't forcibly be switched out until they give up control.
Lightweight threads on the other hand (usually, user space threads) have much less context. (They essentially share the same hardware context), they only need to store the context of the user stack, hence the time taking to switch lightweight threads is much shorter.
On most OSes, any threads you create as a programmer in user space will be lightweight in comparison to the kernel space threads. There is no formal definition of heavyweight and lightweight, it's just more of a comparison between threads with more context and threads with less context. Don't forget that every OS has its own different implementation of threads, and the lines between heavy and light threads are not necessarily clearly defined. In some programming languages and frameworks, when you create a "Thread" you might not even be getting a full thread, you might just be getting some abstraction that hides the real number of threads underneath.
[Some OSes allow threads to share address space, so threads that would usually be quite heavy, are slightly lighter]
Java standard threads are reasonably heavy in comparison to Erlang threads which are very light spawnable processes. Erlang demonstrates a distributed finite state machine.
However as an example, http://kilim.malhar.net/ , a Java extension library based on the Actor model of concurrency, proposes a construct for light weight threads in java. Instead of Thread implementing run(), a Kilim thread implements from the Kilim library using an execute() method. Apparently it shows Java's runtime outperforms Erlang's (atleast in a local environment AFAIK). Java did actually have such things in the original language spec called 'green threads' but subsequent Java versions dropped them in favor of native threads
In most systems Light weight threads are the normal threads you create with the help of library, like p_threads in linux.
While Heavy weight, in some systems, refer to a system process, with its own virtual memory and a more complex structure, like information about the process performance/statistics.
For more information:
http://www.computerworld.com/s/article/66405/Processes_and_Threads
http://msdn.microsoft.com/en-us/library/ms684841(VS.85).aspx