I've been reading about Java Fibers as small unit of work which would mapped to Threads. In case of a blocking call a different Fiber would be mapped to the same Thread. Since Threads in Java are kernel level threads, this would prevent Threads from getting exhausted.
I've been using Spring Web-Flux, so just wanted to understand What all happens internally when Netty server receives 100 request/sec each of which include reactive databases access, How are these requests mapped to 40 threads which Netty server spawns by default?
How is a flux different from a Fiber? How does flux guarantee asynchronous behaviour with limited number of threads?
What all happens internally when Netty server receives 100 request/sec each of which include reactive databases access, How are these requests mapped to 40 threads which Netty server spawns by default?
In brief, it takes those requests and assigns them "round robin" style to each available underlying thread (as those threads become available.) The same thing happens with all other reactive calls too, of course with the caveat that depending on the configuration, they may be running on other schedulers and so other underlying thread pools with different numbers of threads.
How is a flux different from a Fiber?
That's a very big topic, but the "high level" overview is that flux (by which I assume you mean "reactive" Java rather than a Flux itself) is an asynchronous model where no thread is allowed to block, and fibers are "green" threads, designed to be used synchronously, that make use of preemptive scheduling (amongst other techniques) to map to far fewer underlying kernel level threads.
In practice, that means that you can use pretty much the same threading model & code techniques you do today with fibers, but reactive programming will require you to adopt new paradigms.
How does flux guarantee asynchronous behaviour with limited number of threads?
Quite simply, because it's architected to be asynchronous. The question here seems like it's based on a false premise - asynchronous behaviour isn't guaranteed or not guaranteed by the number of threads available, but by your model (it can't "spill over" into synchronous behaviour if it's overwhelmed by requests.)
Related
Small question regarding Spring Webflux and project Reactor please.
From the official doc, we can see:
https://docs.spring.io/spring-framework/docs/current/reference/html/web-reactive.html#threading-model
Threading Model
What threads should you expect to see on a server running with Spring WebFlux?
On a “vanilla” Spring WebFlux server (for example, no data access nor other optional dependencies), you can expect one thread for the server and several others for request processing (typically as many as the number of CPU cores). Servlet containers, however, may start with more threads (for example, 10 on Tomcat), in support of both servlet (blocking) I/O and servlet 3.1 (non-blocking) I/O usage.
What happens if the hardware only has one cpu please?
I have a webapp, which takes a Flux of string as input, and perform a heavy operation on it.
Please note, the heavy operation is non blocking. It has been BlockHound tested, and for sure, does not contain any database, web call IO.
Yet, the computation is heavy, lengthy (but again, non blocking).
What heavyComputation does is that it takes the string, performs some in memory decryption, convert to some objects, check some fields against a BCrypt hash, in memory re encyption.
The heavyComputation is very heavy and takes up to 5 second to complete for one string.
#GetMapping(value = "/upload-flux", consumes = MediaType.MULTIPART_FORM_DATA_VALUE, produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> question(#RequestPart("question") Flux<String> stringFlux) {
return stringFlux.map(oneString -> heavyComputation(oneString));
}
private String heavyComputation(String oneString) {
// heavy and time-consuming in memory decryption
// heavy and time-consuming conversion to java object
// heavy and time-consuming validation of fields against hash
// heavy and time-consuming re encryption
// return the re encrypted string
}
I was hoping by using Spring Webflux, I could see some concurrency and asyncs, as the hardware are constrains to 1 cpu only.
Sadly, I observe everything is done on reactor-http-nio-1 thread, and looks like it is fairly sequential, one first string, heavyComputation which takes 5ish seconds, then the second string, etc.
What am I doing wrong please?
Thank you
WebFlux is built around the concept that new incoming requests don't spawn new threads (*) like traditional web servers such as servlet containers do. Instead, requests get queued to be processed by a single long-running thread (assuming a single CPU core), similar to how e.g. click events are processed in JavaScript or desktop UI libraries. The benefit of that is that the CPU is freed from much of the overhead associated with managing threads, which is very expensive. It gets to do one job after another instead of creating the illusion that it can do thousands of jobs at once.
This doesn't magically make your CPU go faster, it just makes it waste less time with thread context switching, which is notoriously expensive. Your CPU-bound computations need as many CPU cycles as they need, no matter what thread they run on. Also, with WebFlux, if request processing involves long-running CPU intensive computation, this means that the CPU can't process new requests until it's finished with the current one - unless you explicitly offload it to a worker thread (and wrap it in a Mono). This will however effectively nullify the benefits of the reactive model if those CPU-bound computations are what the application will be busy with most of the time, because the CPU will now yet again have to do thread context switching to alternately assign CPU time to the request processing and the newly spawned worker thread. Or worse yet, it will have to juggle multiple parallel such worker threads as they get spawned through new requests.
You can expect performance gains from WebFlux if your application needs to process a very large number of requests per second, but where individual requests need very few CPU cycles to process and I/O is non-blocking. Your use case seems to be the opposite, so the Reactive model might not actually do anything for you compared to the simpler Servlet model.
If, on the other hand, your use case is such that the CPU-bound work can be parallelized, you will need multiple (or at least hyperthreading-enabled) CPU cores to benefit from that. Reactive can't help you there.
(*) Yes, that's an oversimplification, I'm aware of thread pools, I'm just trying to get the point across.
I need to use memcached Java API in my Scala/Akka code. This API gives you both synchronous and asynchronous methods. The asynchronous ones return java.util.concurrent.Future. There was a question here about dealing with Java Futures in Scala here How do I wrap a java.util.concurrent.Future in an Akka Future?. However in my case I have two options:
Using synchronous API and wrapping blocking code in future and mark blocking:
Future {
blocking {
cache.get(key) //synchronous blocking call
}
}
Using asynchronous Java API and do polling every n ms on Java Future to check if the future completed (like described in one of the answers above in the linked question above).
Which one is better? I am leaning towards the first option because polling can dramatically impact response times. Shouldn't blocking { } block prevent from blocking the whole pool?
I always go with the first option. But i am doing it in a slightly different way. I don't use the blocking feature. (Actually i have not thought about it yet.) Instead i am providing a custom execution context to the Future that wraps the synchronous blocking call. So it looks basically like this:
val ecForBlockingMemcachedStuff = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(100)) // whatever number you think is appropriate
// i create a separate ec for each blocking client/resource/api i use
Future {
cache.get(key) //synchronous blocking call
}(ecForBlockingMemcachedStuff) // or mark the execution context implicit. I like to mention it explicitly.
So all the blocking calls will use a dedicated execution context (= Threadpool). So it is separated from your main execution context responsible for non blocking stuff.
This approach is also explained in a online training video for Play/Akka provided by Typesafe. There is a video in lesson 4 about how to handle blocking calls. It is explained by Nilanjan Raychaudhuri (hope i spelled it correctly), who is a well known author for Scala books.
Update: I had a discussion with Nilanjan on twitter. He explained what the difference between the approach with blocking and a custom ExecutionContext is. The blocking feature just creates a special ExecutionContext. It provides a naive approach to the question how many threads you will need. It spawns a new thread every time, when all the other existing threads in the pool are busy. So it is actually an uncontrolled ExecutionContext. It could create lots of threads and lead to problems like an out of memory error. So the solution with the custom execution context is actually better, because it makes this problem obvious. Nilanjan also added that you need to consider circuit breaking for the case this pool gets overloaded with requests.
TLDR: Yeah, blocking calls suck. Use a custom/dedicated ExecutionContext for blocking calls. Also consider circuit breaking.
The Akka documentation provides a few suggestions on how to deal with blocking calls:
In some cases it is unavoidable to do blocking operations, i.e. to put
a thread to sleep for an indeterminate time, waiting for an external
event to occur. Examples are legacy RDBMS drivers or messaging APIs,
and the underlying reason is typically that (network) I/O occurs under
the covers. When facing this, you may be tempted to just wrap the
blocking call inside a Future and work with that instead, but this
strategy is too simple: you are quite likely to find bottlenecks or
run out of memory or threads when the application runs under increased
load.
The non-exhaustive list of adequate solutions to the “blocking
problem” includes the following suggestions:
Do the blocking call within an actor (or a set of actors managed by a router), making sure to configure a thread pool which is either
dedicated for this purpose or sufficiently sized.
Do the blocking call within a Future, ensuring an upper bound on the number of such calls at any point in time (submitting an unbounded
number of tasks of this nature will exhaust your memory or thread
limits).
Do the blocking call within a Future, providing a thread pool with an upper limit on the number of threads which is appropriate for the
hardware on which the application runs.
Dedicate a single thread to manage a set of blocking resources (e.g. a NIO selector driving multiple channels) and dispatch events as they
occur as actor messages.
The first possibility is especially well-suited for resources which
are single-threaded in nature, like database handles which
traditionally can only execute one outstanding query at a time and use
internal synchronization to ensure this. A common pattern is to create
a router for N actors, each of which wraps a single DB connection and
handles queries as sent to the router. The number N must then be tuned
for maximum throughput, which will vary depending on which DBMS is
deployed on what hardware.
I recently read about Quasar which provides "lightweight" / Go-like "user mode" threads to the JVM (it also has an Erlang inspired Actor system like Akka but that's not the main question)
For example:
package jmodern;
import co.paralleluniverse.fibers.Fiber;
import co.paralleluniverse.strands.Strand;
import co.paralleluniverse.strands.channels.Channel;
import co.paralleluniverse.strands.channels.Channels;
public class Main {
public static void main(String[] args) throws Exception {
final Channel<Integer> ch = Channels.newChannel(0);
new Fiber<Void>(() -> {
for (int i = 0; i < 10; i++) {
Strand.sleep(100);
ch.send(i);
}
ch.close();
}).start();
new Fiber<Void>(() -> {
Integer x;
while((x = ch.receive()) != null)
System.out.println("--> " + x);
}).start().join(); // join waits for this fiber to finish
}
}
As far as I understand the code above doesn't spawn any JVM / Kernel threads, all is done in user mode threads (or so they claim) which is supposed to be cheaper (just like Go co-routines if I understood correctly)
My question is this - as far as I understand, in Akka, everything is still based on JVM Threads which is most of the time maps to native OS kernel threads (e.g. pthreads in POSIX systems), e.g. to the best of my understanding there are no user-mode threads / go like co-routines / lightweight threads in Akka, did I understand correctly?
If so, then do you know if it's a design choice? or there is a plan for go-like lightweight threads in Akka in the future?
My understanding is that if you have a million Actors but most of them are blocking then Akka can handle it with much less physical threads, but if most of them are non blocking and you still need some responsiveness from the system (e.g. service million of small requests for streaming some data) then I can see the benefits of a user mode threading implementation, which can allow many more "threads" to be alive with a lower cost of creating switching and terminating (of course the only benefit is evenly dividing responsiveness for many clients, but responsiveness is still important)
Is my understanding correct more or less? please correct me if I'm wrong.
*I might be completely confusing user-mode threads with go/co-routines and lightweight threads, the question above relies on my poor understanding that they are all one of the same.
Akka is a very flexible library and it allows you to schedule actors using (essentially it boils down to that through a chain of traits) a simple trait ExecutionContext, which, as you can see, accepts Runnables and somehow executes them. So, as far as I can see, it is likely that it is possible to write a binding to something like Quasar and use it as a "backend" for Akka actors.
However, Quasar and similar libraries are likely to provide special utilities for communication between fibers. I also don't know how they would handle blocking tasks like I/O, probably they would have a mechanism for that too. I'm not sure if Akka will be able to run correctly over green threads because of this. Quasar also seems to rely on bytecode instrumentation, and this is a rather advanced technique which can have a lot of implications preventing it from backing Akka.
However, you shouldn't really worry about lightweightness of threads when using Akka actors. In fact, Akka is perfectly able to create millions of actors on the single system (see here), and all these actors will work just fine.
This is achieved via clever scheduling over special kinds of thread pools, like fork-join thread pool. This means that unless actors are blocked on some long-running computation they can run over a number of threads significantly less than the number of these actors. For example, you can create a thread pool which will use at most 8 threads (one for each core of 8-core processor), and all actors activities will be scheduled on these threads.
Akka flexibility allows you to configure exact dispatcher to use for specific actors, if it is needed. You can create dedicated dispatchers for actors which stay in long-running tasks, like database access. See here for more information.
So, in short, no, you don't need userland threads for actors, because actors don't map one-to-one to native threads (unless you force them to, that is, but this should be avoided at all costs).
Akka actors are essentially asynchronous and that's why you can have a lot of them, while Quasar actors (yes, Quasar offers an actors implementation too), that are very close to Erlang's, are synchronous or blocking but they use fibers rather than Java (i.e. at present OS) threads, so still you can have a lot of them with the same performance as async but a programming style just as straightforward as when using regular threads and blocking calls.
I/O is handled through integrations: Quasar includes a framework to convert both async and sync APIs into fiber-blocking and Comsat include many such integrationsalready (one with Java NIO is included in Quasar directly).
My reply to another question contains further info.
I am also thinking of integrating the disruptor pattern in our application. I am a bit unsure about a few things before I start using the disruptor
I have 3 producers, mainly a FIX thread which de-serialises the requests. Another thread which continously modifies order price as the market moves. Also we have one more thread which is responsible for de-serialising the requests sent from a GUI application. All three threads currently write to a Blocking Queue (hence we see a lot of contention on the queue)
The disruptor talks about a Single writer principle and from what I have read that approach scales the best. Is there any way we could make the above three threads obey the single writer principle?
Also in a typical request/response application, specially in our case we have contention on an in memory cache, as we need to lock the cache when we update the cache with the response, whilst a request might be happening for the same order. How do we handle this through the disruptor, i.e. how do I tie up a response to a particular request? Can I eliminate the lock on the cache if yes how?
Any suggestions/pointers would be highly appreciated. We are currently using Java 1.6
I'm new to distruptor and am trying to understand as much usecases as possible. I have tried to answer your questions.
Yes, Disruptor can be used to sequence calls from multiple
producers. I understand that all 3 threads try to update the state
of a shared object. And a single consumer which takes necessary action on the shared object. Internally you can have the single consumer delegate calls to the appropriate single threaded handler based on responsibility. The
The Disruptor exactly does this. It sequences the calls such that
the state is accessed only by a thread at a time. If there's a specific order in which the event handlers are to be invoked, set up the memory barrier. The latest version of Disruptor has a DSL that lets you setup the order easily.
The Cache can be abstracted and accessed through the Disruptor. At a time, only a
Reader or a Writer would get access to the cache, since all calls to
the cache are sequential.
I've searched the site a bit for help understanding this, but haven't found anything super clear, so I thought I'd post my use case and see if anybody could shed some light.
I have a question about the scaling of jvm threads vs os threads when used in akka for io operations. From the akka site:
Akka supports dispatchers for both event-driven lightweight threads, allowing creation of millions threads on a single workstation, and thread-based Actors, where each dispatcher is bound to a dedicated OS thread.
The event-based Actors currently consume ~600 bytes per Actor which means that you can create more than 6.5 million Actors on 4 G RAM.
In this context, can you all help me understand how that matters on a workstation with only 1 processor (for simplicity). So, for my example use case, I want to take a list of say 1000 'Users' and then go query a database (or several) for various information about each user. So if I were to dispatch each of these 'get' tasks to an actor, and that actor is going to do IO, wouldn't that actor block based on the os thread limit for the workstation?
How does the akka actor model give me lift in a scenario like this? I know that I am probably missing something as I am not wildly knowledgeable on the interworkings of vm threads vs os threads, so if one of the smart folks here could spell it out for me, that would be great.
If I use Futures, don't I need to use await() or get() to block and wait for the reply?
In my use case, regardless of actors, would it end up just 'feeling' like I'm making 1000 sequential database requests?
If code snips are useful in helping me understand this, Java would be preferred as I am still coming up to speed on scala syntax - but a nice clear textual explanation of how these millions of threads can interoperate on a single processor machine while doing database IO would be fine too.
It is really hard to figure out what you are actually asking here, but here are some pointers:
If you are running on a modern JVM, there is typically a one-to-one relationship between Java threads and OS threads. (IIRC, Solaris allows you to do this differently ... but that's the exception.)
The amount of real parallelism you will get using threads, or anything built on top of threads is limited by the number of processors / cores that are available to the application. Beyond that, you will find that not all threads are actually executing at any given instant.
If you have 1000 Actors all trying to access the database "at the same time", then most of them will actually be waiting on the database itself, or on the thread scheduler. Whether this amounts to making 1000 sequential requests (i.e. strict serialization) will depend on the database and the queries / updates that the actors are doing.
The bottom line is that a computer system has hard limits on the resources available for doing stuff; e.g. number of processors, speed of processors, memory bandwidth, disc access times, network bandwidth, etc. You can design an application to be smart about the way it uses available resources, but you can't get it to use more resources than there actually are.
On reading the text that you quoted, it seems to me that it is talking about two different kinds of actors:
Thread-based actors have a 1 to 1 relationship with threads. There's no way you could have millions of this kind of actor in 4Gb memory.
Event-based actors work differently. Instead of having a thread at all times, they would mostly be sitting in a queue waiting for an event to happen. When that happened, an event processing thread would grab the actor from the queue and execute the "action" associated with the event. When the action finished, the thread moves onto another actor / event pair.
The quoted text is saying that the memory overhead of an event-based actor is ~600 bytes. They don't include the event thread ... because the event thread is shared by multiple actors.
Now I'm not an expert on Scala / Actors, but it is pretty obvious that there are certain things that you should avoid when using event-based actors. For instance, you should probably avoid talking directly to an external database because that is liable to block the event processing thread.
I think there may be a typo there. I think they meant to say:
Akka supports dispatchers for both event-driven lightweight actors,
allowing creation of millions actors on a single workstation, and thread-based Actors, where each actor is bound to a dedicated OS thread.
The event-driven actors use a thread pool - all of the (potentially millions of) actors share the same pool of threads. I'm not that familiar with Akka actors but generally you would not want to do blocking I/O with event-driven actors, otherwise you could cause starvation.