Spring Webflux Threading Model on machine with ONE cpu

Spring Webflux Threading Model on machine with ONE cpu - java

Small question regarding Spring Webflux and project Reactor please.
From the official doc, we can see:
https://docs.spring.io/spring-framework/docs/current/reference/html/web-reactive.html#threading-model
Threading Model
What threads should you expect to see on a server running with Spring WebFlux?
On a “vanilla” Spring WebFlux server (for example, no data access nor other optional dependencies), you can expect one thread for the server and several others for request processing (typically as many as the number of CPU cores). Servlet containers, however, may start with more threads (for example, 10 on Tomcat), in support of both servlet (blocking) I/O and servlet 3.1 (non-blocking) I/O usage.
What happens if the hardware only has one cpu please?
I have a webapp, which takes a Flux of string as input, and perform a heavy operation on it.
Please note, the heavy operation is non blocking. It has been BlockHound tested, and for sure, does not contain any database, web call IO.
Yet, the computation is heavy, lengthy (but again, non blocking).
What heavyComputation does is that it takes the string, performs some in memory decryption, convert to some objects, check some fields against a BCrypt hash, in memory re encyption.
The heavyComputation is very heavy and takes up to 5 second to complete for one string.
#GetMapping(value = "/upload-flux", consumes = MediaType.MULTIPART_FORM_DATA_VALUE, produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> question(#RequestPart("question") Flux<String> stringFlux) {
return stringFlux.map(oneString -> heavyComputation(oneString));
}
private String heavyComputation(String oneString) {
// heavy and time-consuming in memory decryption
// heavy and time-consuming conversion to java object
// heavy and time-consuming validation of fields against hash
// heavy and time-consuming re encryption
// return the re encrypted string
}
I was hoping by using Spring Webflux, I could see some concurrency and asyncs, as the hardware are constrains to 1 cpu only.
Sadly, I observe everything is done on reactor-http-nio-1 thread, and looks like it is fairly sequential, one first string, heavyComputation which takes 5ish seconds, then the second string, etc.
What am I doing wrong please?
Thank you

WebFlux is built around the concept that new incoming requests don't spawn new threads (*) like traditional web servers such as servlet containers do. Instead, requests get queued to be processed by a single long-running thread (assuming a single CPU core), similar to how e.g. click events are processed in JavaScript or desktop UI libraries. The benefit of that is that the CPU is freed from much of the overhead associated with managing threads, which is very expensive. It gets to do one job after another instead of creating the illusion that it can do thousands of jobs at once.
This doesn't magically make your CPU go faster, it just makes it waste less time with thread context switching, which is notoriously expensive. Your CPU-bound computations need as many CPU cycles as they need, no matter what thread they run on. Also, with WebFlux, if request processing involves long-running CPU intensive computation, this means that the CPU can't process new requests until it's finished with the current one - unless you explicitly offload it to a worker thread (and wrap it in a Mono). This will however effectively nullify the benefits of the reactive model if those CPU-bound computations are what the application will be busy with most of the time, because the CPU will now yet again have to do thread context switching to alternately assign CPU time to the request processing and the newly spawned worker thread. Or worse yet, it will have to juggle multiple parallel such worker threads as they get spawned through new requests.
You can expect performance gains from WebFlux if your application needs to process a very large number of requests per second, but where individual requests need very few CPU cycles to process and I/O is non-blocking. Your use case seems to be the opposite, so the Reactive model might not actually do anything for you compared to the simpler Servlet model.
If, on the other hand, your use case is such that the CPU-bound work can be parallelized, you will need multiple (or at least hyperthreading-enabled) CPU cores to benefit from that. Reactive can't help you there.
(*) Yes, that's an oversimplification, I'm aware of thread pools, I'm just trying to get the point across.

Related

Parallel Flux vs Flux in project Reactor

So what I have understood from the docs is that parallel Flux is that essentially divided the flux elements into separate rails.(Essentially something like grouping). And as far as thread is considered, it would be the job of schedulers. So let's consider a situation like this. And all this will be run on the same scheduler instance provided via runOn() methods.
Let's consider a situation like below:
Mono<Response> = webClientCallAPi(..) //function returning Mono from webclient call
Now let's say we make around 100 calls
Flux.range(0,100).subscribeOn(Schedulers.boundedElastic()).flatMap(i -> webClientCallApi(i)).collecttoList() // or subscribe somehow
and if we use paralleFlux:
Flux.range(0,100).parallel().runOn(Schedulers.boundedElastic()).flatMap(i -> webClientCallApi(i)).sequential().collecttoList();
So if my understanding is correct, it pretty much seems to be similar. So what are the advantages of ParallelFlux over Flux and when should you use parallelFlux over flux?

In practice, you'll likely very rarely need to use a parallel flux, including in this example.
In your example, you're firing off 100 web service calls. Bear in mind the actual work needed to do this is very low - you generate and fire off an asynchronous request, and then some time later you receive a response back. In between that request & response you're not doing any work at all, it simply takes a tiny amount of CPU resources when each request is sent, and another tiny about when each response is received. (This is one of the core advantages of using an asynchronous framework to make your web requests, you're not tying up any threads while the request is in-flight.)
If you split this flux and run it in parallel, you're saying that you want these tiny amounts of CPU resources to be split so they can run simultaneously, on different CPU cores. This makes absolutely no sense - the overhead of splitting the flux, running it in parallel and then combining it later is going to be much, much greater than just leaving it to execute on a normal, sequential scheduler.
On the other hand, let's say I had a Flux<Integer> and I wanted to check if each of those integers was a prime for example - or perhaps a Flux<String> of passwords that I wanted to check against a BCrypt hash. Those sorts of operations are genuinely CPU intensive, so in that case a parallel flux, used to split execution across cores, could make a lot of sense. In reality though, those situations occur quite rarely in the normal reactor use cases.
(Also, just as a closing note, you almost always want to use Schedulers.parallel() with a parallel flux, not Schedulers.boundedElastic().)

Difference between Reactor Flux and Java Fiber

I've been reading about Java Fibers as small unit of work which would mapped to Threads. In case of a blocking call a different Fiber would be mapped to the same Thread. Since Threads in Java are kernel level threads, this would prevent Threads from getting exhausted.
I've been using Spring Web-Flux, so just wanted to understand What all happens internally when Netty server receives 100 request/sec each of which include reactive databases access, How are these requests mapped to 40 threads which Netty server spawns by default?
How is a flux different from a Fiber? How does flux guarantee asynchronous behaviour with limited number of threads?

What all happens internally when Netty server receives 100 request/sec each of which include reactive databases access, How are these requests mapped to 40 threads which Netty server spawns by default?
In brief, it takes those requests and assigns them "round robin" style to each available underlying thread (as those threads become available.) The same thing happens with all other reactive calls too, of course with the caveat that depending on the configuration, they may be running on other schedulers and so other underlying thread pools with different numbers of threads.
How is a flux different from a Fiber?
That's a very big topic, but the "high level" overview is that flux (by which I assume you mean "reactive" Java rather than a Flux itself) is an asynchronous model where no thread is allowed to block, and fibers are "green" threads, designed to be used synchronously, that make use of preemptive scheduling (amongst other techniques) to map to far fewer underlying kernel level threads.
In practice, that means that you can use pretty much the same threading model & code techniques you do today with fibers, but reactive programming will require you to adopt new paradigms.
How does flux guarantee asynchronous behaviour with limited number of threads?
Quite simply, because it's architected to be asynchronous. The question here seems like it's based on a false premise - asynchronous behaviour isn't guaranteed or not guaranteed by the number of threads available, but by your model (it can't "spill over" into synchronous behaviour if it's overwhelmed by requests.)

Best practices with Akka in Scala and third-party Java libraries

I need to use memcached Java API in my Scala/Akka code. This API gives you both synchronous and asynchronous methods. The asynchronous ones return java.util.concurrent.Future. There was a question here about dealing with Java Futures in Scala here How do I wrap a java.util.concurrent.Future in an Akka Future?. However in my case I have two options:
Using synchronous API and wrapping blocking code in future and mark blocking:
Future {
blocking {
cache.get(key) //synchronous blocking call
}
}
Using asynchronous Java API and do polling every n ms on Java Future to check if the future completed (like described in one of the answers above in the linked question above).
Which one is better? I am leaning towards the first option because polling can dramatically impact response times. Shouldn't blocking { } block prevent from blocking the whole pool?

I always go with the first option. But i am doing it in a slightly different way. I don't use the blocking feature. (Actually i have not thought about it yet.) Instead i am providing a custom execution context to the Future that wraps the synchronous blocking call. So it looks basically like this:
val ecForBlockingMemcachedStuff = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(100)) // whatever number you think is appropriate
// i create a separate ec for each blocking client/resource/api i use
Future {
cache.get(key) //synchronous blocking call
}(ecForBlockingMemcachedStuff) // or mark the execution context implicit. I like to mention it explicitly.
So all the blocking calls will use a dedicated execution context (= Threadpool). So it is separated from your main execution context responsible for non blocking stuff.
This approach is also explained in a online training video for Play/Akka provided by Typesafe. There is a video in lesson 4 about how to handle blocking calls. It is explained by Nilanjan Raychaudhuri (hope i spelled it correctly), who is a well known author for Scala books.
Update: I had a discussion with Nilanjan on twitter. He explained what the difference between the approach with blocking and a custom ExecutionContext is. The blocking feature just creates a special ExecutionContext. It provides a naive approach to the question how many threads you will need. It spawns a new thread every time, when all the other existing threads in the pool are busy. So it is actually an uncontrolled ExecutionContext. It could create lots of threads and lead to problems like an out of memory error. So the solution with the custom execution context is actually better, because it makes this problem obvious. Nilanjan also added that you need to consider circuit breaking for the case this pool gets overloaded with requests.
TLDR: Yeah, blocking calls suck. Use a custom/dedicated ExecutionContext for blocking calls. Also consider circuit breaking.

The Akka documentation provides a few suggestions on how to deal with blocking calls:
In some cases it is unavoidable to do blocking operations, i.e. to put
a thread to sleep for an indeterminate time, waiting for an external
event to occur. Examples are legacy RDBMS drivers or messaging APIs,
and the underlying reason is typically that (network) I/O occurs under
the covers. When facing this, you may be tempted to just wrap the
blocking call inside a Future and work with that instead, but this
strategy is too simple: you are quite likely to find bottlenecks or
run out of memory or threads when the application runs under increased
load.
The non-exhaustive list of adequate solutions to the “blocking
problem” includes the following suggestions:
Do the blocking call within an actor (or a set of actors managed by a router), making sure to configure a thread pool which is either
dedicated for this purpose or sufficiently sized.
Do the blocking call within a Future, ensuring an upper bound on the number of such calls at any point in time (submitting an unbounded
number of tasks of this nature will exhaust your memory or thread
limits).
Do the blocking call within a Future, providing a thread pool with an upper limit on the number of threads which is appropriate for the
hardware on which the application runs.
Dedicate a single thread to manage a set of blocking resources (e.g. a NIO selector driving multiple channels) and dispatch events as they
occur as actor messages.
The first possibility is especially well-suited for resources which
are single-threaded in nature, like database handles which
traditionally can only execute one outstanding query at a time and use
internal synchronization to ensure this. A common pattern is to create
a router for N actors, each of which wraps a single DB connection and
handles queries as sent to the router. The number N must then be tuned
for maximum throughput, which will vary depending on which DBMS is
deployed on what hardware.

nonblocking-io vs blocking-io on raw data throughput

In apache HTTPComponent document there is a statement:
Contrary to the popular belief, the performance of NIO in terms of raw data throughput is significantly lower than that of blocking I/O."
Is that true? Can someone explain this in more details? And what is a typical use case where
request / response handling needs to be decoupled

Non blocking IO should be used when you can handle the request, dispatch it for processing on some other execution context (different thread, RPC call to another server, some other async mechanism) and release the web-server's thread to handle more incoming requests. When the processing of the response will be completed, a response handling thread will be invoked, and it will send the response to the client.
I would recommend reading netty documentation for better understanding of the concept.
As for higher throughput: When your server sends/recieves large amounts of data, all those context switches, and passing data between threads, can really hurt overall performance. Think of it like this: you receive a large request (PUT request with a large file). All you need to do is to save it to disk, and return OK. Starting to toss it between threads could result in few more mem-copy operations that would have been needed in case you've just threw it to disk in the same thread. And handling this operation in async manner would not improve performance: though you could have released the request handling thread back to web-server's thread pool and let it process other requests, your main performance bottleneck is your disk IO, and in this case - trying to save more files simultaneously, would only make things slower.
I hope I was clear enough. Please feel free to ask more questions in comments if you need more explanations.

The first statement is true only when the number of concurrent requests is relatively small (rather in tens than thousands). It's all about using many threads (blocking) instead of one or few threads (non-blocking). Let's say you want to write an application which only downloads a file from remote server. If your application need to download only one file at a time you need only one thread. But if you have a crawler which runs thousands of HTTP requests then you need to have thousands of threads (or use limited number of threads + NIO instead). For so big number of threads the problem is context switching which can slow down your application dramatically (therefore for this number of concurrent requests NIO is better).
But let's back to your question. Why NIO can be slower in terms of raw data throughput ? The reason is the amount of CPU time used by NIO driven applications. For such case in blocking model your code is doing only one thing - waiting for data (it executes recv() operation in a loop). In the NIO application the logic is much more complicated: in a loop the code is using the selector to select a set of keys (which involves epoll_wait system call on Linux, Oracle JVM), then iterate through the set, pick up a channel for every key and then read the data from the channel (read() operation in OS). In standard blocking model all you do is to execute the recv() system function. In summary: NIO driven application in such case use more CPU time and generates more mode switch operations because of higher number of system calls (by saying mode switch I mean the switch from user to kernel mode). Therefore the time needed to download the file will be higher.

akka jvm threads vs os threads when performing io

I've searched the site a bit for help understanding this, but haven't found anything super clear, so I thought I'd post my use case and see if anybody could shed some light.
I have a question about the scaling of jvm threads vs os threads when used in akka for io operations. From the akka site:
Akka supports dispatchers for both event-driven lightweight threads, allowing creation of millions threads on a single workstation, and thread-based Actors, where each dispatcher is bound to a dedicated OS thread.
The event-based Actors currently consume ~600 bytes per Actor which means that you can create more than 6.5 million Actors on 4 G RAM.
In this context, can you all help me understand how that matters on a workstation with only 1 processor (for simplicity). So, for my example use case, I want to take a list of say 1000 'Users' and then go query a database (or several) for various information about each user. So if I were to dispatch each of these 'get' tasks to an actor, and that actor is going to do IO, wouldn't that actor block based on the os thread limit for the workstation?
How does the akka actor model give me lift in a scenario like this? I know that I am probably missing something as I am not wildly knowledgeable on the interworkings of vm threads vs os threads, so if one of the smart folks here could spell it out for me, that would be great.
If I use Futures, don't I need to use await() or get() to block and wait for the reply?
In my use case, regardless of actors, would it end up just 'feeling' like I'm making 1000 sequential database requests?
If code snips are useful in helping me understand this, Java would be preferred as I am still coming up to speed on scala syntax - but a nice clear textual explanation of how these millions of threads can interoperate on a single processor machine while doing database IO would be fine too.

It is really hard to figure out what you are actually asking here, but here are some pointers:
If you are running on a modern JVM, there is typically a one-to-one relationship between Java threads and OS threads. (IIRC, Solaris allows you to do this differently ... but that's the exception.)
The amount of real parallelism you will get using threads, or anything built on top of threads is limited by the number of processors / cores that are available to the application. Beyond that, you will find that not all threads are actually executing at any given instant.
If you have 1000 Actors all trying to access the database "at the same time", then most of them will actually be waiting on the database itself, or on the thread scheduler. Whether this amounts to making 1000 sequential requests (i.e. strict serialization) will depend on the database and the queries / updates that the actors are doing.
The bottom line is that a computer system has hard limits on the resources available for doing stuff; e.g. number of processors, speed of processors, memory bandwidth, disc access times, network bandwidth, etc. You can design an application to be smart about the way it uses available resources, but you can't get it to use more resources than there actually are.
On reading the text that you quoted, it seems to me that it is talking about two different kinds of actors:
Thread-based actors have a 1 to 1 relationship with threads. There's no way you could have millions of this kind of actor in 4Gb memory.
Event-based actors work differently. Instead of having a thread at all times, they would mostly be sitting in a queue waiting for an event to happen. When that happened, an event processing thread would grab the actor from the queue and execute the "action" associated with the event. When the action finished, the thread moves onto another actor / event pair.
The quoted text is saying that the memory overhead of an event-based actor is ~600 bytes. They don't include the event thread ... because the event thread is shared by multiple actors.
Now I'm not an expert on Scala / Actors, but it is pretty obvious that there are certain things that you should avoid when using event-based actors. For instance, you should probably avoid talking directly to an external database because that is liable to block the event processing thread.

I think there may be a typo there. I think they meant to say:
Akka supports dispatchers for both event-driven lightweight actors,
allowing creation of millions actors on a single workstation, and thread-based Actors, where each actor is bound to a dedicated OS thread.
The event-driven actors use a thread pool - all of the (potentially millions of) actors share the same pool of threads. I'm not that familiar with Akka actors but generally you would not want to do blocking I/O with event-driven actors, otherwise you could cause starvation.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.