Lightweight threads in Akka

Lightweight threads in Akka - java

I recently read about Quasar which provides "lightweight" / Go-like "user mode" threads to the JVM (it also has an Erlang inspired Actor system like Akka but that's not the main question)
For example:
package jmodern;
import co.paralleluniverse.fibers.Fiber;
import co.paralleluniverse.strands.Strand;
import co.paralleluniverse.strands.channels.Channel;
import co.paralleluniverse.strands.channels.Channels;
public class Main {
public static void main(String[] args) throws Exception {
final Channel<Integer> ch = Channels.newChannel(0);
new Fiber<Void>(() -> {
for (int i = 0; i < 10; i++) {
Strand.sleep(100);
ch.send(i);
}
ch.close();
}).start();
new Fiber<Void>(() -> {
Integer x;
while((x = ch.receive()) != null)
System.out.println("--> " + x);
}).start().join(); // join waits for this fiber to finish
}
}
As far as I understand the code above doesn't spawn any JVM / Kernel threads, all is done in user mode threads (or so they claim) which is supposed to be cheaper (just like Go co-routines if I understood correctly)
My question is this - as far as I understand, in Akka, everything is still based on JVM Threads which is most of the time maps to native OS kernel threads (e.g. pthreads in POSIX systems), e.g. to the best of my understanding there are no user-mode threads / go like co-routines / lightweight threads in Akka, did I understand correctly?
If so, then do you know if it's a design choice? or there is a plan for go-like lightweight threads in Akka in the future?
My understanding is that if you have a million Actors but most of them are blocking then Akka can handle it with much less physical threads, but if most of them are non blocking and you still need some responsiveness from the system (e.g. service million of small requests for streaming some data) then I can see the benefits of a user mode threading implementation, which can allow many more "threads" to be alive with a lower cost of creating switching and terminating (of course the only benefit is evenly dividing responsiveness for many clients, but responsiveness is still important)
Is my understanding correct more or less? please correct me if I'm wrong.
*I might be completely confusing user-mode threads with go/co-routines and lightweight threads, the question above relies on my poor understanding that they are all one of the same.

Akka is a very flexible library and it allows you to schedule actors using (essentially it boils down to that through a chain of traits) a simple trait ExecutionContext, which, as you can see, accepts Runnables and somehow executes them. So, as far as I can see, it is likely that it is possible to write a binding to something like Quasar and use it as a "backend" for Akka actors.
However, Quasar and similar libraries are likely to provide special utilities for communication between fibers. I also don't know how they would handle blocking tasks like I/O, probably they would have a mechanism for that too. I'm not sure if Akka will be able to run correctly over green threads because of this. Quasar also seems to rely on bytecode instrumentation, and this is a rather advanced technique which can have a lot of implications preventing it from backing Akka.
However, you shouldn't really worry about lightweightness of threads when using Akka actors. In fact, Akka is perfectly able to create millions of actors on the single system (see here), and all these actors will work just fine.
This is achieved via clever scheduling over special kinds of thread pools, like fork-join thread pool. This means that unless actors are blocked on some long-running computation they can run over a number of threads significantly less than the number of these actors. For example, you can create a thread pool which will use at most 8 threads (one for each core of 8-core processor), and all actors activities will be scheduled on these threads.
Akka flexibility allows you to configure exact dispatcher to use for specific actors, if it is needed. You can create dedicated dispatchers for actors which stay in long-running tasks, like database access. See here for more information.
So, in short, no, you don't need userland threads for actors, because actors don't map one-to-one to native threads (unless you force them to, that is, but this should be avoided at all costs).

Akka actors are essentially asynchronous and that's why you can have a lot of them, while Quasar actors (yes, Quasar offers an actors implementation too), that are very close to Erlang's, are synchronous or blocking but they use fibers rather than Java (i.e. at present OS) threads, so still you can have a lot of them with the same performance as async but a programming style just as straightforward as when using regular threads and blocking calls.
I/O is handled through integrations: Quasar includes a framework to convert both async and sync APIs into fiber-blocking and Comsat include many such integrationsalready (one with Java NIO is included in Quasar directly).
My reply to another question contains further info.

Related

Difference between Reactor Flux and Java Fiber

I've been reading about Java Fibers as small unit of work which would mapped to Threads. In case of a blocking call a different Fiber would be mapped to the same Thread. Since Threads in Java are kernel level threads, this would prevent Threads from getting exhausted.
I've been using Spring Web-Flux, so just wanted to understand What all happens internally when Netty server receives 100 request/sec each of which include reactive databases access, How are these requests mapped to 40 threads which Netty server spawns by default?
How is a flux different from a Fiber? How does flux guarantee asynchronous behaviour with limited number of threads?

What all happens internally when Netty server receives 100 request/sec each of which include reactive databases access, How are these requests mapped to 40 threads which Netty server spawns by default?
In brief, it takes those requests and assigns them "round robin" style to each available underlying thread (as those threads become available.) The same thing happens with all other reactive calls too, of course with the caveat that depending on the configuration, they may be running on other schedulers and so other underlying thread pools with different numbers of threads.
How is a flux different from a Fiber?
That's a very big topic, but the "high level" overview is that flux (by which I assume you mean "reactive" Java rather than a Flux itself) is an asynchronous model where no thread is allowed to block, and fibers are "green" threads, designed to be used synchronously, that make use of preemptive scheduling (amongst other techniques) to map to far fewer underlying kernel level threads.
In practice, that means that you can use pretty much the same threading model & code techniques you do today with fibers, but reactive programming will require you to adopt new paradigms.
How does flux guarantee asynchronous behaviour with limited number of threads?
Quite simply, because it's architected to be asynchronous. The question here seems like it's based on a false premise - asynchronous behaviour isn't guaranteed or not guaranteed by the number of threads available, but by your model (it can't "spill over" into synchronous behaviour if it's overwhelmed by requests.)

Stream processing architecture

I am in the process of designing a system where there's a main stream of objects and there are multiple workers which produces some result from that object. Finally, there is some special/unique worker (sort of a "sink", in terms of graph theory) which takes all the results, and process them to some final object which is written to some DB.
It is possible for a worker to be dependent on the result of some other workers (hence, waiting for their results)
Now, I'm facing several problems:
It could be that one worker is much slower than another. How do you deal with that? Adding more workers (= scaling) of the slower type? (maybe dynamically)
Suppose W_B is dependent on W_A. If W_B is down for some reason then the flow will stop and the system will stop working. So I'd like the system to bypass this worker, somehow.
Moreover, how do the final worker decide when to operate on the set of results? Suppose it has the results of A and B but lacking the result of C. It may be that C is down or it's just very slow at the moment. How can it make a decision?
It is worth mentioning that it's not a realtime application but rather an offline processing system (i.e. you may access the DB and alter a record), but at the same time, it has to deal with relatively large amount of objects in an "high pace".
Regarding technologies,
I'm developing the system with Java but I'm not bounded to a specific technology.
I'd be glad if you could help me with the general design of the system.
Thanks a lot!

As Peter said, it really depends on the use case. Some general remarks though:
If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.
If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.
If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.

Some additional thoughts:
If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.
You might want to decouple your architecture with some sort of durable queueing between the workers.
It's common to use heartbeats with timeouts and restarts.
Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.

Best practices with Akka in Scala and third-party Java libraries

I need to use memcached Java API in my Scala/Akka code. This API gives you both synchronous and asynchronous methods. The asynchronous ones return java.util.concurrent.Future. There was a question here about dealing with Java Futures in Scala here How do I wrap a java.util.concurrent.Future in an Akka Future?. However in my case I have two options:
Using synchronous API and wrapping blocking code in future and mark blocking:
Future {
blocking {
cache.get(key) //synchronous blocking call
}
}
Using asynchronous Java API and do polling every n ms on Java Future to check if the future completed (like described in one of the answers above in the linked question above).
Which one is better? I am leaning towards the first option because polling can dramatically impact response times. Shouldn't blocking { } block prevent from blocking the whole pool?

I always go with the first option. But i am doing it in a slightly different way. I don't use the blocking feature. (Actually i have not thought about it yet.) Instead i am providing a custom execution context to the Future that wraps the synchronous blocking call. So it looks basically like this:
val ecForBlockingMemcachedStuff = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(100)) // whatever number you think is appropriate
// i create a separate ec for each blocking client/resource/api i use
Future {
cache.get(key) //synchronous blocking call
}(ecForBlockingMemcachedStuff) // or mark the execution context implicit. I like to mention it explicitly.
So all the blocking calls will use a dedicated execution context (= Threadpool). So it is separated from your main execution context responsible for non blocking stuff.
This approach is also explained in a online training video for Play/Akka provided by Typesafe. There is a video in lesson 4 about how to handle blocking calls. It is explained by Nilanjan Raychaudhuri (hope i spelled it correctly), who is a well known author for Scala books.
Update: I had a discussion with Nilanjan on twitter. He explained what the difference between the approach with blocking and a custom ExecutionContext is. The blocking feature just creates a special ExecutionContext. It provides a naive approach to the question how many threads you will need. It spawns a new thread every time, when all the other existing threads in the pool are busy. So it is actually an uncontrolled ExecutionContext. It could create lots of threads and lead to problems like an out of memory error. So the solution with the custom execution context is actually better, because it makes this problem obvious. Nilanjan also added that you need to consider circuit breaking for the case this pool gets overloaded with requests.
TLDR: Yeah, blocking calls suck. Use a custom/dedicated ExecutionContext for blocking calls. Also consider circuit breaking.

The Akka documentation provides a few suggestions on how to deal with blocking calls:
In some cases it is unavoidable to do blocking operations, i.e. to put
a thread to sleep for an indeterminate time, waiting for an external
event to occur. Examples are legacy RDBMS drivers or messaging APIs,
and the underlying reason is typically that (network) I/O occurs under
the covers. When facing this, you may be tempted to just wrap the
blocking call inside a Future and work with that instead, but this
strategy is too simple: you are quite likely to find bottlenecks or
run out of memory or threads when the application runs under increased
load.
The non-exhaustive list of adequate solutions to the “blocking
problem” includes the following suggestions:
Do the blocking call within an actor (or a set of actors managed by a router), making sure to configure a thread pool which is either
dedicated for this purpose or sufficiently sized.
Do the blocking call within a Future, ensuring an upper bound on the number of such calls at any point in time (submitting an unbounded
number of tasks of this nature will exhaust your memory or thread
limits).
Do the blocking call within a Future, providing a thread pool with an upper limit on the number of threads which is appropriate for the
hardware on which the application runs.
Dedicate a single thread to manage a set of blocking resources (e.g. a NIO selector driving multiple channels) and dispatch events as they
occur as actor messages.
The first possibility is especially well-suited for resources which
are single-threaded in nature, like database handles which
traditionally can only execute one outstanding query at a time and use
internal synchronization to ensure this. A common pattern is to create
a router for N actors, each of which wraps a single DB connection and
handles queries as sent to the router. The number N must then be tuned
for maximum throughput, which will vary depending on which DBMS is
deployed on what hardware.

Work/Task Stealing ThreadPoolExecutor

In my project I am building a Java execution framework that receives work requests from a client. The work (varying size) is broken down in to a set of tasks and then queued up for processing. There are separate queues to process each type of task and each queue is associated with a ThreadPool. The ThreadPools are configured in a way such that the overall performance of the engine is optimal.
This design helps us load balance the requests effectively and large requests don't end up hogging the system resources. However at times the solution becomes ineffective when some of the queues are empty and their respective thread pools sitting idle.
To make this better I was thinking of implementing a work/task stealing technique so that the heavily loaded queue can get help from the other ThreadPools. However this may require implementing my own Executor as Java doesn't allow multiple queues to be associated with a ThreadPool and doesn't support the work stealing concept.
Read about Fork/Join but that doesn't seem like a fit for my needs. Any suggestions or alternative way to build this solution could be very helpful.
Thanks
Andy

Executors.newWorkStealingPool
Java 8 has factory and utility methods for that in the Executors class: Executors.newWorkStealingPool
That is an implementation of a work-stealing thread pool, I believe, is exactly what you want.

Have you considered the ForkJoinPool? The fork-join framework was implemented in a nice modular fashion so you can just use the work-stealing thread pool.

you could implement a custom BlockingQueue implementation (i think you mainly need to implement the offer() and take() methods) which is backed by a "primary" queue and 0 or more secondary queues. take would always take from the primary backing queue if non-empty, otherwise it can pull from the secondary queues.
in fact, it may be better to have 1 pool where all workers have access to all the queues, but "prefer" a specific queue. you can come up with your optimal work ratio by assigning different priorities to different workers. in a fully loaded system, your workers should be working at the optimal ratio. in an underloaded system, your workers should be able to help out with other queues.

akka jvm threads vs os threads when performing io

I've searched the site a bit for help understanding this, but haven't found anything super clear, so I thought I'd post my use case and see if anybody could shed some light.
I have a question about the scaling of jvm threads vs os threads when used in akka for io operations. From the akka site:
Akka supports dispatchers for both event-driven lightweight threads, allowing creation of millions threads on a single workstation, and thread-based Actors, where each dispatcher is bound to a dedicated OS thread.
The event-based Actors currently consume ~600 bytes per Actor which means that you can create more than 6.5 million Actors on 4 G RAM.
In this context, can you all help me understand how that matters on a workstation with only 1 processor (for simplicity). So, for my example use case, I want to take a list of say 1000 'Users' and then go query a database (or several) for various information about each user. So if I were to dispatch each of these 'get' tasks to an actor, and that actor is going to do IO, wouldn't that actor block based on the os thread limit for the workstation?
How does the akka actor model give me lift in a scenario like this? I know that I am probably missing something as I am not wildly knowledgeable on the interworkings of vm threads vs os threads, so if one of the smart folks here could spell it out for me, that would be great.
If I use Futures, don't I need to use await() or get() to block and wait for the reply?
In my use case, regardless of actors, would it end up just 'feeling' like I'm making 1000 sequential database requests?
If code snips are useful in helping me understand this, Java would be preferred as I am still coming up to speed on scala syntax - but a nice clear textual explanation of how these millions of threads can interoperate on a single processor machine while doing database IO would be fine too.

It is really hard to figure out what you are actually asking here, but here are some pointers:
If you are running on a modern JVM, there is typically a one-to-one relationship between Java threads and OS threads. (IIRC, Solaris allows you to do this differently ... but that's the exception.)
The amount of real parallelism you will get using threads, or anything built on top of threads is limited by the number of processors / cores that are available to the application. Beyond that, you will find that not all threads are actually executing at any given instant.
If you have 1000 Actors all trying to access the database "at the same time", then most of them will actually be waiting on the database itself, or on the thread scheduler. Whether this amounts to making 1000 sequential requests (i.e. strict serialization) will depend on the database and the queries / updates that the actors are doing.
The bottom line is that a computer system has hard limits on the resources available for doing stuff; e.g. number of processors, speed of processors, memory bandwidth, disc access times, network bandwidth, etc. You can design an application to be smart about the way it uses available resources, but you can't get it to use more resources than there actually are.
On reading the text that you quoted, it seems to me that it is talking about two different kinds of actors:
Thread-based actors have a 1 to 1 relationship with threads. There's no way you could have millions of this kind of actor in 4Gb memory.
Event-based actors work differently. Instead of having a thread at all times, they would mostly be sitting in a queue waiting for an event to happen. When that happened, an event processing thread would grab the actor from the queue and execute the "action" associated with the event. When the action finished, the thread moves onto another actor / event pair.
The quoted text is saying that the memory overhead of an event-based actor is ~600 bytes. They don't include the event thread ... because the event thread is shared by multiple actors.
Now I'm not an expert on Scala / Actors, but it is pretty obvious that there are certain things that you should avoid when using event-based actors. For instance, you should probably avoid talking directly to an external database because that is liable to block the event processing thread.

I think there may be a typo there. I think they meant to say:
Akka supports dispatchers for both event-driven lightweight actors,
allowing creation of millions actors on a single workstation, and thread-based Actors, where each actor is bound to a dedicated OS thread.
The event-driven actors use a thread pool - all of the (potentially millions of) actors share the same pool of threads. I'm not that familiar with Akka actors but generally you would not want to do blocking I/O with event-driven actors, otherwise you could cause starvation.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.