This question already has answers here:
How does Actors work compared to threads?
(2 answers)
Closed 5 years ago.
I'm trying to understand the difference in definition of Actors and threads. Some articles say that actors are lightweight threads, while other articles states that actors are not threads. I also understand that a thread can run multiple actors. I'm confused as to how to exactly differentiate actors from threads. Please help me understand thanks!
Actors are at a higher level of abstraction when compared to Threads.
Thread is a JVM concept, whereas an Actor is a normal java class that runs in the JVM and hence the question is not so much about Actor vs Thread, its more about how Actor uses Threads.
What is an Actor?
At a very simple level, an Actor is an entity that receives messages, one at a time, and reacts to those messages.
How does Actor use Threads?
When Actor receives a message, it performs some action in response. How does the action code run in the JVM? Again, if you simplify the situation, you could imagine the Action executing the action task on the current thread. Also, it is possible that the Actor decides to perform the action task on a thread pool. It does not really matter as long as the Actor makes sure that only one message is processed at a time.
I'm trying to understand the difference in definition of Actors and threads.
Honestly, they really don't have much to do with each other, so it is kinda hard to talk about their difference. It's like talking about the difference between a Toyota Camry and the color blue.
An Actor is a self-contained, encapsulated entity that communicates with other self-contained, encapsulated entities purely by sending messages.
Now, if that sounds exactly like the definition of Object to you, you are right! There is a deep connection between Actors and Objects: the Actor Model of Computation was conceived by Carl Hewitt, partially based on the Message-Directed Execution of early Smalltalks (Smalltalk-71, Smalltalk-72). Alan Kay, in turn, cites PLANNER' Goal-Directed Execution as a heavy influence on Smalltalk's Message-Directed Execution … PLANNER in turn had been designed by Carl Hewitt. (PLANNER is also the precursor to Prolog, which in turn was the precursor to Erlang; Erlang is based on Prolog, started out as a library in Prolog, and the first implementations were written in Prolog.) Also, both Carl Hewitt and Alan Kay cite the early work of Vint Cerf and Bob Kahn on what would later become the Internet as inspiration, and both of them were heavily inspired by nature, Carl Hewitt by physics and Alan Kay by microbiology. So, the deep connections and similarities are very much not surprising.
Actors and Objects are pretty much the same thing. We usually expect Message sends in Object-Oriented Systems to be synchronous, instantaneous, reliable, and ordered, while no such guarantees exist for Actors. That is the main difference between Actors and Objects: Actors are Objects where Messages may get lost, re-ordered, take a long time, and are asynchronous. (The only guarantee is that a Message may get delivered At Most Once.) Also, the Receiver of a Message typically has no implicit reference to the Sender, so if the Receiver wants to reply to the Message, the Sender needs to pass its own reference along with the Message (unlike typical Object-Oriented Systems, where I can always return something to the Sender).
An Actor can do three things, and exactly those three things and nothing more:
Create new Actors
Send Messages to Addresses of Actors it already knows (i.e. Addresses of Actors it created, and Addresses it got sent in a Message)
Designate how to handle the next Message (this is essentially how to model mutable state)
Note that in order to send a Message to an Actor, you need to have an Address for the Actor. You can only send Messages to Addresses. An Actor may have multiple Addresses, and multiple Actors may be hidden behind a single Address. Addresses are unforgeable, you cannot construct one yourself, you have to have it handed to you. (This is an important difference to pointers in C, and more like object references in Java.)
There is a very nice video of a whiteboarding session (mostly) between Carl Hewitt himself and Erik Meijer about Actors on Microsoft's Channel9 community website: Hewitt, Meijer and Szyperski: The Actor Model (everything you wanted to know, but were afraid to ask). It explains among other things the fundamental ideas behind Actors as well as the fundamental Axioms of Actors in an accessible way.
Carl Hewitt was also invited to give a lecture at Stanford's EE380 Computer Systems Colloquium, where he talked about How to Program the Many Cores For Inconsistency Robustness, including an introduction to Actors.
Threads, OTOH are just strands of execution. They are essentially nothing but an instruction pointer and a call stack. In particular, Threads don't have memory (of their own). All Threads of a Process share the same memory. Which means among other things, that a single misbehaving Thread can crash all Threads of the Process, and that all access to this shared memory must be carefully controlled.
Some articles say that actors are lightweight threads, while other articles states that actors are not threads. I also understand that a thread can run multiple actors.
You are confusing the concept of an Actor with a possible implementation.
A very simple implementation of Actors in an existing environment would be to make every Actor a Process. That's how BEAM (the dominant implementation of Erlang) works, for example. This works, because on BEAM, Processes are extremely cheap and lightweight; a process only weighs about 300 Byte, and there can be tens of millions of Processes on a cheap 32 Bit single-core laptop from 10 years ago. This would not work with Windows Processes, for example, which are much more expensive.
There is also an implementation of Erlang on the JVM (called Erjang) which uses Kilim, an implementation of extremely lightweight Actors on the JVM using bytecode re-writing. (Note that both Erjang and Kilim appear to be dead.)
If you want to implement Actors on Threads, you have the problem that Actors are completely isolated, whereas Threads are completely shared. So, you need to provide this isolation somehow.
A common way to implement Actors on systems such as the JVM, is to implement Actors as Objects, and then schedule them to run on a Thread Pool to get the asynchronous properties. That's how Akka works.
I highly recommend to read through official documentation first:
https://doc.akka.io/docs/akka/current/scala/general/actors.html
Behind the scenes Akka will run sets of actors on sets of real threads, where typically many actors share one thread, and subsequent invocations of one actor may end up being processed on different threads. Akka ensures that this implementation detail does not affect the single-threadedness of handling the actor’s state.
A thread is a "sequence of activity" - independent of specific objects. It can execute code that belongs to arbitrary objects.
Whereas an actor is about a specific object that "owns" its own "sequence of activity". But please note that this actor sequence isn't necessarily a thread. To the contrary!
Looking at this from the implementation side, one could use one thread to drive many actor objects.
Related
I am in the process of designing a system where there's a main stream of objects and there are multiple workers which produces some result from that object. Finally, there is some special/unique worker (sort of a "sink", in terms of graph theory) which takes all the results, and process them to some final object which is written to some DB.
It is possible for a worker to be dependent on the result of some other workers (hence, waiting for their results)
Now, I'm facing several problems:
It could be that one worker is much slower than another. How do you deal with that? Adding more workers (= scaling) of the slower type? (maybe dynamically)
Suppose W_B is dependent on W_A. If W_B is down for some reason then the flow will stop and the system will stop working. So I'd like the system to bypass this worker, somehow.
Moreover, how do the final worker decide when to operate on the set of results? Suppose it has the results of A and B but lacking the result of C. It may be that C is down or it's just very slow at the moment. How can it make a decision?
It is worth mentioning that it's not a realtime application but rather an offline processing system (i.e. you may access the DB and alter a record), but at the same time, it has to deal with relatively large amount of objects in an "high pace".
Regarding technologies,
I'm developing the system with Java but I'm not bounded to a specific technology.
I'd be glad if you could help me with the general design of the system.
Thanks a lot!
As Peter said, it really depends on the use case. Some general remarks though:
If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.
If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.
If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.
Some additional thoughts:
If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.
You might want to decouple your architecture with some sort of durable queueing between the workers.
It's common to use heartbeats with timeouts and restarts.
Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.
I've been reading something about, found some libraries which really messed with my thoughts, like Akka, Quasar, Reactor and Disruptor, Akka and Quasar implements the Actor Pattern and the Disruptor is a Inter-Thread Messaging library, and Reactor is based on . So what are the advantages, use cases for using a message driven architecture over simple method calls?
Given a RabbitMQ queue listener, I receive a message from the method, decide which Type the RabbitMQ message is (NewOrder,Payment,...).
With a Message Driven library I could do.
Pseudo code:
actor.tell('decider-mailbox',message)
Which basically says "I'm putting this message here, when you guys can handle it, do it") and so on until it gets saved.
And the actor is ready again to receive another message
But with directly calling the method like messageHandler.handle(message), wouldn't be better and less abstracted ?
The Actors Model looks a lot like people working together; it is based on message-passing but there is much more to it and I'd say that not all message-passing models are the same, for example Quasar actually supports not only Erlang-like actors but also Go-like channels which are simpler but don't provide a fault-tolerance model (and fibers BTW, that are just like threads but much more lightweight, which you can use even without any message-passing at all).
Methods/functions follow a strict, nestable call-return (so request-response) discipline and usually don't involve any concurrency (at least in imperative and non-pure functional languages).
Message passing instead, very broadly speaking, allows looser coupling because doesn't enforce a request-response discipline and allows the communicating parties to execute concurrently, which also helps in isolating failures and in hot-upgrades and generally maintenance (for example, the Actors Model offers these features). Often message passing will also allow looser data contracts by using a more dynamic typing for messages (this is especially true for the Actors Model where each party, or actor, has a single incoming channel, that is his mailbox).
Other than that the details depends a lot on the messaging model/solution you're considering, for example the communication channels can synchronize the interacting parts or have limited/unlimited buffering, allow multiple source and/or multiple producers and consumers etc.
Note that RPC is really message passing but with a strict request-response communication discipline.
This means that, depending on the situation, one or the other may suit you better: methods/functions are better when you're in a call-return discipline and/or you're simply making your sequential code more modular. Message-passing is better when you need a network of potentially concurrent, autonomous "agents" that communicate but not necessarily in a request-response discipline.
As for the Actors Model I think you can build more insight about it for example by reading the first part of this blog post (notice: I'm the main author of the post and I'm part of the Parallel Universe - and Quasar - development team):
The actor model is a design pattern for fault-tolerant and highly scalable systems. Actors are independent worker-modules that communicate with other actors only through message-passing, can fail in isolation from other actors but can monitor other actors for failure and take some recovery measures when that happens. Actors are simple, isolated yet coordinated, concurrent workers.
Actor-based design brings many benefits:
Adaptive behaviour: interacting only through a message-queue makes actors loosely coupled and allows them to:
Isolate faults: mailboxes are decoupling message queues that allow actor restart without service disruption.
Manage evolution: they enable actor replacement without service disruption.
Regulate concurrency: receiving messages very often and discarding overflow or, alternatively, increasing mailbox size can maximize concurrency at the expense of reliability or memory usage respectively.
Regulate load: reducing the frequency of receive calls and using small mailboxes reduces concurrency and increases latencies, applying back-pressure through the boundaries of the actor system.
Maximum concurrency capacity:
Actors are extremely lightweight both in memory consumption and
management overhead, so it’s possible to spawn even millions in a
single box.
Because actors do not share state, they can safely run in parallel.
Low complexity:
Each actor can implements stateful behaviour by mutating its private state without worrying about concurrent modification.
Actors can simplify their state transition logic by selectively receiving messages from the mailbox in logical, rather than arrival order.
The difference is that the processing takes place in a different thread so the current one is ready to receive and forwards the next message. When you call the handler from the current thread it is blocked until processing is finished.
In a way it is just a matter of defining abstractions. Some say that originally object oriented programming was actually supposed to be based on message passing, and calling a method on an object would have the semantics of sending it a message (with similar async non-blocking behavior as in actors).
The way we implemented OO in most popular languages is such that it became what it is today - a "synchronous blocking order" to an object, controlled and ran from the same execution context (thread/process) as the caller. This is nice because it is easy to understand, but it has its limitations when designing concurrent systems.
In theory, you could create a language with similar syntax as Java, but give it different semantics - making object.method(arg) actually internally be something similar to actor.tell(msg). There are a lot of idioms that try to hide asynchronous calling and message passing behind simple method invocations, but as always it depends on the use case.
Akka provides a nice new syntax which makes it clear that what we are doing is something completely different than invoking methods on an object, in part to cause less confusion and make the message passing more explicit. In the end, you are stating the same thing - you are sending a message to an actor in the system, but you are doing it with less constraints than if you were calling one of its methods directly.
I've searched the site a bit for help understanding this, but haven't found anything super clear, so I thought I'd post my use case and see if anybody could shed some light.
I have a question about the scaling of jvm threads vs os threads when used in akka for io operations. From the akka site:
Akka supports dispatchers for both event-driven lightweight threads, allowing creation of millions threads on a single workstation, and thread-based Actors, where each dispatcher is bound to a dedicated OS thread.
The event-based Actors currently consume ~600 bytes per Actor which means that you can create more than 6.5 million Actors on 4 G RAM.
In this context, can you all help me understand how that matters on a workstation with only 1 processor (for simplicity). So, for my example use case, I want to take a list of say 1000 'Users' and then go query a database (or several) for various information about each user. So if I were to dispatch each of these 'get' tasks to an actor, and that actor is going to do IO, wouldn't that actor block based on the os thread limit for the workstation?
How does the akka actor model give me lift in a scenario like this? I know that I am probably missing something as I am not wildly knowledgeable on the interworkings of vm threads vs os threads, so if one of the smart folks here could spell it out for me, that would be great.
If I use Futures, don't I need to use await() or get() to block and wait for the reply?
In my use case, regardless of actors, would it end up just 'feeling' like I'm making 1000 sequential database requests?
If code snips are useful in helping me understand this, Java would be preferred as I am still coming up to speed on scala syntax - but a nice clear textual explanation of how these millions of threads can interoperate on a single processor machine while doing database IO would be fine too.
It is really hard to figure out what you are actually asking here, but here are some pointers:
If you are running on a modern JVM, there is typically a one-to-one relationship between Java threads and OS threads. (IIRC, Solaris allows you to do this differently ... but that's the exception.)
The amount of real parallelism you will get using threads, or anything built on top of threads is limited by the number of processors / cores that are available to the application. Beyond that, you will find that not all threads are actually executing at any given instant.
If you have 1000 Actors all trying to access the database "at the same time", then most of them will actually be waiting on the database itself, or on the thread scheduler. Whether this amounts to making 1000 sequential requests (i.e. strict serialization) will depend on the database and the queries / updates that the actors are doing.
The bottom line is that a computer system has hard limits on the resources available for doing stuff; e.g. number of processors, speed of processors, memory bandwidth, disc access times, network bandwidth, etc. You can design an application to be smart about the way it uses available resources, but you can't get it to use more resources than there actually are.
On reading the text that you quoted, it seems to me that it is talking about two different kinds of actors:
Thread-based actors have a 1 to 1 relationship with threads. There's no way you could have millions of this kind of actor in 4Gb memory.
Event-based actors work differently. Instead of having a thread at all times, they would mostly be sitting in a queue waiting for an event to happen. When that happened, an event processing thread would grab the actor from the queue and execute the "action" associated with the event. When the action finished, the thread moves onto another actor / event pair.
The quoted text is saying that the memory overhead of an event-based actor is ~600 bytes. They don't include the event thread ... because the event thread is shared by multiple actors.
Now I'm not an expert on Scala / Actors, but it is pretty obvious that there are certain things that you should avoid when using event-based actors. For instance, you should probably avoid talking directly to an external database because that is liable to block the event processing thread.
I think there may be a typo there. I think they meant to say:
Akka supports dispatchers for both event-driven lightweight actors,
allowing creation of millions actors on a single workstation, and thread-based Actors, where each actor is bound to a dedicated OS thread.
The event-driven actors use a thread pool - all of the (potentially millions of) actors share the same pool of threads. I'm not that familiar with Akka actors but generally you would not want to do blocking I/O with event-driven actors, otherwise you could cause starvation.
I am kicking off my final year project right now. I am going to be investigating the concurrency approaches from java and scala perspectives. Having come out of a java concurrency module, I can see why people say that the shared state threading approach is difficult to reason about. You have critical sections to worry about, run the risk of race conditions and deadlocks etc due to the non deterministic way in which java threads operate. With 1.5 this reasoning was given some clarity ,but still, far from crystal clear.
At first view, scala appears to remove this complex reasoning through the actors class. This has given the programmer the ability to develop concurrent systems from a more sequential viewpoint and easier to conceptualize. But, for this positive, am I right in saying that there are some drawbacks? For instance, say we want to sort a large list in both scenarios - with java you create two threads split the list in two, worry about the critical sections, atomic actions etc and go code. With scala, because it is "share nothing" you actually have to pass the list/2 to two actors to peform the sort operation, right?
I guess my question is that the price you pay for simpler reasoning is performance overhead of having to pass the collection to your actors, in scala?
I was thinking of doing some benchmark tests to this effect (selection sort, quick sort etc;) but because one is functional and one is imperative - I will not be comparing apples with apples from an algorithm viewpoint.
I would really appreciate any views you guys have on the above to give me some ideas to get me started.
Many thanks.
The nice thing about Scala is that you can do concurrency the Java way if you want. All the Java classes are available.
So it really boils down to the difference between a model where you have threads with concurrent access to mutable variables, and a model where you have stateful actors which send messages to each other but do not peek into each others' internals. And you're absolutely right that in some scenarios you have to trade off performance against ease of getting the code correct.
I generally find as a rough rule of thumb that if you're going to have a pile of threads spending a significant amount of time waiting for a lock to open up, using a Java model, and there is no clean way to separate the work to avoid having everyone waiting for that resource, and if the execution switches between threads quickly, then the Java model is far superior to an actor model where the actor sends an "I'm done" message back to a supervisor, which then sends out a "Here's new work!" message to an existing non-busy actor. Sorting algorithms, depending on how you envision them, can very much fall into this category.
For most everything else, the performance penalty associated with actors doesn't amount to much as far as I've seen. If you can conceive of your problem as lots and lots of reactive elements (i.e. they only need time when they've received a message), then actors can scale particularly well (millions available, though only a handful are working at any given instant); with threads, you'd need to have some sort of extra internal state to keep track of who should be doing what work, since you couldn't handle that many active threads.
I'm just going to point out here that Scala does not copy arguments passed to actors, so actors can share whatever it is passed to them.
As opposed to Erlang, it is the programmer's responsibility to avoid sharing mutable stuff. However, there is no penalty in sharing immutable stuff, since there's no need to lock it, as all accesses to it are read-only. And Scala has strong support for immutable data structures.
I have a problem which I believe is the classic master/worker pattern, and I'm seeking advice on implementation. Here's what I currently am thinking about the problem:
There's a global "queue" of some sort, and it is a central place where "the work to be done" is kept. Presumably this queue will be managed by a kind of "master" object. Threads will be spawned to go find work to do, and when they find work to do, they'll tell the master thing (whatever that is) to "add this to the queue of work to be done".
The master, perhaps on an interval, will spawn other threads that actually perform the work to be done. Once a thread completes its work, I'd like it to notify the master that the work is finished. Then, the master can remove this work from the queue.
I've done a fair amount of thread programming in Java in the past, but it's all been prior to JDK 1.5 and consequently I am not familiar with the appropriate new APIs for handling this case. I understand that JDK7 will have fork-join, and that that might be a solution for me, but I am not able to use an early-access product in this project.
The problems, as I see them, are:
1) how to have the "threads doing the work" communicate back to the master telling them that their work is complete and that the master can now remove the work from the queue
2) how to efficiently have the master guarantee that work is only ever scheduled once. For example, let's say this queue has a million items, and it wants to tell a worker to "go do these 100 things". What's the most efficient way of guaranteeing that when it schedules work to the next worker, it gets "the next 100 things" and not "the 100 things I've already scheduled"?
3) choosing an appropriate data structure for the queue. My thinking here is that the "threads finding work to do" could potentially find the same work to do more than once, and they'd send a message to the master saying "here's work", and the master would realize that the work has already been scheduled and consequently should ignore the message. I want to ensure that I choose the right data structure such that this computation is as cheap as possible.
Traditionally, I would have done this in a database, in sort of a finite-state-machine manner, working "tasks" through from start to complete. However, in this problem, I don't want to use a database because of the high volume and volatility of the queue. In addition, I'd like to keep this as light-weight as possible. I don't want to use any app server if that can be avoided.
It is quite likely that this problem I'm describing is a common problem with a well-known name and accepted set of solutions, but I, with my lowly non-CS degree, do not know what this is called (i.e. please be gentle).
Thanks for any and all pointers.
As far as I understand your requirements, you need ExecutorService. ExecutorService have
submit(Callable task)
method which return value is Future. Future is a blocking way to communicate back from worker to master. You could easily expand this mechanism to work is asynchronous manner. And yes, ExecutorService also maintaining work queue like ThreadPoolExecutor. So you don't need to bother about scheduling, in most cases. java.util.concurrent package already have efficient implementations of thread safe queue (ConcurrentLinked queue - nonblocking, and LinkedBlockedQueue - blocking).
Check out java.util.concurrent in the Java library.
Depending on your application it might be as simple as cobbling together some blocking queue and a ThreadPoolExecutor.
Also, the book Java Concurrency in Practice by Brian Goetz might be helpful.
First, why do you want to hold the items after a worker started doing them? Normally, you would have a queue of work and a worker takes items out of this queue. This would also solve the "how can I prevent workers from getting the same item"-problem.
To your questions:
1) how to have the "threads doing the
work" communicate back to the master
telling them that their work is
complete and that the master can now
remove the work from the queue
The master could listen to the workers using the listener/observer pattern
2) how to efficiently have the master
guarantee that work is only ever
scheduled once. For example, let's say
this queue has a million items, and it
wants to tell a worker to "go do these
100 things". What's the most efficient
way of guaranteeing that when it
schedules work to the next worker, it
gets "the next 100 things" and not
"the 100 things I've already
scheduled"?
See above. I would let the workers pull the items out of the queue.
3) choosing an appropriate data
structure for the queue. My thinking
here is that the "threads finding work
to do" could potentially find the same
work to do more than once, and they'd
send a message to the master saying
"here's work", and the master would
realize that the work has already been
scheduled and consequently should
ignore the message. I want to ensure
that I choose the right data structure
such that this computation is as cheap
as possible.
There are Implementations of a blocking queue since Java 5
Don't forget Jini and Javaspaces. What you're describing sounds very like the classic producer/consumer pattern that space-based architectures excel at.
A producer will write the jobs into the space. 1 or more consumers will take out jobs (under a transaction) and work on that in parallel, and then write the results back. Since it's under a transaction, if a problem occurs the job is made available again for another consumer .
You can scale this trivially by adding more consumers. This works especially well when the consumers are separate VMs and you scale across the network.
If you are open to the idea of Spring, then check out their Spring Integration project. It gives you all the queue/thread-pool boilerplate out of the box and leaves you to focus on the business logic. Configuration is kept to a minimum using #annotations.
btw, the Goetz is very good.
This doesn't sound like a master-worker problem, but a specialized client above a threadpool. Given that you have a lot of scavenging threads and not a lot of processing units, it may be worthwhile simply doing a scavaging pass and then a computing pass. By storing the work items in a Set, the uniqueness constraint will remove duplicates. The second pass can submit all of the work to an ExecutorService to perform the process in parallel.
A master-worker model generally assumes that the data provider has all of the work and supplies it to the master to manage. The master controls the work execution and deals with distributed computation, time-outs, failures, retries, etc. A fork-join abstraction is a recursive rather than iterative data provider. A map-reduce abstraction is a multi-step master-worker that is useful in certain scenarios.
A good example of master-worker is for trivially parallel problems, such as finding prime numbers. Another is a data load where each entry is independant (validate, transform, stage). The need to process a known working set, handle failures, etc. is what makes a master-worker model different than a thread-pool. This is why a master must be in control and pushes the work units out, whereas a threadpool allows workers to pull work from a shared queue.