Stream processing architecture

Stream processing architecture - java

I am in the process of designing a system where there's a main stream of objects and there are multiple workers which produces some result from that object. Finally, there is some special/unique worker (sort of a "sink", in terms of graph theory) which takes all the results, and process them to some final object which is written to some DB.
It is possible for a worker to be dependent on the result of some other workers (hence, waiting for their results)
Now, I'm facing several problems:
It could be that one worker is much slower than another. How do you deal with that? Adding more workers (= scaling) of the slower type? (maybe dynamically)
Suppose W_B is dependent on W_A. If W_B is down for some reason then the flow will stop and the system will stop working. So I'd like the system to bypass this worker, somehow.
Moreover, how do the final worker decide when to operate on the set of results? Suppose it has the results of A and B but lacking the result of C. It may be that C is down or it's just very slow at the moment. How can it make a decision?
It is worth mentioning that it's not a realtime application but rather an offline processing system (i.e. you may access the DB and alter a record), but at the same time, it has to deal with relatively large amount of objects in an "high pace".
Regarding technologies,
I'm developing the system with Java but I'm not bounded to a specific technology.
I'd be glad if you could help me with the general design of the system.
Thanks a lot!

As Peter said, it really depends on the use case. Some general remarks though:
If a worker is slower than the other, maybe create more instances of that type; eg Kubernetes allows dynamic Node creation, and Kafka allows to partition a topic so more than one instance can read off and process it.
If B depends on A and A is down, B can't work and that's it. Maybe restart A? Maybe you can do a regular health check on it.
If the final worker needs the results of A, B and C, how would it process without C being available? If it can, it can store the results of A and B, install a timer, and if that goes off without C having arrived, continue.

Some additional thoughts:
If you mean to say that some subtasks of the overall application are quicker to execute than others, then it can be a good idea to slice up the application so that each worker is doing a bit of everything -- in other words, a share of the quick work and a share of the slow work. But if you mean to say that some machines are slower than others, then you could run fewer workers on the slow machines, and more on the faster ones, so as to balance things so that each worker has roughly the same resources.
You might want to decouple your architecture with some sort of durable queueing between the workers.
It's common to use heartbeats with timeouts and restarts.
Distributed stream processing quickly becomes very complex. Your life will be much easier if you build on top a stream processing framework that provides high availability and exactly-once semantics out of the box.

Related

Alternative for synchronisation in multi threading in JAVA

I am a beginner in Multi threading and have this one doubt:
Is there any other alternative for traditional Synchronisation(which uses synchronised keywords) in java,since it affects the performance of the application?

As others have indicated, it depends on what you're trying to avoid, as well as what you're trying to achieve with multithreading.
If you mean "is there a zero-overhead way to do multithreading with shared resources," the answer is almost certainly "no." If two cars going in different directions approach an intersection at the same time, one of them will have to wait for the other one - there's no way that the cars can occupy the same space at the same time. That's why we have stop signs and traffic lights. (Alternatively, there are things like traffic circles, but even those have some overhead - you really can't just go through them at full speed as if they weren't there).
There are lots of ways of doing asynchronous and parallel operations other than using that type of synchronization:
Non-blocking I/O. The argument here is that, when you're interacting with a server or slow I/O device or something, most of the time is spent waiting for a response from the device or server, so you really don't need multiple threads to handle that - you just need to allow the original thread to do other work while it's waiting for a response. My usual analogy here is: suppose you go out to eat with a group of 10 people. When the waiter comes to take orders, the first person he asks to order isn't ready yet. The sensible thing to do, of course, is for the waiter to take other people's orders first, and then to come back to the first guy. There's no need to bring in separate waiters for each person's orders, bring in another waiter to wait for the first guy, or anything like that.
Promise/futures based async
Event-driven async
Using immutable data structures to minimize the amount of shared resources.
There are, of course, a lot of types of locking and synchronization mechanisms available other than just the synchronized keywords, such as counting semaphores, reader-writer locks, etc.
There are a lot of other types of concurrency as well, such as the actor model.
When used properly, these can help minimize your overhead and possibly reduce the amount of explicit locking and synchronization required. They all have overhead, though.
TL;DR You have overhead no matter what you do - just select the design and primitives that result in the smallest overhead for your particular use case.

You can look for ReentrantLock and ReentrantReadWriteLock.

Java Multithreading - More Threads That Do Less, or Fewer Threads that Do More?

EDIT: This question might be appropriate for other languages as well - the overall theory behind it seems mostly language agnostic. However, as this will run in a JVM, I'm sure there's differences between JVM overheads/threading and those of other environments.
EDIT 2: To clarify a little better, I guess the main question is which is better for scalability: to have smaller threads that can return quicker to enable processing other chunks of work for other workloads, or try to get a single workload through as quickly as possible? The workloads are sequential and multithreading won't help speed up a single unit of work in this case; it's more in hopes of increasing the throughput of the system overall (thanks to Uri for leading me towards the clarification).
I'm working on a system that's replacing an existing system; the current system has a pretty heavy load, so we already know the replacement needs to be highly scalable. It communicates with several outside processes, such as email, other services, databases, etc., and I'm already planning on making it multithreaded to help with scaling. I've worked on multithreaded apps before, just nothing with this high of a performance/scalability requirement, so I don't have much experience when it comes to getting the absolute most out of concurrency.
The question I have is what's the best way to divide the work up between threads? I'm looking at two different versions, one that creates a single thread for the full workflow, and another that creates a thread for each of the individual steps, continuing on to the next step (in a new/different thread) when the previous step completes - probably with a NodeJS-style callback system, but not terribly concerned about the direct implementation details.
I don't know much about the nitty-gritty details of multithreading - things like context switching, for example - so I don't know if the overhead of multiple threads would swamp the execution time in each of the threads. On one hand, the single thread model seems like it would be fastest for an individual work flow compared to the multiple threads; however, it would also tie up a single thread for the entire workflow, whereas the multiple threads would be shorter lived and would return to the pool quicker (I imagine, at least).
Hopefully the underlying concept is easy enough to understand; here's a contrived pseudo-code example though:
// Single-thread approach
foo();
bar();
baz();
Or:
// Multiple Thread approach
Thread.run(foo);
when foo.isDone()
Thread.run(bar);
when bar.isDone()
Thread.run(baz);
UPDATE: Completely forgot. The reason I'm considering the multithreaded approach is the (possibly mistaken) belief that, since the threads will have smaller execution times, they'll be available for other instances of the overall workload. If each operation takes, say 5 seconds, then the single-thread version locks up a thread for 15 seconds; the multiple thread version would lock up a single thread for 5 seconds, and then it can be used for another process.
Any ideas? If there's anything similar out there in the interwebs, I'd love even a link - I couldn't think of how to search for this (I blame Monday for that, but it would probably be the same tomorrow).

Multithreading is not a silver bullet. It's means to an end.
Before making any changes, you need to ask yourself where your bottlenecks are, and what you're really trying to parallelize. I'm not sure that without more information that we can give good advice here.
If foo, bar, and baz are part of a pipeline, you're not necessarily going to improve the overall latency of a single sequence by using multiple threads.
What you might be able to do is to increase your throughput by letting multiple executions of the pipeline over different input pieces work in parallel, by letting later items to travel through the pipeline while earlier items are blocked on something (e.g., I/O). For instance, if bar() for a particular input is blocked and waiting on a notification, it's possible that you could do computationally heavy operations on another input, or have CPU resources to devote to foo(). A particularly important question is whether any of the external dependencies act as a limited shared resource. e.g., if one thread is accessing system X, is another thread going to be affected?
Threads are also very effective if you want to divide and conquer your problem - splitting your input into smaller parts, running each part through the pipeline, and then waiting on all the pieces to be ready. Is that possible with the kind of workflow you're looking at?

If you need to first do foo, then do bar, and then do baz, you should have one thread do each of these steps in sequence. This is simple and makes obvious sense.
The most common case where you're better off with the assembly line approach is when keeping the code in cache is more important than keeping the data in cache. In this case, having one thread that does foo over and over can keep the code for this step in cache, keep branch prediction information around, and so on. However, you will have data cache misses when you hand the results of foo to the thread that does bar.
This is more complex and should only be attempted if you have good reason to think it will work better.

Use a single thread for the full workflow.
Dividing up the workflow can't improve the completion time for one piece of work: since the parts of the workflow have to be done sequentially anyway, only one thread can work on the piece of work at a time. However, breaking up the stages can delay the completion time for one piece of work, because a processor which could have picked up the last part of one piece of work might instead pick up the first part of another piece of work.
Breaking up the stages into multiple threads is also unlikely to improve the time to completion of all your work, relative to executing all the stages in one thread, since ultimately you still have to execute all the stages for all the pieces of work.
Here's an example. If you have 200 of these pieces of work, each requiring three 5 second stages, and say a thread pool of two threads running on two processors, keeping the entire workflow in a single thread results in your first two results after 15 seconds. It will take 1500 seconds to get all your results, but you only need the working memory for two of the pieces of work at a time. If you break up the stages, then it may take a lot longer than 15 seconds to get your first results, and you potentially may need memory for all 200 pieces of work proceeding in parallel if you still want to get all the results in 1500 seconds.
In most cases, there are no efficiency advantages to breaking up sequential stages into different threads, and there may be substantial disadvantages. Threads are generally only useful when you can use them to do work in parallel, which does not seem to be the case for your work stages.
However, there is a huge disadvantage to breaking up the stages into separate threads. That disadvantage is that you now need to write multithreaded code that manages the stages. It's extremely easy to write bugs in such code, and such bugs can be very difficult to catch prior to production deployment.
The way to avoid such bugs is to keep the threading code as simple as possible given your requirements. In the case of your work stages, the simplest possible threading code is none at all.

akka jvm threads vs os threads when performing io

I've searched the site a bit for help understanding this, but haven't found anything super clear, so I thought I'd post my use case and see if anybody could shed some light.
I have a question about the scaling of jvm threads vs os threads when used in akka for io operations. From the akka site:
Akka supports dispatchers for both event-driven lightweight threads, allowing creation of millions threads on a single workstation, and thread-based Actors, where each dispatcher is bound to a dedicated OS thread.
The event-based Actors currently consume ~600 bytes per Actor which means that you can create more than 6.5 million Actors on 4 G RAM.
In this context, can you all help me understand how that matters on a workstation with only 1 processor (for simplicity). So, for my example use case, I want to take a list of say 1000 'Users' and then go query a database (or several) for various information about each user. So if I were to dispatch each of these 'get' tasks to an actor, and that actor is going to do IO, wouldn't that actor block based on the os thread limit for the workstation?
How does the akka actor model give me lift in a scenario like this? I know that I am probably missing something as I am not wildly knowledgeable on the interworkings of vm threads vs os threads, so if one of the smart folks here could spell it out for me, that would be great.
If I use Futures, don't I need to use await() or get() to block and wait for the reply?
In my use case, regardless of actors, would it end up just 'feeling' like I'm making 1000 sequential database requests?
If code snips are useful in helping me understand this, Java would be preferred as I am still coming up to speed on scala syntax - but a nice clear textual explanation of how these millions of threads can interoperate on a single processor machine while doing database IO would be fine too.

It is really hard to figure out what you are actually asking here, but here are some pointers:
If you are running on a modern JVM, there is typically a one-to-one relationship between Java threads and OS threads. (IIRC, Solaris allows you to do this differently ... but that's the exception.)
The amount of real parallelism you will get using threads, or anything built on top of threads is limited by the number of processors / cores that are available to the application. Beyond that, you will find that not all threads are actually executing at any given instant.
If you have 1000 Actors all trying to access the database "at the same time", then most of them will actually be waiting on the database itself, or on the thread scheduler. Whether this amounts to making 1000 sequential requests (i.e. strict serialization) will depend on the database and the queries / updates that the actors are doing.
The bottom line is that a computer system has hard limits on the resources available for doing stuff; e.g. number of processors, speed of processors, memory bandwidth, disc access times, network bandwidth, etc. You can design an application to be smart about the way it uses available resources, but you can't get it to use more resources than there actually are.
On reading the text that you quoted, it seems to me that it is talking about two different kinds of actors:
Thread-based actors have a 1 to 1 relationship with threads. There's no way you could have millions of this kind of actor in 4Gb memory.
Event-based actors work differently. Instead of having a thread at all times, they would mostly be sitting in a queue waiting for an event to happen. When that happened, an event processing thread would grab the actor from the queue and execute the "action" associated with the event. When the action finished, the thread moves onto another actor / event pair.
The quoted text is saying that the memory overhead of an event-based actor is ~600 bytes. They don't include the event thread ... because the event thread is shared by multiple actors.
Now I'm not an expert on Scala / Actors, but it is pretty obvious that there are certain things that you should avoid when using event-based actors. For instance, you should probably avoid talking directly to an external database because that is liable to block the event processing thread.

I think there may be a typo there. I think they meant to say:
Akka supports dispatchers for both event-driven lightweight actors,
allowing creation of millions actors on a single workstation, and thread-based Actors, where each actor is bound to a dedicated OS thread.
The event-driven actors use a thread pool - all of the (potentially millions of) actors share the same pool of threads. I'm not that familiar with Akka actors but generally you would not want to do blocking I/O with event-driven actors, otherwise you could cause starvation.

Tutorial about Using multi-threading in jdbc

Our company has a Batch Application which runs every day, It does some database related jobs mostly, import data into database table from file for example.
There are 20+ tasks defined in that application, each one may depends on other ones or not.
The application execute tasks one by one, the whole application runs in a single thread.
It takes 3~7 hours to finish all the tasks. I think it's too long, so I think maybe I can improve performance by multi-threading.
I think as there is dependency between tasks, it not good (or it's not easy) to make tasks run in parallel, but maybe I can use multi-threading to improve performance inside a task.
for example : we have a task defined as "ImportBizData", which copy data into a database table from a data file(usually contains 100,0000+ rows). I wonder is that worth to use multi-threading?
As I know a little about multi-threading, I hope some one provide some tutorial links on this topic.

Multi-threading will improve your performance but there are a couple of things you need to know:
Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
Upload the data in chunks and commit once in a while to avoid accumulating huge rollback/undo tables.
Cut tasks into several work units where each unit does one job.
To elaborate the last point: Currently, you have a task that reads a file, parses it, opens a JDBC connection, does some calculations, sends the data to the database, etc.
What you should do:
One (!) thread to read the file and create "jobs" out of it. Each job should contains a small, but not too small "unit of work". Push those into a queue
The next thread(s) wait(s) for jobs in the queue and do the calculations. This can happen while the threads in step #1 wait for the slow hard disk to return the new lines of data. The result of this conversion step goes into the next queue
One or more threads to upload the data via JDBC.
The first and the last threads are pretty slow because they are I/O bound (hard disks are slow and network connections are even worse). Plus inserting data in a database is a very complex task (allocating space, updating indexes, checking foreign keys)
Using different worker threads gives you lots of advantages:
It's easy to test each thread separately. Since they don't share data, you need no synchronization. The queues will do that for you
You can quickly change the number of threads for each step to tweak performance

Multi threading may be of help, if the lines are uncorrelated, you may start off two processes one reading even lines, another uneven lines, and get your db connection from a connection pool (dbcp) and analyze performance. But first I would investigate whether jdbc is the best approach normally databases have optimized solutions for imports like this. These solutions may also temporarily switch of constraint checking of your table, and turn that back on later, which is also great for performance. As always depending on your requirements.
Also you may want to checkout springbatch which is designed for batch processing.

As far as I know,the JDBC Bridge uses synchronized methods to serialize all calls to ODBC so using mutliple threads won't give you any performance boost unless it boosts your application itself.

I am not all that familiar with JDBC but regarding the multithreading bit of your question, what you should keep in mind is that parallel processing relies on effectively dividing your problem into bits that are independent of one another and in some way putting them back together (their output that is). If you dont know the underlying dependencies between tasks you might end up having really odd errors/exceptions in your code. Even worse, it might all execute without any problems, but the results might be off from true values. Multi-threading is tricky business, in a way fun to learn (at least I think so) but pain in the neck when things go south.
Here are a couple of links that might provide useful:
Oracle's java trail: best place to start
A good tutorial for java concurrency
an interesting article on concurrency
If you are serious about putting effort to getting into multi-threading I can recommend GOETZ, BRIAN: JAVA CONCURRENCY, amazing book really..
Good luck

I had a similar task. But in my case, all the tables were unrelated to each other.
STEP1:
Using SQL Loader(Oracle) for uploading data into database(very fast) OR any similar bulk update tools for your database.
STEP2:
Running each uploading process in a different thread(for unrelated tasks) and in a single thread for related tasks.
P.S. You could identify different inter-related jobs in your application and categorize them in groups; and running each group in different threads.
Links to run you up:
JAVA Threading
follow the last example in the above link(Example: Partitioning a large task with multiple threads)
SQL Loader can dramatically improve performance

The fastest way I've found to insert large numbers of records into Oracle is with array operations. See the "setExecuteBatch" method, which is specific to OraclePreparedStatement. It's described in one of the examples here:
http://betteratoracle.com/posts/25-array-batch-inserts-with-jdbc

If Multi threading would complicate your work, you could go with Async messaging. I'm not fully aware of what your needs are, so, the following is from what I am seeing currently.
Create a file reader java whose purpose is to read the biz file and put messages into the JMS queue on the server. This could be plain Java with static void main()
Consume the JMS messages in the Message driven beans(You can set the limit on the number of beans to be created in the pool, 50 or 100 depending on the need) if you have mutliple servers, well and good, your job is now split into multiple servers.
Each row of data is asynchronously split between 2 servers and 50 beans on each server.
You do not have to deal with threads in the whole process, JMS is ideal because your data is within a transaction, if something fails before you send an ack to the server, the message will be resent to the consumer, the load will be split between the servers without you doing anything special like multi threading.
Also, spring is providing spring-batch which can help you. http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios

Patterns/Principles for thread-safe queues and "master/worker" program in Java

I have a problem which I believe is the classic master/worker pattern, and I'm seeking advice on implementation. Here's what I currently am thinking about the problem:
There's a global "queue" of some sort, and it is a central place where "the work to be done" is kept. Presumably this queue will be managed by a kind of "master" object. Threads will be spawned to go find work to do, and when they find work to do, they'll tell the master thing (whatever that is) to "add this to the queue of work to be done".
The master, perhaps on an interval, will spawn other threads that actually perform the work to be done. Once a thread completes its work, I'd like it to notify the master that the work is finished. Then, the master can remove this work from the queue.
I've done a fair amount of thread programming in Java in the past, but it's all been prior to JDK 1.5 and consequently I am not familiar with the appropriate new APIs for handling this case. I understand that JDK7 will have fork-join, and that that might be a solution for me, but I am not able to use an early-access product in this project.
The problems, as I see them, are:
1) how to have the "threads doing the work" communicate back to the master telling them that their work is complete and that the master can now remove the work from the queue
2) how to efficiently have the master guarantee that work is only ever scheduled once. For example, let's say this queue has a million items, and it wants to tell a worker to "go do these 100 things". What's the most efficient way of guaranteeing that when it schedules work to the next worker, it gets "the next 100 things" and not "the 100 things I've already scheduled"?
3) choosing an appropriate data structure for the queue. My thinking here is that the "threads finding work to do" could potentially find the same work to do more than once, and they'd send a message to the master saying "here's work", and the master would realize that the work has already been scheduled and consequently should ignore the message. I want to ensure that I choose the right data structure such that this computation is as cheap as possible.
Traditionally, I would have done this in a database, in sort of a finite-state-machine manner, working "tasks" through from start to complete. However, in this problem, I don't want to use a database because of the high volume and volatility of the queue. In addition, I'd like to keep this as light-weight as possible. I don't want to use any app server if that can be avoided.
It is quite likely that this problem I'm describing is a common problem with a well-known name and accepted set of solutions, but I, with my lowly non-CS degree, do not know what this is called (i.e. please be gentle).
Thanks for any and all pointers.

As far as I understand your requirements, you need ExecutorService. ExecutorService have
submit(Callable task)
method which return value is Future. Future is a blocking way to communicate back from worker to master. You could easily expand this mechanism to work is asynchronous manner. And yes, ExecutorService also maintaining work queue like ThreadPoolExecutor. So you don't need to bother about scheduling, in most cases. java.util.concurrent package already have efficient implementations of thread safe queue (ConcurrentLinked queue - nonblocking, and LinkedBlockedQueue - blocking).

Check out java.util.concurrent in the Java library.
Depending on your application it might be as simple as cobbling together some blocking queue and a ThreadPoolExecutor.
Also, the book Java Concurrency in Practice by Brian Goetz might be helpful.

First, why do you want to hold the items after a worker started doing them? Normally, you would have a queue of work and a worker takes items out of this queue. This would also solve the "how can I prevent workers from getting the same item"-problem.
To your questions:
1) how to have the "threads doing the
work" communicate back to the master
telling them that their work is
complete and that the master can now
remove the work from the queue
The master could listen to the workers using the listener/observer pattern
2) how to efficiently have the master
guarantee that work is only ever
scheduled once. For example, let's say
this queue has a million items, and it
wants to tell a worker to "go do these
100 things". What's the most efficient
way of guaranteeing that when it
schedules work to the next worker, it
gets "the next 100 things" and not
"the 100 things I've already
scheduled"?
See above. I would let the workers pull the items out of the queue.
3) choosing an appropriate data
structure for the queue. My thinking
here is that the "threads finding work
to do" could potentially find the same
work to do more than once, and they'd
send a message to the master saying
"here's work", and the master would
realize that the work has already been
scheduled and consequently should
ignore the message. I want to ensure
that I choose the right data structure
such that this computation is as cheap
as possible.
There are Implementations of a blocking queue since Java 5

Don't forget Jini and Javaspaces. What you're describing sounds very like the classic producer/consumer pattern that space-based architectures excel at.
A producer will write the jobs into the space. 1 or more consumers will take out jobs (under a transaction) and work on that in parallel, and then write the results back. Since it's under a transaction, if a problem occurs the job is made available again for another consumer .
You can scale this trivially by adding more consumers. This works especially well when the consumers are separate VMs and you scale across the network.

If you are open to the idea of Spring, then check out their Spring Integration project. It gives you all the queue/thread-pool boilerplate out of the box and leaves you to focus on the business logic. Configuration is kept to a minimum using #annotations.
btw, the Goetz is very good.

This doesn't sound like a master-worker problem, but a specialized client above a threadpool. Given that you have a lot of scavenging threads and not a lot of processing units, it may be worthwhile simply doing a scavaging pass and then a computing pass. By storing the work items in a Set, the uniqueness constraint will remove duplicates. The second pass can submit all of the work to an ExecutorService to perform the process in parallel.
A master-worker model generally assumes that the data provider has all of the work and supplies it to the master to manage. The master controls the work execution and deals with distributed computation, time-outs, failures, retries, etc. A fork-join abstraction is a recursive rather than iterative data provider. A map-reduce abstraction is a multi-step master-worker that is useful in certain scenarios.
A good example of master-worker is for trivially parallel problems, such as finding prime numbers. Another is a data load where each entry is independant (validate, transform, stage). The need to process a known working set, handle failures, etc. is what makes a master-worker model different than a thread-pool. This is why a master must be in control and pushes the work units out, whereas a threadpool allows workers to pull work from a shared queue.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.