How to have dynamic number of producers and consumers? - java

There is a program which is implemented using producer and consumer pattern. The producer fetches data from db based on list of queries and puts it in array blocking queue... The consumer prepares excel report based on data in array blocking queue. For increasing performance, I want to have dynamic number of producers and consumers.. example, when producer is slow, have more number of producers.. when, consumer is slow, have more numbers of consumers . How can I have dynamic producers and consumers??

If you do this, you must first ask yourself a couple of questions:
How will you make sure that multiple parallel producers put items in the queue in the correct order? This might or might not be possible - it depends on the kind of problem you are dealing with.
How will you make sure that multiple parallel consumers don't "steal" each other's items from the queue? Again, this depends on your problem, in some cases this might be desirable and in others it's forbidden. You didn't provide enough information, but typically if you prepare data for report, you will need to have a single consumer and wait until the report data is complete.
Is this actually going to achieve any speedup? Did you actually measure that the bottleneck is I/O bound on the producer side, or are you just assuming? If the bottleneck is CPU-bound, you will not achieve anything.
So, assuming that you need complete data for report (i.e. single consumer, which needs the full data), and that your data can be "sharded" to independent subsets, and that the bottleneck is in fact what you think it is, you could do it like this:
As multiple producers will be producing different parts of results, they will not be sequential. So a list is not a good option; you would need a data structure where you would store interim results and care about which ranges have been completed and which ranges are still missing. Possibly, you could use one list per producer as a buffer and have a "merge" thread which will write to a single output list for consumer.
You need to split input data to several input pieces (one per producer)
You need to somehow track the ordering and ensure that the consumer takes out pieces in correct order
You can start consumer at the moment the first output piece comes out
You must stop the consumer when the last piece is produced.
In short, this is a kind of problem for which you should probably think about using something like MapReduce

Related

Advantage of "forcing" the partition

I have a topic theimportanttopic with three partitions.
Is an advantage to "forcing" the partition assignment?
For instance, I have three consumers in one group. I want consumer1 to always and only consume from partition-0, consumer2 to always and only consume from partition-1 and consumer3 to always and only consume from partition-2.
One consumer should not touch any other partition at any point of time except the one that was assigned.
A drawback I can think of is that when one of the consumers goes down, no one is consuming from the partition.
Let's suppose a fancy self-healing architecture is in place, can bring back any of those lost consumer back very efficiently.
Would it be an advantage, knowing there won't be any partition reassignment cost to the healthy consumers? The healthy consumers can focus on their own partition, etc.
Are there any other pros and cons?
https://docs.spring.io/spring-kafka/reference/html/#tip-assign-all-parts
https://docs.spring.io/spring-kafka/reference/html/#manual-assignment
It seems the API allow possibility of forcing the partition, I was wondering if this use case was one of the purposes of this design.
How do you know what is the "number" of each consumer? Based on your last questions, you've either used kubernetes, or setting concurrency in Spring Kafka. In either case, pods/threads rebalance across partitions of the same executable application... Therefore, you cannot scale them and assign to specific partitions without extra external locking logic.
In my opinion, all executable consumers should be equally be able to handle any partition.
Plus, as you pointed out, there's downtime if one stops.
But, the use case is to exactly match the producer. You've produced data by some custom partitioner logic, therefore, you need exact consumers to read only a subset of that data.
Also, assignment doesn't use consumer groups, so while there would be no rebalancing, it makes it not possible to monitor lag using tools like Burrow or consumer groups cli. Lag would need gathered directly from the consumer metrics themselves.

Mechanism for reading all messages from a queue in one request

I need a solution for the following scenario which is similar to a queue:
I want to write messages to a queue continuously. My message is very big, containing a lot of data so I do want to make as few requests as possible.
So my queue will contain a lot of messages at some point.
My Consumer will read from the queue every 1 hour. (not whenever a new message is written) and it will read all the messages from the queue.
The problem is that I need a way to read ALL the messages from the queue using only one call (I also want the consumer to make as few requests to the queue as possible).
A close solution would be ActiveMQ but the problem is that you can only read one message at a time and I need to read them all in one request.
So my question is.. Would there be other ways of doing this more efficiently? The actual thing that I need is to persist in some way messages created continuously by some application and then consume them (also delete them) by the same application all at once, every 1 hour.
The reason I thought a queue would be fit is because as the messages are consumed they are also deleted but I need to consume them all at once.
I think there's some important things to keep in mind as you're searching for a solution:
In what way do you need to be "more efficient" (e.g. time, monetary cost, computing resources, etc.)?
It's incredibly hard to prove that there are, in fact, no other "more efficient" ways to solve a particular problem, as that would require one to test all possible solutions. What you really need to know is, given your specific use-case, what solution is good enough. This, of course, requires knowing specifically what kind of performance numbers you need and the constraints on acquiring those numbers (e.g. time, monetary cost, computing resources, etc.).
Modern message broker clients (e.g. those shipped with either ActiveMQ 5.x or ActiveMQ Artemis) don't make a network round-trip for every message they consume as that would be extremely inefficient. Rather, they fetch blocks of messages in configurable sizes (e.g. prefetchSize for ActiveMQ 5.x, and consumerWindowSize for ActiveMQ Artemis). Those messages are stored locally in a buffer of sorts and fed to the client application when the relevant API calls are made to receive a message.
Making "as few requests as possible" is rarely a way to increase performance. Modern message brokers scale well with concurrent consumers. Consuming all the messages with a single consumer drastically limits the message throughput as compared to spinning up multiple threads which each have their own consumer. Rather than limiting the number of consumer requests you should almost certainly be maximizing them until you reach a point of diminishing returns.

What's the best way to asynchronously handle low-speed consumer (database) in high performance Java application

One EventHandler(DatabaseConsumer) of the Disruptor calls stored procedures in database, which is so slow that it blocks the Disruptor for some time.
Since I need the Disruptor keep running without blocking. I am thinking adding an extra queue so that EventHandler could serve as Producer and another new-created thread could serve as Consumer to handle database's work, which could be asynchronous without affecting the Disruptor
Here is some constrain:
The object that Disruptor passed to the EventHandler is around 30KB and the number of this object is about 400k. In theory, the total size of the objects that needs to be handled is around 30KBX400K =12GB. So the extra queue should be enough for them.
Since performance matters, GC pause should be avoided.
The heap size of the Java program is only 2GB.
I'm thinking text file as a option. EventHandler(Producer) writes the object to the file and Consumer reads from them and call stored procedure. The problem is how to handle the situation that it reach to the end of the file and how to know the new coming line.
Anyone who has solve this situation before? Any advice?
The short answer is size your disruptor to cope with the size of your bursts not your entire volume, bare in mind the disruptor can just contain a reference to the 30kb object, the entire object does not need to be in the ring buffer.
With any form of buffering before your database will require the memory for buffering the disruptor offers you the option of back pressure on the rest of the system when the database has fallen too far behind. That is to say you can slow the inputs to the disruptor down.
The other option for spooling to files is to look at Java Chronicle which uses memory mapped files to persist things to disk.
The much more complicated answer is take advantage of the batching effects of the disruptor so that your DB can catch up. I.e. using a EventHandler which collects events a batch of events together and submits them to the database as one unit.
This practice allows the EventHandler to become more efficient as things back up thus increasing throughput.
Short answer: don't use disruptor. Use a distributed MQ with retransmission support.
Long answer: If you have fast producers with slow consumers you will need some sort of retransmission mechanism. I don't think you can escape from that unless you can tolerate nasty blocks (i.e. huge latencies) in your system. That's when distributed MQs (Messaging Queues) come to play. Disruptor is not a distributed MQ, but you could try to implement something similar. The idea is:
All messages are sequenced and processed in order by the consumer
If the queue gets full, messages are dropped
If the consumer detects a message gap it will request a retransmission of the lost messages, buffering the future messages until it receives the gap
With that approach the consumer can be as slow as it wants because it can always request the retransmission of any message it lost at any time. What we are missing here is the retransmission entity. In a distributed MQ that will be a separate and independent node persisting all messages to disk, so it can replay back any message to any other node at any time. Since you are not talking about an MQ here, but about disruptor, then you will have to somehow implement that retransmission mechanism yourself on another thread. This is a very interesting problem without an easy answer or recipe. I would use multiple disruptor queues so your consumer could do something like:
Read from the main channel (i.e. main disruptor queue)
If you detect a sequence gap, go to another disruptor queue connected to the replayer thread. You will actually need two queues there, one to request the missing messages and another one to receive them.
The replayer thread would have another disruptor queue from where it is receiving all messages and persisting it to disk.
You are left to make sure your replayer thread can write messages fast enough to disk. If it cannot then there is no escape besides blocking the whole system. Fortunately disk i/o can be done very fast if you know what you are doing.
You can forget all I said if you can just afford to block the producers if the consumers are slow. But if the producers are getting messages from the network, blocking them will eventually give you packet drops (UDP) and probably an IOException (TCP).
As you can see this is a very interesting question with a very complicated answer. At Coral Blocks we have experience developing distributed MQs like that on top of CoralReactor. You can take a look in some of the articles we have on our website.

producer-consumer: how to know inform that prodcution completed

i have the following situation:
Read data from database
do work "calculation"
write result to database
I have a thread that reads from the database and puts the generated objects into a BlockingQueue. These objects are extremely heavy weight hence the queue to limit amount of objects in memory.
A multiple threads take objects from the Queue, performs work and put the results in a second queue.
The final thread takes results from second queue and saves result to database.
The problem is how to prevent deadlocks, eg. the "calculation threads" need to know when no more objects will be put into the queue.
Currently I achieve this by passing a references of the threads (callable) to each other and checking thread.isDone() before a poll or offer and then if the element is null. I also check the size of the queue, as long as there are elements in it, the must be consumed. Using take or put leads to deadlocks.
Is there a simpler way to achieve this?
One of the ways to accomplish would be to put a "dummy" or "poison" message as the last message on the queue when you are sure that no more tasks are going to arrive on the queue.. for example after putting the message related to the last row of the db query. So the producer puts a dummy message on the queue, the consumer on receiving this dummy message knows that no more meaningful work is expected in this batch.
Maybe you should take a look at CompletionService
It is designed to combine executor and a queue functionality in one.
Tasks which completed execution will be available from the completions service via
completionServiceInstance.take()
You can then again use another executor for 3. i.e. fill DB with the results, which you will feed with the results taken from the completionServiceInstance.

How to achieve maximum concurrency for a distributed application using a database as medium of communication

I have an application which is similar to classic producer consumer problem. Just wanted to check out all the possible implementations to achieve it. The problem is-
Process A: inserts a row into the table in database (producers)
Process B: reads M rows from the table, deletes the read M rows after processing.
Tasks in process B:
1. Read M rows
2. Process these rows
3. Delete these rows
N1 instances of process A,
N2 instances of process B runs concurrently.
Each instance runs on a different box.
Some requirements:
If a process p1 is reading (0,M-1) rows. process p2 should not wait for p1 until it releases the lock on these rows, instead it should read (M,2M-1) rows.
I bet there are better ways of parallel processing than using DB as the excahnger between producer and consumer. Why not queues? Have you checked the tools/frameworks designed for Map/Reduce. Hadoop, GridGain, JPPF all can do this.
Similar concept is being used in ConcurrentHashMap of Java.15.
A list of rows which are being processed should be maintained separately. When any process needs to interact with DB, it should check whether that rows are being processed by another process. If so it should wait on that condition, else it can process. maintaining Indexes might help in such a case
I think that if this application is implemented it actually uses hand made queue. I believe that JMS is much better in this case. There are a lot of JMS implementations available. Most of them are open source.
In your case process A should insert tasks into the queue. Process B should be blocked on receive(), get N messages and then process them. You probably have reasons to get a bulk of tasks from your queue but if you change implementation to JMS based you probably do not need this at all, so you can just listen to the queue and process message immediately. The implementation becomes almost trivial, very flexible and scalable. You can run as many processes A and B as you want and distribute them among separate boxes.
You may also want to take a look into Amazon Elastic Map Reduce
http://aws.amazon.com/elasticmapreduce/

Categories