I need a solution for the following scenario which is similar to a queue:
I want to write messages to a queue continuously. My message is very big, containing a lot of data so I do want to make as few requests as possible.
So my queue will contain a lot of messages at some point.
My Consumer will read from the queue every 1 hour. (not whenever a new message is written) and it will read all the messages from the queue.
The problem is that I need a way to read ALL the messages from the queue using only one call (I also want the consumer to make as few requests to the queue as possible).
A close solution would be ActiveMQ but the problem is that you can only read one message at a time and I need to read them all in one request.
So my question is.. Would there be other ways of doing this more efficiently? The actual thing that I need is to persist in some way messages created continuously by some application and then consume them (also delete them) by the same application all at once, every 1 hour.
The reason I thought a queue would be fit is because as the messages are consumed they are also deleted but I need to consume them all at once.
I think there's some important things to keep in mind as you're searching for a solution:
In what way do you need to be "more efficient" (e.g. time, monetary cost, computing resources, etc.)?
It's incredibly hard to prove that there are, in fact, no other "more efficient" ways to solve a particular problem, as that would require one to test all possible solutions. What you really need to know is, given your specific use-case, what solution is good enough. This, of course, requires knowing specifically what kind of performance numbers you need and the constraints on acquiring those numbers (e.g. time, monetary cost, computing resources, etc.).
Modern message broker clients (e.g. those shipped with either ActiveMQ 5.x or ActiveMQ Artemis) don't make a network round-trip for every message they consume as that would be extremely inefficient. Rather, they fetch blocks of messages in configurable sizes (e.g. prefetchSize for ActiveMQ 5.x, and consumerWindowSize for ActiveMQ Artemis). Those messages are stored locally in a buffer of sorts and fed to the client application when the relevant API calls are made to receive a message.
Making "as few requests as possible" is rarely a way to increase performance. Modern message brokers scale well with concurrent consumers. Consuming all the messages with a single consumer drastically limits the message throughput as compared to spinning up multiple threads which each have their own consumer. Rather than limiting the number of consumer requests you should almost certainly be maximizing them until you reach a point of diminishing returns.
Related
There is a program which is implemented using producer and consumer pattern. The producer fetches data from db based on list of queries and puts it in array blocking queue... The consumer prepares excel report based on data in array blocking queue. For increasing performance, I want to have dynamic number of producers and consumers.. example, when producer is slow, have more number of producers.. when, consumer is slow, have more numbers of consumers . How can I have dynamic producers and consumers??
If you do this, you must first ask yourself a couple of questions:
How will you make sure that multiple parallel producers put items in the queue in the correct order? This might or might not be possible - it depends on the kind of problem you are dealing with.
How will you make sure that multiple parallel consumers don't "steal" each other's items from the queue? Again, this depends on your problem, in some cases this might be desirable and in others it's forbidden. You didn't provide enough information, but typically if you prepare data for report, you will need to have a single consumer and wait until the report data is complete.
Is this actually going to achieve any speedup? Did you actually measure that the bottleneck is I/O bound on the producer side, or are you just assuming? If the bottleneck is CPU-bound, you will not achieve anything.
So, assuming that you need complete data for report (i.e. single consumer, which needs the full data), and that your data can be "sharded" to independent subsets, and that the bottleneck is in fact what you think it is, you could do it like this:
As multiple producers will be producing different parts of results, they will not be sequential. So a list is not a good option; you would need a data structure where you would store interim results and care about which ranges have been completed and which ranges are still missing. Possibly, you could use one list per producer as a buffer and have a "merge" thread which will write to a single output list for consumer.
You need to split input data to several input pieces (one per producer)
You need to somehow track the ordering and ensure that the consumer takes out pieces in correct order
You can start consumer at the moment the first output piece comes out
You must stop the consumer when the last piece is produced.
In short, this is a kind of problem for which you should probably think about using something like MapReduce
One EventHandler(DatabaseConsumer) of the Disruptor calls stored procedures in database, which is so slow that it blocks the Disruptor for some time.
Since I need the Disruptor keep running without blocking. I am thinking adding an extra queue so that EventHandler could serve as Producer and another new-created thread could serve as Consumer to handle database's work, which could be asynchronous without affecting the Disruptor
Here is some constrain:
The object that Disruptor passed to the EventHandler is around 30KB and the number of this object is about 400k. In theory, the total size of the objects that needs to be handled is around 30KBX400K =12GB. So the extra queue should be enough for them.
Since performance matters, GC pause should be avoided.
The heap size of the Java program is only 2GB.
I'm thinking text file as a option. EventHandler(Producer) writes the object to the file and Consumer reads from them and call stored procedure. The problem is how to handle the situation that it reach to the end of the file and how to know the new coming line.
Anyone who has solve this situation before? Any advice?
The short answer is size your disruptor to cope with the size of your bursts not your entire volume, bare in mind the disruptor can just contain a reference to the 30kb object, the entire object does not need to be in the ring buffer.
With any form of buffering before your database will require the memory for buffering the disruptor offers you the option of back pressure on the rest of the system when the database has fallen too far behind. That is to say you can slow the inputs to the disruptor down.
The other option for spooling to files is to look at Java Chronicle which uses memory mapped files to persist things to disk.
The much more complicated answer is take advantage of the batching effects of the disruptor so that your DB can catch up. I.e. using a EventHandler which collects events a batch of events together and submits them to the database as one unit.
This practice allows the EventHandler to become more efficient as things back up thus increasing throughput.
Short answer: don't use disruptor. Use a distributed MQ with retransmission support.
Long answer: If you have fast producers with slow consumers you will need some sort of retransmission mechanism. I don't think you can escape from that unless you can tolerate nasty blocks (i.e. huge latencies) in your system. That's when distributed MQs (Messaging Queues) come to play. Disruptor is not a distributed MQ, but you could try to implement something similar. The idea is:
All messages are sequenced and processed in order by the consumer
If the queue gets full, messages are dropped
If the consumer detects a message gap it will request a retransmission of the lost messages, buffering the future messages until it receives the gap
With that approach the consumer can be as slow as it wants because it can always request the retransmission of any message it lost at any time. What we are missing here is the retransmission entity. In a distributed MQ that will be a separate and independent node persisting all messages to disk, so it can replay back any message to any other node at any time. Since you are not talking about an MQ here, but about disruptor, then you will have to somehow implement that retransmission mechanism yourself on another thread. This is a very interesting problem without an easy answer or recipe. I would use multiple disruptor queues so your consumer could do something like:
Read from the main channel (i.e. main disruptor queue)
If you detect a sequence gap, go to another disruptor queue connected to the replayer thread. You will actually need two queues there, one to request the missing messages and another one to receive them.
The replayer thread would have another disruptor queue from where it is receiving all messages and persisting it to disk.
You are left to make sure your replayer thread can write messages fast enough to disk. If it cannot then there is no escape besides blocking the whole system. Fortunately disk i/o can be done very fast if you know what you are doing.
You can forget all I said if you can just afford to block the producers if the consumers are slow. But if the producers are getting messages from the network, blocking them will eventually give you packet drops (UDP) and probably an IOException (TCP).
As you can see this is a very interesting question with a very complicated answer. At Coral Blocks we have experience developing distributed MQs like that on top of CoralReactor. You can take a look in some of the articles we have on our website.
We currently have a distributed setup where we are publishing events to SQS and we have an application which has multiple hosts that drains messages from the queue and does some transformation over it and transmits to interested parties. I have a use case where the receiving end point has scalability concerns with the message volume and hence we would like to batch these messages periodically (say every 15 mins) in the application before sending it.
The incoming message rate is around 200 messages per second and each message is no more than 10 KB. This system need not be real time, but would definitely be a good to have and also the order is not important (its okay if a batch containing older messages gets sent first).
One approach that I can think of is maintaining an embedded database within the application (each host) that batches the events and another thread that runs periodically and clears the data.
Another approach could be to create timestamped buckets in a a distributed key-value store (s3, dynamo etc.) where we write the message to the correct bucket based the messages time stamp and we periodically clear the buckets.
We can run into several issues here, since the messages would be out of order a bucket might have already been cleared (can be solved by having a default bucket though), would need to accurately decide when to clear a bucket etc.
The way I see it, at least two components would be required one which does the batching into a temporary storage and another that clears it.
Any feedback on the above approaches would help, also it looks like a common problem are they any existing solutions that I can leverage ?
Thanks
I am working on designing a system that uses an ETL tool to retrieve batches of data, i.e., insert/update/deletes for one or more tables, and puts them on a JMS topic to be processed later by multiple clients. Right now, each message on the topic represents a single record I/U/D and we have a special message to delimit the end of the batch. It's important to process the batches in a single transaction, so having a bunch of messages delimited by a special one is not ideal: both sessions publishing and receiving messages must be designed for multiple messages; the batch delimiter message is a messy solution (each time we receive a message we need to check if it's the last) and very error prone; the system is difficult to debug and maintain; the number of messages on the topic becomes quickly huge (up to millions).
Now, I think that the next natural step to improve the architecture is to pack all the records in a single JMS message so that when a message is received, it encompasses a single transaction, it's easy to detect failures, there are no "orphan" records on the topic, etc. I only see advantages in doing so! Now here are my questions:
What's the best way to create such a packed message? I think my choices are StreamMessage, ByteMessage or ObjectMessage. I excluded text and map messages because the first will require text parsing, which will kill performance, and I assume the second one doesn't really seem to fit the scenario. I'm kinda leaning towards StreamMessage because it seems quite compact although it will require a lot of work writing custom serialization code (even worse for ByteMessage). Not sure about ObjectMessage, how does it perform? Is there an out of the box solution for this?
What's the maximum size allowed per message? Could it be in the order of hundreds of KB or even few MB?
Thanks for the thoughts!
Giovanni
Instead of using one large message, you could use two (or more) queues, correlation ids and a message selector.
Queueing:
Post a notification message to "notification queue" to indicate that processing should start
Post command messages to "command queue" with correlation id set to notification messages message id (you can use multiple command queues, if queue depth gets too high)
Commit the transaction
Processing:
Receive the notification message from "notification queue" (e.g. with message driven bean)
Receive and process all the related messages from "command queue" using a message selector
Commit the transaction
Using bytes (e.g. a ByteMessage) is likely the less memory intensive.
If you manipulate Java Objects, you can use a fast and byte effective serialization/deserialization library like Kryo
We happily use Kryo in production on a messaging system, but you have plenty of alternatives such as the popular Google Protocol Buffers
I was curious regarding the most common (or recommended) implementations of disruptor about the journaling step. And the most common questions of mine are:
how it is actually implemented (by example)?
Is it wise to use JPA?
What DB is commonly used (by the community that has already implement projects with disruptor)?
Is it wise to be used at the intermediate handlers (of EventProcessors) so the State of each message should be saved, rather than before and after the business logic process?
By the way (I am sorry, I know this is not related with the journalling step), what is the right way to delete a message from the RingBuffer during an eventHandler process (assuming that the message is dead/expired and should be removed by the whole procedure). I was wondering something similar as the Dead Letter Channel pattern.
Cheers!
The Disruptor is usually used for low latency, high throughput processing. E.g. millions of messages with a typical latency in the hundreds of micro-seconds. As very few databases can handle this sort of rate of updates with reasonably bounded delays, journaling is often done to a raw file with replication to a second (or third) system.
For reporting purposes, a system reads this file or listens to messages and updates a database as quickly as it can but this is taken out of the critical path.
An entry is dead in the ring buffer when every event processor has processed it.
The slot a message uses is not available until every event processor has processed it and all the message before it. Deleting a message would be too expensive, both in terms of performance and impact on the design
Every event processors sees every message. As this happens concurrently, there is little cost in doing this, but it quite normal for event processors to ignore messages as a result. (possibly most messages)