The collection aggregator used in the Mule 2.0 framework works a bit like this:
An inbound router takes a collection of messages and splits it up into a number of smaller messages - each smaller message get stamped with a correlation id corresponding to the parent message
These messages flow through various services
Finally these messages arrive at an inbound aggregator that collects up the messages based on the correlation id of the parent message and the number of expected messages. Once all of the expected messages have been received then the aggregation function is called and the result is returned.
Now this works fine when the number of messages in a group is reasonably small. However once the number of messages in a group becomes huge ~100k then a lot of memory is tied up holding onto the group of messages waiting for the later messages to arrive. This is made worse if there are multiple groups being aggregated at the same time.
A way around this issue would be to implement a streaming aggregator. In my use case I am essentially summing up the various messages based on a key and this could be done without having to see all of the messages in the group at the same time. I'd only want to know that all of the messages had been received before forwarding the result onto the endpoint.
Does this sound like a reasonable solution to the problem?
Is this already implemented somewhere in Mule?
Are there better ways of doing this?
This seems like a reasonable approach (I'm not a Mule expert by any means), and I have read all of the Mule documentation and don't think there is something like this out there (the streaming support is limited to a few connectors and transformers - it's pretty simple in that it just passes around an InputStream). Only a few things in Mule stream, so you may need to have other modified transformers (if you use them) that stream. You would just implement the aggregator the provides an InputStream and starts streaming as soon as it got some consecutive sequence of messages.
However one sentence in your description "... all of the messages had been received before forwarding the results to the endpoint" could be troubling. This by it's very nature defeats the purpose of streaming, unless you mean that you (in your service component presumably) will keep track that you got everything before forwarding the (presumably much smaller) processed result onwards.
Related
I am processing messages from Kafka in a standard processing loop:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processMessage(record);
}
}
What should I do if my Kafka Consumer gets into a timeout while processing the records? I mean the timeout controlled by the property session.timeout.ms
When this happens, my consumer should stop processing the records, because it would lose its partitions and the records that it processes could be already processed by another consumer. If the original consumer writes some processing results into a database, it could overwrite the records produced by the "new" consumer that got the partitions after my original consumer timed out.
I know about the ConsumerRebalanceListener, but from my understanding its method onPartitionsLost would only be called after I call the poll method from the consumer. Therefore this doesn't help me to stop the processing loop of the batch of records that I received from the previous poll.
I would expect that the heartbeat thread could notify me that it was not able to contact the broker and that we have a session timeout in the consumer, but there doesn't seem to be anything like that...
Am I missing something?
Adding this as an answer as it would be too long in a comment.
Kafka has a few ways that can be used to process messages
At most once;
At least once; and
Exactly once.
You are describing that you would like to use kafka as exactly once semantics (which by the way is the least common way of using kafka). Also producers need to play nicely as by default kafka can produce the same message more than once.
It's a lot more common to build services that use the at least once mechanism, in this way you can receive (or process) the same message more than once but you need to have a way to deduplicate them (it's the same idea behind idempotency on http APIs). You'll need to have something in the message that is unique and have register that that id has been processed already. If the payload has nothing you can use to deduplicate them, you can add a header on the message and use that.
This is also useful in the scenario that you have to reset the offset, so the service can go through old messages without breaking.
I would suggest you to google a bit for details on how to implement the above.
Here's a blog post from confluent about developing exactly once semantics Improved Robustness and Usability of Exactly-Once Semantics in Apache Kafka and the Kafka docs explaining the different semantics.
About the point of the ConsumerRebalanceListener, you don't need to do anything if you follow the solution of using idempotency in the consumer. Rebalances also happen when an app crashes, and in that scenario the service might have processed some records, but not committed them yet to Kafka.
A mini tip I give to everyone who is starting with Kafka. Kafka looks simple from the outside but it's a complex technology. Don't use it in production until you know the nitty gritty details of how it works including have done some good amount of negative testing (unless you are ok with losing data).
I am creating two apache camel (blueprint XML) kafka projects, one is kafka-producer which accepts requests and stores it in kafka server, and other is kafka-consumer which picks ups messages from kafka server and processes them.
This setup is working fine for single topic and single consumer. However how do I create separate consumer groups within same kafka topic? How to route multiple consumer specific messages within same topic inside different consumer groups? Any help is appreciated. Thank you.
Your question is quite general as it's not very clear what's the problem you are trying to solve, therefore it's hard to understand if there's a better way to implement the solution.
Anyway let's start by saying that, as far as I can understand, you are looking for a Selective Consumer (EIP) which is something that's not supported out-of-the-box by Kafka and Consumer API. Selective Consumer can choose what message to pick from the queue or topic based on specific selectors' values that are put in advance by a producer. This feature must be implemented in the message broker as well, but kafka has not such a capability.
Kafka does implement a hybrid solution between pure pub/sub and queue. That being said, what you can do is subscribing to the topic with one or more consumer groups (more on that later) and filter out all messages you're not interested in, by inspecting messages themselves. In the messaging and EIP world, this pattern is known as Array of Filters. As you can imagine this happen after the message has been broadcasted to all subscribers; therefore if that solution does not fit your requirements or context, then you can think of implementing a Content Based Router which is intended to dispatch the message to a subset of consumers only under your centralized control (this would imply intermediate consumer-specific channels that could be other Kafka topics or seda/VM queues, of course).
Moving to the second question, here is the official Kafka Component website: https://camel.apache.org/components/latest/kafka-component.html.
In order to create different consumer groups, you just have to define multiple routes each of them having a dedicated groupId. By adding the groupdId property, you will inform the Consumer Group coordinators (that reside in Kafka brokers) about the existence of multiple separated groups of consumers and brokers will use those in order to discriminate and treat them separately (by sending them a copy of each log message stored in the topic)...
Here is an example:
public void configure() throws Exception {
from("kafka:myTopic?brokers={{kafkaBootstrapServers}}" +
"&groupId=myFirstConsumerGroup"
.log("Message received by myFirstConsumerGroup : ${body}");
from("kafka:myTopic?brokers={{kafkaBootstrapServers}}" +
"&groupId=mySecondConsumerGroup"
.log("Message received by mySecondConsumerGroup : ${body}");
}
As you can see, I created two routes in the same RouteBuilder, not to say in the same Java process. That's a very bad design decision in most of the use cases I can think of, because there's no single responsibility, segregated concerns and they will not scale. But again, it depends on your requirements/context.
Out of completeness, please consider taking a look at all other Kafka Component properties, as there may be many other configurations of your interest such as the number of consumer threads per group.
I tried to stay high level, in order to initiate the discussion... I'll edit my answer in case of new updates from you. Hope I helped!
I need a solution for the following scenario which is similar to a queue:
I want to write messages to a queue continuously. My message is very big, containing a lot of data so I do want to make as few requests as possible.
So my queue will contain a lot of messages at some point.
My Consumer will read from the queue every 1 hour. (not whenever a new message is written) and it will read all the messages from the queue.
The problem is that I need a way to read ALL the messages from the queue using only one call (I also want the consumer to make as few requests to the queue as possible).
A close solution would be ActiveMQ but the problem is that you can only read one message at a time and I need to read them all in one request.
So my question is.. Would there be other ways of doing this more efficiently? The actual thing that I need is to persist in some way messages created continuously by some application and then consume them (also delete them) by the same application all at once, every 1 hour.
The reason I thought a queue would be fit is because as the messages are consumed they are also deleted but I need to consume them all at once.
I think there's some important things to keep in mind as you're searching for a solution:
In what way do you need to be "more efficient" (e.g. time, monetary cost, computing resources, etc.)?
It's incredibly hard to prove that there are, in fact, no other "more efficient" ways to solve a particular problem, as that would require one to test all possible solutions. What you really need to know is, given your specific use-case, what solution is good enough. This, of course, requires knowing specifically what kind of performance numbers you need and the constraints on acquiring those numbers (e.g. time, monetary cost, computing resources, etc.).
Modern message broker clients (e.g. those shipped with either ActiveMQ 5.x or ActiveMQ Artemis) don't make a network round-trip for every message they consume as that would be extremely inefficient. Rather, they fetch blocks of messages in configurable sizes (e.g. prefetchSize for ActiveMQ 5.x, and consumerWindowSize for ActiveMQ Artemis). Those messages are stored locally in a buffer of sorts and fed to the client application when the relevant API calls are made to receive a message.
Making "as few requests as possible" is rarely a way to increase performance. Modern message brokers scale well with concurrent consumers. Consuming all the messages with a single consumer drastically limits the message throughput as compared to spinning up multiple threads which each have their own consumer. Rather than limiting the number of consumer requests you should almost certainly be maximizing them until you reach a point of diminishing returns.
I am working on designing a system that uses an ETL tool to retrieve batches of data, i.e., insert/update/deletes for one or more tables, and puts them on a JMS topic to be processed later by multiple clients. Right now, each message on the topic represents a single record I/U/D and we have a special message to delimit the end of the batch. It's important to process the batches in a single transaction, so having a bunch of messages delimited by a special one is not ideal: both sessions publishing and receiving messages must be designed for multiple messages; the batch delimiter message is a messy solution (each time we receive a message we need to check if it's the last) and very error prone; the system is difficult to debug and maintain; the number of messages on the topic becomes quickly huge (up to millions).
Now, I think that the next natural step to improve the architecture is to pack all the records in a single JMS message so that when a message is received, it encompasses a single transaction, it's easy to detect failures, there are no "orphan" records on the topic, etc. I only see advantages in doing so! Now here are my questions:
What's the best way to create such a packed message? I think my choices are StreamMessage, ByteMessage or ObjectMessage. I excluded text and map messages because the first will require text parsing, which will kill performance, and I assume the second one doesn't really seem to fit the scenario. I'm kinda leaning towards StreamMessage because it seems quite compact although it will require a lot of work writing custom serialization code (even worse for ByteMessage). Not sure about ObjectMessage, how does it perform? Is there an out of the box solution for this?
What's the maximum size allowed per message? Could it be in the order of hundreds of KB or even few MB?
Thanks for the thoughts!
Giovanni
Instead of using one large message, you could use two (or more) queues, correlation ids and a message selector.
Queueing:
Post a notification message to "notification queue" to indicate that processing should start
Post command messages to "command queue" with correlation id set to notification messages message id (you can use multiple command queues, if queue depth gets too high)
Commit the transaction
Processing:
Receive the notification message from "notification queue" (e.g. with message driven bean)
Receive and process all the related messages from "command queue" using a message selector
Commit the transaction
Using bytes (e.g. a ByteMessage) is likely the less memory intensive.
If you manipulate Java Objects, you can use a fast and byte effective serialization/deserialization library like Kryo
We happily use Kryo in production on a messaging system, but you have plenty of alternatives such as the popular Google Protocol Buffers
We have a custom messaging system written in Java, and I want to implement a basic batching/compression feature that basically under heavy load it will aggregate a bunch of push responses into a single push response.
Essentially:
if we detect 3 messages were sent in the past second then start batching responses and schedule a timer to fire in 5 seconds
The timer will aggregate all the message responses received in the next 5 seconds into a single message
I'm sure this has been implemented before I'm just looking for the best example of it in Java. I'm not looking for a full blown messaging layer, just the basic detect messages per second and schedule some task (obviously I can easily write this myself I just want to compare it with any existing algorithms to make sure I'm not missing any edge cases or that I've simplified the problem as much as possible).
Are there any good open source examples of building a basic QoS batching/throttling/compression implementations?
we are using a very similar mechanism for high load.
it will work as you described it
* Aggregate messages over a given interval
* Send a List instead of a single message after that.
* Start aggregating again.
You should watch out for the following pitfalls:
* If you are using a transacted messaging system like JMS you can get into trouble because your implementation will not be able to send inside the JMS transaction so it will keep aggregating. Depending on the size of your data structure to hold the messages this can run out of space. If you are have very long transactions sending many messages this can pose a problem.
* Sending a message in such a way will happen asynchronous because a different thread will be sending the message and the thread calling the send() method will only put it in the data structure.
* Sticking to the JMS example you should keep in mind that they way messages are consumed is also changed by this approach. Because you will receive the list of messages from JMS as a single message. So once you commit this single JMS message you commited the entire list of messages. You should check if this i a problem to your requirements.