Spring Boot kafkaTemplate consumer message load and processing message - java

in my application I am using Spring Boot kafkaTemplate for consuming the messages. I am new to kafka with Spring Boot. I have added a consumer as below -
#KafkaListener(topics = "#{'${app.kafka.consumer.topic}'.split(',')}")
public void receivedMessage(ConsumerRecord<String, String> cr, #Payload String message){
log.info("Message received from topic {} ", cr.topic());
//TODO
}
On a topic we will receive near about 200K messages per second. Ones I received message will send to another method for processing which filtered the messages based on certain criteria and then publishing the filtered message to another topic.
My question is, does above #KafkaListener method will handle this load or do I need to do any special treatment like threading or concurrency.

It depends entirely on your workload, the number of cores in your CPU, etc, etc. In general, increasing the concurrency will provide more throughput (as long as you have at least that many partitions on the topic).
But, your downstream code must be completely thread-safe.
Even then, there may be other bottlenecks in your code (DB etc).
If all you are doing is computation, transformation, and publishing to another topic, without doing any other I/O, increasing the concurrency should definitely help.
The only real solution is experimentation and, if you don't get the throughput you need, profile your application.
You can't learn such skills without actually doing it.

Related

How to handle session timeout while processing Kafka messages?

I am processing messages from Kafka in a standard processing loop:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processMessage(record);
}
}
What should I do if my Kafka Consumer gets into a timeout while processing the records? I mean the timeout controlled by the property session.timeout.ms
When this happens, my consumer should stop processing the records, because it would lose its partitions and the records that it processes could be already processed by another consumer. If the original consumer writes some processing results into a database, it could overwrite the records produced by the "new" consumer that got the partitions after my original consumer timed out.
I know about the ConsumerRebalanceListener, but from my understanding its method onPartitionsLost would only be called after I call the poll method from the consumer. Therefore this doesn't help me to stop the processing loop of the batch of records that I received from the previous poll.
I would expect that the heartbeat thread could notify me that it was not able to contact the broker and that we have a session timeout in the consumer, but there doesn't seem to be anything like that...
Am I missing something?
Adding this as an answer as it would be too long in a comment.
Kafka has a few ways that can be used to process messages
At most once;
At least once; and
Exactly once.
You are describing that you would like to use kafka as exactly once semantics (which by the way is the least common way of using kafka). Also producers need to play nicely as by default kafka can produce the same message more than once.
It's a lot more common to build services that use the at least once mechanism, in this way you can receive (or process) the same message more than once but you need to have a way to deduplicate them (it's the same idea behind idempotency on http APIs). You'll need to have something in the message that is unique and have register that that id has been processed already. If the payload has nothing you can use to deduplicate them, you can add a header on the message and use that.
This is also useful in the scenario that you have to reset the offset, so the service can go through old messages without breaking.
I would suggest you to google a bit for details on how to implement the above.
Here's a blog post from confluent about developing exactly once semantics Improved Robustness and Usability of Exactly-Once Semantics in Apache Kafka and the Kafka docs explaining the different semantics.
About the point of the ConsumerRebalanceListener, you don't need to do anything if you follow the solution of using idempotency in the consumer. Rebalances also happen when an app crashes, and in that scenario the service might have processed some records, but not committed them yet to Kafka.
A mini tip I give to everyone who is starting with Kafka. Kafka looks simple from the outside but it's a complex technology. Don't use it in production until you know the nitty gritty details of how it works including have done some good amount of negative testing (unless you are ok with losing data).

Apache Camel - Kafka component - single producer multiple consumer

I am creating two apache camel (blueprint XML) kafka projects, one is kafka-producer which accepts requests and stores it in kafka server, and other is kafka-consumer which picks ups messages from kafka server and processes them.
This setup is working fine for single topic and single consumer. However how do I create separate consumer groups within same kafka topic? How to route multiple consumer specific messages within same topic inside different consumer groups? Any help is appreciated. Thank you.
Your question is quite general as it's not very clear what's the problem you are trying to solve, therefore it's hard to understand if there's a better way to implement the solution.
Anyway let's start by saying that, as far as I can understand, you are looking for a Selective Consumer (EIP) which is something that's not supported out-of-the-box by Kafka and Consumer API. Selective Consumer can choose what message to pick from the queue or topic based on specific selectors' values that are put in advance by a producer. This feature must be implemented in the message broker as well, but kafka has not such a capability.
Kafka does implement a hybrid solution between pure pub/sub and queue. That being said, what you can do is subscribing to the topic with one or more consumer groups (more on that later) and filter out all messages you're not interested in, by inspecting messages themselves. In the messaging and EIP world, this pattern is known as Array of Filters. As you can imagine this happen after the message has been broadcasted to all subscribers; therefore if that solution does not fit your requirements or context, then you can think of implementing a Content Based Router which is intended to dispatch the message to a subset of consumers only under your centralized control (this would imply intermediate consumer-specific channels that could be other Kafka topics or seda/VM queues, of course).
Moving to the second question, here is the official Kafka Component website: https://camel.apache.org/components/latest/kafka-component.html.
In order to create different consumer groups, you just have to define multiple routes each of them having a dedicated groupId. By adding the groupdId property, you will inform the Consumer Group coordinators (that reside in Kafka brokers) about the existence of multiple separated groups of consumers and brokers will use those in order to discriminate and treat them separately (by sending them a copy of each log message stored in the topic)...
Here is an example:
public void configure() throws Exception {
from("kafka:myTopic?brokers={{kafkaBootstrapServers}}" +
"&groupId=myFirstConsumerGroup"
.log("Message received by myFirstConsumerGroup : ${body}");
from("kafka:myTopic?brokers={{kafkaBootstrapServers}}" +
"&groupId=mySecondConsumerGroup"
.log("Message received by mySecondConsumerGroup : ${body}");
}
As you can see, I created two routes in the same RouteBuilder, not to say in the same Java process. That's a very bad design decision in most of the use cases I can think of, because there's no single responsibility, segregated concerns and they will not scale. But again, it depends on your requirements/context.
Out of completeness, please consider taking a look at all other Kafka Component properties, as there may be many other configurations of your interest such as the number of consumer threads per group.
I tried to stay high level, in order to initiate the discussion... I'll edit my answer in case of new updates from you. Hope I helped!

Listening to multiple topics on single consumer

I have recently started working on Apache Kafka. One thing which I keep on seeing on various blogs is that multiple topics are configured to the same listener.
My question is, Is it a good practice to do so?
Let's say we are receiving 100 messages per sec on each topic. Message from each topic need different customization. And message from. Individual topics got to respective tables. Example: message from topic1 goes to topic_1 table.
It's a Spring Boot app that I'm working on. Also I would like to know what other challenges I might face going forward.
Update: Code sample
#KafkaListener(topics = "#{'${kafka-consumer.topics}'.split(',')}", groupId = "${kafka-consumer.groupId}")
public void consume(KafkaConsumer<String, String> record) {
int count = 0;
ConsumerRecords<String, String> records = record.poll(1000);
for (ConsumerRecord<String, String> data : records) {
System.out.println(data.value());
count++;
}
//record.listTopics()
if(count > 0){
record.commitAsync();
}
}
My question is, Is it a good practice to do so?
It depends on the use case. In your example where a topic correlates to a table you should probably have a consumer per topic because if your consumer is consuming from many unrelated topics then the consumption will slow down. Consuming is less efficient than producing so the most common use case is to split your topic into multiple partitions and have multiple consumers per topic.
It would make sense to consume from multiple topics if the topics were related. There is a use case that Confluent has written a white paper about where they replicate data between data centers and the topics are prefixed with a data center ID. The consumers then consume from all topics with matching names but different data center IDs.
it's not a good practice at all! because it's dramatically decrease the consumption rate, for obvious reasons...
but in some cases you'll have to use it, if you have many producers that can dynamically spawn and you'd like to retain a consumption of data from every one of them and yet have the capabilities to send data to a specific device
e.g
a lot of sensors where each one send to it's own topic with it's id like outgoing/12445646 and such
the consumer of the data from all those sensors will listen to outgoing/* topic but still can send a message directly to that sensor on channel like incoming/12445646
a separate outgoing channel can be very handy in case of traffic control where one can spawn dedicated consumers for high throughput channels and similar scenarios, or deal with a specific device without effecting the rest

Lagom PubSubRef subscriber drops messages

[Attention] The question is Lagom framework specific!
In my current project, the problem with cutting the list of messages from Source to Kafka topic publisher has been observed when upstream is of high speed and looks like downstream can't handle all messages in time. As realized, the cutting is related to the behavior of PubSubRef.subscribe() method https://github.com/lagom/lagom/blob/master/pubsub/javadsl/src/main/scala/com/lightbend/lagom/javadsl/pubsub/PubSubRef.scala#L85
The full method definition:
def subscriber(): Source[T, NotUsed] = {
scaladsl.Source.actorRef[T](bufferSize, OverflowStrategy.dropHead)
.mapMaterializedValue { ref =>
mediator ! Subscribe(topic.name, ref)
NotUsed
}.asJava
}
There's OverflowStrategy.dropHead is used. Can it be changed to use back-pressure strategy?
UPD#1:
The use case is pretty simple, when a query request is published into command topic, get it and query objects from DB table, the resulting list is pushed into result Kafka topic. Code snippet:
objectsResultTopic = pubSub.refFor(TopicId.of(CustomObject.class, OBJECTS_RESULT_TOPIC));
objectQueryTopic().subscribe().atLeastOnce(
Flow.fromSinkAndSource(
Flow.fromFunction(this::deserializeCommandAndQueryObjects)
.mapAsync(concurrency, objects -> objects)
.flatMapMerge(concurrency, objects -> objects)
.alsoTo(Sink.foreach(event -> LOG.trace("Sending object {}", object)))
.to(objectsResultTopic.publisher()),
Source.repeat(Done.getInstance())
)
)
In case of objects stream generated by deserializeCommandAndQueryObjects function is more than default buffer-size = 1000 it starts cutting the elements (our case is ~ 2.5k objects).
UPD#2:
The source of objects data is:
// returns CompletionStage<Source<CustomObject, ?>>
jdbcSession.withConnection(
connection -> Source.from(runQuery(connection, rowConverter))
)
And there's a subscription to Kafka objectsResultTopic:
TopicProducer.singleStreamWithOffset(
offset -> objectsResultTopic.subscriber().map(gm -> {
JsonNode node = mapper.convertValue(gm, JsonNode.class);
return Pair.create(node, offset);
}));
It sounds like Lagom's distributed publish-subscribe feature may not be the best tool for the job you have.
Your question mentions Kafka, but this feature does not make use of Kafka. Instead, it works by directly broadcasting messages to all subscribers in the cluster. This is an "at most once" messaging transport that may indeed lose messages, and is intended for consumers who care more about keeping up with recent messages than processing every single one. The overflow strategy is not customizable, and you would not want to use back-pressure in these use cases, as it would mean that one slow consumer could slow down delivery to all of the other subscribers.
There are a few other options that you have:
If you do want to use Kafka, you should use Lagom's message broker API. This supports "at least once" delivery semantics, and can be used to ensure that each consumer processes every message (at the cost of possibly increasing latency).
In this case, Kafka acts as a giant durable buffer, so it's even better than back-pressure: the producer and consumer can proceed at different paces, and (when used with partitioning) you can add consumers in order to scale out and process messages more quickly when needed.
The message broker API can be used when producers and consumers are all in the same service, but it is particularly suitable for communication between services.
If the messages you are sending are persistent entity events, and the consumers are part of the same service, then a persistent read-side processor might be a good option.
This also provides "at least once" delivery, and if the only effects of processing messages are database updates, then the built-in support for Cassandra read-side databases and relational read-side databases provide "effectively once" semantics, where the database updates are run transactionally to ensure that failures that occur during event processing cannot result in partial updates.
If the messages you are sending are persistent entity events, the consumers are part of the same service, but you want to process the events as a stream, you can access a raw stream of events.
If your use case does not fit into one of the use cases that Lagom supports explicitly, you can use lower-level Akka APIs, including distributed publish-subscribe, to implement something more tailored to your needs.
The best choice will depend on the specifics of your use case: the source of the messages and the types of consumers you want. If you update your question with more details and add a comment to this answer, I can edit the answer with more specific suggestions.
If someone is interested, finally we solved that problem by using Akka Producer API, like:
ProducerSettings<String, CustomObject> producerSettings = ProducerSettings.create(system, new StringSerializer(), new CustomObjectSerializer());
objectQueryTopic().subscribe().atLeastOnce(
Flow.fromSinkAndSource(
Flow.fromFunction(this::deserializeCommandAndQueryObjects)
.mapAsync(concurrency, objects -> objects)
.flatMapMerge(concurrency, objects -> objects)
.alsoTo(Sink.foreach(object -> LOG.trace("Sending event {}", object)))
.map(object -> new ProducerRecord<String, CustomObject>(OBJECTS_RESULT_TOPIC, object))
.to(Producer.plainSink(producerSettings)),
Source.repeat(Done.getInstance())));
It works without buffering, just makes the pushing into Kafka topic.

JMS (ActiveMQ) Performance

I have a Java application with a number of components communicating via JMS (ActiveMQ). Currently the application and the JMS Hub are on the same server although we eventually plan to split out the components for scalability. Currently we are having significant issues with performance, all seemingly around JMS, most notably, and the focus of this question is the amount of time it is taking to publish a message to a topic.
We have around 50 dynamically created topics used for communication between the components of the application. One component reads records from a table and processes them one at a time, the processing involves creating a JMS Object message and publishing it to one of the topics. This processing could not keep up with the rate at which records were being written to the source table ~23/sec, so we changed the processing to create the JMS Object message and add it to a queue. A new thread was created which read from this queue and published the message to the appropriate topic. Obviously this does not speed the processing up but it did allow us to see how far behind we were getting by looking at the size of the queue.
At the start of the day no messages are going through the whole system, this quickly ramps up from 1560000 (433/sec) messages through the hub in the first hour to 2100000 (582/sec) in the 3rd hour and then staying at that level. At the start of the first hour the message publishing from the component reading records from the database table keeps up however, by the end of that hour there are 2000 messages in the queue waiting to be sent and by the 3rd hour the queue has 9000 messages in it.
Below are the appropiate sections of the code which send the JMS messages, any advice on what we are doing wrong or how we can improve this performance are much appreciated. Looking at stats on the web JMS should be able to easily handle ~1000-2000 large messages/sec or ~10000 small messages/sec. Our messages are around 500 bytes each so I imagine sit somewhere in the middle of that scale.
Code for getting the publisher:
private JmsSessionPublisher getJmsSessionPublisher(String topicName) throws JMSException {
if (!this.topicPublishers.containsKey(topicName)) {
TopicSession pubSession = (ActiveMQTopicSession) topicConnection.createTopicSession(false, Session.AUTO_ACKNOWLEDGE);
ActiveMQTopic topic = getTopic(topicName, pubSession);
// Create a JMS publisher and subscriber
TopicPublisher publisher = pubSession.createPublisher(topic);
this.topicPublishers.put(topicName, new JmsSessionPublisher(pubSession, publisher));
}
return this.topicPublishers.get(topicName);
}
Sending the message:
JmsSessionPublisher jmsSessionPublisher = getJmsSessionPublisher(topicName);
ObjectMessage objMessage = jmsSessionPublisher.getSession().createObjectMessage(messageObj);
objMessage.setJMSCorrelationID(correlationID);
objMessage.setJMSTimestamp(System.currentTimeMillis());
jmsSessionPublisher.getPublisher().publish(objMessage, false, 4, 0);
Code which adds messages to the queue:
List<EventQueue> events = eventQueueDao.getNonProcessedEvents();
for (EventQueue eventRow : events) {
IEvent event = eventRow.getEvent();
AbstractEventFactory.EventType eventType = AbstractEventFactory.EventType.valueOf(event.getEventType());
String topic = event.getTopicName() + topicSuffix;
EventMsgPayload eventMsg = AbstractEventFactory.getFactory(eventType).getEventMsgPayload(event);
synchronized (queue) {
queue.add(new QueueElement(eventRow.getEventId(), topic, eventMsg));
queue.notify();
}
}
Code in the thread removing items from the queue:
jmsSessionFactory.publishMessageToTopic(e.getTopic(), e.getEventMsg(), Integer.toString(e.getEventMsg().hashCode()));
publishMessageToTopic executes the 'Sending the message' code above.
Other JMS implementations are an option if the consensus is that ActiveMQ may not be the best option.
Thank you,
James
We do not use ActiveMQ, but we ran into similar issues, we discovered that the issues were with the back-end processing and not with the Java side. There could be multiple issues here:
The program processing the messages from the Queue could be slow (e.g. CICS on mainframe) it might not be able to keep up with the messages that are sent to the queue. One possible solution for this is to increase the processing power (or optimize the back end code which processes the messages)
Check the messages on the queue, sometimes there are are lots of uncommitted poison messages on the queue, we use a separate queue for such messages.
It would nice to know the answers to the questions asked by Karianna.
It's not 100% clear where you are experiencing the slow performance, but it sounds like what you are describing is slowness in publishing the messages. Are you creating a new publisher every time you publish a message? If so, this is terribly inefficient and you should consider creating one publisher and use it over and over to send messages. Furthermore, if you are sending persistent messages, then you are probably using synchronous sends to the broker. You might want to consider using asynchronous sends to speed things up. For more info, see the doc about Async Sends
Also, how is the performance of the consumers? How many consumers are being used? Are they able to keep pace with the rate at which messages are being published?
Additionally, what is the broker configuration that you are using? Has it been tuned at all?
Bruce
Although this is an old question, there is one very very important advice missing:
Investigate the amount of topics and queues that you have.
ActiveMQ keeps subscription topics in separate threads. Particularly, when you have large amounts of different topics, this will drag down any server. Think about using JMS selectors instead.
I ran into a similar situation where I had thousands of market data messages per second. When I naively dumped each message into a market instrument specific channel, the server was able to stand about an hour before it was spitting out error messages to the message producers. I changed the design to have ONE channel "MARKET_DATA" and I then set header properties on all produced messages and set a selector on the consumer side to select just the messages that I want. Note that my selector is in SQL like syntax and runs on the server though ... (yeah, let's skip the CEP marketing hype bashing) ...

Categories