Kafka Streams application producing topic with same message

Kafka Streams application producing topic with same message - java

I am facing an issue with my Kafka streams application, where messages are being processed multiple times and the result topic is constantly receiving messages. This issue is only present in production and not in my local environment. Can you help me determine the root cause of this problem, based on the transformer code?
#Override
public KeyValue<String, UserClicks> transform(final String user, final Long clicks) {
UserClicks userClicks = tempStore.get(user);
if (userClicks != null) {
userClicks.clicks += clicks;
}
else {
final String region = regionStore.get(user).value();
userClicks = new UserClicks(user, region, clicks);
}
if (userClicks.clicks < CLICKS_THRESHOLD) {
tempStore.put(user, userClicks);
}
else {
tempStore.delete(user);
}
return KeyValue.pair(user, userClicks);
}
`
When I remove KStore from transformer everything seems to work fine.

Usally this problem occures becuase kafka can’t save its state, and it’s reading the same batch of messages. KStore stores it’s state on change log topic, and it stores it by producing messages. If the produces can’t produce for some reson, new offset can never be commited.
To resolve the issue, change the minimum number of in-sync replicas to 1 or set the replication factor to 2. By default, Kafka streams creates a replication factor of 1.
Easy way to configure this is through Conduktor, just go to topic config and changes min.insync.replicas property
It cant also be done through kafka CLI by running this command.
kafka-configs.sh --bootstrap-server localhost:9092 --alter --entity-type topics --entity-name configured-topic min.insync.replicas 1

Related

how to consume a kafka topic from a specific offset?

recently I am using kafka,
I have a topic and I am using the following code to consume
#KafkaListener(topics = "topic_name", groupId = "_id" , id = "pro", containerFactory = "kafkaListenerContainerFactory")
public void consume(ConsumerRecord<String, String> record, Acknowledgment ack) {
kafkaService.proccessorConsumer(record);
ack.acknowledge();
}
every thing works fine, but I need to handle a situation where if the service stopped for any reason, then started I want to continue consuming from the last message that has processed, I do understand that the acknowledgment help with this, but for the sake of certainty I saved the last consumed offset somewhere.
my question is how I could use that offset to start consuming the topic from it.

As #OneCricketeer indicates, what you are trying to achieve is the default behaviour of the Kafka consumer, if you haven't disabled automatic commit.
You can check this by describing your consumer group using the consumer id as follows, just check that the offset of your consumer is the same as the one you have stored elsewhere.
> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group-id

flink tumbling window is not triggered (no watermark strategy)

Problem statement: stream events from kafka source. These event payloads are of string format. Parse them into Documents and batch insert them into DB every 5 seconds based on event time.
map() functions are getting executed. But program control is not going into apply(). Hence bulk insert is not happening. I tried with keyed and non-keyed windows. None of them are working. No error is being thrown.
flink version: 1.15.0
Below is the code for my main method. How should I fix this?
public static void main(String[] args) throws Exception {
final Logger logger = LoggerFactory.getLogger(Main.class);
final StreamExecutionEnvironment streamExecutionEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecutionEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
KafkaConfig kafkaConfig = Utils.getAppConfig().getKafkaConfig();
logger.info("main() Loading kafka config :: {}", kafkaConfig);
KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
.setBootstrapServers(kafkaConfig.getBootstrapServers())
.setTopics(kafkaConfig.getTopics())
.setGroupId(kafkaConfig.getConsumerGroupId())
.setStartingOffsets(OffsetsInitializer.latest())
.setValueOnlyDeserializer(new SimpleStringSchema()).build();
logger.info("main() Configured kafka source :: {}", kafkaSource);
DataStreamSource<String> dataStream = streamExecutionEnv.fromSource(kafkaSource,
WatermarkStrategy.noWatermarks(), "mySource");
logger.info("main() Configured kafka dataStream :: {}", dataStream);
DataStream<Document> dataStream1 = dataStream.map(new DocumentMapperFunction());
DataStream<InsertOneModel<Document>> dataStream2 = dataStream1.map(new InsertOneModelMapperFunction());
DataStream<Object> dataStream3 = dataStream2
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5), Time.seconds(0)))
/*.keyBy(insertOneModel -> insertOneModel.getDocument().get("ackSubSystem"))
.window(TumblingEventTimeWindows.of(Time.seconds(5)))*/
.apply(new BulkInsertToDB2())
.setParallelism(1);
logger.info("main() before streamExecutionEnv execution");
dataStream3.print();
streamExecutionEnv.execute();
}

Use TumblingProcessingTimeWindows instead Event time windows.
As David has mentioned TumblingEventTimeWindows requires watermark strategy.

Event time windows require a watermark strategy. Without one, the windows are never triggered.
Furthermore, even with forMonotonousTimestamps, a given window will not be triggered until Flink has processed at least one event belonging to the following window from every Kafka partition. (If there are idle (or empty) Kafka partitions, you should use withIdleness to withdraw those partitions from the overall watermark calculations.)

Batch consumer camel kafka

I am unable to read in batch with the kafka camel consumer, despite following an example posted here. Are there changes I need to make to my producer, or is the problem most likely with my consumer configuration?
The application in question utilizes the kafka camel component to ingest messages from a rest endpoint, validate them, and place them on a topic. I then have a separate service that consumes them from the topic and persists them in a time-series database.
The messages were being produced and consumed one at a time, but the database expects the messages to be consumed and committed in batch for optimal performance. Without touching the producer, I tried adjusting the consumer to match the example in the answer to this question:
How to transactionally poll Kafka from Camel?
I wasn't sure how the messages would appear, so for now I'm just logging them:
from(kafkaReadingConsumerEndpoint).routeId("rawReadingsConsumer").process(exchange -> {
// simple approach to generating errors
String body = exchange.getIn().getBody(String.class);
if (body.startsWith("error")) {
throw new RuntimeException("can't handle the message");
}
log.info("BODY:{}", body);
}).process(kafkaOffsetManager);
But the messages still appear to be coming across one at a time with no batch read.
My consumer config is this:
kafka:
host: myhost
port: myport
consumer:
seekTo: beginning
maxPartitionFetchBytes: 55000
maxPollRecords: 50
consumerCount: 1
autoOffsetReset: earliest
autoCommitEnable: false
allowManualCommit: true
breakOnFirstError: true
Does my config need work, or are there changes I need to make to the producer to have this work correctly?

At the lowest layer, the KafkaConsumer#poll method is going to return an Iterator<ConsumerRecord>; there's no way around that.
I don't have in-depth experience with Camel, but in order to get a "batch" of records, you'll need some intermediate collection to "queue" the data that you want to eventually send downstream to some "collection consumer" process. Then you will need some "switch" processor that says "wait, process this batch" or "continue filling this batch".
As far as databases go, that process is exactly what Kafka Connect JDBC Sink does with batch.size config.

We solved a similar requirement by using the Aggregation [1] capability provided by Camel
A rough code snippet
#Override
public void configure() throws Exception {
// 1. Define your Aggregation Strat
AggregationStrategy agg = AggregationStrategies.flexible(String.class)
.accumulateInCollection(ArrayList.class)
.pick(body());
from("kafka:your-topic?and-other-params")
// 2. Define your Aggregation Strat Params
.aggregate(constant(true), agg)
.completionInterval(1000)
.completionSize(100)
.parallelProcessing(true)
// 3. Generate bulk insert statement
.process(exchange -> {
List<String> body = (List<String>) exchange.getIn().getBody();
String query = generateBulkInsertQueryStatement("target-table", body);
exchange.getMessage().setBody(query);
})
.to("jdbc:dataSource");
}
There are a variety of strategies that you can implement, but we chose this particular one because it allows you to create a List of strings for the message contents that we need to ingest into the db. [2]
We set a variety of different params such as completionInterval & completionSize. The most important one for us was to set parallellProcessing(true) [3] ; without that our performance wasn't nearly getting the required throughput.
Once the aggregation has either collected 100 messages or 1000 ms has passed, then the processor generates a bulk insert statement, which then gets sent to the db.
[1] https://camel.apache.org/components/3.18.x/eips/aggregate-eip.html
[2] https://camel.apache.org/components/3.18.x/eips/aggregate-eip.html#_aggregating_into_a_list
[3] https://camel.apache.org/components/3.18.x/eips/aggregate-eip.html#_worker_pools

Rabbit MQ doesn't flush acks?

The problem appeared in logs: Consumer failed to start in 60000 milliseconds; does the task executor have enough threads to support the container concurrency?
We try to open handlers for like 50 queues dynamically by SimpleMessageListenerContainer.addQueueNames(), then application is started. It consumes some messages, but the RabbitMQ admin panel shows that they are unacked. After a significant amount of time, messages are stacking up to 6 unacked messages (queue has fairly low amount of messages per minute) per queue, which sums up to 300 messages total, something happens and they all become consumed and acked. While messages are unacked, the container seems to be trying to load another consumer until it bumps into the limit.
We rely on AUTO acknowledgment mode now, when it was MANUAL, it was fine.
There are two questions:
What can be the reason for unacked messages? Is there any flushing mechanism that doesn't trigger often?
What do I do with "not enough threads" message?
Those two seem to be really related one to another.
Here's the setup:
#Bean
fun queueMessageListenerContainer(
connectionFactory: ConnectionFactory,
retryOperationsInterceptor: RetryOperationsInterceptor,
vehicleQueueListenerFactory: QueueListenerFactory,
): SimpleMessageListenerContainer {
return SimpleMessageListenerContainer().also {
it.connectionFactory = connectionFactory
it.setConsumerTagStrategy { queueName -> consumerTag(queueName) }
it.setMessageListener(vehicleQueueListenerFactory.create())
it.setConcurrentConsumers(2)
it.setMaxConcurrentConsumers(5)
it.setListenerId("queue-consumer")
it.setAdviceChain(retryOperationsInterceptor)
it.setRecoveryInterval(RABBIT_HEARTH_BEAT.toMillis())
//had 10-100 threads, didn't help
it.setTaskExecutor(rabbitConsumersExecutorService)
// AUTO suppose to set ack for the messages, right?
it.acknowledgeMode = AcknowledgeMode.AUTO
}
}
#Bean
fun connectionFactory(rabbitProperties: RabbitProperties): AbstractConnectionFactory {
val rabbitConnectionFactory = com.rabbitmq.client.ConnectionFactory().also { connectionFactory ->
connectionFactory.isAutomaticRecoveryEnabled = true
connectionFactory.isTopologyRecoveryEnabled = true
connectionFactory.networkRecoveryInterval = RABBIT_HEARTH_BEAT.toMillis()
connectionFactory.requestedHeartbeat = RABBIT_HEARTH_BEAT.toSeconds().toInt()
// was up to 100 connections, didn't help
connectionFactory.setSharedExecutor(rabbitConnectionExecutorService)
connectionFactory.host = rabbitProperties.host
connectionFactory.port = rabbitProperties.port ?: connectionFactory.port
}
return CachingConnectionFactory(rabbitConnectionFactory)
.also {
it.cacheMode = rabbitProperties.cache.connection.mode
it.connectionCacheSize = rabbitProperties.cache.connection.size
it.setConnectionNameStrategy { "simulation-gateway:${springProfiles.firstOrNull()}:event-consumer" }
}
}
class QueueListenerFactory {
fun create(){
return MessageListener {
try {
// no ack, rely on AUTO acknowledgement mode
handle()
} catch (e: Throwable) {
...
}
}
}
}

Okay, I figured out what the problem was. Basically, it couldn't start all of the queues consumers in time, since it not only is slow process for too slow for SimpleMessageListenerContainer, but also we tried to addQueueNames one by one.
userRepository.findAll()
.map { user -> queueName(user) }
.onEach { queueName ->
simpleContainerListener.addQueueNames(queueName)
}
But the following line of documentation for SimpleMessageListenerContainer remained unnoticed:
The existing consumers will be cancelled after they have processed any pre-fetched messages and new consumers will be created
Which means what actually happened is recreation of (1, 2, ... N) consumers. What made it even worse is that if the request comes from the API, we did exactly the same simpleContainerListener.addQueueNames(queueName) after handling the request, which recreated all of consumers after that.
Also, recreation of the consumers was the reason why AUTO acknowledgement didn't work: threads were hanging trying to build enough consumers before the timeout.
I fixed this by adding DirectMessageListenerContainer to handle recently added queues, which is blazing fast, compared to SimpleMessageListenerContainer for the particular case of adding just one new consumer.
DirectMessageListenerContainer(connectionFactory).also {
it.setConsumerTagStrategy { queueName -> consumerTag(queueName, RECENT_CONSUMER_TAG) }
it.setMessageListener(ListenerFactory.create())
it.setListenerId("queue-consumer-recent")
it.setAdviceChain(retryOperationsInterceptor)
it.setTaskExecutor(recentQueuesTaskExecutor)
it.acknowledgeMode = AcknowledgeMode.AUTO
}
The downside is DirectMessageListenerContainer using 1 thread per queue on every instance. This is exactly why I didn't want to use it in the first place, but using both DirectMessageListenerContainer for recent and SimpleContainerListener for already existing queues significantly reduces amount of thread required to handle those queues. As far as I understand, an overwhelming usage of DirectMessageListenerContainer will lead to OOM eventually, so the next step can be to transfer queues from direct to simple container listener in batches.

Message transfer in between two topics in google cloud pub sub

We have a use case where on any action from UI we need to read messages from google pub/sub Topic A synchronously and move those messages to Topic B.
Below is the code that has been written to handle this behavior and this is from Google Pub Sub docs to access a Topic synchronusly.
public static int subscribeSync(String projectId, String subscriptionId, Integer numOfMessages, int count, String acknowledgementTopic) throws IOException {
SubscriberStubSettings subscriberStubSettings =
SubscriberStubSettings.newBuilder()
.setTransportChannelProvider(
SubscriberStubSettings.defaultGrpcTransportProviderBuilder()
.setMaxInboundMessageSize(20 * 1024 * 1024) // 20MB (maximum message size).
.build())
.build();
try (SubscriberStub subscriber = GrpcSubscriberStub.create(subscriberStubSettings)) {
String subscriptionName = ProjectSubscriptionName.format(projectId, subscriptionId);
PullRequest pullRequest =
PullRequest.newBuilder()
.setMaxMessages(numOfMessages)
.setSubscription(subscriptionName)
.build();
// Use pullCallable().futureCall to asynchronously perform this operation.
PullResponse pullResponse = subscriber.pullCallable().call(pullRequest);
List<String> ackIds = new ArrayList<>();
for (ReceivedMessage message : pullResponse.getReceivedMessagesList()) {
// START - CODE TO PUBLISH MESSAGE TO TOPIC B
**publishMessage(message.getMessage(),acknowledgementTopic,projectId);**
// END - CODE TO PUBLISH MESSAGE TO TOPIC B
ackIds.add(message.getAckId());
}
// Acknowledge received messages.
AcknowledgeRequest acknowledgeRequest =
AcknowledgeRequest.newBuilder()
.setSubscription(subscriptionName)
.addAllAckIds(ackIds)
.build();
// Use acknowledgeCallable().futureCall to asynchronously perform this operation.
subscriber.acknowledgeCallable().call(acknowledgeRequest);
count=pullResponse.getReceivedMessagesList().size();
}catch(Exception e) {
log.error(e.getMessage());
}
return count;
}
Below is the sample code to publish messages to Topic B
public static void publishMessage(PubsubMessage pubsubMessage,String Topic,String projectId) {
Publisher publisher = null;
ProjectTopicName topicName =ProjectTopicName.newBuilder().setProject(projectId).setTopic(Topic).build();
try {
// Publish the messages to normal topic.
publisher = Publisher.newBuilder(topicName).build();
} catch (IOException e) {
log.error(e.getMessage());
}
publisher.publish(pubsubMessage);
}
Is this the right way of handling this use case or this can be handled in someother way. We do not want to use Cloud Dataflow. Can someone let us know if this is fine or there is an issue.
The code works but sometimes messages stay on Topic A even after hey are consumed synchronously.
Thanks'

There are some issues with the code as presented.
You should really only use synchronous pull if there are specific reasons why you need to do so. In general, it is much better to use asynchronous pull via the client libraries. It will be more efficient and reduce the latency of moving messages from one topic to the other. You do not show how you call subscribeSync, but in order to process messages efficiently and ensure that you actually process all messages, you'd need to be calling it many times in parallel continuously. If you are going to stick with synchronous pull, then you should reuse the SubscriberStub object as recreating it for every call will be inefficient.
You don't reuse your Publisher object. As a result, you are not able to take advantage of the batching that the publisher client can do. You should create the Publisher once and reuse it across your calls for publishes to the same topic. If the passed-in topic can differ across messages, then keep a map from topic to publisher and retrieve the right one from the map.
You don't wait for the result of the call to publish. It is possible that this call fails, but you do not handle that failure. As a result, you could acknowledge the message on the first topic without it having actually been published, resulting in message loss.
With regard to your question about duplicates, Pub/Sub offers at-least-once delivery guarantees, so even with proper acking, it is still possible to receive messages again (typical duplicate rates are around 0.1%). There can be many different reasons for duplicates. In your case, since you are processing messages sequentially and recreating a publisher for every call, it could be that later messages are not acked before the ack deadline expires, which results in redelivery.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.