Kafka producer is losing messages when broker is down - java

Given the following scenario:
I bring up zookeeper and a single kafka broker on my local and create "test" topic as described in the kafka quickstart: https://kafka.apache.org/quickstart
Then, I run a simple java program that produces a message to the "test" topic every second. After some time I bring down my local kafka broker and see producer continues producing messages, it doesn't throw any exception. Finally, I bring kafka broker up again, producer is able to reconnect to broker and it continues producing messages, but, all those messages that were produced during kafka broker downtime are lost. Producer doesn't replay them when detects healthy kafka broker.
How can I prevent this? I want kafka producer to replay those messages when it detects kafka broker back online. Here is my producer config:
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("linger.ms", 0);
props.put("key.serializer", StringSerializer.class.getName());
props.put("value.serializer", StringSerializer.class.getName());

Kafka Producer library has a retry mechanism built in, however it is turned off by default. Change a retries Producer config to a value bigger that 0 (default value) to turn it on. You should also experiment with retry.backoff.ms and request.timetout.ms in order to customise Producer retries.
Example Kafka Producer config with enabled retries:
retries=2147483647 //Integer.MAX_VALUE
retry.backoff.ms=1000
request.timeout.ms=305000 //5 minutes
max.block.ms=2147483647 //Integer.MAX_VALUE
You can find more information about those properties in Apache Kafka documentation.

Since you're running just one broker, I'm afraid you won't be able to store messages when your broker is down.
However, it is strange that you don't get any exception/warning/errors when you bring your broker down.
I would expect a "Failed to update metadata" or "expiring messages" error because when the producer sends messages to the broker(s) mentioned against the bootstrap.servers property, it first checks with the zookeeper for the active controller (or leader) and partitions. So, in your case since you're running kafka in a stand-alone mode and when the broker is down the producer should not receive the leader information and error out.
Could you please check what the following properties are set to:
request.timeout.ms
max.block.ms
and play around (reducing, may be) with these values? and check the results?
One more option you might want to try out is to send messages to Kafka in a synchronous fashion (blocking send() method until the messages are received) and here's a code snippet that might help (taken from this documentation reference):
If you want to simulate a simple blocking call you can call the get() method immediately:
byte[] key = "key".getBytes();
byte[] value = "value".getBytes();
ProducerRecord<byte[],byte[]> record = new ProducerRecord<byte[],byte[]>("my-topic", key, value)
producer.send(record).get();
In this case, kafka should throw an exception if the messages are not sent successfully for any reason.
I hope this helps.

Related

Azure eventhub Kafka org.apache.kafka.common.errors.TimeoutException for some of the records

Have a ArrayList containing 80 to 100 records trying to stream and send each individual record(POJO ,not entire list) to Kafka topic (event hub) . Scheduled a cron job like every hour to send these records(POJO) to event hub.
Able to see messages being sent to eventhub ,but after 3 to 4 successful run getting following exception (which includes several messages being sent and several failing with below exception)
Expiring 14 record(s) for eventhubname: 30125 ms has passed since batch creation plus linger time
Following is the config for Producer used,
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.ACKS_CONFIG, "1");
props.put(ProducerConfig.RETRIES_CONFIG, "3");
Message Retention period - 7
Partition - 6
using spring Kafka(2.2.3) to send the events
method marked as #Async where kafka send is written
#Async
protected void send() {
kafkatemplate.send(record);
}
Expected - No exception to be thrown from kafka
Actual - org.apache.kafka.common.errors.TimeoutException is been thrown
Prakash - we have seen a number of issues where spiky producer patterns see batch timeout.
The problem here is that the producer has two TCP connections that can go idle for > 4 mins - at that point, Azure load balancers close out the idle connections. The Kafka client is unaware that the connections have been closed so it attempts to send a batch on a dead connection, which times out, at which point retry kicks in.
Set connections.max.idle.ms to < 4mins – this allows Kafka client’s network client layer to gracefully handle connection close for the producer’s message-sending TCP connection
Set metadata.max.age.ms to < 4mins – this is effectively a keep-alive for the producer metadata TCP connection
Feel free to reach out to the EH product team on Github, we are fairly good about responding to issues - https://github.com/Azure/azure-event-hubs-for-kafka
This exception indicates you are queueing records at a faster rate than they can be sent. Once a record is added a batch, there is a time limit for sending that batch to ensure it has been sent within a specified duration. This is controlled by the Producer configuration parameter, request.timeout.ms. If the batch has been queued longer than the timeout limit, the exception will be thrown. Records in that batch will be removed from the send queue.
Please check the below for similar issue, this might help better.
Kafka producer TimeoutException: Expiring 1 record(s)
you can also check this link
when-does-the-apache-kafka-client-throw-a-batch-expired-exception/34794261#34794261 for reason more details about batch expired exception.
Also implement proper retry policy.
Note this does not account any network issues scanner side. With network issues you will not be able to send to either hub.
Hope it helps.

Kafka Streams error - Offset commit failed on partition, request timed out

We use Kafka Streams for consuming, processing and producing messages, and on PROD env we faced with errors on multiple topics:
ERROR org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=app-xxx-StreamThread-3-consumer, groupId=app]
Offset commit failed on partition xxx-1 at offset 13920:
The request timed out.[]
These errors occur rarely for topics with small load, but for topics with high load (and spikes) errors occur dozens of times a day per topic. Topics have multiple partitions (e.g. 10). Seems this issue does not affect processing of data (despite performance), as after throwing exception (even could be multiple errors for the same offset), consumer later re-read message and successfully process it.
I see that this error message appeared in kafka-clients version 1.0.0 due to PR, but in previous kafka-clients versions for the same use case (Errors.REQUEST_TIMED_OUT on consumer) similar message (Offset commit for group {} failed: {}) was logged with debug level.
as for me, it would be more logical to update log level to warning for such use case.
How to fix this issue? What could be the root cause? Maybe changing consumer properties or partition setup could help to get rid of such issue.
we use the following implementation for creating Kafka Streams:
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> stream = builder.<String, String>stream(topicName);
stream.foreach((key, value) -> processMessage(key, value));
Topology topology = builder.build();
StreamsConfig streamsConfig = new StreamsConfig(consumerSettings);
new KafkaStreams(streamsTopology, streamsConfig);
our Kafka consumer settings:
bootstrap.servers: xxx1:9092,xxx2:9092,...,xxx5:9092
application.id: app
state.dir: /tmp/kafka-streams/xxx
commit.interval.ms: 5000 # also I tried default value 30000
key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
value.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
timestamp.extractor: org.apache.kafka.streams.processor.WallclockTimestampExtractor
kafka broker version: kafka_2.11-0.11.0.2.
error occur on both versions of Kafka Streams: 1.0.1 and 1.1.0.
Looks like you have issue with Kafka cluster and Kafka consumer is get timed out while trying to commit offsets.
You can try to increase connection related configs for Kafka consumer
request.timeout.ms (by default 305000ms)
The configuration controls the maximum amount of time the client will
wait for the response of a request
connections.max.idle.ms (by default 540000ms)
Close idle connections after the number of milliseconds specified by
this config.

Kafka 1.0 Streaming API: message consumption from partitions get delayed

Recently, I've switched our streaming app from spark-streaming 2.1 to use kafka-streaming new API (1.0) with kafka broker server 0.11.0.0
I have implemented my own Processor class, and in process method, I just printed the message content.
I have a kafka cluster of 3 machines, and the topic I am hooking on have 300 partitions.
I ran the streaming app with 100 thread, on a machine with 32 GB of RAM, and 8 cores.
My problem is, in some cases, I got the messages once it reached the kafka topic/partition, and in other cases, I got the message after it has reached the topic with 10-15 minutes, Don't know why!
I used the below command line to track the lag on the kafka topic for the group.id for the streaming app.
./bin/kafka-run-class.sh kafka.admin.ConsumerGroupCommand --bootstrap-server kafka1:9092,kafka2:9092,kafka3:9092 --new-consumer --describe --group kf_streaming_gp_id
but unfortunately it is not consistently give accurate results, or even give result at all, any body know why?
Is there is something I missed with the streaming app so that I can read the messages once reached the partitions consistently?
Any consumer properties fix such problem.
My kafka-streaming app structure is as below:
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "kf_streaming_gp_id");
config.put(StreamsConfig.CLIENT_ID_CONFIG, "kf_streaming_gp_id");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka1:9092,kafka2:9092,kafka3:9092");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, DocumentSerde.class);
config.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, CustomTimeExtractor.class);
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
config.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 100);
KStream<String, Document> topicStreams = builder.stream(sourceTopic);
topicStreams.process(() -> new DocumentProcessor(appName, environment, dimensions, vector, sinkTopic));
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();
I figured out what was the problem in my case.
It turned out that there were threads stuck with doing a high CPU intensive work, which resulted in stopping other threads from consuming messages, that's why I saw such bursts, when I stopped this cpu intensive logic, everything was super fast, and messages gets to the streaming job once they got to the kafka topic.

Kafka Consumer Properties to read from the maximum offset

I have written a Java Kafka Consumer. I would like to make sure how to explicitly ensure that once the Kafka Consumer is started it only reads the messages which are sent by the producer from that time onwards i.e. it should not read any messages which have already been sent by the producer to Kafka. Can anyone explain how to ensure this? :
Here is a snippet of the properties I use
Properties properties = new Properties();
properties.put("zookeeper.connect", zookeeperHost);
properties.put("group.id", group);
properties.put("auto.offset.reset","largest");
ConsumerConfig consumerConfig = new ConsumerConfig(properties);
consumerConnector = Consumer.createJavaConsumerConnector(consumerConfig);
UPDATE Sept14:
I am using the following properties, it seems that the consumer still reads from the beginning at times, can someone tell me what's wrong now?
I am using Kafka Version 0.8.2
properties.put("auto.offset.reset","largest");
properties.put("auto.commit.enable","false");
Based on answers above, it seems that the correct mechanism is as follows for setting properties of the consumer:
properties.put("auto.offset.reset","largest");
properties.put("auto.commit.enable","false");
This ensures reading from the maximum offset

How can I send large messages with Kafka (over 15MB)?

I send String-messages to Kafka V. 0.8 with the Java Producer API.
If the message size is about 15 MB I get a MessageSizeTooLargeException.
I have tried to set message.max.bytesto 40 MB, but I still get the exception. Small messages worked without problems.
(The exception appear in the producer, I don't have a consumer in this application.)
What can I do to get rid of this exception?
My example producer config
private ProducerConfig kafkaConfig() {
Properties props = new Properties();
props.put("metadata.broker.list", BROKERS);
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("request.required.acks", "1");
props.put("message.max.bytes", "" + 1024 * 1024 * 40);
return new ProducerConfig(props);
}
Error-Log:
4709 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 214 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
4869 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 217 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5035 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 220 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5198 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 223 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5305 [main] ERROR kafka.producer.async.DefaultEventHandler - Failed to send requests for topics datasift with correlation ids in [213,224]
kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries.
at kafka.producer.async.DefaultEventHandler.handle(Unknown Source)
at kafka.producer.Producer.send(Unknown Source)
at kafka.javaapi.producer.Producer.send(Unknown Source)
You need to adjust three (or four) properties:
Consumer side:fetch.message.max.bytes - this will determine the largest size of a message that can be fetched by the consumer.
Broker side: replica.fetch.max.bytes - this will allow for the replicas in the brokers to send messages within the cluster and make sure the messages are replicated correctly. If this is too small, then the message will never be replicated, and therefore, the consumer will never see the message because the message will never be committed (fully replicated).
Broker side: message.max.bytes - this is the largest size of the message that can be received by the broker from a producer.
Broker side (per topic): max.message.bytes - this is the largest size of the message the broker will allow to be appended to the topic. This size is validated pre-compression. (Defaults to broker's message.max.bytes.)
I found out the hard way about number 2 - you don't get ANY exceptions, messages, or warnings from Kafka, so be sure to consider this when you are sending large messages.
Minor changes required for Kafka 0.10 and the new consumer compared to laughing_man's answer:
Broker: No changes, you still need to increase properties message.max.bytes and replica.fetch.max.bytes. message.max.bytes has to be equal or smaller(*) than replica.fetch.max.bytes.
Producer: Increase max.request.size to send the larger message.
Consumer: Increase max.partition.fetch.bytes to receive larger messages.
(*) Read the comments to learn more about message.max.bytes<=replica.fetch.max.bytes
The answer from #laughing_man is quite accurate. But still, I wanted to give a recommendation which I learned from Kafka expert Stephane Maarek. We actively applied this solution in our live systems.
Kafka isn’t meant to handle large messages.
Your API should use cloud storage (for example, AWS S3) and simply push a reference to S3 to Kafka or any other message broker. You'll need to find a place to save your data, whether it can be a network drive or something else entirely, but it shouldn't be a message broker.
If you don't want to proceed with the recommended and reliable solution above,
The message max size is 1MB (the setting in your brokers is called message.max.bytes) Apache Kafka. If you really needed it badly, you could increase that size and make sure to increase the network buffers for your producers and consumers.
And if you really care about splitting your message, make sure each message split has the exact same key so that it gets pushed to the same partition, and your message content should report a “part id” so that your consumer can fully reconstruct the message.
If the message is text-based try to compress the data, which may reduce the data size, but not magically.
Again, you have to use an external system to store that data and just push an external reference to Kafka. That is a very common architecture and one you should go with and widely accepted.
Keep that in mind Kafka works best only if the messages are huge in amount but not in size.
Source: https://www.quora.com/How-do-I-send-Large-messages-80-MB-in-Kafka
The idea is to have equal size of message being sent from Kafka Producer to Kafka Broker and then received by Kafka Consumer i.e.
Kafka producer --> Kafka Broker --> Kafka Consumer
Suppose if the requirement is to send 15MB of message, then the Producer, the Broker and the Consumer, all three, needs to be in sync.
Kafka Producer sends 15 MB --> Kafka Broker Allows/Stores 15 MB --> Kafka Consumer receives 15 MB
The setting therefore should be:
a) on Broker:
message.max.bytes=15728640
replica.fetch.max.bytes=15728640
b) on Consumer:
fetch.message.max.bytes=15728640
You need to override the following properties:
Broker Configs($KAFKA_HOME/config/server.properties)
replica.fetch.max.bytes
message.max.bytes
Consumer Configs($KAFKA_HOME/config/consumer.properties)
This step didn't work for me. I add it to the consumer app and it was working fine
fetch.message.max.bytes
Restart the server.
look at this documentation for more info:
http://kafka.apache.org/08/configuration.html
I think, most of the answers here are kind of outdated or not entirely complete.
To refer on the answer of Sacha Vetter (with the update for Kafka 0.10), I'd like to provide some additional Information and links to the official documentation.
Producer Configuration:
max.request.size (Link) has to be increased for files bigger than 1 MB, otherwise they are rejected
Broker/Topic configuration:
message.max.bytes (Link) may be set, if one like to increase the message size on broker level. But, from the documentation: "This can be set per topic with the topic level max.message.bytes config."
max.message.bytes (Link) may be increased, if only one topic should be able to accept lager files. The broker configuration must not be changed.
I'd always prefer a topic-restricted configuration, due to the fact, that I can configure the topic by myself as a client for the Kafka cluster (e.g. with the admin client). I may not have any influence on the broker configuration itself.
In the answers from above, some more configurations are mentioned as necessary:
replica.fetch.max.bytes (Link) (Broker config)
From the documentation: "This is not an absolute maximum, if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that progress can be made."
max.partition.fetch.bytes (Link) (Consumer config)
From the documentation: "Records are fetched in batches by the consumer. If the first record batch in the first non-empty partition of the fetch is larger than this limit, the batch will still be returned to ensure that the consumer can make progress."
fetch.max.bytes (Link) (Consumer config; not mentioned above, but same category)
From the documentation: "Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress."
Conclusion: The configurations regarding fetching messages are not necessary to change for processing messages, lager than the default values of these configuration (had this tested in a small setup). Probably, the consumer may always get batches of size 1. However, two of the configurations from the first block has to be set, as mentioned in the answers before.
This clarification should not tell anything about performance and should not be a recommendation to set or not to set these configuration. The best values has to be evaluated individually depending on the concrete planned throughput and data structure.
One key thing to remember that message.max.bytes attribute must be in sync with the consumer's fetch.message.max.bytes property. the fetch size must be at least as large as the maximum message size otherwise there could be situation where producers can send messages larger than the consumer can consume/fetch. It might worth taking a look at it.
Which version of Kafka you are using? Also provide some more details trace that you are getting. is there some thing like ... payload size of xxxx larger
than 1000000 coming up in the log?
For people using landoop kafka:
You can pass the config values in the environment variables like:
docker run -d --rm -p 2181:2181 -p 3030:3030 -p 8081-8083:8081-8083 -p 9581-9585:9581-9585 -p 9092:9092
-e KAFKA_TOPIC_MAX_MESSAGE_BYTES=15728640 -e KAFKA_REPLICA_FETCH_MAX_BYTES=15728640 landoop/fast-data-dev:latest `
This sets topic.max.message.bytes and replica.fetch.max.bytes on the broker.
And if you're using rdkafka then pass the message.max.bytes in the producer config like:
const producer = new Kafka.Producer({
'metadata.broker.list': 'localhost:9092',
'message.max.bytes': '15728640',
'dr_cb': true
});
Similarly, for the consumer,
const kafkaConf = {
"group.id": "librd-test",
"fetch.message.max.bytes":"15728640",
... .. }
Here is how I achieved successfully sending data up to 100mb using kafka-python==2.0.2:
Broker:
consumer = KafkaConsumer(
...
max_partition_fetch_bytes=max_bytes,
fetch_max_bytes=max_bytes,
)
Producer (See final solution at the end):
producer = KafkaProducer(
...
max_request_size=KafkaSettings.MAX_BYTES,
)
Then:
producer.send(topic, value=data).get()
After sending data like this, the following exception appeared:
MessageSizeTooLargeError: The message is n bytes when serialized which is larger than the total memory buffer you have configured with the buffer_memory configuration.
Finally I increased buffer_memory (default 32mb) to receive the message on the other end.
producer = KafkaProducer(
...
max_request_size=KafkaSettings.MAX_BYTES,
buffer_memory=KafkaSettings.MAX_BYTES * 3,
)

Categories