Apache Kafka Cluster start fails with NoNodeException

Apache Kafka Cluster start fails with NoNodeException - java

I'm trying to start a spark streaming session which consumes from a Kafka queue and I'm using Zookeeper for config mgt. However, when I try to start this following exception is being thrown.
18/03/26 09:25:49 INFO ZookeeperConnection: Checking Kafka topic core-data-tickets does exists ...
18/03/26 09:25:49 INFO Broker: Kafka topic core-data-tickets exists
18/03/26 09:25:49 INFO Broker: Processing topic : core-data-tickets
18/03/26 09:25:49 WARN ZookeeperConnection: Resetting Topic Offset
org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /consumers/clt/offsets/core-data-tickets/4
at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685)
at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766)
at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
at kafka.utils.ZkUtils$.readData(ZkUtils.scala:443)
at kafka.utils.ZkUtils.readData(ZkUtils.scala)
at net.core.data.connection.ZookeeperConnection.readTopicPartitionOffset(ZookeeperConnection.java:145)
I have already created the relevant Kafka topic.
Any insights on this would be highly appreciated.
#
I'm using the following code to run the spark job
spark-submit --class net.core.data.compute.Broker --executor-memory 512M --total-executor-cores 2 --driver-java-options "-Dproperties.path=/ebs/tmp/continuous-loading-tool/continuous-loading-tool/src/main/resources/dev.properties" --conf spark.ui.port=4045 /ebs/tmp/dev/data/continuous-loading-tool/target/continuous-loading-tool-1.0-SNAPSHOT.jar

I guess that this error has to do with offsets retention. By default, offsets are stored for only 1440 minutes (i.e. 24 hours). Therefore, if the group has not committed offsets within a day, Kafka won't have information about it.
A possible workaround is to set the value of offsets.retention.minutes accordingly.
offsets.retention.minutes
Offsets older than this retention period will be discarded

Related

io.smallrye.mutiny.TimeoutException when using kafka vs redis

I'm using kafka + redis in my project.
I get message from Kafka, process and save to redis, but it is giving error like below when my code runs after some time my code
io.smallrye.mutiny.TimeoutException
at io.smallrye.mutiny.operators.uni.UniBlockingAwait.await(UniBlockingAwait.java:64)
at io.smallrye.mutiny.groups.UniAwait.atMost(UniAwait.java:65)
at io.quarkus.redis.client.runtime.RedisClientImpl.await(RedisClientImpl.java:1046)
at io.quarkus.redis.client.runtime.RedisClientImpl.set(RedisClientImpl.java:687)
at worker.redis.process.implementation.ProductImplementation.refresh(ProductImplementation.java:34)
at worker.redis.Worker.refresh(Worker.java:51)
at
kafka.InComingProductKafkaConsume.lambda$consume$0(InComingProductKafkaConsume.java:38)
at business.core.hpithead.ThreadStart.doRun(ThreadStart.java:34)
at business.core.hpithead.core.NotifyingThread.run(NotifyingThread.java:27)
at java.base/java.lang.Thread.run(Thread.java:833)
The record 51761 from topic-partition 'mer-outgoing-master-item-0' has waited for 153 seconds to be acknowledged. This waiting time is greater than the configured threshold (150000 ms). At the moment 2 messages from this partition are awaiting acknowledgement. The last committed offset for this partition was 51760. This error is due to a potential issue in the application which does not acknowledged the records in a timely fashion. The connector cannot commit as a record processing has not completed.
#Incoming("mer_product")
#Blocking
public CompletionStage<Void> consume2(Message<String> payload) {
var objectDto = configThreadLocal.mapper.readValue(payload.getPayload(), new TypeReference<KafkaPayload<ItemKO>>(){});
worker.refresh(objectDto.payload.castDto());
return payload.ack();
}

Does anyone knows kafka producer hanging fix

Can anyone please tell me about this exception.
ERROR [kafka-producer-network-thread | producer-2] c.o.p.a.s.CalculatorAdapter [CalculatorAdapter.java:285]
Cannot send outgoingDto with decision id = 46d1-9491-123ce9c7a916 in kafka:
org.springframework.kafka.core.KafkaProducerException: Failed to send;
nested exception is org.apache.kafka.common.errors.TimeoutException:
Expiring 1 record(s) for save-request-0:604351 ms has passed since batch creation
at org.springframework.kafka.core.KafkaTemplate.lambda$buildCallback$4(KafkaTemplate.java:602)
at org.springframework.kafka.core.DefaultKafkaProducerFactory$CloseSafeProducer$1.onCompletion(DefaultKafkaProducerFactory.java:871)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1356)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:231)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:197)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:676)
at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:380)
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:323)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.common.errors.TimeoutException:
Expiring 1 record(s) for save-request-0:604351 ms has passed since batch creation
I have been fighting with him for the second week.
Revised a bunch of fix recipes, but none of the recipes helped.
My program sends messages about 60 kilobytes in size, but they do not reach the kafka server.
The entire java application log is filled with exceptions of this kind.

My guess is that the time to fill the batch size takes longer than the time of the transaction, so the message is not sent.
// example
Properties props = new Properties();
...
pros.put(ProducerConfig.BATCH_SIZE_CONFIG, 60000); // 60kb
...
Producer producer = new KafkaProducer<>(props);
Checkout this articles.
Kafka Producer Batch
Kafka Producer batch size
Batch size configuration
http://cloudurable.com/blog/kafka-tutorial-kafka-producer-advanced-java-examples/index.html
https://kafka.apache.org/26/javadoc/org/apache/kafka/clients/producer/ProducerConfig.html

Expiring 1 record(s) for xxxxx: 30030 ms has passed since batch creation plus linger time

My use case:
Using Postman, I call a Spring boot soap endpoint. The endpoint creates a KafkaProducer and send a message to a specific topic. I also have a TaskScheduler to consume the topic.
The problem:
When calling soap to push a message to a topic, I get this error:
2017-11-14 21:29:31.463 ERROR 6389 --- [ad | producer-3]
DomainEntityProducer : Expiring 1 record(s) for
DomainEntityCommandStream-0: 30030 ms has passed since batch creation
plus linger time 2017-11-14 21:29:31.464 ERROR 6389 ---
[nio-8080-exec-6] DomainEntityProducer :
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s)
for DomainEntityCommandStream-0: 30030 ms has passed since batch
creation plus linger time
Here’s the method I use to push to the topic:
public DomainEntity push(DomainEntity pDomainEntity) throws Exception {
logger.log(Level.INFO, "streaming...");
wKafkaProperties.put("bootstrap.servers", "localhost:9092");
wKafkaProperties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
wKafkaProperties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer wKafkaProducer = new KafkaProducer(wKafkaProperties);
ProducerRecord wProducerRecord = new ProducerRecord("DomainEntityCommandStream", getJSON(pDomainEntity));
wKafkaProducer.send(wProducerRecord, (RecordMetadata r, Exception e) -> {
if (e != null) {
logger.log(Level.SEVERE, e.getMessage());
}
}).get();
return pDomainEntity;
}
Using command shell scripts
./kafka-console-producer.sh --broker-list 10.0.1.15:9092 --topic
DomainEntityCommandStream
and
./kafka-console-consumer.sh --boostrap-server 10.0.1.15:9092 --topic
DomainEntityCommandStream --from-beginning
works very well.
Going through some related problems on Stackoverflow, I have tried to purge the topic:
./kafka-topics.sh --zookeeper 10.0.1.15:9092 --alter --topic
DomainEntityCommandStream --config retention.ms=1000
Looking at kafka logs, I see that retention time was altered.
But, no luck, I get the same error.
The payload is ridiculously small, so why should I change batch.size?
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:gs="http://soap.problem.com">
<soapenv:Header/>
<soapenv:Body>
<gs:streamDomainEntityRequest>
<gs:domainEntity>
<gs:name>12345</gs:name>
<gs:value>Quebec</gs:value>
<gs:version>666</gs:version>
</gs:domainEntity>
</gs:streamDomainEntityRequest>
</soapenv:Body>
</soapenv:Envelope>

Using Docker and Kafka 0.11.0.1 image you need to add the following environment parameters to the container:
KAFKA_ZOOKEEPER_CONNECT = X.X.X.X:XXXX (your zookeeper IP or domain : PORT default 2181)
KAFKA_ADVERTISED_HOST_NAME = X.X.X.X (your kafka IP or domain)
KAFKA_ADVERTISED_PORT = XXXX (your kafka PORT number default 9092)
Optionally:
KAFKA_BROKER_ID = 999 (some value)
KAFKA_CREATE_TOPICS=test:1:1 (some topic name to create at start)
If it doesn't work and you still get same message ("Expiring X record(s) for xxxxx: XXXXX ms has passed since batch creation plus linger time") you can try cleaning the kafka data from zookeeper.

Kafka No broker in ISR for partition

We have a Kafka cluster consists of 6 nodes. Five of the 6 nodes have zookeeper.
A spark streaming job is reading from a streaming server, do some processing, and send the result to Kafka.
From time to time the spark job got stuck, no data is sent to Kafka, and the job is restarted.
The job keeps stuck and restarted until we manually restart the Kafka cluster. After restarting Kafka everything is working smoothly.
Checking the Kafka logs we found this exception is thrown several times
2017-03-10 05:12:14,177 ERROR state.change.logger: Controller 133 epoch 616 initiated state change for partition [live_stream_2,52] from OfflinePartition to OnlinePartition failed
kafka.common.NoReplicaOnlineException: No broker in ISR for partition [gnip_live_stream_2,52] is alive. Live brokers are: [Set(133, 137, 134, 135, 143)], ISR brokers are: [142]
at kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:66)
at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:345)
at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:205)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:120)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:117)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777)
at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:117)
at kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:70)
at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:333)
at kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:164)
at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:259)
at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:823)
at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
The exception above is thrown for an unused topic (live_stream_2) but it is thrown also for a used topic with a little difference.
Here is the exception for the used topic
2017-03-10 12:05:18,535 ERROR state.change.logger: Controller 133 epoch 620 initiated state change for partition [gnip_live_stream,3] from OfflinePartition to OnlinePartition failed
kafka.common.NoReplicaOnlineException: No broker in ISR for partition [live_stream,3] is alive. Live brokers are: [Set(133, 134, 135, 137)], ISR brokers are: [136]
at kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:66)
at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:345)
at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:205)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:120)
at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:117)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777)
at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:117)
at kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:70)
at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:333)
at kafka.controller.KafkaController$$anonfun$1.apply$mcV$sp(KafkaController.scala:164)
at kafka.server.ZookeeperLeaderElector.elect(ZookeeperLeaderElector.scala:84)
at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:146)
at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:141)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:259)
at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:141)
at org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:823)
at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
In the first exception, it says the ISR broker list for partition 52 contains only the broker with ID 142 which is weird because the cluster has no broker with this id.
In the second exception, it says the ISR broker list for partition 3 contains only the broker with ID 136 which is not existing in the broker live list.
I suspect there is stale data in zookeeper that cause the first exception and for some reason broker 136 was down at specific time which causes the second exception.
My questions
1- Could those exceptions be the reason of Kafka (and consequently the spark job) to stuck?
2- How to solve it?

The group coordinator is not available-Kafka

When I am write a topic to kafka,there is an error:Offset commit failed:
2016-10-29 14:52:56.387 INFO [nioEventLoopGroup-3-1][org.apache.kafka.common.utils.AppInfoParser$AppInfo:82] - Kafka version : 0.9.0.1
2016-10-29 14:52:56.387 INFO [nioEventLoopGroup-3-1][org.apache.kafka.common.utils.AppInfoParser$AppInfo:83] - Kafka commitId : 23c69d62a0cabf06
2016-10-29 14:52:56.409 ERROR [nioEventLoopGroup-3-1][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$DefaultOffsetCommitCallback:489] - Offset commit failed.
org.apache.kafka.common.errors.GroupCoordinatorNotAvailableException: The group coordinator is not available.
2016-10-29 14:52:56.519 WARN [kafka-producer-network-thread | producer-1][org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater:582] - Error while fetching metadata with correlation id 0 : {0085000=LEADER_NOT_AVAILABLE}
2016-10-29 14:52:56.612 WARN [pool-6-thread-1][org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater:582] - Error while fetching metadata with correlation id 1 : {0085000=LEADER_NOT_AVAILABLE}
When create a new topic using command,it is ok.
./kafka-topics.sh --zookeeper localhost:2181 --create --topic test1 --partitions 1 --replication-factor 1 --config max.message.bytes=64000 --config flush.messages=1
This is the producer code using Java：
public void create() {
Properties props = new Properties();
props.clear();
String producerServer = PropertyReadHelper.properties.getProperty("kafka.producer.bootstrap.servers");
String zookeeperConnect = PropertyReadHelper.properties.getProperty("kafka.producer.zookeeper.connect");
String metaBrokerList = PropertyReadHelper.properties.getProperty("kafka.metadata.broker.list");
props.put("bootstrap.servers", producerServer);
props.put("zookeeper.connect", zookeeperConnect);//声明ZooKeeper
props.put("metadata.broker.list", metaBrokerList);//声明kafka broker
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 1000);
props.put("linger.ms", 10000);
props.put("buffer.memory", 10000);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
producer = new KafkaProducer<String, String>(props);
}
Where is wrong？

I faced a similar issue. The problem was when you start your Kafka broker there is a property associated with it, "KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR". If you are working with a single node cluster make sure you set this property with the value '1'. As its default value is 3. This change resolved my problem. (you can check the value in Kafka.properties file)
Note: I was using base image of confluent kafka version 4.0.0 ( confluentinc/cp-kafka:4.0.0)

Looking at your logs the problem is that cluster probably don't have connection to node which is the only one know replica of given topic in zookeeper.
You can check it using given command:
kafka-topics.sh --describe --zookeeper localhost:2181 --topic test1
or using kafkacat:
kafkacat -L -b localhost:9092
Example result:
Metadata for all topics (from broker 1003: localhost:9092/1003):
1 brokers:
broker 1003 at localhost:9092
1 topics:
topic "topic1" with 1 partitions:
partition 0, leader -1, replicas: 1001, isrs: , Broker: Leader not available
If you have single node cluster then broker id(1001) should match leader of topic1 partition.
But as you can see the only one known replica of topic1 was 1001 - which is not available now, so there is no possibility to recreate topic on different node.
The source of the problem can be an automatic generation of broker id(if you don't have specified broker.id or it is set to -1).
Then on starting the broker(the same single broker) you probably receive broker id different that previously and different than was marked in zookeeper (this a reason why partition deletion can help - but it is not a production solution).
The solution may be setting broker.id value in node config to fixed value - according to documentation it should be done on produciton environment:
broker.id=1
If everything is alright you should receive sth like this:
Metadata for all topics (from broker 1: localhost:9092/1001):
1 brokers:
broker 1 at localhost:9092
1 topics:
topic "topic1" with 1 partitions:
partition 0, leader 1, replicas: 1, isrs: 1
Kafka Documentation:
https://kafka.apache.org/documentation/#prodconfig

Hi you have to keep your kafka replicas and replication factor for your code same.
for me i keep 3 as replicas and 3 as replication factor.

The solution for me was that I had to make sure KAFKA_ADVERTISED_HOST_NAME was the correct IP address of the server.

We had the same issue and replicas and replication factors both were 3. and the Partition count was 1 . I increased the partition count to 10 and it started working.

We faced same issue in production too. The code was working fine for long time suddenly we got this exception.
We analyzed that there is no issue in code. So we asked deployment team to restart the zookeeper. Restarting it solved the issue.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Kafka Cluster start fails with NoNodeException - java

Related

io.smallrye.mutiny.TimeoutException when using kafka vs redis

Does anyone knows kafka producer hanging fix

Expiring 1 record(s) for xxxxx: 30030 ms has passed since batch creation plus linger time

Kafka No broker in ISR for partition

The group coordinator is not available-Kafka

Categories

Resources