I'm performing a load test for an application that keeps restarting each day, and under a certain amount of load the app gracefully restarts with the following logs:
AbstractCamelContext - Apache Camel 3.2.0 (CamelContext: MyApplication) is shutting down
DefaultShutdownStrategy - Starting to graceful shutdown 22 routes (timeout 45 seconds)
DefaultShutdownStrategy - Waiting as there are still 800 inflight and pending exchanges to complete, timeout in 45 seconds. Inflights per route: [route1 = 200, routingSlip = 200, createAERoute = 121, assignSeatRoute = 71, addSRRRoute = 8, shoppingCart = 200]
I've checkd the Camel docs for Graceful Shutdown - Apache Camel and understand how it works after it's already triggered, but I'm wondering what exactly causes the graceful shutdown to begin with. Does it check the JVM memory usage against the max heap? CPU usage? Container limits? Any information is well appreciated.
Related
I have a situation while updating and restarting applications by my self developed CICD module.
I have Eureka as registry center,and zuul as gateway.
By running shell script using kill -15 command, I want to shutdown my applications in a graceful way. And that shell script runs correctly.
But so far during my testing and observing this CICD module, I found that restarting zuul gateway application will take a huge long time to shutdown (about 5 minutes waiting, the other applications will take only less than 5 seconds).
As far as I know, springboot will shut down the threadpools at first to reject resolving new request, then let the remaining threads to be done, and then shut down the application context after that.
When restarting my gateway application , I have done these steps:
pull down this gateway service from nginx upstream;
pull down this gateway service from eureka server, but not shutting down;
wait for 90 seconds
30 (eureka server refresh to readable server list cache default 30s )
+ 30 (eureka client fetching interval default 30s)
+ 30 (ribbon refreshing after eureka client fething server list cache default 30s)
use kill -15 applicationPid to shutdown the application
do a loop to see if this pid was shut down.
restart the new application
wait for 60 seconds if application is reachable from eureka server's api
30 (eureka client fetching interval default 30s)
+ 30 (ribbon refreshing after eureka client fething server list cache default 30s)
pull up this gateway service from nginx
Testing plan is shown below:
By sending request from 20 threads, each of the thread will send 3 request per second.
2 Linux servers A and B, each of them has a gateway service on it.
when shuting down A's gateway, nginx will point to B and let B bear the job, and the same options when B's gateway is shuting down.
As I observed, all requests will correctly resolved and no errors turned out during restarting the gateway applications.
But I don't know why that shutting down gateway application will take so much time. there are totally no request comming in after nginx was pull down, and the application will still remain stucking there and seems there are no useful logs to show what's going on.
After several minutes the application will then finally shut down.
If I send no requests, the gateway application will shut down immediately and gracefully.
when it's stuck, the console log is shown below:
....normal log....
2021-07-19 14:42:08.195 [app:web-gateway,traceId:,spanId:,parentId:] [SpringContextShutdownHook] INFO | EurekaServiceRegistry.java:65 | o.s.c.n.e.s.EurekaServiceRegistry | Unregistering application WEB-GATEWAY with eureka with status DOWN
2021-07-19 14:42:08.195 [app:web-gateway,traceId:,spanId:,parentId:] [SpringContextShutdownHook] WARN | DiscoveryClient.java:1351 | c.netflix.discovery.DiscoveryClient | Saw local status change event StatusChangeEvent [timestamp=1626676928195, current=DOWN, previous=UP]
2021-07-19 14:42:08.195 [app:web-gateway,traceId:,spanId:,parentId:] [DiscoveryClient-InstanceInfoReplicator-0] INFO | DiscoveryClient.java:870 | c.netflix.discovery.DiscoveryClient | DiscoveryClient_WEB-GATEWAY/192.168.24.200:web-gateway:8005:NEW_GATEWAY_DEFAULT_GROUP: registering service...
2021-07-19 14:42:08.199 [app:web-gateway,traceId:,spanId:,parentId:] [DiscoveryClient-InstanceInfoReplicator-0] INFO | DiscoveryClient.java:879 | c.netflix.discovery.DiscoveryClient | DiscoveryClient_WEB-GATEWAY/192.168.24.200:web-gateway:8005:NEW_GATEWAY_DEFAULT_GROUP - registration status: 204
2021-07-19 14:42:08.252 [app:web-gateway,traceId:,spanId:,parentId:] [Thread-17] INFO | EurekaNotificationServerListUpdater.java:71 | c.n.n.l.EurekaNotificationServerListUpdater | Shutting down the Executor for EurekaNotificationServerListUpdater
2021-07-19 14:42:08.745 [app:web-gateway,traceId:,spanId:,parentId:] [SpringContextShutdownHook] INFO | DirectJDKLog.java:173 | o.a.coyote.http11.Http11NioProtocol | Pausing ProtocolHandler ["http-nio-8005"]
2021-07-19 14:43:18.087 [app:web-gateway,traceId:,spanId:,parentId:] [AsyncResolver-bootstrap-executor-0] INFO | ConfigClusterResolver.java:43 | c.n.d.s.r.aws.ConfigClusterResolver | Resolving eureka endpoints via configuration
.....stuck here.....
Because I have manually pull down gateway application from eureka, so here application log shows code 204 is acceptable.
I have once guess that if it is the code 204 error to stuck application to shut down. But other applications which also bear the requests will shut down immediately and gracefully after the kill -15 command is called.Only gateway application will stuck.
Could any one tell me how to checkout from the stuck application to see what's going on when the kill -15 command was done?
The problem solved.
Never questioning a stable structure....
I have got something wrong to the threadpool so while killing with kill -15,my customized thread pool have so many tasks that was not ended.
by checking stack of JVM I've found that problem. by correcting the code,the problem solved.
I have an application that calls external service and in case of connection errors, application retries for certain number of times.
There are certain instances where the application had to be shutdown on receipt of SIGTERM; During the shutdown process, camel waits for the in-flight messages to be completed which is good but odd times with connection errors, the retry also kicks in causing long delays for shutdown. is there a way to stop the retries while application is shutting down?
Per this link from the year 2012, Graceful shutdown was made aggressive after timeout but seems it is not any more with v2.24.0.
I am working on a Flink application that sinks to Kafka. I created a Kafka producer that has default pool size of 5. I have enabled checkpoints with following config:
env.enableCheckpointing(1800000);//checkpointing for every 30 minutes.
// set mode to exactly-once (this is the default)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(60000);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
The app sometimes keeps on crashing with following exception. Is this issue with kafka producer pool size or checkpoints ?
2020-03-20 22:31:23,859 INFO org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction - FlinkKafkaProducer011 0/1 aborted recovered transaction TransactionHolder{handle=KafkaTransactionState [transactionalId=FileSplitReader -> metrics-map -> Sink: components-topic-sink-4ab008489d4c8ed0fe577883438cc1ff-1, producerId=21, epoch=3], transactionStartTime=1584742933826}
2020-03-20 22:31:23,860 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Error during disposal of stream operator.
java.lang.NullPointerException
at org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.dispose(ContinuousFileReaderOperator.java:164)
at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:668)
at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:579)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:481)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
at java.lang.Thread.run(Thread.java:748)
2020-03-20 22:31:23,861 INFO org.apache.flink.runtime.taskmanager.Task - FileSplitReader -> metrics-map -> Sink: components-topic-sink (1/1) (92b7f3ed8f6362fe0087efd40eb94016) switched from RUNNING to FAILED.
org.apache.flink.streaming.connectors.kafka.FlinkKafka011Exception: Too many ongoing snapshots. Increase kafka producers pool size or decrease number of concurrent checkpoints.
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.createTransactionalProducer(FlinkKafkaProducer011.java:934)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.beginTransaction(FlinkKafkaProducer011.java:701)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.beginTransaction(FlinkKafkaProducer011.java:97)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.beginTransactionInternal(TwoPhaseCommitSinkFunction.java:394)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.initializeState(TwoPhaseCommitSinkFunction.java:385)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.initializeState(FlinkKafkaProducer011.java:862)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:284)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:1006)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:454)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:449)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:461)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
at java.lang.Thread.run(Thread.java:748)
I recommend you upgrade to the latest flink/kafka connector -- it looks like you're running FlinkKafkaProducer011, which is intended for Kafka 0.11.
You should be using FlinkKafkaProducer from the universal Kafka connector: flink-connector-kafka. Since Flink 1.9, this uses the Kafka 2.2.0 client.
With maven you want to specify
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.10.0</version>
</dependency>
Or replace 2.11 with 2.12 if you are using Scala 2.12.
It's hard to tell without access to the environment.
It may be related to the specific code you are running. You are basically hitting this exception.
A couple of things:
This is a similar issue that was related to an array in the code:
Interrupted while joining ioThread / Error during disposal of stream operator in flink application
It sounds like you are running in Kubernetes and if you look at this you can see that it that the problem could be related to a failed teardown or lack of connectivity between job and task managers, so you may want to check the networking in your Kubernetes cluster and make sure that all your Flink pods can communicate with each other.
This is a follow up on a previous question I sent regarding high latency in our Kafka Streams; (Kafka Streams rebalancing latency spikes on high throughput kafka-streams services).
As a quick reminder, our Stateless service has very tight latency requirements and we are facing too high latency problems (some messages consumed more than 10 secs after being produced) specially when a consumer leaves gracefully the group.
After further investigation we have found out that at least for small consumer groups the rebalance is taking less than 500ms. So we thought, where is this huge latency when removing one consumer (>10s) coming from?
We realized that it is the time between the consumer exiting Gracefully and the rebalance kicking in.
That previous tests were executed with all-default configurations in both Kafka and Kafka Streams application.
We changed the configurations to:
properties.put("max.poll.records", 50); // defaults to 1000 in kafkastreams
properties.put("auto.offset.reset", "latest"); // defaults to latest
properties.put("heartbeat.interval.ms", 1000);
properties.put("session.timeout.ms", 6000);
properties.put("group.initial.rebalance.delay.ms", 0);
properties.put("max.poll.interval.ms", 6000);
And the result is that the time for the rebalance to start dropped to a bit more than 5 secs.
We also tested to kill a consumer non-gracefully by 'kill -9' it; the result is that the time to trigger the rebalance is exactly the same.
So we have some questions:
- We expected that when the consumer is stopping gracefully the rebalance is triggered right away, should that be the expected behavior? why isn't it happening in our tests?
- How can we reduce the time between a consumer gracefully exiting and the rebalance being triggered? what are the tradeoffs? more unneeded rebalances?
For more context, our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0. On the consumer side, we are using Kafka-streams 2.1.0.
Thank you!
Kafka Streams does not sent a "leave group request" when an instance is shut down gracefully -- this is on purpose. The goal is to avoid expensive rebalances if an instance is bounced (eg, if one upgrades an application; or if one runs in a Kubernetes environment and a POD is restarted quickly automatically).
To achieve this, a non public configuration is used. You can overwrite the config via
props.put("internal.leave.group.on.close", true); // Streams' default is `false`
Im running into an issue with my KStreams based application: it will run once and when I stop/restart it gets 'stuck' and won't progress anymore until I delete the various topics it has created. This doesn't happen every time but more often than not.
Typically this happens when I copy a new(er) version to the work VM (in the same subnet as the kafka cluster for speed reasons).
When it's wedged I'll see;
"Connect": org.apache.zookeeper.ZooKeeper - Initiating client connection
"Client": [StreamThread-1] INFO o.a.k.s.p.internals.StreamTask - Creating restoration consumer client
"Ping" : I'll see these and the app won't shut down normally. It must be kill'd.
In any of these cases the message will typically repeat indefinitely (well - at least all the way through a lunch + meeting. IE Too long).
The app is shutting down 'cleanly' before this happens.
What am I doing wrong?
Edit:
This most recent time - after 20 minutes I got a stream of errors:
org.apache.kafka.common.errors.TimeoutException: Batch containing 101 record(s) expired due to timeout while requesting metadata from brokers
followed by:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member
--> which is a good trick since there is no other member.
If you are running with Kafka 0.10.0.x then you may be hitting a known issue:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread
This has been resolved in the upcomming 0.10.1.0 release of Kafka, and I would recommend you trying out the new version to see if this issue goes away.