Polling for Pod's ready state - java

I am using the fabric8 library for java for deploying appliations on a Kubernetes cluster.
I want to poll the status of pods to know when they are ready. I started writing my own until I read about the Watcher.
I implemented something like this
deployment =
kubeClient.extensions().deployments().inNamespace(namespaceName).create(deployment);
kubeClient.pods().inNamespace(namespaceName).watch(new Watcher<Pod>() {
#Override
public void eventReceived(io.fabric8.kubernetes.client.Watcher.Action action,
Pod resource) {
logger.info("Pod event {} {}", action, resource);
logger.info("Pod status {} , Reason {} ", resource.getStatus().getPhase(),
resource.getStatus().getReason());
}
#Override
// What causes the watcher to close?
public void onClose(KubernetesClientException cause) {
if (cause != null) {
// throw?
logger.error("Pod event {} ", cause);
}
}
});
I m not sure if I understand the Watcher functionality correctly. Does it time out? Or Do I still write my poller inside the eventReceivedMethod()? What is the use case for a watcher?

// What causes the watcher to close?
Since watches are implemented using websockets, a connection is subject to closure at any time for any reason or no reason.
What is the use case for a watcher?
I would imagine it's two-fold: not paying the TCP/IP + SSL connection setup cost, making it quicker, and having your system be event-driven rather than simple polling, which will make every participant use less resources (the server and your client).
But yes, the answer to your question is that you need to have retry logic to reestablish the watcher if you have not yet reached the Pod state you were expecting.

Related

Spring Cloud Stream - notice and handle errors in broker

I am fairly new to developing distributed applications with messaging, and to Spring Cloud Stream in particular. I am currently wondering about best practices on how to deal with errors on the broker side.
In our application, we need to both consume and produce messages from/to multiple sources/destinations like this:
Consumer side
For consuming, we have defined multiple #Beans of type java.util.function.Consumer. The configuration for those looks like this:
spring.cloud.stream.bindings.consumeA-in-0.destination=inputA
spring.cloud.stream.bindings.consumeA-in-0.group=$Default
spring.cloud.stream.bindings.consumeB-in-0.destination=inputB
spring.cloud.stream.bindings.consumeB-in-0.group=$Default
This part works quite well - wenn starting the application, the exchanges "inputA" and "inputB" as well as the queues "inputA.$Default" and "inputB.$Default" with corresponding binding are automatically created in RabbitMQ.
Also, in case of an error (e.g. a queue is suddenly not available), the application gets notified immediately with a QueuesNotAvailableException and continuously tries to re-establish the connection.
My only question here is: Is there some way to handle this exception in code? Or, what are best practices to deal with failures like this on broker side?
Producer side
This one is more problematic. Producing messages is triggered by some internal logic, we cannot use function #Beans here. Instead, we currently rely on StreamBridge to send messages. The problem is that this approach does not trigger creation of exchanges and queues on startup. So when our code calls streamBridge.send("outputA", message), the message is sent (result is true), but it just disappears into the void since RabbitMQ automatically drops unroutable messages.
I found that with this configuration, I can at least get RabbitMQ to create exchanges and queues as soon as the first message is sent:
spring.cloud.stream.source=produceA;produceB
spring.cloud.stream.default.producer.requiredGroups=$Default
spring.cloud.stream.bindings.produceA-out-0.destination=outputA
spring.cloud.stream.bindings.produceB-out-0.destination=outputB
I need to use streamBridge.send("produceA-out-0", message) in code to make it work, which is not too great since it means having explicit configuration hardcoded, but at least it works.
I also tried to implement the producer in a Reactor style as desribed in this answer, but in this case the exchange/queue also is not created on application startup and the sent message just disappears even though the return status of the sending method is "OK".
Failures on the broker side are not registered at all with this approach - when I simulate one e.g. by deleting the queue or the exchange, it is not registered by the application. Only when another message is sent, I get in the logs:
ERROR 21804 --- [127.0.0.1:32404] o.s.a.r.c.CachingConnectionFactory : Shutdown Signal: channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - no exchange 'produceA-out-0' in vhost '/', class-id=60, method-id=40)
But still, the result of StreamBridge#send was true in this case. But we need to know that sending did actually fail at this point (we persist the state of the sent object using this boolean return value). Is there any way to accomplish that?
Any other suggestions on how to make this producer scenario more robust? Best practices?
EDIT
I found an interesting solution to the producer problem using correlations:
...
CorrelationData correlation = new CorrelationData(UUID.randomUUID().toString());
messageHeaderAccessor.setHeader(AmqpHeaders.PUBLISH_CONFIRM_CORRELATION, correlation);
Message<String> message = MessageBuilder.createMessage(payload, messageHeaderAccessor.getMessageHeaders());
boolean sent = streamBridge.send(channel, message);
try {
final CorrelationData.Confirm confirm = correlation.getFuture().get(30, TimeUnit.SECONDS);
if (correlation.getReturned() == null && confirm.isAck()) {
// success logic
} else {
// failed logic
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
// failed logic
} catch (ExecutionException | TimeoutException e) {
// failed logic
}
using these additional configurations:
spring.cloud.stream.rabbit.default.producer.useConfirmHeader=true
spring.rabbitmq.publisher-confirm-type=correlated
spring.rabbitmq.publisher-returns=true
This seems to work quite well, although I'm still clueless about the return value of StreamBridge#send, it is always true and I cannot find information in which cases it would be false. But the rest is fine, I can get information on issues with the exchange or the queue from the correlation or the confirm.
But this solution is very much focused on RabbitMQ, which causes two problems:
our application should be able to connect to different brokers (e.g. Azure Service Bus)
in tests we use Kafka binder and I don't know how to configure the application context to make it work in this case, too
Any help would be appreciated.
On the consumer side, you can listen for an event such as the ListenerContainerConsumerFailedEvent.
https://docs.spring.io/spring-amqp/docs/current/reference/html/#consumer-events
On the producer side, producers only know about exchanges, not any queues bound to them; hence the requiredGroups property which causes the queue to be bound.
You only need spring.cloud.stream.default.producer.requiredGroups=$Default - you can send to arbitrary destinations using the StreamBridge and the infrastructure will be created.
#SpringBootApplication
public class So70769305Application {
public static void main(String[] args) {
SpringApplication.run(So70769305Application.class, args);
}
#Bean
ApplicationRunner runner(StreamBridge bridge) {
return args -> bridge.send("foo", "test");
}
}
spring.cloud.stream.default.producer.requiredGroups=$Default

Is there any scenario in which a Vert.x MessageConsumer can fail to unregister from the event bus?

public void unregisterConsumer(MessageConsumer<Object> mc) {
mc.unregister(result -> {
if(result.succeeded())
return;
else
//uh oh
});
}
In the event of a failed AsyncResult, would it be unwise to just call unregisterConsumer again, perhaps with a vertx.setTimer(5000, id -> unregisterConsumer(mc));?
If Vert.x is not clustered, the chances you get a failure are negligible (it would only happen in case of a bug).
If Vert.x is clustered, this may happen if the underlying cluster manager fails to remove the subscription (e.g. if network communication is lost).
As for retrying, it can be a good idea if your application registers consumers dynamically. Otherwise you can ignore the failure and let the process die. The cluster manager will clean-up the subscriptions, eventually.

Akka actors are getting stopped

I'm using akka actors to achieve parallel processing of some http requests. I've initailized a pool of actors using RoundRobinPool like:
ActorRef myActorPool = actorSystem.actorOf(new RoundRobinPool(200).props(Props.create(MyActor.class, args)), MyActor.class.getSimpleName());
It is working fine. But after the process is running for sometime, i'm getting following error
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Recipient[Actor[akka://web_server/user/MyActor#-769383443]] had already been terminated. Sender[null] sent the message of type "com.data.model.Request".
So I've overridden postStop method added a log statement there.
#Override
public void postStop() {
log.warn("Actor is stopped");
}
Now, I can see in the logs that the actors are getting stopped. But I'm not sure for which request it is happening. Once all the actors in the pool terminates (200 is the pool size I've set), I'm getting AskTimeoutException as said before. Is there anyway to debug why the actors are getting terminated?
EDIT 1
In the controller, I'm using the created actor pool like
CompletableFuture<Object> s = ask(myActorPool, request, 1000000000).toCompletableFuture();
return s.join();
The actor processes one kind of messages only.
#Override
public AbstractActor.Receive createReceive() {
return receiveBuilder()
.match(Request.class, this::process)
.build();
}
private void process(Request request) {
try {
// code here
} catch (Exception e) {
log.error(e.getMessage(), e);
getSender().tell(new akka.actor.Status.Failure(e), getSelf());
}
}
As far as you have described the probelm it seems you are processing your data inside the ask call and taking more time than askTimeout, and you are getting the error.
What you can do is either increase the askTimeout or do less processing inside tha ask call.
you should not do CPU bound operations inside the ask call, it can cause slowing down your system it is recommended that you should do the I/O bound operations inside the ask call. That way you can leverage the actors.
For example:
val x=x+1 // CPU bound operation should not do inside the ask call
some db call you can do inside the ask call that is preferable.

How to unblock an app blocked due to publishing events on rabbitmq server having disk space issues?

I am using rabbitmq java client to publish events from my java application.
When the server goes out of space, all the publish messages are blocked.
In my application, I am ok with events not being published but, I want the application to continue to work. To that end I looked at the documentation and found BlockedListener on the connection[1]
I was able to use this. But, the problem is unless I throw a runtime exception in the handleBlocked method, my app/publisher is still blocked. I tried to close/abort the connection but, nothing worked.
Is there any other graceful way to unblock my app
This is my BlockedListener code
private class BlockedConnectionHandler implements BlockedListener {
#Override
public void handleBlocked(String reason) throws IOException {
s_logger.error("rabbitmq connection is blocked with reason: " + reason);
throw new RuntimeException("unblocking the parent thread");
}
#Override
public void handleUnblocked() throws IOException {
s_logger.info("rabbitmq connection in unblocked");
}
}
[1] https://www.rabbitmq.com/connection-blocked.html

How do I stop a Camel route when JVM gets to a certain heap size?

I am using Apache Camel to connect to various endpoints, including JMS topics, and write to a database. Sometimes the database connection fails (for whatever reason, database issue, network blip, etc) and the messages from the topic subscriber start backing up. At a certain point, there are so many messages backed up waiting to be written to the database that the application throws an out of memory error. So far I understand all that.
The problem I have is the following: When the application is frantically trying to garbage collect before eventually giving up and accepting that it is out of memory, the application stops working, but is still alive. This means that the topic subscriber is still seen as active by the JMS provider, but not reading anything off the topic, so the provider starts queueing up the messages. Eventually the provider falls over also when the maximum depth runs out.
How can I configure my application to either disconnect when reaching a certain heap usage, or kill itself completely much much faster when running out of memory? I believe there are some JVM parameters that allow the application to kill itself much quicker when running out of memory, but I am wondering if that is the best solution or whether there is another way?
First of all I think you should use a JDBC connection pool that is capable of refreshing failed connections. So you do not run into the described scenario in the first place. At least not if the DB/network issue is short lived.
Next I'd protect the message broker by applying producer flow control (at least thats how it is called in ActiveMQ). I.e. prevent message producers from submitting more messages if a certain memory threshold has been breached. If the thresholds are set correctly, then that will prevent your message broker from falling over.
As for your original question: I'd use JMX to monitor the VM. If some metric, e.g. memory, breaches a threshold then you can suspend or shut down the route or the whole Camel context via the MBeans Camel exposes.
You can control (start/stop and suspend/resume) Camel routes using the Camel context methods .stop(), .start(), .suspend() and .resume().
You can spin a separate thread that monitors the current VM memory and stops the required route when a certain condition is met.
new Thread() {
#Override
public void run() {
while(true) {
long free = Runtime.getRuntime().freeMemory();
boolean routeRunning = camelContext.isRouteStarted("yourRoute");
if (free < threshold && routeRunning) {
camelContext.stopRoute("yourRoute");
} else if (free > threshold && !routeRunning) {
camelContext.startRoute("yourRoute");
}
// Check every 10 seconds
Thread.sleep(10000);
}
}
}
As commented in the other answer, relying on this is not particularly robust, but at least a little more robust than getting an OutOfMemoryException. Note that you need to .stop() the route, .suspend() does not deallocate resources, which means the connection with the queue provider is still open and the service looks like it is open for business.
You can also stop the route as part of the error handling of the route itself (this is possibly more robust but would require manual intervention to restart the route once the error is cleared, or a scheduled route that periodically checks if the error condition still exists and restart the route if it is gone). The thing to keep in mind is that you cannot stop a route from the same thread that is servicing the route at the time so you need to spin a separate thread that does the stopping. For example:
route("sample").from("jms://myqueue")
// Handle SQL Exceptions by shutting down the route
.onException(SQLException.class)
.process(new Processor() {
// This processor spawns a new thread that stops the current route
Thread stop;
#Override
public void process(final Exchange exchange) throws Exception {
if (stop == null) {
stop = new Thread() {
#Override
public void run() {
try {
// Stop the current route
exchange.getContext().stopRoute("sample");
} catch (Exception e) {}
}
};
}
// start the thread in background
stop.start();
}
})
.end()
// Standard route processors go here
.to(...);

Categories