How to handle RMQ connection loss? - java

We're using Java RMQ client in Scala and we're experiencing some issues on DEV environment. We have this fallback strategy set up:
def addConnectionShutdownListener(connection: Connection): Unit ={
connection.addShutdownListener { cause: ShutdownSignalException =>
logger.error(s"Error on RMQ connection: ${cause.getMessage}", cause)
if (exitOnFail) {
logger.info("Terminating process with RMQ consumer is shut down")
System.exit(1)
}
else if (retryOnFail) {
logger.info(s"Retrying to connect")
retryCreatingConnection(1)
}
}
}
addConnectionShutdownListener(rmqConnection)
In a similar fashion, I added channel connection shutdown listener.
So there are 2 strategies which we use (and modify through config)
exit on fail
retry on fail
I set up exit on fail strategy and sometimes it works correctly. I see this line on log when error happens Terminating process with RMQ consumer is shut down and service is restarted correctly (kubernetes pod is shut down and it is started automatically again). I disabled RMQ auto recovery because it didn't worked at all.
The problem is sometimes some queues are left without consumers and messages are being queued and hanging, but there is no this error message in log. It's really hard to test it, since I don't know what circumstances happened on our DEV environment.
What could happen?
Is there a better way to handle a connection loss, or to be more precise - to handle a scenario when consumers are somehow detached from queue?
Thanks in advance,
Amer

Related

Spring Cloud Stream - notice and handle errors in broker

I am fairly new to developing distributed applications with messaging, and to Spring Cloud Stream in particular. I am currently wondering about best practices on how to deal with errors on the broker side.
In our application, we need to both consume and produce messages from/to multiple sources/destinations like this:
Consumer side
For consuming, we have defined multiple #Beans of type java.util.function.Consumer. The configuration for those looks like this:
spring.cloud.stream.bindings.consumeA-in-0.destination=inputA
spring.cloud.stream.bindings.consumeA-in-0.group=$Default
spring.cloud.stream.bindings.consumeB-in-0.destination=inputB
spring.cloud.stream.bindings.consumeB-in-0.group=$Default
This part works quite well - wenn starting the application, the exchanges "inputA" and "inputB" as well as the queues "inputA.$Default" and "inputB.$Default" with corresponding binding are automatically created in RabbitMQ.
Also, in case of an error (e.g. a queue is suddenly not available), the application gets notified immediately with a QueuesNotAvailableException and continuously tries to re-establish the connection.
My only question here is: Is there some way to handle this exception in code? Or, what are best practices to deal with failures like this on broker side?
Producer side
This one is more problematic. Producing messages is triggered by some internal logic, we cannot use function #Beans here. Instead, we currently rely on StreamBridge to send messages. The problem is that this approach does not trigger creation of exchanges and queues on startup. So when our code calls streamBridge.send("outputA", message), the message is sent (result is true), but it just disappears into the void since RabbitMQ automatically drops unroutable messages.
I found that with this configuration, I can at least get RabbitMQ to create exchanges and queues as soon as the first message is sent:
spring.cloud.stream.source=produceA;produceB
spring.cloud.stream.default.producer.requiredGroups=$Default
spring.cloud.stream.bindings.produceA-out-0.destination=outputA
spring.cloud.stream.bindings.produceB-out-0.destination=outputB
I need to use streamBridge.send("produceA-out-0", message) in code to make it work, which is not too great since it means having explicit configuration hardcoded, but at least it works.
I also tried to implement the producer in a Reactor style as desribed in this answer, but in this case the exchange/queue also is not created on application startup and the sent message just disappears even though the return status of the sending method is "OK".
Failures on the broker side are not registered at all with this approach - when I simulate one e.g. by deleting the queue or the exchange, it is not registered by the application. Only when another message is sent, I get in the logs:
ERROR 21804 --- [127.0.0.1:32404] o.s.a.r.c.CachingConnectionFactory : Shutdown Signal: channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - no exchange 'produceA-out-0' in vhost '/', class-id=60, method-id=40)
But still, the result of StreamBridge#send was true in this case. But we need to know that sending did actually fail at this point (we persist the state of the sent object using this boolean return value). Is there any way to accomplish that?
Any other suggestions on how to make this producer scenario more robust? Best practices?
EDIT
I found an interesting solution to the producer problem using correlations:
...
CorrelationData correlation = new CorrelationData(UUID.randomUUID().toString());
messageHeaderAccessor.setHeader(AmqpHeaders.PUBLISH_CONFIRM_CORRELATION, correlation);
Message<String> message = MessageBuilder.createMessage(payload, messageHeaderAccessor.getMessageHeaders());
boolean sent = streamBridge.send(channel, message);
try {
final CorrelationData.Confirm confirm = correlation.getFuture().get(30, TimeUnit.SECONDS);
if (correlation.getReturned() == null && confirm.isAck()) {
// success logic
} else {
// failed logic
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
// failed logic
} catch (ExecutionException | TimeoutException e) {
// failed logic
}
using these additional configurations:
spring.cloud.stream.rabbit.default.producer.useConfirmHeader=true
spring.rabbitmq.publisher-confirm-type=correlated
spring.rabbitmq.publisher-returns=true
This seems to work quite well, although I'm still clueless about the return value of StreamBridge#send, it is always true and I cannot find information in which cases it would be false. But the rest is fine, I can get information on issues with the exchange or the queue from the correlation or the confirm.
But this solution is very much focused on RabbitMQ, which causes two problems:
our application should be able to connect to different brokers (e.g. Azure Service Bus)
in tests we use Kafka binder and I don't know how to configure the application context to make it work in this case, too
Any help would be appreciated.
On the consumer side, you can listen for an event such as the ListenerContainerConsumerFailedEvent.
https://docs.spring.io/spring-amqp/docs/current/reference/html/#consumer-events
On the producer side, producers only know about exchanges, not any queues bound to them; hence the requiredGroups property which causes the queue to be bound.
You only need spring.cloud.stream.default.producer.requiredGroups=$Default - you can send to arbitrary destinations using the StreamBridge and the infrastructure will be created.
#SpringBootApplication
public class So70769305Application {
public static void main(String[] args) {
SpringApplication.run(So70769305Application.class, args);
}
#Bean
ApplicationRunner runner(StreamBridge bridge) {
return args -> bridge.send("foo", "test");
}
}
spring.cloud.stream.default.producer.requiredGroups=$Default

ActiveMQ - cannot rollback non transaced session INVIDUAL_ACK

Is it possible to rollback async processed message in ActiveMQ? I'm consuming next message while first one is still processing, so while I'm trying to rollback the first message on another (not activemq pool) thread, I'm getting above error. Eventually should I sednd message to DLQ manually?
Message error handling can work a couple ways:
Broker-side 'redelivery policy'. Where the client invokes a rollback n number (default is usually 6 retries) of times and the broker automatically moves the message to a Dead Letter Queue (DLQ)
Client-side. Application consumes the message and then produces to the DLQ.
Option #1 is good for unplanned/planned outages-- database down, etc. Where you want automatic retry. The re-delivery policy can also be configured when the client connects to the broker.
Option #2 is good for 'bad data' scenarios where you know the message will never be able to be processed. This is ideal, because you can move the message on the 1st consumption and not have to reject the message n number of times.
When you combine infinite retry with #1 and include #2 in your application flow, you can have a robust process flow of automatic retry, and move-bad-data-out-of-the-way-quickly. Best of breed =)
ActiveMQ Redelivery policy

spring rabbit exception in consumer issue

I have a spring rabbit consumer:
public class SlackIdle1Consumer extends AbstractMessageConsumer {
#Override public void process(Message amqpMessage, Channel channel)
throws Exception {
/*very bad exception goes here.
it causes amqp message to be rejected and if no other consumer is available and error
still persists, the message begins looping over and over.
And when the error is fixed,
those messages are being processed but the result of this procession may be harmful.
*/
}
}
}
And somewhere inside an exception happens. Lets imagine this is a bad exception - development logic error. So amqp message begins to spin indefinitely, and when error is fixed and consumer restarted, all old messages are being processed, and it's bad, because logic and data may change since those messages were sent. How to handle it properly?
So the question is: how to fix this situation properly? Should I wrap all my code to try-catch clause or will I have to develop 'checks' in each consumer to prevent consistency issues in my app?
There are several options:
Set the container's defaultRequeueRejected property to false so failed messages are always rejected (discarded or sent to a dead letter exchange depending on queue configuration).
If you want some exceptions to be retried and others not, then add a try catch and throw an AmqpRejectAndDontRequeueException to reject those you don't want retried.
Add a custom ErrorHandler to the container, to do the same thing as #2 - determine which exceptions you want retried - documentation here.
Add a retry advice with a recoverer - the default recoverer simply logs the error, the RejectAndDontRequeueRecoverer causes the message to be rejected after retries are exhausted, the RepublishMessageRecoverer is used to write to a queue with additional diagnostics in headers - documentation here.

Spring AMQP stuck queue due to unack'd message

I am using a SimpleMessageListenerContainer and had problems that every hour or so the queue would get stuck and nothing would be processed due to an unack'd message.
I am sure this is due an error that isn't being caught properly but can't trace the issue.
I have set the acknowledge mode to NONE and this "fixed" the issue but it is really just hiding the issue. Also if I want to throw a AmqpException and re-queue the message this doesn't work with acknowledge mode set to NONE.
My question is how can I trace the issue with the queue getting stuck, is there a way to see the payload of the unack'd message? Or is there an acknowledgement mode that will allow acknowledges to not to be needed but re-queue messages if an exception is thrown?
Here is how I am registering a listener:
final SimpleMessageListenerContainer container = new SimpleMessageListenerContainer();
container.setConnectionFactory(connectionFactory);
container.setQueueNames(queueName);
container.setMessageListener(new MQMessageListenerWrapper(listener));
container.setAcknowledgeMode(AcknowledgeMode.NONE);
container.start();
Thanks.
My best guess is your consumer thread is hung someplace upstream of the listener. When control is returned to the container, the message is ack'd or rejected; it can't be left in an unack'd state if the thread returned to the container.
Use jstack <pid> to find out where the consumer thread is stuck.
You are correct NONE is just masking the issue.
When the queue gets stuck look at the connections listening on the specific queue. Could be a sign of some sort of dead lock scenario because of 2 (or more) consumer-threads listening on the same queue - and therefore being blocked by rabbit.
This was an issue within my code that I finally tracked down as it only occurred in a rare instance.
This had nothing to do with Spring AMQP or RabbitMQ just my bad coding :-)

Is this a realistic expectation of a distributed mechanism?

I've been evaluating ActiveMQ as a candidate message broker. I've written some test code to try and get an understanding of ActiveMQ's performance limitations.
I can produce a failure state in the broker by sending messages as fast as possible like this:
try {
while(true) {
byte[] payload = new byte[(int) (Math.random() * 16384)];
BytesMessage message = session.createBytesMessage();
message.writeBytes(payload);
producer.send(message);
} catch (JMSException ex) { ... }
I was surprised that the line
producer.send(message);
blocks when the broker enters a failed state. I was hoping that some exception would be thrown, so there would be some indication that the broker has failed.
I realize that my test code is spamming the broker, and I expect the broker to fail. However, I would prefer that the broker failed "loudly" as opposed to simply blocking.
Is this an unrealistic expectation?
Update:
Uri's answer references an ActiveMQ bug report that was filed in March. The bug description includes a proposal that sounds like what I'm looking for: "if the request on the transport had a timeout (this is to catch failure scenarios, so something that's not expected to reasonably happen), things would have errored out rather than building waiting threads."
However, after 8 months the bug is currently unassigned with a single vote. So I guess the question still stands, is this something ActiveMQ should (will?) implement?
You are testing the 'slow consumer' and producer flowcontrol issue all message brokers have to deal with. Do you wanna fail producers, block them or spool to disk?
Basically the out of the box default in ActiveMQ is to block producers. But you can configure message cursors to spool to disk.
BTW you've not said if you are using queues/topics or persistent/non-persistent; if you are using non persistent topics there are other strategies you can use for discarding messages etc.
Apprently there's a known issue, not sure if it's been fixed:
https://issues.apache.org/activemq/browse/AMQ-1625
Not sure about ActiveMQ config, but other JMS providers have various configuration options - so you maybe able to get ActiveMQ to do as you wish in that situation.
I know Fiorano has options to specify whether providers block or not in this situation.

Categories