RabbitMQ connections dropping and not recovering despite heartbeat setting - java

I am trying to run down a problem with consumer connections to RabbitMQ being dropped in our production environment. This problem seems to occur after running for a few days and by restarting our application it seems to connect and work fine for a few more days. My guess is that there is a period of inactivity that is causing the issue. It seems the AMQP heartbeat was designed just for this problem. We are using spring-amqp 1.3.2.RELEASE and setting the requestedHeartbeat on the ConnectionFactory to 10 however we are still seeing connections drop.
The spring-amqp client will reconnect if I completely disconnect from the internet and reconnect, or block the connection with a firewall, however it does not even seem to throw an Exception in the log when this happens in production. Of course that may be because we are using slf4j and logback for our logging mechanism and spring is using commons logging, so it is appearing in System.out and not going to the log. I have added the jcf-over-slf4j bridge to fix that but have not rolled it out yet so I do not have a stack trace to contribute.
One more piece of info about our architecture: we have HA proxy in front of RabbitMQ.
I would like to somehow put the app in debug and run within eclipse to see if the heartbeats are actually going out. I tried to verify with Wireshark but our traffic has two way SSL encryption and I haven't been able to decrypt the traffic yet.
Does anyone have any suggestions? I have been trying to run this down for weeks and I'm running out of ideas. I would greatly appreciate your input.
Thanks!

on 11-Feb-2015 days ago rabbit released 3.4.4 which has support for automatic reconnections. You could roll your own solution like we did a while back but it seems easier to just upgrade to the newest version of rabbit.
https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq_v3_4_4/projects/client/RabbitMQ.Client/src/client/impl/AutorecoveringConnection.cs

If you have rabbitmq's autorecovery enabled, Spring AMQP prior to 1.4 is not compatible with it; the problem being that rabbit restores the connections/channels, but Spring AMQP doesn't know about them. Generally, though, this just causes extra connections/channels - Spring AMQP just establishes new channels. I have not heard of it causing the problems you describe.
If you can't figure it out by fixing your logging configuration, another thing to try is to jstack your process to see what the threads are doing. But you should be able to figure it out from the logs.

Related

Forcefully kill JMS connection threads

First time posting so hopefully I can give enough info.
We are currently using Apache ActiveMQ to setup a pretty standard Producer/Consumer queue app. Our application is hosted on various different client servers and at random times/loads we experience issues where the JMS connection permanently dies, so our producer can no longer connect to the consumer, so we have to restart the producer. We're pretty sure on the issue, that we're running our of connections on the JMS cached connection factory, so need to do a better cleanup/recycling of these connections. This is a relatively common issue described here (Our setup is pretty similar):
Is maven shade-plugin culprit for my jar to not work in server
ActiveMQ Dead Connection issue
Our difficulty is that this problem is only experienced on our application when it's deployed on servers, however we don't have access to these, as they house confidential client info, so we can't do any debugging/reproduction on the servers where the issues occur, however it's not possible so far for us to reproduce the issue on our local environment.
So in short is there any way that we could forcefully kill/corrupt our JMS connection threads so that we can reproduce and test various fixes and approaches? Unfortunately we don't have the option to add in fixes without testing/demo'ing any solutions, so replication of the issue on our local setup is our only option.
Thanks in advance

Some Spring WebSocket Sessions never disconnect

I have a websocket solution for duplex communication between mobile apps and a java backend system. I am using Spring WebSockets, with STOMP. I have implemented a ping-pong solution to keep websockets open longer than 30 seconds because I need longer sessions than that. Sometimes I get these errors in the logs, which seem to come from checkSession() in Spring's SubProtocolWebSocketHandler.
server.log: 07:38:41,090 ERROR [org.springframework.web.socket.messaging.SubProtocolWebSocketHandler] (ajp-http-executor-threads - 14526905) No messages received after 60205 ms. Closing StandardWebSocketSession[id=214a10, uri=/base/api/websocket].
They are not very frequent, but happens every day and the time of 60 seconds seem appropriate since it's hardcoded into the Spring class mentioned above. But then after running the application for a while I start getting large amounts of these really long-lived 'timeouts':
server.log: 00:09:25,961 ERROR [org.springframework.web.socket.messaging.SubProtocolWebSocketHandler] (ajp-http-executor-threads - 14199679) No messages received after 208049286 ms. Closing StandardWebSocketSession[id=11a9d9, uri=/base/api/websocket].
And at about this time the application starts experiencing problems.
I've been trying to search for this behavior but havn't found it anywhere on the web. Has anyone seen this problem before, know a solution, or can explain it to me?
We found some things:
We have added our own ping/pong functionality on STOMP level that runs every 30 seconds.
The mobile client had a bug that caused them to keep replying to the pings even when going into screensaving mode. This meant that the websocket was never closed or timed out.
On each pong message that the server received the Spring check found that no 'real' messages had been received for a very long time and triggered the log to be written. It then tries to close the websocket with this code:
session.close(CloseStatus.SESSION_NOT_RELIABLE);
but I suspect this doesn't close the session correctly. And even if it did, the mobile clients would try to reconnect. So when 30 more seconds have passed another pong message is sent to the server causing yet another one of these logs to be written. And so on forever...
The solution was to write some server-side code to close old websockets based on this project and also to fix the bug in the mobile clients that made them respond to ping/pong even when being in screensaver mode.
Oh, one thing that might be good for other people to know is that clients should never be trusted and we saw that they could sometimes send multiple request for websockets within one millisecond, so make sure to handle these 'duplicate requests' some way!
I am also facing the same problem.
net stat on Linux output shows tcp connections and status as below:
1 LISTEN
13 ESTABLISHED
67 CLOSE_WAIT
67 TCP connections are waiting to be closed but these are never getting closed.

ActiveMQ Code working in Windows but Fails in CentOS

I'm working with a a framework that has a Core application, and secondary applications that all communicate using JMS via ActiveMQ (through Camel). It all seems to work fine on Windows, but the moment I moved it to our CentOS environment it failed. Let me note that our guy who programmed it and who was our ActiveMQ guy has left so I don't know quite how to diagnose the problem. It seems to be establishing a connection but then does nothing else. It is supposed to begin an exchange of messages but it doesn't. When I set logging to DEBUG I get messages saying "urlList connectionList:[]" and "waiting x ms before attempting connection" as though it's not connecting. I've made sure there's no firewall, there's no security policy to block it, ActiveMQ is shown to be running. I've tried everything I can think of but I have no idea what the problem could be. Any recommendations?
try testing ActiveMQ via the web console to make sure it's functioning properly. next, try connecting remotely via JMX and verify that works as well. It's likely an environment issue that will be difficult to diagnose via the info given...
see these pages for more information...
http://activemq.apache.org/web-console.html
http://activemq.apache.org/jmx.html

Detecting ActiveMQ flow control

I have a production system that uses ActiveMQ (5.3.2) to send messages from server A to server B. A few weeks ago, the system inexplicably started taking 10+ second to send a message. After a reboot of the producer, the system worked fine.
After investigation, I'm pretty sure this is due to producer flow control. (I have a fairly standard activemq setup). The day before this happened (for other reasons) my consumer software had been acting erratically and had even stopped accepting connections for a while. So I'm guessing this triggered this. (It does puzzle me that the requests were still being throttled a day later).
Question -- how can I confirm that the requests were being throttled. I took a heap dump of the server -- is there data in memory I can look for?
Edit: I've found the following:
WireFormatNegotiator.tcpNoDelayEnabled=false for one of three WireFormatNegotiator instances in the memory. I'm trying to figure out what sets this.
And second (and more important), is there a way I can use JMX to tell if the messages are being throttled? I'd like to set up a Nagios alert to let me know if this happens in the future. What property should I check for with JMX?
you can configure your producer client to throw javax.jms.ResourceAllocationException exceptions which can then be detected/logged, etc. just set one of the following...
<systemUsage>
<systemUsage sendFailIfNoSpaceAfterTimeout="3000">
...OR...
<systemUsage sendFailIfNoSpace="true">

Any reason why I would not be permitted to confirm a message using Tibco Rendezvous?

I have a setup in which some applications communicate with each other via Tibco rendezvous. The applications communicate using certified messaging. My problem is that two of my receivers have recently started exhibiting the behavior that they will get an Error 27, Not Permitted when they want to confirm a message (the first message in a certified message exchange isn't certified, we've accounted for that).
I've been looking around the internet to find people with the same error, and I have found many, but they all get the error when trying to create the tibco transport. I can create the transport just fine, but I can't confirm any messages received over it.
Our environment uses both tibco 7.X and 8.X, some times intermingled. This problem appears both when the peers use the same tibco version and when they use different versions. It doesn't show up for all applications, but when it does show up for an application, it remains "broken". Discarding the ledger files for both sender and receiver does nothing. We still get the error. Both sender and receiver have proper permissions to write to (and create the) ledger files. We are connecting to permanently running rvds. The sender and receiver are on different machines. Communication has worked flawlessly in the past, but at some point, it stopped doing so. The application is in java, and we're using the tibrvj.jar auto-native libraries.
The error is
...
Caused by: TibrvException[error=27,message=Not permitted]
at com.tibco.tibrv.TibrvImplCmTPortC.natConfirmMsg(Native Method)
at com.tibco.tibrv.TibrvImplCmTPortC.confirmMsg(TibrvImplCmTPortC.java:304)
at com.tibco.tibrv.TibrvCmListener.confirmMsg(TibrvCmListener.java:88)
....
I know you're going to ask me "what did you do to make it start happening", and my response is "I don't know".
Any input would be appreciated.
Thanks.
It may be possible that TCP connections between the two RVD servers is not possible. Can you check if you can connect from one to the other (connect from the subscriber host back to the publisher)? In my experience, CM acknowledgments are handled over TCP (please take this with a grain of salt as I'm more an end user than a Middleware support guy).
As it turns out, it was a screw-up on the application level.
Due to some old code lying around, after having updated a dependency (our messaging layer), we had moved from an application level confirmation to a container level confirmation, but we had forgotten to remove an explicit message confirmation in the application code.
To summarize: We tried to confirm the message twice, and the second time it threw this exception.
I recently encountered the same exception - application had been working for months, suddenly was throwing exception. In my case some maintenance had been done on the Windows server the application ran on and directories had been marked read-only. Once that was cleared the exception went away.
Discovered this after trouble-shooting hours worth of other potential causes.
Just my two cents: This exception also occurs when you try to explicitly confirm message on non-CM transport.

Categories