ActiveMQ:'channel inactive for too long' exceptions stop broker messaging - java

My system has the following parts:
ActiveMQ broker exposed on tcp, port 61616
3 Grails/Spring wars that live in their own Tomcat servers, they publish and consume messages to the JMS broker
n times remote client system with a JMS listener component to receive client specific messages, connect to the JMS broker through VPN using a hostname and port 61616
So far, all works fine throughout dev, Test and production environments.
We've just connected a new client system in production and we've noticed that it's logs start to report 'channel was inactive for too long' exceptions and drops the connection.
Worrying the overall effect of this one client is that it stops all message consumption on the broker so brings are whole system to a halt.
This client listener (using Spring caching connection factory) appears to connect to the JMS broker ok, process some messages, then 3 mins reports the exception. Turned on DEBUG in ActiveMQ and got loads of output, nothing suggesting a warning or error on the broker around the same time though.
Believe that ActiveMQ has some internal keep alive that should keep the connection even if inactive for longer than the default 30 seconds.
Infrastructure guys have monitored the VPN of this client and confirm it stays up and connected the whole time.
Don't believe it is code or Spring config that is at fault, as we have numerous other instances of the listener in different clients and they all behave themselves fine.
Suppose I have 2 questions really:
What is causing 'channel inactive' exceptions?
Why does this exception in a single client stop ActiveMQ from working?
EDIT - adding exception stacktrace:
2013-04-24 14:02:06,359 WARN - Encountered a JMSException - resetting the underlying JMS Connection (org.springframework.jms.connection.CachingConnectionFactory)
javax.jms.JMSException: Channel was inactive for too (>30000) long: jmsserver/xxx.xx.xx.xxx:61616
at org.apache.activemq.util.JMSExceptionSupport.create(JMSExceptionSupport.java:49)
at org.apache.activemq.ActiveMQConnection.onAsyncException(ActiveMQConnection.java:1833)
at org.apache.activemq.ActiveMQConnection.onException(ActiveMQConnection.java:1850)
at org.apache.activemq.transport.TransportFilter.onException(TransportFilter.java:101)
at org.apache.activemq.transport.ResponseCorrelator.onException(ResponseCorrelator.java:126)
at org.apache.activemq.transport.TransportFilter.onException(TransportFilter.java:101)
at org.apache.activemq.transport.TransportFilter.onException(TransportFilter.java:101)
at org.apache.activemq.transport.WireFormatNegotiator.onException(WireFormatNegotiator.java:160)
at org.apache.activemq.transport.InactivityMonitor.onException(InactivityMonitor.java:266)
at org.apache.activemq.transport.InactivityMonitor$4.run(InactivityMonitor.java:186)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:693)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:719)
at java.lang.Thread.run(Thread.java:813)
Caused by: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too (>30000) long: jmsserver/xxx.xx.xx.xxx:61616
... 4 more

Have you tried the following:
Disable the InactivityMonitor; wireFormat.maxInactivityDuration=0 e.g.
URL: tcp://localhost:61616?wireFormat.maxInactivityDuration=0
If you don't wish to disable, have you tried setting it to a high number e.g.: URL: tcp://localhost:61616?wireFormat.maxInactivityDuration=5000000 (just an example - use your own time in ms)
Also, ensure that the jar files are the same version for both client and server.
Hope it helps

You just need to change the activemq.xml (configuration file):
transportConnectors section:
transportConnector name="ws" uri="ws://0.0.0.0:61614"
change
transportConnector name="ws" uri="tcp://0.0.0.0:61614"
It works for my windows and linux virtual machines

Related

How to recover in Spring from a disconnected Redis message listener?

We had a situation occur today that took our Redis instance offline for a few minutes. We have several microservices that are configured to listen for messages on various channels. When Redis went offline, the messageListeners lost their connections but provided no feedback that anything was wrong. The services that were publishing messages continued to work fine when Redis came back online, but the listeners are hooked up during startup.
I'm wondering if there is any way to detect that they are no longer listening, and then reconnect to the message channel without having to restart the application.

How to catch lost connection event in netty channel handler

I'm working on an old app that is using Netty to connect to a couple of remote TCP endpoints.
The app contains an implementation of IdleStateAwareChannelHandler and overrides several methods provided by it and SimpleChannelHandler (channelConnected, channelIdle, messageReceived, exceptionCaught, channelClosed).
This implementation is not able to cope with the scenario where the application server loses connection towards the remote server while my application is running.
I have hoped that introducing my own custom implementation of channelDisconnected() method would allow me to react to connection loss, but in practice I'm seeing something different:
I simulate connection loss by removing my application server from my network, thus cutting it off from both incoming and outgoing traffic
I leave it isolated for 5-10 min and observe the logs
Then I bring back the application server back to network
Only once I have restored the application server to the network I start seeing my debug logs from exceptionCaught and channelDisconnected methods
While the machine is isolated, I see from the logs that channelIdle method is being invoked regularly
Question: Is it possible to isolate and react on connection loss in my channel handler ?
Additional Info:
Netty version: 3.2.7.Final

StompBrokerRelayMessageHandler - Transport failure: java.lang.IllegalStateException: No TcpConnection available

We use STOMP broker relay(External Broker - ActiveMQ 5.13.2) in our Project see
https://docs.spring.io/spring/docs/current/spring-framework-reference/web.html#websocket-stomp-handle-broker-relay
We use following stack:
org.springframework:spring-jms:jar:5.1.8.RELEASE
org.springframework:spring-messaging:jar:5.1.8.RELEASE
io.projectreactor:reactor-core:jar:3.2.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.8.6.RELEASE
io.netty:netty-all:jar:4.1.34.Final
From time to time(lets say once a 2 weeks) we can observe in tomcat catalina.out logs error
2019-08-21 13:38:58,891 [tcp-client-scheduler-5] ERROR com.*.websocket.stomp.SimpMessagingSender - BrokerAvailabilityEvent[available=false, StompBrokerRelay[ReactorNettyTcpClient[reactor.netty.tcp.TcpClientDoOn#219abb46]]]
2019-08-21 13:38:58,965 [tcp-client-scheduler-1] ERROR org.springframework.messaging.simp.stomp.StompBrokerRelayMessageHandler - Transport failure: java.lang.IllegalStateException: No TcpConnection available
After that error STOMP communication is broken(system connection - single TCP connection is not available)
And it seems that everything started when we update stack from:
org.springframework:spring-jms:jar:5.0.8.RELEASE
org.springframework:spring-messaging:jar:5.0.8.RELEASE
io.projectreactor:reactor-core:jar:3.1.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.7.8.RELEASE
io.netty:netty-all:jar:4.1.25.Final
ActiveMQ version not changed
There is a bug reported in spring that auto-reconnect failed when the system connection lost see:
https://github.com/spring-projects/spring-framework/issues/22080
And now 3 questions:
How to make this problem more reproducible?
How to fix this reconnect behavior? :)
How to prevent to lose this connection? :)
EDIT 23.09.2019
After error occurred TCP stack for port 61613(STOMP) is the following(Please note CLOSE_WAIT state):
netstat -an | grep 61613
tcp6 0 0 :::61613 :::* LISTEN
tcp6 2 0 127.0.0.1:49084 127.0.0.1:61613 CLOSE_WAIT
I can't say that I have enough information to answer your question although I have some input that may help you find a way forward.
ActiveMQ is typically used in an environment that is hosted/distributed, so load and scaling should always be a consideration.
Most dbs/message queues/ect.. will need some sort of tuning for load - even on AWS (via requesting higher limits) even though most of that is taken care of by the hosting provider.
But I digress...
In this case it appears you're using the TCP transport for your queue:
https://activemq.apache.org/tcp-transport-reference
As you can see, all of these settings can be tuned and have default values.
So in the case of issues logged from the spring side connecting to AMQ, you'll want to narrow down the time of the error and then go look at your AMQ metrics and logs.
If you don't have monitoring for AMQ, I suggest:
Add Monitoring - https://activemq.apache.org/how-can-i-monitor-activemq
Add logging (or find out where the logs are). - Then enable detailed logging. (AMQ uses log4j, so just look at the log4j config file or add one.) Beyond this, consider sending the logs to a log aggregator. -- https://activemq.apache.org/how-can-i-enable-detailed-logging
Look at your hosting provider's metrics & downtime. For instance, if using AWS, there are very detailed incident logs for network failures or momentary issues with VPC or cross-region tunneling, network traffic in/out ect..
Setting up the right tools for your distributed systems to enable your team to search/find errors/logs (and documenting how to do it) is extremely helpful. A step beyond this (for mature systems) is to add a layer on top of your monitoring so that your systems start telling you when there is a problem instead of the other way around (go looking for problems).
That may be a bit verbose - but that all leads up to me asking if you have logs / metrics for the AMQ system at the times of the failure. If you do, please post them!
I make these suggestions because:
There is no information provided on your load expectation, variability of load, or recognition that load is a consideration in a system (via troubleshooting steps).
Logs/errors provided are strictly from the client side.
The reproducibility of the error is infrequent and inconsistent - so it could be almost anything (memory leak, load issue, etc..) - so monitoring is necessary.
Also consider adding Spring Actuator for monitoring your message client on the spring side, as there are frequently limitations/settings for client connection pools & advanced settings too, especially if you scale up/down instance size, etc.. and your instance will be handling more/less load, your client libs may need some settings tuning.
https://www.baeldung.com/spring-boot-actuators
Exposing metrics about current Websocket connections with Spring
You can also catch the exception and tear down & re-create your connection/settings - although this wouldn't be the first thing I recommend without knowing more about the situations & stats at the time of the connection failure.

Application resilience impact using Logstash/Graylog log appender

I have some question about the gelf module (http://logging.paluch.biz/) and in particular when the graylog server is not available for some reason.
Is log4j will cache the logs somewhere and will send them when the connection to the graylog is recovered?
Is the application using this module will stop to work during the issue with graylog server?
Thanks.
Gelf-Appenders are online appenders without a cache. They connect directly do a remote service and submit log events as your application produces these.
If the remote service is down, log events get lost. There are a few options with different impacts:
TCP: TCP comes with transport reliability and requires a connection. If the remote service becomes slow/unresponsive, then your application gets affected, as soon as I/O buffer are saturated. logstash-gelf uses NIO in a non-blocking way if all data was sent. If the TCP connection drops, then you will run into connection timeouts, if the remote side is not reachable or connection refused states if the remote port is closed. In any case, you get reliability, but it will affect your application performance.
UDP: UDP has no connection notion, it's used for fire-and-forget communication. If the remote side becomes unhealthy, your application usually is not affected, but you encounter log event loss.
Redis: You can use Redis as an intermediate buffer if your Graylog instance is known to fail/been taken down for maintenance. Once Graylog is available again, it should catch up, and you prevent log event loss to some degree. If your Redis service becomes unhealthy, see Point 1.
HTTP: HTTP is another option that gives you a degree of flexibility. You can put your Graylog servers behind a load-balancer to improve availability and reduce the risk of failure. Log event loss is still possible.
If you want to ensure log continuity and reduce the probability of log event loss, then write logs to disk. It's still no 100% guarantee against loss (disk failure, disk full) but improves application performance. The log file (ideally some JSON-based format) can then be parsed and submitted to Graylog by maintaining a read offset to recover from a remote outage.

Client Side JMS Configuration - JMS Cluster - Connets to only one server

So i wrote a program to connect to a Clustered WebLogic server behind a VIP with 4 servers and 4 queues that are all connected( i think they call them distributed...) When i run the program from my local machine and just get JMS Connections, look for messages and disconnect, it works great. and by that i mean it:
iteration #1
connects to server 1.
look for a message
disconnects
iteration #2
connects to server 2.
look for a message
disconnects
and so on.
When i run it on the server though, the application picks a server and stick to it. It will never pick a new server, so the queues on the other servers don't ever get worked. like with a "sticky session" setup... My OS is Win7, and the Server os is Win2008r2 JDK is identical for both machines.. How is this configured client side? The server implementation uses "Apache Procrun" to run it as a service. but i haven't seen too many issues with that part...
is there a session cookie getting written out somewhere?
any ideas?
Thanks!
Try disabling 'Server Affinity' on the JMS Connection factory. If you are using the Default Connection Factory, define your own an disable Server Affinity.
EDIT:
Server Affinity is a Server-side setting, but it controls how messages are routed to consumers after a WebLogic JMS Server receives the message. The other option is to use round-robin DNS and send to only one hostname that resolves to a different IP(Managed Server) such that each connection goes to a different server.
I'm pretty sure this is the setting you're looking for :)

Categories