StompBrokerRelayMessageHandler - Transport failure: java.lang.IllegalStateException: No TcpConnection available

StompBrokerRelayMessageHandler - Transport failure: java.lang.IllegalStateException: No TcpConnection available - java

We use STOMP broker relay(External Broker - ActiveMQ 5.13.2) in our Project see
https://docs.spring.io/spring/docs/current/spring-framework-reference/web.html#websocket-stomp-handle-broker-relay
We use following stack:
org.springframework:spring-jms:jar:5.1.8.RELEASE
org.springframework:spring-messaging:jar:5.1.8.RELEASE
io.projectreactor:reactor-core:jar:3.2.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.8.6.RELEASE
io.netty:netty-all:jar:4.1.34.Final
From time to time(lets say once a 2 weeks) we can observe in tomcat catalina.out logs error
2019-08-21 13:38:58,891 [tcp-client-scheduler-5] ERROR com.*.websocket.stomp.SimpMessagingSender - BrokerAvailabilityEvent[available=false, StompBrokerRelay[ReactorNettyTcpClient[reactor.netty.tcp.TcpClientDoOn#219abb46]]]
2019-08-21 13:38:58,965 [tcp-client-scheduler-1] ERROR org.springframework.messaging.simp.stomp.StompBrokerRelayMessageHandler - Transport failure: java.lang.IllegalStateException: No TcpConnection available
After that error STOMP communication is broken(system connection - single TCP connection is not available)
And it seems that everything started when we update stack from:
org.springframework:spring-jms:jar:5.0.8.RELEASE
org.springframework:spring-messaging:jar:5.0.8.RELEASE
io.projectreactor:reactor-core:jar:3.1.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.7.8.RELEASE
io.netty:netty-all:jar:4.1.25.Final
ActiveMQ version not changed
There is a bug reported in spring that auto-reconnect failed when the system connection lost see:
https://github.com/spring-projects/spring-framework/issues/22080
And now 3 questions:
How to make this problem more reproducible?
How to fix this reconnect behavior? :)
How to prevent to lose this connection? :)
EDIT 23.09.2019
After error occurred TCP stack for port 61613(STOMP) is the following(Please note CLOSE_WAIT state):
netstat -an | grep 61613
tcp6 0 0 :::61613 :::* LISTEN
tcp6 2 0 127.0.0.1:49084 127.0.0.1:61613 CLOSE_WAIT

I can't say that I have enough information to answer your question although I have some input that may help you find a way forward.
ActiveMQ is typically used in an environment that is hosted/distributed, so load and scaling should always be a consideration.
Most dbs/message queues/ect.. will need some sort of tuning for load - even on AWS (via requesting higher limits) even though most of that is taken care of by the hosting provider.
But I digress...
In this case it appears you're using the TCP transport for your queue:
https://activemq.apache.org/tcp-transport-reference
As you can see, all of these settings can be tuned and have default values.
So in the case of issues logged from the spring side connecting to AMQ, you'll want to narrow down the time of the error and then go look at your AMQ metrics and logs.
If you don't have monitoring for AMQ, I suggest:
Add Monitoring - https://activemq.apache.org/how-can-i-monitor-activemq
Add logging (or find out where the logs are). - Then enable detailed logging. (AMQ uses log4j, so just look at the log4j config file or add one.) Beyond this, consider sending the logs to a log aggregator. -- https://activemq.apache.org/how-can-i-enable-detailed-logging
Look at your hosting provider's metrics & downtime. For instance, if using AWS, there are very detailed incident logs for network failures or momentary issues with VPC or cross-region tunneling, network traffic in/out ect..
Setting up the right tools for your distributed systems to enable your team to search/find errors/logs (and documenting how to do it) is extremely helpful. A step beyond this (for mature systems) is to add a layer on top of your monitoring so that your systems start telling you when there is a problem instead of the other way around (go looking for problems).
That may be a bit verbose - but that all leads up to me asking if you have logs / metrics for the AMQ system at the times of the failure. If you do, please post them!
I make these suggestions because:
There is no information provided on your load expectation, variability of load, or recognition that load is a consideration in a system (via troubleshooting steps).
Logs/errors provided are strictly from the client side.
The reproducibility of the error is infrequent and inconsistent - so it could be almost anything (memory leak, load issue, etc..) - so monitoring is necessary.
Also consider adding Spring Actuator for monitoring your message client on the spring side, as there are frequently limitations/settings for client connection pools & advanced settings too, especially if you scale up/down instance size, etc.. and your instance will be handling more/less load, your client libs may need some settings tuning.
https://www.baeldung.com/spring-boot-actuators
Exposing metrics about current Websocket connections with Spring
You can also catch the exception and tear down & re-create your connection/settings - although this wouldn't be the first thing I recommend without knowing more about the situations & stats at the time of the connection failure.

Related

Failed to connect to Tomcat server on ec2 instance

UPDATE:
My goal is to learn what factors could overwhelm my little tomcat server. And when some exception happens, what I could do to resolve or remediate it without switching my server to a better machine. This is not a real app in a production environment but just my own experiment (Besides some changes on the server-side, I may also do something on my client-side)
Both of my client and server are very simple: the server only checks the URL format and send 201 code if it is correct. Each request sent from my client only includes an easy JSON body. There is no database involved. The two machines (t2-micro) only run client and server respectively.
My client is OkHttpClient(). To avoid timeout exceptions, I already set timeout 1,000,000 milli secs via setConnectTimeout, setReadTimeout, and setWriteTimeout. I also go to $CATALINA/conf/server.xml on my server and set connectionTimeout = "-1"(infinite)
ORIGINAL POST:
I'm trying to stress out my server by having a client launching 3000+ threads sending HTTP requests to my server. Both of my client and server reside on different ec2 instances.
Initially, I encountered some timeout issues, but after I set the connection, read and write timeout to a bigger value, this exception has been resolved. However, with the same specification, I'm getting java.net.ConnectException: Failed to connect to my_host_ip:8080 exception. And I do not know its root cause. I'm new to multithreading and distributed system, can anyone please give me some insights of this exception?
Below is some screenshot of from my ec2:
1. Client:
2. Server:

Having gone through similar exercise in past I can say that there is no definitive answer to the problem of scaling.
Here are some general trouble shooting steps that may lead to more specific information. I would suggest trying out tests by tweaking a few parameters in each test and measure the changes in Cpu, logs etc.
Please provide what value you have put for the timeout. Increasing timeout could cause your server (or client) to run out of threads quickly (cause each thread can process for longer). Question the need for increasing timeout. Is there any processing that slows your server?
Check application logs, JVM usage, memory usage on the client and Server. There will be some hints there.
Your client seems to be hitting 99%+ and then come down. This implies that there could be a problem at the client side in that it maxes out during the test. Your might want to resize your client to be able to do more.
Look at open file handles. The number should be sufficiently high.
Tomcat has some limit on thread count to handle load. You can check this in server.xml and if required change it to handle more. Although cpu doesn't actually max out on server side so unlikely that this is the problem.
If you a database then check the performance of the database. Also check jdbc connect settings. There is thread and timeout config at jdbc level as well.
Is response compression set up on the Tomcat? It will give much better throughout on server especially if the data being sent back by each request is more than a few kbs.
--------Update----------
Based on update on question few more thoughts.
Since the application is fairly simple, the path in terms of stressing the server should be to start low and increase load in increments whilst monitoring various things (cpu, memory, JVM usage, file handle count, network i/o).
The increments of load should be spread over several runs.
Start with something as low as 100 parallel threads.
Record as much information as you can after each run and if the server holds up well, increase load.
Suggested increments 100, 200, 500, 1000, 1500, 2000, 2500, 3000.
At some level you will see that the server can no longer take it. That would be your breaking point.
As you increase load and monitor you will likely discover patterns that suggest tuning of specific parameters. Each tuning attempt should then be tested again the same level of multi threading. The improvement of available will be obvious from the monitoring.

Application resilience impact using Logstash/Graylog log appender

I have some question about the gelf module (http://logging.paluch.biz/) and in particular when the graylog server is not available for some reason.
Is log4j will cache the logs somewhere and will send them when the connection to the graylog is recovered?
Is the application using this module will stop to work during the issue with graylog server?
Thanks.

Gelf-Appenders are online appenders without a cache. They connect directly do a remote service and submit log events as your application produces these.
If the remote service is down, log events get lost. There are a few options with different impacts:
TCP: TCP comes with transport reliability and requires a connection. If the remote service becomes slow/unresponsive, then your application gets affected, as soon as I/O buffer are saturated. logstash-gelf uses NIO in a non-blocking way if all data was sent. If the TCP connection drops, then you will run into connection timeouts, if the remote side is not reachable or connection refused states if the remote port is closed. In any case, you get reliability, but it will affect your application performance.
UDP: UDP has no connection notion, it's used for fire-and-forget communication. If the remote side becomes unhealthy, your application usually is not affected, but you encounter log event loss.
Redis: You can use Redis as an intermediate buffer if your Graylog instance is known to fail/been taken down for maintenance. Once Graylog is available again, it should catch up, and you prevent log event loss to some degree. If your Redis service becomes unhealthy, see Point 1.
HTTP: HTTP is another option that gives you a degree of flexibility. You can put your Graylog servers behind a load-balancer to improve availability and reduce the risk of failure. Log event loss is still possible.
If you want to ensure log continuity and reduce the probability of log event loss, then write logs to disk. It's still no 100% guarantee against loss (disk failure, disk full) but improves application performance. The log file (ideally some JSON-based format) can then be parsed and submitted to Graylog by maintaining a read offset to recover from a remote outage.

RabbitMQ connections dropping and not recovering despite heartbeat setting

I am trying to run down a problem with consumer connections to RabbitMQ being dropped in our production environment. This problem seems to occur after running for a few days and by restarting our application it seems to connect and work fine for a few more days. My guess is that there is a period of inactivity that is causing the issue. It seems the AMQP heartbeat was designed just for this problem. We are using spring-amqp 1.3.2.RELEASE and setting the requestedHeartbeat on the ConnectionFactory to 10 however we are still seeing connections drop.
The spring-amqp client will reconnect if I completely disconnect from the internet and reconnect, or block the connection with a firewall, however it does not even seem to throw an Exception in the log when this happens in production. Of course that may be because we are using slf4j and logback for our logging mechanism and spring is using commons logging, so it is appearing in System.out and not going to the log. I have added the jcf-over-slf4j bridge to fix that but have not rolled it out yet so I do not have a stack trace to contribute.
One more piece of info about our architecture: we have HA proxy in front of RabbitMQ.
I would like to somehow put the app in debug and run within eclipse to see if the heartbeats are actually going out. I tried to verify with Wireshark but our traffic has two way SSL encryption and I haven't been able to decrypt the traffic yet.
Does anyone have any suggestions? I have been trying to run this down for weeks and I'm running out of ideas. I would greatly appreciate your input.
Thanks!

on 11-Feb-2015 days ago rabbit released 3.4.4 which has support for automatic reconnections. You could roll your own solution like we did a while back but it seems easier to just upgrade to the newest version of rabbit.
https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/rabbitmq_v3_4_4/projects/client/RabbitMQ.Client/src/client/impl/AutorecoveringConnection.cs

If you have rabbitmq's autorecovery enabled, Spring AMQP prior to 1.4 is not compatible with it; the problem being that rabbit restores the connections/channels, but Spring AMQP doesn't know about them. Generally, though, this just causes extra connections/channels - Spring AMQP just establishes new channels. I have not heard of it causing the problems you describe.
If you can't figure it out by fixing your logging configuration, another thing to try is to jstack your process to see what the threads are doing. But you should be able to figure it out from the logs.

Best way to find activemq connection leaks

This is shameful, but we know there are some activemq connection leaks. The code is old and has many twists and turns that makes finding the leaky flow very hard.
We fire many short leaved jobs from batch machine. We know that not all paths are closing the activemq connection properly. When connection is not closed, but job terminates, activemq holds that connection for some amount of time. Ultimately, there are some critical applications which get impacted because activemq max connection limit exceeds.
Is it possible to set connection name or other identifying information so that a non properly closed connection will appear in activemq's log files. This will tell us which log files need to be examined. Sheer number of jobs makes it very hard to find out which exact job caused the problem. However once we know the job, we can deduce enough information from logs to find and fix the connection leaks.
Right now all we see is ip address from which connection originated and since all the jobs originate from same machine, its not helpful to find out who caused the problem

If you add jms.clientID=something into your connection URL and turn on DEBUG logging in your conf/log4j.properties, you will get the client id in your debug log on AMQ. You could then write something to analyze your log and find the AMQ ID for a given clientID and match the logs that way.
If your process is truly exiting though, your connection should be going away at that point (ie you can't keep the connection alive if there's no process to service it).
If you are running on Linux, you can do an netstat -anp | grep 61616 (or whatever your AMQ port is) to see which PIDs still have connections to AMQ, and then another ps to see what those processes are.

ActiveMQ:'channel inactive for too long' exceptions stop broker messaging

My system has the following parts:
ActiveMQ broker exposed on tcp, port 61616
3 Grails/Spring wars that live in their own Tomcat servers, they publish and consume messages to the JMS broker
n times remote client system with a JMS listener component to receive client specific messages, connect to the JMS broker through VPN using a hostname and port 61616
So far, all works fine throughout dev, Test and production environments.
We've just connected a new client system in production and we've noticed that it's logs start to report 'channel was inactive for too long' exceptions and drops the connection.
Worrying the overall effect of this one client is that it stops all message consumption on the broker so brings are whole system to a halt.
This client listener (using Spring caching connection factory) appears to connect to the JMS broker ok, process some messages, then 3 mins reports the exception. Turned on DEBUG in ActiveMQ and got loads of output, nothing suggesting a warning or error on the broker around the same time though.
Believe that ActiveMQ has some internal keep alive that should keep the connection even if inactive for longer than the default 30 seconds.
Infrastructure guys have monitored the VPN of this client and confirm it stays up and connected the whole time.
Don't believe it is code or Spring config that is at fault, as we have numerous other instances of the listener in different clients and they all behave themselves fine.
Suppose I have 2 questions really:
What is causing 'channel inactive' exceptions?
Why does this exception in a single client stop ActiveMQ from working?
EDIT - adding exception stacktrace:
2013-04-24 14:02:06,359 WARN - Encountered a JMSException - resetting the underlying JMS Connection (org.springframework.jms.connection.CachingConnectionFactory)
javax.jms.JMSException: Channel was inactive for too (>30000) long: jmsserver/xxx.xx.xx.xxx:61616
at org.apache.activemq.util.JMSExceptionSupport.create(JMSExceptionSupport.java:49)
at org.apache.activemq.ActiveMQConnection.onAsyncException(ActiveMQConnection.java:1833)
at org.apache.activemq.ActiveMQConnection.onException(ActiveMQConnection.java:1850)
at org.apache.activemq.transport.TransportFilter.onException(TransportFilter.java:101)
at org.apache.activemq.transport.ResponseCorrelator.onException(ResponseCorrelator.java:126)
at org.apache.activemq.transport.TransportFilter.onException(TransportFilter.java:101)
at org.apache.activemq.transport.TransportFilter.onException(TransportFilter.java:101)
at org.apache.activemq.transport.WireFormatNegotiator.onException(WireFormatNegotiator.java:160)
at org.apache.activemq.transport.InactivityMonitor.onException(InactivityMonitor.java:266)
at org.apache.activemq.transport.InactivityMonitor$4.run(InactivityMonitor.java:186)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:693)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:719)
at java.lang.Thread.run(Thread.java:813)
Caused by: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too (>30000) long: jmsserver/xxx.xx.xx.xxx:61616
... 4 more

Have you tried the following:
Disable the InactivityMonitor; wireFormat.maxInactivityDuration=0 e.g.
URL: tcp://localhost:61616?wireFormat.maxInactivityDuration=0
If you don't wish to disable, have you tried setting it to a high number e.g.: URL: tcp://localhost:61616?wireFormat.maxInactivityDuration=5000000 (just an example - use your own time in ms)
Also, ensure that the jar files are the same version for both client and server.
Hope it helps

You just need to change the activemq.xml (configuration file):
transportConnectors section:
transportConnector name="ws" uri="ws://0.0.0.0:61614"
change
transportConnector name="ws" uri="tcp://0.0.0.0:61614"
It works for my windows and linux virtual machines

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.