This is shameful, but we know there are some activemq connection leaks. The code is old and has many twists and turns that makes finding the leaky flow very hard.
We fire many short leaved jobs from batch machine. We know that not all paths are closing the activemq connection properly. When connection is not closed, but job terminates, activemq holds that connection for some amount of time. Ultimately, there are some critical applications which get impacted because activemq max connection limit exceeds.
Is it possible to set connection name or other identifying information so that a non properly closed connection will appear in activemq's log files. This will tell us which log files need to be examined. Sheer number of jobs makes it very hard to find out which exact job caused the problem. However once we know the job, we can deduce enough information from logs to find and fix the connection leaks.
Right now all we see is ip address from which connection originated and since all the jobs originate from same machine, its not helpful to find out who caused the problem
If you add jms.clientID=something into your connection URL and turn on DEBUG logging in your conf/log4j.properties, you will get the client id in your debug log on AMQ. You could then write something to analyze your log and find the AMQ ID for a given clientID and match the logs that way.
If your process is truly exiting though, your connection should be going away at that point (ie you can't keep the connection alive if there's no process to service it).
If you are running on Linux, you can do an netstat -anp | grep 61616 (or whatever your AMQ port is) to see which PIDs still have connections to AMQ, and then another ps to see what those processes are.
Related
We use STOMP broker relay(External Broker - ActiveMQ 5.13.2) in our Project see
https://docs.spring.io/spring/docs/current/spring-framework-reference/web.html#websocket-stomp-handle-broker-relay
We use following stack:
org.springframework:spring-jms:jar:5.1.8.RELEASE
org.springframework:spring-messaging:jar:5.1.8.RELEASE
io.projectreactor:reactor-core:jar:3.2.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.8.6.RELEASE
io.netty:netty-all:jar:4.1.34.Final
From time to time(lets say once a 2 weeks) we can observe in tomcat catalina.out logs error
2019-08-21 13:38:58,891 [tcp-client-scheduler-5] ERROR com.*.websocket.stomp.SimpMessagingSender - BrokerAvailabilityEvent[available=false, StompBrokerRelay[ReactorNettyTcpClient[reactor.netty.tcp.TcpClientDoOn#219abb46]]]
2019-08-21 13:38:58,965 [tcp-client-scheduler-1] ERROR org.springframework.messaging.simp.stomp.StompBrokerRelayMessageHandler - Transport failure: java.lang.IllegalStateException: No TcpConnection available
After that error STOMP communication is broken(system connection - single TCP connection is not available)
And it seems that everything started when we update stack from:
org.springframework:spring-jms:jar:5.0.8.RELEASE
org.springframework:spring-messaging:jar:5.0.8.RELEASE
io.projectreactor:reactor-core:jar:3.1.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.7.8.RELEASE
io.netty:netty-all:jar:4.1.25.Final
ActiveMQ version not changed
There is a bug reported in spring that auto-reconnect failed when the system connection lost see:
https://github.com/spring-projects/spring-framework/issues/22080
And now 3 questions:
How to make this problem more reproducible?
How to fix this reconnect behavior? :)
How to prevent to lose this connection? :)
EDIT 23.09.2019
After error occurred TCP stack for port 61613(STOMP) is the following(Please note CLOSE_WAIT state):
netstat -an | grep 61613
tcp6 0 0 :::61613 :::* LISTEN
tcp6 2 0 127.0.0.1:49084 127.0.0.1:61613 CLOSE_WAIT
I can't say that I have enough information to answer your question although I have some input that may help you find a way forward.
ActiveMQ is typically used in an environment that is hosted/distributed, so load and scaling should always be a consideration.
Most dbs/message queues/ect.. will need some sort of tuning for load - even on AWS (via requesting higher limits) even though most of that is taken care of by the hosting provider.
But I digress...
In this case it appears you're using the TCP transport for your queue:
https://activemq.apache.org/tcp-transport-reference
As you can see, all of these settings can be tuned and have default values.
So in the case of issues logged from the spring side connecting to AMQ, you'll want to narrow down the time of the error and then go look at your AMQ metrics and logs.
If you don't have monitoring for AMQ, I suggest:
Add Monitoring - https://activemq.apache.org/how-can-i-monitor-activemq
Add logging (or find out where the logs are). - Then enable detailed logging. (AMQ uses log4j, so just look at the log4j config file or add one.) Beyond this, consider sending the logs to a log aggregator. -- https://activemq.apache.org/how-can-i-enable-detailed-logging
Look at your hosting provider's metrics & downtime. For instance, if using AWS, there are very detailed incident logs for network failures or momentary issues with VPC or cross-region tunneling, network traffic in/out ect..
Setting up the right tools for your distributed systems to enable your team to search/find errors/logs (and documenting how to do it) is extremely helpful. A step beyond this (for mature systems) is to add a layer on top of your monitoring so that your systems start telling you when there is a problem instead of the other way around (go looking for problems).
That may be a bit verbose - but that all leads up to me asking if you have logs / metrics for the AMQ system at the times of the failure. If you do, please post them!
I make these suggestions because:
There is no information provided on your load expectation, variability of load, or recognition that load is a consideration in a system (via troubleshooting steps).
Logs/errors provided are strictly from the client side.
The reproducibility of the error is infrequent and inconsistent - so it could be almost anything (memory leak, load issue, etc..) - so monitoring is necessary.
Also consider adding Spring Actuator for monitoring your message client on the spring side, as there are frequently limitations/settings for client connection pools & advanced settings too, especially if you scale up/down instance size, etc.. and your instance will be handling more/less load, your client libs may need some settings tuning.
https://www.baeldung.com/spring-boot-actuators
Exposing metrics about current Websocket connections with Spring
You can also catch the exception and tear down & re-create your connection/settings - although this wouldn't be the first thing I recommend without knowing more about the situations & stats at the time of the connection failure.
We are getting below exception while performing a load test on our application which is using Gremlin Java.
how to solve this issue?
Exception:
java.lang.IllegalStateException: org.apache.tinkerpop.gremlin.process.remote.RemoteConnectionException: java.lang.RuntimeException: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timed out while waiting for an available host - check the client configuration and connectivity to the server if this message persists
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.promise(RemoteStep.java:98 )
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.processNextStart(RemoteStep.java:65 )
at org.apache.tinkerpop.gremlin.process.traversal.step.ut
"Timed out while waiting for an available host" - This is most certainly a connectivity issue between your client and DB. There are numerous answers around debugging connectivity to Neptune, please try them out. To begin with, can you try the following from your client machine?
telnet <db-endpoint> <db-port>
You would most likely see that its waiting to establish the connection, which confirms this hypothesis.
In general, establishing a connection to the server is fairly quick. The only timeout that you need to worry about is query timeout, and Neptune has a parameter group entry for that.
https://docs.aws.amazon.com/neptune/latest/userguide/parameters.html
I faced the same error. Neptunes does not logs error stack trace into logs. TimeoutException for me was coming when the cpu > 60 percent use case. The cpu would go this high, because of many connections being made to db.
Gremlin is based on websockets and multiple requests can be multiplexed and used via same channel. Adding maxInProcessPerConnection and maxSimultaneosUsagePerConnection did help me to reduce the error rate to 0 percent. These paramters set the count of process which will be multiplexed within one connection. In my case around 50 workers concurrently read/write.
I observed that for my use case setting the values to 32 led to minimum cpu usage.
Below is the Cluster properties I am settling with for now.
By default Cluster keeps a pool of max 8 websocket connections if not mentioned. I was getting TimeoutException when maxPoolSize was set to 100.
.addContactPoint(uri)
.port(port)
.serializer(Serializers.GRAPHBINARY_V1D0)
.maxInProcessPerConnection(32)
.maxSimultaneousUsagePerConnection(32)
.maxContentLength(10000000)
.maxWaitForConnection(10)
.create()
Could be related: Difference between Connection timed out and Read timed out
I have written a java server application using nio.
I connected a client to my server application and unplugged the network cable of the client. On the server side, I didn't get any exception immediately but after some time (8 minutes or so), I got a "IOException: Connection timed out"
Here is a partial stack trace:
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:225)
at sun.nio.ch.IOUtil.read(IOUtil.java:198)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:375)
........
Till this time, when I saw the netstat output, I see that the socket state of this particular client connection is shown as ESTABLISHED.
Questions are:
Is this timeout configurable?
Why does the netstat output show the socket state as ESTABLISHED? Ideally it should be CLOSE_WAIT (as the client got disconnected)
No it is not configurable. It is the result of retransmit timeouts. It wouldn't happen at all unless the application kept writing, or had pending writes when the disconnect happened.
It shouldn't be CLOSE_WAIT, as no FIN had been received. Ergo it should be ESTABLISHED.
That timeout is generally not configurable as it depends on the possibilities offered by the operating system. Unix in general does not allow a process to fix the connection timeout and generally it is fixed to around two minutes. Perhaps some versions of linux/BSD systems allow this to be configured, but that's not portable and normally is not allowed to fix it to the user (only the administrator). This has to do with the number of retransmissions and the timeouts used for each try, and is under the exclusive control of the TCP implementation.
When you finish a connection you pass through two states (FIN_WAIT and TIME_WAIT) that are not timeout states. The first of two is to get the other end's response (you can close your side of the connection telling the other side you are not going to send more data, but you have to wait for the other end to do the same thing) The TIME_WAIT is a special state that the kernel maintains for a closed connection to process (and discard) all the possible retransmissions of the last frames that can be in course after the connection is closed. They have nothing to do with timeouts.
A tcp connection has no timeout implicit. Two machines can pass weeks without interchanging any info if they have nothing to transmit. You can control the use of some kind of heartbeat between silenting connections to check their liveness with one socket option (SO_KEEPALIVE) This option makes the tcps at both sides to interchange empty packets to know if the other side is still alive. Again, you can only control the use of this packets, not the frequency or the number of lost frames that closes the connection (this can be configured in linux, but touching the kernel configuration only in administrator mode)
Note 1 (answer to #Krishna Chaitanya P)
If you unplugged the cable and got an exception some time later, it can be one of two reasons for that to happen:
You continued writing to that connection and the sending buffer filled up without being acknowledged in time (this is rare, as normally your process get blocked in write(2) system call when this happens) and some timeout (in the java implementation of socket) did occur.
Your java implementation of tcp socket uses the SO_KEEPALIVE option (the most probable thing). As I said before, you have boolean control to use or not use it, but you cannot adjust the time between keepalives or the number of them that drops your connection. Try to call getKeepAlive()/setKeepAlive(boolean) methods on the Socket class to control this feature. I have not seen in the documentation if the connected socket is, by default, keepalived or not. This is, by far, a commonly used option in a server, as it allows to disconnect the clients that lose connections without telling to the server.
In my experience, the cause for this exception occurring for a connected socket was always due to a firewall closing connections that had been idle for too long. I've seen it happen in cloud evironments (AWS, Rackspace) in particular, but it's not limited to that. Most likely, you have some kind of firewall between the 2 connection peers, which closes idle connections after some time.
The best fix in an ideal world is to change the firewall configuration, provided you or an operations team has access to it. In any case, it's better if you can handle that use case in your code and gracefully terminate the communication with the other peer.
Because the CLOSE_WAIT state is for a FI waiting for its corresponding FIN from the peer and that is not the case here.
This TO is most probably configurable
We have a Java Spring MVC 2.5 application using Tomcat 6 and MySQL 5.0 . We have a bizarre scenario where for whatever reason, the amount of connections used in the c3p0 connection pool starts spiraling out of control and eventually brings tomcat down.
We monitor the c3p0 connection pooling through JMX and most of the time, connections are barely used. When this spiraling situation happens, our tomcat connection pool maxes out and apache starts queuing threads.
In the spiraling scenario, the database has low load and is not giving any errors or any obvious bad situation.
Starting to run out of ideas on how to detect this issue. I don't think doing a tomcat stack dump would do me any good when the situation is already spiraling out of control and not sure how I could catch it before it does spiral out of control.
We also use Terracotta which I don't believe by the logs that it is doing anything odd.
Any ideas would be greatly appreciated!
Cheers!
Somewhere you're leaking a connection. This can happen when you explicitly retrieve Hibernate sessions from the session factory, rather than getting the connection associated with an in-process transaction (can't remember the exact method names).
C3P0 will let you debug this situation with two configuration options (following is copied from docs, which are part of the download package):
unreturnedConnectionTimeout defines a limit (in seconds) to how long a Connection may remain checked out. If set to a nozero value, unreturned, checked-out Connections that exceed this limit will be summarily destroyed, and then replaced in the pool
set debugUnreturnedConnectionStackTraces to true, then a stack trace will be captured each time a Connection is checked-out. Whenever an unreturned Connection times out, that stack trace will be printed, revealing where a Connection was checked out that was not checked in promptly. debugUnreturnedConnectionStackTraces is intended to be used only for debugging, as capturing a stack trace can slow down Connection check-out
I am trying to retrieve data form an Oracle database using jdbc (ojdbc14.jar). I have a limited number of concurrent connections when connecting to the database and these connections are managed by Websphere connection pool.
Sometimes when I make the call I see an UncategorizedSQLException exception thrown in my logs with one of the following oracle codes:
ORA-01012 (not logged in) exception
ORA-17410 (connection timed out, socket empty),
ORA-02396 exceeded maximum idle time, please connect again
Other times I get no exceptions and it works fine.
Anyone understand what might be happening here?
In Websphere I have my cache statement size set to 10. Not sure if it is relevant in this situation, when it looks like the connection is being dropped.
It looks like the database is deciding to drop the connection. It's a good idea to write your code in a way that doesn't require that a connection be held forever. A better choice is to have the program connect to the database, do its work, and disconnect. This eliminates the problem of the database deciding to disconnect the application due to inactivity/server overload/whatever, and the program needing to figure this out and make a reasonable stab at reconnecting.
I hope this helps.