FIN_WAIT Issue With Java Monitoring Application - java

Having issues with FIN_WAIT1 on my RHEL 5.4 running Introscope. What I have observed so far is whenever the target JVM which we are monitoring using Introscope is hung the agent running on that host stop sending data and after some time the socket on the server (Introscope Server) goes in FIN_WAIT1 state and it remains there for a long time it gets cleaned up if we restart the target JVM.
I would like to know if this is happening because of a bug in Introscope or is it something to do with TCP layer.

FIN_WAIT1 is at the TCP layer - it means your computer's tcp stack is waiting for one of the connection-close messages from the other side's TCP stack. It usually doesn't really cause much harm, other than taking some tiny amount of kernel state until it times out. However sometimes it can prevent you from restarting a server on the same port, in which case you can set the SO_REUSESOCKET and/or SO_REUSEPORT options on the socket before opening it the first time. (This does have some security implications if you're sharing the machine.)

Related

Elasticsearch unclosed client. Live threads after Tomcat shutdown. Memory usage impact?

I am using Elasticsearch 1.5.1 and Tomcat 7. Web application creates a TCP client instance as Singleton during server startup through Spring Framework.
Just noticed that I failed to close the client during server shutdown.
Through analysis on various tools like VisualVm, JConsole, MAT in Eclipse, it is evident that threads created by the elasticsearch client are live even after server(tomcat) shutdown.
Note: after introducing client.close() via Context Listener destroy methods, the threads are killed gracefully.
But my query here is,
how to check the memory occupied by these live threads?
Memory leak impact due to this thread?
We have got few Out of memory:Perm gen errors in PROD. This might be a reason but still I would like to measure and provide stats for this.
Any suggestions/help please.
Typically clients run in a different process than the services they communicate with. For example, I can open a web page in a web browser, and then shutdown the webserver, and the client will remain open.
This has to do with the underlying design choices of TCP/IP. Glossing over the details, under most cases a client only detects it's server is gone during the next request to the server. (Again generally speaking) it does not continually poll the server to see if it is alive, nor does the server generally send a "please disconnect" message on shutting down.
The reason that clients don't generally poll servers is because it allows the server to handle more clients. With a polling approach, the server is limited by the number of clients running, but without a polling approach, it is limited by the number of clients actively communicating. This allows it to support more clients because many of the running clients aren't actively communicating.
The reason that servers typically don't send an "I'm shutting down" message is because many times the server goes down uncontrollably (power outage, operating system crash, fire, short circuit, etc) This means that an protocol which requires such a message will leave the clients in a corrupt state if the server goes down in an uncontrolled manner.
So losing a connection is really a function of a failed request to the server. The client will still typically be running until it makes the next attempt to do something.
Likewise, opening a connection to a server often does nothing most of the time too. To validate that you really have a working connection to a server, you must ask it for some data and get a reply. Most protocols do this automatically to simplify the logic; but, if you ever write your own service, if you don't ask for data from the server, even if the API says you have a good "connection", you might not. The API can report back a good "connections" when you have all the stuff configured on your machine successfully. To really know if it works 100% on the other machine, you need to ask for data (and get it).
Finally servers sometimes lose their clients, but because they don't waste bandwidth chattering with clients just to see if they are there, often the servers will put a "timeout" on the client connection. Basically if the server doesn't hear from the client in 10 minutes (or the configured value) then it closes the cached connection information for the client (recreating the connection information as necessary if the client comes back).
From your description it is not clear which of the scenarios you might be seeing, but hopefully this general knowledge will help you understand why after closing one side of a connection, the other side of a connection might still think it is open for a while.
There are ways to configure the network connection to report closures more immediately, but I would avoid using them, unless you are willing to lose a lot of your network bandwidth to keep-alive messages and don't want your servers to respond as quickly as they could.

Too many TIME_WAIT connections with Jetty

I am running an API on 10 different servers, all of them are behind a firewall. I am using jetty 8 to serve all the http requests. The use case for this API is short lived connections.
A few month ago I started to get random Too many open file descriptors errors. These errors make the server completely unresponsive and I need to restart the jetty server in order to fix that. Today this happened 0-10 times a day depending on the traffic I am getting.
After some investigations, I noticed that I am exhausting the number of available connections because all of them are stuck in the TIME_WAIT state so I can't create new ones.
ss -s
TCP: 13392 (estab 1549, closed 11439, orphaned 9, synrecv 0, timewait *11438*/0), ports 932
On this example the number of connections in TIME_WAIT state is pretty low but it can go up to 50k.
I have been trying several kernel tweaks and I also tried to set the SO_LINGER timer to 1 second for jetty sockets. All these changes helped reduce the frequency but I am still getting errors regularly.
Also worth mentioning, I am receiving around 3k requests/second on each server and the cpu usage is very low. The bottleneck to scale my traffic today is this connection issue.
Does anyone have an idea of what I can do to handle that correctly ?
'Too many open file descriptors' is probably caused by a resource leak in your application.
The TIME_WAIT state is caused by being the end that first sends a close, instead of the end that first receives the close. You might want to reconsider your application protocol so that it is the client which closes first. This is not too hard to arrange. It falls out free if you use client-side connection pooling for example.
These two conditions are not related. The TIME_WAIT state can only occur on a port whose socket has already been closed. It does not cause 'too many open file descriptors' problems.

Why am I getting a SocketException in a long running application?

I have written a Java socket server application which is giving me error if i run it for long time say 4-8hrs, below is the list of error i get:
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:130)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:282)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:324)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:153)
at java.io.BufferedReader.readLine(BufferedReader.java:316)
at java.io.BufferedReader.readLine(BufferedReader.java:379)
at LiveRate.processData(LiveRate.java:224)
at LiveRate.mainLiveRate(LiveRate.java:265)
at LiveRate.liveRate(LiveRate.java:126)
at LiveRate.run(LiveRate.java:119)
at java.lang.Thread.run(Thread.java:636)
My socket application reads some values from another TCP/IP server and stores the value temporarily and offers the same to other client.Not sure If these error are because of Heavyload on the system or because of the Memory issues.Please help
It is probably neither (directly) load or memory related. Instead, it is more likely to be one of the following:
the remote service is shut down / falls over and is restarted on a regular basis,
the remote service has decided to close its end of the connection because it is "idle",
network connectivity is intermittent and you are occasionally encountering an outage or congestion-induced "brownout" that is too long,
you are using NAT or similar, and the port number that was being used for the connection has been reclaimed by the NAT gateway, or
something is enforcing some policy about TCP/IP connections being open for too long.
The bottom line is that your client software needs to be able to cope with lost connections if you want ti to run for extended periods of time. This is the way that the internet works.
I'd say it's because your connection gets reseted by your Internet Provider every 24 hours.

Sockets in CLOSE_WAIT from Jersey Client

I am using Jersey 1.4, the ApacheHttpClient, and the Apache MultiThreadedHttpConnectionManager class to manage connections. For the HttpConnectionManager, I set staleCheckingEnabled to true, maxConnectionsPerHost to 1000 and maxTotalConnections to 1000. Everything else is default. We are running in Tomcat and making connections out to multiple external hosts using the Jersey client.
I have noticed that after after a short period of time I will begin to see sockets in a CLOSE_WAIT state that are associated with the Tomcat process. Some monitoring with tcpdump shows that the external hosts appear to be closing the connection after some time but it's not getting closed on our end. Usually there is some data in the socket read queue, often 24 bytes. The connections are using https and the data seems to be encrypted so I'm not sure what it is.
I have checked to be sure that the ClientRequest objects that get created are closed. The sockets in CLOSE_WAIT do seem to get recycled and we're not running out of any resources, at least at this time. I'm not sure what's happening on the external servers.
My question is, is this normal and should I be concerned?
Thanks,
John
This is likely to be a device such as the firewall or the remote server timing out the TCP session. You can analyze packet captures of HTTPS using Wireshark as described on their SSL page:
http://wiki.wireshark.org/SSL
The staleCheckingEnabled flag only issues the check when you go to actually use the connection so you aren't using network resources (TCP sessions) when they aren't needed.

Detecting TCP dropout over an unreliable network

I am doing some experimentation over an unreliable radio network (home brewed) using very rudimentary java socket programming to transfer messages back and forth between the end nodes.
The setup is as follows:
Node A --- Relay Node --- Node B
One problem I am constantly running into is that somehow the connection drops out and neither Node A or B knows that the link is dead, and yet continues to transmit data. The TCP connection does not time out either. I have added in a heartbeat message that causes a timeout after a while, but I still would like to know what is the underlying cause of why TCP does not time out.
Here are the options I am enabling when setting up a socket:
channel.socket().setKeepAlive(false);
channel.socket().setTrafficClass(0x08); // for max throughput
This behavior is strange since it is totally different than when I have a wired network. On a wired network, I can simulate a disconnected connection by pulling out the ethernet cord, however, once I plug the cord back in, the connection becomes restablished and messages begin to be passed through once more.
On the radio network, the connection is never reestablished and once it silently dies, the messages never resume.
Is there some other unknown java implentation or setting for a socket that I can use, also, why am I seeing this behavior in the first place?
And yes, before anyone says anything, I know TCP is not the preffered choice over an unreliable network, but in this case I wanted to ensure no packet loss.
The TCP protocol was designed to be quiet. The RFC requires keepalive heartbeat no more frequent than 2 hours. Unless you have control over the system on both ends to change the default 2 hour heartbeat (sometimes, it requires kernel rebuild), you have to add heartbeat in your own app.
If you send heartbeat, it still needs to wait till Retransmit Timeout, which varies depending on the RTT. On a high-latency network, the timeout can be very high but it should be within minutes.
You get notification on local network because the system can detect link-down status and drop all connections on that network.
BTW, you want set Keepalive to TRUE, instead of false. With Keepalive, you at least get the slow heartbeat.
In the OSI 7-layer model, the first two layers are physical and data link. Your physical hardware running the data link protocol on wired ethernet can detect when the cable is pulled. Your wireless hardware, and corresponding protocol, probably not so much. The TCP stack can't do anything to timeout if the layer1/2 stuff isn't signaling that it is disconnected.
Define 'never'?
I expect you will be notified by a send failing eventually. You're probably just expecting to be notified sooner than you will be. The TCP stack will be retransmitting segments that it doesn't get ACKs for and the timeout before retransmission for each attempt is doubled each time it retransmits. Depending on how the stack is working out when to retransmit it's probably going to be longer than you're expecting before the stack will decide that the connection is broken and only then will it let you know.
See here: http://www.ietf.org/rfc/rfc2988.txt, here: http://msdn.microsoft.com/en-us/library/ms819737.aspx, etc.
You're used to having a wired network where the drivers can notify higher level layers that the connection has been physically broken. If you were to configure a wired network to route via a router which you then deliberately set up to not route correctly then you'd probably see similar behaviour....

Categories