Too many TIME_WAIT connections with Jetty - java

I am running an API on 10 different servers, all of them are behind a firewall. I am using jetty 8 to serve all the http requests. The use case for this API is short lived connections.
A few month ago I started to get random Too many open file descriptors errors. These errors make the server completely unresponsive and I need to restart the jetty server in order to fix that. Today this happened 0-10 times a day depending on the traffic I am getting.
After some investigations, I noticed that I am exhausting the number of available connections because all of them are stuck in the TIME_WAIT state so I can't create new ones.
ss -s
TCP: 13392 (estab 1549, closed 11439, orphaned 9, synrecv 0, timewait *11438*/0), ports 932
On this example the number of connections in TIME_WAIT state is pretty low but it can go up to 50k.
I have been trying several kernel tweaks and I also tried to set the SO_LINGER timer to 1 second for jetty sockets. All these changes helped reduce the frequency but I am still getting errors regularly.
Also worth mentioning, I am receiving around 3k requests/second on each server and the cpu usage is very low. The bottleneck to scale my traffic today is this connection issue.
Does anyone have an idea of what I can do to handle that correctly ?

'Too many open file descriptors' is probably caused by a resource leak in your application.
The TIME_WAIT state is caused by being the end that first sends a close, instead of the end that first receives the close. You might want to reconsider your application protocol so that it is the client which closes first. This is not too hard to arrange. It falls out free if you use client-side connection pooling for example.
These two conditions are not related. The TIME_WAIT state can only occur on a port whose socket has already been closed. It does not cause 'too many open file descriptors' problems.

Related

How can I create a "soft" broken TCP connection, as if networking hardware had silently disconnected the stream?

I've got a Java program that opens a TCP stream and connects to a listening port on a remote server. I send a request to the server and I receive a response. I then let the stream sit idle for 60 minutes. At that point if I write a new request it will not arrive at the server. In short order TCP/IP will let me know that the connection has gone away.
My client code is running on a Windows laptop which is connected to a corporate environment via a VPN router. The server is whirring away up in Canada, far away from me here in central Massachusetts USA. I'm likely being routed through multiple pieces of networking equipment. I have no idea which one is causing the stream to break. (I keep thinking of Ghostbusters and "Don't cross the streams!")
What is the best term to use when a piece of equipment specifically "forgets" about a TCP connection which has been idle, causing it to break? Is that half-open, half-closed, or just plain gone?
I want to be able to simulate this timeout scenario entirely within my home lab so that I can perform easier testing -- for example where I don't have to wait for 60 minutes! What's a good technique, and what is the appropriate equipment I should use to simulate this "disconnect"? I've got extra switches here at home, as well as one old (and fiesty!) WRT router that could use some lovin'.
I do not want to enable keepalive to mask the problem. Keepalive won't prevent all possible stream disconnection scenarios, AFAIK. I want to do the best that I can at letting the problem occur and handling it quickly and cleanly when it does.
Thank you very much,
Bill S

Using Spring REST template, either creating too many connections or slow

I have a RESTful service that works very fast. I am testing it on localhost. The client is using Spring REST template. I started by using a naive approach:
RestTemplate restTemplate = new RestTemplate(Collections.singletonList(new GsonHttpMessageConverter()));
Result result = restTemplate.postForObject(url, payload, Result.class);
When I make a lot of these requests, I am getting the following exception:
Caused by: org.springframework.web.client.ResourceAccessException: I/O error on POST request for "http://localhost:8080/myservice":No buffer space available (maximum connections reached?): connect; nested exception is java.net.SocketException: No buffer space available (maximum connections reached?): connect
This is caused by connections not being closed and hanging in TIME_WAIT state. The exception starts happening when the ephemeral ports are exhausted. Then the execution waits for the ports to be free again. I am seeing peak performance with long breaks. The rate I am getting is almost what I need, but of course, these TIME_WAIT connections are not good. Tested both on Linux (Ubuntu 14) and Windows (7), similar results at different times due to different ranges of the ports.
To fix this, I tried using an HttpClient with HttpClientBuilder from Apache Http Components library.
RestTemplate restTemplate = new RestTemplate(Collections.singletonList(new GsonHttpMessageConverter()));
HttpClient httpClient = HttpClientBuilder.create()
.setMaxConnTotal(TOTAL)
.setMaxConnPerRoute(PER_ROUTE)
.build();
restTemplate.setRequestFactory(new HttpComponentsClientHttpRequestFactory(httpClient));
Result result = restTemplate.postForObject(url, payload, Result.class);
With this client, I see no exceptions. The client is now using only a very limited number of ephemeral ports. But whatever settings I use (TOTAL and PER_ROUTE), I can't get the performance I need.
Using the netstat command, I see that there are not many connections done to the server. I tried setting the numbers to several thousands, but it seems the client never uses that much.
Is there anything I can do to improve the performance, without opening too many connections?
UPDATE: I've tried setting number of total and per route connections to 5000 and 2500 but it still looks like the client is not creating more than a hundred (judging from netstat -n | wc -l). The REST service is implemented using JAX-RS and running on Jetty.
UPDATE2: I have now tuned the server with some memory settings and I am getting really good throughput. The naive approach is still a bit faster, but I think it's just a little overhead of the pooling on client side.
Actually Spring Boot is not leaking connections. What you're seeing here is standard behavior of the Linux kernel (and every major OS). All sockets that are closed from the machine go to a TIME_WAIT state for some duration of time. This is to prevent the next socket that uses that ephemeral port from receiving packets that were actually intended for the previous socket on that port. The difference you're seeing between the two is a result of the connection pooling approaches each one takes.
More specifically, RestTemplate does not use connection pooling by default. This means every rest call opens a new local ephemeral port and a new connection to the server. If your service is very fast, it will blow through its available local port range in no time at all. With the Apache HttpClient, you are taking advantage of connection pooling. This will prevent your application from seeing the problem that you described. However, given that your service is able to respond faster than the Linux kernel is taking sockets out of TIME_WAIT, connection pooling will make your client slower no matter what you do (if it didn't slow anything down - then you'd run out of local ephemeral ports again).
While it's possible to enable TCP reuse in the Linux kernel, it's can get dangerous (packets can get delayed and you could get ephemeral ports receiving random packets they don't understand which could cause all kinds of problems). The solution here is to use connection pooling as you have in the second example, with sufficiently high numbers to achieve close to the performance you're looking for.
To help you tune your connection pool, you'll want to tweak the maxConnPerRoute and maxConnTotal parameters. maxConnPerRoute limits the number of connections that will be made to a single IP:Port pair, and maxTotal limits the number of total connections that will ever be opened. In your case, since it appears all requests are made to the same location, you could set them to the same (high) value.

FIN_WAIT Issue With Java Monitoring Application

Having issues with FIN_WAIT1 on my RHEL 5.4 running Introscope. What I have observed so far is whenever the target JVM which we are monitoring using Introscope is hung the agent running on that host stop sending data and after some time the socket on the server (Introscope Server) goes in FIN_WAIT1 state and it remains there for a long time it gets cleaned up if we restart the target JVM.
I would like to know if this is happening because of a bug in Introscope or is it something to do with TCP layer.
FIN_WAIT1 is at the TCP layer - it means your computer's tcp stack is waiting for one of the connection-close messages from the other side's TCP stack. It usually doesn't really cause much harm, other than taking some tiny amount of kernel state until it times out. However sometimes it can prevent you from restarting a server on the same port, in which case you can set the SO_REUSESOCKET and/or SO_REUSEPORT options on the socket before opening it the first time. (This does have some security implications if you're sharing the machine.)

Sockets in CLOSE_WAIT from Jersey Client

I am using Jersey 1.4, the ApacheHttpClient, and the Apache MultiThreadedHttpConnectionManager class to manage connections. For the HttpConnectionManager, I set staleCheckingEnabled to true, maxConnectionsPerHost to 1000 and maxTotalConnections to 1000. Everything else is default. We are running in Tomcat and making connections out to multiple external hosts using the Jersey client.
I have noticed that after after a short period of time I will begin to see sockets in a CLOSE_WAIT state that are associated with the Tomcat process. Some monitoring with tcpdump shows that the external hosts appear to be closing the connection after some time but it's not getting closed on our end. Usually there is some data in the socket read queue, often 24 bytes. The connections are using https and the data seems to be encrypted so I'm not sure what it is.
I have checked to be sure that the ClientRequest objects that get created are closed. The sockets in CLOSE_WAIT do seem to get recycled and we're not running out of any resources, at least at this time. I'm not sure what's happening on the external servers.
My question is, is this normal and should I be concerned?
Thanks,
John
This is likely to be a device such as the firewall or the remote server timing out the TCP session. You can analyze packet captures of HTTPS using Wireshark as described on their SSL page:
http://wiki.wireshark.org/SSL
The staleCheckingEnabled flag only issues the check when you go to actually use the connection so you aren't using network resources (TCP sessions) when they aren't needed.

MantaRay JMS: event ID 4226. TCP connect timeout?

I have a problem with MantaRay JMS: I use a static world map because autodiscovery wouldn't work in our network. If more than 10 peers are offline, I get an error 4226.
The problem is: Microsoft set a limit of 10 half-open connections with Windows XP SP2. MantaRay tries to contact every peer, and starts a lot of connections. The first 10 connections are Ok, then when the 11th starts, our software must wait for another connection to time out. Any other program trying to access the network on the same PC times out.
Strange thing is: on some PC the connection times out after 1-2 seconds, and the problem has almost no consequences, on some other, we have to wait 10 or 20 seconds. But according to Microsoft, there's no way to configure the default TCP connect timeout directly, and there are other factors (network switches, routers, VPN...) which could influence that.
I looked at the MantaRay source code, and tried to find a way to set the TCP connect timeout, but MantaRay uses SohetChannels instead of "regular" sockets, and the connect() method has no timeout. Am I missing something?
You could also patch the TCP/IP connection limit of WinXP... if you don't mind using such things. There are several sites offering patches. Just search Google for "change winxp tcp connection limit" and you'll find most of them. But use those tools on your own risk. Patching the code to work around that limit should be a better approach.
Problem solved.
I replaced the whole MantaRay with a much simpler JMS provider I wrote: I send a first test message over UDP, a peer is allowed to open a TCP connection only after this first message was received.
This taught me to be careful when using open-source (GPL) software.

Categories