I am using Jersey 1.4, the ApacheHttpClient, and the Apache MultiThreadedHttpConnectionManager class to manage connections. For the HttpConnectionManager, I set staleCheckingEnabled to true, maxConnectionsPerHost to 1000 and maxTotalConnections to 1000. Everything else is default. We are running in Tomcat and making connections out to multiple external hosts using the Jersey client.
I have noticed that after after a short period of time I will begin to see sockets in a CLOSE_WAIT state that are associated with the Tomcat process. Some monitoring with tcpdump shows that the external hosts appear to be closing the connection after some time but it's not getting closed on our end. Usually there is some data in the socket read queue, often 24 bytes. The connections are using https and the data seems to be encrypted so I'm not sure what it is.
I have checked to be sure that the ClientRequest objects that get created are closed. The sockets in CLOSE_WAIT do seem to get recycled and we're not running out of any resources, at least at this time. I'm not sure what's happening on the external servers.
My question is, is this normal and should I be concerned?
Thanks,
John
This is likely to be a device such as the firewall or the remote server timing out the TCP session. You can analyze packet captures of HTTPS using Wireshark as described on their SSL page:
http://wiki.wireshark.org/SSL
The staleCheckingEnabled flag only issues the check when you go to actually use the connection so you aren't using network resources (TCP sessions) when they aren't needed.
Related
In my batch application when I am sending requests across a network using a Web Service and Java, after running about 30000 requests and receiving the responses, the program throws a java.net.connectexception connection timed out exception.
I am using WildFly, along with some Java code in the middle to configure the requests (Strings) before sending it across the network.
After research the possible reasons I found for this is that there is either a Firewall blocking my access, which can't be true since it ran 90% of the requests already.
I've also seen somewhere that says that I could have overloaded the server, although I'm not sure what that means exactly.
You have filled up the server's listen backlog queue. This is caused by creating new connections faster than the server can accept them. You should look into connection pooling at the client, and handling multiple requests per connection at the server.
I am using Elasticsearch 1.5.1 and Tomcat 7. Web application creates a TCP client instance as Singleton during server startup through Spring Framework.
Just noticed that I failed to close the client during server shutdown.
Through analysis on various tools like VisualVm, JConsole, MAT in Eclipse, it is evident that threads created by the elasticsearch client are live even after server(tomcat) shutdown.
Note: after introducing client.close() via Context Listener destroy methods, the threads are killed gracefully.
But my query here is,
how to check the memory occupied by these live threads?
Memory leak impact due to this thread?
We have got few Out of memory:Perm gen errors in PROD. This might be a reason but still I would like to measure and provide stats for this.
Any suggestions/help please.
Typically clients run in a different process than the services they communicate with. For example, I can open a web page in a web browser, and then shutdown the webserver, and the client will remain open.
This has to do with the underlying design choices of TCP/IP. Glossing over the details, under most cases a client only detects it's server is gone during the next request to the server. (Again generally speaking) it does not continually poll the server to see if it is alive, nor does the server generally send a "please disconnect" message on shutting down.
The reason that clients don't generally poll servers is because it allows the server to handle more clients. With a polling approach, the server is limited by the number of clients running, but without a polling approach, it is limited by the number of clients actively communicating. This allows it to support more clients because many of the running clients aren't actively communicating.
The reason that servers typically don't send an "I'm shutting down" message is because many times the server goes down uncontrollably (power outage, operating system crash, fire, short circuit, etc) This means that an protocol which requires such a message will leave the clients in a corrupt state if the server goes down in an uncontrolled manner.
So losing a connection is really a function of a failed request to the server. The client will still typically be running until it makes the next attempt to do something.
Likewise, opening a connection to a server often does nothing most of the time too. To validate that you really have a working connection to a server, you must ask it for some data and get a reply. Most protocols do this automatically to simplify the logic; but, if you ever write your own service, if you don't ask for data from the server, even if the API says you have a good "connection", you might not. The API can report back a good "connections" when you have all the stuff configured on your machine successfully. To really know if it works 100% on the other machine, you need to ask for data (and get it).
Finally servers sometimes lose their clients, but because they don't waste bandwidth chattering with clients just to see if they are there, often the servers will put a "timeout" on the client connection. Basically if the server doesn't hear from the client in 10 minutes (or the configured value) then it closes the cached connection information for the client (recreating the connection information as necessary if the client comes back).
From your description it is not clear which of the scenarios you might be seeing, but hopefully this general knowledge will help you understand why after closing one side of a connection, the other side of a connection might still think it is open for a while.
There are ways to configure the network connection to report closures more immediately, but I would avoid using them, unless you are willing to lose a lot of your network bandwidth to keep-alive messages and don't want your servers to respond as quickly as they could.
I have a RESTful service that works very fast. I am testing it on localhost. The client is using Spring REST template. I started by using a naive approach:
RestTemplate restTemplate = new RestTemplate(Collections.singletonList(new GsonHttpMessageConverter()));
Result result = restTemplate.postForObject(url, payload, Result.class);
When I make a lot of these requests, I am getting the following exception:
Caused by: org.springframework.web.client.ResourceAccessException: I/O error on POST request for "http://localhost:8080/myservice":No buffer space available (maximum connections reached?): connect; nested exception is java.net.SocketException: No buffer space available (maximum connections reached?): connect
This is caused by connections not being closed and hanging in TIME_WAIT state. The exception starts happening when the ephemeral ports are exhausted. Then the execution waits for the ports to be free again. I am seeing peak performance with long breaks. The rate I am getting is almost what I need, but of course, these TIME_WAIT connections are not good. Tested both on Linux (Ubuntu 14) and Windows (7), similar results at different times due to different ranges of the ports.
To fix this, I tried using an HttpClient with HttpClientBuilder from Apache Http Components library.
RestTemplate restTemplate = new RestTemplate(Collections.singletonList(new GsonHttpMessageConverter()));
HttpClient httpClient = HttpClientBuilder.create()
.setMaxConnTotal(TOTAL)
.setMaxConnPerRoute(PER_ROUTE)
.build();
restTemplate.setRequestFactory(new HttpComponentsClientHttpRequestFactory(httpClient));
Result result = restTemplate.postForObject(url, payload, Result.class);
With this client, I see no exceptions. The client is now using only a very limited number of ephemeral ports. But whatever settings I use (TOTAL and PER_ROUTE), I can't get the performance I need.
Using the netstat command, I see that there are not many connections done to the server. I tried setting the numbers to several thousands, but it seems the client never uses that much.
Is there anything I can do to improve the performance, without opening too many connections?
UPDATE: I've tried setting number of total and per route connections to 5000 and 2500 but it still looks like the client is not creating more than a hundred (judging from netstat -n | wc -l). The REST service is implemented using JAX-RS and running on Jetty.
UPDATE2: I have now tuned the server with some memory settings and I am getting really good throughput. The naive approach is still a bit faster, but I think it's just a little overhead of the pooling on client side.
Actually Spring Boot is not leaking connections. What you're seeing here is standard behavior of the Linux kernel (and every major OS). All sockets that are closed from the machine go to a TIME_WAIT state for some duration of time. This is to prevent the next socket that uses that ephemeral port from receiving packets that were actually intended for the previous socket on that port. The difference you're seeing between the two is a result of the connection pooling approaches each one takes.
More specifically, RestTemplate does not use connection pooling by default. This means every rest call opens a new local ephemeral port and a new connection to the server. If your service is very fast, it will blow through its available local port range in no time at all. With the Apache HttpClient, you are taking advantage of connection pooling. This will prevent your application from seeing the problem that you described. However, given that your service is able to respond faster than the Linux kernel is taking sockets out of TIME_WAIT, connection pooling will make your client slower no matter what you do (if it didn't slow anything down - then you'd run out of local ephemeral ports again).
While it's possible to enable TCP reuse in the Linux kernel, it's can get dangerous (packets can get delayed and you could get ephemeral ports receiving random packets they don't understand which could cause all kinds of problems). The solution here is to use connection pooling as you have in the second example, with sufficiently high numbers to achieve close to the performance you're looking for.
To help you tune your connection pool, you'll want to tweak the maxConnPerRoute and maxConnTotal parameters. maxConnPerRoute limits the number of connections that will be made to a single IP:Port pair, and maxTotal limits the number of total connections that will ever be opened. In your case, since it appears all requests are made to the same location, you could set them to the same (high) value.
I am running an API on 10 different servers, all of them are behind a firewall. I am using jetty 8 to serve all the http requests. The use case for this API is short lived connections.
A few month ago I started to get random Too many open file descriptors errors. These errors make the server completely unresponsive and I need to restart the jetty server in order to fix that. Today this happened 0-10 times a day depending on the traffic I am getting.
After some investigations, I noticed that I am exhausting the number of available connections because all of them are stuck in the TIME_WAIT state so I can't create new ones.
ss -s
TCP: 13392 (estab 1549, closed 11439, orphaned 9, synrecv 0, timewait *11438*/0), ports 932
On this example the number of connections in TIME_WAIT state is pretty low but it can go up to 50k.
I have been trying several kernel tweaks and I also tried to set the SO_LINGER timer to 1 second for jetty sockets. All these changes helped reduce the frequency but I am still getting errors regularly.
Also worth mentioning, I am receiving around 3k requests/second on each server and the cpu usage is very low. The bottleneck to scale my traffic today is this connection issue.
Does anyone have an idea of what I can do to handle that correctly ?
'Too many open file descriptors' is probably caused by a resource leak in your application.
The TIME_WAIT state is caused by being the end that first sends a close, instead of the end that first receives the close. You might want to reconsider your application protocol so that it is the client which closes first. This is not too hard to arrange. It falls out free if you use client-side connection pooling for example.
These two conditions are not related. The TIME_WAIT state can only occur on a port whose socket has already been closed. It does not cause 'too many open file descriptors' problems.
I have a typical java client and a server. The client sends some request to the server and waits for the response. The client reads up to say 100 bytes of data from the contained input stream into an array of bytes. It waits for the complete response of 100 bytes to be read within a specified timeout period of say 3 secs. The problem here is to identify if the server went down or crashed while/before writing the response. Basically, we need to identify if the socket was broken or the peer disconnected for some reason. Is there a way to identify this?
How to identify a broken socket connection in Java immediately?
You can't detect it immediately, in Java or any other language. TCP/IP doesn't know, so Java can't know. The only sure way to detect a broken TCP connection is by writing to it and catching IOExceptions, and they won't happen immediately.
The best way to identity the connection is down is to timeout the connection. i.e. you expect a response in a given amount of time and flag if that response does not come as you expect.
When you have a graceful disconnection (.e.g the other end calls close()) the read on the connection will let you know once the buffer has been drained.
However, if there some other type of failure, you might not be notified until the OS times out the connection (e.g. after 3 minutes) and indeed, you may want to keep the connection. e.g. if you pull the network cable out for 10 seconds and put it back in, that doesn't need to be a failure.
EDIT: I don't believe its a good idea to be too aggressive in automatically handling connection/service "failures". This is usually better handled by a planned fix to the system, based on investigation of the true cause. e.g. increased bandwidth, redundant connectivity, faster servers, code fixes.
If connection is broken abnormally, you will receieve IOException when reading; that normally happens quite fast, but there is no guarantees about time - all depends on the OS, network hardware, etc. If remote end gracefully closes the socket, you'll read -1 as next byte.
Assuming everything else works, if the remote peer - the TCP server - was killed then the TCP client will normally receive a TCP RST (reset) and you'll get an IOException in your client application.
However, there are lots of other things that can go wrong besides a process being killed. Basically anything on the network path between the two processes: a cable is yanked, a router dies, a firewall dies, etc. All of this will not immediately be detected.
For the above reasons the general rule is - as pointed out in the answer from EJP - that a broken connection can only be detected by writing to it. This is why it is always recommended that a TCP client and TCP server exchange some type of heartbeat messages at regular intervals. There are different ways to do this. I like best the method where the TCP client will - in the absence of data being received from the TCP server - send a heartbeat message to the server and expect a reply back within a certain time period. This way heartbeat messages will only be sent when really needed.
A sub-optimal approach - if you cannot implement true heartbeating - is to always read with a timeout. Set the timeout on the socket and then catch java.net.SocketTimeoutException. This will allow you to know that no data has been received on socket during x milliseconds.
It should be mentioned that there's one scenario where you don't have to use heartbeating, nor using the socket timeout: if the TCP client and the TCP server communicate over a loopback interface then a broken connection will always be propagated to both the TCP client application and the TCP server application. This is because, in this case, there's really no network infrastructure between the two processes. So if you have an existing application which isn't well-designed with respect to its TCP communication (i.e. it doesn't implement some form of heartbeating or at least reading with a timeout), then as a last resort you may 'fix' the problem by moving the two application onto the same host and let them communicate over the loopback interface.