I've been caught catching SocketExceptions belonging to subspecies like for example Broken pipe or Connection reset. The question is what to do with the slippery bastards once they're caught.
Which ones may I happily ignore and which need further attention? I'm looking for a list of different SocketExceptions and their causes.
In terms of Java web development, a Broken pipe or a Connection reset basically means that the other side has closed the connection. This can under each be caused by the client pressing Esc while the request is still running or navigating away by link/bookmark/addressbar while the request is still running. You see this particular error often in long running requests such as large file downloads and unnecessarily large/slow business tasks (which is not good for the impatient user, about 3 secs is really the max). In rare cases it can also be caused by a hardware/network problem, such as a network outage at either server or client side.
This exception can be thrown when a flush() or close() on the outputstream of the response is invoked. You as server side cannot do anything against it. You cannot recover from it as you cannot (re)connect the client due to security restrictions in HTTP. In most cases you also shouldn't even try to, because this is often client's own decision. Just ignore it or log it for pure statistics.
One of the other causes is usually the TCP/IP stack settings on the Operating System. Haven't tried it on Linux yet but one platform i've worked on is Sun's Solaris 9/10 Operating System. The basic idea is that Solaris has a tunable TCP/IP stack which you can tune while running your web applications.
So there are two parameters that you should be aware of
tcp_conn_req_max_q0 - queue of incomplete handshakes
tcp_conn_req_max_q1 - queue of complete handshakes
tcp_keepalive_interval - keepalive
tcp_time_wait_interval - time of a TCP segment that's considered alive
in the internet
All the above parameters affect how much load can the system take (from a TCP/IP perspective) and on the flipside affects the occurrence of certain types of SocketExceptions - such as the ones BalusC pointed above.
This is obviously quite convoluted but the point i'm trying to make is that the OS you're hosting your apps on more often than not, offers you mitigation strategies.
Related
UPDATE:
My goal is to learn what factors could overwhelm my little tomcat server. And when some exception happens, what I could do to resolve or remediate it without switching my server to a better machine. This is not a real app in a production environment but just my own experiment (Besides some changes on the server-side, I may also do something on my client-side)
Both of my client and server are very simple: the server only checks the URL format and send 201 code if it is correct. Each request sent from my client only includes an easy JSON body. There is no database involved. The two machines (t2-micro) only run client and server respectively.
My client is OkHttpClient(). To avoid timeout exceptions, I already set timeout 1,000,000 milli secs via setConnectTimeout, setReadTimeout, and setWriteTimeout. I also go to $CATALINA/conf/server.xml on my server and set connectionTimeout = "-1"(infinite)
ORIGINAL POST:
I'm trying to stress out my server by having a client launching 3000+ threads sending HTTP requests to my server. Both of my client and server reside on different ec2 instances.
Initially, I encountered some timeout issues, but after I set the connection, read and write timeout to a bigger value, this exception has been resolved. However, with the same specification, I'm getting java.net.ConnectException: Failed to connect to my_host_ip:8080 exception. And I do not know its root cause. I'm new to multithreading and distributed system, can anyone please give me some insights of this exception?
Below is some screenshot of from my ec2:
1. Client:
2. Server:
Having gone through similar exercise in past I can say that there is no definitive answer to the problem of scaling.
Here are some general trouble shooting steps that may lead to more specific information. I would suggest trying out tests by tweaking a few parameters in each test and measure the changes in Cpu, logs etc.
Please provide what value you have put for the timeout. Increasing timeout could cause your server (or client) to run out of threads quickly (cause each thread can process for longer). Question the need for increasing timeout. Is there any processing that slows your server?
Check application logs, JVM usage, memory usage on the client and Server. There will be some hints there.
Your client seems to be hitting 99%+ and then come down. This implies that there could be a problem at the client side in that it maxes out during the test. Your might want to resize your client to be able to do more.
Look at open file handles. The number should be sufficiently high.
Tomcat has some limit on thread count to handle load. You can check this in server.xml and if required change it to handle more. Although cpu doesn't actually max out on server side so unlikely that this is the problem.
If you a database then check the performance of the database. Also check jdbc connect settings. There is thread and timeout config at jdbc level as well.
Is response compression set up on the Tomcat? It will give much better throughout on server especially if the data being sent back by each request is more than a few kbs.
--------Update----------
Based on update on question few more thoughts.
Since the application is fairly simple, the path in terms of stressing the server should be to start low and increase load in increments whilst monitoring various things (cpu, memory, JVM usage, file handle count, network i/o).
The increments of load should be spread over several runs.
Start with something as low as 100 parallel threads.
Record as much information as you can after each run and if the server holds up well, increase load.
Suggested increments 100, 200, 500, 1000, 1500, 2000, 2500, 3000.
At some level you will see that the server can no longer take it. That would be your breaking point.
As you increase load and monitor you will likely discover patterns that suggest tuning of specific parameters. Each tuning attempt should then be tested again the same level of multi threading. The improvement of available will be obvious from the monitoring.
I am using Elasticsearch 1.5.1 and Tomcat 7. Web application creates a TCP client instance as Singleton during server startup through Spring Framework.
Just noticed that I failed to close the client during server shutdown.
Through analysis on various tools like VisualVm, JConsole, MAT in Eclipse, it is evident that threads created by the elasticsearch client are live even after server(tomcat) shutdown.
Note: after introducing client.close() via Context Listener destroy methods, the threads are killed gracefully.
But my query here is,
how to check the memory occupied by these live threads?
Memory leak impact due to this thread?
We have got few Out of memory:Perm gen errors in PROD. This might be a reason but still I would like to measure and provide stats for this.
Any suggestions/help please.
Typically clients run in a different process than the services they communicate with. For example, I can open a web page in a web browser, and then shutdown the webserver, and the client will remain open.
This has to do with the underlying design choices of TCP/IP. Glossing over the details, under most cases a client only detects it's server is gone during the next request to the server. (Again generally speaking) it does not continually poll the server to see if it is alive, nor does the server generally send a "please disconnect" message on shutting down.
The reason that clients don't generally poll servers is because it allows the server to handle more clients. With a polling approach, the server is limited by the number of clients running, but without a polling approach, it is limited by the number of clients actively communicating. This allows it to support more clients because many of the running clients aren't actively communicating.
The reason that servers typically don't send an "I'm shutting down" message is because many times the server goes down uncontrollably (power outage, operating system crash, fire, short circuit, etc) This means that an protocol which requires such a message will leave the clients in a corrupt state if the server goes down in an uncontrolled manner.
So losing a connection is really a function of a failed request to the server. The client will still typically be running until it makes the next attempt to do something.
Likewise, opening a connection to a server often does nothing most of the time too. To validate that you really have a working connection to a server, you must ask it for some data and get a reply. Most protocols do this automatically to simplify the logic; but, if you ever write your own service, if you don't ask for data from the server, even if the API says you have a good "connection", you might not. The API can report back a good "connections" when you have all the stuff configured on your machine successfully. To really know if it works 100% on the other machine, you need to ask for data (and get it).
Finally servers sometimes lose their clients, but because they don't waste bandwidth chattering with clients just to see if they are there, often the servers will put a "timeout" on the client connection. Basically if the server doesn't hear from the client in 10 minutes (or the configured value) then it closes the cached connection information for the client (recreating the connection information as necessary if the client comes back).
From your description it is not clear which of the scenarios you might be seeing, but hopefully this general knowledge will help you understand why after closing one side of a connection, the other side of a connection might still think it is open for a while.
There are ways to configure the network connection to report closures more immediately, but I would avoid using them, unless you are willing to lose a lot of your network bandwidth to keep-alive messages and don't want your servers to respond as quickly as they could.
I have a Scala application which maintains (or tries to) TCP connections to various servers for hours (possibly > 24) at a time. Each server sends a short, ~30 character message about twice a second. These messages are fed into an iteratee where they are parsed and eventually end up making state changes to a database.
If any of these connections fail for any reason, my app needs to continually try to reconnect until I specify otherwise. Any messages getting lost is Bad. I have no control over the servers I connect to, or the protocols used.
It is conceivable there would be as many as 300 of these connections at once. No exactly a high-load scenario, so I don't think NIO is needed, though it might be nice to have? Other bits of the app are high-load.
I'm looking for some sort of socket controller / manager which can keep these connections as reliably as possible. I am running my own blocking controller now, but as I'm inexperienced with socket coding (and all the various settings, options, timeouts, etc.) I doubt its will achieve the best possible uptime. Plus I may need SSL support at some point down the line.
Would NIO offer any real advantages?
Would Netty be the best choice here? I've seen the Uptime example here, and was thinking of simply duplicating it, but being new to lower-level networking I wasn't sure if there were better options.
However I'm uncertain of the best strategies for ensuring as few packets are lost as possible, and assumed this would be a "solved" problem in one library or another.
Yup. JMS is an example.
I suppose a lot of it would come down to a timeout guessing strategy? Close and re-open a socket too early and you've lost whatever packets were en-route.
That is correct. That approach is not going to be reliable, especially if connections go up and down regularly.
A real solution involves having the other end keep track of what it has received, and letting the sender know when then connection is re-established. If that can't be done, you have no real way of controlling how much gets lost. (This is what the reliable messaging services do ...)
I have no control over the servers I connect to. So unless there's another way to adapt JMS to a generic TCP stream I don't think it will work.
Yup. And the same applies if you try to implement this by hand. The other end has to cooperate.
I guess you could construct something where you run (say) a JMS end point on each of the remote servers, and have the endpoint use UNIX domain sockets or loopback (i.e. 127.0.0.1) to talk to the server. But you still have potential for message loss.
I have a multithreaded java program that runs on Amazon's EC2. It queries and fetches data items from a vendor via HttpPost and HttpGet, using a org.apache.http.impl.client.DefaultHttpClient. Concurrently, it pushes the retrieved data items into S3 using AWS's Java SDK.
After a few days of running, I get the symptoms that normally come with http connection leaks:
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection
at org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking(ConnPoolByRoute.java:417)
at org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry(ConnPoolByRoute.java:300)
at org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection(ThreadSafeClientConnManager.java:224)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:391)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
Since both AWS and my requests to the data vendor use Http connections, I am not quite sure where exactly I forget to HttpEntity.consume(), or S3ObjectInputStream.close() (unless it is yet something else...).
So here is my question: are there ways to monitor org.apache.http.impl.conn.tsccm.ConnPoolByRoute so that at least I can detect when I am starting to leak connections/entities not properly consumed/http streams not closed? (I have a feeling it happens only under certain conditions, e.g. when certain exceptions are being thrown, by-passing the logic in my code that consumes HttpEntities, closes streams, etc.) Any idea on how to diagnose what eventually causes all my http connections to fail with that ConnectionPoolTimeoutException would be most welcome. I don't feel like waiting 4+ days between attempts to fix the root cause of the problem.
If you're using the PoolingClientConnectionManager note there are the methods getTotalStats() and getStats(final HttpRoute route) which will give you a PoolStats object with the data you're looking to monitor.
Just fetch the ConnectionManager from your httpclient:
PoolingClientConnectionManager poolManager = (PoolingClientConnectionManager) httpClient.getConnectionManager();
If you can access the org.apache.http.impl.conn.tsccm.ConnPoolByRoute then set it's connTTL to a low enough value so that it's WaitingThreadAborter will eventually terminate a connection. It will show a nice stacktrace there. The other option is to use CGLIB or some other bytecode manipulating framework to create a proxy class wrapping org.apache.http.impl.conn.tsccm.ConnPoolByRoute. Depending on your environment it might not be that easy to set it up, but it's a rather valuable tool to debug issues like yours. (And yes, if you happen to use spring or just plain Aspects the setup will be supereasy :) )
My web app is running on 64-bit Java 6.0.23, Tomcat 6.0.29 (with Apache Portable Runtime 1.4.2), on Linux (CentOS). Tomcat's JAVA_OPTS includes -Xincgc, which is supposed to help prevent long garbage collections.
The app is under heavy load and has intermittent failures, and I'd like to troubleshoot it.
Here is the symptom: Very intermittently, an HTTP client will send an HTTP request to the web app and get an empty response back.
The app doesn't use a database, so it's definitely not a problem with JDBC connections. So I figure the problem is perhaps one of: memory (perhaps long garbage collections), out of threads, or out of file descriptors.
I used javamelody to view the number of threads that are being used, and it seems that maxThreads is set high enough to not be running out of threads. Similarly, we have the number of available of file descriptors set to a very high number.
The app does use a lot of memory. Does it seem like memory is probably the culprit here, or is there something else that I might be overlooking?
I guess my main confusion, though, is why garbage collections would cause HTTP requests to fail. Intuitively, I would guess that a long garbage collection might cause an HTTP request to take a long time to run, but I would not guess that a long garbage collection would cause an HTTP request to fail.
Additional info in response to Jon Skeet's comments...
The client is definitely not timing out. The empty response happens fairly quickly. When it fails, there is no data and no HTTP headers.
I very much doubt that garbage collection is responsible for the issue.
You really really need to find out exactly what this "empty response" consists of:
Does the server just chop the connection?
Does the client perhaps time out?
Does the server give a valid HTTP response but with no data?
Each of these could suggest very different ways of finding out what's going on. Determining the failure mode should be your primary concern, IMO. Until you know that, it's complete guesswork.