Hung theads | java.io.FileInputStream.open(Native Method) | NFS server - java

We have four lpars running 1 java instance each.
They do a lot of read/write operations to a shared NFS server. When the NFS server goes down abruptly, all the threads that were trying to read an image in each of these four servers gets into a hung state.
Below trace shows the same (process is a websphere applciation server process)
While we are working on the issues in the NFS server side, is there a way to avoid this from the code side?
If the underlying connection is tcp based (which I assume it is), should the tcp read/connect timeout take care of this? Basically I want to thread be returned back to the pool instead of waiting infinitely for the other side to repond.
Or is this something which should be taken care by the nfs 'client' on the source machine? Some config setting on the client side pertaining to nfs (since FileInputStream.open would not know whether the file it is trying to read is on the local server or the shared folder in nfs server)
Thanks in advance for your answers :)
We are using
java 1.6 on WAS 7.0
[8/2/15 19:52:41:219 GST] 00000023 ThreadMonitor W WSVR0605W: Thread
"WebContainer : 77" (00003c2b) has been active for 763879 milliseconds
and may be hung. There is/are 110 thread(s) in total in the server
that may be hung.
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:113)
at java.io.FileInputStream.(FileInputStream.java:73)
at org.emarapay.presentation.common.util.ImageServlet.processRequest(Unknown
Source)
at org.emarapay.presentation.common.util.ImageServlet.doGet(Unknown
Source)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:718)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:831)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1694)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1635)
at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:113)
at com.ibm.ws.webcontainer.filter.WebAppFilterChain._doFilter(WebAppFilterChain.java:80)
at com.ibm.ws.webcontainer.filter.WebAppFilterManager.doFilter(WebAppFilterManager.java:908)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:965)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:508)
at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl

Check this solution https://stackoverflow.com/a/9832633/1609655
You can do something similar for reading the image. Basically wrap the call to read in a Java Future implementation and signal a thread kill when the operation does not finish in a specified amount of time.
It might not be perfect, but i will atleast prevent your server to be stuck for ever.

This was the response from #shodanshok in serverfault and it helped us.
This probably depends on how the NFS share is mounted. By default, NFS shared are mounted with the "hard" parameters, meaning that accesses to a non-responding NFS share will block indefinitely.
You can change the client side mount point, adding one of the following parameters (I'm using Linux man page here, maybe your specific options are a little different):
soft: if the soft option is specified, then the NFS client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling applicationintr: selects whether to allow signals to interrupt file operations on this mount point. Using the intr option is preferred to using the soft option because it is significantly less likely to result in data corruption. FYI, this was deprecated in Linux kernel 2.6.25+
Source: Linux nfs man page

http://martinfowler.com/bliki/CircuitBreaker.html
This seems to be the perfect solution for this problem (and the similar kinds). The idea is to wrap the call in an another object which will prevent further calls (based on how you design this object to handle the situation) to the failed service.
E.g. When a external service becomes unresponsive, slowly threads goes into a hung state. Instead, it will be good if we have a THRESHOLD LEVLE which prevents threads from getting into that state. What if we can configure say, do not attempts to connect to the external service if it has not responded or waiting to respond for the previous 30 requests! In this case the 31 request will directly throw an error to the customer trying to access report (or send an error mail to the team) but ATLEAST the 31st thread WILL NOT BE STUCK waiting, instead it will be used to server other requests from other functionalities.
References:
http://martinfowler.com/bliki/CircuitBreaker.html
http://doc.akka.io/docs/akka/snapshot/common/circuitbreaker.html
http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
https://github.com/Netflix/Hystrix

Related

Failed to connect to Tomcat server on ec2 instance

UPDATE:
My goal is to learn what factors could overwhelm my little tomcat server. And when some exception happens, what I could do to resolve or remediate it without switching my server to a better machine. This is not a real app in a production environment but just my own experiment (Besides some changes on the server-side, I may also do something on my client-side)
Both of my client and server are very simple: the server only checks the URL format and send 201 code if it is correct. Each request sent from my client only includes an easy JSON body. There is no database involved. The two machines (t2-micro) only run client and server respectively.
My client is OkHttpClient(). To avoid timeout exceptions, I already set timeout 1,000,000 milli secs via setConnectTimeout, setReadTimeout, and setWriteTimeout. I also go to $CATALINA/conf/server.xml on my server and set connectionTimeout = "-1"(infinite)
ORIGINAL POST:
I'm trying to stress out my server by having a client launching 3000+ threads sending HTTP requests to my server. Both of my client and server reside on different ec2 instances.
Initially, I encountered some timeout issues, but after I set the connection, read and write timeout to a bigger value, this exception has been resolved. However, with the same specification, I'm getting java.net.ConnectException: Failed to connect to my_host_ip:8080 exception. And I do not know its root cause. I'm new to multithreading and distributed system, can anyone please give me some insights of this exception?
Below is some screenshot of from my ec2:
1. Client:
2. Server:
Having gone through similar exercise in past I can say that there is no definitive answer to the problem of scaling.
Here are some general trouble shooting steps that may lead to more specific information. I would suggest trying out tests by tweaking a few parameters in each test and measure the changes in Cpu, logs etc.
Please provide what value you have put for the timeout. Increasing timeout could cause your server (or client) to run out of threads quickly (cause each thread can process for longer). Question the need for increasing timeout. Is there any processing that slows your server?
Check application logs, JVM usage, memory usage on the client and Server. There will be some hints there.
Your client seems to be hitting 99%+ and then come down. This implies that there could be a problem at the client side in that it maxes out during the test. Your might want to resize your client to be able to do more.
Look at open file handles. The number should be sufficiently high.
Tomcat has some limit on thread count to handle load. You can check this in server.xml and if required change it to handle more. Although cpu doesn't actually max out on server side so unlikely that this is the problem.
If you a database then check the performance of the database. Also check jdbc connect settings. There is thread and timeout config at jdbc level as well.
Is response compression set up on the Tomcat? It will give much better throughout on server especially if the data being sent back by each request is more than a few kbs.
--------Update----------
Based on update on question few more thoughts.
Since the application is fairly simple, the path in terms of stressing the server should be to start low and increase load in increments whilst monitoring various things (cpu, memory, JVM usage, file handle count, network i/o).
The increments of load should be spread over several runs.
Start with something as low as 100 parallel threads.
Record as much information as you can after each run and if the server holds up well, increase load.
Suggested increments 100, 200, 500, 1000, 1500, 2000, 2500, 3000.
At some level you will see that the server can no longer take it. That would be your breaking point.
As you increase load and monitor you will likely discover patterns that suggest tuning of specific parameters. Each tuning attempt should then be tested again the same level of multi threading. The improvement of available will be obvious from the monitoring.

Threads getting into blocked state while doing a host name look up using java.net.InetAddress.getLocalHost

Please find below the application/environment details where the problem is seen.
Java Web application deployed on Tomcat 9.0.35 with JRE Version 1.8.0_231-b11
The application is running in a docker container deployed on Open shift Kubernetes Distribution platform.
I see lot of threads in the application are getting into a BLOCKED state sometimes for few mins. On thread dump analysis, it was found that java.net.InetAddress.getLocalHost call is taking too much time. Lot of threads are getting stuck here. The host name is fetched for every logger printed in the application.
The issue is intermittent. But when it occurs, the application/tomcat will go into a paused state which leads to the accumulation of lot of threads. After some time(few seconds), all the blocked threads are unblocked simultaneously. Because of the request concurrency, the application will run out of DB connections which it maintains in the pool leading to issues/slowness/service availability. As a fix, I have made sure to access the host name only once into a static variable and use the same throughout the logging process. I wanted to know the detailed root cause of this issue.
Why this issue is occurring intermittently?
Is there a problem with DNS look up in this kubernetes environment?
We are using IPV4 protocol/addresses
Are there any better approaches/fixes to handle this issue?
Sample below from the thread dump:
"https-jsse-nio-8443-exec-13" #95 daemon prio=5 os_prio=0 tid=0x00007fccadbba800 nid=0xaf5 waiting for monitor entry 0x00007fcb912d1000
java.lang.Thread.State: BLOCKED (on object monitor)
at java.net.InetAddress.getLocalHost(InetAddress.java:1486)
- waiting to lock <0x00000005e71878a0> (a java.lang.Object)
In JDK 8, InetAddress.getLocalHost() works as follows:
Obtain host name as a string via native gethostname call.
If there was less than 5 seconds since the last host name resolution, return the cached IP address.
Otherwise resolve the host name:
using JDK built-in lookup cache, which has the default TTL equal to 30 seconds;
using the system call, which performs an actual DNS lookup (depending on the configuration, the address may be further cached by the OS and DNS servers).
Cache the resolved local host IP address for 5 seconds.
Steps 2-4 are performed under the global cacheLock. If something goes wrong during this process, all threads calling InetAddress.getLocalHost() will block at this lock - exactly what you observe.
Usually local host name resolution does not end up in a network call, as long as the host address is hard-coded in /etc/hosts. But in your case it seems like the real network requests are involved (whenever TTL expires). And when the first DNS request times out (UDP is not a reliable protocol after all), a delay happens.
The solution is to configure /etc/hosts to contain the name and the address of the local host, e.g.
192.168.1.23 myhost.mydomain
where myhost.mydomain is the same string as returned by hostname command.
Finally, if the host name is not expected to change while the application is running, caching it once and forever on the application level looks like a good fix.
To fix the issue, I am loading the hostname only once and caching it during the application start up. I have rolled out this fix to production and we are not seeing the thread blocking issues anymore.
Maybe server is going to look using ipv6 and if is not in use you can configure JVM to use only IPV4, to do so add this to the options -Djava.net.preferIPv4Stack=true or if only need ipv6 -Djava.net.preferIPv6Stack=true. This will force JVM to use the right protocol.

Elasticsearch unclosed client. Live threads after Tomcat shutdown. Memory usage impact?

I am using Elasticsearch 1.5.1 and Tomcat 7. Web application creates a TCP client instance as Singleton during server startup through Spring Framework.
Just noticed that I failed to close the client during server shutdown.
Through analysis on various tools like VisualVm, JConsole, MAT in Eclipse, it is evident that threads created by the elasticsearch client are live even after server(tomcat) shutdown.
Note: after introducing client.close() via Context Listener destroy methods, the threads are killed gracefully.
But my query here is,
how to check the memory occupied by these live threads?
Memory leak impact due to this thread?
We have got few Out of memory:Perm gen errors in PROD. This might be a reason but still I would like to measure and provide stats for this.
Any suggestions/help please.
Typically clients run in a different process than the services they communicate with. For example, I can open a web page in a web browser, and then shutdown the webserver, and the client will remain open.
This has to do with the underlying design choices of TCP/IP. Glossing over the details, under most cases a client only detects it's server is gone during the next request to the server. (Again generally speaking) it does not continually poll the server to see if it is alive, nor does the server generally send a "please disconnect" message on shutting down.
The reason that clients don't generally poll servers is because it allows the server to handle more clients. With a polling approach, the server is limited by the number of clients running, but without a polling approach, it is limited by the number of clients actively communicating. This allows it to support more clients because many of the running clients aren't actively communicating.
The reason that servers typically don't send an "I'm shutting down" message is because many times the server goes down uncontrollably (power outage, operating system crash, fire, short circuit, etc) This means that an protocol which requires such a message will leave the clients in a corrupt state if the server goes down in an uncontrolled manner.
So losing a connection is really a function of a failed request to the server. The client will still typically be running until it makes the next attempt to do something.
Likewise, opening a connection to a server often does nothing most of the time too. To validate that you really have a working connection to a server, you must ask it for some data and get a reply. Most protocols do this automatically to simplify the logic; but, if you ever write your own service, if you don't ask for data from the server, even if the API says you have a good "connection", you might not. The API can report back a good "connections" when you have all the stuff configured on your machine successfully. To really know if it works 100% on the other machine, you need to ask for data (and get it).
Finally servers sometimes lose their clients, but because they don't waste bandwidth chattering with clients just to see if they are there, often the servers will put a "timeout" on the client connection. Basically if the server doesn't hear from the client in 10 minutes (or the configured value) then it closes the cached connection information for the client (recreating the connection information as necessary if the client comes back).
From your description it is not clear which of the scenarios you might be seeing, but hopefully this general knowledge will help you understand why after closing one side of a connection, the other side of a connection might still think it is open for a while.
There are ways to configure the network connection to report closures more immediately, but I would avoid using them, unless you are willing to lose a lot of your network bandwidth to keep-alive messages and don't want your servers to respond as quickly as they could.

Glassfish thread pool issues

We're using Glassfish 3.0.1 and experiencing very long response times; in the order of 5 minutes for 25% of our POST/PUT requests, by the time the response comes back the front facing load balancer has timed out.
My theory is that the requests are queuing up and waiting for an available thread.
The reason I think this is because the access logs reveal that the requests are taking a few seconds to complete however the time at which the requests are being executed are five minutes later than I'd expect.
Does anyone have any advice for debugging what is going on with the thread pools? or what the optimum settings should be for them?
Is it required to do a thread dump periodically or will a one off dump be sufficient?
At first glance, this seems to have very little to do with the threadpools themselves. Without knowing much about the rest of your network setup, here are some things I would check:
Is there a dead/nonresponsive node in the load balancer pool? This can cause all requests to be tried against this node until they fail due to timeout before being redirected to the other node.
Is there some issue with initial connections between the load balancer and the Glassfish server? This can be slow or incorrect DNS lookups (though the server should cache results), a missing proxy, or some other network-related problem.
Have you checked that the clocks are synchronized between the machines? This could cause the logs to get out of sync. 5min is a pretty strange timeout period.
If all these come up empty, you may simply have an impedance mismatch between the load balancer and the web server and you may need to add webservers to handle the load. The load balancer should be able to give you plenty of stats on the traffic coming in and how it's stacking up.
Usually you get this behaviour if you configured not enough worker threads in your server. Default values range from 15 to 100 threads in common webservers. However if your application blocks the server's worker threads (e.g. by waiting for queries) the defaults are way too low frequently.
You can increase the number of workers up to 1000 without problems (assure 64 bit). Also check the number of workerthreads (sometimes referred to as 'max concurrent/open requests') of any in-between server (e.g. a proxy or an apache forwarding via mod_proxy).
Another common pitfall is your software sending requests to itself (e.g. trying to reroute or forward a request) while blocking an incoming request.
Taking threaddump is the best way to debug what is going on with the threadpools. Please take 3-4 threaddumps one after another with 1-2 seconds gap between each threaddump.
From threaddump, you can find the number of worker threads by their name. Find out long running threads from the multiple threaddumps.
You may use TDA tool (http://java.net/projects/tda/downloads/download/tda-bin-2.2.zip) for analyzing threaddumps.

What are some common SocketExceptions and what is causing them?

I've been caught catching SocketExceptions belonging to subspecies like for example Broken pipe or Connection reset. The question is what to do with the slippery bastards once they're caught.
Which ones may I happily ignore and which need further attention? I'm looking for a list of different SocketExceptions and their causes.
In terms of Java web development, a Broken pipe or a Connection reset basically means that the other side has closed the connection. This can under each be caused by the client pressing Esc while the request is still running or navigating away by link/bookmark/addressbar while the request is still running. You see this particular error often in long running requests such as large file downloads and unnecessarily large/slow business tasks (which is not good for the impatient user, about 3 secs is really the max). In rare cases it can also be caused by a hardware/network problem, such as a network outage at either server or client side.
This exception can be thrown when a flush() or close() on the outputstream of the response is invoked. You as server side cannot do anything against it. You cannot recover from it as you cannot (re)connect the client due to security restrictions in HTTP. In most cases you also shouldn't even try to, because this is often client's own decision. Just ignore it or log it for pure statistics.
One of the other causes is usually the TCP/IP stack settings on the Operating System. Haven't tried it on Linux yet but one platform i've worked on is Sun's Solaris 9/10 Operating System. The basic idea is that Solaris has a tunable TCP/IP stack which you can tune while running your web applications.
So there are two parameters that you should be aware of
tcp_conn_req_max_q0 - queue of incomplete handshakes
tcp_conn_req_max_q1 - queue of complete handshakes
tcp_keepalive_interval - keepalive
tcp_time_wait_interval - time of a TCP segment that's considered alive
in the internet
All the above parameters affect how much load can the system take (from a TCP/IP perspective) and on the flipside affects the occurrence of certain types of SocketExceptions - such as the ones BalusC pointed above.
This is obviously quite convoluted but the point i'm trying to make is that the OS you're hosting your apps on more often than not, offers you mitigation strategies.

Categories