We hit a strange issue on one of customers servers, where Java encounters "Too many files",
Checking the descriptors via lsof produces a large list of "sock" descriptors with "can't identify protocol".
I suspect it happens due to sockets that opened for too much time, but as our thread dump contains a lot of them, I have no clear idea who exactly the culprit.
Is there any good method to detect which threads exactly open these sockets?
Thanks.
Is there any good method to detect which threads exactly open these sockets?
Not the threads per se.
One approach is to run the application using a profiler. This could well find the problem even if you cannot exactly reproduce the customer's problem. (#SyBer reports that the YourKit profiler has specific support for finding socket leaks ... see comment.)
A second approach is to tweak your test platform by using ulimit to REDUCE the number of open files allowed. This may make it easier to reproduce the "too many files open" scenario in your test environment.
Finally, I'd recommend "grepping" your codebase to find all places where socket objects are created. Then examine them all to make sure they use correctly try / finally blocks to ensure that the sockets are always closed.
Start from
netstat -ano | grep $YOUR_PROCESS_ID - for unix
netstat -ano | find "$YOUR_PROCESS_ID" - for windows
At least you will see the whether connections really exist.
Did you try ulimit to increase amount of open files? Also, it's possible that you're not closing your sockets properly, so you have a leak.
The only "good" method to detect leaking sockets is either a very verbose log, or a profiler. Do a memory dump and analyse the objects.
Valgrind will identify file descriptor leaks if you pass --track-fds=yes. Valgrind generates short stack traces at the "acquisition" point of the resources it tracks. When you have located the source lines the leaks are occurring, you can combine this with the return value of pthread_self to your logging system (I'm sure you would be using one!), or place breakpoints in gdb.
Likely you are neglecting to close() sockets that you are finished with. This needs to be done even when the peer initiates the shutdown.
Related
The application i am working on suddenly crashed with
java.io.IOException: ... Too many open files
As i understand the issue it means that files are opened but not closed.
Stacktrace of course happens after the fact and can only help understand before what event error occurred.
What would be an intelligent way to search your code base to find this issue which only seems to occur when app is under high stress load.
use lsof -p pid to check what cause leak of file references;
use ulimit -n to see the limit of opened file references of a single process;
check any IO resources in your project,are they released in time?,Note that,File,Process,Socket(and Http connections) are all IO resources.
sometimes, too many threads will cause this problem too.
I think the best way to use a tool specifically designed for the purpose, such as this one:
This little Java agent is a tool that keeps track of where/when/who opened files in your JVM. You can have the agent trace these operations to find out about the access pattern or handle leaks, and dump the list of currently open files and where/when/who opened them.
In addition, upon "too many open files" exception, this agent will dump the list, allowing you to find out where a large number of file descriptors are in use.
I seem to remember YourKit also having some facilities around this, but can't find any specific information at the moment.
What OS? If it's linux/mac, there is information under /proc that should help. On Windows, use the Process Explorer.
As far as searching the code base, perhaps look for code that catches or raises IOException - I think I/O methods that already catch/raise this have a high likelihood of needing a close() call.
Have you tried attaching to the running process using jvisualvm (Java 5.0 and later in the JDK bin directory). You can open the running process and do a heap dump (which if you have an older JDK you will need to analyze using eclipse or intellij or netbeans et. al.).
In JDK 7 the heap dump button is under the "Monitor" tab. It will create a heap dump tab, "Classes" sub-tab that you can check and see if any classes that open files exist in high quantity. Another very useful feature is heap dump compare, so you can take a reference heap dump, let your app run a bit and then take another heap dump and compare the two (the link to compare is on the "[heapdump]" tab you get when you take one. There is also a flag in java for taking a heapdump on crash or OOM exception, you can go down that route if comparing heap dumps does not give you an obvious class that is causing the problem. Also, "Instances" subtab in the heap dump diff will show you what has been allocated in the time between the two heap dumps which may also help.
jvisualvm is an awesome tool that does not get enough mentions.
I need to peek into the stack of 2 deadlocked threads to analyze the situation. The JVM is live right now and the data is there, but I need some kind of tool to extract it from the process. I only care about 6 variables in the stack of type String. Any ideas are greatly appreciated. JVM versions 6_35, it's a linux, JMX is enabled, but I dont have a profiler/debugger connection configured on it. It's very difficult to reproduce.
I found a little trick using a heap dump viewer (YourKit in this instance, but may be others work as well). Basically you enumerate all instances of the Thread class, then you find the thread you want by name and open it. The stack variables are marked as < local variable > like this:
Not all variables are here, but all that are passed as arguments to method are displayed. I wonder if the profilers can address this issue even better?
You can't do this easily. Normal jstack tool will only dump stack. Technically you can try dumping whole heap (using jmap) but looking for this particular variables can be a pain if possible.
Note that this is not easily doable for security reasons. Stack traces can contain credentials or other sensitive data.
You can send the process a SIGQUIT which will give you a dump and keep the VM running, on a Unix-like OS with the Sun/Oracle JVM, as will IBM's JVM -- not sure if the output will be suitable for your purposes, tough. Probably similar to jstack/jmap in the other answer.
I ran into a problem that is very similar to other SO question ( jps returns no output even when java processes are running ). Before I read that question I though that my problem is that jstatd is not running, but solution in that question implies that jps uses some sort of temporary files. I also realized that it is possible to monitor local JVMs without any network activity at all and I'm curious how does it work. I'm not asking for a solution to my problem, I just want to know how jps and others work locally. It surprises me that I don't know it at all after so many years spent in Java development.
In case of local usage the default implementation of MonitoredHost is sun.jvmstat.perfdata.monitor.protocol.local.MonitoredHostProvider which uses sun.jvmstat.perfdata.monitor.protocol.local.LocalVmManager. It's method activeVms(), wher the real work is being done, loops through files in user temp directories searching for files with known filename format where started JVMs publish their monitoring data. No TCP at all as I suspected. Interesting.
What are the most likely causes for application server failure?
For example: "out of disk space" is more likely than "2 of the drives in a RAID 4 setup die simultaneously".
My particular environment is Java, so Java-specific answers are welcome, but not required.
EDIT just to clarify, i'm looking for downtime-related crashes (out of memory is a good example) not just one-time issues (like a temporary network glitch).
If you are trying to keep an application server up, start monitoring it. Nagios, Big Sister, and other Network Monitoring tools can be very useful.
Watch memory availability / usage, disk availability / usage, cpu availability / usage, etc.
The most common reason why a server goes down is rarely the same reason twice. Someone "fixes" the last-most-common-reason, and a new-most-common-reason is born.
Edwin is right - you need monitoring to understand what the problem is. Or better - understand what the problem is AND prevent it from causing downtime.
You should not only track resource consumption but also demand. The difference between the two shows you if you have sized your server correctly.
There are a ton of open source tools like nagios, CollectD, etc. that can give you server specific data - that's only monitoring though, not prevention. Librato Silverline (disclosure: I work there) allows you to monitor individual processes and then throttle the resources they use by placing them in application containers for which you define resource polices.
If your server is 8 cores or less you can use it for free.
"Out of Memory" exception due to memory leaks.
All sorts of things can cause a server to crash, ranging from busted hardware (e.g. disk failures) to faulty code (memory leak resulting in an out of memory exception, network failure that got rethrown as a runtime exception and was never caught, in servers that aren't Java servers a SEGFAULT, etc.)
At first, it is usually because of memory leaks, disk space problems, endless loops causing cpu to eat up.
Once you monitor those issues and set up correct logging and warning mechanisms, they turn meta on you... and exploding error handling becomes a possible reason for a full lockup: an error (or more likely: two in an unhappy combination) occurs but when the handler is trying to write to the logfiles or send a warning (by mail or something) it gets another error which it is trying to handle by writing to the logfile or sending a warning or... and this continues until one of the resources gives out: it may lead to skyrocketing server load, memory problems, filling disk space, locking up network traffic which means it won't be accessible for a remote user to correct the problem, etc.
I have a Java program for doing a set of scientific calculations across multiple processors by breaking it into pieces and running each piece in a different thread. The problem is trivially partitionable so there's no contention or communication between the threads. The only common data they access are some shared static caches that don't need to have their access synchronized, and some data files on the hard drive. The threads are also continuously writing to the disk, but to separate files.
My problem is that sometimes when I run the program I get very good speed, and sometimes when I run the exact same thing it runs very slowly. If I see it running slowly and ctrl-C and restart it, it will usually start running fast again. It seems to set itself into either slow mode or fast mode early on in the run and never switches between modes.
I have hooked it up to jconsole and it doesn't seem to be a memory problem. When I have caught it running slowly, I've tried connecting a profiler to it but the profiler won't connect. I've tried running with -Xprof but the dumps between a slow run and fast run don't seem to be much different. I have tried using different garbage collectors and different sizings of the various parts of the memory space, also.
My machine is a mac pro with striped RAID partition. The cpu usage never drops off whether its running slowly or quickly, which you would expect if threads were spending too much time blocking on reads from the disk, so I don't think it could be a disk read problem.
My question is, what types of problems with my code could cause this? Or could this be an OS problem? I haven't been able to duplicate it on a windows a machine, but I don't have a windows machine with a similar RAID setup.
You might have thread that have gone into an endless loop.
Try connecting with VisualVM and use the Thread monitor.
https://visualvm.dev.java.net
You may have to connect before the problem occurs.
I second that you should be doing it with a profiler looking at the threads view - how many threads, what states are they in, etc. It might be an odd race condition happening every now and then. It could also be the case that instrumenting the classes with profiler hooks (which causes slowdown), sortes the race condition out and you will see no slowdown with the profiler attached :/
Please have a look at this post, or rather the answer, where there is Cache contention problem mentioned.
Are you spawning the same umber of threads each time? Is that number less or equal the number of threads available on your platform? That number could be checked or guestimated with a fair accuracy.
Please post any finidngs!
Do you have a tool to measure CPU temperature? The OS might be throttling the CPU to deal with temperature issues.
Is it possible that your program is being paged to disk sometimes? In this case, you will need to look at the memory usage of the operating system as whole, rather than just your program. I know from experience there is a huge difference in runtime performance when memory is being continually paged to the disk and back.
I don't know much about OSX, but in linux the "free" command is useful for this purpose.
Another issue that might cause this slowdown is log files? I've known at least some logging code that slowed down the system incrementally as the log files grew. It's possible that your threads are synchronizing on a log file which is growing in size, then when you restart your program, another log file is used.