I wrote a Java program, which analyses other programs. The execution may take very long (= days). Now (after three days), I have the problem, that my program / process is sleeping (S). It still has allocated 50% of the memory and sometimes it prints new output, but top shows must of the time 0% CPU.
I used jstack to be sure, that there are still runnable threads. Hence, it seems not to be a deadlock problem. I do not know, why the process does not get more cpu time. I chanced the niceness of the java process from 0 to -10, but nothing happends.
More details:
The process runs on a linux server: Ubuntu 10.04.4 LTS.
I started my process with screen. So, I do not have to be logged in all the time.
screen -S analyse ant myParameters
The server has almost nothing to do.
Thanks for your help.
Start your program in debug mode. Then you can attach to it with any Java debugger and inspect what it is doing.
Related
We experienced a (at least in our eyes) strange problem:
We have two Wildfly 8.1 installations on the same linux machine (CentOS 6.6) running the same applications in different versions and listining to different ports.
Now, we discovered that all of a sudden, when starting one of them, the other one got killed. We then discovered that the amount of free memory was low due to other leaking processes. When we killed those, the two wildlflys were running both correctly again.
Since I don't think that linux itself decided to kill another random process, I assume that JBoss has either some sort of mechanism to free memory by killing something which it assumes is not longer needed or that there are (maybe by wrong configuration) resources used by both of them leading to one of them getting killed when not being able to obtain it.
Did anyone experience something similar or know of a mechanism of that sort?
Most probably it was the linux OOM Killer.
You can verify if one of the servers was killed by it by checking the logfiles:
grep -i kill /var/log/messages*
And if it was you shoud see something like:
host kernel: Out of Memory: Killed process 2592
The OOM killer uses the following algorithm when determining which process to kill:
The function select_bad_process() is responsible for choosing a process to kill. It decides by stepping through each running task and calculating how suitable it is for killing with the function badness(). The badness is calculated as follows, note that the square roots are integer approximations calculated with int_sqrt();
badness_for_task = total_vm_for_task / (sqrt(cpu_time_in_seconds) *
sqrt(sqrt(cpu_time_in_minutes)))
This has been chosen to select a process that is using a large amount of memory but is not that long lived. Processes which have been running a long time are unlikely to be the cause of memory shortage so this calculation is likely to select a process that uses a lot of memory but has not been running long.
You can manually see the badness of each process by reading the oom_score file in the process directory in /proc
cat /proc/10292/oom_score
Situation:
I have a Java1.7 process running in CentOS6 with multiple threads. The process currently halts (i.e., stuck in some kind of loop or waiting function). Due to the complexity nature of the program, it is difficult to do a routine debug in, for instance, Eclipse (more explanation in the background section below). Therefore, I'd like to debug the code by tracing the current running stack.
Question:
Is there a Linux command that would allow me to print the stack to identify the thread/method that is currently running such that I can find which method is causing the halt?
Background:
The reasons for not being able to debug in Eclipse:
It is a MapReduce program typically run on multiple computers.
Even if I use run on one computer, it still involves multiple threads running simultaneously.
Most importantly, the "halting bug" occurs randomly (i.e., not being able to reproduce). So my best shot is to identify the current running method that caused the bug.
P.S. My approach may be completely wrong, so feel free to correct me and point me in the right direction.
Thanks for your help.
You can use JStack to get the current thread dump. It should give you currently running threads and their stack traces.
It will even do more for you - should there be any deadlocks present it will tell you about them.
Apart from that you can also use JVisualVM to monitor your application in real time (you can check threads in real time there and take thread dumps from it).
From RedHat:
Following are methods for generating a Java thread dump on Unix:
1) Note the process ID number of the Java process (e.g. using top, a
grep on ps -axw, etc.) and send a QUIT signal to the process with the
kill -QUIT or kill -3 command. For example:
kill -3 JAVA_PID
Ubuntu 12.04 LTS
java -version
java version "1.6.0_38"
Java(TM) SE Runtime Environment (build 1.6.0_38-b05)
Java HotSpot(TM) 64-Bit Server VM (build 20.13-b02, mixed mode)
4 core CPU - some Dell server hardware
10 threads from time to time run a "heavy" job over several minutes. At other periods they are doing nothing.
1 thread is supposed to wake up every 5 (or so) secs and send a quick ping over the network to another process. This works nicely as long as the other 10 threads do nothing, but when the other 10 threads are running a "heavy" job it never (or very rarely) get to run and send its ping.
I could understand this if this "heavy" job was CPU intensive. But during such a "heavy" job top says something like 50-100% IO-wait but around 1% CPU usage. Profiling shows that by far most of the time spent by the 10 threads are spent in (waiting I guess) in some NIO call. It all adds up, and is kinda expected because a lot of the heaviness about the job is to read files from disk.
What I do not understand is that during such a "heavy" job, the 1 thread doing pings do not get to run. How can that be explained when top shows 1% CPU usage and it seems (profiling and top) that the 10 threads are spending most of their time waiting for IO. Isnt the 1 ping-thread supposed to get execution-time when the other threads are waiting for IO?
Java thread priority is equal on all 11 threads.
Spreading a few yields here and there in the 10 threads seem to solve (or lower) the problem, but I simply do not understand why the ping thread does not get to run without the yields, when the other threads do not do much but wait for IO.
ADDITIONAL INFO 05.03.2014
I have reproduced the problem in a simpler setup - even though not very simple yet (you will have to find out how to install a Apache Zookeeper server, but it is fairly simple - I can provide info later)
Find Eclipse Kepler project here (maven - build by "mvn package"): https://dl.dropboxusercontent.com/u/25718039/io-test.zip
Find binary here: https://dl.dropboxusercontent.com/u/25718039/io-test-1.0-SNAPSHOT-jar-with-dependencies.jar
Start a Apache ZooKeeper 3.4.5 (on port 2181) server on a machine. On another separate machine (this is where I have Ubuntu 12.04 LTS etc. as described above) run the binary as follows (first create a folder io-test-files - 50GB space needed)
nohup java -cp io-test-1.0-SNAPSHOT-jar-with-dependencies.jar dk.designware.io_test.ZKIOTest ./io-test-files 10 1024 5000000 IP-of-ZooKeeper-server:2181 > ./io-test-files/stdouterr.txt 2>&1 &
First it creates 10 5GB files (50GB is way more than machine RAM so not much help by OS file-cache), then starts a ZooKeeper client (which is supposed to keep its connection with the ZooKeeper server up by sending pings/heartbeats regularly), then makes 10 threads doing random-access into the 10 files creating a lot of disk IO, but really no real usage of the CPU. I see that the ZooKeeper client eventually loses its connection ("Zk state"-prints stop saying "CONNECTED" - in stdouterr.txt), and that is basically what I do not understand. The ZooKeeper client-thread only wants to send a tiny heartbeat with several secs apart, and only if it is not able to do that within a period of 20 secs is will lose its connection. I would expect it to have easy access to the CPU, because all the other threads are basically only waiting for disk IO.
During the test I see the following using "top"
Very high "Load average". Above 10 which I do not understand, because there are basically only 10 threads doing something. I also thought that "Load average" only counted threads that actually wanted to do real stuff on the CPU (not including waiting of IO), but according to http://en.wikipedia.org/wiki/Load_%28computing%29 Linux also counts "uninterruptible sleeps" including threads waiting of IO. But I really do not hope/think that it will prevent other threads that have real stuff to do, from getting their hands on the CPU
Very high %wa, but almost no %sy and %us on the CPU(s)
Here is the output from one of my runs: https://dl.dropboxusercontent.com/u/25718039/io-test-output.txt
I have a server that occasionally hangs when it exits. The hang only occures about 1/10 or less of the time and so far we can't figure out a way to reliably recreate the issue. I've walked through my code and thought that I am closing all resources and killing my threads, but obviously some of the time I don't close right.
Can anyone suggest debugging tips to help me test this when I can't reliably recreate it? I've tried running JVisualVM once it goes down, but it doesn't help much other then showing me the sigterm threads are running still and everything is at 0% CPU, which I assume means a deadlock somewhere.
When the process hangs you can send SIGQUIT (kill -3) to the process and it will generate a thread dump. The output goes to stderr so make sure that is being captured.
You could try using JConsole to monitor your server. You can visually monitor the memory, CPU usages and no of threads etc. It also can detect deadlocks if there are.
I'm running a java application that is supposed to process a batch of files through some decisioning module. I need to process the batch for about 50 hrs. The problem I'm facing is that the process runs fine for about an hour and then starts to idle. So, I did this - I run the JVM for one hour and then shut it down, restart the JVM after 30 mins, but still for some reason the second run is taking almost 4-5 hrs. to do what the first run does in 1 hr. Any help or insights would be greatly appreciated.
I am running this on a 64-bit windows r2 server, 2 intel quad core processors(2.53 GHz), 24 GB RAM. Java version is 1.6.0_22(64-bit), memory allotted to the application is - heap(16 GB) and PermGen(2GB).
the external module is also running on a jvm and i am shutting that down too, but i have a feeling that it is holding on to memory even after shutdown. before i start the jvm RAM usage is 1 GB, after I end it it tends to stay at about 3 GB. Is there any way i can ask JAVA to forcibly release that memory?
Are you sure the JVM you are trying to close is indeed closed?
Once a process ends all of the RAM it had allocated is no longer allocated. There's no way for a process to hang on to it once it closes, which also means there's no way for you to tell it to do so. This is done by the Operating System.
Frankly, this sounds like the JVM is still running, or something else is eating the RAM. Also, it sounds like you're trying to workaround a vicious bug instead of hunting it down and killing it?
I suspect the JVM isn't exiting at all. I see this symptom regularly with a JBoss instance that grinds to a halt with OutOfMemoryExceptions. When I kill the JVM (via a SIGTERM/SIGQUIT), the JVM doesn't exit in a timely fashion since it's consuming all its resources throwing/handling OOM exceptions.
Check your logs and process table carefully to ensure the JVM quits properly. At that point the OS will clear all resources related to that process (including the consumed memory)
i've noticed something in the process.
After I shut down the JVM, if i delete all the files that i wrote to the file system, then the RAM usage comes back to the original 1 GB.
Does this lead to anything and can i possibly do something about it.
Out of interest: Have you tried splitting up the process so that it can run in parallel?
A 50hr job is a big job! You have a decent machine there (8 cores and 24GB Ram) and you should be able to parallelise some parts of it.