We have an application deployed on Kubernetes. One of the services is an app running in the JVM.
Our application is faulty, it consumes to much memory. We are hitting the limit set in the Replication Controller, which makes it restart the pod.
Is it a good idea to use the replication controller for this? Or is it a better idea to limit the memory on the JVM (set it to something below the replication controller limit) and use something else inside the pod to restart our application?
If the JVM would stop with an out-of-memory exception, I could use memory dumps written by the JVM. Now I'm blind as to what occupied the memory.
Thank you for your reply!
In fact this is not a responsibility of Deployment / ReplicationController. The restarts are handled within Pod it self. As for handling memory, this is not a trivial issue and you should control it on both levels (things like heap sizes etc.) so that your app avoids hitting the limit, but if it does go awall, limits on pod will handle it, and that is perfectly ok (specially if you run more then one pod so you have HA).
The thing here is, that is is a bit tricky to tune memory limits well, and probably can be done best with trial and error + monitoring pod/container metrics (Prometheus/Grafana to the rescue :) ).
Related
We are running a set of Java applications in docker containers on OpenShift. On a regular basis we experience oom kills for our containers.
To analyse this issue we set up Instana and Grafana for monitoring. In Grafana we have graphs for each of our containers showing memory metrics e.g. JVM heap, memory.usage and memory.total_rss. From these graphs we know that the heap as well as the memory.total_rss of our containers is pretty stable on a certain level over a week. So we assume that we do not have a memory leak in our Java application. However, the memeory.total is constantly increasing over the time and after a couple of days it goes beyond the configured memory limit of the docker container. As far as we can see this doesn't cause Openshift to kill the container immediately but sooner or later it happens.
We increased the memory limit of all our containers and this seems to help since Openshift is not killing our containers that often anymore. However we still see in Grafana that the memeory.total is exceeding the configures memory limit of our containers significantly after a couple of days (rss memory is fine).
To better understand Openshifts OOM killer, does anybody know which memory metric Openshift takes into account to decide if a container has to be killed or not? Is the configured container memory limit related to the memory.usage or the memory.total_rss or something completely different?
Thanks for help in advance.
I usually use Spring Boot + JPA + Hibernate + Postgres.
At the end of the development of a WEB application I compile in Jar, then I run it directly with Java and then I do reverse proxy with Apache (httpd).
I have noticed that when starting there are no problems or latency, when accessing the website it works very quickly, but when several hours pass without anyone making a request to the server and then I want to access I must wait at least 20 seconds until the server responds, after this I can continue to access the site normally.
Why does this happen ?, It is as if Spring were in standby mode every time it detects that it has no load of requests, but I am not sure if it is so or is a problem. If it's some native spring functionality, how can I disable it?
Although I need to use a little more memory in idle state I want the answers to be fast regardless of whether it is loaded or not.
Without knowing more, it is likely that while your webapp is sitting idle, other programs on your server is using memory and cause the JVM memory to be swapped to disk.
When you then access the webapp again, the OS has to swap that JVM memory back into RAM, one page at a time. That takes time, but once the memory is back in RAM, your webapp will run normally.
Unfortunately, the way Java memory works, swapping JVM memory to disk is very bad for performance. That is an issue for most languages that rely on garbage collectors to free memory. Languages with manual memory management, e.g. C++ code, will usually not be hit as badly, when memory is swapped to disk, because memory use is more "focused" in those languages.
Solution: If my guess at the cause of your problem is correct, reconfigure your server so the JVM memory won't be swapped to disk.
Note that when I say server, I mean the physical machine. The "other programs", that your JVM is fighting for memory, might be running in different VMs, i.e. not in the same OS.
I have a very strange issue. I have a java web app (spring boot 1.5) which runs inside a docker container.
At some point the app starts consuming the CPU quire hard.
So i was thinking that the app itself has a bug of some sort.
BUT
If i remove the app from the load balancer, so it will no accept any connections, the app continues to consume a lot of CPU, even it is not accessed at all.
I continue to see a lot of GC log entries from the app in the log file.
It seems that the JVM continues to run GC on Young gen every 300ms, even when the app should be completely idle (and it is idle as there is nothing in the logfile)!
The app itself is just a website using spring boot. Nothing really special there (no scheduled task or what so ever).
Any idea what might going on here ? Can it be docker related ?
Thanks in advance
OK,
it turns out this had nothing to do with docker. Was a bug in the app, creating many(unnecessary) short lived objects which needed GC.
Since moving to Tomcat8/Java8, now and then the Tomcat server is OOM-killed. OOM = Out-of-memory kill by the Linux kernel.
How can I prevent the Tomcat server be OOM-killed?
Can this be the result of a memory leak? I guess I would get a normal Out-of-memory message, but no OOM-kill. Correct?
Should I change settings in the HEAP size?
Should I change settings for the MetaSpace size?
Knowing which Tomcat process is killed, how to retrieve info so that I can reconfigure the Tomcat server?
Firstly check that the oomkill isn't being triggered by another process in the system, or that the server isn't overloaded with other processes. It could be that Tomcat is being unfairly targeted by oomkill when some other greedy process is the culprit.
Heap should be set as a maximum size (-Xmx) to be smaller than the physical RAM on the server. If it is more than this, then paging will cause desperately poor performance when garbage collecting.
If it's caused by the metaspace growing in an unbounded fashion, then you need to find out why that is happening. Simply setting the maximum size of metaspace will cause an outofmemory error once you reach the limit you've set. And raising the limit will be pointless, because eventually you'll hit any higher limit you set.
Run your application and before it crashes (not easy of course but you'll need to judge it), kill -3 the tomcat process. Then analyse the heap and try to find out why metaspace is growing big. It's usually caused by dynamically loading classes. Is this something your application is doing? More likely, it's some framework doing this. (Nb oom killer will kill -9 the tomcat process, and you won't be able to diagnostics after that, so you need to let the app run and intervene before this happens).
Check out also this question - there's an intriguing answer which claims an obscure fix to an XML binding setting cleared the problem (highly questionable but may be worth a try) java8 "java.lang.OutOfMemoryError: Metaspace"
Another very good solution is transforming your application to a Spring Boot JAR (Docker) application. Normally this application has a lot less memory consumption.
So steps the get huge improvements (if you can move to Spring Boot application):
Migrate to Spring Boot application. In my case, this took 3 simple actions.
Use a light-weight base image. See below.
VERY IMPORTANT - use the Java memory balancing options. See the last line of the Dockerfile below. This reduced my running container RAM usage from over 650MB to ONLY 240MB. Running smoothly. So, SAVING over 400MB on 650MB!!
This is my Dockerfile:
FROM openjdk:8-jdk-alpine
ENV JAVA_APP_JAR your.jar
ENV AB_OFF true
EXPOSE 8080
ADD target/$JAVA_APP_JAR /deployments/
CMD ["java","-XX:+UnlockExperimentalVMOptions", "-XX:+UseCGroupMemoryLimitForHeap", "-jar","/deployments/your.jar"]
First, just a bit of background:
One of our customers is experiencing CPU usage spikes for WebSphere instances running one of our web apps (other instances with other apps are fine). They have a test environment and a live environment (both iSeries) which both experience the problem - with a single app per instance setup. We have deployed this application locally in our own test environments and also for many other customers all on iSeries with no similar problems.
What's actually happening:
Every one second or so, the CPU usage for the WebSphere process' CPU usage jumps to anywhere from 7%-20% even though there are no requests being processed at the time. Customer has reported seeing spikes as high as 30%. These spikes average out to be 1.5% of CPU overall - the other WebSphere instances typically use 0%-0.1% when idle.
My investigations so far
So, I had a look at the threads. One thread in there test environment was using ~350 CPU cycles per second. A similar thread in their live environment was using ~1500 CPU cycles per second (showing that it has bigger CPU). The call stack for these threads looks like
Type Program Statement Procedure
QLESPI QSYS 17 LE_Create_Thread2__FP12crtt >
QJVALIBJVM QSYS 7 startThread__FPv
J com/ibm/ws/util/Threa > run
J com/ibm/ws/util/Threa > run
J com/ibm/ws/util/Threa > getTask
J com/ibm/ws/util/Bound > poll
The entire class name from the bottom line is com/ibm/ws/util/BoundedBuffer. I asked the customer to do a JVM Dump for me - the only additional information I got from this was the thread name:
Thread: 00002F82 Deferrable Alarm : 11
Now for my questions:
Can any of you identify the problem, given these symptoms? (Maybe that's a long shot!)
What is Deferrable Alarm? From the JVM Dump, I can see 4 threads with this name. The other three seem to be doing just fine. By debugging my local WebSphere (on Windows) and adding breakpoints in the BoundedBuffer class, I see that BoudedBuffers are polling and periodically invoking some listener.
I don't have access to the WebSphere console for the customer machines, and they aren't owning up to having made any config changes. I can ask them to check the console for me though - what should I be asking them to look at?
I have telnet access to the customer boxes, is there anything else I can investigate here? Looking at the WebSphere profile files, etc? Which files should I be looking at?
Because the Call Stack and JVM Dump don't explicitly reference our code, is it safe to assume that this is a configuration problem?
It's been a long question, so thanks for reading this far.
30 April Update (1)
This morning I've noticed that this behaviour only happens after the first request of the day has been processed (irrespective of which Web Service is invoked). This points the finger back at our application or Apache Axis. Could it be that this is just normal behaviour?!
30 April Update (2)
So it seems that this CPU activity is some kind of housekeeping activity for the web-container or maybe something within Apache Axis. I've now observed this happening on a few different web-applications on a few different servers. Applications with no web component don't suffer the same additional CPU overhead.
I'd imagine if it is housekeeping work, that "tuning" it somehow could be counter productive - by that, I mean that making the App Server idle better would probably negatively affect the amount of "real" work it can do.
You could try to profile and do heap dumps of the application, that could answer a few questions related to memory and cpu usage.
I would recommend following the must gather documentation provided by IBM, and raising a PMR along with your own investigation. Things you might suspect:
Garbage collection (unlikely on low application utilization)
Timers or tasks (such as java.util.Timer or commonj work manager)
Pretest connection that has a complex SQL query (in the DataSource's WebSphere Application Server data source properties)
I would also recommend using the profiler to determine the cause, YourKit profiler is a pretty decent one.
Very instinctively (being unfamiliar with iSeries platforms) I would look at disk IO related issues. Can you describe the disk subsystem? Can you see if your app is spending an unusually large amount of time in iowait?
I know this doesn't quite match your problem. But might be worth a look if your running prior to WAS 6.1 patch 17.
http://www-01.ibm.com/support/docview.wss?uid=swg24018437
Hope this helps. Cheers John
My best guess is that it is some type of monitoring is being done on the instance, like Tivioli etc. Have you ruled out any GC activity?
HTH Tom
Most application servers are implemented in java itself and so is WebSphere. This servers apart from serving client requests have to do other periodical jobs like say resource pool management. Performing this jobs will create some temporary objects that needs to be garbage collected.
Depending on how much heap you have allocated, usage and garbage collector settings, garbage collector will be invoked. I'd say try to see if it is garbage collector thread that is taking up your CPU. For this connect jconsole utility to remote websphere process for a day and see if there is any co-relation between heap usage and cpu usage.
I am also experiencing this very same issue, [Deferrable Alarm:x] using with BoundedBuffer. The only difference I have is that this is on a Windows 7 64bit machine. There is absolutely no Tivioli or other batch process running, no requests being made, the single instance is just idle.
I can run the application in DEBUG mode and pause the Deferrable Alarm thread and the CPU spikes stop, resume and they start again.
I've checked disk activity, network activity and their is nothing happening there.
I am running WebSphere 6.1.0.27 .