i'm running a long Task (about 6 hours) in a Backend Instance in my Google Appengine Application.
Here is the backend configuration:
<backend name="backend_1">
<class>B4_1G</class>
<options>
<public>true</public>
<dynamic>false</dynamic>
</options>
</backend>
When the process is running (in default thread or parallel thread, i tried both) after a random amount i time i get
2013-09-13 18:52:14.677
Process terminated because the backend took too long to shutdown.
I've looked around to find a solution, i read about the shutdown Hook for Backend instance, which i implemented, but it seems not to be working.
LifecycleManager.getInstance().setShutdownHook(new ShutdownHook()
{
public void shutdown()
{
log().info("Shutting down...");
LifecycleManager.getInstance().interruptAllRequests();
}
});
The log message is never shown, only the Process terminated because... message
I also implemented the isShuttingDown check
LifecycleManager.getInstance().isShuttingDown();
In every cycle of my process, the first thing i check is if the backend is shutting down, but this flag is never true.
The process is always "brutally" interrupted without any hook to control the shutdown (Maybe i can stop the operation saving some data for future resuming)
I thought about an "out of memory" error, but i'm not storing any "big" object in memory. Also, at every cycle, the instance variables are set to NULL, forcing the release of the memory.
Also, i'm expecting an error like this
Uncaught exception from servlet java.lang.OutOfMemoryError: Java heap space
if this is the problem
Am i the only one experiencing this kind of problem?
Already read this article, but no solutions
I am having the same issue with python. In my case, all I see is the terminated message. cpu, memory usage n all are fine when checked. never the less profiled and brought the memory consumption down. Still seeing the same thing. In your case it says memory issue, you could try to profile your code and lower the memory footprint. Again as I noticed in around 50 runs of the backend, shutdown handler is not guaranteed to be called. Google developers make this clear in the Google IO video on backends.
After chasing this error for sometime now, it seems AppEngine development paradigm revolves around url handlers with limitations on time, memory etc. This is applicable to long running tasks too. I redid my long term task to be executed small tasks. Task queues triggered smaller tasks which inturn run, before finish queue the next task. Never failed even once!
The advantage is that taskqueues have better failsafe/handover than just a huge cron job. One task failing does not mean the rest of the huge task list fail.
Also shutdown hook work perfectly when approaching backends with queues. Fired everytime compared to sometimes on cron jobs.
Related
I have a fairly simple streaming Dataflow pipeline reading from pubsub and writing to BigQuery using BATCH_LOADS (with streaming engine enabled). We do have one version of this pipeline working, but it seems very fragile, and simple additions to the code seem to tip it over and the worker process starts to eat up memory.
Sometimes the Java heap fills up, gets java.lang.OutOfMemoryError, dumps the heap, and jhat shows the heap is full of Windmill.Message objects.
More often, the machine gets really slow (kernel starts swapping), then the kernel OOM killer kills Java.
Today I have further evidence that might help debug this issue: a live worker (compute.googleapis.com/resource_id: "1183238143363133621") that started swapping but managed to come out of that state without crashing. The worker logs show that the Java heap is using 1GB (total memory), but when I ssh into the worker, "top" shows the Java process is using 3.2GB.
What could be causing Java to use so much memory outside of its heap?
I am using Beam 2.15 PubsubIO, and a clone of Beam's BigQueryIO with some modifications. I have tried increasing to a larger machine size, but it only delays the failure. It still eventually fills up its memory when the pubsub subscription has a lot of backlog.
EDIT:
Some more details: The memory issues seem to happen earlier in the pipeline than BigQueryIO. There are two steps between PubsubIO.Read and BigQueryIO.Write: Parse and Enhance. Enhance uses a side input, so I suspect fusion is not being applied to merge those two steps. Triggers are very slow to fire (why?), so Enhance is slow to start due to the side input's Combine.Globally being delayed by about 3 mintues, and even after it is ready, WriteGroupedRecords is sometimes called 10 minutes after I know the data was ready. When it is called, it is often with way more than 3 minutes-worth of data. Often, especially when the pipeline is just starting, the Parse step will pull close to 1000000 records from pubsub. Once Enhance starts working, it will quickly process the 1000000 rows and turn them into TableRows. Then it pulls more and more data from pubsub, continuing for 10 minutes without WriteGroupedRecords being called. It seems like the runner is favoring the earlier pipeline steps (maybe because of the sheer number of elements in the backlog) instead of firing window triggers that activate the later steps (and side inputs) as soon as possible.
I run a web server on tomcat 7 with Java 8, the server perform a lot of IO operations - mostly DB and HTTP calls, each transaction consumes a generous amount of memory and it serves around 100 concurrents at a given time.
After some time, around 10,000 requests made, but not in particular, the server start hangs, not respond or respond with empty 500 responses.
I see some errors on the logs which I currently trying to solve, but what bugs me is that I can't figure out what eventually causes that - catalina log file does not show a heap space exception, plus I took some memory dumps and it seems like there's always room to grow and garbage to collect, so I decided it is not a memory problem. Then I took thread dumps, I've always seen dozens of threads in WAITING, TIMED_WAITING, PARKING, etc...from what I read it seems like these threads are available to handle incoming work.
It's worth mentioning that all the work is done asynchronously, with no blocking operations and it seems like all the thread pools are available. Even more, I stop the traffic to the server and let it rest for some time, and even then the issue doesn't go away. So I figured it's also not a thread problem.
So...my question is:
Maybe it is a memory issue? Can it be a thread-cpu issue? can it be anything else?
TL;DR: Is there a foolproof (!) way I can detect from my master JVM that my slave JVM spawned via 2 intermediate scripts has experienced an OutOfMemory error on Linux?
Long version:
I'm running some sort of application launcher. Basically it receives some input and reacts by spawning a slave Java application to process said input. This happens via a python script (to correctly handle remote kill commands for it) which in turn calls a bash script (generated by Gradle and sets up the classpath) to actually spawn the slave.
The slave contains a worker thread and a monitor thread to make callbacks to a remote host for status updates. If status updates fail to occur for a set amount of time, the slave gets killed by the launcher. The reason for it not responding CAN be an OutOfMemoryError, however it can also be other reasons. I need to differentiate an OutOfMemoryError of the slave from some other error which caused it to stop working.
I don't just want to monitor memory usage and say once it reaches like 90% "ok that's enough". It may very well be that the GC succeeds in cleaning up sufficiently for the workload to finish. I only want to know if it failed to clean up and the JVM died because not enough memory could be freed.
What I have tried:
Use the -XX:OnOutOfMemory flag as a JVM option for the slave which calls a script which in turn creates an empty flag file. I then check with the launcher for the existence of the flag file if the slave died. Worked like a charm on Windows, did not work at all on Unix because there is a funky bug which causes the execution of the flag call to require the exact same amount of Xmx the slave has used. See https://bugs.openjdk.java.net/browse/JDK-8027434 for the bug. => Solution discarded because the slave needs the entire memory of the machine.
try{ longWork(); } catch (OutOfMemoryError e) { createOomFlagFile(); System.exit(100); } This does work in some cases. However there are also cases where this does not happen and the monitor thread simply stops sending status updates. No exception occurs, no OOM flag file gets created. I know from SSHing onto the machine though that Java is eating all the memory available on the system and the whole system is slow.
Is there some (elegant) foolproof way to detect this which I am missing?
You shouldn't wait for the OutOfMemory. My suggestion is, that you track memory consumption from the master application via Java Management Beans and issue warnings when memory consumption gets critical. I never did that on my own before, so I cannot get more precisely on how to do that, but maybe you find out or some others here can provide a solution.
Edit: this is the respective MXBean http://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryMXBean.html
we've got a slightly grown spring webapp (on tomcat 7) that is very slow in shutdown. (which has negative impacts on the performance of our continous delivery)
My suspicion is, that there must be some bean that is blocking (or taking very long) in it's #PreDestroy method.
So far I've ensured that it's not related to a thread(pool) that is not shut down correctly by giving distinct names to every pool, thread and timer and ensuring that they are either daemon threads or being shut down correctly.
Has anybody every solved a situation like this and can give me a hint on how to cope with this?
BTW: killing the tomcat process is not an option - we really need a clean shutdown for our production system.
Profiling would be the nuclear option. It's probably easy to get a picture of what's happening (especially if it is just blocked threads since that state will be long lived) just using thread dumps. If you take 2 dumps separated by a few seconds and they show the same or similar output for one or more threads then that is probably the bottleneck. You can get a thread dump using jstack or "kill -3" (on a sensible operating system).
and if you're on Windows, then selecting the java console window, and hitting ctrl + pause will dump to that window - just hit 'enter' to resume execution
These things obviously require close inspection and availability of code to thoroughly analyze and give good suggestions. Nevertheless, that is not always possible and I hope it may be possible to provide me with good tips based on the information I provide below.
I have a server application that uses a listener thread to listen for incoming data. The incoming data is interpreted into application specific messages and these messages then give rise to events.
Up to that point I don't really have any control over how things are done.
Because this is a legacy application, these events were previously taken care of by that same listener thread (largely a single-threaded application). The events are sent to a blackbox and out comes a result that should be written to disk.
To improve throughput, I wanted to employ a threadpool to take care of the events. The idea being that the listener thread could just spawn new tasks every time an event is created and the threads would take care of the blackbox invocation. Finally, I have a background thread performing the writing to disk.
With just the previous setup and the background writer, everything works OK and the throughput is ~1.6 times more than previously.
When I add the thread pool however performance degrades. At the start, everything seems to run smoothly but then after awhile everything is very slow and finally I get OutOfMemoryExceptions. The weird thing is that when I print the number of active threads each time a task is added to the pool (along with info on how many tasks are queued and so on) it looks as if the thread pool has no problem keeping up with the producer (the listener thread).
Using top -H to check for CPU usage, it's quite evenly spread out at the outset, but at the end the worker threads are barely ever active and only the listener thread is active. Yet it doesn't seem to be submitting more tasks...
Can anyone hypothesize a reason for these symptoms? Do you think it's more likely that there's something in the legacy code (that I have no control over) that just goes bad when multiple threads are added? The out of memory issue should be because some queue somewhere grows too large but since the threadpool almost never contains queued tasks it can't be that.
Any ideas are welcome. Especially ideas of how to more efficiently diagnose a situation like this. How can I get a better profile on what my threads are doing etc.
Thanks.
Slowing down then out of memory implies a memory leak.
So I would start by using some Java memory analyzer tools to identify if there is a leak and what is being leaked. Sometimes you get lucky and the leaked object is well-known and it becomes pretty clear who is hanging on to things that they should not.
Thank you for the answers. I read up on Java VisualVM and used that as a tool. The results and conclusions are detailed below. Hopefully the pictures will work long enough.
I first ran the program and created some heap dumps thinking I could just analyze the dumps and see what was taking up all the memory. This would probably have worked except the dump file got so large and my workstation was of limited use in trying to access it. After waiting two hours for one operation, I realized I couldn't do this.
So my next option was something I, stupidly enough, hadn't thought about. I could just reduce the number of messages sent to the application, and the trend of increasing memory usage should still be there. Also, the dump file will be smaller and faster to analyze.
It turns out that when sending messages at a slower rate, no out of memory issue occured! A graph of the memory usage can be seen below.
The peaks are results of cumulative memory allocations and the troughs that follow are after the garbage collector has run. Although the amount of memory usage certainly is quite alarming and there are probably issues there, no long term trend of memory leakage can be observed.
I started to incrementally increase the rate of messages sent per second to see where the application hits the wall. The image below shows a very different scenario then the previous one...
Because this happens when the rate of messages sent are increased, my guess is that my freeing up the listener thread results in it being able to accept a lot of messages very quickly and this causes more and more allocations. The garbage collector doesn't run and the memory usage hits a wall.
There's of course more to this issue but given what I have found out today I have a fairly good idea of where to go from here. Of course, any additional suggestions/comments are welcome.
This questions should probably be recategorized as dealing with memory usage rather than threadpools... The threadpool wasn't the problem at all.
I agree with #djna.
Thread Pool of java concurrency package works. It does not create threads if it does not need them. You see that number of threads is as expected. This means that probably something in your legacy code is not ready for multithreading. For example some code fragment is not synchronized. As a result some element is not removed from collection. Or some additional elements are stored in collection. So, the memory usage is growing.
BTW I did not understand exactly which part of the application uses threadpool now. Did you have one thread that processes events and now you have several threads that do this? Have you probably changed the inter-thread communication mechanism? Added queues? This may be yet another direction of your investigation.
Good luck!
As mentioned by djna, it's likely some type of memory leak. My guess would be that you're keeping a reference to the request around somewhere:
In the dispatcher thread that's queuing the requests
In the threads that deal with the requests
In the black box that's handling the requests
In the writer thread that writes to disk.
Since you said everything works find before you add the thread pool into the mix, my guess would be that the threads in the pool are keeping a reference to the request somewhere. Th idea being that, without the threadpool, you aren't reusing threads so the information goes away.
As recommended by djna, you can use a Java memory analyzer to help figure out where the data is stacking up.