i'm writing an application that should run many hours (10-100) which i'm monitoring using JMX.
However, after some time, i discover two things:
com.sun.jmx.remote.internal.ArrayNotificationBuffer#1 gets bigger: after 20 hours, it's about 10MB - when i started it, it was smaller than 1 MB
More threads like RMI TCP Accept-0 (or any other number) and RMI-TCP-Connection(44)-[IP] are instanciated over time.
I'm thinking, that it has something to do with different connections to the application, but currently i'm just connected once, but some connections still seem to be open.
How can that be? How can i fix this?
I was poking around in the source code comments for ArrayNotificationBuffer and it has a decent amount of JMX trace logging, so you might want to enable JMX tracing to get a better idea of what's going on.
You may find that this known bug is affecting you. The bug report indicates the issue is observed on long lasting connections. There's a couple of work-arounds mentioned, although a simpler one is not, if practical to you, which is to periodically disconnect and reconnect. The good news is that there appears to be a patch for this in Java7, although I'm not sure if it has reached a released build yet.
I would also make sure that if you are registering JMX notification listeners, that they are continually and promptly handling notifications. Failure to do so might also be causing this symptom.
Related
Can I base on "started" text in its STDOUT? Is there a standard way for this?
I'm basically trying to find a way to know it without pinging the WebApp and without having a timeout. I'm trying to find a reliable way.
If you can't modify the webapp, and it doesn't give you any other notification, then there's going to be no better way than pinging it and seeing if it's up every second or so. It might feel a bit inelegant, but in reality there's nothing wrong with that approach. I'm assuming here that it's going to start up within a few seconds - if it's going to be in the order of minutes / hours before it's up, then it might be worth combining this with reading the "started" text of stdout (though this approach seems a bit more fragile.)
If you could modify it (or build a wrapper around it), and don't like the "keep pinging it" approach, then the only other way would be to use some kind of inter process communication (such as over a localhost socket) so that you could connect to it, and it could push out a message to you when it had spun up. Worth pointing out though that in most scenarios, the additional effort of this approach won't be warranted.
I run a web server on tomcat 7 with Java 8, the server perform a lot of IO operations - mostly DB and HTTP calls, each transaction consumes a generous amount of memory and it serves around 100 concurrents at a given time.
After some time, around 10,000 requests made, but not in particular, the server start hangs, not respond or respond with empty 500 responses.
I see some errors on the logs which I currently trying to solve, but what bugs me is that I can't figure out what eventually causes that - catalina log file does not show a heap space exception, plus I took some memory dumps and it seems like there's always room to grow and garbage to collect, so I decided it is not a memory problem. Then I took thread dumps, I've always seen dozens of threads in WAITING, TIMED_WAITING, PARKING, etc...from what I read it seems like these threads are available to handle incoming work.
It's worth mentioning that all the work is done asynchronously, with no blocking operations and it seems like all the thread pools are available. Even more, I stop the traffic to the server and let it rest for some time, and even then the issue doesn't go away. So I figured it's also not a thread problem.
So...my question is:
Maybe it is a memory issue? Can it be a thread-cpu issue? can it be anything else?
I'm attempting to detect a bottleneck in our servers, and I'm having a hard time deciding where to start. The symptoms from the machine before it crashes are dropped connections (time outs, this is what happens when a client sees when a response takes too long, possibly indicating that a processor couldn't be allocated by the server side code, and the request couldn't be handled) and out of memory issues. The latter error code is actually given by the JVM in an error log, but I have a hard time believing that both the RAM and the lack of available CPUs are turning out to be the bottlenecks at the same time. (In the days of the previous crashes, it has consistently crashed in the manner described above.) We have our own in house server code, and I wouldn't classify it as analogous to Apache or any other code I've seen. (Sorry if this makes giving advice that much more difficult.)
I'd like to take some time to create a somewhat controlled test locally. I'm running the server, and I'm creating a program that will request different things from the local server. What's a good way to monitor RAM/CPU? I'm currently using Java's VisualVM, but monitors stop responding when I hammer it with some of the tests.
Any ideas out there would be greatly appreciated. Like I mentioned, I'm trying to grab as much useful data, to help me further troubleshoot. In general, when a bottleneck issue like this arises, what are some general strategies to take? The live servers are all running on Windows Server 2008. The version of Java is 7.03. My local box is running Windows 7, with Java 7.03 as well. I don't want to make too many assumptions, but I think it's reasonable to assume that Server 2008 and Windows 7 are pretty similar. (The OS architecture is the same.) Aside from that, my local box has identical hardware to our servers.
For Windows, you need to:
1) Establish a "performance baseline"
2) Try to reproduce the bottleneck, and compare the behavior under stress with the baseline
3) Identify the cause of the bottleneck, and correct it.
These links will help:
http://technet.microsoft.com/en-us/library/cc781394%28v=ws.10%29.aspx
http://msdn.microsoft.com/en-us/library/windows/hardware/gg463394.aspx
You should also look at these TCP/IP registry settings:
Increase the range of ephemeral ports
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters, MaxUserPort
Consider disabling auto-tuning:
http://geekswithblogs.net/cajunmcse/archive/2010/08/20/windows-2008-exchange-and-tcpip-auto-tuning.aspx
Since last month, we got a problem on our company's server (Win2008ServerStd + IIS7 + CF enterprise 9.0.1 (hotfix2)).
I used jConsole to monitor the Coldfusion JVM (1.6.0_24) activity and here's what I see:
Notice that strange "curve" between 14:10 and 14:15 ! What is that?
Obviously it's not a standard behaviour, when it happens, my applications hang for 30 to 70 seconds!
Do you know what can cause that memory issue? It seems like GC does not run correctly, or hangs itself.
I don't expect a flash-answer, I wonder there can be a lot of root problems causing that but.... where can I start investigating?
Using cfstat, perfmon, fusionreactor, or cf perfomance monitor take a look at running requests and queued duing your problem. What you will likely see is running requests climbing past the setting of the simultaneous requests (in the cf admin). Then the requests will start to queue. Eventually the queue will clear out (if your server is recovering on it's own).
This sort of thing can be caused by a number of things. For example, if your DB server slows down or has an issue, if your network has a problem, or if network ports are resyncing, if your disks have I/O problems etc.
My guess is that you will drive yourself batty trying to figure this out by monitoring your heap. See if you can watch one of the monitors for some specific scripts that might be the culprit.
The other comment (about some indexing agents) is also a possibility. A flurry of indexing can definitely cause behavior. If that's the case, you might take a look at the simultaneous request settings. If it is set at the default you might have enough head room to increase it.
It could have been a spider creating lots and lots of sessions as it crawled the site which would eat up memory for a period of time. Once the spider stopped crawling those sessions would time out and be garbage collected.
I would compare your HTTP server logs w/ the JVM logs. Compare that time frame and see if there are a lot of requests from a search engine spider (Googlebot, msnbot, etc).
Fabio,
Same kind of issue I have couple of month ago where I was getting spike on regular interval and server eating up arround 50% of CPU usage. I wrote full story below URL
http://www.isummation.com/blog/strange-coldfusion-issue-jrun-eating-up-to-50-of-cpu/
which may help you (Sorry for so long).
I found that client variables storing in registry was causing issue and I am able to catch with help of VisualVM where I first find out thread causing issue and looking into trace of exactly find solution.
The only thing that's really odd IMO is the sudden spike to having so many threads. Capture a thread dump on a regular basis (jstack, etc.. are your friends) and then correlate those thread dumps to your monitoring where it shows the spike.
The root problem will become more obvious once you understand what all the extra threads are doing. Perhaps it's more threads handling transactions, but it might be something else entirely.
I'm developping a communication library based on NIO's non-blocking SocketChannels so I can use select to keep my thread low on CPU usage (and getting faster reaction time to other events).
SocketChannel are created externally to my thread and added to the list it handles, marking them as non-blocking and adding them to a Selector for READ operations (and WRITE when needed, but that does not happen in my problem).
I have a little Swing application for tests, running locally, that can be either a client or server: the client one connects to the server one and they can send each other messages. Pretty simple and works fine, excepts for the CPU which tops 100% (50% for each jvm) as soon as the connection is established between client and server.
Running jvisualvm shows me that sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run() uses 98% of the application time, counting only 3 method calls!
A forced stack trace shows it's blocking on the read operation on a FilteredInputStream, on a Socket.
I'm a little puzzled as I don't use RMI (though I can understand NIO and RMI can share the "transport" code part). I have seen a few similar questions but each were specifically using RMI, which I'm not. The answers I've seen is that this ConnectionHandler.run() method is responsible for marshalling/unmarshalling things, when here I get 100% CPU without any network traffic. I can only infer an active wait on the sockets but that sounds odd, especially with non-blocking SocketChannel...
Any idea would be greatly appreciated!
I tracked CPU use down to select(int timeout) which returns 0 immediately, regardless of the timeout value. My understanding of this function was it would block until a selected operation pops up, or timeout is reached (as said in the Javadoc).
However, I found out this other StackOverflow post showing the same problem: OP_CONNECT operation has to be cancelled once connection is accepted.
Many thanks to #Alexander, and #EJP for clarification about the OP_WRITE/OP_CONNECT similarities.
Regarding tge RMI part, it was probably due to Eclipse run configurations.