Java RMI random latency in a HPC application

Java RMI random latency in a HPC application - java

I'm using Java and RMI in order to execute 100k Montecarlo Simulations on a cluster of hundreds of cores.
The approach I'm using is to have a client app that invokes RMI processes and divides simulations on the number of available (RMI) processes on the grid.
Once that the simulations have been run I have to reaggregate results.
The only limit I have is that all this has to happen in less than 500ms.
The process is actually in place BUT randomly, from time to time, one of the RMI call takes 200ms more to execute.
I've added loads of logs and timings all over the place and as possible reason I've already discarded:
1) Simulations taking extra time
2) Data transfer (it constantly works, only sometimes the slowdown is verified, and only on a subset of RMI calls)
3) Transferring results back (I can clearly timing how long from last RMI calls return to the end of the process)
The only thing I cannot measure is IF any of the RMI Call is taking extra time to be initialized (and honestly is the only thing I can suppose). The reason of this is that -unfortunately- clocks are not synchronized :(
Is that possible that the RMI remote process got passivated/detached/collected even if I keep a (Remote) reference to it from the client?
Hope the question is clear enough (I'm pretty much sure it isn't).
Thanks a mil and do not hesitate to make more questions if it is not clear enough.
Regards,
Giovanni

Is that possible that the RMI remote process got passivated/detached/collected even if I keep a (Remote) reference to it from the client?
Unlikely, but possible. The RMI remote process should not be collected (as the RMI FAQ indicates for VM exit conditions). It could, however, be paged to disk if the OS desired.
Do you have a way to rule out GC calls (other than writing a monitor with JVM TI)?
Also, is your code structured in such a way that you send off all calls from your aggregator asynchronously, have the replies append to a list, and aggregate the results when your critical time is up, even if some processors haven't returned results? I'm assuming that each processor is an independent, random event and that it's safe to ignore some results. If not, disregard.

I finally came up with issue. Basically after insuring that the stub wasn't getting deallocated and that the GC wasn't triggered behind the scene, I used wireshark for understanding if there was any network issue.
What I found out it was that randomly one of the packet got lost and TCP needed on our network 120ms (41 retransmission) for correctly re-transfer data.
When switching to jdk7, SDP and infiniband, we didn't experience the issue anymore.
So basically the answer to my question was... PACKET LOST!
Thanks who replied to the post it helped to focus on the right path!
Gio

Related

ColdFusion JVM: strange memory behaviour

Since last month, we got a problem on our company's server (Win2008ServerStd + IIS7 + CF enterprise 9.0.1 (hotfix2)).
I used jConsole to monitor the Coldfusion JVM (1.6.0_24) activity and here's what I see:
Notice that strange "curve" between 14:10 and 14:15 ! What is that?
Obviously it's not a standard behaviour, when it happens, my applications hang for 30 to 70 seconds!
Do you know what can cause that memory issue? It seems like GC does not run correctly, or hangs itself.
I don't expect a flash-answer, I wonder there can be a lot of root problems causing that but.... where can I start investigating?

Using cfstat, perfmon, fusionreactor, or cf perfomance monitor take a look at running requests and queued duing your problem. What you will likely see is running requests climbing past the setting of the simultaneous requests (in the cf admin). Then the requests will start to queue. Eventually the queue will clear out (if your server is recovering on it's own).
This sort of thing can be caused by a number of things. For example, if your DB server slows down or has an issue, if your network has a problem, or if network ports are resyncing, if your disks have I/O problems etc.
My guess is that you will drive yourself batty trying to figure this out by monitoring your heap. See if you can watch one of the monitors for some specific scripts that might be the culprit.
The other comment (about some indexing agents) is also a possibility. A flurry of indexing can definitely cause behavior. If that's the case, you might take a look at the simultaneous request settings. If it is set at the default you might have enough head room to increase it.

It could have been a spider creating lots and lots of sessions as it crawled the site which would eat up memory for a period of time. Once the spider stopped crawling those sessions would time out and be garbage collected.
I would compare your HTTP server logs w/ the JVM logs. Compare that time frame and see if there are a lot of requests from a search engine spider (Googlebot, msnbot, etc).

Fabio,
Same kind of issue I have couple of month ago where I was getting spike on regular interval and server eating up arround 50% of CPU usage. I wrote full story below URL
http://www.isummation.com/blog/strange-coldfusion-issue-jrun-eating-up-to-50-of-cpu/
which may help you (Sorry for so long).
I found that client variables storing in registry was causing issue and I am able to catch with help of VisualVM where I first find out thread causing issue and looking into trace of exactly find solution.

The only thing that's really odd IMO is the sudden spike to having so many threads. Capture a thread dump on a regular basis (jstack, etc.. are your friends) and then correlate those thread dumps to your monitoring where it shows the spike.
The root problem will become more obvious once you understand what all the extra threads are doing. Perhaps it's more threads handling transactions, but it might be something else entirely.

sun.rmi.transport.tcp.TCPTransport uses 100% CPU

I'm developping a communication library based on NIO's non-blocking SocketChannels so I can use select to keep my thread low on CPU usage (and getting faster reaction time to other events).
SocketChannel are created externally to my thread and added to the list it handles, marking them as non-blocking and adding them to a Selector for READ operations (and WRITE when needed, but that does not happen in my problem).
I have a little Swing application for tests, running locally, that can be either a client or server: the client one connects to the server one and they can send each other messages. Pretty simple and works fine, excepts for the CPU which tops 100% (50% for each jvm) as soon as the connection is established between client and server.
Running jvisualvm shows me that sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run() uses 98% of the application time, counting only 3 method calls!
A forced stack trace shows it's blocking on the read operation on a FilteredInputStream, on a Socket.
I'm a little puzzled as I don't use RMI (though I can understand NIO and RMI can share the "transport" code part). I have seen a few similar questions but each were specifically using RMI, which I'm not. The answers I've seen is that this ConnectionHandler.run() method is responsible for marshalling/unmarshalling things, when here I get 100% CPU without any network traffic. I can only infer an active wait on the sockets but that sounds odd, especially with non-blocking SocketChannel...
Any idea would be greatly appreciated!

I tracked CPU use down to select(int timeout) which returns 0 immediately, regardless of the timeout value. My understanding of this function was it would block until a selected operation pops up, or timeout is reached (as said in the Javadoc).
However, I found out this other StackOverflow post showing the same problem: OP_CONNECT operation has to be cancelled once connection is accepted.
Many thanks to #Alexander, and #EJP for clarification about the OP_WRITE/OP_CONNECT similarities.
Regarding tge RMI part, it was probably due to Eclipse run configurations.

How can I limit the performance of sandboxed Java code?

I'm working on a multi-user Java webapp, where it is possible for clients to use the webapp API to do potentially naughty things, by passing code which will execute on our server in a sandbox.
For example, it is possible for a client to write a tight while(true) loop that impacts the performance of other clients.
Can you guys think of ways to limit the damage caused by these sorts of behaviors to other clients' performance?
We are using Glassfish for our application server.

The halting problem show that there is no way that a computer can reliably identify code that will not terminate.
The only way to do this reliably is to execute your code in a separate JVM which you then ask the operating system to shut down when it times out. A JVM not timing out can process more tasks so you can just reuse it.

One more idea would be byte-code instrumentation. Before you load the code sent by your client, manipulate it so it adds a short sleep in every loop and for every method call (or method entry).
This avoids clients clogging a whole CPU until they are done. Of course, they still block a Thread object (which takes some memory), and the slowing down is for every client, not only the malicious ones. Maybe make the first some tries free, then scale the waiting time up with each try (and set it down again if the thread has to wait for other reasons).

Modern day app servers use Thread Pooling for better performance. The problem is that one bad apple can spoil the bunch. What you need is an app server with one thread or maybe process per request. Of course there are going to be trade offs. but the OS will handle making sure that processing time gets allocated evenly.
NOTE: After researching a little more what you need is an engine that will create another process per request. If not a user can either cripple you servlet engine by having servlets with infinite loops and then posting multiple requests. Or he could simply do a System.exit in his code and bring everybody down.

You could use a parent thread to launch each request in a separate thread as suggested already, but then monitor the CPU time used by the threads using the ThreadMXBean class. You could then have the parent thread kill any threads that are misbehaving. This is if, of course, you can establish some kind of reasonable criteria for how much CPU time a thread should or should not be using. Maybe the rule could be that a certain initial amount of time plus a certain additional amount per second of wall clock time is OK?
I would make these client request threads have lower priority than the thread responsible for monitoring them.

Response Time is different for mulitiple execution of the application with the same request Performance problem

My java application functionality is to provide reference data (basically loads lots of data from xml files into hashmap) and hence we request for one such data from the hashmap based on a id and we have such multiple has map for different set of business data. The problem is that when i tried executing the java application for the same request multiple times, the response times are different like 31ms, 48ms, 72ms, 120ms, 63ms etc. hence there is a considerable gap between the min and max time taken for the execution to complete. Ideally, i would expect the response times to be like, 63ms, 65ms, 61ms, 70ms, 61ms, but in my case the variation of the response time for the same request is varying hugely. I had used a opensource profile to understand if there is any extra execution of the methods or memory leak, but as per my understanding there was no problem. Please let me know what could be the reasons and how can i address this problem.

There could be many causes:
Is your Java application restarted for each run? If not, it could be that the garbage collector kicks in at an unfortunate time. If so, the JVM startup time could be responsible for the variations.
Is anything else running on that machine?
Is the disk cache "warmed up" in some cases, but not in others? That is, have the files been recently accessed so that they are still in memory?
If this is a networked application, is there any network activity during the measurements?
If there is a remote machine involved (e.g. a database server or a file server), do the above apply to that machine as well?
Use a profiler to find out which piece of code is responsible for the variations in time.

If you don't run a real-time system, then you can't be sure it will execute within a certain time.
OSes constantly do other things, mostly housekeeping tasks, and providing the system other services. This easily will slow down the rest of your system for 50ms.
There also might be time that you need to wait for IO. Such as harddisks or network communication.
Besides that there is also the fact that your JVM doesn't do any real-time promises. This can mean the garbage collector runs through. The effect of this is very small on a normal application, but can be large if you create and forget lots of objects (as you might do when loading many or large files).
Finally it can be your algorithm (do you run the same data each time?) if you have different data, you can have different execution times.

Sporadic behavior by the machines in stress

We are doing some Java stress runs (involving network IO). Initially things are all fine and the system responds very fast (avg latency in test 2ms). But hours later when I redo the same test I observe the performance goes down (20 - 60ms). It's the same Jar files, same JVM, and the same LAN over which the stress is running. I am not understanding the reason for this behavior.
The LAN is 1GBPS and for the stress requirements I'm sure we are not using all of it.
So my questions:
Can it be because of some switches in the LANs?
Does the machine slow off after some time ( The machines are restarted .. say about 6months back well before the stress can start; They are RHEL5, XEON 64bit Quad core)
What is the general way to debug such an issues?

A few questions...
How much of the environment is under your control and are you putting any measures in place to ensure it's consistent for each run? i.e. are you sharing the network with other systems, is the machine you're using being used solely for your stress testing?
The way I'd look at this is to start gathering details on what your machine and code are up to. That means use perfmon (windows) sar (unix) to find out what the OS and hardware is doing and get a profiler attached to make sure your code is doing the same thing and help pin-point where the bottleneck is occuring from a code perspective.
Nothing terribly detailed but something I hope that will help get you started.

The general way is "measure everything". This, in particular might mean:
Ensure time on all servers is the same (use ntp or something similar);
Measure how long did it take to generate request (what if request generator has a bug?);
Measure when did request leave the client machine(s), or at least how long did it take to do i/o. Sometimes it is enough to know average time necessary for many requests.
Measure when did the request arrive.
Measure how long did it take to generate a response.
Measure how long did it take to send the response.
You can probably start from the 5th element, as this is (you believe) your critical chain. But it is best to log as much as you can - as according to what you've said yourself, it takes days to produce different results.
If you don't want to modify your code, look for cases where you can sniff data without intervening (e.g. define a servlet filter in your web.xml).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.