Java Random Slowdowns on Mac OS cont'd

Java Random Slowdowns on Mac OS cont'd - java

I asked this question a few weeks ago, but I'm still having the problem and I have some new hints. The original question is here:
Java Random Slowdowns on Mac OS
Basically, I have a java application that splits a job into independent pieces and runs them in separate threads. The threads have no synchronization or shared memory items. The only resources they do share are data files on the hard disk, with each thread having an open file channel.
Most of the time it runs very fast, but occasionally it will run very slow for no apparent reason. If I attach a CPU profiler to it, then it will start running quickly again. If I take a CPU snapshot, it says its spending most of its time in "self time" in a function that doesn't do anything except check a few (unshared unsynchronized) booleans. I don't know how this could be accurate because 1, it makes no sense, and 2, attaching the profiler seems to knock the threads out of whatever mode they're in and fix the problem. Also, regardless of whether it runs fast or slow, it always finishes and gives the same output, and it never dips in total cpu usage (in this case ~1500%), implying that the threads aren't getting blocked.
I have tried different garbage collectors, different sizings the parts of the memory space, writing data output to non-raid drives, and putting all data output in threads separate the main worker threads.
Does anyone have any idea what kind of problem this could be? Could it be the operating system (OS X 10.6.2) ? I have not been able to duplicate it on a windows machine, but I don't have one with a similar hardware configuration.

It's probably a bit late to reply, but I could observe similar slowdowns using Random in Threads, related to a volatile variable used within java.util.Random - see How can assigning a variable result in a serious performance drop while the execution order is (nearly) untouched? for details. If the answer I got is correct (and it sounds pretty reasonable to me), the slowdown might be related to the in-memory-addresses of the volatile variables used within Random (Have a look at the answer of user 'irreputable' to my question, which explains the problem much better than I do here).
In case you're creating the Random-instances within the run-method of your Threads, you could simply try to turn them into object-variables and initialize them within the constructor of your Thread: This would most likely ensure that the volatile fields of your Random instances will end up in 'different areas' in RAM, which do not have to get synchronized between the processor cores.

How do you know it's running slow? How do you know that it runs quicker when CPU profiler is active? If you do the entire run under the profiler does it ever run slow? If you restrict the number of threads to one does it ever run slow?

Actually this is an interesting problem, im curious to know whats the problem.
First, in your previous question, you are saying you split the job between "multiple" processors. Are they physically multiple, like in multiple machines? or a multi core CPU?
Second, im not sure if Snow Leopard has something to do with it, but we know that SL introduced few new features in term of multi-processor machines. So there might be some problem with the VM on the new OS. Try to use another Java version, i know SL uses Java 6 by default. Try to use Java 5.
Third, did you try to make the Thread pool a little smaller, you are talking about 100 threads running at same time. Try to make them 20 or 40 for example. See if it makes difference.
Finally, i would be interested in seeing how you implemented the multi-threading solution. Small parts of the code will be good

Related

How do I begin to optimise my program [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a web server program written in java and my boss wants it to run faster.
Iv always been happy if it ran without error so efficiency is new to me.
I tried a profiler but it crashed my computer and turned out to be a dead opensource project.
I have no idea what I am doing except from reading a few questions on here. I see that re factoring code is the best option but Im not sure how to go about that and that i need a profiler to see what code to re factor.
So does anyone know of a free profiler that I can use ? Im using java and eclipse. if possible some instructions or a like to easy instruction would be great.
But what I really want if anyone can give it is a basic introduction to the subject so I can understand enough to go do in depth research on the subject to get the best results.
I am a complete beginner when it comes to optimising code and the subject seems very complex from what I have seen so far, any help with how to get started would be greatly appreciated.
Im new to java as well so saying things like check garbage collection would mean nothing to me, id need a more detailed explanation.
EDIT: the program uses tomcat for the networking. it connects to an SQL database. the main function is a polling loop which checks all attached devices on the network, reads events from them writes the event to the database and the performs the event functions.
I am trying to improve the polling loop. the program is heavily multithreaded and uses a lot of interfaces and proxies so it is hart to see were code goes the farther you get from the polling loop.
I hope this information helps you offer solutions. also I did not build it, I inherited the code.

First of all detect the bottlenecks. There is no point in optimizing a method from 500ms to 400ms when there is a method running for 5 seconds, when it should run for 100ms.
You can try using the VisualVM as a profiler, which is built-in in the JDK.

If you want a free profiler, use VisualVM when comes with Java. It is likely to be enough.
You should ask your boss exact what he would like to go faster. There is no point optimising random pieces of code he/she might not care about. (Its easily done)
You can also log key points in you task/request to determine what it spends the most time doing.

EDIT: the program uses tomcat for the networking. it connects to an
SQL database. the main function is a polling loop which checks all
attached devices on the network, reads events from them writes the
event to the database and the performs the event functions.
I am trying to improve the polling loop. the program is heavily
multithreaded and uses a lot of interfaces and proxies so it is hart
to see were code goes the farther you get from the polling loop
This sounds like you have a heavily I/O bound application. There really isn't much that you can do about that because I/O bound applications aren't inefficiently using the CPU--they're stuck waiting for I/O operations on other devices to complete.
FWIW, this scenario is actually why a lot of big companies are contemplating moving toward cheap, ARM-based solutions. They're wasting a lot of power and resources on powerful x86 CPUs that get underutilized while their code sits there waiting for a remote MySQL or Oracle server to finish doing its thing. With such an application, why throw more CPU than you need?

If your new to java then Optimization sounds like a bad idea. Its very easy to get wrong. Its not trivial to rewrite code and keep all the outputs the same while changing the inner workings.
Possibly have a look at your stored procedures and replace any IN statments with INNER JOIN. Thats a fairly low risk and high reward way of speeding thing up.

Start by identifying the time taken by various steps in your application (use logging to identify). Notice if there is anything unusual.
Step into each of these steps to see if there are any bottlenecks. Identify if something can be cached to save a db call. Identify if there is scope of parallelism by breaking down your tasks into independent units.
Hope you have some unit/ integration tests to ensure you don't accidentally break anything.

Measure (with a profiler - as others suggested, VisualVM is good) and locate the spots where your program spends most of its time.
Analyze the hot spots and try to improve their performance.
Measure again to verify that your changes had the expected effect.
If needed, repeat from step 1.

Start very simple.
Make a list of whats slow from a user perspective.
Try to do high level profiling yourself. Maybe an interceptor that prints the run time for your actions.
Then profile only those actions with Start time = System.currentTime...
This easy way could be a starting point into more advanced profiling and if your lucky it may fix your problems.

Before you start optimizing, you have to understand your workload, and you have to be able to recreate that workload. One easy way to do that is to log all requests, in production, with enough detail that you can recreate the requests in a development environment.
At the same time that you log your load, you can also log the performance of those requests: the time from the start of the request to the end. One way to do that (and, incidentally, to capture the data needed to log the request) is to add a servlet filter into your stack.
Then you can start to think about optimization.
Establish performance goals. Simply saying "make it faster" is pointless. Instead, you need to establish goals such as "all pages should respond within 1.5 seconds, as long as there are less than 100 concurrent users."
Identify the requests that fail your performance goals. Focus on the biggest failure first.
Identify why the request takes so long.
To do #3, you need to be able to recreate load in a development environment. Then you can either use a profiler, or simply add trace-level logging into your application to find out how long each step of the process takes.
There is also a whole field of holistic optimization, of which garbage collection tuning is probably the most important. But again, you need to establish and replicate your workload, otherwise you'll be flailing.

When starting to optimize an application, the main risk is to try to optimize every step, which does often not improve the program efficiency as expected and results in unmaintainable code.
It is likely that 80% of the execution time of your program is caused by a single step, which is itself only 20% of the code base.
The first thing to do is to identify this bottleneck. For example, you can log timestamps (using System.nanoTime and/or System.currentTimeMillis and you favorite logging framework) to do this.
Once the step has been identified, try to write a test class which runs this step, and run it with a profiler. I have good experience with both HPROF (http://java.sun.com/developer/technicalArticles/Programming/HPROF.html) although it might require some time to get familiar with, and Eclipse Test and Performance Tools Platform (http://www.eclipse.org/tptp/). If you have never used a profiler, I recommend you start with Eclipse TPTP.
The execution profile will help you find out in what methods your program spends time. Once you know them, look at the source code, and try to understand why it is slow. It might be because (this list is not exhaustive) :
unnecessary costly operations are performed,
a sub-optimal algorithm is used,
the algorithm generates lots of objects, thus giving a lot of work to the garbage collector (especially true for objects which have a medium to long life expectancy).
If there is no visible defect in the code, then you might consider :
making the algorithm more parallel in order to leverage all your CPUs
buying faster hardware.
Regarding JVM options, the two most important ones for performance are :
-server, in order to use the server VM (enabled by default depending on the hardware) which provides better performance at the price of a slower startup (http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client),
-Xms and -Xmx which define the heap size available on startup, and the maximum amount of memory that the JVM can use. If the JVM is not given enough memory, garbage collection will use a lot of your CPU resources, slowing down your program, however if the JVM already has enough memory, increasing the heap size will not improve performance, and might even cause longer GC pauses. (http://stackoverflow.com/questions/1043817/speed-tradeoff-of-javas-xms-and-xmx-options)
Other parameters usually have lower impact, you can consult them at http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html.

Why is my multithreaded Java program not maxing out all my cores on my machine?

I have a program that starts up and creates an in-memory data model and then creates a (command-line-specified) number of threads to run several string checking algorithms against an input set and that data model. The work is divided amongst the threads along the input set of strings, and then each thread iterates the same in-memory data model instance (which is never updated again, so there are no synchronization issues).
I'm running this on a Windows 2003 64-bit server with 2 quadcore processors, and from looking at Windows task Manager they aren't being maxed-out, (nor are they looking like they are being particularly taxed) when I run with 10 threads. Is this normal behaviour?
It appears that 7 threads all complete a similar amount of work in a similar amount of time, so would you recommend running with 7 threads instead?
Should I run it with more threads?...Although I assume this could be detrimental as the JVM will do more context switching between the threads.
Alternatively, should I run it with fewer threads?
Alternatively, what would be the best tool I could use to measure this?...Would a profiling tool help me out here - indeed, is one of the several profilers better at detecting bottlenecks (assuming I have one here) than the rest?
Note, the server is also running SQL Server 2005 (this may or may not be relevant), but nothing much is happening on that database when I am running my program.
Note also, the threads are only doing string matching, they aren't doing any I/O or database work or anything else they may need to wait on.

My guess would be that your app is bottlenecked on memory access, i.e. your CPU cores spend most of the time waiting for data to be read from main memory. I'm not sure how well profilers can diagnose this kind of problem (the profiler itself could influence the behaviour considerably). You could verify the guess by having your code repeat the operations it does many times on a very small data set.
If this guess is correct, the only thing you can do (other than getting a server with more memory bandwidth) is to try and increase the locality of your memory access to make better use of caches; but depending on the details of the application that may not be possible. Using more threads may in fact lead to worse performance because of cores sharing cache memory.

Without seeing the actual code, it's hard to give proper advice. But do make sure that the threads aren't locking on shared resources, since that would naturally prevent them all from working as efficiently as possible. Also, when you say they aren't doing any io, are they not reading an input or writing an output either? this could also be a bottleneck.
With regards to cpu intensive threads, it is normally not beneficial to run more threads than you have actual cores, but in an uncontrolled environment like this with other big apps running at the same time, you are probably better off simply testing your way to the optimal number of threads.

Doesn't the fact that Go and Java use User space thread mean that you can't really take advantage of multiple core?

We've been talking about threads in my operating system class a lot lately and one question has come to my mind.
Since Go, (and Java) uses User-space thread instead of kernel threads, doesn't that mean that you can't effectively take advantages of multiple cores since the OS only allocates CPU time to the process and not the threads themselves?
This seems to confirm the fact that you can't
Wikipedia also seems to think so

What makes you think Go uses User-space threads?
It doesn't. It uses OS-threads and can take advantage of multiple cores.
You might be puzzled by the fact that by default Go only uses 1 thread to run your program. If you start two goroutines they run in one thread. But if one goroutine blocks for I/O Go creates a second thread and continues to run the other goroutine on the new thread.
If you really want to unlock the full multi-core power just use the GOMAXPROCS() function.
runtime.GOMAXPROCS(4); //somewhere in main
Now your program would use 4 OS-threads (instead of 1) and would be able to fully use a e.g. 4 core system.

Most recent versions of Java to use OS threads, although there is not necessarily a one-to-one mapping with Java threads. Java clearly does work quite nicely across many hardware threads.

I presume that by "user-space threads" you mean (for example) Go's goroutines.
It is true that using goroutines for concurrency is less efficient than designing (by hand and by scientific calculations) a special-purpose algorithm for assigning work units to OS threads.
However: Every Go program is situated in an environment and is designed to solve a particular problem. A new goroutine can be started for each request that the environment is making to the Go program. If the environment is making concurrent requests to the Go program, a Go program using goroutines might be able to run faster than a serial program even if the Go program is using just 1 OS thread. The reason why goroutines might be able to process requests with greater speed (even when using just 1 OS thread) is that the Go program will automatically switch from goroutine A to goroutine B when the part of environment which is associated with A is momentarily unable to respond.
But yes, it is true that using goroutines and automatically assigning them to multiple OS threads is less efficient than designing (by hand and by scientific calculations) a special-purpose algorithm for assigning work units to OS threads.

Java Random Slowdowns on Mac OS

I have a Java program for doing a set of scientific calculations across multiple processors by breaking it into pieces and running each piece in a different thread. The problem is trivially partitionable so there's no contention or communication between the threads. The only common data they access are some shared static caches that don't need to have their access synchronized, and some data files on the hard drive. The threads are also continuously writing to the disk, but to separate files.
My problem is that sometimes when I run the program I get very good speed, and sometimes when I run the exact same thing it runs very slowly. If I see it running slowly and ctrl-C and restart it, it will usually start running fast again. It seems to set itself into either slow mode or fast mode early on in the run and never switches between modes.
I have hooked it up to jconsole and it doesn't seem to be a memory problem. When I have caught it running slowly, I've tried connecting a profiler to it but the profiler won't connect. I've tried running with -Xprof but the dumps between a slow run and fast run don't seem to be much different. I have tried using different garbage collectors and different sizings of the various parts of the memory space, also.
My machine is a mac pro with striped RAID partition. The cpu usage never drops off whether its running slowly or quickly, which you would expect if threads were spending too much time blocking on reads from the disk, so I don't think it could be a disk read problem.
My question is, what types of problems with my code could cause this? Or could this be an OS problem? I haven't been able to duplicate it on a windows a machine, but I don't have a windows machine with a similar RAID setup.

You might have thread that have gone into an endless loop.
Try connecting with VisualVM and use the Thread monitor.
https://visualvm.dev.java.net
You may have to connect before the problem occurs.

I second that you should be doing it with a profiler looking at the threads view - how many threads, what states are they in, etc. It might be an odd race condition happening every now and then. It could also be the case that instrumenting the classes with profiler hooks (which causes slowdown), sortes the race condition out and you will see no slowdown with the profiler attached :/
Please have a look at this post, or rather the answer, where there is Cache contention problem mentioned.
Are you spawning the same umber of threads each time? Is that number less or equal the number of threads available on your platform? That number could be checked or guestimated with a fair accuracy.
Please post any finidngs!

Do you have a tool to measure CPU temperature? The OS might be throttling the CPU to deal with temperature issues.

Is it possible that your program is being paged to disk sometimes? In this case, you will need to look at the memory usage of the operating system as whole, rather than just your program. I know from experience there is a huge difference in runtime performance when memory is being continually paged to the disk and back.
I don't know much about OSX, but in linux the "free" command is useful for this purpose.
Another issue that might cause this slowdown is log files? I've known at least some logging code that slowed down the system incrementally as the log files grew. It's possible that your threads are synchronizing on a log file which is growing in size, then when you restart your program, another log file is used.

Parallelization: What causes Java threads to block other than synchronization & I/O?

Short version is in the title.
Long version:
I am working on a program for scientific optimization using Java. The workload of the program can be divided into parallel and serial phases -- parallel phases meaning that highly parallelizable work is being performed. To speed up the program (it runs for hours/days) I create a number of threads equal to the number of CPU cores on the machine I'm using -- typically 4 or 8 -- and divide the work between them. I then start these threads and join() them before proceeding to a serial phase.
So far so good. What's bothering me is that the CPU utilization and speedup of the parallel phases is nowhere near the "theoretical maximum" -- e.g. if I have 4 cores, I expect to see somewhere between 350-400% "utilization" (as reported by top) but instead it bounces around between 180 and about 310. Using only a single thread, I get 100% CPU utilization.
The only reasons I know of for threads not to run at full speed are:
-blocking due to I/O
-blocking due to synchronization
No I/O whatsoever is going on in my parallel threads, nor any synchronization -- the only data structures shared by the threads are read-only, and are either basic types or (non-concurrent) collections. So I'm looking for other explanations. One possibility would be that several threads are repeatedly blocking for garbage collection, but that would only seem to make sense in a situation with memory pressure, and I am allocating well above the required maximum heap space.
Any suggestions would be appreciated.
Update: Just in case anyone is curious, after some more investigation I tweaked the code for general performance and am seeing better utilization, even though nothing I changed has to do with synchronization. However, some of the changes should have resulted in fewer new heap allocations in particular I got rid of some use of iterators and termporary boxed numbers (The CERN "Colt" library for high-performance Java computing was useful here: it provides collections like IntArrayList, DoubleArrayList etc for basic types.). So I think garbage collection was probably the culprit.

All graphics operations run on a single thread in swing. If they are rendering to the screen they will effectively be contending for access to this thread.
If you are running on Windows, all graphics operations run on a single thread no matter what. Other operating systems have similar limitations.
It's actually fairly difficult to get the proper granularity of threaded workers sometimes, and sometimes it's easy to make them too big or too small, which will typically give you less than 100% usage of all cores.
If you're not rendering much gui, the most likely culprit is that you're contending more than you think for some shared resource. This is easily seen with profiler tools like jprofiler. Some VM's like bea's jrockit can even tell you this straight out of the box.
This is one of those places where you dont want to act on guesswork. Get a profiler!

First of all, GC will not happen only "in situation with memory pressure", but at any time the JVM sees fit (unpredictable, as far as I know).
Second, if your threads allocate memory in the heap (you mention they use Collections so I guess they do assign memory in the heap), you can never be sure if this memory is currently in RAM or on a Virtual Memory page (the OS decides), and thus access to "memory" may generate blocking I/O access!
Finally, as suggested in a prior answer, you may find it useful to check what happens by using a profiler (or even JMX monitoring might give some hints there).
I believe it will be difficult to get further hints on your problem unless you provide more concrete (code) information.

Firstly, I assume you're not doing any other significant work on the box. If you are, that's clearly going to mess with things.
It does sound very odd if you're really not sharing anything. Can you give us more idea of what the code is really doing?
What happens if you run n copies of the program as different Java processes, with each only using a single thread? If that uses each CPU completely, then at least we know that it can't be a problem with the OS. Speaking of the OS, which one is this running on, and which JVM? If you can try different JVMs and different OSes, the results might give you a hint as to what's wrong.

Also an important point: Which Hardware do you use?
E.g. 4-8 Cores could mean you work on one of Suns Niagara CPUs. And despite having 4-8 Cores they have less FPUs. When computing scientific stuff it could happen, that the FPU is the bottleneck.

You try to use the full CPU capability for your calculations but the OS itself uses resources as well. So be aware that the OS will block some of your execution in order to satisfy its needs.

You are doing synchronization at some level.
Perhaps only in the memory allocation system, including garbage collection. While the JVM vendor has worked to keep blocking in these areas to a minimum, they can't reduce it to zero. Perhaps something about your application is pushing at a weak point in this area.
The accepted wisdom is "don't build your own memory reclaiming pool, let the GC work for you". This is true most of the time but not in at least one piece of code I maintain (proven with profiling). Perhaps you need to rework your Object allocation in some major way.

Try the latency analyzer that comes with JRockit Mission Control. It will show you what the CPU is doing when it's not doing anything, if the application is waiting for file I/O, TLA-fetches, object allocations, thread suspension, JVM-locks, gc-pauses etc. You can also see transitions, e.g. when one thread wakes up another. The overhead is negligible, 1% or so.
See this blog for more info. The tool is free to use for development and you can download it here

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.