I have a multithreaded file converter that I'm working on. On Windows, it puts each file that's being converted in its own thread and uses 100% CPU (on all cores) all the time. It's awesome! On Ubuntu, I get 100% on the first core and ~10% on all the rest. The performance is poor and disappointing.
I'm using Threads, all within a SwingWorker so I don't freeze the GUI. I use thread.join on all threads so I perform a certain task when all threads are complete. I have not changed the code between OS's. Is there a feasible way to fix this?
It was very dumb and I don't quite understand why, but shortly after I posted this, I transferred all of my files to my Ubuntu partition, and it's just as fast (if not faster) than the Windows one. Not sure why moving files would make it go faster? Perhaps my real issue was that since they were on different file systems, my bottleneck was I/O. Converting just one file from the NTFS partition took 3x longer than if I moved it to the ext4 partition. (And Yes, these are all on the same SSD)
Related
I have a very large set of text files. The task was to calculate the document frequencies (number of document that contain a certain term) for all the terms (uniquely) inside this huge corpus. Simply starting from the first file and calculating everything in a serialized manner seemed to be a dumb thing to do (I admit I did it just to see how disastrous it is).
I realized that if I do this calculation in a Map-Reduce manner, meaning clustering my data into smaller pieces and in the end aggregating the results, I would get the results much faster.
My PC has 4 cores, so I decided to separate my data into 3 distinct subsets and feeding each subset to a separate thread waiting for all the threads to finish their work and passing their results to a another method to aggregate everything.
I tests it with a very small set of data, worked fined. Before I use the actual data, I tested it with a larger set to I can study its behaviour better. I started jvisualvm and htop to see how the cpu and memory is working. I can see that 3 threads are running and cpu cores are also busy. But the usage of these cores are rarely above 50%. This means that my application is not really using the full power of my PC. is this related to my code, or is this how it is supposed to be. My expectation was that each thread uses as much cpu core resource as possible.
I use Ubuntu.
Sounds to me that you have an IO bound application. You are spending more time in your individual threads reading the data from the disk then you are actually processing the information that is read.
You can test this by migrating your program to another system with a SSD to see if the CPU performance changes. You can also read in all of the files and then process them later to see if that changes the CPU curve during processing time. I suspect it will.
As already stated you're bottle-necked by something probably disk IO. Try separating the code that reads from disk from the code that processes the data, and use separate thread pools for each. Afterwards, a good way to quickly scale your thread pools to properly fit your resources is to use one of the Executors thread pools.
You are IO bound for a problem like this on a single machine, not CPU bound. Are you actively reading the files? Only if you had all the files in-memory would you start to saturate the CPU. That is why map-reduce is effective. It scales the total IO throughput more than CPU.
You can possibly speed up this quite a bit if you are on Linux and use tmpfs for storing the data in memory, instead on disk.
I asked this question a few weeks ago, but I'm still having the problem and I have some new hints. The original question is here:
Java Random Slowdowns on Mac OS
Basically, I have a java application that splits a job into independent pieces and runs them in separate threads. The threads have no synchronization or shared memory items. The only resources they do share are data files on the hard disk, with each thread having an open file channel.
Most of the time it runs very fast, but occasionally it will run very slow for no apparent reason. If I attach a CPU profiler to it, then it will start running quickly again. If I take a CPU snapshot, it says its spending most of its time in "self time" in a function that doesn't do anything except check a few (unshared unsynchronized) booleans. I don't know how this could be accurate because 1, it makes no sense, and 2, attaching the profiler seems to knock the threads out of whatever mode they're in and fix the problem. Also, regardless of whether it runs fast or slow, it always finishes and gives the same output, and it never dips in total cpu usage (in this case ~1500%), implying that the threads aren't getting blocked.
I have tried different garbage collectors, different sizings the parts of the memory space, writing data output to non-raid drives, and putting all data output in threads separate the main worker threads.
Does anyone have any idea what kind of problem this could be? Could it be the operating system (OS X 10.6.2) ? I have not been able to duplicate it on a windows machine, but I don't have one with a similar hardware configuration.
It's probably a bit late to reply, but I could observe similar slowdowns using Random in Threads, related to a volatile variable used within java.util.Random - see How can assigning a variable result in a serious performance drop while the execution order is (nearly) untouched? for details. If the answer I got is correct (and it sounds pretty reasonable to me), the slowdown might be related to the in-memory-addresses of the volatile variables used within Random (Have a look at the answer of user 'irreputable' to my question, which explains the problem much better than I do here).
In case you're creating the Random-instances within the run-method of your Threads, you could simply try to turn them into object-variables and initialize them within the constructor of your Thread: This would most likely ensure that the volatile fields of your Random instances will end up in 'different areas' in RAM, which do not have to get synchronized between the processor cores.
How do you know it's running slow? How do you know that it runs quicker when CPU profiler is active? If you do the entire run under the profiler does it ever run slow? If you restrict the number of threads to one does it ever run slow?
Actually this is an interesting problem, im curious to know whats the problem.
First, in your previous question, you are saying you split the job between "multiple" processors. Are they physically multiple, like in multiple machines? or a multi core CPU?
Second, im not sure if Snow Leopard has something to do with it, but we know that SL introduced few new features in term of multi-processor machines. So there might be some problem with the VM on the new OS. Try to use another Java version, i know SL uses Java 6 by default. Try to use Java 5.
Third, did you try to make the Thread pool a little smaller, you are talking about 100 threads running at same time. Try to make them 20 or 40 for example. See if it makes difference.
Finally, i would be interested in seeing how you implemented the multi-threading solution. Small parts of the code will be good
I am trying to create 100 files using FileOutputStream/BufferedOutputStream.
I can see the CPU utilization is 100% for 5 to 10 sec. The Directory which i am writing is empty. I am creating PDF files thru iText. Each file having round 1 MB. I am running on Linux.
How can i rewrite the code so that i can minimize the CPU utilization?
Don't guess: profile your application.
If the numbers show that a lot of time is spent in / within write calls, then look at ways to do faster I/O. But if most time is spent in formatting stuff for output (e.g. iText rendering), then that's where you need to focus your efforts.
Is this in a directory which already contains a lot of files? If so, you may well just be seeing the penalty for having a lot of files in a directory - this varies significantly by operating system and file system.
Otherwise, what are you actually doing while you're creating the files? Where does the data come from? Are they big files? One thing you might want to do is try writing to a ByteArrayOutputStream instead - that way you can see how much of the activity is due to the file system and how much is just how you're obtaining/writing the data.
It's a long shot guess, but even if you're using buffered streams make sure you're not writing out a single byte at a time.
The .read(int) and .write(int) methods are CPU killers. You should be using .read(byte[]...) and .write(byte[], int, int) for certain.
A 1MB file to write is large enough to use a java.nio FileChannel and see large performance improvements over java.io. Rewrite your code, and measure it agaist the old stuff. I predict a 2x improvement, at a minimum.
You're unlikely to be able to reduce the CPU load for your task, especially on a Windows system. Java on Linux does support Asynchronous File I/O, however, this can seriously complicate your code. I suspect you are running on Windows, as File I/O generally takes much more time on Windows than it does on Linux. I've even heard of improvements by running Java in a linux VM on Windows.
Take a look at your Task Manager when the process is running, and turn on Show Kernel Times. The CPU time spent in user space can generally be optimized, but the CPU time in kernel space can usually only be reduce by make more efficient calls.
Update -
JSR 203 specifically addresses the need for asynchronous, multiplexed, scatter/gather file IO:
The multiplexed, non-blocking facility introduced by JSR-51 solved much of that problem for network sockets, but it did not do so for filesystem operations.
Until JSR-203 becomes part of Java, you can get true asynchronous IO with the Apache MINA project on Linux.
Java NIO (1) allows you to do Channel based I/O. This is an improvement in performance, but your only doing a buffer of data at a time, and not true async & multiplexed IO.
I have a Java program for doing a set of scientific calculations across multiple processors by breaking it into pieces and running each piece in a different thread. The problem is trivially partitionable so there's no contention or communication between the threads. The only common data they access are some shared static caches that don't need to have their access synchronized, and some data files on the hard drive. The threads are also continuously writing to the disk, but to separate files.
My problem is that sometimes when I run the program I get very good speed, and sometimes when I run the exact same thing it runs very slowly. If I see it running slowly and ctrl-C and restart it, it will usually start running fast again. It seems to set itself into either slow mode or fast mode early on in the run and never switches between modes.
I have hooked it up to jconsole and it doesn't seem to be a memory problem. When I have caught it running slowly, I've tried connecting a profiler to it but the profiler won't connect. I've tried running with -Xprof but the dumps between a slow run and fast run don't seem to be much different. I have tried using different garbage collectors and different sizings of the various parts of the memory space, also.
My machine is a mac pro with striped RAID partition. The cpu usage never drops off whether its running slowly or quickly, which you would expect if threads were spending too much time blocking on reads from the disk, so I don't think it could be a disk read problem.
My question is, what types of problems with my code could cause this? Or could this be an OS problem? I haven't been able to duplicate it on a windows a machine, but I don't have a windows machine with a similar RAID setup.
You might have thread that have gone into an endless loop.
Try connecting with VisualVM and use the Thread monitor.
https://visualvm.dev.java.net
You may have to connect before the problem occurs.
I second that you should be doing it with a profiler looking at the threads view - how many threads, what states are they in, etc. It might be an odd race condition happening every now and then. It could also be the case that instrumenting the classes with profiler hooks (which causes slowdown), sortes the race condition out and you will see no slowdown with the profiler attached :/
Please have a look at this post, or rather the answer, where there is Cache contention problem mentioned.
Are you spawning the same umber of threads each time? Is that number less or equal the number of threads available on your platform? That number could be checked or guestimated with a fair accuracy.
Please post any finidngs!
Do you have a tool to measure CPU temperature? The OS might be throttling the CPU to deal with temperature issues.
Is it possible that your program is being paged to disk sometimes? In this case, you will need to look at the memory usage of the operating system as whole, rather than just your program. I know from experience there is a huge difference in runtime performance when memory is being continually paged to the disk and back.
I don't know much about OSX, but in linux the "free" command is useful for this purpose.
Another issue that might cause this slowdown is log files? I've known at least some logging code that slowed down the system incrementally as the log files grew. It's possible that your threads are synchronizing on a log file which is growing in size, then when you restart your program, another log file is used.
I'm developing a Java application that streams music via HTTP, and one problem I've come up against is that while the app is reading the audio file from disk and sending it to the client it usually maxes out the CPU at 90-100% (which can cause users problems running other apps).
Is it possible to control the thread doing this work to use less CPU, or does this need to be controlled by the OS? Are there any techniques for managing how intensive your application is at present?
I know you can start threads with a high/low priority, but this doesn't seem to have any effect for me in this scenario.
(I can't get my head past "I've asked the computer to do something, so it's obviously going to do it as fast as it can...")
Thanks!
rod.
That task (reading a file from the disk and sending it via HTTP) should not use any significant amount of CPU, especially at the bitrates required for music streaming (unless you're talking about multi-channel uncompressed PCM or something like that, but even then it should be I/O-bound and not use a lot of CPU).
You're probably doing the reading/writing in a very inefficient way. Do you read/write each byte separately or are you using some kind of buffer?
I would check how much buffering you are using. If you read/write one byte at a time you will consume a lot of CPU. However, if you are reading/writing blocks of say 4 kB it shouldn't use much CPU at all. If your network is the internet your CPU shouldn't be much over 10% of a single client.
One approximation for the buffer size is the bandwidth * delay. e.g. if you expect users to stream at 500 KB/s and there is a network latency of up to 0.1 sec, then the buffer size should be around 50 KB.
You can lower it's priority using methods in Thread (via Thread.currentThread() if necessary).
You can also put delays in it's processing loop (Thread.sleep()).
Other than that, let the O/S take care of it. If your program can use 100% CPU, and nothing else needs the CPU your app might as well use it rather than letting the O/S idle task have it.
It's also true that streaming data should be I/O bound, so you should definitely review what's being done between reading the data and sending it. Are you reading/sending byte by byte, unbuffered, for example?
EDIT: In response to marr75's comment, I am absolutely not advocating that you write poor, inefficient code which wastes CPU resources - There is an article on my web site which clearly conveys what I think about that mind-set. Rather, what I am saying is that if your code legitimately needs the CPU, and you've prioritized it to behave nicely if the user wants to do other things, then there is no point at all in artificially delaying the outcome just to avoid pegging the CPU - that only does the user the disservice of making them wait longer for the end result, which they presumably want as quickly as possible.
Do you have one or more of:
Software RAID
Compressed folder
Intrusive virus checker
Loopback file system
I don't think you can lower the priority without losing the functionality (stream music). Your program gets this much cpu from the OS, because it needs it. It's not like the OS is giving cpu-time away for no reason or because "it's in the mood for it".
If you think, you can do the task without using that much cpu-utilization, you can profile your app and find out, where this high cpu-utilization takes place and then try to improve your code.
I think you are doing the streaming in an inefficient way, but I say streaming CAN be a highly utilizing task.
I repeat, don't think about reducing the cpu-utilization by lowering the priority of the process or telling the OS "Don't give that much cpu-time to this process". That's the whole wrong intuition in my eyes. Reduce the cpu-utilization by improving the algorithms and code after profiling.
A good start in profiling java is this article: http://www.ibm.com/developerworks/edu/os-dw-os-ecl-tptp.html
In addition to the information given above: the JVM is free in how it uses OS threads. The Thread in your Java application might run in a seperate OS thread, or it might share that thread with other Threads. Check the documentation for the JVM you are using for additional information.
Ok, thanks for the advice guys! Looks like I'm just going to have to look into trying to improve the efficiency of the way my app is streaming (though not sure this is going to go far as I'm basically just reading the file from disk and writing it to the client...).
VisualVM is very easy to use to find out where your CPU time is being spent for Java applications, and it is included in the latest versions of the JDK (named jvisualvm.exe on Windows)
Follow up to my "well thought out buffers" comment, a good rule of thumb for TCP buffering,
buffer size = 2 * bandwidth * delay
So if you want to stream 214kbps music (around 27kB/s) and have, let's say 60ms of latency, you're looking at 3.24 kilobytes, and rounding off to a nice 4kB buffer will work very well for you on a wide range of systems.