I'm developing a Java application that streams music via HTTP, and one problem I've come up against is that while the app is reading the audio file from disk and sending it to the client it usually maxes out the CPU at 90-100% (which can cause users problems running other apps).
Is it possible to control the thread doing this work to use less CPU, or does this need to be controlled by the OS? Are there any techniques for managing how intensive your application is at present?
I know you can start threads with a high/low priority, but this doesn't seem to have any effect for me in this scenario.
(I can't get my head past "I've asked the computer to do something, so it's obviously going to do it as fast as it can...")
Thanks!
rod.
That task (reading a file from the disk and sending it via HTTP) should not use any significant amount of CPU, especially at the bitrates required for music streaming (unless you're talking about multi-channel uncompressed PCM or something like that, but even then it should be I/O-bound and not use a lot of CPU).
You're probably doing the reading/writing in a very inefficient way. Do you read/write each byte separately or are you using some kind of buffer?
I would check how much buffering you are using. If you read/write one byte at a time you will consume a lot of CPU. However, if you are reading/writing blocks of say 4 kB it shouldn't use much CPU at all. If your network is the internet your CPU shouldn't be much over 10% of a single client.
One approximation for the buffer size is the bandwidth * delay. e.g. if you expect users to stream at 500 KB/s and there is a network latency of up to 0.1 sec, then the buffer size should be around 50 KB.
You can lower it's priority using methods in Thread (via Thread.currentThread() if necessary).
You can also put delays in it's processing loop (Thread.sleep()).
Other than that, let the O/S take care of it. If your program can use 100% CPU, and nothing else needs the CPU your app might as well use it rather than letting the O/S idle task have it.
It's also true that streaming data should be I/O bound, so you should definitely review what's being done between reading the data and sending it. Are you reading/sending byte by byte, unbuffered, for example?
EDIT: In response to marr75's comment, I am absolutely not advocating that you write poor, inefficient code which wastes CPU resources - There is an article on my web site which clearly conveys what I think about that mind-set. Rather, what I am saying is that if your code legitimately needs the CPU, and you've prioritized it to behave nicely if the user wants to do other things, then there is no point at all in artificially delaying the outcome just to avoid pegging the CPU - that only does the user the disservice of making them wait longer for the end result, which they presumably want as quickly as possible.
Do you have one or more of:
Software RAID
Compressed folder
Intrusive virus checker
Loopback file system
I don't think you can lower the priority without losing the functionality (stream music). Your program gets this much cpu from the OS, because it needs it. It's not like the OS is giving cpu-time away for no reason or because "it's in the mood for it".
If you think, you can do the task without using that much cpu-utilization, you can profile your app and find out, where this high cpu-utilization takes place and then try to improve your code.
I think you are doing the streaming in an inefficient way, but I say streaming CAN be a highly utilizing task.
I repeat, don't think about reducing the cpu-utilization by lowering the priority of the process or telling the OS "Don't give that much cpu-time to this process". That's the whole wrong intuition in my eyes. Reduce the cpu-utilization by improving the algorithms and code after profiling.
A good start in profiling java is this article: http://www.ibm.com/developerworks/edu/os-dw-os-ecl-tptp.html
In addition to the information given above: the JVM is free in how it uses OS threads. The Thread in your Java application might run in a seperate OS thread, or it might share that thread with other Threads. Check the documentation for the JVM you are using for additional information.
Ok, thanks for the advice guys! Looks like I'm just going to have to look into trying to improve the efficiency of the way my app is streaming (though not sure this is going to go far as I'm basically just reading the file from disk and writing it to the client...).
VisualVM is very easy to use to find out where your CPU time is being spent for Java applications, and it is included in the latest versions of the JDK (named jvisualvm.exe on Windows)
Follow up to my "well thought out buffers" comment, a good rule of thumb for TCP buffering,
buffer size = 2 * bandwidth * delay
So if you want to stream 214kbps music (around 27kB/s) and have, let's say 60ms of latency, you're looking at 3.24 kilobytes, and rounding off to a nice 4kB buffer will work very well for you on a wide range of systems.
Related
I have a very large set of text files. The task was to calculate the document frequencies (number of document that contain a certain term) for all the terms (uniquely) inside this huge corpus. Simply starting from the first file and calculating everything in a serialized manner seemed to be a dumb thing to do (I admit I did it just to see how disastrous it is).
I realized that if I do this calculation in a Map-Reduce manner, meaning clustering my data into smaller pieces and in the end aggregating the results, I would get the results much faster.
My PC has 4 cores, so I decided to separate my data into 3 distinct subsets and feeding each subset to a separate thread waiting for all the threads to finish their work and passing their results to a another method to aggregate everything.
I tests it with a very small set of data, worked fined. Before I use the actual data, I tested it with a larger set to I can study its behaviour better. I started jvisualvm and htop to see how the cpu and memory is working. I can see that 3 threads are running and cpu cores are also busy. But the usage of these cores are rarely above 50%. This means that my application is not really using the full power of my PC. is this related to my code, or is this how it is supposed to be. My expectation was that each thread uses as much cpu core resource as possible.
I use Ubuntu.
Sounds to me that you have an IO bound application. You are spending more time in your individual threads reading the data from the disk then you are actually processing the information that is read.
You can test this by migrating your program to another system with a SSD to see if the CPU performance changes. You can also read in all of the files and then process them later to see if that changes the CPU curve during processing time. I suspect it will.
As already stated you're bottle-necked by something probably disk IO. Try separating the code that reads from disk from the code that processes the data, and use separate thread pools for each. Afterwards, a good way to quickly scale your thread pools to properly fit your resources is to use one of the Executors thread pools.
You are IO bound for a problem like this on a single machine, not CPU bound. Are you actively reading the files? Only if you had all the files in-memory would you start to saturate the CPU. That is why map-reduce is effective. It scales the total IO throughput more than CPU.
You can possibly speed up this quite a bit if you are on Linux and use tmpfs for storing the data in memory, instead on disk.
We have been profiling and profiling our application to get reduce latency as much as possible. Our application consists of 3 separate Java processes, all running on the same server, which are passing messages to each other over TCP/IP sockets.
We have reduced processing time in first component to 25 μs, but we see that the TCP/IP socket write (on localhost) to the next component invariably takes about 50 μs. We see one other anomalous behavior, in that the component which is accepting the connection can write faster (i.e. < 50 μs). Right now, all the components are running < 100 μs with the exception of the socket communications.
Not being a TCP/IP expert, I don't know what could be done to speed this up. Would Unix Domain Sockets be faster? MemoryMappedFiles? what other mechanisms could possibly be a faster way to pass the data from one Java Process to another?
UPDATE 6/21/2011
We created 2 benchmark applications, one in Java and one in C++ to benchmark TCP/IP more tightly and to compare. The Java app used NIO (blocking mode), and the C++ used Boost ASIO tcp library. The results were more or less equivalent, with the C++ app about 4 μs faster than Java (but in one of the tests Java beat C++). Also, both versions showed a lot of variability in the time per message.
I think we are agreeing with the basic conclusion that a shared memory implementation is going to be the fastest. (Although we would also like to evaluate the Informatica product, provided it fits the budget.)
If using native libraries via JNI is an option, I'd consider implementing IPC as usual (search for IPC, mmap, shm_open, etc.).
There's a lot of overhead associated with using JNI, but at least it's a little less than the full system calls needed to do anything with sockets or pipes. You'll likely be able to get down to about 3 microseconds one-way latency using a polling shared memory IPC implementation via JNI. (Make sure to use the -Xcomp JVM option or adjust the compilation threshold, too; otherwise your first 10,000 samples will be terrible. It makes a big difference.)
I'm a little surprised that a TCP socket write is taking 50 microseconds - most operating systems optimize TCP loopback to some extent. Solaris actually does a pretty good job of it with something called TCP Fusion. And if there has been any optimization for loopback communication at all, it's usually been for TCP. UDP tends to get neglected - so I wouldn't bother with it in this case. I also wouldn't bother with pipes (stdin/stdout or your own named pipes, etc.), because they're going to be even slower.
And generally, a lot of the latency you're seeing is likely coming from signaling - either waiting on an IO selector like select() in the case of sockets, or waiting on a semaphore, or waiting on something. If you want the lowest latency possible, you'll have to burn a core sitting in a tight loop polling for new data.
Of course, there's always the commercial off-the-shelf route - which I happen to know for a certainty would solve your problem in a hurry - but of course it does cost money. And in the interest of full disclosure: I do work for Informatica on their low-latency messaging software. (And my honest opinion, as an engineer, is that it's pretty fantastic software - certainly worth checking out for this project.)
"The O'Reilly book on NIO (Java NIO, page 84), seems to be vague about
whether the memory mapping stays in memory. Maybe it is just saying
that like other memory, if you run out of physical, this gets swapped
back to disk, but otherwise not?"
Linux. mmap() call allocates pages in OS page cache area (which are periodically get flushed to disk and can be evicted based on Clock-PRO which is approximation of LRU algorithm?) So the answer on your question is - yes. Memory mapped buffer can be evicted (in theory) from memory unless it is mlocke'd (mlock()). This is in theory. In practice, I think it is hardly possible if your system is not swapping In this case, first victims are page buffers.
See my answer to fastest (low latency) method for Inter Process Communication between Java and C/C++ - with memory mapped files (shared memory) java-to-java latency can be reduced to 0.3 microsecond
MemoryMappedFiles is not viable solution for low latency IPC at all - if mapped segment of memory gets updated it is eventually will be synced to disk thus introducing unpredictable delay which measures in milliseconds at least. For low latency one can try combinations of either Shared Memory + message queues (notifications), or shared memory + semaphores. This works on all Unixes especially System V version (not POSIX) but if you run application on Linux you pretty safe with POSIX IPC (most features are available in 2.6 kernel) Yes, you will need JNI to get this done.
UPD: I forgot that this is JVM - JVM IPC and we have already GCs which we can not control fully, so introducing additional several ms pauses due to OS file buffers flash to disk may be acceptable.
Check out https://github.com/pcdv/jocket
It's a low-latency replacement for local Java sockets that uses shared memory.
RTT latency between 2 processes is well below 1us on a modern CPU.
I have a program that starts up and creates an in-memory data model and then creates a (command-line-specified) number of threads to run several string checking algorithms against an input set and that data model. The work is divided amongst the threads along the input set of strings, and then each thread iterates the same in-memory data model instance (which is never updated again, so there are no synchronization issues).
I'm running this on a Windows 2003 64-bit server with 2 quadcore processors, and from looking at Windows task Manager they aren't being maxed-out, (nor are they looking like they are being particularly taxed) when I run with 10 threads. Is this normal behaviour?
It appears that 7 threads all complete a similar amount of work in a similar amount of time, so would you recommend running with 7 threads instead?
Should I run it with more threads?...Although I assume this could be detrimental as the JVM will do more context switching between the threads.
Alternatively, should I run it with fewer threads?
Alternatively, what would be the best tool I could use to measure this?...Would a profiling tool help me out here - indeed, is one of the several profilers better at detecting bottlenecks (assuming I have one here) than the rest?
Note, the server is also running SQL Server 2005 (this may or may not be relevant), but nothing much is happening on that database when I am running my program.
Note also, the threads are only doing string matching, they aren't doing any I/O or database work or anything else they may need to wait on.
My guess would be that your app is bottlenecked on memory access, i.e. your CPU cores spend most of the time waiting for data to be read from main memory. I'm not sure how well profilers can diagnose this kind of problem (the profiler itself could influence the behaviour considerably). You could verify the guess by having your code repeat the operations it does many times on a very small data set.
If this guess is correct, the only thing you can do (other than getting a server with more memory bandwidth) is to try and increase the locality of your memory access to make better use of caches; but depending on the details of the application that may not be possible. Using more threads may in fact lead to worse performance because of cores sharing cache memory.
Without seeing the actual code, it's hard to give proper advice. But do make sure that the threads aren't locking on shared resources, since that would naturally prevent them all from working as efficiently as possible. Also, when you say they aren't doing any io, are they not reading an input or writing an output either? this could also be a bottleneck.
With regards to cpu intensive threads, it is normally not beneficial to run more threads than you have actual cores, but in an uncontrolled environment like this with other big apps running at the same time, you are probably better off simply testing your way to the optimal number of threads.
I am trying to create 100 files using FileOutputStream/BufferedOutputStream.
I can see the CPU utilization is 100% for 5 to 10 sec. The Directory which i am writing is empty. I am creating PDF files thru iText. Each file having round 1 MB. I am running on Linux.
How can i rewrite the code so that i can minimize the CPU utilization?
Don't guess: profile your application.
If the numbers show that a lot of time is spent in / within write calls, then look at ways to do faster I/O. But if most time is spent in formatting stuff for output (e.g. iText rendering), then that's where you need to focus your efforts.
Is this in a directory which already contains a lot of files? If so, you may well just be seeing the penalty for having a lot of files in a directory - this varies significantly by operating system and file system.
Otherwise, what are you actually doing while you're creating the files? Where does the data come from? Are they big files? One thing you might want to do is try writing to a ByteArrayOutputStream instead - that way you can see how much of the activity is due to the file system and how much is just how you're obtaining/writing the data.
It's a long shot guess, but even if you're using buffered streams make sure you're not writing out a single byte at a time.
The .read(int) and .write(int) methods are CPU killers. You should be using .read(byte[]...) and .write(byte[], int, int) for certain.
A 1MB file to write is large enough to use a java.nio FileChannel and see large performance improvements over java.io. Rewrite your code, and measure it agaist the old stuff. I predict a 2x improvement, at a minimum.
You're unlikely to be able to reduce the CPU load for your task, especially on a Windows system. Java on Linux does support Asynchronous File I/O, however, this can seriously complicate your code. I suspect you are running on Windows, as File I/O generally takes much more time on Windows than it does on Linux. I've even heard of improvements by running Java in a linux VM on Windows.
Take a look at your Task Manager when the process is running, and turn on Show Kernel Times. The CPU time spent in user space can generally be optimized, but the CPU time in kernel space can usually only be reduce by make more efficient calls.
Update -
JSR 203 specifically addresses the need for asynchronous, multiplexed, scatter/gather file IO:
The multiplexed, non-blocking facility introduced by JSR-51 solved much of that problem for network sockets, but it did not do so for filesystem operations.
Until JSR-203 becomes part of Java, you can get true asynchronous IO with the Apache MINA project on Linux.
Java NIO (1) allows you to do Channel based I/O. This is an improvement in performance, but your only doing a buffer of data at a time, and not true async & multiplexed IO.
Is Java a suitable alternative to C / C++ for realtime audio processing?
I am considering an app with ~100 (at max) tracks of audio with delay lines (30s # 48khz), filtering (512 point FIR?), and other DSP type operations occurring on each track simultaneously.
The operations would be converted and performed in floating point.
The system would probably be a quad core 3GHz with 4GB RAM, running Ubuntu.
I have seen articles about Java being much faster than it used to be, coming close to C / C++, and now having realtime extensions as well. Is this reality? Does it require hard core coding and tuning to achieve the %50-%100 performance of C some are spec'ing?
I am really looking for a sense if this is possible and a heads up for any gotchas.
For an audio application you often have only very small parts of code where most of the time is spent.
In Java, you can always use the JNI (Java Native interface) and move your computational heavy code into a C-module (or assembly using SSE if you really need the power). So I'd say use Java and get your code working. If it turns out that you don't meet your performance goal use JNI.
90% of the code will most likely be glue code and application stuff anyway. But keep in mind that you loose some of the cross platform features that way. If you can live with that JNI will always leave you the door open for native code performance.
Java is fine for many audio applications. Contrary to some of the other posters, I find Java audio a joy to work with. Compare the API and resources available to you to the horrendous, barely documented mindf*k that is CoreAudio and you'll be a believer. Java audio suffers from some latency issues, though for many apps this is irrelevant, and a lack of codecs. There are also plenty of people who've never bothered to take the time to write good audio playback engines(hint, never close a SourceDataLine, instead write zeros to it), and subsequently blame Java for their problems. From an API point of view, Java audio is very straightforward, very easy to use, and there is lots and lots of guidance over at jsresources.org.
Sure, why not?
The crucial questions (independent of language, this is from queueing theory) are:
what is the maximum throughput you need to handle (you've specified 100 x 48kHz, is that mono or stereo, how many bits equivalent at that frequency?)
can your Java routines keep up with this rate on the average?
what is the maximum permissible latency?
If your program can keep up with the throughput on the average, and you have enough room for latency, then you should be able to use queues for inputs and outputs, and the only parts of the program that are critical for timing are the pieces that put the data into the input queue and take it out of the output queue and send it to a DAC/speaker/whatever.
Delay lines have low computational load, you just need enough memory (+ memory bandwidth)... in fact you should probably just use the input/output queues for it, i.e. start putting data into the input queue immediately, and start taking data out of the output queue 30s later. If it's not there, your program is too slow...).
FIRs are more expensive, that's probably going to be the bottleneck (& what you'd want to optimize) unless you have some other ugly nasty operation in mind.
I think latency will be your major problem - it is quite hard to maintain latency already in C/C++ on modern OSes, and java surely adds to the problem (garbage collector). The general design for "real-time" audio processing is to have your processing threads running at real time scheduling (SCHED_FIFO on linux kernels, equivalent on other OSes), and those threads should never block. This means no system calls, no malloc, no IO of course, etc... Even paging is a problem (getting a page from disk to memory can easily take several ms), so you should lock some pages to be sure they are never swapped out.
You may be able to do those things in Java, but java makes it more complicated, not easier. I would look into a mixed design, where the core would be in C, and the rest (GUI, etc...) would be in java if you want.
One thing I didn't see in your question is whether you need to play out these processed samples or if you're doing something else with them (encoding them into a file, for example). I'd be more worried about the state of Java's sound engine than in how fast the JVM can crunch samples.
I pushed pretty hard on javax.sound.sampled a few years back and came away deeply unimpressed -- it doesn't compare with equivalent frameworks like OpenAL or Mac/iPhone's Core Audio (both of which I've used at a similar level of intensity). javax.sound.sampled requires you to push your samples into an opaque buffer of unknown duration, which makes synchronization nigh impossible. It's also poorly documented (very hard to find examples of streaming indeterminate-length audio over a Line as opposed to the trivial examples of in-memory Clips), has unimplemented methods (DataLine.getLevel()... whose non-implementation isn't even documented), and to top it off, I believe Sun laid off the last JavaSound engineer years ago.
If I had to use a Java engine for sound mixing and output, I'd probably try to use the JOAL bindings to OpenAL as a first choice, since I'd at least know the engine was currently supported and capable of very low-latency. Though I suspect in the long run that Nils is correct and you'll end up using JNI to call the native sound API.
Yes, Java is great for audio applications. You can use Java and access audio layers via Asio and have really low latency (64 samples latency which is next to nothing) on Windows platform. It means you will have lip-sync on video/movie. More latency on Mac as there is no Asio to "shortcut" the combination of OS X and "Java on top", but still OK. Linux also, but I am more ignorant. See soundpimp.com for a practical (and world first) example of Java and Asio working in perfect harmony. Also see the NRK Radio&tv Android app containing a sw mp3 decoder (from Java). You can do most audio things with Java, and then use a native layer if extra time critical.
Check out a library called Jsyn.
http://www.softsynth.com/jsyn/
Why not spend a day and write a simple java application that does minimal processing and validate whether the performance is adaquate.
From http://www.jsresources.org/faq_performance.html#java_slow
Let's collect some ethernal wisdom:
The earth is flat.
and, not to forget: Java is slow.
As several applications prove (see links section), Java is sufficient
to build audio editors, multitrack recording systems and MIDI
processing software. Try it out!