Useful audio data from FFT - java

I'm trying to do a simple music visualization in java. I have two threads set up, one for playing the clip, and another for extracting a chunk of bytes from the clip to process with an FFT. The processed array can then be sent to the JFrame that will handle drawing, and used as a parameter for some sort of visual.
I'm not exactly sure what to do with the data, however. I've been just using a power spectrum for now, which gives me very limited response, and I realize is too general for what I am trying to do. I'm open to using any FFT library out there, if there is a specific one that will be especially helpful. But, in general, what can I get from my data after doing an FFT, and how can I use to show decently accurate results in the visuals?

All FFTs will do pretty much the same thing given the same data. The FFT parameters you can vary are the scale factor, the length of the FFT (longer will give you higher frequency resolution, shorter will give you better time response), and (pre)windowing the data, which will cause less "splatter" or spectral leakage of spectral peaks. You can zero-pad an FFT for interpolating smoother looking results. You can average the magnitude results of several successive FFTs to reduce the noise floor. You can also use a scaling function such a log scaling (or log log, e.g. log on both axis) for presenting the FFT magnitude results.
The phase of a complex FFT is usually unimportant for any visualization unless you are doing some type of phase vocoder analysis+resynthesis.

Related

Level out FFT graph (Processing)

I am trying to make a music visualizer in Processing, not that that part is super important, and I'm using a fast fourier transform through Minim. It's working perfectly (reading the data), but there is a large spike on the left (bass) end. What's the best way to 'level' this out?
My source code is here, if you want to take a look.
Thanks in advance,
-tlf
The spectrum you show looks fairly typical of a complex musical sound where you have a complex section at lower frequencies, but also some clear harmonics emerging from the low frequency mess. And, actually, these harmonics are atypically clear... music in general is complicated. Sometimes, for example, if a flute is playing a single clear note one will get a single nice peak or two, but it's much more common that transients and percussive sounds lead to a very complicated spectrum, especially at low frequencies.
For comparing directly to the video, it seems to me that the video is a bit odd. My guess is that the spectrum they show is either a zoom in a small section of the spectrum far from zero, or that it's just a graphical algorithm that's based off the music but doesn't correspond to an actual spectrum. That is, if you really want something to look very similar to this video, you'll need more than the spectrum, though the spectrum will likely be a good starting point. Here are a few points to note:
1) there is a prominent peak which occasionally appears right above the "N" in the word anchor. A single dominant peak should be clear in the audio as an approximately pure tone.
2) occasionally there's another peak that varies temporally with this peak, which would normally be a sign that the second peak is a harmonic, but many times this second peak isn't there.
3) A good examples of odd behavior, is a 2:26. This time just follows a little laser sound effect, and then there's basically a quite hiss. A hiss should be a broad spectrum sound without peaks, often weighted to lower frequencies. At 2:26, though, there's just this single large peak above the "N" with nothing else.
It turns out what I had to do was multiply the data by
Math.log(i + 2) / 3
where i is the index of the data being referenced, zero-indexed from the left (bass).
You can see this in context here

frequency / pitch detection for dummies

While I have many questions on this site dealing with the concept of pitch detection... They all deal with this magical FFT with which I am not familiar. I am trying to build an Android application that needs to implement pitch detection. I have absolutely no understanding for the algorithms that are used to do this.
It can't be that hard can it? There are around 8 billion guitar tuner apps on the android market after all.
Can someone help?
The FFT is not really the best way to implement pitch detection or pitch tracking. One issue is that the loudest frequency is not always the fundamental frequency. Another is that the FFT, by itself, requires a pretty large amount of data and processing to obtain the resolution you need to tune an instrument, so it can appear slow to respond (i.e. latency). Yet another issue is that the result of an FFT is necessarily intuitive to work with: you get an array of complex numbers and you have to know how to interpret them.
If you really want to use an FFT, here is one approach:
Low-pass your signal. This will help prevent noise and higher harmonics from creating spurious results. Conceivably, you could do skip this step and instead weight your results towards the lower values of the FFT instead. For some instruments with strong fundamental frequencies, this might not be necessary.
Window your signal. Windows should be at lest 4096 in size. Larger is better to a point because it gives you better frequency resolution. If you go too large, it will end up increasing your computation time and latency. The hann function is a good choice for your window. http://en.wikipedia.org/wiki/Hann_function
FFT the windowed signal as often as you can. Even overlapping windows are good.
The results of the FFT are complex numbers. Find the magnitude of each complex number using sqrt( real^2 + imag^2 ). The index in the FFT array with the largest magnitude is the index with your peak frequency.
You may want to average multiple FFTs for more consistent results.
How do you calculate the frequency from the index? Well, let's say you've got a window of size N. After you FFT, you will have N complex numbers. If your peak is the nth one, and your sample rate is 44100, then your peak frequency will be near (44100/2)*n/N. Why near? well you have an error of (44100/2)*1/N. For a bin size of 4096, this is about 5.3 Hz -- easily audible at A440. You can improve on that by 1. taking phase into account (I've only described how to take magnitude into account), 2. using larger windows (which will increase latency and processing requirements as the FFT is an N Log N algorithm), or 3. use a better algorithm like YIN http://www.ircam.fr/pcm/cheveign/pss/2002_JASA_YIN.pdf
You can skip the windowing step and just break the audio into discrete chunks of however many samples you want to analyze. This is equivalent to using a square window, which works, but you may get more noise in your results.
BTW: Many of those tuner apps license code form third parties, such as z-plane, and iZotope.
Update: If you want C source code and a full tutorial for the FFT method, I've written one. The code compiles and runs on Mac OS X, and should be convertible to other platforms pretty easily. It's not designed to be the best, but it is designed to be easy to understand.
A Fast Fourier Transform changes a function from time domain to frequency domain. So instead of f(t) where f is the signal that you are getting from the microphone and t is the time index of that signal, you get g(θ) where g is the FFT of f and θ is the frequency. Once you have g(θ), you just need to find which θ with the highest amplitude, meaning the "loudest" frequency. That will be the primary pitch of the sound that you are picking up.
As for actually implementing the FFT, if you google "fast fourier transform sample code", you'll get a bunch of examples.

Analyzing Sound in a WAV file

I am trying to analyze a movie file by splitting it up into camera shots and then trying to determine which shots are more important than others. One of the factors I am considering in a shot's importance is how loud the volume is during that part of the movie. To do this, I am analyzing the corresponding sound file. I'm having trouble determining how "loud" a shot is because I don't think I fully understand what the data in a WAV file represents.
I read the file into an audio buffer using a method similar to that described in this post.
Having already split the corresponding video file into shots, I am now trying to find which shots are louder than others in the WAV file. I am trying to do this by extracting each sample in the file like this:
double amplitude = (double)((audioData[i] & 0xff) | (audioData[i + 1] << 8));
Some of the other posts I have read seem to indicate that I need to apply a Fast Fourier Transform to this audio data to get the amplitude, which makes me wonder what the values I have extracted actually represent. Is what I'm doing correct? My sound file format is a 16-bit mono PCM with a sampling rate of 22,050 Hz. Should I be doing something with this 22,050 value when I am trying to analyze the volume of the file? Other posts suggest using Root Mean Square to evaluate loudness. Is this required, or just a more accurate way of doing it?
The more I look into this the more confused I get. If anyone could shed some light on my mistakes and misunderstandings, I would greatly appreciate it!
The FFT has nothing to do with volume and everything to do with frequencies. To find out how loud a scene is on average, simply average the sampled values. Depending on whether you get the data as signed or unsigned values in your language, you might have to apply an absolute function first so that negative amplitudes don't cancel out the positive ones, but that's pretty much it. If you don't get the results you were expecting that must have to do with the way you are extracting the individual values in line 20.
That said, there are a few refinements that might or might not affect your task. Perceived loudness, amplitude and acoustic power are in fact related in non-linear ways, but as long as you are only trying to get a rough estimate of how much is "going on" in the audio signal I doubt that this is relevant for you. And of course, humans hear different frequencies better or worse - for instance, bats emit ultrasound squeals that would be absolutely deafening to us, but luckily we can't hear them at all. But again, I doubt this is relevant to your task, since e.g. frequencies above 22kHz (or was is 44kHz? not sure which) are in fact not representable in simple WAV format.
I don't know the level of accuracy you want, but a simple RMS (and perhaps simple filtering of the signal) is all many similar applications would need.
RMS will be much better than Peak amplitude. Using peak amplitudes is like determining the brightness of an image based on the brightest pixel, rather than averaging.
If you want to filter the signal or weigh it to perceived loudness, then you would need the sample rate for that.
FFT should not be required unless you want to do complex frequency analysis as well. The ear responds differently to frequencies at different amplitudes - the ear does not respond to sounds at different frequencies and amplitudes linearly. In this case, you could use FFT to perform frequency analyses for another domain of accuracy.

Using jTransforms library on a WAV file?

I am trying to do spectral analysis on a WAV file using the jTransforms library: Official Site
But I have problems on how to convert the WAV file into an acceptable input for FFT using jTransforms, and how can I display a frequency spectrum after FFT? I have searched around Google and found I need to somehow convert the WAV file into a double[] or Complex[], and afterwards how should I interpret the output?
Sorry I am very new to FFT so this question may sound extremely stupid. Many thanks!
I don't know your library but i guess they have extensive documentation on how to apply the transforms.
Regarding the interpretation, if you use a complex transform you can interpret the real part as energy for the corresponding freuqncy bin and the imaginary as phase of the sinusoid.
The power spectral density (PSD) can be computed by
real(fftData)*conj(fftData)
which is equal to
abs(fftData^2)
(so multiply the real parts by their complex conjugate).
One thing you might have to consider is rescaling your fft output. Some algorithms scale the output proportional to the fftSize. So you will have to multiply the output by 1/fftSize.
And the last thing in case you are not aware of, you only have to take half of the fft output since the spectrum is symmetric.
The middle bin (fftSize/2) is usually the mirrored fundamental frequency and is equal to fftData[0]. This marks the Nyquist frequency which is the highest frequency you can analyze with the given fftSize.
So if you want to display frequencies upto 22kHz make sure your fftSize is at least 44k.
There are many pitfalls with FFT, so be sure you read up on some parts and understand what you are doing there. The mathematics itself are not that important if you just want to use it, so you might skip them.
EDIT: There is even more. Consider to weight your input data with a tapered window (gaussian, hamming, hanning...) to avoid nasty edge effects if you don't feed your whole wav file as input. Otherwise you will get artificial high frequencies into your fft output which are simply not present in the original.

Read audio byte data in real time

The goal is to get a simple 2d audio visualizer that is somewhat responsive to the music.
I've got the basics set up, where I have graphics that will respond to some data being fed in. Given a file, I load up an audioInputStream for playback (this works fine), and have that running in a thread. In another thread, I would like to extract byte data at a rate close to the playback (or perhaps faster, to allow for delay in processing that data). I then want to feed that to an FFT process, and feed the resulting data to my graphics object that will use it as a parameter for whatever the visualization is.
I have two questions for this process:
1) How can I get the byte data and process it at a rate that will match the normal playback of a file? Is using an audioInputStream the way to go here?
2) Once I do FFT, what's a good way to get usable data (ie: power spectrum? Somehow filtering out certain frequencies? etc..)
Some considerations about (2) using the FFT to extract "features".
You should calculate short-term FFTs for example 512 points, whenever there is enough CPU cycles free to do so. For a visualisation it is not necessary to preserve all information (i.e. work with overlapping windows), instead you could calculate a 100ms FFT 5 times per second.
Then you should calculate the logarithmic power spectrum in dB (decibel).
This gives you a pretty good impression about the detailed frequency content of your sound.
Depending on what you like to visualize you could for example combine some low frequency FFT lines (calculate the RMS) to get the "Bass" content of your sound and so on.
See this post for details.

Categories