I am trying to analyze a movie file by splitting it up into camera shots and then trying to determine which shots are more important than others. One of the factors I am considering in a shot's importance is how loud the volume is during that part of the movie. To do this, I am analyzing the corresponding sound file. I'm having trouble determining how "loud" a shot is because I don't think I fully understand what the data in a WAV file represents.
I read the file into an audio buffer using a method similar to that described in this post.
Having already split the corresponding video file into shots, I am now trying to find which shots are louder than others in the WAV file. I am trying to do this by extracting each sample in the file like this:
double amplitude = (double)((audioData[i] & 0xff) | (audioData[i + 1] << 8));
Some of the other posts I have read seem to indicate that I need to apply a Fast Fourier Transform to this audio data to get the amplitude, which makes me wonder what the values I have extracted actually represent. Is what I'm doing correct? My sound file format is a 16-bit mono PCM with a sampling rate of 22,050 Hz. Should I be doing something with this 22,050 value when I am trying to analyze the volume of the file? Other posts suggest using Root Mean Square to evaluate loudness. Is this required, or just a more accurate way of doing it?
The more I look into this the more confused I get. If anyone could shed some light on my mistakes and misunderstandings, I would greatly appreciate it!
The FFT has nothing to do with volume and everything to do with frequencies. To find out how loud a scene is on average, simply average the sampled values. Depending on whether you get the data as signed or unsigned values in your language, you might have to apply an absolute function first so that negative amplitudes don't cancel out the positive ones, but that's pretty much it. If you don't get the results you were expecting that must have to do with the way you are extracting the individual values in line 20.
That said, there are a few refinements that might or might not affect your task. Perceived loudness, amplitude and acoustic power are in fact related in non-linear ways, but as long as you are only trying to get a rough estimate of how much is "going on" in the audio signal I doubt that this is relevant for you. And of course, humans hear different frequencies better or worse - for instance, bats emit ultrasound squeals that would be absolutely deafening to us, but luckily we can't hear them at all. But again, I doubt this is relevant to your task, since e.g. frequencies above 22kHz (or was is 44kHz? not sure which) are in fact not representable in simple WAV format.
I don't know the level of accuracy you want, but a simple RMS (and perhaps simple filtering of the signal) is all many similar applications would need.
RMS will be much better than Peak amplitude. Using peak amplitudes is like determining the brightness of an image based on the brightest pixel, rather than averaging.
If you want to filter the signal or weigh it to perceived loudness, then you would need the sample rate for that.
FFT should not be required unless you want to do complex frequency analysis as well. The ear responds differently to frequencies at different amplitudes - the ear does not respond to sounds at different frequencies and amplitudes linearly. In this case, you could use FFT to perform frequency analyses for another domain of accuracy.
So I tested downsampling a music track in Java by using the Javax.Sound API
First of all we had the original mp3 which was then converted to .wav to accept Javas audio AudioFormat. Then I used AudioSystem.getAudioInputStream(AudioFormat targetFormat, AudioInputStream sourceStream) to downsample my .wav file.
Here you can see the original mp3 file in Audacity:
After converting it by using JLayer and applying Javas Sound API on it, the downsampled track looked like this:
However, by using another program, dBPoweramp, it looked like this:
You can see that the amplitudes of the wave are higher than in the version i downsampled with Java.
Therefore it sounds louder and a bit more like the original .mp3 file, where as my own file sounds very quiet compared to the original.
Now my Questions:
How can i achieve this effect? Is it better to have higher amps or are they just cut of like you see in the picture sampled by dBPoweramp. Why ist there any difference anyway?
I'm not entirely sure what you mean by quality here, but it's no surprise whatsoever that nature of a downsampled signal will be different to the original as it will have been filtered to remove frequencies that would violate the nyqist rate at the new sample frequency.
We would therefore expect that the gain of down-sampled signal would be lower than that of original. From that perspective, the signal produced by JLayer looks far more plausible than that of dbPoweramp.
The second graph appears to have higher gain than the original, so I suspect that there is make-up gain applied, and possibly dynamic range compression and brick-wall limiting (this signal has periods which appear to have peaks at the limit). Or worse, it's simply clipped.
And this brings us back to the definition of quality: It's subjective. Lots of commercial music is heavily compressed and brick-wall limited as part of the production process for a whole variety of reasons. One of which is artistic effect. It seems you're getting more of this from dbPoweramp, and this may well be flattering to your taste and the content.
It is probably not a clean conversion. In any objective measurement of system performance (e.g. PSNR), the quality would be lower.
If objective quality is what you're after, much better performance is achieved in rendering the mp3 into a lower sample rather than decoding to PCM and then downsampling.
As a final word of caution: Audacity is doing some processing on the signal in order to render the time-aplitude graph. I suspect it is showing the max amplitude per point in the x-axis.
I am trying to make a music visualizer in Processing, not that that part is super important, and I'm using a fast fourier transform through Minim. It's working perfectly (reading the data), but there is a large spike on the left (bass) end. What's the best way to 'level' this out?
My source code is here, if you want to take a look.
Thanks in advance,
The spectrum you show looks fairly typical of a complex musical sound where you have a complex section at lower frequencies, but also some clear harmonics emerging from the low frequency mess. And, actually, these harmonics are atypically clear... music in general is complicated. Sometimes, for example, if a flute is playing a single clear note one will get a single nice peak or two, but it's much more common that transients and percussive sounds lead to a very complicated spectrum, especially at low frequencies.
For comparing directly to the video, it seems to me that the video is a bit odd. My guess is that the spectrum they show is either a zoom in a small section of the spectrum far from zero, or that it's just a graphical algorithm that's based off the music but doesn't correspond to an actual spectrum. That is, if you really want something to look very similar to this video, you'll need more than the spectrum, though the spectrum will likely be a good starting point. Here are a few points to note:
1) there is a prominent peak which occasionally appears right above the "N" in the word anchor. A single dominant peak should be clear in the audio as an approximately pure tone.
2) occasionally there's another peak that varies temporally with this peak, which would normally be a sign that the second peak is a harmonic, but many times this second peak isn't there.
3) A good examples of odd behavior, is a 2:26. This time just follows a little laser sound effect, and then there's basically a quite hiss. A hiss should be a broad spectrum sound without peaks, often weighted to lower frequencies. At 2:26, though, there's just this single large peak above the "N" with nothing else.
It turns out what I had to do was multiply the data by
Math.log(i + 2) / 3
where i is the index of the data being referenced, zero-indexed from the left (bass).
You can see this in context here
I'm making program for Active Noise Control(also use Adaptive instead of Active / use Cancellation instead of Control)
System is pretty simple.
get sound via mic
turn the sound into data, which I can read(Something like Integer array)
make antiphase of the sound.
turn the data into sound file
Follwing is my question
Can I read sound as Integer Array?
If I can use Integer Array, how can I make antiphase? Just multiply -1 to every data?
Any useful think about my project
Is there any recommended language rather than java?
I heard that stackoverflow have many top class programmers. So, I expect for critical answer :D
Answering your questions:
(1) When you read sound, a byte array is returned. The bytes can readily be decoded into integers, shorts, floats, whatever. Java supports many common formats, and probably has one that matches your microphone input and speaker output. For example, Java supports 16-bit encoding, stereo, 44100 fps, which is considered the standard for CD-quality. There are several questions already at StackOverflow that show the coding for the decoding and recoding back to bytes.
(2) Yes, just multiply by -1 to every element of your PCM array. When you add the negative to the correctly lined up counterpart, 0 will result.
(3 & 4) I don't know what the tolerances are for lag time! I think if you simply take the input, decode, multiply by -1, recode, and output, it might be possible to get a very small amount of processing time. I don't know what Java is capable of here, but I bet it will be on the scale of a dozen millis, give or take. How much is enough for cancellation? How far does the sound travel from mike to speaker location? How much time does that allow? (Or am I missing something about how this works? I haven't done this sort of thing before.)
Java is pretty darn fast, and you will be relatively close to the native code level with the reading and writing and simple numeric conversions. The core code (for testing) could probably be written in an afternoon, using the following tutorial examples as a template: Reading/Writing sound files, see code snippets. I'd pay particular attention to the spot where the comment reads "Here do something useful with the audio data that is in the bytes array..." At this point,
you would put the code to convert the bytes to DSP, multiply by -1, then convert back to bytes.
If Java doesn't prove fast enough, I assume the next thing to try would be some flavor of C.
I am trying to do spectral analysis on a WAV file using the jTransforms library: Official Site
But I have problems on how to convert the WAV file into an acceptable input for FFT using jTransforms, and how can I display a frequency spectrum after FFT? I have searched around Google and found I need to somehow convert the WAV file into a double[] or Complex[], and afterwards how should I interpret the output?
Sorry I am very new to FFT so this question may sound extremely stupid. Many thanks!
I don't know your library but i guess they have extensive documentation on how to apply the transforms.
Regarding the interpretation, if you use a complex transform you can interpret the real part as energy for the corresponding freuqncy bin and the imaginary as phase of the sinusoid.
The power spectral density (PSD) can be computed by
which is equal to
(so multiply the real parts by their complex conjugate).
One thing you might have to consider is rescaling your fft output. Some algorithms scale the output proportional to the fftSize. So you will have to multiply the output by 1/fftSize.
And the last thing in case you are not aware of, you only have to take half of the fft output since the spectrum is symmetric.
The middle bin (fftSize/2) is usually the mirrored fundamental frequency and is equal to fftData[0]. This marks the Nyquist frequency which is the highest frequency you can analyze with the given fftSize.
So if you want to display frequencies upto 22kHz make sure your fftSize is at least 44k.
There are many pitfalls with FFT, so be sure you read up on some parts and understand what you are doing there. The mathematics itself are not that important if you just want to use it, so you might skip them.
EDIT: There is even more. Consider to weight your input data with a tapered window (gaussian, hamming, hanning...) to avoid nasty edge effects if you don't feed your whole wav file as input. Otherwise you will get artificial high frequencies into your fft output which are simply not present in the original.
Is there anyway to analyze the audio pitches programmatically. For example, i know most of the players show a graph or bar & if the songs pitch is high # time t, the bar goes up at time t .. something like this. Is there any utility/tool/API to determine songs pitch so that we interpolate that to a bar which goes up & down.
Thanks for any help
Naive but robust: transform a modest length segment into Fourier space and find the peaks. Repeat as necessary.
Speed may be an issue, so choose the segment length as a power of 2 so that you can use the Fast Fourier Transform which is, well, fast.
Lots of related stuff on SO already. Try: https://stackoverflow.com/search?q=Fourier+transform
Well, unfortunately I'm not really an expert on audio with the iPhone, but I can point you towards a couple good resources.
Core Audio is probably going to be a big thing in what you want to do: htp://developer.apple.com/iphone/library/documentation/MusicAudio/Conceptual/CoreAudioOverview/Introduction/Introduction.html
As well, the Audio Toolbox may be of some help: htp://developer.apple.com/iphone/library/navigation/Frameworks/Media/AudioToolbox/index.html
If you are have a developer account, there are plenty of people on the forums that can help you: htps://devforums.apple.com/community/iphone
You'll have to add in a 't' in the http portion of those URLs, as I cannot post more than one hyperlink (sorry!).
To find the current pitch of a song, you need to learn about the Discrete Time Fourier Transform. To find the tempo, you need autocorrelation.
I think what you may be speaking of is a graphic equalizer, which displays the amplitude of different frequency ranges at a given time in an audio signal. It normally equipped with controls to modify the amplitudes between the given frequency ranges. Here's an example. Is that sort of what you're thinking of?
EDIT: Also, your numerous tags don't really give any indication of what language you might be using here, so I can't really suggest any specific techniques or libraries.