The goal is to get a simple 2d audio visualizer that is somewhat responsive to the music.
I've got the basics set up, where I have graphics that will respond to some data being fed in. Given a file, I load up an audioInputStream for playback (this works fine), and have that running in a thread. In another thread, I would like to extract byte data at a rate close to the playback (or perhaps faster, to allow for delay in processing that data). I then want to feed that to an FFT process, and feed the resulting data to my graphics object that will use it as a parameter for whatever the visualization is.
I have two questions for this process:
1) How can I get the byte data and process it at a rate that will match the normal playback of a file? Is using an audioInputStream the way to go here?
2) Once I do FFT, what's a good way to get usable data (ie: power spectrum? Somehow filtering out certain frequencies? etc..)
Some considerations about (2) using the FFT to extract "features".
You should calculate short-term FFTs for example 512 points, whenever there is enough CPU cycles free to do so. For a visualisation it is not necessary to preserve all information (i.e. work with overlapping windows), instead you could calculate a 100ms FFT 5 times per second.
Then you should calculate the logarithmic power spectrum in dB (decibel).
This gives you a pretty good impression about the detailed frequency content of your sound.
Depending on what you like to visualize you could for example combine some low frequency FFT lines (calculate the RMS) to get the "Bass" content of your sound and so on.
See this post for details.
Related
I am writing audio stream via websocket for a telephony application. Just before the audio starts playing there is a distinct 'click'. Upon further research, I came across the following question in SO,
WebAudio play sound pops at start and end
The accepted answer in the above question states to the exponentialRampToValueAtTime API to remove the said noise. I am implementing my service in Java and do not have access to specific API. How do I go about implementing an exponentialRampToValueAtTime method to attenuate the noise in Java?
I wrote code to handle the clicking problem for some sound-based applications. IDK that the algorithm I came up with is considered robust, but it seems to be working. I'd categorize it as "reinventing the wheel". The algorithm uses a linear ramp, not exponential. It shouldn't be too hard to make the increments exponential instead.
Basic overview:
Obtain the byte data from the line
convert the byte data to PCM
for the starting 1024 pcm frames (a smaller number may be fine, especially if an exponential ramp is used instead of the linear), multiply each frame by a sequence that progresses from 0 to 1.
I use the following linear formula and have gotten satisfactory results.
for n = 0 to 1024
pcmValue[n] *= n / 1024;
convert the PCM back to bytes and ship
This only has to be done for the starts or the ends (algorithm in reverse).
For exponential, I'm guessing something like the following might work:
pcmValue[n] *= (Math.pow(2, n/1024) - 1);
A function related to decibels might have even better spaced increments. The better the spacing, the fewer the number of PCM frames needed to prevent the click.
In AudioCue, there is a need to ensure smooth transitions when a sound receives real-time commands to change volume. I use the same basic idea described above, but with a linear ramp between the two volume levels. Code for handling the increments can be seen at lines 1200, 892 and 1302. Using a linear ramp allows for a one-time calculation followed by simple addition for the individual PCM frames. But as I said, I wrote this from scratch and a better-schooled or more experienced audio engineer will likely have further insights or improvements.
On one side with my Android smartphone I'm recording an audio stream using AudioRecord.read(). For the recording I'm using the following specs
SampleRate: 44100 Hz
MonoChannel
PCM-16Bit
size of the array I use for AudioRecord.read(): 100 (short array)
using this small size allows me to read every 0.5ms (mean value), so I can use this timestamp later for the multilateration (at least I think so :-) ). Maybe this will be obsolete if I can use cross correlation to determine the TDoA ?!? (see below)
On the other side I have three speaker emitting different sounds using the WebAudio API and the the following specs
freq1: 17500 Hz
freq2: 18500 Hz
freq3: 19500 Hz
signal length: 200 ms + a fade in and fade out of the gain node of 5ms, so in sum 210ms
My goal is to determine the time difference of arrival (TDoA) between the emitted sounds. So in each iteration I read 100 byte from my AudioRecord buffer and then I want to determine the time difference (if I found one of my sounds). So far I've used a simple frequency filter (using fft) to determine the TDoA, but this is really inacurrate in the real world.
So far I've found out that I can use cross correlation to determine the TDoA value even better (http://paulbourke.net/miscellaneous/correlate/ and some threads here on SO). Now my problem: at the moment I think I have to correlate the recorded signal (my short array) with a generated signal of each of my three sounds above. But I'm struggling to generate this signal. Using the code found at (http://repository.tudelft.nl/view/ir/uuid%3Ab6c16565-cac8-448d-a460-224617a35ae1/ section B1.1. genTone()) does not clearly solve my problem because this will generate an array way bigger than my recorded samples. And so far I know the cross correlation needs two arrays of the same size to work. So how can I generate a sample array?
Another question: is the thinking of how to determine the TDoA so far correct?
Here are some lessons I've learned the past days:
I can either use cross correlation (xcorr) or a frequency recognition technique to determine the TDoA. The latter one is far more imprecise. So i focus on the xcorr.
I can achieve the TDoA by appling the xcorr on my recorded signal and two reference signals. E.g. my record has a length of 1000 samples. With the xcorr I recognize sound A at sample 500 and sound B at sample 600. So I know they have a time difference of 100 sample (that can be converted to seconds depending on the sample rate).
Therefor I generate a linear chirp (chirps a better than simple sin waves (see literature)) using this code found on SO. For an easy example and to check if my experiment seems to work I save my record as well as my generated chirp sounds as .wav files (there are plenty of code example how to do this). Then I use MatLab as an easy way to calculate the xcorr: see here
Another point: "input of xcorr has to be the same size?" I'm quite not sure about this part but I think this has to be done. We can achieve this by zero padding the two signals to the same length (preferable a power of two, so we can use the efficient Radix-2 implementation of FFT) and then use the FFT to calculate the xcorr (see another link from SO)
I hope this so far correct and covers some questions of other people :-)
So I tested downsampling a music track in Java by using the Javax.Sound API
First of all we had the original mp3 which was then converted to .wav to accept Javas audio AudioFormat. Then I used AudioSystem.getAudioInputStream(AudioFormat targetFormat, AudioInputStream sourceStream) to downsample my .wav file.
Here you can see the original mp3 file in Audacity:
After converting it by using JLayer and applying Javas Sound API on it, the downsampled track looked like this:
However, by using another program, dBPoweramp, it looked like this:
You can see that the amplitudes of the wave are higher than in the version i downsampled with Java.
Therefore it sounds louder and a bit more like the original .mp3 file, where as my own file sounds very quiet compared to the original.
Now my Questions:
How can i achieve this effect? Is it better to have higher amps or are they just cut of like you see in the picture sampled by dBPoweramp. Why ist there any difference anyway?
I'm not entirely sure what you mean by quality here, but it's no surprise whatsoever that nature of a downsampled signal will be different to the original as it will have been filtered to remove frequencies that would violate the nyqist rate at the new sample frequency.
We would therefore expect that the gain of down-sampled signal would be lower than that of original. From that perspective, the signal produced by JLayer looks far more plausible than that of dbPoweramp.
The second graph appears to have higher gain than the original, so I suspect that there is make-up gain applied, and possibly dynamic range compression and brick-wall limiting (this signal has periods which appear to have peaks at the limit). Or worse, it's simply clipped.
And this brings us back to the definition of quality: It's subjective. Lots of commercial music is heavily compressed and brick-wall limited as part of the production process for a whole variety of reasons. One of which is artistic effect. It seems you're getting more of this from dbPoweramp, and this may well be flattering to your taste and the content.
It is probably not a clean conversion. In any objective measurement of system performance (e.g. PSNR), the quality would be lower.
If objective quality is what you're after, much better performance is achieved in rendering the mp3 into a lower sample rather than decoding to PCM and then downsampling.
As a final word of caution: Audacity is doing some processing on the signal in order to render the time-aplitude graph. I suspect it is showing the max amplitude per point in the x-axis.
I am trying to analyze a movie file by splitting it up into camera shots and then trying to determine which shots are more important than others. One of the factors I am considering in a shot's importance is how loud the volume is during that part of the movie. To do this, I am analyzing the corresponding sound file. I'm having trouble determining how "loud" a shot is because I don't think I fully understand what the data in a WAV file represents.
I read the file into an audio buffer using a method similar to that described in this post.
Having already split the corresponding video file into shots, I am now trying to find which shots are louder than others in the WAV file. I am trying to do this by extracting each sample in the file like this:
double amplitude = (double)((audioData[i] & 0xff) | (audioData[i + 1] << 8));
Some of the other posts I have read seem to indicate that I need to apply a Fast Fourier Transform to this audio data to get the amplitude, which makes me wonder what the values I have extracted actually represent. Is what I'm doing correct? My sound file format is a 16-bit mono PCM with a sampling rate of 22,050 Hz. Should I be doing something with this 22,050 value when I am trying to analyze the volume of the file? Other posts suggest using Root Mean Square to evaluate loudness. Is this required, or just a more accurate way of doing it?
The more I look into this the more confused I get. If anyone could shed some light on my mistakes and misunderstandings, I would greatly appreciate it!
The FFT has nothing to do with volume and everything to do with frequencies. To find out how loud a scene is on average, simply average the sampled values. Depending on whether you get the data as signed or unsigned values in your language, you might have to apply an absolute function first so that negative amplitudes don't cancel out the positive ones, but that's pretty much it. If you don't get the results you were expecting that must have to do with the way you are extracting the individual values in line 20.
That said, there are a few refinements that might or might not affect your task. Perceived loudness, amplitude and acoustic power are in fact related in non-linear ways, but as long as you are only trying to get a rough estimate of how much is "going on" in the audio signal I doubt that this is relevant for you. And of course, humans hear different frequencies better or worse - for instance, bats emit ultrasound squeals that would be absolutely deafening to us, but luckily we can't hear them at all. But again, I doubt this is relevant to your task, since e.g. frequencies above 22kHz (or was is 44kHz? not sure which) are in fact not representable in simple WAV format.
I don't know the level of accuracy you want, but a simple RMS (and perhaps simple filtering of the signal) is all many similar applications would need.
RMS will be much better than Peak amplitude. Using peak amplitudes is like determining the brightness of an image based on the brightest pixel, rather than averaging.
If you want to filter the signal or weigh it to perceived loudness, then you would need the sample rate for that.
FFT should not be required unless you want to do complex frequency analysis as well. The ear responds differently to frequencies at different amplitudes - the ear does not respond to sounds at different frequencies and amplitudes linearly. In this case, you could use FFT to perform frequency analyses for another domain of accuracy.
I'm trying to do a simple music visualization in java. I have two threads set up, one for playing the clip, and another for extracting a chunk of bytes from the clip to process with an FFT. The processed array can then be sent to the JFrame that will handle drawing, and used as a parameter for some sort of visual.
I'm not exactly sure what to do with the data, however. I've been just using a power spectrum for now, which gives me very limited response, and I realize is too general for what I am trying to do. I'm open to using any FFT library out there, if there is a specific one that will be especially helpful. But, in general, what can I get from my data after doing an FFT, and how can I use to show decently accurate results in the visuals?
All FFTs will do pretty much the same thing given the same data. The FFT parameters you can vary are the scale factor, the length of the FFT (longer will give you higher frequency resolution, shorter will give you better time response), and (pre)windowing the data, which will cause less "splatter" or spectral leakage of spectral peaks. You can zero-pad an FFT for interpolating smoother looking results. You can average the magnitude results of several successive FFTs to reduce the noise floor. You can also use a scaling function such a log scaling (or log log, e.g. log on both axis) for presenting the FFT magnitude results.
The phase of a complex FFT is usually unimportant for any visualization unless you are doing some type of phase vocoder analysis+resynthesis.