So I tested downsampling a music track in Java by using the Javax.Sound API
First of all we had the original mp3 which was then converted to .wav to accept Javas audio AudioFormat. Then I used AudioSystem.getAudioInputStream(AudioFormat targetFormat, AudioInputStream sourceStream) to downsample my .wav file.
Here you can see the original mp3 file in Audacity:
After converting it by using JLayer and applying Javas Sound API on it, the downsampled track looked like this:
However, by using another program, dBPoweramp, it looked like this:
You can see that the amplitudes of the wave are higher than in the version i downsampled with Java.
Therefore it sounds louder and a bit more like the original .mp3 file, where as my own file sounds very quiet compared to the original.
Now my Questions:
How can i achieve this effect? Is it better to have higher amps or are they just cut of like you see in the picture sampled by dBPoweramp. Why ist there any difference anyway?
I'm not entirely sure what you mean by quality here, but it's no surprise whatsoever that nature of a downsampled signal will be different to the original as it will have been filtered to remove frequencies that would violate the nyqist rate at the new sample frequency.
We would therefore expect that the gain of down-sampled signal would be lower than that of original. From that perspective, the signal produced by JLayer looks far more plausible than that of dbPoweramp.
The second graph appears to have higher gain than the original, so I suspect that there is make-up gain applied, and possibly dynamic range compression and brick-wall limiting (this signal has periods which appear to have peaks at the limit). Or worse, it's simply clipped.
And this brings us back to the definition of quality: It's subjective. Lots of commercial music is heavily compressed and brick-wall limited as part of the production process for a whole variety of reasons. One of which is artistic effect. It seems you're getting more of this from dbPoweramp, and this may well be flattering to your taste and the content.
It is probably not a clean conversion. In any objective measurement of system performance (e.g. PSNR), the quality would be lower.
If objective quality is what you're after, much better performance is achieved in rendering the mp3 into a lower sample rather than decoding to PCM and then downsampling.
As a final word of caution: Audacity is doing some processing on the signal in order to render the time-aplitude graph. I suspect it is showing the max amplitude per point in the x-axis.
Related
I am investigatin this field to obtain object detection in real time.
Video example:
http://www.youtube.com/watch?v=Bm5qUG-06V8
http://www.youtube.com/watch?v=aYd2kAN0Y20
But how can they extract sift keypoint and matching them so fast?
SIFT extraction requires a second generally
I'm an OpenIMAJ developer and responsible for making the first video.
We're not doing anything particularly fancy to make the matching fast in that video, and the SIFT detection and extraction is carried out on the entirety of every frame. In fact that video was made well before we did any optimisation; the current version of that demo is much smoother. We do also have a version with a hybrid KLT-tracker that works even faster by not having to perform SIFT on every frame.
As suggested by #Mario, the image size does have a big effect on the speed of the extraction, so processing a smaller frame can give a big win. Secondly, in the original description of the difference of Gaussian interest point localisation suggested by Lowe in the SIFT paper, it was suggested that the input image was first doubled in size to increase the number of features. By not performing this double-sizing you also get a big performance boost at the expense of having fewer features to match.
The code is open source (BSD license) and you can get it by following the links at http://www.openimaj.org. As stated in the video description, the image-processing code is pure Java; the only native code is a thin interface to the webcam. Tutorial number 7 in the current tutorial pdf document walks through the process of using SIFT in OpenIMAJ. Disabling the double-sizing can be achieved by doing:
DoGSIFTEngine engine = new DoGSIFTEngine();
engine.getOptions().setDoubleInitialImage(false);
SIFT can be accelerated in several ways :
if you can afford approximations, then you can derive a keypoint called SURF which is way faster (using integral images for most tasks)
you can use parallel implementations, at the CPU level (e.g. OpenCV uses Intel's TBB) or at the GPU level (google for sift gpu for related code and doc).
Anyway, none of these is available (AFAIK) in Java, so you'll have to use a Java wrapper to opencv or work it out yourself.
General and first idea: Ask the video uploader(s). We can just assume what's done or how it's done. It might also help to know what you've done so far (e.g. your video resolution, your processing power, image preparation, etc.).
I haven't used SIFT specifically, but I did quite some object/motion tracking during the last few years, so this is more in general. You might have tried some points already, I don't know.
Reduce your image resolution: Going from 640x480 to 320x240 will reduce your data to 25%. Going down to 160x120 will cut it by another 25% (so 6.25 % data left) without significantly impacting your algorithm.
In a similar way, it might be useful to reduce the color depth of your image (not just 256 grayscale, but maybe even more; like 64 colors).
Try other methods to make features more obvious or faster to find, e.g. try running an edge detector over your image.
At least the second video mentions a tracking system, so you could try to guess the region where the object tracked should reappear the next frame (using some simple a/b filter or whatever on coordinates and possibly rotation), then use SIFT on that sub area (with some added margin) only. Only analyze the whole image if you can't find it again. At around 40 or 50 seconds in the second video they're losing the object and need quite some time/tries to find it again.
I am trying to analyze a movie file by splitting it up into camera shots and then trying to determine which shots are more important than others. One of the factors I am considering in a shot's importance is how loud the volume is during that part of the movie. To do this, I am analyzing the corresponding sound file. I'm having trouble determining how "loud" a shot is because I don't think I fully understand what the data in a WAV file represents.
I read the file into an audio buffer using a method similar to that described in this post.
Having already split the corresponding video file into shots, I am now trying to find which shots are louder than others in the WAV file. I am trying to do this by extracting each sample in the file like this:
double amplitude = (double)((audioData[i] & 0xff) | (audioData[i + 1] << 8));
Some of the other posts I have read seem to indicate that I need to apply a Fast Fourier Transform to this audio data to get the amplitude, which makes me wonder what the values I have extracted actually represent. Is what I'm doing correct? My sound file format is a 16-bit mono PCM with a sampling rate of 22,050 Hz. Should I be doing something with this 22,050 value when I am trying to analyze the volume of the file? Other posts suggest using Root Mean Square to evaluate loudness. Is this required, or just a more accurate way of doing it?
The more I look into this the more confused I get. If anyone could shed some light on my mistakes and misunderstandings, I would greatly appreciate it!
The FFT has nothing to do with volume and everything to do with frequencies. To find out how loud a scene is on average, simply average the sampled values. Depending on whether you get the data as signed or unsigned values in your language, you might have to apply an absolute function first so that negative amplitudes don't cancel out the positive ones, but that's pretty much it. If you don't get the results you were expecting that must have to do with the way you are extracting the individual values in line 20.
That said, there are a few refinements that might or might not affect your task. Perceived loudness, amplitude and acoustic power are in fact related in non-linear ways, but as long as you are only trying to get a rough estimate of how much is "going on" in the audio signal I doubt that this is relevant for you. And of course, humans hear different frequencies better or worse - for instance, bats emit ultrasound squeals that would be absolutely deafening to us, but luckily we can't hear them at all. But again, I doubt this is relevant to your task, since e.g. frequencies above 22kHz (or was is 44kHz? not sure which) are in fact not representable in simple WAV format.
I don't know the level of accuracy you want, but a simple RMS (and perhaps simple filtering of the signal) is all many similar applications would need.
RMS will be much better than Peak amplitude. Using peak amplitudes is like determining the brightness of an image based on the brightest pixel, rather than averaging.
If you want to filter the signal or weigh it to perceived loudness, then you would need the sample rate for that.
FFT should not be required unless you want to do complex frequency analysis as well. The ear responds differently to frequencies at different amplitudes - the ear does not respond to sounds at different frequencies and amplitudes linearly. In this case, you could use FFT to perform frequency analyses for another domain of accuracy.
I need to break apart a large collection of wav files into smaller segments, and convert them into 16 khz, 16-bit mono wav files. To segment the wav files, I downloaded a WavFile class from the following site: WavFile Class. I tweaked it a bit to allow skipping an arbitrary number of frames. Using that class, I created a WavSegmenter class that would read a source wav file, and copy the frames between time x and time y into a new wav file. The start time and end time I can get from a provided XML file, and I can get the frames using sample rate * time. My problem is I do not know how to convert the sample rates from 44,100 to 16,000.
Currently, I am looking into Java's Sound API for this. I didn't consult it initially, because I found the guides long, but if it's the best existing option, I am willing to go through it. I would still like to know if there's another way to do it, though. Finally, I would like to know whether I should completely adapt Java's Sound API, and drop the WavFile class I am currently using. To me, it looks sound, but I would just like to be sure.
Thank you very much, in advance, for your time.
I believe the hardest part of your task is re-sampling from 44.1K to 16K samples per sec. It would have been much simpler to downsample to 22K or 11K from there! You will need to do some interpolation there.
EDIT: After further review and discussion with OP I believe the right choice for this situation is to go with Java Sound API because it provides methods for conversion between different sound file formats, including different sampling rates. Sticking with the WavFile API would require re-sampling which is quite complicated to implement in a 44.1K to 16K conversion case.
http://www.jsresources.org/examples/SampleRateConverter.html. I suppose This would help you...
The goal is to get a simple 2d audio visualizer that is somewhat responsive to the music.
I've got the basics set up, where I have graphics that will respond to some data being fed in. Given a file, I load up an audioInputStream for playback (this works fine), and have that running in a thread. In another thread, I would like to extract byte data at a rate close to the playback (or perhaps faster, to allow for delay in processing that data). I then want to feed that to an FFT process, and feed the resulting data to my graphics object that will use it as a parameter for whatever the visualization is.
I have two questions for this process:
1) How can I get the byte data and process it at a rate that will match the normal playback of a file? Is using an audioInputStream the way to go here?
2) Once I do FFT, what's a good way to get usable data (ie: power spectrum? Somehow filtering out certain frequencies? etc..)
Some considerations about (2) using the FFT to extract "features".
You should calculate short-term FFTs for example 512 points, whenever there is enough CPU cycles free to do so. For a visualisation it is not necessary to preserve all information (i.e. work with overlapping windows), instead you could calculate a 100ms FFT 5 times per second.
Then you should calculate the logarithmic power spectrum in dB (decibel).
This gives you a pretty good impression about the detailed frequency content of your sound.
Depending on what you like to visualize you could for example combine some low frequency FFT lines (calculate the RMS) to get the "Bass" content of your sound and so on.
See this post for details.
I've been researching this off-and-on for a few months.
I'm looking for a library or working example code to detect the frequency in sound card audio input, or detect presence of a given set of frequencies. I'm leaning towards Java, but the real requirement is that it should be something higher-level/simpler than C, and preferably cross-platform. Linux will be the target platform but I want to leave options open for Mac or possibly even Windows. Python would be acceptable too, and if anyone knows of a language that would make this easier/has better pre-written libraries, I'd be willing to consider it.
Essentially I have a defined set of frequency pairs that will appear in the soundcard audio input and I need to be able to detect this pair and then... do something, such as for example record the following audio up to a maximum duration, and then perform some action. A potential run could feature say 5-10 pairs, defined at runtime, can't be compiled in: something like frequency 1 for ~ 1 second, a maximum delay of ~1 second, frequency 2 for ~1 second.
I found suggestions of either doing an FFT or Goertzel algorithm, but was unable to find any more than the simplest example code that seemed to give no useful results. I also found some limitations with Java audio and not being able to sample at a high enough rate to get the resolution I need.
Any suggestions for libraries to use or maybe working code? I'll admit that I'm not the most mathematically inclined, so I've been lost in some of the more technical descriptions of how the algorithms actually work.
If you are aiming at detecting frequency pairs then your job is very similar to a DTMF detector.
Try searching for DTMF in places like sourgeforge, you'll find detectors in many programming languages. The frequency pairs placing along the spectrum seems to be even more stringent than your specs so you should be fine adapting a DTMF detector to your input.
Check out SNDPeek, its a cross-platform C++ application that extracts all kinds of information from live audio; https://github.com/RobQuistNL/sndpeek