Android library to get pitch from WAV file

Android library to get pitch from WAV file - java

I have a list of sampled data from the WAV file. I would like to pass in these values into a library and get the frequency of the music played in the WAV file. For now, I will have 1 frequency in the WAV file and I would like to find a library that is compatible with Android. I understand that I need to use FFT to get the frequency domain. Is there any good libraries for that? I found that [KissFFT][1] is quite popular but I am not very sure how compatible it is on Android. Is there an easier and good library that can perform the task I want?
EDIT:
I tried to use JTransforms to get the FFT of the WAV file but always failed at getting the correct frequency of the file. Currently, the WAV file contains sine curve of 440Hz, music note A4. However, I got the result as 441. Then I tried to get the frequency of G4, I got the result as 882Hz which is incorrect. The frequency of G4 is supposed to be 783Hz. Could it be due to not enough samples? If yes, how much samples should I take?
//DFT
DoubleFFT_1D fft = new DoubleFFT_1D(numOfFrames);
double max_fftval = -1;
int max_i = -1;
double[] fftData = new double[numOfFrames * 2];
for (int i = 0; i < numOfFrames; i++) {
// copying audio data to the fft data buffer, imaginary part is 0
fftData[2 * i] = buffer[i];
fftData[2 * i + 1] = 0;
}
fft.complexForward(fftData);
for (int i = 0; i < fftData.length; i += 2) {
// complex numbers -> vectors, so we compute the length of the vector, which is sqrt(realpart^2+imaginarypart^2)
double vlen = Math.sqrt((fftData[i] * fftData[i]) + (fftData[i + 1] * fftData[i + 1]));
//fd.append(Double.toString(vlen));
// fd.append(",");
if (max_fftval < vlen) {
// if this length is bigger than our stored biggest length
max_fftval = vlen;
max_i = i;
}
}
//double dominantFreq = ((double)max_i / fftData.length) * sampleRate;
double dominantFreq = (max_i/2.0) * sampleRate / numOfFrames;
fd.append(Double.toString(dominantFreq));
Can someone help me out?
EDIT2: I manage to fix the problem mentioned above by increasing the number of samples to 100000, however, sometimes I am getting the overtones as the frequency. Any idea how to fix it? Should I use Harmonic Product Frequency or Autocorrelation algorithms?

I realised my mistake. If I take more samples, the accuracy will increase. However, this method is still not complete as I still have some problems in obtaining accurate results for piano/voice sounds.

Related

Extracting frequency from wav file

I am trying to extract frequency from a wav file, but looks like something is going wrong.
First of all I am extracting bytes from files, then applying FFT on it and at last finding the magnitude.
Seems like I am doing something wrong as the output is not close to real value.
Below is the code.
try{
File log = new File("files/log.txt");
if(!log.exists()) log.createNewFile();
PrintStream ps = new PrintStream(log);
File f = new File("files/5000.wav");
FileInputStream fis = new FileInputStream(f);
int length = (int)f.length();
length = (int)nearestPow2(length);
double[] ibr = new double[length]; //== real
double[] ibi = new double[length]; //== imaginary
int i = 0;
int l=0;
//fis.skip(44);
byte[] b = new byte[1024];
while((l=fis.read(b))!=-1){
try{
for(int j=0; j<1024; j++){
ibr[i] = b[j];
ibi[i] = 0;
i++;
}
}catch(Exception e){}
}
double[] ftb = FFTBase.fft(ibr, ibi, true);
double[] mag = new double[ftb.length/2];
double mxMag = 0;
long avgMg = 0;
int reqIndex = 512; //== no need to go till end
for(i=1;i<ibi.length; i++){
ibr[i] = ftb[i*2];
ibi[i] = ftb[i*2+1];
mag[i] = Math.sqrt(ibr[i]*ibr[i]+ibi[i]*ibi[i]);
avgMg += mag[i];
if(mag[i]>mxMag) mxMag = mag[i];
ps.println(mag[i]);
}
avgMg = avgMg/ibi.length;
ps.println("MAx===="+mxMag);
ps.println("Average===="+avgMg);
}catch(Exception e){e.printStackTrace();}
When I run this code for a 5KHZ file , these are the values I am getting.
https://pastebin.com/R3V0QU4G
This is not the complete output, but its somewhat similar.
Thanks

Extracting a frequency, or a "pitch" is unfortunatly hardly possible by only doing a fft and searching for the "loudest" frequency or something like that. At least if you are trying to extract it from a musical signal.
Also there are different kinds of tones. A large portion of musical instruments (i.e. a guitar or our voice) create harmonic sounds which consists of several frequencies which follow a certain pattern.
But there are also tones that have only one Peak / frequency (i.e. whistleing)
Additionally you usually have to deal with noise in the signal that is not tonal at all. This could be a background noise, or this could be produced by the instrument itself. Guitars for instance have a very large noise-portion while the attack-phase.
You can use different approaches, meaning different algorthims to find the pitch of these signals, depending of its type.
If we stay in the frequency domain (fft) and assuming we want to analyze a harmonic sound there is for example the two way mismatch algorithm that uses a statistical patternmatching to find harmonics and to guess the fundamental frequency, which is the frequency that is perceived as the tone by our ears.
An example-implementation can be found here: https://github.com/ausmauricio/audio_dsp This repo is part of a complete course on audio signal processing at coursera, maybe this is helpful.

Train recurrent neural net in deeplearning4j with data that is generated during runtime

I'm new to the deeplearning4j library, but I've got some experience with neural networks in general.
I'm trying to train a recurrent neural network (a LSTM in particular) which is supposed to detect beats in music in realtime. All examples for using recurrent neural nets with deeplearning4j that I've found so far use a reader which reads the training data from a file. As I want to record music in realtime via a microphone, I can't read some pregenerated file, so the data which is fed into the neural network is generated in realtime by my application.
This is the code that I'm using to generate my network:
NeuralNetConfiguration.ListBuilder builder = new NeuralNetConfiguration.Builder()
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT).iterations(1)
.learningRate(0.1)
.rmsDecay(0.95)
.regularization(true)
.l2(0.001)
.weightInit(WeightInit.XAVIER)
.updater(Updater.RMSPROP)
.list();
int nextIn = hiddenLayers.length > 0 ? hiddenLayers[0] : numOutputs;
builder = builder.layer(0, new GravesLSTM.Builder().nIn(numInputs).nOut(nextIn).activation("softsign").build());
for(int i = 0; i < hiddenLayers.length - 1; i++){
nextIn = hiddenLayers[i + 1];
builder = builder.layer(i + 1, new GravesLSTM.Builder().nIn(hiddenLayers[i]).nOut(nextIn).activation("softsign").build());
}
builder = builder.layer(hiddenLayers.length, new RnnOutputLayer.Builder(LossFunctions.LossFunction.MCXENT).nIn(nextIn).nOut(numOutputs).activation("softsign").build());
MultiLayerConfiguration conf = builder.backpropType(BackpropType.TruncatedBPTT).tBPTTForwardLength(DEFAULT_RECURRENCE_DEPTH).tBPTTBackwardLength(DEFAULT_RECURRENCE_DEPTH)
.pretrain(false).backprop(true)
.build();
net = new MultiLayerNetwork(conf);
net.init();
In this case I'm using about 700 inputs (which is mostly FFT-data of the recorded audio), 1 output (which is supposed to output a number between 0 [no beat] and 1 [beat]) and my hiddenLayers array consists of the ints {50, 25, 10}.
For getting the output of the network I'm using this code:
double[] output = new double[]{net.rnnTimeStep(Nd4j.create(netInputData)).getDouble(0)};
where netInputData is the data I want to input into the network as a one-dimensional double array.
I'm relatively sure that this code is working fine, since I get some output for an untrained network which looks something like this when I plot it.
However, once I try to train a network (even if I train it just for a short time, which should alter the weights of the network just a little bit, so that the output should be very similar to the untrained network), I get an output which looks like a constant.
This is the code which I'm using to train the network:
for(int timestep = 0; timestep < trainingData.length - DEFAULT_RECURRENCE_DEPTH; timestep++){
INDArray inputDataArray = Nd4j.create(new int[]{1, numInputs, DEFAULT_RECURRENCE_DEPTH},'f');
for(int inputPos = 0; inputPos < trainingData[timestep].length; inputPos++)
for(int inputTimeWindowPos = 0; inputTimeWindowPos < DEFAULT_RECURRENCE_DEPTH; inputTimeWindowPos++)
inputDataArray.putScalar(new int[]{0, inputPos, inputTimeWindowPos}, trainingData[timestep + inputTimeWindowPos][inputPos]);
INDArray desiredOutputDataArray = Nd4j.create(new int[]{1, numOutputs, DEFAULT_RECURRENCE_DEPTH},'f');
for(int outputPos = 0; outputPos < desiredOutputData[timestep].length; outputPos++)
for(int inputTimeWindowPos = 0; inputTimeWindowPos < DEFAULT_RECURRENCE_DEPTH; inputTimeWindowPos++)
desiredOutputDataArray.putScalar(new int[]{0, outputPos, inputTimeWindowPos}, desiredOutputData[timestep + inputTimeWindowPos][outputPos]);
net.fit(new DataSet(inputDataArray, desiredOutputDataArray));
}
Once again, I've got my data for the input and for the desired output as a double array. This time the two arrays are two-dimensional. The first index represents the time (where index 0 is the first audio data of the recorded audio) and the second index represents the input (or respectively the desired output) for this time step.
Given the shown output after training a network, I tend to think that there must be something wrong with my code used for creating the INDArrays from my data. Am I missing some important step for initializing these arrays or did I mess up the order I need to put my data into these arrays?
Thank you for any help in advance.

I'm not sure, but perhaps 99.99% of your training examples are 0, with only an occasional 1 exactly where the beat occurs. This might be too imbalanced to learn. Good luck.

Coefficient Correlation Over a Large Binary Image Data-Set - Slow Performance

I am trying to build an OCR by calculating the Coefficient Correlation between characters extracted from an image with every character I have pre-stored in a database. My implementation is based on Java and pre-stored characters are loaded into an ArrayList upon the beginning of the application, i.e.
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
...
...
public double findCorrelation(byte [] extractedCharacter, byte [] storedCharacter)
{
double mag1, mag2, corr = 0;
for(int i=0; i < extractedCharacter.length; i++)
{
mag1 += extractedCharacter[i] * extractedCharacter[i];
mag2 += storedCharacter[i] * storedCharacter[i];
corr += extractedCharacter[i] * storedCharacter[i];
} // for
corr /= Math.sqrt(mag1*mag2);
return corr;
}
The number of extractedCharacters are around 100-150 per image but the database has 15600 stored binary characters. Checking the coefficient correlation between every extracted character and every stored character has an impact on the performance as it needs around 15-20 seconds to complete for every image, with an Intel i5 CPU.
Is there a way to improve the speed of this program, or suggesting another path of building this bringing similar results. (The results produced by comparing every character with such a large dataset is quite good).
Thank you in advance
UPDATE 1
public static void run() {
ArrayList<byte []> storedCharacters, extractedCharacters;
storedCharacters = load_all_characters_from_database();
extractedCharacters = extract_characters_from_image();
// Calculate the coefficent between every extracted character
// and every character in database.
computeNorms(charComps, extractedCharacters);
double maxCorr = -1;
for(byte [] extractedCharacter : extractedCharacters)
for(byte [] storedCharacter : storedCharactes)
{
corr = findCorrelation(extractedCharacter, storedCharacter)
if (corr > maxCorr)
maxCorr = corr;
}
}
}
private static double[] storedNorms;
private static double[] extractedNorms;
// Correlation between to binary images
public static double findCorrelation(byte[] arr1, byte[] arr2, int strCharIndex, int extCharNo){
final int dotProduct = dotProduct(arr1, arr2);
final double corr = dotProduct * storedNorms[strCharIndex] * extractedNorms[extCharNo];
return corr;
}
public static void computeNorms(ArrayList<byte[]> storedCharacters, ArrayList<byte[]> extractedCharacters) {
storedNorms = computeInvNorms(storedCharacters);
extractedNorms = computeInvNorms(extractedCharacters);
}
private static double[] computeInvNorms(List<byte []> a) {
final double[] result = new double[a.size()];
for (int i=0; i < result.length; ++i)
result[i] = 1 / Math.sqrt(dotProduct(a.get(i), a.get(i)));
return result;
}
private static int dotProduct(byte[] arr1, byte[] arr2) {
int dotProduct = 0;
for(int i = 0; i< arr1.length; i++)
dotProduct += arr1[i] * arr2[i];
return dotProduct;
}

Nowadays, it's hard to find a CPU with a single core (even in mobiles). As the tasks are nicely separated, you can do it with a few lines only. So I'd go for it, though the gain is limited.
In case you really mean cross-correlation, then a transform like DFT or DCT could help. They surely do for big images, but with yours 12x16, I'm not sure.
Maybe you mean just a dot product? And maybe you should tell us?
Note that you actually don't need to compute the correlation, most of the time you only need is find out if it's bigger than a threshold:
corr = findCorrelation(extractedCharacter, storedCharacter)
..... more code to check if this is the best match ......
This may lead to some optimizations or not, depending on how the images look like.
Note also that a simple low level optimization can give you nearly a factor of 4 as in this question of mine. Maybe you really should tell us what you're doing?
UPDATE 1
I guess that due to the computation of three products in the loop, there's enough instruction level parallelism, so a manual loop unrolling like in my above question is not necessary.
However, I see that those three products get computed some 100 * 15600 times, while only one of them depends on both extractedCharacter and storedCharacter. So you can compute
100 + 15600 + 100 * 15600
dot products instead of
3 * 100 * 15600
This way you may get a factor of three pretty easily.
Or not. After this step there's a single sum computed in the relevant step and the problem linked above applies. And so does its solution (unrolling manually).
Factor 5.2
While byte[] is nicely compact, the computation involves extending them to ints, which costs some time as my benchmark shows. Converting the byte[]s to int[]s before all the correlations gets computed saves time. Even better is to make use of the fact that this conversion for storedCharacters can be done beforehand.
Manual loop unrolling twice helps but unrolling more doesn't.

Avoid overmodulation/distorsion when applying gain to PCM

I work on an audio recorder (AudioRec on Google Play).
I have the option to adjust the gain with [-20dB, + 20dB] range.
It works pretty well on my phone, but an user using a professional microphone attached to his device had complained about the gain because when selecting -20dB, the output is distorted.
See below how I impl. gain function:
for(int frameIndex=0; frameIndex<numFrames; frameIndex++){
for(int c=0; c<nChannels; c++){
if(rGain != 1){
// gain
long accumulator=0;
for(int b=0; b<bytesPerSample; b++){
accumulator+=((long)(source[byteIndex++]&0xFF))<<(b*8+emptySpace);
}
double sample = ((double)accumulator/(double)Long.MAX_VALUE);
sample *= rGain;
int intValue = (int)((double)sample*(double)Integer.MAX_VALUE);
for(int i=0; i<bytesPerSample; i++){
source[i+byteIndex2]=(byte)(intValue >>> ((i+2)*8) & 0xff);
}
byteIndex2 += bytesPerSample;
}
}//end for(channel)
}//end for(frameIndex)
Maybe I should apply some low/high filter after samle *= rGain; ? Something like if(sample < MINIMUM_VALUE || sample > MAXIMUM_VALUE) ? in this case, please let me know what are these min max values...

Simply clipping values above a threshold will most certainly cause distortion. If you can picture a pure sine wave, as you lop the top off it will begin to resemble a square wave.
That said, if you have an input signal and you are multiplying it by a value smaller than one, there is no way that you are introducing any (significant) distortion. You need to look further back in the signal path. Perhaps clipping is occurring at the input.

I would try to simplify your logic. It appears you are using 32-bit wave form but the code is far more complex than needed. This will make it harder to work out how to avoid clipping.
IntBuffer ints = ByteBuffer.wrap(source).order(ByteBuffer.nativeOrder()).asIntBuffer();
for(int i = 0; i < ints.limit(); i++) {
int signal = ints.get(i);
double gained = signal * gain;
if (gained > Integer.MAX_VALUE) {
// do something.
} else if (gained < Integer.MIN_VALUE) {
// do something
}
ints.put(i, (int) gained);
}
A simple approach is to let the values overflow, but as you say this can result in an apparent distortion. Just clipping the data could lead to long period of effective silence.
What you may have to do is a FFT and produce a signal which increases the strength of audible frequencies as the cost of lower frequencies when the gain is too high. i.e. it is the low frequencies which result in the signal being too high or too low so you can't amplify these as much if you want to stay in bounds.

Mixing and Adding Silence to Audio Android/Java

I have 2 files. Once is an mp3 being decoded to pcm into a stream and I have a wav being read into pcm also. The samples are being held in a short data type.
Audio stats: 44,100 samples * 16 bits per sample * 2 channels = 1,411,200 bits/sec
I have X seconds of silence that I need to apply to the beginning of the mp3 pcm data and I am doing it like this:
private short[] mp3Buffer = null;
private short[] wavBuffer = null;
private short[] mixedBuffer = null;
double silenceSamples = (audioInfo.rate * padding) * 2;
for (int i = 0; i < minBufferSize; i++){
if (silenceSamples > 0 ){
mp3Buffer[i] = 0; //Add 0 to the buffer as silence
mixedBuffer[i] = (short)((mp3Buffer[i] + stereoWavBuffer[i])/2);
silenceSamples = silenceSamples - 0.5;
}
else
mixedBuffer[i] = (short)((mp3Buffer[i] + stereoWavBuffer[i])/2);
}
The audio is always off. Sometimes its a second or two too fast, sometimes its a second or two too slow too slow. I dont think its a problem with the timing as I start the audiorecord(wav) first and then set a start timer->start mediaplayer(already prepared)->end timer and setting the difference to the "padding" variable. I am also skipping the 44kb when from the wav header.
Any help would be much appreciated.

I'm assuming you are wanting to align two sources of audio in some way by inserting padding at the start of one of the streams? There are a few things wrong here.
mp3Buffer[i] = 0; //Add 0 to the buffer as silence
This is not adding silence to the beginning, is is just setting the entry at offest [i] in the array to 0. The next line:
mixedBuffer[i] = (short)((mp3Buffer[i] + stereoWavBuffer[i])/2);
Then just overwrites this value.
If you are wanting to align the streams in some way, the best way to go about it is not to insert silence at the beginning of either stream, but to just begin mixing in one of the streams at an offset from the other. Also it would be better to mix them into a 32 bit float and then normalise. Something like:
int silenceSamples = (audioInfo.rate * padding) * 2;
float[] mixedBuffer = new float[minBufferSize + silenceSamples]
for (int i = 0; i < minBufferSize + silenceSamples; i++){
if (i < silenceSamples )
{
mixedBuffer[i] = (float) stereoWavBuffer[i];
}
else if(i < minBufferSize)
{
mixedBuffer[i] = (float) (stereoWavBuffer[i] + mp3Buffer[i-silenceSamples]);
}
else
{
mixedBuffer[i] = (float) (mp3Buffer[i-silenceSamples]);
}
To normalise the data you need to run through the mixedBuffer and find the absolute largest value Math.abs(...), and then multiple all the values in the array by 32,767/largestValue - this will give you a buffer where the largest value fits back into a short without clipping. Then iterate through your float array moving each value back into a short array.
I'm not sure what your minBufferSize is - this will need to be large enough to get all your data mixed.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.