Can I get sound data as array? - java

I'm making program for Active Noise Control(also use Adaptive instead of Active / use Cancellation instead of Control)
System is pretty simple.
get sound via mic
turn the sound into data, which I can read(Something like Integer array)
make antiphase of the sound.
turn the data into sound file
Follwing is my question
Can I read sound as Integer Array?
If I can use Integer Array, how can I make antiphase? Just multiply -1 to every data?
Any useful think about my project
Is there any recommended language rather than java?
I heard that stackoverflow have many top class programmers. So, I expect for critical answer :D

Answering your questions:
(1) When you read sound, a byte array is returned. The bytes can readily be decoded into integers, shorts, floats, whatever. Java supports many common formats, and probably has one that matches your microphone input and speaker output. For example, Java supports 16-bit encoding, stereo, 44100 fps, which is considered the standard for CD-quality. There are several questions already at StackOverflow that show the coding for the decoding and recoding back to bytes.
(2) Yes, just multiply by -1 to every element of your PCM array. When you add the negative to the correctly lined up counterpart, 0 will result.
(3 & 4) I don't know what the tolerances are for lag time! I think if you simply take the input, decode, multiply by -1, recode, and output, it might be possible to get a very small amount of processing time. I don't know what Java is capable of here, but I bet it will be on the scale of a dozen millis, give or take. How much is enough for cancellation? How far does the sound travel from mike to speaker location? How much time does that allow? (Or am I missing something about how this works? I haven't done this sort of thing before.)
Java is pretty darn fast, and you will be relatively close to the native code level with the reading and writing and simple numeric conversions. The core code (for testing) could probably be written in an afternoon, using the following tutorial examples as a template: Reading/Writing sound files, see code snippets. I'd pay particular attention to the spot where the comment reads "Here do something useful with the audio data that is in the bytes array..." At this point,
you would put the code to convert the bytes to DSP, multiply by -1, then convert back to bytes.
If Java doesn't prove fast enough, I assume the next thing to try would be some flavor of C.

Related

How to implement exponentialRampToValueAtTime in order to remove clicking sound when writing audio via websocket

I am writing audio stream via websocket for a telephony application. Just before the audio starts playing there is a distinct 'click'. Upon further research, I came across the following question in SO,
WebAudio play sound pops at start and end
The accepted answer in the above question states to the exponentialRampToValueAtTime API to remove the said noise. I am implementing my service in Java and do not have access to specific API. How do I go about implementing an exponentialRampToValueAtTime method to attenuate the noise in Java?
I wrote code to handle the clicking problem for some sound-based applications. IDK that the algorithm I came up with is considered robust, but it seems to be working. I'd categorize it as "reinventing the wheel". The algorithm uses a linear ramp, not exponential. It shouldn't be too hard to make the increments exponential instead.
Basic overview:
Obtain the byte data from the line
convert the byte data to PCM
for the starting 1024 pcm frames (a smaller number may be fine, especially if an exponential ramp is used instead of the linear), multiply each frame by a sequence that progresses from 0 to 1.
I use the following linear formula and have gotten satisfactory results.
for n = 0 to 1024
pcmValue[n] *= n / 1024;
convert the PCM back to bytes and ship
This only has to be done for the starts or the ends (algorithm in reverse).
For exponential, I'm guessing something like the following might work:
pcmValue[n] *= (Math.pow(2, n/1024) - 1);
A function related to decibels might have even better spaced increments. The better the spacing, the fewer the number of PCM frames needed to prevent the click.
In AudioCue, there is a need to ensure smooth transitions when a sound receives real-time commands to change volume. I use the same basic idea described above, but with a linear ramp between the two volume levels. Code for handling the increments can be seen at lines 1200, 892 and 1302. Using a linear ramp allows for a one-time calculation followed by simple addition for the individual PCM frames. But as I said, I wrote this from scratch and a better-schooled or more experienced audio engineer will likely have further insights or improvements.

Read and understand contents of AMR file on Android

I want to open an AMR file so I can perform signal processing algorithms on the contents (ex: what is the pitch?). I know you can open these files in a media player, but I want to get the actual contents of the file.
At one point I printed the contents and got a bunch of integers, but have no idea what they mean.
Any help is greatly appreciated. Thanks!
It sounds like you are able to get at the data, but don't know very much at all about the basics of audio signal processing.
The data you are looking at is probably raw bytes that need to be translated into PCM (Pulse Code Modulation). The Java Overview of the Sampled Package talks a bit about the relationship of the bytes to PCM as determined by a specific format.
For example, if the format specifies 16-bit encoding, then two bytes (each being 8 bits) will be concatenated to form a single PCM value that will range from -32767 to 32767. (Some people work directly with these numbers, others scale the numbers to floats ranging from -1 to 1).
And if the file is 44100 fps, then there will be 44100 "frames" of data per second, where the frame will most likely be mono or stereo (one PCM or two PCM values per frame)
The tutorial does get into Java specifics pretty quickly, but at least it gives a basic picture and you will have more terms to use in a search for something more specific to Android.
If you want to go into greater depth or detail, you could consult Steve Smith's The Scientist and Engineer's Guide to Digital Signal Processing. It is a free online book that I've found to be extremely helpful.

Cross Correlation: Android AudioRecord create sample data for TDoA

On one side with my Android smartphone I'm recording an audio stream using AudioRecord.read(). For the recording I'm using the following specs
SampleRate: 44100 Hz
MonoChannel
PCM-16Bit
size of the array I use for AudioRecord.read(): 100 (short array)
using this small size allows me to read every 0.5ms (mean value), so I can use this timestamp later for the multilateration (at least I think so :-) ). Maybe this will be obsolete if I can use cross correlation to determine the TDoA ?!? (see below)
On the other side I have three speaker emitting different sounds using the WebAudio API and the the following specs
freq1: 17500 Hz
freq2: 18500 Hz
freq3: 19500 Hz
signal length: 200 ms + a fade in and fade out of the gain node of 5ms, so in sum 210ms
My goal is to determine the time difference of arrival (TDoA) between the emitted sounds. So in each iteration I read 100 byte from my AudioRecord buffer and then I want to determine the time difference (if I found one of my sounds). So far I've used a simple frequency filter (using fft) to determine the TDoA, but this is really inacurrate in the real world.
So far I've found out that I can use cross correlation to determine the TDoA value even better (http://paulbourke.net/miscellaneous/correlate/ and some threads here on SO). Now my problem: at the moment I think I have to correlate the recorded signal (my short array) with a generated signal of each of my three sounds above. But I'm struggling to generate this signal. Using the code found at (http://repository.tudelft.nl/view/ir/uuid%3Ab6c16565-cac8-448d-a460-224617a35ae1/ section B1.1. genTone()) does not clearly solve my problem because this will generate an array way bigger than my recorded samples. And so far I know the cross correlation needs two arrays of the same size to work. So how can I generate a sample array?
Another question: is the thinking of how to determine the TDoA so far correct?
Here are some lessons I've learned the past days:
I can either use cross correlation (xcorr) or a frequency recognition technique to determine the TDoA. The latter one is far more imprecise. So i focus on the xcorr.
I can achieve the TDoA by appling the xcorr on my recorded signal and two reference signals. E.g. my record has a length of 1000 samples. With the xcorr I recognize sound A at sample 500 and sound B at sample 600. So I know they have a time difference of 100 sample (that can be converted to seconds depending on the sample rate).
Therefor I generate a linear chirp (chirps a better than simple sin waves (see literature)) using this code found on SO. For an easy example and to check if my experiment seems to work I save my record as well as my generated chirp sounds as .wav files (there are plenty of code example how to do this). Then I use MatLab as an easy way to calculate the xcorr: see here
Another point: "input of xcorr has to be the same size?" I'm quite not sure about this part but I think this has to be done. We can achieve this by zero padding the two signals to the same length (preferable a power of two, so we can use the efficient Radix-2 implementation of FFT) and then use the FFT to calculate the xcorr (see another link from SO)
I hope this so far correct and covers some questions of other people :-)

Analyzing Sound in a WAV file

I am trying to analyze a movie file by splitting it up into camera shots and then trying to determine which shots are more important than others. One of the factors I am considering in a shot's importance is how loud the volume is during that part of the movie. To do this, I am analyzing the corresponding sound file. I'm having trouble determining how "loud" a shot is because I don't think I fully understand what the data in a WAV file represents.
I read the file into an audio buffer using a method similar to that described in this post.
Having already split the corresponding video file into shots, I am now trying to find which shots are louder than others in the WAV file. I am trying to do this by extracting each sample in the file like this:
double amplitude = (double)((audioData[i] & 0xff) | (audioData[i + 1] << 8));
Some of the other posts I have read seem to indicate that I need to apply a Fast Fourier Transform to this audio data to get the amplitude, which makes me wonder what the values I have extracted actually represent. Is what I'm doing correct? My sound file format is a 16-bit mono PCM with a sampling rate of 22,050 Hz. Should I be doing something with this 22,050 value when I am trying to analyze the volume of the file? Other posts suggest using Root Mean Square to evaluate loudness. Is this required, or just a more accurate way of doing it?
The more I look into this the more confused I get. If anyone could shed some light on my mistakes and misunderstandings, I would greatly appreciate it!
The FFT has nothing to do with volume and everything to do with frequencies. To find out how loud a scene is on average, simply average the sampled values. Depending on whether you get the data as signed or unsigned values in your language, you might have to apply an absolute function first so that negative amplitudes don't cancel out the positive ones, but that's pretty much it. If you don't get the results you were expecting that must have to do with the way you are extracting the individual values in line 20.
That said, there are a few refinements that might or might not affect your task. Perceived loudness, amplitude and acoustic power are in fact related in non-linear ways, but as long as you are only trying to get a rough estimate of how much is "going on" in the audio signal I doubt that this is relevant for you. And of course, humans hear different frequencies better or worse - for instance, bats emit ultrasound squeals that would be absolutely deafening to us, but luckily we can't hear them at all. But again, I doubt this is relevant to your task, since e.g. frequencies above 22kHz (or was is 44kHz? not sure which) are in fact not representable in simple WAV format.
I don't know the level of accuracy you want, but a simple RMS (and perhaps simple filtering of the signal) is all many similar applications would need.
RMS will be much better than Peak amplitude. Using peak amplitudes is like determining the brightness of an image based on the brightest pixel, rather than averaging.
If you want to filter the signal or weigh it to perceived loudness, then you would need the sample rate for that.
FFT should not be required unless you want to do complex frequency analysis as well. The ear responds differently to frequencies at different amplitudes - the ear does not respond to sounds at different frequencies and amplitudes linearly. In this case, you could use FFT to perform frequency analyses for another domain of accuracy.

Compressing a byte array in Java and decompressing in C

I currently have the following array in a Java program,
byte[] data = new byte[800];
and I'd like to compress it before sending it to a microcontroller over serial (115200 Baud). I would like to then decompress the array on the microcontroller in C. However, I'm not quite sure what the best way to do this is. Performance is an issue since the microcontroller is just an arduino so it can't be too memory/cpu intensive. The data is more or less random (edit I guess it's not really that random, see the edit below) I'd say since it represents a rgb color value for every 16 bits.
What would be the best way to compress this data? Any idea how much compression I could possibly get?
edit
Sorry about the lack of info. I need the compression to be lossless and I do only intend to send 800 bytes at a time. My issue is that 800 bytes won't transfer fast enough at the rate of 115200 baud that I am using. I was hoping I could shrink the size a little bit to improve speed.
Every two bytes looks like:
0RRRRRGGGGGBBBBB
Where R G and B bits represent the values for color channels red, green, and blue respectively. Every two bytes is then an individual LED on a 20x20 grid. I would imagine that many sets of two bytes would be identical since I frequently assign the same color codes to multiple LEDs. It may also be the case that RGB values are often > 15 since I typically use bright colors when I can (However, this might be a moot point since they are not all typically > 15 at once).
If the data is "more or less random" then you won't have much luck compressing it, I'm afraid.
UPDATE
Given the new information, I bet you don't need 32k colours on your LED display. I'd imagine that a 1024- or 256-colour palette might be sufficient. Hence you could get away with a trivial compression scheme (simply map each word through a lookup table, or possibly just discard lsbs of each component), that would work even for completely uncorrelated pixel values.
Use miniLZO compression. Java version C version
A really simple compression/decompression algorithm that is practical in tiny embedded environments and is easy to "roll your own" is run length encoding. Basically this means replacing a run of duplicate values with a (count, value) pair. Of course you need a sentinel (magic) value to introduce the pair, and then a mechanism to allow the magic value to appear in normal data (typically an escape sequence can be used for both jobs). In your example it might be best to use 16 bit values (2 bytes).
But naturally it all depends on the data. Data that is sufficiently random is incompressible by definition. You would do best to collect some example data first, then evaluate your compression options.
Edit after extra information posted
Just for fun and to show how easy run length encoding is I have coded up something. I'm afraid I've used C for compression as well, since I'm not a Java guy. To keep things simple I've worked entirely with 16 bit data. An optimization would be to use an 8 bit count in the (count,value) pair. I haven't tried to compile or test this code. See also my comment to your question about the possible benefits of mangling the LED addresses.
#define NBR_16BIT_WORDS 400
typedef unsigned short uint16_t;
// Return number of words written to dst (always
// less than or equal to NBR_16BIT_WORDS)
uint16_t compress( uint16_t *src, uint16_t *dst )
{
uint16_t *end = (src+NBR_16BIT_WORDS);
uint16_t *dst_begin = dst;
while( src < end )
{
uint16_t *temp;
uint16_t count=1;
for( temp=src+1; temp<end; temp++ )
{
if( *src == *temp )
count++;
else
break;
}
if( count < 3 )
*dst++ = *src++;
else
{
*dst++ = (*src)|0x8000;
*dst++ = count;
*src += count;
}
}
return dst-dst_begin;
}
void decompress( uint16_t *src, uint16_t *dst )
{
uint16_t *end_src = (src+NBR_16BIT_WORDS);
uint16_t *end_dst = (dst+NBR_16BIT_WORDS);
while( src<end_src && dst<end_dst )
{
data = *src++;
if( (data&0x8000) == 0 )
*dst++ = data;
else
{
data &= 0x7fff;
uint16_t count = *src++;
while( dst<end_dst && count-- )
*dst++ = data;
}
}
}
One of the first things to do would be to convert from RGB to YUV, or YCrCb, or something on that order. Having done that, you can usually get away with sub-sampling the U and V (or Cr/Cb) channels to half resolution. This is quite common in most types of images (e.g., JPEG, and MPEG both do it, and so do the sensors in most digital cameras).
Realistically, starting with only 800 bytes of data, most other forms of compression are going to be a waste of time and effort. You're going to have to put in quite a bit of work before you accomplish much (and keeping it reasonably fast on a Arduino won't be trivial for either).
Edit: okay, if you're absolutely certain you can't modify the data at all, things get more difficult very quickly. The real question at that point is what kind of input you're dealing with. Others have already mentioned the possibility of something on the order of a predictive delta compression -- e.g., based on preceding pixels, predict what the next one is likely to be, and then encode only the difference between the prediction and the actual value. Getting the most out of that, however, generally requires running the result through some sort of entropy-based algorithm like Shannon-Fanno or Huffman compression. Those, unfortunately, aren't usually the fastest to decompress though.
If your data is most things like charts or graphs, where you can expect to have large areas of identical pixels, run-length (or run-end) encoding can work pretty well. This does have the advantage of being really trivial to decompress as well.
I doubt that LZ-based compression is going to work so well though. LZ-based compression works (in general) by building a dictionary of strings of bytes that have been seen, and when/if the same string of bytes is seen again, transmitting the code assigned to the previous instance instead of re-transmitting the entire string. The problem is that you can't transmit uncompressed bytes -- you start out by sending the code word that represents that byte in the dictionary. In your case, you could use (for example) a 10-bit code word. This means the first time you send any particularly character, you need to send it as 10 bits, not just 8. You only start to get some compression when you can build up some longer (two-byte, three-byte, etc.) strings in your dictionary, and find a matching string later in the input.
This means LZ-based compression usually gets fairly poor compression for the first couple hundred bytes or so, then about breaks even for a while, and only after it's been running across some input for a while does it really start to compress well. Dealing with only 800 bytes at a time, I'm not at all sure you're ever going to see much compression -- in fact, working in such small blocks, it wouldn't be particularly surprising to see the data expand on a fairly regular basis (especially if it's very random).
The data is more or less random I'd say since it represents a rgb color value for every 16 bits.
What would be the best way to compress this data? Any idea how much compression I could possibly get?
Ideally you can compress 800 bytes of colour data to one byte if the whole image is the same colour. As Oli Charlesworth mentions however, the more random the data, the less you can compress it. If your images looks like static on a TV, then indeed, good luck getting any compression out of it.
Definitely consider Oli Charlesworth's answer. On a 20x20 grid, I don't know if you need a full 32k color palette.
Also, in your earlier question, you said you were trying to run this on a 20ms period (50 Hz). Do you really need that much speed for this display? At 115200 bps, you can transmit ~11520 bytes/sec - call it 10KBps for a margin of safety (e.g. your micro might have a delay between bytes, you should do some experiments to see what the 'real' bandwidth is). At 50 Hz, this only allows you about 200 bytes per packet - you're looking for a compression ratio over 75%, which may not be attainable under any circumstances. You seem pretty married to your requirements, but it may be time for an awkward chat.
If you do want to go the compression route, you will probably just have to try several different algorithms with 'real' data, as others have said, and try different encodings. I bet you can find some extra processing time by doing matrix math, etc. in between receiving bytes over the serial link (you'll have about 80 microseconds between bytes) - if you use interrupts to read the serial data instead of polling, you can probably do pretty well by using a double buffer and processing/displaying the previous buffer while reading into the current buffer.
EDIT:
Is it possible to increase the serial port speed beyond 115200? This USB-serial adapter at Amazon says it goes up to 1 Mbps (probably actually 921600 bps). Depending on your hardware and environment, you may have to worry about bad data, but if you increase the speed enough, you could probably add a checksum, and maybe even limited error correction.
I'm not familiar with the Arduino, but I've got an 8-bit FreeScale HCS08 I drive at 1.25 Mbps, although the bus is actually running RS-485, not RS-232 (485 uses differential signaling for better noise performance), and I don't have any problems with noise errors. You might even consider a USB RS-485 adapter, if you can wire that to your Arduino (you'd need conversion hardware to change the 485 signals to the Arduino's levels).
EDIT 2:
You might also consider this USB-SPI/I2C adapter, if you have an available I2C or SPI interface, and you can handle the wiring. It says it can go to 400 kHz I2C or 200 kHz SPI, which is still not quite enough by itself, but you could split the data between the SPI/I2C and the serial link you already have.
LZ77/78 are relatively easy to write http://en.wikipedia.org/wiki/LZ77_and_LZ78
However given the small amount of data you're transferring, its probably not worth compressing it at all.

Categories