Let's say, we have taken a mic input (say "hello") and stored it as a wav file. Then we take the same input "hello" from the mic. Now if the two are identical then we trigger an action. So how do we compare and check the raw data of the two inputs?
Update: Let's suppose we want to have the exact word being spoken amd not being interested about who said the word since that would prevent making the program/software. User independent. In other words: we need to extarct the exact being spoken from the mic input of the user and then check if it was identical to any of the given predefined commands which will in turn trigger an action.
So in other words we need the followng:
extract the exact words spoken by the speaker/user.
compare/check if the word spoken by the user is same or identical to any of the words stored predefined.
So how do we get about our business?
Simple comparison of WAV files is not going to work. What you need is some kind of voice print software. But most of the Java speech processing software out there seems to be more focused on speech recognition (figuring out what was said) than on voice prints (who said it).
Related
im looking for the best (and fastest) way to record a short audio input (like one word) from mobile microphone and then compare it with a long real time audio input (like speech) - from the same person and look for word occurrence.
I tried many approaches like using typical SpeechRecognizer, but there were many problems, like there is actually no way to guarantee that it will give reasults fast enough or run for many minutes.
VoiceRecognition Android Taking Too Long To React
Long audio speech recognition on Android
I dont really need to recognize which words is the person saying, only to be able to find occurences with some deviation.
It would be nice if you could give me some suggestions of how to do so.
EDIT: Im basically looking for a way to control the app with sound inputed from a user
Here are a couple ideas to consider.
(1) First, create a set of common short sounds that will likely be part of the search. For example, perhaps all phonemes, or a something like a set of all consonant-vowel combinations, e.g., bah, bay, beh, bee, etc. and the same with cah, cay, keh, key, etc.
Then, run through the "long" target sample with each, indexing the locations where these phonemes are found.
Now, when the user gives you a word, first compare it to your set of indexed phoneme fragments, and then use the matches to focus your search and test in the long target file.
(2) Break your "long" file up into fragments and sort the fragments. Then compare the input word to the items in the sorted list, using something like a binary search algorithm.
I am currently delopping an application where the user can load a .mp3 file and enter a sequence of notes. The goal for the user is to match this sequence of notes with the song of the .mp3 file.
This requires the possibility to play the .mp3 file and the sequence of notes simultaneously. After some research I found out that either the Java Sound API or JFuge can do the job to produce a sequence of notes (MIDI). (The input given by the user). As stated here, JLayer can be used to play mp3 files in Java. (I could also transform the .mp3 to .wav and use another way to play the transformed .wav).
However, would it be possible to play this .mp3 and sequence of notes together without any problems, or should I first convert them to one single file?
The user should be able to play the .mp3 and his/her sequence of notes at any random timestamp simultaneously. Preferably without any delay so the user can easily adapt a note to match the pitch of the file. It seems that merging them together to one file, before playing them, would be too much overhead when the user is almost constantly changing a note and replaying to check if it matches the pitch.
Thanks in advance!
Java supports playback from multiple threads. All you need to do is run the .mp3 from one thread, and the midi-generated notes on another concurrently running thread.
There used to be a few Linux systems that could only handle output from one audio source at a time. I don't know if this is still an issue.
Another, much more elaborate possibility that would let you do live mixing and output to a single line would be to read the song file using AudioInputStream, convert the bytes to PCM on the fly (e.g., to floats ranging from -1 to 1) (or preload and store the audio as PCM), and then add this to PCM data coming from a do-it-yourself synth, and then convert this back to bytes and output via a SourceDataLine.
That is a lot of trouble and you probably don't want to go that route, but if you did, following is some info to help break down the various steps of one possible realization.
Loading .wav data and converting it into an internal PCM form can be seen in the open-source AudioCue (line 359 loadURL method). And here is an example (free download) of a real-time Java synth I made that runs via keystrokes. One of the voices is a simple organ, which outputs PCM audio data by just adding four sine waves at harmonic frequencies. Making other sounds is possible if you want to get into other forms of synthesis but gets more involved.
(IDK how to convert data coming from a MIDI-controlled synth, unless maybe a TargetDataLine can be identified, and data from it converted to PCM similar to the conversion used in reading from an AudioInputStream in the AudioCue source example.)
Given two PCM sources, the two can be mixed in real time using addition, converted to bytes and output via a single SourceDataLine (see line 1387 convertBufferToAudioBytes method). The SourceDataLine can be kept running indefinitely if you input zeros from the contributors when they are not playing. An SDL spends the vast majority of its time in a blocked state as audio data processing is much quicker than the rate it is consumed by the system, and so uses very little cpu.
My application involves scanning through the phone camera and detecting text. The only words that my application is concerned with is valid english words.
I have a list of ~354,000 valid english words that i can compare my scanned word with.
Since my application continuously detects text, i need this functionality to be very very fast. I have applied Levenshtein Distance technique. For each word, I:
Store the contents of the text file into an Arraylist<String> using Scanner
Calculate Levenshtein Distance of the word with each of the 354k words
Return the word corresponding to the minimum distance value
The problem is that it is very very slow. Without applying this, my app manages to ocr more than 20 words in around 70 to 100 millisecond. When i include this fixing routine, my app takes more that 1 full minute (60000ms) for a single word.
I was wondering if this technique is even suitable, given my case. If not, what other tested way should i go with? Any help would be greatly appreciated. I know this is possible, looking at how android keyboards are able to instantly correct our incorrectly typed words.
Other Failed endeavors:
Jaro distance. (similar)
Android internal SpellCheckerSession service. (doesn't fit my case. Result receipt via a callback is the issue)
My Solution that works:
I created a MYSQL table and uploaded the list of valid english words in it. It solves all the problems addressed in the question.
Here is my Android Application for reference:
Optical Dictionary & Vocabulary Teacher
I understand the concept of TargetDataLine and SourceDataLine and I have written a program to list them as well as the Ports and the available Controls for each. For the test progam I have an onboard mic, on board speakers, a line in, a speaker jack and an audio interface with two inputs and one output. The inputs on the interface are treated as left and right so I'm not sure how I would differentiate between the two if they act as one stereo input.
I want to be able to select the DataLine I want to use for either recording or playback at runtime. How can I identify and separate inputs and outputs to list them and allow a user to select a specific one to use? And if anyone has any suggestions for handling the interface input as two mono inputs That would be helpful as well. Thanks in advance.
To convert two mono lines into stereo requires interleaving the left and right, one "sample" at a time. The size of the sample depends on your bit depth. For example, 16-bit encoding consumes two bytes. So, take two bytes from the left, then two bytes from the right. Repeat for the duration of the lines.
There may be prebuilt methods that will help you with this. Check out the section in the Java Sound tutorials section on converting formats--it's like the 4th or 5th section of the sound tutorials, and happens to also be the best written of the bunch if I remember correctly. (Actual sample code provided, unlike much of the rest of this very difficult tutorial.)
I'm not sure how selecting a line or port is different from programming selecting anything else. You make a list, and the user clicks a button associated with the item, or selects the item from a drop down, then you plug it in.
I have a theremin where I made a menubar that allows one to select a mixer line. It just populates a radio button set with names of the mixers that are found. When you select the item the listener directs one to install the associated mixer.
I want to write a program in which plays an audio file that reads a text.
I want to highlite the current syllable that the audiofile plays in green and the rest of the current word in red.
What kind of datastructure should I use to store the audio file and the information that tells the program when to switch to the next word/syllable?
This is a slightly left-field suggestion, but have you looked at Karaoke software? It may not be seen as "serious" enough, but it sounds very similar to what you're doing. For example, Aegisub is a subtitling program that lets you create subtitles in the SSA/ASS format. It has karaoke tools for hilighting the chosen word or part.
It's most commonly used for subtitling anime, but it also works for audio provided you have a suitable player. These are sadly quite rare on the Mac.
The format looks similar to the one proposed by Yuval A:
{\K132}Unmei {\K34}no {\K54}tobira
{\K60}{\K132}yukkuri {\K36}to {\K142}hirakareta
The lengths are durations rather than absolute offsets. This makes it easier to shift the start of the line without recalculating all the offsets. The double entry indicates a pause.
Is there a good reason this needs to be part of your Java program, or is an off the shelf solution possible?
How about a simple data structure that describes what next batch of letters consists of the next syllable and the time stamp for switching to that syllable?
Just a quick example:
[0:00] This [0:02] is [0:05] an [0:07] ex- [0:08] am- [0:10] ple
To highlight part of word sounds like you're getting into phonetics which are sounds that make up words. It's going to be really difficult to turn a sound file into something that will "read" a text. Your best bet is to use the text itself to drive a phonetics based engine, like FreeTTS which is based off of the Java Speech API.
To do this you're going to have to take the text to be read, split it into each phonetic syllable and play it. so "syllable" is "syl" "la" "ble". Playing would be; highlight syl, say it and move to next one.
This is really "old-skool" its been done on the original Apple II the same way.
you might want to get familiar with FreeTTS -- this open source tool : http://freetts.sourceforge.net/docs/index.php -
You might want to feed only a few words to the TTS engine at a given point of time -- highlight them and once those are SPOKEN out, de-highlight them and move to the next batch of words.
BR,
~A