im looking for the best (and fastest) way to record a short audio input (like one word) from mobile microphone and then compare it with a long real time audio input (like speech) - from the same person and look for word occurrence.
I tried many approaches like using typical SpeechRecognizer, but there were many problems, like there is actually no way to guarantee that it will give reasults fast enough or run for many minutes.
VoiceRecognition Android Taking Too Long To React
Long audio speech recognition on Android
I dont really need to recognize which words is the person saying, only to be able to find occurences with some deviation.
It would be nice if you could give me some suggestions of how to do so.
EDIT: Im basically looking for a way to control the app with sound inputed from a user
Here are a couple ideas to consider.
(1) First, create a set of common short sounds that will likely be part of the search. For example, perhaps all phonemes, or a something like a set of all consonant-vowel combinations, e.g., bah, bay, beh, bee, etc. and the same with cah, cay, keh, key, etc.
Then, run through the "long" target sample with each, indexing the locations where these phonemes are found.
Now, when the user gives you a word, first compare it to your set of indexed phoneme fragments, and then use the matches to focus your search and test in the long target file.
(2) Break your "long" file up into fragments and sort the fragments. Then compare the input word to the items in the sorted list, using something like a binary search algorithm.
Related
Let us say, a user is typing text in an EditText. Now, as the user is typing, I want to extract the keywords from those texts.
For example, if user types- "I am having headache". It should extract "headache" as a keyword.
Please let me know how I can do this efficiently in Android.
Update: I do not know what the keywords are. They have to be extracted from the text which user enters.
First of all you should define what you will consider as keywords.
a. A limited list of words which are the keywords.
Or b. A limited list of words which are not the keywords.
That list can be in an ArrayList<String> in your code.
When the user changes the text in the EditText (see EditText.addTextChangedListener(new TextWatcher(){...}), you get the text and split() it into a String [] using space character as a delimiter. Next search each word in the array in your list (a or b options on top) either to check if they are or not there. When you get a hit you have found a keyword entered by the user.
The resulting keywords can be kept temporarily in another ArrayList<String> for you to use them after finishing the scanning of the input.
Note: I have proposed an ArrayList, to keep the list, considering that it won't be a long list. Fo more complex scenarios the list can be kept in HashMap or a TreeMap in the lines of what #Deepakkaku commented for the search to be quicker.
There can be two approaches to this problem:
Hardcoding the keywords or non-keywords you are interested in. #Juan's answer is the way to go here.
Second option is using some machine learnt model, which is what you are looking at, I guess given your machine-learning tag.
Option 1 requires a set of keywords defined ahead of time, which you say you don't have in your question. So this won't work in such a case. So here's a solution for Option 2.
Create a model.
You have to create a dataset of labeled examples.
You have to define a vocabulary for your entire dataset.
You have to define and train a model. If you have enough data, you can start from scratch. Otherwise, it is recommended to use transfer learning. For example, you can look up NLP models such as word2vec or sentiment analysis online and look up transfer learning. TF Hub makes it easy to do transfer learning.
Once you have trained the model, you have to workout how to convert that model so as to run efficiently on Android for inference. You have choices in Tensorflow-lite, Caffe2, etc. If you use Tensorflow, it is recommended that you convert to Tensorflow Lite for inference for efficiency.
You have to build your Android app with the appropriate runtime (TFLite, Caffe2, etc.) and bundle the model in. You can use ML Kit to take care of the download for you if you don't want to bundle.
Add the hooks to model in your activity by listening to changes in your EditText and calling the model inference. You likely want the model interpreter to be loaded ahead of time before the first inference call for efficiency.
My application involves scanning through the phone camera and detecting text. The only words that my application is concerned with is valid english words.
I have a list of ~354,000 valid english words that i can compare my scanned word with.
Since my application continuously detects text, i need this functionality to be very very fast. I have applied Levenshtein Distance technique. For each word, I:
Store the contents of the text file into an Arraylist<String> using Scanner
Calculate Levenshtein Distance of the word with each of the 354k words
Return the word corresponding to the minimum distance value
The problem is that it is very very slow. Without applying this, my app manages to ocr more than 20 words in around 70 to 100 millisecond. When i include this fixing routine, my app takes more that 1 full minute (60000ms) for a single word.
I was wondering if this technique is even suitable, given my case. If not, what other tested way should i go with? Any help would be greatly appreciated. I know this is possible, looking at how android keyboards are able to instantly correct our incorrectly typed words.
Other Failed endeavors:
Jaro distance. (similar)
Android internal SpellCheckerSession service. (doesn't fit my case. Result receipt via a callback is the issue)
My Solution that works:
I created a MYSQL table and uploaded the list of valid english words in it. It solves all the problems addressed in the question.
Here is my Android Application for reference:
Optical Dictionary & Vocabulary Teacher
Background:
I am developing a program in that iterates over all the movies & tv series episodes stored on my computer, rates them (using rotten tomatoes) and sorts them in order of rating.
I extract the movie name by removing all the unneccessary text such as '.avi', '720p' etc. from the file name.
I am using Java.
Problem:
Some folders contain movie files such as:
Episode 301 Rainforest Schmainforest.avi
Episode 302 Spontaneous Combustion.avi
The word 'Episode' and numbers are valid and are common words in movies, so I can't simply remove them. However, It is clear from the repetitive nature of the names that 'Episode' and '3XX' should be removed.
Aother folder might be:
720p.S5.E1.cripple fight.avi
720p.S5.E2.towelie.avi
Many arbitary patterns like these exist in different groups of files, and I need something to recongise these arbitary patterns so I can extract the keywords. It would be unfeasible to write regex for each case.
Summary:
Is there a tool or API that I can use to find complex repetitive patterns (must be able to match sequences of numbers)? [something like a longest common sequence library]
Well, you could simply take all the filtered names in your dir, and do a simple word-count. You could give extra weight to words that occur in (roughly) the same spot every time.
In the end you'd end up with a count and a weight, and you need to decide what lines to draw. It's probably not every file in the dir (because of maybe images or samples), but if most have a certain word, it's not "the" or something like that, and mabye they all appear "at the start" or "on the second spot", you can filter them.
But this wouldn't work for, random example, Friends episodes. THey're all called "The one where.....". That would be filtered in every sane version of your sought-after algorithm
The bottom line is: I don't think you can because of the friends-episode-problem. There just not enough distinction between wanted repetition and unwanted repetition.
Only thing you can do is make a blacklist of stuff you want to filter, like you allready seem to do with the avi / 720 thing.
I believe that what you are asking for is not trivial. Pattern extraction, as opposed to mere recognition, is well within the fields of artificial intelligence and knowledge discovery. I have encountered several related libraries for Java, but most need a lot of additional code to define even the simplest task.
Since this is a rather hot research area, you might want to perform a cursory search in Google Scholar, using appropriate keywords.
Disclaimer: before you use any library or algorithm found via the Internet, you should investigate its legal status. Unfortunately quite a few of the algorithms that are developed in active research areas are often encumbered by patents and such...
I have a kind-of answer posted here
http://pastebin.com/Eb0cQyKd
I wanted to remove non-unique parts of file names such as'720dpi', 'Episode', 'xvid' 'ac3' without specifying in advance what they would be. But I wanted to keep information like S01E01. I had created a huge black list but it wasn't convenient because the list kept on changing.
The code linked above uses Python (not Java) to remove all non-unique words in a file name.
Basically it creates a list of all the words used in the file names, and any word which comes up for most of the files it puts into a dictionary. Then it iterates through the files and deletes all these dictionary words from them.
The script also does some cleaning: some movies use underscores ('_') or periods ('.') to separate words in the filenames. I convert all these to spaces.
I have used it a lot recently and it works well.
I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I donĀ“t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.
I want to write a program in which plays an audio file that reads a text.
I want to highlite the current syllable that the audiofile plays in green and the rest of the current word in red.
What kind of datastructure should I use to store the audio file and the information that tells the program when to switch to the next word/syllable?
This is a slightly left-field suggestion, but have you looked at Karaoke software? It may not be seen as "serious" enough, but it sounds very similar to what you're doing. For example, Aegisub is a subtitling program that lets you create subtitles in the SSA/ASS format. It has karaoke tools for hilighting the chosen word or part.
It's most commonly used for subtitling anime, but it also works for audio provided you have a suitable player. These are sadly quite rare on the Mac.
The format looks similar to the one proposed by Yuval A:
{\K132}Unmei {\K34}no {\K54}tobira
{\K60}{\K132}yukkuri {\K36}to {\K142}hirakareta
The lengths are durations rather than absolute offsets. This makes it easier to shift the start of the line without recalculating all the offsets. The double entry indicates a pause.
Is there a good reason this needs to be part of your Java program, or is an off the shelf solution possible?
How about a simple data structure that describes what next batch of letters consists of the next syllable and the time stamp for switching to that syllable?
Just a quick example:
[0:00] This [0:02] is [0:05] an [0:07] ex- [0:08] am- [0:10] ple
To highlight part of word sounds like you're getting into phonetics which are sounds that make up words. It's going to be really difficult to turn a sound file into something that will "read" a text. Your best bet is to use the text itself to drive a phonetics based engine, like FreeTTS which is based off of the Java Speech API.
To do this you're going to have to take the text to be read, split it into each phonetic syllable and play it. so "syllable" is "syl" "la" "ble". Playing would be; highlight syl, say it and move to next one.
This is really "old-skool" its been done on the original Apple II the same way.
you might want to get familiar with FreeTTS -- this open source tool : http://freetts.sourceforge.net/docs/index.php -
You might want to feed only a few words to the TTS engine at a given point of time -- highlight them and once those are SPOKEN out, de-highlight them and move to the next batch of words.
BR,
~A