I need to implement a spell checker in java , let me give you an example for a string lets say "sch aproblm iseasili solved" my output is "such a problem is easily solved".The maximum length of the string to correct is 64.As you can see my string can have spaces inserted in the wrong places or not at all and even misspelled words.I need a little help in finding a efficient algorithm of coming up with the corrected string. I am currently trying to delete all spaces in my string and inserting spaces in every possible position , so lets say for the word (it apply to a sentence as well) "hot" i generate the next possible strings to afterwords be corrected word by word using levenshtein distance : h o t ; h ot; ho t; hot. As you can see i have generated 2^(string.length() -1) possible strings. So for a string with a length of 64 it will generate 2^63 possible strings, which is damn high, and afterwords i need to process them one by one and select the best one by a different set of parameters such as : - total editing distance (must take the smallest one)
-if i have more strings with same editing distance i have to choose the one with the fewer number of words
-if i have more strings with the same number of words i need to choose the one with the total maximum frequency the words have( i have a dictionary of the most frequent 8000 words along with their frequency )
-and finally if there are more strings with the same total frequency i have to take the smallest lexicographic one.
So basically i generate all possible strings (inserting spaces in all possible positions into the original string) and then one by one i calculate their total editing distance, nr of words ,etc. and then choose the best one, and output the corrected string. I want to know if there is a easier(in terms of efficiency) way of doing this , like not having to generate all possible combinations of strings etc.
EDIT:So i thought that i should take another approach on this one.Here is what i have in mind: I take the first letter from my string , and extract from the dictionary all the words that begin with that letter.After that i process all of them and extract from my string all possible first words. I will remain at my previous example , for the word "hot" by generating all possible combinations i got 4 results , but with my new algorithm i obtain only 2 "hot" , and "ho" , so it's already an improvement.Though i need a little bit of help in creating a recursive or PD algorithm for doing this . I need a way to store all possible strings for the first word , then for all of those all possible strings for the second word and so on and finally to concatenate all possibilities and add them into an array or something. There will still be a lot of combinations for large strings but not as many as having to do ALL of them. Can someone help me with a pseudocode or something , as this is not my strong suit.
EDIT2: here is the code where i generate all the possible first word from my string http://pastebin.com/d5AtZcth .I need to somehow implement this to do the same for the rest and combine for each first word with each second word and so on , and store all these concatenated into an array or something.
A few tips for you:
try correcting just small parts of the string, not everything at once.
90% of erros (IIRC) have 1 edit distance from the source.
you can use a phonetic index to match words against words that sound alike.
you can assume most typos are QWERTY errors (j=>k, h=>g), and try to check them first.
A few more ideas can be found in this nice article:
http://norvig.com/spell-correct.html
Related
I am making a Dictionary Application. I am using Pearson Dictionary API for the same. I need to generate a word so that I could query that word for its definition.
PROBLEM
I know how to generate a random word but I don't know how to generate a meaningful English word.
I tried to solve this problem by requesting a JSON response and checking the results[](results[ ] hold definitions for the word) in the response. So, if results[].lenght > 0 then the word is a valid English word.
But the solution above has its own serious problem: Suppose I want to generate a 5 letter word, there are as many as 26^5 = 11881376different combinations whereas there aren't as many 5 letter meaningful English words. As the letters in the word increases, the number of combinations increases too. Thus, generating a meaningful word can take a very long time.
How can I check if the generated word is a meaningful English word or not? Isn't there any feasible programmatic way of doing this?
OR Is there any other way I could solve this Problem?
As far as I can see, you either generate random strings of letters and check to see if they're words (which, as you realise, is very slow, hit-or-miss approach) or you store a list of "known good" words and select randomly from that list.
How big that list needs to be depends on what you're trying to achieve.
According to this page the OED has around 171,476 main entries, not including variants like plurals (cat, cats), standard variants (sit, sitting), nor words that have multiple classes (e.g. dog can be a noun [the animal] or a verb [to follow persistently] etc.). According to this page an average adult knows between 20,000 and 35,000 words, so a prudent selection of 50,000 should cover most general purpose uses.
The answers to this question (now closed) provide a number of sources for word-lists. Examining one of them (originally provided by infochimps.org but available as a simple text-list on github) shows that the average length of 350,000+ words is just under 10 characters. For Linux (and possibly other flavours) /usr/share/dict/words may be a useful place to start.
There is this beautifull text file containing all english wordS:
https://github.com/AlexHakman/Java-challenge/blob/master/words.txt
You can then generate 5 letter words based on whats inside this text document :)
Get per line the length of the line, or just generate and compare it with the text file :)
Instead of doing it random because you need to spend time verifying just store a dictionary of the words that you would require and have a lookup table for it.
A relatively complete dictionary for English is about 2MBs compressed like the one here http://wordlist.aspell.net/12dicts/
Even for an Android app unless you're targeting really under powered devices it shouldn't be that big.
You can use SQLite to store the data so it may take up a bit more storage but you get SQL as your query language rather than making up your own.
Since you would also need a bit of randomness, each row can add some sort of randomized key that you can further query.
If you really wanted to limit it to 5 characters then just use a subset of the dictionary. But this will allow you to have an arbitrary length even length ranges (e.g. 2 to 10 characters)
I have a text file full of words. I want to add each of these words to a hashset. I also have a hashset of words I do not want.
Is it more efficient to:
(A) Add all the words to the hashset I want and remove the hashset of words I do not want at the end.
(B) Check if each word is in the hashset of words I do not want and if it is, ignore it. If it is not then add it to the set of words I do want.
Edit
There is far more words I want, than words I do not want.
The answer depends completely on the size of your lists. If you have 99999 words you don't want and 1 word you do, you should do option A. If you have 99999 words you want and 1 word you don't, you should do option B.
The reason behind this is obvious - option B gets more and more efficient the smaller the hash set of undesired words is since you have to check that entire set any time you insert a new word using option B.
From a purely theoretical view, both are the same in terms of worst case time complexity, but practically, there can be a big difference.
So basically, as with most solutions, the efficiency depends on how you expect your data to be structured.
I'm trying to compare two strings (product names) using some of well known algorithms like Levenstein distance and library of different solutions for string simmetrics (got best results with SmithWatermanGotoh alg).
Two strings are:
iPhone 3gs 32 GB black
Apple iPhone 3 gs 16GB black
Levenstein is working pretty bad on whole string if some words are in different order (which is expected from how algorithm works) so I tried to implement word by word comparison.
The problem I'm facing with is the way to detect similar 'words' that are divided with space char ('3gs'->'3 gs' ; '32 GB'->'16GB').
My code compares shorter (word count, if == then str.length) string with longer one. Words are split into ArrayList<String>. I'm combining each word from str1 with others in the same string creating new arraylist.
Here is a rough code:
foreach(str1)
foreach(str2)
res1 = getLevensteinDist
endforeach
foreach(combinedstr2)
res1 = getLevensteinDist
endforeach
return getHigherPercent(res1, res2)
endforeach
This works if the words in str2 are split, but I can't figure out how to do a reverse, detect words in str2 that are split in str1.
I hope I'm at least a bit clear what I'm trying to do. Every help is appreciated.
First of all you should preprocess your strings, I mean you should remove "a, the, as, an" and all common verbs, numnbers,... from input strings, also you should convert every plural form to the singular form, .... to unify all words. Then you can apply some string matching algorithms, or just put the words into the hashmap, or if they are a lot, put them into the trie, and run your similarity algorithm.
Have a look at TF-IDF. It is specifically designed to compute similarities between textual features.
http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html
Try to split one of the string into words and then for eash word run SmithWaterman and use scores from SmithWaterman as similarity measure.
13 years ago I wrote my own implementation of trigram fuzzy search algorithm,
named "Wilbur-Khovayko algorithm".
You can download here: http://olegh.cc.st/wilbur-khovayko.tar.gz
It search "N closest terms" for entered search term.
List of terms - in the file termlist.txt
N - in the variable lim, file findtest.c
Alrorithm very quick: on the old Sun 200mHz, it search 100 closest term among 100,000
entries for ~0.3 secs.
I have a prepopulated sqlite database imported to assets folder and I use it to set some text to my buttons and to compare user's input with my correct answers in that database. But I have two problems which I don't how to solve.
For example I have an answer which is "Michael Jordan" or some other two words. I a user enters Michael Jordan i'm good to go, but if he enter Jordan Michael I'm in trouble. It will popup a wrong answer alert. Is there a way to accept these words shuffles?
Also, if I have an answer "Balls" and user type in "ball" this will be wrong aswer. How to make sure that all singulars and plurals get accepted?
Fuzzy String Comparison Algorithm
The custom brute force method below provides word swapping and gives you complete control over the vowel/consonant score thresholds, but increases the total number of comparisons.
You will also want to check methods such as Apache Lucene described in this thread: Fuzzy string search library in Java
Custom Fuzzy Comparison Recipe:
Lower Case: All comparisons will be with lower-case text. Either make sure that all words in the reference database are in lower case, or use a String.toLower() on each item in the database before comparison. Obviously, preprocessing the list in the database will dramatically increase performance.
Remove Spaces and Punctuation: You must make a function that removes all spaces and other punctuation from any phrase. You should have a separate column in your reference with this information pre-calculated for an increase in performance.
Custom Compare Function: Your String comparison function will compare each character and assign a custom score based on closeness of letters, in which the lowest scores will indicate the best match. For example, identical characters will add zero score. Each mismatched consonant pair will add 2 to the score. Each mismatched vowel will add 1. Mixed mismatches will add 3. Normalize the score by the number of characters. Apply a simple threshold to determine acceptable matches. In the above example, start with threshold=0.2 which will allow approximately one small mistake per 5 characters (this solves simple misspellings, but not missing characters. See Step 4 below).
Extra or Missing Characters: Loop through each comparison an extra time for each character position. Once without the character in that position and once with an extra character in that position. Report the smallest score for all the loops. Compare that score against the threshold. Break out of the loop and stop comparing if the score is below the threshold, thus indicating a match. This will catch misspellings such as "colage" for "collage".
Swap Words: After the loop in Step #4, if the score is still above the threshold, loop through each word of the input phrase and swap with its nearest neighbor adjacent word. and rerun the comparison suite. Obviously, you will have to look at the original raw user phrase to find the word boundaries, rather than the processed phrase without spaces and punctuation of Step #2. This will catch your requirement of allowing "Jordan Michael" to substitute for "Michael Jordan".
For long entries with more than 2 words, this method will incur 10's of comparisons per database entry or more, so there is a definite performance hit.
This is a great question. I think, realistically you need a dictionary of "valid" words. However a dictionary on its own will not solve your problems. You also need a set of heuristics based on your dictionary as to what constitutes a valid entry.
I would be tempted to try "tries" here as you can encapsulate a rich text base better that alternate methods. Tries, in this case will offer comparable performance to say a word dictionary or the likes. The additional benefit of using tries is that it is fairly trivial to add new words/phrases to your application. The downside, tries use a fair amount of memory. That said, there are techniques one can use to compact data.
Im given a task which i am a little confused to understand. Here is the question statement:
The following program should read a file and store all its tokens in a member variable.
Your task is to write a single method that returns the number of items in tokenMap, the average length (as double value) of the elements in tokenMap, and the number of tokens starting with character "a".
Here the tokenMap is an object of type HashMap<String, Integer>;
I do have some idea about HashMap but what i want to know the "key value" for HashMap required is a single character or the whole word?? that i should store in tokenMap.
Also how can i compute the average length?
Looks like you have to use the entire word as the key.
The average length of tokens can be computed by summing the lengths of each token and dividing by the number of tokens.
In Java, you can find the number of tokens in the HashMap by tokenMap.size().
You can write loops that visit each member of the map like this:
for(String t: tokenMap.values()){
//t is a token
}
and if you look up String in the Java API docs you will see that it is easy to find the length of a String.
To compute the average length of the items in a hash map, you'll have to iterate over them all and count the length and calculate the average.
As for your other question about what to use for a key, how are we supposed to know? A hashmap can use practically any* value for a key.
*The value must be hashable, which is defined differently for different languages.
Reading the question closely, it seems that you have to read a file, extract each word and use it as the key value, and store the length of each key as the integer:
an example line
leads to a HashMap like this
an : 2
example : 7
line : 4
After you've built your map (made of keys mapping to entries, or seemingly elements in the question), you'll need to run some statistics over it to find
the number of keys (look at HashMap)
the average length of all keys (again, simple enough)
the number beginning with "a" (just look at the String)
Then make a value object containing these values and return it from the method that does the statistics.
I know I've given more information that you require, but someone else may benefit from a little extra help.
Guys there is some confusion. Im not asking for a solution. Im just confused for one thing.
For the time being, im gonna use String type as the key type.
The only confusion i have is once i read the file line by line, should i split it based upon words or based upon each character. So that the key value should be a single character type string or a String of whole word.
If you can go through the question statement, what do you suggest. That's all im asking.
should i split it based upon words or
based upon each character
The requirement is to make tokens, so you should split them based on words. Each word becomes a unique String key. It would make sense for the value to be the count of each token.
If the file you are reading has these three lines:
int alpha;
int beta;
float delta;
Then you should have something like
<"int", 2>
<";", 3>
<"alpha", 1>
<"beta", 1>
<"float", 1>
<"delta", 1>
(The semicolon may or may not be considered a token.)
Your average length would be ( 3x2 + 3x1 + 5 + 4 + 5 + 5) / 6.
Your length of tokens starting with "a" would be 5.0.
Look elsewhere on this forum for keySet and you should be good to go.