I am making a Dictionary Application. I am using Pearson Dictionary API for the same. I need to generate a word so that I could query that word for its definition.
PROBLEM
I know how to generate a random word but I don't know how to generate a meaningful English word.
I tried to solve this problem by requesting a JSON response and checking the results[](results[ ] hold definitions for the word) in the response. So, if results[].lenght > 0 then the word is a valid English word.
But the solution above has its own serious problem: Suppose I want to generate a 5 letter word, there are as many as 26^5 = 11881376different combinations whereas there aren't as many 5 letter meaningful English words. As the letters in the word increases, the number of combinations increases too. Thus, generating a meaningful word can take a very long time.
How can I check if the generated word is a meaningful English word or not? Isn't there any feasible programmatic way of doing this?
OR Is there any other way I could solve this Problem?
As far as I can see, you either generate random strings of letters and check to see if they're words (which, as you realise, is very slow, hit-or-miss approach) or you store a list of "known good" words and select randomly from that list.
How big that list needs to be depends on what you're trying to achieve.
According to this page the OED has around 171,476 main entries, not including variants like plurals (cat, cats), standard variants (sit, sitting), nor words that have multiple classes (e.g. dog can be a noun [the animal] or a verb [to follow persistently] etc.). According to this page an average adult knows between 20,000 and 35,000 words, so a prudent selection of 50,000 should cover most general purpose uses.
The answers to this question (now closed) provide a number of sources for word-lists. Examining one of them (originally provided by infochimps.org but available as a simple text-list on github) shows that the average length of 350,000+ words is just under 10 characters. For Linux (and possibly other flavours) /usr/share/dict/words may be a useful place to start.
There is this beautifull text file containing all english wordS:
https://github.com/AlexHakman/Java-challenge/blob/master/words.txt
You can then generate 5 letter words based on whats inside this text document :)
Get per line the length of the line, or just generate and compare it with the text file :)
Instead of doing it random because you need to spend time verifying just store a dictionary of the words that you would require and have a lookup table for it.
A relatively complete dictionary for English is about 2MBs compressed like the one here http://wordlist.aspell.net/12dicts/
Even for an Android app unless you're targeting really under powered devices it shouldn't be that big.
You can use SQLite to store the data so it may take up a bit more storage but you get SQL as your query language rather than making up your own.
Since you would also need a bit of randomness, each row can add some sort of randomized key that you can further query.
If you really wanted to limit it to 5 characters then just use a subset of the dictionary. But this will allow you to have an arbitrary length even length ranges (e.g. 2 to 10 characters)
Related
I have several lists of Strings already classified like
<string> <tag>
088 9102355 PHONE NUMBER
091 910255 PHONE NUMBER
...
Alfred St STREET
German St STREET
...
RE98754TO IDENTIFIER
AUX9654TO IDENTIFIER
...
service open all day long DESCRIPTION
service open from 8 to 22 DESCRIPTION
...
jhon.smith#email.com EMAIL
jhon.smith#anothermail.com EMAIL
...
www.serviceSite.com URL
...
Florence CITY
...
with a lot of strings per tag and i have to make a java program which, given
a new List of String(supposed all of the same tag), assigns a probability for each tag to the list.
The program has to be completely language independent and all the knowledge has to came from the lists of tagged strings as the one described above.
I think that this problem can be solved with NER approaches (i.e machine learning algorithms like CRF) but those are usually for unstructured text like a chapter from a book, or a paragraph of a web page, and not for list of independent strings.
I Thought to use CRF (i.e Conditional Random Field) because I found a similar approach used in the Karma Data integration Tool as described in this Article, paragraph 3.1
where the "semantic types" are the my tags.
To tackle the program I have downloaded the Stanford Named Entity Recognizer (NER) and played a bit
with it's JAVA API through NERDemo.java finding two problems:
the training file for the CRFClassifier has to have one word per row, therefore I haven't found a way to classify groups of words with a single tag
I don't understand if I have to make one Classifier per tag or a single Classifier for all, because a single string could be classified with n different tags and it is the user that chooses between them. So I'm rather interested in the probability assigned by the classifiers instead of the exact class matching. Furthermore
i haven't any "no Tag" Strings so I don't know how the Classifier behaves without them to assign the probabilities.
Is this the right approach to the problem? Is There a way To use The Stanford NER
or another JAVA API with CRF or other suitable Machine Learning Algoritm to do it?
Update
I managed to train the CRF classifier first with each word classified independently with the tag and each group of words separated by two commas( classified as "no Tag"(0) ), then with the group of words as a single word with underscores replacing spaces but I have very disappointing results in the little test I made. I haven't quite get which features I have to include and which exclude from the ones described in the NERFeatureFactory javadoc considering they can't have anything to do with language.
Update 2
The test results are beginning to make sense, I've divided each string(tagging every Token) from the others with two new Lines, instead of the horrible "two commas labeled with 0", and I've used the Stanford PTBTokenizer instead of the one that I made. Moreover I've tuned the features, turning on the usePrev and useNext features and using suffix/prefix Ngrams up to 6 characters of length and other things.
The training file named training.tsv has this format:
rt05201201010to identifier
1442955884000 identifier
rt100005154602cv identifier
Alfred street
Street street
Robert street
Street street
and theese are the flags in the the propeties file:
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useTags=false
useWordPairs=false
useDisjunctive=true
useSequences=false
usePrevSequences=true
useNextSequences=true
# the next flag can have these values: IO, IOB1, IOB2, IOE1, IOE2, SBIEO
entitySubclassification=IO
printClassifier=HighWeight
cacheNGrams=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
However I found another problem, I managed to train only 39 labels with 100 strings each, though I have like 150 labels with more than 1000 string each, but even so it takes like 5 minutes to train and if I rise these numbers a bit it throws a Java Heap Out of Memory Error.
Is there a way to scale up to those numbers with a single classifier? Is it better to train 150 (or less, maybe one with two or three labels) little classifiers and combine them later? Do I need to train with 1000+ strings each label or can I stop to 100(maybe choosing them quite different from one another)?
The first thing you should be aware of is that (linear chain) CRF taggers are not designed for this purpose. They came as a very nice solution for context-based prediction, i.e. when you have words before and after named entities, and you look for clues in a limited window (e.g. 2 words before / after current word). This is why you had to insert double lines: to delimit sentences. They also provide coherence between tags affected to words, which is indeed a good thing in your case.
A CRF tagger should work, but with an extra cost in learning step which you could be avoided by using simpler (maximum entropy, SVM) but still accurate machine learning methods. In Java, for your task, wouldn't Weka be a better solution? I would also consider BIO tagging as not relevant in your case.
Whatever software / coding you use, it is not surprising that ngrams at character level gives good improvements, but I believe you may add dedicated features. For instance, since morphological clues are important (presence of an "#", upper case or digits characters), you may use codes (see ref [1]) that are a very convenient method to describe strings. You'll also most probably obtain better results by using lists of names (lexicon) that may be triggered as additional features.
[1] Ranking algorithms for named-entity extraction: Boosting and the voted perceptron (Michael Collins, 2002)
I am trying to generate all possible sentences from given token. It is a transliteration program. I have various possibilities for each token to be transliterated and I want to generate all possible sentences. e.g. if sentence is token1 token2 token3 and supposing token1 can be represented in 3 ways after transliteration, token2 can be represented by 2 ways and token3 can be represented by 4 ways then total possible sentences are 24. I am developed a general tree and then perform depth first traversal to generate all possible sentences. the problem is when sentence become long, the number of possibilities increases and I got "java.lang.OutOfMemoryError: Java heap space" error.
Is there any other way to generate all possible sentences?? At some instances I need to generate millions of sentences. Please Help!!!
You can't generate them all at once like that.
Depending on what you need them for, you should either do whatever that is or write them to a file.
Another thought, that still might not work, would be to not store every possible value but store a set of references/relationships. You can make this much more complex with n-grams and mMrkov chains, or simply have a a set of references, or even just have a list of array indexes.
So besides using storage space as a memory buffer, you can conceptualize instead of foo calling gen for the full set, have gen call foo after each one is generated.
[EDIT: looking back on this, (I was interested to see any other answers) I want to clarify that the function foo is whatever you're using them for and the function gen generates them (just in case it isn't clear, and especially for anyone who's first language isn't english)]
I have a prepopulated sqlite database imported to assets folder and I use it to set some text to my buttons and to compare user's input with my correct answers in that database. But I have two problems which I don't how to solve.
For example I have an answer which is "Michael Jordan" or some other two words. I a user enters Michael Jordan i'm good to go, but if he enter Jordan Michael I'm in trouble. It will popup a wrong answer alert. Is there a way to accept these words shuffles?
Also, if I have an answer "Balls" and user type in "ball" this will be wrong aswer. How to make sure that all singulars and plurals get accepted?
Fuzzy String Comparison Algorithm
The custom brute force method below provides word swapping and gives you complete control over the vowel/consonant score thresholds, but increases the total number of comparisons.
You will also want to check methods such as Apache Lucene described in this thread: Fuzzy string search library in Java
Custom Fuzzy Comparison Recipe:
Lower Case: All comparisons will be with lower-case text. Either make sure that all words in the reference database are in lower case, or use a String.toLower() on each item in the database before comparison. Obviously, preprocessing the list in the database will dramatically increase performance.
Remove Spaces and Punctuation: You must make a function that removes all spaces and other punctuation from any phrase. You should have a separate column in your reference with this information pre-calculated for an increase in performance.
Custom Compare Function: Your String comparison function will compare each character and assign a custom score based on closeness of letters, in which the lowest scores will indicate the best match. For example, identical characters will add zero score. Each mismatched consonant pair will add 2 to the score. Each mismatched vowel will add 1. Mixed mismatches will add 3. Normalize the score by the number of characters. Apply a simple threshold to determine acceptable matches. In the above example, start with threshold=0.2 which will allow approximately one small mistake per 5 characters (this solves simple misspellings, but not missing characters. See Step 4 below).
Extra or Missing Characters: Loop through each comparison an extra time for each character position. Once without the character in that position and once with an extra character in that position. Report the smallest score for all the loops. Compare that score against the threshold. Break out of the loop and stop comparing if the score is below the threshold, thus indicating a match. This will catch misspellings such as "colage" for "collage".
Swap Words: After the loop in Step #4, if the score is still above the threshold, loop through each word of the input phrase and swap with its nearest neighbor adjacent word. and rerun the comparison suite. Obviously, you will have to look at the original raw user phrase to find the word boundaries, rather than the processed phrase without spaces and punctuation of Step #2. This will catch your requirement of allowing "Jordan Michael" to substitute for "Michael Jordan".
For long entries with more than 2 words, this method will incur 10's of comparisons per database entry or more, so there is a definite performance hit.
This is a great question. I think, realistically you need a dictionary of "valid" words. However a dictionary on its own will not solve your problems. You also need a set of heuristics based on your dictionary as to what constitutes a valid entry.
I would be tempted to try "tries" here as you can encapsulate a rich text base better that alternate methods. Tries, in this case will offer comparable performance to say a word dictionary or the likes. The additional benefit of using tries is that it is fairly trivial to add new words/phrases to your application. The downside, tries use a fair amount of memory. That said, there are techniques one can use to compact data.
The following list contains 1 correct word called "disastrous" and other incorrect words which sound like the correct word?
A. disastrus
B. disasstrous
C. desastrous
D. desastrus
E. disastrous
F. disasstrous
Is it possible to automate generation of wrong choices given a correct word, through some kind of java dictionary API?
No, there is nothing related in java API. You can make a simple algorithm which will do the job.
Just make up some rules about letters permutations and doubling and add generated words to the Set until you get enough words.
There are a number of algorithms for matching words by sound - 'soundex' is the one that springs to mind, but I remember uncovering a few when I did some research on this a couple of years ago. I expect the problem you would find is that they take a word and return a value that represents how the word sounds so you can see if two spellings sound similar (so the words in the question should generate similar values); but I expect doing the reverse, i.e. taking the value and generating similar sounding spellings, would be quite hard.
is there a dictionary i can download for java?
i want to have a program that takes a few random letters and sees if they can be rearanged into a real word by checking them against the dictionary
Is there a dictionary i can download
for java?
Others have already answered this... Maybe you weren't simply talking about a dictionary file but about a spellchecker?
I want to have a program that takes a
few random letters and sees if they
can be rearranged into a real word by
checking them against the dictionary
That is different. How fast do you want this to be? How many words in the dictionary and how many words, up to which length, do you want to check?
In case you want a spellchecker (which is not entirely clear from your question), Jazzy is a spellchecker for Java that has links to a lot of dictionaries. It's not bad but the various implementation are horribly inefficient (it's ok for small dictionaries, but it's an amazing waste when you have several hundred thousands of words).
Now if you just want to solve the specific problem you describe, you can:
parse the dictionary file and create a map : (letters in sorted order, set of matching words)
then for any number of random letters: sort them, see if you have an entry in the map (if you do the entry's value contains all the words that you can do with these letters).
abracadabra : (aaaaabbcdrr, (abracadabra))
carthorse : (acehorrst, (carthorse) )
orchestra : (acehorrst, (carthorse,orchestra) )
etc...
Now you take, say, three random letters and get "hsotrerca", you sort them to get "acehorrst" and using that as a key you get all the (valid) anagrams...
This works because what you described is a special (easy) case: all you need is sort your letters and then use an O(1) map lookup.
To come with more complicated spell checkings, where there may be errors, then you need something to come up with "candidates" (words that may be correct but mispelled) [like, say, using the soundex, metaphone or double metaphone algos] and then use things like the Levenhstein Edit-distance algorithm to check candidates versus known good words (or the much more complicated tree made of Levenhstein Edit-distance that Google use for its "find as you type"):
http://en.wikipedia.org/wiki/Levenshtein_distance
As a funny sidenote, optimized dictionary representation can store hundreds and even millions of words in less than 10 bit per word (yup, you've read correctly: less than 10 bits per word) and yet allow very fast lookup.
Dictionaries are usually programming language agnostic. If you try to google it without using the keyword "java", you may get better results. E.g. free dictionary download gives under each dicts.info.
OpenOffice dictionaries are easy to parse line-by-line.
You can read it in memory (remember it's a lot of memory):
List words = IOUtils.readLines(new FileInputStream("dicfile.txt")) (from commons-io)
Thus you get a List of all words. Alternatively you can use the Line Iterator, if you encounter memory prpoblems.
If you are on a unix like OS look in /usr/share/dict.
Here's one:
http://java.sun.com/docs/books/tutorial/collections/interfaces/examples/dictionary.txt
You can use the standard Java file handling to read the word on each line:
http://www.java-tips.org/java-se-tips/java.io/how-to-read-file-in-java.html
Check out - http://sourceforge.net/projects/test-dictionary/, it might give you some clue
I am not sure if there are any such libraries available for download! But I guess you can definitely digg through sourceforge.net to see if there are any or how people have used dictionaries - http://sourceforge.net/search/?type_of_search=soft&words=java+dictionary