I can work out how to create anagrams of a string but I don't know how I can compare them to a dictionary of real words to check if the anagram is a real word. Is there a class in the Java API that contains the entire English dictionary?
No, but you can get a wordlist from various places. From there, you could read the wordlist file into a list:
List<String> lines = new ArrayList<String>();
BufferedReader in = new BufferedReader(new FileReader("wordlist.txt"));
String line = null;
while (null!=(line=in.readLine()))
{
lines.add(line);
}
in.close();
And finally binary search use lines.contains() for your candidate word.
One method of determining whether a set of characters is an anagram of a word involves using prime numbers. Assign each letter a prime number, for example, a=2, b=3, c=5, d=7. Now precompute the product of primes for each word in your dictionary. For example, 'add' = 2*7*7 = 98, or 'bad' = 3*2*7 = 42.
Now determining if a set of letters is an anagram of any word in a dictionary can be done by computing the value of the set of letters. For example, the letters 'abd'= 2*3*7 = 42 = 'bad'. Just check whether the computed value for the letters exists in your precomputed dictionary. For any anagram, you need only do this computation once versus trying to generate every possible anagram. Note however this method will only work well for relatively small words, otherwise you will run into overflow issues and need to use BigInteger.
No, you have to use an external library, such as JWNL, which is a wrapper for WordNet -- a machine-readable lexical database organized by meanings, that contains pretty much every English word.
Maybe the English dictionary in jazzy can help you.
There's no such specialized class in the standard Java library, but you can use any implementation you like of the Set interface and initialize it by loading it up with words of your choosing, picked from any of the innumerable word lists you can find in many places (just check out carefully that the license for the word list you choose is compatible with your intended application, e.g., does it allow commercial use, closed-source apps if that's what you require, and so forth).
Related
I have a text extracted from image using OCR. Some of the words are not correctly recognized in the text as follows:
'DRDER 0F OFF1CE RESTAURAUT, QNE THO...'
As you can see optically some characters is easy to mix for others: 1 -> I, O -> D -> Q, H -> W, U -> N and so on.
Question: Apart from standard algorithms like Levenshtein distance, is there a Java or Python library implementing OCR specific algorithm that can help compare words to a predefined dictionary and give a score, taking into account possible OCR character mixups?
I don't know of anything OCR-specific, but you might be able to make this work with Biopython, because the basic problem of comparing one string to another using a matrix that scores each character's similarity to every other character is very common in bioinformatics. We call it a sequence alignment problem.
Have a look at the pairwise2 module that Biopython provides; you would be able to compare each input word against each dictionary word with pairwise2.align.globaldx, using a dict that has all the pairwise character similarities. There are also functions in there for scoring deleted/inserted characters.
Computing the pairwise character similarities would be something you'd have to do yourself, maybe by rendering each character in your chosen font and comparing the images, or maybe manually by just rating which characters look similar to you. You could also have a look at this other SO answer where characters are broken into classes based on the presence/absence of strokes.
If you want something better than O(input * dictionary), you'd have to switch from brute force comparison to some kind of seed-match-based algorithm. If you assume that you'll always have a 2-character perfect match for example, you can index your dictionary by which words contain each length-2 string, and only compare the input words against the dictionary words that share a length-2 string with them.
I am studying for an interview and having trouble with this question.
Basically, you have a word that has spaces in it like c_t.
You have a word bank and have to find all the possible words that can be made with the given string. So for in this case, if cat was in the word bank we would return true.
Any help on solving this question (like an optimal algorithm would be appreciated).
I think we can start with checking lengths of strings in the word bank and then maybe use a hashmap somehow.
Step 1.) Eliminate all words in the wordbook that don't have the same length as the specified one.
Step 2.) Eliminate all words in the bank that don't have the same starting sequence and ending sequence.
Step 3.) If the specified string is fragmented like c_ter_il_ar, for each word left in the bank check if it contains the isolated sequences at those exact same indexes such as ter and il and eliminate those that don't have it
Step 4.) At this point all the words left in the bank are viable solutions, so return true if the bank is non-empty
It may depend on what your interviewer is looking for... creativity, knowledge of algorithms, mastery of data structures? One off-the-cuff solution would be to substitute underscores for any spaces and use a LIKE clause in a SQL query.
SELECT word FROM dictionary WHERE word LIKE 'c_t'; should return "cat", "cot" and "cut".
If you're being evaluated on your ability to divide and conquer, then you should be able to reason whether it's more work to extract a list of candidate words and evaluate each against your criteria, or to generate a list of candidate words from your criteria and evaluate each against your dictionary.
I am converting single Chinese characters into roman letters(pinyin) using package pinyin4j in java. However, this will often yield multiple pinyins for one character(same character has different pronunciations). Say,character C1 converts to 2 pinyin forms p1 and p2, character C2 converts to 3 pinyin forms, q1,q2,q3.
When I combine C1C2 into a word, it yields 2*3=6 combinations. Usually only one of these is a real word. I want to check these combinations against a lexicon text file I built, with many lines start with \w that is a lexical entry(so for instance, only p1q2 out of the 6 combinations is found in the lexicon). i'm thinking about reading the lexicon file into a hashset. However I'm not sure about how to best implement this whole process. Any suggestions?
HashSet seems quite alright. If the lexicon is extra large and you have to be super fast, consider using Trie data structure. There is, however, no implementation in the Java.
I have a list of words (assume they are stored in String[] if you must). I want to filter out words that belong to a broad general category such as Music or Sports.
Is there a ready-made solution for this (even if it's only for a limited set of general categories)?
Or how would you go about doing this?
It is to be done in Java 1.6 and it is an NLP (Natural Language Processing) problem. The input list of words has random words, and I want to extract from this large list, only the words that belong to a given general category (which will be a subset).
Another way of thinking: Given a single word, I want to determine if this word belongs to a category. Something like this:
String word1 = "football"; //the strings will always be single word units
String word2 = "telephone";
boolean b1 = belongsToCategory(Categories.SPORTS, word1); //true
boolean b2 = belongsToCategory(Categories.SPORTS, word2); //false
If you need more info, please ask.
Well, my idea would be to hold a set of words for each category and look the word up in each set.
Of course, this set would get huge and impossible to maintain if you held all the inflected forms for a single word. I'd consider using lemmatization to limit the size of this set.
You might be interested in checking the following links:
Lemmatization on Wikipedia
and
Lemmatization java
I am guessing the key of a less-simple simple substitution ciphertext. The rule that I evaluate the correctness of the key is number of english words in the putative decryption.
Are there any tools in java that can check the number of english words in a string. For example,
"thefoitedstateswasat"-> 4 words
"thefortedxyzstateswasathat"->5 words.
I loaded words list and using HashSet as a dictionay. As I dont know the inter-word spaces belong in the text, I can't validate words using simple dictionary.
Thanks.
I gave an answer to a similar question here:
If a word is made up of two valid words
It has some Java-esque pseudocode in it that might be adaptable into something that solves this problem.
Sorry I'm new and does not have the rep to comment yet.
But wouldn't the code be very slow as the number of checks and permutations is very big?
I guess you just have to brute force your way through by using (n-1) words nested for loop. And then search the dictionary for each substring.
Surely there's a better way to test the accuracy of your key?
But that's not the point, here's what I'd do:
Using "quackdogsomethinggodknowswhat"
I'd have a recursive method where starting at the beginning of the string, I'd call a recursive method for all the words with which the subject string starts, in this case "qua", and "quack" with the string not containing the word ("dogsomethinggodknowswhat" for quack). Return whatever is greater: 1 + the greatest value returned out of all your method calls OR 0 + the method call for the string starting at index 1 ("uackdogsomethinggodknowswhat").
This would probably work best if you kept your wordlist in a tree of some sort.
If you need some pseudocode, ask!