guess words using dictionary - java

I am guessing the key of a less-simple simple substitution ciphertext. The rule that I evaluate the correctness of the key is number of english words in the putative decryption.
Are there any tools in java that can check the number of english words in a string. For example,
"thefoitedstateswasat"-> 4 words
"thefortedxyzstateswasathat"->5 words.
I loaded words list and using HashSet as a dictionay. As I dont know the inter-word spaces belong in the text, I can't validate words using simple dictionary.
Thanks.

I gave an answer to a similar question here:
If a word is made up of two valid words
It has some Java-esque pseudocode in it that might be adaptable into something that solves this problem.

Sorry I'm new and does not have the rep to comment yet.
But wouldn't the code be very slow as the number of checks and permutations is very big?
I guess you just have to brute force your way through by using (n-1) words nested for loop. And then search the dictionary for each substring.

Surely there's a better way to test the accuracy of your key?
But that's not the point, here's what I'd do:
Using "quackdogsomethinggodknowswhat"
I'd have a recursive method where starting at the beginning of the string, I'd call a recursive method for all the words with which the subject string starts, in this case "qua", and "quack" with the string not containing the word ("dogsomethinggodknowswhat" for quack). Return whatever is greater: 1 + the greatest value returned out of all your method calls OR 0 + the method call for the string starting at index 1 ("uackdogsomethinggodknowswhat").
This would probably work best if you kept your wordlist in a tree of some sort.
If you need some pseudocode, ask!

Related

Rearranging one string to another in Java

I am trying to find whether a part of given string A can be or can not be rearranged to given string B (Boolean output).
Since the algorithm must be at most O(n), to ease it, I used stringA.retainAll(stringB), so now I know string A and string B consist of the same set of characters and now the whole task smells like regex.
And .. reading about regex, I might be now having two problems(c).
The question is, do I potentially face a risk of getting O(infinity) by using regex or its more efficient to use StreamAPI with the purpose of finding whether each character of string A has enough duplicates to cover each of character of string B? Let alone regex syntax is not easy to read and build.
As of now, I can't use sorting (any sorting is at least n*log(n)) nor hashsets and the likes (as it eliminates duplicates in both strings).
Thank you.
You can use a HashMap<Character,Integer> to count the number of occurrences of each character of the first String. That would take linear time.
Then, for each Character of the second String, find if it's in the HashMap and decrement the counter (if it's still positive). This will also take linear time, and if you manage to decrement the counters for all the characters of the second String, you succeed.

Fill in the Blank String

I am studying for an interview and having trouble with this question.
Basically, you have a word that has spaces in it like c_t.
You have a word bank and have to find all the possible words that can be made with the given string. So for in this case, if cat was in the word bank we would return true.
Any help on solving this question (like an optimal algorithm would be appreciated).
I think we can start with checking lengths of strings in the word bank and then maybe use a hashmap somehow.
Step 1.) Eliminate all words in the wordbook that don't have the same length as the specified one.
Step 2.) Eliminate all words in the bank that don't have the same starting sequence and ending sequence.
Step 3.) If the specified string is fragmented like c_ter_il_ar, for each word left in the bank check if it contains the isolated sequences at those exact same indexes such as ter and il and eliminate those that don't have it
Step 4.) At this point all the words left in the bank are viable solutions, so return true if the bank is non-empty
It may depend on what your interviewer is looking for... creativity, knowledge of algorithms, mastery of data structures? One off-the-cuff solution would be to substitute underscores for any spaces and use a LIKE clause in a SQL query.
SELECT word FROM dictionary WHERE word LIKE 'c_t'; should return "cat", "cot" and "cut".
If you're being evaluated on your ability to divide and conquer, then you should be able to reason whether it's more work to extract a list of candidate words and evaluate each against your criteria, or to generate a list of candidate words from your criteria and evaluate each against your dictionary.

Finding common phrases in text

In the past I've written code to find common words in a body of text, but I was curious if there is a known way to find common phrases in a body of text? (In java)
Does anyone know how to accomplish something like this without Lucene or nlp? What other tools or solutions are there?
It is difficult to give you an answer without knowing exactly what you want to do. A naive answer to your problem would be split the text in punctuation marks, and use a data structure to store the counters of every sentence in your text, incrementing the counter for every sentence you parse from the text.
You could use for example a priority queue to keep the sentences sorted by its counters. Then you could remove the maximum element n times for the n most common sentences, or pop sentences until the counter is greater than a number you want.
However, if you don't want exact sentences, either you'll have to change what you store in the priority queue or you would have to use another algorithm altogether.
Hope this at least helps!
A bit indirect algorithm:
One could create a permuted index: for all words in every sentence store sentence and sort on the word and then the remaining sentence and then all before. The before-part is irrelevant.
Then you should be able to count common phrases of two and more words.

Regex unordered matches

This feels like it should be an extremely simple thing to do with regex but I can't quite seem to figure it out.
I would like to write a regex which checks to see if a list of certain words appear in a document, in any order, along with any of a set of other words in any order.
In boolean logic the check would be:
If allOfTheseWords are in this text and atLeastOneOfTheseWords are in this text, return true.
Example
I'm searching for (john and barbara) with (happy or sad).
Order does not matter.
"Happy birthday john from barbara" => VALID
"Happy birthday john" => INVALID
I simply cannot figure out how to get the and part to match in an orderless way, any help would be appreciated!
You don't really want to use a regex for this unless the text is very small, which from your description I doubt.
A simple solution would be to dump all the words into a HashSet, at which point checking to see if a word is present becomes a very quick and easy operation.
If you want to do it with regex, I'd try positive lookahead:
// searching for (john and barbara) with (happy or sad)
"^(?=.*\bjohn\b)(?=.*\bbarbara\b).*\b(happy|sad)\b"
The performance should be comparable to doing a full text search for each of the words in the allOfTheseWords group separately.
If you really need a single regex, then it would be very large and very slow due to backtracking. For your particular example of (John AND Barbara) AND (Happy or Sad), it would start like this:
\bJohn\b.*?\bBarbara\n.*?\bHappy\b|\bJohn\b.*?\bBarbara\n.*?\bSad\b|......
You'd ultimately need to put all combinations in the regex. Something like:
JBH, JBS, JHB, JSB, HJB, SJB, BJH, BJS, BHJ, BSJ, HBJ, SBJ
Again backtracking would be prohibitive, as would the explosion in the number of cases. Stay away from regexes here.
With your example, this is a regex that may help you :
Regex
(?:happy|sad).*?john.*?barbara|
(?:happy|sad).*?barbara.*?john|
barbara.*?john.*?(?:happy|sad)|
john.*?barbara.*?(?:happy|sad)|
barbara.*?(?:happy|sad).*?john|
john.*?(?:happy|sad).*?barbara
Output
happy birthday john from barbara => Matched
Happy birthday john => Not matched
As mentionned in other responses, a regex may not be well suited here.
It might be possible to do it with regexp, but it would be so complicated that it's better to use some different way (for example using a HashSet, as mentioned in the other answers).
One way to do it with regex would be to calculate all the permutations of the words which you are looking for, and then write a regex which mentions all those permutations. With 2 words there would be 2 permutations, as in (.*foo.*bar.*)|(.*bar.*foo.*) (plus word boundaries), with 3 words there would be 6 permutations, and quite soon the number of permutations would be larger than your input file.
If your data is relatively constant, and you are planning on searching a lot, using Apache Lucene will ensure better peformance.
Using information retrieval techniques, you will first index all your documents/sentences, and then search for your words, in your example you would want to search for "+(+john +barbara) +(sad happy)" [or "(john AND barbarar) AND (sad OR HAPPY)" ]
this approach will consume some time when indexing, however, searching will be much faster then any regex/hashset approach (since you don't need to iterate over all documents...)

Java Anagram Solver

I can work out how to create anagrams of a string but I don't know how I can compare them to a dictionary of real words to check if the anagram is a real word. Is there a class in the Java API that contains the entire English dictionary?
No, but you can get a wordlist from various places. From there, you could read the wordlist file into a list:
List<String> lines = new ArrayList<String>();
BufferedReader in = new BufferedReader(new FileReader("wordlist.txt"));
String line = null;
while (null!=(line=in.readLine()))
{
lines.add(line);
}
in.close();
And finally binary search use lines.contains() for your candidate word.
One method of determining whether a set of characters is an anagram of a word involves using prime numbers. Assign each letter a prime number, for example, a=2, b=3, c=5, d=7. Now precompute the product of primes for each word in your dictionary. For example, 'add' = 2*7*7 = 98, or 'bad' = 3*2*7 = 42.
Now determining if a set of letters is an anagram of any word in a dictionary can be done by computing the value of the set of letters. For example, the letters 'abd'= 2*3*7 = 42 = 'bad'. Just check whether the computed value for the letters exists in your precomputed dictionary. For any anagram, you need only do this computation once versus trying to generate every possible anagram. Note however this method will only work well for relatively small words, otherwise you will run into overflow issues and need to use BigInteger.
No, you have to use an external library, such as JWNL, which is a wrapper for WordNet -- a machine-readable lexical database organized by meanings, that contains pretty much every English word.
Maybe the English dictionary in jazzy can help you.
There's no such specialized class in the standard Java library, but you can use any implementation you like of the Set interface and initialize it by loading it up with words of your choosing, picked from any of the innumerable word lists you can find in many places (just check out carefully that the license for the word list you choose is compatible with your intended application, e.g., does it allow commercial use, closed-source apps if that's what you require, and so forth).

Categories