Fill in the Blank String - java

I am studying for an interview and having trouble with this question.
Basically, you have a word that has spaces in it like c_t.
You have a word bank and have to find all the possible words that can be made with the given string. So for in this case, if cat was in the word bank we would return true.
Any help on solving this question (like an optimal algorithm would be appreciated).
I think we can start with checking lengths of strings in the word bank and then maybe use a hashmap somehow.

Step 1.) Eliminate all words in the wordbook that don't have the same length as the specified one.
Step 2.) Eliminate all words in the bank that don't have the same starting sequence and ending sequence.
Step 3.) If the specified string is fragmented like c_ter_il_ar, for each word left in the bank check if it contains the isolated sequences at those exact same indexes such as ter and il and eliminate those that don't have it
Step 4.) At this point all the words left in the bank are viable solutions, so return true if the bank is non-empty

It may depend on what your interviewer is looking for... creativity, knowledge of algorithms, mastery of data structures? One off-the-cuff solution would be to substitute underscores for any spaces and use a LIKE clause in a SQL query.
SELECT word FROM dictionary WHERE word LIKE 'c_t'; should return "cat", "cot" and "cut".
If you're being evaluated on your ability to divide and conquer, then you should be able to reason whether it's more work to extract a list of candidate words and evaluate each against your criteria, or to generate a list of candidate words from your criteria and evaluate each against your dictionary.

Related

Word Search: two string arrays in alphabetical order using merge sort

For my class project, we have to go through the Shakespeare sonnet and check if each word is in the dictionary or not. Now I have two String arrays both in alphabetical order, one consists of the words from the sonnet and the other one is consisted of the word from the dictionary. I am asked to use the merge sort to check if the word in the sonnet exists in the dictionary. Can anyone give me an idea of how I can implement this??? Thanks in advance!
The idea is to:
Sort both of the arrays (with merge sort)
Remove any duplicates
Iterate through both of the sorted arrays simultaneously (can be done using the merging procedure in mergesort) and check if the next word in the sonnet list equals the next word in the dictionary. If it does not, remove it, and mark it as "not in dictionary", if it is, mark it as "in the dictionary", and proceed to the next element in both lists
However, this approach assumes that all of the words in the dictionary is contained in the sonnet. If this is not the case, you would have to remove those words up front.
Really though; this doesn't sound like a sort problem.
The best approach would be to use a HashMap and put all the dictionary words in that. Then you could iterate through the sonnet, and check for existence in the map.

How do you find a phrase with Lucene?

I hope the way I worded my question is correct, though I could be mistaken. Basically, I have an index with term vectors, positions, and offsets, and I want to be able to do the following: when I see the word "do", check to see if the next word is "you". If so, treat those two words as one phrase for the purposes of scoring. I'm doing this to avoid splitting up words that are commonly used together anyway. Instead of my list of words sorted by score looking like this,
do
want
you
come
to
I'd like to see something more like this
do you
want
come
to
One workaround would be index both by word and by phrase, so your scoring list would be:
do you
want
come
to
do
you
If you then applied a boost to your phrases during indexing, you would be closer to your goal. But that depends on whether matching phrases should always rank higher than their individual words.
It might also be worth looking at Boosting Lucene Terms When Building the Index.

Finding common phrases in text

In the past I've written code to find common words in a body of text, but I was curious if there is a known way to find common phrases in a body of text? (In java)
Does anyone know how to accomplish something like this without Lucene or nlp? What other tools or solutions are there?
It is difficult to give you an answer without knowing exactly what you want to do. A naive answer to your problem would be split the text in punctuation marks, and use a data structure to store the counters of every sentence in your text, incrementing the counter for every sentence you parse from the text.
You could use for example a priority queue to keep the sentences sorted by its counters. Then you could remove the maximum element n times for the n most common sentences, or pop sentences until the counter is greater than a number you want.
However, if you don't want exact sentences, either you'll have to change what you store in the priority queue or you would have to use another algorithm altogether.
Hope this at least helps!
A bit indirect algorithm:
One could create a permuted index: for all words in every sentence store sentence and sort on the word and then the remaining sentence and then all before. The before-part is irrelevant.
Then you should be able to count common phrases of two and more words.

Regex unordered matches

This feels like it should be an extremely simple thing to do with regex but I can't quite seem to figure it out.
I would like to write a regex which checks to see if a list of certain words appear in a document, in any order, along with any of a set of other words in any order.
In boolean logic the check would be:
If allOfTheseWords are in this text and atLeastOneOfTheseWords are in this text, return true.
Example
I'm searching for (john and barbara) with (happy or sad).
Order does not matter.
"Happy birthday john from barbara" => VALID
"Happy birthday john" => INVALID
I simply cannot figure out how to get the and part to match in an orderless way, any help would be appreciated!
You don't really want to use a regex for this unless the text is very small, which from your description I doubt.
A simple solution would be to dump all the words into a HashSet, at which point checking to see if a word is present becomes a very quick and easy operation.
If you want to do it with regex, I'd try positive lookahead:
// searching for (john and barbara) with (happy or sad)
"^(?=.*\bjohn\b)(?=.*\bbarbara\b).*\b(happy|sad)\b"
The performance should be comparable to doing a full text search for each of the words in the allOfTheseWords group separately.
If you really need a single regex, then it would be very large and very slow due to backtracking. For your particular example of (John AND Barbara) AND (Happy or Sad), it would start like this:
\bJohn\b.*?\bBarbara\n.*?\bHappy\b|\bJohn\b.*?\bBarbara\n.*?\bSad\b|......
You'd ultimately need to put all combinations in the regex. Something like:
JBH, JBS, JHB, JSB, HJB, SJB, BJH, BJS, BHJ, BSJ, HBJ, SBJ
Again backtracking would be prohibitive, as would the explosion in the number of cases. Stay away from regexes here.
With your example, this is a regex that may help you :
Regex
(?:happy|sad).*?john.*?barbara|
(?:happy|sad).*?barbara.*?john|
barbara.*?john.*?(?:happy|sad)|
john.*?barbara.*?(?:happy|sad)|
barbara.*?(?:happy|sad).*?john|
john.*?(?:happy|sad).*?barbara
Output
happy birthday john from barbara => Matched
Happy birthday john => Not matched
As mentionned in other responses, a regex may not be well suited here.
It might be possible to do it with regexp, but it would be so complicated that it's better to use some different way (for example using a HashSet, as mentioned in the other answers).
One way to do it with regex would be to calculate all the permutations of the words which you are looking for, and then write a regex which mentions all those permutations. With 2 words there would be 2 permutations, as in (.*foo.*bar.*)|(.*bar.*foo.*) (plus word boundaries), with 3 words there would be 6 permutations, and quite soon the number of permutations would be larger than your input file.
If your data is relatively constant, and you are planning on searching a lot, using Apache Lucene will ensure better peformance.
Using information retrieval techniques, you will first index all your documents/sentences, and then search for your words, in your example you would want to search for "+(+john +barbara) +(sad happy)" [or "(john AND barbarar) AND (sad OR HAPPY)" ]
this approach will consume some time when indexing, however, searching will be much faster then any regex/hashset approach (since you don't need to iterate over all documents...)

guess words using dictionary

I am guessing the key of a less-simple simple substitution ciphertext. The rule that I evaluate the correctness of the key is number of english words in the putative decryption.
Are there any tools in java that can check the number of english words in a string. For example,
"thefoitedstateswasat"-> 4 words
"thefortedxyzstateswasathat"->5 words.
I loaded words list and using HashSet as a dictionay. As I dont know the inter-word spaces belong in the text, I can't validate words using simple dictionary.
Thanks.
I gave an answer to a similar question here:
If a word is made up of two valid words
It has some Java-esque pseudocode in it that might be adaptable into something that solves this problem.
Sorry I'm new and does not have the rep to comment yet.
But wouldn't the code be very slow as the number of checks and permutations is very big?
I guess you just have to brute force your way through by using (n-1) words nested for loop. And then search the dictionary for each substring.
Surely there's a better way to test the accuracy of your key?
But that's not the point, here's what I'd do:
Using "quackdogsomethinggodknowswhat"
I'd have a recursive method where starting at the beginning of the string, I'd call a recursive method for all the words with which the subject string starts, in this case "qua", and "quack" with the string not containing the word ("dogsomethinggodknowswhat" for quack). Return whatever is greater: 1 + the greatest value returned out of all your method calls OR 0 + the method call for the string starting at index 1 ("uackdogsomethinggodknowswhat").
This would probably work best if you kept your wordlist in a tree of some sort.
If you need some pseudocode, ask!

Categories