Finding common phrases in text

Finding common phrases in text - java

In the past I've written code to find common words in a body of text, but I was curious if there is a known way to find common phrases in a body of text? (In java)
Does anyone know how to accomplish something like this without Lucene or nlp? What other tools or solutions are there?

It is difficult to give you an answer without knowing exactly what you want to do. A naive answer to your problem would be split the text in punctuation marks, and use a data structure to store the counters of every sentence in your text, incrementing the counter for every sentence you parse from the text.
You could use for example a priority queue to keep the sentences sorted by its counters. Then you could remove the maximum element n times for the n most common sentences, or pop sentences until the counter is greater than a number you want.
However, if you don't want exact sentences, either you'll have to change what you store in the priority queue or you would have to use another algorithm altogether.
Hope this at least helps!

A bit indirect algorithm:
One could create a permuted index: for all words in every sentence store sentence and sort on the word and then the remaining sentence and then all before. The before-part is irrelevant.
Then you should be able to count common phrases of two and more words.

Related

Fill in the Blank String

I am studying for an interview and having trouble with this question.
Basically, you have a word that has spaces in it like c_t.
You have a word bank and have to find all the possible words that can be made with the given string. So for in this case, if cat was in the word bank we would return true.
Any help on solving this question (like an optimal algorithm would be appreciated).
I think we can start with checking lengths of strings in the word bank and then maybe use a hashmap somehow.

Step 1.) Eliminate all words in the wordbook that don't have the same length as the specified one.
Step 2.) Eliminate all words in the bank that don't have the same starting sequence and ending sequence.
Step 3.) If the specified string is fragmented like c_ter_il_ar, for each word left in the bank check if it contains the isolated sequences at those exact same indexes such as ter and il and eliminate those that don't have it
Step 4.) At this point all the words left in the bank are viable solutions, so return true if the bank is non-empty

It may depend on what your interviewer is looking for... creativity, knowledge of algorithms, mastery of data structures? One off-the-cuff solution would be to substitute underscores for any spaces and use a LIKE clause in a SQL query.
SELECT word FROM dictionary WHERE word LIKE 'c_t'; should return "cat", "cot" and "cut".
If you're being evaluated on your ability to divide and conquer, then you should be able to reason whether it's more work to extract a list of candidate words and evaluate each against your criteria, or to generate a list of candidate words from your criteria and evaluate each against your dictionary.

How do you find a phrase with Lucene?

I hope the way I worded my question is correct, though I could be mistaken. Basically, I have an index with term vectors, positions, and offsets, and I want to be able to do the following: when I see the word "do", check to see if the next word is "you". If so, treat those two words as one phrase for the purposes of scoring. I'm doing this to avoid splitting up words that are commonly used together anyway. Instead of my list of words sorted by score looking like this,
do
want
you
come
to
I'd like to see something more like this
do you
want
come
to

One workaround would be index both by word and by phrase, so your scoring list would be:
do you
want
come
to
do
you
If you then applied a boost to your phrases during indexing, you would be closer to your goal. But that depends on whether matching phrases should always rank higher than their individual words.
It might also be worth looking at Boosting Lucene Terms When Building the Index.

Sentence Auto-Complete with Java

Lets say I have about 1000 sentences that I want to offer as suggestions when user is typing into a field.
I was thinking about running lucene in memory search and then feeding the results into the suggestions set.
The trigger for running the searches would be space char and exit from the input field.
I intend to use this with GWT so the client with be just getting the results from server.
I don't want to do what google is doing; where they complete each word and than make suggestions on each set of keywords. I just want to check the keywords and make suggestions based on that. Sort of like when I'm typing the title for the question here on stackoverflow.
Did anyone do something like this before? Is there already library I could use?

I was working on a similar solution. This paper titled Effective Phrase Prediction was quite helpful for me . You will have to prioritize the suggestions as well

If you've only got 1000 sentences, you probably don't need a powerful indexer like lucene. I'm not sure whether you want to do "complete the sentence" suggestions or "suggest other queries that have the same keywords" suggestions. Here are solutions to both:
Assuming that you want to complete the sentence input by the user, then you could put all of your strings into a SortedSet, and use the tailSet method to get a list of strings that are "greater" than the input string (since the string comparator considers a longer string A that starts with string B to be "greater" than B). Then, iterate over the top few entries of the set returned by tailSet to create a set of strings where the first inputString.length() characters match the input string. You can stop iterating as soon as the first inputString.length() characters don't match the input string.
If you want to do keyword suggestions instead of "complete the sentence" suggestions, then the overhead depends on how long your sentences are, and how many unique words there are in the sentences. If this set is small enough, you'll be able to get away with a HashMap<String,Set<String>>, where you mapped keywords to the sentences that contained them. Then you could handle multiword queries by intersecting the sets.
In both cases, I'd probably convert all strings to lower case first (assuming that's appropriate in your application). I don't think either solution would scale to hundreds of thousands of suggestions either. Do either of those do what you want? Happy to provide code if you'd like it.

guess words using dictionary

I am guessing the key of a less-simple simple substitution ciphertext. The rule that I evaluate the correctness of the key is number of english words in the putative decryption.
Are there any tools in java that can check the number of english words in a string. For example,
"thefoitedstateswasat"-> 4 words
"thefortedxyzstateswasathat"->5 words.
I loaded words list and using HashSet as a dictionay. As I dont know the inter-word spaces belong in the text, I can't validate words using simple dictionary.
Thanks.

I gave an answer to a similar question here:
If a word is made up of two valid words
It has some Java-esque pseudocode in it that might be adaptable into something that solves this problem.

Sorry I'm new and does not have the rep to comment yet.
But wouldn't the code be very slow as the number of checks and permutations is very big?
I guess you just have to brute force your way through by using (n-1) words nested for loop. And then search the dictionary for each substring.

Surely there's a better way to test the accuracy of your key?
But that's not the point, here's what I'd do:
Using "quackdogsomethinggodknowswhat"
I'd have a recursive method where starting at the beginning of the string, I'd call a recursive method for all the words with which the subject string starts, in this case "qua", and "quack" with the string not containing the word ("dogsomethinggodknowswhat" for quack). Return whatever is greater: 1 + the greatest value returned out of all your method calls OR 0 + the method call for the string starting at index 1 ("uackdogsomethinggodknowswhat").
This would probably work best if you kept your wordlist in a tree of some sort.
If you need some pseudocode, ask!

Splitting text to sentences and sentence to words: BreakIterator vs regular expressions

I accidentally answered a question where the original problem involved splitting sentence to separate words.
And the author suggested to use BreakIterator to tokenize input strings and some people liked this idea.
I just don't get that madness: how 25 lines of complicated code can be better than a simple one-liner with regexp?
Please, explain me the pros of using BreakIterator and the real cases when it should be used.
If it's really so cool and proper then I wonder: do you really use the approach with BreakIterator in your projects?

From looking at the code posted at that answer, it looks like BreakIterator takes into consideration the language and locale of the text. Getting that level of support via regex will surely be a considerable pain. Perhaps that is the main reason it is preferred over a simple regex?

The BreakIterator gives some nice explicit control and iterates cleanly in a nested way over each sentence and word. I'm not familiar with exactly what specifying the locale does for you, but I'm sure its quite helpful sometimes as well.
It didn't strike me as complicate at all. Just set up one iterator for the sentence level, another for the word level, nest the word one inside the second one.
If the problem changed into something different the solution you had on the other question might've just been out the window. However, that pattern of iterating through sentences and words can do a lot.
Find the sentence where any word occurs the most repeated times. Output it along with that word
Find the word used most times throughout the whole string.
Find all words that occur in every sentence
Find all words that occur a prime number of times in 2 or more sentences
The list goes on...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.