find the longest word made of other words

find the longest word made of other words - java

I am working on a problem, which is to write a program to find the longest word made of other words in a list of words.
EXAMPLE
Input: test, tester, testertest, testing, testingtester
Output: testingtester
I searched and find the following solution, my question is I am confused in step 2, why we should break each word in all possible ways? Why not use each word directly as a whole? If anyone could give some insights, it will be great.
The solution below does the following:
Sort the array by size, putting the longest word at the front
For each word, split it in all possible ways. That is, for “test”, split it into {“t”, “est”}, {“te”, “st”} and {“tes”, “t”}.
Then, for each pairing, check if the first half and the second both exist elsewhere in the array.
“Short circuit” by returning the first string we find that fits condition #3.

Answering your question indirectly, I believe the following is an efficient way to solve this problem using tries.
Build a trie from all of the words in your string.
Sort the words so that the longest word comes first.
Now, for each word W, start at the top of the trie and begin following the word down the tree one letter at a time using letters from the word you are testing.
Each time a word ends, recursively re-enter the trie from the top making a note that you have "branched". If you run out of letters at the end of the word and have branched, you've found a compound word and, because the words were sorted, this is the longest compound word.
If the letters stop matching at any point, or you run out and are not at the end of the word, just back track to wherever it was that you branched and keep plugging along.
I'm afraid I don't know Java that well, so I'm unable to provide you sample code in that language. I have, however, written out a solution in Python (using a trie implementation from this answer). Hopefully it is clear to you:
#!/usr/bin/env python3
#End of word symbol
_end = '_end_'
#Make a trie out of nested HashMap, UnorderedMap, dict structures
def MakeTrie(words):
root = dict()
for word in words:
current_dict = root
for letter in word:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return root
def LongestCompoundWord(original_trie, trie, word, level=0):
first_letter = word[0]
if not first_letter in trie:
return False
if len(word)==1 and _end in trie[first_letter]:
return level>0
if _end in trie[first_letter] and LongestCompoundWord(original_trie, original_trie, word[1:], level+1):
return True
return LongestCompoundWord(original_trie, trie[first_letter], word[1:], level)
#Words that were in your question
words = ['test','testing','tester','teste', 'testingtester', 'testingtestm', 'testtest','testingtest']
trie = MakeTrie(words)
#Sort words in order of decreasing length
words = sorted(words, key=lambda x: len(x), reverse=True)
for word in words:
if LongestCompoundWord(trie,trie,word):
print("Longest compound word was '{0:}'".format(word))
break
With the above in mind, the answer to your original question becomes clearer: we do not know ahead of time which combination of prefix words will take us successfully through the tree. Therefore, we need to be prepared to check all possible combinations of prefix words.
Since the algorithm you found does not have an efficient way of knowing what subsets of a word are prefixes, it splits the word at all possible points in word to ensure that all prefixes are generated.

Richard's answer will work well in many cases, but it can take exponential time: this will happen if there are many segments of the string W, each of which can be decomposed in multiple different ways. For example, suppose W is abcabcabcd, and the other words are ab, c, a and bc. Then the first 3 letters of W can be decomposed either as ab|c or as a|bc... and so can the next 3 letters, and the next 3, for 2^3 = 8 possible decompositions of the first 9 letters overall:
a|bc|a|bc|a|bc
a|bc|a|bc|ab|c
a|bc|ab|c|a|bc
a|bc|ab|c|ab|c
ab|c|a|bc|a|bc
ab|c|a|bc|ab|c
ab|c|ab|c|a|bc
ab|c|ab|c|ab|c
All of these partial decompositions necessarily fail in the end, since there is no word in the input that contains W's final letter d -- but his algorithm will explore them all before discovering this. In general, a word consisting of n copies of abc followed by a single d will take O(n*2^n) time.
We can improve this to O(n^2) worst-case time (at the cost of O(n) space) by recording extra information about the decomposability of suffixes of W as we go along -- that is, suffixes of W that we have already discovered we can or cannot match to word sequences. This type of algorithm is called dynamic programming.
The condition we need for some word W to be decomposable is exactly that W begins with some word X from the set of other words, and the suffix of W beginning at position |X|+1 is decomposable. (I'm using 1-based indices here, and I'll denote a substring of a string S beginning at position i and ending at position j by S[i..j].)
Whenever we discover that the suffix of the current word W beginning at some position i is or is not decomposable, we can record this fact and make use of it later to save time. For example, after testing the first 4 decompositions in the 8 listed earlier, we know that the suffix of W beginning at position 4 (i.e., abcabcd) is not decomposable. Then when we try the 5th decomposition, i.e., the first one starting with ab, we first ask the question: Is the rest of W, i.e. the suffix of W beginning at position 3, decomposable? We don't know yet, so we try adding c to get ab|c, and then we ask: Is the rest of W, i.e. the suffix of W beginning at position 4, decomposable? And we find that it has already been found not to be -- so we can immediately conclude that no decomposition of W beginning with ab|c is possible either, instead of having to grind through all 4 possibilities.
Assuming for the moment that the current word W is fixed, what we want to build is a function f(i) that determines whether the suffix of W beginning at position i is decomposable. Pseudo-code for this could look like:
- Build a trie the same way as Richard's solution does.
- Initialise the array KnownDecomposable[] to |W| DUNNO values.
f(i):
- If i == |W|+1 then return 1. (The empty suffix means we're finished.)
- If KnownDecomposable[i] is TRUE or FALSE, then immediately return it.
- MAIN BODY BEGINS HERE
- Walk through Richard's trie from the root, following characters in the
suffix W[i..|W|]. Whenever we find a trie node at some depth j that
marks the end of a word in the set:
- Call f(i+j) to determine whether the rest of W can be decomposed.
- If it can (i.e. if f(i+j) == 1):
- Set KnownDecomposable[i] = TRUE.
- Return TRUE.
- If we make it to this point, then we have considered all other
words that form a prefix of W[i..|W|], and found that none of
them yield a suffix that can be decomposed.
- Set KnownDecomposable[i] = FALSE.
- Return FALSE.
Calling f(1) then tells us whether W is decomposable.
By the time a call to f(i) returns, KnownDecomposable[i] has been set to a non-DUNNO value (TRUE or FALSE). The main body of the function is only run if KnownDecomposable[i] is DUNNO. Together these facts imply that the main body of the function will only run as many times as there are distinct values i that the function can be called with. There are at most |W|+1 such values, which is O(n), and outside of recursive calls, a call to f(i) takes at most O(n) time to walk through Richard's trie, so overall the time complexity is bounded by O(n^2).

I guess you are just making a confusion about which words are split.
After sorting, you consider the words one after the other, by decreasing length. Let us call a "candidate" a word you are trying to decompose.
If the candidate is made of other words, it certainly starts with a word, so you will compare all prefixes of the candidate to all possible words.
During the comparison step, you compare a candidate prefix to the whole words, not to split words.
By the way, the given solution will not work for triwords and longer. The fix is as follows:
try every prefix of the candidate and compare it to all words
in case of a match, repeat the search with the suffix.
Example:
testingtester gives the prefixes
t, te, tes, test, testi, testin, testing, testingt, testingte, testingtes and testingteste
Among these, test and testing are words. Then you need to try the corresponding suffixes ingtester and tester.
ingtester gives
i, in, ing, ingt, ingte, ingtes, ingtest and ingteste, none of which are words.
tester is a word and you are done.
IsComposite(InitialCandidate, Candidate):
For all Prefixes of Candidate:
if Prefix is in Words:
Suffix= Candidate - Prefix
if Suffix == "":
return Candidate != InitialCandidate
else:
return IsComposite(InitialCandidate, Suffix)
For all Candidate words by decreasing size:
if IsComposite(Candidate, Candidate):
print Candidate
break

I would probably use recursion here. Start with the longest word and find words it starts with. For any such word remove it from the original word and continue with the remaining part in the same manner.
Pseudo code:
function iscomposed(orininalword, wordpart)
for word in allwords
if word <> orininalword
if wordpart = word
return yes
elseif wordpart starts with word
if iscomposed(orininalword, wordpart - word)
return yes
endif
endif
endif
next
return no
end
main
sort allwords by length descending
for word in allwords
if iscomposed(word, word) return word
next
end
Example:
words:
abcdef
abcde
abc
cde
ab
Passes:
1. abcdef starts with abcde. rest = f. 2. no word f starts with found.
1. abcdef starts with abc. rest = def. 2. no word def starts with found.
1. abcdef starts with ab. rest = cdef. 2. cdef starts with cde. rest = f. 3. no word f starts with found.
1. abcde starts with abc. rest = cde. 2. cde itself found. abcde is a composed word

To find longest world using recursion
class FindLongestWord {
public static void main(String[] args) {
List<String> input = new ArrayList<>(
Arrays.asList("cat", "banana", "rat", "dog", "nana", "walk", "walker", "dogcatwalker"));
List<String> sortedList = input.stream().sorted(Comparator.comparing(String::length).reversed())
.collect(Collectors.toList());
boolean isWordFound = false;
for (String word : sortedList) {
input.remove(word);
if (findPrefix(input, word)) {
System.out.println("Longest word is : " + word);
isWordFound = true;
break;
}
}
if (!isWordFound)
System.out.println("Longest word not found");
}
public static boolean findPrefix(List<String> input, String word) {
boolean output = false;
if (word.isEmpty())
return true;
else {
for (int i = 0; i < input.size(); i++) {
if (word.startsWith(input.get(i))) {
output = findPrefix(input, word.replace(input.get(i), ""));
if (output)
return true;
}
}
}
return output;
}
}

Related

Finding longest concatenated word

I have a dictionary with many words. And i hope search the longest concatenated word (that is, the longest word that is comprised entirely of
shorter words in the file). I give the method a descending word from their length. How can I check that all the symbols have been used from the dictionary?
public boolean tryMatch(String s, List dictionary) {
String nextWord = new String();
int contaned = 0;
//Цикл перебирающий каждое слово словаря
for(int i = 1; i < dictionary.size();i++) {
nextWord = (String) dictionary.get(i);
if (nextWord == s) {
nextWord = (String) dictionary.get(i + 1);
}
if (s.contains(nextWord)) {
contaned++;
}
}
if(contaned >1) {
return true;
}
return false;
}

If you have a sorted list of words, finding compound words is easy, but it will only perform well if the words are in a Set.
Let's look at the compound word football, and of course assume that both ball and foot are in the work list.
By definition, any compound word using foot as the first sub-word must start with foot.
So, when iterating the list, remember the current active "stem" words, e.g. when seeing foot, remember it.
Now, when seeing football, you check if the word starts with the stem word. If not, clear the stem word, and make new word the stem word.
If it does, the new word (football) is a candidate for being a compound word. The part after the stem is ball, so we need to check if that is a word, and if so, we found a compound word.
Checking is easy for simple case, i.e. wordSet.contains(remain).
However, compound words can be made up of more than 2 words, e.g. whatsoever. So after finding that it is a candidate from the stem word what, the remain is soever.
You can simply try all lengths of that (soever, soeve, soev, soe, so, s), and if one of the shorter ones are words, you repeat the process.

Find words sequence in a document

Using Java (on Android) I try to find a way (fast one...) to resolve this problem :
I have a list of words (around 10 to 30) and a document. The length of the document can vary too, maybe around 2500 to 10000 words. This document is part of a book.
The thing i want is to find in this document the string (sentence...) who contains the higher quantity of the words in my list. The words in the document has to be in the same order as my words list. Normally the words should not be so far one from the other in the document, maybe max 2 or 3 words between each words of my list.
To be more clear, lets take an example with small data.
My word list is :
harm piece work day
my document :
just so, with the greatest care. You must see to it that you pull up
regularly all the baobabs, at the very first moment when they can be
distinguished from the rosebushes which they resemble so closely in
their earliest youth. It is very tedious work," the little prince
added, "but very easy." And one day he said to me: "You ought to
make a beautiful drawing, so that the children where you live can see
exactly how all this is. That would be very useful to them if they
were to travel some day. Sometimes," he added, "there is no harm
in putting off a piece of work until another day. But
when it is a matter of baobabs, that always means a catastrophe. I
knew a planet that was inhabited by a lazy man. He neglected three
little bushes..." So, as the little prince described it to me, I
have made a drawing of that planet. I do not much like to take the
tone of a moralist. But the danger of the baobabs is so little
understood, and such considerable risks would be run by anyone who
might get lost on an asteroid, that for once I am breaking through my
reserve. "Children," I say plainly, "watch out for the baobabs!"
The goal is to find the string "there is no harm in putting off a piece of work until another day" in the document.
For now, the only way i think about is :
1 - find the first occurrence of the first word in my list in the document.
2 - multiply the number of words in my list by 2 or 3 to get the string length i have to check in my document (regarding the max number of words between the words of my list in the document).
3 - search for the occurrence of the other words in my list in this document string (having the string length I got in step 2) by split and loop.
If I consider the occurrence of my words in this string is not enough (maybe around 50%) then continu searching in the document starting by the next occurrence of the first word in my list.
But I'm afraid this could be very long, too much long, specially because I'm working on a mobile device... So i'm here to grab some ideas I maybe didn't think about, or some libs who could help me with this task. I thought about regular expressions too but I'm not sure if it would be a better way.
#gukoff proposition
Regarding that finally my words list can't be in a different order than my text it simplify the algorithm. The beginning of #gukoff answer is enough. No need to implement the LIS algorithm or reverse the list.
//Section = input text
//wordsToFind = words to find in text separated by space
private ArrayList<ArrayList<Integer>> test1(String wordsToFind, Section section) {
//1. Create the index of your words array.
String[] wordsArray = wordsToFind.split(" ");
ArrayList<Integer> indexesSentences = new ArrayList<>();
ArrayList<ArrayList<Integer>> sentenceArrayIndexes = new ArrayList<>();
ArrayList<Integer> wordsToFindIndexes = new ArrayList<>();
for(Sentence sentence:section.getSentences()) {
indexesSentences.clear();
for(String sentenceWord:sentence.getWords()) {
wordsToFindIndexes.clear();
int j = 0;
for(String word:wordsArray) {
if(word.equals(sentenceWord)) {
wordsToFindIndexes.add(j+1);
}
j++;
}
//Collections.reverse(wordsToFindIndexes);
for(int idx:wordsToFindIndexes) {
indexesSentences.add(idx);
}
}
sentenceArrayIndexes.add((ArrayList<Integer>)indexesSentences.clone());
}
return sentenceArrayIndexes;
}
public class Section {
private ArrayList<Sentence> sentences;
public Section (String text) {
sentences = new ArrayList<>();
if(text == null || text.trim() == "") {
throw new IllegalArgumentException("Text not valid");
}
String formattedText = text.trim().replaceAll("[^a-zA-Z. ]", "").toLowerCase();
String[] sentencesArray = formattedText.split("\\.");
for(String sentenceStr:sentencesArray) {
if(sentenceStr.trim() != "") {
sentences.add(new Sentence(sentenceStr));
}
}
}
public ArrayList<Sentence> getSentences() {
return sentences;
}
public void addSentence(Sentence sentence) {
sentences.add(sentence);
}
}

So, you have the words to be found and a text, which consists of sentences to be examined.
Create the index of your words array.
For example, if words = a dog is not a human:
{
"a": [1, 5],
"dog": [2],
"is": [3],
"not": [4],
"human": [6]
}
In every sentence replace every word by its index value in descending order. That said, "a" gets replaced by [5, 1], "human" gets replaced by [6] and "tree" gets replaced by [].
For example, the sentence "not a cat is a human" should turn into [4, 5,1, 3, 5,1, 6]
Find the Longest increasing subsequence(LIS) in every array. Essentially, LIS would be the longest sub-match of your words array in the sentence.
For example, LIS of [4, 5,1, 3, 5,1, 6] is [1, 3, 5, 6], which maps to the sub-match "a is a human".
But generally, in case the words shouldn't be very far from each other, I suggest to find LIS using dynamic programming with corresponding modifications.

Here is a simple approach which should be good enough given your document size:
make an array (call it words) of size n where n is number of words in your document.
Now populate this array such that
words[i] = 0 if no words in your list match this word
words[i] = k if kth word in your list matches this word (1 based indexing )
Example: If your document is there is no harm in putting off a piece of work until another day. and word list is work day harm piece (in that order) then your wordsarray will look like this [0,0,0,3,0,0,0,0,4,0,1,0,0,2]
2.Now you will have an array of size 2000~3000 of integers.You can use a variant of Longest common subsequence problem or modify your algorithm a little to find the best match.

How to get around recursive stack overflow?

EDIT: Just to clarify, the recursion is required as part of an assignment, so it must be recursive even though I know that's not the best way to do this problem
I made a program that, in part, will search through an extremely large dictionary and compare a given list of words with each word in the dictionary and return a list of words that begin with the same two letters of the user-given word.
This works for small dictionaries but I just discovered that for dictionaries over a certain amount there is a stack limit for the recursions, so I get a stack overflow error.
My idea is to limit each recursion to 1000 recursions, then increment a counter for another 1000 and start again where the recursive method last left off and then end again at 2000, then so on until the end of the dictionary.
Is this the best way to do it? And if so, does anyone have any ideas how? I'm having a really hard time implementing this idea!
(edit: If it's not the best way, does anyone have any ideas of how to do it more effectively?)
Here is the code I have so far, the 1000 recursions idea is barely implemented here because I've deleted some of the code I tried in the past already but honestly it was about as helpful as what I have here.
the call:
for(int i = 0; i < givenWords.size(); i++){
int thousand = 1000;
Dictionary.prefix(givenWords.get(i), theDictionary, 0, thousand);
thousand = thousand + 1000;
}
and the prefix method:
public static void prefix (String origWord, List<String> theDictionary, int wordCounter, int thousand){
if(wordCounter < thousand){
// if the words don't match recurse through this same method in order to move on to the next word
if (wordCounter < theDictionary.size()){
if ( origWord.charAt(0) != theDictionary.get(wordCounter).charAt(0) || origWord.length() != theDictionary.get(wordCounter).length()){
prefix(origWord, theDictionary, wordCounter+1, thousand+1);
}
// if the words first letter and size match, send the word to prefixLetterChecker to check for the rest of the prefix.
else{
prefixLetterChecker(origWord, theDictionary.get(wordCounter), 1);
prefix(origWord, theDictionary, wordCounter+1, thousand+1);
}
}
}
else return;
}
edit for clarification:
The dictionary is a sorted large dictionary with only one word per line, lowercase
the "given word" is actually one out of a list, in the program, the user inputs a string between 2-10 characters, letters only no spaces etc. The program creates a list of all possible permutations of this string, then goes through an array of those permutations and for each permutation returns another list of words beginning with the first two letters of the given word.
If as the program is going through it, any letter up to the first two letters doesn't match, the program moves on to the next given word.

This is actually a nice assignment. Let's make some assumptions....
26 letters in the alphabet, all words are in those letters.
no single word is more than.... 1000 or so characters long.
Create a class, call it 'Node', looks something like:
private static class Node {
Node[] children = new Node[26];
boolean isWord = false;
}
Now, create a tree using this node class. The root of this tree is:
private final Node root = new Node ();
Then, first word in the dictionary is the word 'a'. We add it to the tree. Note that 'a' is letter 0.
So, we 'recurse' in to the tree:
private static final int indexOf(char c) {
return c - 'a';
}
private final Node getNodeForChars(Node node, char[] chars, int pos) {
if (pos == chars.length) {
return this;
}
Node n = children[indexOf(chars[pos])];
if (n == null) {
n = new Node();
children[indexOf(chars[pos])] = n;
}
return getNodeForChars(n, chars, pos + 1);
}
So, with that, you can simply do:
Node wordNode = getNodeForChars(root, word.toCharArray(), 0);
wordNode.isWord = true;
So, you can create a tree of words..... Now, if you need to find all words starting with a given sequence of letters (the prefix), you can do:
Node wordNode = getNodeForChars(root, prefix.toCharArray(), 0);
Now, this node, if isWord is true, and all of its children that are not-null and isWord is true, are words with the prefix. You just have to rebuild the sequence. You may find it advantageous to store the actual word as part of the Node, instead of the boolean isWord flag. Your call.
The recursion depth will never be more than the longest word. The density of the data is 'fanned out' a lot. There are other ways to set up the Node that may be more (or less) efficient in terms of performance, or space. The idea though, is that you set up your data in a wide tree, and your search is thus very fast, and all the child nodes at any point have the same prefix as the parent (or, rather, the parent is the prefix).

Finding the index of a permutation within a string

I just attempted a programming challenge, which I was not able to successfully complete. The specification is to read 2 lines of input from System.in.
A list of 1-100 space separated words, all of the same length and between 1-10 characters.
A string up to a million characters in length, which contains a permutation of the above list just once. Return the index of where this permutation begins in the string.
For example, we may have:
dog cat rat
abcratdogcattgh
3
Where 3 is the result (as printed by System.out).
It's legal to have a duplicated word in the list:
dog cat rat cat
abccatratdogzzzzdogcatratcat
16
The code that I produced worked providing that the word that the answer begins with has not occurred previously. In the 2nd example here, my code will fail because dog has already appeared before where the answer begins at index 16.
My theory was to:
Find the index where each word occurs in the string
Extract this substring (as we have a number of known words with a known length, this is possible)
Check that all of the words occur in the substring
If they do, return the index that this substring occurs in the original string
Here is my code (it should be compilable):
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class Solution {
public static void main(String[] args) throws Exception {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String line = br.readLine();
String[] l = line.split(" ");
String s = br.readLine();
int wl = l[0].length();
int len = wl * l.length;
int sl = s.length();
for (String word : l) {
int i = s.indexOf(word);
int z = i;
//while (i != -1) {
int y = i + len;
if (y <= sl) {
String sub = s.substring(i, y);
if (containsAllWords(l, sub)) {
System.out.println(s.indexOf(sub));
System.exit(0);
}
}
//z+= wl;
//i = s.indexOf(word, z);
//}
}
System.out.println("-1");
}
private static boolean containsAllWords(String[] l, String s) {
String s2 = s;
for (String word : l) {
s2 = s2.replaceFirst(word, "");
}
if (s2.equals(""))
return true;
return false;
}
}
I am able to solve my issue and make it pass the 2nd example by un-commenting the while loop. However this has serious performance implications. When we have an input of 100 words at 10 characters each and a string of 1000000 characters, the time taken to complete is just awful.
Given that each case in the test bench has a maximum execution time, the addition of the while loop would cause the test to fail on the basis of not completing the execution in time.
What would be a better way to approach and solve this problem? I feel defeated.

If you concatenate the strings together and use the new string to search with.
String a = "dog"
String b = "cat"
String c = a+b; //output of c would be "dogcat"
Like this you would overcome the problem of dog appearing somewhere.
But this wouldn't work if catdog is a valid value too.

Here is an approach (pseudo code)
stringArray keys(n) = {"cat", "dog", "rat", "roo", ...};
string bigString(1000000);
L = strlen(keys[0]); // since all are same length
int indices(n, 1000000/L); // much too big - but safe if only one word repeated over and over
for each s in keys
f = -1
do:
f = find s in bigString starting at f+1 // use bigString.indexOf(s, f+1)
write index of f to indices
until no more found
When you are all done, you will have a series of indices (location of first letter of match). Now comes the tricky part. Since the words are all the same length, we're looking for a sequence of indices that are all spaced the same way, in the 10 different "collections". This is a little bit tedious but it should complete in a finite time. Note that it's faster to do it this way than to keep comparing strings (comparing numbers is faster than making sure a complete string is matched, obviously). I would again break it into two parts - first find "any sequence of 10 matches", then "see if this is a unique permutation".
sIndx = sort(indices(:))
dsIndx = diff(sIndx);
sequence = find {n} * 10 in dsIndx
for each s in sequence
check if unique permutation
I hope this gets you going.

Perhaps not the best optimized version, but how about following theory to give you some ideas:
Count length of all words in row.
Take random word from list and find the starting index of its first
occurence.
Take a substring with length counted above before and after that
index (e.g. if index is 15 and 3 words of 4 letters long, take
substring from 15-8 to 15+11).
Make a copy of the word list with earlier random word removed.
Check the appending/prepending [word_length] letters to see if they
match a new word on the list.
If word matches copy of list, remove it from copy of list and move further
If all words found, break loop.
If not all words found, find starting index of next occurence of
earlier random word and go back to 3.
Why it would help:
Which word you pick to begin with wouldn't matter, since every word
needs to be in the succcessful match anyway.
You don't have to manually loop through a lot of the characters,
unless there are lots of near complete false matches.
As a supposed match keeps growing, you have less words on the list copy left to compare to.
Can also keep track or furthest index you've gone to, so you can
sometimes limit the backwards length of picked substring (as it
cannot overlap to where you've already been, if the occurence are
closeby to each other).

How to know whether a string can be segmented into two strings

I was asked in interview following question. I could not figure out how to approach this question. Please guide me.
Question: How to know whether a string can be segmented into two strings - like breadbanana is segmentable into bread and banana, while breadbanan is not. You will be given a dictionary which contains all the valid words.

Build a trie of the words you have in the dictionary, which will make searching faster.
Search the tree according to the following letters of your input string. When you've found a word, which is in the tree, recursively start from the position after that word in the input string. If you get to the end of the input string, you've found one possible fragmentation. If you got stuck, come back and recursively try another words.
EDIT: sorry, missed the fact, that there must be just two words.
In this case, limit the recursion depth to 2.
The pseudocode for 2 words would be:
T = trie of words in the dictionary
for every word in T, which can be found going down the tree by choosing the next letter of the input string each time we move to the child:
p <- length(word)
if T contains input_string[p:length(intput_string)]:
return true
return false
Assuming you can go down to a child node in the trie in O(1) (ascii indexes of children), you can find all prefixes of the input string in O(n+p), where p is the number of prefixes, and n the length of the input. Upper bound on this is O(n+m), where m is the number of words in dictionary. Checking for containing will take O(w) where w is the length of word, for which the upper bound would be m, so the time complexity of the algorithm is O(nm), since O(n) is distributed in the first phase between all found words.
But because we can't find more than n words in the first phase, the complexity is also limited to O(n^2).
So the search complexity would be O(n*min(n, m))
Before that you need to build the trie which will take O(s), where s is the sum of lengths of words in the dictionary. The upper bound on this is O(n*m), since the maximum length of every word is n.

you go through your dictionary and compare every term as a substring with the original term e.g. "breadbanana". If the first term matches with the first substring, cut the first term out of the original search term and compare the next dictionary entries with the rest of the original term...
let me try to explain that in java:
e.g.
String dictTerm = "bread";
String original = "breadbanana";
// first part matches
if (dictTerm.equals(original.substring(0, dictTerm.length()))) {
// first part matches, get the rest
String lastPart = original.substring(dictTerm.length());
String nextDictTerm = "banana";
if (nextDictTerm.equals(lastPart)) {
System.out.println("String " + original +
" contains the dictionary terms " +
dictTerm + " and " + lastPart);
}
}

The simplest solution:
Split the string between every pair of consecutive characters and see whether or not both substrings (to the left of the split point and to the right of it) are in the dictionary.

One approach could be:
Put all elements of dictionary in some set or list
now you can use contains & substring function to remove words which matches dictionary. if at the end string is null -> string can be segmented else not. You can also take care of count.

public boolean canBeSegmented(String s) {
for (String word : dictionary.getWords()) {
if (s.contains(word) {
String sub = s.subString(0, s.indexOf(word));
s = sub + s.subString(s.indexOf(word)+word.length(), s.length()-1);
}
return s.equals("");
}
}
This code checks if your given String can be fully segmented. It checks if a word from the dictionary is inside your string and then subtracks it. If you want to segment it in the process you have to order the subtracted sementents in the order they are inside the word.
Just two words makes it easier:
public boolean canBeSegmented(String s) {
boolean wordDetected = false;
for (String word : dictionary.getWords()) {
if (s.contains(word) {
String sub = s.subString(0, s.indexOf(word));
s = sub + s.subString(s.indexOf(word)+word.length(), s.length()-1);
if(!wordDetected)
wordDetected = true;
else
return s.equals("");
}
return false;
}
}
This code checks for one Word and if there is another word in the String and just these two words it returns true otherwise false.

this is a mere idea , you can implement it better if you want
package farzi;
import java.util.ArrayList;
public class StringPossibility {
public static void main(String[] args) {
String str = "breadbanana";
ArrayList<String> dict = new ArrayList<String>();
dict.add("bread");
dict.add("banana");
for(int i=0;i<str.length();i++)
{
String word1 = str.substring(0,i);
String word2 = str.substring(i,str.length());
System.out.println(word1+"===>>>"+word2);
if(dict.contains(word1))
{
System.out.println("word 1 found : "+word1+" at index "+i);
}
if(dict.contains(word2))
{
System.out.println("word 2 found : "+ word2+" at index "+i);
}
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

find the longest word made of other words - java

Related

Finding longest concatenated word

Find words sequence in a document

How to get around recursive stack overflow?

Finding the index of a permutation within a string

How to know whether a string can be segmented into two strings

Categories

Resources