I am working on a problem, which is to write a program to find the longest word made of other words in a list of words.
EXAMPLE
Input: test, tester, testertest, testing, testingtester
Output: testingtester
I searched and find the following solution, my question is I am confused in step 2, why we should break each word in all possible ways? Why not use each word directly as a whole? If anyone could give some insights, it will be great.
The solution below does the following:
Sort the array by size, putting the longest word at the front
For each word, split it in all possible ways. That is, for “test”, split it into {“t”, “est”}, {“te”, “st”} and {“tes”, “t”}.
Then, for each pairing, check if the first half and the second both exist elsewhere in the array.
“Short circuit” by returning the first string we find that fits condition #3.
Answering your question indirectly, I believe the following is an efficient way to solve this problem using tries.
Build a trie from all of the words in your string.
Sort the words so that the longest word comes first.
Now, for each word W, start at the top of the trie and begin following the word down the tree one letter at a time using letters from the word you are testing.
Each time a word ends, recursively re-enter the trie from the top making a note that you have "branched". If you run out of letters at the end of the word and have branched, you've found a compound word and, because the words were sorted, this is the longest compound word.
If the letters stop matching at any point, or you run out and are not at the end of the word, just back track to wherever it was that you branched and keep plugging along.
I'm afraid I don't know Java that well, so I'm unable to provide you sample code in that language. I have, however, written out a solution in Python (using a trie implementation from this answer). Hopefully it is clear to you:
#!/usr/bin/env python3
#End of word symbol
_end = '_end_'
#Make a trie out of nested HashMap, UnorderedMap, dict structures
def MakeTrie(words):
root = dict()
for word in words:
current_dict = root
for letter in word:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return root
def LongestCompoundWord(original_trie, trie, word, level=0):
first_letter = word[0]
if not first_letter in trie:
return False
if len(word)==1 and _end in trie[first_letter]:
return level>0
if _end in trie[first_letter] and LongestCompoundWord(original_trie, original_trie, word[1:], level+1):
return True
return LongestCompoundWord(original_trie, trie[first_letter], word[1:], level)
#Words that were in your question
words = ['test','testing','tester','teste', 'testingtester', 'testingtestm', 'testtest','testingtest']
trie = MakeTrie(words)
#Sort words in order of decreasing length
words = sorted(words, key=lambda x: len(x), reverse=True)
for word in words:
if LongestCompoundWord(trie,trie,word):
print("Longest compound word was '{0:}'".format(word))
break
With the above in mind, the answer to your original question becomes clearer: we do not know ahead of time which combination of prefix words will take us successfully through the tree. Therefore, we need to be prepared to check all possible combinations of prefix words.
Since the algorithm you found does not have an efficient way of knowing what subsets of a word are prefixes, it splits the word at all possible points in word to ensure that all prefixes are generated.
Richard's answer will work well in many cases, but it can take exponential time: this will happen if there are many segments of the string W, each of which can be decomposed in multiple different ways. For example, suppose W is abcabcabcd, and the other words are ab, c, a and bc. Then the first 3 letters of W can be decomposed either as ab|c or as a|bc... and so can the next 3 letters, and the next 3, for 2^3 = 8 possible decompositions of the first 9 letters overall:
a|bc|a|bc|a|bc
a|bc|a|bc|ab|c
a|bc|ab|c|a|bc
a|bc|ab|c|ab|c
ab|c|a|bc|a|bc
ab|c|a|bc|ab|c
ab|c|ab|c|a|bc
ab|c|ab|c|ab|c
All of these partial decompositions necessarily fail in the end, since there is no word in the input that contains W's final letter d -- but his algorithm will explore them all before discovering this. In general, a word consisting of n copies of abc followed by a single d will take O(n*2^n) time.
We can improve this to O(n^2) worst-case time (at the cost of O(n) space) by recording extra information about the decomposability of suffixes of W as we go along -- that is, suffixes of W that we have already discovered we can or cannot match to word sequences. This type of algorithm is called dynamic programming.
The condition we need for some word W to be decomposable is exactly that W begins with some word X from the set of other words, and the suffix of W beginning at position |X|+1 is decomposable. (I'm using 1-based indices here, and I'll denote a substring of a string S beginning at position i and ending at position j by S[i..j].)
Whenever we discover that the suffix of the current word W beginning at some position i is or is not decomposable, we can record this fact and make use of it later to save time. For example, after testing the first 4 decompositions in the 8 listed earlier, we know that the suffix of W beginning at position 4 (i.e., abcabcd) is not decomposable. Then when we try the 5th decomposition, i.e., the first one starting with ab, we first ask the question: Is the rest of W, i.e. the suffix of W beginning at position 3, decomposable? We don't know yet, so we try adding c to get ab|c, and then we ask: Is the rest of W, i.e. the suffix of W beginning at position 4, decomposable? And we find that it has already been found not to be -- so we can immediately conclude that no decomposition of W beginning with ab|c is possible either, instead of having to grind through all 4 possibilities.
Assuming for the moment that the current word W is fixed, what we want to build is a function f(i) that determines whether the suffix of W beginning at position i is decomposable. Pseudo-code for this could look like:
- Build a trie the same way as Richard's solution does.
- Initialise the array KnownDecomposable[] to |W| DUNNO values.
f(i):
- If i == |W|+1 then return 1. (The empty suffix means we're finished.)
- If KnownDecomposable[i] is TRUE or FALSE, then immediately return it.
- MAIN BODY BEGINS HERE
- Walk through Richard's trie from the root, following characters in the
suffix W[i..|W|]. Whenever we find a trie node at some depth j that
marks the end of a word in the set:
- Call f(i+j) to determine whether the rest of W can be decomposed.
- If it can (i.e. if f(i+j) == 1):
- Set KnownDecomposable[i] = TRUE.
- Return TRUE.
- If we make it to this point, then we have considered all other
words that form a prefix of W[i..|W|], and found that none of
them yield a suffix that can be decomposed.
- Set KnownDecomposable[i] = FALSE.
- Return FALSE.
Calling f(1) then tells us whether W is decomposable.
By the time a call to f(i) returns, KnownDecomposable[i] has been set to a non-DUNNO value (TRUE or FALSE). The main body of the function is only run if KnownDecomposable[i] is DUNNO. Together these facts imply that the main body of the function will only run as many times as there are distinct values i that the function can be called with. There are at most |W|+1 such values, which is O(n), and outside of recursive calls, a call to f(i) takes at most O(n) time to walk through Richard's trie, so overall the time complexity is bounded by O(n^2).
I guess you are just making a confusion about which words are split.
After sorting, you consider the words one after the other, by decreasing length. Let us call a "candidate" a word you are trying to decompose.
If the candidate is made of other words, it certainly starts with a word, so you will compare all prefixes of the candidate to all possible words.
During the comparison step, you compare a candidate prefix to the whole words, not to split words.
By the way, the given solution will not work for triwords and longer. The fix is as follows:
try every prefix of the candidate and compare it to all words
in case of a match, repeat the search with the suffix.
Example:
testingtester gives the prefixes
t, te, tes, test, testi, testin, testing, testingt, testingte, testingtes and testingteste
Among these, test and testing are words. Then you need to try the corresponding suffixes ingtester and tester.
ingtester gives
i, in, ing, ingt, ingte, ingtes, ingtest and ingteste, none of which are words.
tester is a word and you are done.
IsComposite(InitialCandidate, Candidate):
For all Prefixes of Candidate:
if Prefix is in Words:
Suffix= Candidate - Prefix
if Suffix == "":
return Candidate != InitialCandidate
else:
return IsComposite(InitialCandidate, Suffix)
For all Candidate words by decreasing size:
if IsComposite(Candidate, Candidate):
print Candidate
break
I would probably use recursion here. Start with the longest word and find words it starts with. For any such word remove it from the original word and continue with the remaining part in the same manner.
Pseudo code:
function iscomposed(orininalword, wordpart)
for word in allwords
if word <> orininalword
if wordpart = word
return yes
elseif wordpart starts with word
if iscomposed(orininalword, wordpart - word)
return yes
endif
endif
endif
next
return no
end
main
sort allwords by length descending
for word in allwords
if iscomposed(word, word) return word
next
end
Example:
words:
abcdef
abcde
abc
cde
ab
Passes:
1. abcdef starts with abcde. rest = f. 2. no word f starts with found.
1. abcdef starts with abc. rest = def. 2. no word def starts with found.
1. abcdef starts with ab. rest = cdef. 2. cdef starts with cde. rest = f. 3. no word f starts with found.
1. abcde starts with abc. rest = cde. 2. cde itself found. abcde is a composed word
To find longest world using recursion
class FindLongestWord {
public static void main(String[] args) {
List<String> input = new ArrayList<>(
Arrays.asList("cat", "banana", "rat", "dog", "nana", "walk", "walker", "dogcatwalker"));
List<String> sortedList = input.stream().sorted(Comparator.comparing(String::length).reversed())
.collect(Collectors.toList());
boolean isWordFound = false;
for (String word : sortedList) {
input.remove(word);
if (findPrefix(input, word)) {
System.out.println("Longest word is : " + word);
isWordFound = true;
break;
}
}
if (!isWordFound)
System.out.println("Longest word not found");
}
public static boolean findPrefix(List<String> input, String word) {
boolean output = false;
if (word.isEmpty())
return true;
else {
for (int i = 0; i < input.size(); i++) {
if (word.startsWith(input.get(i))) {
output = findPrefix(input, word.replace(input.get(i), ""));
if (output)
return true;
}
}
}
return output;
}
}
EDIT: Just to clarify, the recursion is required as part of an assignment, so it must be recursive even though I know that's not the best way to do this problem
I made a program that, in part, will search through an extremely large dictionary and compare a given list of words with each word in the dictionary and return a list of words that begin with the same two letters of the user-given word.
This works for small dictionaries but I just discovered that for dictionaries over a certain amount there is a stack limit for the recursions, so I get a stack overflow error.
My idea is to limit each recursion to 1000 recursions, then increment a counter for another 1000 and start again where the recursive method last left off and then end again at 2000, then so on until the end of the dictionary.
Is this the best way to do it? And if so, does anyone have any ideas how? I'm having a really hard time implementing this idea!
(edit: If it's not the best way, does anyone have any ideas of how to do it more effectively?)
Here is the code I have so far, the 1000 recursions idea is barely implemented here because I've deleted some of the code I tried in the past already but honestly it was about as helpful as what I have here.
the call:
for(int i = 0; i < givenWords.size(); i++){
int thousand = 1000;
Dictionary.prefix(givenWords.get(i), theDictionary, 0, thousand);
thousand = thousand + 1000;
}
and the prefix method:
public static void prefix (String origWord, List<String> theDictionary, int wordCounter, int thousand){
if(wordCounter < thousand){
// if the words don't match recurse through this same method in order to move on to the next word
if (wordCounter < theDictionary.size()){
if ( origWord.charAt(0) != theDictionary.get(wordCounter).charAt(0) || origWord.length() != theDictionary.get(wordCounter).length()){
prefix(origWord, theDictionary, wordCounter+1, thousand+1);
}
// if the words first letter and size match, send the word to prefixLetterChecker to check for the rest of the prefix.
else{
prefixLetterChecker(origWord, theDictionary.get(wordCounter), 1);
prefix(origWord, theDictionary, wordCounter+1, thousand+1);
}
}
}
else return;
}
edit for clarification:
The dictionary is a sorted large dictionary with only one word per line, lowercase
the "given word" is actually one out of a list, in the program, the user inputs a string between 2-10 characters, letters only no spaces etc. The program creates a list of all possible permutations of this string, then goes through an array of those permutations and for each permutation returns another list of words beginning with the first two letters of the given word.
If as the program is going through it, any letter up to the first two letters doesn't match, the program moves on to the next given word.
This is actually a nice assignment. Let's make some assumptions....
26 letters in the alphabet, all words are in those letters.
no single word is more than.... 1000 or so characters long.
Create a class, call it 'Node', looks something like:
private static class Node {
Node[] children = new Node[26];
boolean isWord = false;
}
Now, create a tree using this node class. The root of this tree is:
private final Node root = new Node ();
Then, first word in the dictionary is the word 'a'. We add it to the tree. Note that 'a' is letter 0.
So, we 'recurse' in to the tree:
private static final int indexOf(char c) {
return c - 'a';
}
private final Node getNodeForChars(Node node, char[] chars, int pos) {
if (pos == chars.length) {
return this;
}
Node n = children[indexOf(chars[pos])];
if (n == null) {
n = new Node();
children[indexOf(chars[pos])] = n;
}
return getNodeForChars(n, chars, pos + 1);
}
So, with that, you can simply do:
Node wordNode = getNodeForChars(root, word.toCharArray(), 0);
wordNode.isWord = true;
So, you can create a tree of words..... Now, if you need to find all words starting with a given sequence of letters (the prefix), you can do:
Node wordNode = getNodeForChars(root, prefix.toCharArray(), 0);
Now, this node, if isWord is true, and all of its children that are not-null and isWord is true, are words with the prefix. You just have to rebuild the sequence. You may find it advantageous to store the actual word as part of the Node, instead of the boolean isWord flag. Your call.
The recursion depth will never be more than the longest word. The density of the data is 'fanned out' a lot. There are other ways to set up the Node that may be more (or less) efficient in terms of performance, or space. The idea though, is that you set up your data in a wide tree, and your search is thus very fast, and all the child nodes at any point have the same prefix as the parent (or, rather, the parent is the prefix).
I am trying to overwrite the compareTo in Java such that it works as follows. There will be two string arrays containing k strings each. The compareTo method will go through the words in order, comparing the kth element of each array. The arrays will then be sorted thusly. The code I have currently is as follows, but it does not work properly.
I need a return statement outside the for-loop. I'm not sure what this return statement should return, since one of the for-loop return statements will always be reached.
Also, am I using continue correctly here?
public int compareTo(WordNgram wg) {
for (int k = 0; k < (this.myWords).length; k++) {
String temp1 = (this.myWords)[k];
String temp2 = (wg.myWords)[k];
int last = temp1.compareTo(temp2);
if (last == 0) {
continue;
} else {
return last;
}
}
}
You want to compare the two string at the same location:
int last = temp1.compare(temp2);
Java compiler mandates all the end points must have a return statement. In your case you must return 0 at end so when both arrays contain completely equal strings the caller will know they are equal.
You should start listening to your compiler, because after looking at your code for 1 minute, I spotted two undefined states: this.myWords.length is 0 and the two words are equal.
Also, I personally find it very difficult to handle multiple method exit points with all possibilities for input considered and rather insert a single returning statement which makes debugging easier and the results more predictable. In your case for example, I would collect the results of compareTo in a collection if they differ from 0 so that after the for-loop has finished, you could decide at the state of this collection if 0 (empty collection) or the first value in the collection could be returned. I like this more formal approach, because it enforces you to think set-like as in "Give me all comparing results where compareTo results in anything else but 0. If this list is empty, the comparing result is 0, otherwise it is the first element of the list."
I have the following problem and I don't know how to resolve it.
I have a Group class where I have some Nodes and I add more nodes constantly. One of them is named "figure" and I would like to identify this node to remove it.
For example I have an initial Group:
1 line
2 point
3 figure
And then I add more nodes:
1 line
2 point
3 figure
4 line
5 point
I have used this but I have not got it because I can only use it in one situation:
pp.setNodeName("figure");
int numNodes= this._featureNodes.getNumChildren();
if (this._featureNodes.getChild(numNodes-1).getNodeName() == "figure")
{
this._featureNodes.removeChild(numNodes-1);
}
Use equals() for comapring Strings not == operator. equals() compares whether the nodeName has the same String characters. == compares whether two references reference the same object. Therefore, your if block would be as follows:
if (this._featureNodes.getChild(numNodes-1).getNodeName().equals("figure"))
{
this._featureNodes.removeChild(numNodes-1);
}
I get it!!
The solution was to create a "for" loop to read all items of my group and identify the node "figure" like this:
for (int i = 0 ; i< this._featureNodes.getNumChildren(); ++i){
if (_featureNodes.getChild(i).getNodeName().equals("figure"))
{
this._featureNodes.removeChild(i);
}
}
How is the Morse Code representation of a letter determined?
"E" = "."
"T" = "-"
Why is it not alphabetic? As in let "A" = ".", "B" = "-", "C" =".-", etc.
I'm trying to develop an algorithm for a traversing Binary Tree filled with these letter.
My main aim is to search for a letter, like "A", but I don't know what conditions to use that determines when to branch to the right or left node.
EDIT
This is what I tried to do. Here I'm trying to keep a track of the path. But when I try it with a leter like "E", it says the root is empty.
static boolean treeContains( Node root, String item ) {
// Return true if item is one of the items in the binary
// sort tree to which node points. Return false if not.
if ( root == null ) {
// Tree is empty, so it certainly doesn't contain item.
System.out.print("Is null");
return false;
}
else if ( item.equals(root.element) ) {
// Yes, the item has been found in the root node.
return true;
}
else if ( item.compareTo(root.element) < 0 ) {
// If the item occurs, it must be in the left subtree.
// So, return the result of searching the left subtree.
res = res.concat(".");
return treeContains( root.right, item );
}
else {
// If the item occurs, it must be in the right subtree.
// So, return the result of searching the right subtree.
res = res.concat("-");
return treeContains( root.left, item );
}
} // end treeContains()
If you have a Binary Tree with letters in it, then let left be a dot (.) and right be a dash (-). As you traverse the tree, then you know what the binary code for each letter will is, by keeping track of the path.
EDIT
Looking at your code, you're not traversing the tree properly. First, I'm not sure what the variable res is, but I'm betting it's a static, which is not good coding practice.
Your real problem is that your comparison item.compareTo(root.element) < 0 is not a valid comparison for this tree. Instead, you should use a recursive call as your test, treeContains( root.right, item ). Only if this returns true can you then append the dot (.) to your res string. If it returns false, then you can make the recursive call using root.left and append a dash (-).
Personally, I would return a string from this method. The string would be the morse code for the letter so far, or null if the letter is not found. As you return back from the correct tree traversal, build up the correct string (what you're using for res right now).
Something to test for is that you may have to concat to the front of the string instead of the back of the string to get things to come out correct.
The real usefulness of this tree is in decoding the Morse code, converting a dot-dash string to its correct letter. This becomes a simple tree traversal.