Finding longest concatenated word

Finding longest concatenated word - java

I have a dictionary with many words. And i hope search the longest concatenated word (that is, the longest word that is comprised entirely of
shorter words in the file). I give the method a descending word from their length. How can I check that all the symbols have been used from the dictionary?
public boolean tryMatch(String s, List dictionary) {
String nextWord = new String();
int contaned = 0;
//Цикл перебирающий каждое слово словаря
for(int i = 1; i < dictionary.size();i++) {
nextWord = (String) dictionary.get(i);
if (nextWord == s) {
nextWord = (String) dictionary.get(i + 1);
}
if (s.contains(nextWord)) {
contaned++;
}
}
if(contaned >1) {
return true;
}
return false;
}

If you have a sorted list of words, finding compound words is easy, but it will only perform well if the words are in a Set.
Let's look at the compound word football, and of course assume that both ball and foot are in the work list.
By definition, any compound word using foot as the first sub-word must start with foot.
So, when iterating the list, remember the current active "stem" words, e.g. when seeing foot, remember it.
Now, when seeing football, you check if the word starts with the stem word. If not, clear the stem word, and make new word the stem word.
If it does, the new word (football) is a candidate for being a compound word. The part after the stem is ball, so we need to check if that is a word, and if so, we found a compound word.
Checking is easy for simple case, i.e. wordSet.contains(remain).
However, compound words can be made up of more than 2 words, e.g. whatsoever. So after finding that it is a candidate from the stem word what, the remain is soever.
You can simply try all lengths of that (soever, soeve, soev, soe, so, s), and if one of the shorter ones are words, you repeat the process.

Related

Time Complexity of Code for finding longest word inside dictionary

Problem is as follows: You start with a 2 letter word, and you can append letters to the front and back of the word. You have to return the longest word that exists inside a dictionary that you can form by appending letters to the front and back of the 2 letter word, and every new word that you formed must also be inside the dictionary as well
For example:
Start: 'at'
Dict: [hat, chat, chats, rat, rate, orange]
Output: 'chats', because: at -> hat -> chat -> chats
I have the code as follows:
public static String longest(ArrayList<String> input) {
return helper('at', dict);
}
public static String helper(String in, ArrayList<String> dict) {
ArrayList<String> maxes = new ArrayList<String>();
for (char a = 'a'; a < 'z'; a++) {
String front = Character.toString(a) + in;
String back = in + Character.toString(a);
if (dict.contains(front)) {
maxes.add(helper(front, dict));
}
if (dict.contains(back)) {
maxes.add(helper(back, dict));
}
}
if (maxes.size() == 0) {
return in;
}
String word = "";
for (String w : maxes) {
if (w.length() > word.length()) {
word = w;
}
}
return word;
}
I was wondering what the time complexity for this algorithm would be? I can't for the life of me figure it out.

The answer strongly depends on your dictionary (n words with max reachable length L<=n+1) and on your data structure for storing it. Each call to helper (without its recursive calls) is O(n L) with dict being an ArrayList, whereas with a hash table it's O(L) (absent unlikely collisions). (There can be very long unreachable words in the dictionary, but it still costs only O(L) to compare against them because your trial words can't be longer.)
As for the number of calls to helper: this is just a depth-first search on the tree of words related by prepending/appending a letter. As such, it's O(v), where v is the number of vertices visited. The values of v for various input words depends on your dictionary as well: v<=n, of course, and is often much less. As an example: using the 71813 lines in my /usr/share/dict/words that are all ASCII letters (and ignoring case), the most words ever considered is 593 (for "Ar" as in argon).
The worst-case dictionary will have all its words forming a chain "ab", "abc", "abcd", etc.. You visit every word for a total cost of O(v n L)=O(n^3) (O(v L)=O(n^2) with the hash table). Realistic dictionaries will be much faster not only because L is smaller but also because v is; the exact speedup is unfortunately difficult to analyze. It's probably reasonable to assume L is Θ(log(n)); there's no meaningful asymptotic expression for v as a function of n because realistic dictionaries don't have arbitrarily large n.

Looping through an ArrayList with another Arraylist in Java

I have a large array list of sentences and another array list of words.
My program loops through the array list and removes an element from that array list if the sentence contains any of the words from the other.
The sentences array list can be very large and I coded a quick and dirty nested for loop. While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
for (int i = 0; i < SENTENCES.size(); i++) {
for (int k = 0; k < WORDS.size(); k++) {
if (SENTENCES.get(i).contains(" " + WORDS.get(k) + " ") == true) {
//Do something
}
}
}
Is there a more efficient way of doing this then a nested for loop?

There's a few inefficiencies in your code, but at the end of the day, if you've got to search for sentences containing words then there's no getting away from loops.
That said, there are couple of things to try.
First, make WORDS a HashSet, the contains method will be far quicker than for an ArrayList because it's doing a hash look-up to get the value.
Second, switch the logic about a bit like this:
Iterator<String> sentenceIterator = SENTENCES.iterator();
sentenceLoop:
while (sentenceIterator.hasNext())
{
String sentence = sentenceIterator.next();
for (String word : sentence.replaceAll("\\p{P}", " ").toLowerCase().split("\\s+"))
{
if (WORDS.contains(word))
{
sentenceIterator.remove();
continue sentenceLoop;
}
}
}
This code (which assumes you're trying to remove sentences that contain certain words) uses Iterators and avoids the string concatenation and parsing logic you had in your original code (replacing it with a single regex) both of which should be quicker.
But bear in mind, as with all things performance you'll need to test these changes to see they improve the situation.

I̶ ̶w̶o̶u̶l̶d̶ ̶s̶a̶y̶ ̶n̶o̶,̶ ̶b̶u̶t̶ what you must change is the way you handle the removal of the data. This is noted by this part of the explanation of your problem:
The sentences array list can be very large (...). While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
The cause of this is that removal time in ArrayList takes O(N), and since you're doing this inside a loop, then it will take at least O(N^2).
I recommend using LinkedList rather than ArrayList to store the sentences, and use Iterator rather than your naive List#get since it already offers Iterator#remove in time O(1) for LinkedList.
In case you cannot change the design to LinkedList, I recommend storing the sentences that are valid in a new List, and in the end replace the contents of your original List with this new List, thus saving lot of time.
Apart from this big improvement, you can improve the algorithm even more by using a Set to store the words to lookup rather than using another List since the lookup in a Set is O(1).

What you could do is put all your words into a HashSet. This allows you to check if a word is in the set very quickly. See https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html for documentation.
HashSet<String> wordSet = new HashSet();
for (String word : WORDS) {
wordSet.add(word);
}
Then it's just a matter of splitting each sentence into the words that make it up, and checking if any of those words are in the set.
for (String sentence : SENTENCES) {
String[] sentenceWords = sentence.split(" "); // You probably want to use a regex here instead of just splitting on a " ", but this is just an example.
for (String word : sentenceWords) {
if (wordSet.contains(word)) {
// The sentence contains one of the special words.
// DO SOMETHING
break;
}
}
}

I will create a set of words from second ArrayList:
Set<String> listOfWords = new HashSet<String>();
listOfWords.add("one");
listOfWords.add("two");
I will then iterate over the set and the first ArrayList and use Contains:
for (String word : listOfWords) {
for(String sentence : Sentences) {
if (sentence.contains(word)) {
// do something
}
}
}
Also, if you are free to use any open source jar, check this out:
searching string in another string

First, your program has a bug: it would not count words at the beginning and at the end of a sentence.
Your current program has runtime complexity of O(s*w), where s is the length, in characters, of all sentences, and w is the length of all words, also in characters.
If words is relatively small (a few hundred items or so) you could use regex to speed things up considerably: construct a pattern like this, and use it in a loop:
StringBuilder regex = new StringBuilder();
boolean first = true;
// Let's say WORDS={"quick", "brown", "fox"}
regex.append("\\b(?:");
for (String w : WORDS) {
if (!first) {
regex.append('|');
} else {
first = false;
}
regex.append(w);
}
regex.append(")\\b");
// Now regex is "\b(?:quick|brown|fox)\b", i.e. your list of words
// separated by OR signs, enclosed in non-capturing groups
// anchored to word boundaries by '\b's on both sides.
Pattern p = Pattern.compile(regex.toString());
for (int i = 0; i < SENTENCES.size(); i++) {
if (p.matcher(SENTENCES.get(i)).find()) {
// Do something
}
}
Since regex gets pre-compiled into a structure more suitable for fast searches, your program would run in O(s*max(w)), where s is the length, in characters, of all sentences, and w is the length of the longest word. Given that the number of words in your collection is about 200 or 300, this could give you an order of magnitude decrease in running time.

If you have enough memory you can tokenize SENTENCES and put them in a Set. Then it would be better in performance and also more correct than current implementation.

Well, looking at your code I would suggest two things that will improve the performance from each iteration:
Remove " == true". The contains operation already returns a boolean, so it is enough for the if, comparing it with true adds one extra operation for each iteration that is not needed.
Do not concatenate Strings inside a loop (" " + WORDS.get(k) + " ") as it is a quite expensive operation because + operator creates new objects. Better use a string buffer / builder and clear it after each iteration with stringBuffer.setLength(0);.
Besides that, for this case I do not know any other approach, maybe you can use regular expressions if you can abstract a pattern out of those words you want to remove and have then only one loop.
Hope it helps!

If you concern about the efficiency, I think that the most effective way to do this is to use Aho-Corasick's algorithm. While you have 2 nested loops here and a contains() method (that I think takes at the best length of sentence + length of word time), Aho-Corasick gives you one loop over sentences and for checking of containing words it takes length of sentence, which is length of word times faster (+ a preprocessing time for creation of finite state machine, which is relatively small).

I'll approach this in more theoretical view.. If you don't have memory limitation, you can try to mimic the logic in counting sort
say M1 = sentences.size, M2 = number of word per sentences, and N = word.size
Assume all sentences has the same number of words just for simplicity
your current approach's complexity is O(M1.M2.N)
We can create a mapping of words - position in sentences.
Loop through your arraylist of sentences, and change them into two dimensional jagged array of words. Loop through the new array, create a HashMap where key,value = words, arraylist of word position (say with length X). That's O(2M1.M2.X) = O(M1.M2.X)
Then loop through your words arraylist, access your word hashmap, loop through the list of word position. remove each one. That's O(N.X)
Say you're need to give the result in arraylist of string, we need another loop and concat everything. That's O(M1.M2)
Total complexity is O(M1.M2.X) + O(N.X) + O(M1.M2)
assumming X is way smaller than N, you'll probably get better performance

How to get around recursive stack overflow?

EDIT: Just to clarify, the recursion is required as part of an assignment, so it must be recursive even though I know that's not the best way to do this problem
I made a program that, in part, will search through an extremely large dictionary and compare a given list of words with each word in the dictionary and return a list of words that begin with the same two letters of the user-given word.
This works for small dictionaries but I just discovered that for dictionaries over a certain amount there is a stack limit for the recursions, so I get a stack overflow error.
My idea is to limit each recursion to 1000 recursions, then increment a counter for another 1000 and start again where the recursive method last left off and then end again at 2000, then so on until the end of the dictionary.
Is this the best way to do it? And if so, does anyone have any ideas how? I'm having a really hard time implementing this idea!
(edit: If it's not the best way, does anyone have any ideas of how to do it more effectively?)
Here is the code I have so far, the 1000 recursions idea is barely implemented here because I've deleted some of the code I tried in the past already but honestly it was about as helpful as what I have here.
the call:
for(int i = 0; i < givenWords.size(); i++){
int thousand = 1000;
Dictionary.prefix(givenWords.get(i), theDictionary, 0, thousand);
thousand = thousand + 1000;
}
and the prefix method:
public static void prefix (String origWord, List<String> theDictionary, int wordCounter, int thousand){
if(wordCounter < thousand){
// if the words don't match recurse through this same method in order to move on to the next word
if (wordCounter < theDictionary.size()){
if ( origWord.charAt(0) != theDictionary.get(wordCounter).charAt(0) || origWord.length() != theDictionary.get(wordCounter).length()){
prefix(origWord, theDictionary, wordCounter+1, thousand+1);
}
// if the words first letter and size match, send the word to prefixLetterChecker to check for the rest of the prefix.
else{
prefixLetterChecker(origWord, theDictionary.get(wordCounter), 1);
prefix(origWord, theDictionary, wordCounter+1, thousand+1);
}
}
}
else return;
}
edit for clarification:
The dictionary is a sorted large dictionary with only one word per line, lowercase
the "given word" is actually one out of a list, in the program, the user inputs a string between 2-10 characters, letters only no spaces etc. The program creates a list of all possible permutations of this string, then goes through an array of those permutations and for each permutation returns another list of words beginning with the first two letters of the given word.
If as the program is going through it, any letter up to the first two letters doesn't match, the program moves on to the next given word.

This is actually a nice assignment. Let's make some assumptions....
26 letters in the alphabet, all words are in those letters.
no single word is more than.... 1000 or so characters long.
Create a class, call it 'Node', looks something like:
private static class Node {
Node[] children = new Node[26];
boolean isWord = false;
}
Now, create a tree using this node class. The root of this tree is:
private final Node root = new Node ();
Then, first word in the dictionary is the word 'a'. We add it to the tree. Note that 'a' is letter 0.
So, we 'recurse' in to the tree:
private static final int indexOf(char c) {
return c - 'a';
}
private final Node getNodeForChars(Node node, char[] chars, int pos) {
if (pos == chars.length) {
return this;
}
Node n = children[indexOf(chars[pos])];
if (n == null) {
n = new Node();
children[indexOf(chars[pos])] = n;
}
return getNodeForChars(n, chars, pos + 1);
}
So, with that, you can simply do:
Node wordNode = getNodeForChars(root, word.toCharArray(), 0);
wordNode.isWord = true;
So, you can create a tree of words..... Now, if you need to find all words starting with a given sequence of letters (the prefix), you can do:
Node wordNode = getNodeForChars(root, prefix.toCharArray(), 0);
Now, this node, if isWord is true, and all of its children that are not-null and isWord is true, are words with the prefix. You just have to rebuild the sequence. You may find it advantageous to store the actual word as part of the Node, instead of the boolean isWord flag. Your call.
The recursion depth will never be more than the longest word. The density of the data is 'fanned out' a lot. There are other ways to set up the Node that may be more (or less) efficient in terms of performance, or space. The idea though, is that you set up your data in a wide tree, and your search is thus very fast, and all the child nodes at any point have the same prefix as the parent (or, rather, the parent is the prefix).

Finding the index of a permutation within a string

I just attempted a programming challenge, which I was not able to successfully complete. The specification is to read 2 lines of input from System.in.
A list of 1-100 space separated words, all of the same length and between 1-10 characters.
A string up to a million characters in length, which contains a permutation of the above list just once. Return the index of where this permutation begins in the string.
For example, we may have:
dog cat rat
abcratdogcattgh
3
Where 3 is the result (as printed by System.out).
It's legal to have a duplicated word in the list:
dog cat rat cat
abccatratdogzzzzdogcatratcat
16
The code that I produced worked providing that the word that the answer begins with has not occurred previously. In the 2nd example here, my code will fail because dog has already appeared before where the answer begins at index 16.
My theory was to:
Find the index where each word occurs in the string
Extract this substring (as we have a number of known words with a known length, this is possible)
Check that all of the words occur in the substring
If they do, return the index that this substring occurs in the original string
Here is my code (it should be compilable):
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class Solution {
public static void main(String[] args) throws Exception {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String line = br.readLine();
String[] l = line.split(" ");
String s = br.readLine();
int wl = l[0].length();
int len = wl * l.length;
int sl = s.length();
for (String word : l) {
int i = s.indexOf(word);
int z = i;
//while (i != -1) {
int y = i + len;
if (y <= sl) {
String sub = s.substring(i, y);
if (containsAllWords(l, sub)) {
System.out.println(s.indexOf(sub));
System.exit(0);
}
}
//z+= wl;
//i = s.indexOf(word, z);
//}
}
System.out.println("-1");
}
private static boolean containsAllWords(String[] l, String s) {
String s2 = s;
for (String word : l) {
s2 = s2.replaceFirst(word, "");
}
if (s2.equals(""))
return true;
return false;
}
}
I am able to solve my issue and make it pass the 2nd example by un-commenting the while loop. However this has serious performance implications. When we have an input of 100 words at 10 characters each and a string of 1000000 characters, the time taken to complete is just awful.
Given that each case in the test bench has a maximum execution time, the addition of the while loop would cause the test to fail on the basis of not completing the execution in time.
What would be a better way to approach and solve this problem? I feel defeated.

If you concatenate the strings together and use the new string to search with.
String a = "dog"
String b = "cat"
String c = a+b; //output of c would be "dogcat"
Like this you would overcome the problem of dog appearing somewhere.
But this wouldn't work if catdog is a valid value too.

Here is an approach (pseudo code)
stringArray keys(n) = {"cat", "dog", "rat", "roo", ...};
string bigString(1000000);
L = strlen(keys[0]); // since all are same length
int indices(n, 1000000/L); // much too big - but safe if only one word repeated over and over
for each s in keys
f = -1
do:
f = find s in bigString starting at f+1 // use bigString.indexOf(s, f+1)
write index of f to indices
until no more found
When you are all done, you will have a series of indices (location of first letter of match). Now comes the tricky part. Since the words are all the same length, we're looking for a sequence of indices that are all spaced the same way, in the 10 different "collections". This is a little bit tedious but it should complete in a finite time. Note that it's faster to do it this way than to keep comparing strings (comparing numbers is faster than making sure a complete string is matched, obviously). I would again break it into two parts - first find "any sequence of 10 matches", then "see if this is a unique permutation".
sIndx = sort(indices(:))
dsIndx = diff(sIndx);
sequence = find {n} * 10 in dsIndx
for each s in sequence
check if unique permutation
I hope this gets you going.

Perhaps not the best optimized version, but how about following theory to give you some ideas:
Count length of all words in row.
Take random word from list and find the starting index of its first
occurence.
Take a substring with length counted above before and after that
index (e.g. if index is 15 and 3 words of 4 letters long, take
substring from 15-8 to 15+11).
Make a copy of the word list with earlier random word removed.
Check the appending/prepending [word_length] letters to see if they
match a new word on the list.
If word matches copy of list, remove it from copy of list and move further
If all words found, break loop.
If not all words found, find starting index of next occurence of
earlier random word and go back to 3.
Why it would help:
Which word you pick to begin with wouldn't matter, since every word
needs to be in the succcessful match anyway.
You don't have to manually loop through a lot of the characters,
unless there are lots of near complete false matches.
As a supposed match keeps growing, you have less words on the list copy left to compare to.
Can also keep track or furthest index you've gone to, so you can
sometimes limit the backwards length of picked substring (as it
cannot overlap to where you've already been, if the occurence are
closeby to each other).

How to know whether a string can be segmented into two strings

I was asked in interview following question. I could not figure out how to approach this question. Please guide me.
Question: How to know whether a string can be segmented into two strings - like breadbanana is segmentable into bread and banana, while breadbanan is not. You will be given a dictionary which contains all the valid words.

Build a trie of the words you have in the dictionary, which will make searching faster.
Search the tree according to the following letters of your input string. When you've found a word, which is in the tree, recursively start from the position after that word in the input string. If you get to the end of the input string, you've found one possible fragmentation. If you got stuck, come back and recursively try another words.
EDIT: sorry, missed the fact, that there must be just two words.
In this case, limit the recursion depth to 2.
The pseudocode for 2 words would be:
T = trie of words in the dictionary
for every word in T, which can be found going down the tree by choosing the next letter of the input string each time we move to the child:
p <- length(word)
if T contains input_string[p:length(intput_string)]:
return true
return false
Assuming you can go down to a child node in the trie in O(1) (ascii indexes of children), you can find all prefixes of the input string in O(n+p), where p is the number of prefixes, and n the length of the input. Upper bound on this is O(n+m), where m is the number of words in dictionary. Checking for containing will take O(w) where w is the length of word, for which the upper bound would be m, so the time complexity of the algorithm is O(nm), since O(n) is distributed in the first phase between all found words.
But because we can't find more than n words in the first phase, the complexity is also limited to O(n^2).
So the search complexity would be O(n*min(n, m))
Before that you need to build the trie which will take O(s), where s is the sum of lengths of words in the dictionary. The upper bound on this is O(n*m), since the maximum length of every word is n.

you go through your dictionary and compare every term as a substring with the original term e.g. "breadbanana". If the first term matches with the first substring, cut the first term out of the original search term and compare the next dictionary entries with the rest of the original term...
let me try to explain that in java:
e.g.
String dictTerm = "bread";
String original = "breadbanana";
// first part matches
if (dictTerm.equals(original.substring(0, dictTerm.length()))) {
// first part matches, get the rest
String lastPart = original.substring(dictTerm.length());
String nextDictTerm = "banana";
if (nextDictTerm.equals(lastPart)) {
System.out.println("String " + original +
" contains the dictionary terms " +
dictTerm + " and " + lastPart);
}
}

The simplest solution:
Split the string between every pair of consecutive characters and see whether or not both substrings (to the left of the split point and to the right of it) are in the dictionary.

One approach could be:
Put all elements of dictionary in some set or list
now you can use contains & substring function to remove words which matches dictionary. if at the end string is null -> string can be segmented else not. You can also take care of count.

public boolean canBeSegmented(String s) {
for (String word : dictionary.getWords()) {
if (s.contains(word) {
String sub = s.subString(0, s.indexOf(word));
s = sub + s.subString(s.indexOf(word)+word.length(), s.length()-1);
}
return s.equals("");
}
}
This code checks if your given String can be fully segmented. It checks if a word from the dictionary is inside your string and then subtracks it. If you want to segment it in the process you have to order the subtracted sementents in the order they are inside the word.
Just two words makes it easier:
public boolean canBeSegmented(String s) {
boolean wordDetected = false;
for (String word : dictionary.getWords()) {
if (s.contains(word) {
String sub = s.subString(0, s.indexOf(word));
s = sub + s.subString(s.indexOf(word)+word.length(), s.length()-1);
if(!wordDetected)
wordDetected = true;
else
return s.equals("");
}
return false;
}
}
This code checks for one Word and if there is another word in the String and just these two words it returns true otherwise false.

this is a mere idea , you can implement it better if you want
package farzi;
import java.util.ArrayList;
public class StringPossibility {
public static void main(String[] args) {
String str = "breadbanana";
ArrayList<String> dict = new ArrayList<String>();
dict.add("bread");
dict.add("banana");
for(int i=0;i<str.length();i++)
{
String word1 = str.substring(0,i);
String word2 = str.substring(i,str.length());
System.out.println(word1+"===>>>"+word2);
if(dict.contains(word1))
{
System.out.println("word 1 found : "+word1+" at index "+i);
}
if(dict.contains(word2))
{
System.out.println("word 2 found : "+ word2+" at index "+i);
}
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Finding longest concatenated word - java

Related

Time Complexity of Code for finding longest word inside dictionary

Looping through an ArrayList with another Arraylist in Java

How to get around recursive stack overflow?

Finding the index of a permutation within a string

How to know whether a string can be segmented into two strings

Categories

Resources