Find words sequence in a document

Find words sequence in a document - java

Using Java (on Android) I try to find a way (fast one...) to resolve this problem :
I have a list of words (around 10 to 30) and a document. The length of the document can vary too, maybe around 2500 to 10000 words. This document is part of a book.
The thing i want is to find in this document the string (sentence...) who contains the higher quantity of the words in my list. The words in the document has to be in the same order as my words list. Normally the words should not be so far one from the other in the document, maybe max 2 or 3 words between each words of my list.
To be more clear, lets take an example with small data.
My word list is :
harm piece work day
my document :
just so, with the greatest care. You must see to it that you pull up
regularly all the baobabs, at the very first moment when they can be
distinguished from the rosebushes which they resemble so closely in
their earliest youth. It is very tedious work," the little prince
added, "but very easy." And one day he said to me: "You ought to
make a beautiful drawing, so that the children where you live can see
exactly how all this is. That would be very useful to them if they
were to travel some day. Sometimes," he added, "there is no harm
in putting off a piece of work until another day. But
when it is a matter of baobabs, that always means a catastrophe. I
knew a planet that was inhabited by a lazy man. He neglected three
little bushes..." So, as the little prince described it to me, I
have made a drawing of that planet. I do not much like to take the
tone of a moralist. But the danger of the baobabs is so little
understood, and such considerable risks would be run by anyone who
might get lost on an asteroid, that for once I am breaking through my
reserve. "Children," I say plainly, "watch out for the baobabs!"
The goal is to find the string "there is no harm in putting off a piece of work until another day" in the document.
For now, the only way i think about is :
1 - find the first occurrence of the first word in my list in the document.
2 - multiply the number of words in my list by 2 or 3 to get the string length i have to check in my document (regarding the max number of words between the words of my list in the document).
3 - search for the occurrence of the other words in my list in this document string (having the string length I got in step 2) by split and loop.
If I consider the occurrence of my words in this string is not enough (maybe around 50%) then continu searching in the document starting by the next occurrence of the first word in my list.
But I'm afraid this could be very long, too much long, specially because I'm working on a mobile device... So i'm here to grab some ideas I maybe didn't think about, or some libs who could help me with this task. I thought about regular expressions too but I'm not sure if it would be a better way.
#gukoff proposition
Regarding that finally my words list can't be in a different order than my text it simplify the algorithm. The beginning of #gukoff answer is enough. No need to implement the LIS algorithm or reverse the list.
//Section = input text
//wordsToFind = words to find in text separated by space
private ArrayList<ArrayList<Integer>> test1(String wordsToFind, Section section) {
//1. Create the index of your words array.
String[] wordsArray = wordsToFind.split(" ");
ArrayList<Integer> indexesSentences = new ArrayList<>();
ArrayList<ArrayList<Integer>> sentenceArrayIndexes = new ArrayList<>();
ArrayList<Integer> wordsToFindIndexes = new ArrayList<>();
for(Sentence sentence:section.getSentences()) {
indexesSentences.clear();
for(String sentenceWord:sentence.getWords()) {
wordsToFindIndexes.clear();
int j = 0;
for(String word:wordsArray) {
if(word.equals(sentenceWord)) {
wordsToFindIndexes.add(j+1);
}
j++;
}
//Collections.reverse(wordsToFindIndexes);
for(int idx:wordsToFindIndexes) {
indexesSentences.add(idx);
}
}
sentenceArrayIndexes.add((ArrayList<Integer>)indexesSentences.clone());
}
return sentenceArrayIndexes;
}
public class Section {
private ArrayList<Sentence> sentences;
public Section (String text) {
sentences = new ArrayList<>();
if(text == null || text.trim() == "") {
throw new IllegalArgumentException("Text not valid");
}
String formattedText = text.trim().replaceAll("[^a-zA-Z. ]", "").toLowerCase();
String[] sentencesArray = formattedText.split("\\.");
for(String sentenceStr:sentencesArray) {
if(sentenceStr.trim() != "") {
sentences.add(new Sentence(sentenceStr));
}
}
}
public ArrayList<Sentence> getSentences() {
return sentences;
}
public void addSentence(Sentence sentence) {
sentences.add(sentence);
}
}

So, you have the words to be found and a text, which consists of sentences to be examined.
Create the index of your words array.
For example, if words = a dog is not a human:
{
"a": [1, 5],
"dog": [2],
"is": [3],
"not": [4],
"human": [6]
}
In every sentence replace every word by its index value in descending order. That said, "a" gets replaced by [5, 1], "human" gets replaced by [6] and "tree" gets replaced by [].
For example, the sentence "not a cat is a human" should turn into [4, 5,1, 3, 5,1, 6]
Find the Longest increasing subsequence(LIS) in every array. Essentially, LIS would be the longest sub-match of your words array in the sentence.
For example, LIS of [4, 5,1, 3, 5,1, 6] is [1, 3, 5, 6], which maps to the sub-match "a is a human".
But generally, in case the words shouldn't be very far from each other, I suggest to find LIS using dynamic programming with corresponding modifications.

Here is a simple approach which should be good enough given your document size:
make an array (call it words) of size n where n is number of words in your document.
Now populate this array such that
words[i] = 0 if no words in your list match this word
words[i] = k if kth word in your list matches this word (1 based indexing )
Example: If your document is there is no harm in putting off a piece of work until another day. and word list is work day harm piece (in that order) then your wordsarray will look like this [0,0,0,3,0,0,0,0,4,0,1,0,0,2]
2.Now you will have an array of size 2000~3000 of integers.You can use a variant of Longest common subsequence problem or modify your algorithm a little to find the best match.

Related

How to reverse hashmap compression (index method) (Java) [duplicate]

Background of question
I have been developing some code that focuses on firstly, reading a string and creating a file. Secondly, spliting a string into an array. Then getting the indexes for each word in the array and finally, removing the duplicates and printing it to a different file.
I currently have made the code for this here is a link https://pastebin.com/gqWH0x0 (there is a menu system as well) but it is rather long so I have refrained from implementing it in this question.
The compression method is done via hashmaps, getting indexes of the array and mapping them to the relevant word. Here is an example:
Original: "sea sea see sea see see"
Output: see[2, 4, 5],sea[0, 1, 3],
Question
The next stage is getting the output back into the original state. I am currently relatively new to java so I am not aware of the techniques required. The code should be able to take the output file (shown above) and put it back into the original.
My current thinking is that you would just rewrite this hashmap (below). Would I be correct in thinking this? I thought I should check with stack overflow first!
Map<String, Set<Integer>> seaMap = new HashMap<>(); //new hashmap
for (int seaInt = 0; seaInt < sealist.length; seaInt++) {
if (seaMap.keySet().contains(sealist[seaInt])) {
Set<Integer> index = seaMap.get(sealist[seaInt]);
index.add(seaInt);
} else {
Set<Integer> index = new HashSet<>();
index.add(seaInt);
seaMap.put(sealist[seaInt], index);
}
}
System.out.print("Compressed: ");
seaMap.forEach((seawords, seavalues) -> System.out.print(seawords + seavalues + ","));
System.out.println("\n");
If anyone has any good ideas / answers then please let me know, I am really desperate for a solution!
Link to current code: https://pastebin.com/gqWH0x0K

first you will have to separate the words with index(es) from your compressed line, using your example:
"see[2, 4, 5],sea[0, 1, 3],"
to obtain following Strings:
"see[2, 4, 5]" and "sea[0, 1, 3]"
for each you must read the indexes, e.g. for first:
2, 4 and 5
now just write the word in an ArrayList (or array) at the given index.
For the first two steps you can use a regular expression to find each word and the index list. Then use String.split and Integer.parseInt to get all indexes.
Pattern pattern = Pattern.compile("(.*?)\\[(.*?)\\],");
String line = "see[2, 4, 5],sea[0, 1, 3],";
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
String word = matcher.group(1);
String[] indexes = matcher.group(2).split(", ");
for (String str : indexes) {
int index = Integer.parseInt(str);
Now just check that the result List is big enough and set the word at the found indexes.

How do I take a compressed file (through indexes) and re-create the original file? (Java)

Background of question
I have been developing some code that focuses on firstly, reading a string and creating a file. Secondly, spliting a string into an array. Then getting the indexes for each word in the array and finally, removing the duplicates and printing it to a different file.
I currently have made the code for this here is a link https://pastebin.com/gqWH0x0 (there is a menu system as well) but it is rather long so I have refrained from implementing it in this question.
The compression method is done via hashmaps, getting indexes of the array and mapping them to the relevant word. Here is an example:
Original: "sea sea see sea see see"
Output: see[2, 4, 5],sea[0, 1, 3],
Question
The next stage is getting the output back into the original state. I am currently relatively new to java so I am not aware of the techniques required. The code should be able to take the output file (shown above) and put it back into the original.
My current thinking is that you would just rewrite this hashmap (below). Would I be correct in thinking this? I thought I should check with stack overflow first!
Map<String, Set<Integer>> seaMap = new HashMap<>(); //new hashmap
for (int seaInt = 0; seaInt < sealist.length; seaInt++) {
if (seaMap.keySet().contains(sealist[seaInt])) {
Set<Integer> index = seaMap.get(sealist[seaInt]);
index.add(seaInt);
} else {
Set<Integer> index = new HashSet<>();
index.add(seaInt);
seaMap.put(sealist[seaInt], index);
}
}
System.out.print("Compressed: ");
seaMap.forEach((seawords, seavalues) -> System.out.print(seawords + seavalues + ","));
System.out.println("\n");
If anyone has any good ideas / answers then please let me know, I am really desperate for a solution!
Link to current code: https://pastebin.com/gqWH0x0K

first you will have to separate the words with index(es) from your compressed line, using your example:
"see[2, 4, 5],sea[0, 1, 3],"
to obtain following Strings:
"see[2, 4, 5]" and "sea[0, 1, 3]"
for each you must read the indexes, e.g. for first:
2, 4 and 5
now just write the word in an ArrayList (or array) at the given index.
For the first two steps you can use a regular expression to find each word and the index list. Then use String.split and Integer.parseInt to get all indexes.
Pattern pattern = Pattern.compile("(.*?)\\[(.*?)\\],");
String line = "see[2, 4, 5],sea[0, 1, 3],";
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
String word = matcher.group(1);
String[] indexes = matcher.group(2).split(", ");
for (String str : indexes) {
int index = Integer.parseInt(str);
Now just check that the result List is big enough and set the word at the found indexes.

find the longest word made of other words

I am working on a problem, which is to write a program to find the longest word made of other words in a list of words.
EXAMPLE
Input: test, tester, testertest, testing, testingtester
Output: testingtester
I searched and find the following solution, my question is I am confused in step 2, why we should break each word in all possible ways? Why not use each word directly as a whole? If anyone could give some insights, it will be great.
The solution below does the following:
Sort the array by size, putting the longest word at the front
For each word, split it in all possible ways. That is, for “test”, split it into {“t”, “est”}, {“te”, “st”} and {“tes”, “t”}.
Then, for each pairing, check if the first half and the second both exist elsewhere in the array.
“Short circuit” by returning the first string we find that fits condition #3.

Answering your question indirectly, I believe the following is an efficient way to solve this problem using tries.
Build a trie from all of the words in your string.
Sort the words so that the longest word comes first.
Now, for each word W, start at the top of the trie and begin following the word down the tree one letter at a time using letters from the word you are testing.
Each time a word ends, recursively re-enter the trie from the top making a note that you have "branched". If you run out of letters at the end of the word and have branched, you've found a compound word and, because the words were sorted, this is the longest compound word.
If the letters stop matching at any point, or you run out and are not at the end of the word, just back track to wherever it was that you branched and keep plugging along.
I'm afraid I don't know Java that well, so I'm unable to provide you sample code in that language. I have, however, written out a solution in Python (using a trie implementation from this answer). Hopefully it is clear to you:
#!/usr/bin/env python3
#End of word symbol
_end = '_end_'
#Make a trie out of nested HashMap, UnorderedMap, dict structures
def MakeTrie(words):
root = dict()
for word in words:
current_dict = root
for letter in word:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return root
def LongestCompoundWord(original_trie, trie, word, level=0):
first_letter = word[0]
if not first_letter in trie:
return False
if len(word)==1 and _end in trie[first_letter]:
return level>0
if _end in trie[first_letter] and LongestCompoundWord(original_trie, original_trie, word[1:], level+1):
return True
return LongestCompoundWord(original_trie, trie[first_letter], word[1:], level)
#Words that were in your question
words = ['test','testing','tester','teste', 'testingtester', 'testingtestm', 'testtest','testingtest']
trie = MakeTrie(words)
#Sort words in order of decreasing length
words = sorted(words, key=lambda x: len(x), reverse=True)
for word in words:
if LongestCompoundWord(trie,trie,word):
print("Longest compound word was '{0:}'".format(word))
break
With the above in mind, the answer to your original question becomes clearer: we do not know ahead of time which combination of prefix words will take us successfully through the tree. Therefore, we need to be prepared to check all possible combinations of prefix words.
Since the algorithm you found does not have an efficient way of knowing what subsets of a word are prefixes, it splits the word at all possible points in word to ensure that all prefixes are generated.

Richard's answer will work well in many cases, but it can take exponential time: this will happen if there are many segments of the string W, each of which can be decomposed in multiple different ways. For example, suppose W is abcabcabcd, and the other words are ab, c, a and bc. Then the first 3 letters of W can be decomposed either as ab|c or as a|bc... and so can the next 3 letters, and the next 3, for 2^3 = 8 possible decompositions of the first 9 letters overall:
a|bc|a|bc|a|bc
a|bc|a|bc|ab|c
a|bc|ab|c|a|bc
a|bc|ab|c|ab|c
ab|c|a|bc|a|bc
ab|c|a|bc|ab|c
ab|c|ab|c|a|bc
ab|c|ab|c|ab|c
All of these partial decompositions necessarily fail in the end, since there is no word in the input that contains W's final letter d -- but his algorithm will explore them all before discovering this. In general, a word consisting of n copies of abc followed by a single d will take O(n*2^n) time.
We can improve this to O(n^2) worst-case time (at the cost of O(n) space) by recording extra information about the decomposability of suffixes of W as we go along -- that is, suffixes of W that we have already discovered we can or cannot match to word sequences. This type of algorithm is called dynamic programming.
The condition we need for some word W to be decomposable is exactly that W begins with some word X from the set of other words, and the suffix of W beginning at position |X|+1 is decomposable. (I'm using 1-based indices here, and I'll denote a substring of a string S beginning at position i and ending at position j by S[i..j].)
Whenever we discover that the suffix of the current word W beginning at some position i is or is not decomposable, we can record this fact and make use of it later to save time. For example, after testing the first 4 decompositions in the 8 listed earlier, we know that the suffix of W beginning at position 4 (i.e., abcabcd) is not decomposable. Then when we try the 5th decomposition, i.e., the first one starting with ab, we first ask the question: Is the rest of W, i.e. the suffix of W beginning at position 3, decomposable? We don't know yet, so we try adding c to get ab|c, and then we ask: Is the rest of W, i.e. the suffix of W beginning at position 4, decomposable? And we find that it has already been found not to be -- so we can immediately conclude that no decomposition of W beginning with ab|c is possible either, instead of having to grind through all 4 possibilities.
Assuming for the moment that the current word W is fixed, what we want to build is a function f(i) that determines whether the suffix of W beginning at position i is decomposable. Pseudo-code for this could look like:
- Build a trie the same way as Richard's solution does.
- Initialise the array KnownDecomposable[] to |W| DUNNO values.
f(i):
- If i == |W|+1 then return 1. (The empty suffix means we're finished.)
- If KnownDecomposable[i] is TRUE or FALSE, then immediately return it.
- MAIN BODY BEGINS HERE
- Walk through Richard's trie from the root, following characters in the
suffix W[i..|W|]. Whenever we find a trie node at some depth j that
marks the end of a word in the set:
- Call f(i+j) to determine whether the rest of W can be decomposed.
- If it can (i.e. if f(i+j) == 1):
- Set KnownDecomposable[i] = TRUE.
- Return TRUE.
- If we make it to this point, then we have considered all other
words that form a prefix of W[i..|W|], and found that none of
them yield a suffix that can be decomposed.
- Set KnownDecomposable[i] = FALSE.
- Return FALSE.
Calling f(1) then tells us whether W is decomposable.
By the time a call to f(i) returns, KnownDecomposable[i] has been set to a non-DUNNO value (TRUE or FALSE). The main body of the function is only run if KnownDecomposable[i] is DUNNO. Together these facts imply that the main body of the function will only run as many times as there are distinct values i that the function can be called with. There are at most |W|+1 such values, which is O(n), and outside of recursive calls, a call to f(i) takes at most O(n) time to walk through Richard's trie, so overall the time complexity is bounded by O(n^2).

I guess you are just making a confusion about which words are split.
After sorting, you consider the words one after the other, by decreasing length. Let us call a "candidate" a word you are trying to decompose.
If the candidate is made of other words, it certainly starts with a word, so you will compare all prefixes of the candidate to all possible words.
During the comparison step, you compare a candidate prefix to the whole words, not to split words.
By the way, the given solution will not work for triwords and longer. The fix is as follows:
try every prefix of the candidate and compare it to all words
in case of a match, repeat the search with the suffix.
Example:
testingtester gives the prefixes
t, te, tes, test, testi, testin, testing, testingt, testingte, testingtes and testingteste
Among these, test and testing are words. Then you need to try the corresponding suffixes ingtester and tester.
ingtester gives
i, in, ing, ingt, ingte, ingtes, ingtest and ingteste, none of which are words.
tester is a word and you are done.
IsComposite(InitialCandidate, Candidate):
For all Prefixes of Candidate:
if Prefix is in Words:
Suffix= Candidate - Prefix
if Suffix == "":
return Candidate != InitialCandidate
else:
return IsComposite(InitialCandidate, Suffix)
For all Candidate words by decreasing size:
if IsComposite(Candidate, Candidate):
print Candidate
break

I would probably use recursion here. Start with the longest word and find words it starts with. For any such word remove it from the original word and continue with the remaining part in the same manner.
Pseudo code:
function iscomposed(orininalword, wordpart)
for word in allwords
if word <> orininalword
if wordpart = word
return yes
elseif wordpart starts with word
if iscomposed(orininalword, wordpart - word)
return yes
endif
endif
endif
next
return no
end
main
sort allwords by length descending
for word in allwords
if iscomposed(word, word) return word
next
end
Example:
words:
abcdef
abcde
abc
cde
ab
Passes:
1. abcdef starts with abcde. rest = f. 2. no word f starts with found.
1. abcdef starts with abc. rest = def. 2. no word def starts with found.
1. abcdef starts with ab. rest = cdef. 2. cdef starts with cde. rest = f. 3. no word f starts with found.
1. abcde starts with abc. rest = cde. 2. cde itself found. abcde is a composed word

To find longest world using recursion
class FindLongestWord {
public static void main(String[] args) {
List<String> input = new ArrayList<>(
Arrays.asList("cat", "banana", "rat", "dog", "nana", "walk", "walker", "dogcatwalker"));
List<String> sortedList = input.stream().sorted(Comparator.comparing(String::length).reversed())
.collect(Collectors.toList());
boolean isWordFound = false;
for (String word : sortedList) {
input.remove(word);
if (findPrefix(input, word)) {
System.out.println("Longest word is : " + word);
isWordFound = true;
break;
}
}
if (!isWordFound)
System.out.println("Longest word not found");
}
public static boolean findPrefix(List<String> input, String word) {
boolean output = false;
if (word.isEmpty())
return true;
else {
for (int i = 0; i < input.size(); i++) {
if (word.startsWith(input.get(i))) {
output = findPrefix(input, word.replace(input.get(i), ""));
if (output)
return true;
}
}
}
return output;
}
}

Looping through an ArrayList with another Arraylist in Java

I have a large array list of sentences and another array list of words.
My program loops through the array list and removes an element from that array list if the sentence contains any of the words from the other.
The sentences array list can be very large and I coded a quick and dirty nested for loop. While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
for (int i = 0; i < SENTENCES.size(); i++) {
for (int k = 0; k < WORDS.size(); k++) {
if (SENTENCES.get(i).contains(" " + WORDS.get(k) + " ") == true) {
//Do something
}
}
}
Is there a more efficient way of doing this then a nested for loop?

There's a few inefficiencies in your code, but at the end of the day, if you've got to search for sentences containing words then there's no getting away from loops.
That said, there are couple of things to try.
First, make WORDS a HashSet, the contains method will be far quicker than for an ArrayList because it's doing a hash look-up to get the value.
Second, switch the logic about a bit like this:
Iterator<String> sentenceIterator = SENTENCES.iterator();
sentenceLoop:
while (sentenceIterator.hasNext())
{
String sentence = sentenceIterator.next();
for (String word : sentence.replaceAll("\\p{P}", " ").toLowerCase().split("\\s+"))
{
if (WORDS.contains(word))
{
sentenceIterator.remove();
continue sentenceLoop;
}
}
}
This code (which assumes you're trying to remove sentences that contain certain words) uses Iterators and avoids the string concatenation and parsing logic you had in your original code (replacing it with a single regex) both of which should be quicker.
But bear in mind, as with all things performance you'll need to test these changes to see they improve the situation.

I̶ ̶w̶o̶u̶l̶d̶ ̶s̶a̶y̶ ̶n̶o̶,̶ ̶b̶u̶t̶ what you must change is the way you handle the removal of the data. This is noted by this part of the explanation of your problem:
The sentences array list can be very large (...). While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
The cause of this is that removal time in ArrayList takes O(N), and since you're doing this inside a loop, then it will take at least O(N^2).
I recommend using LinkedList rather than ArrayList to store the sentences, and use Iterator rather than your naive List#get since it already offers Iterator#remove in time O(1) for LinkedList.
In case you cannot change the design to LinkedList, I recommend storing the sentences that are valid in a new List, and in the end replace the contents of your original List with this new List, thus saving lot of time.
Apart from this big improvement, you can improve the algorithm even more by using a Set to store the words to lookup rather than using another List since the lookup in a Set is O(1).

What you could do is put all your words into a HashSet. This allows you to check if a word is in the set very quickly. See https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html for documentation.
HashSet<String> wordSet = new HashSet();
for (String word : WORDS) {
wordSet.add(word);
}
Then it's just a matter of splitting each sentence into the words that make it up, and checking if any of those words are in the set.
for (String sentence : SENTENCES) {
String[] sentenceWords = sentence.split(" "); // You probably want to use a regex here instead of just splitting on a " ", but this is just an example.
for (String word : sentenceWords) {
if (wordSet.contains(word)) {
// The sentence contains one of the special words.
// DO SOMETHING
break;
}
}
}

I will create a set of words from second ArrayList:
Set<String> listOfWords = new HashSet<String>();
listOfWords.add("one");
listOfWords.add("two");
I will then iterate over the set and the first ArrayList and use Contains:
for (String word : listOfWords) {
for(String sentence : Sentences) {
if (sentence.contains(word)) {
// do something
}
}
}
Also, if you are free to use any open source jar, check this out:
searching string in another string

First, your program has a bug: it would not count words at the beginning and at the end of a sentence.
Your current program has runtime complexity of O(s*w), where s is the length, in characters, of all sentences, and w is the length of all words, also in characters.
If words is relatively small (a few hundred items or so) you could use regex to speed things up considerably: construct a pattern like this, and use it in a loop:
StringBuilder regex = new StringBuilder();
boolean first = true;
// Let's say WORDS={"quick", "brown", "fox"}
regex.append("\\b(?:");
for (String w : WORDS) {
if (!first) {
regex.append('|');
} else {
first = false;
}
regex.append(w);
}
regex.append(")\\b");
// Now regex is "\b(?:quick|brown|fox)\b", i.e. your list of words
// separated by OR signs, enclosed in non-capturing groups
// anchored to word boundaries by '\b's on both sides.
Pattern p = Pattern.compile(regex.toString());
for (int i = 0; i < SENTENCES.size(); i++) {
if (p.matcher(SENTENCES.get(i)).find()) {
// Do something
}
}
Since regex gets pre-compiled into a structure more suitable for fast searches, your program would run in O(s*max(w)), where s is the length, in characters, of all sentences, and w is the length of the longest word. Given that the number of words in your collection is about 200 or 300, this could give you an order of magnitude decrease in running time.

If you have enough memory you can tokenize SENTENCES and put them in a Set. Then it would be better in performance and also more correct than current implementation.

Well, looking at your code I would suggest two things that will improve the performance from each iteration:
Remove " == true". The contains operation already returns a boolean, so it is enough for the if, comparing it with true adds one extra operation for each iteration that is not needed.
Do not concatenate Strings inside a loop (" " + WORDS.get(k) + " ") as it is a quite expensive operation because + operator creates new objects. Better use a string buffer / builder and clear it after each iteration with stringBuffer.setLength(0);.
Besides that, for this case I do not know any other approach, maybe you can use regular expressions if you can abstract a pattern out of those words you want to remove and have then only one loop.
Hope it helps!

If you concern about the efficiency, I think that the most effective way to do this is to use Aho-Corasick's algorithm. While you have 2 nested loops here and a contains() method (that I think takes at the best length of sentence + length of word time), Aho-Corasick gives you one loop over sentences and for checking of containing words it takes length of sentence, which is length of word times faster (+ a preprocessing time for creation of finite state machine, which is relatively small).

I'll approach this in more theoretical view.. If you don't have memory limitation, you can try to mimic the logic in counting sort
say M1 = sentences.size, M2 = number of word per sentences, and N = word.size
Assume all sentences has the same number of words just for simplicity
your current approach's complexity is O(M1.M2.N)
We can create a mapping of words - position in sentences.
Loop through your arraylist of sentences, and change them into two dimensional jagged array of words. Loop through the new array, create a HashMap where key,value = words, arraylist of word position (say with length X). That's O(2M1.M2.X) = O(M1.M2.X)
Then loop through your words arraylist, access your word hashmap, loop through the list of word position. remove each one. That's O(N.X)
Say you're need to give the result in arraylist of string, we need another loop and concat everything. That's O(M1.M2)
Total complexity is O(M1.M2.X) + O(N.X) + O(M1.M2)
assumming X is way smaller than N, you'll probably get better performance

Finding the index of a permutation within a string

I just attempted a programming challenge, which I was not able to successfully complete. The specification is to read 2 lines of input from System.in.
A list of 1-100 space separated words, all of the same length and between 1-10 characters.
A string up to a million characters in length, which contains a permutation of the above list just once. Return the index of where this permutation begins in the string.
For example, we may have:
dog cat rat
abcratdogcattgh
3
Where 3 is the result (as printed by System.out).
It's legal to have a duplicated word in the list:
dog cat rat cat
abccatratdogzzzzdogcatratcat
16
The code that I produced worked providing that the word that the answer begins with has not occurred previously. In the 2nd example here, my code will fail because dog has already appeared before where the answer begins at index 16.
My theory was to:
Find the index where each word occurs in the string
Extract this substring (as we have a number of known words with a known length, this is possible)
Check that all of the words occur in the substring
If they do, return the index that this substring occurs in the original string
Here is my code (it should be compilable):
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class Solution {
public static void main(String[] args) throws Exception {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String line = br.readLine();
String[] l = line.split(" ");
String s = br.readLine();
int wl = l[0].length();
int len = wl * l.length;
int sl = s.length();
for (String word : l) {
int i = s.indexOf(word);
int z = i;
//while (i != -1) {
int y = i + len;
if (y <= sl) {
String sub = s.substring(i, y);
if (containsAllWords(l, sub)) {
System.out.println(s.indexOf(sub));
System.exit(0);
}
}
//z+= wl;
//i = s.indexOf(word, z);
//}
}
System.out.println("-1");
}
private static boolean containsAllWords(String[] l, String s) {
String s2 = s;
for (String word : l) {
s2 = s2.replaceFirst(word, "");
}
if (s2.equals(""))
return true;
return false;
}
}
I am able to solve my issue and make it pass the 2nd example by un-commenting the while loop. However this has serious performance implications. When we have an input of 100 words at 10 characters each and a string of 1000000 characters, the time taken to complete is just awful.
Given that each case in the test bench has a maximum execution time, the addition of the while loop would cause the test to fail on the basis of not completing the execution in time.
What would be a better way to approach and solve this problem? I feel defeated.

If you concatenate the strings together and use the new string to search with.
String a = "dog"
String b = "cat"
String c = a+b; //output of c would be "dogcat"
Like this you would overcome the problem of dog appearing somewhere.
But this wouldn't work if catdog is a valid value too.

Here is an approach (pseudo code)
stringArray keys(n) = {"cat", "dog", "rat", "roo", ...};
string bigString(1000000);
L = strlen(keys[0]); // since all are same length
int indices(n, 1000000/L); // much too big - but safe if only one word repeated over and over
for each s in keys
f = -1
do:
f = find s in bigString starting at f+1 // use bigString.indexOf(s, f+1)
write index of f to indices
until no more found
When you are all done, you will have a series of indices (location of first letter of match). Now comes the tricky part. Since the words are all the same length, we're looking for a sequence of indices that are all spaced the same way, in the 10 different "collections". This is a little bit tedious but it should complete in a finite time. Note that it's faster to do it this way than to keep comparing strings (comparing numbers is faster than making sure a complete string is matched, obviously). I would again break it into two parts - first find "any sequence of 10 matches", then "see if this is a unique permutation".
sIndx = sort(indices(:))
dsIndx = diff(sIndx);
sequence = find {n} * 10 in dsIndx
for each s in sequence
check if unique permutation
I hope this gets you going.

Perhaps not the best optimized version, but how about following theory to give you some ideas:
Count length of all words in row.
Take random word from list and find the starting index of its first
occurence.
Take a substring with length counted above before and after that
index (e.g. if index is 15 and 3 words of 4 letters long, take
substring from 15-8 to 15+11).
Make a copy of the word list with earlier random word removed.
Check the appending/prepending [word_length] letters to see if they
match a new word on the list.
If word matches copy of list, remove it from copy of list and move further
If all words found, break loop.
If not all words found, find starting index of next occurence of
earlier random word and go back to 3.
Why it would help:
Which word you pick to begin with wouldn't matter, since every word
needs to be in the succcessful match anyway.
You don't have to manually loop through a lot of the characters,
unless there are lots of near complete false matches.
As a supposed match keeps growing, you have less words on the list copy left to compare to.
Can also keep track or furthest index you've gone to, so you can
sometimes limit the backwards length of picked substring (as it
cannot overlap to where you've already been, if the occurence are
closeby to each other).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Find words sequence in a document - java

Related

How to reverse hashmap compression (index method) (Java) [duplicate]

How do I take a compressed file (through indexes) and re-create the original file? (Java)

find the longest word made of other words

Looping through an ArrayList with another Arraylist in Java

Finding the index of a permutation within a string

Categories

Resources