How would I go about reasonably efficiently finding the shortest possible output given by repeatedly applying replacements to an input sequence? I believe (please correct me if I am wrong) that this is exponential-time in the worst case, but I am not sure due to the second constraint below. The naive method certainly is.
I tried coding the naive method (for all possible replacements, for all valid positions, recurse on a copy of the input after applying the replacement at the position. Return the shortest of all valid recursions and the input, with a cache on the function to catch equivalent replacement sequences), but it is (unworkably) slow, and I'm pretty sure it's an algorithmic issue as opposed to the implementation.
A couple of things that may (or may not) make a difference:
Token is an enumerated type.
The length of the output of each entry in the map is strictly less than the input of the entry.
I do not need what replacements were done and where, just the resulting sequence.
So, as an example where each character is a token (for simplicity's sake), if I have the replacement map as aaba -> a, aaa -> ab, and aba -> bb, and I apply minimalString('aaaaa'), I want to get 'a'.
The actual method signature is something along the following lines:
List<Token> getMinimalAfterReplacements(List<Token> inputList, Map<List<Token>, List<Token>> replacements) {
?
}
Is there a better method than brute-force? If not, is there, for example, a SAT library or similar that could be harnessed? Is there any preprocessing to the map that could be done to make it faster when called multiple times with different token lists but with the same replacement map?
The code below is a Python version to find the shortest possible reduction. It is non-recursive but not too far from the naive algorithm. In every step it tries all possible single reductions, thus, obtaining a set of strings to reduce for the next step.
One optimization that helps in cases when there are "symbol eating" rules like "aa" -> "a" is to check the next set of strings for duplicates.
Another optimization (not implemented in the code below) would be to process the replacement rules into a finite automaton that finds locations of all possible single reductions with a single pass through the input string. This would not help the exponential nature of the main tree search algorithm, though.
class Replacer:
def __init__(self, replacements):
self.replacements = [[tuple(key), tuple(value)] for key, value in replacements.items()]
def get_possible_replacements(self, input):
"Return all possible variations where a single replacement was done to the input"
result = []
for replace_what, replace_with in self.replacements:
#print replace_what, replace_with
for p in range(1 + len(input) - len(replace_what)):
if input[p : p + len(replace_what)] == replace_what:
input_copy = list(input[:])
input_copy[p : p + len(replace_what)] = replace_with
result.append(tuple(input_copy))
return result
def get_minimum_sequence_list(self, input):
"Return the shortest irreducible sequence that can be obtained from the given input"
irreducible = []
to_reduce = [tuple(input)]
to_reduce_new = []
step = 1
while to_reduce:
print "Reduction step", step, ", number of candidates to reduce:", len(to_reduce)
step += 1
for current_input in to_reduce:
reductions = self.get_possible_replacements(current_input)
if not reductions:
irreducible.append(current_input)
else:
to_reduce_new += reductions
to_reduce = set(to_reduce_new[:]) # This dramatically reduces the tree width by removing duplicates
to_reduce_new = []
irreducible_sorted = sorted(set(irreducible), key = lambda x: len(x))
#print "".join(input), "could be reduced to any of", ["".join(x) for x in irreducible_sorted]
return irreducible_sorted[0]
def get_minimum_sequence(self, input):
return "".join(self.get_minimum_sequence_list(list(input)))
input = "aaaaa"
replacements = {
"aaba" : "a",
"aaa" : "ab",
"aba" : "bb",
}
replacer = Replacer(replacements)
replaced = replacer.get_minimum_sequence(input)
print "The shortest string", input, "could be reduced to is", replaced
Just a simple idea which might reduce the branching: With rules like
ba -> c
ca -> b
and a string like
aaabaacaa
^ ^
you can do two substitutions and their order doesn't matter. This is already sort of covered by memoization, however, there's still a considerable overhead for generating the useless string. So I'd suggest the following rule:
After a substitution on position p, consider only substitutions on positions q such that
q + length(lhs_of_the_rule) > p
i.e., such that don't start to the left of the previous substitutions or they overlap.
As a simple low-level optimization I'd suggest to replace the List<Token> by a String or (or an encapsulated byte[] or short[] or whatever). The lower memory footprint should help the cache and you can index an array by a string element (or two) in order to find out what rules may be applicable for it.
Related
The question was asking me to return set containing all the possible combination of strings made up of "cc" and "ddd" for given length n.
so for example if the length given was 5 then set would include "ccddd" and "dddcc".
and length 6 would return set containing "cccccc","dddddd"
and length 7 would return set contating "ccdddcc","dddcccc","ccccddd"
and length 12 will return 12 different combination and so on
However, set returned is empty.
Can you please help?
"Please understand extremeply poor coding style"
public static Set<String> set = new HashSet<String>();
public static Set<String> generateset(int n) {
String s = strings(n,n,"");
return set; // change this
}
public static String strings(int n,int size, String s){
if(n == 3){
s = s + ("cc");
return "";}
if(n == 2){
s = s + ("ddd");
return "";}
if(s.length() == size)
set.add(s);
return strings(n-3,size,s) + strings(n-2,size,s);
}
I think you'll need to rethink your approach. This is not an easy problem, so if you're extremely new to Java (and not extremely familiar with other programming languages), you may want to try some easier problems involving sets, lists, or other collections, before you tackle something like this.
Assuming you want to try it anyway: recursive problems like this require very clear thinking about how you want to accomplish the task. I think you have a general idea, but it needs to be much clearer. Here's how I would approach the problem:
(1) You want a method that returns a list (or set) of strings of length N. Your recursive method returns a single String, and as far as I can tell, you don't have a clear definition of what the resulting string is. (Clear definitions are very important in programming, but probably even more so when solving a complex recursive problem.)
(2) The strings will either begin with "cc" or "ddd". Thus, to form your resulting list, you need to:
(2a) Find all strings of length N-2. This is where you need a recursive call to get all strings of that length. Go through all strings in that list, and add "cc" to the front of each string.
(2b) Similarly, find all strings of length N-3 with a recursive call; go through all the strings in that list, and add "ddd" to the front.
(2c) The resulting list will be all the strings from steps (2a) and (2b).
(3) You need base cases. If N is 0 or 1, the resulting list will be empty. If N==2, it will have just one string, "cc"; if N==3, it will have just one string, "ddd".
You can use a Set instead of a list if you want, since the order won't matter.
Note that it's a bad idea to use a global list or set to hold the results. When a method is calling itself recursively, and every invocation of the method touches the same list or set, you will go insane trying to get everything to work. It's much easier if you let each recursive invocation hold its own local list with the results. Edit: This needs to be clarified. Using a global (i.e. instance field that is shared by all recursive invocations) collection to hold the final results is OK. But the approach I've outlined above involves a lot of intermediate results--i.e. if you want to find all strings whose length is 8, you will also be finding strings whose length is 6, 5, 4, ...; using a global to hold all of those would be painful.
The answer to why set is returned empty is simply follow the logic. Say you execute generateset(5); which will execute strings(5,5,"");:
First iteration strings(5,5,""); : (s.length() == size) is false hence nothing added to set
Second iteration strings(2,5,""); : (n == 2) is true, hence nothing added to set
Third iteration strings(3,5,""); : (n == 3) is true, hence nothing added
to set
So set remains un changed.
I'm looking at finding very short substrings (pattern, needle) in many short lines of text (haystack). However, I'm not quite sure which method to use outside the naive, brute force method.
Background: I'm doing a side project for fun where I receive text messaging chat logs of multiple users (anywhere from 2000-15000 lines of text and 2-50 users), and I want to find all the various pattern matches in the chat logs based on predetermined words that I've come up with. So far I have about 1600 patterns that I'm looking for, but I may look for more.
So for example, I want to find the number of food related words that are used in an average text message log such as "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", "McDonalds". While I gave out English examples, I will actually be using Korean for my program. Each of these designated words will have their own respective score, which I put in a hashmap as key and value separately. I then show the top scorers for food related words as well as the most frequent words used by those users for food words.
My current method is to eliminate each line of text by whitespaces, and process each individual word from the haystack by using contains method (which uses the indexOf method and the naive substring search algorithm) of the haystack contains the pattern.
wordFromInput.contains(wordFromPattern);
To give an example, with 17 users in chat, 13000 lines of text, and the 1600 patterns, I've found that this whole program took 12-13 seconds with this method. And on the Android app that I'm developing, it took 2 minutes and 30 seconds to process, which is far too slow.
Originally, I tried to use a hash map and to merely get the pattern instead of searching for it in the ArrayList, but I then realized that is...
not possible with hash table
for what I am trying to do with a substring.
I've looked around through Stackoverflow and found a lot of helpful and related questions, such as these two:
1 and 2. I'm somewhat more familiar with the various string algorithms (Boyer Moore, KMP, etc.)
I initially thought then that the naive method would of course be the worst type of algorithm for my case, but having found this question, I've realized that my case (short pattern, short text), might actually be more effective with the naive method. But I wanted to know if there was something that I was neglecting completely.
Here is a snippet of my code though if anyone wants to see my issue more concretely.
While I removed large parts of the code to simplify it, the primary method that I use to actually match substrings is there in the method matchWords().
I know that's really ugly and bad code (5 for loops...), so if there are any suggestions for that, I'm happy to hear it as well.
So to clean it up:
lines of text from chat logs (2000-10,000+), haystack
1600+ patterns, needle(s)
mostly using Korean characters, although some English is included
Brute force naive method is simply too slow, but debating whether there are other alternatives and even if there are, whether they are practical given the nature of short patterns and text.
I just want some input on my thought process, and possibly some general advice. But additionally, I would like some specific suggestion for a particular algorithm or method if that is possible.
You can replace the hashtable with a Trie.
Split the line of text into words using white space to separate words. Then check if the word is in the Trie. If it is in the Trie, update a counter associated with the word. Ideally, the counter would be integrated into the Trie.
This appraoch is O(C) where C is the number of characters in the text. It's highly unlikely that you can avoid checking each character at least once. Thus this approach should be as good as you can get at least in terms of big O.
However, it sounds like you may not want to list all of the possible words you are searching for. Therefore, you might want to simply use you could build a counting Trie from all of the words. If nothing else that'll probably make it easier for any pattern matching algorithm you use. Although, it might require some modifications to the Trie.
What you're describing sounds like an excellent use case for the Aho-Corasick string-matching algorithm. This algorithm finds all matches of a set of pattern strings inside of a source string and does so in linear time (plus the time to report the matches). If you have a fixed set of strings to search for, you can do linear preprocessing work up front on the patterns to search for all matches very quickly.
There's a Java implementation of Aho-Corasick available here. I haven't tried it out, but it might be a good match.
Hope this helps!
I'm pretty sure string.contains is already highly optimized, so replacing it with something else is not going to do you a lot of good.
So the way to go, I suspect, is not to look for each and every bank-word in your chat words, but rather do multiple comparisons at once.
The first way to do it would be to create one huge regular expression that will match all your bank-words. Compile it and hope the regular expression package is efficient enough (chances are - it is). You will have a rather lengthy setup stage (the regex compilation), but matches should be a lot faster.
You can build an index of the words you need to match and count them as you process them. If you can use a HashMap to lookup the patterns for each word, the cost will be O(n * m)
You can use a HashMap for all the possible words, you can then dissect the words later.
e.g. say you need to match red and apple, you can combine the sum of
redapple = 1
applered = 0
red = 10
apple = 15
This means that red is actually 11 (10 + 1), and apple is 16 (15 + 1)
I don't know Korean so I imagine the same strategies used to tinker with Strings in Korean isn't necessarily possible in the way it is with English, but perhaps this strategy in pseudocode can be applied with your knowledge of Korean to make it work. (Java is of course still the same, but for example, in Korean is it still highly likely for the letters "ough" to be in succession? Are there even letters for "ough"? But with that being said, hopefully the principle can be applied
I would use String.toCharArray to create a two-dimensional array (or ArrayList if variable size needed). The
if (first letter of word matches keyword's first letter)//we have a candidate
skip to last letter of the current word //see comment below
if(last letter of word matches keyword's last letter)//strong candidate
iterate backwards to start+1 checking remainder of letters
The reason I suggest to skip to the last letter is because statistically a "consonant, vowel" for the first two letters of a word is significantly high, especially nouns, which will consist of alot of your keywords since any food is a noun (almost all the keyword examples you gave were matched that structure of consonant, vowel). And since there are only 5 vowels(plus y), the likelihood of the second letter "i" showing up in the keyword "pizza" is inherently highly likely, yet after that point there is still a good chance that the word may turn out to not be a match.
However if you know that the first letter and the last letter match, then you probably have a much stronger candidate and can then iterate in reverse. I think over larger sets of data, this would eliminate candidates much faster than checking letters in order. Basically you'd be letting too many fake candidates past the second iteration, thus increasing your overall conditional operations. It might sound like something small, but in a project like this there's lots of reiterating, so micro-optimizations will accumulate very quickly.
If this approach can be applied in a language that's probably structurally very different from English(I'm speaking from ignorance here though), then I think it might provide some efficiency for you whether you make it happen through iterating a char array or with a scanner, or any other construct.
The trick is to realise that if you can describe the string you are searching for as a regular expression you can also, by definition, describe it with a state machine.
At every character in your message start a state machine for every one of your 1600 patterns and pass the character through it. This sounds scary but believe me most of them will terminate immediately anyway so you aren't really doing a huge amount of work. Bear in mind that a state machine can usually be encoded with a simple switch/case or a ch == s.charAt at each step so they are close to the ultimate in light-weight.
Obviously you know what to do whenever one of your search machines terminates at the end of their search. Any that terminate before full-match can be discarded immediately.
private static class Matcher {
private final int where;
private final String s;
private int i = 0;
public Matcher ( String s, int where ) {
this.s = s;
this.where = where;
}
public boolean match(char ch) {
return s.charAt(i++) == ch;
}
public int matched() {
return i == s.length() ? where: -1;
}
}
// Words I am looking for.
String[] watchFor = new String[] {"flies", "like", "arrow", "banana", "a"};
// Test string to search.
String test = "Time flies like an arrow, fruit flies like a banana";
public void test() {
// Use a LinkedList because it is O(1) to remove anywhere.
List<Matcher> matchers = new LinkedList<> ();
int pos = 0;
for ( char c : test.toCharArray()) {
// Fire off all of the matchers at this point.
for ( String s : watchFor ) {
matchers.add(new Matcher(s, pos));
}
// Discard all matchers that fail here.
for ( Iterator<Matcher> i = matchers.iterator(); i.hasNext(); ) {
Matcher m = i.next();
// Should it be removed?
boolean remove = !m.match(c);
if ( !remove ) {
// Still matches! Is it complete?
int matched = m.matched();
if ( matched >= 0 ) {
// Todo - Should use getters.
System.out.println(" "+m.s +" found at "+m.where+" active matchers "+matchers.size());
// Complete!
remove = true;
}
}
// Remove it where necessary.
if ( remove ) {
i.remove();
}
}
// Step pos to keep track.
pos += 1;
}
}
prints
flies found at 5 active matchers 6
like found at 11 active matchers 6
a found at 16 active matchers 2
a found at 19 active matchers 2
arrow found at 19 active matchers 6
flies found at 32 active matchers 6
like found at 38 active matchers 6
a found at 43 active matchers 2
a found at 46 active matchers 3
a found at 48 active matchers 3
banana found at 45 active matchers 6
a found at 50 active matchers 2
There are several simple optimisations. With some simple pre-processing the most obvious is to use the current character to determine which matchers may be applicable.
This is a pretty broad question, so I won't go into too much detail, but roughly:
Pre-process the haystacks using something like broad lemmatizer to create "topic word only" versions of the messages by noting which topics all words in it cover. For example, any occurrences of "hamburger", "pizza", "coke", "lunch", "dinner", "restaurant", or "McDonalds" would cause the "topic" word "food" to be collected for that message. Some words may have multiple topics, eg "McDonalds" may be in the topics "food" and "business". Most words won't have any topic.
After this process, you'll have haystacks consisting of only "topic" words. Then create a Map<String, Set<Integer>> and populate it with the topic word and the Set of chat message ids that contain it. This is reverse index of topic word to the chat messages that contain it.
The runtime code to find all documents that contain all n words is then trivial and super fast - near O(#terms):
private Map<String, Set<Integer>> index; // pre-populated
Set<Integer> search(String... topics) {
Set<Integer> results = null;
for (String topic : topics) {
Set<Integer> hits = index.get(topic);
if (hits == null)
return Collections.emptySet();
if (results == null)
results = new HashSet<Integer>(hits);
else
results.retainAll(hits);
if (results.isEmpty())
return Collections.emptySet(); // exit early
}
return results;
}
This will perform near O(1), and tell you which messages share all search terms. If you just want the number, use the trivial size() of the returned Set.
I have a list of keywords and I want to be able to find if a string contains any of those keywords. Right now the solution I have takes O(n). Is there a quicker way of doing this search without looping through each keywords and doing a comparison/contains?
i.e.
Keywords = "cat", "hat", "mat", "bat", "fat", "sat", "rat", "pat", "foo bar", "foo-bar"
String = "There is a cat in the box."
The result of this is true because "cat" matches one of the words in the 'keywords'
EDIT:
I guess I wasn't as clear when I said O(n). I mean to say O(n) where n=number of keywords.
You can use Boyer-Moore, which involves preprocessing the string, but you're not going to be able to beat a worst case of O(KN), where K is the sum of the lengths of the keywords, and N is the length of the string. Best case is of course sublinear, but you can't have a worst case sublinear runtime.
Note that the comparisons aren't free. It's not like you can compare two strings in O(1) to see if they're equal, you have to iterate through the characters. Hashing gets you to what you need to compare to in constant time, but doesn't help more than that since two different strings can have the same hash. That's not to say that hashing isn't good, it is, but it does not alter the worst case runtime complexity.
In the end, you need to compare the characters, and Boyer-Moore provides a very good way to do that. Of course if you're using some sort of hash based build, you may be able to rule out certain keywords in amortized constant time, but that doesn't change the fact that in the worst case (and many other cases), you're going to need to compare characters.
Also note that depending on what we assume about the data, and how we construct our indexing structure(s), it's possible to achieve a very good actual runtime. Just because the worst case complexity isn't sublinear doesn't mean that the actual runtime won't be really fast. There is no singleton simple or correct solution, the problem can be approached in myriad ways. There's never a quick and dirty answer to solve all of your problems when it comes to information retrieval.
Could try using contains().
Get the String; String passed = "there is a cat in the box";
use for loop to go through your key words. if keywords is an array.
for(int i = 0; i < keywords.length; i++){
if(passed.toLowerCase().contains(keywords[i]){
//set true;
}else{
//set false;
}
}
Either going through a loop or checking each word individually, i dont think you'll get much better than O(n)
k = # of chars in sentence
n = # of keywords
m = # of words in sentence
You can get O(k + n) time complexity by hashing the words in sentence.
Separating the sentence into words takes O(k). Creating the HashSet also takes O(k). Checking the hash n times takes n*O(1) = O(n), so the overall time complexity is O(k + n).
Edit1: Hashing all n keywords is technically n*O(k/m), where k/m is avg. word length. However, k/m does not scale with the size of the input, so it still gives O(n).
Edit2: FYI, Boyer-Moore will match any substring, not just keywords; E.g. "cat" will match "catepillar". Also, because it is more general it has a worse running time than a simple word match, O(KN) as #SteveP. has in his answer.
So if you only need word matching, as opposed to substring matching, stick with hashing as above.
not sure it will find inO(n).
but the solution to find the element could be like this
List<String> keywords = new ArrayList<String> (Arrays.asList("cat", "hat", "mat", "bat", "fat", "sat", "rat", "pat", "foo bar", "foo-bar"));
String search= "There is a cat in the box." ;
List<String> searchWords = new ArrayList<String> (Arrays.asList(search.split(" ")));
System.out.println(!Collections.disjoint(keywords,searchWords));
You're probably not going to get any better than O(n), since there's a linear component to this piece - you have to trawl the string in some shape, form, or fashion.
Consider using a Set:
Constant-time to add all of the elements (could be amoritized to N for N elements)
Constant-time look up for existence
public boolean inPhrase(String phrase, String searchWord) {
Set<String> phraseSet = new HashSet<>();
// remove the punctuation and split the words on white space.
for(String s: phrase.replaceAll("[.,?!;"'], "").split(" ")) {
phraseSet.add(s);
}
return phraseSet.contains(searchWord);
}
Interesting algorithm I would like to get the communities opinion on. I am looking to loop through a Sorted ArrayList<String> for the boolean result if a String exists in the array that begins with certain characters.
Ex. Array {"he", "help", "helpless", hope"}
search character = h
Result: true
search character = he
Result: true
search character = hea
Result: false
Now my first impression was that I should combine binary search with regex but let me know if I am way off. While trie would be the best implementation I need a solution that minimizes heap memory (developing on android) as this array in practicality will contain ~10,000-20,000 entries (words).
I have a db that contains ~200,000 words. I am taking a subset beginning with a set letter (in my example h) which will contain ~20,000 entries and inserting these into an array. I am then performing ~100-1,000 lookups/contains using this subset. The thought in my approach was to increase performance time (instead of db querying) while trying to minimize the hit to memory (array instead of trie tree)
Perhaps a DAWG would optimize lookup however I'm not sure if the size requirements for this structure would be significantly larger than an ArrayList?
If you really want to avoid a trie, this should fit your needs:
NavigableSet<String> tree = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
tree.addAll(Arrays.asList("he", "help", "helpless", "hope"));
String[] queries = {"h", "he", "hea"};
for (String query : queries) {
String higher = tree.ceiling(query);
System.out.println(query + ": " + higher.startsWith(query));
}
prints
h: true
he: true
hea: false
You should consider http://en.wikipedia.org/wiki/Skip_list as an option. Many java implementations are readily available
I have a list of strings in Java containing first name of a person with dissimilar spellings (not entirely different). For example, John may be spelled as Jon, Jawn, Jaun etc. How should I retrieve the most appropriate string in this list. If anyone can suggest a method how to use Soundex in this case, it shall be of great help.
You have use approximate string matching algorithm , There are several strategies to implement this . Blur is a Trie-based Java implementation of approximate string matching based on the Levenshtein word distance.
There is another strategy to implement its called boyer-moore approximate string matching algorithm.
The usual approach to solve these problem using this algorithm and Levenshtein word distance is to compare the input to the possible outputs and choose the one with the smallest distance to the desired output.
There is one jar file for matching approximate string..
go through link and download frej.jar
http://sourceforge.net/projects/frej/files/
there is one method inside this jar file
Fuzzy.equals("jon","john");
it will return true in this type of approximate string.
Solr can do this, if you use the phonetic filter factory while indexing the text.
It is solr's speciality to search. And search for similar sounding words. However if you just want this, and not want other features offered by solr, then you can use the source available here.
There are lots of theories and methods to estimate the match of 2 strings
Giving a blunt true/false result seems strange since "jon" really doesn't equal "john", it's close but doesn't match
One great academic work implementing quite a few estimation methods is called "SecondString.jar" - site link
Most implemented methods give some score to the match, this score depends on the method used
Example:
Lets define "Edit Distance" as the number of char changes required in str1 to get to str2
in this case "jon" --> "john" requires 1 char addition
naturally for this method a lower score is better
This article provides a detailed explanation and complete code on a Trie-based Java implementation of approximate string matching:
Fast and Easy Levenshtein distance using a Trie.
The search function returns a list of all words that are less than the given
maximum distance from the target word
def search( word, maxCost ):
# build first row
currentRow = range( len(word) + 1 )
results = []
# recursively search each branch of the trie
for letter in trie.children:
searchRecursive( trie.children[letter], letter, word, currentRow,
results, maxCost )
return results
This recursive helper is used by the search function above. It assumes that
the previousRow has been filled in already.
def searchRecursive( node, letter, word, previousRow, results, maxCost ):
columns = len( word ) + 1
currentRow = [ previousRow[0] + 1 ]
# Build one row for the letter, with a column for each letter in the target
# word, plus one for the empty string at column 0
for column in xrange( 1, columns ):
insertCost = currentRow[column - 1] + 1
deleteCost = previousRow[column] + 1
if word[column - 1] != letter:
replaceCost = previousRow[ column - 1 ] + 1
else:
replaceCost = previousRow[ column - 1 ]
currentRow.append( min( insertCost, deleteCost, replaceCost ) )
# if the last entry in the row indicates the optimal cost is less than the
# maximum cost, and there is a word in this trie node, then add it.
if currentRow[-1] <= maxCost and node.word != None:
results.append( (node.word, currentRow[-1] ) )
# if any entries in the row are less than the maximum cost, then
# recursively search each branch of the trie
if min( currentRow ) <= maxCost:
for letter in node.children:
searchRecursive( node.children[letter], letter, word, currentRow,
results, maxCost )