Find all concatenations of two string in a huge set

Find all concatenations of two string in a huge set - java

Given a set of 50k strings, I need to find all pairs (s, t), such that s, t and s + t are all contained in this set.
What I've tried
, there's an additional constraint: s.length() >= 4 && t.length() >= 4. This makes it possible to group the strings by length 4 prefixes and, separately, suffixes. Then for every string composed of length at least 8, I look up the set of candidates for s using the first four characters of composed and the set of candidates for t using its last four characters. This works, but it needs to look at 30M candidate pairs (s, t) for finding the 7k results.
This surprisingly high number of candidates comes from the fact, that the string are (mostly German) words from a limited vocabulary and the word starts and ends often the same. It's still much better than trying all 2.5G pairs, but much worse than I hoped.
What I need
As the additional constraint may get dropped and the set will grow, I'm looking for a better algorithm.
The "missing" question
There were complaints about me not asking a question. So the missing question mark is at the end of the next sentence. How can this be done more efficiently, ideally without using the constraint?

Algorithm 1: Test pairs, not singles
One way could be, instead of working from all possible pairs to all possible composite strings containing those pairs, work from all possible composite strings and see if they contain pairs. This changes the problem from n^2 lookups (where n is the number of strings >= 4 characters) to m * n lookups (where m is the average length of all strings >= 8 characters, minus 7, and n is now the number of strings >= 8 characters). Here's one implementation of that:
int minWordLength = 4;
int minPairLength = 8;
Set<String> strings = Stream
.of(
"a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
"bear", "hug", "bearhug", "cur", "curlique", "curl",
"down", "downstream", "stream"
)
.filter(s -> s.length() >= minWordLength)
.collect(ImmutableSet.toImmutableSet());
strings
.stream()
.filter(s -> s.length() >= minPairLength)
.flatMap(s -> IntStream
.rangeClosed(minWordLength, s.length() - minWordLength)
.mapToObj(splitIndex -> ImmutableList.of(
s.substring(0, splitIndex),
s.substring(splitIndex)
))
.filter(pair ->
strings.contains(pair.get(0))
&& strings.contains(pair.get(1))
)
)
.map(pair ->
pair.get(0) + pair.get(1) + " = " + pair.get(0) + " + " + pair.get(1)
)
.forEach(System.out::println);
Gives the result:
downstream = down + stream
This has average algorithmic complexity of m * n as shown above. So in effect, O(n). In the worst case, O(n^2). See hash table for more on the algorithmic complexity.
Explanation
Put all strings four or more characters long into a hash set (which takes average O(1) complexity for search). I used Guava's ImmutableSet for convenience. Use whatever you like.
filter: Restrict to only the items that are eight or more characters in length, representing our candidates for being a composition of two other words in the list.
flatMap: For each candidate, compute all possible pairs of sub-words, ensuring each is at least 4 characters long. Since there can be more than one result, this is in effect a list of lists, so flatten it into a single-deep list.
rangeClosed: Generate all integers representing the number of characters that will be in the first word of the pair we will check.
mapToObj: Use each integer combined with our candidate string to output a list of two items (in production code you'd probably want something more clear like a two-property value class, or an appropriate existing class).
filter: Restrict to only pairs where both are in the list.
map: Pretty up the results a little.
forEach: Output to the console.
Algorithm Choice
This algorithm is tuned to words that are way shorter than the number of items in the list. If the list were very short and the words were very long, then switching back to a composition task instead of a decomposition task would work better. Given that the list is 50,000 strings in size, and German words while long are very unlikely to exceed 50 characters, that is a 1:1000 factor in favor of this algorithm.
If on the other hand, you had 50 strings that were on average 50,000 characters long, a different algorithm would be far more efficient.
Algorithm 2: Sort and keep a candidate list
One algorithm I thought about for a little while was to sort the list, with the knowledge that if a string represents the start of a pair, all candidate strings that could be one of its pairs will be immediately after it in order, among the set of items that start with that string. Sorting my tricky data above, and adding some confounders (downer, downs, downregulate) we get:
a
abc
abcdef
bear
bearhug
cur
curl
curlique
def
down ---------\
downs |
downer | not far away now!
downregulate |
downstream ---/
hug
shine
stream
sun
sunshine
Thus if a running set of all items to check were kept, we could find candidate composites in essentially constant time per word, then probe directly into a hash table for the remainder word:
int minWordLength = 4;
Set<String> strings = Stream
.of(
"a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
"bear", "hug", "bearhug", "cur", "curlique", "curl",
"down", "downs", "downer", "downregulate", "downstream", "stream")
.filter(s -> s.length() >= minWordLength)
.collect(ImmutableSet.toImmutableSet());
ImmutableList<String> orderedList = strings
.stream()
.sorted()
.collect(ImmutableList.toImmutableList());
List<String> candidates = new ArrayList<>();
List<Map.Entry<String, String>> pairs = new ArrayList<>();
for (String currentString : orderedList) {
List<String> nextCandidates = new ArrayList<>();
nextCandidates.add(currentString);
for (String candidate : candidates) {
if (currentString.startsWith(candidate)) {
nextCandidates.add(candidate);
String remainder = currentString.substring(candidate.length());
if (remainder.length() >= minWordLength && strings.contains(remainder)) {
pairs.add(new AbstractMap.SimpleEntry<>(candidate, remainder));
}
}
}
candidates = nextCandidates;
}
pairs.forEach(System.out::println);
Result:
down=stream
The algorithmic complexity on this one is a little more complicated. The searching part I think is O(n) average, with O(n^2) worst case. The most expensive part might be the sorting—which depends on the algorithm used and the characteristics of the unsorted data. So use this one with a grain of salt, but it has possibility. It seems to me that this is going to be way less expensive than building a Trie out of an enormous data set, because you only probe it once comprehensively and don’t get any amortization of the build cost.
Also, this time I chose a Map.Entry to hold the pair. It's completely arbitrary how you do it. Making a custom Pair class or using some existing Java class would be fine.

You can improve Erik’s answer by avoiding most of the sub-String creation using CharBuffer views and altering their position and limit:
Set<CharBuffer> strings = Stream.of(
"a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
"bear", "hug", "bearhug", "cur", "curlique", "curl",
"down", "downstream", "stream"
)
.filter(s -> s.length() >= 4) // < 4 is irrelevant
.map(CharBuffer::wrap)
.collect(Collectors.toSet());
strings
.stream()
.filter(s -> s.length() >= 8)
.map(CharBuffer::wrap)
.flatMap(cb -> IntStream.rangeClosed(4, cb.length() - 4)
.filter(i -> strings.contains(cb.clear().position(i))&&strings.contains(cb.flip()))
.mapToObj(i -> cb.clear()+" = "+cb.limit(i)+" + "+cb.clear().position(i))
)
.forEach(System.out::println);
This is the same algorithm, hence doesn’t change the time complexity, unless you incorporate the hidden character data copying costs, which would be another factor (times the average string length).
Of course, the differences become significant only if you use a different terminal operation than printing the matches, as printing is quiet an expensive operation. Likewise, when the source is a stream over a large file, the I/O will dominate the operation. Unless you go into an entirely different direction, like using memory mapping and refactor this operation to operate over ByteBuffers.

A possible solution could be this.
You start with the first string as your prefix and the second string as your suffix.
You go through each string. If the string begins with the first string, you check if it ends in the second string. And keep going until the end. To save some time before checking if the letters themselves are the same you could make a length check.
It's pretty much what you made, but with this added length check you might be able to trim off a few. That's my take on it at least.

Not sure if this is better than your solution but I think it's worth a try.
Build two Tries, one with the candidates in normal order, the other with the words reversed.
Walk the forwards Trie from depth 4 inwards and use the remainder of the leaf to determine the suffix (or something like that) and look it up in the backwards Trie.
I've posted a Trie implementation in the past here https://stackoverflow.com/a/9320920/823393.

Related

Combination Algorithm from multiple sets

I am trying to write an algorithm that tells me how many pairs I could generate with items coming from multiple set of values. For example I have the following sets:
{1,2,3} {4,5} {6}
From these sets I can generate 11 pairs:
{1,4}, {1,5}, {1,6}, {2,4}, {2,5}, {2,6}, {3,4}, {3,5}, {3,6}, {4,6}, {5,6}
I wrote the following algorithm:
int result=0;
for(int k=0;k<numberOfSets;k++){ //map is a list where I store all my sets
int size1 = map.get(k);
for(int l=k+1;l<numberOfSets;l++){
int size2 = map.get(l);
result += size1*size2;
}
}
But as you can see the algorithm is not very scalable. If the number of sets increases the algorithm starts performing very poorly.
Am I missing something?, Is there an algorithm that can help me with this ? I have been looking to combination and permutation algorithms but I am not very sure if thats the right path for this.
Thank you very much in advance

First at all, if the order in the pairs does matter, then starting with int l=k+1 in the inner cycle is erroneous. E.g. you are missing {4,1} if you consider it equal with {1,4}, then the result is correct, otherwise it isn't.
Second, to complicate the matter further, you don't say if the the pairs need to be unique or not. E.g. {1,2} , {2,3}, {4} will generate {2,4} twice - if you need to count it as unique, the result of your code is incorrect (and you will need to keep a Set<Pair<int,int>> to remove the duplicates and you will need to scan those sets and actually generate the pairs).
The good news: while you can't do better than O(N2) just for counting the pairs, even if you have thousands of sets, the millions of integral multiplication/additions are fast enough on nowaday computers - e.g Eigen deals quite well with O(N^3) operations for floating multiplications (see matrix multiplication operations).

Assuming you only care about the number of pairs, and are counting duplicates, then there is a more efficient algorithm:
We will keep track of the current number of sets, and the number of elements which we encountered so far.
Go over the list from the end to the start
For each new set, the number of new pairs we can make is the size of the set * the size of encountered elements. Add this to the current number of sets.
Add the size of the new set to the number of elements which we encountered so far.
The code:
int numberOfPairs=0;
int elementsEncountered=0;
for(int k = numberOfSets - 1 ; k >= 0 ; k--) {
int sizeOfCurrentSet = map.get(k);
int numberOfNewPairs = sizeOfCurrentSet * elementsEncountered;
numberOfPairs += numberOfNewPairs;
elementsEncountered += sizeOfCurrentSet;
}
The key point to relize is that when we count the number of new pairs that each set contributes, it doesn't matter from which set we select the second element of the pair. That is, we don't need to keep track of any set which we have already analyzed.

Find first index of matching character from two strings using parallel streams

Trying to figure out whether it is possible to find what the first index of a matching character that is within one string that is also in another string. So for example:
String first = "test";
String second = "123er";
int value = get(test, other);
// method would return 1, as the first matching character in
// 123er, e is at index 1 of test
So I'm trying to accomplish this using parallel streams. I know I can find whether there is a matching character fairly simply like such:
test.chars().parallel().anyMatch(other::contains);
How would I use this to find the exact index?

If you really care for performance, you should try to avoid the O(n × m) time complexity of iterating over one string for every character of the other. So, first iterate over one string to get a data structure supporting efficient (O(1)) lookup, then iterate over the other utilizing this.
BitSet encountered = new BitSet();
test.chars().forEach(encountered::set);
int index = IntStream.range(0, other.length())
.filter(ix->encountered.get(other.charAt(ix)))
.findFirst().orElse(-1);
If the strings are sufficiently large, the O(n + m) time complexity of this solution will turn to much shorter execution times. For smaller strings, it’s irrelevant anyway.
If you really think, the strings are large enough to benefit from parallel processing (which is very unlikely), you can perform both operations in parallel, with small adaptions:
BitSet encountered = CharBuffer.wrap(test).chars().parallel()
.collect(BitSet::new, BitSet::set, BitSet::or);
int index = IntStream.range(0, other.length()).parallel()
.filter(ix -> encountered.get(other.charAt(ix)))
.findFirst().orElse(-1);
The first operation uses the slightly more complicated, parallel compatible collect now and it contains a not-so-obvious change for the Stream creation.
The problem is described in bug report JDK-8071477. Simply said, the stream returned by String.chars() has a poor splitting capability, hence a poor parallel performance. The code above wraps the string in a CharBuffer, whose chars() method returns a different implementation, having the same semantics, but a good parallel performance. This work-around should become obsolete with Java 9.
Alternatively, you could use IntStream.range(0, test.length()).map(test::charAt) to create a stream with a good parallel performance. The second operation already works that way.
But, as said, for this specific task it’s rather unlikely that you ever encounter strings large enough to make parallel processing beneficial.

You can do it by relying on String#indexOf(int ch), keeping only values >= 0 to remove non existing characters then get the first value.
// Get the index of each characters of test in other
// Keep only the positive values
// Then return the first match
// Or -1 if we have no match
int result = test.chars()
.parallel()
.map(other::indexOf)
.filter(i -> i >= 0)
.findFirst()
.orElse(-1);
System.out.println(result);
Output:
1
NB 1: The result is 1 not 2 because indexes start from 0 not 1.
NB 2: Unless you have very very long String, using a parallel Stream in this case should not help much in term of performances because the tasks are not complexes and creating, starting and synchronizing threads has a very high cost so you will probably get your result much slower than with a normal stream.

Upgrading Nicolas' answer here. min() method enforces consumption of the whole Stream. In such cases, it's better to use findFirst() which stops the whole execution after finding the first matching element and not the minimum of all:
test.chars().parallel()
.map(other::indexOf)
.filter(i -> i >= 0)
.findFirst()
.ifPresent(System.out::println);

What is the best way to count and sort a string array

I am trying to find if there is a good way to search (count number of occurrences) and then sort a String array in a efficient way... that is a way that will work well in embedded systems (32Mb)
Example: I have to count the number of time the character A, B, C, etc... is used save that result for posterior sorting...
I can count using a public int count(String searchDomain, char searchValue) method, but each string should have all alphabet letter for instance:
"This is a test string"
A:1,B:0,C:0,D:0,E:1,I:3,F:0,...
"ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC"
A:7,B:0,C:22,G:18
My sorting method need to be able to answer to things like: Sort by number of As, Bs
sort first by As and then sort that subdomain by Bs
This is not for homework, it's for an application that needs to run on mobile phones, i need this to be efficient, my current implementation is too slow and uses too much memory.

I'd take advantage of Java's (very efficient) built in sorting capabilities. To start with, define a simple class to contain your string and its metadata:
class Item
{
// Your string. It's public, so you can get it if you want,
// but also final, so you can't accidentally change it.
public final String string;
// An array of counts, where the offset is the alphabetical position
// of the letter it's counting. (A = 0, B = 1, C=2...)
private final short[] instanceCounts = new short[32];
public Item(String string)
{
this.string = string;
for(char c : string.toCharArray())
{
// Increment the count for this character
instanceCounts[(byte)c - 65] ++;
}
}
public int getCount(char c)
{
return instanceCounts[(byte)c - 65];
}
}
This will hold your String (for searching and display), and set up an array of shorts with the count of the matching characters. (If you're really low on memory and you know your strings have more than 255 of any one character, you can even change this to an array of bytes.) A short is only 16 bytes, so the array itself will only take 64 bytes all together regardless of how complex your string. If you'd rather pay the performance hit for calculating the counts every time, you can get rid of the array and replace the getCount() method, but you'll probably end up saving once-off memory by consuming frequently-garbage-collected memory, which is a big performance hit. :)
Now, define the rule you want to search on using a Comparator. For example, to sort by the number of A's in your string:
class CompareByNumberOfA implements Comparator<Item>
{
public int compare(Item arg0, Item arg1)
{
return arg1.getCount('A') - arg0.getCount('A');
}
}
Finally, stick all of your items in an array, and use the built in (and highly memory efficient) Arrays methods to sort. For example:
public static void main(String args[])
{
Item[] items = new Item[5];
items[0]= new Item("ABC");
items[1]= new Item("ABCAA");
items[2]= new Item("ABCAAC");
items[3]= new Item("ABCAAA");
items[4]= new Item("ABBABZ");
// THIS IS THE IMPORTANT PART!
Arrays.sort(items, new CompareByNumberOfA());
System.out.println(items[0].string);
System.out.println(items[1].string);
System.out.println(items[2].string);
System.out.println(items[3].string);
System.out.println(items[4].string);
}
You can define a whole bunch of comparators, and use them how you like.
One of the things to remember about coding with Java is not to get too clever. Compilers do a damn fine job of optimizing for their platform, as long as you take advantage of things they can optimize (like built-in APIs including Arrays.sort).
Often, if you try to get too clever, you'll just optimize yourself right out of an efficient solution. :)

I believe that what you're after is a tree structure, and that in fact the question would be better rewritten talking about a tree structure to index a long continuous string rather than "count" or "sort".
I'm not sure if this is a solution or a restatement of the question. Do you want a data-structure which is a tree, where the root has e.g. 26 sub-trees, one for strings starting with 'A', the next child for 'B', and so on; then the 'A' child has e.g. 20 children representing "AB", "AC", "AT" etc.; and so on down to children representing e.g. "ABALXYZQ", where each child contains an integer field representing the count, i.e. the number of times that sub-string occurs?
class AdamTree {
char ch;
List<AdamTree> children;
int count;
}
If this uses too much memory then you'd be looking at ways of trading off memory for CPU time, but that might be difficult to do...nothing comes to mind.

Sorry I don't have time to write this up in a better way. To minimize space, I would make an two m x n (dense) arrays, one byte and one short where:
m is the number of input strings
n is the number of characters in each string; this dimension varies from row to row
the byte array contains the character
the short array contains the count for that character
If counts are guaranteed < 256, you could just use one m x n x 2 byte array.
If the set of characters you are using is dense, i.e., the set of ALL characters used in ANY string is not much larger than the set of characters used in EACH string, you could get rid of the byte array and just use a fixed "n" (above) with a function that maps from character to index. This is would be much faster.
This would requires 2Q traversals of this array for any query with Q clauses. Hopefully this will be fast enough.

I can assist with php/pseudo code and hashmaps or associative arrays.
$hash="";
$string = "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC"
while ( read each $char from $string ) {
if ( isset($hash[$char]) ) {
$hash[$char] = $hash[$char]+1
} else {
$hash[$char]=1
}
}
at the end you'll have an associative array with 1 key / char found
and in the hash value you'll have the count of the occurences
It's not PHP (or any other language for that matter) but the principle should help.

http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
Have a look at the KMP algorithm. This is a rather common programming problem. Above you will find one of the fastest solutions possible. Easy to understand and implement.
Count the occurences with KMP then either go with a merge sort after insertion, or if you know that the array/etc is sorted, go with binary search/direction insertion.

Maybe you could use a kind of tree structure, where the depth corresponds to a given letter. Each node in the tree thus corresponds to a letter + a count of occurrences of that letter. If only one string matches this node (and its parent nodes), then it is stored in the node. Otherwise, the node has child nodes for the next letters and the letter count.
This would thus give something like this:
A: 0 1 3 ...
| / \ / \
B: 0 0 1 1 3
/ \ heaven / \ barracuda ababab
C: 0 1 0 1
foo cow bar bac
Not sure this would cost less than the array count solution but at least you wouldn't have to store the count for all letters for all strings (the tree stops when the letter count uniquely identifies a string)
You could probably optimize it by cutting long branches without siblings

You could try the code in Java below
int[] data = new int[254];//we have 254 different characters
void processData(String mString){
for (int i=0 ; i< mString.length;i++){
char c = mString.charAt(i);
data[c]++;
}
}
int getCountOfChar(char c){
return data[c];
}

It seems there's some confusion on what your requirements and goals are.
If your search results take up too much space, why not "lossily compress" (like music compression) the results? Kind of like a hash function. Then, when you need to retrieve results, your hash indicates a much smaller subset of strings that needed to be searched properly with a more lengthy searching algorithm.
If you actually store the String objects, and your strings are actually human readable text, you could try deflating them with java.util.zip after you're done searching and index and all that. If you really want to keep them tiny and you don't receive actual String objects, and you said you only have 26 different letters, you can compress them into groups of 5 bits and store them like that. Use the CharSequence interface for this.

what is a good metric for deciding if 2 Strings are "similar enough"

I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings.
What I'm doing currently is basically taking the total number of edits and dividing it by the size of the larger String. If that value is below some threshold, currently randomly set to 25%, then they are "similar enough".
However, this is totally arbitrary and I don't think is a very good way to calculate similarity. Is there some kind of math equation or probability/statistics approach to taking the Levenshtein Distance data and using it to say "yes, these strings are similar enough based on the number of edits made and the size of the strings"?
Also, the key thing here is that I'm using an arbitrary threshold and I would prefer not to do that. How can I compute this threshold instead of assign it so that I can safely say that 2 Strings are "similar enough"?
UPDATE
I'm comparing strings that represent a Java stack trace. The reason I want to do this is to group a bunch of given stack traces by similarity and use it as a filter to sort "stuff" :) This grouping is important for a higher level reason which I can't exactly share publicly.
So far, my algorithm (pseudo code) is roughly along the lines of:
/*
* The input lists represent the Strings I want to test for similarity. The
* Strings are split apart based on new lines / carriage returns because Java
* stack traces are not a giant one-line String, rather a multi-line String.
* So each element in the input lists is a "line" from its stack trace.
*/
calculate similarity (List<String> list1, List<String> list2) {
length1 = 0;
length2 = 0;
levenshteinDistance = 0;
iterator1 = list1.iterator();
iterator2 = list2.iterator();
while ( iterator1.hasNext() && iterator2.hasNext() ) {
// skip blank/empty lines because they are not interesting
str1 = iterator1.next(); length1 += str1.length();
str2 = iterator2.next(); length2 += str2.length();
levensteinDistance += getLevenshteinDistance(str1, str2);
}
// handle the rest of the lines from the iterator that has not terminated
difference = levenshteinDistance / Math.max(length1, length2);
return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck!
}

How about using cosine similarity? This is a general technique to assess similarity between two texts. It works as follows:
Take all the letters from both Strings an build a table like this:
Letter | String1 | String2
This can be a simple hash table or whatever.
In the letter column put each letter and in the string columns put their frequency inside that string (if a letter does not appear in a string the value is 0).
It is called cosine similarity because you interpret each of the two string columns as vectors, where each component is the number associated to a letter. Next, compute the cosine of the "angle" between the vectors as:
C = (V1 * V2) / (|V1| * |V2|)
The numerator is the dot product, that is the sum of the products of the corresponding components, and the denominator is the product of the sizes of the vectors.
How close C is to 1 gives you how similar the Strings are.
It may seem complicated but it's just a few lines of code once you understand the idea.
Let's see an example: consider the strings
s1 = aabccdd
s2 = ababcd
The table looks like:
Letter a b c d
s1 2 1 2 2
s2 2 2 1 1
And thus:
C = (V1 * V2) / (|V1| * |V2|) =
(2 * 2 + 1 * 2 + 2 * 1 + 2 * 1) / (sqrt(13) * sqrt(10)) = 0.877
So they are "pretty" similar.

Stack traces are in a format amenable to parsing. I would just parse the stack traces using a parsing library and then you can extract whatever semantic content you want to compare.
Similarity algorithms are going to be slower and difficult to debug with when strings aren't comparing as you expect.

Here's my take on this - just a long story to consider and not necessarily an answer to your problem:
I've done something similar in the past where I would try to determine if someone was plagiarizing by simply rearranging sentences while maintaining the same sort of message.
1 "children should play while we eat dinner"
2 "while we eat dinner, the children should play"
3 "we should eat children while we play"
So levenshtein wouldn't be of much use here because it is linear and each one would be considerably different. The standard difference would pass the test and the student would get away with the crime.
So I broke each word in the sentences up and recomposed the sentences as arrays, then compared each other to first determine if the word existed in each array, and where it was in relation to the last. Then each word would check the next in the array to determine if there were sequential words, like in my example sentences above line 1 and 2.
So if there were sequential words, I would compose a string of each sequence common to each array and then attempt to find differences in the remaining words. The fewer remaining words, the more likely they are just filler to make it seem less plagiarized.
"while we eat dinner, I think the children should play"
Then "I think" is evaluated and considered filler based on a keyword lexicon - this part is hard to describe here.
This was a complex project that did a lot more than just what I described and not a simple chunk of code I can easily share, but the idea above is not too hard to replicate.
Good luck. I'm interested in what other SO members have to say about your question.

Since the Levenshtein distance is never greater than the length of the longer string, I'd certainly change the denominator from (length1 + length2) to Math.max(length1, length2). This would normalize the metric to be between zero and one.
Now, it's impossible to answer what's "similar enough" for your needs based on the information provided. I personally try to avoid step functions like you have with the 0.25 cutoff, preferring continuous values from a known interval. Perhaps it would be better to feed the continuous "similarity" (or "distance") values into higher-level algorithms instead of transforming those values into binary ones?

Find longest series of ones in a binary digit array

How would I find the longest series of ones in this array of binary digits - 100011101100111110011100
In this case the answer should be = 11111
I was thinking of looping through the array and checking every digit, if the digit is a one then add it to a new String, if its a zero re-start creating a new String but save the previously created String. When done check the length of every String to see which is the longest. I'm sure there is a simpler solution ?

Your algorithm is good, but you do not need to save all the temporary strings - they are all "ones" anyway.
You should simply have two variables "bestStartPosition" and "bestLength". After you find a sequence of "ones" - you compare the length of this sequence with saved "bestLength", and overwrite both variables with new position and length.
After you scanned all array - you will have the position of the longest sequence (in case you need it) and a length (by which you can generate a string of "ones").

Java 8 update with O(n) time complexity (and only 1 line):
int maxLength = Arrays.stream(bitStr.split("0+"))
.mapToInt(String::length)
.max().orElse(0);
See live demo.
This also automatically handles blank input, returning 0 in that case.
Java 7 compact solution, but O(n log n) time complexity:
Let the java API do all the work for you in just 3 lines:
String bits = "100011101100111110011100";
LinkedList<String> list = new LinkedList<String>(Arrays.asList(bits.split("0+")));
Collections.sort(list);
int maxLength = list.getLast().length(); // 5 for the example given
How this works:
bits.split("0+") breaks up the input into a String[] with each continuous chain of 1's (separated by all zeros - the regex for that is 0+) becoming an element of the array
Arrays.asList() turns the String[] into a List<String>
Create and populate a new LinkedList from the list just created
Use collections to sort the list. The longest chain of 1's will sort last
Get the length of the last element (the longest) in the list. That is why LinkedList was chosen - it has a getLast() method, which I thought was a nice convenience
For those who think this is "too heavy", with the sample input given it took less than 1ms to execute on my MacBook Pro. Unless your input String is gigabytes long, this code will execute very quickly.
EDITED
Suggested by Max, using Arrays.sort() is very similar and executes in half the time, but still requires 3 lines:
String[] split = bits.split("0+");
Arrays.sort(split);
int maxLength = split[split.length - 1].length();

Here is some pseudocode that should do what you want:
count = 0
longestCount = 0
foreach digit in binaryDigitArray:
if (digit == 1) count++
else:
longestCount = max(count, maxCount)
count = 0
longestCount = max(count, maxCount)
Easier would be to extract all sequences of 1s, sort them by length and pick the first one. However, depending on the language used it would probably be only a short version of my suggestion.

Got some preview code for php only, maybe your can rewrite to your language.
Which will say what the max length is of the 1's:
$match = preg_split("/0+/", "100011101100111110011100", -1, PREG_SPLIT_NO_EMPTY);
echo max(array_map('strlen', $match));
Result:
5

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.