Count how many palindromes in a string - java

I want to find out all possible palindromes that can be possible using substrings from a given string.
Example: input = abbcbbd.
Possible palindromes are a,b,b,c,b,b,d,bb,bcb, bbcbb,bb
Here is the logic I have implemented:
public int palindromeCount(String input) {
int size = input.length();// all single characters in string are treated as palindromes.
int count = size;
for(int i=0; i<size; i++) {
for(int j=i+2; j<=size; j++) {
String value = input.substring(i, j);
String reverse = new StringBuilder(value).reverse().toString();
if(value.equals(reverse)) {
count++;
}
}
}
return count;
}
Here the time complexity is more, how can I improve the performance of this logic?

Here are some things you can think about when optimizing this algorithm:
What are palindromes? A palindrome is a symmetrical string, which means it must have a center pivot! The pivot may be one of the following:
A letter, as in "aba", or
The position between two letters, as in the position between the letters "aa"
That means there are a total of 2n − 1 possible pivots.
Then, you can search outwards from each pivot. Here is an example:
Sample string: "abcba"
First, let's take a pivot at "c":
abcba → c is a palindrome, then widen your search by 1 on each side.
abcba → bcb is a palindrome, then widen your search by 1 on each side.
abcba → abcba is a palindrome, so we know there are at least 3 palindromes in the string.
Continue this with all pivots.
Which this approach, the runtime complexity is O(n2).

If you're comfortable busting out some heavyweight data structures, it's possible to do this in time O(n), though I'll admit that this isn't something that will be particularly easy to code up. :-)
We're going to need two tools in order to solve this problem.
Tool One: Generalized Suffix Trees. A generalized suffix tree is data structure that, intuitively, is a trie containing all suffixes of two strings S and T, but represented in a more space-efficient way.
Tool Two: Lowest Common Ancestor Queries. A lowest common ancestor query structure (or LCA query) is a data structure built around a particular tree. It's designed to efficiently answer queries of the form "given two nodes in the tree, what is their lowest common ancestor?"
Importantly, a generalized suffix tree for two strings of length m can be built in time O(m), and an LCA query can be built in time O(m) such that all queries take time O(1). These are not obvious runtimes; the algorithms and data structures needed here were publishable results when they were first discovered.
Assuming we have these two structures, we can build a third data structure, which is what we'll use to get our final algorithm:
Tool Three: Longest Common Extension Queries. A longest common extension query data structure (or LCE query) is a data structure built around two strings S and T. It supports queries of the following form: given an index i into string S and an index j into string T, what is the length of the longest string that appears starting at index i in S and index j in T?
As an example, take these two strings:
S: OFFENSE
0123456
T: NONSENSE
01234567
If we did an LCE query starting at index 3 in string S and index 4 in string T, the answer would be the string ENSE. On the other hand, if we did an LCE query starting at index 4 in string S and index 0 in string T, we'd get back the string N.
(More strictly, the LCE query structure doesn't actually return the actual string you'd find at both places, but rather its length.)
It's possible to build an LCE data structure for a pair of strings S and T of length m in time O(m) so that each query takes time O(1). The technique for doing so involves building a generalized suffix tree for the two strings, then constructing an LCA data structure on top. The key insight is that the LCE starting at position i in string S and j in string T is given by the lowest common ancestor of suffix i of string S and suffix j of string T in the suffix tree.
The LCE structure is extremely useful for this problem. To see why, let's take your sample string abbcbbd. Now, consider both that string and its reverse, as shown here:
S: abbcbbd
0123456
T: dbbcbba
0123456
Every palindrome in a string takes one of two forms. First, it can be an odd-length palindrome. A palindrome like that has some central character c, plus some "radius" stretching out forwards and backwards from that center. For example, the string bbcbb is an odd-length palindrome with center c and radius bb.
We can count up how many odd-length palindromes there are in a string by using LCE queries. Specifically, build an LCE query structure over both the string and its reverse. Then, for each position within the original string, ask for the LCE of that position in the original string and its corresponding position in the mirrored string. This will give you the longest odd-length palindrome centered at that point. (More specifically, it'll give you the length of the radius plus one, since the character itself will always match at those two points). Once we know the longest odd-length palindrome centered at that position in the string, we can count the number of odd-length palindromes centered at that position in the string: that will be equal to all the ways we can take that longer palindrome and shorten it by cutting off the front and back character.
With this in mind, we can count all the odd-length palindromes in the string as follows:
for i from 0 to length(S) - 1:
total += LCE(i, length(S) - 1 - i)
The other class of palindromes are even-length palindromes, which don't have a center and instead consist of two equal radii. We can also find these using LCE queries, except instead of looking at some position i and its corresponding position in the reversed string, we'll look at position i in the original string and the place that corresponds to index i - 1 in the reversed string. That can be done here:
for i from 1 to length(S) - 1:
total += LCE(i, length(S) - i)
Overall, this solution
Constructs an LCE query structure from the original string and the reversed string. Reversing the string takes time O(m) and building the LCE query structure takes time O(m). Total time for this step: O(m).
Makes 2m - 1 total queries to the LCE structure. Each query takes time O(1). Total time for this step: O(m).
I'm fairly certain it's possible to achieve this runtime without using such heavyweight tools, but this at least shows that a linear-time solution exists.
Hope this helps!

Related

Binary splitting an array and retrieving "leaf" arrays

I have a number of entries in an array (FT = [-10.5, 6.5, 7.5, -7.5]) which I am applying on binary splitting to append to a result array of arrays (LT = [[-10.5],[6.5, 7.5, -7.5],[6.5,7.5],[-7.5]] the tree describing the splitting for my example is below:
[-10.5, 6.5, 7.5, -7.5]
/ \
[-10.5] [6.5, 7.5, -7.5]
/ \
[6.5, 7.5] [ -7.5]
Now from the array LT I want to retrieve only "leaf" arrays (T = [[-10.5],[6.5,7.5],[-7.5]]) given the size of the initial array FT.
How to achieve this (get T) in Java?
I am presenting a way of thinking about your problem. I am not fledgling it out into a full algorithm; I am leaving some parts for yourself to fill in.
First, if LT is empty, no splitting has occurred. In this case the original FT was the leaf array, and we have no way of telling what it was. The problem cannot be solved.
If LT contains n arrays, then there must exist some m (0 < m < n) so that the first m arrays form the left subtree and the rest form the right subtree. We don’t know m, so we simply try all possible values of m in turn. For each possible m we check whether a solution for this value of m is possible by trying to reconstruct each subtree.
So define an auxiliary method to check if a part of LT can form a subtree and return the leaves if it can.
Your auxiliary method will work like this: If there is only one array, it’s a leaf, so return it. If there are two arrays, they cannot form a subtree. I there are three, they form a subtree exactly if the first is the concatenation of the other two. If there are more than three, then again we need to consider all the possibilities of how they are distributed into subtrees. The difference from before is that we know which full array the subtrees come from, namely the frontmost array. So all solutions should be checked against this. For starters, if the second array is not a prefix of the first, we cannot have a subtree.
Your algorithm will no doubt get recursive at some point.
Pruning opportunity: It seems to me that a binary tree always has an odd number of leaves. So for a solution to exist n needs to be even and m needs to be odd.
I would consider coding my algorithm using lists rather than arrays because I think it’s more convenient to pass lists of lists rather than arrays of arrays or lists of arrays around.
Happy further refinement and coding.

Sorting string so that there aren't two same characters on adjacent places [duplicate]

It's a bonus school task for which we didn't receive any teaching yet and I'm not looking for a complete code, but some tips to get going would be pretty cool. Going to post what I've done so far in Java when I get home, but here's something I've done already.
So, we have to do a sorting algorithm, which for example sorts "AAABBB" to the ABABAB. Max input size is 10^6, and it all has to happen under 1 second. If there's more than one answer, the first one in alphabetical order is the right one. I started to test different algorithms to even sort them without that alphabetical order requirement in mind, just to see how the things work out.
First version:
Save the ascii codes to the Integer array where index is the ascii code, and the value is amount which that character occurs in the char array.
Then I picked 2 highest numbers, and started to spam them to the new character array after each other, until some number was higher, and I swapped to it. It worked well, but of course the order wasn't right.
Second version:
Followed the same idea, but stopped picking the most occurring number and just picked the indexes in the order they were in my array. Works well until the input is something like CBAYYY. Algorithm sorts it to the ABCYYY instead of AYBYCY. Of course I could try to find some free spots for those Y's, but at that point it starts to take too long.
An interesting problem, with an interesting tweak. Yes, this is a permutation or rearranging rather than a sort. No, the quoted question is not a duplicate.
Algorithm.
Count the character frequencies.
Output alternating characters from the two lowest in alphabetical order.
As each is exhausted, move to the next.
At some point the highest frequency char will be exactly half the remaining chars. At that point switch to outputting all of that char alternating in turn with the other remaining chars in alphabetical order.
Some care required to avoid off-by-one errors (odd vs even number of input characters). Otherwise, just writing the code and getting it to work right is the challenge.
Note that there is one special case, where the number of characters is odd and the frequency of one character starts at (half plus 1). In this case you need to start with step 4 in the algorithm, outputting all one character alternating with each of the others in turn.
Note also that if one character comprises more than half the input then apart for this special case, no solution is possible. This situation may be detected in advance by inspecting the frequencies, or during execution when the tail consists of all one character. Detecting this case was not part of the spec.
Since no sort is required the complexity is O(n). Each character is examined twice: once when it is counted and once when it is added to the output. Everything else is amortised.
My idea is the following. With the right implementation it can be almost linear.
First establish a function to check if the solution is even possible. It should be very fast. Something like most frequent letter > 1/2 all letters and take into cosideration if it can be first.
Then while there are still letters remaining take the alphabetically first letter that is not the same as previous, and makes further solution possible.
The correct algorithm would be the following:
Build a histogram of the characters in the input string.
Put the CharacterOccurrences in a PriorityQueue / TreeSet where they're ordered on highest occurrence, lowest alphabetical order
Have an auxiliary variable of type CharacterOccurrence
Loop while the PQ is not empty
Take the head of the PQ and keep it
Add the character of the head to the output
If the auxiliary variable is set => Re-add it to the PQ
Store the kept head in the auxiliary variable with 1 occurrence less unless the occurrence ends up being 0 (then unset it)
if the size of the output == size of the input, it was possible and you have your answer. Else it was impossible.
Complexity is O(N * log(N))
Make a bi directional table of character frequencies: character->count and count->character. Record an optional<Character> which stores the last character (or none of there is none). Store the total number of characters.
If (total number of characters-1)<2*(highest count character count), use the highest count character count character. (otherwise there would be no solution). Fail if this it the last character output.
Otherwise, use the earliest alphabetically that isn't the last character output.
Record the last character output, decrease both the total and used character count.
Loop while we still have characters.
While this question is not quite a duplicate, the part of my answer giving the algorithm for enumerating all permutations with as few adjacent equal letters as possible readily can be adapted to return only the minimum, as its proof of optimality requires that every recursive call yield at least one permutation. The extent of the changes outside of the test code are to try keys in sorted order and to break after the first hit is found. The running time of the code below is polynomial (O(n) if I bothered with better data structures), since unlike its ancestor it does not enumerate all possibilities.
david.pfx's answer hints at the logic: greedily take the least letter that doesn't eliminate all possibilities, but, as he notes, the details are subtle.
from collections import Counter
from itertools import permutations
from operator import itemgetter
from random import randrange
def get_mode(count):
return max(count.items(), key=itemgetter(1))[0]
def enum2(prefix, x, count, total, mode):
prefix.append(x)
count_x = count[x]
if count_x == 1:
del count[x]
else:
count[x] = count_x - 1
yield from enum1(prefix, count, total - 1, mode)
count[x] = count_x
del prefix[-1]
def enum1(prefix, count, total, mode):
if total == 0:
yield tuple(prefix)
return
if count[mode] * 2 - 1 >= total and [mode] != prefix[-1:]:
yield from enum2(prefix, mode, count, total, mode)
else:
defect_okay = not prefix or count[prefix[-1]] * 2 > total
mode = get_mode(count)
for x in sorted(count.keys()):
if defect_okay or [x] != prefix[-1:]:
yield from enum2(prefix, x, count, total, mode)
break
def enum(seq):
count = Counter(seq)
if count:
yield from enum1([], count, sum(count.values()), get_mode(count))
else:
yield ()
def defects(lst):
return sum(lst[i - 1] == lst[i] for i in range(1, len(lst)))
def test(lst):
perms = set(permutations(lst))
opt = min(map(defects, perms))
slow = min(perm for perm in perms if defects(perm) == opt)
fast = list(enum(lst))
assert len(fast) == 1
fast = min(fast)
print(lst, fast, slow)
assert slow == fast
for r in range(10000):
test([randrange(3) for i in range(randrange(6))])
You start by count each number of letter you have in your array:
For example you have 3 - A, 2 - B, 1 - C, 4 - Y, 1 - Z.
1) Then you put each time the lowest one (it is A), you can put.
so you start by :
A
then you can not put A any more so you put B:
AB
then:
ABABACYZ
These works if you have still at least 2 kind of characters. But here you will have still 3 Y.
2) To put the last characters, you just go from your first Y and insert one on 2 in direction of beginning.(I don't know if these is the good way to say that in english).
So ABAYBYAYCYZ.
3) Then you take the subsequence between your Y so YBYAYCY and you sort the letter between the Y :
BAC => ABC
And you arrive at
ABAYAYBYCYZ
which should be the solution of your problem.
To do all this stuff, I think a LinkedList is the best way
I hope it help :)

How does LCP help in finding the number of occurrences of a pattern?

I have read that the Longest Common Prefix (LCP) could be used to find the number of occurrences of a pattern in a string.
Specifically, you just need to create the suffix array of the text, sort it, and then instead of doing binary search to find the range so that you can figure out the number of occurrences, you simply compute the LCP for each successive entry in the suffix array.
Although using binary search to find the number of occurrences of a pattern is obvious I can't figure out how the LCP helps find the number of occurrences here.
For example for this suffix array for banana:
LCP Suffix entry
N/A a
1 ana
3 anana
0 banana
0 na
2 nana
How does the LCP help find the number of occurrences of a substring like "banana" or "na" is not obvious to me.
Any help figuring out how LCP helps here?
I do not know any way of using the LCP array instead of carrying out a binary search, but I believe what you refer to is the technique described by Udi Manber and Gene Myers in Suffix arrays: a new method for on-line string searches.
(Note: The below explanation has been copied into a Wikipedia article on 9th April 2014, see diff. If you look at the revision history here and on Wikipedia, you'll see that the one here was written first. Please don't insert comments like "taken from Wikipedia" into my answer.)
The idea is this: In order to find the number of occurrences of a given string P (length m) in a text T (length N),
You use binary search against the suffix array of T (just like you suggested)
But you speed it up using the LCP array as auxiliary data structure. More specifically, you generate a special version of the LCP array (I will call it LCP-LR below) and use that.
The issue with using standard binary search (without the LCP information) is that in each of the O(log N) comparisons you need to make, you compare P to the current entry of the suffix array, which means a full string comparison of up to m characters. So the complexity is O(m*log N).
The LCP-LR array helps improve this to O(m+log N), in the following way:
At any point during the binary search algorithm, you consider, as usual, a range (L,...,R) of the suffix array and its central point M, and decide whether you continue your search in the left sub-range (L,...,M) or in the right sub-range (M,...,R).
In order to make the decision, you compare P to the string at M. If P is identical to M, you are done, but if not, you will have compared the first k characters of P and then decided whether P is lexicographically smaller or larger than M. Let's assume the outcome is that P is larger than M.
So, in the next step, you consider (M,...,R) and a new central point M' in the middle:
M ...... M' ...... R
|
we know:
lcp(P,M)==k
The trick now is that LCP-LR is precomputed such that a O(1)-lookup tells you the longest common prefix of M and M', lcp(M,M').
You know already (from the previous step) that M itself has a prefix of k characters in common with P: lcp(P,M)=k. Now there are three possibilities:
Case 1: k < lcp(M,M'), i.e. P has fewer prefix characters in common with M than M has in common with M'. This means the (k+1)-th character of M' is the same as that of M, and since P is lexicographically larger than M, it must be lexicographically larger than M', too. So we continue in the right half (M',...,R).
Case 2: k > lcp(M,M'), i.e. P has more prefix characters in common with M than M has in common with M'. Consequently, if we were to compare P to M', the common prefix would be smaller than k, and M' would be lexicographically larger than P, so, without actually making the comparison, we continue in the left half (M,...,M').
Case 3: k == lcp(M,M'). So M and M' are both identical with P in the first k characters. To decide whether we continue in the left or right half, it suffices to compare P to M' starting from the (k+1)-th character.
We continue recursively.
The overall effect is that no character of P is compared to any character of the text more than once. The total number of character comparisons is bounded by m, so the total complexity is indeed O(m+log N).
Obviously, the key remaining question is how did we precompute LCP-LR so it is able to tell us in O(1) time the lcp between any two entries of the suffix array? As you said, the standard LCP array tells you the lcp of consecutive entries only, i.e. lcp(x-1,x) for any x. But M and M' in the description above are not necessarily consecutive entries, so how is that done?
The key to this is to realize that only certain ranges (L,...,R) will ever occur during the binary search: It always starts with (0,...,N) and divides that at the center, and then continues either left or right and divide that half again and so forth. If you think of it: Every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges (L...M...R) that can possibly play a role during binary search, and it suffices to precompute lcp(L,M) and lcp(M,R) for those N possible ranges. So that is 2*N distinct precomputed values, hence LCP-LR is O(N) in size.
Moreover, there is a straight-forward recursive algorithm to compute the 2*N values of LCP-LR in O(N) time from the standard LCP array – I'd suggest posting a separate question if you need a detailed description of that.
To sum up:
It is possible to compute LCP-LR in O(N) time and O(2*N)=O(N) space from LCP
Using LCP-LR during binary search helps accelerate the search procedure from O(M*log N) to O(M+log N)
As you suggested, you can use two binary searches to determine the left and right end of the match range for P, and the length of the match range corresponds with the number of occurrences for P.
The Longest Common Prefix (LCP) is the Lowest Common Ancestor (LCA) in a suffix tree. Once you have the Lowest Common Ancestor, you can count the number of nodes that branch out from the LCA. This will give you the number of occurrences of a pattern in the suffix tree. This is the relationship between the LCP and LCA.

What is the best way to count and sort a string array

I am trying to find if there is a good way to search (count number of occurrences) and then sort a String array in a efficient way... that is a way that will work well in embedded systems (32Mb)
Example: I have to count the number of time the character A, B, C, etc... is used save that result for posterior sorting...
I can count using a public int count(String searchDomain, char searchValue) method, but each string should have all alphabet letter for instance:
"This is a test string"
A:1,B:0,C:0,D:0,E:1,I:3,F:0,...
"ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC"
A:7,B:0,C:22,G:18
My sorting method need to be able to answer to things like: Sort by number of As, Bs
sort first by As and then sort that subdomain by Bs
This is not for homework, it's for an application that needs to run on mobile phones, i need this to be efficient, my current implementation is too slow and uses too much memory.
I'd take advantage of Java's (very efficient) built in sorting capabilities. To start with, define a simple class to contain your string and its metadata:
class Item
{
// Your string. It's public, so you can get it if you want,
// but also final, so you can't accidentally change it.
public final String string;
// An array of counts, where the offset is the alphabetical position
// of the letter it's counting. (A = 0, B = 1, C=2...)
private final short[] instanceCounts = new short[32];
public Item(String string)
{
this.string = string;
for(char c : string.toCharArray())
{
// Increment the count for this character
instanceCounts[(byte)c - 65] ++;
}
}
public int getCount(char c)
{
return instanceCounts[(byte)c - 65];
}
}
This will hold your String (for searching and display), and set up an array of shorts with the count of the matching characters. (If you're really low on memory and you know your strings have more than 255 of any one character, you can even change this to an array of bytes.) A short is only 16 bytes, so the array itself will only take 64 bytes all together regardless of how complex your string. If you'd rather pay the performance hit for calculating the counts every time, you can get rid of the array and replace the getCount() method, but you'll probably end up saving once-off memory by consuming frequently-garbage-collected memory, which is a big performance hit. :)
Now, define the rule you want to search on using a Comparator. For example, to sort by the number of A's in your string:
class CompareByNumberOfA implements Comparator<Item>
{
public int compare(Item arg0, Item arg1)
{
return arg1.getCount('A') - arg0.getCount('A');
}
}
Finally, stick all of your items in an array, and use the built in (and highly memory efficient) Arrays methods to sort. For example:
public static void main(String args[])
{
Item[] items = new Item[5];
items[0]= new Item("ABC");
items[1]= new Item("ABCAA");
items[2]= new Item("ABCAAC");
items[3]= new Item("ABCAAA");
items[4]= new Item("ABBABZ");
// THIS IS THE IMPORTANT PART!
Arrays.sort(items, new CompareByNumberOfA());
System.out.println(items[0].string);
System.out.println(items[1].string);
System.out.println(items[2].string);
System.out.println(items[3].string);
System.out.println(items[4].string);
}
You can define a whole bunch of comparators, and use them how you like.
One of the things to remember about coding with Java is not to get too clever. Compilers do a damn fine job of optimizing for their platform, as long as you take advantage of things they can optimize (like built-in APIs including Arrays.sort).
Often, if you try to get too clever, you'll just optimize yourself right out of an efficient solution. :)
I believe that what you're after is a tree structure, and that in fact the question would be better rewritten talking about a tree structure to index a long continuous string rather than "count" or "sort".
I'm not sure if this is a solution or a restatement of the question. Do you want a data-structure which is a tree, where the root has e.g. 26 sub-trees, one for strings starting with 'A', the next child for 'B', and so on; then the 'A' child has e.g. 20 children representing "AB", "AC", "AT" etc.; and so on down to children representing e.g. "ABALXYZQ", where each child contains an integer field representing the count, i.e. the number of times that sub-string occurs?
class AdamTree {
char ch;
List<AdamTree> children;
int count;
}
If this uses too much memory then you'd be looking at ways of trading off memory for CPU time, but that might be difficult to do...nothing comes to mind.
Sorry I don't have time to write this up in a better way. To minimize space, I would make an two m x n (dense) arrays, one byte and one short where:
m is the number of input strings
n is the number of characters in each string; this dimension varies from row to row
the byte array contains the character
the short array contains the count for that character
If counts are guaranteed < 256, you could just use one m x n x 2 byte array.
If the set of characters you are using is dense, i.e., the set of ALL characters used in ANY string is not much larger than the set of characters used in EACH string, you could get rid of the byte array and just use a fixed "n" (above) with a function that maps from character to index. This is would be much faster.
This would requires 2Q traversals of this array for any query with Q clauses. Hopefully this will be fast enough.
I can assist with php/pseudo code and hashmaps or associative arrays.
$hash="";
$string = "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC"
while ( read each $char from $string ) {
if ( isset($hash[$char]) ) {
$hash[$char] = $hash[$char]+1
} else {
$hash[$char]=1
}
}
at the end you'll have an associative array with 1 key / char found
and in the hash value you'll have the count of the occurences
It's not PHP (or any other language for that matter) but the principle should help.
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
Have a look at the KMP algorithm. This is a rather common programming problem. Above you will find one of the fastest solutions possible. Easy to understand and implement.
Count the occurences with KMP then either go with a merge sort after insertion, or if you know that the array/etc is sorted, go with binary search/direction insertion.
Maybe you could use a kind of tree structure, where the depth corresponds to a given letter. Each node in the tree thus corresponds to a letter + a count of occurrences of that letter. If only one string matches this node (and its parent nodes), then it is stored in the node. Otherwise, the node has child nodes for the next letters and the letter count.
This would thus give something like this:
A: 0 1 3 ...
| / \ / \
B: 0 0 1 1 3
/ \ heaven / \ barracuda ababab
C: 0 1 0 1
foo cow bar bac
Not sure this would cost less than the array count solution but at least you wouldn't have to store the count for all letters for all strings (the tree stops when the letter count uniquely identifies a string)
You could probably optimize it by cutting long branches without siblings
You could try the code in Java below
int[] data = new int[254];//we have 254 different characters
void processData(String mString){
for (int i=0 ; i< mString.length;i++){
char c = mString.charAt(i);
data[c]++;
}
}
int getCountOfChar(char c){
return data[c];
}
It seems there's some confusion on what your requirements and goals are.
If your search results take up too much space, why not "lossily compress" (like music compression) the results? Kind of like a hash function. Then, when you need to retrieve results, your hash indicates a much smaller subset of strings that needed to be searched properly with a more lengthy searching algorithm.
If you actually store the String objects, and your strings are actually human readable text, you could try deflating them with java.util.zip after you're done searching and index and all that. If you really want to keep them tiny and you don't receive actual String objects, and you said you only have 26 different letters, you can compress them into groups of 5 bits and store them like that. Use the CharSequence interface for this.

Find longest series of ones in a binary digit array

How would I find the longest series of ones in this array of binary digits - 100011101100111110011100
In this case the answer should be = 11111
I was thinking of looping through the array and checking every digit, if the digit is a one then add it to a new String, if its a zero re-start creating a new String but save the previously created String. When done check the length of every String to see which is the longest. I'm sure there is a simpler solution ?
Your algorithm is good, but you do not need to save all the temporary strings - they are all "ones" anyway.
You should simply have two variables "bestStartPosition" and "bestLength". After you find a sequence of "ones" - you compare the length of this sequence with saved "bestLength", and overwrite both variables with new position and length.
After you scanned all array - you will have the position of the longest sequence (in case you need it) and a length (by which you can generate a string of "ones").
Java 8 update with O(n) time complexity (and only 1 line):
int maxLength = Arrays.stream(bitStr.split("0+"))
.mapToInt(String::length)
.max().orElse(0);
See live demo.
This also automatically handles blank input, returning 0 in that case.
Java 7 compact solution, but O(n log n) time complexity:
Let the java API do all the work for you in just 3 lines:
String bits = "100011101100111110011100";
LinkedList<String> list = new LinkedList<String>(Arrays.asList(bits.split("0+")));
Collections.sort(list);
int maxLength = list.getLast().length(); // 5 for the example given
How this works:
bits.split("0+") breaks up the input into a String[] with each continuous chain of 1's (separated by all zeros - the regex for that is 0+) becoming an element of the array
Arrays.asList() turns the String[] into a List<String>
Create and populate a new LinkedList from the list just created
Use collections to sort the list. The longest chain of 1's will sort last
Get the length of the last element (the longest) in the list. That is why LinkedList was chosen - it has a getLast() method, which I thought was a nice convenience
For those who think this is "too heavy", with the sample input given it took less than 1ms to execute on my MacBook Pro. Unless your input String is gigabytes long, this code will execute very quickly.
EDITED
Suggested by Max, using Arrays.sort() is very similar and executes in half the time, but still requires 3 lines:
String[] split = bits.split("0+");
Arrays.sort(split);
int maxLength = split[split.length - 1].length();
Here is some pseudocode that should do what you want:
count = 0
longestCount = 0
foreach digit in binaryDigitArray:
if (digit == 1) count++
else:
longestCount = max(count, maxCount)
count = 0
longestCount = max(count, maxCount)
Easier would be to extract all sequences of 1s, sort them by length and pick the first one. However, depending on the language used it would probably be only a short version of my suggestion.
Got some preview code for php only, maybe your can rewrite to your language.
Which will say what the max length is of the 1's:
$match = preg_split("/0+/", "100011101100111110011100", -1, PREG_SPLIT_NO_EMPTY);
echo max(array_map('strlen', $match));
Result:
5

Categories