Count amount of String occurrences and comparisons (KMP)

Count amount of String occurrences and comparisons (KMP) - java

I'm trying to count the amount of pattern occurrences and needed comparisons (called matches in the code below), using search algorithm KMP.
I've tried doing the following:
public class KMP {
private String pat;
private int[][] dfa;
private static int match;
private static int count;
public KMP(String pat) {
// Build DFA from pattern.
this.pat = pat;
int M = pat.length();
int R = 256;
dfa = new int[R][M];
dfa[pat.charAt(0)][0] = 1;
for (int X = 0, j = 1; j < M; j++) {
// Compute dfa[][j].
for (int c = 0; c < R; c++) {
dfa[c][j] = dfa[c][X]; // Copy mismatch cases.
dfa[pat.charAt(j)][j] = j + 1; // Set match case.
X = dfa[pat.charAt(j)][X]; // Update restart state.
}
}
}
public int search(String txt) {
// Simulate operation of DFA on txt.
int i, j, N = txt.length(), M = pat.length();
for (i = 0, j = 0; i < N && j < M; i++) {
j = dfa[txt.charAt(i)][j];
}
if (j == M) {
return i - M; // found (hit end of pattern)
} else {
return N; // not found (hit end of text)
}
}
public static void main(String[] args) {
String pat = "babba";
String txt = "aaaaaaaaaaaabbaaababbaaaaababbaaaa";
int lastIndex = 0;
KMP kmp = new KMP(pat);
int offset = kmp.search(txt);
System.out.println("text: " + txt);
System.out.print("pattern: ");
while (lastIndex != txt.length()) {
for (int i = 0; i < offset; i++) {
lastIndex++;
match++;
}
count++;
}
System.out.println(pat);
System.out.println("count: " + count);
System.out.println("match: " + match);
}
}
My code works just fine when compiling it like this, but when I change the String txt attribute to something like aaaaaaaaaaaabbaaababbaaaaababbaaaababba, It gives me an unexpected, negative count value (also, it takes about 30 seconds to actually run the code).
I'm trying to find a better solution of counting the occurrences and I'd also like to know what's wrong with my code, since it only works in some cases.

The cause is your loop condition.
while (lastIndex != txt.length())
Your problem string has a length of 38 and an offset of 17.
Each for-loop lastIndex is increment by 17.
After the third for-loop it has the value 51.
That fulfills the condition and the loop continues.
It ends only after probably several int overflows which causes the negative count value.
Also you can't count the occurences like that.
kmp.search() only returns the start position of the first occurence of the pattern.
For example
String txt = "aaaaaaaaaaaaaaaaababbaaaaaaaaaaaaa";
Your code returns count = 2.
A solution would be to split the string after each search and then search the substring after the pattern.
KMP kmp = new KMP(pat);
int offset = kmp.search(txt);
while (offset != txt.length()) {
count++;
txt = txt.substring(offset+pat.length());
offset = kmp.search(txt);
}
System.out.println("count: " + count);
Edit: The code above only works for non overlapping patterns.
txt = txt.substring(offset+at.length());
needs to be changed to
txt = txt.substring(offset+1);
if there is an overlap.

Related

KMP Algorithm for string search?

I found this very challenging coding problem online which I though I'd give a try.
The general idea is that given string of text T and pattern P, find the occurrences of this pattern, sum up it's corresponding value and return max and min. If you want to read the problem in more details, please refer to this.
However, below is the code I've provided, it works for a simple test case, but when running on multiple and complex test cases its pretty slow, and I'm not sure where my code needs to be optimized.
Can anyone please help where im getting the logic wrong.
public class DeterminingDNAHealth {
private DeterminingDNAHealth() {
/*
* Fixme:
* Each DNA contains number of genes
* - some of them are beneficial and increase DNA's total health
* - Each Gene has a health value
* ======
* - Total health of DNA = sum of all health values of beneficial genes
*/
}
int checking(int start, int end, String pattern) {
String[] genesChar = new String[] {
"a",
"b",
"c",
"aa",
"d",
"b"
};
String numbers = "123456";
int total = 0;
for (int i = start; i <= end; i++) {
total += KMPAlgorithm.initiateAlgorithm(pattern, genesChar[i]) * (i + 1);
}
return total;
}
public static void main(String[] args) {
String[] genesChar = new String[] {
"a",
"b",
"c",
"aa",
"d",
"b"
};
Gene[] genes = new Gene[genesChar.length];
for (int i = 0; i < 6; i++) {
genes[i] = new Gene(genesChar[i], i + 1);
}
String[] checking = "15caaab 04xyz 24bcdybc".split(" ");
DeterminingDNAHealth DNA = new DeterminingDNAHealth();
int i, mostHealthiest, mostUnhealthiest;
mostHealthiest = Integer.MIN_VALUE;
mostUnhealthiest = Integer.MAX_VALUE;
for (i = 0; i < checking.length; i++) {
int start = Character.getNumericValue(checking[i].charAt(0));
int end = Character.getNumericValue(checking[i].charAt(1));
String pattern = checking[i].substring(2, checking[i].length());
int check = DNA.checking(start, end, pattern);
if (check > mostHealthiest)
mostHealthiest = check;
else
if (check < mostUnhealthiest)
mostUnhealthiest = check;
}
System.out.println(mostHealthiest + " " + mostUnhealthiest);
// DNA.checking(1,5, "caaab");
}
}
KMPAlgorithm
public class KMPAlgorithm {
KMPAlgorithm() {}
public static int initiateAlgorithm(String text, String pattern) {
// let us generate our LPC table from the pattern
int[] partialMatchTable = partialMatchTable(pattern);
int matchedOccurrences = 0;
// initially we don't have anything matched, so 0
int partialMatchLength = 0;
// we then start to loop through the text, !note, not the pattern. The text that we are testing the pattern on
for (int i = 0; i < text.length(); i++) {
// if there is a mismatch and there's no previous match, then we've hit the base-case, hence break from while{...}
while (partialMatchLength > 0 && text.charAt(i) != pattern.charAt(partialMatchLength)) {
/*
* otherwise, based on the number of chars matched, we decrement it by 1.
* In fact, this is the unique part of this algorithm. It is this part that we plan to skip partialMatchLength
* iterations. So if our partialMatchLength was 5, then we are going to skip (5 - 1) iteration.
*/
partialMatchLength = partialMatchTable[partialMatchLength - 1];
}
// if however we have a char that matches the current text[i]
if (text.charAt(i) == pattern.charAt(partialMatchLength)) {
// then increment position, so hence we check the next char of the pattern against the next char in text
partialMatchLength++;
// we will know that we're at the end of the pattern matching, if the matched length is same as the pattern length
if (partialMatchLength == pattern.length()) {
// to get the starting index of the matched pattern in text, apply this formula (i - (partialMatchLength - 1))
// this line increments when a match string occurs multiple times;
matchedOccurrences++;
// just before when we have a full matched pattern, we want to test for multiple occurrences, so we make
// our match length incomplete, and let it run longer.
partialMatchLength = partialMatchTable[partialMatchLength - 1];
}
}
}
return matchedOccurrences;
}
private static int[] partialMatchTable(String pattern) {
/*
* TODO
* Note:
* => Proper prefix: All the characters in a string, with one or more cut off the end.
* => proper suffix: All the characters in a string, with one or more cut off the beginning.
*
* 1.) Take the pattern and construct a partial match table
*
* To construct partial match table {
* 1. Loop through the String(pattern)
* 2. Create a table of size String(pattern).length
* 3. For each character c[i], get The length of the longest proper prefix in the (sub)pattern
* that matches a proper suffix in the same (sub)pattern
* }
*/
// we will need two incremental variables
int i, j;
// an LSP table also known as “longest suffix-prefix”
int[] LSP = new int[pattern.length()];
// our initial case is that the first element is set to 0
LSP[0] = 0;
// loop through the pattern...
for (i = 1; i < pattern.length(); i++) {
// set our j as previous elements data (not the index)
j = LSP[i - 1];
// we will be comparing previous and current elements data. ei char
char current = pattern.charAt(i), previous = pattern.charAt(j);
// we will have a case when we're somewhere in loop and two chars will not match, and j is not in base case.
while (j > 0 && current != previous)
// we decrement our j
j = LSP[j - 1];
// simply put, if two characters are same, then we update our LSP to say that at that point, we hold the j's value
if (current == previous)
// increment our j
j++;
// update the table
LSP[i] = j;
}
return LSP;
}
}
Cource code credit to Github

You may try this KMP implementation. It is O(m+n), as KMP is intended to be. It should be a lot faster:
private static int[] failureFunction(char[] pattern) {
int m = pattern.length;
int[] f = new int[pattern.length];
f[0] = 0;
int i = 1;
int j = 0;
while (i < m) {
if (pattern[i] == pattern[j]) {
f[i] = j + 1;
i++;
j++;
} else if (j > 0) {
j = f[j - 1];
} else {
f[i] = 0;
i++;
}
}
return f;
}
private static int kmpMatch(char[] text, char[] pattern) {
int[] f = failureFunction(pattern);
int m = pattern.length;
int n = text.length;
int i = 0;
int j = 0;
while (i < n) {
if (pattern[j] == text[i]) {
if (j == m - 1){
return i - (m - 1);
} else {
i++;
j++;
}
} else if (j > 0) {
j = f[j - 1];
} else {
i++;
}
}
return -1;
}

pattern search in a text by using three methods

I want to write a program for pattern searching in a given text which reads in a text and then one or more patterns and gives for each pattern:
if it is found(prints out the position of the first appearance in the text)
if not the number of comparison for of the methods: Brute-force, Boyer-Moore Heuristics, and KMP.
But I don't know how to write my main class to get output:
import java.util.HashMap;
import java.util.Map;
public class PatternSearch {
/** Returns the lowest index at which substring pattern begins in text (or else -1).*/
public static int findBrute(char[] text, char[] pattern) {
int n = text.length;
int m = pattern.length;
for (int i=0; i <= n - m; i++) { // try every starting index within text
int k = 0; // k is index into pattern
while (k < m && text[i+k] == pattern[k]) // kth character of pattern matches
k++;
if (k == m) // if we reach the end of the pattern,
return i; // substring text[i..i+m-1] is a match
}
return -1; // search failed
}
/** Returns the lowest index at which substring pattern begins in text (or else -1).*/
public static int findBoyerMoore(char[] text, char[] pattern) {
int n = text.length;
int m = pattern.length;
if (m == 0) return 0; // trivial search for empty string
Map<Character,Integer> last = new HashMap<>( ); // the 'last' map
for (int i=0; i < n; i++)
last.put(text[i], -1); // set -1 as default for all text characters
for (int k=0; k < m; k++)
last.put(pattern[k], k); // rightmost occurrence in pattern is last
// start with the end of the pattern aligned at index m-1 of the text
int i = m-1; // an index into the text
int k = m-1; // an index into the pattern
while (i < n) {
if (text[i] == pattern[k]) { // a matching character
if (k == 0) return i; // entire pattern has been found
i--; // otherwise, examine previous
k--; // characters of text/pattern
} else {
i += m - Math.min(k, 1 + last.get(text[i])); // case analysis for jump step
k = m - 1; // restart at end of pattern
}
}
return -1; // pattern was never found
}
/** Returns the lowest index at which substring pattern begins in text (or else -1).*/
public static int findKMP(char[] text, char[] pattern) {
int n = text.length;
int m = pattern.length;
if (m == 0) return 0; // trivial search for empty string
int[] fail = computeFailKMP(pattern); // computed by private utility
int j = 0; // index into text
int k = 0; // index into pattern
while (j < n) {
if (text[j] == pattern[k]) { // pattern[0..k] matched thus far
if (k == m - 1) return j - m + 1; // match is complete
j++; // otherwise, try to extend match
k++;
} else if (k > 0)
k = fail[k-1]; // reuse suffix of P[0..k-1]
else
j++;
}
return -1; // reached end without match
}
private static int[] computeFailKMP(char[] pattern) {
int m = pattern.length;
int[ ] fail = new int[m]; // by default, all overlaps are zero
int j = 1;
int k = 0;
while (j < m) { // compute fail[j] during this pass, if nonzero
if (pattern[j] == pattern[k]) { // k + 1 characters match thus far
fail[j] = k + 1;
j++;
k++;
} else if (k > 0) // k follows a matching prefix
k = fail[k-1];
else // no match found starting at j
j++;
}
return fail;
}
public static void main(String[] args) {
}
}

Looks like you are migrating from c\c++ to Java.
Here is how to call functions via main method.
public static void main(String[] args) throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.print("1) Please enter a string :");
String text = br.readLine();
System.out.print("2) Please enter a pattern :");
String pattern = br.readLine();
System.out.println(PatternSearch.findBoyerMoore(text.toCharArray(), pattern.toCharArray()));
System.out.println(PatternSearch.findBrute(text.toCharArray(), pattern.toCharArray()));
System.out.println(PatternSearch.findKMP(text.toCharArray(), pattern.toCharArray()));
}
Since all your methods are static I called methods by using className.method();
If your methods are non static then you will have to create an instance of the class and then call the method by using the instance you created.
PatternSearch instance = new PatternSearch();
instance.findKMP(text.toCharArray(), pattern.toCharArray());

UCF HSPT 2016 - Chomp Chomp

I am having a lot of trouble finding an efficient solution to Problem #9 in the UCF HSPT programming competition. The whole pdf can we viewed here, and the problem is called "Chomp Chomp!".
Essentially the problem involves taking 2 "chomps" out of an array, where each chomp is a continuous chain of elements from the array and the 2 chomps have to have at least element between them that's not "chomped." Once the two "chomps" are determined, the sum of all the elements in both "chomps" has to be a multiple of the number given in the input. My solution essentially is a brute-force that goes through every possible "chomp" and I tried to improve the speed of it by storing previously calculated sums of chomps.
My code:
import java.util.Arrays;
import java.util.HashMap;
import java.util.Scanner;
public class chomp {
static long[] arr;
public static long sum(int start, int end) {
long ret = 0;
for(int i = start; i < end; i++) {
ret+=arr[i];
}
return ret;
}
public static int sumArray(int[] arr) {
int sum = 0;
for(int i = 0; i < arr.length; i++) {
sum+=arr[i];
}
return sum;
}
public static long numChomps(long[] arr, long divide) {
HashMap<String, Long> map = new HashMap<>();
int k = 1;
long numchomps = 0;
while(true) {
if (k > arr.length-2) break;
for (int i = 0; i < arr.length -2; i++) {
if ((i+k)>arr.length-2) break;
String one = i + "";
String two = (i+k) + "";
String key1 = one + " " + two;
long first = 0;
if(map.containsKey(key1)) {
//System.out.println("Key 1 found!");
first = map.get(key1).longValue();
} else {
first = sum(i, i+k);
map.put(key1, new Long(first));
}
int kk = 1;
while(true){
if (kk > arr.length-2) break;
for (int j = i+k+1; j < arr.length; j++) {
if((j+kk) > arr.length) break;
String o = j + "";
String t = (j+kk) + "";
String key2 = o + " " + t;
long last = 0;
if(map.containsKey(key2)) {
//System.out.println("Key 2 found!");
last = map.get(key2).longValue();
} else {
last = sum(j, j+kk);
map.put(key2, new Long(last));
}
if (((first+last) % divide) == 0) {
numchomps++;
}
}
kk++;
}
}
k++;
}
return numchomps;
}
public static void main(String[] args) {
// TODO Auto-generated method stub
Scanner in = new Scanner(System.in);
int n = Integer.parseInt(in.nextLine());
for(int i = 1; i <= n; i++) {
int length = in.nextInt();
long divide = in.nextLong();
in.nextLine();
arr = new long[length];
for(int j = 0; j < length; j++) {
arr[j] = (in.nextLong());
}
//System.out.println(arr);
in.nextLine();
long blah = numChomps(arr, divide);
System.out.println("Plate #"+i + ": " + blah);
}
}
}
My code gets the right answer, but seems to take too long, especially for large inputs when the size of the array is 1000 or greater. I tried to improve the speed of my algorithm my storing previous sums calculated in a HashMap, but that didn't improve the speed of my program considerably. What can I do to improve the speed so it runs under 10 seconds?

The first source of inefficiency is constant recalculation of sums. You should make an auxiliary array of partial sums long [n] partial;, then instead of calling sum(i, i + k) you may simply do partial[i + k] - partial[i].
Now the problem reduces to finding indices i<j<k<m such that
(partial[j] - partial[i] + partial[m] - partial[k]) % divide == 0
or, rearranging terms,
(partial[j] + partial[m]) % divide == (partial[i] + partial[k]) % divide
To find them you may consider an array of triples (s, i, j) where s = (partial[j] - partial[i]) % divide, stable sort it by s, and inspect equal ranges for non-overlapping "chomps".
This approach improves performance from O(n4) to O(n2 log n). Now you shall be able to improve it to O(n log n).

How to find the longest substring with equal amount of characters efficiently

I have a string that consists of characters A,B,C and D and I am trying to calculate the length of the longest substring that has an equal amount of each one of these characters in any order.
For example ABCDB would return 4, ABCC 0 and ADDBCCBA 8.
My code currently:
public int longestSubstring(String word) {
HashMap<Integer, String> map = new HashMap<Integer, String>();
for (int i = 0; i<word.length()-3; i++) {
map.put(i, word.substring(i, i+4));
}
StringBuilder sb;
int longest = 0;
for (int i = 0; i<map.size(); i++) {
sb = new StringBuilder();
sb.append(map.get(i));
int a = 4;
while (i<map.size()-a) {
sb.append(map.get(i+a));
a+= 4;
}
String substring = sb.toString();
if (equalAmountOfCharacters(substring)) {
int length = substring.length();
if (length > longest)
longest = length;
}
}
return longest;
}
This currently works pretty well if the string length is 10^4 but I'm trying to make it 10^5. Any tips or suggestions would be appreciated.

Let's assume that cnt(c, i) is the number of occurrences of the character c in the prefix of length i.
A substring (low, high] has an equal amount of two characters a and b iff cnt(a, high) - cnt(a, low) = cnt(b, high) - cnt(b, low), or, put it another way, cnt(b, high) - cnt(a, high) = cnt(b, low) - cnt(a, low). Thus, each position is described by a value of cnt(b, i) - cnt(a, i). Now we can generalize it for more that two characters: each position is described by a tuple (cnt(a_2, i) - cnt(a_1, i), ..., cnt(a_k, i) - cnt(a_1, i)), where a_1 ... a_k is the alphabet.
We can iterate over the given string and maintain the current tuple. At each step, we should update the answer by checking the value of i - first_occurrence(current_tuple), where first_occurrence is a hash table that stores the first occurrence of each tuple seen so far. Do not forget to put a tuple of zeros to the hash map before iteration(it corresponds to an empty prefix).

If there were only A's and B's, then you could do something like this.
def longest_balanced(word):
length = 0
cumulative_difference = 0
first_index = {0: -1}
for index, letter in enumerate(word):
if letter == 'A':
cumulative_difference += 1
elif letter == 'B':
cumulative_difference -= 1
else:
raise ValueError(letter)
if cumulative_difference in first_index:
length = max(length, index - first_index[cumulative_difference])
else:
first_index[cumulative_difference] = index
return length
Life is more complicated with all four letters, but the idea is much the same. Instead of keeping just one cumulative difference, for A's versus B's, we keep three, for A's versus B's, A's versus C's, and A's versus D's.

Well, first of all abstain from constructing any strings.
If you don't produce any (or nearly no) garbage, there's no need to collect it, which is a major plus.
Next, use a different data-structure:
I suggest 4 byte-arrays, storing the count of their respective symbol in the 4-span starting at the corresponding string-index.
That should speed it up considerably.

You can count the occurrences of the characters in word. Then, a possible solution could be:
If min is the minimum number of occurrences of any character in word, then min is also the maximum possible number of occurrences of each character in the substring we are looking for. In the code below, min is maxCount.
We iterate over decreasing values of maxCount. At every step, the string we are searching for will have length maxCount * alphabetSize. We can view this as the size of a sliding window we can slide over word.
We slide the window over word, counting the occurrences of the characters in the window. If the window is the substring we are searching for, we return the result. Otherwise, we keep searching.
[FIXED] The code:
private static final int ALPHABET_SIZE = 4;
public int longestSubstring(String word) {
// count
int[] count = new int[ALPHABET_SIZE];
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
count[c - 'A']++;
}
int maxCount = word.length();
for (int i = 0; i < count.length; i++) {
int cnt = count[i];
if (cnt < maxCount) {
maxCount = cnt;
}
}
// iterate over maxCount until found
boolean found = false;
while (maxCount > 0 && !found) {
int substringLength = maxCount * ALPHABET_SIZE;
found = findSubstring(substringLength, word, maxCount);
if (!found) {
maxCount--;
}
}
return found ? maxCount * ALPHABET_SIZE : 0;
}
private boolean findSubstring(int length, String word, int maxCount) {
int startIndex = 0;
boolean found = false;
while (startIndex + length <= word.length()) {
int[] count = new int[ALPHABET_SIZE];
for (int i = startIndex; i < startIndex + length; i++) {
char c = word.charAt(i);
int cnt = ++count[c - 'A'];
if (cnt > maxCount) {
break;
}
}
if (equalValues(count, maxCount)) {
found = true;
break;
} else {
startIndex++;
}
}
return found;
}
// Returns true if all values in c are equal to value
private boolean equalValues(int[] count, int value) {
boolean result = true;
for (int i : count) {
if (i != value) {
result = false;
break;
}
}
return result;
}
[MERGED] This is Hollis Waite's solution using cumulative counts, but taking my observations at points 1. and 2. into consideration. This may improve performance for some inputs:
private static final int ALPHABET_SIZE = 4;
public int longestSubstring(String word) {
// count
int[][] cumulativeCount = new int[ALPHABET_SIZE][];
for (int i = 0; i < ALPHABET_SIZE; i++) {
cumulativeCount[i] = new int[word.length() + 1];
}
int[] count = new int[ALPHABET_SIZE];
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
count[c - 'A']++;
for (int j = 0; j < ALPHABET_SIZE; j++) {
cumulativeCount[j][i + 1] = count[j];
}
}
int maxCount = word.length();
for (int i = 0; i < count.length; i++) {
int cnt = count[i];
if (cnt < maxCount) {
maxCount = cnt;
}
}
// iterate over maxCount until found
boolean found = false;
while (maxCount > 0 && !found) {
int substringLength = maxCount * ALPHABET_SIZE;
found = findSubstring(substringLength, word, maxCount, cumulativeCount);
if (!found) {
maxCount--;
}
}
return found ? maxCount * ALPHABET_SIZE : 0;
}
private boolean findSubstring(int length, String word, int maxCount, int[][] cumulativeCount) {
int startIndex = 0;
int endIndex = (startIndex + length) - 1;
boolean found = true;
while (endIndex < word.length()) {
for (int i = 0; i < ALPHABET_SIZE; i++) {
if (cumulativeCount[i][endIndex] - cumulativeCount[i][startIndex] != maxCount) {
found = false;
break;
}
}
if (found) {
break;
} else {
startIndex++;
endIndex++;
}
}
return found;
}

You'll probably want to cache cumulative counts of characters for each index of String -- that's where the real bottleneck is. Haven't thoroughly tested but something like the below should work.
public class Test {
static final int LEN = 4;
static class RandomCharSequence implements CharSequence {
private final Random mRandom = new Random();
private final int mAlphabetLen;
private final int mLen;
private final int mOffset;
RandomCharSequence(int pLen, int pOffset, int pAlphabetLen) {
mAlphabetLen = pAlphabetLen;
mLen = pLen;
mOffset = pOffset;
}
public int length() {return mLen;}
public char charAt(int pIdx) {
mRandom.setSeed(mOffset + pIdx);
return (char) (
'A' +
(mRandom.nextInt() % mAlphabetLen + mAlphabetLen) % mAlphabetLen
);
}
public CharSequence subSequence(int pStart, int pEnd) {
return new RandomCharSequence(pEnd - pStart, pStart, mAlphabetLen);
}
#Override public String toString() {
return (new StringBuilder(this)).toString();
}
}
public static void main(String[] pArgs) {
Stream.of("ABCDB", "ABCC", "ADDBCCBA", "DADDBCCBA").forEach(
pWord -> System.out.println(longestSubstring(pWord))
);
for (int i = 0; ; i++) {
final double len = Math.pow(10, i);
if (len >= Integer.MAX_VALUE) break;
System.out.println("Str len 10^" + i);
for (int alphabetLen = 1; alphabetLen <= LEN; alphabetLen++) {
final Instant start = Instant.now();
final int val = longestSubstring(
new RandomCharSequence((int) len, 0, alphabetLen)
);
System.out.println(
String.format(
" alphabet len %d; result %08d; time %s",
alphabetLen,
val,
formatMillis(ChronoUnit.MILLIS.between(start, Instant.now()))
)
);
}
}
}
static String formatMillis(long millis) {
return String.format(
"%d:%02d:%02d.%03d",
TimeUnit.MILLISECONDS.toHours(millis),
TimeUnit.MILLISECONDS.toMinutes(millis) -
TimeUnit.HOURS.toMinutes(TimeUnit.MILLISECONDS.toHours(millis)),
TimeUnit.MILLISECONDS.toSeconds(millis) -
TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(millis)),
TimeUnit.MILLISECONDS.toMillis(millis) -
TimeUnit.SECONDS.toMillis(TimeUnit.MILLISECONDS.toSeconds(millis))
);
}
static int longestSubstring(CharSequence pWord) {
// create array that stores cumulative char counts at each index of string
// idx 0 = char (A-D); idx 1 = offset
final int[][] cumulativeCnts = new int[LEN][];
for (int i = 0; i < LEN; i++) {
cumulativeCnts[i] = new int[pWord.length() + 1];
}
final int[] cumulativeCnt = new int[LEN];
for (int i = 0; i < pWord.length(); i++) {
cumulativeCnt[pWord.charAt(i) - 'A']++;
for (int j = 0; j < LEN; j++) {
cumulativeCnts[j][i + 1] = cumulativeCnt[j];
}
}
final int maxResult = Arrays.stream(cumulativeCnt).min().orElse(0) * LEN;
if (maxResult == 0) return 0;
int result = 0;
for (int initialOffset = 0; initialOffset < LEN; initialOffset++) {
for (
int start = initialOffset;
start < pWord.length() - result;
start += LEN
) {
endLoop:
for (
int end = start + result + LEN;
end <= pWord.length() && end - start <= maxResult;
end += LEN
) {
final int substrLen = end - start;
final int expectedCharCnt = substrLen / LEN;
for (int i = 0; i < LEN; i++) {
if (
cumulativeCnts[i][end] - cumulativeCnts[i][start] !=
expectedCharCnt
) {
continue endLoop;
}
}
if (substrLen > result) result = substrLen;
}
}
}
return result;
}
}

Suppose there are K possible letters in a string of length N. We could track the balance of letters seen with a vector pos of length K that is updated as follows:
If letter 1 is seen, add (K-1, -1, -1, ...)
If letter 2 is seen, add (-1, K-1, -1, ...)
If letter 3 is seen, add (-1, -1, K-1, ...)
Maintain a hash that maps pos to the first string position where pos is reached. Balanced substrings occur whenever hash[pos] already exists and the substring value is s[hash[pos]:pos].
The cost of maintaining the hash is O(log N) so processing the string takes O(N log N). How does this compare with solutions so far? These types of problems tend to have linear solutions but I haven't come across one yet.
Here's some code demonstrating the idea for 3 letters and a run using biased random strings. (Uniform random strings allow for solutions that are around half the string length, which is unwieldy to print).
#!/usr/bin/python
import random
from time import time
alphabet = "abc"
DIM = len(alphabet)
def random_string(n):
# return a random string over choices[] of length n
# distribution of letters is non-uniform to make matches harder to find
choices = "aabbc"
s = ''
for i in range(n):
r = random.randint(0, len(choices) - 1)
s += choices[r]
return s
def validate(s):
# verify frequencies of each letter are the same
f = [0, 0, 0]
a2f = {alphabet[i] : i for i in range(DIM)}
for c in s:
f[a2f[c]] += 1
assert f[0] == f[1] and f[1] == f[2]
def longest_balanced(s):
"""return length of longest substring of s containing equal
populations of each letter in alphabet"""
slen = len(s)
p = [0 for i in range(DIM)]
vec = {alphabet[0] : [2, -1, -1],
alphabet[1] : [-1, 2, -1],
alphabet[2] : [-1, -1, 2]}
x = -1
best = -1
hist = {str([0, 0, 0]) : -1}
for c in s:
x += 1
p = [p[i] + vec[c][i] for i in range(DIM)]
pkey = str(p)
if pkey not in hist:
hist[pkey] = x
else:
span = x - hist[pkey]
assert span % DIM == 0
if span > best:
best = span
cand = s[hist[pkey] + 1: x + 1]
print("best so far %d = [%d,%d]: %s" % (best,
hist[pkey] + 1,
x + 1,
cand))
validate(cand)
return best if best > -1 else 0
def main():
#print longest_balanced( "aaabcabcbbcc" )
t0 = time()
s = random_string(1000000)
print "generate time:", time() - t0
t1 = time()
best = longest_balanced( s )
print "best:", best
print "elapsed:", time() - t1
main()
Sample run on an input of 10^6 letters with an alphabet of 3 letters:
$ ./bal.py
...
best so far 189 = [847894,848083]: aacacbcbabbbcabaabbbaabbbaaaacbcaaaccccbcbcbababaabbccccbbabbacabbbbbcaacacccbbaacbabcbccaabaccabbbbbababbacbaaaacabcbabcbccbabbccaccaabbcabaabccccaacccccbaacaaaccbbcbcabcbcacaabccbacccacca
best: 189
elapsed: 1.43609690666

Big O for this code

Below code is from topcoder website. I was trying to figure the time complexity for this code. There is 1 for loop and 1 while loop in the method isRandom and 1 for loop in the method diff. I guess the worst case scenario would be O(n^2). Is that correct?
public class CDPlayer {
private boolean[] used;
public boolean diff(String str, int from, int to) {
Arrays.fill(used, false);
to = Math.min(to, str.length());
for (int i = from; i < to; i++) {
if (used[str.charAt(i) - 'A']) {
return false;
}
used[str.charAt(i) - 'A'] = true;
}
return true;
}
public int isRandom(String[] songlist, int n){
String str = "";
for (int i = 0; i < songlist.length; i++) {
str += songlist[i];
}
used = new boolean[26];
for (int i = 0; i < n; i++) {
if (!diff(str, 0, i)) {
continue;
}
int j = i;
boolean bad = false;
while (j < str.length()) {
if (!diff(str, j, j + n)) {
bad = true;
break;
}
j += n;
}
if (bad) {
continue;
}
return i;
}
return -1;
}
}

I figured out something like this O(S) + O(n^2) + O(SS)*O(n^2), where
S = songlist.length, SS = sum of all song lengths. So your complexity depends on various inputs and it can't be represented by simple value.
P.S. Note that String is immutable object, so better use StringBuilder.
Before:
String str = "";
for (int i = 0; i < songlist.length; i++) {
str += songlist[i];
}
After:
StringBuilder builder = new StringBuilder();
for (int i = 0; i < songlist.length; i++) {
builder.append(songlist[i]);
}
In that case you won't create new String object on each iteration

As "n" is not the size of the input, it can not really be O(n) or O(n^2).
If m is the length of all strings in songlist, then you are jumping over that string in steps of the size n. So the compelxity is related to m not to n. I did not calculate in big O etc. since a few decades ... however I would assume the complexity is O(m).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Count amount of String occurrences and comparisons (KMP) - java

Related

KMP Algorithm for string search?

pattern search in a text by using three methods

UCF HSPT 2016 - Chomp Chomp

How to find the longest substring with equal amount of characters efficiently

Big O for this code

Categories

Resources