How is the Boyer Moore offset table created (Wiki vcode)?

How is the Boyer Moore offset table created (Wiki vcode)? - java

I've been trying to get my head around the boyer moore algorithm . I am going through the Bad Character hurestic code in Java given on wikipedia .
Theoretically I understand what the algorith is doing. But I am not able to wrap my head around the preprocessing table.
This is the code:
/**
* Returns the index within this string of the first occurrence of the
* specified substring. If it is not a substring, return -1.
*
* There is no Galil because it only generates one match.
*
* #param haystack The string to be scanned
* #param needle The target string to search
* #return The start index of the substring
*/
public static int indexOf(char[] haystack, char[] needle) {
if (needle.length == 0) {
return 0;
}
int charTable[] = makeCharTable(needle);
int offsetTable[] = makeOffsetTable(needle);
for (int i = needle.length - 1, j; i < haystack.length;) {
for (j = needle.length - 1; needle[j] == haystack[i]; --i, --j) {
if (j == 0) {
return i;
}
}
// i += needle.length - j; // For naive method
i += Math.max(offsetTable[needle.length - 1 - j], charTable[haystack[i]]);
}
return -1;
}
/**
* Makes the jump table based on the mismatched character information.
*/
private static int[] makeCharTable(char[] needle) {
final int ALPHABET_SIZE = Character.MAX_VALUE + 1; // 65536
int[] table = new int[ALPHABET_SIZE];
for (int i = 0; i < table.length; ++i) {
table[i] = needle.length;
}
for (int i = 0; i < needle.length; ++i) {
table[needle[i]] = needle.length - 1 - i;
}
return table;
}
/**
* Makes the jump table based on the scan offset which mismatch occurs.
* (bad character rule).
*/
private static int[] makeOffsetTable(char[] needle) {
int[] table = new int[needle.length];
int lastPrefixPosition = needle.length;
for (int i = needle.length; i > 0; --i) {
if (isPrefix(needle, i)) {
lastPrefixPosition = i;
}
table[needle.length - i] = lastPrefixPosition - i + needle.length;
}
for (int i = 0; i < needle.length - 1; ++i) {
int slen = suffixLength(needle, i);
table[slen] = needle.length - 1 - i + slen;
}
return table;
}
/**
* Is needle[p:end] a prefix of needle?
*/
private static boolean isPrefix(char[] needle, int p) {
for (int i = p, j = 0; i < needle.length; ++i, ++j) {
if (needle[i] != needle[j]) {
return false;
}
}
return true;
}
/**
* Returns the maximum length of the substring ends at p and is a suffix.
* (good suffix rule)
*/
private static int suffixLength(char[] needle, int p) {
int len = 0;
for (int i = p, j = needle.length - 1;
i >= 0 && needle[i] == needle[j]; --i, --j) {
len += 1;
}
return len;
}
I am not able to understand what the preprocessing function makeOffsetTable is doing ?
It seems makeCharTable is all the pre-processing that is needed but looks like there is another pre processing step but I dont understand what it is , could someone explain what its doing ?

Related

How do I reduce the code complexity of this task? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Basically I have this task where I need to find the median of an array based on the sum of the elements to it's left and the elements to it's right.
If an array is:
1 2 3 4 5
The output here should be 4 , because the left sum is 6 and the right is 5.
This is the code that solves this task:
/**
* This method calculates the smallest difference and returns it's index.
*
* #param array The array being targeted by the method.
* #return The index of the smallest difference as an integer.
*/
private static int findSmallestDifference(int[] array) {
if (array == null || array.length == 0) {
return -1;
}
int smallestDifferenceIndex = 0;
int currentDifference = Integer.MAX_VALUE;
for (int i = 0; i < array.length; ++i) {
int currentElement = Math.abs(array[i]);
if (currentElement < currentDifference) {
smallestDifferenceIndex = i;
currentDifference = currentElement;
} else if (currentElement == currentDifference
&& array[i] > 0
&& array[smallestDifferenceIndex] < 0) {
smallestDifferenceIndex = i;
}
}
return smallestDifferenceIndex + 1;
}
/**
* This method calculates the left and the right sum of the array.
*
* #param array The array being targeted by the method.
* #return The median of the array as an integer.
*/
public static int getMedian(int[] array) {
if (array == null || array.length == 0) {
return -1;
}
int[] prefix = Arrays.copyOf(array, array.length);
int[] suffix = Arrays.copyOf(array, array.length);
int[] difference = new int[array.length];
for (int i = 1; i < array.length; i++) {
prefix[i] = prefix[i] + prefix[i - 1];
}
for (int i = array.length - 1; i > 0; i--) {
suffix[i - 1] = suffix[i] + suffix[i - 1];
}
for (int i = 0; i < array.length; i++) {
difference[i] = Math.abs(prefix[i] - suffix[i]);
}
return findSmallestDifference(difference);
}
My question is how can I simplify this solution?

public static int findMedian(int[] arr) {
int[] leftSum = new int[arr.length];
for (int i = 0; i < arr.length; i++)
leftSum[i] = i == 0 ? arr[i] : leftSum[i - 1] + arr[i];
int rightSum = 0;
int minDif = Integer.MAX_VALUE;
int median = 0;
for (int i = arr.length - 2; i - 1 >= 0; i--) {
rightSum += arr[i + 1];
int dif = Math.abs(leftSum[i - 1] - rightSum);
if (dif < minDif) {
median = arr[i];
minDif = dif;
}
}
return median;
}

(Name prefix and suffix simply leftSum and rightSum.)
differences does not need to be kept in an array, but the smallest difference can be found dynamically. If array is sorted, vou could have interpolated the smallest difference.
Arrays.copy is redundant; maybe saves a line of code. new int[array.length] and a walking int sum = 0; would do too.
There is a problem in the calculating of the smallest difference
if (array == null) {
return -1;
}
int smallestDifferenceIndex = -1;
int currentDifference = Integer.MAX_VALUE;
for (int i = 0; i < array.length; ++i) {
int left = i <= 0 ? 0 : prefix[i - 1];
int right = i >= array.length - 1 ? 0 : suffix[i + 1];
int currentElement = Math.abs(left - right);
if (currentElement <= currentDifference) {
smallestDifferenceIndex = i;
currentDifference = currentElement;
}
}
return smallestDifferenceIndex;
By the way. Java as one successor of C++ wanted less nesting of { }, and took the convention of 4 spaces for an indentation.

My solution (returning the index): start at the ends and walk inward, from the side where expected difference is smaller:
private static final int find(int... array) {
var left = 0;
var right = array.length-1;
var sumL = 0;
var sumR = 0;
while (left < right) {
var numL = array[left];
var numR = array[right];
if (abs((sumL+numL)-sumR) < abs(sumL-(sumR+numR))) {
sumL += numL;
left += 1;
} else {
sumR += numR;
right -= 1;
}
}
return right;
}
This is not expected to work if the array contains negative numbers (or zeros?)
EDIT:
if you need the index staring at 1 (not Java-like) then just add 1:
. . .
return right + 1; // zero if the array is empty
}
BONUS:
Here my brute force solution used for testing*:
private static int brute(int... array) {
var result = -1;
var min = Integer.MAX_VALUE;
for (var i = 0; i < array.length; i++) {
var l = IntStream.range(0, i).map(j -> array[j]).sum();
var r = IntStream.range(i+1, array.length).map(j -> array[j]).sum();
var d = abs(l - r);
if (d < min) {
result = i;
min = d;
}
}
return result + 1; // zero if array is empty
}
* after comments I also used code from question to check

How to find the longest substring with equal amount of characters efficiently

I have a string that consists of characters A,B,C and D and I am trying to calculate the length of the longest substring that has an equal amount of each one of these characters in any order.
For example ABCDB would return 4, ABCC 0 and ADDBCCBA 8.
My code currently:
public int longestSubstring(String word) {
HashMap<Integer, String> map = new HashMap<Integer, String>();
for (int i = 0; i<word.length()-3; i++) {
map.put(i, word.substring(i, i+4));
}
StringBuilder sb;
int longest = 0;
for (int i = 0; i<map.size(); i++) {
sb = new StringBuilder();
sb.append(map.get(i));
int a = 4;
while (i<map.size()-a) {
sb.append(map.get(i+a));
a+= 4;
}
String substring = sb.toString();
if (equalAmountOfCharacters(substring)) {
int length = substring.length();
if (length > longest)
longest = length;
}
}
return longest;
}
This currently works pretty well if the string length is 10^4 but I'm trying to make it 10^5. Any tips or suggestions would be appreciated.

Let's assume that cnt(c, i) is the number of occurrences of the character c in the prefix of length i.
A substring (low, high] has an equal amount of two characters a and b iff cnt(a, high) - cnt(a, low) = cnt(b, high) - cnt(b, low), or, put it another way, cnt(b, high) - cnt(a, high) = cnt(b, low) - cnt(a, low). Thus, each position is described by a value of cnt(b, i) - cnt(a, i). Now we can generalize it for more that two characters: each position is described by a tuple (cnt(a_2, i) - cnt(a_1, i), ..., cnt(a_k, i) - cnt(a_1, i)), where a_1 ... a_k is the alphabet.
We can iterate over the given string and maintain the current tuple. At each step, we should update the answer by checking the value of i - first_occurrence(current_tuple), where first_occurrence is a hash table that stores the first occurrence of each tuple seen so far. Do not forget to put a tuple of zeros to the hash map before iteration(it corresponds to an empty prefix).

If there were only A's and B's, then you could do something like this.
def longest_balanced(word):
length = 0
cumulative_difference = 0
first_index = {0: -1}
for index, letter in enumerate(word):
if letter == 'A':
cumulative_difference += 1
elif letter == 'B':
cumulative_difference -= 1
else:
raise ValueError(letter)
if cumulative_difference in first_index:
length = max(length, index - first_index[cumulative_difference])
else:
first_index[cumulative_difference] = index
return length
Life is more complicated with all four letters, but the idea is much the same. Instead of keeping just one cumulative difference, for A's versus B's, we keep three, for A's versus B's, A's versus C's, and A's versus D's.

Well, first of all abstain from constructing any strings.
If you don't produce any (or nearly no) garbage, there's no need to collect it, which is a major plus.
Next, use a different data-structure:
I suggest 4 byte-arrays, storing the count of their respective symbol in the 4-span starting at the corresponding string-index.
That should speed it up considerably.

You can count the occurrences of the characters in word. Then, a possible solution could be:
If min is the minimum number of occurrences of any character in word, then min is also the maximum possible number of occurrences of each character in the substring we are looking for. In the code below, min is maxCount.
We iterate over decreasing values of maxCount. At every step, the string we are searching for will have length maxCount * alphabetSize. We can view this as the size of a sliding window we can slide over word.
We slide the window over word, counting the occurrences of the characters in the window. If the window is the substring we are searching for, we return the result. Otherwise, we keep searching.
[FIXED] The code:
private static final int ALPHABET_SIZE = 4;
public int longestSubstring(String word) {
// count
int[] count = new int[ALPHABET_SIZE];
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
count[c - 'A']++;
}
int maxCount = word.length();
for (int i = 0; i < count.length; i++) {
int cnt = count[i];
if (cnt < maxCount) {
maxCount = cnt;
}
}
// iterate over maxCount until found
boolean found = false;
while (maxCount > 0 && !found) {
int substringLength = maxCount * ALPHABET_SIZE;
found = findSubstring(substringLength, word, maxCount);
if (!found) {
maxCount--;
}
}
return found ? maxCount * ALPHABET_SIZE : 0;
}
private boolean findSubstring(int length, String word, int maxCount) {
int startIndex = 0;
boolean found = false;
while (startIndex + length <= word.length()) {
int[] count = new int[ALPHABET_SIZE];
for (int i = startIndex; i < startIndex + length; i++) {
char c = word.charAt(i);
int cnt = ++count[c - 'A'];
if (cnt > maxCount) {
break;
}
}
if (equalValues(count, maxCount)) {
found = true;
break;
} else {
startIndex++;
}
}
return found;
}
// Returns true if all values in c are equal to value
private boolean equalValues(int[] count, int value) {
boolean result = true;
for (int i : count) {
if (i != value) {
result = false;
break;
}
}
return result;
}
[MERGED] This is Hollis Waite's solution using cumulative counts, but taking my observations at points 1. and 2. into consideration. This may improve performance for some inputs:
private static final int ALPHABET_SIZE = 4;
public int longestSubstring(String word) {
// count
int[][] cumulativeCount = new int[ALPHABET_SIZE][];
for (int i = 0; i < ALPHABET_SIZE; i++) {
cumulativeCount[i] = new int[word.length() + 1];
}
int[] count = new int[ALPHABET_SIZE];
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
count[c - 'A']++;
for (int j = 0; j < ALPHABET_SIZE; j++) {
cumulativeCount[j][i + 1] = count[j];
}
}
int maxCount = word.length();
for (int i = 0; i < count.length; i++) {
int cnt = count[i];
if (cnt < maxCount) {
maxCount = cnt;
}
}
// iterate over maxCount until found
boolean found = false;
while (maxCount > 0 && !found) {
int substringLength = maxCount * ALPHABET_SIZE;
found = findSubstring(substringLength, word, maxCount, cumulativeCount);
if (!found) {
maxCount--;
}
}
return found ? maxCount * ALPHABET_SIZE : 0;
}
private boolean findSubstring(int length, String word, int maxCount, int[][] cumulativeCount) {
int startIndex = 0;
int endIndex = (startIndex + length) - 1;
boolean found = true;
while (endIndex < word.length()) {
for (int i = 0; i < ALPHABET_SIZE; i++) {
if (cumulativeCount[i][endIndex] - cumulativeCount[i][startIndex] != maxCount) {
found = false;
break;
}
}
if (found) {
break;
} else {
startIndex++;
endIndex++;
}
}
return found;
}

You'll probably want to cache cumulative counts of characters for each index of String -- that's where the real bottleneck is. Haven't thoroughly tested but something like the below should work.
public class Test {
static final int LEN = 4;
static class RandomCharSequence implements CharSequence {
private final Random mRandom = new Random();
private final int mAlphabetLen;
private final int mLen;
private final int mOffset;
RandomCharSequence(int pLen, int pOffset, int pAlphabetLen) {
mAlphabetLen = pAlphabetLen;
mLen = pLen;
mOffset = pOffset;
}
public int length() {return mLen;}
public char charAt(int pIdx) {
mRandom.setSeed(mOffset + pIdx);
return (char) (
'A' +
(mRandom.nextInt() % mAlphabetLen + mAlphabetLen) % mAlphabetLen
);
}
public CharSequence subSequence(int pStart, int pEnd) {
return new RandomCharSequence(pEnd - pStart, pStart, mAlphabetLen);
}
#Override public String toString() {
return (new StringBuilder(this)).toString();
}
}
public static void main(String[] pArgs) {
Stream.of("ABCDB", "ABCC", "ADDBCCBA", "DADDBCCBA").forEach(
pWord -> System.out.println(longestSubstring(pWord))
);
for (int i = 0; ; i++) {
final double len = Math.pow(10, i);
if (len >= Integer.MAX_VALUE) break;
System.out.println("Str len 10^" + i);
for (int alphabetLen = 1; alphabetLen <= LEN; alphabetLen++) {
final Instant start = Instant.now();
final int val = longestSubstring(
new RandomCharSequence((int) len, 0, alphabetLen)
);
System.out.println(
String.format(
" alphabet len %d; result %08d; time %s",
alphabetLen,
val,
formatMillis(ChronoUnit.MILLIS.between(start, Instant.now()))
)
);
}
}
}
static String formatMillis(long millis) {
return String.format(
"%d:%02d:%02d.%03d",
TimeUnit.MILLISECONDS.toHours(millis),
TimeUnit.MILLISECONDS.toMinutes(millis) -
TimeUnit.HOURS.toMinutes(TimeUnit.MILLISECONDS.toHours(millis)),
TimeUnit.MILLISECONDS.toSeconds(millis) -
TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(millis)),
TimeUnit.MILLISECONDS.toMillis(millis) -
TimeUnit.SECONDS.toMillis(TimeUnit.MILLISECONDS.toSeconds(millis))
);
}
static int longestSubstring(CharSequence pWord) {
// create array that stores cumulative char counts at each index of string
// idx 0 = char (A-D); idx 1 = offset
final int[][] cumulativeCnts = new int[LEN][];
for (int i = 0; i < LEN; i++) {
cumulativeCnts[i] = new int[pWord.length() + 1];
}
final int[] cumulativeCnt = new int[LEN];
for (int i = 0; i < pWord.length(); i++) {
cumulativeCnt[pWord.charAt(i) - 'A']++;
for (int j = 0; j < LEN; j++) {
cumulativeCnts[j][i + 1] = cumulativeCnt[j];
}
}
final int maxResult = Arrays.stream(cumulativeCnt).min().orElse(0) * LEN;
if (maxResult == 0) return 0;
int result = 0;
for (int initialOffset = 0; initialOffset < LEN; initialOffset++) {
for (
int start = initialOffset;
start < pWord.length() - result;
start += LEN
) {
endLoop:
for (
int end = start + result + LEN;
end <= pWord.length() && end - start <= maxResult;
end += LEN
) {
final int substrLen = end - start;
final int expectedCharCnt = substrLen / LEN;
for (int i = 0; i < LEN; i++) {
if (
cumulativeCnts[i][end] - cumulativeCnts[i][start] !=
expectedCharCnt
) {
continue endLoop;
}
}
if (substrLen > result) result = substrLen;
}
}
}
return result;
}
}

Suppose there are K possible letters in a string of length N. We could track the balance of letters seen with a vector pos of length K that is updated as follows:
If letter 1 is seen, add (K-1, -1, -1, ...)
If letter 2 is seen, add (-1, K-1, -1, ...)
If letter 3 is seen, add (-1, -1, K-1, ...)
Maintain a hash that maps pos to the first string position where pos is reached. Balanced substrings occur whenever hash[pos] already exists and the substring value is s[hash[pos]:pos].
The cost of maintaining the hash is O(log N) so processing the string takes O(N log N). How does this compare with solutions so far? These types of problems tend to have linear solutions but I haven't come across one yet.
Here's some code demonstrating the idea for 3 letters and a run using biased random strings. (Uniform random strings allow for solutions that are around half the string length, which is unwieldy to print).
#!/usr/bin/python
import random
from time import time
alphabet = "abc"
DIM = len(alphabet)
def random_string(n):
# return a random string over choices[] of length n
# distribution of letters is non-uniform to make matches harder to find
choices = "aabbc"
s = ''
for i in range(n):
r = random.randint(0, len(choices) - 1)
s += choices[r]
return s
def validate(s):
# verify frequencies of each letter are the same
f = [0, 0, 0]
a2f = {alphabet[i] : i for i in range(DIM)}
for c in s:
f[a2f[c]] += 1
assert f[0] == f[1] and f[1] == f[2]
def longest_balanced(s):
"""return length of longest substring of s containing equal
populations of each letter in alphabet"""
slen = len(s)
p = [0 for i in range(DIM)]
vec = {alphabet[0] : [2, -1, -1],
alphabet[1] : [-1, 2, -1],
alphabet[2] : [-1, -1, 2]}
x = -1
best = -1
hist = {str([0, 0, 0]) : -1}
for c in s:
x += 1
p = [p[i] + vec[c][i] for i in range(DIM)]
pkey = str(p)
if pkey not in hist:
hist[pkey] = x
else:
span = x - hist[pkey]
assert span % DIM == 0
if span > best:
best = span
cand = s[hist[pkey] + 1: x + 1]
print("best so far %d = [%d,%d]: %s" % (best,
hist[pkey] + 1,
x + 1,
cand))
validate(cand)
return best if best > -1 else 0
def main():
#print longest_balanced( "aaabcabcbbcc" )
t0 = time()
s = random_string(1000000)
print "generate time:", time() - t0
t1 = time()
best = longest_balanced( s )
print "best:", best
print "elapsed:", time() - t1
main()
Sample run on an input of 10^6 letters with an alphabet of 3 letters:
$ ./bal.py
...
best so far 189 = [847894,848083]: aacacbcbabbbcabaabbbaabbbaaaacbcaaaccccbcbcbababaabbccccbbabbacabbbbbcaacacccbbaacbabcbccaabaccabbbbbababbacbaaaacabcbabcbccbabbccaccaabbcabaabccccaacccccbaacaaaccbbcbcabcbcacaabccbacccacca
best: 189
elapsed: 1.43609690666

Find indexOf a byte array within another byte array

Given a byte array, how can I find within it, the position of a (smaller) byte array?
This documentation looked promising, using ArrayUtils, but if I'm correct it would only let me find an individual byte within the array to be searched.
(I can't see it mattering, but just in case: sometimes the search byte array will be regular ASCII characters, other times it will be control characters or extended ASCII characters. So using String operations would not always be appropriate)
The large array could be between 10 and about 10000 bytes, and the smaller array around 10. In some cases I will have several smaller arrays that I want found within the larger array in a single search. And I will at times want to find the last index of an instance rather than the first.

The simpelst way would be to compare each element:
public int indexOf(byte[] outerArray, byte[] smallerArray) {
for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
boolean found = true;
for(int j = 0; j < smallerArray.length; ++j) {
if (outerArray[i+j] != smallerArray[j]) {
found = false;
break;
}
}
if (found) return i;
}
return -1;
}
Some tests:
#Test
public void testIndexOf() {
byte[] outer = {1, 2, 3, 4};
assertEquals(0, indexOf(outer, new byte[]{1, 2}));
assertEquals(1, indexOf(outer, new byte[]{2, 3}));
assertEquals(2, indexOf(outer, new byte[]{3, 4}));
assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}
As you updated your question: Java Strings are UTF-16 Strings, they do not care about the extended ASCII set, so you could use string.indexOf()

Google's Guava provides a Bytes.indexOf(byte[] array, byte[] target).

Using the Knuth–Morris–Pratt algorithm is the most efficient way.
StreamSearcher.java is an implementation of it and is part of Twitter's elephant-bird project.
It is not recommended to include this library since it is rather sizable for using just a single class.
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
/**
* An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
* For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
public class StreamSearcher
{
private byte[] pattern_;
private int[] borders_;
// An upper bound on pattern length for searching. Results are undefined for longer patterns.
#SuppressWarnings("unused")
public static final int MAX_PATTERN_LENGTH = 1024;
StreamSearcher(byte[] pattern)
{
setPattern(pattern);
}
/**
* Sets a new pattern for this StreamSearcher to use.
*
* #param pattern the pattern the StreamSearcher will look for in future calls to search(...)
*/
public void setPattern(byte[] pattern)
{
pattern_ = Arrays.copyOf(pattern, pattern.length);
borders_ = new int[pattern_.length + 1];
preProcess();
}
/**
* Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
* that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
* byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
* another reasonable default, i.e. leave the stream unchanged.
*
* #return bytes consumed if found, -1 otherwise.
*/
long search(InputStream stream) throws IOException
{
long bytesRead = 0;
int b;
int j = 0;
while ((b = stream.read()) != -1)
{
bytesRead++;
while (j >= 0 && (byte) b != pattern_[j])
{
j = borders_[j];
}
// Move to the next character in the pattern.
++j;
// If we've matched up to the full pattern length, we found it. Return,
// which will automatically save our position in the InputStream at the point immediately
// following the pattern match.
if (j == pattern_.length)
{
return bytesRead;
}
}
// No dice, Note that the stream is now completely consumed.
return -1;
}
/**
* Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
* and aids in implementation of the Knuth-Moore-Pratt string search.
* <p>
* For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
private void preProcess()
{
int i = 0;
int j = -1;
borders_[i] = j;
while (i < pattern_.length)
{
while (j >= 0 && pattern_[i] != pattern_[j])
{
j = borders_[j];
}
borders_[++i] = ++j;
}
}
}

Is this what you are looking for?
public class KPM {
/**
* Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
* #param data
* #param start First index in data
* #param stop Last index in data so that stop-start = length
* #param pattern What is being searched. '*' can be used as wildcard for "ANY character"
* #return
*/
public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
if( data == null || pattern == null) return -1;
int[] failure = computeFailure(pattern);
int j = 0;
for( int i = start; i < stop; i++) {
while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
j = failure[j - 1];
}
if (pattern[j] == '*' || pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}

To save your time in testing:
http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence
gives you code that works if you make computeFailure() static:
public class KPM {
/**
* Search the data byte array for the first occurrence
* of the byte array pattern.
*/
public static int indexOf(byte[] data, byte[] pattern) {
int[] failure = computeFailure(pattern);
int j = 0;
for (int i = 0; i < data.length; i++) {
while (j > 0 && pattern[j] != data[i]) {
j = failure[j - 1];
}
if (pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}
Since it is always wise to test the code that you borrow, you may start with:
public class Test {
public static void main(String[] args) {
do_test1();
}
static void do_test1() {
String[] ss = { "",
"\r\n\r\n",
"\n\n",
"\r\n\r\nthis is a test",
"this is a test\r\n\r\n",
"this is a test\r\n\r\nthis si a test",
"this is a test\r\n\r\nthis si a test\r\n\r\n",
"this is a test\n\r\nthis si a test",
"this is a test\r\nthis si a test\r\n\r\n",
"this is a test"
};
for (String s: ss) {
System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
}
}
}

Copied almost identical from java.lang.String.
indexOf(char[],int,int,char[]int,int,int)
static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
byte first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first)
;
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
;
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}

package org.example;
import java.util.List;
import org.riversun.finbin.BinarySearcher;
public class Sample2 {
public static void main(String[] args) throws Exception {
BinarySearcher bs = new BinarySearcher();
// UTF-8 without BOM
byte[] srcBytes = "Hello world.It's a small world.".getBytes("utf-8");
byte[] searchBytes = "world".getBytes("utf-8");
List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes);
System.out.println("indexList=" + indexList);
}
}
so it results in
indexList=[6, 25]
So,u can find the index of byte[] in byte[]
Example here on Github at: https://github.com/riversun/finbin

Several (or all?) of the examples posted here failed some Unit tests so I am posting my version along with the aforementioned tests over here. All of the Unit tests are BASED upon the requirement that Java's String.indexOf() always gives us the right answer!
// The Knuth, Morris, and Pratt string searching algorithm remembers information about
// the past matched characters instead of matching a character with a different pattern
// character over and over again. It can search for a pattern in O(n) time as it never
// re-compares a text symbol that has matched a pattern symbol. But, it does use a partial
// match table to analyze the pattern structure. Construction of a partial match table
// takes O(m) time. Therefore, the overall time complexity of the KMP algorithm is O(m + n).
public class KMPSearch {
public static int indexOf(byte[] haystack, byte[] needle)
{
// needle is null or empty
if (needle == null || needle.length == 0)
return 0;
// haystack is null, or haystack's length is less than that of needle
if (haystack == null || needle.length > haystack.length)
return -1;
// pre construct failure array for needle pattern
int[] failure = new int[needle.length];
int n = needle.length;
failure[0] = -1;
for (int j = 1; j < n; j++)
{
int i = failure[j - 1];
while ((needle[j] != needle[i + 1]) && i >= 0)
i = failure[i];
if (needle[j] == needle[i + 1])
failure[j] = i + 1;
else
failure[j] = -1;
}
// find match
int i = 0, j = 0;
int haystackLen = haystack.length;
int needleLen = needle.length;
while (i < haystackLen && j < needleLen)
{
if (haystack[i] == needle[j])
{
i++;
j++;
}
else if (j == 0)
i++;
else
j = failure[j - 1] + 1;
}
return ((j == needleLen) ? (i - needleLen) : -1);
}
}
import java.util.Random;
class KMPSearchTest {
private static Random random = new Random();
private static String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
#Test
public void testEmpty() {
test("", "");
test("", "ab");
}
#Test
public void testOneChar() {
test("a", "a");
test("a", "b");
}
#Test
public void testRepeat() {
test("aaa", "aaaaa");
test("aaa", "abaaba");
test("abab", "abacababc");
test("abab", "babacaba");
}
#Test
public void testPartialRepeat() {
test("aaacaaaaac", "aaacacaacaaacaaaacaaaaac");
test("ababcababdabababcababdaba", "ababcababdabababcababdaba");
}
#Test
public void testRandomly() {
for (int i = 0; i < 1000; i++) {
String pattern = randomPattern();
for (int j = 0; j < 100; j++)
test(pattern, randomText(pattern));
}
}
/* Helper functions */
private static String randomPattern() {
StringBuilder sb = new StringBuilder();
int steps = random.nextInt(10) + 1;
for (int i = 0; i < steps; i++) {
if (sb.length() == 0 || random.nextBoolean()) { // Add literal
int len = random.nextInt(5) + 1;
for (int j = 0; j < len; j++)
sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
} else { // Repeat prefix
int len = random.nextInt(sb.length()) + 1;
int reps = random.nextInt(3) + 1;
if (sb.length() + len * reps > 1000)
break;
for (int j = 0; j < reps; j++)
sb.append(sb.substring(0, len));
}
}
return sb.toString();
}
private static String randomText(String pattern) {
StringBuilder sb = new StringBuilder();
int steps = random.nextInt(100);
for (int i = 0; i < steps && sb.length() < 10000; i++) {
if (random.nextDouble() < 0.7) { // Add prefix of pattern
int len = random.nextInt(pattern.length()) + 1;
sb.append(pattern.substring(0, len));
} else { // Add literal
int len = random.nextInt(30) + 1;
for (int j = 0; j < len; j++)
sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
}
}
return sb.toString();
}
private static void test(String pattern, String text) {
try {
assertEquals(text.indexOf(pattern), KMPSearch.indexOf(text.getBytes(), pattern.getBytes()));
} catch (AssertionError e) {
System.out.println("FAILED -> Unable to find '" + pattern + "' in '" + text + "'");
}
}
}

Java strings are composed of 16-bit chars, not of 8-bit bytes. A char can hold a byte, so you can always make your byte arrays into strings, and use indexOf: ASCII characters, control characters, and even zero characters will work fine.
Here is a demo:
byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};
byte[] small = new byte[] {7,0,8,9,0,0,1};
String bigStr = new String(big, StandardCharsets.UTF_8);
String smallStr = new String(small, StandardCharsets.UTF_8);
System.out.println(bigStr.indexOf(smallStr));
This prints 7.
However, considering that your large array could be up to 10,000 bytes, and the small array is only ten bytes, this solution may not be the most efficient, for two reasons:
It requires copying your big array into an array that is twice as large (same capacity, but with char instead of byte). This triples your memory requirements.
String search algorithm of Java is not the fastest one available. You may get sufficiently faster if you implement one of the advanced algorithms, for example, the Knuth–Morris–Pratt one. This could potentially bring the execution speed down by a factor of up to ten (the length of the small string), and will require additional memory that is proportional to the length of the small string, not the big string.

For a little HTTP server I am currently working on, I came up with the following code to find boundaries in a multipart/form-data request. Hoped to find a better solution here, but i guess I'll stick with it. I think it is as efficent as it can get (quite fast and uses not much ram). It uses the input bytes as ring buffer, reads the next byte as soon as it does not match the boundary and writes the data after the first full cycle into the output stream. Of course can it be changed for byte arrays instead of streams, as asked in the question.
private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary)
{
try
{
String n = "--"+boundary;
byte[] bc = n.getBytes("UTF-8");
int s = bc.length;
byte[] b = new byte[s];
int p = 0;
long l = 0;
int c;
boolean r;
while ((c = is.read()) != -1)
{
b[p] = (byte) c;
l += 1;
p = (int) (l % s);
if (l>p)
{
r = true;
for (int i = 0; i < s; i++)
{
if (b[(p + i) % s] != bc[i])
{
r = false;
break;
}
}
if (r)
break;
os.write(b[p]);
}
}
os.flush();
return true;
} catch(IOException e) {e.printStackTrace();}
return false;
}

Recursive method for Pascal's triangle

I have written a method to evaluate a Pascal's triangle of n rows. However when I test the method I receive the error:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
Here is the code:
public static int[] PascalTriangle(int n) {
int[] pt = new int[n + 1];
if (n == 0) {
pt[0] = 1;
return pt;
}
int[] ppt = PascalTriangle(n - 1);
pt[0] = pt[n] = 1;
for (int i = 0; i < ppt.length; i++) {
pt[i] = ppt[i - 1] + ppt[i];
}
return pt;
}
Please let me know if you have any ideas for how the code could be edited to fix the problem.

for(int i = 0; i < ppt.length; i++)
{
pt[i] = ppt[i-1] + ppt[i];
In your first iteration, i == 0 and so (i-1) == -1. This is the cause of the error.
You can special handle the boundaries to avoid this. Or as the others have suggested, start i at 1 instead of 0.

Here is some code a friend of mine came up with
import java.util.Scanner;
public class Pascal {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
System.out.print("Enter the number of rows to print: ");
int rows = scanner.nextInt();
System.out.println("Pascal Triangle:");
print(rows);
scanner.close();
}
public static void print(int n) {
for (int i = 0; i < n; i++) {
for (int k = 0; k < n - i; k++) {
System.out.print(" "); // print space for triangle like structure
}
for (int j = 0; j <= i; j++) {
System.out.print(pascal(i, j) + " ");
}
System.out.println();
}
}
public static int pascal(int i, int j) {
if (j == 0 || j == i) {
return 1;
} else {
return pascal(i - 1, j - 1) + pascal(i - 1, j);
}
}
}

In this code:
pt[0] = pt[n] = 1;
for(int i = 0; i < ppt.length; i++)
{
pt[i] = ppt[i-1] + ppt[i];
}
the problem is that when i is 0, you're trying to access ppt[i-1] which is ppt[-1]. The thing to notice is that when i is 0, you don't need to execute the statement that sets pt[i], because you already set pt[0] up before the loop! Try initializing i to 1 instead of 0.

Improvement in #Clemson code using Dynamic Programming :
class Solution {
int[][] dp ;
public List<List<Integer>> generate(int numRows) {
dp = new int[numRows][numRows];
List<List<Integer>> results = new ArrayList<>();
pascal(results, numRows);
return results;
}
private void pascal(List<List<Integer>> results, int numRows) {
for(int i = 0; i < numRows; i++) {
List<Integer> list = new ArrayList<>();
for(int j = 0; j <= i ; j++) {
list.add(dfs(i, j));
}
results.add(list);
}
}
private int dfs(int i, int j) {
if(j == 0 || i == j) return 1;
if(dp[i][j] != 0) return dp[i][j];
return dp[i][j] = dfs(i - 1, j - 1) + dfs(i - 1, j );
}
}

This isn't the solution to your code but it is solution to printing Pascals Triangle using only recursion which means no loops, using the combinations formula. All it needs is a main method or demo class to create an instance of the PascalsTriangle class. Hope this helps future Java students.
public class PascalsTriangle {
private StringBuilder str; // StringBuilder to display triangle
/**
* Starts the process of printing the Pascals Triangle
* #param rows Number of rows to print
*/
public PascalsTriangle(int rows) {
str = new StringBuilder();
printTriangle(rows, str);
}
/**
* Uses recursion to function as an "outer loop" and calls
* itself once for each row in triangle. Then displays the result
* #param row The number of the row to generate
* #param str StringBuilder to insert each row into
*/
public static void printTriangle(int row, StringBuilder str) {
// calls itself until row equals -1
if (row >= 0) {
// calls lower function to generate row and inserts the result into front of StringBuilder
str.insert(0, getRow(row, 0) + "\n");
// calls itself with a decremented row number
printTriangle(row - 1, str);
} else {
// when the base case is reached - display the result
JOptionPane.showMessageDialog(null, str);
System.exit(0);
}
}
/**
* Uses recursion to act as the "inner loop" and calculate each number in the given row
* #param rowNumber Number of the row being generated
* #param elementNumber Number of the element within the row (always starts with 0)
* #return String containing full row of numbers or empty string when base case is reached
*/
public static String getRow(int rowNumber, int elementNumber) {
// calls itself until elementNumber is greater than rowNumber
if (elementNumber <= rowNumber) {
// calculates element using combinations formula: n!/r!(n-r)!
int element = fact(rowNumber) / (fact(elementNumber) * (fact(rowNumber - elementNumber)));
// calls itself for each element in row and returns full String
return element + " " + getRow(rowNumber, elementNumber + 1);
} else return "";
}
/**
* Helper function that uses recursion to calculate factorial of given integer
* #param n Number to calculate factorial
* #return Factorial
*/
public static int fact(int n) {
if (n <= 0)
return 1;
else
return n * fact(n - 1);
}

I need to print an asterisk at the end of each column if the column is never visually changed. Advice?

/**
* Fills the mutations array and sends to printMutations
* #param firstString original DNA generation.
*/
public static void mutation(String firstString)
{
final int ROWSINDEX = 26;
final int SPACEUSED = firstString.length();
char[][] mutations = new char[ROWSINDEX][SPACEUSED];
String dnaChars = "AGTC";
for (int i = 0; i < SPACEUSED; i++)
{
mutations[0][i] = firstString.charAt(i);
}
for (int i = 1; i < ROWSINDEX - 1; i++)
{
for (int j = 0; j < SPACEUSED; j++)
{
mutations[i][j] = mutations[i - 1][j];
}
int randomIndex = (int) (Math.random() * (SPACEUSED));
int randomChar = (int) (Math.random() * (dnaChars.length()));
mutations[i][randomIndex] = dnaChars.charAt(randomChar);
}
printMutations(mutations, ROWSINDEX, SPACEUSED);
}
/**
* Prints the 25 generations of mutations and the astrixes.
* #param mutations array that holds the mutated generations
* #param ROWSINDEX integer holding the max amount of rows possible
* #param SPACEUSED integer that holds the number of columns
*/
public static void printMutations(char[][] mutations, int ROWSINDEX, int SPACEUSED)
{
for (int i = 0; i < ROWSINDEX; i++)
{
for (int j = 0; j < SPACEUSED; j++)
{
System.out.print(" " + mutations[i][j]);
}
if (i > 0)
{
char[] a = mutations[i];
char[] a2 = mutations[i - 1];
if (Arrays.equals( a, a2 ) == true)
{
System.out.print("*");
}
}
System.out.println("");
}
}
}
At the end of the output, you should print an asterisk below the column of any letter that did not change during the course of the simulation.
An example run of the program should appear like this:
$ java BeckJ0926
Enter a DNA sequence up to 80 bp: ATTCGGCTA
ATTCGGCTA
ATCCGGCTA
ATCCGTCTA
ATCCGTCTA *
...
ATCCGTCTT
AACCGTCTT
AATCGTCTT
* ** **
I don't know whether it would be best to set up a boolean array to determine whether each column has changed or not, which is what I was originally trying to do. I cannot use arrayLists.

You could change the line
mutations[i][randomIndex] = dnaChars.charAt(randomChar);
to
char currentChar = mutations[i][randomIndex];
if (currentChar == randomChar) {
System.out.print("*");
} else {
mutations[i][randomIndex] = dnaChars.charAt(randomChar);
printMutation(mutations[i]);
}
and change the print function to take one mutation and print it.
private void printMutation(char[] mutation) {
for (char a : mutation) {
System.out.print(a + " ");
}
}
Is that helpful?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How is the Boyer Moore offset table created (Wiki vcode)? - java

Related

How do I reduce the code complexity of this task? [closed]

How to find the longest substring with equal amount of characters efficiently

Find indexOf a byte array within another byte array

Recursive method for Pascal's triangle

I need to print an asterisk at the end of each column if the column is never visually changed. Advice?

Categories

Resources