Find indexOf a byte array within another byte array

Find indexOf a byte array within another byte array - java

Given a byte array, how can I find within it, the position of a (smaller) byte array?
This documentation looked promising, using ArrayUtils, but if I'm correct it would only let me find an individual byte within the array to be searched.
(I can't see it mattering, but just in case: sometimes the search byte array will be regular ASCII characters, other times it will be control characters or extended ASCII characters. So using String operations would not always be appropriate)
The large array could be between 10 and about 10000 bytes, and the smaller array around 10. In some cases I will have several smaller arrays that I want found within the larger array in a single search. And I will at times want to find the last index of an instance rather than the first.

The simpelst way would be to compare each element:
public int indexOf(byte[] outerArray, byte[] smallerArray) {
for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) {
boolean found = true;
for(int j = 0; j < smallerArray.length; ++j) {
if (outerArray[i+j] != smallerArray[j]) {
found = false;
break;
}
}
if (found) return i;
}
return -1;
}
Some tests:
#Test
public void testIndexOf() {
byte[] outer = {1, 2, 3, 4};
assertEquals(0, indexOf(outer, new byte[]{1, 2}));
assertEquals(1, indexOf(outer, new byte[]{2, 3}));
assertEquals(2, indexOf(outer, new byte[]{3, 4}));
assertEquals(-1, indexOf(outer, new byte[]{4, 4}));
assertEquals(-1, indexOf(outer, new byte[]{4, 5}));
assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8}));
}
As you updated your question: Java Strings are UTF-16 Strings, they do not care about the extended ASCII set, so you could use string.indexOf()

Google's Guava provides a Bytes.indexOf(byte[] array, byte[] target).

Using the Knuth–Morris–Pratt algorithm is the most efficient way.
StreamSearcher.java is an implementation of it and is part of Twitter's elephant-bird project.
It is not recommended to include this library since it is rather sizable for using just a single class.
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
/**
* An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
* For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
public class StreamSearcher
{
private byte[] pattern_;
private int[] borders_;
// An upper bound on pattern length for searching. Results are undefined for longer patterns.
#SuppressWarnings("unused")
public static final int MAX_PATTERN_LENGTH = 1024;
StreamSearcher(byte[] pattern)
{
setPattern(pattern);
}
/**
* Sets a new pattern for this StreamSearcher to use.
*
* #param pattern the pattern the StreamSearcher will look for in future calls to search(...)
*/
public void setPattern(byte[] pattern)
{
pattern_ = Arrays.copyOf(pattern, pattern.length);
borders_ = new int[pattern_.length + 1];
preProcess();
}
/**
* Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
* that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
* byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
* another reasonable default, i.e. leave the stream unchanged.
*
* #return bytes consumed if found, -1 otherwise.
*/
long search(InputStream stream) throws IOException
{
long bytesRead = 0;
int b;
int j = 0;
while ((b = stream.read()) != -1)
{
bytesRead++;
while (j >= 0 && (byte) b != pattern_[j])
{
j = borders_[j];
}
// Move to the next character in the pattern.
++j;
// If we've matched up to the full pattern length, we found it. Return,
// which will automatically save our position in the InputStream at the point immediately
// following the pattern match.
if (j == pattern_.length)
{
return bytesRead;
}
}
// No dice, Note that the stream is now completely consumed.
return -1;
}
/**
* Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
* and aids in implementation of the Knuth-Moore-Pratt string search.
* <p>
* For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
private void preProcess()
{
int i = 0;
int j = -1;
borders_[i] = j;
while (i < pattern_.length)
{
while (j >= 0 && pattern_[i] != pattern_[j])
{
j = borders_[j];
}
borders_[++i] = ++j;
}
}
}

Is this what you are looking for?
public class KPM {
/**
* Search the data byte array for the first occurrence of the byte array pattern within given boundaries.
* #param data
* #param start First index in data
* #param stop Last index in data so that stop-start = length
* #param pattern What is being searched. '*' can be used as wildcard for "ANY character"
* #return
*/
public static int indexOf( byte[] data, int start, int stop, byte[] pattern) {
if( data == null || pattern == null) return -1;
int[] failure = computeFailure(pattern);
int j = 0;
for( int i = start; i < stop; i++) {
while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) {
j = failure[j - 1];
}
if (pattern[j] == '*' || pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}

To save your time in testing:
http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence
gives you code that works if you make computeFailure() static:
public class KPM {
/**
* Search the data byte array for the first occurrence
* of the byte array pattern.
*/
public static int indexOf(byte[] data, byte[] pattern) {
int[] failure = computeFailure(pattern);
int j = 0;
for (int i = 0; i < data.length; i++) {
while (j > 0 && pattern[j] != data[i]) {
j = failure[j - 1];
}
if (pattern[j] == data[i]) {
j++;
}
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}
/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private static int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];
int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j>0 && pattern[j] != pattern[i]) {
j = failure[j - 1];
}
if (pattern[j] == pattern[i]) {
j++;
}
failure[i] = j;
}
return failure;
}
}
Since it is always wise to test the code that you borrow, you may start with:
public class Test {
public static void main(String[] args) {
do_test1();
}
static void do_test1() {
String[] ss = { "",
"\r\n\r\n",
"\n\n",
"\r\n\r\nthis is a test",
"this is a test\r\n\r\n",
"this is a test\r\n\r\nthis si a test",
"this is a test\r\n\r\nthis si a test\r\n\r\n",
"this is a test\n\r\nthis si a test",
"this is a test\r\nthis si a test\r\n\r\n",
"this is a test"
};
for (String s: ss) {
System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]");
}
}
}

Copied almost identical from java.lang.String.
indexOf(char[],int,int,char[]int,int,int)
static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) {
if (fromIndex >= sourceCount) {
return (targetCount == 0 ? sourceCount : -1);
}
if (fromIndex < 0) {
fromIndex = 0;
}
if (targetCount == 0) {
return fromIndex;
}
byte first = target[targetOffset];
int max = sourceOffset + (sourceCount - targetCount);
for (int i = sourceOffset + fromIndex; i <= max; i++) {
/* Look for first character. */
if (source[i] != first) {
while (++i <= max && source[i] != first)
;
}
/* Found first character, now look at the rest of v2 */
if (i <= max) {
int j = i + 1;
int end = j + targetCount - 1;
for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++)
;
if (j == end) {
/* Found whole string. */
return i - sourceOffset;
}
}
}
return -1;
}

package org.example;
import java.util.List;
import org.riversun.finbin.BinarySearcher;
public class Sample2 {
public static void main(String[] args) throws Exception {
BinarySearcher bs = new BinarySearcher();
// UTF-8 without BOM
byte[] srcBytes = "Hello world.It's a small world.".getBytes("utf-8");
byte[] searchBytes = "world".getBytes("utf-8");
List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes);
System.out.println("indexList=" + indexList);
}
}
so it results in
indexList=[6, 25]
So,u can find the index of byte[] in byte[]
Example here on Github at: https://github.com/riversun/finbin

Several (or all?) of the examples posted here failed some Unit tests so I am posting my version along with the aforementioned tests over here. All of the Unit tests are BASED upon the requirement that Java's String.indexOf() always gives us the right answer!
// The Knuth, Morris, and Pratt string searching algorithm remembers information about
// the past matched characters instead of matching a character with a different pattern
// character over and over again. It can search for a pattern in O(n) time as it never
// re-compares a text symbol that has matched a pattern symbol. But, it does use a partial
// match table to analyze the pattern structure. Construction of a partial match table
// takes O(m) time. Therefore, the overall time complexity of the KMP algorithm is O(m + n).
public class KMPSearch {
public static int indexOf(byte[] haystack, byte[] needle)
{
// needle is null or empty
if (needle == null || needle.length == 0)
return 0;
// haystack is null, or haystack's length is less than that of needle
if (haystack == null || needle.length > haystack.length)
return -1;
// pre construct failure array for needle pattern
int[] failure = new int[needle.length];
int n = needle.length;
failure[0] = -1;
for (int j = 1; j < n; j++)
{
int i = failure[j - 1];
while ((needle[j] != needle[i + 1]) && i >= 0)
i = failure[i];
if (needle[j] == needle[i + 1])
failure[j] = i + 1;
else
failure[j] = -1;
}
// find match
int i = 0, j = 0;
int haystackLen = haystack.length;
int needleLen = needle.length;
while (i < haystackLen && j < needleLen)
{
if (haystack[i] == needle[j])
{
i++;
j++;
}
else if (j == 0)
i++;
else
j = failure[j - 1] + 1;
}
return ((j == needleLen) ? (i - needleLen) : -1);
}
}
import java.util.Random;
class KMPSearchTest {
private static Random random = new Random();
private static String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
#Test
public void testEmpty() {
test("", "");
test("", "ab");
}
#Test
public void testOneChar() {
test("a", "a");
test("a", "b");
}
#Test
public void testRepeat() {
test("aaa", "aaaaa");
test("aaa", "abaaba");
test("abab", "abacababc");
test("abab", "babacaba");
}
#Test
public void testPartialRepeat() {
test("aaacaaaaac", "aaacacaacaaacaaaacaaaaac");
test("ababcababdabababcababdaba", "ababcababdabababcababdaba");
}
#Test
public void testRandomly() {
for (int i = 0; i < 1000; i++) {
String pattern = randomPattern();
for (int j = 0; j < 100; j++)
test(pattern, randomText(pattern));
}
}
/* Helper functions */
private static String randomPattern() {
StringBuilder sb = new StringBuilder();
int steps = random.nextInt(10) + 1;
for (int i = 0; i < steps; i++) {
if (sb.length() == 0 || random.nextBoolean()) { // Add literal
int len = random.nextInt(5) + 1;
for (int j = 0; j < len; j++)
sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
} else { // Repeat prefix
int len = random.nextInt(sb.length()) + 1;
int reps = random.nextInt(3) + 1;
if (sb.length() + len * reps > 1000)
break;
for (int j = 0; j < reps; j++)
sb.append(sb.substring(0, len));
}
}
return sb.toString();
}
private static String randomText(String pattern) {
StringBuilder sb = new StringBuilder();
int steps = random.nextInt(100);
for (int i = 0; i < steps && sb.length() < 10000; i++) {
if (random.nextDouble() < 0.7) { // Add prefix of pattern
int len = random.nextInt(pattern.length()) + 1;
sb.append(pattern.substring(0, len));
} else { // Add literal
int len = random.nextInt(30) + 1;
for (int j = 0; j < len; j++)
sb.append(alphabet.charAt(random.nextInt(alphabet.length())));
}
}
return sb.toString();
}
private static void test(String pattern, String text) {
try {
assertEquals(text.indexOf(pattern), KMPSearch.indexOf(text.getBytes(), pattern.getBytes()));
} catch (AssertionError e) {
System.out.println("FAILED -> Unable to find '" + pattern + "' in '" + text + "'");
}
}
}

Java strings are composed of 16-bit chars, not of 8-bit bytes. A char can hold a byte, so you can always make your byte arrays into strings, and use indexOf: ASCII characters, control characters, and even zero characters will work fine.
Here is a demo:
byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4};
byte[] small = new byte[] {7,0,8,9,0,0,1};
String bigStr = new String(big, StandardCharsets.UTF_8);
String smallStr = new String(small, StandardCharsets.UTF_8);
System.out.println(bigStr.indexOf(smallStr));
This prints 7.
However, considering that your large array could be up to 10,000 bytes, and the small array is only ten bytes, this solution may not be the most efficient, for two reasons:
It requires copying your big array into an array that is twice as large (same capacity, but with char instead of byte). This triples your memory requirements.
String search algorithm of Java is not the fastest one available. You may get sufficiently faster if you implement one of the advanced algorithms, for example, the Knuth–Morris–Pratt one. This could potentially bring the execution speed down by a factor of up to ten (the length of the small string), and will require additional memory that is proportional to the length of the small string, not the big string.

For a little HTTP server I am currently working on, I came up with the following code to find boundaries in a multipart/form-data request. Hoped to find a better solution here, but i guess I'll stick with it. I think it is as efficent as it can get (quite fast and uses not much ram). It uses the input bytes as ring buffer, reads the next byte as soon as it does not match the boundary and writes the data after the first full cycle into the output stream. Of course can it be changed for byte arrays instead of streams, as asked in the question.
private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary)
{
try
{
String n = "--"+boundary;
byte[] bc = n.getBytes("UTF-8");
int s = bc.length;
byte[] b = new byte[s];
int p = 0;
long l = 0;
int c;
boolean r;
while ((c = is.read()) != -1)
{
b[p] = (byte) c;
l += 1;
p = (int) (l % s);
if (l>p)
{
r = true;
for (int i = 0; i < s; i++)
{
if (b[(p + i) % s] != bc[i])
{
r = false;
break;
}
}
if (r)
break;
os.write(b[p]);
}
}
os.flush();
return true;
} catch(IOException e) {e.printStackTrace();}
return false;
}

Related

How is the Boyer Moore offset table created (Wiki vcode)?

I've been trying to get my head around the boyer moore algorithm . I am going through the Bad Character hurestic code in Java given on wikipedia .
Theoretically I understand what the algorith is doing. But I am not able to wrap my head around the preprocessing table.
This is the code:
/**
* Returns the index within this string of the first occurrence of the
* specified substring. If it is not a substring, return -1.
*
* There is no Galil because it only generates one match.
*
* #param haystack The string to be scanned
* #param needle The target string to search
* #return The start index of the substring
*/
public static int indexOf(char[] haystack, char[] needle) {
if (needle.length == 0) {
return 0;
}
int charTable[] = makeCharTable(needle);
int offsetTable[] = makeOffsetTable(needle);
for (int i = needle.length - 1, j; i < haystack.length;) {
for (j = needle.length - 1; needle[j] == haystack[i]; --i, --j) {
if (j == 0) {
return i;
}
}
// i += needle.length - j; // For naive method
i += Math.max(offsetTable[needle.length - 1 - j], charTable[haystack[i]]);
}
return -1;
}
/**
* Makes the jump table based on the mismatched character information.
*/
private static int[] makeCharTable(char[] needle) {
final int ALPHABET_SIZE = Character.MAX_VALUE + 1; // 65536
int[] table = new int[ALPHABET_SIZE];
for (int i = 0; i < table.length; ++i) {
table[i] = needle.length;
}
for (int i = 0; i < needle.length; ++i) {
table[needle[i]] = needle.length - 1 - i;
}
return table;
}
/**
* Makes the jump table based on the scan offset which mismatch occurs.
* (bad character rule).
*/
private static int[] makeOffsetTable(char[] needle) {
int[] table = new int[needle.length];
int lastPrefixPosition = needle.length;
for (int i = needle.length; i > 0; --i) {
if (isPrefix(needle, i)) {
lastPrefixPosition = i;
}
table[needle.length - i] = lastPrefixPosition - i + needle.length;
}
for (int i = 0; i < needle.length - 1; ++i) {
int slen = suffixLength(needle, i);
table[slen] = needle.length - 1 - i + slen;
}
return table;
}
/**
* Is needle[p:end] a prefix of needle?
*/
private static boolean isPrefix(char[] needle, int p) {
for (int i = p, j = 0; i < needle.length; ++i, ++j) {
if (needle[i] != needle[j]) {
return false;
}
}
return true;
}
/**
* Returns the maximum length of the substring ends at p and is a suffix.
* (good suffix rule)
*/
private static int suffixLength(char[] needle, int p) {
int len = 0;
for (int i = p, j = needle.length - 1;
i >= 0 && needle[i] == needle[j]; --i, --j) {
len += 1;
}
return len;
}
I am not able to understand what the preprocessing function makeOffsetTable is doing ?
It seems makeCharTable is all the pre-processing that is needed but looks like there is another pre processing step but I dont understand what it is , could someone explain what its doing ?

KMP Algorithm for string search?

I found this very challenging coding problem online which I though I'd give a try.
The general idea is that given string of text T and pattern P, find the occurrences of this pattern, sum up it's corresponding value and return max and min. If you want to read the problem in more details, please refer to this.
However, below is the code I've provided, it works for a simple test case, but when running on multiple and complex test cases its pretty slow, and I'm not sure where my code needs to be optimized.
Can anyone please help where im getting the logic wrong.
public class DeterminingDNAHealth {
private DeterminingDNAHealth() {
/*
* Fixme:
* Each DNA contains number of genes
* - some of them are beneficial and increase DNA's total health
* - Each Gene has a health value
* ======
* - Total health of DNA = sum of all health values of beneficial genes
*/
}
int checking(int start, int end, String pattern) {
String[] genesChar = new String[] {
"a",
"b",
"c",
"aa",
"d",
"b"
};
String numbers = "123456";
int total = 0;
for (int i = start; i <= end; i++) {
total += KMPAlgorithm.initiateAlgorithm(pattern, genesChar[i]) * (i + 1);
}
return total;
}
public static void main(String[] args) {
String[] genesChar = new String[] {
"a",
"b",
"c",
"aa",
"d",
"b"
};
Gene[] genes = new Gene[genesChar.length];
for (int i = 0; i < 6; i++) {
genes[i] = new Gene(genesChar[i], i + 1);
}
String[] checking = "15caaab 04xyz 24bcdybc".split(" ");
DeterminingDNAHealth DNA = new DeterminingDNAHealth();
int i, mostHealthiest, mostUnhealthiest;
mostHealthiest = Integer.MIN_VALUE;
mostUnhealthiest = Integer.MAX_VALUE;
for (i = 0; i < checking.length; i++) {
int start = Character.getNumericValue(checking[i].charAt(0));
int end = Character.getNumericValue(checking[i].charAt(1));
String pattern = checking[i].substring(2, checking[i].length());
int check = DNA.checking(start, end, pattern);
if (check > mostHealthiest)
mostHealthiest = check;
else
if (check < mostUnhealthiest)
mostUnhealthiest = check;
}
System.out.println(mostHealthiest + " " + mostUnhealthiest);
// DNA.checking(1,5, "caaab");
}
}
KMPAlgorithm
public class KMPAlgorithm {
KMPAlgorithm() {}
public static int initiateAlgorithm(String text, String pattern) {
// let us generate our LPC table from the pattern
int[] partialMatchTable = partialMatchTable(pattern);
int matchedOccurrences = 0;
// initially we don't have anything matched, so 0
int partialMatchLength = 0;
// we then start to loop through the text, !note, not the pattern. The text that we are testing the pattern on
for (int i = 0; i < text.length(); i++) {
// if there is a mismatch and there's no previous match, then we've hit the base-case, hence break from while{...}
while (partialMatchLength > 0 && text.charAt(i) != pattern.charAt(partialMatchLength)) {
/*
* otherwise, based on the number of chars matched, we decrement it by 1.
* In fact, this is the unique part of this algorithm. It is this part that we plan to skip partialMatchLength
* iterations. So if our partialMatchLength was 5, then we are going to skip (5 - 1) iteration.
*/
partialMatchLength = partialMatchTable[partialMatchLength - 1];
}
// if however we have a char that matches the current text[i]
if (text.charAt(i) == pattern.charAt(partialMatchLength)) {
// then increment position, so hence we check the next char of the pattern against the next char in text
partialMatchLength++;
// we will know that we're at the end of the pattern matching, if the matched length is same as the pattern length
if (partialMatchLength == pattern.length()) {
// to get the starting index of the matched pattern in text, apply this formula (i - (partialMatchLength - 1))
// this line increments when a match string occurs multiple times;
matchedOccurrences++;
// just before when we have a full matched pattern, we want to test for multiple occurrences, so we make
// our match length incomplete, and let it run longer.
partialMatchLength = partialMatchTable[partialMatchLength - 1];
}
}
}
return matchedOccurrences;
}
private static int[] partialMatchTable(String pattern) {
/*
* TODO
* Note:
* => Proper prefix: All the characters in a string, with one or more cut off the end.
* => proper suffix: All the characters in a string, with one or more cut off the beginning.
*
* 1.) Take the pattern and construct a partial match table
*
* To construct partial match table {
* 1. Loop through the String(pattern)
* 2. Create a table of size String(pattern).length
* 3. For each character c[i], get The length of the longest proper prefix in the (sub)pattern
* that matches a proper suffix in the same (sub)pattern
* }
*/
// we will need two incremental variables
int i, j;
// an LSP table also known as “longest suffix-prefix”
int[] LSP = new int[pattern.length()];
// our initial case is that the first element is set to 0
LSP[0] = 0;
// loop through the pattern...
for (i = 1; i < pattern.length(); i++) {
// set our j as previous elements data (not the index)
j = LSP[i - 1];
// we will be comparing previous and current elements data. ei char
char current = pattern.charAt(i), previous = pattern.charAt(j);
// we will have a case when we're somewhere in loop and two chars will not match, and j is not in base case.
while (j > 0 && current != previous)
// we decrement our j
j = LSP[j - 1];
// simply put, if two characters are same, then we update our LSP to say that at that point, we hold the j's value
if (current == previous)
// increment our j
j++;
// update the table
LSP[i] = j;
}
return LSP;
}
}
Cource code credit to Github

You may try this KMP implementation. It is O(m+n), as KMP is intended to be. It should be a lot faster:
private static int[] failureFunction(char[] pattern) {
int m = pattern.length;
int[] f = new int[pattern.length];
f[0] = 0;
int i = 1;
int j = 0;
while (i < m) {
if (pattern[i] == pattern[j]) {
f[i] = j + 1;
i++;
j++;
} else if (j > 0) {
j = f[j - 1];
} else {
f[i] = 0;
i++;
}
}
return f;
}
private static int kmpMatch(char[] text, char[] pattern) {
int[] f = failureFunction(pattern);
int m = pattern.length;
int n = text.length;
int i = 0;
int j = 0;
while (i < n) {
if (pattern[j] == text[i]) {
if (j == m - 1){
return i - (m - 1);
} else {
i++;
j++;
}
} else if (j > 0) {
j = f[j - 1];
} else {
i++;
}
}
return -1;
}

Algorithm - Lexicographically largest possible magical substring

I am working on this magical sub-string problem.
Magical binary strings are non-empty binary strings if the following two conditions are true:
The number of 0's is equal to the number of 1's.
For every prefix of the binary string, the number of 1's should not be less than the number of 0's.
I got stuck on how to proceed further in my Java program.
Here is my program:
static String findLargest(String str) {
String[] splits = str.split("");
Set<String> set = new LinkedHashSet<String>();
for (int i = 0; i < splits.length; i++) {
if (splits[i].equals("0")) {
continue;
}
int zeros = 0;
int ones = 0;
StringBuilder sb = new StringBuilder("");
for (int j = i; j < splits.length; j++) {
if (splits[j].equals("0")) {
zeros++;
} else {
ones++;
}
sb.append(splits[j]);
if (zeros == ones && ones >= zeros) {
set.add(sb.toString());
}
}
}
set.remove(str);
List<String> list = new ArrayList<String>(set);
System.out.println(list);
return null;
}
Using this program I am able to get the magical sub-strings for the given input String 11011000 as [10, 101100, 1100] in my list variable.
Now from here I am struggling how to remove the invalid entry of 101100 from my list and then use the elements 10, 1100 to swap from my input 11011000 to get the final result as 11100100
Also please guide me if there is any other alternate approach.

If your question is about only eliminating the unwanted "101100" from the result, here is the answer
import java.util.ArrayList;
import java.util.HashMap;
import java.lang.*;
import java.util.Set;
import java.util.*;
public class HelloWorld{
public static void main(String []args){
findLargest("11011000");
}
public static String findLargest(String str) {
String[] splits = str.split("");
Set<String> set = new LinkedHashSet<String>();
for (int i = 0; i < splits.length; i++) {
if (splits[i].equals("0")) {
continue;
}
int zeros = 0;
int ones = 0;
StringBuilder sb = new StringBuilder("");
for (int j = i; j < splits.length; j++) {
if (splits[j].equals("0")) {
zeros++;
} else {
ones++;
}
sb.append(splits[j]);
if (zeros == ones && ones >= zeros) {
set.add(sb.toString());
j = i +1; // RESET THE INDEX ELEMENT TO SKIP THE SUBSTRING FROM CONSIDERATION
break; // BREAK FROM THE LOOP
}
}
}
set.remove(str);
List<String> list = new ArrayList<String>(set);
System.out.println(list);
return null;
}
}

I can provide some points.
First, get all the magical substrings and store them as a pair of start and end index(l, r) in a list;
Second, sort the list based on index l;
Those who can be potentially swapped substring can get from the same index l. look at the example given "11011000"
the list will have (0,7),(1,2),(1,6),(3,6),(4,5)
obviously only potential swap is among (1,2)(1,6)
deal these substrings have same index l will help find potential swapping substrings, sort them to find the maximum order.

class Pair{
int start;
int end;
}
public List<Pair> findmagicalPairs(String binString){
List<Pair> magicPairs = new ArrayList<Pair>();
for(int start=0;start<binString.length()-1;start++){
int ones=0;
int zeros=0;
for(int i=start; i<binString.length();i++){
if(binString.charAt(i) == '1'){
ones++;
} else if(binString.charAt(i)=='0'){
zeros++;
}
if(ones == zeros){ //check if magical
Pair temp=new Pair();
temp.start=start;
temp.end =i;
magicPairs.add(temp);
}
}
}
return magicPairs;
}
public String largestMagical(String binString) {
// Write your code here
List<Pair> allPairs = findmagicalPairs(binString);
String largest=binString;
//check by swapping each pairs
for(int i=0;i<allPairs.size()-1;i++){
for(int j=i+1;j<allPairs.size()-1;j++){
if(allPairs.get(i).end+1 == allPairs.get(j).start){
//consecutive Pair so swap and see largest
int index = allPairs.get(j).start;
String swapped = binString.substring(0,allPairs.get(i).start)+binString.substring(allPairs.get(j).start,allPairs.get(j).end+1)+binString.substring(allPairs.get(i).start,allPairs.get(i).end+1)+binString.substring(allPairs.get(j).end+1);
largest = LargestString(largest, swapped);
} else {
//else ignore
}
}
}
return largest;
}
public String LargestString(String first, String second){
if(first.compareTo(second)>0){
return first;
} else {
return second;
}
}

JavaScript Code!
const largestMagical = (binString) => {
//console.log({ binString });
const len = binString.length;
const height = Array(len + 1).fill(0),
num = { 1: 1, 0: -1 },
marked = Array(len + 1).fill(false),
sameHeights = {};
let i,
j,
result = "";
for (i = 1; i <= len; ++i) {
height[i] = height[i - 1] + num[binString[i - 1]];
}
//console.log({ height });
for (i = 0; i <= len; ++i) {
if (marked[i]) continue;
marked[i] = true;
sameHeights[i] = [i];
for (j = i + 1; j <= len; ++j) {
if (height[j] < height[i]) break;
if (height[j] === height[i]) {
sameHeights[i].push(j);
marked[j] = true;
}
}
}
//console.log({ sameHeights });
for (let k in sameHeights) {
const leng = sameHeights[k].length;
let startId, midId, endId;
for (startId = 0; startId < leng - 2; ++startId) {
for (midId = startId + 1; midId < leng - 1; ++midId) {
for (endId = midId + 1; endId < leng; ++endId) {
const start = sameHeights[k][startId],
mid = sameHeights[k][midId],
end = sameHeights[k][endId];
//console.log({start, mid, end});
const swapped =
binString.substring(0, start) +
binString.substring(mid, end) +
binString.substring(start, mid) +
binString.substring(end, len);
//console.log({swapped});
if (swapped > result) result = swapped;
}
}
}
}
return result;
};
console.log(largestMagical("1010111000"));
console.log(largestMagical("11011000"));

pattern search in a text by using three methods

I want to write a program for pattern searching in a given text which reads in a text and then one or more patterns and gives for each pattern:
if it is found(prints out the position of the first appearance in the text)
if not the number of comparison for of the methods: Brute-force, Boyer-Moore Heuristics, and KMP.
But I don't know how to write my main class to get output:
import java.util.HashMap;
import java.util.Map;
public class PatternSearch {
/** Returns the lowest index at which substring pattern begins in text (or else -1).*/
public static int findBrute(char[] text, char[] pattern) {
int n = text.length;
int m = pattern.length;
for (int i=0; i <= n - m; i++) { // try every starting index within text
int k = 0; // k is index into pattern
while (k < m && text[i+k] == pattern[k]) // kth character of pattern matches
k++;
if (k == m) // if we reach the end of the pattern,
return i; // substring text[i..i+m-1] is a match
}
return -1; // search failed
}
/** Returns the lowest index at which substring pattern begins in text (or else -1).*/
public static int findBoyerMoore(char[] text, char[] pattern) {
int n = text.length;
int m = pattern.length;
if (m == 0) return 0; // trivial search for empty string
Map<Character,Integer> last = new HashMap<>( ); // the 'last' map
for (int i=0; i < n; i++)
last.put(text[i], -1); // set -1 as default for all text characters
for (int k=0; k < m; k++)
last.put(pattern[k], k); // rightmost occurrence in pattern is last
// start with the end of the pattern aligned at index m-1 of the text
int i = m-1; // an index into the text
int k = m-1; // an index into the pattern
while (i < n) {
if (text[i] == pattern[k]) { // a matching character
if (k == 0) return i; // entire pattern has been found
i--; // otherwise, examine previous
k--; // characters of text/pattern
} else {
i += m - Math.min(k, 1 + last.get(text[i])); // case analysis for jump step
k = m - 1; // restart at end of pattern
}
}
return -1; // pattern was never found
}
/** Returns the lowest index at which substring pattern begins in text (or else -1).*/
public static int findKMP(char[] text, char[] pattern) {
int n = text.length;
int m = pattern.length;
if (m == 0) return 0; // trivial search for empty string
int[] fail = computeFailKMP(pattern); // computed by private utility
int j = 0; // index into text
int k = 0; // index into pattern
while (j < n) {
if (text[j] == pattern[k]) { // pattern[0..k] matched thus far
if (k == m - 1) return j - m + 1; // match is complete
j++; // otherwise, try to extend match
k++;
} else if (k > 0)
k = fail[k-1]; // reuse suffix of P[0..k-1]
else
j++;
}
return -1; // reached end without match
}
private static int[] computeFailKMP(char[] pattern) {
int m = pattern.length;
int[ ] fail = new int[m]; // by default, all overlaps are zero
int j = 1;
int k = 0;
while (j < m) { // compute fail[j] during this pass, if nonzero
if (pattern[j] == pattern[k]) { // k + 1 characters match thus far
fail[j] = k + 1;
j++;
k++;
} else if (k > 0) // k follows a matching prefix
k = fail[k-1];
else // no match found starting at j
j++;
}
return fail;
}
public static void main(String[] args) {
}
}

Looks like you are migrating from c\c++ to Java.
Here is how to call functions via main method.
public static void main(String[] args) throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
System.out.print("1) Please enter a string :");
String text = br.readLine();
System.out.print("2) Please enter a pattern :");
String pattern = br.readLine();
System.out.println(PatternSearch.findBoyerMoore(text.toCharArray(), pattern.toCharArray()));
System.out.println(PatternSearch.findBrute(text.toCharArray(), pattern.toCharArray()));
System.out.println(PatternSearch.findKMP(text.toCharArray(), pattern.toCharArray()));
}
Since all your methods are static I called methods by using className.method();
If your methods are non static then you will have to create an instance of the class and then call the method by using the instance you created.
PatternSearch instance = new PatternSearch();
instance.findKMP(text.toCharArray(), pattern.toCharArray());

How to find the longest substring with equal amount of characters efficiently

I have a string that consists of characters A,B,C and D and I am trying to calculate the length of the longest substring that has an equal amount of each one of these characters in any order.
For example ABCDB would return 4, ABCC 0 and ADDBCCBA 8.
My code currently:
public int longestSubstring(String word) {
HashMap<Integer, String> map = new HashMap<Integer, String>();
for (int i = 0; i<word.length()-3; i++) {
map.put(i, word.substring(i, i+4));
}
StringBuilder sb;
int longest = 0;
for (int i = 0; i<map.size(); i++) {
sb = new StringBuilder();
sb.append(map.get(i));
int a = 4;
while (i<map.size()-a) {
sb.append(map.get(i+a));
a+= 4;
}
String substring = sb.toString();
if (equalAmountOfCharacters(substring)) {
int length = substring.length();
if (length > longest)
longest = length;
}
}
return longest;
}
This currently works pretty well if the string length is 10^4 but I'm trying to make it 10^5. Any tips or suggestions would be appreciated.

Let's assume that cnt(c, i) is the number of occurrences of the character c in the prefix of length i.
A substring (low, high] has an equal amount of two characters a and b iff cnt(a, high) - cnt(a, low) = cnt(b, high) - cnt(b, low), or, put it another way, cnt(b, high) - cnt(a, high) = cnt(b, low) - cnt(a, low). Thus, each position is described by a value of cnt(b, i) - cnt(a, i). Now we can generalize it for more that two characters: each position is described by a tuple (cnt(a_2, i) - cnt(a_1, i), ..., cnt(a_k, i) - cnt(a_1, i)), where a_1 ... a_k is the alphabet.
We can iterate over the given string and maintain the current tuple. At each step, we should update the answer by checking the value of i - first_occurrence(current_tuple), where first_occurrence is a hash table that stores the first occurrence of each tuple seen so far. Do not forget to put a tuple of zeros to the hash map before iteration(it corresponds to an empty prefix).

If there were only A's and B's, then you could do something like this.
def longest_balanced(word):
length = 0
cumulative_difference = 0
first_index = {0: -1}
for index, letter in enumerate(word):
if letter == 'A':
cumulative_difference += 1
elif letter == 'B':
cumulative_difference -= 1
else:
raise ValueError(letter)
if cumulative_difference in first_index:
length = max(length, index - first_index[cumulative_difference])
else:
first_index[cumulative_difference] = index
return length
Life is more complicated with all four letters, but the idea is much the same. Instead of keeping just one cumulative difference, for A's versus B's, we keep three, for A's versus B's, A's versus C's, and A's versus D's.

Well, first of all abstain from constructing any strings.
If you don't produce any (or nearly no) garbage, there's no need to collect it, which is a major plus.
Next, use a different data-structure:
I suggest 4 byte-arrays, storing the count of their respective symbol in the 4-span starting at the corresponding string-index.
That should speed it up considerably.

You can count the occurrences of the characters in word. Then, a possible solution could be:
If min is the minimum number of occurrences of any character in word, then min is also the maximum possible number of occurrences of each character in the substring we are looking for. In the code below, min is maxCount.
We iterate over decreasing values of maxCount. At every step, the string we are searching for will have length maxCount * alphabetSize. We can view this as the size of a sliding window we can slide over word.
We slide the window over word, counting the occurrences of the characters in the window. If the window is the substring we are searching for, we return the result. Otherwise, we keep searching.
[FIXED] The code:
private static final int ALPHABET_SIZE = 4;
public int longestSubstring(String word) {
// count
int[] count = new int[ALPHABET_SIZE];
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
count[c - 'A']++;
}
int maxCount = word.length();
for (int i = 0; i < count.length; i++) {
int cnt = count[i];
if (cnt < maxCount) {
maxCount = cnt;
}
}
// iterate over maxCount until found
boolean found = false;
while (maxCount > 0 && !found) {
int substringLength = maxCount * ALPHABET_SIZE;
found = findSubstring(substringLength, word, maxCount);
if (!found) {
maxCount--;
}
}
return found ? maxCount * ALPHABET_SIZE : 0;
}
private boolean findSubstring(int length, String word, int maxCount) {
int startIndex = 0;
boolean found = false;
while (startIndex + length <= word.length()) {
int[] count = new int[ALPHABET_SIZE];
for (int i = startIndex; i < startIndex + length; i++) {
char c = word.charAt(i);
int cnt = ++count[c - 'A'];
if (cnt > maxCount) {
break;
}
}
if (equalValues(count, maxCount)) {
found = true;
break;
} else {
startIndex++;
}
}
return found;
}
// Returns true if all values in c are equal to value
private boolean equalValues(int[] count, int value) {
boolean result = true;
for (int i : count) {
if (i != value) {
result = false;
break;
}
}
return result;
}
[MERGED] This is Hollis Waite's solution using cumulative counts, but taking my observations at points 1. and 2. into consideration. This may improve performance for some inputs:
private static final int ALPHABET_SIZE = 4;
public int longestSubstring(String word) {
// count
int[][] cumulativeCount = new int[ALPHABET_SIZE][];
for (int i = 0; i < ALPHABET_SIZE; i++) {
cumulativeCount[i] = new int[word.length() + 1];
}
int[] count = new int[ALPHABET_SIZE];
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
count[c - 'A']++;
for (int j = 0; j < ALPHABET_SIZE; j++) {
cumulativeCount[j][i + 1] = count[j];
}
}
int maxCount = word.length();
for (int i = 0; i < count.length; i++) {
int cnt = count[i];
if (cnt < maxCount) {
maxCount = cnt;
}
}
// iterate over maxCount until found
boolean found = false;
while (maxCount > 0 && !found) {
int substringLength = maxCount * ALPHABET_SIZE;
found = findSubstring(substringLength, word, maxCount, cumulativeCount);
if (!found) {
maxCount--;
}
}
return found ? maxCount * ALPHABET_SIZE : 0;
}
private boolean findSubstring(int length, String word, int maxCount, int[][] cumulativeCount) {
int startIndex = 0;
int endIndex = (startIndex + length) - 1;
boolean found = true;
while (endIndex < word.length()) {
for (int i = 0; i < ALPHABET_SIZE; i++) {
if (cumulativeCount[i][endIndex] - cumulativeCount[i][startIndex] != maxCount) {
found = false;
break;
}
}
if (found) {
break;
} else {
startIndex++;
endIndex++;
}
}
return found;
}

You'll probably want to cache cumulative counts of characters for each index of String -- that's where the real bottleneck is. Haven't thoroughly tested but something like the below should work.
public class Test {
static final int LEN = 4;
static class RandomCharSequence implements CharSequence {
private final Random mRandom = new Random();
private final int mAlphabetLen;
private final int mLen;
private final int mOffset;
RandomCharSequence(int pLen, int pOffset, int pAlphabetLen) {
mAlphabetLen = pAlphabetLen;
mLen = pLen;
mOffset = pOffset;
}
public int length() {return mLen;}
public char charAt(int pIdx) {
mRandom.setSeed(mOffset + pIdx);
return (char) (
'A' +
(mRandom.nextInt() % mAlphabetLen + mAlphabetLen) % mAlphabetLen
);
}
public CharSequence subSequence(int pStart, int pEnd) {
return new RandomCharSequence(pEnd - pStart, pStart, mAlphabetLen);
}
#Override public String toString() {
return (new StringBuilder(this)).toString();
}
}
public static void main(String[] pArgs) {
Stream.of("ABCDB", "ABCC", "ADDBCCBA", "DADDBCCBA").forEach(
pWord -> System.out.println(longestSubstring(pWord))
);
for (int i = 0; ; i++) {
final double len = Math.pow(10, i);
if (len >= Integer.MAX_VALUE) break;
System.out.println("Str len 10^" + i);
for (int alphabetLen = 1; alphabetLen <= LEN; alphabetLen++) {
final Instant start = Instant.now();
final int val = longestSubstring(
new RandomCharSequence((int) len, 0, alphabetLen)
);
System.out.println(
String.format(
" alphabet len %d; result %08d; time %s",
alphabetLen,
val,
formatMillis(ChronoUnit.MILLIS.between(start, Instant.now()))
)
);
}
}
}
static String formatMillis(long millis) {
return String.format(
"%d:%02d:%02d.%03d",
TimeUnit.MILLISECONDS.toHours(millis),
TimeUnit.MILLISECONDS.toMinutes(millis) -
TimeUnit.HOURS.toMinutes(TimeUnit.MILLISECONDS.toHours(millis)),
TimeUnit.MILLISECONDS.toSeconds(millis) -
TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(millis)),
TimeUnit.MILLISECONDS.toMillis(millis) -
TimeUnit.SECONDS.toMillis(TimeUnit.MILLISECONDS.toSeconds(millis))
);
}
static int longestSubstring(CharSequence pWord) {
// create array that stores cumulative char counts at each index of string
// idx 0 = char (A-D); idx 1 = offset
final int[][] cumulativeCnts = new int[LEN][];
for (int i = 0; i < LEN; i++) {
cumulativeCnts[i] = new int[pWord.length() + 1];
}
final int[] cumulativeCnt = new int[LEN];
for (int i = 0; i < pWord.length(); i++) {
cumulativeCnt[pWord.charAt(i) - 'A']++;
for (int j = 0; j < LEN; j++) {
cumulativeCnts[j][i + 1] = cumulativeCnt[j];
}
}
final int maxResult = Arrays.stream(cumulativeCnt).min().orElse(0) * LEN;
if (maxResult == 0) return 0;
int result = 0;
for (int initialOffset = 0; initialOffset < LEN; initialOffset++) {
for (
int start = initialOffset;
start < pWord.length() - result;
start += LEN
) {
endLoop:
for (
int end = start + result + LEN;
end <= pWord.length() && end - start <= maxResult;
end += LEN
) {
final int substrLen = end - start;
final int expectedCharCnt = substrLen / LEN;
for (int i = 0; i < LEN; i++) {
if (
cumulativeCnts[i][end] - cumulativeCnts[i][start] !=
expectedCharCnt
) {
continue endLoop;
}
}
if (substrLen > result) result = substrLen;
}
}
}
return result;
}
}

Suppose there are K possible letters in a string of length N. We could track the balance of letters seen with a vector pos of length K that is updated as follows:
If letter 1 is seen, add (K-1, -1, -1, ...)
If letter 2 is seen, add (-1, K-1, -1, ...)
If letter 3 is seen, add (-1, -1, K-1, ...)
Maintain a hash that maps pos to the first string position where pos is reached. Balanced substrings occur whenever hash[pos] already exists and the substring value is s[hash[pos]:pos].
The cost of maintaining the hash is O(log N) so processing the string takes O(N log N). How does this compare with solutions so far? These types of problems tend to have linear solutions but I haven't come across one yet.
Here's some code demonstrating the idea for 3 letters and a run using biased random strings. (Uniform random strings allow for solutions that are around half the string length, which is unwieldy to print).
#!/usr/bin/python
import random
from time import time
alphabet = "abc"
DIM = len(alphabet)
def random_string(n):
# return a random string over choices[] of length n
# distribution of letters is non-uniform to make matches harder to find
choices = "aabbc"
s = ''
for i in range(n):
r = random.randint(0, len(choices) - 1)
s += choices[r]
return s
def validate(s):
# verify frequencies of each letter are the same
f = [0, 0, 0]
a2f = {alphabet[i] : i for i in range(DIM)}
for c in s:
f[a2f[c]] += 1
assert f[0] == f[1] and f[1] == f[2]
def longest_balanced(s):
"""return length of longest substring of s containing equal
populations of each letter in alphabet"""
slen = len(s)
p = [0 for i in range(DIM)]
vec = {alphabet[0] : [2, -1, -1],
alphabet[1] : [-1, 2, -1],
alphabet[2] : [-1, -1, 2]}
x = -1
best = -1
hist = {str([0, 0, 0]) : -1}
for c in s:
x += 1
p = [p[i] + vec[c][i] for i in range(DIM)]
pkey = str(p)
if pkey not in hist:
hist[pkey] = x
else:
span = x - hist[pkey]
assert span % DIM == 0
if span > best:
best = span
cand = s[hist[pkey] + 1: x + 1]
print("best so far %d = [%d,%d]: %s" % (best,
hist[pkey] + 1,
x + 1,
cand))
validate(cand)
return best if best > -1 else 0
def main():
#print longest_balanced( "aaabcabcbbcc" )
t0 = time()
s = random_string(1000000)
print "generate time:", time() - t0
t1 = time()
best = longest_balanced( s )
print "best:", best
print "elapsed:", time() - t1
main()
Sample run on an input of 10^6 letters with an alphabet of 3 letters:
$ ./bal.py
...
best so far 189 = [847894,848083]: aacacbcbabbbcabaabbbaabbbaaaacbcaaaccccbcbcbababaabbccccbbabbacabbbbbcaacacccbbaacbabcbccaabaccabbbbbababbacbaaaacabcbabcbccbabbccaccaabbcabaabccccaacccccbaacaaaccbbcbcabcbcacaabccbacccacca
best: 189
elapsed: 1.43609690666

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Find indexOf a byte array within another byte array - java

Google's Guava provides a Bytes.indexOf(byte[] array, byte[] target).

Related

How is the Boyer Moore offset table created (Wiki vcode)?

KMP Algorithm for string search?

Algorithm - Lexicographically largest possible magical substring

pattern search in a text by using three methods

How to find the longest substring with equal amount of characters efficiently

Categories

Resources