How to check if a string can be exhausted by regex matches? - java

So the problem is to determine if every character in a string would be included in a match of a particular regex. Or, to state it differently, if the set of all of the character positions that could be included in some match of a particular regex includes all the character positions in the string.
My thought is to do something like this:
boolean matchesAll(String myString, Matcher myMatcher){
boolean matched[] = new boolean[myString.size()];
for(myMatcher.reset(myString); myMatcher.find();)
for(int idx = myMatcher.start(); idx < myMatcher.end(); idx++)
matched[idx] = true;
boolean allMatched = true;
for(boolean charMatched : matched)
allMatched &= charMatched;
return allMatched
}
Is there a better way to do this, however?
Also, as I was writing this, it occured to me that that would not do what I want in cases like
matchesAll("abcabcabc", Pattern.compile("(abc){2}").matcher()); //returns false
because Matcher only tries to match starting at the end of the last match. I want it to return true, because if you start the matcher at position 3, it could include the third abc in a match.
boolean matchesAll(String myString, Matcher myMatcher){
boolean matched[] = new boolean[myString.size()];
boolean allMatched = true;
for(int idx = 0; idx < myString.size() && myMatcher.find(idx);
idx = myMatcher.start() + 1) {
for(int idx2 = myMatcher.start(); idx2 < myMatcher.end(); idx2++)
matched[idx2] = true;
}
boolean allMatched = true;
for(boolean charMatched : matched)
allMatched &= charMatched;
return allMatched;
}
Is there any way to make this code better, faster, or more readable?

I have 2 answers for you, although I am not sure I understand the question right.
Call Pattern.matcher(str2match).matches() method instead of find(). In one shot a true return value will tell you if the entire string is matched.
Prepend the reg exp by "^" (beginning of string) and add a "$" at the end (for end of string) before "Pattern.compile(str)"-ing the regex.
The 2 solutions can go together, too. An example class follows - you can copy it into AllMatch.java, compile it with "javac AllMatch.java" and run it as "java AllMatch" (I assume you have "." in your CLASSSPATH). Just pick the solution you find is more elegant :) Happy New Year!
import java.util.regex.Pattern;
public class AllMatch {
private Pattern pattern;
public AllMatch (String reStr) {
pattern = Pattern.compile ("^" + reStr + "$");
}
public boolean checkMatch (String s) {
return pattern.matcher(s).matches();
}
public static void main (String[] args) {
int n = args.length;
String rexp2Match = (n > 0) ? args[0] : "(abc)+",
testString = (n > 1) ? args[1] : "abcabcabc",
matchMaker = new AllMatch (rexp2Match)
.checkMatch(testString) ? "" : "un";
System.out.println ("[AllMatch] match " + matchMaker +
"successful");
}
}

This works:
private static boolean fullyCovered(final String input,
final Pattern pattern)
{
// If the string is empty, check that it is matched by the pattern
if (input.isEmpty())
return pattern.matcher(input).find();
final int len = input.length();
// All initialized to false by default
final boolean[] covered = new boolean[len];
final Matcher matcher = pattern.matcher(input);
for (int index = 0; index < len; index++) {
// Try and match at this index:
if (!matcher.find(index)) {
// if there isn't a match, check if this character is already covered;
// if no, it's a failure
if (!covered[index])
return false;
// Otherwise, continue
continue;
}
// If the match starts at the string index, fill the covered array
if (matcher.start() == index)
setCovered(covered, index, matcher.end());
}
// We have finished parsing the string: it is fully covered.
return true;
}
private static void setCovered(final boolean[] covered,
final int beginIndex, final int endIndex)
{
for (int i = beginIndex; i < endIndex; i++)
covered[i] = true;
}
It will probably not be any faster to execute, but I surmise it is easier to read ;) Also, .find(int) resets the matcher, so this is safe.

Related

How can I check whether a string contains a given string, that is not a part of another given string?

For instance, I have this string: hello world, you're full of weird things
I'd like to know whether this string contains a ll that's not a part of the string hell, aka it should return true in this example, because there's a ll in the word full;
If the string was only hello world, it wouldn't match, since the ll in hello is a part of the given hell string
Can you try this:
String mainString = "hello world, you're full of weird things";
String findString = "ll";
String removeString = "hell";
String newString = mainString.Remove(mainString.IndexOf(removeString), removeString.Length);
if (newString.Contains(findString))
{
return true;
}
You could use RegExp and Negative Lookbehind:
public static boolean matches(String str) {
final Pattern pattern = Pattern.compile("(?<!he)ll");
Matcher matcher = pattern.matcher(str);
return matcher.find();
}
Test:
System.out.println(matches("hello world, you're full of weird things")); // true
System.out.println(matches("hello world")); // false
regex101.com
public static boolean checkOccurance(String str,String findStr,,String butNotIn ){
int lastIndex = 0;
int count = 0;
str = str.replace(butNotIn,"");
while(lastIndex != -1){
lastIndex = str.indexOf(findStr,lastIndex);
if(lastIndex != -1){
count ++;
lastIndex += findStr.length();
}
}
if(count>1){
return true;
}else{
return false;
}
call the method
System.out.println( checkOccurance("hello world, you're full of weird things","ll","hell"));
output
false
Why not trying to find the boundary (start and end index) of the butNotIn then use String#indexOf until you have an index that is outside the boundary ?
boolean isIn(String string, String word, String butNotIn) {
int start = string.indexOf(butNotIn);
int end = start+butNotIn.length;
int position;
do {
position = string.indexOf(word);
} while (position<start || position>end);
return position>-1;
}

Regex: How to match a string that is not following a #&&, but has to follow a &

I'm trying to match the String &abD&eG
from abCD#&&abCD&abD&eG
The general rules are:
Match a string consisting of alpha that has to follow & but NOT #&&.
But once the string starts as a single & a #&& is still considered as part of the match.
Consecutive &'s will count as a match.
So some simplified sample strings and matches are:
#&&abc&abc
should match: &abc
&abc&abc
should match: &abc&abc
#&&abc&abc#&&abc
should match: &abc#&&abc
#&&abc#&&abc
should match: NO MATCH
#&&abc
should match: NO MATCH
abc#&&
should match: NO MATCH
abc
should match: NO MATCH
&&abc&abc
should match: &&abc&abc
&&abc#&&
should match: &&abc#&&
#&&&&abc
should match: &&abc
&&abc&abc&&&&
should match: &&abc&abc&&&&
&&&
should match: &&&
abc&abc
should match: &abc
I currently have the regex (?<!#&&)(&\p{Alnum}+)+ but it detects the sequence after & regardless of whether or not it is followed by a #&.
How should I modify it so that it will match accordingly to my general rules?
I tried building a regex for this but as the & is part of both, the marker to exclude a match, and also a character to be included in the match, it got extra complicated, for something that easily can be detected with a simple FDA.
I am leaving here the algorithm if it is of any use for you. It is implemented in java, but porting it other languages shouldn't be a problem.
The match method returns an ArrayList with three values:
"true" if there was a match or "false" otherwise
The position in the string where the match starts, or -1 if there is no match
The matched string.
public class SO47732442 {
private int [] [] states = {
{1,4,0},
{3,2,3},
{3,0,3},
{3,3,3},
{3,3,3}
};
private int state = 0;
private int getCol(char c){
int rtn = 4;
switch(c){
case '#':
rtn = 0; break;
case '&':
rtn = 1; break;
default:
rtn = 2;
}
return rtn;
}
public ArrayList<String> match(String text){
state = 0;
ArrayList<String> rtn = new ArrayList<>();
StringBuilder sb = new StringBuilder();
int start = -1;
boolean match = false;
for(int i=0; i<text.length();i++){
int col = getCol(text.charAt(i));
state = states[state][col];
if(state == 3){
if(!match){
sb.append("&");
start = i;
match = true;
}
sb.append(text.charAt(i));
}
}
rtn.add(match? "true" : "false");
rtn.add(""+start);
rtn.add(sb.toString());
return rtn;
}
/* This is just to test the matches */
public static void main(String[] args){
SO47732442 app = new SO47732442();
ArrayList<String> tests = new ArrayList<>();
tests.add("#&&abc&abc");
tests.add("&abc&abc");
tests.add("#&&abc&abc#&&abc");
tests.add("#&&abc#&&abc");
tests.add("#&&abc");
tests.add("abc#&&");
tests.add("abc");
tests.add("&&abc&abc ");
tests.add("&&abc#&&");
tests.add("#&&&&abc");
tests.add("&&abc&abc&&&&");
tests.add("&&&");
tests.add("abc&abc");
tests.add("abcabc&");
for(String test : tests){
System.out.println("Text: " + test);
ArrayList<String> result = app.match(test);
for(String res : result){
System.out.println(res);
}
System.out.println("");
}
}
}
Can't get a regex to work, but here's a function that passes all your test cases (probably can be cleaned up a little):
public static String getMatch(String string) {
int startIndex = 0;
while (string.indexOf("&", startIndex) > string.indexOf("#&&", startIndex))
{
if (string.indexOf("&", startIndex) < 0) return "";
if (string.indexOf("#&&", startIndex) < 0) return string.substring(string.indexOf("&", startIndex));
startIndex = string.indexOf("#&&", startIndex) + 3;
}
return (string.indexOf("&", startIndex) < 0) ? "" : string.substring(string.indexOf("&", startIndex));
}

How to determine whether a string is a subsequence of another string regardless of characters in between?

I am trying to write a code that will tell me if one string is a substring of another string. The catch is that it does not matter if there are characters in between and the only characters that matter are 'A', 'T', 'G' and 'C'. For instance:
"TxxAA" is a subsequence of "CTyyGCACA"
"pln" is a subsequence of "oiu"
"TAA" is NOT a subsequence of "TCCCA"
Currently I am doing
private boolean subSequence(DNASequence other) {
other.fix();
boolean valid = false;
String t = other.toString();
data = dataFix(data);
int index = 0;
for (int i = 0; i < data.length(); i++) {
for (int j = 0; j < t.length(); j++) {
if(data.charAt(i) == t.charAt(j)) {
if( j >= index) {
valid = true;
index = j;
t = t.replace(t.charAt(j), '_');
} else {
valid = false;
}
}
}
}
if (data == "" || t == "" ) {
valid = true;
}
return valid;
}
private String dataFix(String data) {
for (int i = 0; i < data.length(); i += 1) {
char ch = data.charAt(i);
if (("ATGC".indexOf(ch) < 0))
data = data.replace(data.charAt(i), ' ');
}
data = data.replaceAll(" ", "").trim();
return data;
}
the fix() and dataFix() methods erase all characters besides "ATGC". As the code iterates through, it is replacing the character in t that matches with data.charAt(i) with a _ so that it does not rematch the same letter (I was having that problem).
Currently, what is happening is that the replace function is replacing every char in the string not just the char at the specific index (which is what it is supposed to do) What is a better way to approach this problem? Where am I going wrong? Thank you.
To answer the first question 'What is a better way to approach this problem?', I would recommend using Regular Expressions (or regex). Regular Expressions are a way to express patterns in text.
For this example where you have a search term:
TxxAA
a regex to describe the patter you are looking for could be:
T.*A.*A
Without going into too much detail the term .* is an expression for any number (zero or more) of any characters. So this regex describes a pattern which is: T; then any characters; A; then any characters; and then A.
Your original question becomes "does a sequence have a sub-sequence with the pattern T.*A.*A?". Java has a regex library built in and you can use the Pattern and Matcher objects to answer this question.
Some sample code as a demonstration:
public class DnaMatcher {
static boolean isSearchChar(char c) {
return 'A' == c || 'T' == c || 'G' == c || 'C' == c;
}
static Pattern preparePattern(String searchSequence) {
StringBuilder pattern = new StringBuilder();
boolean first = false;
for (char c : searchSequence.toCharArray()) {
if (isSearchChar(c)) {
if (first) {
first = false;
} else {
pattern.append(".*");
}
pattern.append(c);
}
}
return Pattern.compile(pattern.toString());
}
static boolean contains(String sequence, String searchSequence) {
Pattern pattern = preparePattern(searchSequence);
Matcher matcher = pattern.matcher(sequence);
return matcher.find();
}
public static void main(String...none) throws Exception {
System.out.println(contains("CTyyGCACA", "TxxAA")); // true
System.out.println(contains("TCCCA", "TAA")); // false
}
}
You can see that the preparePattern matches prepares the regex expression as discussed.
Understanding that the strings might be very long, a regular expression check might take some time.
static String fix(String s) {
return s.replaceAll("[^ACGT]+", "");
}
static boolean isSubSequence(String sought, String chain) {
sought = fix(sought);
chain = fix(chain);
char[] soughtChars = sought.toCharArray();
char[] chainChars = chain.toCharArray();
int si = 0;
for (int ci = 0; si < soughtChars.length && ci < chainChars.length; ++ci) {
if (chainChars[ci] == soughtChars[si]) {
++si;
}
}
return si >= soughtChars.length;
}
Or
static boolean isSubSequence(String sought, String chain) {
sought = fix(sought);
chain = fix(chain);
int ci = 0;
for (char ch : sought.toCharArray()) {
ci = chain.indexOf(ch, ci);
if (ci < 0) {
return false;
}
++ci;
}
return true;
}
The problem seems more the sense of such a result.
Comparing with regex:
I did a comparison:
StringBuilder sb = new StringBuilder(10_000);
Random random = new Random(42);
for (int i = 0; i < 10_1000 - 6; ++i) {
sb.append("ACGT".charAt(random.nextInt(3)));
}
sb.append("TTAGTA");
String s = sb.toString();
String t = "TAGAAG";
{
long t0 = System.nanoTime();
boolean found = contains(s, t);
long t1 = System.nanoTime();
System.out.printf("Found: %s in %d ms%n", found, (t1 - t0) / 1000_000L);
}
{
long t0 = System.nanoTime();
boolean found = isSubSequence(t, s);
long t1 = System.nanoTime();
System.out.printf("Found: %s in %d ms%n", found, (t1 - t0) / 1000_000L);
}
Results
Found: false in 31829 ms --> Regex
Found: false in 5 ms --> indexOf
But: the case is quite artificial: failure on a short string.
It can be done with a (relatively) simple recursion:
/**
* Returns true is s1 is a subsequence of s2, false otherwise
*/
private static boolean isSubSeq(String s1, String s2) {
if ("".equals(s1)) {
return true;
}
String first = s1.substring(0, 1);
s1 = s1.substring(1);
int index = s2.indexOf(first);
if (index == -1) {
return false;
}
s2 = s2.substring(index+1);
return isSubSeq(s1, s2);
}
Algorithm: look for the first index of the first character of s1 in s2, if there is no such index - the answer is false, if there is, we can continue looking (recursively) for the next letter starting at position index+1
EDIT
It seems that you need to sanitize your input to include only the characters: 'A', 'T', 'G', 'C'
It's easy to do (following runs on Java 9, but it's easy to modify to lower versions of Java):
private static String sanitize(String s) {
String result = "";
List<Character> valid = List.of( 'A', 'T', 'G', 'C');
for (char c : s.toCharArray()) {
if (valid.contains(c)) {
result += c;
}
}
return result;
}
Then it's used as follows (example):
public static void main(String[] args) {
String s1 = "TxxAA";
String s2 = "CTyyGCACA";
s1 = sanitize(s1); // you need to sanitize only s1, can you see why?
System.out.println(isSubSeq(s1, s2));
}

Pattern matching interview Q

I was recently in an interview and they asked me the following question:
Write a function to return true if a string matches a pattern, false
otherwise
Pattern: 1 character per item, (a-z), input: space delimited string
This was my solution for the first problem:
static boolean isMatch(String pattern, String input) {
char[] letters = pattern.toCharArray();
String[] split = input.split("\\s+");
if (letters.length != split.length) {
// early return - not possible to match if lengths aren't equal
return false;
}
Map<String, Character> map = new HashMap<>();
// aaaa test test test1 test1
boolean used[] = new boolean[26];
for (int i = 0; i < letters.length; i++) {
Character existing = map.get(split[i]);
if (existing == null) {
// put into map if not found yet
if (used[(int)(letters[i] - 'a')]) {
return false;
}
used[(int)(letters[i] - 'a')] = true;
map.put(split[i], letters[i]);
} else {
// doesn't match - return false
if (existing != letters[i]) {
return false;
}
}
}
return true;
}
public static void main(String[] argv) {
System.out.println(isMatch("aba", "blue green blue"));
System.out.println(isMatch("aba", "blue green green"));
}
The next part of the problem stumped me:
With no delimiters in the input, write the same function.
eg:
isMatch("aba", "bluegreenblue") -> true
isMatch("abc","bluegreenyellow") -> true
isMatch("aba", "t1t2t1") -> true
isMatch("aba", "t1t1t1") -> false
isMatch("aba", "t1t11t1") -> true
isMatch("abab", "t1t2t1t2") -> true
isMatch("abcdefg", "ieqfkvu") -> true
isMatch("abcdefg", "bluegreenredyellowpurplesilvergold") -> true
isMatch("ababac", "bluegreenbluegreenbluewhite") -> true
isMatch("abdefghijklmnopqrstuvwxyz", "zyxwvutsrqponmlkjihgfedcba") -> true
I wrote a bruteforce solution (generating all possible splits of the input string of size letters.length and checking in turn against isMatch) but the interviewer said it wasn't optimal.
I have no idea how to solve this part of the problem, is this even possible or am I missing something?
They were looking for something with a time complexity of O(M x N ^ C), where M is the length of the pattern and N is the length of the input, C is some constant.
Clarifications
I'm not looking for a regex solution, even if it works.
I'm not looking for the naive solution that generates all possible splits and checks them, even with optimization since that'll always be exponential time.
It is possible to optimize a backtracking solution. Instead of generating all splits first and then checking that it is a valid one, we can check it "on fly". Let's assume that we have already split a prefix(with length p) of the initial string and have matched i characters from the pattern. Let's take look at the i + 1 character.
If there is a string in the prefix that corresponds to the i + 1 letter, we should just check that a substring that starts at the position p + 1 is equal to it. If it is, we just proceed to i + 1 and p + the length of this string. Otherwise, we can kill this branch.
If there is no such string, we should try all substrings that start in the position p + 1 and end somewhere after it.
We can also use the following idea to reduce the number of branches in your solution: we can estimate the length of the suffix of the pattern which has not been processed yet(we know the length for the letters that already stand for some strings, and we know a trivial lower bound of the length of a string for any letter in the pattern(it is 1)). It allows us to kill a branch if the remaining part of the initial string is too short to match a the rest of the pattern.
This solution still has an exponential time complexity, but it can work much faster than generating all splits because invalid solutions can be thrown away much earlier, so the number of reachable states can reduce significantly.
I feel like this is cheating, and I'm not convinced the capture group and reluctant quantifier will do the right thing. Or maybe they're looking to see if you can recognize that, because of how quantifiers work, matching is ambiguous.
boolean matches(String s, String pattern) {
StringBuilder patternBuilder = new StringBuilder();
Map<Character, Integer> backreferences = new HashMap<>();
int nextBackreference = 1;
for (int i = 0; i < pattern.length(); i++) {
char c = pattern.charAt(i);
if (!backreferences.containsKey(c)) {
backreferences.put(c, nextBackreference++);
patternBuilder.append("(.*?)");
} else {
patternBuilder.append('\\').append(backreferences.get(c));
}
}
return s.matches(patternBuilder.toString());
}
You could improve on brute force by first assuming token lengths, and checking that the sum of token lengths equals the length of the test string. That would be quicker than pattern matching each time. Still very slow as number of unique tokens increases however.
UPDATE:
Here is my solution. Based it off of the explanation I made before.
import com.google.common.collect.*;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.tuple.Pair;
import org.apache.commons.math3.util.Combinations;
import java.util.*;
/**
* Created by carlos on 2/14/15.
*/
public class PatternMatcher {
public static boolean isMatch(char[] pattern, String searchString){
return isMatch(pattern, searchString, new TreeMap<Integer, Pair<Integer, Integer>>(), Sets.newHashSet());
}
private static boolean isMatch(char[] pattern, String searchString, Map<Integer, Pair<Integer, Integer>> candidateSolution, Set<String> mappedStrings) {
List<Integer> occurrencesOfCharacterInPattern = getNextUnmappedPatternOccurrences(candidateSolution, pattern);
if(occurrencesOfCharacterInPattern.size() == 0)
return isValidSolution(candidateSolution, searchString, pattern, mappedStrings);
List<Pair<Integer, Integer>> sectionsOfUnmappedStrings = sectionsOfUnmappedStrings(searchString, candidateSolution);
if(sectionsOfUnmappedStrings.size() == 0)
return false;
String firstUnmappedString = substring(searchString, sectionsOfUnmappedStrings.get(0));
for (int substringSize = 1; substringSize <= firstUnmappedString.length(); substringSize++) {
String candidateSubstring = firstUnmappedString.substring(0, substringSize);
if(mappedStrings.contains(candidateSubstring))
continue;
List<Pair<Integer, Integer>> listOfAllOccurrencesOfSubstringInString = Lists.newArrayList();
for (int currentIndex = 0; currentIndex < sectionsOfUnmappedStrings.size(); currentIndex++) {
Pair<Integer,Integer> currentUnmappedSection = sectionsOfUnmappedStrings.get(currentIndex);
List<Pair<Integer, Integer>> occurrencesOfSubstringInString =
findAllInstancesOfSubstringInString(searchString, candidateSubstring,
currentUnmappedSection);
for(Pair<Integer,Integer> possibleAddition:occurrencesOfSubstringInString) {
listOfAllOccurrencesOfSubstringInString.add(possibleAddition);
}
}
if(listOfAllOccurrencesOfSubstringInString.size() < occurrencesOfCharacterInPattern.size())
return false;
Iterator<int []> possibleSolutionIterator =
new Combinations(listOfAllOccurrencesOfSubstringInString.size(),
occurrencesOfCharacterInPattern.size()).iterator();
iteratorLoop:
while(possibleSolutionIterator.hasNext()) {
Set<String> newMappedSets = Sets.newHashSet(mappedStrings);
newMappedSets.add(candidateSubstring);
TreeMap<Integer,Pair<Integer,Integer>> newCandidateSolution = Maps.newTreeMap();
// why doesn't Maps.newTreeMap(candidateSolution) work?
newCandidateSolution.putAll(candidateSolution);
int [] possibleSolutionIndexSet = possibleSolutionIterator.next();
for(int i = 0; i < possibleSolutionIndexSet.length; i++) {
Pair<Integer, Integer> candidatePair = listOfAllOccurrencesOfSubstringInString.get(possibleSolutionIndexSet[i]);
//if(candidateSolution.containsValue(Pair.of(0,1)) && candidateSolution.containsValue(Pair.of(9,10)) && candidateSolution.containsValue(Pair.of(18,19)) && listOfAllOccurrencesOfSubstringInString.size() == 3 && candidateSolution.size() == 3 && possibleSolutionIndexSet[0]==0 && possibleSolutionIndexSet[1] == 2){
if (makesSenseToInsert(newCandidateSolution, occurrencesOfCharacterInPattern.get(i), candidatePair))
newCandidateSolution.put(occurrencesOfCharacterInPattern.get(i), candidatePair);
else
break iteratorLoop;
}
if (isMatch(pattern, searchString, newCandidateSolution,newMappedSets))
return true;
}
}
return false;
}
private static boolean makesSenseToInsert(TreeMap<Integer, Pair<Integer, Integer>> newCandidateSolution, Integer startIndex, Pair<Integer, Integer> candidatePair) {
if(newCandidateSolution.size() == 0)
return true;
if(newCandidateSolution.floorEntry(startIndex).getValue().getRight() > candidatePair.getLeft())
return false;
Map.Entry<Integer, Pair<Integer, Integer>> ceilingEntry = newCandidateSolution.ceilingEntry(startIndex);
if(ceilingEntry !=null)
if(ceilingEntry.getValue().getLeft() < candidatePair.getRight())
return false;
return true;
}
private static boolean isValidSolution( Map<Integer, Pair<Integer, Integer>> candidateSolution,String searchString, char [] pattern, Set<String> mappedStrings){
List<Pair<Integer,Integer>> values = Lists.newArrayList(candidateSolution.values());
return areIntegersConsecutive(Lists.newArrayList(candidateSolution.keySet())) &&
arePairsConsecutive(values) &&
values.get(values.size() - 1).getRight() == searchString.length() &&
patternsAreUnique(pattern,mappedStrings);
}
private static boolean patternsAreUnique(char[] pattern, Set<String> mappedStrings) {
Set<Character> uniquePatterns = Sets.newHashSet();
for(Character character:pattern)
uniquePatterns.add(character);
return uniquePatterns.size() == mappedStrings.size();
}
private static List<Integer> getNextUnmappedPatternOccurrences(Map<Integer, Pair<Integer, Integer>> candidateSolution, char[] searchArray){
List<Integer> allMappedIndexes = Lists.newLinkedList(candidateSolution.keySet());
if(allMappedIndexes.size() == 0){
return occurrencesOfCharacterInArray(searchArray,searchArray[0]);
}
if(allMappedIndexes.size() == searchArray.length){
return Lists.newArrayList();
}
for(int i = 0; i < allMappedIndexes.size()-1; i++){
if(!areIntegersConsecutive(allMappedIndexes.get(i),allMappedIndexes.get(i+1))){
return occurrencesOfCharacterInArray(searchArray,searchArray[i+1]);
}
}
List<Integer> listOfNextUnmappedPattern = Lists.newArrayList();
listOfNextUnmappedPattern.add(allMappedIndexes.size());
return listOfNextUnmappedPattern;
}
private static String substring(String string, Pair<Integer,Integer> bounds){
try{
string.substring(bounds.getLeft(),bounds.getRight());
}catch (StringIndexOutOfBoundsException e){
System.out.println();
}
return string.substring(bounds.getLeft(),bounds.getRight());
}
private static List<Pair<Integer, Integer>> sectionsOfUnmappedStrings(String searchString, Map<Integer, Pair<Integer, Integer>> candidateSolution) {
if(candidateSolution.size() == 0) {
return Lists.newArrayList(Pair.of(0, searchString.length()));
}
List<Pair<Integer, Integer>> sectionsOfUnmappedStrings = Lists.newArrayList();
List<Pair<Integer,Integer>> allMappedPairs = Lists.newLinkedList(candidateSolution.values());
// Dont have to worry about the first index being mapped because of the way the first candidate solution is made
for(int i = 0; i < allMappedPairs.size() - 1; i++){
if(!arePairsConsecutive(allMappedPairs.get(i), allMappedPairs.get(i + 1))){
Pair<Integer,Integer> candidatePair = Pair.of(allMappedPairs.get(i).getRight(), allMappedPairs.get(i + 1).getLeft());
sectionsOfUnmappedStrings.add(candidatePair);
}
}
Pair<Integer,Integer> lastMappedPair = allMappedPairs.get(allMappedPairs.size() - 1);
if(lastMappedPair.getRight() != searchString.length()){
sectionsOfUnmappedStrings.add(Pair.of(lastMappedPair.getRight(),searchString.length()));
}
return sectionsOfUnmappedStrings;
}
public static boolean areIntegersConsecutive(List<Integer> integers){
for(int i = 0; i < integers.size() - 1; i++)
if(!areIntegersConsecutive(integers.get(i),integers.get(i+1)))
return false;
return true;
}
public static boolean areIntegersConsecutive(int left, int right){
return left == (right - 1);
}
public static boolean arePairsConsecutive(List<Pair<Integer,Integer>> pairs){
for(int i = 0; i < pairs.size() - 1; i++)
if(!arePairsConsecutive(pairs.get(i), pairs.get(i + 1)))
return false;
return true;
}
public static boolean arePairsConsecutive(Pair<Integer, Integer> left, Pair<Integer, Integer> right){
return left.getRight() == right.getLeft();
}
public static List<Integer> occurrencesOfCharacterInArray(char[] searchArray, char searchCharacter){
assert(searchArray.length>0);
List<Integer> occurrences = Lists.newLinkedList();
for(int i = 0;i<searchArray.length;i++){
if(searchArray[i] == searchCharacter)
occurrences.add(i);
}
return occurrences;
}
public static List<Pair<Integer,Integer>> findAllInstancesOfSubstringInString(String searchString, String substring, Pair<Integer,Integer> bounds){
String string = substring(searchString,bounds);
assert(StringUtils.isNoneBlank(substring,string));
int lastIndex = 0;
List<Pair<Integer,Integer>> listOfOccurrences = Lists.newLinkedList();
while(lastIndex != -1){
lastIndex = string.indexOf(substring,lastIndex);
if(lastIndex != -1){
int newIndex = lastIndex + substring.length();
listOfOccurrences.add(Pair.of(lastIndex + bounds.getLeft(), newIndex + bounds.getLeft()));
lastIndex = newIndex;
}
}
return listOfOccurrences;
}
}
It works with the cases provided, but is not thoroughly tested. Let me know if there are any mistakes.
ORIGINAL RESPONSE:
Assuming your string you are searching can have arbitrary length tokens (which some of your examples do) then:
You want to start trying to break your string into parts that match the pattern. Looking for contradictions along the way to cut down on your search tree.
When you start processing you're going to select N characters of the beginning of the string. Now, go and see if you can find that substring in the rest of the string. If you can't then it can't possibly be a solution. If you can then your string looks something like this
(N characters)<...>[(N characters)<...>] where either one of the <...> contains 0+ characters and aren't necessarily the same substring. And whats inside of [] could repeat a number of times equal to the number of times (N characters) appears in the string.
Now, you have the first letter of your pattern matched, your not sure if the rest of the pattern matches, but you can basically re-use this algorithm (with modifications) to interrogate the <...> parts of the string.
You would do this for N = 1,2,3,4...
Make sense?
I'll work an example (which doesn't cover all cases, but hopefully illustrates) Note, when i'm referring to substrings in the pattern i'll use single quotes and when i'm referring to substrings of the string i'll use double quotes.
isMatch("ababac", "bluegreenbluegreenbluewhite")
Ok, 'a' is my first pattern.
for N = 1 i get the string "b"
where is "b" in the search string?
bluegreenbluegreenbluewhite.
Ok, so at this point this string MIGHT match with "b" being the pattern 'a'. Lets see if we can do the same with the pattern 'b'. Logically, 'b' MUST be the entire string "luegreen" (because its squeezed between two consecutive 'a' patterns) then I check in between the 2nd and 3rd 'a'. YUP, its "luegreen".
Ok, so far i've matched all but the 'c' of my pattern. Easy case, its the rest of the string. It matches.
This is basically writing a Perl regex parser. ababc = (.+)(.+)(\1)(\2)(.+). So you just have to convert it to a Perl regex
Here's a sample snippet of my code:
public static final boolean isMatch(String patternStr, String input) {
// Initial Check (If all the characters in the pattern string are unique, degenerate case -> immediately return true)
char[] patt = patternStr.toCharArray();
Arrays.sort(patt);
boolean uniqueCase = true;
for (int i = 1; i < patt.length; i++) {
if (patt[i] == patt[i - 1]) {
uniqueCase = false;
break;
}
}
if (uniqueCase) {
return true;
}
String t1 = patternStr;
String t2 = input;
if (patternStr.length() == 0 && input.length() == 0) {
return true;
} else if (patternStr.length() != 0 && input.length() == 0) {
return false;
} else if (patternStr.length() == 0 && input.length() != 0) {
return false;
}
int count = 0;
StringBuffer sb = new StringBuffer();
char[] chars = input.toCharArray();
String match = "";
// first read for the first character pattern
for (int i = 0; i < chars.length; i++) {
sb.append(chars[i]);
count++;
if (!input.substring(count, input.length()).contains(sb.toString())) {
match = sb.delete(sb.length() - 1, sb.length()).toString();
break;
}
}
if (match.length() == 0) {
match = t2;
}
// based on that character, update patternStr and input string
t1 = t1.replace(String.valueOf(t1.charAt(0)), "");
t2 = t2.replace(match, "");
return isMatch(t1, t2);
}
I basically decided to first parse the pattern string and determine if there are any matching characters in the pattern string. For example in "aab" "a" is used twice in the pattern string and so "a" cannot map to something else. Otherwise, if there are no matching characters in a string such as "abc", it won't matter what our input string is since the pattern is unique and so it doesn't matter what each pattern character matches to (degenerative case).
If there are matching characters in the pattern string, then I would begin to check what each string matches to. Unfortunately, without knowing the delimiter I wouldn't know how long each string would be. Instead, I just decided to parse 1 character at a time and check if the other parts of the string contains the same string and continue adding characters to the buffer letter by letter until the buffer string cannot be found in the input string. Once I have the string determined, it's now in the buffer I would simply delete all the matched strings in the input string and the character pattern from the pattern string then recurse.
Apologies if my explanation wasn't very clear, I hope my code can be clear though.

Java Regex : How to detect the index of not mached char in a complex regex

I'm using regex to control an input and I want to get the exact index of the wrong char.
My regex is :
^[A-Z]{1,4}(/[1-2][0-9][0-9][0-9][0-1][0-9])?
If I type the following input :
DATE/201A08
Then macher.group() (using lookingAt() method) will return "DATE" instead of "DATE/201". Then, I can't know that the wrong index is 9.
If I read this right, you can't do this using only one regex.
^[A-Z]{1,4}(/[1-2][0-9][0-9][0-9][0-1][0-9])? assumes either a String starting with 1 to 4 characters followed by nothing, or followed by / and exactly 6 digits. So it correctly parses your input as "DATE" as it is valid according to your regex.
Try to split this into two checks. First check if it's a valid DATE
Then, if there's an actual / part, check this against the non-optional pattern.
You want to know whether the entire pattern matched, and when not, how far it matched.
There regex fails. A regex test must succeed to give results in group(). If it also succeeds on a part, one does not know whether all was matched.
The sensible thing to do is split the matching.
public class ProgressiveMatch {
private final String[] regexParts;
private String group;
ProgressiveMatch(String... regexParts) {
this.regexParts = regexParts;
}
// lookingAt with (...)?(...=)?...
public boolean lookingAt(String text) {
StringBuilder sb = new StringBuilder();
sb.append('^');
for (int i = 0; i < regexParts.length; ++i) {
String part = regexParts[i];
sb.append("(");
sb.append(part);
sb.append(")?");
}
Pattern pattern = Pattern.compile(sb.toString());
Matcher m = pattern.matcher(text);
if (m.lookingAt()) {
boolean all = true;
group = "";
for (int i = 1; i <= regexParts.length; ++i) {
if (m.group(i) == null) {
all = false;
break;
}
group += m.group(i);
}
return all;
}
group = null;
return false;
}
// lookingAt with multiple patterns
public boolean lookingAt(String text) {
for (int n = regexParts.length; n > 0; --n) {
// Match for n parts:
StringBuilder sb = new StringBuilder();
sb.append('^');
for (int i = 0; i < n; ++i) {
String part = regexParts[i];
sb.append(part);
}
Pattern pattern = Pattern.compile(sb.toString());
Matcher m = pattern.matcher(text);
if (m.lookingAt()) {
group = m.group();
return n == regexParts.length;
}
}
group = null;
return false;
}
public String group() {
return group;
}
}
public static void main(String[] args) {
// ^[A-Z]{1,4}(/[1-2][0-9][0-9][0-9][0-1][0-9])?
ProgressiveMatch match = new ProgressiveMatch("[A-Z]{1,4}", "/",
"[1-2]", "[0-9]", "[0-9]", "[0-9]", "[0-1]", "[0-9]");
boolean matched = match.lookingAt("DATE/201A08");
System.out.println("Matched: " + matched);
System.out.println("Upto; " + match.group());
}
One could make a small DSL in java, like:
ProgressiveMatch match = ProgressiveMatchBuilder
.range("A", "Z", 1, 4)
.literal("/")
.range("1", "2")
.range("0", "9", 3, 3)
.range("0", "1")
.range("0", "9")
.match();

Categories