looping through a string and checking if certain conditions are met

looping through a string and checking if certain conditions are met - java

I need to write a piece of code which accepts an input string parameter and determine if exactly 3 question marks exist between every pair of numbers that add up to 10. If so, return true, otherwise return false.Some examples test cases are below
"arrb6???4xxbl5???eee5" => true
"acc?7??sss?3rr1??????5" => true
"5??aaaaaaaaaaaaaaaaaaa?5?5" => false
"9???1???9???1???9" => true
"aa6?9" => false
I already tried to implement it in Java as below but the result is not as expected
public static String QuestionsMarks(String str) {
str = str.replaceAll("[a-z]+","");
Pattern pattern = Pattern.compile("([0-9])([?])([?])([0-9])");
Pattern pattern01 = Pattern.compile("([0-9])([?])([?])([0-9])");
Matcher matcher01 = pattern01.matcher(str);
Pattern pattern02 = Pattern.compile("([0-9])([0-9])");
Matcher matcher02 = pattern02.matcher(str);
Matcher matcher = pattern.matcher(str);
if (matcher01.find() || matcher02.find()) {
return "false";
} else if (matcher.find()) {
return "true";
}
return "false";
}

The regular expression to detect overlapping numbers separated with a number of question marks is as follows:
(?=((\d)([^?\d]*\?[^?\d]*)*(\d)))
It uses a positive lookahead (?=) to handle overlapping numbers between adjacent matches of this pattern:
(\d)([^?\d]*\?[^?\d]*)*(\d): two digits separated by any number of question marks with other optional characters.
Online demo of the regexp
So, the regular expression provides a stream of matches in the form: Group1=StringWithQuestionMarks, Group2=Digit, Group4=Digit which should be validated against the rules: count the sum of the digits to be 10 and count the question marks between to be 3.
Example implementation using Stream API:
private static final Pattern DIGS = Pattern.compile("(?=((\\d)([^?\\d]*\\?[^?\\d]*)*(\\d)))");
public static boolean hasTotal10Around3QMarks(String str) {
Supplier<Stream<MatchResult>> ss = () -> DIGS
.matcher(str)
.results()
.filter(r -> r.groupCount() == 4
&& 10 == Integer.parseInt(r.group(2)) + Integer.parseInt(r.group(4))
);
return ss.get().findAny().isPresent() &&
ss.get().allMatch(r -> 3 == r.group(1).chars()
.filter(c -> c == '?')
.count()
);
}
Supplier<Stream> helps to reuse stream of MatchResult to handle the last test case when there's no matching pair of digits with 3 question marks between (allMatch for empty stream returns true).
Tests:
String[] tests = {
"arrb6???4xxbl5???eee5",
"acc?7??sss?3rr1??????5",
"5??aaaaaaaaaaaaaaaaaaa?5?5",
"9???1???9???1???9",
"aa6?9",
""
};
Arrays.stream(tests)
.forEach(t -> System.out.printf("'%s' => %s%n", t, hasTotal10Around3QMarks(t)));
Output:
'arrb6???4xxbl5???eee5' => true
'acc?7??sss?3rr1??????5' => true
'5??aaaaaaaaaaaaaaaaaaa?5?5' => false
'9???1???9???1???9' => true
'aa6?9' => false
'' => false

Related

Find dash "-" that's not inside round brackets "()" within String

I'm trying to find/determine if a String contains the character "-" that is not enclosed in round brackets "()".
I've tried the regex
[^\(]*-[^\)]*,
but it's not working.
Examples:
100 - 200 mg -> should match because the "-" is not enclosed in round brackets.
100 (+/-) units -> should NOT match

Do you have to use regex? You could try just iterating over the string and keeping track of the scope like so:
public boolean HasScopedDash(String str)
{
int scope = 0;
boolean foundInScope = false;
for (int i = 0; i < str.length(); i++)
{
char c = str.charAt(i);
if (c == '(')
scope++;
else if (c == '-')
foundInScope = scope != 0;
else if (c == ')' && scope > 0)
{
if (foundInScope)
return true;
scope--;
}
}
return false;
}
Edit: As mentioned in the comments, it might be desirable to exclude cases where the dash comes after an opening parenthesis but no closing parenthesis ever follows. (I.e. "abc(2-xyz") The above edited code accounts for this.

You might not to want to check for that to make this pass. Maybe, you could simply make a check on other boundaries. This expression for instance checks for spaces and numbers before and after the dash or any other chars in the middle you wish to have, which is much easier to modify:
([0-9]\s+[-]\s+[0-9])
It passes your first input and fails the undesired input. You could simply add other chars to its middle char list using logical ORs.
Demo

Java supports quantified atomic groups, this works.
The way it works is to consume paired parenthesis and their contents,
and not giving anything back, up until it finds a dash -.
This is done via the atomic group constructs (?> ).
^(?>(?>\(.*?\))|[^-])*?-
https://www.regexplanet.com/share/index.html?share=yyyyd8n1dar
(click on the Java button, check the find() function column)
Readable
^
(?>
(?> \( .*? \) )
|
[^-]
)*?
-

If you don't mind to check the string by using 2 regex instead of 1 complicated regex. You can try this instead
public static boolean match(String input) {
Pattern p1 = Pattern.compile("\\-"); // match dash
Pattern p2 = Pattern.compile("\\(.*\\-.*\\)"); // match dash within bracket
Matcher m1 = p1.matcher(input);
Matcher m2 = p2.matcher(input);
if ( m1.find() && !m2.find() ) {
return true;
} else {
return false;
}
}
Test the string
public static void main(String[] args) {
String input1 = "100 - 200 mg";
String input2 = "100 (+/-) units";
System.out.println(input1 + " : " + ( match(input1) ? "match" : "not match") );
System.out.println(input2 + " : " + ( match(input2) ? "match" : "not match") );
}
The output will be
100 - 200 mg : match
100 (+/-) units : not match

Matcher m = Pattern.compile("\\([^()-]*-[^()]*\\)").matcher(s); return !m.find();
https://ideone.com/YXvuem

Salesforce - Apex Class/Trigger to not allow 3 consecutive characters [duplicate]

I need regular expressions to match the below cases.
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.

I don't think you can (easily) use regex for the first case. The second case is easy though:
Pattern pattern = Pattern.compile("([a-z\\d])\\1\\1", Pattern.CASE_INSENSITIVE);
Since \\1 represents part matched by group 1 this will match any sequence of three identical characters that are either within the range a-z or are digits (\d).
Update
To be clear, you can use regex for the first case. However, the pattern is so laborious and ridiculously convoluted that you are better off not doing it at all. Especially if you wanted to REALLY cover all the alphabet. In that case you should probably generate the pattern programmatically by iterating the char codes of the Unicode charset or something like that and generate groupings for every three consecutive characters. However, you should realize that by having generated such a large decision tree for the pattern matcher, the marching performance is bound to suffer (O(n) where n is the number of groups which is the size of the Unicode charset minus 2).

I disagree, case 1 is possible to regex, but you have to tell it the sequences to match... which is kind of long and boring:
/(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+/ig
http://regexr.com/3dqln

for the second question:
\\b([a-zA-Z0-9])\\1\\1+\\b
explanation:
\\b : zero-length word boundary
( : start capture group 1
[a-zA-Z0-9] : a letter or a digit
) : end group
\\1 : same character as group 1
\\1+ : same character as group 1 one or more times
\\b : zero-length word boundary

To my knowledge, the first case is indeed not possible. The regex engine doesn't know anything about the order of the natural numbers or the alphabet. But it's at least possible to differentiate between 3 or more numbers and 3 or more letters, for example:
[a-z]{3,}|[A-Z]{3,}|\d{3,}
This matches abcd, ABCDE or 123 but doesn't match ab2d, A5c4 or 12z, for example. According to this, the second case can be correctly given in a shorter version as:
(\w)\1{2,}

3 or more consecutive sequential characters/numbers ex - 123, abc, 789, pqr etc.
Not possible with regular expressions.
3 or more consecutive identical characters/numbers ex - 111, aaa, bbb. 222 etc.
Use a pattern of (?i)(?:([a-z0-9])\\1{2,})*.
If you want to check the whole string, use Matcher.matches(). To find matches within a string, use Matcher.find().
Here's some sample code:
final String ps = "(?i)(?:([a-z0-9])\\1{2,})*";
final String psLong =
"(?i)\t\t\t# Case insensitive flag\n"
+ "(?:\t\t\t\t# Begin non-capturing group\n"
+ " (\t\t\t\t# Begin capturing group\n"
+ " [a-z0-9]\t\t# Match an alpha or digit character\n"
+ " )\t\t\t\t# End capturing group\n"
+ " \\1\t\t\t\t# Back-reference first capturing group\n"
+ " {2,}\t\t\t# Match previous atom 2 or more times\n"
+ ")\t\t\t\t# End non-capturing group\n"
+ "*\t\t\t\t# Match previous atom zero or more characters\n";
System.out.println("***** PATTERN *****\n" + ps + "\n" + psLong
+ "\n");
final Pattern p = Pattern.compile(ps);
for (final String s : new String[] {"aa", "11", "aaa", "111",
"aaaaaaaaa", "111111111", "aaa111bbb222ccc333",
"aaaaaa111111bbb222"})
{
final Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("Success: " + s);
} else {
System.out.println("Fail: " + s);
}
}
And the output is:
***** PATTERN *****
(?i)(?:([a-z0-9])\1{2,})*
(?i) # Case insensitive flag
(?: # Begin non-capturing group
( # Begin capturing group
[a-z0-9] # Match an alpha or digit character
) # End capturing group
\1 # Back-reference first capturing group
{2,} # Match previous atom 2 or more times
) # End non-capturing group
* # Match previous atom zero or more characters
Fail: aa
Fail: 11
Success: aaa
Success: 111
Success: aaaaaaaaa
Success: 111111111
Success: aaa111bbb222ccc333
Success: aaaaaa111111bbb222

Regex to match three consecutive numbers or alphabets is
"([0-9]|[aA-zZ])\1\1"

Thanks All for helping me.
For the first case - 3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc. I used below code logic. Pls share your comments on this.
public static boolean validateConsecutiveSeq(String epin) {
char epinCharArray[] = epin.toCharArray();
int asciiCode = 0;
boolean isConSeq = false;
int previousAsciiCode = 0;
int numSeqcount = 0;
for (int i = 0; i < epinCharArray.length; i++) {
asciiCode = epinCharArray[i];
if ((previousAsciiCode + 1) == asciiCode) {
numSeqcount++;
if (numSeqcount >= 2) {
isConSeq = true;
break;
}
} else {
numSeqcount = 0;
}
previousAsciiCode = asciiCode;
}
return isConSeq;
}

If you have lower bound (3) and upper bound regexString can be generated as follows
public class RegexBuilder {
public static void main(String[] args) {
StringBuilder sb = new StringBuilder();
int seqStart = 3;
int seqEnd = 5;
buildRegex(sb, seqStart, seqEnd);
System.out.println(sb);
}
private static void buildRegex(StringBuilder sb, int seqStart, int seqEnd) {
for (int i = seqStart; i <= seqEnd; i++) {
buildRegexCharGroup(sb, i, '0', '9');
buildRegexCharGroup(sb, i, 'A', 'Z');
buildRegexCharGroup(sb, i, 'a', 'z');
buildRegexRepeatedString(sb, i);
}
}
private static void buildRegexCharGroup(StringBuilder sb, int seqLength,
char start, char end) {
for (char c = start; c <= end - seqLength + 1; c++) {
char ch = c;
if (sb.length() > 0) {
sb.append('|');
}
for (int i = 0; i < seqLength; i++) {
sb.append(ch++);
}
}
}
private static void buildRegexRepeatedString(StringBuilder sb, int seqLength) {
sb.append('|');
sb.append("([a-zA-Z\\d])");
for (int i = 1; i < seqLength; i++) {
sb.append("\\1");
}
}
}
Output
012|123|234|345|456|567|678|789|ABC|BCD|CDE|DEF|EFG|FGH|GHI|HIJ|IJK|JKL|KLM|LMN|MNO|NOP|OPQ|PQR|QRS|RST|STU|TUV|UVW|VWX|WXY|XYZ|abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|([a-z\d])\1\1|0123|1234|2345|3456|4567|5678|6789|ABCD|BCDE|CDEF|DEFG|EFGH|FGHI|GHIJ|HIJK|IJKL|JKLM|KLMN|LMNO|MNOP|NOPQ|OPQR|PQRS|QRST|RSTU|STUV|TUVW|UVWX|VWXY|WXYZ|abcd|bcde|cdef|defg|efgh|fghi|ghij|hijk|ijkl|jklm|klmn|lmno|mnop|nopq|opqr|pqrs|qrst|rstu|stuv|tuvw|uvwx|vwxy|wxyz|([a-z\d])\1\1\1|01234|12345|23456|34567|45678|56789|ABCDE|BCDEF|CDEFG|DEFGH|EFGHI|FGHIJ|GHIJK|HIJKL|IJKLM|JKLMN|KLMNO|LMNOP|MNOPQ|NOPQR|OPQRS|PQRST|QRSTU|RSTUV|STUVW|TUVWX|UVWXY|VWXYZ|abcde|bcdef|cdefg|defgh|efghi|fghij|ghijk|hijkl|ijklm|jklmn|klmno|lmnop|mnopq|nopqr|opqrs|pqrst|qrstu|rstuv|stuvw|tuvwx|uvwxy|vwxyz|([a-z\d])\1\1\1\1

All put together:
([a-zA-Z0-9])\1\1+|(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
([a-zA-Z0-9])\1\1+
https://regexr.com/4727n
This also works:
(?:(?:0(?=1)|1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)|6(?=7)|7(?=8)|8(?=9)){2,}\d|(?:a(?=b)|b(?=c)|c(?=d)|d(?=e)|e(?=f)|f(?=g)|g(?=h)|h(?=i)|i(?=j)|j(?=k)|k(?=l)|l(?=m)|m(?=n)|n(?=o)|o(?=p)|p(?=q)|q(?=r)|r(?=s)|s(?=t)|t(?=u)|u(?=v)|v(?=w)|w(?=x)|x(?=y)|y(?=z)){2,}[[:alpha:]])|([a-zA-Z0-9])\1\1+
https://regex101.com/r/6fXC9u/1

for the first question this works if you're ok with less regex
containsConsecutiveCharacters(str) {
for (let i = 0; i <= str.length - 3; i++) {
var allthree = str[i] + str[i + 1] + str[i + 2];
let s1 = str.charCodeAt(i);
let s2 = str.charCodeAt(i + 1);
let s3 = str.charCodeAt(i + 2);
if (
/[a-zA-Z]+$/.test(allthree) &&
(s1 < s2 && s2 < s3 && s1+s2+s3-(3*s1) === 3)
) {
return true;
}
}
}

3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
(?:(?:0(?=1)|1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)|6(?=7)|7(?=8)|8(?=9)){2,}\d|(?:a(?=b)|b(?=c)|c(?=d)|d(?=e)|e(?=f)|f(?=g)|g(?=h)|h(?=i)|i(?=j)|j(?=k)|k(?=l)|l(?=m)|m(?=n)|n(?=o)|o(?=p)|p(?=q)|q(?=r)|r(?=s)|s(?=t)|t(?=u)|u(?=v)|v(?=w)|w(?=x)|x(?=y)|y(?=z)){2,}[\p{Alpha}])
https://regex101.com/r/5IragF/1
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
([\p{Alnum}])\1{2,}
https://regex101.com/r/VEHoI9/1

All put together:
([a-zA-Z0-9])\1\1+|(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
([a-zA-Z0-9])\1\1+
https://regexr.com/4727n

For case #2 I got inspired by a sample on regextester and created the following regex to match n identical digits (to check for both numbers and letters replace 0-9 with A-Za-z0-9):
const n = 3
const identicalAlphanumericRegEx = new RegExp("([0-9])" + "\\1".repeat(n - 1))

I was discussing this with a coworker and we think we have a good solution for #1.
To check for abc or bcd or ... or 012 or 123 or even any number of sequential characters, try:
.*((a(?=b))|(?:b(?=c))|(?:c(?=d))|(?:d(?=e))|(?:e(?=f))|(?:f(?=g))|(?:g(?=h))|(?:h(?=i))|(?:i(?=j))|(?:j(?=k))|(?:k(?=l))|(?:l(?=m))|(?:m(?=n))|(?:n(?=o))|(?:o(?=p))|(?:p(?=q))|(?:q(?=r))|(?:r(?=s))|(?:s(?=t))|(?:t(?=u))|(?:u(?=v))|(?:v(?=w))|(?:w(?=x))|(?:x(?=y))|(?:y(?=z))|(?:0(?=1))|(?:1(?=2))|(?:2(?=3))|(?:3(?=4))|(?:4(?=5))|(?:5(?=6))|(?:6(?=7))|(?:7(?=8))|(?:8(?=9))){2,}.*
The nice thing about this solution is if you want more than 3 consecutive characters, increase the {2,} to be one less than what you want to check for.
the ?: in each group prevents the group from being captured.

Try this for the first question.
returns true if it finds 3 consecutive numbers or alphabets in the arg
function check(val){
for (i = 0; i <= val.length - 3; i++) {
var s1 = val.charCodeAt(i);
var s2 = val.charCodeAt(i + 1);
var s3 = val.charCodeAt(i + 2);
if (Math.abs(s1 - s2) === 1 && s1 - s2 === s2 - s3) {
return true;
}
}
return false;
}
console.log(check('Sh1ak#ki1r#100'));

Pattern matcher issue with special character in java

i want to validate string pattern .its work with below code if not any special char in string .
for example :
Pattern p = Pattern.compile("Dear User, .* is your One Time Password(OTP) for registration.",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("Dear User, 999 is your One Time Password(OTP) for registration.");
if (m.matches()){
System.out.println("truee");
}else{
System.out.println("false");
} // output false
and below is working fine if i remove ( and ) .
Pattern p = Pattern.compile("Dear User, .* is your One Time Password OTP for registration.",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("Dear User, 999 is your One Time Password OTP for registration.");
if (m.matches()){
System.out.println("truee");
}else{
System.out.println("false");
} // output true

Try this:
Pattern p = Pattern.compile("Dear User, .* is your One Time Password\\(OTP\\) for registration\\.",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("Dear User, 999 is your One Time Password(OTP) for registration.");
if (m.matches()){
System.out.println("truee");
}else{
System.out.println("false");
}
( ,) and . are used in regex expressions. You need to escape them if you want normal behaviour.

In regex, you must always beware of special characters when you need to match them literally.
In this case, you have 3 characters: ( (used to open a group), ) (used to close a group) and . (matches any character).
To match them literally, you need to escape them, or place into a character class [...].
See fixed demo

package com.ramesh.test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class PatternMatcher {
public boolean containsPattern(String input, String pattern) {
if (input == null || input.trim().isEmpty()) {
System.out.println("Incorrect format of string");
return false;
}
// “ * ”is replaced by “ .* ” in replaceAll()
//“ .* ” Use this pattern to match any number, any string (including the empty string) of any characters.
String inputpattern = pattern.replaceAll("\\*", ".*");
System.out.println("first input" + inputpattern);
Pattern p = Pattern.compile(inputpattern);
Matcher m = p.matcher(input);
boolean b = m.matches();
return b;
}
public boolean containsPatternNot(String input1, String pattern1) {
return (containsPattern(input1, pattern1) == true ? false : true);
}
public static void main(String[] args) {
PatternMatcher m1 = new PatternMatcher ();
boolean a = m1.containsPattern("ma5&*%u&^()k5.r5gh^", "m*u*r*");
System.out.println(a);// returns True
boolean d = m1.containsPattern("mur", "m*u*r*");
System.out.println(d);// returns True
boolean c = m1.containsPatternNot("ma5&^%u&^()k56r5gh^", "m*u*r*");
System.out.println(c);// returns false
boolean e = m1.containsPatternNot("mur", "m*u*r*");
System.out.println(e);// returns false
}
}
Output: true
true
false
false

Remove the intersections of multiple regular expressions?

Pattern[] a =new Pattern[2];
a[0] = Pattern.compile("[$£€]?\\s*\\d*[\\.]?[pP]?\\d*\\d");
a[1] = Pattern.compile("Rs[.]?\\s*[\\d,]*[.]?\\d*\\d");
Ex: Rs.150 is detected by a[1] and 150 is detected by a[0].
How to remove such intersections and let it only detect by a[1] but not by a[0]?

You can use the | operator inside your regular expression. Then call the method Matcher#group(int) to see which pattern your input applies to. This method returns null if the matching group is empty.
Sample code
public static void main(String[] args) {
// Build regexp
final String MONEY_REGEX = "[$£€]?\\s*\\d*[\\.]?[pP]?\\d*\\d";
final String RS_REGEX = "Rs[.]?\\s*[\\d,]*[.]?\\d*\\d";
// Separate them with '|' operator and wrap them in two distinct matching groups
final String MONEY_OR_RS = String.format("(%s)|(%s)", MONEY_REGEX, RS_REGEX);
// Prepare some sample inputs
String[] inputs = new String[] { "$100", "Rs.150", "foo" };
Pattern p = Pattern.compile(MONEY_OR_RS);
// Test each inputs
Matcher m = null;
for (String input : inputs) {
if (m == null) {
m = p.matcher(input);
} else {
m.reset(input);
}
if (m.matches()) {
System.out.println(String.format("m.group(0) => %s\nm.group(1) => %s\n", m.group(1), m.group(2)));
} else {
System.out.println(input + " doesn't match regexp.");
}
}
}
Output
m.group(0) => $100
m.group(1) => null
m.group(0) => null
m.group(1) => Rs.150
foo doesn't match regexp.

Use an initial test to switch between expressions. How fast and/or smart this initial test is is up to you.
In this case you could do something like:
if (input.startsWith("Rs.") && a[1].matcher(input).matches()) {
return true;
}
and put it in front of your method that does the testing.
Simply putting the most common regular expressions in front of the array may help as well of course.

Description
Use a negative look ahead to match a[1] rs.150 format while at the same time preventing the a[0] 150 format.
Generic expression: (?! the a[0] regex goes here ) followed by the a[1] expression
Basic regex statment: (?![$£€]?\s*\d*[\.]?[pP]?\d*\d)Rs[.]?\s*[\d,]*[.]?\d*\d
escaped for java: (?![$£€]?\\s*\\d*[\\.]?[pP]?\\d*\\d)Rs[.]?\\s*[\\d,]*[.]?\\d*\\d

Regex look-behind without obvious maximum length in Java

I always thought that a look-behind assertion in Java's regex-API (and many other languages for that matter) must have an obvious length. So, STAR and PLUS quantifiers are not allowed inside look-behinds.
The excellent online resource regular-expressions.info seems to confirm (some of) my assumptions:
"[...] Java takes things a step further by
allowing finite repetition. You still
cannot use the star or plus, but you
can use the question mark and the
curly braces with the max parameter
specified. Java recognizes the fact
that finite repetition can be
rewritten as an alternation of strings
with different, but fixed lengths.
Unfortunately, the JDK 1.4 and 1.5
have some bugs when you use
alternation inside lookbehind. These
were fixed in JDK 1.6. [...]"
-- http://www.regular-expressions.info/lookaround.html
Using the curly brackets works as long as the total length of range of the characters inside the look-behind is smaller or equal to Integer.MAX_VALUE. So these regexes are valid:
"(?<=a{0," +(Integer.MAX_VALUE) + "})B"
"(?<=Ca{0," +(Integer.MAX_VALUE-1) + "})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"
But these aren't:
"(?<=Ca{0," +(Integer.MAX_VALUE) +"})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"
However, I don't understand the following:
When I run a test using the * and + quantifier inside a look-behind, all goes well (see output Test 1 and Test 2).
But, when I add a single character at the start of the look-behind from Test 1 and Test 2, it breaks (see output Test 3).
Making the greedy * from Test 3 reluctant has no effect, it still breaks (see Test 4).
Here's the test harness:
public class Main {
private static String testFind(String regex, String input) {
try {
boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();
return "testFind : Valid -> regex = "+regex+", input = "+input+", returned = "+returned;
} catch(Exception e) {
return "testFind : Invalid -> "+regex+", "+e.getMessage();
}
}
private static String testReplaceAll(String regex, String input) {
try {
String returned = input.replaceAll(regex, "FOO");
return "testReplaceAll : Valid -> regex = "+regex+", input = "+input+", returned = "+returned;
} catch(Exception e) {
return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();
}
}
private static String testSplit(String regex, String input) {
try {
String[] returned = input.split(regex);
return "testSplit : Valid -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);
} catch(Exception e) {
return "testSplit : Invalid -> "+regex+", "+e.getMessage();
}
}
public static void main(String[] args) {
String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};
String input = "CaaaaaaaaaaaaaaaBaaaa";
int test = 0;
for(String regex : regexes) {
test++;
System.out.println("********************** Test "+test+" **********************");
System.out.println(" "+testFind(regex, input));
System.out.println(" "+testReplaceAll(regex, input));
System.out.println(" "+testSplit(regex, input));
System.out.println();
}
}
}
The output:
********************** Test 1 **********************
testFind : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
testReplaceAll : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
testSplit : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]
********************** Test 2 **********************
testFind : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
testReplaceAll : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
testSplit : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]
********************** Test 3 **********************
testFind : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
testSplit : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
********************** Test 4 **********************
testFind : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
testSplit : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
My question may be obvious, but I'll still ask it: Can anyone explain to me why Test 1 and 2 fail, and Test 3 and 4 don't? I would have expected them all to fail, not half of them to work and half of them to fail.
Thanks.
PS. I'm using: Java version 1.6.0_14

Glancing at the source code for Pattern.java reveals that the '*' and '+' are implemented as instances of Curly (which is the object created for curly operators). So,
a*
is implemented as
a{0,0x7FFFFFFF}
and
a+
is implemented as
a{1,0x7FFFFFFF}
which is why you see exactly the same behaviors for curlies and stars.

It's a bug: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6695369
Pattern.compile() is always supposed to throw an exception if it can't determine the maximum possible length of a lookbehind match.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

looping through a string and checking if certain conditions are met - java

Related

Find dash "-" that's not inside round brackets "()" within String

Salesforce - Apex Class/Trigger to not allow 3 consecutive characters [duplicate]

Pattern matcher issue with special character in java

Remove the intersections of multiple regular expressions?

Regex look-behind without obvious maximum length in Java

Categories

Resources