Java - Regex pattern matching [duplicate] - java

I have issue with following example:
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[1]);
boolean b = false;
while(b = m.find()) {
System.out.print(m.start() + m.group());
}
}
}
And the command line:
java Regex2 "\d*" ab34ef
Can someone explain to me, why the result is: 01234456
regex pattern is d* - it means number one or more but there are more positions that in args[1],
thanks

\d* matches 0 or more digits. So, it will even match empty string before every character and after the last character. First before index 0, then before index 1, and so on.
So, for string ab34ef, it matches following groups:
Index Group
0 "" (Before a)
1 "" (Before b)
2 34 (Matches more than 0 digits this time)
4 "" (Before `e` at index 4)
5 "" (Before f)
6 "" (At the end, after f)
If you use \\d+, then you will get just a single group at 34.

Related

Salesforce - Apex Class/Trigger to not allow 3 consecutive characters [duplicate]

I need regular expressions to match the below cases.
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
I don't think you can (easily) use regex for the first case. The second case is easy though:
Pattern pattern = Pattern.compile("([a-z\\d])\\1\\1", Pattern.CASE_INSENSITIVE);
Since \\1 represents part matched by group 1 this will match any sequence of three identical characters that are either within the range a-z or are digits (\d).
Update
To be clear, you can use regex for the first case. However, the pattern is so laborious and ridiculously convoluted that you are better off not doing it at all. Especially if you wanted to REALLY cover all the alphabet. In that case you should probably generate the pattern programmatically by iterating the char codes of the Unicode charset or something like that and generate groupings for every three consecutive characters. However, you should realize that by having generated such a large decision tree for the pattern matcher, the marching performance is bound to suffer (O(n) where n is the number of groups which is the size of the Unicode charset minus 2).
I disagree, case 1 is possible to regex, but you have to tell it the sequences to match... which is kind of long and boring:
/(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+/ig
http://regexr.com/3dqln
for the second question:
\\b([a-zA-Z0-9])\\1\\1+\\b
explanation:
\\b : zero-length word boundary
( : start capture group 1
[a-zA-Z0-9] : a letter or a digit
) : end group
\\1 : same character as group 1
\\1+ : same character as group 1 one or more times
\\b : zero-length word boundary
To my knowledge, the first case is indeed not possible. The regex engine doesn't know anything about the order of the natural numbers or the alphabet. But it's at least possible to differentiate between 3 or more numbers and 3 or more letters, for example:
[a-z]{3,}|[A-Z]{3,}|\d{3,}
This matches abcd, ABCDE or 123 but doesn't match ab2d, A5c4 or 12z, for example. According to this, the second case can be correctly given in a shorter version as:
(\w)\1{2,}
3 or more consecutive sequential characters/numbers ex - 123, abc, 789, pqr etc.
Not possible with regular expressions.
3 or more consecutive identical characters/numbers ex - 111, aaa, bbb. 222 etc.
Use a pattern of (?i)(?:([a-z0-9])\\1{2,})*.
If you want to check the whole string, use Matcher.matches(). To find matches within a string, use Matcher.find().
Here's some sample code:
final String ps = "(?i)(?:([a-z0-9])\\1{2,})*";
final String psLong =
"(?i)\t\t\t# Case insensitive flag\n"
+ "(?:\t\t\t\t# Begin non-capturing group\n"
+ " (\t\t\t\t# Begin capturing group\n"
+ " [a-z0-9]\t\t# Match an alpha or digit character\n"
+ " )\t\t\t\t# End capturing group\n"
+ " \\1\t\t\t\t# Back-reference first capturing group\n"
+ " {2,}\t\t\t# Match previous atom 2 or more times\n"
+ ")\t\t\t\t# End non-capturing group\n"
+ "*\t\t\t\t# Match previous atom zero or more characters\n";
System.out.println("***** PATTERN *****\n" + ps + "\n" + psLong
+ "\n");
final Pattern p = Pattern.compile(ps);
for (final String s : new String[] {"aa", "11", "aaa", "111",
"aaaaaaaaa", "111111111", "aaa111bbb222ccc333",
"aaaaaa111111bbb222"})
{
final Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("Success: " + s);
} else {
System.out.println("Fail: " + s);
}
}
And the output is:
***** PATTERN *****
(?i)(?:([a-z0-9])\1{2,})*
(?i) # Case insensitive flag
(?: # Begin non-capturing group
( # Begin capturing group
[a-z0-9] # Match an alpha or digit character
) # End capturing group
\1 # Back-reference first capturing group
{2,} # Match previous atom 2 or more times
) # End non-capturing group
* # Match previous atom zero or more characters
Fail: aa
Fail: 11
Success: aaa
Success: 111
Success: aaaaaaaaa
Success: 111111111
Success: aaa111bbb222ccc333
Success: aaaaaa111111bbb222
Regex to match three consecutive numbers or alphabets is
"([0-9]|[aA-zZ])\1\1"
Thanks All for helping me.
For the first case - 3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc. I used below code logic. Pls share your comments on this.
public static boolean validateConsecutiveSeq(String epin) {
char epinCharArray[] = epin.toCharArray();
int asciiCode = 0;
boolean isConSeq = false;
int previousAsciiCode = 0;
int numSeqcount = 0;
for (int i = 0; i < epinCharArray.length; i++) {
asciiCode = epinCharArray[i];
if ((previousAsciiCode + 1) == asciiCode) {
numSeqcount++;
if (numSeqcount >= 2) {
isConSeq = true;
break;
}
} else {
numSeqcount = 0;
}
previousAsciiCode = asciiCode;
}
return isConSeq;
}
If you have lower bound (3) and upper bound regexString can be generated as follows
public class RegexBuilder {
public static void main(String[] args) {
StringBuilder sb = new StringBuilder();
int seqStart = 3;
int seqEnd = 5;
buildRegex(sb, seqStart, seqEnd);
System.out.println(sb);
}
private static void buildRegex(StringBuilder sb, int seqStart, int seqEnd) {
for (int i = seqStart; i <= seqEnd; i++) {
buildRegexCharGroup(sb, i, '0', '9');
buildRegexCharGroup(sb, i, 'A', 'Z');
buildRegexCharGroup(sb, i, 'a', 'z');
buildRegexRepeatedString(sb, i);
}
}
private static void buildRegexCharGroup(StringBuilder sb, int seqLength,
char start, char end) {
for (char c = start; c <= end - seqLength + 1; c++) {
char ch = c;
if (sb.length() > 0) {
sb.append('|');
}
for (int i = 0; i < seqLength; i++) {
sb.append(ch++);
}
}
}
private static void buildRegexRepeatedString(StringBuilder sb, int seqLength) {
sb.append('|');
sb.append("([a-zA-Z\\d])");
for (int i = 1; i < seqLength; i++) {
sb.append("\\1");
}
}
}
Output
012|123|234|345|456|567|678|789|ABC|BCD|CDE|DEF|EFG|FGH|GHI|HIJ|IJK|JKL|KLM|LMN|MNO|NOP|OPQ|PQR|QRS|RST|STU|TUV|UVW|VWX|WXY|XYZ|abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|([a-z\d])\1\1|0123|1234|2345|3456|4567|5678|6789|ABCD|BCDE|CDEF|DEFG|EFGH|FGHI|GHIJ|HIJK|IJKL|JKLM|KLMN|LMNO|MNOP|NOPQ|OPQR|PQRS|QRST|RSTU|STUV|TUVW|UVWX|VWXY|WXYZ|abcd|bcde|cdef|defg|efgh|fghi|ghij|hijk|ijkl|jklm|klmn|lmno|mnop|nopq|opqr|pqrs|qrst|rstu|stuv|tuvw|uvwx|vwxy|wxyz|([a-z\d])\1\1\1|01234|12345|23456|34567|45678|56789|ABCDE|BCDEF|CDEFG|DEFGH|EFGHI|FGHIJ|GHIJK|HIJKL|IJKLM|JKLMN|KLMNO|LMNOP|MNOPQ|NOPQR|OPQRS|PQRST|QRSTU|RSTUV|STUVW|TUVWX|UVWXY|VWXYZ|abcde|bcdef|cdefg|defgh|efghi|fghij|ghijk|hijkl|ijklm|jklmn|klmno|lmnop|mnopq|nopqr|opqrs|pqrst|qrstu|rstuv|stuvw|tuvwx|uvwxy|vwxyz|([a-z\d])\1\1\1\1
All put together:
([a-zA-Z0-9])\1\1+|(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
([a-zA-Z0-9])\1\1+
https://regexr.com/4727n
This also works:
(?:(?:0(?=1)|1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)|6(?=7)|7(?=8)|8(?=9)){2,}\d|(?:a(?=b)|b(?=c)|c(?=d)|d(?=e)|e(?=f)|f(?=g)|g(?=h)|h(?=i)|i(?=j)|j(?=k)|k(?=l)|l(?=m)|m(?=n)|n(?=o)|o(?=p)|p(?=q)|q(?=r)|r(?=s)|s(?=t)|t(?=u)|u(?=v)|v(?=w)|w(?=x)|x(?=y)|y(?=z)){2,}[[:alpha:]])|([a-zA-Z0-9])\1\1+
https://regex101.com/r/6fXC9u/1
for the first question this works if you're ok with less regex
containsConsecutiveCharacters(str) {
for (let i = 0; i <= str.length - 3; i++) {
var allthree = str[i] + str[i + 1] + str[i + 2];
let s1 = str.charCodeAt(i);
let s2 = str.charCodeAt(i + 1);
let s3 = str.charCodeAt(i + 2);
if (
/[a-zA-Z]+$/.test(allthree) &&
(s1 < s2 && s2 < s3 && s1+s2+s3-(3*s1) === 3)
) {
return true;
}
}
}
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
(?:(?:0(?=1)|1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)|6(?=7)|7(?=8)|8(?=9)){2,}\d|(?:a(?=b)|b(?=c)|c(?=d)|d(?=e)|e(?=f)|f(?=g)|g(?=h)|h(?=i)|i(?=j)|j(?=k)|k(?=l)|l(?=m)|m(?=n)|n(?=o)|o(?=p)|p(?=q)|q(?=r)|r(?=s)|s(?=t)|t(?=u)|u(?=v)|v(?=w)|w(?=x)|x(?=y)|y(?=z)){2,}[\p{Alpha}])
https://regex101.com/r/5IragF/1
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
([\p{Alnum}])\1{2,}
https://regex101.com/r/VEHoI9/1
All put together:
([a-zA-Z0-9])\1\1+|(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
([a-zA-Z0-9])\1\1+
https://regexr.com/4727n
For case #2 I got inspired by a sample on regextester and created the following regex to match n identical digits (to check for both numbers and letters replace 0-9 with A-Za-z0-9):
const n = 3
const identicalAlphanumericRegEx = new RegExp("([0-9])" + "\\1".repeat(n - 1))
I was discussing this with a coworker and we think we have a good solution for #1.
To check for abc or bcd or ... or 012 or 123 or even any number of sequential characters, try:
.*((a(?=b))|(?:b(?=c))|(?:c(?=d))|(?:d(?=e))|(?:e(?=f))|(?:f(?=g))|(?:g(?=h))|(?:h(?=i))|(?:i(?=j))|(?:j(?=k))|(?:k(?=l))|(?:l(?=m))|(?:m(?=n))|(?:n(?=o))|(?:o(?=p))|(?:p(?=q))|(?:q(?=r))|(?:r(?=s))|(?:s(?=t))|(?:t(?=u))|(?:u(?=v))|(?:v(?=w))|(?:w(?=x))|(?:x(?=y))|(?:y(?=z))|(?:0(?=1))|(?:1(?=2))|(?:2(?=3))|(?:3(?=4))|(?:4(?=5))|(?:5(?=6))|(?:6(?=7))|(?:7(?=8))|(?:8(?=9))){2,}.*
The nice thing about this solution is if you want more than 3 consecutive characters, increase the {2,} to be one less than what you want to check for.
the ?: in each group prevents the group from being captured.
Try this for the first question.
returns true if it finds 3 consecutive numbers or alphabets in the arg
function check(val){
for (i = 0; i <= val.length - 3; i++) {
var s1 = val.charCodeAt(i);
var s2 = val.charCodeAt(i + 1);
var s3 = val.charCodeAt(i + 2);
if (Math.abs(s1 - s2) === 1 && s1 - s2 === s2 - s3) {
return true;
}
}
return false;
}
console.log(check('Sh1ak#ki1r#100'));

x? quantifer: Why does a non-x give a "zero-length" match?

A quantifier x? means a single or no occurance of x.
I am posting a test harness for matching the regex with strings for convenience.
I am confused about the regex a? when compared to the string ababaaaab.
The output of the program is:
Enter your regex: a?
Enter your input string to seacrh: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.
Enter your regex:
I am confused about the b's.
"The regular expression a? is not specifically looking for the letter
"b"; it's merely looking for the presence (or lack thereof) of the
letter "a". If the quantifier allows for a match of "a" zero times,
anything in the input string that's not an "a" will show up as a
zero-length match."
Reference
QUESTION:-
The first line is understandable, and I do understand that presence of b or any non-a is an absence of a, or 0 occurence of a, so should result in a match. But the absence of a (that is the occurance of b) is between the indices 1 and 2. So why is the match of the text "" between the index 1 and 1 (in other words, why are we getting a zero-length match here). From my reasoning, it should be between the indices 1 and 2.
import java.io.InputStreamReader;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/*
* Enter your regex: foo
* Enter input string to search: foo
* I found the text foo starting at index 0 and ending at index 3.
* */
public class RegexTestHarness {
public static void main(String[] args){
/*Console console = System.console();
if (console == null) {
System.err.println("No console.");
System.exit(1);
}*/
while (true) {
/*Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: ", null));*/
System.out.print("\nEnter your regex: ");
Scanner scanner = new Scanner(new InputStreamReader(System.in));
Pattern pattern = Pattern.compile(scanner.next());
System.out.print("\nEnter your input string to seacrh: ");
Matcher matcher =
pattern.matcher(scanner.next());
boolean found = false;
while (matcher.find()) {
/*console.format("I found the text" +
" \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(),
matcher.start(),
matcher.end());*/
System.out.println("I found the text \"" + matcher.group() + "\" starting at index " + matcher.start() + " and ending at index " + matcher.end() + ".");
found = true;
}
if(!found){
//console.format("No match found.%n", null);
System.out.println("No match found.");
}
}
}
}
But the absence of a (that is the occurance of b) is between the indices 1 and 2. So why is the match of the text "" between the index 1 and 1 (in other words, why are we getting a zero-length match here)
The length of the match is the length of the input string that matched the pattern.
Since there was no "a", only an empty string was matched.
Again, the pattern does not match "a sequence of non-a characters", it matches a (possibly empty) sequence of "a"s up to a total length of one. In this case, that matched sequence was empty.
But the absence of a (that is the occurance of b)
The absence of a is not the occurance of b. The absence of a takes place before the occurance of b and ends at the occurance of b.
The position reported is not the position of a character
The key thing to understand is that the regex engine is not giving you the position of a character where it found a match.
It is giving you the starting position where it started the match that was successful. That position is not a character. It is the space between characters. For instance,
Position 0 is the very beginning of the string. That is where the \A or ^ assertions match.
Position 1 is the position between the first and the second characters.
Position 9 is the position after the last b at the end of ababaaaab. That is where the \Z or $ assertions match.
a? is greedy. In other words, the regex engine will process as follow:
foreach index
if next char is "a"
return "a"
else if next char is ""
return ""
end if
end foreach
If you apply this algorithm on your input string, you'll have the same output as the one you provided.
You could try its non-greedy (or lazy) equivalent: a??. The regex engine would then process as follow:
foreach index
if next char is ""
return ""
else if next char is "a"
return "a"
end if
end foreach
An empty string would thus be found at each index, and no a would be outputted at all.

Understanding regular expression output [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I need help to understand the output of the code below. I am unable to figure out the output for System.out.print(m.start() + m.group());. Please can someone explain it to me?
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d*");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while(b = m.find()) {
System.out.println(m.start() + m.group());
}
}
}
Output is:
0
1
234
4
5
6
Note that if I put System.out.println(m.start() );, output is:
0
1
2
4
5
6
Because you have included a * character, your pattern will match empty strings as well. When I change your code as I suggested in the comments, I get the following output:
0 ()
1 ()
2 (34)
4 ()
5 ()
6 ()
So you have a large number of empty matches (matching each location in the string) with the exception of 34, which matches the string of digits. Use \\d+ if you want to match digits without also matching empty strings..
You used this regex - \d* - which basically means zero or more digits. Mind the zero!
So this pattern will match any group of digits, e.g. 34 plus any other position in the string, where the matched sequence will be the empty string.
So, you will have 6 matches, starting at indices 0,1,2,4,5,6. For match starting at index 2, the matched sequence is 34, while for the remaining ones, the match will be the empty string.
If you want to find only digits, you might want to use this pattern: \d+
d* - match zero or more digits in the expresion.
expresion ab34ef and his corresponding indices 012345
On the zero index there is no match so start() prints 0 and group() prints nothing, then on the first index 1 and nothing, on the second we find match so it prints 2 and 34. Next it will print 4 and nothing and so on.
Another example:
Pattern pattern = Pattern.compile("\\d\\d");
Matcher matcher = pattern.matcher("123ddc2ab23");
while(matcher.find()) {
System.out.println("start:" + matcher.start() + " end:" + matcher.end() + " group:" + matcher.group() + ";");
}
which will println:
start:0 end:2 group:12;
start:9 end:11 group:23;
You will find more information in the tutorial

SCJP6 regex issue

I have issue with following example:
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[1]);
boolean b = false;
while(b = m.find()) {
System.out.print(m.start() + m.group());
}
}
}
And the command line:
java Regex2 "\d*" ab34ef
Can someone explain to me, why the result is: 01234456
regex pattern is d* - it means number one or more but there are more positions that in args[1],
thanks
\d* matches 0 or more digits. So, it will even match empty string before every character and after the last character. First before index 0, then before index 1, and so on.
So, for string ab34ef, it matches following groups:
Index Group
0 "" (Before a)
1 "" (Before b)
2 34 (Matches more than 0 digits this time)
4 "" (Before `e` at index 4)
5 "" (Before f)
6 "" (At the end, after f)
If you use \\d+, then you will get just a single group at 34.

Why empty regex and empty capturing group regex return string length plus one results

How would you explain that empty regex and empty capturing group regex return string length plus one results?
Code
public static void main(String... args) {
{
System.out.format("Pattern - empty string\n");
String input = "abc";
Pattern pattern = Pattern.compile("");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String s = matcher.group();
System.out.format("[%s]: %d / %d\n", s, matcher.start(),
matcher.end());
}
}
{
System.out.format("Pattern - empty capturing group\n");
String input = "abc";
Pattern pattern = Pattern.compile("()");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String s = matcher.group();
System.out.format("[%s]: %d / %d\n", s, matcher.start(),
matcher.end());
}
}
}
Output
Pattern - empty string
[]: 0 / 0
[]: 1 / 1
[]: 2 / 2
[]: 3 / 3
Pattern - empty capturing group
[]: 0 / 0
[]: 1 / 1
[]: 2 / 2
[]: 3 / 3
The regex engine is hardcoded to advance one position upon a zero-length match (otherwise infinite loop). Your regex matches a zero-length substring. There are zero-length substrings between every character (think the "gaps between each character"); in addition, the regex engine considers the start and end of the string valid match positions as well. Because a string of length N contains N+1 gaps between letters (counting the start and end, which the regex engine does), you'll get N+1 matches.
Regex engines consider positions before and after characters, too. You can see this from the fact that they have things like ^ (start of string), $ (end of string) and \b word boundary, which match at certain positions without matching any characters (and therefore between/before/after characters). Therefore we have the N-1 positions between characters that have to be considered, as well as the first and last position (because ^ and $ would match there respectively), which gives you N+1 candidate positions. All of which match for a completely unrestrictive empty pattern.
So here are your matches:
" a b c "
^ ^ ^ ^
Which is obviously N+1 for N characters.
You will get the same behavior with other patterns that allow zero-length matches and don't actually find longer ones in your pattern. For instance, try \d*. It cannot find any digits in your input string, but * will gladly return zero-length matches.

Categories