java.util.Pattern API "puzzle" - java

Does anybody know where in Pattern API the behaviour of this line of code is described
System.out.println("000".matches("(0)\\10"));
I think few people can say what it prints until they run it. API says
\n Whatever the n-th capturing group matched
It does not say that n must be 1 digit. Is it 10-th or 1-th group in my test?

You attempt to match the character 0 between parenthesis, and then you want the previous matched character \1 to be there also, followed by a 0 character. 000 does verify that pattern and thus the match() method returns true, so it prints true.
Since the matcher did not found 10 capturing groups, it interprets it as the first one \1 then the character 0.
A more complex example shows that if the matcher find N capturing group > 9 and that the available number of capturing groups is enough, it works also:
System.out.println(
"01234567891011 01120".matches(
"(0)(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11) \\1\\12\\30"
)
);
Is true because 0 is in the first capturing group \1 and 11 is in the capturing group \12, finally there is no captured group number \30 so it is interpreted as back reference \3 (which is character 2) then the character 0.

The behaviour in this case is described in the section Comparison to Perl 5 of the Pattern api:
In Perl, \1 through \9 are always interpreted as back references; a backslash-escaped number greater than 9 is treated as a back reference if at least that many subexpressions exist, otherwise it is interpreted, if possible, as an octal escape. In this class octal escapes must always begin with a zero. In this class, \1 through \9 are always interpreted as back references, and a larger number is accepted as a back reference if at least that many subexpressions exist at that point in the regular expression, otherwise the parser will drop digits until the number is smaller or equal to the existing number of groups or it is one digit.

Related

Why does backreferencing capturing groups work for multiple digit numbers in Java?

Let's say that you have a string:
String string = "ab #1?AZa$ab #1?AZa$"
You're trying to verify that the tenth is a non-whitespace character, and that the twentieth character is the same as the tenth. Furthermore, there is corresponding verification with the 1st and 11th, the 2nd and 12th, the 3rd and 13th, etc. each with their own separate requirements (the full list is here) so you have to use 10 capturing groups. I found that the following regex still works to validate the aforementioned string:
string.matches("^([a-z])(\\w)(\\s)(\\W)(\\d)(\\D)([A-Z])([a-zA-Z])([aeiouAEIOU])(\\S)\\1\\2\\3\\4\\5\\6\\7\\8\\9\\10$") //returns true
My question regards the last backreference:
\\10
Shouldn't this be interpreted as "match with the first character" and then "match with 0" (the digit)? I don't see how this is interpreted as "match with the tenth character" without somehow grouping the 1 and 0 together into 10. Puzzlingly, surrounding the 1 and 0 with parentheses does not work.
The behavior for Java is documented in Pattern:
In this class, \1 through \9 are always interpreted as back references, and a larger number is accepted as a back reference if at least that many subexpressions exist at that point in the regular expression, otherwise the parser will drop digits until the number is smaller or equal to the existing number of groups or it is one digit.

Regex for numbers

Im trying to create a regex of numbers where 7 should appear atleast once and it shouldn't include 9
/[^9]//d+
I'm not sure how to do make it include 7 at least once
Also, it fails for the following example
123459, it accepts the string, even tho, there is a 9 included in there
However, if my string is 95, it rejects it, which is right
Code
Method 1
See regex in use here
(?=\d*7)(?!\d*9)\d+
Method 2
See regex in use here
\b(?=\d*7)[0-8]+\b
Note: This method uses fewer steps (170) as opposed to Method 1 with 406 steps.
Alternatively, you can also replace [0-8] with [^9\D] as seen here, which is basically saying don't match 9 or \D (any non-digit character).
You can also use \b(?=[^7\D]*7)[0-8]+\b as seen here, which brings the number of steps down from 170 to 147.
Method 3
See regex in use here
\b[0-8]*7[0-8]*\b
Note: This method uses few steps than both methods above at 139 steps. The only issue with this regex is that you need to identify valid characters in multiple locations in the pattern.
Results
Input
**VALID**
123456780
7
1237412
**INVALID**
9
12345680
1234567890
12341579
Output
Note: Shown below are strings that match.
123456780
7
1237412
Explanation
Method 1
(?=\d*7) Positive lookahead ensuring what follows is any digit any number of times, followed by 7 literally
(?!\d*9) Negative lookahead ensuring what follows is not any digit any number of times, followed by 9 literally
\d+ Any digit one or more times
Method 2
\b Assert the position as a word boundary
(?=\d*7) Positive lookahead ensuring what follows is any digit any number of times, followed by 7 literally
[0-8]+ Match any character present in the set 0-8
\b Assert the position as a word boundary
Method 3
\b Assert the position as a word boundary
[0-8]* Match any digit (except 9) any number of times
7 Match the digit 7 literally
[0-8]* Match any digit (except 9) any number of times
\b Assert the position as a word boundary
One way to do it would be to use several lookaheads:
(?=[^7]*7)(?!.*9)^\d+$
See a demo on regex101.com.
Note that you need to double escape the backslashes in Java, so that it becomes:
(?=[^7]*7)(?!.*9)^\\d+$
This has got a bit complex but it works for your use case :
(?=.*^[0-68-9]*7[0-68-9]*$)(?=^(?:(?!9).)*$).*$
First expression matches exactly one occurence of 7, accepts just numbers and second expression tests non-occurence of 9.
Try here : https://regex101.com/r/5OHgIr/1
If I find out correctly, you need a regex that accept all numbers that include at least one 7 and exclude 9. so try this:
(?:[0-8]*7[0-8]*)+
If you want found only numbers in a normal text add \s first and last of regex.

Regex in Android JAVA - How to specify more than 9 backreferences?

I use multiple groups in a Regex search and replace many parts of a string. I use $1 $2 etc in Android JAVA when using String.replaceFirst.
If I use more than nine groups in my Regex search when trying to reference them in replaceFirst for example $10 , it will replace the first back reference and then prints a literal 0.
Is there anyway I can use a tenth reference? Is there a different way of referencing it?
Example, but I'm trying to use more than nine back references. $10 sees only $1.
String.replaceFirst("(hello)(.*)(this)","$1middle$2");
TL;DR If you experience that $10 is treated as $1 and a 0, then your regex doesn't have 10 capture groups.
The $ back-references in the replacement value is documented in the javadoc for the appendReplacement method:
The replacement string may contain references to subsequences captured during the previous match: Each occurrence of ${name} or $g will be replaced by the result of evaluating the corresponding group(name) or group(g) respectively. For $g, the first number after the $ is always treated as part of the group reference. Subsequent numbers are incorporated into g if they would form a legal group reference. Only the numerals '0' through '9' are considered as potential components of the group reference. If the second group matched the string "foo", for example, then passing the replacement string "$2bar" would cause "foobar" to be appended to the string buffer. A dollar sign ($) may be included as a literal in the replacement string by preceding it with a backslash (\$).
So, let's say we have 11 groups:
System.out.println("ABCDEFGHIJKLMN".replaceFirst("(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)",
"$11$10$9$3$2$1"));
Here we capture the first 11 characters as individual groups, so e.g. group(1) returns "A" and group(11) returns "K". The input string has 14 characters, so the last 3 (LMN) are not replaced. The result is:
KJICBALMN
If we remove capture group 11 from the regex, then $11 is not a legal group reference, and will be interpreted as $1 and the literal 1:
System.out.println("ABCDEFGHIJKLMN".replaceFirst("(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)",
"$11$10$9$3$2$1"));
Prints:
A1JICBAKLMN
So, if you experience that $10 is treated as a $1 back-reference and a literal 0, then your regex doesn't have 10 groups.
You can also name them with (?<name>...) and then reference them with ${name}.
String.replaceFirst("(?<g1>hello)(?<g2>.*)(?<g3>this)","${g1}middle${g2}");

Please explain the output this regex (starts with a positive lookahead)

Pattern p = Pattern.compile("(?=[1-9][0-9]{2})[0-9]*[05]");
Matcher m = p.matcher("101");
while(m.find()){
System.out.println(m.start()+":"+ m.end()+ m.group());
}
Output------ >> 0:210
Please let me know why I am getting output of m.group() as 10 here.
As far as I understand m.group() should return nothing because [05] matches to nothing.
Your Pattern, (?=[1-9][0-9]{2})[0-9]*[05] consists of 2 parts:
(?=[1-9][0-9]{2})
and
[0-9]*[05]
The first part is a zero-width positive lookahead which searches for a number of length 3, and the first can not be 0. This matches your 101.
The second part searches for any amount of numbers and then a 0 or a 5. This matches the first 2 characters of 101, thus the result is 10.
See Java - Pattern for more information.
What your Regex is looking for is:
[1-9]:
match a single character present in the list below
1-9 a single character in the range between 1 and 9
[0-9]{2}:
match a single character present in the list below
Quantifier: {2} Exactly 2 times
0-9 a single character in the range between 0 and 9
[0-9]*:
match a single character present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
[05]:
match a single character present in the list below
05 a single character in the list 05 literally
for the String "101" this nacht the first 2 chars 101,
so you are printing.out:
System.out.println(**m.start()**+":"+ **m.end()**+ m.group());
where m.start() returns the start index of the previous match(char at
0). where m.end() returns the offset after the last character matched.
and where m.group() returns the input subsequence matched by the previous
match.
That regex was meant to match a number that's a multiple of 5 and greater than or equal to 100, but it's useless without anchors. It should be:
^(?=[1-9][0-9]{2}$)[0-9]*[05]$
The anchors ensure that both the lookahead and the main part are examining the whole of the string. But the task doesn't require a lookahead anyway; this works just fine:
^[1-9][0-9][05]$
As #AlanMoore states, there has to be an alignment.
Assertions are a self contained entity, all they have to do is Pass
to advance to the next construct.
Lets see what (?=[1-9][0-9]{2}) matches;
1111111110666
2222222222222222225666
33333333333333333333333330666
So far so good, on to the next construct.
Lets see what [0-9]*[05] matches.
What ever this matches is the final answer.
1111111110666
2222222222222222225666
33333333333333333333333330666
What to learn is that to get a cohesive answer, assertions have to be crafted to
coincide with constructs that come after them.
Here is an example of a constraint that could be applied
after the assertion.
The assertion state's it needs three digits and the first digit must be >= 1.
The constructs after the assertion state it can be any number of digit's,
as long as it ends with a 0 or 5.
This last part is distressing since it will match only the 500000
So for sure, you need at least three digits.
That can be done like this:
[0-9]{2,}[05]
This says two things
There must be at least three digits, but can be more
It must end with a 0 or 5.
That's it, put it all together, its:
(?=[1-9][0-9]{2})[0-9]{2,}[05]
Of course, this can be condensed to;
[1-9][0-9]+[05]

Understanding `+` in regular expression

I have a regular expression to remove all usernames from tweets. It looks like this:
regexFinder = "(?:\\s|\\A)[#]+([A-Za-z0-9-_]+):";
I'm trying to understand what each component does. So far, I've got:
( Used to begin a “group” element
?: Starts non-capturing group (this means one that will be removed from the final result)
\\s Matches against shorthand characters
| or
\\A Matches at the start of the string and matches a position as opposed to a character
[#] Matches against this symbol (which is used for Twitter usernames)
+ Match the previous followed by
([A-Za-z0-9- ] Match against any capital or small characters and numbers or hyphens
I'm a bit lost with the last bit though. Could somebody tell me what the +): means? I'm assuming the bracket is ending the group, but I don't get the colon or the plus sign.
If I've made any mistakes in my understanding of the regex please feel free to point it out!
The + actually means "one or more" of whatever it follows.
In this case [#]+ means "one or more # symbols" and [A-Za-z0-9-_]+ means "one or more of a letter, number, dash, or underscore". The + is one of several quantifiers, learn more here.
The colon at the end is just making sure the match has a colon at the end of the match.
Sometimes it helps to see a visualization, here is one generated by debuggex:
The + sign means "the previous character can be repeated 1 or more times". This is in contrast to the * symbol, which means "the previous character can be repeated 0 or more times". The colon, as far as I can tell, is literal—it matches a literal : in the string.
The plus sign in regular expressions means "one or more occurrences of the previous character or group of characters." Since the second plus sign is within the second set of parentheses, it basically means that the second set of parentheses matches any string comprised of at least one lowercase or uppercase letter, number, or hyphen.
As for the colon, it doesn't have any meaning in Java's regex class. If you're not sure, someone else already found out.
Well, we shall see..
[#]+ any character of: '#' (1 or more times)
( group and capture to \1:
[A-Za-z0-9-_]+ any character of: (a-z A-Z), (0-9), '-', '_' (1 or more times)
) end of capture group \1
: look for and match ':'
The following quantifiers are recognized:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times

Categories