Regex in Android JAVA - How to specify more than 9 backreferences? - java

I use multiple groups in a Regex search and replace many parts of a string. I use $1 $2 etc in Android JAVA when using String.replaceFirst.
If I use more than nine groups in my Regex search when trying to reference them in replaceFirst for example $10 , it will replace the first back reference and then prints a literal 0.
Is there anyway I can use a tenth reference? Is there a different way of referencing it?
Example, but I'm trying to use more than nine back references. $10 sees only $1.
String.replaceFirst("(hello)(.*)(this)","$1middle$2");

TL;DR If you experience that $10 is treated as $1 and a 0, then your regex doesn't have 10 capture groups.
The $ back-references in the replacement value is documented in the javadoc for the appendReplacement method:
The replacement string may contain references to subsequences captured during the previous match: Each occurrence of ${name} or $g will be replaced by the result of evaluating the corresponding group(name) or group(g) respectively. For $g, the first number after the $ is always treated as part of the group reference. Subsequent numbers are incorporated into g if they would form a legal group reference. Only the numerals '0' through '9' are considered as potential components of the group reference. If the second group matched the string "foo", for example, then passing the replacement string "$2bar" would cause "foobar" to be appended to the string buffer. A dollar sign ($) may be included as a literal in the replacement string by preceding it with a backslash (\$).
So, let's say we have 11 groups:
System.out.println("ABCDEFGHIJKLMN".replaceFirst("(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)",
"$11$10$9$3$2$1"));
Here we capture the first 11 characters as individual groups, so e.g. group(1) returns "A" and group(11) returns "K". The input string has 14 characters, so the last 3 (LMN) are not replaced. The result is:
KJICBALMN
If we remove capture group 11 from the regex, then $11 is not a legal group reference, and will be interpreted as $1 and the literal 1:
System.out.println("ABCDEFGHIJKLMN".replaceFirst("(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)",
"$11$10$9$3$2$1"));
Prints:
A1JICBAKLMN
So, if you experience that $10 is treated as a $1 back-reference and a literal 0, then your regex doesn't have 10 groups.

You can also name them with (?<name>...) and then reference them with ${name}.
String.replaceFirst("(?<g1>hello)(?<g2>.*)(?<g3>this)","${g1}middle${g2}");

Related

Java: is "$1" a placeholder? [duplicate]

This question already has answers here:
JAVA - replaceAll in a regex with $1
(1 answer)
What does RegExp.$1 do
(6 answers)
Closed 1 year ago.
I was given a Java exercise:
Break up camelCase writing into words, for example the input "camelCaseTest" should give the output "camel Case Test".
I found this solution online, but I don't understand all of it
public static String camelCaseBetter(String input) {
input = input.replaceAll("([A-Z])", " $1");
return input;
}
What does the $1 do? I think it just takes the String that is to be replaced (A-Z) and replaces it with itself (in this case the method also appends a space to break up the words)
I couldn't find a good explanation for $1, so I hope somebody here can explain it or share a link to the right resource which can explain it.
From the documentation of the String class:
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string; see Matcher.replaceAll.
From Matcher.replaceAll
The replacement string may contain references to captured subsequences as in the appendReplacement method.
Then the appendReplacement method:
The replacement string may contain references to subsequences captured during the previous match: Each occurrence of ${name} or $g will be replaced by the result of evaluating the corresponding group(name) or group(g) respectively. For $g, the first number after the $ is always treated as part of the group reference. Subsequent numbers are incorporated into g if they would form a legal group reference. Only the numerals '0' through '9' are considered as potential components of the group reference. If the second group matched the string "foo", for example, then passing the replacement string "$2bar" would cause "foobar" to be appended to the string buffer. A dollar sign ($) may be included as a literal in the replacement string by preceding it with a backslash (\$).
So, $1 will reference the the first capturing group (whatever matches the pattern within the first parentheses of the regular expression).
([A-Z]) will match any uppercase character and place it in the first capturing group. $1 will then replace it with a space, followed by the matched uppercase character.

Why does backreferencing capturing groups work for multiple digit numbers in Java?

Let's say that you have a string:
String string = "ab #1?AZa$ab #1?AZa$"
You're trying to verify that the tenth is a non-whitespace character, and that the twentieth character is the same as the tenth. Furthermore, there is corresponding verification with the 1st and 11th, the 2nd and 12th, the 3rd and 13th, etc. each with their own separate requirements (the full list is here) so you have to use 10 capturing groups. I found that the following regex still works to validate the aforementioned string:
string.matches("^([a-z])(\\w)(\\s)(\\W)(\\d)(\\D)([A-Z])([a-zA-Z])([aeiouAEIOU])(\\S)\\1\\2\\3\\4\\5\\6\\7\\8\\9\\10$") //returns true
My question regards the last backreference:
\\10
Shouldn't this be interpreted as "match with the first character" and then "match with 0" (the digit)? I don't see how this is interpreted as "match with the tenth character" without somehow grouping the 1 and 0 together into 10. Puzzlingly, surrounding the 1 and 0 with parentheses does not work.
The behavior for Java is documented in Pattern:
In this class, \1 through \9 are always interpreted as back references, and a larger number is accepted as a back reference if at least that many subexpressions exist at that point in the regular expression, otherwise the parser will drop digits until the number is smaller or equal to the existing number of groups or it is one digit.

What is the functionality of this regex? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I am recently learning regex and i am not quite sure how the following regex works:
str.replaceAll("(\\w)(\\w*)", "$2$1ay");
This allows us to do the following:
input string: "Hello World !"
return string: "elloHay orldWay !"
From what I know: w is supposed to match all word characters including 0-9 and underscore and $ matches stuff at the end of string.
In the replaceAll method, the first parameter can be a regex. It matches all words in the string with the regex and changes them to the second parameter.
In simple cases replaceAll works like this:
str = "I,am,a,person"
str.replaceAll(",", " ") // I am a person
It matched all the commas and replaced them with a space.
In your case, the match is every alphabetic character(\w), followed by a stream of alphabetic characters(\w*).
The () around \w is to group them. So you have two groups, the first letter and the remaining part. If you use regex101 or some similar website you can see a visualization of this.
Your replacement is $2 -> Second group, followed by $1(remaining part), followed by ay.
Hope this clears it up for you.
Enclosing a regex expression in brackets () will make it a Capturing group.
Here you have 2 capturing groups , (\w) captures a single word character, and (\w*) catches zero or more.
$1 and $2 are used to refer to the captured groups, first and second respectively.
Also replaceAll takes each word individually.
So in this example in 'Hello' , 'H' is the first captured groups and 'ello' is the second. It's replaced by a reordered version - $2$1 which is basically swapping the captured groups.
So you get '$2$1ay' as 'elloHay'
The same for the next word also.

Replace multiple capture groups using regexp with java

I have this requirement - for an input string such as the one shown below
8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs
I would like to strip the matched word boundaries (where the matching pair is 8 or & or % etc) and will result in the following
This is really a test of repl%acing %mul%tiple matched 9pairs
This list of characters that is used for the pairs can vary e.g. 8,9,%,# etc and only the words matching the start and end with each type will be stripped of those characters, with the same character embedded in the word remaining where it is.
Using Java I can do a pattern as \\b8([^\\s]*)8\\b and replacement as $1, to capture and replace all occurrences of 8...8, but how do I do this for all the types of pairs?
I can provide a pattern such as \\b8([^\\s]*)8\\b|\\b9([^\\s]*)9\\b .. and so on that will match all types of matching pairs *8,9,..), but how do I specify a 'variable' replacement group -
e.g. if the match is 9...9, the the replacement should be $2.
I can of course run it through multiple of these, each replacing a specific type of pair, but I am wondering if there is a more elegant way.
Or is there a completely different way of approaching this problem?
Thanks.
You could use the below regex and then replace the matched characters by the characters present inside the group index 2.
(?<!\S)(\S)(\S+)\1(?=\s|$)
OR
(?<!\S)(\S)(\S*)\1(?=\s|$)
Java regex would be,
(?<!\\S)(\\S)(\\S+)\\1(?=\\s|$)
DEMO
String s1 = "8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs";
System.out.println(s1.replaceAll("(?<!\\S)(\\S)(\\S+)\\1(?=\\s|$)", "$2"));
Output:
This is reallly a test of repl%acing %mul%tiple matched 9pairs
Explanation:
(?<!\\S) Negative lookbehind, asserts that the match wouldn't be preceded by a non-space character.
(\\S) Captures the first non-space character and stores it into group index 1.
(\\S+) Captures one or more non-space characters.
\\1 Refers to the character inside first captured group.
(?=\\s|$) And the match must be followed by a space or end of the line anchor.
This makes sure that the first character and last character of the string must be the same. If so, then it replaces the whole match by the characters which are present inside the group index 2.
For this specific case, you could modify the above regex as,
String s1 = "8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs";
System.out.println(s1.replaceAll("(?<!\\S)([89&#%])(\\S+)\\1(?=\\s|$)", "$2"));
DEMO
(?<![a-zA-Z])[8&#%9](?=[a-zA-Z])([^\s]*?)(?<=[a-zA-Z])[8&#%9](?![a-zA-Z])
Try this.Replace with $1 or \1.See demo.
https://regex101.com/r/qB0jV1/15
(?<![a-zA-Z])[^a-zA-Z](?=[a-zA-Z])([^\s]*?)(?<=[a-zA-Z])[^a-zA-Z](?![a-zA-Z])
Use this if you have many delimiters.

java.util.Pattern API "puzzle"

Does anybody know where in Pattern API the behaviour of this line of code is described
System.out.println("000".matches("(0)\\10"));
I think few people can say what it prints until they run it. API says
\n Whatever the n-th capturing group matched
It does not say that n must be 1 digit. Is it 10-th or 1-th group in my test?
You attempt to match the character 0 between parenthesis, and then you want the previous matched character \1 to be there also, followed by a 0 character. 000 does verify that pattern and thus the match() method returns true, so it prints true.
Since the matcher did not found 10 capturing groups, it interprets it as the first one \1 then the character 0.
A more complex example shows that if the matcher find N capturing group > 9 and that the available number of capturing groups is enough, it works also:
System.out.println(
"01234567891011 01120".matches(
"(0)(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11) \\1\\12\\30"
)
);
Is true because 0 is in the first capturing group \1 and 11 is in the capturing group \12, finally there is no captured group number \30 so it is interpreted as back reference \3 (which is character 2) then the character 0.
The behaviour in this case is described in the section Comparison to Perl 5 of the Pattern api:
In Perl, \1 through \9 are always interpreted as back references; a backslash-escaped number greater than 9 is treated as a back reference if at least that many subexpressions exist, otherwise it is interpreted, if possible, as an octal escape. In this class octal escapes must always begin with a zero. In this class, \1 through \9 are always interpreted as back references, and a larger number is accepted as a back reference if at least that many subexpressions exist at that point in the regular expression, otherwise the parser will drop digits until the number is smaller or equal to the existing number of groups or it is one digit.

Categories