Java regex backreference for two digits - java

I am working with a regex and I want to use it on the replaceAll method of the String class in Java.
My regex works fine and groupCount() returns 11. So, when I try to replace my text using backreference pointing to the eleventh group, I am getting the first group with a "1" attached to it, instead of the group eleven.
String regex = "(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)([^<]*<)";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>":
String replacement = text.replaceAll(regex, $1$2$11");
I am expecting to get the following result:
<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>
But the $11 backreference is not returning the 11th group, it is returning the first group with a 1 attached to it, and instead I am getting the following result:
<span style="font-size:11.0pt">675-441-3144>1o:p></o:p></span>
Can someone please tell me how to access the eleventh group of my pattern?
Thanks.

Short Answer
The way you access the eleventh group of a match in the replacement is with $11.
Explanation:
As the corresponding Javadoc* states:
The replacement string may contain references to subsequences captured
during the previous match: Each occurrence of ${name} or $g will be
replaced by the result of evaluating the corresponding group(name) or
group(g) respectively. For $g, the first number after the $ is always
treated as part of the group reference. Subsequent numbers are
incorporated into g if they would form a legal group reference.
So generally speaking, as long as have at least eleven groups, then "$11" will evaluate to group(11). However, if you do not have at least eleven groups, then "$11" will evaluate to group(1) + "1".
* This quote is from Matcher#appendReplacement(StringBuffer,String), which is where the chain of relevant citations from String#replaceAll(String,String) leads to.
Actual Answer
Your regex does not do what you think it does.
Part 1
The Problem
Let's divide your regex into its three top-level groups. These are groups 1, 2, and 11, respectively.
Group 1:
(>[^<]*?)
Group 2:
((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
Group 11:
([^<]*<)
Group 2 is the main body of your regex, and it consists of a top-level alternation over two options. These two options consist of groups 3-8 and 9-10, respectively.
First option:
((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})
Second option:
(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
Now, given the text string, here is what is going on:
Group 1 executes. It matches the first ">".
Group 2 executes. It evaluates the options of its alternation in order.
The first option of group 2's alternation executes. It matches "675-441-3144".
Group 2's alternation successfully short-circuits upon the match of one of its options.
Group 2 as a whole is now equal to the option that matched, which is "675-441-3144".
The cursor is now positioned immediately after "675-441-3144", which is immediately before ";;;78888464#".
Group 11 executes. It matches everything up through the next "<", which is all of ";;;78888464#<".
Thus, some of the content that you want to be in group 2 is actually in group 11 instead.
The Solution
Do both of the following two things:
Convert the contents of group 2 from
option1|option2
to
option1(option2)?|option2
Change $11 in your replacement pattern to $12.
This will greedy match one or both options, rather than only one option. The modification to the replacement pattern is because we have added a group.
Part 2
The Problem
Now that we have modified the regex, our original "option 2" no longer makes sense. Given our new pattern template option1(option2)?|option2, it will be impossible for group 2 to match "675-441-3144;;;78888464#". This is because our original "option 1" will match all of "675-441-3144" and then stop. Our original "option 2" will then attempt to match ";;;78888464#", but will be unable to because it begins with a mandatory capture group of 6-10 digits: (\d{6,16}), but ";;;78888464#" begins with a semicolon.
The Solution
Convert the contents of our original "option 2" from
(\d{6,16})([;,\.]{1,3}\d{3,}#?)?
to
([;,\.]{1,3}\d{3,}#?)?
Part 3
The Problem
We have one final problem to solve. Now that our original "option 2" consists only of a single group with the ? quantifier, it is possible for it to successfully match a zero-length substring. So our pattern template option1(newoption2)?|newoption2 could result in a zero-length match, which does not fulfill the intended purpose of matching phone numbers.
The Solution
Do both of the following:
Convert the contents of our new "option 2" from
([;,.]{1,3}\d{3,}#?)?
to
[;,.]{1,3}\d{3,}#?
Change $12 in our replacement string to $10, since we have now removed one group in two locations.
The Final Solution
Putting everything together, our final solution is as follows.
Search regex:
(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})([;,\.]{1,3}\d{3,}#?)?|[;,\.]{1,3}\d{3,}#?)([^<]*<)
Replacement regex:
$1$2$10
Java:
final String searchRegex = "(>[^<]*?)((\\+?\\d{1,4}[ \\t\\f\\-\\.](\\d[ \\t\\f\\-\\.])?)?(\\(\\d{1,4}([\\s-]\\d{1,4})?\\)[\\.\\- \\t\\f])?((\\d{2,6}[\\.\\- \\t\\f])+\\d{2,6})([;,\\.]{1,3}\\d{3,}#?)?|[;,\\.]{1,3}\\d{3,}#?)([^<]*<)";
final String replacementRegex = "$1$2$10";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>";
String replacement = text.replaceAll(searchRegex, replacementRegex);
Proof of correctness

Well, after trying to do it with replaceall without success, I had to implement the replacement method by myself:
public static String parsePhoneNumbers(String html){
StringBuilder regex = new StringBuilder(120);
regex.append("(>[^<]*?)(")
.append("((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?")
.append("(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?")
.append("((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})")
.append("([;,\.]{1,3}\d{3,}#?)?)")
.append(")+([^<]*<)");
StringBuilder mutableHtml = new StringBuilder(html.length());
Pattern pattern = Pattern.compile(regex.toString());
Matcher matcher = pattern.matcher(html);
int start = 0;
while(matcher.find()){
mutableHtml.append(html.substring(start, matcher.start()));
mutableHtml.append(matcher.group(1)).append("<a href=\"tel:")
.append(matcher.group(2)).append("\">").append(matcher.group(2))
.append("</a>").append(matcher.group(matcher.groupCount()));
start = matcher.end();
}
mutableHtml.append(html.substring(start));
return mutableHtml.toString();
}

Related

How to split a string in two using a pivot a number?

I'm trying to split a string into two parts using regex, but apparently, the regex is greedy so in the first group it adds a little bit more.
Example of string: "This is a phrase 22ext"
Desired result:
Group 0 = "This is a phrase"
Group 1 = "22"
The "ex"t iss discarded.
I'm using the following Regex (in java):
[^0-9]*([0-9]+).*
It works for Group 1, but in Group 0, it includes "22ext" as well.
How can I avoid it?
Your regex doesn't give the desired output because you didn't add the first part of it in a group, so you only have one group in your regex 1. You can fix that by using:
([^0-9]*)([0-9]+).*
And then you can find your two strings in "Group 1" and "Group 2". Note that "Group 0" is the full match.
Here's a demo.
A better and shorter way is to use the following regex:
(\D*)(\d+)
Which matches any non-numeric characters in the first group (until it reaches the first numeric characters) and then it matches the upcoming numeric characters including all Unicode digits in the second group.
And you can decide whether or not to include the .* at the end.
Try it online.
References:
Difference between [0-9] and \d.
1 "Group 0" is the full match for the entire pattern, so you need to use "Group 1" and "Group 2".

Java Regex capture multiple groups with groups containing others

I'm trying to build a regular expression which captures multiple groups, with some of them being contained in others. For instance, let's say I want to capture every 4-grams that follows a 'to' prefix:
input = "I want to run to get back on shape"
expectedOutput = ["run to get back", "get back on shape"]
In that case I would use this regex:
"to((?:[ ][a-zA-Z]+){4})"
But it only captures the first item in expectedOutput (with a space prefix but that's not the point).
This is quite easy to solve without regex, but I'd like to know if it is possible only using regex.
You can make use of a regex overlapping mstrings:
String s = "I want to run to get back on shape";
Pattern pattern = Pattern.compile("(?=\\bto\\b((?:\\s*[\\p{L}\\p{M}]+){4}))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1).trim());
}
See IDEONE demo
The regex (?=\bto\b((?:\s*[\p{L}\p{M}]+){4})) checks each location in the string (since it is a zero width assertion) and looks for:
\bto\b - a whole word to
((?:\s*[\p{L}\p{M}]+){4}) - Group 1 capturing 4 occurrences of
\s* zero or more whitespace(s)
[\p{L}\p{M}]+ - one or more letters or diacritics
If you want to allow capturing fewer than 4 ngrams, use a {0,4} (or {1,4} to require at least one) greedy limiting quantifier instead of {4}.
It is the order of groups in Regex
1 ((A)(B(C))) // first group (surround two other inside this)
2 (A) // second group ()
3 (B(C)) // third group (surrounded one other group)
4 (C) // forth group ()

How to determine substring OR match?

I have a regex that has an | (or) in it and I would like to determine what part of the or matched in the regex:
Possible Inputs:
-- Input 1 --
Stuff here to keep.
-- First --
all of this below
gets
deleted
-- Input 2 --
Stuff here to keep.
-- Second --
all of this below
gets
deleted
Regex to match part of an incoming input source and determine what part of the | (or) was matched? "-- First --" or "-- Second --"
Pattern PATTERN = Pattern.compile("^(.*?)-+ *(?:First|Second) *-+", Pattern.DOTALL);
Matcher m = PATTERN.matcher(text);
if (m.find()) {
// How can I tell if the regex matched "First" or "Second"?
}
How can I tell which input was matched (First or Second)?
The regular expression does not contain that information. However, you could use some additional groups to figure it out.
Example pattern: (?:(First)|(Second))
On the string First the second capture group will be empty and with Second the first one will be empty. A simple inspection of the groups returned to Java will tell you which part of the regex matched.
EDIT: I assumed that First and Second were used as placeholders for the sake of simplicity and actually represent more complex expressions. If you are really looking to find which of two strings was matched, then having a single capture group (like this: (First|Second)) and comparing its content with First will do the job just fine.
Because RegExes are stateless there is no way to tell by using only one regex.
The solution is to use two different RegExes and make a case decision.
However, you can use group() which returns the last match as String.
You can test this for .contains("First").
if(m.group().contains("First")) {
// case 1
} else {
// case 2
}

Java regexp grouping and + operator (Obtaining multiple values of a group)

I was wondering if is it possible to obtain all the matches of a group with a + operator on a java regular expression.
Example code:
public static void main(String[] args) {
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:\\s*(([\\w\\s]+),?\\s*)+.");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("Regular expression Match: "+ m.group(0));
System.out.println("Group 1: "+ m.group(1));
System.out.println("Group 2: "+ m.group(2));
}
}
OUTPUT:
Regular expression Match: Start: First match, second match, third match.
Group 1: third match
Group 2: third match
Despite group 2 matched 3 times "First match, " "second match, " "third match" due to the second "+" operator that is on the Regexp we can access just the last one on match.group(2).
My questions is:
¿There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
thanks.
As mentioned in other answers, you can't match n groups using + like this.
However, if you are looking to solve this problem in Java then using a Scanner to break on the delimiters may help:
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:|\\s*,");
Scanner s = new Scanner(input).useDelimiter(p);
while (s.hasNext()) {
System.out.println("Matched: " + s.next());
}
This prints out:
Matched: First match
Matched: second match
Matched: third match.
You asked:
There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
Answer is NO, if same group matches some text multiple times then you can only access last matched text.
There are of course other ways to return multiple matches.
I think this may not be possible with your regular expression.
As per the docs:
The captured input associated with a group is always the subsequence
that the group most recently matched. If a group is evaluated a second
time because of quantification then its previously-captured value, if
any, will be retained if the second evaluation fails. Matching the
string "aba" against the expression (a(b)?)+, for example, leaves
group two set to "b". All captured input is discarded at the beginning
of each match.
Like most other regex flavors, Java doesn't save the intermediate captures of a repeated group. But that feature isn't really as useful as might think. For example, the .NET flavor provides the CaptureCollection class for that purpose, but you still have to write the code to loop through it. Not that big a deal, but still it's usually easier to use multiple matches, like the other responders suggested. Try it with this regex:
"(?:Start:|\\G,)\\s*([\\w\\s]+)"
\G is a kind of anchor that causes the regex to reject any match that doesn't start exactly where the last match ended. If there was no previous match (i.e., this is the first match attempt), it acts like \A and matches only at the very beginning of the string. That's partly why I placed the , in that part of the regex; I think it's safe to assume the string doesn't start with a comma.
Note that the first group is non-capturing; the part you're looking for will always be in 'group(1)`.

Java - Extract strings with Regex

I've this string
String myString ="A~BC~FGH~~zuzy|XX~ 1234~ ~~ABC~01/01/2010 06:30~BCD~01/01/2011 07:45";
and I need to extract these 3 substrings
1234
06:30
07:45
If I use this regex \\d{2}\:\\d{2} I'm only able to extract the first hour 06:30
Pattern depArrHours = Pattern.compile("\\d{2}\\:\\d{2}");
Matcher matcher = depArrHours.matcher(myString);
String firstHour = matcher.group(0);
String secondHour = matcher.group(1); (IndexOutOfBoundException no Group 1)
matcher.group(1) throws an exception.
Also I don't know how to extract 1234. This string can change but it always comes after 'XX~ '
Do you have any idea on how to match these strings with regex expressions?
UPDATE
Thanks to Adam suggestion I've now this regex that match my string
Pattern p = Pattern.compile(".*XX~ (\\d{3,4}).*(\\d{1,2}:\\d{2}).*(\\d{1,2}:\\d{2})";
I match the number, and the 2 hours with matcher.group(1); matcher.group(2); matcher.group(3);
The matcher.group() function expects to take a single integer argument: The capturing group index, starting from 1. The index 0 is special, which means "the entire match". A capturing group is created using a pair of parenthesis "(...)". Anything within the parenthesis is captures. Groups are numbered from left to right (again, starting from 1), by opening parenthesis (which means that groups can overlap). Since there are no parenthesis in your regular expression, there can be no group 1.
The javadoc on the Pattern class covers the regular expression syntax.
If you are looking for a pattern that might recur some number of times, you can use Matcher.find() repeatedly until it returns false. Matcher.group(0) once on each iteration will then return what matched that time.
If you want to build one big regular expression that matches everything all at once (which I believe is what you want) then around each of the three sets of things that you want to capture, put a set of capturing parenthesis, use Matcher.match() and then Matcher.group(n) where n is 1, 2 and 3 respectively. Of course Matcher.match() might also return false, in which case the pattern did not match, and you can't retrieve any of the groups.
In your example, what you probably want to do is have it match some preceding text, then start a capturing group, match for digits, end the capturing group, etc...I don't know enough about your exact input format, but here is an example.
Lets say I had strings of the form:
Eat 12 carrots at 12:30
Take 3 pills at 01:15
And I wanted to extract the quantity and times. My regular expression would look something like:
"\w+ (\d+) [\w ]+ (\d{1,2}:\d{2})"
The code would look something like:
Pattern p = Pattern.compile("\\w+ (\\d+) [\\w ]+ (\\d{2}:\\d{2})");
Matcher m = p.matcher(oneline);
if(m.matches()) {
System.out.println("The quantity is " + m.group(1));
System.out.println("The time is " + m.group(2));
}
The regular expression means "a string containing a word, a space, one or more digits (which are captured in group 1), a space, a set of words and spaces ending with a space, followed by a time (captured in group 2, and the time assumes that hour is always 0-padded out to 2 digits). I would give a closer example to what you are looking for, but the description of the possible input is a little vague.

Categories