Java regexp grouping and + operator (Obtaining multiple values of a group)

Java regexp grouping and + operator (Obtaining multiple values of a group) - java

I was wondering if is it possible to obtain all the matches of a group with a + operator on a java regular expression.
Example code:
public static void main(String[] args) {
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:\\s*(([\\w\\s]+),?\\s*)+.");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("Regular expression Match: "+ m.group(0));
System.out.println("Group 1: "+ m.group(1));
System.out.println("Group 2: "+ m.group(2));
}
}
OUTPUT:
Regular expression Match: Start: First match, second match, third match.
Group 1: third match
Group 2: third match
Despite group 2 matched 3 times "First match, " "second match, " "third match" due to the second "+" operator that is on the Regexp we can access just the last one on match.group(2).
My questions is:
¿There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
thanks.

As mentioned in other answers, you can't match n groups using + like this.
However, if you are looking to solve this problem in Java then using a Scanner to break on the delimiters may help:
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:|\\s*,");
Scanner s = new Scanner(input).useDelimiter(p);
while (s.hasNext()) {
System.out.println("Matched: " + s.next());
}
This prints out:
Matched: First match
Matched: second match
Matched: third match.

You asked:
There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
Answer is NO, if same group matches some text multiple times then you can only access last matched text.
There are of course other ways to return multiple matches.

I think this may not be possible with your regular expression.
As per the docs:
The captured input associated with a group is always the subsequence
that the group most recently matched. If a group is evaluated a second
time because of quantification then its previously-captured value, if
any, will be retained if the second evaluation fails. Matching the
string "aba" against the expression (a(b)?)+, for example, leaves
group two set to "b". All captured input is discarded at the beginning
of each match.

Like most other regex flavors, Java doesn't save the intermediate captures of a repeated group. But that feature isn't really as useful as might think. For example, the .NET flavor provides the CaptureCollection class for that purpose, but you still have to write the code to loop through it. Not that big a deal, but still it's usually easier to use multiple matches, like the other responders suggested. Try it with this regex:
"(?:Start:|\\G,)\\s*([\\w\\s]+)"
\G is a kind of anchor that causes the regex to reject any match that doesn't start exactly where the last match ended. If there was no previous match (i.e., this is the first match attempt), it acts like \A and matches only at the very beginning of the string. That's partly why I placed the , in that part of the regex; I think it's safe to assume the string doesn't start with a comma.
Note that the first group is non-capturing; the part you're looking for will always be in 'group(1)`.

Related

why this regex ".*" match against "abcd 1234 abcd" gives two matches?

Why do I get two matches when using the regular expression .* on the string abcd 1234 abcd? See https://regex101.com/r/rV8jfz/1.
From the explanation given by regex101, I can see that the second match happened at position 14-14 and the value matched is null. But why is a second match done? Is there a way that I can avoid the second match?
I understand .* means zero or more of any character, so it's trying to find zero occurrences. But I don't understand why this null match is required.
The problem is when used in any language (e.g. Java), when I do while(matcher.find()) { ... }, this would loop twice while I would want it to loop only once.
I know this could not be a real world match situation, but to understand and explore regex, I see this as a good case to study.
Edit - follwing #terdon response.
I did like to keep the /g option in regex101, i am aware about it. I would like to know the total possible matches.
https://regex101.com/r/EvOoAr/1 -> pattern abcd against string abcd 1234 abcd gives two matches. And i wan't to know this information.
the problem i find is, when dealing this in a language like java -
Ref - https://onecompiler.com/java/3xnax494k
String str = "abcd 1234 abcd";
Pattern p = Pattern.compile(".*");
Matcher matcher = p.matcher(str);
int matchCount=0;
while(matcher.find()) {
matchCount++;
System.out.println("match number: " + matchCount);
System.out.println("matcher.groupCount(): " + matcher.groupCount());
System.out.println("matcher.group(): " + matcher.group());
}
The output is -
match number: 1
matcher.groupCount(): 0 //you can ignore this
matcher.group(): abcd 1234 abcd
match number: 2
matcher.groupCount(): 0
matcher.group(): //this is my concern. The program has to deal with this nothing match some how.
It would be nice for me as a programmer, if the find() did not match against "nothing". I should add additional code in the loop to catch this "nothing" case.
This null problem (in code) will get even worse with this regex case - https://regex101.com/r/5HuJ0R/1 -> [0-9]* against abcd 1234 abcd gives 12 matches.

The reason you get two matches is because you are using the g (global) operator. If you remove that from your regex101 example, you will only get one match.
This happens because the global operator makes the regex engine try to find as many matches on the string as possible. Since the expression .* matches everything, it also matches nothing, i.e. the empty string. Therefore, the first match is the entire string and then the second match is matching the "nothing" that comes after, it is matching an empty string. Removing the g will make it stop at the first match, the entire string, and not try to find others:

Java regex backreference for two digits

I am working with a regex and I want to use it on the replaceAll method of the String class in Java.
My regex works fine and groupCount() returns 11. So, when I try to replace my text using backreference pointing to the eleventh group, I am getting the first group with a "1" attached to it, instead of the group eleven.
String regex = "(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)([^<]*<)";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>":
String replacement = text.replaceAll(regex, $1$2$11");
I am expecting to get the following result:
<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>
But the $11 backreference is not returning the 11th group, it is returning the first group with a 1 attached to it, and instead I am getting the following result:
<span style="font-size:11.0pt">675-441-3144>1o:p></o:p></span>
Can someone please tell me how to access the eleventh group of my pattern?
Thanks.

Short Answer
The way you access the eleventh group of a match in the replacement is with $11.
Explanation:
As the corresponding Javadoc* states:
The replacement string may contain references to subsequences captured
during the previous match: Each occurrence of ${name} or $g will be
replaced by the result of evaluating the corresponding group(name) or
group(g) respectively. For $g, the first number after the $ is always
treated as part of the group reference. Subsequent numbers are
incorporated into g if they would form a legal group reference.
So generally speaking, as long as have at least eleven groups, then "$11" will evaluate to group(11). However, if you do not have at least eleven groups, then "$11" will evaluate to group(1) + "1".
* This quote is from Matcher#appendReplacement(StringBuffer,String), which is where the chain of relevant citations from String#replaceAll(String,String) leads to.
Actual Answer
Your regex does not do what you think it does.
Part 1
The Problem
Let's divide your regex into its three top-level groups. These are groups 1, 2, and 11, respectively.
Group 1:
(>[^<]*?)
Group 2:
((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
Group 11:
([^<]*<)
Group 2 is the main body of your regex, and it consists of a top-level alternation over two options. These two options consist of groups 3-8 and 9-10, respectively.
First option:
((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})
Second option:
(\d{6,16})([;,\.]{1,3}\d{3,}#?)?)
Now, given the text string, here is what is going on:
Group 1 executes. It matches the first ">".
Group 2 executes. It evaluates the options of its alternation in order.
The first option of group 2's alternation executes. It matches "675-441-3144".
Group 2's alternation successfully short-circuits upon the match of one of its options.
Group 2 as a whole is now equal to the option that matched, which is "675-441-3144".
The cursor is now positioned immediately after "675-441-3144", which is immediately before ";;;78888464#".
Group 11 executes. It matches everything up through the next "<", which is all of ";;;78888464#<".
Thus, some of the content that you want to be in group 2 is actually in group 11 instead.
The Solution
Do both of the following two things:
Convert the contents of group 2 from
option1|option2
to
option1(option2)?|option2
Change $11 in your replacement pattern to $12.
This will greedy match one or both options, rather than only one option. The modification to the replacement pattern is because we have added a group.
Part 2
The Problem
Now that we have modified the regex, our original "option 2" no longer makes sense. Given our new pattern template option1(option2)?|option2, it will be impossible for group 2 to match "675-441-3144;;;78888464#". This is because our original "option 1" will match all of "675-441-3144" and then stop. Our original "option 2" will then attempt to match ";;;78888464#", but will be unable to because it begins with a mandatory capture group of 6-10 digits: (\d{6,16}), but ";;;78888464#" begins with a semicolon.
The Solution
Convert the contents of our original "option 2" from
(\d{6,16})([;,\.]{1,3}\d{3,}#?)?
to
([;,\.]{1,3}\d{3,}#?)?
Part 3
The Problem
We have one final problem to solve. Now that our original "option 2" consists only of a single group with the ? quantifier, it is possible for it to successfully match a zero-length substring. So our pattern template option1(newoption2)?|newoption2 could result in a zero-length match, which does not fulfill the intended purpose of matching phone numbers.
The Solution
Do both of the following:
Convert the contents of our new "option 2" from
([;,.]{1,3}\d{3,}#?)?
to
[;,.]{1,3}\d{3,}#?
Change $12 in our replacement string to $10, since we have now removed one group in two locations.
The Final Solution
Putting everything together, our final solution is as follows.
Search regex:
(>[^<]*?)((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?((\d{2,6}[\.\- \t\f])+\d{2,6})([;,\.]{1,3}\d{3,}#?)?|[;,\.]{1,3}\d{3,}#?)([^<]*<)
Replacement regex:
$1$2$10
Java:
final String searchRegex = "(>[^<]*?)((\\+?\\d{1,4}[ \\t\\f\\-\\.](\\d[ \\t\\f\\-\\.])?)?(\\(\\d{1,4}([\\s-]\\d{1,4})?\\)[\\.\\- \\t\\f])?((\\d{2,6}[\\.\\- \\t\\f])+\\d{2,6})([;,\\.]{1,3}\\d{3,}#?)?|[;,\\.]{1,3}\\d{3,}#?)([^<]*<)";
final String replacementRegex = "$1$2$10";
String text = "<span style=\"font-size:11.0pt\">675-441-3144;;;78888464#<o:p></o:p></span>";
String replacement = text.replaceAll(searchRegex, replacementRegex);
Proof of correctness

Well, after trying to do it with replaceall without success, I had to implement the replacement method by myself:
public static String parsePhoneNumbers(String html){
StringBuilder regex = new StringBuilder(120);
regex.append("(>[^<]*?)(")
.append("((\+?\d{1,4}[ \t\f\-\.](\d[ \t\f\-\.])?)?")
.append("(\(\d{1,4}([\s-]\d{1,4})?\)[\.\- \t\f])?")
.append("((\d{2,6}[\.\- \t\f])+\d{2,6})|(\d{6,16})")
.append("([;,\.]{1,3}\d{3,}#?)?)")
.append(")+([^<]*<)");
StringBuilder mutableHtml = new StringBuilder(html.length());
Pattern pattern = Pattern.compile(regex.toString());
Matcher matcher = pattern.matcher(html);
int start = 0;
while(matcher.find()){
mutableHtml.append(html.substring(start, matcher.start()));
mutableHtml.append(matcher.group(1)).append("<a href=\"tel:")
.append(matcher.group(2)).append("\">").append(matcher.group(2))
.append("</a>").append(matcher.group(matcher.groupCount()));
start = matcher.end();
}
mutableHtml.append(html.substring(start));
return mutableHtml.toString();
}

Java Regex Capturing Groups

I am trying to understand this code block. In the first one, what is it we are looking for in the expression?
My understanding is that it is any character (0 or more times *) followed by any number between 0 and 9 (one or more times +) followed by any character (0 or more times *).
When this is executed the result is:
Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT300
Found value: 0
Could someone please go through this with me?
What is the advantage of using Capturing groups?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTut3 {
public static void main(String args[]) {
String line = "This order was placed for QT3000! OK?";
String pattern = "(.*)(\\d+)(.*)";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println("Found value: " + m.group(0));
System.out.println("Found value: " + m.group(1));
System.out.println("Found value: " + m.group(2));
} else {
System.out.println("NO MATCH");
}
}
}

The issue you're having is with the type of quantifier. You're using a greedy quantifier in your first group (index 1 - index 0 represents the whole Pattern), which means it'll match as much as it can (and since it's any character, it'll match as many characters as there are in order to fulfill the condition for the next groups).
In short, your 1st group .* matches anything as long as the next group \\d+ can match something (in this case, the last digit).
As per the 3rd group, it will match anything after the last digit.
If you change it to a reluctant quantifier in your 1st group, you'll get the result I suppose you are expecting, that is, the 3000 part.
Note the question mark in the 1st group.
String line = "This order was placed for QT3000! OK?";
Pattern pattern = Pattern.compile("(.*?)(\\d+)(.*)");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1));
System.out.println("group 2: " + matcher.group(2));
System.out.println("group 3: " + matcher.group(3));
}
Output:
group 1: This order was placed for QT
group 2: 3000
group 3: ! OK?
More info on Java Pattern here.
Finally, the capturing groups are delimited by round brackets, and provide a very useful way to use back-references (amongst other things), once your Pattern is matched to the input.
In Java 6 groups can only be referenced by their order (beware of nested groups and the subtlety of ordering).
In Java 7 it's much easier, as you can use named groups.

This is totally OK.
The first group (m.group(0)) always captures the whole area that is covered by your regular expression. In this case, it's the whole string.
Regular expressions are greedy by default, meaning that the first group captures as much as possible without violating the regex. The (.*)(\\d+) (the first part of your regex) covers the ...QT300 int the first group and the 0 in the second.
You can quickly fix this by making the first group non-greedy: change (.*) to (.*?).
For more info on greedy vs. lazy, check this site.

Your understanding is correct. However, if we walk through:
(.*) will swallow the whole string;
it will need to give back characters so that (\\d+) is satistifed (which is why 0 is captured, and not 3000);
the last (.*) will then capture the rest.
I am not sure what the original intent of the author was, however.

From the doc :
Capturing groups</a> are indexed from left
* to right, starting at one. Group zero denotes the entire pattern, so
* the expression m.group(0) is equivalent to m.group().
So capture group 0 send the whole line.

Java repetitive pattern matching (3)

I am trying to solve a simple Java regex matching problem but still getting conflicting results (following up on this and that question).
More specifically, I am trying to match a repetitive text input, consisting of groups that are delimited by '|' (vertical bar) that may be directly preceded by underscore ('_'), especially if the groups are not empty (i.e., if no two consecutive | delimiters appear in the input).
An example such input is:
Text group 1_|Text group 2_|||Text group 5_|||Text group 8
In addition, I need a way to verify that a match has occurred, in order to avoid applying the processing related to that input to other, totally different inputs that my application also processes, using different regular expressions.
To confirm that a regex works, I am using RegexPal.
After several tests, the closest to what I want are the following two Regular Expressions, suggested in the questions I quoted above:
1. (?:\||^)([^\\|]*)
2. \G([^\|]+?)_?\||\G()\||\G([^\|]*)$
Using either of these, if I run a matcher.find() loop I get:
All the text groups, with the underscore included in the end, from Regex 1
All the text groups apart from the last, with no underscore but 2 empty groups in the end, from Regex 2.
So, apparently Regex 2 is not correct (and RegexPal also does not show it as matching).
I could use Regex 1 and do some post-processing to remove the trailing underscore, although ideally I would like the regex to do that for me.
However, none of the two aforementioned regular expressions returns true for matcher.matches(), whereas matcher.find() is always true even for totally irrelevant input (reasonable, since there will often be at least 1 matching group, even in other text).
I thus have two questions:
Is there a correct (fully working) regex that excludes the trailing underscore?
Is there any way of checking that only the correct regex has matched?
The code used to test Regex 1, is something like
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher = Pattern.compile("(?:\\||^)([^\\\\|]*)").matcher(input);
if (matcher.matches())
{
System.out.println("Input MATCHED: " + input);
while (matcher.find())
{
System.out.println("\t\t" + matcher.group(1));
}
}
else
{
System.out.println("\tInput NOT MATCHED: " + input);
}
Using the above code always results in "NOT MATCHED". Removing the if/else and only using matcher.find() does retrieve all text groups.

Matcher#matches method attempts to match the entire input sequence against the pattern, that is why you are getting the result Input NOT MATCHED. See the documentation here http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#matches
If you want to exclude the trailing underscore you can use this regex (slight modification of what you already have)
(?:\\||^)([^\\\\|_]*)
This would work if you are sure that _ comes just before |.

RegexPal is a JavaScript regex tool. The Java and JavaScript regular expression languages differ. Consider using a Java Regex tool; perhaps this one
This may be close to what you want: (?:([^_\|]+)_{0,1}+\|*)+
Edit: Code added.
In java 6 this prints each group (the find() loop).
public static void main(String[] args)
{
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher;
Pattern pattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)+");
Pattern groupPattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)");
matcher = pattern.matcher(input);
if (matcher.matches())
{
Matcher groupMatcher;
System.out.println("matcher.matches() is true");
int groupCount = matcher.groupCount();
for (int index = 1; index <= groupCount; ++index)
{
System.out.print("group (pattern)[");
System.out.print(index);
System.out.print("]: ");
System.out.println(matcher.group(index));
}
groupMatcher = groupPattern.matcher(input);
while (groupMatcher.find())
{
System.out.print("group (groupPattern):");
System.out.println(groupMatcher.group());
System.out.println(groupMatcher.group(1));
}
}
else
{
System.out.println("No match");
}
}

Java - Extract strings with Regex

I've this string
String myString ="A~BC~FGH~~zuzy|XX~ 1234~ ~~ABC~01/01/2010 06:30~BCD~01/01/2011 07:45";
and I need to extract these 3 substrings
1234
06:30
07:45
If I use this regex \\d{2}\:\\d{2} I'm only able to extract the first hour 06:30
Pattern depArrHours = Pattern.compile("\\d{2}\\:\\d{2}");
Matcher matcher = depArrHours.matcher(myString);
String firstHour = matcher.group(0);
String secondHour = matcher.group(1); (IndexOutOfBoundException no Group 1)
matcher.group(1) throws an exception.
Also I don't know how to extract 1234. This string can change but it always comes after 'XX~ '
Do you have any idea on how to match these strings with regex expressions?
UPDATE
Thanks to Adam suggestion I've now this regex that match my string
Pattern p = Pattern.compile(".*XX~ (\\d{3,4}).*(\\d{1,2}:\\d{2}).*(\\d{1,2}:\\d{2})";
I match the number, and the 2 hours with matcher.group(1); matcher.group(2); matcher.group(3);

The matcher.group() function expects to take a single integer argument: The capturing group index, starting from 1. The index 0 is special, which means "the entire match". A capturing group is created using a pair of parenthesis "(...)". Anything within the parenthesis is captures. Groups are numbered from left to right (again, starting from 1), by opening parenthesis (which means that groups can overlap). Since there are no parenthesis in your regular expression, there can be no group 1.
The javadoc on the Pattern class covers the regular expression syntax.
If you are looking for a pattern that might recur some number of times, you can use Matcher.find() repeatedly until it returns false. Matcher.group(0) once on each iteration will then return what matched that time.
If you want to build one big regular expression that matches everything all at once (which I believe is what you want) then around each of the three sets of things that you want to capture, put a set of capturing parenthesis, use Matcher.match() and then Matcher.group(n) where n is 1, 2 and 3 respectively. Of course Matcher.match() might also return false, in which case the pattern did not match, and you can't retrieve any of the groups.
In your example, what you probably want to do is have it match some preceding text, then start a capturing group, match for digits, end the capturing group, etc...I don't know enough about your exact input format, but here is an example.
Lets say I had strings of the form:
Eat 12 carrots at 12:30
Take 3 pills at 01:15
And I wanted to extract the quantity and times. My regular expression would look something like:
"\w+ (\d+) [\w ]+ (\d{1,2}:\d{2})"
The code would look something like:
Pattern p = Pattern.compile("\\w+ (\\d+) [\\w ]+ (\\d{2}:\\d{2})");
Matcher m = p.matcher(oneline);
if(m.matches()) {
System.out.println("The quantity is " + m.group(1));
System.out.println("The time is " + m.group(2));
}
The regular expression means "a string containing a word, a space, one or more digits (which are captured in group 1), a space, a set of words and spaces ending with a space, followed by a time (captured in group 2, and the time assumes that hour is always 0-padded out to 2 digits). I would give a closer example to what you are looking for, but the description of the possible input is a little vague.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regexp grouping and + operator (Obtaining multiple values of a group) - java

Related

why this regex ".*" match against "abcd 1234 abcd" gives two matches?

Java regex backreference for two digits

Java Regex Capturing Groups

Java repetitive pattern matching (3)

Java - Extract strings with Regex

Categories

Resources