Java Regex Capturing Groups - java

I am trying to understand this code block. In the first one, what is it we are looking for in the expression?
My understanding is that it is any character (0 or more times *) followed by any number between 0 and 9 (one or more times +) followed by any character (0 or more times *).
When this is executed the result is:
Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT300
Found value: 0
Could someone please go through this with me?
What is the advantage of using Capturing groups?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTut3 {
public static void main(String args[]) {
String line = "This order was placed for QT3000! OK?";
String pattern = "(.*)(\\d+)(.*)";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println("Found value: " + m.group(0));
System.out.println("Found value: " + m.group(1));
System.out.println("Found value: " + m.group(2));
} else {
System.out.println("NO MATCH");
}
}
}

The issue you're having is with the type of quantifier. You're using a greedy quantifier in your first group (index 1 - index 0 represents the whole Pattern), which means it'll match as much as it can (and since it's any character, it'll match as many characters as there are in order to fulfill the condition for the next groups).
In short, your 1st group .* matches anything as long as the next group \\d+ can match something (in this case, the last digit).
As per the 3rd group, it will match anything after the last digit.
If you change it to a reluctant quantifier in your 1st group, you'll get the result I suppose you are expecting, that is, the 3000 part.
Note the question mark in the 1st group.
String line = "This order was placed for QT3000! OK?";
Pattern pattern = Pattern.compile("(.*?)(\\d+)(.*)");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1));
System.out.println("group 2: " + matcher.group(2));
System.out.println("group 3: " + matcher.group(3));
}
Output:
group 1: This order was placed for QT
group 2: 3000
group 3: ! OK?
More info on Java Pattern here.
Finally, the capturing groups are delimited by round brackets, and provide a very useful way to use back-references (amongst other things), once your Pattern is matched to the input.
In Java 6 groups can only be referenced by their order (beware of nested groups and the subtlety of ordering).
In Java 7 it's much easier, as you can use named groups.

This is totally OK.
The first group (m.group(0)) always captures the whole area that is covered by your regular expression. In this case, it's the whole string.
Regular expressions are greedy by default, meaning that the first group captures as much as possible without violating the regex. The (.*)(\\d+) (the first part of your regex) covers the ...QT300 int the first group and the 0 in the second.
You can quickly fix this by making the first group non-greedy: change (.*) to (.*?).
For more info on greedy vs. lazy, check this site.

Your understanding is correct. However, if we walk through:
(.*) will swallow the whole string;
it will need to give back characters so that (\\d+) is satistifed (which is why 0 is captured, and not 3000);
the last (.*) will then capture the rest.
I am not sure what the original intent of the author was, however.

From the doc :
Capturing groups</a> are indexed from left
* to right, starting at one. Group zero denotes the entire pattern, so
* the expression m.group(0) is equivalent to m.group().
So capture group 0 send the whole line.

Related

why this regex ".*" match against "abcd 1234 abcd" gives two matches?

Why do I get two matches when using the regular expression .* on the string abcd 1234 abcd? See https://regex101.com/r/rV8jfz/1.
From the explanation given by regex101, I can see that the second match happened at position 14-14 and the value matched is null. But why is a second match done? Is there a way that I can avoid the second match?
I understand .* means zero or more of any character, so it's trying to find zero occurrences. But I don't understand why this null match is required.
The problem is when used in any language (e.g. Java), when I do while(matcher.find()) { ... }, this would loop twice while I would want it to loop only once.
I know this could not be a real world match situation, but to understand and explore regex, I see this as a good case to study.
Edit - follwing #terdon response.
I did like to keep the /g option in regex101, i am aware about it. I would like to know the total possible matches.
https://regex101.com/r/EvOoAr/1 -> pattern abcd against string abcd 1234 abcd gives two matches. And i wan't to know this information.
the problem i find is, when dealing this in a language like java -
Ref - https://onecompiler.com/java/3xnax494k
String str = "abcd 1234 abcd";
Pattern p = Pattern.compile(".*");
Matcher matcher = p.matcher(str);
int matchCount=0;
while(matcher.find()) {
matchCount++;
System.out.println("match number: " + matchCount);
System.out.println("matcher.groupCount(): " + matcher.groupCount());
System.out.println("matcher.group(): " + matcher.group());
}
The output is -
match number: 1
matcher.groupCount(): 0 //you can ignore this
matcher.group(): abcd 1234 abcd
match number: 2
matcher.groupCount(): 0
matcher.group(): //this is my concern. The program has to deal with this nothing match some how.
It would be nice for me as a programmer, if the find() did not match against "nothing". I should add additional code in the loop to catch this "nothing" case.
This null problem (in code) will get even worse with this regex case - https://regex101.com/r/5HuJ0R/1 -> [0-9]* against abcd 1234 abcd gives 12 matches.
The reason you get two matches is because you are using the g (global) operator. If you remove that from your regex101 example, you will only get one match.
This happens because the global operator makes the regex engine try to find as many matches on the string as possible. Since the expression .* matches everything, it also matches nothing, i.e. the empty string. Therefore, the first match is the entire string and then the second match is matching the "nothing" that comes after, it is matching an empty string. Removing the g will make it stop at the first match, the entire string, and not try to find others:

Regular expression for UK postcode also matches UUID

I am having problems with the following UK Postcode regex
([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})
It works for UK postcodes as intended e.g.
AB11AB
However, it also seems to match UUIDs as well e.g.
c25d4f64-2336-4a5d-b94c-14dc12xxxa58
Is there anyway to ignore UUIDs from the regular expression ?
Please find example here
https://regex101.com/r/dI6gD9/19
Option 1
Maybe, we would just add start and end anchors and fail the UUIDs, and change the capturing groups to non, if that'd be OK:
^(?:[Gg][Ii][Rr]\s+0[Aa]{2})|(?:(?:([A-Za-z][0-9]{1,2})|(?:(?:[A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(?:(?:[A-Za-z][0-9][A-Za-z])|(?:[A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s*[0-9][A-Za-z]{2})$
The expression can be most likely simplified (e.g., non-capturing groups), I have also added extra spaces, just in case.
DEMO 1
Option 2
Another option would be to add word boundaries, then it would become almost improbable that it would match a UUID in our data, that I'm guessing, and we can also add an i flag:
(?i)(?:\bgir\b\s+\b0a{2}\b)|\b(?:[a-z][0-9]{1,2}|[a-z][a-hj-y][0-9]{1,2}|[a-z][0-9][a-z]|[a-z][a-hj-y][0-9][a-z]?)\s*[0-9][a-z]{2}\b
DEMO 2
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^(?:[Gg][Ii][Rr]\\s+0[Aa]{2})|(?:(?:([A-Za-z][0-9]{1,2})|(?:(?:[A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(?:(?:[A-Za-z][0-9][A-Za-z])|(?:[A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\\s*[0-9][A-Za-z]{2})$";
final String string = "c25d4f64-2336-4a5d-b94c-14dc12xxxa58\n"
+ "AB11AB";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
RegEx Circuit
jex.im visualizes regular expressions:
Your regex is fine, you just need to match it with the start and end of the string. Just append a ^ to the start and a $ to the end of the pattern.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})$
https://regex101.com/r/jwLqLx/1
You are using the correct regex, that is issued by the UK government.
Below i added examples of how to use it:
Match full string:
When matching to a full string don't use the global flag, because then it will find the occurrences within a string, rather than testing a string to fully match the regex.
So don't use the global and multi-line flags
Notice the gm part in
/your_regex/gm
Try it in this example on regex101.com, where I have already disabled the global and multi-line flag for you.
Match in log file:
For log files, add the word identifier around your regex
Notice the \b parts in
/\byour_regex\b/gm
Try it in this example which shows this behaviour in an example log file.

Java regexp grouping and + operator (Obtaining multiple values of a group)

I was wondering if is it possible to obtain all the matches of a group with a + operator on a java regular expression.
Example code:
public static void main(String[] args) {
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:\\s*(([\\w\\s]+),?\\s*)+.");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("Regular expression Match: "+ m.group(0));
System.out.println("Group 1: "+ m.group(1));
System.out.println("Group 2: "+ m.group(2));
}
}
OUTPUT:
Regular expression Match: Start: First match, second match, third match.
Group 1: third match
Group 2: third match
Despite group 2 matched 3 times "First match, " "second match, " "third match" due to the second "+" operator that is on the Regexp we can access just the last one on match.group(2).
My questions is:
¿There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
thanks.
As mentioned in other answers, you can't match n groups using + like this.
However, if you are looking to solve this problem in Java then using a Scanner to break on the delimiters may help:
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:|\\s*,");
Scanner s = new Scanner(input).useDelimiter(p);
while (s.hasNext()) {
System.out.println("Matched: " + s.next());
}
This prints out:
Matched: First match
Matched: second match
Matched: third match.
You asked:
There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
Answer is NO, if same group matches some text multiple times then you can only access last matched text.
There are of course other ways to return multiple matches.
I think this may not be possible with your regular expression.
As per the docs:
The captured input associated with a group is always the subsequence
that the group most recently matched. If a group is evaluated a second
time because of quantification then its previously-captured value, if
any, will be retained if the second evaluation fails. Matching the
string "aba" against the expression (a(b)?)+, for example, leaves
group two set to "b". All captured input is discarded at the beginning
of each match.
Like most other regex flavors, Java doesn't save the intermediate captures of a repeated group. But that feature isn't really as useful as might think. For example, the .NET flavor provides the CaptureCollection class for that purpose, but you still have to write the code to loop through it. Not that big a deal, but still it's usually easier to use multiple matches, like the other responders suggested. Try it with this regex:
"(?:Start:|\\G,)\\s*([\\w\\s]+)"
\G is a kind of anchor that causes the regex to reject any match that doesn't start exactly where the last match ended. If there was no previous match (i.e., this is the first match attempt), it acts like \A and matches only at the very beginning of the string. That's partly why I placed the , in that part of the regex; I think it's safe to assume the string doesn't start with a comma.
Note that the first group is non-capturing; the part you're looking for will always be in 'group(1)`.

Java repetitive pattern matching (3)

I am trying to solve a simple Java regex matching problem but still getting conflicting results (following up on this and that question).
More specifically, I am trying to match a repetitive text input, consisting of groups that are delimited by '|' (vertical bar) that may be directly preceded by underscore ('_'), especially if the groups are not empty (i.e., if no two consecutive | delimiters appear in the input).
An example such input is:
Text group 1_|Text group 2_|||Text group 5_|||Text group 8
In addition, I need a way to verify that a match has occurred, in order to avoid applying the processing related to that input to other, totally different inputs that my application also processes, using different regular expressions.
To confirm that a regex works, I am using RegexPal.
After several tests, the closest to what I want are the following two Regular Expressions, suggested in the questions I quoted above:
1. (?:\||^)([^\\|]*)
2. \G([^\|]+?)_?\||\G()\||\G([^\|]*)$
Using either of these, if I run a matcher.find() loop I get:
All the text groups, with the underscore included in the end, from Regex 1
All the text groups apart from the last, with no underscore but 2 empty groups in the end, from Regex 2.
So, apparently Regex 2 is not correct (and RegexPal also does not show it as matching).
I could use Regex 1 and do some post-processing to remove the trailing underscore, although ideally I would like the regex to do that for me.
However, none of the two aforementioned regular expressions returns true for matcher.matches(), whereas matcher.find() is always true even for totally irrelevant input (reasonable, since there will often be at least 1 matching group, even in other text).
I thus have two questions:
Is there a correct (fully working) regex that excludes the trailing underscore?
Is there any way of checking that only the correct regex has matched?
The code used to test Regex 1, is something like
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher = Pattern.compile("(?:\\||^)([^\\\\|]*)").matcher(input);
if (matcher.matches())
{
System.out.println("Input MATCHED: " + input);
while (matcher.find())
{
System.out.println("\t\t" + matcher.group(1));
}
}
else
{
System.out.println("\tInput NOT MATCHED: " + input);
}
Using the above code always results in "NOT MATCHED". Removing the if/else and only using matcher.find() does retrieve all text groups.
Matcher#matches method attempts to match the entire input sequence against the pattern, that is why you are getting the result Input NOT MATCHED. See the documentation here http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#matches
If you want to exclude the trailing underscore you can use this regex (slight modification of what you already have)
(?:\\||^)([^\\\\|_]*)
This would work if you are sure that _ comes just before |.
RegexPal is a JavaScript regex tool. The Java and JavaScript regular expression languages differ. Consider using a Java Regex tool; perhaps this one
This may be close to what you want: (?:([^_\|]+)_{0,1}+\|*)+
Edit: Code added.
In java 6 this prints each group (the find() loop).
public static void main(String[] args)
{
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher;
Pattern pattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)+");
Pattern groupPattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)");
matcher = pattern.matcher(input);
if (matcher.matches())
{
Matcher groupMatcher;
System.out.println("matcher.matches() is true");
int groupCount = matcher.groupCount();
for (int index = 1; index <= groupCount; ++index)
{
System.out.print("group (pattern)[");
System.out.print(index);
System.out.print("]: ");
System.out.println(matcher.group(index));
}
groupMatcher = groupPattern.matcher(input);
while (groupMatcher.find())
{
System.out.print("group (groupPattern):");
System.out.println(groupMatcher.group());
System.out.println(groupMatcher.group(1));
}
}
else
{
System.out.println("No match");
}
}

Java - Extract strings with Regex

I've this string
String myString ="A~BC~FGH~~zuzy|XX~ 1234~ ~~ABC~01/01/2010 06:30~BCD~01/01/2011 07:45";
and I need to extract these 3 substrings
1234
06:30
07:45
If I use this regex \\d{2}\:\\d{2} I'm only able to extract the first hour 06:30
Pattern depArrHours = Pattern.compile("\\d{2}\\:\\d{2}");
Matcher matcher = depArrHours.matcher(myString);
String firstHour = matcher.group(0);
String secondHour = matcher.group(1); (IndexOutOfBoundException no Group 1)
matcher.group(1) throws an exception.
Also I don't know how to extract 1234. This string can change but it always comes after 'XX~ '
Do you have any idea on how to match these strings with regex expressions?
UPDATE
Thanks to Adam suggestion I've now this regex that match my string
Pattern p = Pattern.compile(".*XX~ (\\d{3,4}).*(\\d{1,2}:\\d{2}).*(\\d{1,2}:\\d{2})";
I match the number, and the 2 hours with matcher.group(1); matcher.group(2); matcher.group(3);
The matcher.group() function expects to take a single integer argument: The capturing group index, starting from 1. The index 0 is special, which means "the entire match". A capturing group is created using a pair of parenthesis "(...)". Anything within the parenthesis is captures. Groups are numbered from left to right (again, starting from 1), by opening parenthesis (which means that groups can overlap). Since there are no parenthesis in your regular expression, there can be no group 1.
The javadoc on the Pattern class covers the regular expression syntax.
If you are looking for a pattern that might recur some number of times, you can use Matcher.find() repeatedly until it returns false. Matcher.group(0) once on each iteration will then return what matched that time.
If you want to build one big regular expression that matches everything all at once (which I believe is what you want) then around each of the three sets of things that you want to capture, put a set of capturing parenthesis, use Matcher.match() and then Matcher.group(n) where n is 1, 2 and 3 respectively. Of course Matcher.match() might also return false, in which case the pattern did not match, and you can't retrieve any of the groups.
In your example, what you probably want to do is have it match some preceding text, then start a capturing group, match for digits, end the capturing group, etc...I don't know enough about your exact input format, but here is an example.
Lets say I had strings of the form:
Eat 12 carrots at 12:30
Take 3 pills at 01:15
And I wanted to extract the quantity and times. My regular expression would look something like:
"\w+ (\d+) [\w ]+ (\d{1,2}:\d{2})"
The code would look something like:
Pattern p = Pattern.compile("\\w+ (\\d+) [\\w ]+ (\\d{2}:\\d{2})");
Matcher m = p.matcher(oneline);
if(m.matches()) {
System.out.println("The quantity is " + m.group(1));
System.out.println("The time is " + m.group(2));
}
The regular expression means "a string containing a word, a space, one or more digits (which are captured in group 1), a space, a set of words and spaces ending with a space, followed by a time (captured in group 2, and the time assumes that hour is always 0-padded out to 2 digits). I would give a closer example to what you are looking for, but the description of the possible input is a little vague.

Categories