How to precisely identify & work greedy or reluctant quantifiers? [duplicate] - java

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
Given:
import java.util.regex.*;
class Regex2 {
public static void main (String args[]) {
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher (args [1]);
boolean b = false;
while (m. find()) {
System.out.print(m.start() + m.group());
}
}
}
the command line expression is :
java Regex2 "\d*" ab34ef
What is the result?
A. 234
B. 334
C. 2334
D 0123456
E. 01234456
F. 12334567
G. Compilation fails
The SCJP book explains regex, pattern and matchers so horribly it's unbelievable.
Anyway, I pretty much understand most of the basics and have looked at the Sun/Oracle documentation about greedy and reluctant quantifiers. I understand the concepts but am a blurry about a few things:
What exactly is the physical symbol of a "greedy" quantifier? Is it simply a single *,? or + ?
If so, can someone explain in detail how this answer turns out to be E according to the book? When I run it myself I get the answer: 2334!
Here we would be using a greedy quantifier correct? This would consume the entire string and then backtrack and look for zero or more digits in a row. Thus, if greedy, the 'full string' would contain 2 digits in a row and would execute .find() only once (ie. m.start = 0 , m.group = "ab34ef"), by that definition!
Thanks for the help guys.

These are the matches of \d* against "ab34ef":
index 0: zero-width;
index 1: zero-width;
index 2: "34";
index 4: zero-width;
index 5: zero-width;
index 6: zero-width.
This should explain your output. If the quantifier was reluctant, this would be the difference:
index 2: zero-width;
index 3: zero-width;
The reluctant quantifier grabs as little as allowed to make the entire expression match.

Related

Java, Regex, Nested optional groups

I'm trying to capture nested optional groups in Java but it's not working out.
I'm trying to capture a keyword followed by an interval, where a keyword is anything for now, and an interval is just two dates. The interval may be optional, and the two dates may be optional as well. So, the following are valid matches.
word
word [01/01/1900, ]
word [, 01/01/2000]
word [01/01/1900, 01/01/2000]
I want to capture the keyword and both the dates even if they are null.
This is the Java MWE I've came up with.
public class Parser {
public static void main(String[] args) {
Parser parser = new Parser();
String s = "word [01/01/1900, 01/01/2000]";
parser.parse(s);
}
public void parse(String s) {
String date = "\\d{2}/\\d{2}/\\d{4}";
String interval = "\\[("+date+")?, ("+date+")?\\]";
String keyword = "(.+)( "+interval+")?";
Pattern p = Pattern.compile(keyword);
Matcher m = p.matcher(s);
if (m.matches()) {
for (int i = 0; i <= m.groupCount(); ++i) {
System.out.println(i + ": " + m.group(i));
}
}
}
}
And this is the output
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
4: null
If interval isn't optional, then it works.
String keyword = "(.+)( "+interval+")";
0: word [01/01/1900, 01/01/2000]
1: word
2: [01/01/1900, 01/01/2000]
3: 01/01/1900
4: 01/01/2000
If interval is a non-matching group (but still optional), then it doesn't work.
String keyword = "(.+)(?: "+interval+")?";
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
What do I need to do to retrieve back both dates? Thank You.
Edit: Part 2.
Suppose now I watch to match repeated keywords. i.e. the regex, keyword(, keyword)*. I tried this out, but only the first and the last instance is captured.
For simplicity, suppose I want to match the following a, b, c, d with the regex ([a-z])(?:, ([a-z]))*
However, I can only retrieve back the first and last group.
0: a, b, c, d
1: a
2: d
Why is this so?
Just found out that this cannot be done. Capture group multiple times
Change the first part of keyword from (.+) to (.+?).
Without the ?, the (.+) is a greedy quantifier. That means it will try to match as much as it can. I don't know all the mechanics of how the regex engine works, but I believe that in your case, what it's doing is setting some counter N to the number of characters remaining in the source. If it can use up that many characters and get the whole regex to match, it will. Otherwise, it tries N-1, N-2, etc., until the entire regex matches. I also think it goes from left to right when trying this; that is, since (.+) is the leftmost "part" of the pattern (for some definition of "part"), it loops on that part before it tries any looping on parts that are to the right. Thus, it's more important to make (.+) greedy than to make any other part of the pattern greedy; the (.+) takes precedence.
In your case, since (.+) is followed by an optional part, the regex matcher starts by trying the entire remainder of the string--and it succeeds, because the rest of the string, which is empty, is a fine match for an optional substring. That should also explains why it doesn't work if your substring isn't optional--the empty substring no longer matches.
Adding ? makes it a "reluctant" (or "stingy") quantifier, which works in the opposite direction. It starts by seeing if it can make a match with 0 characters, then 1, 2, ..., instead of starting with N and going downward. So when it gets up to 5, matching "word ", and it finds that the rest of the string matches your optional part, it completes and gives the results you were expecting.

Understanding regular expression output [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I need help to understand the output of the code below. I am unable to figure out the output for System.out.print(m.start() + m.group());. Please can someone explain it to me?
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d*");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while(b = m.find()) {
System.out.println(m.start() + m.group());
}
}
}
Output is:
0
1
234
4
5
6
Note that if I put System.out.println(m.start() );, output is:
0
1
2
4
5
6
Because you have included a * character, your pattern will match empty strings as well. When I change your code as I suggested in the comments, I get the following output:
0 ()
1 ()
2 (34)
4 ()
5 ()
6 ()
So you have a large number of empty matches (matching each location in the string) with the exception of 34, which matches the string of digits. Use \\d+ if you want to match digits without also matching empty strings..
You used this regex - \d* - which basically means zero or more digits. Mind the zero!
So this pattern will match any group of digits, e.g. 34 plus any other position in the string, where the matched sequence will be the empty string.
So, you will have 6 matches, starting at indices 0,1,2,4,5,6. For match starting at index 2, the matched sequence is 34, while for the remaining ones, the match will be the empty string.
If you want to find only digits, you might want to use this pattern: \d+
d* - match zero or more digits in the expresion.
expresion ab34ef and his corresponding indices 012345
On the zero index there is no match so start() prints 0 and group() prints nothing, then on the first index 1 and nothing, on the second we find match so it prints 2 and 34. Next it will print 4 and nothing and so on.
Another example:
Pattern pattern = Pattern.compile("\\d\\d");
Matcher matcher = pattern.matcher("123ddc2ab23");
while(matcher.find()) {
System.out.println("start:" + matcher.start() + " end:" + matcher.end() + " group:" + matcher.group() + ";");
}
which will println:
start:0 end:2 group:12;
start:9 end:11 group:23;
You will find more information in the tutorial

pattern search using regex in java

public static void main(String args[]) {
Pattern p = Pattern.compile("ab"); // Case 1
Pattern p = Pattern.compile("bab"); // Case 2
Matcher m = p.matcher("abababa");
while(m.find()){
System.out.print(m.start());
}
}
When I used Case 1, then output is 024 as expected. But, when I used Case 2 then output is 1, but I was expected 13. So, anyone explain me, is there any exceptional rule in regex, which causes this output, if not. Then, why I'm getting this output.
Help appreciated !!
Note : Case 1 and Case 2 are independently used.
The match consumes the input, so the next match is found after the end of the previous match:
Position of "bab" matcher's pointer before each match would be:
|abababa
abab|aba
For Case 2:
its because, after it search's for bab, it wouldn't consider the already searched char(b in this case at index 3) thus you get only 1.
Input: abababa
Search for bab,
find's a match starting at index 1 and ending at index 3, next the search would start at index 4(aba)

Regular expression - Greedy quantifier [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I am really struggling with this question:
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[1]);
boolean b = false;
while(b = m.find()) {
System.out.print(m.start() + m.group());
}
}
}
When the above program is run with the following command:
java Regex2 "\d*" ab34ef
It outputs 01234456. I don't really understand this output. Consider the following indexes for each of the characters:
a b 3 4 e f
^ ^ ^ ^ ^ ^
0 1 2 3 4 5
Shouldn't the output have been 0123445?
I have been reading around and it looks like the RegEx engine will also read the end of the string but I just don't understand. Would appreciate if someone can provide a step by step guide as to how it is getting that result. i.e. how it is finding each of the numbers.
It is helpful to change
System.out.print(m.start() + m.group());
to
System.out.println(m.start() + ": " + m.group());
This way the output is much clearer:
0:
1:
2: 34
4:
5:
6:
You can see that it matched at 7 different positions: at position 2 it matched string "34" and at any other position it matched an empty string. Empty string matches at the end as well, which is why you see "6" at the end of your output.
Note that if you run your program like this:
java Regex2 "\d+" ab34ef
it will only output
2: 34

How to group in regex

I have this input string(oid) : 1.2.3.4.5.66.77.88.99.10.52
I want group each number into 3 to like this
Group 1 : 1.2.3
Group 2 : 4.5.66
Group 3 : 77.88.99
Group 4 : 10.52
It should be very dynamic depending on the input. If it has 30 numbers meaning it will return 10 groups.
I have tested using this regex : (\d+.\d+.\d+)
But the result is this
Match 1: 1.2.3
Subgroups:
1: 1.2.3
Match 2: 4.5.66
Subgroups:
1: 4.5.66
Match 3: 77.88.99
Subgroups:
1: 77.88.99
Where as still missed one more matches.
Can anyone help me to provide the Regex. Thank you
\d+(?:\.\d+){0,2}
This is basically the same as Al's final regex - ((?:\d+\.){0,2}\d+) - but I think it's clearer this way. And there's no need to put parentheses around the whole regex. Assuming you're using Matcher.find() to get the matches, you can use group() or group(0) instead of group(1) to retrieve the matched text.
If you want to match up to three digits, you should try:
((?:\d+\.?){1,3})
The {1,3} part matches 1-3 of the preceding item (which is one or more digits followed by a literal .. Note that the dot is escaped so that it doesn't match any character.
Edit
Further explanation: The (?: ) part is a grouping that cannot be used for backreferences (tends to be faster), see section 4.3 here for more information. You could, of course, also just use ((\d+\.?){1,3}) if you prefer. For more information on {1,3}, see here under "Limiting Repetition".
Edit (2)
Fixed error pointed out by dtmunir. An alternative way that is a bit more explicit (and doesn't catch the extra "." at the end of the early groups) is:
((?:\d+\.){0,2}\d+)
Al that will not capture the 52. But this one in fact will:
((?:\d+\.?){1,3})
The only change is adding the question mark after the .
This allows it to accept the last number without having a period after it
Explanation (EDIT):
The \d+ as you can imagine captures consecutive digits.
The \. captures a period
The \.? captures a period, but allows the inner group to not require a period at the end
The (?:\d+\.?) defines "one group" which in your case you want to be 3 numbers.
The {1,3} sets the limits. It requires a minimum of 1 inner group and at most 3 inner groups. These groups may or may not end with a period.
This is my weird code for do this without regex :-)
public static String[] getTokens(String s) {
String[] splitted = s.split("\\.");
//Personally I hate Double.valueOf but I don't know how to avoid it
String[] result = new String[Double.valueOf(Math.ceil(Double.valueOf(splitted.length) / 3)).intValue()];
for (int i = 0, j = 0; j < splitted.length; i++, j+=3) {
//Weird concat
result[i] = splitted[j] + ( j+1 < splitted.length ? "." + splitted[j+1] : "" ) + ( j+2 < splitted.length ? "." + splitted[j+2] : "" );
}
return result;
}

Categories