Regular expression - Greedy quantifier [duplicate] - java

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I am really struggling with this question:
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[1]);
boolean b = false;
while(b = m.find()) {
System.out.print(m.start() + m.group());
}
}
}
When the above program is run with the following command:
java Regex2 "\d*" ab34ef
It outputs 01234456. I don't really understand this output. Consider the following indexes for each of the characters:
a b 3 4 e f
^ ^ ^ ^ ^ ^
0 1 2 3 4 5
Shouldn't the output have been 0123445?
I have been reading around and it looks like the RegEx engine will also read the end of the string but I just don't understand. Would appreciate if someone can provide a step by step guide as to how it is getting that result. i.e. how it is finding each of the numbers.

It is helpful to change
System.out.print(m.start() + m.group());
to
System.out.println(m.start() + ": " + m.group());
This way the output is much clearer:
0:
1:
2: 34
4:
5:
6:
You can see that it matched at 7 different positions: at position 2 it matched string "34" and at any other position it matched an empty string. Empty string matches at the end as well, which is why you see "6" at the end of your output.
Note that if you run your program like this:
java Regex2 "\d+" ab34ef
it will only output
2: 34

Related

Java - Regex pattern matching [duplicate]

I have issue with following example:
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[1]);
boolean b = false;
while(b = m.find()) {
System.out.print(m.start() + m.group());
}
}
}
And the command line:
java Regex2 "\d*" ab34ef
Can someone explain to me, why the result is: 01234456
regex pattern is d* - it means number one or more but there are more positions that in args[1],
thanks
\d* matches 0 or more digits. So, it will even match empty string before every character and after the last character. First before index 0, then before index 1, and so on.
So, for string ab34ef, it matches following groups:
Index Group
0 "" (Before a)
1 "" (Before b)
2 34 (Matches more than 0 digits this time)
4 "" (Before `e` at index 4)
5 "" (Before f)
6 "" (At the end, after f)
If you use \\d+, then you will get just a single group at 34.

Java regex pattern query [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
Just a quick question about Java regex patterns! So say if I had a method like..
public void example()
{
Pattern p = Pattern.compile("\\d*");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while (b = m.find())
{
System.out.println(m.start() + " " + m.group());
}
}
If I ran this I would end up with the following output..
0
1
2 34
4
5
6
I understand how this works apart from how it ends up at 6, I thought it would finish on 5 could someone please explain this to me? Thanks!
In your string, "ab34ef", there are 7 "empty characters" with a value of "". They are located between each of the normal characters. It attempts to find a match starting on each empty character, not each normal character; i.e. the location of each | in the following: "|a|b|3|4|e|f|".

Understanding regular expression output [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
I need help to understand the output of the code below. I am unable to figure out the output for System.out.print(m.start() + m.group());. Please can someone explain it to me?
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d*");
Matcher m = p.matcher("ab34ef");
boolean b = false;
while(b = m.find()) {
System.out.println(m.start() + m.group());
}
}
}
Output is:
0
1
234
4
5
6
Note that if I put System.out.println(m.start() );, output is:
0
1
2
4
5
6
Because you have included a * character, your pattern will match empty strings as well. When I change your code as I suggested in the comments, I get the following output:
0 ()
1 ()
2 (34)
4 ()
5 ()
6 ()
So you have a large number of empty matches (matching each location in the string) with the exception of 34, which matches the string of digits. Use \\d+ if you want to match digits without also matching empty strings..
You used this regex - \d* - which basically means zero or more digits. Mind the zero!
So this pattern will match any group of digits, e.g. 34 plus any other position in the string, where the matched sequence will be the empty string.
So, you will have 6 matches, starting at indices 0,1,2,4,5,6. For match starting at index 2, the matched sequence is 34, while for the remaining ones, the match will be the empty string.
If you want to find only digits, you might want to use this pattern: \d+
d* - match zero or more digits in the expresion.
expresion ab34ef and his corresponding indices 012345
On the zero index there is no match so start() prints 0 and group() prints nothing, then on the first index 1 and nothing, on the second we find match so it prints 2 and 34. Next it will print 4 and nothing and so on.
Another example:
Pattern pattern = Pattern.compile("\\d\\d");
Matcher matcher = pattern.matcher("123ddc2ab23");
while(matcher.find()) {
System.out.println("start:" + matcher.start() + " end:" + matcher.end() + " group:" + matcher.group() + ";");
}
which will println:
start:0 end:2 group:12;
start:9 end:11 group:23;
You will find more information in the tutorial

SCJP6 regex issue

I have issue with following example:
import java.util.regex.*;
class Regex2 {
public static void main(String[] args) {
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[1]);
boolean b = false;
while(b = m.find()) {
System.out.print(m.start() + m.group());
}
}
}
And the command line:
java Regex2 "\d*" ab34ef
Can someone explain to me, why the result is: 01234456
regex pattern is d* - it means number one or more but there are more positions that in args[1],
thanks
\d* matches 0 or more digits. So, it will even match empty string before every character and after the last character. First before index 0, then before index 1, and so on.
So, for string ab34ef, it matches following groups:
Index Group
0 "" (Before a)
1 "" (Before b)
2 34 (Matches more than 0 digits this time)
4 "" (Before `e` at index 4)
5 "" (Before f)
6 "" (At the end, after f)
If you use \\d+, then you will get just a single group at 34.

How to precisely identify & work greedy or reluctant quantifiers? [duplicate]

This question already has an answer here:
SCJP6 regex issue
(1 answer)
Closed 7 years ago.
Given:
import java.util.regex.*;
class Regex2 {
public static void main (String args[]) {
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher (args [1]);
boolean b = false;
while (m. find()) {
System.out.print(m.start() + m.group());
}
}
}
the command line expression is :
java Regex2 "\d*" ab34ef
What is the result?
A. 234
B. 334
C. 2334
D 0123456
E. 01234456
F. 12334567
G. Compilation fails
The SCJP book explains regex, pattern and matchers so horribly it's unbelievable.
Anyway, I pretty much understand most of the basics and have looked at the Sun/Oracle documentation about greedy and reluctant quantifiers. I understand the concepts but am a blurry about a few things:
What exactly is the physical symbol of a "greedy" quantifier? Is it simply a single *,? or + ?
If so, can someone explain in detail how this answer turns out to be E according to the book? When I run it myself I get the answer: 2334!
Here we would be using a greedy quantifier correct? This would consume the entire string and then backtrack and look for zero or more digits in a row. Thus, if greedy, the 'full string' would contain 2 digits in a row and would execute .find() only once (ie. m.start = 0 , m.group = "ab34ef"), by that definition!
Thanks for the help guys.
These are the matches of \d* against "ab34ef":
index 0: zero-width;
index 1: zero-width;
index 2: "34";
index 4: zero-width;
index 5: zero-width;
index 6: zero-width.
This should explain your output. If the quantifier was reluctant, this would be the difference:
index 2: zero-width;
index 3: zero-width;
The reluctant quantifier grabs as little as allowed to make the entire expression match.

Categories