Not getting * quantifier correctly in regex?

Not getting * quantifier correctly in regex? - java

I am new to regex and I'm going through the regex quantifier section. I have a question about the * quantifier. Here is the definition of the * quantifier:
X* - Finds no or several letter X
.* - any character sequence
Based on the above definition, I wrote a small program:
public static void testQuantifier() {
String testStr = "axbx";
System.out.println(testStr.replaceAll("x*", "M"));
//my expected output is MMMM but actual output is MaMMbMM
/*
Logic behind my expected output is:
1. it encounters a which means 0 x is found. It should replace a with M.
2. it encounters x which means 1 x is found. It should replace x with M.
3. it encounters b which means 0 x is found. It should replace b with M.
4. it encounters x which means 1 x is found. It should replace x with M.
so output should be MMMM but why it is MaMMbMM?
*/
System.out.println(testStr.replaceAll(".*", "M"));
//my expected output is M but actual output is MM
/*
Logic behind my expected output is:
It encounters axbx, which is any character sequence, it should
replace complete sequence with M.
So output should be M but why it is MM?
*/
}
UPDATE:-
As per the revised understanding, I expect the output as MaMMbM but not MaMMbMM. So I'm not getting why I get an extra M in the end?
My revised understanding for the first regex is:
1. it encounters a which means 0 x is found. It should replace a with Ma.
2. it encounters x which means 1 x is found. It should replace x with M.
3. it encounters b which means 0 x is found. It should replace b with Mb.
4. it encounters x which means 1 x is found. It should replace x with M.
5. Lastly it encounters end of string at index 4. So it replaces 0x at end of String with M.
(Though I find it strange to consider also the index for end of string)
So the first part is clear now.
Also if somebody can clarify on the second regex, it would be helpful.

This is where you're going wrong:
first it encounters a which means 0 x is found. So it should replace a with M.
No - it means that 0 xs are found and then an a is found. You haven't said that the a should be replaced by M... you've said that any number of xs (including 0) should be replaced by M.
If you want every character to be replaced by M, you should just use .:
System.out.println(testStr.replaceAll(".", "23"));
(I would personally have expected a result of MaMbM - I'm looking into why you get MaMMbMM instead - I suspect it's because there's a sequence of 0 xs between the x and the b, but it still seems a little odd to me.)
EDIT: It becomes a bit clearer if you look at where your pattern matches. Here's code to show that:
Pattern pattern = Pattern.compile("x*");
Matcher matcher = pattern.matcher("axbx");
while (matcher.find()) {
System.out.println(matcher.start() + "-" + matcher.end());
}
Results (bear in mind that the end is exclusive) with a bit of explanation:
0-0 (index 0 = 'a', doesn't match)
1-2 (index 1 = 'x', matches)
2-2 (index 2 = 'b', doesn't match)
3-4 (index 3 = 'x', matches)
4-4 (index 4 is the end of the string)
If you replace each of those matches with "M", you end up with the output you're actually getting.
I think the fundamental problem is that if you've got a pattern which can match (in its entirety) the empty string, you can argue that that pattern occurs an infinite number of times between any two characters in the input. I would probably try to avoid such patterns where possible - make sure that any match has to include at least one character.

a and b are not replaced because they are not matched by your regex. The xes and the empty strings before a non-matching letter or before the end of the string are replaced.
Let's see what happens:
We're at the start of the string. The regex engine tries to match an x but fails, because there is an a here.
The regex engine backtracks because x* also allows zero repetitions of x. We have a match and replace with M.
The regex engine advances past the a and successfully matches x. Replace by M.
The regex engine now tries to match x at the current position (after the previous match), which is right before b. It can't.
But it can backtrack again, matching zero xes here. Replace by M.
The regex engine advances past the b and successfully matches x. Replace by M.
The regex engine now tries to match x at the current position (after the previous match), which is at the end of the string. It can't.
But it can backtrack again, matching zero xes here. Replace by M.
This is implementation-dependent, by the way. In Python, for example, it's
>>> re.sub("x*", "M", "axbx")
'MaMbM'
because there, empty matches for the pattern are replaced only when not adjacent to a previous match.

Related

Regex to validate exact 8 or exact 14 character [duplicate]

Consider the following regular expression, where X is any regex.
X{n}|X{m}
This regex would test for X occurring exactly n or m times.
Is there a regex quantifier that can test for an occurrence X exactly n or m times?

There is no single quantifier that means "exactly m or n times". The way you are doing it is fine.
An alternative is:
X{m}(X{k})?
where m < n and k is the value of n-m.

Here is the complete list of quantifiers (ref. http://www.regular-expressions.info/reference.html):
?, ?? - 0 or 1 occurences (?? is lazy, ? is greedy)
*, *? - any number of occurences
+, +? - at least one occurence
{n} - exactly n occurences
{n,m} - n to m occurences, inclusive
{n,m}? - n to m occurences, lazy
{n,}, {n,}? - at least n occurence
To get "exactly N or M", you need to write the quantified regex twice, unless m,n are special:
X{n,m} if m = n+1
(?:X{n}){1,2} if m = 2n
...

No, there is no such quantifier. But I'd restructure it to /X{m}(X{m-n})?/ to prevent problems in backtracking.

Very old post, but I'd like to contribute sth that might be of help.
I've tried it exactly the way stated in the question and it does work but there's a catch:
The order of the quantities matters. Consider this:
#[a-f0-9]{6}|#[a-f0-9]{3}
This will find all occurences of hex colour codes (they're either 3 or 6 digits long). But when I flip it around like this
#[a-f0-9]{3}|#[a-f0-9]{6}
it will only find the 3 digit ones or the first 3 digits of the 6 digit ones. This does make sense and a Regex pro might spot this right away, but for many this might be a peculiar behaviour. There are some advanced Regex features that might avoid this trap regardless of the order, but not everyone is knee-deep into Regex patterns.

TLDR; (?<=[^x]|^)(x{n}|x{m})(?:[^x]|$)
Looks like you want "x n times" or "x m times", I think a literal translation to regex would be (x{n}|x{m}).
Like this https://regex101.com/r/vH7yL5/1
or, in a case where you can have a sequence of more than m "x"s (assuming m > n), you can add 'following no "x"' and 'followed by no "x", translating to [^x](x{n}|x{m})[^x] but that would assume that there are always a character behind and after you "x"s. As you can see here: https://regex101.com/r/bB2vH2/1
you can change it to (?:[^x]|^)(x{n}|x{m})(?:[^x]|$), translating to "following no 'x' or following line start" and "followed by no 'x' or followed by line end". But still, it won't match two sequences with only one character between them (because the first match would require a character after, and the second a character before) as you can see here: https://regex101.com/r/oC5oJ4/1
Finally, to match the one character distant match, you can add a positive look ahead (?=) on the "no 'x' after" or a positive look behind (?<=) on the "no 'x' before", like this: https://regex101.com/r/mC4uX3/1
(?<=[^x]|^)(x{n}|x{m})(?:[^x]|$)
This way you will match only the exact number of 'x's you want.

Taking a look at Enhardened's answer, they state that their penultimate expression won't match sequences with only one character between them. There is an easy way to fix this without using look ahead/look behind, and that's to replace the start/end character with the boundary character. This lets you match against word boundaries which includes start/end. As such, the appropriate expression should be:
(?:[^x]|\b)(x{n}|x{m})(?:[^x]|\b)
As you can see here: https://regex101.com/r/oC5oJ4/2.

Java Regex Quantifiers in String Split

The code:
String s = "a12ij";
System.out.println(Arrays.toString(s.split("\\d?")));
The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?

The pattern you're using only matches one digit a time:
\d match a digit [0-9]
? matches between zero and one time (greedy)
Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:
\d match a digit [0-9]
+? matches between one and unlimited times (lazy)
Or you could just do:
\d match a digit [0-9]
+ matches between one and unlimited times (greedy)
Which would likely be the closest to what I would think you would want, although it's unclear.
Explanation:
Since the token \d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).
You can picture it something like this:
a,1,2,i,j // each character represents (zero) and is split
| |
a, , ,i,j // digit 1 and 2 are each matched (once)
Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.
If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to (capture the \digits as a group between one and unlimited times+) followed up by the greedy qualifier ?. I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!
↳ The solution can be found here

The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher. The javadoc for find() says:
This method starts at the beginning of this matcher's region, or, if a
previous invocation of the method was successful and the matcher has
not since been reset, at the first character not matched by the
previous match.
So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:
Empty string starting at position 0 (before a)
The string "1"
The string "2"
Empty string starting at position 3 (before i). This is because "the first character not matched by the previous match" is the i.
Empty string starting at position 4 (before j).
Empty string starting at position 5 (at the end of the string).
So if the matches found are the substrings denoted by the x, where an x under a blank means the match is an empty string:
a 1 2 i j
x x x x x x
Now if we look at the substrings between the x's, they are "a", "", "", "i", "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)
I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.
MORE: I've confirmed from the source that split() does rely on Matcher and find(), except for an optimization for the common case of splitting on a one-known-character delimiter. So that explains the behavior.

Please explain the output this regex (starts with a positive lookahead)

Pattern p = Pattern.compile("(?=[1-9][0-9]{2})[0-9]*[05]");
Matcher m = p.matcher("101");
while(m.find()){
System.out.println(m.start()+":"+ m.end()+ m.group());
}
Output------ >> 0:210
Please let me know why I am getting output of m.group() as 10 here.
As far as I understand m.group() should return nothing because [05] matches to nothing.

Your Pattern, (?=[1-9][0-9]{2})[0-9]*[05] consists of 2 parts:
(?=[1-9][0-9]{2})
and
[0-9]*[05]
The first part is a zero-width positive lookahead which searches for a number of length 3, and the first can not be 0. This matches your 101.
The second part searches for any amount of numbers and then a 0 or a 5. This matches the first 2 characters of 101, thus the result is 10.
See Java - Pattern for more information.

What your Regex is looking for is:
[1-9]:
match a single character present in the list below
1-9 a single character in the range between 1 and 9
[0-9]{2}:
match a single character present in the list below
Quantifier: {2} Exactly 2 times
0-9 a single character in the range between 0 and 9
[0-9]*:
match a single character present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
[05]:
match a single character present in the list below
05 a single character in the list 05 literally
for the String "101" this nacht the first 2 chars 101,
so you are printing.out:
System.out.println(**m.start()**+":"+ **m.end()**+ m.group());
where m.start() returns the start index of the previous match(char at
0). where m.end() returns the offset after the last character matched.
and where m.group() returns the input subsequence matched by the previous
match.

That regex was meant to match a number that's a multiple of 5 and greater than or equal to 100, but it's useless without anchors. It should be:
^(?=[1-9][0-9]{2}$)[0-9]*[05]$
The anchors ensure that both the lookahead and the main part are examining the whole of the string. But the task doesn't require a lookahead anyway; this works just fine:
^[1-9][0-9][05]$

As #AlanMoore states, there has to be an alignment.
Assertions are a self contained entity, all they have to do is Pass
to advance to the next construct.
Lets see what (?=[1-9][0-9]{2}) matches;
1111111110666
2222222222222222225666
33333333333333333333333330666
So far so good, on to the next construct.
Lets see what [0-9]*[05] matches.
What ever this matches is the final answer.
1111111110666
2222222222222222225666
33333333333333333333333330666
What to learn is that to get a cohesive answer, assertions have to be crafted to
coincide with constructs that come after them.
Here is an example of a constraint that could be applied
after the assertion.
The assertion state's it needs three digits and the first digit must be >= 1.
The constructs after the assertion state it can be any number of digit's,
as long as it ends with a 0 or 5.
This last part is distressing since it will match only the 500000
So for sure, you need at least three digits.
That can be done like this:
[0-9]{2,}[05]
This says two things
There must be at least three digits, but can be more
It must end with a 0 or 5.
That's it, put it all together, its:
(?=[1-9][0-9]{2})[0-9]{2,}[05]
Of course, this can be condensed to;
[1-9][0-9]+[05]

Why the zero-length character always remains at the end of the source string for java regex pattern a?

Pattern pattern = Pattern.compile("a?");
Matcher matcher = pattern.matcher("a");
while(matcher.find()){
System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
}
Output :
0[a]1
1[]1
why this gives me two outputs while there is a single characters as the matcher.
I noticed that for this pattern it gives an zero-length always at the end of the source string.
Eg : when source is "abab" it gives
0[a]1
1[]1
2[a]3
3[]3
4[]4

The regex special character ? (question mark) means "match the preceding thing zero or one time".
Since you are matching in a while loop (while (matcher.find()) {...) it finds both matches of the expression - one occurrence of "a" (at position 0, the string "a") and zero occurrences of "a" (at position 1, the empty string at the very end).
So here's what your code snippet is matching (start/end indices are denoted by X/Y):
String: " a b a b "
├─┼─┼─┼─┤
Index: 0 1 2 3 4
Match: ╰┬╯ ╰┬╯ ╰- the empty string 4/4 (zero occurrences of "a").
|| |╰- the empty string 3/3 (zero occurrences of "a").
|| ╰ the string "a" 2/3 (one occurrence of "a").
|╰ the empty string 1/1 (zero occurrences of "a").
╰ the string "a" 0/1 (one occurrence of "a").
It doesn't match at positions 0/0 or 2/2 since the expression is greedy, which means it will try to consider the next character (at positions 0/1, 2/3) as long as it doesn't invalidate the match, which it doesn't so they are skipped. To illustrate, if you were to match the string "bbbb" against the pattern a? then you would get five empty strings, one for each empty string at the beginning, end, and between each character.

Have a look at
http://docs.oracle.com/javase/tutorial/essential/regex/quant.html
It explains your case in detail under the section Zero-Length Matches

a? stands for 0-or-1 occurrances of the character a.
The empty string is matching the 0 occurrence.
The matching is also greedy in you case, so it matches the 1 occurrance first, then the 0 occurrance at the end.
In the abab case, think of it as a[]ba[]b[], where [] denotes the empty occurrance found. The matcher does not find it in the beginning or after the first b, because it can greedily match on a.

Matching the empty space after the last character is not universal.
The Vim editor has this behavior:
Buffer before:
aaaa
~
~
:s/x\?/y/g <- command
Buffer after:
yayaya
~
~
No x occurs in aaaa but the x? (written x\? in Vim by default) allows an empty match. The pattern matches the empty space at the start of the string and between
all the characters, but not past the end.
The exception is if the line is empty. The command will replace a blank line with a single y.
I implemented the Vim-like behavior in my own program:
$ txr -c '#(bind result #(regsub #/x?/ "y" "aaaa"))'
result="yayayaya"
$ txr -c '#(bind result #(regsub #/x?/ "y" ""))'
result="y"
Only because Vim is popular and I can point to that as the reference model if any questions come up. But it's a bit of a hack. The logic has a do .. while loop, which allows an incoming empty string to be processed:
do {
/* regex match, extraction, substitution ... */
position++;
} while (position < length(input))
So if the starting position is zero and the input has length zero, we do the loop once, applying the regex to the empty string. But if we process the last character, position reaches the length and the loop terminates without processing the empty string.
Originally, I had the loop test at the top, so it was behaving like Vim, but not in the empty input case, which would not match regexes that match on empty.
The behavior of the Java class you're using might be implemented like this:
while (position <= length(input)) {
/* process regex */
position++;
}

Strange behavior in regexes

There was a question about regex and trying to answer I found another strange things.
String x = "X";
System.out.println(x.replaceAll("X*", "Y"));
This prints YY. why??
String x = "X";
System.out.println(x.replaceAll("X*?", "Y"));
And this prints YXY
Why reluctant regex doesn't match 'X' character? There is "noting"X"nothing" but why first doesn't match three symbols and matches two and then one instead of three? and second regex matches only "nothing"s and not X?

Let's consider them in turn:
"X".replaceAll("X*", "Y")
There are two matches:
At character position 0, X is matched, and is replaced with Y.
At character position 1, the empty string is matched, and Y gets added to the output.
End result: YY.
"X".replaceAll("X*?", "Y")
There are also two matches:
At character position 0, the empty string is matched, and Y gets added to the output. The character at this position, X, was not consumed by the match, and is therefore copied into the output verbatim.
At character position 1, the empty string is matched, and Y gets added to the output.
End result: YXY.

The * is a tricky 'quantifier' since it means '0 or more'. Thus, it also matches '0 times X' (i.e. an empty string).
I would use
"X".replaceAll("X+", "Y")
which has the expected behaviour.

In your first example you are using a "Greedy" quantifier. This means that the input string is forced to be read entirely before attempting the first match, so the first match tried is the whole input. If the input matches, the matcher goes past the input string and performs the zero-length match at the end of the string hence the two matches you see. The greedy matcher never backs-off to the zero-length match before the character X before the first match attempt was successful.
On the second example you are using a "Reluctant" quantifier which does the opposite of "Greedy". It starts at the beginning and tries to match one character at the time going forward (if it has to). So the zero-length match before the "X" character is matched, matcher moves forward by one (that's why you still see the "X" character in the output) where the next match is now the zero-length match after the "X".
There is a good tutorial here: http://docs.oracle.com/javase/tutorial/essential/regex/quant.html

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Not getting * quantifier correctly in regex? - java

Related

Regex to validate exact 8 or exact 14 character [duplicate]

Java Regex Quantifiers in String Split

Please explain the output this regex (starts with a positive lookahead)

Why the zero-length character always remains at the end of the source string for java regex pattern a?

Strange behavior in regexes

Categories

Resources