Strange behavior in regexes - java

There was a question about regex and trying to answer I found another strange things.
String x = "X";
System.out.println(x.replaceAll("X*", "Y"));
This prints YY. why??
String x = "X";
System.out.println(x.replaceAll("X*?", "Y"));
And this prints YXY
Why reluctant regex doesn't match 'X' character? There is "noting"X"nothing" but why first doesn't match three symbols and matches two and then one instead of three? and second regex matches only "nothing"s and not X?

Let's consider them in turn:
"X".replaceAll("X*", "Y")
There are two matches:
At character position 0, X is matched, and is replaced with Y.
At character position 1, the empty string is matched, and Y gets added to the output.
End result: YY.
"X".replaceAll("X*?", "Y")
There are also two matches:
At character position 0, the empty string is matched, and Y gets added to the output. The character at this position, X, was not consumed by the match, and is therefore copied into the output verbatim.
At character position 1, the empty string is matched, and Y gets added to the output.
End result: YXY.

The * is a tricky 'quantifier' since it means '0 or more'. Thus, it also matches '0 times X' (i.e. an empty string).
I would use
"X".replaceAll("X+", "Y")
which has the expected behaviour.

In your first example you are using a "Greedy" quantifier. This means that the input string is forced to be read entirely before attempting the first match, so the first match tried is the whole input. If the input matches, the matcher goes past the input string and performs the zero-length match at the end of the string hence the two matches you see. The greedy matcher never backs-off to the zero-length match before the character X before the first match attempt was successful.
On the second example you are using a "Reluctant" quantifier which does the opposite of "Greedy". It starts at the beginning and tries to match one character at the time going forward (if it has to). So the zero-length match before the "X" character is matched, matcher moves forward by one (that's why you still see the "X" character in the output) where the next match is now the zero-length match after the "X".
There is a good tutorial here: http://docs.oracle.com/javase/tutorial/essential/regex/quant.html

Related

Java Regex Quantifiers in String Split

The code:
String s = "a12ij";
System.out.println(Arrays.toString(s.split("\\d?")));
The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?
The pattern you're using only matches one digit a time:
\d match a digit [0-9]
? matches between zero and one time (greedy)
Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:
\d match a digit [0-9]
+? matches between one and unlimited times (lazy)
Or you could just do:
\d match a digit [0-9]
+ matches between one and unlimited times (greedy)
Which would likely be the closest to what I would think you would want, although it's unclear.
Explanation:
Since the token \d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).
You can picture it something like this:
a,1,2,i,j // each character represents (zero) and is split
| |
a, , ,i,j // digit 1 and 2 are each matched (once)
Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.
If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to (capture the \digits as a group between one and unlimited times+) followed up by the greedy qualifier ?. I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!
↳ The solution can be found here
The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher. The javadoc for find() says:
This method starts at the beginning of this matcher's region, or, if a
previous invocation of the method was successful and the matcher has
not since been reset, at the first character not matched by the
previous match.
So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:
Empty string starting at position 0 (before a)
The string "1"
The string "2"
Empty string starting at position 3 (before i). This is because "the first character not matched by the previous match" is the i.
Empty string starting at position 4 (before j).
Empty string starting at position 5 (at the end of the string).
So if the matches found are the substrings denoted by the x, where an x under a blank means the match is an empty string:
a 1 2 i j
x x x x x x
Now if we look at the substrings between the x's, they are "a", "", "", "i", "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)
I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.
MORE: I've confirmed from the source that split() does rely on Matcher and find(), except for an optimization for the common case of splitting on a one-known-character delimiter. So that explains the behavior.

Java regular expression boundary match?

I found the following question in one Java test suite
Pattern p = Pattern.compile("[wow]*");
Matcher m = p.matcher("wow its cool");
boolean b = false;
while (b = m.find()) {
System.out.print(m.start() + " \"" + m.group() + "\" ");
}
where the output seems to be as follows
0 "wow" 3 "" 4 "" 5 "" 6 "" 7 "" 8 "" 9 "oo" 11 "" 12 ""
Up till the last match it is clear, the pattern [wow]* greedily matches 0 or more 'w' and 'o' characters, while for unmatching characters, including spaces, it results in empty strings. However after matching the last 'l' with 11 "", the following 12 "" seems to be unclear. There is no detailing for this in the test solution, nor I was really able to definitely figure it out from javadoc. My best guess is boundary character, but I would appreciate if someone could provide an explanation
The reason that you see this behavior is that your pattern allows empty matches. In other words, if you pass it an empty string, you would see a single match at position zero:
Pattern p = Pattern.compile("[wow]*"); // One of the two 'w's is redundant, but the engine is OK with it
Matcher m = p.matcher(""); // Passing an empty string results in a valid match that is empty
boolean b = false;
while (b = m.find()) {
System.out.print(m.start() + " \"" + m.group() + "\" ");
}
this would print 0 "" because an empty string is as good a match as any other match for the expression.
Going back to your example, every time the engine discovers a match, including an empty one, it advances past it by a single character. "Advancing by one" means that the engine considers the "tail" of the string at the next position. This includes the time when the regex engine is at position 11, i.e. at the very last character: here, the "tail" consists of an empty string. This is similar to calling "wow its cool".substring(12): you would get an empty string in that case as well.
The engine consider an empty string a valid input, and tries to match it against your expression, as shown in the example above. This produces a match, which your program properly reports.
[wow]* Matches the first wow string. count = 1
Because of the * (zero or more) next to the character class, [wow]* this regex would match an empty string which exists before the character which is not matched by the above pattern. So it matches the boundary or empty space which exists just before to the first space. Count = 2.
its is not matched by the above regex . So it matches the empty string which exists before each character. So count is 2+3=5.
And also the second space is not matched by the above regex. So we get an empty string as match. 5+1=6
c is not matched by the above regex. So it matches the empty space which exists just before to the c 6+1=7
oo is matched by the above regex. [wow]*. So it matches oo and this is considered as 1 match . So we get 7+1=8 as count.
l is not matched. Count = 9
At the last it matches the empty string which exists next to the last character. So now the count is 9+1=10
And finally we all know that the m.start() prints the starting index of the corresponding match.
DEMO
The regex is simply matching the pattern against the input, starting at a given offset.
For the last match, the offset of 12 is at the point after the last character of 'cool' - you might think this is the end of the string and therefore cannot be used for matching purposes - but you'd be wrong. For pattern-matching, this is a perfectly valid starting point.
As you state, your regex expression includes the possibility of zero characters and indeed, this is what happens after the end of the last character, but before the end-of-string marker (usually represented by $ in a regex expression).
To put it another way, without testing past the end of the last character, it would mean no matches would ever occur relating to the end of the string - but there are many regex constructs that match the end of the string (and you've shown one of them here).

Not getting * quantifier correctly in regex?

I am new to regex and I'm going through the regex quantifier section. I have a question about the * quantifier. Here is the definition of the * quantifier:
X* - Finds no or several letter X
.* - any character sequence
Based on the above definition, I wrote a small program:
public static void testQuantifier() {
String testStr = "axbx";
System.out.println(testStr.replaceAll("x*", "M"));
//my expected output is MMMM but actual output is MaMMbMM
/*
Logic behind my expected output is:
1. it encounters a which means 0 x is found. It should replace a with M.
2. it encounters x which means 1 x is found. It should replace x with M.
3. it encounters b which means 0 x is found. It should replace b with M.
4. it encounters x which means 1 x is found. It should replace x with M.
so output should be MMMM but why it is MaMMbMM?
*/
System.out.println(testStr.replaceAll(".*", "M"));
//my expected output is M but actual output is MM
/*
Logic behind my expected output is:
It encounters axbx, which is any character sequence, it should
replace complete sequence with M.
So output should be M but why it is MM?
*/
}
UPDATE:-
As per the revised understanding, I expect the output as MaMMbM but not MaMMbMM. So I'm not getting why I get an extra M in the end?
My revised understanding for the first regex is:
1. it encounters a which means 0 x is found. It should replace a with Ma.
2. it encounters x which means 1 x is found. It should replace x with M.
3. it encounters b which means 0 x is found. It should replace b with Mb.
4. it encounters x which means 1 x is found. It should replace x with M.
5. Lastly it encounters end of string at index 4. So it replaces 0x at end of String with M.
(Though I find it strange to consider also the index for end of string)
So the first part is clear now.
Also if somebody can clarify on the second regex, it would be helpful.
This is where you're going wrong:
first it encounters a which means 0 x is found. So it should replace a with M.
No - it means that 0 xs are found and then an a is found. You haven't said that the a should be replaced by M... you've said that any number of xs (including 0) should be replaced by M.
If you want every character to be replaced by M, you should just use .:
System.out.println(testStr.replaceAll(".", "23"));
(I would personally have expected a result of MaMbM - I'm looking into why you get MaMMbMM instead - I suspect it's because there's a sequence of 0 xs between the x and the b, but it still seems a little odd to me.)
EDIT: It becomes a bit clearer if you look at where your pattern matches. Here's code to show that:
Pattern pattern = Pattern.compile("x*");
Matcher matcher = pattern.matcher("axbx");
while (matcher.find()) {
System.out.println(matcher.start() + "-" + matcher.end());
}
Results (bear in mind that the end is exclusive) with a bit of explanation:
0-0 (index 0 = 'a', doesn't match)
1-2 (index 1 = 'x', matches)
2-2 (index 2 = 'b', doesn't match)
3-4 (index 3 = 'x', matches)
4-4 (index 4 is the end of the string)
If you replace each of those matches with "M", you end up with the output you're actually getting.
I think the fundamental problem is that if you've got a pattern which can match (in its entirety) the empty string, you can argue that that pattern occurs an infinite number of times between any two characters in the input. I would probably try to avoid such patterns where possible - make sure that any match has to include at least one character.
a and b are not replaced because they are not matched by your regex. The xes and the empty strings before a non-matching letter or before the end of the string are replaced.
Let's see what happens:
We're at the start of the string. The regex engine tries to match an x but fails, because there is an a here.
The regex engine backtracks because x* also allows zero repetitions of x. We have a match and replace with M.
The regex engine advances past the a and successfully matches x. Replace by M.
The regex engine now tries to match x at the current position (after the previous match), which is right before b. It can't.
But it can backtrack again, matching zero xes here. Replace by M.
The regex engine advances past the b and successfully matches x. Replace by M.
The regex engine now tries to match x at the current position (after the previous match), which is at the end of the string. It can't.
But it can backtrack again, matching zero xes here. Replace by M.
This is implementation-dependent, by the way. In Python, for example, it's
>>> re.sub("x*", "M", "axbx")
'MaMbM'
because there, empty matches for the pattern are replaced only when not adjacent to a previous match.

Using string.matches to check if last character is vowel

I am new to matches in java. I want to determine if the last character of a string is a vowel (ie aieou). For example if the string is abcde, then it is ok. But if it is eaoid, then it is wrong.
str.matches(".*[aeiou]$");
.* matches any character zero or more times
[aeiou] matches one of the characters in the set
$ matches the end of the string.
So "abcde".matches(".*[aeiou]$") == true and "eaoid".matches(".*[aeiou]$") == false
The matches() method in java must must the whole string in order to return true, so you need to start the regex with .* and finish it with a character class (square brackets around a list of characters), which is the regex way of saying "one of these characters"
If you want to match strings that end in either an upper or a lower case vowel:
str.matches(".*[AEIOUaeiou]");
or even more simply:
str.matches(".*(?i)[aeiou]");
The regex (?i) means "ignore case"

Why the zero-length character always remains at the end of the source string for java regex pattern a?

Pattern pattern = Pattern.compile("a?");
Matcher matcher = pattern.matcher("a");
while(matcher.find()){
System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
}
Output :
0[a]1
1[]1
why this gives me two outputs while there is a single characters as the matcher.
I noticed that for this pattern it gives an zero-length always at the end of the source string.
Eg : when source is "abab" it gives
0[a]1
1[]1
2[a]3
3[]3
4[]4
The regex special character ? (question mark) means "match the preceding thing zero or one time".
Since you are matching in a while loop (while (matcher.find()) {...) it finds both matches of the expression - one occurrence of "a" (at position 0, the string "a") and zero occurrences of "a" (at position 1, the empty string at the very end).
So here's what your code snippet is matching (start/end indices are denoted by X/Y):
String: " a b a b "
├─┼─┼─┼─┤
Index: 0 1 2 3 4
Match: ╰┬╯ ╰┬╯ ╰- the empty string 4/4 (zero occurrences of "a").
|| |╰- the empty string 3/3 (zero occurrences of "a").
|| ╰ the string "a" 2/3 (one occurrence of "a").
|╰ the empty string 1/1 (zero occurrences of "a").
╰ the string "a" 0/1 (one occurrence of "a").
It doesn't match at positions 0/0 or 2/2 since the expression is greedy, which means it will try to consider the next character (at positions 0/1, 2/3) as long as it doesn't invalidate the match, which it doesn't so they are skipped. To illustrate, if you were to match the string "bbbb" against the pattern a? then you would get five empty strings, one for each empty string at the beginning, end, and between each character.
Have a look at
http://docs.oracle.com/javase/tutorial/essential/regex/quant.html
It explains your case in detail under the section Zero-Length Matches
a? stands for 0-or-1 occurrances of the character a.
The empty string is matching the 0 occurrence.
The matching is also greedy in you case, so it matches the 1 occurrance first, then the 0 occurrance at the end.
In the abab case, think of it as a[]ba[]b[], where [] denotes the empty occurrance found. The matcher does not find it in the beginning or after the first b, because it can greedily match on a.
Matching the empty space after the last character is not universal.
The Vim editor has this behavior:
Buffer before:
aaaa
~
~
:s/x\?/y/g <- command
Buffer after:
yayaya
~
~
No x occurs in aaaa but the x? (written x\? in Vim by default) allows an empty match. The pattern matches the empty space at the start of the string and between
all the characters, but not past the end.
The exception is if the line is empty. The command will replace a blank line with a single y.
I implemented the Vim-like behavior in my own program:
$ txr -c '#(bind result #(regsub #/x?/ "y" "aaaa"))'
result="yayayaya"
$ txr -c '#(bind result #(regsub #/x?/ "y" ""))'
result="y"
Only because Vim is popular and I can point to that as the reference model if any questions come up. But it's a bit of a hack. The logic has a do .. while loop, which allows an incoming empty string to be processed:
do {
/* regex match, extraction, substitution ... */
position++;
} while (position < length(input))
So if the starting position is zero and the input has length zero, we do the loop once, applying the regex to the empty string. But if we process the last character, position reaches the length and the loop terminates without processing the empty string.
Originally, I had the loop test at the top, so it was behaving like Vim, but not in the empty input case, which would not match regexes that match on empty.
The behavior of the Java class you're using might be implemented like this:
while (position <= length(input)) {
/* process regex */
position++;
}

Categories