Matcher creates extra group at the end of string - java

I've faced with strange behavior of java.util.regex.Matcher.
Lets consider example:
Pattern p = Pattern.compile("\\d*");
String s = "a1b";
Matcher m = p.matcher(s);
while(m.find())
{
System.out.println(m.start()+" "+m.end());
}
It produces output:
0 0
1 2
2 2
3 3
I can understant all lines except last. Matcher creates extra group (3,3) out of string.
But javadoc for method start() confirms:
start() Returns the start index of the previous match.
The same case for dot-star pattern:
Pattern p = Pattern.compile(".*");
String s = "a1b";
Matcher m = p.matcher(s);
while(m.find())
{
System.out.println(m.start()+" "+m.end());
}
Output:
0 3
3 3
But if specify line boundaries
Pattern p = Pattern.compile("^.*$");
The output will be "right":
0 3
Can someone explain me а reason of such behavior?

The pattern "\\d*" matches 0 or more digits. Same stands for ".*". It matches 0 or more occurrence of any character except newline.
The last match that you get is the empty string at the end of your string, after "b". The empty string satisfies the pattern \\d*. If you change the pattern to \\d+, you'll get expected result.
Similarly, the pattern .* matches everything from first character to last character. Thus it first matches "a1b". After that the cursor is after b: "a1b|". Now, matcher.find() again runs, and finds a zero-length string at the cursor, which satisifies the pattern .*, so it considers it as a match.
The reason why it gives expected output with "^.*$" is that the last empty string doesn't satisfy the ^ anchor. It is not at the beginning of the string, so it fails to match.

Related

How to write a regex capture group which matches a character 3 or 4 times before a delimiter?

I'm trying to write a regex that splits elements out according to a delimiter. The regex also needs to ensure there are ideally 4, but at least 3 colons : in each match.
Here's an example string:
"Checkers, etc:Blue::C, Backgammon, I say:Green::Pepsi:P, Chess, misc:White:Coke:Florida:A, :::U"
From this, there should be 4 matches:
Checkers, etc:Blue::C
Backgammon, I say:Green::Pepsi:P
Chess, misc:White:Coke:Florida:A
:::U
Here's what I've tried so far:
([^:]*:[^:]*){3,4}(?:, )
Regex 101 at: https://regex101.com/r/O8iacP/8
I tried setting up a non-capturing group for ,
Then I tried matching a group of any character that's not a :, a :, and any character that's not a : 3 or 4 times.
The code I'm using to iterate over these groups is:
String line = "Checkers, etc:Blue::C, Backgammon, I say::Pepsi:P, Chess:White:Coke:Florida:A, :::U";
String pattern = "([^:]*:[^:]*){3,4}(?:, )";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher matcher = r.matcher(line);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Any help is appreciated!
Edit
Using #Casimir's regex, it's working. I had to change the above code to use group(0) like this:
String line = "Checkers, etc:Blue::C, Backgammon, I say::Pepsi:P, Chess:White:Coke:Florida:A, :::U";
String pattern = "(?![\\s,])(?:[^:]*:){3}\\S*(?![^,])";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher matcher = r.matcher(line);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Now prints:
Checkers, etc:Blue::C
Backgammon, I say::Pepsi:P
Chess:White:Coke:Florida:A
:::U
Thanks again!
I suggest this pattern:
(?![\\s,])(?:[^:]*:){3}\\S*(?![^,])
Negative lookaheads avoid to match leading or trailing delimiters. The second one in particular forces the match to be followed by the delimiter or the end of the string (not followed by a character that isn't a comma).
demo
Note that the pattern doesn't have capture groups, so the result is the whole match (or group 0).
You might use
(?:[^,:]+, )?[^:,]*(?::+[^:,]+)+
(?:[^,:]+, )? Optionally match 1+ any char except a , or : followed by , and space
[^:,]* Match 0+ any char except : or ,
(?: Non Capturing group
:+[^:,]+ Match 1+ : and 1+ times any char except : and ,
)+ Close group and repeat 1+ times
Regex demo
You seem to be making it harder than it needs to be with the lookahead (which won't be satisfied at end-of-line anyway).
([^:]*:){3}[^:,]*:?[^:,]*
Find the first 3 :'s, then start including , in the negative groupings, with an optional 4th :.

find the start of a matched region (java regex)

Suppose the string I am interested is similar to these num3.a, num4.b, etc.
(but I don't want it to match these foo.num3.a, whatever.num2.b)
I have this regex to match them Pattern p = Pattern.compile("[^\\.]\\bnum(\\d*)(?=\\.)";
Given this input string : (num3.a)
Matcher m = p.matcher("(num3.a)");
if (m.find())
System.out.println(m.start()); // This would print 0 rather than 1 WHY?
How do I change the code so it prints 1 instead? (because 1 is the index of n, which is the start of my interested pattern)
If you're interessted in num3.a you should expand your Group. The brackets indicate a group and can be used to address within your match.
[^\\.]\\b(num\\d*)(?=\\.)
then you can access the group with
start(0) and end(0)
Pattern p = Pattern.compile("\\b(num\\d*\\.a)");
String input = "fffffffffffff(num3.a)fffffffffffffffffsdfsdf";
Matcher m = p.matcher(input);
if (m.find())
{
System.out.println(m.start(0));
System.out.println(input.substring(m.start(0), m.end(0)));
}
will output
14
num3.a
The method Matcher#regionStart()
Reports the start index of this matcher's region.
This doesn't indicate the start of the match, only the start of the region that is checked for a match.
Use find() and start() to find the start of a match.
Now you changed your pattern. [^\\.] matches anything that is not a dot. A ( is neither of those, so it is matched. The ( is at index 0 in the given String.
Pattern p = Pattern.compile("\\,\\d*");
String inpu = "Hotel Class : 5,106936 ";
Matcher m = p.matcher(inpu);
if (m.find())
{
System.out.println(inpu.substring(m.start(0), m.end(0)));
}
The output is "5,106936"

Regular expression for a string starting with some string

I have some string, that has this type: (notice)Any_other_string (notes that : () has in this string`.
So, I want to separate this string to 2 part : (notice) and the rest. I do as follow :
private static final Pattern p1 = Pattern.compile("(^\\(notice\\))([a-z_A-Z1-9])+");
String content = "(notice)Stack Over_Flow 123";
Matcher m = p1.matcher(content);
System.out.println("Printing");
if (m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
I hope the result will be (notice) and Stack Over_Flow 123, but instead, the result is : (notice)Stack and (notice)
I cannot explain this result. Which regex is suitable for my purpose?
Issue 1: group(0) will always return the entire match - this is specified in the javadoc - and the actual capturing groups start from index 1. Simply replace it with the following:
System.out.println(m.group(1));
System.out.println(m.group(2));
Issue 2: You do not take spaces and other characters, such as underscores, into account (not even the digit 0). I suggest using the dot, ., for matching unknown characters. Or include \\s (whitespace) and _ into your regex. Either of the following regexes should work:
(^\\(notice\\))(.+)
(^\\(notice\\))([A-Za-z0-9_\\s]+)
Note that you need the + inside the capturing group, or it will only find the last character of the second part.

Why empty regex and empty capturing group regex return string length plus one results

How would you explain that empty regex and empty capturing group regex return string length plus one results?
Code
public static void main(String... args) {
{
System.out.format("Pattern - empty string\n");
String input = "abc";
Pattern pattern = Pattern.compile("");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String s = matcher.group();
System.out.format("[%s]: %d / %d\n", s, matcher.start(),
matcher.end());
}
}
{
System.out.format("Pattern - empty capturing group\n");
String input = "abc";
Pattern pattern = Pattern.compile("()");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String s = matcher.group();
System.out.format("[%s]: %d / %d\n", s, matcher.start(),
matcher.end());
}
}
}
Output
Pattern - empty string
[]: 0 / 0
[]: 1 / 1
[]: 2 / 2
[]: 3 / 3
Pattern - empty capturing group
[]: 0 / 0
[]: 1 / 1
[]: 2 / 2
[]: 3 / 3
The regex engine is hardcoded to advance one position upon a zero-length match (otherwise infinite loop). Your regex matches a zero-length substring. There are zero-length substrings between every character (think the "gaps between each character"); in addition, the regex engine considers the start and end of the string valid match positions as well. Because a string of length N contains N+1 gaps between letters (counting the start and end, which the regex engine does), you'll get N+1 matches.
Regex engines consider positions before and after characters, too. You can see this from the fact that they have things like ^ (start of string), $ (end of string) and \b word boundary, which match at certain positions without matching any characters (and therefore between/before/after characters). Therefore we have the N-1 positions between characters that have to be considered, as well as the first and last position (because ^ and $ would match there respectively), which gives you N+1 candidate positions. All of which match for a completely unrestrictive empty pattern.
So here are your matches:
" a b c "
^ ^ ^ ^
Which is obviously N+1 for N characters.
You will get the same behavior with other patterns that allow zero-length matches and don't actually find longer ones in your pattern. For instance, try \d*. It cannot find any digits in your input string, but * will gladly return zero-length matches.

Java Matcher. Return several entries from one sequence

For example I have following regexp: \d{2} (2 digits). And when I using
Matcher matcher = Pattern.compile("\\d{2}").matcher("123");
matcher.find();
String result = matcher.group();
In result variable I get only first entry, i.e. 12. But I want to get ALL possible entries, i.e. 12 and 23.
How to achieve this?
You'll need the help of a capture group within a positive lookahead:
Matcher m = Pattern.compile("(?=(\\d{2}))").matcher("1234");
while (m.find()) System.out.println(m.group(1));
prints
12
23
34
That's not how regular expression matching works. The matcher starts at the beginning of the string, and each time it finds a match it continues looking from the character following the end of that match - it will not give you overlapping matches.
If you want to find overlapping matches of an arbitrary regular expression without needing to use lookaheads and capturing groups you can do this by resetting the matcher's "region" after each match
Matcher matcher = Pattern.compile(theRegex).matcher(str);
// prevent ^ and $ from matching the beginning/end of the region when this is
// smaller than the whole string
matcher.useAnchoringBounds(false);
// allow lookaheads/behinds to look outside the current region
matcher.useTransparentBounds(true);
while(matcher.find()) {
System.out.println(matcher.group());
if(matcher.start() < str.length()) {
// start looking again from the character after the _start_ of the previous
// match, instead of the character following the _end_ of the match
matcher.region(matcher.start() + 1, str.length());
}
}
some thing like this
^(?=[1-3]{2}$)(?!.*(.).*\1).*$
Test and experiment here

Categories