Regex capturing group doesn't recognise group(1) despite matches() true - java

I'm writing some simple (I thought) regex in Java to remove an asterisk or ampersand which occurs directly next to some specified punctuation.
This was my original code:
String ptr = "\\s*[\\*&]+\\s*";
String punct1 = "[,;=\\{}\\[\\]\\)]"; //need two because bracket rules different for ptr to left or right
String punct2 = "[,;=\\{}\\[\\]\\(]";
out = out.replaceAll(ptr+"("+punct1+")|("+punct2+")"+ptr,"$1");
Which instead of just removing the "ptr" part of the string, removed the punct too! (i.e. replaced the matched string with an empty string)
I examined further by doing:
String ptrStr = ".*"+ptr+"("+punct1+")"+".*|.*("+punct2+")"+ptr+".*";
Matcher m_ptrStr = Pattern.compile(ptrStr).matcher(out);
and found that:
m_ptrStr.matches() //returns true, but...
m_ptrStr.group(1) //returns null??
I have no idea what I'm doing wrong as I've used this exact method before with far more complicated regex and group(1) has always returned the captured group. There must be something I haven't been able to spot, so.. any ideas?

The problem is that you have an alternation with a capturing group on each side:
(regex1)|(regex2)
The matcher will start and search for a match using the first alternation; if not found, it will try the second alternation.
However, those are still two groups, and only one will match. The one which will not match will return null, and this is what happens to you here.
You therefore need to test both groups; since you have a match, at least one will not be null.

When you have | in your pattern, that means that the matcher is allowed to match one of two patterns. Whichever one it matches, any capture groups for the pattern it matches will return the substrings--but any capture groups for the other pattern will return null, because the other pattern wasn't really matched.
It looks like your pattern is
.*\s*[\*&]+\s*([,;=\{}\[\]\)]).*|.*([,;=\{}\[\]\(])+\s*[\*&]+\s*.*
------------- left ------------- -------------- right ------------
If matches() returns true, then either your string matched the "left" pattern, in which case group(1) will be non-null and group(2) will be null; or else it matched the "right" pattern, in which case group(1) will be null and group(2) non-null. [Note: The matcher will not try to find out if both sides are successful matches. That is, if the left side matches, it won't check the right side.]

Related

X? regex quantifier doesn't work as expected (by me)

Input string:
aaa---foo---ccc---ddd
aaa---bar---ccc---ddd
aaa---------ccc---ddd
Regex: aaa.*(foo|bar)?.*ccc.*(ddd)
This regex doesn't find first group (foo|bar) in any cases. It always returns null for capture group 1.
My question is why and how can I avoid that.
It's very oversimplified example of my regex for just demonstrating. It works if I remove ? quantifier but input string can be without this group at all (aaa---------ccc---ddd) and I still need to determine if it is foo or bar or null. But group 1 is always null.
Page with this regex and test strings: http://fiddle.re/45c766
Here's why it doesn't work: When you have .* in a pattern, the matcher's algorithm is to try to match as many characters as it can to make the rest of the pattern work. In this case, if it tries starting with the entire remainder of the string as .* and removing one character until it matches, it finds that (for "aaa---foo---ccc---ddd") it will work to have .* match 9 characters; then (foo|bar)? doesn't match anything, which is OK because it's optional; and the next .* matches 0 characters, and then the rest of the pattern matches. So that's the one it selects.
The reason changing .* to .*?:
aaa.*?(foo|bar)?.*?ccc.*(ddd)
doesn't work is that the matcher does the same thing in reverse. It starts with a 0-character match and then figures out if it can make the pattern work. When it tries this, it will find that it works to make .*? match 0 characters; then (foo|bar)? doesn't match anything; then the second .*? matches 9 characters; then the rest of the pattern matches ccc---ddd. So either way, it won't do what you want.
There are a couple solutions in the answers, both involving lookahead. Here's another solution:
aaa.*(foo|bar).*ccc.*(ddd)|aaa.*ccc.*(ddd)
This basically checks for two patterns, in order; first it checks to see if there's a pattern with foo|bar in it, and if that doesn't match, it will then search for the other possibility, without foo|bar. This will always find foo|bar if it's there.
All of these solutions involve rather difficult-to-read regexes, though. This is how I might code it:
Pattern pat1 = Pattern.compile("aaa(.*)ccc.*ddd");
Pattern pat2 = Pattern.compile("foo|bar");
Matcher m1 = pat1.matcher(source);
String foobar;
if (m1.matches()) {
Matcher m2 = pat2.matcher(m1.group(1));
if (m2.find()) {
foobar = m2.group(0);
} else {
foobar = null;
}
}
Often, attempting to use one whiz-bang regex to solve a problem results in less-readable (and possibly less-efficient) code than just breaking the problem into parts.
Change your regex to the below if you want to capture the inbetween foo or bar strings.
aaa(?:(?!foo|bar).)*(foo|bar)?.*?ccc.*?(ddd)
Because the .* would also eats up the in-between strings foo or bar, you could use (?:(?!foo|bar).)* instead of that. This (?:(?!foo|bar).)* regex would match any character but not of foo or bar zero or more times.
DEMO
String s = "aaa---foo---ccc---ddd\n" +
"aaa---bar---ccc---ddd\n" +
"aaa---------ccc---ddd";
Pattern regex = Pattern.compile("aaa(?:(?!foo|bar).)*(foo|bar)?.*?ccc.*?(ddd)");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
System.out.println(matcher.group(1));
}
Output:
foo
bar
null
Try:
.{3}\-{3}(.{3})\-{3}.{3}\-{3}(.{3})

How to determine substring OR match?

I have a regex that has an | (or) in it and I would like to determine what part of the or matched in the regex:
Possible Inputs:
-- Input 1 --
Stuff here to keep.
-- First --
all of this below
gets
deleted
-- Input 2 --
Stuff here to keep.
-- Second --
all of this below
gets
deleted
Regex to match part of an incoming input source and determine what part of the | (or) was matched? "-- First --" or "-- Second --"
Pattern PATTERN = Pattern.compile("^(.*?)-+ *(?:First|Second) *-+", Pattern.DOTALL);
Matcher m = PATTERN.matcher(text);
if (m.find()) {
// How can I tell if the regex matched "First" or "Second"?
}
How can I tell which input was matched (First or Second)?
The regular expression does not contain that information. However, you could use some additional groups to figure it out.
Example pattern: (?:(First)|(Second))
On the string First the second capture group will be empty and with Second the first one will be empty. A simple inspection of the groups returned to Java will tell you which part of the regex matched.
EDIT: I assumed that First and Second were used as placeholders for the sake of simplicity and actually represent more complex expressions. If you are really looking to find which of two strings was matched, then having a single capture group (like this: (First|Second)) and comparing its content with First will do the job just fine.
Because RegExes are stateless there is no way to tell by using only one regex.
The solution is to use two different RegExes and make a case decision.
However, you can use group() which returns the last match as String.
You can test this for .contains("First").
if(m.group().contains("First")) {
// case 1
} else {
// case 2
}

Java regular expression - Search string by group

Please could someone explain this for me:
We have a regular expression which we use to check if a string matches a specific sequence. The regular expression is shown below:
JPRN(JAPICCTI\d{6})|(JAPICCTI\d{6})
I want to try and understand what this code is trying to achieve:
matcher = Pattern.compile("JPRN(JAPICCTI\d{6})|(JAPICCTI\d{6})");
Matcher m = matcher.matcher("JAPICCTI132323");
if(m.find()){
Matcher m2 = matcher.matcher(m.group());
if(m2.find()){
return m2.replaceAll("$1")
}
}
The string it tries to check (i.e. JAPICCTI132323) does match with the regular expression.
I dont however understand why the matching is done twice i.e. using the string and again using the "group". What would be the reason for doing this?
And also what is the purpose of the $1 string.
This is failing because the m2.replaceAll("$1") is returning an empty string but i was expecting it to return JAPICCTI132323. Given that i dont understand what it is doing i am struggling to understand why the result is an empty string
Thanks in advance.
The | symbol indicates alternation which means "Match the left group first, if it does not match, try the second group"
The $1 symbol represents what was matched, in this case it would simply replace itself with itself.
If you have a number of capture groups: (one\d+)(two\w+\d)(three.*?)
Then you could use $1, $2 and $3 to represent the matched strings.
In other regex implementations you can name a capture group like so: (?<first match>regexpattern) or (?<phone number>\d{2}\s\d{4}) but unfortunately in Java, it is not available.
You might have to do some testing, but you might be able to specify $1$2 as the replacement, since if one of them is null, it won't add anything but the other match will.
But if both match, it will cause issues because you will have two strings in your replacement.

java regex with preceeding and trailing (.*) slow

I noticed that when I match a regular expression like the following one on a text it is a lot slower than the one without preceeding and trailing (.*) parts. I did the same on perl and found that for perl it hardly makes a difference. Is there any way to optimize the original regular expression "(.*)someRegex(.*)" for java?
Pattern p = Pattern.compile("(.*)someRegex(.*)");
Matcher m = p.matcher("some text");
m.matches();
Pattern p = Pattern.compile("someRegex");
Matcher m = p.matcher("some text");
m.matches();
Edit:
Here is a concrete example:
(.*?)<b>\s*([^<]*)\s*<\/b>(.*)
Your best bet is to skip trying to match the front and end of the string at all. You must do that if you use the matches() method, but you don't if you use the find() method. That's probably what you want instead.
Pattern p = Pattern.compile("<b>\\s*([^<]*)\\s*<\\/b>");
Matcher m = p.matcher("some <b>text</b>");
m.find();
You can use start() and end() to find the indexes within the source string containing the match. You can use group() to find the contents of the () capture within the match (i.e., the text inside the bold tag.
In my experience, using regular expressions to process HTML is very fragile and works well in only the most trivial cases. You might have better luck using a full blown XML parser instead, but if this is one of those trivial cases, have at it.
Original Answer: Here is my original answer sharing why a .* at the beginning of a match will perform so badly.
The problem with using .* at the front is that it will cause lots of backtracking in your match. For example, consider the following:
Pattern p = Pattern.compile("(.*)ab(.*)");
Matcher m = p.matcher("aaabaaa");
m.matches();
The match will proceed like this:
The matcher will attempt to suck the whole string, "aaabaaa", into the first .*, but then tries to match a and fails.
The matcher will back up and match "aaabaa", then tries to match a and succeeds, but tries to match b and fails.
The matcher will back up and match "aaaba", then tries to match a and succeeds, but tries to match b and fails.
The matcher will back up and match "aaab", then tries to match a and succeeds, but tries to match b and fails.
The matcher will back up and match "aaa", then tries to match a and fails.
The matcher will back up and match "aa", then tries to match a and succeeds, tries b and succeeds, and then matches "aaa" to the final .*. Success.
You want to avoid a really broad match toward the beginning of your pattern matches whenever possible. Without knowing your actual problem, it would be very difficult to suggest something better.
Update: Anirudha suggests using (.*?)ab(.*) as a possible fix to avoid backtracking. This will short circuit backtracking to some extent, but at the cost of trying to apply the next match on each try. So now, consider the following:
Pattern p = Pattern.compile("(.*?)ab(.*)");
Matcher m = p.matcher("aaabaaa");
m.matches();
It will proceed like this:
The matcher will attempt to match nothing, "", into the first .*?, tries to match a and succeeds, but fails to match b.
The matcher will attempt to match the first letter, "a", into the first .*?, tries to match a and succeeds, but fails to match b.
The matcher will attempt to match the first two letters, "aa", into the first .*?, tries to match a and succeeds, tries to match b and succeeds, and then slurps up the rest into .*, "aaa". Success.
There aren't any backtracks this time, but we still have a more complicated matching process for each forward move within .*?. This may be a performance gain for a particular match or a loss if iterating through the match forward happens to be slower.
This also changes the way the match will proceed. The .* match is greedy and tries to match as much as possible where as .*? is more conservative.
For example, the string "aaabaaabaaa".
The first pattern, (.*)ab(.*) will match "aaabaa" to the first capture and "aaa" to the second.
The second pattern, (.*?)ab(.*) will match "aa" to the first capture and "aaabaaa" to the second.
Instead of doing "(.*)someRegex(.*)" , why not just split the string on "someRegex" and get the parts from the resulting array ? This will give you the same result, but much faster and simpler. Java supports splitting by regex if you need it - http://www.regular-expressions.info/java.html
. matches every character
instead of . try limiting your search by using classes like \w or \s.
But I dont' guarantee that it would run fast.
It all depends on the amount of text you are matching!

keeping data that regex expression parses

I have a Regex pattern that matches data I need to parse exactly as I need it. Unfortunately with the split method it is deleting the desired data and passing the garbage out to me. Normally I would just try another Regex expression doing the opposite but its not quite as simple as it sounds. It must be in Java as this section is part of a much bigger program/package.
Pattern p = Pattern.compile("/^\{\?|\:|\=|\||(\-
configurationFile)|(isUsingRESTDescription)|(\restURL)=(\s|\w|\.|\-|\:|\/|\;|\[|\]|\'|\})\r/g");
This is the string I'm parsing (there are carriage returns after each section):
SearchResult::getBleh(): {BLEHID=BLEH blehLastmoddate=1-Jul-11 bleh=BLEH; Beh description=blehbleh BlEh=bleh1231bleh bLeH=bleh-blehbleh 1 media=http://bleh.com/13 Date=22-May-12 name=[]} String[] items = p.split(input^);
The above gives me the opposite of what I want.
You'd think someone would have had this problem. Help would be appreciated :).
Use capture groups. You can read about them in the javadoc for Pattern.
An example:
Pattern p = Pattern.compile("[^/]*/([^/]*)/.*");
Matcher m = p.matcher("foo/bar/input");
if (m.find()) {
String captured = m.group(1); // This equals "bar"
String matched = m.group(0); // This equals "foo/bar/input"
}
Anything located inside of parentheses in a Pattern is a capture group. The Matcher indexes the capture groups based on when the opening parentheses is encountered. Group 0 is always the entire matched region.

Categories