X? regex quantifier doesn't work as expected (by me) - java

Input string:
aaa---foo---ccc---ddd
aaa---bar---ccc---ddd
aaa---------ccc---ddd
Regex: aaa.*(foo|bar)?.*ccc.*(ddd)
This regex doesn't find first group (foo|bar) in any cases. It always returns null for capture group 1.
My question is why and how can I avoid that.
It's very oversimplified example of my regex for just demonstrating. It works if I remove ? quantifier but input string can be without this group at all (aaa---------ccc---ddd) and I still need to determine if it is foo or bar or null. But group 1 is always null.
Page with this regex and test strings: http://fiddle.re/45c766

Here's why it doesn't work: When you have .* in a pattern, the matcher's algorithm is to try to match as many characters as it can to make the rest of the pattern work. In this case, if it tries starting with the entire remainder of the string as .* and removing one character until it matches, it finds that (for "aaa---foo---ccc---ddd") it will work to have .* match 9 characters; then (foo|bar)? doesn't match anything, which is OK because it's optional; and the next .* matches 0 characters, and then the rest of the pattern matches. So that's the one it selects.
The reason changing .* to .*?:
aaa.*?(foo|bar)?.*?ccc.*(ddd)
doesn't work is that the matcher does the same thing in reverse. It starts with a 0-character match and then figures out if it can make the pattern work. When it tries this, it will find that it works to make .*? match 0 characters; then (foo|bar)? doesn't match anything; then the second .*? matches 9 characters; then the rest of the pattern matches ccc---ddd. So either way, it won't do what you want.
There are a couple solutions in the answers, both involving lookahead. Here's another solution:
aaa.*(foo|bar).*ccc.*(ddd)|aaa.*ccc.*(ddd)
This basically checks for two patterns, in order; first it checks to see if there's a pattern with foo|bar in it, and if that doesn't match, it will then search for the other possibility, without foo|bar. This will always find foo|bar if it's there.
All of these solutions involve rather difficult-to-read regexes, though. This is how I might code it:
Pattern pat1 = Pattern.compile("aaa(.*)ccc.*ddd");
Pattern pat2 = Pattern.compile("foo|bar");
Matcher m1 = pat1.matcher(source);
String foobar;
if (m1.matches()) {
Matcher m2 = pat2.matcher(m1.group(1));
if (m2.find()) {
foobar = m2.group(0);
} else {
foobar = null;
}
}
Often, attempting to use one whiz-bang regex to solve a problem results in less-readable (and possibly less-efficient) code than just breaking the problem into parts.

Change your regex to the below if you want to capture the inbetween foo or bar strings.
aaa(?:(?!foo|bar).)*(foo|bar)?.*?ccc.*?(ddd)
Because the .* would also eats up the in-between strings foo or bar, you could use (?:(?!foo|bar).)* instead of that. This (?:(?!foo|bar).)* regex would match any character but not of foo or bar zero or more times.
DEMO
String s = "aaa---foo---ccc---ddd\n" +
"aaa---bar---ccc---ddd\n" +
"aaa---------ccc---ddd";
Pattern regex = Pattern.compile("aaa(?:(?!foo|bar).)*(foo|bar)?.*?ccc.*?(ddd)");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
System.out.println(matcher.group(1));
}
Output:
foo
bar
null

Try:
.{3}\-{3}(.{3})\-{3}.{3}\-{3}(.{3})

Related

Java Regex Look-Behind Doesn't Work

So I am working on regex comparing phone numbers and this is the result:
(?:(?:0{2}|\+)?([1-9][0-9]))? ?([1-9][0-9])? ?([1-9][0-9]{5})
As you can see there are spaces between the numbers. I want them to appear only when there is some other number before the space so:
"0022 45 432345" - should match
"45 345678" or "560032" - should match
" 324400" - shouldn't match because of the space in the beginning
I've been reading different tutorials about regexes and found out about look-behinds, but simple construction like that(just for test):
Pattern p2 = Pattern.compile("(?<=abc)aa");
Matcher m2 = p2.matcher("abcaa");
doesn't work.
Can you tell me what's wrong?
Another problem is - I want a character only happen when it is THE FIRST character in a string, otherwise it shouldn't occur. So the code:
0043 022 234567 should not work, but 022 123450 should match.
I'm stuck right now and would appreciate any help a lot.
This should work just fine. The spaces are moved into the optional groups and are themselves optional. This way, they only match if the group before them is present, but even then they are still optional. No look-behind required.
(?:(?:(?:00|\+)?([1-9][0-9]) ?)?([1-9][0-9]) ?)?([1-9][0-9]{5})
Lookbehind is a zero length match.
The javadoc for the Matcher.matches method determines if the whole String is a match.
What you're looking for is something the Matcher.find and Matcher.group methods. Something like:
final Pattern pattern = Pattern.compile("(?<=abc)aa");
final Matcher matcher = pattern.matcher("abaca");
final String subMatch;
if (matcher.find()) {
subMatch = matcher.group();
} else {
subMatch = "";
}
System.out.println(subMatch);
Example.

Regex capturing group doesn't recognise group(1) despite matches() true

I'm writing some simple (I thought) regex in Java to remove an asterisk or ampersand which occurs directly next to some specified punctuation.
This was my original code:
String ptr = "\\s*[\\*&]+\\s*";
String punct1 = "[,;=\\{}\\[\\]\\)]"; //need two because bracket rules different for ptr to left or right
String punct2 = "[,;=\\{}\\[\\]\\(]";
out = out.replaceAll(ptr+"("+punct1+")|("+punct2+")"+ptr,"$1");
Which instead of just removing the "ptr" part of the string, removed the punct too! (i.e. replaced the matched string with an empty string)
I examined further by doing:
String ptrStr = ".*"+ptr+"("+punct1+")"+".*|.*("+punct2+")"+ptr+".*";
Matcher m_ptrStr = Pattern.compile(ptrStr).matcher(out);
and found that:
m_ptrStr.matches() //returns true, but...
m_ptrStr.group(1) //returns null??
I have no idea what I'm doing wrong as I've used this exact method before with far more complicated regex and group(1) has always returned the captured group. There must be something I haven't been able to spot, so.. any ideas?
The problem is that you have an alternation with a capturing group on each side:
(regex1)|(regex2)
The matcher will start and search for a match using the first alternation; if not found, it will try the second alternation.
However, those are still two groups, and only one will match. The one which will not match will return null, and this is what happens to you here.
You therefore need to test both groups; since you have a match, at least one will not be null.
When you have | in your pattern, that means that the matcher is allowed to match one of two patterns. Whichever one it matches, any capture groups for the pattern it matches will return the substrings--but any capture groups for the other pattern will return null, because the other pattern wasn't really matched.
It looks like your pattern is
.*\s*[\*&]+\s*([,;=\{}\[\]\)]).*|.*([,;=\{}\[\]\(])+\s*[\*&]+\s*.*
------------- left ------------- -------------- right ------------
If matches() returns true, then either your string matched the "left" pattern, in which case group(1) will be non-null and group(2) will be null; or else it matched the "right" pattern, in which case group(1) will be null and group(2) non-null. [Note: The matcher will not try to find out if both sides are successful matches. That is, if the left side matches, it won't check the right side.]

Java Regex to check "=number", ex "=5455"?

I want to check a string that matches the format "=number", ex "=5455".
As long as the fist char is "=" & the subsequence is any number in [0-9] (dot is not allowed), then it will popup "correct" message.
if(str.matches("^[=][0-9]+")){
Window.alert("correct");
}
So, is this ^[=][0-9]+ the correct one?
if it is not correct, can u provide a correct solution?
if it is correct, then can u find a better solution?
I'm no big regex expert and more knowledgeable people than me might correct this answer, but:
I don't think there's a point in using [=] rather than simply = - the [...] block is used to declare multiple choices, why declare a multiple choice of one character?
I don't think you need to use ^ (if your input string contains any character before =, it won't match anyway). I'm unsure as to whether its presence makes your regex faster, slower or has no effect.
In conclusion, I'd use =[0-9]+
That should be correct it is looking for an anchored at the beginning = sign and then 1 or more digits between 0-9
Your regex will work, even though it can be simplified:
.matches() does not really do regex matching, since it tries and matches all the input against the regex; therefore the beginning of input anchor is not needed;
you don't need the character class around the =.
Therefore:
if (str.matches("=[0-9]+")) { ... }
If you want to match a string which only begins with that regex, you have to use a Pattern, a Matcher and .find():
final Pattern p = Pattern.compile("^=[0-9]+");
final Matcher m = p.matcher(str);
if (m.find()) { ... }
And finally, Matcher also has .lookingAt() which anchors the regex only at the beginning of the input.

java regex with preceeding and trailing (.*) slow

I noticed that when I match a regular expression like the following one on a text it is a lot slower than the one without preceeding and trailing (.*) parts. I did the same on perl and found that for perl it hardly makes a difference. Is there any way to optimize the original regular expression "(.*)someRegex(.*)" for java?
Pattern p = Pattern.compile("(.*)someRegex(.*)");
Matcher m = p.matcher("some text");
m.matches();
Pattern p = Pattern.compile("someRegex");
Matcher m = p.matcher("some text");
m.matches();
Edit:
Here is a concrete example:
(.*?)<b>\s*([^<]*)\s*<\/b>(.*)
Your best bet is to skip trying to match the front and end of the string at all. You must do that if you use the matches() method, but you don't if you use the find() method. That's probably what you want instead.
Pattern p = Pattern.compile("<b>\\s*([^<]*)\\s*<\\/b>");
Matcher m = p.matcher("some <b>text</b>");
m.find();
You can use start() and end() to find the indexes within the source string containing the match. You can use group() to find the contents of the () capture within the match (i.e., the text inside the bold tag.
In my experience, using regular expressions to process HTML is very fragile and works well in only the most trivial cases. You might have better luck using a full blown XML parser instead, but if this is one of those trivial cases, have at it.
Original Answer: Here is my original answer sharing why a .* at the beginning of a match will perform so badly.
The problem with using .* at the front is that it will cause lots of backtracking in your match. For example, consider the following:
Pattern p = Pattern.compile("(.*)ab(.*)");
Matcher m = p.matcher("aaabaaa");
m.matches();
The match will proceed like this:
The matcher will attempt to suck the whole string, "aaabaaa", into the first .*, but then tries to match a and fails.
The matcher will back up and match "aaabaa", then tries to match a and succeeds, but tries to match b and fails.
The matcher will back up and match "aaaba", then tries to match a and succeeds, but tries to match b and fails.
The matcher will back up and match "aaab", then tries to match a and succeeds, but tries to match b and fails.
The matcher will back up and match "aaa", then tries to match a and fails.
The matcher will back up and match "aa", then tries to match a and succeeds, tries b and succeeds, and then matches "aaa" to the final .*. Success.
You want to avoid a really broad match toward the beginning of your pattern matches whenever possible. Without knowing your actual problem, it would be very difficult to suggest something better.
Update: Anirudha suggests using (.*?)ab(.*) as a possible fix to avoid backtracking. This will short circuit backtracking to some extent, but at the cost of trying to apply the next match on each try. So now, consider the following:
Pattern p = Pattern.compile("(.*?)ab(.*)");
Matcher m = p.matcher("aaabaaa");
m.matches();
It will proceed like this:
The matcher will attempt to match nothing, "", into the first .*?, tries to match a and succeeds, but fails to match b.
The matcher will attempt to match the first letter, "a", into the first .*?, tries to match a and succeeds, but fails to match b.
The matcher will attempt to match the first two letters, "aa", into the first .*?, tries to match a and succeeds, tries to match b and succeeds, and then slurps up the rest into .*, "aaa". Success.
There aren't any backtracks this time, but we still have a more complicated matching process for each forward move within .*?. This may be a performance gain for a particular match or a loss if iterating through the match forward happens to be slower.
This also changes the way the match will proceed. The .* match is greedy and tries to match as much as possible where as .*? is more conservative.
For example, the string "aaabaaabaaa".
The first pattern, (.*)ab(.*) will match "aaabaa" to the first capture and "aaa" to the second.
The second pattern, (.*?)ab(.*) will match "aa" to the first capture and "aaabaaa" to the second.
Instead of doing "(.*)someRegex(.*)" , why not just split the string on "someRegex" and get the parts from the resulting array ? This will give you the same result, but much faster and simpler. Java supports splitting by regex if you need it - http://www.regular-expressions.info/java.html
. matches every character
instead of . try limiting your search by using classes like \w or \s.
But I dont' guarantee that it would run fast.
It all depends on the amount of text you are matching!

How to find the exact word using a regex in Java?

Consider the following code snippet:
String input = "Print this";
System.out.println(input.matches("\\bthis\\b"));
Output
false
What could be possibly wrong with this approach? If it is wrong, then what is the right solution to find the exact word match?
PS: I have found a variety of similar questions here but none of them provide the solution I am looking for.
Thanks in advance.
When you use the matches() method, it is trying to match the entire input. In your example, the input "Print this" doesn't match the pattern because the word "Print" isn't matched.
So you need to add something to the regex to match the initial part of the string, e.g.
.*\\bthis\\b
And if you want to allow extra text at the end of the line too:
.*\\bthis\\b.*
Alternatively, use a Matcher object and use Matcher.find() to find matches within the input string:
Pattern p = Pattern.compile("\\bthis\\b");
Matcher m = p.matcher("Print this");
m.find();
System.out.println(m.group());
Output:
this
If you want to find multiple matches in a line, you can call find() and group() repeatedly to extract them all.
Full example method for matcher:
public static String REGEX_FIND_WORD="(?i).*?\\b%s\\b.*?";
public static boolean containsWord(String text, String word) {
String regex=String.format(REGEX_FIND_WORD, Pattern.quote(word));
return text.matches(regex);
}
Explain:
(?i) - ignorecase
.*? - allow (optionally) any characters before
\b - word boundary
%s - variable to be changed by String.format (quoted to avoid regex
errors)
\b - word boundary
.*? - allow (optionally) any characters after
For a good explanation, see: http://www.regular-expressions.info/java.html
myString.matches("regex") returns true or false depending whether the
string can be matched entirely by the regular expression. It is
important to remember that String.matches() only returns true if the
entire string can be matched. In other words: "regex" is applied as if
you had written "^regex$" with start and end of string anchors. This
is different from most other regex libraries, where the "quick match
test" method returns true if the regex can be matched anywhere in the
string. If myString is abc then myString.matches("bc") returns false.
bc matches abc, but ^bc$ (which is really being used here) does not.
This writes "true":
String input = "Print this";
System.out.println(input.matches(".*\\bthis\\b"));
You may use groups to find the exact word. Regex API specifies groups by parentheses. For example:
A(B(C))D
This statement consists of three groups, which are indexed from 0.
0th group - ABCD
1st group - BC
2nd group - C
So if you need to find some specific word, you may use two methods in Matcher class such as: find() to find statement specified by regex, and then get a String object specified by its group number:
String statement = "Hello, my beautiful world";
Pattern pattern = Pattern.compile("Hello, my (\\w+).*");
Matcher m = pattern.matcher(statement);
m.find();
System.out.println(m.group(1));
The above code result will be "beautiful"
Is your searchString going to be regular expression? if not simply use String.contains(CharSequence s)
System.out.println(input.matches(".*\\bthis$"));
Also works. Here the .* matches anything before the space and then this is matched to be word in the end.

Categories