Java Regex Match Pattern Groups unexpectedly matched [duplicate]

Java Regex Match Pattern Groups unexpectedly matched [duplicate] - java

I am writing a regex that will be used for recognizing commands in a string. I have three possible words the commands could start with and they always end with a semi-colon.
I believe the regex pattern should look something like this:
(command1|command2|command3).+;
The problem, I have found, is that since . matches any character and + tells it to match one or more, it skips right over the first instance of a semi-colon and continues going.
Is there a way to get it to stop at the first instance of a semi-colon it comes across? Is there something other than . that I should be using instead?

The issue you are facing with this: (command1|command2|command3).+; is that the + is greedy, meaning that it will match everything till the last value.
To fix this, you will need to make it non-greedy, and to do that you need to add the ? operator, like so: (command1|command2|command3).+?;
Just as an FYI, the same applies for the * operator. Adding a ? will make it non greedy.

Tell it to find only non-semicolons.
[^;]+

What you are looking for is a non-greedy match.
.+?
The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default.
Your regex would be
'(command1|command2|command3).+?;'
See Python RE documentation

Related

Any suggestions to match and extract the pattern?

I want to match something like this
$(string).not(string).not(string)
The not(string) can repeat zero or more times, after $(string).
Note that the string can be whatever things, except nested not(string).
I used the regular expression (\\$\\((.*)\\))((\\.not\\((.*?)\\))*?)(?!(\\.not)), I think the *? is to non-greedily match any number of sequence of not(string), and use the lookahead to stop the match that is not not(string), so that I can extract only the part that I want.
However, when I tested on the input like
$(string).not(string).not(string).append(string)
the group(0) returns the whole string, which I only need $(string).not(string).not(string).
Obviously I still miss something or misuse of anything, any suggestions?

Try this one (escaped for java):
(\\$\\(string\\)(?:(?:\\.not\(.*?\\))+))
It should capture just the part that you are after. You can test it out (unescaped for java though)

If we assume that parenthesis are not nested, you can write something like this:
string p = "\\$\\([^)]*\\)(?:\\.not\\([^)]*\\))*";
Not need to add a lookahead since the non-capturing group has a greedy quantifier (so the group is repeated as possible).
if what you called string in your question may be a quoted string with parenthesis inside like in Pshemo example: $(string).not(".not(foo)").not(string), you can replace each [^)]* with (?:\\s*\"[^\"]*\"\\s*|[^)]*) to ignore characters inside quoted parts.

From here, "group zero denotes the entire pattern". Use group(1).

(\$\([\w ]+\))(\.not\([\w ]+\))*
This will also work, it would give you two groups, One consisting of the word with $ sign, another would give you the set of all ".not" strings.
Please note: You might have to add escape characters for java.

Regex why does negative lookahead not work when there are two groups here

when I tried this regex
\"(\S\S+)\"(?!;c)
on this string "MM:";d it comes as matched as I wanted
and on this string "MM:";c it comes as not matched as desired.
But when I add a second group, by moving the semicolon inside that group and making it optional using |
\"(\S\S+)\"(;|)(?!c)
for this string "MM:";c it comes as matched when I expected it to not like before.
I tried this on Java and then on Javascript using Regex tool debuggex:
This link contains a snippet of the above
What am I doing wrong?
note the | is so it is not necessary to have a semicolon.Also in the examples I put c, it is just a substitute in the example for a word, that's why I am using negative lookahead.
After following Holgers response of using Possessive Quantifiers,
\"(\S\S+)\";?+(?!c)
it worked, here is a link to it on RegexPlanet

I believe that the regex will do what it can to find a match; since your expression said the semicolon could be optional, it found that it could match the entire expression (since if the semicolon is not consumed by the first group, it becomes a "no-match" for the negative lookahead. This has to do with the recursive way that regex works: it keeps trying to find a match...
In other words, the process goes like this:
MM:" - matched
(;|) - try semicolon? matched
(?!c) - oops - negative lookahead fails. No match. Go back
(;|) - try nothing. We still have ';c' left to match
(?!c) - negative lookahead not matched. We have a match
An update (based on your comment). The following code may work better:
\"(\S\S+)\"(;|)((?!c)|(?!;c))
Debuggex Demo

The problem is that you don’t want to make the semicolon optional in the sense of regular expression. An optional semicolon implies that the matcher is allowed to try both, matching with or without it. So even if the semicolon is there the matcher can ignore it creating an empty match for the group letting the lookahead succeed.
But you want to consume the semicolon if it’s there, so it is not allowed to be used to satisfy the negative look-ahead. With Java’s regex engine that’s pretty easy: use ;?+
This is called a “possessive quantifier”. Like with the ? the semicolon doesn’t need to be there but if it’s there it must match and cannot be ignored. So the regex engine has no alternatives any more.
So the entire pattern looks like \"(\S\S+)\";?+(?!c) or \"(\S\S+)\"(;?+)(?!c) if you need the semicolon in a group.

Regular Expression Issue in Java

I have searched everywhere and I cannot find what I am doing wrong.
I have this regular expression: ^(\[\[).+(\]\]) that I want to match for this data that starts just at the beginning of the line as shown below (I do not want to match anything but the things starting at the beginning of a line):
[[match this]] [[don't match this]]
{{Link GA|es}}
{{Link FA|ca}}
And for some reason it is not matching anything in Java (or other regex "testers" such as regexpal.com). By "in Java" i mean with the String.replaceAll(String regex, String replacement) method in the Java String API.
But, if I omit the ^ and just have (\[\[).+(\]\]) it matches fine at the beginning of the line, but also matches inline instances which I do not want.
Can anyone point out what the error is here? Thank you

^ means "start of string", not "start of line", unless you use the Pattern.MULTILINE (or (?m)) option when building the regex. Also, you should be using a lazy quantifier (as pointed out by Dave Newton in his comment).
Finally, don't forget to double the backslashes:
String result = subject.replaceAll("(?m)^\\[\\[.+?\\]\\]", "");

.+ is greedy, in that it will match everything it can (here, matching everything up to the last \]\]
To stop this behaviour just add a ? to make it non-greedy
^\[\[.+?\]\]
Will match [[ then look for any characters until it finds the first occurrence of ]]

(\[\[).+(\]\]){1}+ {1}+ that mean exactly one time's improve link

Match contents surrounded by optional group in Java regex

I'm having trouble wrapping my head around how a particular Java regex should be written. The regex will be used in a sequence, and will match sections ending with /.
The problem is that using a simple split won't work because the text before the / can optionally be surrounded by ~. If it is, then the text inside can match anything - including / and ~. The key here is the ending ~/, which is the only way to escape this 'anything goes' sequence if it begins with ~.
Because the regex pattern will be used in a sequence (i.e. (xxx)+), I can't use ^ or $ for non-greedy matching.
Example matches:
foo/
~foo~/
~foo/~/
~foo~~/
~foo/bar~/
and some that wouldn't match:
foo~//
~foo~/bar~/
~foo/
foo~/ (see edit 2)
Is there any way to do this without being redundant with my regexes? What would be the best way to think about matching this? Java doesn't have a conditional modifier (?) so that complicated things in my head a bit more.
EDIT: After working on this in the meantime, the regex ((?:\~())?)(((?!((?!\2)/|\~/)).)+)\1/ gets close but #6 doesn't match.
EDIT 2: After Steve pointed out that there is ambiguity, it became clear #6 shouldn't match.

I don't think that this is a solvable problem. From your givens, these are all acceptable:
~foo/~/
~foo/
foo~/
So, now, let's consider this combination:
~foo/foo~/
What happens here? We have combined the second example and the third example to create an instance of the first example. How do you suggest a correct splitting? As far as I can tell, there's no way to tell if we should be taking the entire expression as one or two valid expressions. Hence, I don't think it's possible to break it up accurately based on your listed restrictions.

Java Regex Everything Before and Including Match

I need the regex expression to remove any text before a match and including the match
eg. I want to remove "123S" and everything before it, I know I can do this with
string.replaceAll("^.*?(?=[123S])","");
string.replaceAll("123S","");
But really want to do it in a single expression (can't find another example anywhere!)

You can do it with:
string.replaceAll("^.*123S","");
Remove non-greedy ? to match last occurence and .* everything before.

You don't need the look ahead:
"abc123Sdef123Sxyz".replaceAll("^.*?123S","");
This replaces the first occurence only, if that is what you need (output is def123Sxyz).
In case you want to replace up to the last 123S, just remove the ? modifier:
"abc123Sdef123Sxyz".replaceAll("^.*123S","");
Output is xyz.

string.replaceAll("^.*?123S", "");
(?= is the "if followed by" pattern which you don't want, and [123S] isn't even correct it'll catch just '2' for instance.

string.replaceAll("^.*?123S","");
More efficient and improves clarity so someone else knows what you're doing.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex Match Pattern Groups unexpectedly matched [duplicate] - java

Tell it to find only non-semicolons. [^;]+

What you are looking for is a non-greedy match. .+? The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default. Your regex would be '(command1|command2|command3).+?;' See Python RE documentation

Related

Any suggestions to match and extract the pattern?

Regex why does negative lookahead not work when there are two groups here

Regular Expression Issue in Java

Match contents surrounded by optional group in Java regex

Java Regex Everything Before and Including Match

Categories

Resources