Java regex with a positive look behind of a negative look ahead - java

I am trying to extract from this kind of string ou=persons,ou=(.*),dc=company,dc=org the last string immediately preceded by a coma not followed by (.*). In the last case, this should give dc=company,dc=org.
Looking on regex, this seems to be a positive look behind (preceded by) of a negative look ahead.
So I have achieve this regex: (?<=(,(?!.*\Q(.*)\E))).*, but it returns ,dc=company,dc=org with the coma. I want the same thing without the coma. What I am doing wrong?

The comma appears because the capturing group contains it.
You can make the outside capture group noncapturing with (?:)
(?<=(?:,(?!.*\Q(.*)\E))).*

It seems that I have solved my problem alone, removing the capturing group around the negative look ahead. It gives the following regex: (?<=,(?!.*\Q(.*)\E)).*.
It is linked with the behavior of capturing groups in look arounds as explained here: http://www.regular-expressions.info/lookaround.html in the part Lookaround Is Atomic.

Related

Java Regex Match Pattern Groups unexpectedly matched [duplicate]

I am writing a regex that will be used for recognizing commands in a string. I have three possible words the commands could start with and they always end with a semi-colon.
I believe the regex pattern should look something like this:
(command1|command2|command3).+;
The problem, I have found, is that since . matches any character and + tells it to match one or more, it skips right over the first instance of a semi-colon and continues going.
Is there a way to get it to stop at the first instance of a semi-colon it comes across? Is there something other than . that I should be using instead?
The issue you are facing with this: (command1|command2|command3).+; is that the + is greedy, meaning that it will match everything till the last value.
To fix this, you will need to make it non-greedy, and to do that you need to add the ? operator, like so: (command1|command2|command3).+?;
Just as an FYI, the same applies for the * operator. Adding a ? will make it non greedy.
Tell it to find only non-semicolons.
[^;]+
What you are looking for is a non-greedy match.
.+?
The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default.
Your regex would be
'(command1|command2|command3).+?;'
See Python RE documentation

Any suggestions to match and extract the pattern?

I want to match something like this
$(string).not(string).not(string)
The not(string) can repeat zero or more times, after $(string).
Note that the string can be whatever things, except nested not(string).
I used the regular expression (\\$\\((.*)\\))((\\.not\\((.*?)\\))*?)(?!(\\.not)), I think the *? is to non-greedily match any number of sequence of not(string), and use the lookahead to stop the match that is not not(string), so that I can extract only the part that I want.
However, when I tested on the input like
$(string).not(string).not(string).append(string)
the group(0) returns the whole string, which I only need $(string).not(string).not(string).
Obviously I still miss something or misuse of anything, any suggestions?
Try this one (escaped for java):
(\\$\\(string\\)(?:(?:\\.not\(.*?\\))+))
It should capture just the part that you are after. You can test it out (unescaped for java though)
If we assume that parenthesis are not nested, you can write something like this:
string p = "\\$\\([^)]*\\)(?:\\.not\\([^)]*\\))*";
Not need to add a lookahead since the non-capturing group has a greedy quantifier (so the group is repeated as possible).
if what you called string in your question may be a quoted string with parenthesis inside like in Pshemo example: $(string).not(".not(foo)").not(string), you can replace each [^)]* with (?:\\s*\"[^\"]*\"\\s*|[^)]*) to ignore characters inside quoted parts.
From here, "group zero denotes the entire pattern". Use group(1).
(\$\([\w ]+\))(\.not\([\w ]+\))*
This will also work, it would give you two groups, One consisting of the word with $ sign, another would give you the set of all ".not" strings.
Please note: You might have to add escape characters for java.

Recursive checking a string for repeating pattern?

sorry if this is a duplicate but i couldnt find anything close.
i want to check recursively a string for the following pattern
[a-z0-9][:][a-z0-9][&][a-z0-9][:][a-z0-9]...
example
foo:bar&foo:bar1&foo:bar&foo:111&bar:2A2...
is it possible with regex and if so anyone can show me a regex expression for this?
If there is a efficient java method for this, it would be also good.
Assuming that you want to match the whole string:
(\w+:\w+(?:&\w+:\w+)*)
See a demo.
Debuggex Demo
Just put the pattern inside a group with a preceding & and then make it to repeat zero or more times.
^[a-z0-9]+:[a-z0-9]+(?:&[a-z0-9]+:[a-z0-9]+)*$
Anchors won't be needed if you use matches method.
DEMO
If you want to match value:value& as a sole element multiple times,
(([a-z0-9]+:)([a-z0-9]+&))+
NOTE : It won't match value:value&value:,value&value&value: etc.

Java Regex - Finding specific string within a String

I am trying to match a string that start with the set word "hotel", then a hyphen, then a word of any length, then another hyphen and finally a number of any length.
Edit: Dima gave the solution I needed in the comments of this question! Thanks Dima.
Further edit: elaborating on Dima's answer, adding capturing groups making it easier to retrieve the information entered, and correcting the last bit to only accept digits:
^hotel-(.+)-(\d+)
^hotel-(.)*$
(But hotel-something WILL work, according to your initial statement).
So, if you actually want something like:
hotel-XXXXXX-YYYYYYY
Then the regex is :
^hotel-(.)*-(.)*$
Try a regex online tester like http://www.regextester.com/.
If you want to match the start of the input, you use ^.
so if you have ^hotel-\b, that will force hotel to be at the start of the string.
as a note, you can use $ for the end of the string in a similar way.
\bhotel-[^\s-]+-[^\s-]+\b
\b means that it should be a word boundery
[^\s-] means anything but - or whitespace
https://regex101.com/r/mH3vY8/1

Regex why does negative lookahead not work when there are two groups here

when I tried this regex
\"(\S\S+)\"(?!;c)
on this string "MM:";d it comes as matched as I wanted
and on this string "MM:";c it comes as not matched as desired.
But when I add a second group, by moving the semicolon inside that group and making it optional using |
\"(\S\S+)\"(;|)(?!c)
for this string "MM:";c it comes as matched when I expected it to not like before.
I tried this on Java and then on Javascript using Regex tool debuggex:
This link contains a snippet of the above
What am I doing wrong?
note the | is so it is not necessary to have a semicolon.Also in the examples I put c, it is just a substitute in the example for a word, that's why I am using negative lookahead.
After following Holgers response of using Possessive Quantifiers,
\"(\S\S+)\";?+(?!c)
it worked, here is a link to it on RegexPlanet
I believe that the regex will do what it can to find a match; since your expression said the semicolon could be optional, it found that it could match the entire expression (since if the semicolon is not consumed by the first group, it becomes a "no-match" for the negative lookahead. This has to do with the recursive way that regex works: it keeps trying to find a match...
In other words, the process goes like this:
MM:" - matched
(;|) - try semicolon? matched
(?!c) - oops - negative lookahead fails. No match. Go back
(;|) - try nothing. We still have ';c' left to match
(?!c) - negative lookahead not matched. We have a match
An update (based on your comment). The following code may work better:
\"(\S\S+)\"(;|)((?!c)|(?!;c))
Debuggex Demo
The problem is that you don’t want to make the semicolon optional in the sense of regular expression. An optional semicolon implies that the matcher is allowed to try both, matching with or without it. So even if the semicolon is there the matcher can ignore it creating an empty match for the group letting the lookahead succeed.
But you want to consume the semicolon if it’s there, so it is not allowed to be used to satisfy the negative look-ahead. With Java’s regex engine that’s pretty easy: use ;?+
This is called a “possessive quantifier”. Like with the ? the semicolon doesn’t need to be there but if it’s there it must match and cannot be ignored. So the regex engine has no alternatives any more.
So the entire pattern looks like \"(\S\S+)\";?+(?!c) or \"(\S\S+)\"(;?+)(?!c) if you need the semicolon in a group.

Categories