How to make negative lookahead work with end of line text - java

I have a regex like the following:
.{0,1000}(?!(xa7|para(graf))$)
using Java.
I was expecting that it would cause the following text to fail:
blaparagraf
because paragraf is found at the end

That's because .{0,1000} will match the entire subject, hence it's not followed by xa7 or paragraf (it's followed by $ only).
You want negative lookbehind:
.{0,1000}(?<!xa7|paragraf)$

It is a common a mistake to misplace assertions. If you want to use lookahead, the pattern is something like this:
^(?!.*paragraph$).*$
This matches (as seen on rubular.com):
something something para
paragraph something something
But doesn't match:
something paragraph
So the key difference here is that we start looking ahead at the beginning of the string, before we match .* (or .{0,1000} in your case). Of course, what we're looking for isn't simply paragraph$, but rather .*paragraph$.
That said, to check that a string doesn't end with something of finite length, lookbehind when supported is the most natural solution.
^.*$(?<!paragraph)

Related

Regex not matching when the start or end are empty

Here is my regex as I have inputted it into my java file.
String myRegex = "(?<=[^a-zA-Z0-9])(target)(?=[^a-zA-Z0-9])";
If I have a string as follows:
.target. - it works.
However, if I have a string that JUST says target it does not work. How can I modify the regex so that if there is nothing at the start or the end of the string, it still matches?
EDIT - Examples.
_target - Should succeed!
target_ - Should succeed!
target - Should succeed!
Currently these examples fail with the current regex.
Add "start of input" to your look behind and add "end of input" to your look ahead using a regex alternation (ie | which is a logical "or"):
String myRegex = "(?<=^|[^a-zA-Z0-9])target(?=[^a-zA-Z0-9]|$)";
The problem with your regex is that your look behind required there to be a preceding character that was not a letter/digit.
These look arounds also match start/end of input.
See live demo.
The problem is, there are two negatives happening here. My lookbehinds are can be negative, and my character classes can be negatives. Currently, my lookbehinds are positive and my character classes are negatives. So it's saying: "Look behind and make sure you find something that is not within these classes". So when you there is nothing there, it won't find it and will fail. The solution was to make my look behind negative and make the character classes positive. So now it's saying "Look behind and sure there ISN'T any of these characters". So if it is empty, it won't fail because it meets the condition.
This is the final regex:
String myRegex = "(?<![a-zA-Z0-9])target(?![a-zA-Z0-9])";
If I'm understanding your question correctly, instead of using the look ahead and look behind, you can just use the ? to indicate that there should be 0 or 1 non-alphabetical or numerical character before and after "target".
([^a-zA-Z0-9])?(target)([^a-zA-Z0-9])?
You should be able to match target using the * 0 or more quantifier to match any target which have 0 or more occurrences of the characters you want. So:
[_]*(target)[_]*
should match:
_target
target
target_
_target_
Add any element you want to be matched before or after the word to the brackets. Example to match .target. too:
[\._]*(target)[\._]*
This will match target substring no matter what part of the string it is. If you want to make the rule only for match at the start of the string then add the ^ anchor to it like:
^[\._]*(target)[\._]*
and will match the ones mentioned above only if they start the string.

Any suggestions to match and extract the pattern?

I want to match something like this
$(string).not(string).not(string)
The not(string) can repeat zero or more times, after $(string).
Note that the string can be whatever things, except nested not(string).
I used the regular expression (\\$\\((.*)\\))((\\.not\\((.*?)\\))*?)(?!(\\.not)), I think the *? is to non-greedily match any number of sequence of not(string), and use the lookahead to stop the match that is not not(string), so that I can extract only the part that I want.
However, when I tested on the input like
$(string).not(string).not(string).append(string)
the group(0) returns the whole string, which I only need $(string).not(string).not(string).
Obviously I still miss something or misuse of anything, any suggestions?
Try this one (escaped for java):
(\\$\\(string\\)(?:(?:\\.not\(.*?\\))+))
It should capture just the part that you are after. You can test it out (unescaped for java though)
If we assume that parenthesis are not nested, you can write something like this:
string p = "\\$\\([^)]*\\)(?:\\.not\\([^)]*\\))*";
Not need to add a lookahead since the non-capturing group has a greedy quantifier (so the group is repeated as possible).
if what you called string in your question may be a quoted string with parenthesis inside like in Pshemo example: $(string).not(".not(foo)").not(string), you can replace each [^)]* with (?:\\s*\"[^\"]*\"\\s*|[^)]*) to ignore characters inside quoted parts.
From here, "group zero denotes the entire pattern". Use group(1).
(\$\([\w ]+\))(\.not\([\w ]+\))*
This will also work, it would give you two groups, One consisting of the word with $ sign, another would give you the set of all ".not" strings.
Please note: You might have to add escape characters for java.

Regex why does negative lookahead not work when there are two groups here

when I tried this regex
\"(\S\S+)\"(?!;c)
on this string "MM:";d it comes as matched as I wanted
and on this string "MM:";c it comes as not matched as desired.
But when I add a second group, by moving the semicolon inside that group and making it optional using |
\"(\S\S+)\"(;|)(?!c)
for this string "MM:";c it comes as matched when I expected it to not like before.
I tried this on Java and then on Javascript using Regex tool debuggex:
This link contains a snippet of the above
What am I doing wrong?
note the | is so it is not necessary to have a semicolon.Also in the examples I put c, it is just a substitute in the example for a word, that's why I am using negative lookahead.
After following Holgers response of using Possessive Quantifiers,
\"(\S\S+)\";?+(?!c)
it worked, here is a link to it on RegexPlanet
I believe that the regex will do what it can to find a match; since your expression said the semicolon could be optional, it found that it could match the entire expression (since if the semicolon is not consumed by the first group, it becomes a "no-match" for the negative lookahead. This has to do with the recursive way that regex works: it keeps trying to find a match...
In other words, the process goes like this:
MM:" - matched
(;|) - try semicolon? matched
(?!c) - oops - negative lookahead fails. No match. Go back
(;|) - try nothing. We still have ';c' left to match
(?!c) - negative lookahead not matched. We have a match
An update (based on your comment). The following code may work better:
\"(\S\S+)\"(;|)((?!c)|(?!;c))
Debuggex Demo
The problem is that you don’t want to make the semicolon optional in the sense of regular expression. An optional semicolon implies that the matcher is allowed to try both, matching with or without it. So even if the semicolon is there the matcher can ignore it creating an empty match for the group letting the lookahead succeed.
But you want to consume the semicolon if it’s there, so it is not allowed to be used to satisfy the negative look-ahead. With Java’s regex engine that’s pretty easy: use ;?+
This is called a “possessive quantifier”. Like with the ? the semicolon doesn’t need to be there but if it’s there it must match and cannot be ignored. So the regex engine has no alternatives any more.
So the entire pattern looks like \"(\S\S+)\";?+(?!c) or \"(\S\S+)\"(;?+)(?!c) if you need the semicolon in a group.

Java regex with a positive look behind of a negative look ahead

I am trying to extract from this kind of string ou=persons,ou=(.*),dc=company,dc=org the last string immediately preceded by a coma not followed by (.*). In the last case, this should give dc=company,dc=org.
Looking on regex, this seems to be a positive look behind (preceded by) of a negative look ahead.
So I have achieve this regex: (?<=(,(?!.*\Q(.*)\E))).*, but it returns ,dc=company,dc=org with the coma. I want the same thing without the coma. What I am doing wrong?
The comma appears because the capturing group contains it.
You can make the outside capture group noncapturing with (?:)
(?<=(?:,(?!.*\Q(.*)\E))).*
It seems that I have solved my problem alone, removing the capturing group around the negative look ahead. It gives the following regex: (?<=,(?!.*\Q(.*)\E)).*.
It is linked with the behavior of capturing groups in look arounds as explained here: http://www.regular-expressions.info/lookaround.html in the part Lookaround Is Atomic.

How to bound +/* for a regex group?

Say I have the regex:
(CC|NP)*
As such it creates problems in look-before regexes in Java. How shall I write it to avoid those problem?
I thought of re-writing it as:
(CC|NP){1,9}
Testing on regexr it seems like the upperbound is ignored completely.
In Java those quantitiers {} seem to work only on non-group regex elements as in:
\w+\[\S{1,9}\]
Sorry, look behind patterns usually have restrictions on the sub pattern. See f.x. Why doesn't finite repetition in lookbehind work in some flavors?p. Or search for "lookbehind pattern restrictions" on the web.
You may try to write down all fixed length variants of the look behind pattern as alternating pattern. But this might be many...
You may also simulate lookbehind by normally matching the inner pattern and match and group your actual target: (?:CC|NP)*(.*)
I'm not sure of where you percieve the problem. Quantifiers act on groups just like any entity.
So, \w+\[\S{1,9}\] could have been written \w+\[(\S){1,9}\] with the same result.
As far as your example on regexr, nothing is broken there. It matches what it's supposed to.
(PUN|CC|NP){1,3} will greedily try to match any of the alternations (in left-to-right priority). There will be no breaks in what it will match. It matches 1-3 consecutive occurances of PUN or CC or NP.
The sample string you provided had a space between CC's, so since a space does not exist in the regex, it is not matched. The only thing that is matching is a single CC.
If you want to account for a space, it can be added to the grouping like this:
(?:(?:PUN|CC|NP)\s*){1,3}
If you want to only allow spaces between the alternation's, it can be done like this:
(?:PUN|CC|NP)(?:\s*(?:PUN|CC|NP)){0,2}

Categories