Java Regular Expression Negative Look Ahead Finding Wrong Match - java

Assume I have the following string.
create or replace package test as
-- begin null; end;/
end;
/
I want a regular expression that will find the semicolon not preceded by a set of "--" double dashes on the same line. I'm using the following pattern "(?!--.*);" and I'm still getting matches for the two semicolons on the 2nd line.
I feel like I'm missing something about negative look aheads but I can't figure out what.

If you want to match semicolons only on the lines which do not start with --, this regex should do the trick:
^(?!--).*(;)
Example
I only made a few changes from your regex:
Multi-line mode, so we can use ^ and $ and search by line
^ at the beginning to indicate start of a line
.* between the negative lookahead and the semicolon, because otherwise with the first change it would try to match something like ^;, which is wrong
(I also added parentheses around the semicolon so the demo page displays the result more clearly, but this is not necessary and you can change to whatever is most convenient for your program.)

First of all, what you need is a negative lookbehind (?<!) and not a negative lookahead (?!) since you want to check what's behind your potential match.
Even with that, you won't be able to use the negative lookbehind in your case since the Java's regex engine does not support variable length lookbehind. This means that you need to know exactly how many characters to look behind your potential match for it to work.
With that said, wouldn't be simpler in your case to just split your String by linefeed/carriage return and then remove the line that start with "--"?

The reason "(?!--.*);" isn't working is because the negative look ahead is asserting that when positioned before a ; that the next two chars are --, which of course matches every time (; is always not --).
In java, to match a ; that doesn't have -- anywhere before it:
"\\G(((?<!--)[^;])*);"
To see this in action using a replaceAll() call:
String s = "foo; -- begin null; end;";
s = s.replaceAll("\\G(((?<!--)[^;])*);", "$1!");
System.out.println(s);
Output:
foo! -- begin null; end;
Showing that only semi colons before a double dash are matched.

Related

Java Regex Match Pattern Groups unexpectedly matched [duplicate]

I am writing a regex that will be used for recognizing commands in a string. I have three possible words the commands could start with and they always end with a semi-colon.
I believe the regex pattern should look something like this:
(command1|command2|command3).+;
The problem, I have found, is that since . matches any character and + tells it to match one or more, it skips right over the first instance of a semi-colon and continues going.
Is there a way to get it to stop at the first instance of a semi-colon it comes across? Is there something other than . that I should be using instead?
The issue you are facing with this: (command1|command2|command3).+; is that the + is greedy, meaning that it will match everything till the last value.
To fix this, you will need to make it non-greedy, and to do that you need to add the ? operator, like so: (command1|command2|command3).+?;
Just as an FYI, the same applies for the * operator. Adding a ? will make it non greedy.
Tell it to find only non-semicolons.
[^;]+
What you are looking for is a non-greedy match.
.+?
The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default.
Your regex would be
'(command1|command2|command3).+?;'
See Python RE documentation

Regex not matching when the start or end are empty

Here is my regex as I have inputted it into my java file.
String myRegex = "(?<=[^a-zA-Z0-9])(target)(?=[^a-zA-Z0-9])";
If I have a string as follows:
.target. - it works.
However, if I have a string that JUST says target it does not work. How can I modify the regex so that if there is nothing at the start or the end of the string, it still matches?
EDIT - Examples.
_target - Should succeed!
target_ - Should succeed!
target - Should succeed!
Currently these examples fail with the current regex.
Add "start of input" to your look behind and add "end of input" to your look ahead using a regex alternation (ie | which is a logical "or"):
String myRegex = "(?<=^|[^a-zA-Z0-9])target(?=[^a-zA-Z0-9]|$)";
The problem with your regex is that your look behind required there to be a preceding character that was not a letter/digit.
These look arounds also match start/end of input.
See live demo.
The problem is, there are two negatives happening here. My lookbehinds are can be negative, and my character classes can be negatives. Currently, my lookbehinds are positive and my character classes are negatives. So it's saying: "Look behind and make sure you find something that is not within these classes". So when you there is nothing there, it won't find it and will fail. The solution was to make my look behind negative and make the character classes positive. So now it's saying "Look behind and sure there ISN'T any of these characters". So if it is empty, it won't fail because it meets the condition.
This is the final regex:
String myRegex = "(?<![a-zA-Z0-9])target(?![a-zA-Z0-9])";
If I'm understanding your question correctly, instead of using the look ahead and look behind, you can just use the ? to indicate that there should be 0 or 1 non-alphabetical or numerical character before and after "target".
([^a-zA-Z0-9])?(target)([^a-zA-Z0-9])?
You should be able to match target using the * 0 or more quantifier to match any target which have 0 or more occurrences of the characters you want. So:
[_]*(target)[_]*
should match:
_target
target
target_
_target_
Add any element you want to be matched before or after the word to the brackets. Example to match .target. too:
[\._]*(target)[\._]*
This will match target substring no matter what part of the string it is. If you want to make the rule only for match at the start of the string then add the ^ anchor to it like:
^[\._]*(target)[\._]*
and will match the ones mentioned above only if they start the string.

regex parenthesis OR digit start of line

I'm new to regex and trying to get this statement to pass. What I'm trying to do is check the first character or two from the string.
I want to see that if it starts with a number then the statement will be true. I also want to check that if the first character is a "(" then i want to check that the next number is a digit too.
So far I've got:
if (str.matches("^[(?\d")){
return true;
but this doesn't seem to work. ^ for anchoring to start, (? to check for optional parenthesis and then check if it's a digit afterwards. How have i stuffed up?
So 0800, (0800, (09, 09, should pass where as *08, (*0, AB, (AB, *AB should fail.
Thanks
You need to remove the opening square bracket and escape the metacharacters. Also, matches() tells whether or not the entire string matches the given regular expression. So, you need to add the token .* afterwards to greedily match every single character in the string.
if (str.matches("\\(?\\d.*")) { ... }
Ideone Demo
This regex should do the trick:
^\d.*|^\(\d.*
In Java, with escaping of backslashes:
if (str.matches("^\\d.*|^\\(\\d.*"))

Regex why does negative lookahead not work when there are two groups here

when I tried this regex
\"(\S\S+)\"(?!;c)
on this string "MM:";d it comes as matched as I wanted
and on this string "MM:";c it comes as not matched as desired.
But when I add a second group, by moving the semicolon inside that group and making it optional using |
\"(\S\S+)\"(;|)(?!c)
for this string "MM:";c it comes as matched when I expected it to not like before.
I tried this on Java and then on Javascript using Regex tool debuggex:
This link contains a snippet of the above
What am I doing wrong?
note the | is so it is not necessary to have a semicolon.Also in the examples I put c, it is just a substitute in the example for a word, that's why I am using negative lookahead.
After following Holgers response of using Possessive Quantifiers,
\"(\S\S+)\";?+(?!c)
it worked, here is a link to it on RegexPlanet
I believe that the regex will do what it can to find a match; since your expression said the semicolon could be optional, it found that it could match the entire expression (since if the semicolon is not consumed by the first group, it becomes a "no-match" for the negative lookahead. This has to do with the recursive way that regex works: it keeps trying to find a match...
In other words, the process goes like this:
MM:" - matched
(;|) - try semicolon? matched
(?!c) - oops - negative lookahead fails. No match. Go back
(;|) - try nothing. We still have ';c' left to match
(?!c) - negative lookahead not matched. We have a match
An update (based on your comment). The following code may work better:
\"(\S\S+)\"(;|)((?!c)|(?!;c))
Debuggex Demo
The problem is that you don’t want to make the semicolon optional in the sense of regular expression. An optional semicolon implies that the matcher is allowed to try both, matching with or without it. So even if the semicolon is there the matcher can ignore it creating an empty match for the group letting the lookahead succeed.
But you want to consume the semicolon if it’s there, so it is not allowed to be used to satisfy the negative look-ahead. With Java’s regex engine that’s pretty easy: use ;?+
This is called a “possessive quantifier”. Like with the ? the semicolon doesn’t need to be there but if it’s there it must match and cannot be ignored. So the regex engine has no alternatives any more.
So the entire pattern looks like \"(\S\S+)\";?+(?!c) or \"(\S\S+)\"(;?+)(?!c) if you need the semicolon in a group.

Java Regular Expression for number of exactly 5 digits anywhere in the string

I'm trying to create a regular expression to parse a 5 digit number out of a string no matter where it is but I can't seem to figure out how to get the beginning and end cases.
I've used the pattern as follows \\d{5} but this will grab a subset of a larger number...however when I try to do something like \\D\\d{5}\\D it doesn't work for the end cases. I would appreciate any help here! Thanks!
For a few examples (55555 is what should be extracted):
At the beginning of the string
"55555blahblahblah123456677788"
In the middle of the string
"2345blahblah:55555blahblah"
At the end of the string
"1234567890blahblahblah55555"
Since you are using a language that supports them use negative lookarounds:
"(?<!\\d)\\d{5}(?!\\d)"
These will assert that your \\d{5} is neither preceded nor followed by a digit. Whether that is due to the edge of the string or a non-digit character does not matter.
Note that these assertions themselves are zero-width matches. So those characters will not actually be included in the match. That is why they are called lookbehind and lookahead. They just check what is there, without actually making it part of the match. This is another disadvantage of using \\D, which would include the non-digit character in your match (or require you to use capturing groups).

Categories