Match contents surrounded by optional group in Java regex - java

I'm having trouble wrapping my head around how a particular Java regex should be written. The regex will be used in a sequence, and will match sections ending with /.
The problem is that using a simple split won't work because the text before the / can optionally be surrounded by ~. If it is, then the text inside can match anything - including / and ~. The key here is the ending ~/, which is the only way to escape this 'anything goes' sequence if it begins with ~.
Because the regex pattern will be used in a sequence (i.e. (xxx)+), I can't use ^ or $ for non-greedy matching.
Example matches:
foo/
~foo~/
~foo/~/
~foo~~/
~foo/bar~/
and some that wouldn't match:
foo~//
~foo~/bar~/
~foo/
foo~/ (see edit 2)
Is there any way to do this without being redundant with my regexes? What would be the best way to think about matching this? Java doesn't have a conditional modifier (?) so that complicated things in my head a bit more.
EDIT: After working on this in the meantime, the regex ((?:\~())?)(((?!((?!\2)/|\~/)).)+)\1/ gets close but #6 doesn't match.
EDIT 2: After Steve pointed out that there is ambiguity, it became clear #6 shouldn't match.

I don't think that this is a solvable problem. From your givens, these are all acceptable:
~foo/~/
~foo/
foo~/
So, now, let's consider this combination:
~foo/foo~/
What happens here? We have combined the second example and the third example to create an instance of the first example. How do you suggest a correct splitting? As far as I can tell, there's no way to tell if we should be taking the entire expression as one or two valid expressions. Hence, I don't think it's possible to break it up accurately based on your listed restrictions.

Related

How to exclude previous captured group

Here is my requirement, I want to recognize a valid String definition in compiler design, the string should either start and end with double quote ("hello world"), or start and end with single quote('hello world').
I used (['"]).*\1 to achieve the goal, the \1 here is to reference previous first captured group, namely first single or double quote, as explanation from regex 101,
\1 matches the same text as most recently matched by the 1st capturing group
It works so far so good.
Then I got new requirement, which is to treat an inner single quote in external single quotes as invalid vase, and same to double quotes situation. Which means both 'hello ' world' and "hello " world" are invalid case.
I think the solution should not be hard if we can represent not previous 1st captured group, something like (['"])(?:NOT\1)*\1.
The (?:) here is used as a non capturing group, to make sure \1 represents to first quote always. But the key is how to replace NOT with correct regex symbol. It's not like my previous experience about exclusion, like [^abcd] to exclude abcd, but to exclude the previous capture group and the symbol ^ doesn't work that way.
The most efficient method for this is probably a simple alternation, like already mentioned by #LorenzHetterich in his first comment. Easy to read, a short pattern and it gets the job done.
^(?:"[^"]*"|'[^']*')$
See this demo at regex101
This just alternates between either pairs of quotes without any of the same quote-type inside.
The technique to exclude a capture between certain parts, that you were outlining is known as tempered greedy token. Best to use it if there are no other options available (not for this task).
^(['"])(?:(?!\1).)*\1$
Another demo at regex101
The greedy dot gets tempered by what was captured in the first group and won't skip over.
Similar to this solution but much more efficient:
• Unrolled star alternation solution: ^(['"])[^"']*+(?:(?!\1)['"][^"']*)*\1$ (efficient)
• Explicit greedy alternation solution: ^(['"])(?:[^"']++|(?!\1)["'])*\1$ (a bit slower)
Especially for the latter use of a possessive quantifier is crucial to avoid runaway issues.
Just for having it mentioned, another option is using a negative lookahead to check after capturing the first match if there are not two more ahead. Also not highly efficient but sometimes useful.
^(['"])(?!(?:.*?\1){2}).*
One more demo at regex101
FYI: If the pattern is used with Java matches(), the ^ start and $ end anchors are not needed.

Java Regex Match Pattern Groups unexpectedly matched [duplicate]

I am writing a regex that will be used for recognizing commands in a string. I have three possible words the commands could start with and they always end with a semi-colon.
I believe the regex pattern should look something like this:
(command1|command2|command3).+;
The problem, I have found, is that since . matches any character and + tells it to match one or more, it skips right over the first instance of a semi-colon and continues going.
Is there a way to get it to stop at the first instance of a semi-colon it comes across? Is there something other than . that I should be using instead?
The issue you are facing with this: (command1|command2|command3).+; is that the + is greedy, meaning that it will match everything till the last value.
To fix this, you will need to make it non-greedy, and to do that you need to add the ? operator, like so: (command1|command2|command3).+?;
Just as an FYI, the same applies for the * operator. Adding a ? will make it non greedy.
Tell it to find only non-semicolons.
[^;]+
What you are looking for is a non-greedy match.
.+?
The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default.
Your regex would be
'(command1|command2|command3).+?;'
See Python RE documentation

Speed up regular expression

This is a regex to extract the table name from a SQL statement:
(?:\sFROM\s|\sINTO\s|\sNEXTVAL[\s\W]*|^UPDATE\s|\sJOIN\s)[\s`'"]*([\w\.-_]+)
It matches a token, optionally enclosed in [`'"], preceded by FROM etc. surrounded by whitespace, except for UPDATE which has no leading whitespace.
We execute many regexes, and this is the slowest one, and I'm not sure why. SQL strings can get up to 4k in size, and execution time is at worst 0,35ms on a 2.2GHz i7 MBP.
This is a slow input sample: https://pastebin.com/DnamKDPf
Can we do better? Splitting it up into multiple regexes would be an option, as well if alternation is an issues.
There is a rule of thumb:
Do not let engine make an attempt on matching each single one character if there are some boundaries.
Try the following regex (~2500 steps on the given input string):
(?!FROM|INTO|NEXTVAL|UPDATE|JOIN)\S*\s*|\w+\W*(\w[\w\.-]*)
Live demo
Note: What you need is in the first capturing group.
The final regex according to comments (which is a little bit slower than the previous clean one):
(?!(?:FROM|INTO|NEXTVAL|UPDATE|JOIN)\b)\S*\s*|\b(?:NEXTVAL\W*|\w+\s[\s`'"]*)([\[\]\w\.-]+)
Regex optimisation is a very complex topic and should be done with help of some tools. For example, I like Regex101 which calculates for us number of steps Regex engine had to do to match pattern to payload. For your pattern and given example it prints:
1 match, 22976 steps (~19ms)
First thing which you can always do it is grouping similar parts to one group. For example, FROM, INTO and JOIN look similar, so we can write regex as below:
(?:\s(?:FROM|INTO|JOIN)\s|\sNEXTVAL[\s\W]*|^UPDATE\s)[\s`'"]*([\w\.-_]+)
For above example, Regex101, prints:
1 match, 15891 steps (~13ms)
Try to find some online tools which explain and optimise Regex such as myregextester and calculate how many steps engine needs to do.
Because matches are often near the end, one possibility would be to essentially start at the end and backtrack, rather than start at the beginning and forward-track, something along the lines of
^(?:UPDATE\s|.*(?:\s(?:(?:FROM|INTO|JOIN)\s|NEXTVAL[\s\W]*)))[\s`'\"]*([\w\.-_]+)
https://regex101.com/r/SO7M87/1/ (154 steps)
While this may be much faster when a match exists, it's only a moderate improvement when there's no match, because the pattern must backtrack all the way to the beginning (~9000 steps from ~23k steps)

Matches A but not B in Java Regex?

I have a big document. Lets scale it to
location=State-City-House
location=City-House
So What I want to do is replace all those not starting with State, with some other string. Say "NY". But those starting with State must remain untouched.
So my end result would be
location=State-City-House
location=NY-City-House
1.Obviously I cant use String.replaceAll().
2.Using Pattern.matcher() is tricky since we are using two different patterns where one must be found and one must not be found.
3.Tried a dirty way of replacing "location=State" first with "bocation=State" then replacing the others and then re-replacing.
So, A neat and simple way to do it?
You can definitely use replaceAll with a negative lookahead:
String repl = input.replaceAll( "(?m)^(location=)(?!State)", "$1NY-" );
(?m) sets MULTILINE modifier so that we match anchors ^ and $ in each line
(location=) matches location= and captures the value in group #1
(?!State) is the negative lookahead to fail the match when State appears after the captured group #1 i.e. location=
In replacement we use $1NY- to make it location=NY- at start.
RegEx Demo
If I understand your intention correctly, you don't actually have the string "State" in your input, but varying strings that represent states.
But some of your text lines are missing the state altogether and only have the name of the City and House. Is that correct? In that case, the defining characteristic between the 2 kinds of lines is the number of dashes.
^location=([^-]+)-([^-]+)$
The above regex matches only full lines with only 1 dash.
I might have misunderstood the task. It would be easier if you would post some of the actual input.

How to bound +/* for a regex group?

Say I have the regex:
(CC|NP)*
As such it creates problems in look-before regexes in Java. How shall I write it to avoid those problem?
I thought of re-writing it as:
(CC|NP){1,9}
Testing on regexr it seems like the upperbound is ignored completely.
In Java those quantitiers {} seem to work only on non-group regex elements as in:
\w+\[\S{1,9}\]
Sorry, look behind patterns usually have restrictions on the sub pattern. See f.x. Why doesn't finite repetition in lookbehind work in some flavors?p. Or search for "lookbehind pattern restrictions" on the web.
You may try to write down all fixed length variants of the look behind pattern as alternating pattern. But this might be many...
You may also simulate lookbehind by normally matching the inner pattern and match and group your actual target: (?:CC|NP)*(.*)
I'm not sure of where you percieve the problem. Quantifiers act on groups just like any entity.
So, \w+\[\S{1,9}\] could have been written \w+\[(\S){1,9}\] with the same result.
As far as your example on regexr, nothing is broken there. It matches what it's supposed to.
(PUN|CC|NP){1,3} will greedily try to match any of the alternations (in left-to-right priority). There will be no breaks in what it will match. It matches 1-3 consecutive occurances of PUN or CC or NP.
The sample string you provided had a space between CC's, so since a space does not exist in the regex, it is not matched. The only thing that is matching is a single CC.
If you want to account for a space, it can be added to the grouping like this:
(?:(?:PUN|CC|NP)\s*){1,3}
If you want to only allow spaces between the alternation's, it can be done like this:
(?:PUN|CC|NP)(?:\s*(?:PUN|CC|NP)){0,2}

Categories