Very slow look-behind - java

I'm trying to recover two positions using java regex
The first one is given by the regex:
val r="""(?=(?<=[ ]|^)[^ ]{1,21474836}(?=[ ]|$)(?<=[^A-Z]|^)[A-Z]{1,21474836}(?=[^A-Z]|$))"""
The second one is given by the regex
val p="""(?<=(?<=[ ]|^)[^ ]{1,21474836}(?=[ ]|$)(?<=[^A-Z]|^)[A-Z]{1,21474836}(?=[^A-Z]|$))"""
Note that the two expressions are identical, except the first "=" is replaced by an "<=" in the second expression. I am not using neste quantifiers here.
My command to test it is the following:
r.findAllMatchIn("a <b/>"*100) //.... some long string of size 600...
p.findAllMatchIn("a <b/>"*100) //.... some long string of size 600...
The first example is almost instant during execution, whereas the second takes dozens of seconds. If I launch the same examples in a REPL, both are very fast.
Where does that come from? How can I make the second expression faster?
Update: Why this matters
Note that in general, I can have expressions of the type:
[^ ]+[^.]+
and I would like to know when this regular expression can be found on the left of a given position, or when it can end.
If I have the following data with the position below it:
abc145A
0123456
I would like the end of the previous expression to match position 1,2,3,4,5 and 6. If I use non-greedy repeating jokers, then it will match 1,3 and 5. If I use greedy operators, it matches only 6. This is why I need look-behind assertions. Or you will find me a way to define operators to find the positions I am looking for.

You aren't using nested quantifiers, but I suspect nested lookbehinds cause a similar problem. I suspect you don't need that outer lookahead/lookbehind at all - how about performing a single regex search using only the inner part of the regexes (common to both), and retrieving both the start position and the end position from each result?

Related

Regex for multiple occurrences of specific words

Hello
I'm trying to create a validation rule that checks the regular expression to accept only specific phrases. Regex is based on Java.
Here are examples of correct inputs:
1OR2
2
1 OR 2 OR 15
( 2OR3) AND 1
(12AND13 AND1)OR(4 AND5)
((2AND3 AND 1)OR(4AND5))AND6
but I would be happy if only the regex could accept anything like :
())34AND(4
I have no idea how to create a regex to check if the brackets open and close correctly(they can be nested). I assumed it can be impossible to check it in regex so the proper validation for the brackets I've already made in the code(stack implementation). In the code I have a second step validation of the phrase.
All I need the regex to do is to check if there are these specific things inside the phrase:
numbers, round brackets, words AND and OR with multiple occurrences and whitespaces are allowed.
It should NOT accept letters or other characters.
All I managed to create so far is this:
^[0-9 \\(][0-9 \\(\\)]*
also tried adding something like:
\\b(AND|OR)\\b
inside the second pair of brackets but with no luck.
I cannot figure out how to correct it to add OR and AND words.
I used the following and matched all the inputs you gave:
^[^\)][0-9 \( (AND|OR)]*$
I assumed you didn't want to start with ), which is why I included ^[^\)].
In case you weren't aware, I use https://www.regexpal.com to check my regular expressions for code.
Since you have an arbitrary number of nested elements it's arguably not possible with regex.
For demonstration purposes only, this matches zero or more conjunctions and one set of parenthesis:
^\d+(\s*(?:AND|OR)\s*(\d+|\(?\s*\d+(\s*(?:AND|OR)\s*\d+)\s*\)))*$|^(\d+|\(?\s*\d+(\s*(?:AND|OR)\s*\d+)\s*\))\s*(\s*(?:AND|OR)\s*\d+)*$
That's it. Adding more sets and levels of nested parenthesis leads to exponentially increasing complexity - till it breaks altogether.
Demo

Is it possible to match nested brackets with a regex without using recursion or balancing groups?

The problem: Match an arbitrarily nested group of brackets in a flavour of regex such as Java's java.util.regex that supports neither recursion nor balancing groups. I.e., match the three outer groups in:
(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))
This exercise is purely academic, since we all know that regular expressions are not supposed to be used to match these things, just as Q-tips are not supposed to be used to clean ears.
Stack Overflow encourages self-answered questions, so I decided to create this post to share something I recently discovered.
Indeed! It's possible using forward references:
(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)
Proof
Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.
No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.
That's great and all, but I want to match inner groups too!
OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:
(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$)))
Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.
So... how the hell does this actually work?
I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:
Note
Component
Description
(?=\()
Make sure '(' follows before doing any hard work.
(?:
Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('
(?=
This lookahead deals with finding the next '('.
.*?\((?!.*?\1)
Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again
(.*\)(?!.*\2).*)
Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')'
(?=
This lookahead deals with finding the next ')'
.*?\)(?!.*?\2)
Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)
Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
.
Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+?
Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation
.*?(?=\1)
Match up to and including the last '(' found.
[^(]*(?=\2$)
Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).
Conclusion
So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regular expression features - no recursion or balanced groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.
I know a lot of you use regular expressions to accomplish and help other users accomplish simpler and more practical tasks, but if there is anyone out there who shares my excitement for pushing the limits of possibility with regular expressions then I'd love to hear from you. If there is interest, I have other similar material to post.
Brief
Input Corrections
First of all, your input is incorrect as there's an extra parenthesis (as shown below)
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Making appropriate modifications to either include or exclude the additional parenthesis, one might end up with one of the following strings:
Extra parenthesis removed
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
^
Additional parenthesis added to match extra closing parenthesis
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Regex Capabilities
Second of all, this is really only truly possible in regex flavours that include the recursion capability since any other method will not properly match opening/closing brackets (as seen in the OP's solution, it matches the extra parenthesis from the incorrect input as noted above).
This means that for regex flavours that do not currently support recursion (Java, Python, JavaScript, etc.), recursion (or attempts at mimicking recursion) in regular expressions is not possible.
Input
Considering the original input is actually invalid, we'll use the following inputs to test against.
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
Testing against these inputs should yield the following results:
INVALID (no match)
VALID (match)
VALID (match)
Code
There are multiple ways of matching nested groups. The solutions provided below all depend on regex flavours that include recursion capabilities (e.g. PCRE).
See regex in use here
Using DEFINE block
(?(DEFINE)
(?<value>[^()\r\n]+)
(?<groupVal>(?&group)|(?&value))
(?<group>(?&value)*\((?&groupVal)\)(?&groupVal)*)
)
^(?&group)$
Note: This regex uses the flags gmx
Without DEFINE block
See regex in use here
^(?<group>
(?<value>[^()\r\n]+)*
\((?<groupVal>(?&group)|(?&value))\)
(?&groupVal)*
)$
Note: This regex uses the flags gmx
Without x modifier (one-liner)
See regex in use here
^(?<group>(?<value>[^()\r\n]+)*\((?<groupVal>(?&group)|(?&value))\)(?&groupVal)*)$
Without named (groups & references)
See regex in use here
^(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*)$
Note: This is the shortest possible method that I could come up with.
Explanation
I'll explain the last regex as it's a simplified and minimal example of all the other regular expressions above it.
^ Assert position at the start of the line
(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*) Capture the following into capture group 1
([^()\r\n]+)* Capture the following into capture group 2 any number of times
[^()\r\n]+ Match any character not present in the set ()\r\n one or more times
\( Match a left/opening parenthesis character ( literally
((?1)|(?2)) Capture either of the following into capture group 3
(?1) Recurse the first subpattern (1)
(?2) Recurse the second subpattern (2)
\) Match a right/closing parenthesis character ) literally
(?3)* Recurse the third subpattern (3) any number of times
$ Assert position at the end of the line

Match contents surrounded by optional group in Java regex

I'm having trouble wrapping my head around how a particular Java regex should be written. The regex will be used in a sequence, and will match sections ending with /.
The problem is that using a simple split won't work because the text before the / can optionally be surrounded by ~. If it is, then the text inside can match anything - including / and ~. The key here is the ending ~/, which is the only way to escape this 'anything goes' sequence if it begins with ~.
Because the regex pattern will be used in a sequence (i.e. (xxx)+), I can't use ^ or $ for non-greedy matching.
Example matches:
foo/
~foo~/
~foo/~/
~foo~~/
~foo/bar~/
and some that wouldn't match:
foo~//
~foo~/bar~/
~foo/
foo~/ (see edit 2)
Is there any way to do this without being redundant with my regexes? What would be the best way to think about matching this? Java doesn't have a conditional modifier (?) so that complicated things in my head a bit more.
EDIT: After working on this in the meantime, the regex ((?:\~())?)(((?!((?!\2)/|\~/)).)+)\1/ gets close but #6 doesn't match.
EDIT 2: After Steve pointed out that there is ambiguity, it became clear #6 shouldn't match.
I don't think that this is a solvable problem. From your givens, these are all acceptable:
~foo/~/
~foo/
foo~/
So, now, let's consider this combination:
~foo/foo~/
What happens here? We have combined the second example and the third example to create an instance of the first example. How do you suggest a correct splitting? As far as I can tell, there's no way to tell if we should be taking the entire expression as one or two valid expressions. Hence, I don't think it's possible to break it up accurately based on your listed restrictions.

Conditional Regex searches

I'm attempting to create a Regular Expressions code in Java that will have a conditional search term.
What I mean by this is let's say I have 5 words; tree, car, dog, cat, bird. Now I would like the expression to search for these terms, however is only required to match 3 out of the five, and it could be any of the 5 it chooses to match.
I thought perhaps a using a back reference ?(3) would work but doesn't seem to do the trick.
A standard optional search (?) wouldn't work either because all terms are optional, however the number of matches required is not. Essentially is there a way to create a string that must be 50% (or any percent) correct to provide a match?
Would anyone happen to know or could point me in the right direction?
(I would hopefully like it working client side if possible)
Does it have to be a free-standing regular expression without any further code? A simple loop testing for each word and counting matches should do this perfectly. Pseudocode assuming you want N unique matches (you can also swap the substring test with a regex, doesn't matter how you determine matches as long as you keep the counting of unique matches out of the regex):
bool has_N_words(int n, string[] words, string text) {
int matches = 0;
foreach word in words {
if (word.substringOf(text)) counter++
if (counter >= n) return true
}
return false
}
It seems to me the only (save mind-blowing uses of obscure regex extensions - not that I have something in mind, I've just been surprised again and again what modern regex implementations allow) way to do this with an regular expression goes like this:
Enumerate all unique (ignoring order or not depending on implementation, see below) permutations of words
For each permutation, build a sub-regex that matches a string containing those words, either by
joining the first three words with .*? (this requires all unique permutations)
using three lookahead assertions like (?=.*word) (this allows dropping word combinations that occured before in a different order)
Combine all sub-regexes in a giant or.
That's impractical to do by hand, ugly and complex (as in computational complexity, not in programming effort) to do automatically, and inefficient as well as quite hacky either way.
I don't see why you would want to do this with a regext but if you really need it to be a regex:
/(tree|car|dog|cat|bird)/
Then count the matches you get from that...
(?i)(?s)(.*(tree|car|dog|cat|bird)){3,}?.*
The (?i) is for case insensitive and the (?s) to match new lines with .* also, since you are looking at emails.
The ? at the end is the reluctant quantifier.
I haven't actually tried it.

codingBat repeatEnd using regex

I'm trying to understand regex as much as I can, so I came up with this regex-based solution to codingbat.com repeatEnd:
Given a string and an int N, return a string made of N repetitions of the last N characters of the string. You may assume that N is between 0 and the length of the string, inclusive.
public String repeatEnd(String str, int N) {
return str.replaceAll(
".(?!.{N})(?=.*(?<=(.{N})))|."
.replace("N", Integer.toString(N)),
"$1"
);
}
Explanation on its parts:
.(?!.{N}): asserts that the matched character is one of the last N characters, by making sure that there aren't N characters following it.
(?=.*(?<=(.{N}))): in which case, use lookforward to first go all the way to the end of the string, then a nested lookbehind to capture the last N characters into \1. Note that this assertion will always be true.
|.: if the first assertion failed (i.e. there are at least N characters ahead) then match the character anyway; \1 would be empty.
In either case, a character is always matched; replace it with \1.
My questions are:
Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)
Is there a simpler regex-based solution?
Bonus question
Do repeatBegin (as analogously defined).
I'm honestly having troubles with this one!
Nice one! I don't see a way to significantly improve on that regex, although I would refactor it to avoid the needless use of negative logic:
".(?=.{N})|.(?=.*(?<=(.{N})))"
This way the second alternative is never entered until you reach the final N characters, which I think makes the intent a little clearer.
I've never seen a reference that says it's okay to nest lookarounds, but like Bart, I don't see why it wouldn't be. I sometimes use lookaheads inside lookbehinds to get around limitations on variable-length lookbehind expressions.
EDIT: I just realized I can simplify the regex quite a bit by putting the alternation inside the lookahead:
".(?=.{N}|.*(?<=(.{N})))"
By the way, have you considered using format() to build the regex instead of replace()?
return str.replaceAll(
String.format(".(?=.{%1$d}|.*(?<=(.{%1$d})))", N),
"$1"
);
Whoa, that's some scary regex voodoo there! : )
Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)
Yes, that is perfectly valid in most PCRE implementations I know of.
Is there a simpler regex-based solution?
I didn't spend too much time on it, but I don't quickly see how that could be simplified or shortened with a single regex replacement.
Is there a simpler regex-based solution?
It took me a while, but eventually I managed to simplify the regex to:
"(?=.{0,N}$(?<=(.{N}))).|." // repeatEnd
-or-
".(?<=^(?=(.{N})).{0,N})|." // repeatBegin
Like Alan Moore's answer, this removes the negative assertion, but doesn't even replace it with a positive one, so it now only has 2 assertions instead of 3.
I also like the fact that the "else" case is just a simple .. I prefer to put the bulk of my regex into the "working" side of the alternation, and keep the "non-working" side as simple as possible (usually a simple . or .*).

Categories