Regular expression to match strings in quotes with double-quotes inside - java

I face a challenge to match the input in the following format:
The input consists of key=value pairs. The key starts with slash. The value may be a number or a string in quotes.
The value may optionally contain escaped quotes, that is quote following by a quote (""). Such escaped quote should be considered a part of value. There is no need to check that escaped quotes are balanced (e.g. ends by another escaped quote).
The regular expression should match the given key=value part of the sequence and should not break for long inputs (e.g. value is 10000 characters).
First I came to this solution:
/(\w+)=(\d+|"(?:""|[^"])+"(?!"))
and it performs not bad, however it fails in Java6 with StackOverflowError for long inputs (cashes regexplanet for example). I tried to improve it a bit to run faster:
/(\w+)=(\d+|"(?:""|[^"]+)+"(?!"))
but then if input is not matching, it enters endless loop in backtracking trying to match it.
Then I came to this regex:
/(\w+)=(\d+|".+?(?<!")(?:"")*"(?!"))
which is performing slower, but it seems to solve the task.
Can anyone suggest a better / faster regex?
Sample input:
/mol_type="protein" /transl_table=11 /note="[CDS] (""multi
line)" nn /organism="""Some"" Sequence" nn /organism="Some ""Sequence"""
/translation="MHPSSSRIPHIAVVGVSAIFPGSLDAHGFWRDILSGTDLITDVPSTHWLVE
DYYDPDPSAPDKTYAKRGAFLKDVPFDPLEWGVPPSIVPATDTTQLLALIVAKRVLEDAAQGQFE
SMSRERMSVILGVTSAQELLASMVSRIQRPVWAKALRDLGYPEDEVKRACDKIAGNYVPWQESSF
PGLLGNVVAGRIANRLDLGGTNCVTDAACASSLSAMSMAINELALGQSDLVIAGGCDTMNDAFMY
MCFSKTPALSKSGDCRPFSDKADGTLLGEGIAMVALKRLDDAERDGDRVYAVIRGIGSSSDGRSK
SVYAPVPEGQAKALRRTYAAAGYGPETVELMEAHGTGTKAGDAAEFEGLRAMFDESGREDRQWCA
LGSVKSQIGHTKAAAGAAGLFKAIMALHHKVLPPTIKVDKPNPKLDIEKTAFYLNTQARPWIRPG
DHPRRASVSSFGFGGSNFHVALEEYTGPAPKAWRVRALPAELFLLSADTPAALADRARALAKEAE
VPEILRFLARESVLSFDASRPARLGLCATDEADLRKKLEQVAAHLEARPEQALSAPLVHCASGEA
PGRVAFLFPGQGSQYVGMGADALMTFDPARAAWDAAAGVAIADAPLHEVVFPRPVFSDEDRAAQE
ARLRETRWAQPAIGATSLAHLALLAALGVRAEAFAGHSFGEITALHAAGALSAADLLRVARRRGE
LRTLGQVVDHLRASLPAAGPAASASPAAAASVPKASTAAVPAVASVAAPGAAEVERVVMAVVAET
TGYPAEMLGLQMELESDLGIDSIKRVEILSAVRDRTPGLSEVDASALAQLRTLGQVVDHLRASLP
AASAGPAVAAPAAKAPAVAAPTGVSGATPGAAEVERVVMAVVAETTGYPAEMLGLQMELESDLGI
DSIKRVEILSAVRDRTPGLAEVDASALAQLRTLGQVVDHLRASLGPAAVTAGAAPAEPAEEPAST
PLGRWTLVEEPAPAAGLAMPGLFDAGTLVITGHDAIGPALVAALAARGIAAEYAPAVPRGARGAV
FLGGLRELATADAALAVHREAFLAAQAIAAKPALFVTVQDTGGDFGLAGSDRAWVGGLPGLVKTA
ALEWPEASCRAIDLERAGRSDGELAEAIASELLSGGVELEIGLRADGRRTTPRSVRQDAQPGPLP
LGPSDVVVASGGARGVTAATLIALARASHARFALLGRTALEDEPAACRGADGEAALKAALVKAAT
SAGQRVTPAEIGRSVAKILANREVRATLDAIRAAGGEALYVPVDVNDARAVAAALDGVRGALGPV
TAIVHGAGVLADKLVAEKTVEQFERVFSTKVDGLRALLGATAGDPLKAIVLFSSIAARGGNKGQC
DYAMANEVLNKVAAAEAARRPGCRVKSLGWGPWQGGMVNAALEAHFAQLGVPLIPLAAGAKMLLD
ELCDASGDRGARGQGGAPPGAVELVLGAEPKALAAQGHGGRVALAVRADRATHPYLGDHAINGVP
VVPVVIALEWFARAARACRPDLVVTELRDVRVLRGIKLAAYESGGEVFRVDCREVSNGHGAVLAA
ELRGPQGALHYAATIQMQQPEGRVAPKGPAAPELGPWPAGGELYDGRTLFHGRDFQVIRRLDGVS
RDGIAGTVVGLREAGWVAQPWKTDPAALDGGLQLATLWTQHVLGGAALPMSVGALHTFAEGPSDG
PLRAVVRGQIVARDRTKADIAFVDDRGSLVAELRDVQYVLRPDTARGQA"
/note="primer of Streptococcus pneumoniae
Expected output (from regexhero.net):

In order to fail in a reasonable time you need, indeed, to avoid catastrophic backtracking. This can be done using atomic grouping (?>...):
/(\w+)=(\d+|"(?>(?>""|[^"]+)+)"(?!"))
# (?>(?>""|[^"]+)+)
(?> # throw away the states created by (...)+
(?> # throw away the states created by [^"]+
""|[^"]+
)+
)
Your issue when using (?:""|[^"]+)+ on a string that will never match, is linked to the fact that each time you match a new [^"] character the regex engine can choose to use the inner or outer + quantifier.
This leads to a lot of possibilities for backtracking, and before returning a failure the engine has to try them all.
We know that if we haven't found a match by the time the engine reaches the end, we never will: all we need to do is throw away the backtracking positions to avoid the issue, and that's what atomic grouping is for.
See a DEMO: 24 steps on failure, while preserving the speed on the successful cases (not a real benchmarking tool, but catastrophic backtracking would be pretty easy to spot)

Your initial regex was already quite good, but it was more complicated than necessary, leading to catastrophic backtracking.
You should use
/(\w+)=(\d+|"(?:""|[^"])*"(?!"))
See it live on regex101.com.
Explanation:
/ # Slash
(\w+) # Indentifier --> Group 1
= # Equals sign
( # Group 2:
\d+ # Either a number
| # or
"(?:""|[^"])*" # a quoted string
(?!") # unless another quote follows
) # End of group 2

How about this one:
/(\w+)=("(?:[^"]|"")*"|\d+)
(Note that the / is part of the regex here. Escape it as appropriate for your host language.)
If your regex engine supports it (Java does), make the * possessive:
/(\w+)=("(?:[^"]|"")*+"|\d+)
After some debugging the latter expression can be improved to:
/(\w+)=("(?:""|[^"]*+)*+"|\d++)
Note the double *+)*+ which allows matching contiguous text in one step while not being susceptible to catastrophic backtracking.

Related

Speed up regular expression

This is a regex to extract the table name from a SQL statement:
(?:\sFROM\s|\sINTO\s|\sNEXTVAL[\s\W]*|^UPDATE\s|\sJOIN\s)[\s`'"]*([\w\.-_]+)
It matches a token, optionally enclosed in [`'"], preceded by FROM etc. surrounded by whitespace, except for UPDATE which has no leading whitespace.
We execute many regexes, and this is the slowest one, and I'm not sure why. SQL strings can get up to 4k in size, and execution time is at worst 0,35ms on a 2.2GHz i7 MBP.
This is a slow input sample: https://pastebin.com/DnamKDPf
Can we do better? Splitting it up into multiple regexes would be an option, as well if alternation is an issues.
There is a rule of thumb:
Do not let engine make an attempt on matching each single one character if there are some boundaries.
Try the following regex (~2500 steps on the given input string):
(?!FROM|INTO|NEXTVAL|UPDATE|JOIN)\S*\s*|\w+\W*(\w[\w\.-]*)
Live demo
Note: What you need is in the first capturing group.
The final regex according to comments (which is a little bit slower than the previous clean one):
(?!(?:FROM|INTO|NEXTVAL|UPDATE|JOIN)\b)\S*\s*|\b(?:NEXTVAL\W*|\w+\s[\s`'"]*)([\[\]\w\.-]+)
Regex optimisation is a very complex topic and should be done with help of some tools. For example, I like Regex101 which calculates for us number of steps Regex engine had to do to match pattern to payload. For your pattern and given example it prints:
1 match, 22976 steps (~19ms)
First thing which you can always do it is grouping similar parts to one group. For example, FROM, INTO and JOIN look similar, so we can write regex as below:
(?:\s(?:FROM|INTO|JOIN)\s|\sNEXTVAL[\s\W]*|^UPDATE\s)[\s`'"]*([\w\.-_]+)
For above example, Regex101, prints:
1 match, 15891 steps (~13ms)
Try to find some online tools which explain and optimise Regex such as myregextester and calculate how many steps engine needs to do.
Because matches are often near the end, one possibility would be to essentially start at the end and backtrack, rather than start at the beginning and forward-track, something along the lines of
^(?:UPDATE\s|.*(?:\s(?:(?:FROM|INTO|JOIN)\s|NEXTVAL[\s\W]*)))[\s`'\"]*([\w\.-_]+)
https://regex101.com/r/SO7M87/1/ (154 steps)
While this may be much faster when a match exists, it's only a moderate improvement when there's no match, because the pattern must backtrack all the way to the beginning (~9000 steps from ~23k steps)

inverse match regex AND Space or end of string, AND space or start of string [duplicate]

I know it's possible to match a word and then reverse the matches using other tools (e.g. grep -v). However, is it possible to match lines that do not contain a specific word, e.g. hede, using a regular expression?
Input:
hoho
hihi
haha
hede
Code:
grep "<Regex for 'doesn't contain hede'>" input
Desired output:
hoho
hihi
haha
The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:
^((?!hede).)*$
The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.
And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):
/^((?!hede).)*$/s
or use it inline:
/(?s)^((?!hede).)*$/
(where the /.../ are the regex delimiters, i.e., not part of the pattern)
If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:
/^((?!hede)[\s\S])*$/
Explanation
A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":
┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘
index 0 1 2 3 4 5 6 7
where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.
So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$
As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).
Note that the solution to does not start with “hede”:
^(?!hede).*$
is generally much more efficient than the solution to does not contain “hede”:
^((?!hede).)*$
The former checks for “hede” only at the input string’s first position, rather than at every position.
If you're just using it for grep, you can use grep -v hede to get all lines which do not contain hede.
ETA Oh, rereading the question, grep -v is probably what you meant by "tools options".
Answer:
^((?!hede).)*$
Explanation:
^the beginning of the string,
( group and capture to \1 (0 or more times (matching the most amount possible)),
(?! look ahead to see if there is not,
hede your string,
) end of look-ahead,
. any character except \n,
)* end of \1 (Note: because you are using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \1)
$ before an optional \n, and the end of the string
The given answers are perfectly fine, just an academic point:
Regular Expressions in the meaning of theoretical computer sciences ARE NOT ABLE do it like this. For them it had to look something like this:
^([^h].*$)|(h([^e].*$|$))|(he([^h].*$|$))|(heh([^e].*$|$))|(hehe.+$)
This only does a FULL match. Doing it for sub-matches would even be more awkward.
If you want the regex test to only fail if the entire string matches, the following will work:
^(?!hede$).*
e.g. -- If you want to allow all values except "foo" (i.e. "foofoo", "barfoo", and "foobar" will pass, but "foo" will fail), use: ^(?!foo$).*
Of course, if you're checking for exact equality, a better general solution in this case is to check for string equality, i.e.
myStr !== 'foo'
You could even put the negation outside the test if you need any regex features (here, case insensitivity and range matching):
!/^[a-f]oo$/i.test(myStr)
The regex solution at the top of this answer may be helpful, however, in situations where a positive regex test is required (perhaps by an API).
FWIW, since regular languages (aka rational languages) are closed under complementation, it's always possible to find a regular expression (aka rational expression) that negates another expression. But not many tools implement this.
Vcsn supports this operator (which it denotes {c}, postfix).
You first define the type of your expressions: labels are letter (lal_char) to pick from a to z for instance (defining the alphabet when working with complementation is, of course, very important), and the "value" computed for each word is just a Boolean: true the word is accepted, false, rejected.
In Python:
In [5]: import vcsn
c = vcsn.context('lal_char(a-z), b')
c
Out[5]: {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z} → 𝔹
then you enter your expression:
In [6]: e = c.expression('(hede){c}'); e
Out[6]: (hede)^c
convert this expression to an automaton:
In [7]: a = e.automaton(); a
finally, convert this automaton back to a simple expression.
In [8]: print(a.expression())
\e+h(\e+e(\e+d))+([^h]+h([^e]+e([^d]+d([^e]+e[^]))))[^]*
where + is usually denoted |, \e denotes the empty word, and [^] is usually written . (any character). So, with a bit of rewriting ()|h(ed?)?|([^h]|h([^e]|e([^d]|d([^e]|e.)))).*.
You can see this example here, and try Vcsn online there.
Here's a good explanation of why it's not easy to negate an arbitrary regex. I have to agree with the other answers, though: if this is anything other than a hypothetical question, then a regex is not the right choice here.
With negative lookahead, regular expression can match something not contains specific pattern. This is answered and explained by Bart Kiers. Great explanation!
However, with Bart Kiers' answer, the lookahead part will test 1 to 4 characters ahead while matching any single character. We can avoid this and let the lookahead part check out the whole text, ensure there is no 'hede', and then the normal part (.*) can eat the whole text all at one time.
Here is the improved regex:
/^(?!.*?hede).*$/
Note the (*?) lazy quantifier in the negative lookahead part is optional, you can use (*) greedy quantifier instead, depending on your data: if 'hede' does present and in the beginning half of the text, the lazy quantifier can be faster; otherwise, the greedy quantifier be faster. However if 'hede' does not present, both would be equal slow.
Here is the demo code.
For more information about lookahead, please check out the great article: Mastering Lookahead and Lookbehind.
Also, please check out RegexGen.js, a JavaScript Regular Expression Generator that helps to construct complex regular expressions. With RegexGen.js, you can construct the regex in a more readable way:
var _ = regexGen;
var regex = _(
_.startOfLine(),
_.anything().notContains( // match anything that not contains:
_.anything().lazy(), 'hede' // zero or more chars that followed by 'hede',
// i.e., anything contains 'hede'
),
_.endOfLine()
);
Benchmarks
I decided to evaluate some of the presented Options and compare their performance, as well as use some new Features.
Benchmarking on .NET Regex Engine: http://regexhero.net/tester/
Benchmark Text:
The first 7 lines should not match, since they contain the searched Expression, while the lower 7 lines should match!
Regex Hero is a real-time online Silverlight Regular Expression Tester.
XRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex HeroRegex HeroRegex HeroRegex HeroRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her Regex Her Regex Her Regex Her Regex Her Regex Her Regex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her is a real-time online Silverlight Regular Expression Tester.Regex Hero
egex Hero egex Hero egex Hero egex Hero egex Hero egex Hero Regex Hero is a real-time online Silverlight Regular Expression Tester.
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRegex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her
egex Hero
egex Hero is a real-time online Silverlight Regular Expression Tester.
Regex Her is a real-time online Silverlight Regular Expression Tester.
Regex Her Regex Her Regex Her Regex Her Regex Her Regex Her is a real-time online Silverlight Regular Expression Tester.
Nobody is a real-time online Silverlight Regular Expression Tester.
Regex Her o egex Hero Regex Hero Reg ex Hero is a real-time online Silverlight Regular Expression Tester.
Results:
Results are Iterations per second as the median of 3 runs - Bigger Number = Better
01: ^((?!Regex Hero).)*$ 3.914 // Accepted Answer
02: ^(?:(?!Regex Hero).)*$ 5.034 // With Non-Capturing group
03: ^(?!.*?Regex Hero).* 7.356 // Lookahead at the beginning, if not found match everything
04: ^(?>[^R]+|R(?!egex Hero))*$ 6.137 // Lookahead only on the right first letter
05: ^(?>(?:.*?Regex Hero)?)^.*$ 7.426 // Match the word and check if you're still at linestart
06: ^(?(?=.*?Regex Hero)(?#fail)|.*)$ 7.371 // Logic Branch: Find Regex Hero? match nothing, else anything
P1: ^(?(?=.*?Regex Hero)(*FAIL)|(*ACCEPT)) ????? // Logic Branch in Perl - Quick FAIL
P2: .*?Regex Hero(*COMMIT)(*FAIL)|(*ACCEPT) ????? // Direct COMMIT & FAIL in Perl
Since .NET doesn't support action Verbs (*FAIL, etc.) I couldn't test the solutions P1 and P2.
Summary:
The overall most readable and performance-wise fastest solution seems to be 03 with a simple negative lookahead. This is also the fastest solution for JavaScript, since JS does not support the more advanced Regex Features for the other solutions.
Not regex, but I've found it logical and useful to use serial greps with pipe to eliminate noise.
eg. search an apache config file without all the comments-
grep -v '\#' /opt/lampp/etc/httpd.conf # this gives all the non-comment lines
and
grep -v '\#' /opt/lampp/etc/httpd.conf | grep -i dir
The logic of serial grep's is (not a comment) and (matches dir)
Since no one else has given a direct answer to the question that was asked, I'll do it.
The answer is that with POSIX grep, it's impossible to literally satisfy this request:
grep "<Regex for 'doesn't contain hede'>" input
The reason is that with no flags, POSIX grep is only required to work with Basic Regular Expressions (BREs), which are simply not powerful enough for accomplishing that task, because of lack of alternation in subexpressions. The only kind of alternation it supports involves providing multiple regular expressions separated by newlines, and that doesn't cover all regular languages, e.g. there's no finite collection of BREs that matches the same regular language as the extended regular expression (ERE) ^(ab|cd)*$.
However, GNU grep implements extensions that allow it. In particular, \| is the alternation operator in GNU's implementation of BREs. If your regular expression engine supports alternation, parentheses and the Kleene star, and is able to anchor to the beginning and end of the string, that's all you need for this approach. Note however that negative sets [^ ... ] are very convenient in addition to those, because otherwise, you need to replace them with an expression of the form (a|b|c| ... ) that lists every character that is not in the set, which is extremely tedious and overly long, even more so if the whole character set is Unicode.
Thanks to formal language theory, we get to see how such an expression looks like. With GNU grep, the answer would be something like:
grep "^\([^h]\|h\(h\|eh\|edh\)*\([^eh]\|e[^dh]\|ed[^eh]\)\)*\(\|h\(h\|eh\|edh\)*\(\|e\|ed\)\)$" input
(found with Grail and some further optimizations made by hand).
You can also use a tool that implements EREs, like egrep, to get rid of the backslashes, or equivalently, pass the -E flag to POSIX grep (although I was under the impression that the question required avoiding any flags to grep whatsoever):
egrep "^([^h]|h(h|eh|edh)*([^eh]|e[^dh]|ed[^eh]))*(|h(h|eh|edh)*(|e|ed))$" input
Here's a script to test it (note it generates a file testinput.txt in the current directory). Several of the expressions presented in other answers fail this test.
#!/bin/bash
REGEX="^\([^h]\|h\(h\|eh\|edh\)*\([^eh]\|e[^dh]\|ed[^eh]\)\)*\(\|h\(h\|eh\|edh\)*\(\|e\|ed\)\)$"
# First four lines as in OP's testcase.
cat > testinput.txt <<EOF
hoho
hihi
haha
hede
h
he
ah
head
ahead
ahed
aheda
ahede
hhede
hehede
hedhede
hehehehehehedehehe
hedecidedthat
EOF
diff -s -u <(grep -v hede testinput.txt) <(grep "$REGEX" testinput.txt)
In my system it prints:
Files /dev/fd/63 and /dev/fd/62 are identical
as expected.
For those interested in the details, the technique employed is to convert the regular expression that matches the word into a finite automaton, then invert the automaton by changing every acceptance state to non-acceptance and vice versa, and then converting the resulting FA back to a regular expression.
As everyone has noted, if your regular expression engine supports negative lookahead, the regular expression is much simpler. For example, with GNU grep:
grep -P '^((?!hede).)*$' input
However, this approach has the disadvantage that it requires a backtracking regular expression engine. This makes it unsuitable in installations that are using secure regular expression engines like RE2, which is one reason to prefer the generated approach in some circumstances.
Using Kendall Hopkins' excellent FormalTheory library, written in PHP, which provides a functionality similar to Grail, and a simplifier written by myself, I've been able to write an online generator of negative regular expressions given an input phrase (only alphanumeric and space characters currently supported, and the length is limited): http://www.formauri.es/personal/pgimeno/misc/non-match-regex/
For hede it outputs:
^([^h]|h(h|e(h|dh))*([^eh]|e([^dh]|d[^eh])))*(h(h|e(h|dh))*(ed?)?)?$
which is equivalent to the above.
with this, you avoid to test a lookahead on each positions:
/^(?:[^h]+|h++(?!ede))*+$/
equivalent to (for .net):
^(?>(?:[^h]+|h+(?!ede))*)$
Old answer:
/^(?>[^h]+|h+(?!ede))*$/
Aforementioned (?:(?!hede).)* is great because it can be anchored.
^(?:(?!hede).)*$ # A line without hede
foo(?:(?!hede).)*bar # foo followed by bar, without hede between them
But the following would suffice in this case:
^(?!.*hede) # A line without hede
This simplification is ready to have "AND" clauses added:
^(?!.*hede)(?=.*foo)(?=.*bar) # A line with foo and bar, but without hede
^(?!.*hede)(?=.*foo).*bar # Same
An, in my opinon, more readable variant of the top answer:
^(?!.*hede)
Basically, "match at the beginning of the line if and only if it does not have 'hede' in it" - so the requirement translated almost directly into regex.
Of course, it's possible to have multiple failure requirements:
^(?!.*(hede|hodo|hada))
Details: The ^ anchor ensures the regex engine doesn't retry the match at every location in the string, which would match every string.
The ^ anchor in the beginning is meant to represent the beginning of the line. The grep tool matches each line one at a time, in contexts where you're working with a multiline string, you can use the "m" flag:
/^(?!.*hede)/m # JavaScript syntax
or
(?m)^(?!.*hede) # Inline flag
Here's how I'd do it:
^[^h]*(h(?!ede)[^h]*)*$
Accurate and more efficient than the other answers. It implements Friedl's "unrolling-the-loop" efficiency technique and requires much less backtracking.
Another option is that to add a positive look-ahead and check if hede is anywhere in the input line, then we would negate that, with an expression similar to:
^(?!(?=.*\bhede\b)).*$
with word boundaries.
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
RegEx Circuit
jex.im visualizes regular expressions:
If you want to match a character to negate a word similar to negate character class:
For example, a string:
<?
$str="aaa bbb4 aaa bbb7";
?>
Do not use:
<?
preg_match('/aaa[^bbb]+?bbb7/s', $str, $matches);
?>
Use:
<?
preg_match('/aaa(?:(?!bbb).)+?bbb7/s', $str, $matches);
?>
Notice "(?!bbb)." is neither lookbehind nor lookahead, it's lookcurrent, for example:
"(?=abc)abcde", "(?!abc)abcde"
The OP did not specify or Tag the post to indicate the context (programming language, editor, tool) the Regex will be used within.
For me, I sometimes need to do this while editing a file using Textpad.
Textpad supports some Regex, but does not support lookahead or lookbehind, so it takes a few steps.
If I am looking to retain all lines that Do NOT contain the string hede, I would do it like this:
1. Search/replace the entire file to add a unique "Tag" to the beginning of each line containing any text.
Search string:^(.)
Replace string:<##-unique-##>\1
Replace-all
2. Delete all lines that contain the string hede (replacement string is empty):
Search string:<##-unique-##>.*hede.*\n
Replace string:<nothing>
Replace-all
3. At this point, all remaining lines Do NOT contain the string hede. Remove the unique "Tag" from all lines (replacement string is empty):
Search string:<##-unique-##>
Replace string:<nothing>
Replace-all
Now you have the original text with all lines containing the string hede removed.
If I am looking to Do Something Else to only lines that Do NOT contain the string hede, I would do it like this:
1. Search/replace the entire file to add a unique "Tag" to the beginning of each line containing any text.
Search string:^(.)
Replace string:<##-unique-##>\1
Replace-all
2. For all lines that contain the string hede, remove the unique "Tag":
Search string:<##-unique-##>(.*hede)
Replace string:\1
Replace-all
3. At this point, all lines that begin with the unique "Tag", Do NOT contain the string hede. I can now do my Something Else to only those lines.
4. When I am done, I remove the unique "Tag" from all lines (replacement string is empty):
Search string:<##-unique-##>
Replace string:<nothing>
Replace-all
Since the introduction of ruby-2.4.1, we can use the new Absent Operator in Ruby’s Regular Expressions
from the official doc
(?~abc) matches: "", "ab", "aab", "cccc", etc.
It doesn't match: "abc", "aabc", "ccccabc", etc.
Thus, in your case ^(?~hede)$ does the job for you
2.4.1 :016 > ["hoho", "hihi", "haha", "hede"].select{|s| /^(?~hede)$/.match(s)}
=> ["hoho", "hihi", "haha"]
Through PCRE verb (*SKIP)(*F)
^hede$(*SKIP)(*F)|^.*$
This would completely skips the line which contains the exact string hede and matches all the remaining lines.
DEMO
Execution of the parts:
Let us consider the above regex by splitting it into two parts.
Part before the | symbol. Part shouldn't be matched.
^hede$(*SKIP)(*F)
Part after the | symbol. Part should be matched.
^.*$
PART 1
Regex engine will start its execution from the first part.
^hede$(*SKIP)(*F)
Explanation:
^ Asserts that we are at the start.
hede Matches the string hede
$ Asserts that we are at the line end.
So the line which contains the string hede would be matched. Once the regex engine sees the following (*SKIP)(*F) (Note: You could write (*F) as (*FAIL)) verb, it skips and make the match to fail. | called alteration or logical OR operator added next to the PCRE verb which inturn matches all the boundaries exists between each and every character on all the lines except the line contains the exact string hede. See the demo here. That is, it tries to match the characters from the remaining string. Now the regex in the second part would be executed.
PART 2
^.*$
Explanation:
^ Asserts that we are at the start. ie, it matches all the line starts except the one in the hede line. See the demo here.
.* In the Multiline mode, . would match any character except newline or carriage return characters. And * would repeat the previous character zero or more times. So .* would match the whole line. See the demo here.
Hey why you added .* instead of .+ ?
Because .* would match a blank line but .+ won't match a blank. We want to match all the lines except hede , there may be a possibility of blank lines also in the input . so you must use .* instead of .+ . .+ would repeat the previous character one or more times. See .* matches a blank line here.
$ End of the line anchor is not necessary here.
The TXR Language supports regex negation.
$ txr -c '#(repeat)
#{nothede /~hede/}
#(do (put-line nothede))
#(end)' Input
A more complicated example: match all lines that start with a and end with z, but do not contain the substring hede:
$ txr -c '#(repeat)
#{nothede /a.*z&~.*hede.*/}
#(do (put-line nothede))
#(end)' -
az <- echoed
az
abcz <- echoed
abcz
abhederz <- not echoed; contains hede
ahedez <- not echoed; contains hede
ace <- not echoed; does not end in z
ahedz <- echoed
ahedz
Regex negation is not particularly useful on its own but when you also have intersection, things get interesting, since you have a full set of boolean set operations: you can express "the set which matches this, except for things which match that".
It may be more maintainable to two regexes in your code, one to do the first match, and then if it matches run the second regex to check for outlier cases you wish to block for example ^.*(hede).* then have appropriate logic in your code.
OK, I admit this is not really an answer to the posted question posted and it may also use slightly more processing than a single regex. But for developers who came here looking for a fast emergency fix for an outlier case then this solution should not be overlooked.
The below function will help you get your desired output
<?PHP
function removePrepositions($text){
$propositions=array('/\bfor\b/i','/\bthe\b/i');
if( count($propositions) > 0 ) {
foreach($propositions as $exceptionPhrase) {
$text = preg_replace($exceptionPhrase, '', trim($text));
}
$retval = trim($text);
}
return $retval;
}
?>
I wanted to add another example for if you are trying to match an entire line that contains string X, but does not also contain string Y.
For example, let's say we want to check if our URL / string contains "tasty-treats", so long as it does not also contain "chocolate" anywhere.
This regex pattern would work (works in JavaScript too)
^(?=.*?tasty-treats)((?!chocolate).)*$
(global, multiline flags in example)
Interactive Example: https://regexr.com/53gv4
Matches
(These urls contain "tasty-treats" and also do not contain "chocolate")
example.com/tasty-treats/strawberry-ice-cream
example.com/desserts/tasty-treats/banana-pudding
example.com/tasty-treats-overview
Does Not Match
(These urls contain "chocolate" somewhere - so they won't match even though they contain "tasty-treats")
example.com/tasty-treats/chocolate-cake
example.com/home-cooking/oven-roasted-chicken
example.com/tasty-treats/banana-chocolate-fudge
example.com/desserts/chocolate/tasty-treats
example.com/chocolate/tasty-treats/desserts
As long as you are dealing with lines, simply mark the negative matches and target the rest.
In fact, I use this trick with sed because ^((?!hede).)*$ looks not supported by it.
For the desired output
Mark the negative match: (e.g. lines with hede), using a character not included in the whole text at all. An emoji could probably be a good choice for this purpose.
s/(.*hede)/🔒\1/g
Target the rest (the unmarked strings: e.g. lines without hede). Suppose you want to keep only the target and delete the rest (as you want):
s/^🔒.*//g
For a better understanding
Suppose you want to delete the target:
Mark the negative match: (e.g. lines with hede), using a character not included in the whole text at all. An emoji could probably be a good choice for this purpose.
s/(.*hede)/🔒\1/g
Target the rest (the unmarked strings: e.g. lines without hede). Suppose you want to delete the target:
s/^[^🔒].*//g
Remove the mark:
s/🔒//g
^((?!hede).)*$ is an elegant solution, except since it consumes characters you won't be able to combine it with other criteria. For instance, say you wanted to check for the non-presence of "hede" and the presence of "haha." This solution would work because it won't consume characters:
^(?!.*\bhede\b)(?=.*\bhaha\b)
How to use PCRE's backtracking control verbs to match a line not containing a word
Here's a method that I haven't seen used before:
/.*hede(*COMMIT)^|/
How it works
First, it tries to find "hede" somewhere in the line. If successful, at this point, (*COMMIT) tells the engine to, not only not backtrack in the event of a failure, but also not to attempt any further matching in that case. Then, we try to match something that cannot possibly match (in this case, ^).
If a line does not contain "hede" then the second alternative, an empty subpattern, successfully matches the subject string.
This method is no more efficient than a negative lookahead, but I figured I'd just throw it on here in case someone finds it nifty and finds a use for it for other, more interesting applications.
Simplest thing that I could find would be
[^(hede)]
Tested at https://regex101.com/
You can also add unit-test cases on that site
A simpler solution is to use the not operator !
Your if statement will need to match "contains" and not match "excludes".
var contains = /abc/;
var excludes =/hede/;
if(string.match(contains) && !(string.match(excludes))){ //proceed...
I believe the designers of RegEx anticipated the use of not operators.

Is it possible to match nested brackets with a regex without using recursion or balancing groups?

The problem: Match an arbitrarily nested group of brackets in a flavour of regex such as Java's java.util.regex that supports neither recursion nor balancing groups. I.e., match the three outer groups in:
(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))
This exercise is purely academic, since we all know that regular expressions are not supposed to be used to match these things, just as Q-tips are not supposed to be used to clean ears.
Stack Overflow encourages self-answered questions, so I decided to create this post to share something I recently discovered.
Indeed! It's possible using forward references:
(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)
Proof
Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.
No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.
That's great and all, but I want to match inner groups too!
OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:
(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$)))
Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.
So... how the hell does this actually work?
I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:
Note
Component
Description
(?=\()
Make sure '(' follows before doing any hard work.
(?:
Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('
(?=
This lookahead deals with finding the next '('.
.*?\((?!.*?\1)
Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again
(.*\)(?!.*\2).*)
Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')'
(?=
This lookahead deals with finding the next ')'
.*?\)(?!.*?\2)
Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)
Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
.
Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+?
Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation
.*?(?=\1)
Match up to and including the last '(' found.
[^(]*(?=\2$)
Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).
Conclusion
So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regular expression features - no recursion or balanced groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.
I know a lot of you use regular expressions to accomplish and help other users accomplish simpler and more practical tasks, but if there is anyone out there who shares my excitement for pushing the limits of possibility with regular expressions then I'd love to hear from you. If there is interest, I have other similar material to post.
Brief
Input Corrections
First of all, your input is incorrect as there's an extra parenthesis (as shown below)
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Making appropriate modifications to either include or exclude the additional parenthesis, one might end up with one of the following strings:
Extra parenthesis removed
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
^
Additional parenthesis added to match extra closing parenthesis
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Regex Capabilities
Second of all, this is really only truly possible in regex flavours that include the recursion capability since any other method will not properly match opening/closing brackets (as seen in the OP's solution, it matches the extra parenthesis from the incorrect input as noted above).
This means that for regex flavours that do not currently support recursion (Java, Python, JavaScript, etc.), recursion (or attempts at mimicking recursion) in regular expressions is not possible.
Input
Considering the original input is actually invalid, we'll use the following inputs to test against.
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
Testing against these inputs should yield the following results:
INVALID (no match)
VALID (match)
VALID (match)
Code
There are multiple ways of matching nested groups. The solutions provided below all depend on regex flavours that include recursion capabilities (e.g. PCRE).
See regex in use here
Using DEFINE block
(?(DEFINE)
(?<value>[^()\r\n]+)
(?<groupVal>(?&group)|(?&value))
(?<group>(?&value)*\((?&groupVal)\)(?&groupVal)*)
)
^(?&group)$
Note: This regex uses the flags gmx
Without DEFINE block
See regex in use here
^(?<group>
(?<value>[^()\r\n]+)*
\((?<groupVal>(?&group)|(?&value))\)
(?&groupVal)*
)$
Note: This regex uses the flags gmx
Without x modifier (one-liner)
See regex in use here
^(?<group>(?<value>[^()\r\n]+)*\((?<groupVal>(?&group)|(?&value))\)(?&groupVal)*)$
Without named (groups & references)
See regex in use here
^(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*)$
Note: This is the shortest possible method that I could come up with.
Explanation
I'll explain the last regex as it's a simplified and minimal example of all the other regular expressions above it.
^ Assert position at the start of the line
(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*) Capture the following into capture group 1
([^()\r\n]+)* Capture the following into capture group 2 any number of times
[^()\r\n]+ Match any character not present in the set ()\r\n one or more times
\( Match a left/opening parenthesis character ( literally
((?1)|(?2)) Capture either of the following into capture group 3
(?1) Recurse the first subpattern (1)
(?2) Recurse the second subpattern (2)
\) Match a right/closing parenthesis character ) literally
(?3)* Recurse the third subpattern (3) any number of times
$ Assert position at the end of the line

Can I improve performance of this regular expression further

I am trying to fetch thread names from the thread dumps file.
The thread names are usually contained within "double quotes" in the first line of each thread dump.
It may look as simple as follows:
"THREAD1" daemon prio=10 tid=0x00007ff6a8007000 nid=0xd4b6 runnable [0x00007ff7f8aa0000]
Or as big as follows:
"[STANDBY] ExecuteThread: '43' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00007ff71803a000 nid=0xd3e7 in Object.wait() [0x00007ff7f8ae1000]
The regular expression I wrote is simple one: "(.*)". It captures everything inside double quotes as a group. However it causes heavy backtracking thus requiring a lot of steps, as can be seen here. Verbally we can explain this regex as "capture anything that is enclosed inside double quotes as a group"
So I came up with another regex which performs the same: "([^\"])". Verbally we can describe this regex as "capture any number of non-double quote characters that are enclosed inside double quotes". I did not found any fast regex than this. It does not perform any backtracking and hence it requires minimum steps as can be seen here.
I told this above to my colleague. He came up with yet another one: "(.*?)". I didnt get how it works. It performs considerable less backtracking than the first one but is a bit slower than the second one as can be seen here.
However
I don't get why the backtracking stops early.
I understand ? is a quantifier which means once or not at all. However I dont understand how once or not at all is getting used here.
In fact I am not able to guess how can we describe this regex verbally.
My colleague tried explaining me but I am still not able to understand it completely. Can anyone explain?
Brief explanation and a solution
The "(.*)" regex involves a lot of backtracking because it finds the first " and then grabs the whole string and backtracks looking for the " that is closest to the end of string. Since you have a quoted substring closer to the start, there's more backtracking than with "(.*?)" as this lazy quantifier *? makes the regex engine look for the closest " after the first " found.
The negated character class solution "([^"]*)" is the best from the 3 because it does not have to grab everything, just all characters other than ". However, to stop any backtracking and make the expression ultimately efficient, you can use possessive quantifiers.
If you need to match strings like " + no quotes here + ", use
"([^"]*+)"
or even you do not need to match the trailing quote in this situation:
"([^"]*+)
See regex demo
In fact I am not able to guess how can we describe this regex verbally.
The latter "([^"]*+) regex can be described as
" - find the first " symbol from the left of the string
([^"]*+) - match and capture into Group 1 zero or more symbols other than ", as many as possible, and once the engine finds a double quote, the match is returned immediately, without backtracking.
Quantifiers
More information on quantifiers from Rexegg.com:
A* Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A*? Zero or more As, as few as needed to allow the overall pattern to match (lazy)
A*+ Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)
As you see, ? is not a separate quantifier, it is a part of another quantifier.
I advise to read more about why Lazy Quantifiers are Expensive and that Negated Class Solution is really safe and fast to deal with your input string (where you just match a quote followed by non-quotes and then a final quote).
Difference between .*?, .* and [^"]*+ quantifiers
Greedy "(.*)" solution works like this: checks each symbol from left to right looking for ", and once found grabs the whole string up to the end and checks each symbol if it is equal to ". Thus, in your input string, it backtracks 160 times.
Lazy "(.*?)" solution works like this: the engine finds the first " and then advances in the pattern and tries the next token (which is ") against the T in THREAD1. This fails, so the engine backtracks and allows the .*? to expand its match by one item, so that it matches the T. Once again, the engine advances in the pattern. It now tries the " against the H in THREAD1. This fails, so the engine backtracks and allows the .*? to expand and match the H. The process then repeats itself—the engine advances, fails, backtracks, allows the lazy .*? to expand its match by one item, advances, fails and so on. For each character matched by the .*?, the engine has to backtrack. From a computing standpoint, this process of matching one item, advancing, failing, backtracking, expanding is "expensive".
Since the next " is not far, the number of backtrack steps is much fewer than with greedy matching.
possessive quantifier solution with a negated character class "([^"]*+)" works like this: the engine finds the leftmost ", and then grabs all characters that are not " up to the first ". The negated character class [^"]*+ greedily matches zero or more characters that are not a double quote. Therefore, we are guaranteed that the dot-star will never jump over the first encountered ". This is a more direct and efficient way of matching between some delimiters. Note that in this solution, we can fully trust the * that quantifies the [^"]. Even though it is greedy, there is no risk that [^"] will match too much as it is mutually exclusive with the ". This is the contrast principle from the regex style guide [see source].
Note that the possessive quantifier does not let the regex engine backtrack into the subexpression, once matched, the symbols between " become one hard block that cannot be "re-sorted" due to some "inconveniences" met by the regex engine, and it will be unable to shift any characters from and into this block of text.
For the current expression, it does not make a big difference though.

Java Matcher slow regex

This is very simple regex yet, it runs for over 30 seconds on a very short string: (i7 3970k # 3.4ghz)
Pattern compile = Pattern.compile("^(?=[a-z0-9-]{1,63})([a-z0-9]+[-]{0,1}){1,63}[a-z0-9]{1}$");
Matcher matcher = compile.matcher("test-metareg-rw40lntknahvpseba32cßáàâåäæç.nl");
boolean matches = matcher.matches(); //Takes 30+ seconds
First part the (?=) is assertion that the string contains at max these characters
The 2nd part is assertion that the string doesn't exceed syntax for example on this case to prevent --'s and end at least in [a-z0-9]
I tried to guess your intention but it was not easy:
(?=[a-z0-9-]{1,63}) this look-ahead seem to intend to require the next up to 63 characters to be lowercase ASCII letters or numbers, but in fact, it will succeed even if there’s only one letter followed by anything. So maybe you meant (?=[a-z0-9-]{1,63}$) to forbid anything else after the legal up to 63 characters.
You seem to want groups of at least one letter or number between the - but you made the - optional not really creating a constraint and allowing way to much possibilities which created the overhead of your expression. You can simply say: ([a-z0-9]++-){0,63}[a-z0-9]+. The groups within the braces require at least one letter or number and require the minus after that, the expression at the end requires at least one letter or number at the end of the expression but will also match the last group without a following - at the same time. This last group might also be the only one if no - is contained in your text at all.
Putting it all together you regex becomes: (?=[a-z0-9-]{1,63}$)([a-z0-9]++-){0,63}[a-z0-9]+. Note that you don’t need a leading ^ or trailing $ if you use the matches method; it already implies that the string bounds must match the expression bounds.
I hope I got your intention right…
I have fixed this regex replacing it as follows:
^(?=[a-z0-9-]{1,63})([a-z0-9]{0,1}|[-]{0,1}){1,63}[a-z0-9]{1}$
The section ([a-z0-9]+[-]{0,1}){1,63} became: ([a-z0-9]{0,1}|[-]{0,1}){1,63}
If you want to make sure that there is no -- in your string just use negative look ahead (?!.*--).
Also there is no point in writing {1}.
Another thing is if you want to ensure that string has max 63 characters then in your look-ahead you need to add $ at the end (?=[a-z0-9-]{1,63}$).
So maybe ^(?=[a-z0-9-]{1,63}$)(?!.*--)[a-z0-9-]+[a-z0-9]$
I think from what you say, your regex can be simplified to this
Edit - (For posterity) After reading #Holger's post, I am changing this to fix possible catastrophic backtracking, and to speed it up, which as my benches show is possibly the fastest way to do it.
# ^(?=[a-z0-9-]{1,63}$)[a-z0-9]++(?:-[a-z0-9]+)*+$
^ # BOL
(?= [a-z0-9-]{1,63} $ ) # max 1 - 63 of these characters
[a-z0-9]++ (?: - [a-z0-9]+ )*+ # consume the characters in this order
$ # EOL

Categories