What is wrong in regexp? - java

I don't understand, why this regexp works not as I expect:
Regexp: ^<prefix>(.*?)(<optTag.*?>)?(.*?)<postfix>$
Test: <prefix>some chars<optTag value>some chars<postfix>
Test result:
Group 1: Empty
Group 2: Empty
Group 3: some chars<optTag value>some chars
I would expect that group 2 = <optTag value>

You can't use a non-greedy wildcard preceding an optional capture group. Use this instead:
^<prefix>([^<]*)(<optTag.*?>)?(.*?)<postfix>$

Kind of a pain, but you could put a block assertion in those (.*?) groups.
^<prefix>((?:(?!<optTag.*?>).)*?)(<optTag.*?>)?((?:(?!<optTag.*?>).)*?)<postfix>$
https://regex101.com/r/6cQlkC/1
Expanded
^
<prefix>
( # (1 start)
(?:
(?! <optTag .*? > )
.
)*?
) # (1 end)
( <optTag .*? > )? # (2)
( # (3 start)
(?:
(?! <optTag .*? > )
.
)*?
) # (3 end)
<postfix>
$

You can add the word boundaries "\b" in your regular expression to get the required value in Group 2.
This ReGeX worked for me,
^<prefix>(.*?)(\b<optTag.*>\b)(.*?)<postfix>$
You can read more here.

Related

high-level regular expression with not

Hi regular expression experts,
I have the following text
<[~UNKNOWN:a-z\.]> <[~UNKNOWN:A-Z\-0-9]> <[~UNKNOWN:A-Z\]a-z]
And the following reg expr
\[\~[^\[\~\]]*\]
It works fine for the 1st and 2nd group in the text but not for the 3rd one.
The 1st group is
[~UNKNOWN:a-z\.]
The 2nd is
[~UNKNOWN:A-Z\-0-9]
and the 3rd one is
[~UNKNOWN:A-Z\]a-z]
However the reg exp finds the following text
[~UNKNOWN:A-Z\]
I understand why and I know that I have to add the following rule to the reg exp:
starting with '[' and '~' characters and ending with ']' UNLESS there is a '\' in front of ']'. So I should add a NOT expression but not sure how.
Could anybody please help?
Thanks,
V.
Why not simply:
<([^>]+)>?
Regex Demo
This should work (first line pattern, second line your pattern (ignore whitespace), third line my changes):
\[\~(?:[^\[\~\]]|(?<=\\)\])*(?<!\\)\]
\[\~ [^\[\~\]] * \]
(?: |(?<=\\)\]) (?<!\\)
Your regex:
\[\~ # Literal characters [~
[^ # Character group, NONE of the following:
\[\~\] # [ or ~ or ]
]* # 0 or more of this character group
\] # Followed by ]
Your pattern in words: [~, everything in between, up to the next ], as long as there is no [ or ~ or ] in there.
My pattern , only relevant changes explained:
\[\~
(?: # Non capturing group
[^\[\~\]]
| # OR
(?<=\\)\] # ], preceded by \
)*
(?<!\\)\] # ], not preceded by \
In words: Same as yours, plus ] may be contained if it is preceded by \, and the closing ] may not be preceded by \

regular expression using [:punct:] function in java

I am using 'punct' function to replace special characters in a
String ex: ' REPLACE (REGEXP_REPLACE (colum1, '[[:punct:]]' ), ' ', '')) AS OUPUT ' as part of SQL String in java, But I want particular special character '-' not to be replaced? can you suggest best way to do this?
Acc. to Character Classes and Bracket Expressions:
‘[:punct:]’
Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ \ { | } ~.
Hence, use
[][!"#$%&'()*+,./:;<=>?#\\^_`{|}~]
Make sure you escape the ' correctly in the string literal.
A shortened expression with ranges will look like
[!-,.-/:-#[-`{-~]
See a regex test here (the - is between , and ., thus you need to use two !-,.-/ ranges in the above expression to exclude the hyphen).

Get equations from string

I have a string in following pattern
( var1=:key1:'any_value including space and 'quotes'' AND/OR var2=:key2:'any_value...' AND/OR var3=:key3:'any_value...' )
I want to get following result from this.
:key1:'any_value including space and 'quotes''
:key2:'any_value...'
:key3:'any_value...'
Could any one please suggest the pattern/RE for the same ?
Failed attempts :
First I can split it by AND/OR and again split the further strings on : and so on, but looking for single RE/Pattern which can do this.
You can use this regex with negated pattern to match your data:
":[^:]+:'.*?'(?=\\s*(?:AND(?:/OR)?|\\)))"
RegEx Demo
Breakup:
: # match a literal :
[^:]+ # match 1 or more characters that are not :
: # match a literal :
' # match a literal '
.*? # match 0 or more of any characters (non-greedy)
' # match a literal '
(?=\s*(?:AND(?:/OR)?|\))) # lookahead to assert there is AND/OR at the end or closing )
I think this would work for your circumstances.
Unless you know parsing of quotes, there is not much else you could do.
Raw: (?<==)(?:(?!\s*AND/OR).)+
Quoted: "(?<==)(?:(?!\\s*AND/OR).)+"
Expanded:
(?<= = ) # A '=' behind
(?:
(?! \s* AND/OR ) # Not 'AND/OR' in front
.
)+

Sporadic Stack Overflow error in java Matcher

I have a some file parser code where I sporadically get stack overflow errors on m.matches() (where m is a Matcher).
I run my app again and it parses the same file with no stack overflow.
It's true my Pattern is a bit complex. It's basically a bunch of optional zero length positive lookaheads with named groups inside of them so that I can match a bunch of variable name/value pairs irregardless of their order. But I would expect that if some string would cause a stack overflow error it would always cause it... not just sometimes... any ideas?
A much simplified version of my Pattern
"prefix(?=\\s+user=(?<user>\\S+))?(?=\\s+repo=(?<repo>\\S+))?.*?"
full regex is...
app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?(?=(?:[^"]|"[^"]*")*\s+remote_address=(?<ip>\S+))?(?=(?:[^"]|"[^"]*")*\s+now="(?<time>\S+)\+\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+url="(?<url>\S+)")?(?=(?:[^"]|"[^"]*")*\s+referer="(?<referer>\S+)")?(?=(?:[^"]|"[^"]*")*\s+status=(?<status>\S+))?(?=(?:[^"]|"[^"]*")*\s+elapsed=(?<elapsed>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_method=(?<requestmethod>\S+))?(?=(?:[^"]|"[^"]*")*\s+created_at="(?<createdat>\S+)(?:-|\+)\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+pull_request_id=(?<pullrequestid>\d+))?(?=(?:[^"]|"[^"]*")*\s+at=(?<at>\S+))?(?=(?:[^"]|"[^"]*")*\s+fn=(?<fn>\S+))?(?=(?:[^"]|"[^"]*")*\s+method=(?<method>\S+))?(?=(?:[^"]|"[^"]*")*\s+current_user=(?<user2>\S+))?(?=(?:[^"]|"[^"]*")*\s+content_length=(?<contentlength>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_category=(?<requestcategory>\S+))?(?=(?:[^"]|"[^"]*")*\s+controller=(?<controller>\S+))?(?=(?:[^"]|"[^"]*")*\s+action=(?<action>\S+))?.*?
Top of stack overflow error stack... (it's about 9800 lines long)
Exception: java.lang.StackOverflowError
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
example of line I got error on. (Though I have run it 10 times since and not gotten any error)
app=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=675fa67e-c1de-4bfa-a965-127b928d427a server_id=c31404fc-b7d0-41a1-8017-fc1a6dce8111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.041 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" worker_request_count=77192 request_category=apiapp=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=89fcb32e-9ab5-47f7-9464-e5f5cff175e8 server_id=1b74880a-5124-4483-adce-111b60dac111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.024 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" worker_request_count=76263 request_category=api
interestingly... this line seems to be an error... the log seems to put a line break in the wrong place resulting in two log entries being on a single line followed by a blank line. It's this long line that caused the error... well once anyway... now it runs just fine without stack overflow
There are 2 ways to fix your problem:
Parse the input string properly and get the key values from the Map.
I strongly recommend using this method, since the code will be much cleaner, and we no longer have to watch the limit on the input size.
Modify the existing regex to greatly reduce the impact of the implementation flaw which causes StackOverflowError.
Parse the input string
You can parse the input string with the following regex:
\G\s*+(\w++)=([^\s"]++|"[^"]*+")(?:\s++|$)
All quantifiers are made possessive (*+ instead of *, ++ instead of +), since the pattern I wrote doesn't need backtracking.
You can find the basic regex (\w++)=([^\s"]++|"[^"]*+") to match key-value pairs in the middle.
\G is to make sure the match starts from where the last match leaves off. It is used to prevent the engine from "bump-along" when it fails to match.
\s*+ and (?:\s++|$) are for consuming excess spaces. I specify (?:\s++|$) instead of \s*+ to prevent key="value"key=value from being recognized as valid input.
The full example code can be found below:
private static final Pattern KEY_VALUE = Pattern.compile("\\G\\s*+(\\w++)=([^\\s\"]++|\"[^\"]*+\")(?:\\s++|$)");
public static Map<String, String> parseKeyValue(String kvString) {
Matcher matcher = KEY_VALUE.matcher(kvString);
Map<String, String> output = new HashMap<String, String>();
int lastIndex = -1;
while (matcher.find()) {
output.put(matcher.group(1), matcher.group(2));
lastIndex = matcher.end();
}
// Make sure that we match everything from the input string
if (lastIndex != kvString.length()) {
return null;
}
return output;
}
You might want to unquote the values, depending on your requirement.
You can also rewrite the function to pass a List of keys you want to extract, and pick them out in the while loop as you go to avoid storing keys that you don't care about.
Modify the regex
The problem is due to the outer repetition (?:[^"]|"[^"]*")* being implemented with recursion, leading to StackOverflowError when the input string is long enough.
Specifically, in each repetition, it matches either a quoted token, or a single non-quoted character. As a result, the stack grows linearly with the number of non-quoted characters and blows up.
You can replace all instance of (?:[^"]|"[^"]*")* with [^"]*(?:"[^"]*"[^"]*)*. The stack will now grow linearly as the number of quoted tokens, so StackOverflowError will not occur, unless you have thousands of quoted tokens in the input string.
Pattern KEY_CAPTURE = Pattern.compile("app=github(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+user=(?<user>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+repo=(?<repo>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+remote_address=(?<ip>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+now=\"(?<time>\\S+)\\+\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+url=\"(?<url>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+referer=\"(?<referer>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+status=(?<status>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+elapsed=(?<elapsed>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_method=(?<requestmethod>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+created_at=\"(?<createdat>\\S+)(?:-|\\+)\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+pull_request_id=(?<pullrequestid>\\d+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+at=(?<at>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+fn=(?<fn>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+method=(?<method>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+current_user=(?<user2>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+content_length=(?<contentlength>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_category=(?<requestcategory>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+controller=(?<controller>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+action=(?<action>\\S+))?");
It follows equivalent expansion of the regex (A|B)* → A*(BA*)*. Which to use as A or B depends on their number of repetitions - whichever repeats more should be A and the other should be B.
Deep diving into the implementation
StackOverflowError in Pattern is a known problem, which could happen when your pattern contains a repetition of a non-deterministic1 capturing/non-capturing group, which is the subpattern (?:[^"]|"[^"]*")* in your case.
1 This is a terminology used in the source code of Pattern, which is probably intended to be an indicator that the pattern has fixed length. However, the implementation considers alternation | to be non-deterministic, regardless of the actual pattern.
Greedy or lazy repetition of a non-deterministic capturing/non-capturing group is compiled into Loop/LazyLoop classes, which implement repetition by recursion. As a result, such pattern is extremely prone to trigger StackOverflowError, especially when the group contains a branch where only a single character is matched at a time.
On the other hand, deterministic2 repetition, possessive repetition, and repetition of independent group (?>...) (a.k.a. atomic group or non-backtracking group) are compiled into Curly/GroupCurly classes, which processes the repetition with loop in most cases, so there will be no StackOverflowError.
2 The repeated pattern is a character class, or a fixed length capturing/non-capturing group without any alternation
You can see how a fragment of your original regex is compiled below. Take note of the problematic part, which starts with Loop, and compare it to your stack trace.
app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?
BnM. Boyer-Moore (BMP only version) (length=10)
app=github
Ques. Greedy optional quantifier
Pos. Positive look-ahead
GroupHead. local=0
Prolog. Loop wrapper
Loop [1889ca51]. Greedy quantifier {0,2147483647}
GroupHead. local=1
Branch. Alternation (in printed order):
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
---
Single. Match code point: U+0022 QUOTATION MARK
Curly. Greedy quantifier {0,2147483647}
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
Node. Accept match
Single. Match code point: U+0022 QUOTATION MARK
---
BranchConn [7e41986c]. Connect branches to sequel.
GroupTail [47e1b36]. local=1, group=0. --[next]--> Loop [1889ca51]
Curly. Greedy quantifier {1,2147483647}
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
Slice. Match the following sequence (BMP only version) (length=5)
user=
GroupHead. local=3
Curly. Greedy quantifier {1,2147483647}
CharProperty.complement. S̄:
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
GroupTail [732c7887]. local=3, group=2. --[next]--> GroupTail [6c9d2223]
GroupTail [6c9d2223]. local=0, group=0. --[next]--> Node [4ea5d7f2]
Node. Accept match
Node. Accept match
Ques. Greedy optional quantifier
Pos. Positive look-ahead
GroupHead. local=4
Prolog. Loop wrapper
Loop [402c5f8a]. Greedy quantifier {0,2147483647}
GroupHead. local=5
Branch. Alternation (in printed order):
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
---
Single. Match code point: U+0022 QUOTATION MARK
Curly. Greedy quantifier {0,2147483647}
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
Node. Accept match
Single. Match code point: U+0022 QUOTATION MARK
---
BranchConn [21347df0]. Connect branches to sequel.
GroupTail [7d382897]. local=5, group=0. --[next]--> Loop [402c5f8a]
Curly. Greedy quantifier {1,2147483647}
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
Slice. Match the following sequence (BMP only version) (length=5)
repo=
GroupHead. local=7
Curly. Greedy quantifier {1,2147483647}
CharProperty.complement. S̄:
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
GroupTail [71f111ba]. local=7, group=4. --[next]--> GroupTail [9c304c7]
GroupTail [9c304c7]. local=4, group=0. --[next]--> Node [4ea5d7f2]
Node. Accept match
Node. Accept match
LastNode.
Node. Accept match
Final answer:
Move this (?:[^"]|"[^"]*")* functionality into an alternation group with
the others.
Sample: https://ideone.com/YuVcMg
It can't be broken!
A side note - I noticed you said you deleted a newline and ended up with
the end of one record being without a separator between the next,
like this request_category=apiapp=github
That is ok, but these regexes will mostly blow by it when it hits the
\S+.
For that reason, it is better to replace \S+ with (?:(?!app=github)\S)+,
which is not done in the below regex.
Here is the one below with that added:
"(?s)app=github(?>\\s+user=(?<user>(?:(?!app=github)\\S)+)|\\s+repo=(?<repo>(?:(?!app=github)\\S)+)|\\s+remote_address=(?<ip>(?:(?!app=github)\\S)+)|\\s+now=\\\\?\"(?<time>(?:(?!app=github)\\S)+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>(?:(?!app=github)\\S)+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>(?:(?!app=github)\\S)+)\\\\?\"|\\s+status=(?<status>(?:(?!app=github)\\S)+)|\\s+elapsed=(?<elapsed>(?:(?!app=github)\\S)+)|\\s+request_method=(?<requestmethod>(?:(?!app=github)\\S)+)|\\s+created_at=\\\\?\"(?<createdat>(?:(?!app=github)\\S)+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>(?:(?!app=github)\\S)+)|\\s+fn=(?<fn>(?:(?!app=github)\\S)+)|\\s+method=(?<method>(?:(?!app=github)\\S)+)|\\s+current_user=(?<user2>(?:(?!app=github)\\S)+)|\\s+content_length=(?<contentlength>(?:(?!app=github)\\S)+)|\\s+request_category=(?<requestcategory>(?:(?!app=github)\\S)+)|\\s+controller=(?<controller>(?:(?!app=github)\\S)+)|\\s+action=(?<action>(?:(?!app=github)\\S)+)|\"[^\"]*\"|(?!app=github).)+"
And a link to that sample using it: https://ideone.com/hdwufO
Regex
Raw:
(?s)app=github(?>\s+user=(?<user>\S+)|\s+repo=(?<repo>\S+)|\s+remote_address=(?<ip>\S+)|\s+now=\\?"(?<time>\S+)\+\d\d:\d\d\\?"|\s+url=\\?"(?<url>\S+)\\?"|\s+referer=\\?"(?<referer>\S+)\\?"|\s+status=(?<status>\S+)|\s+elapsed=(?<elapsed>\S+)|\s+request_method=(?<requestmethod>\S+)|\s+created_at=\\?"(?<createdat>\S+)[-+]\d\d:\d\d\\?"|\s+pull_request_id=(?<pullrequestid>\d+)|\s+at=(?<at>\S+)|\s+fn=(?<fn>\S+)|\s+method=(?<method>\S+)|\s+current_user=(?<user2>\S+)|\s+content_length=(?<contentlength>\S+)|\s+request_category=(?<requestcategory>\S+)|\s+controller=(?<controller>\S+)|\s+action=(?<action>\S+)|"[^"]*"|(?!app=github).)+
Stringed:
"(?s)app=github(?>\\s+user=(?<user>\\S+)|\\s+repo=(?<repo>\\S+)|\\s+remote_address=(?<ip>\\S+)|\\s+now=\\\\?\"(?<time>\\S+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>\\S+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>\\S+)\\\\?\"|\\s+status=(?<status>\\S+)|\\s+elapsed=(?<elapsed>\\S+)|\\s+request_method=(?<requestmethod>\\S+)|\\s+created_at=\\\\?\"(?<createdat>\\S+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>\\S+)|\\s+fn=(?<fn>\\S+)|\\s+method=(?<method>\\S+)|\\s+current_user=(?<user2>\\S+)|\\s+content_length=(?<contentlength>\\S+)|\\s+request_category=(?<requestcategory>\\S+)|\\s+controller=(?<controller>\\S+)|\\s+action=(?<action>\\S+)|\"[^\"]*\"|(?!app=github).)+"
Formatted:
(?s)
app = github
(?>
\s+
user =
(?<user> \S+ ) # (1)
|
\s+ repo =
(?<repo> \S+ ) # (2)
|
\s+ remote_address =
(?<ip> \S+ ) # (3)
|
\s+ now= \\? "
(?<time> \S+ ) # (4)
\+ \d\d : \d\d \\? "
|
\s+ url = \\? "
(?<url> \S+ ) # (5)
\\? "
|
\s+ referer = \\? "
(?<referer> \S+ ) # (6)
\\? "
|
\s+ status =
(?<status> \S+ ) # (7)
|
\s+ elapsed =
(?<elapsed> \S+ ) # (8)
|
\s+ request_method =
(?<requestmethod> \S+ ) # (9)
|
\s+ created_at = \\? "
(?<createdat> \S+ ) # (10)
[-+]
\d\d : \d\d \\? "
|
\s+ pull_request_id =
(?<pullrequestid> \d+ ) # (11)
|
\s+ at=
(?<at> \S+ ) # (12)
|
\s+ fn=
(?<fn> \S+ ) # (13)
|
\s+ method =
(?<method> \S+ ) # (14)
|
\s+ current_user =
(?<user2> \S+ ) # (15)
|
\s+ content_length =
(?<contentlength> \S+ ) # (16)
|
\s+ request_categor y=
(?<requestcategory> \S+ ) # (17)
|
\s+ controller =
(?<controller> \S+ ) # (18)
|
\s+ action =
(?<action> \S+ ) # (19)
|
" [^"]* " # None of the above, give quotes a chance
|
(?! app = github ) # Failsafe, consume a character, advance by 1
.
)+

Regex match repeating pattern only after string

let a PropDefinition be a string of the form prop\d+ (true|false)
I have a string like:
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
I'd like to extract the bottom PropDefinitions only after the text 'sat', so the matches should be:
prop0 false
prop1 false
prop2 true
I originally tried using /(prop\d (?:true|false))/s (see example here) but that obviously matches all PropDefinitions and I couldn't make it match repeats only after the sat string
I used rubular as an example above because it was convenient, but I'm really looking for the most language agnostic solution. If it's vital info, I'll most likely be using the regex in a Java application.
str =<<-Q
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
Q
p str[/^sat(.*)/m, 1].scan(/prop\d+ (?:true|false)/)
# => ["prop0 false", "prop1 false", "prop2 true"]
When you have patterns that are very different in nature as in this case (string after sat and selecting the specific patterns), it is usually better to express them in multiple regexes rather than trying to do it with a single regex.
s = <<_
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
_
s.split(/^sat\s+/, 2).last.scan(/prop\d+ (?:true|false)/)
# => ["prop0 false", "prop1 false", "prop2 true"]
\s+[(]+\K(prop\d (?:true|false)(?=[)]))
Live demo
If Ruby can support the \G anchor this is one solution.
It looks nasty, but several things are going on.
1. It only allows a single nest (outer plus many inners)
2. It will not match invalid forms that don't comply with '(prop\d true|false)'
Without condition 2, it would be alot easier which is an indicator that a two regex
solution would do the same. First to capture the outer form sat((..)..(..)..)
second to globally capture the inner form (prop\d true|false).
Can be done in a single regex, though this is going to be hard to look at, but should work (test case below in Perl).
# (?:(?!\A|sat\s*\()\G|sat\s*\()[^()]*(?:\((?!prop\d[ ](?:true|false)\))[^()]*\)[^()]*)*\((prop\d[ ](?:true|false))\)(?=(?:[^()]*\([^()]*\))*[^()]*\))
(?:
(?! \A | sat \s* \( )
\G # Start match from end of last match
| # or,
sat \s* \( # Start form 'sat ('
)
[^()]* # This check section consumes invalid inner '(..)' forms
(?: # since we are looking specifically for '(prop\d true|false)'
\(
(?!
prop \d [ ]
(?: true | false )
\)
)
[^()]*
\)
[^()]*
)* # End section, do optionally many times
\(
( # (1 start), match inner form '(prop\d true|false)'
prop \d [ ]
(?: true | false )
) # (1 end)
\)
(?= # Look ahead for end form '(..)(..))'
(?:
[^()]*
\( [^()]* \)
)*
[^()]*
\)
)
Perl test case
$/ = undef;
$str = <DATA>;
while ($str =~ /(?:(?!\A|sat\s*\()\G|sat\s*\()[^()]*(?:\((?!prop\d[ ](?:true|false)\))[^()]*\)[^()]*)*\((prop\d[ ](?:true|false))\)(?=(?:[^()]*\([^()]*\))*[^()]*\))/g)
{
print "'$1'\n";
}
__DATA__
((prop10 true))
sat
((prop3 false)
(asdg)
(propa false)
(prop1 false)
(prop2 true)
)
((prop5 true))
Output >>
'prop3 false'
'prop1 false'
'prop2 true'
Part of the confusion has to do with SingleLine vs MultiLine matching. The patterns below work for me and return all matches in a single execution and without requiring a preliminary operation to split the string.
This one requires SingleLine mode to be specified separately (as in .Net RegExOptions):
(?<=sat.*)(prop\d (?:true|false))
This one specifies SingleLine mode inline which works with many, but not all, RegEx engines:
(?s)(?<=sat.*)(?-s)(prop\d (?:true|false))
You don't need to turn SingleLine mode off via the (?-s) but I think it is clearer in its intent.
The following pattern also toggles SingleLine mode inline, but uses a Negative LookAhead instead of a Positive LookBehind as it seems (according to regular-expressions.info [be sure to select Ruby and Java from the drop-downs]) the Ruby engine doesn't support LookBehinds--Positive or Negative--depending on the version, and even then doesn't allow quantifiers (also noted by #revo in a comment below). This pattern should work in Java, .Net, most likely Ruby, and others:
(prop\d (?:true|false))(?s)(?!.*sat)(?-s)
/(?<=sat).*?(prop\d (true|false))/m
Match group 1 is what you want. See example.
BUT, I would really recommend split the string first. It's much easier.

Categories