Sporadic Stack Overflow error in java Matcher - java

I have a some file parser code where I sporadically get stack overflow errors on m.matches() (where m is a Matcher).
I run my app again and it parses the same file with no stack overflow.
It's true my Pattern is a bit complex. It's basically a bunch of optional zero length positive lookaheads with named groups inside of them so that I can match a bunch of variable name/value pairs irregardless of their order. But I would expect that if some string would cause a stack overflow error it would always cause it... not just sometimes... any ideas?
A much simplified version of my Pattern
"prefix(?=\\s+user=(?<user>\\S+))?(?=\\s+repo=(?<repo>\\S+))?.*?"
full regex is...
app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?(?=(?:[^"]|"[^"]*")*\s+remote_address=(?<ip>\S+))?(?=(?:[^"]|"[^"]*")*\s+now="(?<time>\S+)\+\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+url="(?<url>\S+)")?(?=(?:[^"]|"[^"]*")*\s+referer="(?<referer>\S+)")?(?=(?:[^"]|"[^"]*")*\s+status=(?<status>\S+))?(?=(?:[^"]|"[^"]*")*\s+elapsed=(?<elapsed>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_method=(?<requestmethod>\S+))?(?=(?:[^"]|"[^"]*")*\s+created_at="(?<createdat>\S+)(?:-|\+)\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+pull_request_id=(?<pullrequestid>\d+))?(?=(?:[^"]|"[^"]*")*\s+at=(?<at>\S+))?(?=(?:[^"]|"[^"]*")*\s+fn=(?<fn>\S+))?(?=(?:[^"]|"[^"]*")*\s+method=(?<method>\S+))?(?=(?:[^"]|"[^"]*")*\s+current_user=(?<user2>\S+))?(?=(?:[^"]|"[^"]*")*\s+content_length=(?<contentlength>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_category=(?<requestcategory>\S+))?(?=(?:[^"]|"[^"]*")*\s+controller=(?<controller>\S+))?(?=(?:[^"]|"[^"]*")*\s+action=(?<action>\S+))?.*?
Top of stack overflow error stack... (it's about 9800 lines long)
Exception: java.lang.StackOverflowError
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
example of line I got error on. (Though I have run it 10 times since and not gotten any error)
app=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=675fa67e-c1de-4bfa-a965-127b928d427a server_id=c31404fc-b7d0-41a1-8017-fc1a6dce8111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.041 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" worker_request_count=77192 request_category=apiapp=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=89fcb32e-9ab5-47f7-9464-e5f5cff175e8 server_id=1b74880a-5124-4483-adce-111b60dac111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.024 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" worker_request_count=76263 request_category=api
interestingly... this line seems to be an error... the log seems to put a line break in the wrong place resulting in two log entries being on a single line followed by a blank line. It's this long line that caused the error... well once anyway... now it runs just fine without stack overflow

There are 2 ways to fix your problem:
Parse the input string properly and get the key values from the Map.
I strongly recommend using this method, since the code will be much cleaner, and we no longer have to watch the limit on the input size.
Modify the existing regex to greatly reduce the impact of the implementation flaw which causes StackOverflowError.
Parse the input string
You can parse the input string with the following regex:
\G\s*+(\w++)=([^\s"]++|"[^"]*+")(?:\s++|$)
All quantifiers are made possessive (*+ instead of *, ++ instead of +), since the pattern I wrote doesn't need backtracking.
You can find the basic regex (\w++)=([^\s"]++|"[^"]*+") to match key-value pairs in the middle.
\G is to make sure the match starts from where the last match leaves off. It is used to prevent the engine from "bump-along" when it fails to match.
\s*+ and (?:\s++|$) are for consuming excess spaces. I specify (?:\s++|$) instead of \s*+ to prevent key="value"key=value from being recognized as valid input.
The full example code can be found below:
private static final Pattern KEY_VALUE = Pattern.compile("\\G\\s*+(\\w++)=([^\\s\"]++|\"[^\"]*+\")(?:\\s++|$)");
public static Map<String, String> parseKeyValue(String kvString) {
Matcher matcher = KEY_VALUE.matcher(kvString);
Map<String, String> output = new HashMap<String, String>();
int lastIndex = -1;
while (matcher.find()) {
output.put(matcher.group(1), matcher.group(2));
lastIndex = matcher.end();
}
// Make sure that we match everything from the input string
if (lastIndex != kvString.length()) {
return null;
}
return output;
}
You might want to unquote the values, depending on your requirement.
You can also rewrite the function to pass a List of keys you want to extract, and pick them out in the while loop as you go to avoid storing keys that you don't care about.
Modify the regex
The problem is due to the outer repetition (?:[^"]|"[^"]*")* being implemented with recursion, leading to StackOverflowError when the input string is long enough.
Specifically, in each repetition, it matches either a quoted token, or a single non-quoted character. As a result, the stack grows linearly with the number of non-quoted characters and blows up.
You can replace all instance of (?:[^"]|"[^"]*")* with [^"]*(?:"[^"]*"[^"]*)*. The stack will now grow linearly as the number of quoted tokens, so StackOverflowError will not occur, unless you have thousands of quoted tokens in the input string.
Pattern KEY_CAPTURE = Pattern.compile("app=github(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+user=(?<user>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+repo=(?<repo>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+remote_address=(?<ip>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+now=\"(?<time>\\S+)\\+\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+url=\"(?<url>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+referer=\"(?<referer>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+status=(?<status>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+elapsed=(?<elapsed>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_method=(?<requestmethod>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+created_at=\"(?<createdat>\\S+)(?:-|\\+)\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+pull_request_id=(?<pullrequestid>\\d+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+at=(?<at>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+fn=(?<fn>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+method=(?<method>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+current_user=(?<user2>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+content_length=(?<contentlength>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_category=(?<requestcategory>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+controller=(?<controller>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+action=(?<action>\\S+))?");
It follows equivalent expansion of the regex (A|B)* → A*(BA*)*. Which to use as A or B depends on their number of repetitions - whichever repeats more should be A and the other should be B.
Deep diving into the implementation
StackOverflowError in Pattern is a known problem, which could happen when your pattern contains a repetition of a non-deterministic1 capturing/non-capturing group, which is the subpattern (?:[^"]|"[^"]*")* in your case.
1 This is a terminology used in the source code of Pattern, which is probably intended to be an indicator that the pattern has fixed length. However, the implementation considers alternation | to be non-deterministic, regardless of the actual pattern.
Greedy or lazy repetition of a non-deterministic capturing/non-capturing group is compiled into Loop/LazyLoop classes, which implement repetition by recursion. As a result, such pattern is extremely prone to trigger StackOverflowError, especially when the group contains a branch where only a single character is matched at a time.
On the other hand, deterministic2 repetition, possessive repetition, and repetition of independent group (?>...) (a.k.a. atomic group or non-backtracking group) are compiled into Curly/GroupCurly classes, which processes the repetition with loop in most cases, so there will be no StackOverflowError.
2 The repeated pattern is a character class, or a fixed length capturing/non-capturing group without any alternation
You can see how a fragment of your original regex is compiled below. Take note of the problematic part, which starts with Loop, and compare it to your stack trace.
app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?
BnM. Boyer-Moore (BMP only version) (length=10)
app=github
Ques. Greedy optional quantifier
Pos. Positive look-ahead
GroupHead. local=0
Prolog. Loop wrapper
Loop [1889ca51]. Greedy quantifier {0,2147483647}
GroupHead. local=1
Branch. Alternation (in printed order):
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
---
Single. Match code point: U+0022 QUOTATION MARK
Curly. Greedy quantifier {0,2147483647}
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
Node. Accept match
Single. Match code point: U+0022 QUOTATION MARK
---
BranchConn [7e41986c]. Connect branches to sequel.
GroupTail [47e1b36]. local=1, group=0. --[next]--> Loop [1889ca51]
Curly. Greedy quantifier {1,2147483647}
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
Slice. Match the following sequence (BMP only version) (length=5)
user=
GroupHead. local=3
Curly. Greedy quantifier {1,2147483647}
CharProperty.complement. S̄:
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
GroupTail [732c7887]. local=3, group=2. --[next]--> GroupTail [6c9d2223]
GroupTail [6c9d2223]. local=0, group=0. --[next]--> Node [4ea5d7f2]
Node. Accept match
Node. Accept match
Ques. Greedy optional quantifier
Pos. Positive look-ahead
GroupHead. local=4
Prolog. Loop wrapper
Loop [402c5f8a]. Greedy quantifier {0,2147483647}
GroupHead. local=5
Branch. Alternation (in printed order):
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
---
Single. Match code point: U+0022 QUOTATION MARK
Curly. Greedy quantifier {0,2147483647}
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
Node. Accept match
Single. Match code point: U+0022 QUOTATION MARK
---
BranchConn [21347df0]. Connect branches to sequel.
GroupTail [7d382897]. local=5, group=0. --[next]--> Loop [402c5f8a]
Curly. Greedy quantifier {1,2147483647}
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
Slice. Match the following sequence (BMP only version) (length=5)
repo=
GroupHead. local=7
Curly. Greedy quantifier {1,2147483647}
CharProperty.complement. S̄:
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
GroupTail [71f111ba]. local=7, group=4. --[next]--> GroupTail [9c304c7]
GroupTail [9c304c7]. local=4, group=0. --[next]--> Node [4ea5d7f2]
Node. Accept match
Node. Accept match
LastNode.
Node. Accept match

Final answer:
Move this (?:[^"]|"[^"]*")* functionality into an alternation group with
the others.
Sample: https://ideone.com/YuVcMg
It can't be broken!
A side note - I noticed you said you deleted a newline and ended up with
the end of one record being without a separator between the next,
like this request_category=apiapp=github
That is ok, but these regexes will mostly blow by it when it hits the
\S+.
For that reason, it is better to replace \S+ with (?:(?!app=github)\S)+,
which is not done in the below regex.
Here is the one below with that added:
"(?s)app=github(?>\\s+user=(?<user>(?:(?!app=github)\\S)+)|\\s+repo=(?<repo>(?:(?!app=github)\\S)+)|\\s+remote_address=(?<ip>(?:(?!app=github)\\S)+)|\\s+now=\\\\?\"(?<time>(?:(?!app=github)\\S)+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>(?:(?!app=github)\\S)+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>(?:(?!app=github)\\S)+)\\\\?\"|\\s+status=(?<status>(?:(?!app=github)\\S)+)|\\s+elapsed=(?<elapsed>(?:(?!app=github)\\S)+)|\\s+request_method=(?<requestmethod>(?:(?!app=github)\\S)+)|\\s+created_at=\\\\?\"(?<createdat>(?:(?!app=github)\\S)+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>(?:(?!app=github)\\S)+)|\\s+fn=(?<fn>(?:(?!app=github)\\S)+)|\\s+method=(?<method>(?:(?!app=github)\\S)+)|\\s+current_user=(?<user2>(?:(?!app=github)\\S)+)|\\s+content_length=(?<contentlength>(?:(?!app=github)\\S)+)|\\s+request_category=(?<requestcategory>(?:(?!app=github)\\S)+)|\\s+controller=(?<controller>(?:(?!app=github)\\S)+)|\\s+action=(?<action>(?:(?!app=github)\\S)+)|\"[^\"]*\"|(?!app=github).)+"
And a link to that sample using it: https://ideone.com/hdwufO
Regex
Raw:
(?s)app=github(?>\s+user=(?<user>\S+)|\s+repo=(?<repo>\S+)|\s+remote_address=(?<ip>\S+)|\s+now=\\?"(?<time>\S+)\+\d\d:\d\d\\?"|\s+url=\\?"(?<url>\S+)\\?"|\s+referer=\\?"(?<referer>\S+)\\?"|\s+status=(?<status>\S+)|\s+elapsed=(?<elapsed>\S+)|\s+request_method=(?<requestmethod>\S+)|\s+created_at=\\?"(?<createdat>\S+)[-+]\d\d:\d\d\\?"|\s+pull_request_id=(?<pullrequestid>\d+)|\s+at=(?<at>\S+)|\s+fn=(?<fn>\S+)|\s+method=(?<method>\S+)|\s+current_user=(?<user2>\S+)|\s+content_length=(?<contentlength>\S+)|\s+request_category=(?<requestcategory>\S+)|\s+controller=(?<controller>\S+)|\s+action=(?<action>\S+)|"[^"]*"|(?!app=github).)+
Stringed:
"(?s)app=github(?>\\s+user=(?<user>\\S+)|\\s+repo=(?<repo>\\S+)|\\s+remote_address=(?<ip>\\S+)|\\s+now=\\\\?\"(?<time>\\S+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>\\S+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>\\S+)\\\\?\"|\\s+status=(?<status>\\S+)|\\s+elapsed=(?<elapsed>\\S+)|\\s+request_method=(?<requestmethod>\\S+)|\\s+created_at=\\\\?\"(?<createdat>\\S+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>\\S+)|\\s+fn=(?<fn>\\S+)|\\s+method=(?<method>\\S+)|\\s+current_user=(?<user2>\\S+)|\\s+content_length=(?<contentlength>\\S+)|\\s+request_category=(?<requestcategory>\\S+)|\\s+controller=(?<controller>\\S+)|\\s+action=(?<action>\\S+)|\"[^\"]*\"|(?!app=github).)+"
Formatted:
(?s)
app = github
(?>
\s+
user =
(?<user> \S+ ) # (1)
|
\s+ repo =
(?<repo> \S+ ) # (2)
|
\s+ remote_address =
(?<ip> \S+ ) # (3)
|
\s+ now= \\? "
(?<time> \S+ ) # (4)
\+ \d\d : \d\d \\? "
|
\s+ url = \\? "
(?<url> \S+ ) # (5)
\\? "
|
\s+ referer = \\? "
(?<referer> \S+ ) # (6)
\\? "
|
\s+ status =
(?<status> \S+ ) # (7)
|
\s+ elapsed =
(?<elapsed> \S+ ) # (8)
|
\s+ request_method =
(?<requestmethod> \S+ ) # (9)
|
\s+ created_at = \\? "
(?<createdat> \S+ ) # (10)
[-+]
\d\d : \d\d \\? "
|
\s+ pull_request_id =
(?<pullrequestid> \d+ ) # (11)
|
\s+ at=
(?<at> \S+ ) # (12)
|
\s+ fn=
(?<fn> \S+ ) # (13)
|
\s+ method =
(?<method> \S+ ) # (14)
|
\s+ current_user =
(?<user2> \S+ ) # (15)
|
\s+ content_length =
(?<contentlength> \S+ ) # (16)
|
\s+ request_categor y=
(?<requestcategory> \S+ ) # (17)
|
\s+ controller =
(?<controller> \S+ ) # (18)
|
\s+ action =
(?<action> \S+ ) # (19)
|
" [^"]* " # None of the above, give quotes a chance
|
(?! app = github ) # Failsafe, consume a character, advance by 1
.
)+

Related

Get equations from string

I have a string in following pattern
( var1=:key1:'any_value including space and 'quotes'' AND/OR var2=:key2:'any_value...' AND/OR var3=:key3:'any_value...' )
I want to get following result from this.
:key1:'any_value including space and 'quotes''
:key2:'any_value...'
:key3:'any_value...'
Could any one please suggest the pattern/RE for the same ?
Failed attempts :
First I can split it by AND/OR and again split the further strings on : and so on, but looking for single RE/Pattern which can do this.
You can use this regex with negated pattern to match your data:
":[^:]+:'.*?'(?=\\s*(?:AND(?:/OR)?|\\)))"
RegEx Demo
Breakup:
: # match a literal :
[^:]+ # match 1 or more characters that are not :
: # match a literal :
' # match a literal '
.*? # match 0 or more of any characters (non-greedy)
' # match a literal '
(?=\s*(?:AND(?:/OR)?|\))) # lookahead to assert there is AND/OR at the end or closing )
I think this would work for your circumstances.
Unless you know parsing of quotes, there is not much else you could do.
Raw: (?<==)(?:(?!\s*AND/OR).)+
Quoted: "(?<==)(?:(?!\\s*AND/OR).)+"
Expanded:
(?<= = ) # A '=' behind
(?:
(?! \s* AND/OR ) # Not 'AND/OR' in front
.
)+

Extraction of subsequences that end with point and space by regular expression

hy
I want to extract sub sentences of this sentence by regular expression:
it learn od fg network layout. kdsjhuu ddkm networ.12kfdf. learndfefe layout. learn sdffsfsfs. sddsd learn fefe.
I couldn't write a correct regular expression for Pattern.compile.
This is my expression:([^(\\.\\s)]*)([^.]*\\.)
Actually, i need a way for writing "read everthing except \\.\\s
sub sentences:
it learn od fg network layout.
kdsjhuu ddkm networ.12kfdf.
learndfefe layout.
learn sdffsfsfs.
sddsd learn fefe.
Just split your string with regex "\\. "
String[] arr= str.split("\\. ");
You can use this pattern with the find method:
Pattern p = Pattern.compile("[^\\s.][^.]*(?:\\.(?!\\s|\\z)[^.]*)*\\.?");
Matcher m = p.matcher(yourText);
while(m.find()) {
System.out.println(m.group(0));
}
Pattern details:
[^\\s.] # all that is not a whitespace (to trim) or a dot
[^.]* # all that is not a dot (zero or more times)
(?: # open a non-capturing group
\\. (?!\\s|\\z) # dot not followed by a whitespace or the end of the string
[^.]* #
)* # close and repeat the group as needed
\\.? # an optional dot (allow to match a sentence at the end
# of the string even if there is no dot)

How to fix this regex (matching dictionary entries)

I am working with a spanish dictionary that has definitions like the following:
l. a. c. Buitre, alimoche. adj. Persona alelada. (Cornago). GOICOECHEA. // 2. f. Persona torpe, despistada e irreflexiva. // 3. Estar mirando a los abantos. fr. fig. Ser despistado, soñador, no apercibirse de la realidad. Autol. RUIZ. // 4. f. esto es una prueba
Where the following rules apply:
Each definition MAY contain one (and never more than one) of the following categories:
l. a. c.
f.
m.
The category is always at the start of a definition
The first definition starts from the begining, if there are more definitions, they start with \\ n. where 'n' is a number (could be more than one digit)
For the example I gave, the following definitions should be parsed:
(Category: l.a.c.) Buitre, alimoche. adj. Persona alelada. (Cornago). GOICOECHEA
(Category: f.) Persona torpe, despistada e irreflexiva.
(No category) Estar mirando a los abantos. fr. fig. Ser despistado, soñador, no apercibirse de la realidad. Autol. RUIZ.
(Category: f.) esto es una prueba
I am trying to make a regex to capture every definition (that is 0 or 1 category + meaning). This is what I have
(?:(m\.|l\. a\. c\.|f\.) )?(.*?) (?:$|(?:\/\/ \d+. (?:(m\.|l\. a\. c\.|f\.) )?(.*?))+)
I am testing it here This is how I wrote it:
(?:
(m\.|l\. a\. c\.|f\.) <-- First: unnamed group containing the named group
for the category and one space
)?
(.*?) <-- Named group for the meaning
(?: <-- Unnamed group for end of line OR another definition
$ <--- (end of line)
| <--- (OR)
(?:\/\/ \d+. <--- (Definition separator & number)
(?:(m\.|l\. a\. c\.|f\.) )?(.*?) <-- Another definition
)+ <-- There may be more than one definition, so we add '+'
)
I have serveral problems:
I am not sure why it does not work. It seems like the last capture group (.*?) is not expanding until the next \\. How can I fix it?
The group (m\.|l\. a\. c\.|f\.) should be larger (there are more categories) How can I avoid repeating it?
There are some repetition in the regex string that I gave, how can I avoid that?
This is my first non-trivial regex example, so any other comentaries about style, or improve in general are welcome.
My main question is why is my regex not working. (This is just to clarify...)
The problem is that the last capture group is non-greedy.
(?:
(m\.|l\. a\. c\.|f\.)
)?
(.*?)
(?:
$
|
(?:\/\/ \d+.
(?:(m\.|l\. a\. c\.|f\.) )?
(.*?) <-- this is non-greedy.
)
)+
Because of that, it will simply match the empty string. The + at the end of the pattern doesn't do anything because it already matched once, and that's enough to stop.
The fix is simple: Force the pattern to match the entire line. Just add $ at the end.
(?:(m\.|l\. a\. c\.|f\.) )?(.*?) (?:$|(?:\/\/ \d+. (?:(m\.|l\. a\. c\.|f\.) )?(.*?)))+$
EDIT: It's not possible to capture each category and definition with a single regex. If you use a single pattern to match the entire string, each capture group will only contain the text it matched last, so you'll only be able to parse the last definition.
You can use this pattern to match a single definition.
(?:^| \/\/ \d\. )(?:(?P<category>m\.|l\. a\. c\.|f\.) )?(?P<definition>.*?)(?:$|(?= \/\/ \d\.))
Apply it to the string until it no longer finds a match to capture all definitions.
while (matcher.find()){
... do something
}
Demo.
Detailed explanation of the pattern:
(?:
^ // match start of string
| // OR
\/\/ \d\. // "\\ " literally, followed by a digit, a dot, and a space
)
(?:
(?P<category> // in the named group "category", capture...
m\.|l\. a\. c\.|f\. // one of "m.", "l. a. c.", "f."
) // and a space
)? // ...if possible.
(?P<definition> // in the named group "definition", capture...
.*? // everything up to...
)
(?:
$ // the end of the string
| // OR
(?= // the start of the next definition. This needs to be enclosed in a lookahead assertion so as not to consume it.
\/\/ \d\.
)
)

Regex match repeating pattern only after string

let a PropDefinition be a string of the form prop\d+ (true|false)
I have a string like:
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
I'd like to extract the bottom PropDefinitions only after the text 'sat', so the matches should be:
prop0 false
prop1 false
prop2 true
I originally tried using /(prop\d (?:true|false))/s (see example here) but that obviously matches all PropDefinitions and I couldn't make it match repeats only after the sat string
I used rubular as an example above because it was convenient, but I'm really looking for the most language agnostic solution. If it's vital info, I'll most likely be using the regex in a Java application.
str =<<-Q
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
Q
p str[/^sat(.*)/m, 1].scan(/prop\d+ (?:true|false)/)
# => ["prop0 false", "prop1 false", "prop2 true"]
When you have patterns that are very different in nature as in this case (string after sat and selecting the specific patterns), it is usually better to express them in multiple regexes rather than trying to do it with a single regex.
s = <<_
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
_
s.split(/^sat\s+/, 2).last.scan(/prop\d+ (?:true|false)/)
# => ["prop0 false", "prop1 false", "prop2 true"]
\s+[(]+\K(prop\d (?:true|false)(?=[)]))
Live demo
If Ruby can support the \G anchor this is one solution.
It looks nasty, but several things are going on.
1. It only allows a single nest (outer plus many inners)
2. It will not match invalid forms that don't comply with '(prop\d true|false)'
Without condition 2, it would be alot easier which is an indicator that a two regex
solution would do the same. First to capture the outer form sat((..)..(..)..)
second to globally capture the inner form (prop\d true|false).
Can be done in a single regex, though this is going to be hard to look at, but should work (test case below in Perl).
# (?:(?!\A|sat\s*\()\G|sat\s*\()[^()]*(?:\((?!prop\d[ ](?:true|false)\))[^()]*\)[^()]*)*\((prop\d[ ](?:true|false))\)(?=(?:[^()]*\([^()]*\))*[^()]*\))
(?:
(?! \A | sat \s* \( )
\G # Start match from end of last match
| # or,
sat \s* \( # Start form 'sat ('
)
[^()]* # This check section consumes invalid inner '(..)' forms
(?: # since we are looking specifically for '(prop\d true|false)'
\(
(?!
prop \d [ ]
(?: true | false )
\)
)
[^()]*
\)
[^()]*
)* # End section, do optionally many times
\(
( # (1 start), match inner form '(prop\d true|false)'
prop \d [ ]
(?: true | false )
) # (1 end)
\)
(?= # Look ahead for end form '(..)(..))'
(?:
[^()]*
\( [^()]* \)
)*
[^()]*
\)
)
Perl test case
$/ = undef;
$str = <DATA>;
while ($str =~ /(?:(?!\A|sat\s*\()\G|sat\s*\()[^()]*(?:\((?!prop\d[ ](?:true|false)\))[^()]*\)[^()]*)*\((prop\d[ ](?:true|false))\)(?=(?:[^()]*\([^()]*\))*[^()]*\))/g)
{
print "'$1'\n";
}
__DATA__
((prop10 true))
sat
((prop3 false)
(asdg)
(propa false)
(prop1 false)
(prop2 true)
)
((prop5 true))
Output >>
'prop3 false'
'prop1 false'
'prop2 true'
Part of the confusion has to do with SingleLine vs MultiLine matching. The patterns below work for me and return all matches in a single execution and without requiring a preliminary operation to split the string.
This one requires SingleLine mode to be specified separately (as in .Net RegExOptions):
(?<=sat.*)(prop\d (?:true|false))
This one specifies SingleLine mode inline which works with many, but not all, RegEx engines:
(?s)(?<=sat.*)(?-s)(prop\d (?:true|false))
You don't need to turn SingleLine mode off via the (?-s) but I think it is clearer in its intent.
The following pattern also toggles SingleLine mode inline, but uses a Negative LookAhead instead of a Positive LookBehind as it seems (according to regular-expressions.info [be sure to select Ruby and Java from the drop-downs]) the Ruby engine doesn't support LookBehinds--Positive or Negative--depending on the version, and even then doesn't allow quantifiers (also noted by #revo in a comment below). This pattern should work in Java, .Net, most likely Ruby, and others:
(prop\d (?:true|false))(?s)(?!.*sat)(?-s)
/(?<=sat).*?(prop\d (true|false))/m
Match group 1 is what you want. See example.
BUT, I would really recommend split the string first. It's much easier.

Trying to get some useful data from a log file with regex in Java

I'm having trouble with regex, because I can only match some of my goals.
I have a log file and I must match some of the items and write another txt file. I wrote a Java code for a short example of my code but when I put the whole file, everything gets messed up.
*052511 074217 0065 02242806000 UNKNOWN U G
*052511 074217 0065 4874 02242806000 UNKNOWN U A
*052511 074218 0065 4874 02242806000 UNKNOWN U R
-------- 05/25/11 07:42:17 LINE = 0065 STN = 4874
CALLING NUMBER 02242806000
NAME UNKNOWN
UNKNOWN
BC = SPEECH
00:00:00 INCOMING CALL RINGING 0:02
00:00:11 CALL RELEASED
I have to find these results from the file:
incomming,05/25/11,07:42:17,0065,4874,02242806000,00:00:09,2
In this expression 00:00:09 means [00:00:11-00:00:00]-0:02
For every incoming and outgoing calls, I must make the conversation above.
Here is my code
Here is the log file
You could use a regex like:
(?xm:
^-------- \s+ (\S+) \s+ (\S+) \s+ LINE\s*=\s*(\d+) \s+ STN\s*=\s*(\d+)
\s+ CALLING\ NUMBER \s+ (\d+) \s*
(?:^(?:[ \t]+.*)?[\n\r]+)* # eat unwanted part
^(\d\d:\d\d:\d\d) \s+ INCOMING\ CALL \s+ RINGING\ ([\d:]+) \s*
(?:^\d.*[\r\n]+)* # possible stuff
^(\d\d:\d\d:\d\d) \s+ CALL\ RELEASED
)
Use the values of the capturing groups to get your results. You may need to remove the /x related things like comments and spaces.
Perl example at http://ideone.com/qTBFe

Categories