let a PropDefinition be a string of the form prop\d+ (true|false)
I have a string like:
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
I'd like to extract the bottom PropDefinitions only after the text 'sat', so the matches should be:
prop0 false
prop1 false
prop2 true
I originally tried using /(prop\d (?:true|false))/s (see example here) but that obviously matches all PropDefinitions and I couldn't make it match repeats only after the sat string
I used rubular as an example above because it was convenient, but I'm really looking for the most language agnostic solution. If it's vital info, I'll most likely be using the regex in a Java application.
str =<<-Q
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
Q
p str[/^sat(.*)/m, 1].scan(/prop\d+ (?:true|false)/)
# => ["prop0 false", "prop1 false", "prop2 true"]
When you have patterns that are very different in nature as in this case (string after sat and selecting the specific patterns), it is usually better to express them in multiple regexes rather than trying to do it with a single regex.
s = <<_
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
_
s.split(/^sat\s+/, 2).last.scan(/prop\d+ (?:true|false)/)
# => ["prop0 false", "prop1 false", "prop2 true"]
\s+[(]+\K(prop\d (?:true|false)(?=[)]))
Live demo
If Ruby can support the \G anchor this is one solution.
It looks nasty, but several things are going on.
1. It only allows a single nest (outer plus many inners)
2. It will not match invalid forms that don't comply with '(prop\d true|false)'
Without condition 2, it would be alot easier which is an indicator that a two regex
solution would do the same. First to capture the outer form sat((..)..(..)..)
second to globally capture the inner form (prop\d true|false).
Can be done in a single regex, though this is going to be hard to look at, but should work (test case below in Perl).
# (?:(?!\A|sat\s*\()\G|sat\s*\()[^()]*(?:\((?!prop\d[ ](?:true|false)\))[^()]*\)[^()]*)*\((prop\d[ ](?:true|false))\)(?=(?:[^()]*\([^()]*\))*[^()]*\))
(?:
(?! \A | sat \s* \( )
\G # Start match from end of last match
| # or,
sat \s* \( # Start form 'sat ('
)
[^()]* # This check section consumes invalid inner '(..)' forms
(?: # since we are looking specifically for '(prop\d true|false)'
\(
(?!
prop \d [ ]
(?: true | false )
\)
)
[^()]*
\)
[^()]*
)* # End section, do optionally many times
\(
( # (1 start), match inner form '(prop\d true|false)'
prop \d [ ]
(?: true | false )
) # (1 end)
\)
(?= # Look ahead for end form '(..)(..))'
(?:
[^()]*
\( [^()]* \)
)*
[^()]*
\)
)
Perl test case
$/ = undef;
$str = <DATA>;
while ($str =~ /(?:(?!\A|sat\s*\()\G|sat\s*\()[^()]*(?:\((?!prop\d[ ](?:true|false)\))[^()]*\)[^()]*)*\((prop\d[ ](?:true|false))\)(?=(?:[^()]*\([^()]*\))*[^()]*\))/g)
{
print "'$1'\n";
}
__DATA__
((prop10 true))
sat
((prop3 false)
(asdg)
(propa false)
(prop1 false)
(prop2 true)
)
((prop5 true))
Output >>
'prop3 false'
'prop1 false'
'prop2 true'
Part of the confusion has to do with SingleLine vs MultiLine matching. The patterns below work for me and return all matches in a single execution and without requiring a preliminary operation to split the string.
This one requires SingleLine mode to be specified separately (as in .Net RegExOptions):
(?<=sat.*)(prop\d (?:true|false))
This one specifies SingleLine mode inline which works with many, but not all, RegEx engines:
(?s)(?<=sat.*)(?-s)(prop\d (?:true|false))
You don't need to turn SingleLine mode off via the (?-s) but I think it is clearer in its intent.
The following pattern also toggles SingleLine mode inline, but uses a Negative LookAhead instead of a Positive LookBehind as it seems (according to regular-expressions.info [be sure to select Ruby and Java from the drop-downs]) the Ruby engine doesn't support LookBehinds--Positive or Negative--depending on the version, and even then doesn't allow quantifiers (also noted by #revo in a comment below). This pattern should work in Java, .Net, most likely Ruby, and others:
(prop\d (?:true|false))(?s)(?!.*sat)(?-s)
/(?<=sat).*?(prop\d (true|false))/m
Match group 1 is what you want. See example.
BUT, I would really recommend split the string first. It's much easier.
Related
I have a string in following pattern
( var1=:key1:'any_value including space and 'quotes'' AND/OR var2=:key2:'any_value...' AND/OR var3=:key3:'any_value...' )
I want to get following result from this.
:key1:'any_value including space and 'quotes''
:key2:'any_value...'
:key3:'any_value...'
Could any one please suggest the pattern/RE for the same ?
Failed attempts :
First I can split it by AND/OR and again split the further strings on : and so on, but looking for single RE/Pattern which can do this.
You can use this regex with negated pattern to match your data:
":[^:]+:'.*?'(?=\\s*(?:AND(?:/OR)?|\\)))"
RegEx Demo
Breakup:
: # match a literal :
[^:]+ # match 1 or more characters that are not :
: # match a literal :
' # match a literal '
.*? # match 0 or more of any characters (non-greedy)
' # match a literal '
(?=\s*(?:AND(?:/OR)?|\))) # lookahead to assert there is AND/OR at the end or closing )
I think this would work for your circumstances.
Unless you know parsing of quotes, there is not much else you could do.
Raw: (?<==)(?:(?!\s*AND/OR).)+
Quoted: "(?<==)(?:(?!\\s*AND/OR).)+"
Expanded:
(?<= = ) # A '=' behind
(?:
(?! \s* AND/OR ) # Not 'AND/OR' in front
.
)+
I have a some file parser code where I sporadically get stack overflow errors on m.matches() (where m is a Matcher).
I run my app again and it parses the same file with no stack overflow.
It's true my Pattern is a bit complex. It's basically a bunch of optional zero length positive lookaheads with named groups inside of them so that I can match a bunch of variable name/value pairs irregardless of their order. But I would expect that if some string would cause a stack overflow error it would always cause it... not just sometimes... any ideas?
A much simplified version of my Pattern
"prefix(?=\\s+user=(?<user>\\S+))?(?=\\s+repo=(?<repo>\\S+))?.*?"
full regex is...
app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?(?=(?:[^"]|"[^"]*")*\s+remote_address=(?<ip>\S+))?(?=(?:[^"]|"[^"]*")*\s+now="(?<time>\S+)\+\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+url="(?<url>\S+)")?(?=(?:[^"]|"[^"]*")*\s+referer="(?<referer>\S+)")?(?=(?:[^"]|"[^"]*")*\s+status=(?<status>\S+))?(?=(?:[^"]|"[^"]*")*\s+elapsed=(?<elapsed>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_method=(?<requestmethod>\S+))?(?=(?:[^"]|"[^"]*")*\s+created_at="(?<createdat>\S+)(?:-|\+)\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+pull_request_id=(?<pullrequestid>\d+))?(?=(?:[^"]|"[^"]*")*\s+at=(?<at>\S+))?(?=(?:[^"]|"[^"]*")*\s+fn=(?<fn>\S+))?(?=(?:[^"]|"[^"]*")*\s+method=(?<method>\S+))?(?=(?:[^"]|"[^"]*")*\s+current_user=(?<user2>\S+))?(?=(?:[^"]|"[^"]*")*\s+content_length=(?<contentlength>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_category=(?<requestcategory>\S+))?(?=(?:[^"]|"[^"]*")*\s+controller=(?<controller>\S+))?(?=(?:[^"]|"[^"]*")*\s+action=(?<action>\S+))?.*?
Top of stack overflow error stack... (it's about 9800 lines long)
Exception: java.lang.StackOverflowError
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
example of line I got error on. (Though I have run it 10 times since and not gotten any error)
app=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=675fa67e-c1de-4bfa-a965-127b928d427a server_id=c31404fc-b7d0-41a1-8017-fc1a6dce8111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.041 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" worker_request_count=77192 request_category=apiapp=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=89fcb32e-9ab5-47f7-9464-e5f5cff175e8 server_id=1b74880a-5124-4483-adce-111b60dac111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.024 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" worker_request_count=76263 request_category=api
interestingly... this line seems to be an error... the log seems to put a line break in the wrong place resulting in two log entries being on a single line followed by a blank line. It's this long line that caused the error... well once anyway... now it runs just fine without stack overflow
There are 2 ways to fix your problem:
Parse the input string properly and get the key values from the Map.
I strongly recommend using this method, since the code will be much cleaner, and we no longer have to watch the limit on the input size.
Modify the existing regex to greatly reduce the impact of the implementation flaw which causes StackOverflowError.
Parse the input string
You can parse the input string with the following regex:
\G\s*+(\w++)=([^\s"]++|"[^"]*+")(?:\s++|$)
All quantifiers are made possessive (*+ instead of *, ++ instead of +), since the pattern I wrote doesn't need backtracking.
You can find the basic regex (\w++)=([^\s"]++|"[^"]*+") to match key-value pairs in the middle.
\G is to make sure the match starts from where the last match leaves off. It is used to prevent the engine from "bump-along" when it fails to match.
\s*+ and (?:\s++|$) are for consuming excess spaces. I specify (?:\s++|$) instead of \s*+ to prevent key="value"key=value from being recognized as valid input.
The full example code can be found below:
private static final Pattern KEY_VALUE = Pattern.compile("\\G\\s*+(\\w++)=([^\\s\"]++|\"[^\"]*+\")(?:\\s++|$)");
public static Map<String, String> parseKeyValue(String kvString) {
Matcher matcher = KEY_VALUE.matcher(kvString);
Map<String, String> output = new HashMap<String, String>();
int lastIndex = -1;
while (matcher.find()) {
output.put(matcher.group(1), matcher.group(2));
lastIndex = matcher.end();
}
// Make sure that we match everything from the input string
if (lastIndex != kvString.length()) {
return null;
}
return output;
}
You might want to unquote the values, depending on your requirement.
You can also rewrite the function to pass a List of keys you want to extract, and pick them out in the while loop as you go to avoid storing keys that you don't care about.
Modify the regex
The problem is due to the outer repetition (?:[^"]|"[^"]*")* being implemented with recursion, leading to StackOverflowError when the input string is long enough.
Specifically, in each repetition, it matches either a quoted token, or a single non-quoted character. As a result, the stack grows linearly with the number of non-quoted characters and blows up.
You can replace all instance of (?:[^"]|"[^"]*")* with [^"]*(?:"[^"]*"[^"]*)*. The stack will now grow linearly as the number of quoted tokens, so StackOverflowError will not occur, unless you have thousands of quoted tokens in the input string.
Pattern KEY_CAPTURE = Pattern.compile("app=github(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+user=(?<user>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+repo=(?<repo>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+remote_address=(?<ip>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+now=\"(?<time>\\S+)\\+\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+url=\"(?<url>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+referer=\"(?<referer>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+status=(?<status>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+elapsed=(?<elapsed>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_method=(?<requestmethod>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+created_at=\"(?<createdat>\\S+)(?:-|\\+)\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+pull_request_id=(?<pullrequestid>\\d+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+at=(?<at>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+fn=(?<fn>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+method=(?<method>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+current_user=(?<user2>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+content_length=(?<contentlength>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_category=(?<requestcategory>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+controller=(?<controller>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+action=(?<action>\\S+))?");
It follows equivalent expansion of the regex (A|B)* → A*(BA*)*. Which to use as A or B depends on their number of repetitions - whichever repeats more should be A and the other should be B.
Deep diving into the implementation
StackOverflowError in Pattern is a known problem, which could happen when your pattern contains a repetition of a non-deterministic1 capturing/non-capturing group, which is the subpattern (?:[^"]|"[^"]*")* in your case.
1 This is a terminology used in the source code of Pattern, which is probably intended to be an indicator that the pattern has fixed length. However, the implementation considers alternation | to be non-deterministic, regardless of the actual pattern.
Greedy or lazy repetition of a non-deterministic capturing/non-capturing group is compiled into Loop/LazyLoop classes, which implement repetition by recursion. As a result, such pattern is extremely prone to trigger StackOverflowError, especially when the group contains a branch where only a single character is matched at a time.
On the other hand, deterministic2 repetition, possessive repetition, and repetition of independent group (?>...) (a.k.a. atomic group or non-backtracking group) are compiled into Curly/GroupCurly classes, which processes the repetition with loop in most cases, so there will be no StackOverflowError.
2 The repeated pattern is a character class, or a fixed length capturing/non-capturing group without any alternation
You can see how a fragment of your original regex is compiled below. Take note of the problematic part, which starts with Loop, and compare it to your stack trace.
app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?
BnM. Boyer-Moore (BMP only version) (length=10)
app=github
Ques. Greedy optional quantifier
Pos. Positive look-ahead
GroupHead. local=0
Prolog. Loop wrapper
Loop [1889ca51]. Greedy quantifier {0,2147483647}
GroupHead. local=1
Branch. Alternation (in printed order):
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
---
Single. Match code point: U+0022 QUOTATION MARK
Curly. Greedy quantifier {0,2147483647}
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
Node. Accept match
Single. Match code point: U+0022 QUOTATION MARK
---
BranchConn [7e41986c]. Connect branches to sequel.
GroupTail [47e1b36]. local=1, group=0. --[next]--> Loop [1889ca51]
Curly. Greedy quantifier {1,2147483647}
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
Slice. Match the following sequence (BMP only version) (length=5)
user=
GroupHead. local=3
Curly. Greedy quantifier {1,2147483647}
CharProperty.complement. S̄:
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
GroupTail [732c7887]. local=3, group=2. --[next]--> GroupTail [6c9d2223]
GroupTail [6c9d2223]. local=0, group=0. --[next]--> Node [4ea5d7f2]
Node. Accept match
Node. Accept match
Ques. Greedy optional quantifier
Pos. Positive look-ahead
GroupHead. local=4
Prolog. Loop wrapper
Loop [402c5f8a]. Greedy quantifier {0,2147483647}
GroupHead. local=5
Branch. Alternation (in printed order):
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
---
Single. Match code point: U+0022 QUOTATION MARK
Curly. Greedy quantifier {0,2147483647}
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
Node. Accept match
Single. Match code point: U+0022 QUOTATION MARK
---
BranchConn [21347df0]. Connect branches to sequel.
GroupTail [7d382897]. local=5, group=0. --[next]--> Loop [402c5f8a]
Curly. Greedy quantifier {1,2147483647}
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
Slice. Match the following sequence (BMP only version) (length=5)
repo=
GroupHead. local=7
Curly. Greedy quantifier {1,2147483647}
CharProperty.complement. S̄:
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
GroupTail [71f111ba]. local=7, group=4. --[next]--> GroupTail [9c304c7]
GroupTail [9c304c7]. local=4, group=0. --[next]--> Node [4ea5d7f2]
Node. Accept match
Node. Accept match
LastNode.
Node. Accept match
Final answer:
Move this (?:[^"]|"[^"]*")* functionality into an alternation group with
the others.
Sample: https://ideone.com/YuVcMg
It can't be broken!
A side note - I noticed you said you deleted a newline and ended up with
the end of one record being without a separator between the next,
like this request_category=apiapp=github
That is ok, but these regexes will mostly blow by it when it hits the
\S+.
For that reason, it is better to replace \S+ with (?:(?!app=github)\S)+,
which is not done in the below regex.
Here is the one below with that added:
"(?s)app=github(?>\\s+user=(?<user>(?:(?!app=github)\\S)+)|\\s+repo=(?<repo>(?:(?!app=github)\\S)+)|\\s+remote_address=(?<ip>(?:(?!app=github)\\S)+)|\\s+now=\\\\?\"(?<time>(?:(?!app=github)\\S)+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>(?:(?!app=github)\\S)+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>(?:(?!app=github)\\S)+)\\\\?\"|\\s+status=(?<status>(?:(?!app=github)\\S)+)|\\s+elapsed=(?<elapsed>(?:(?!app=github)\\S)+)|\\s+request_method=(?<requestmethod>(?:(?!app=github)\\S)+)|\\s+created_at=\\\\?\"(?<createdat>(?:(?!app=github)\\S)+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>(?:(?!app=github)\\S)+)|\\s+fn=(?<fn>(?:(?!app=github)\\S)+)|\\s+method=(?<method>(?:(?!app=github)\\S)+)|\\s+current_user=(?<user2>(?:(?!app=github)\\S)+)|\\s+content_length=(?<contentlength>(?:(?!app=github)\\S)+)|\\s+request_category=(?<requestcategory>(?:(?!app=github)\\S)+)|\\s+controller=(?<controller>(?:(?!app=github)\\S)+)|\\s+action=(?<action>(?:(?!app=github)\\S)+)|\"[^\"]*\"|(?!app=github).)+"
And a link to that sample using it: https://ideone.com/hdwufO
Regex
Raw:
(?s)app=github(?>\s+user=(?<user>\S+)|\s+repo=(?<repo>\S+)|\s+remote_address=(?<ip>\S+)|\s+now=\\?"(?<time>\S+)\+\d\d:\d\d\\?"|\s+url=\\?"(?<url>\S+)\\?"|\s+referer=\\?"(?<referer>\S+)\\?"|\s+status=(?<status>\S+)|\s+elapsed=(?<elapsed>\S+)|\s+request_method=(?<requestmethod>\S+)|\s+created_at=\\?"(?<createdat>\S+)[-+]\d\d:\d\d\\?"|\s+pull_request_id=(?<pullrequestid>\d+)|\s+at=(?<at>\S+)|\s+fn=(?<fn>\S+)|\s+method=(?<method>\S+)|\s+current_user=(?<user2>\S+)|\s+content_length=(?<contentlength>\S+)|\s+request_category=(?<requestcategory>\S+)|\s+controller=(?<controller>\S+)|\s+action=(?<action>\S+)|"[^"]*"|(?!app=github).)+
Stringed:
"(?s)app=github(?>\\s+user=(?<user>\\S+)|\\s+repo=(?<repo>\\S+)|\\s+remote_address=(?<ip>\\S+)|\\s+now=\\\\?\"(?<time>\\S+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>\\S+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>\\S+)\\\\?\"|\\s+status=(?<status>\\S+)|\\s+elapsed=(?<elapsed>\\S+)|\\s+request_method=(?<requestmethod>\\S+)|\\s+created_at=\\\\?\"(?<createdat>\\S+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>\\S+)|\\s+fn=(?<fn>\\S+)|\\s+method=(?<method>\\S+)|\\s+current_user=(?<user2>\\S+)|\\s+content_length=(?<contentlength>\\S+)|\\s+request_category=(?<requestcategory>\\S+)|\\s+controller=(?<controller>\\S+)|\\s+action=(?<action>\\S+)|\"[^\"]*\"|(?!app=github).)+"
Formatted:
(?s)
app = github
(?>
\s+
user =
(?<user> \S+ ) # (1)
|
\s+ repo =
(?<repo> \S+ ) # (2)
|
\s+ remote_address =
(?<ip> \S+ ) # (3)
|
\s+ now= \\? "
(?<time> \S+ ) # (4)
\+ \d\d : \d\d \\? "
|
\s+ url = \\? "
(?<url> \S+ ) # (5)
\\? "
|
\s+ referer = \\? "
(?<referer> \S+ ) # (6)
\\? "
|
\s+ status =
(?<status> \S+ ) # (7)
|
\s+ elapsed =
(?<elapsed> \S+ ) # (8)
|
\s+ request_method =
(?<requestmethod> \S+ ) # (9)
|
\s+ created_at = \\? "
(?<createdat> \S+ ) # (10)
[-+]
\d\d : \d\d \\? "
|
\s+ pull_request_id =
(?<pullrequestid> \d+ ) # (11)
|
\s+ at=
(?<at> \S+ ) # (12)
|
\s+ fn=
(?<fn> \S+ ) # (13)
|
\s+ method =
(?<method> \S+ ) # (14)
|
\s+ current_user =
(?<user2> \S+ ) # (15)
|
\s+ content_length =
(?<contentlength> \S+ ) # (16)
|
\s+ request_categor y=
(?<requestcategory> \S+ ) # (17)
|
\s+ controller =
(?<controller> \S+ ) # (18)
|
\s+ action =
(?<action> \S+ ) # (19)
|
" [^"]* " # None of the above, give quotes a chance
|
(?! app = github ) # Failsafe, consume a character, advance by 1
.
)+
hy
I want to extract sub sentences of this sentence by regular expression:
it learn od fg network layout. kdsjhuu ddkm networ.12kfdf. learndfefe layout. learn sdffsfsfs. sddsd learn fefe.
I couldn't write a correct regular expression for Pattern.compile.
This is my expression:([^(\\.\\s)]*)([^.]*\\.)
Actually, i need a way for writing "read everthing except \\.\\s
sub sentences:
it learn od fg network layout.
kdsjhuu ddkm networ.12kfdf.
learndfefe layout.
learn sdffsfsfs.
sddsd learn fefe.
Just split your string with regex "\\. "
String[] arr= str.split("\\. ");
You can use this pattern with the find method:
Pattern p = Pattern.compile("[^\\s.][^.]*(?:\\.(?!\\s|\\z)[^.]*)*\\.?");
Matcher m = p.matcher(yourText);
while(m.find()) {
System.out.println(m.group(0));
}
Pattern details:
[^\\s.] # all that is not a whitespace (to trim) or a dot
[^.]* # all that is not a dot (zero or more times)
(?: # open a non-capturing group
\\. (?!\\s|\\z) # dot not followed by a whitespace or the end of the string
[^.]* #
)* # close and repeat the group as needed
\\.? # an optional dot (allow to match a sentence at the end
# of the string even if there is no dot)
Hello all I'm trying to parse out a pretty well formed string into it's component pieces. The string is very JSON like but it's not JSON strictly speaking. They're formed like so:
createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source="Region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
With output just as chunks of text nothing special has to be done at this point.
createdAt=Fri Aug 24 09:48:51 EDT 2012
id=238996293417062401
text='Test Test'
source="Region"
entities=[foo, bar]
user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
Using the following expression I am able to get most of the fields separated out
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))(?=(?:[^']*'[^']*')*(?![^']*'))
Which will split on all the commas not in quotes of any type, but I can't seem to make the leap to where it splits on commas not in brackets or braces as well.
Because you want to handle nested parens/brackets, the "right" way to handle them is to tokenize them separately, and keep track of your nesting level. So instead of a single regex, you really need multiple regexes for your different token types.
This is Python, but converting to Java shouldn't be too hard.
# just comma
sep_re = re.compile(r',')
# open paren or open bracket
inc_re = re.compile(r'[[(]')
# close paren or close bracket
dec_re = re.compile(r'[)\]]')
# string literal
# (I was lazy with the escaping. Add other escape sequences, or find an
# "official" regex to use.)
chunk_re = re.compile(r'''"(?:[^"\\]|\\")*"|'(?:[^'\\]|\\')*[']''')
# This class could've been just a generator function, but I couldn;'t
# find a way to manage the state in the match function that wasn't
# awkward.
class tokenizer:
def __init__(self):
self.pos = 0
def _match(self, regex, s):
m = regex.match(s, self.pos)
if m:
self.pos += len(m.group(0))
self.token = m.group(0)
else:
self.token = ''
return self.token
def tokenize(self, s):
field = '' # the field we're working on
depth = 0 # how many parens/brackets deep we are
while self.pos < len(s):
if not depth and self._match(sep_re, s):
# In Java, change the "yields" to append to a List, and you'll
# have something roughly equivalent (but non-lazy).
yield field
field = ''
else:
if self._match(inc_re, s):
depth += 1
elif self._match(dec_re, s):
depth -= 1
elif self._match(chunk_re, s):
pass
else:
# everything else we just consume one character at a time
self.token = s[self.pos]
self.pos += 1
field += self.token
yield field
Usage:
>>> list(tokenizer().tokenize('foo=(3,(5+7),8),bar="hello,world",baz'))
['foo=(3,(5+7),8)', 'bar="hello,world"', 'baz']
This implementation takes a few shortcuts:
The string escapes are really lazy: it only supports \" in double quoted strings and \' in single-quoted strings. This is easy to fix.
It only keeps track of nesting level. It does not verify that parens are matched up with parens (rather than brackets). If you care about that you can change depth into some sort of stack and push/pop parens/brackets onto it.
Instead of splitting on the comma, you can use the following regular expression to match the chunks that you want.
(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)
Python:
import re
text = "createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source=\"Region\", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}"
re.findall(r'(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)', text)
>> [
('createdAt', 'Fri Aug 24 09:48:51 EDT 2012'),
('id', '238996293417062401'),
('text', "'Test Test'"),
('source', '"Region"'),
('entities', '[foo, bar]'),
('user', '{name=test, locations=[loc1,loc2], locations={comp1, comp2}}')
]
I've set up grouping so it will separate out the "key" and the "value". It will do the same in Java - See it working in Java here:
http://www.regexplanet.com/cookbook/ahJzfnJlZ2V4cGxhbmV0LWhyZHNyDgsSBlJlY2lwZRj0jzQM/index.html
Regular Expression explained:
(?:^| ) Non-capturing group that matches the beginning of a line, or a space
(.+?) Matches the "key" before the...
= equal sign
(\{.+?\}|\[.+?\]|.+?) Matches either a set of {characters}, [characters], or finally just characters
(?=,|$) Look ahead that matches either a , or the end of a line.
I'm having trouble with regex, because I can only match some of my goals.
I have a log file and I must match some of the items and write another txt file. I wrote a Java code for a short example of my code but when I put the whole file, everything gets messed up.
*052511 074217 0065 02242806000 UNKNOWN U G
*052511 074217 0065 4874 02242806000 UNKNOWN U A
*052511 074218 0065 4874 02242806000 UNKNOWN U R
-------- 05/25/11 07:42:17 LINE = 0065 STN = 4874
CALLING NUMBER 02242806000
NAME UNKNOWN
UNKNOWN
BC = SPEECH
00:00:00 INCOMING CALL RINGING 0:02
00:00:11 CALL RELEASED
I have to find these results from the file:
incomming,05/25/11,07:42:17,0065,4874,02242806000,00:00:09,2
In this expression 00:00:09 means [00:00:11-00:00:00]-0:02
For every incoming and outgoing calls, I must make the conversation above.
Here is my code
Here is the log file
You could use a regex like:
(?xm:
^-------- \s+ (\S+) \s+ (\S+) \s+ LINE\s*=\s*(\d+) \s+ STN\s*=\s*(\d+)
\s+ CALLING\ NUMBER \s+ (\d+) \s*
(?:^(?:[ \t]+.*)?[\n\r]+)* # eat unwanted part
^(\d\d:\d\d:\d\d) \s+ INCOMING\ CALL \s+ RINGING\ ([\d:]+) \s*
(?:^\d.*[\r\n]+)* # possible stuff
^(\d\d:\d\d:\d\d) \s+ CALL\ RELEASED
)
Use the values of the capturing groups to get your results. You may need to remove the /x related things like comments and spaces.
Perl example at http://ideone.com/qTBFe