To match the following text:
text : SS~B66\88~PRELIMINARY PAGES\M01~HEADING PAGES
It has this format:<code1>~<description1>\<code2>~<description2>\<code3>~<description3>....<codeN>~<descriptionN>
I used this regex: [A-Z0-9 ]+~[A-Z0-9 ]+(?:\\[A-Z0-9 ]+~[A-Z0-9 ]+)+
So:
case 1. SS~B66\88~PRELIMINARY PAGES\M01~HEADING PAGES (Match: OK)
case 2. SS~B66\88~PRELIMINARY PAGES~HEADING PAGES (No Match: OK because I removed the code 'M01')
case 3. SS~B66~PRELIMINARY PAGES\M01~HEADING PAGES (No Match: OK because I removed the code '88')
More examples:
SS~B66\88~MEKLKE\M01~MOIIE
B~A310\0~PRELIM#INARY\00-00~HEADING
My problem is that <code> and <description> can accept any type of characters, so when I replaced my regex with:
My new regex .+~.+(?:\\.+~.+)+ , but it can match case 2 and case 3.
Thank you for your help.
Instead of using [A-Z0-9 ] which would not match all the allowed chars, or .+ which would match too much, you can use a negated character class [^~\\] matching any char except \ and ~ to set the boundaries for the matched parts.
^[^~]+~[^~\\]+(?:\\[^~]+~[^~\\]+)+$
^ Start of string
[^~]+~ Match any char other than ~, then match ~
[^~\\]+ Repeat matching 1+ times any char other than ~ and `
(?: Non capture group
\\[^~]+~[^~\\]+ Match \ and a ~ between other chars than ~ before and ~ \ after
)+ Close the group and repeat 1 or more times to match at least a \
$ End of string
Regex demo (The demo contains \n to not cross the newlines in the example data)
I have a string in following pattern
( var1=:key1:'any_value including space and 'quotes'' AND/OR var2=:key2:'any_value...' AND/OR var3=:key3:'any_value...' )
I want to get following result from this.
:key1:'any_value including space and 'quotes''
:key2:'any_value...'
:key3:'any_value...'
Could any one please suggest the pattern/RE for the same ?
Failed attempts :
First I can split it by AND/OR and again split the further strings on : and so on, but looking for single RE/Pattern which can do this.
You can use this regex with negated pattern to match your data:
":[^:]+:'.*?'(?=\\s*(?:AND(?:/OR)?|\\)))"
RegEx Demo
Breakup:
: # match a literal :
[^:]+ # match 1 or more characters that are not :
: # match a literal :
' # match a literal '
.*? # match 0 or more of any characters (non-greedy)
' # match a literal '
(?=\s*(?:AND(?:/OR)?|\))) # lookahead to assert there is AND/OR at the end or closing )
I think this would work for your circumstances.
Unless you know parsing of quotes, there is not much else you could do.
Raw: (?<==)(?:(?!\s*AND/OR).)+
Quoted: "(?<==)(?:(?!\\s*AND/OR).)+"
Expanded:
(?<= = ) # A '=' behind
(?:
(?! \s* AND/OR ) # Not 'AND/OR' in front
.
)+
I have a some file parser code where I sporadically get stack overflow errors on m.matches() (where m is a Matcher).
I run my app again and it parses the same file with no stack overflow.
It's true my Pattern is a bit complex. It's basically a bunch of optional zero length positive lookaheads with named groups inside of them so that I can match a bunch of variable name/value pairs irregardless of their order. But I would expect that if some string would cause a stack overflow error it would always cause it... not just sometimes... any ideas?
A much simplified version of my Pattern
"prefix(?=\\s+user=(?<user>\\S+))?(?=\\s+repo=(?<repo>\\S+))?.*?"
full regex is...
app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?(?=(?:[^"]|"[^"]*")*\s+remote_address=(?<ip>\S+))?(?=(?:[^"]|"[^"]*")*\s+now="(?<time>\S+)\+\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+url="(?<url>\S+)")?(?=(?:[^"]|"[^"]*")*\s+referer="(?<referer>\S+)")?(?=(?:[^"]|"[^"]*")*\s+status=(?<status>\S+))?(?=(?:[^"]|"[^"]*")*\s+elapsed=(?<elapsed>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_method=(?<requestmethod>\S+))?(?=(?:[^"]|"[^"]*")*\s+created_at="(?<createdat>\S+)(?:-|\+)\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+pull_request_id=(?<pullrequestid>\d+))?(?=(?:[^"]|"[^"]*")*\s+at=(?<at>\S+))?(?=(?:[^"]|"[^"]*")*\s+fn=(?<fn>\S+))?(?=(?:[^"]|"[^"]*")*\s+method=(?<method>\S+))?(?=(?:[^"]|"[^"]*")*\s+current_user=(?<user2>\S+))?(?=(?:[^"]|"[^"]*")*\s+content_length=(?<contentlength>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_category=(?<requestcategory>\S+))?(?=(?:[^"]|"[^"]*")*\s+controller=(?<controller>\S+))?(?=(?:[^"]|"[^"]*")*\s+action=(?<action>\S+))?.*?
Top of stack overflow error stack... (it's about 9800 lines long)
Exception: java.lang.StackOverflowError
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
example of line I got error on. (Though I have run it 10 times since and not gotten any error)
app=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=675fa67e-c1de-4bfa-a965-127b928d427a server_id=c31404fc-b7d0-41a1-8017-fc1a6dce8111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.041 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" worker_request_count=77192 request_category=apiapp=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=89fcb32e-9ab5-47f7-9464-e5f5cff175e8 server_id=1b74880a-5124-4483-adce-111b60dac111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.024 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" worker_request_count=76263 request_category=api
interestingly... this line seems to be an error... the log seems to put a line break in the wrong place resulting in two log entries being on a single line followed by a blank line. It's this long line that caused the error... well once anyway... now it runs just fine without stack overflow
There are 2 ways to fix your problem:
Parse the input string properly and get the key values from the Map.
I strongly recommend using this method, since the code will be much cleaner, and we no longer have to watch the limit on the input size.
Modify the existing regex to greatly reduce the impact of the implementation flaw which causes StackOverflowError.
Parse the input string
You can parse the input string with the following regex:
\G\s*+(\w++)=([^\s"]++|"[^"]*+")(?:\s++|$)
All quantifiers are made possessive (*+ instead of *, ++ instead of +), since the pattern I wrote doesn't need backtracking.
You can find the basic regex (\w++)=([^\s"]++|"[^"]*+") to match key-value pairs in the middle.
\G is to make sure the match starts from where the last match leaves off. It is used to prevent the engine from "bump-along" when it fails to match.
\s*+ and (?:\s++|$) are for consuming excess spaces. I specify (?:\s++|$) instead of \s*+ to prevent key="value"key=value from being recognized as valid input.
The full example code can be found below:
private static final Pattern KEY_VALUE = Pattern.compile("\\G\\s*+(\\w++)=([^\\s\"]++|\"[^\"]*+\")(?:\\s++|$)");
public static Map<String, String> parseKeyValue(String kvString) {
Matcher matcher = KEY_VALUE.matcher(kvString);
Map<String, String> output = new HashMap<String, String>();
int lastIndex = -1;
while (matcher.find()) {
output.put(matcher.group(1), matcher.group(2));
lastIndex = matcher.end();
}
// Make sure that we match everything from the input string
if (lastIndex != kvString.length()) {
return null;
}
return output;
}
You might want to unquote the values, depending on your requirement.
You can also rewrite the function to pass a List of keys you want to extract, and pick them out in the while loop as you go to avoid storing keys that you don't care about.
Modify the regex
The problem is due to the outer repetition (?:[^"]|"[^"]*")* being implemented with recursion, leading to StackOverflowError when the input string is long enough.
Specifically, in each repetition, it matches either a quoted token, or a single non-quoted character. As a result, the stack grows linearly with the number of non-quoted characters and blows up.
You can replace all instance of (?:[^"]|"[^"]*")* with [^"]*(?:"[^"]*"[^"]*)*. The stack will now grow linearly as the number of quoted tokens, so StackOverflowError will not occur, unless you have thousands of quoted tokens in the input string.
Pattern KEY_CAPTURE = Pattern.compile("app=github(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+user=(?<user>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+repo=(?<repo>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+remote_address=(?<ip>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+now=\"(?<time>\\S+)\\+\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+url=\"(?<url>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+referer=\"(?<referer>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+status=(?<status>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+elapsed=(?<elapsed>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_method=(?<requestmethod>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+created_at=\"(?<createdat>\\S+)(?:-|\\+)\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+pull_request_id=(?<pullrequestid>\\d+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+at=(?<at>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+fn=(?<fn>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+method=(?<method>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+current_user=(?<user2>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+content_length=(?<contentlength>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_category=(?<requestcategory>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+controller=(?<controller>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+action=(?<action>\\S+))?");
It follows equivalent expansion of the regex (A|B)* → A*(BA*)*. Which to use as A or B depends on their number of repetitions - whichever repeats more should be A and the other should be B.
Deep diving into the implementation
StackOverflowError in Pattern is a known problem, which could happen when your pattern contains a repetition of a non-deterministic1 capturing/non-capturing group, which is the subpattern (?:[^"]|"[^"]*")* in your case.
1 This is a terminology used in the source code of Pattern, which is probably intended to be an indicator that the pattern has fixed length. However, the implementation considers alternation | to be non-deterministic, regardless of the actual pattern.
Greedy or lazy repetition of a non-deterministic capturing/non-capturing group is compiled into Loop/LazyLoop classes, which implement repetition by recursion. As a result, such pattern is extremely prone to trigger StackOverflowError, especially when the group contains a branch where only a single character is matched at a time.
On the other hand, deterministic2 repetition, possessive repetition, and repetition of independent group (?>...) (a.k.a. atomic group or non-backtracking group) are compiled into Curly/GroupCurly classes, which processes the repetition with loop in most cases, so there will be no StackOverflowError.
2 The repeated pattern is a character class, or a fixed length capturing/non-capturing group without any alternation
You can see how a fragment of your original regex is compiled below. Take note of the problematic part, which starts with Loop, and compare it to your stack trace.
app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?
BnM. Boyer-Moore (BMP only version) (length=10)
app=github
Ques. Greedy optional quantifier
Pos. Positive look-ahead
GroupHead. local=0
Prolog. Loop wrapper
Loop [1889ca51]. Greedy quantifier {0,2147483647}
GroupHead. local=1
Branch. Alternation (in printed order):
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
---
Single. Match code point: U+0022 QUOTATION MARK
Curly. Greedy quantifier {0,2147483647}
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
Node. Accept match
Single. Match code point: U+0022 QUOTATION MARK
---
BranchConn [7e41986c]. Connect branches to sequel.
GroupTail [47e1b36]. local=1, group=0. --[next]--> Loop [1889ca51]
Curly. Greedy quantifier {1,2147483647}
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
Slice. Match the following sequence (BMP only version) (length=5)
user=
GroupHead. local=3
Curly. Greedy quantifier {1,2147483647}
CharProperty.complement. S̄:
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
GroupTail [732c7887]. local=3, group=2. --[next]--> GroupTail [6c9d2223]
GroupTail [6c9d2223]. local=0, group=0. --[next]--> Node [4ea5d7f2]
Node. Accept match
Node. Accept match
Ques. Greedy optional quantifier
Pos. Positive look-ahead
GroupHead. local=4
Prolog. Loop wrapper
Loop [402c5f8a]. Greedy quantifier {0,2147483647}
GroupHead. local=5
Branch. Alternation (in printed order):
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
---
Single. Match code point: U+0022 QUOTATION MARK
Curly. Greedy quantifier {0,2147483647}
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
Node. Accept match
Single. Match code point: U+0022 QUOTATION MARK
---
BranchConn [21347df0]. Connect branches to sequel.
GroupTail [7d382897]. local=5, group=0. --[next]--> Loop [402c5f8a]
Curly. Greedy quantifier {1,2147483647}
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
Slice. Match the following sequence (BMP only version) (length=5)
repo=
GroupHead. local=7
Curly. Greedy quantifier {1,2147483647}
CharProperty.complement. S̄:
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
GroupTail [71f111ba]. local=7, group=4. --[next]--> GroupTail [9c304c7]
GroupTail [9c304c7]. local=4, group=0. --[next]--> Node [4ea5d7f2]
Node. Accept match
Node. Accept match
LastNode.
Node. Accept match
Final answer:
Move this (?:[^"]|"[^"]*")* functionality into an alternation group with
the others.
Sample: https://ideone.com/YuVcMg
It can't be broken!
A side note - I noticed you said you deleted a newline and ended up with
the end of one record being without a separator between the next,
like this request_category=apiapp=github
That is ok, but these regexes will mostly blow by it when it hits the
\S+.
For that reason, it is better to replace \S+ with (?:(?!app=github)\S)+,
which is not done in the below regex.
Here is the one below with that added:
"(?s)app=github(?>\\s+user=(?<user>(?:(?!app=github)\\S)+)|\\s+repo=(?<repo>(?:(?!app=github)\\S)+)|\\s+remote_address=(?<ip>(?:(?!app=github)\\S)+)|\\s+now=\\\\?\"(?<time>(?:(?!app=github)\\S)+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>(?:(?!app=github)\\S)+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>(?:(?!app=github)\\S)+)\\\\?\"|\\s+status=(?<status>(?:(?!app=github)\\S)+)|\\s+elapsed=(?<elapsed>(?:(?!app=github)\\S)+)|\\s+request_method=(?<requestmethod>(?:(?!app=github)\\S)+)|\\s+created_at=\\\\?\"(?<createdat>(?:(?!app=github)\\S)+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>(?:(?!app=github)\\S)+)|\\s+fn=(?<fn>(?:(?!app=github)\\S)+)|\\s+method=(?<method>(?:(?!app=github)\\S)+)|\\s+current_user=(?<user2>(?:(?!app=github)\\S)+)|\\s+content_length=(?<contentlength>(?:(?!app=github)\\S)+)|\\s+request_category=(?<requestcategory>(?:(?!app=github)\\S)+)|\\s+controller=(?<controller>(?:(?!app=github)\\S)+)|\\s+action=(?<action>(?:(?!app=github)\\S)+)|\"[^\"]*\"|(?!app=github).)+"
And a link to that sample using it: https://ideone.com/hdwufO
Regex
Raw:
(?s)app=github(?>\s+user=(?<user>\S+)|\s+repo=(?<repo>\S+)|\s+remote_address=(?<ip>\S+)|\s+now=\\?"(?<time>\S+)\+\d\d:\d\d\\?"|\s+url=\\?"(?<url>\S+)\\?"|\s+referer=\\?"(?<referer>\S+)\\?"|\s+status=(?<status>\S+)|\s+elapsed=(?<elapsed>\S+)|\s+request_method=(?<requestmethod>\S+)|\s+created_at=\\?"(?<createdat>\S+)[-+]\d\d:\d\d\\?"|\s+pull_request_id=(?<pullrequestid>\d+)|\s+at=(?<at>\S+)|\s+fn=(?<fn>\S+)|\s+method=(?<method>\S+)|\s+current_user=(?<user2>\S+)|\s+content_length=(?<contentlength>\S+)|\s+request_category=(?<requestcategory>\S+)|\s+controller=(?<controller>\S+)|\s+action=(?<action>\S+)|"[^"]*"|(?!app=github).)+
Stringed:
"(?s)app=github(?>\\s+user=(?<user>\\S+)|\\s+repo=(?<repo>\\S+)|\\s+remote_address=(?<ip>\\S+)|\\s+now=\\\\?\"(?<time>\\S+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>\\S+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>\\S+)\\\\?\"|\\s+status=(?<status>\\S+)|\\s+elapsed=(?<elapsed>\\S+)|\\s+request_method=(?<requestmethod>\\S+)|\\s+created_at=\\\\?\"(?<createdat>\\S+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>\\S+)|\\s+fn=(?<fn>\\S+)|\\s+method=(?<method>\\S+)|\\s+current_user=(?<user2>\\S+)|\\s+content_length=(?<contentlength>\\S+)|\\s+request_category=(?<requestcategory>\\S+)|\\s+controller=(?<controller>\\S+)|\\s+action=(?<action>\\S+)|\"[^\"]*\"|(?!app=github).)+"
Formatted:
(?s)
app = github
(?>
\s+
user =
(?<user> \S+ ) # (1)
|
\s+ repo =
(?<repo> \S+ ) # (2)
|
\s+ remote_address =
(?<ip> \S+ ) # (3)
|
\s+ now= \\? "
(?<time> \S+ ) # (4)
\+ \d\d : \d\d \\? "
|
\s+ url = \\? "
(?<url> \S+ ) # (5)
\\? "
|
\s+ referer = \\? "
(?<referer> \S+ ) # (6)
\\? "
|
\s+ status =
(?<status> \S+ ) # (7)
|
\s+ elapsed =
(?<elapsed> \S+ ) # (8)
|
\s+ request_method =
(?<requestmethod> \S+ ) # (9)
|
\s+ created_at = \\? "
(?<createdat> \S+ ) # (10)
[-+]
\d\d : \d\d \\? "
|
\s+ pull_request_id =
(?<pullrequestid> \d+ ) # (11)
|
\s+ at=
(?<at> \S+ ) # (12)
|
\s+ fn=
(?<fn> \S+ ) # (13)
|
\s+ method =
(?<method> \S+ ) # (14)
|
\s+ current_user =
(?<user2> \S+ ) # (15)
|
\s+ content_length =
(?<contentlength> \S+ ) # (16)
|
\s+ request_categor y=
(?<requestcategory> \S+ ) # (17)
|
\s+ controller =
(?<controller> \S+ ) # (18)
|
\s+ action =
(?<action> \S+ ) # (19)
|
" [^"]* " # None of the above, give quotes a chance
|
(?! app = github ) # Failsafe, consume a character, advance by 1
.
)+
hy
I want to extract sub sentences of this sentence by regular expression:
it learn od fg network layout. kdsjhuu ddkm networ.12kfdf. learndfefe layout. learn sdffsfsfs. sddsd learn fefe.
I couldn't write a correct regular expression for Pattern.compile.
This is my expression:([^(\\.\\s)]*)([^.]*\\.)
Actually, i need a way for writing "read everthing except \\.\\s
sub sentences:
it learn od fg network layout.
kdsjhuu ddkm networ.12kfdf.
learndfefe layout.
learn sdffsfsfs.
sddsd learn fefe.
Just split your string with regex "\\. "
String[] arr= str.split("\\. ");
You can use this pattern with the find method:
Pattern p = Pattern.compile("[^\\s.][^.]*(?:\\.(?!\\s|\\z)[^.]*)*\\.?");
Matcher m = p.matcher(yourText);
while(m.find()) {
System.out.println(m.group(0));
}
Pattern details:
[^\\s.] # all that is not a whitespace (to trim) or a dot
[^.]* # all that is not a dot (zero or more times)
(?: # open a non-capturing group
\\. (?!\\s|\\z) # dot not followed by a whitespace or the end of the string
[^.]* #
)* # close and repeat the group as needed
\\.? # an optional dot (allow to match a sentence at the end
# of the string even if there is no dot)
I'm having trouble with regex, because I can only match some of my goals.
I have a log file and I must match some of the items and write another txt file. I wrote a Java code for a short example of my code but when I put the whole file, everything gets messed up.
*052511 074217 0065 02242806000 UNKNOWN U G
*052511 074217 0065 4874 02242806000 UNKNOWN U A
*052511 074218 0065 4874 02242806000 UNKNOWN U R
-------- 05/25/11 07:42:17 LINE = 0065 STN = 4874
CALLING NUMBER 02242806000
NAME UNKNOWN
UNKNOWN
BC = SPEECH
00:00:00 INCOMING CALL RINGING 0:02
00:00:11 CALL RELEASED
I have to find these results from the file:
incomming,05/25/11,07:42:17,0065,4874,02242806000,00:00:09,2
In this expression 00:00:09 means [00:00:11-00:00:00]-0:02
For every incoming and outgoing calls, I must make the conversation above.
Here is my code
Here is the log file
You could use a regex like:
(?xm:
^-------- \s+ (\S+) \s+ (\S+) \s+ LINE\s*=\s*(\d+) \s+ STN\s*=\s*(\d+)
\s+ CALLING\ NUMBER \s+ (\d+) \s*
(?:^(?:[ \t]+.*)?[\n\r]+)* # eat unwanted part
^(\d\d:\d\d:\d\d) \s+ INCOMING\ CALL \s+ RINGING\ ([\d:]+) \s*
(?:^\d.*[\r\n]+)* # possible stuff
^(\d\d:\d\d:\d\d) \s+ CALL\ RELEASED
)
Use the values of the capturing groups to get your results. You may need to remove the /x related things like comments and spaces.
Perl example at http://ideone.com/qTBFe