I want to split an input string based on the regex pattern using Pattern.split(String) api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes("x,y").
The regex is - (?<!(?<!\Q\\E)\Q\\E)\Q,\E(?=(?:[^\Q"\E]*(?<=\Q,\E)\Q"\E[[^\Q,\E|\Q"\E] | [\Q"\E]]+[^\Q"\E]*[^\Q\\E]*[\Q"\E]*)*[^\Q"\E]*$)
The input string for which this split call is getting timed out is -
"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect and split.
The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3 with the above string. But if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect it.
I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs?
Any pointers, article links are appreciated!
If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:
,(?="[^"\\]*(?:\\.[^"\\]*)*")
The pattern matches:
, Match a comma
(?= Positive lookahad
"[^"\\]* Match " and 0+ times any char except " or \
(?:\\.[^"\\]*)*" Optionally repeat matching \ to escape any char using the . and again match any chars other than " and /
) Close lookahead
Regex demo | Java demo
String string = "\"\",\"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]\",\"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\\\"BOLT,HI-JOK\\\"]\"\n";
String[] parts = string.split(",(?=\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")");
for (String part : parts)
System.out.println(part);
Output
""
"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]"
"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I have a dataset of resume and I want to extract data from each resume
I will give an example as a sample to what I need
String test= "Worked in Innovision Information System Private Limited as Project Trainee-Content Writing from Date to Date.";
I want to extract the company name, role (designation), and Date (From-to)
I'm new to regex so please correct me if I'm wrong
the first thing I tried to extract each one of them separately
String regexStr5="Worked in:? \\w+" ;
String regexStr6 ="as:? ([a-zA-Z ]+)";
and for the date Date : (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, \d{4}
How can I put them all together in the same regex?!!
and print the company-Name +role+date
A literal string match would be just fine for above test string.
Regex: Worked in (.*) as (.*) from (.*) to (.*).
Replacement to do: Company Name: \1\nRole (designation): \2\nDate: \3 to \4
Regex101 Demo
I need to find a regex to extract date section from the name of several files.
In particular I have these two formats:
ATC0200720140828080610.xls
ATC0200720140901080346_UFF_ACC.xls
I use these two regex to check file name format:
^ATC02007[0-9]{14}.xls$
^ATC02007[0-9]{14}_UFF_ACC.xls$
But I need a regex to extract a specific section:
constant | yyyyMMddHHmmss | constant
^ ^ ^
ATC02007 | 20140901080346 | _UFF_ACC.xls
Both regex I'm using match the entire file name, so I can't use to extract the middle section, so which is the right expression?
You are almost there. Just use round brackets to contain the numbers you want.
^ATC02007([0-9]{14})(_UFF_ACC)?.xls$
See example. The numbers are captured in group 1$1.
You need to use capturing groups.
^(ATC02007)([0-9]{14})((?:[^.]*)?\\.xls)$
DEMO
GRoup index 1 contains the first constant and group 2 contains date and time and group 3 contains the third constant.
String s = "ATC0200720140828080610.xls\n" +
"ATC0200720140901080346_UFF_ACC.xls";
Pattern regex = Pattern.compile("(?m)^(ATC02007)([0-9]{14})((?:[^.]*)?\\.xls)$");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
Output:
ATC02007
20140828080610
.xls
ATC02007
20140901080346
_UFF_ACC.xls
I have the following strings:
<PAUL SAINT-KARL 1997-05-07>
<BOB DEAN 2001-05-07>
<GUY JEDDY 2007-05-07>
I want a java regex that would match this type of pattern "name and date" and then extract the name and date separately.
I able to match them separately with the following java regex:
1) (\d{4}-\d{2}-\d{2})>
2) <([ A-Z&#;0-9-]*+)
What I'm looking for is one regex that would identify the full text pattern as provided, and then extract the subsections, such as the actual name, and the date.
I'm looking to use Matcher.group() to retrieve the complete match from the target string.
Thanks
Try this:
"<([ A-Z&#;0-9-]*?) (\\d{4}-\\d{2}-\\d{2})>"
I changed the *+ to *? to make the * match lazily.
I want to split the string
String fields = "name[Employee Name], employeeno[Employee No], dob[Date of Birth], joindate[Date of Joining]";
to
name
employeeno
dob
joindate
I wrote the following java code for this but it is printing only name other matches are not printing.
String fields = "name[Employee Name], employeeno[Employee No], dob[Date of Birth], joindate[Date of Joining]";
Pattern pattern = Pattern.compile("\\[.+\\]+?,?\\s*" );
String[] split = pattern.split(fields);
for (String string : split) {
System.out.println(string);
}
What am I doing wrong here?
Thank you
This part:
\\[.+\\]
matches the first [, the .+ then gobbles up the entire string (if no line breaks are in the string) and then the \\] will match the last ].
You need to make the .+ reluctant by placing a ? after it:
Pattern pattern = Pattern.compile("\\[.+?\\]+?,?\\s*");
And shouldn't \\]+? just be \\] ?
The error is that you are matching greedily. You can change it to a non-greedy match:
Pattern.compile("\\[.+?\\],?\\s*")
^
There's an online regular expression tester at http://gskinner.com/RegExr/?2sa45 that will help you a lot when you try to understand regular expressions and how they are applied to a given input.
WOuld it be better to use Negated Character Classes to match the square brackets? \[(\w+\s)+\w+[^\]]\]
You could also see a good example how does using a negated character class work internally (without backtracking)?