I want to split an input string based on the regex pattern using Pattern.split(String) api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes("x,y").
The regex is - (?<!(?<!\Q\\E)\Q\\E)\Q,\E(?=(?:[^\Q"\E]*(?<=\Q,\E)\Q"\E[[^\Q,\E|\Q"\E] | [\Q"\E]]+[^\Q"\E]*[^\Q\\E]*[\Q"\E]*)*[^\Q"\E]*$)
The input string for which this split call is getting timed out is -
"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect and split.
The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3 with the above string. But if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect it.
I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs?
Any pointers, article links are appreciated!
If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:
,(?="[^"\\]*(?:\\.[^"\\]*)*")
The pattern matches:
, Match a comma
(?= Positive lookahad
"[^"\\]* Match " and 0+ times any char except " or \
(?:\\.[^"\\]*)*" Optionally repeat matching \ to escape any char using the . and again match any chars other than " and /
) Close lookahead
Regex demo | Java demo
String string = "\"\",\"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]\",\"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\\\"BOLT,HI-JOK\\\"]\"\n";
String[] parts = string.split(",(?=\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")");
for (String part : parts)
System.out.println(part);
Output
""
"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]"
"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I am using an email validation pattern I found at How to validate an email and it works fine except it allows a + in the first part of the email and that isn't allowed in my specs. The original code is
public static final String EMAIL_PATTERN = "^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#"
+ "[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";
protected boolean isInvalidEmail(String email) {
pattern = Pattern.compile(EMAIL_PATTERN);
matcher = pattern.matcher(email);
return !matcher.matches();
}
I thought I could just remove the + from "^[_A-Za-z0-9-\\+] but I get a Pattern Syntax Exception: Unclosed Character Class. Can someone tell me why removing the + uncloses the class? Thanks!
You have to remove the \\+ portion.
\\ escapes the \ character. \+ escapes the + regex operator. Thus \\+ breaks down to \+ which means match the literal + character.
Note: The + regex operator means match one or more of the preceding element.
The reason that it gives you Unclosed Character Class is because only removing the + now escapes the closing square bracket so it is considered part of the pattern. Hence, the class does not have a matching closing square bracket. As Jonny Henly mentions the solution is to remove the \\+ to align with your spec, but this gives the answer as to why it is unclosed.
I created regex for extracting php exception message fields
(\w+.*)|\G(?!\A)\s*#\d+\s+(\S+\.php)\((\d+)\):\s(\w+.*)#012|#\d+\s{(\w+)}
Demo Links : https://regex101.com/r/xI6cR0/2
Error Message:
Illegal repetition near index 66 (\w+.*)|\G(?!\A)\s*#\d+\s+(\S+\.php)\((\d+)\):\s(\w+.*)#012|#\d+\s{(\w+)} ^
You need to escape all \ again by \\ for it to work.So your \w would become \\w.You also need to escape {}.So it would be
(\\w+.*)|\\G(?!\\A)\\s*#\\d+\\s+(\\S+\\.php)\\((\\d+)\\):\\s(\\w+.*)#012|#\\d+\\s\\{(\\w+)\\}
I have a String in which I am trying to replace the number enclosed by two backslashes. For example: \10\ , I am trying to replace that with 10. I am currently using this regex to do that:
String texter = texthb.replaceAll("\\.+\\", "\\"+String.valueOf(pertotal + initper)+"\\");
This line is giving the following error:
Exception in thread "AWT-EventQueue-0" java.util.regex.PatternSyntaxException: Unexpected internal error near index 4
.+\
I know it is because the regex is wrong. What is the proper way to accomplish this? Thanks in advance.
Use four backslashes to match a single backslash character.
String texter = texthb.replaceAll("\\\\.+?\\\\", "\\\\"+String.valueOf(pertotal + initper)+"\\\\");
I want to split the string
String fields = "name[Employee Name], employeeno[Employee No], dob[Date of Birth], joindate[Date of Joining]";
to
name
employeeno
dob
joindate
I wrote the following java code for this but it is printing only name other matches are not printing.
String fields = "name[Employee Name], employeeno[Employee No], dob[Date of Birth], joindate[Date of Joining]";
Pattern pattern = Pattern.compile("\\[.+\\]+?,?\\s*" );
String[] split = pattern.split(fields);
for (String string : split) {
System.out.println(string);
}
What am I doing wrong here?
Thank you
This part:
\\[.+\\]
matches the first [, the .+ then gobbles up the entire string (if no line breaks are in the string) and then the \\] will match the last ].
You need to make the .+ reluctant by placing a ? after it:
Pattern pattern = Pattern.compile("\\[.+?\\]+?,?\\s*");
And shouldn't \\]+? just be \\] ?
The error is that you are matching greedily. You can change it to a non-greedy match:
Pattern.compile("\\[.+?\\],?\\s*")
^
There's an online regular expression tester at http://gskinner.com/RegExr/?2sa45 that will help you a lot when you try to understand regular expressions and how they are applied to a given input.
WOuld it be better to use Negated Character Classes to match the square brackets? \[(\w+\s)+\w+[^\]]\]
You could also see a good example how does using a negated character class work internally (without backtracking)?