Regex for multiline string literals produces `StackOverflowError` - java

I want to match strings enclosed in triple "-quotes which may contain line breaks, and which don't contain any """-substrings except at the very beginning and in the very end.
Valid example:
"""foo
bar "baz" blah"""
Invalid example:
"""foo bar """ baz"""
I tried using the following regex (as Java String literal):
"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\""
and it seems to work on short examples. However, on longer examples, like on a string consisting of thousand lines with hello world, it gives me a StackOverflowError.
Scala snippet to reproduce the error
import java.util.regex.{Pattern, Matcher}
val text = "\"" * 3 + "hello world \n" * 1000 + "\"" * 3
val p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"")
println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt())
println(p.matcher(text).lookingAt())
(note: test locally, Scastie times out; or maybe reduce 1000 to smaller number?).
Java snippet that produces the same error
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class RegexOverflowMain {
public static void main(String[] args) {
StringBuilder bldr = new StringBuilder();
bldr.append("\"\"\"");
for (int i = 0; i < 1000; i++) {
bldr.append("hello world \n");
}
bldr.append("\"\"\"");
String text = bldr.toString();
Pattern p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"");
System.out.println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt());
System.out.println(p.matcher(text).lookingAt());
}
}
Question
Any idea how to make this "stack safe", i.e. can someone find a regex that accepts the same language, but does not produce a StackOverflowError when fed to the Java regex API?
I don't care whether the solution is in Scala or Java (or whatever), as long the same underlying Java regex library is used.

Solution using a negative look-ahead to basically find a string that starts with """ and end with """ and contains content that does not include """
As Plain regex: ^"""((?!""")[\s\S])*"""$
As Java escaped regex: "^\"\"\"((?!\"\"\")[\\s\\S])*\"\"\"$"
\s\S includes line-break (its basically . + line-break or . with single line flag)
This should be used without the multiline flag so that ^ and $ match the start and end of the string and not the start and end of the line
otherwise this:
""" ab
"""abc"""
abc """
would match
Also i used this as reference for how to exclude the """: Regular expression to match a line that doesn't contain a word?

The full answer below optimizes the regex performance, but to prevent stack overflow, as a simple solution, just make the repeating group possessive.
Non-possessive repeating groups with choices need recursive calls to allow backtracking. Making it possessive fixes the problem, so simply add a + after the *:
"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*+\"\"\""
Also note that if you want to match entire input, you need to call matches(), not lookingAt().
Performance boost
Note: A quick performance test showed this to be more than 6 times faster than regex in answer by x4rf41.
Instead of matching one of
Not a quote: [^\"]
Exactly one quote: (?:\"[^\"])
Exactly two quotes: (?:\"\"[^\"])
in a loop, first match everything up to a quote. If that is a single- or double-quote, but not a triple-quote, match the 1-2 quotes then everything up to next quote, repeat as needed. Finally match the ending triple-quote.
That matching is definitive, so make the repeats possessive. This also prevent stack overflow in case input has many embedded quotes.
"{3} match 3 leading quotes
[^"]*+ match as many non-quotes as possible (if any) {possesive}
(?: start optional repeating group
"{1,2} match 1-2 quotes
[^"]++ match one or more non-quotes (at least one) {possesive}
)*+ end optional repeating group {possesive}
"{3} match 3 trailing quotes
Since you don't use ^ or $, there is no need for (?m) (MULTILINE)
As Java string:
"\"{3}[^\"]*+(?:\"{1,2}[^\"]++)*+\"{3}"

Related

Java Regex with "Joker" characters

I try to have a regex validating an input field.
What i call "joker" chars are '?' and '*'.
Here is my java regex :
"^$|[^\\*\\s]{2,}|[^\\*\\s]{2,}[\\*\\?]|[^\\*\\s]{2,}[\\?]{1,}[^\\s\\*]*[\\*]{0,1}"
What I'm tying to match is :
Minimum 2 alpha-numeric characters (other than '?' and '*')
The '*' can only appears one time and at the end of the string
The '?' can appears multiple time
No WhiteSpace at all
So for example :
abcd = OK
?bcd = OK
ab?? = OK
ab*= OK
ab?* = OK
??cd = OK
*ab = NOT OK
??? = NOT OK
ab cd = NOT OK
abcd = Not OK (space at the begining)
I've made the regex a bit complicated and I'm lost can you help me?
^(?:\?*[a-zA-Z\d]\?*){2,}\*?$
Explanation:
The regex asserts that this pattern must appear twice or more:
\?*[a-zA-Z\d]\?*
which asserts that there must be one character in the class [a-zA-Z\d] with 0 to infinity questions marks on the left or right of it.
Then, the regex matches \*?, which means an 0 or 1 asterisk character, at the end of the string.
Demo
Here is an alternative regex that is faster, as revo suggested in the comments:
^(?:\?*[a-zA-Z\d]){2}[a-zA-Z\d?]*\*?$
Demo
Here you go:
^\?*\w{2,}\?*\*?(?<!\s)$
Both described at demonstrated at Regex101.
^ is a start of the String
\?* indicates any number of initial ? characters (must be escaped)
\w{2,} at least 2 alphanumeric characters
\?* continues with any number of and ? characters
\*? and optionally one last * character
(?<!\s) and the whole String must have not \s white character (using negative look-behind)
$ is an end of the String
Other way to solve this problem could be with look-ahead mechanism (?=subregex). It is zero-length (it resets regex cursor to position it was before executing subregex) so it lets regex engine do multiple tests on same text via construct
(?=condition1)
(?=condition2)
(?=...)
conditionN
Note: last condition (conditionN) is not placed in (?=...) to let regex engine move cursor after tested part (to "consume" it) and move on to testing other things after it. But to make it possible conditionN must match precisely that section which we want to "consume" (earlier conditions didn't have that limitation, they could match substrings of any length, like lets say few first characters).
So now we need to think about what are our conditions.
We want to match only alphanumeric characters, ?, * but * can appear (optionally) only at end. We can write it as ^[a-zA-Z0-9?]*[*]?$. This also handles non-whitespace characters because we didn't include them as potentially accepted characters.
Second requirement is to have "Minimum 2 alpha-numeric characters". It can be written as .*?[a-zA-Z0-9].*?[a-zA-Z0-9] or (?:.*?[a-zA-Z0-9]){2,} (if we like shorter regexes). Since that condition doesn't actually test whole text but only some part of it, we can place it in look-ahead mechanism.
Above conditions seem to cover all we wanted so we can combine them into regex which can look like:
^(?=(?:.*?[a-zA-Z0-9]){2,})[a-zA-Z0-9?]*[*]?$

Java regular expression for number starts with code

I am not a Java developer but I am interfacing with a Java system.
Please help me with a regular expression that would detect all numbers starting with with 25678 or 25677.
For example in rails would be:
^(25677|25678)
Sample input is 256776582036 an 256782405036
^(25678|25677)
or
^2567[78]
if you do ^(25678|25677)[0-9]* it Guarantees that the others are all numbers and not other characters.
Should do the trick for you...Would look for either number and then any number after
In Java the regex would be the same, assuming that the number takes up the entire line. You could further simplify it to
^2567[78]
If you need to match a number anywhere in the string, use \b anchor (double the backslash if you are making a string literal in Java code).
\b2567[78]
how about if there is a possibility of a + at the beginning of a number
Add an optional +, like this [+]? or like this \+? (again, double the backslash for inclusion in a string literal).
Note that it is important to know what Java API is used with the regular expression, because some APIs will require the regex to cover the entire string in order to declare it a match.
Try something like:
String number = ...;
if (number.matches("^2567[78].*$")) {
//yes it starts with your number
}
Regex ^2567[78].*$ Means:
Number starts with 2567 followed by either 7 or 8 and then followed by any character.
If you need just numbers after say 25677, then regex should be ^2567[78]\\d*$ which means followed by 0 or n numbers after your matching string in begining.
The regex syntax of Java is pretty close to that of rails, especially for something this simple. The trick is in using the correct API calls. If you need to do more than one search, it's worthwhile to compile the pattern once and reuse it. Something like this should work (mixed Java and pseudocode):
Pattern p = Pattern.compile("^2567[78]");
for each string s:
if (p.matcher(s).find()) {
// string starts with 25677 or 25678
} else {
// string starts with something else
}
}
If it's a one-shot deal, then you can simplify all this by changing the pattern to cover the entire string:
if (someString.matches("2567[78].*")) {
// string starts with 25677 or 25678
}
The matches() method tests whether the entire string matches the pattern; hence the leading ^ anchor is unnecessary but the trailing .* is needed.
If you need to account for an optional leading + (as you indicated in a comment to another answer), just include +? at the start of the pattern (or after the ^ if that's used).

Optimization of regular expression for custom key-value pairs

I am trying to extract some key-value pairs plus their preceding text from a large file, but the regular expression used runs very slowly, so it needs optimization.
The input consists of fairly short strings with 1 or 2 key-value pairs, like
one two three/1234==five/5678 some other text
or
one two three/1234==five/5678 some other text four/910==five/1112 more text
The (apparently suboptimal) regular expression used is
(.*?)\s*([^ /]+)\s*/\s*([\d]+)\s*==\s*([^ /]+)\s*/\s*([\d]+)\s*
(Spaces may appear in numerous areas within the string, hence the repeated \s* elements.)
Sample code to test the above:
public static void main(String[] args) {
String text = "one two three/1234==five/5678 some other text";
text = "one two three/1234==five/5678 some other text four/910==five/1112 more text";
String regex = "(.*?)\\s*([^ /]+)\\s*/\\s*([\\d]+)\\s*==\\s*([^ /]+)\\s*/\\s*([\\d]+)\\s*";
Matcher matcher = Pattern.compile(regex).matcher(text);
int end = 0;
System.out.println("--------------------------------------------------");
while (matcher.find()) {
System.out.println("\"" + matcher.group(1) + "\"");
System.out.println(matcher.group(2) + " == " + matcher.group(3));
System.out.println(matcher.group(4) + " == " + matcher.group(5));
end = matcher.end();
System.out.println("--------------------------------------------------");
}
System.out.println(text.substring(end).trim());
}
The output is the key-value pairs, plus the preceding text (all extracted fields are required). For example, for the longer string, the output is:
--------------------------------------------------
"one two"
three == 1234
five == 5678
--------------------------------------------------
"some other text"
four == 910
five == 1112
--------------------------------------------------
more text
In other words, the matcher.find() method runs for 1 or 2 rounds, depending on whether the string has the short or long form (1 or 2 key-value pairs, respectively).
The problem is that the extraction speed is low and at times, depending on the variation of the input string, the find() method takes a lot of time to complete.
Is there any better form for the regular expression, to significantly speed up processing?
It's never a good idea to put (.*?) at the beginning of a regex.
First, it can be slow. Although in theory non-greedy matches can be handled efficiently (see, for example, Russ Cox's re2 implementation), many regex implementations do not handle non-greedy matches very well, especially in the case where the find operation is going to fail. I don't know whether the Java regex implementation falls into this category or not, but there's no reason to tempt fate.
Second, it's pointless. The semantics of regex searching is that the first possible match will be found, which is identical to the semantics of .*?. To get the capture (.*?), you only need the substring from the end of the previous match (or the beginning of the string) to the beginning of the current match. That's trivial, especially since you're already tracking the end of the previous match.
How are you reading the file? If you read the file line-by-line with BufferedReader#readLine() or Scanner#nextLine(), all you need to do is add \G to the beginning of your regex. It acts like \A the first time you apply the regex, anchoring the match to the beginning of the string. If that match succeeds, the next find() will be anchored to the position where the previous match ended. If it doesn't find a match starting right there, it gives up and doesn't look for any more matches in that string.
EDIT: I'm assuming each of the sequences you want to match, whether it's one key/value pair or two, is on its own line. If you read the file one line at a time, you can run the code in your question on on each line.
As for why your regex is so slow, it's because the regex engine has to make multiple match attempts--possibly hundreds of them--on every non-matching line before it gives up. It isn't smart enough to realize that if the first attempt on a given line fails, no further attempts on that line will do any good. So it bumps forward one position and tries again. And it keeps doing that for the whole line.
If you were only expecting one match per line, I would say to use a start-of-line anchor (^ in MULTILINE mode).

validate string in java

I have a string with data separated by commas like this:
$d4kjvdf,78953626,10.0,103007,0,132103.8945F,
I tried the following regex but it doesn't match the strings I want:
[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,
The $ at the beginning of your data string is not matching the regex. Change the first character class to [$a-zA-Z0-9]. And a couple of the comma separated values contain a literal dot. [$.a-zA-Z0-9] would cover both cases. Also, it's probably a good idea to anchor the regex at the start and end by adding ^ and $ to the beginning and end of the regex respectively. How about this for the full regex:
^[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,$
Update:
You said number of commas is your primary matching criteria. If there should be 6 commas, this would work:
^([^,]+,){6}$
That means: match at least 1 character that is anything but a comma, followed by a comma. And perform the aforementioned match 6 times consecutively. Note: your data must end with a trailing comma as is consistent with your sample data.
Well your regular expression is certainly jarbled - there are clearly characters (like $ and .) that your expression won't match, and you don't need to \\ escape ,s. Lets first describe our requirements, you seem to be saying a valid string is defined as:
A string consisting of 6 commas, with one or more characters before each one
We can represent that with the following pattern:
(?:[^,]+,){6}
This says match one or more non-commas, followed by a comma - [^,]+, - six times - {6}. The (?:...) notation is a non-capturing group, which lets us say match the whole sub-expression six times, without it, the {6} would only apply to the preceding character.
Alternately, we could use normal, capturing groups to let us select each individual section of the matching string:
([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?
Now we can not only match the string, but extract its contents at the same time, e.g.:
String str = "$d4kjvdf,78953626,10.0,103007,0,132103.8945F,";
Pattern regex = Pattern.compile(
"([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?");
Matcher m = regex.matcher(str);
if(m.matches()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(m.group(i));
}
}
This prints:
$d4kjvdf
78953626
10.0
103007
0
132103.8945F

Unescaped "." still matches when used in a negation group

I made, what I believed to be, an error in a regular expression in Java recently but when I test my code I don't get the error I expect.
The expression I created was meant to replace a password in a string that I received from another source. The pattern I used went along the lines of: "password: [^\\s.]*", the idea being that it would match the word "password" the colon, a space, then any characters except for a space or a full-stop (period). I would then replace the instance with "password: XXXXXX" and therefore mask it.
The obvious error should be that I have forgotten to escape the full-stop. In otherwords the proper expression should have been "password: [^\\s\\.]*". Thing is, if I don't escape the full-stop the code still works!
Here's some sample code:
import java.util.regex.*;
public class SimpleRegexTest {
public static void main(String[] args) {
Pattern simplePattern = Pattern.compile("password: [^\\s.]*");
Matcher simpleMatcher = simplePattern.matcher("password: newpass. Enjoy.");
String maskedString = simpleMatcher.replaceAll("password: XXXXXX");
System.out.println(maskedString);
}
}
When I run the above code I get the following output:
password: XXXXXX. Enjoy.
Is this a special case, or have I completely missed something?
(edit: changed to "escape the full-stop")
Michael Borgwardt: I couldn't think of another term to describe what I was doing apart from "negation group", sorry for the ambiguity.
Aviator: In this case, no, a space won't be in the password. I didn't make the rules ;-).
(edit: doubled up the slashes in the non-code text so it displays properly, added the ^ which was in the code, but not the text :-/)
Sundar: Fixed the double slashes, SO seems to have it's own escape characters.
A period ('.' character) does not need to be escaped inside a character class [] in a regular expression.
From the API:
Note that a different set of metacharacters are in effect inside a character class than outside a character class. For instance, the regular expression . loses its special meaning inside a character class, while the expression - becomes a range forming metacharacter.
It looks like you got the negation operator mixed up for regex ranges.
In particular, my understanding is that you used the snippet [\s.]* to mean "any characters except for a space or a full-stop (period)." This would in fact be expressed as [^ .]*, using the caret to negate the characters in the set.
I don't know if this was just a typo in your post or what was actually in your code, but the regex as it stands in your question will match the word "password", a colon, a space, then any sequence of backslash characters, "s" characters or periods.

Categories