I've seen people here made comments like "regex is too slow!", or "why would you do something so simple using regex!" (and then present a 10+ lines alternative instead), etc.
I haven't really used regex in industrial setting, so I'm curious if there are applications where regex is demonstratably just too slow, AND where a simple non-regex alternative exists that performs significantly (maybe even asymptotically!) better.
Obviously many highly-specialized string manipulations with sophisticated string algorithms will outperform regex easily, but I'm talking about cases where a simple solution exists and significantly outperforms regex.
What counts as simple is subjective, of course, but I think a reasonable standard is that if it uses only String, StringBuilder, etc, then it's probably simple.
Note: I would very much appreciate answers that demonstrate the following:
a beginner-level regex solution to a non-toy real-life problem that performs horribly
the simple non-regex solution
the expert-level regex rewrite that performs comparably
I remember a textbook example of a regex gone bad. Be aware that none of the following approaches is recommended for production use! Use a proper CSV parser instead.
The mistake made in this example is quite common: Using a dot where a narrower character class is better suited.
In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).
1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 // don't match
42,12,13,12,32,13,14,43,56,31,78,10 // match
42,12,13,12,32,14,13,43,56,31,78,10 // don't match
We use a regex containing exactly 11 commas:
".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*"
This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)
A simple non-regex solution would be to split() each line and compare the 6th element. (Much faster: 9 microseconds per string.)
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
So we replace the greedy quantifier with the reluctant one:
".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?"
This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.
A performant regex replaces the dot by the character class "[^,]":
"[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*"
(This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)
I experimented a bit with the performance of various constructs, and unfortunately I discovered that Java regex doesn't perform what I consider very doable optimizations.
Java regex takes O(N) to match "(?s)^.*+$"
This is very disappointing. It's understandable for ".*" to take O(N), but with the optimization "hints" in the form of anchors (^ and $) and single-line mode Pattern.DOTALL/(?s), even making the repetition possessive (i.e. no backtracking), the regex engine still could not see that this will match every string, and still have to match in O(N).
This pattern isn't very useful, of course, but consider the next problem.
Java regex takes O(N) to match "(?s)^A.*Z$"
Again, I was hoping that the regex engine can see that thanks to the anchors and single-line mode, this is essentially the same as the O(1) non-regex:
s.startsWith("A") && s.endsWith("Z")
Unfortunately, no, this is still O(N). Very disappointing. Still, not very convincing because a nice and simple non-regex alternative exists.
Java regex takes O(N) to match "(?s)^.*[aeiou]{3}$"
This pattern matches strings that ends with 3 lowercase vowels. There is no nice and simple non-regex alternative, but you can still write something non-regex that matches this in O(1), since you only need to check the last 3 characters (for simplicity, we can assume that the string length is at least 3).
I also tried "(?s)^.*$(?<=[aeiou]{3})", in an attempt to tell the regex engine to just ignore everything else, and just check the last 3 characters, but of course this is still O(N) (which follows from the first section above).
In this particular scenario, however, regex can be made useful by combining it with substring. That is, instead of seeing if the whole string matches the pattern, you can manually restrict the pattern to attempt to match only the last 3 characters substring. In general, if you know before hand that the pattern has a finite length maximum match, you can substring the necessary amount of characters from the end of a very long string and regex on just that part.
Test harness
static void testAnchors() {
String pattern = "(?s)^.*[aeiou]{3}$";
for (int N = 1; N < 20; N++) {
String needle = stringLength(1 << N) + "ooo";
System.out.println(N);
boolean b = true;
for (int REPS = 10000; REPS --> 0; ) {
b &=
needle
//.substring(needle.length() - 3) // try with this
.matches(pattern);
}
System.out.println(b);
}
}
The string length in this test grows exponentially. If you run this test, you will find that it starts to really slow down after 10 (i.e. string length 1024). If you uncomment the substring line, however, the entire test will complete in no time (which also confirms that the problem is not because I didn't use Pattern.compile, which would yield a constant improvement at best, but rather because the patttern takes O(N) to match, which is problematic when the asymptotic growth of N is exponential).
Conclusion
It seems that Java regex does little to no optimization based on the pattern. Suffix matching in particular is especially costly, because the regex still needs to go through the entire length of the string.
Thankfully, doing the regex on the chopped suffix using substring (if you know the maximum length of the match) can still allow you to use regex for suffix matching in time independent of the length of the input string.
//update: actually I just realized that this applies to prefix matching too. Java regex matches a O(1) length prefix pattern in O(N). That is, "(?s)^[aeiou]{3}.*$" checks if a string starts with 3 lowercase letters in O(N) when it should be optimizable to O(1).
I thought prefix matching would be more regex-friendly, but I don't think it's possible to come up with a O(1)-runtime pattern to match the above (unless someone can prove me wrong).
Obviously you can do the s.substring(0, 3).matches("(?s)^[aeiou]{3}.*$") "trick", but the pattern itself is still O(N); you've just manually reduced N to a constant by using substring.
So for any kind of finite-length prefix/suffix matching of a really long string, you should preprocess using substring before using regex; otherwise it's O(N) where O(1) suffices.
In my tests, I found the following:
Using java's String.split method (which uses regex) took 2176ms under 1,000,000 iterations.
Using this custom split method took 43ms under 1,000,000 iterations.
Of course, it will only work if your "regex" is completely literal, but in those cases,
it will be much faster.
List<String> array = new ArrayList<String>();
String split = "ab";
String string = "aaabaaabaa";
int sp = 0;
for(int i = 0; i < string.length() - split.length(); i++){
if(string.substring(i, i + split.length()).equals(split)){
//Split point found
array.add(string.substring(sp, i));
sp = i + split.length();
i += split.length();
}
}
if(sp != 0){
array.add(string.substring(sp, string.length()));
}
return array;
So to answer your question, is it theoretically faster? Yes, absolutely, my algorithm is O(n), where n is the length of the string to split. (I'm not sure what regex would be). Is it practically faster? Well, over 1 million iterations, I saved basically 2 seconds. So, it depends on your needs I guess, but I wouldn't worry too much about backporting all code that uses regex to non-regex versions, and in fact, that might be necessary anyways, if the pattern is very complex, a literal split like this won't work. However, if you're splitting on, say, commas, this method will perform much better, though "much better" is subjective here.
Well, not always but sometimes slow, depends on patterns and implementations.
A quick example, 2x slower than normal replace, but I don't think its that slow.
>>> import time,re
>>>
>>> x="abbbcdexfbeczexczczkef111anncdehbzzdezf" * 500000
>>>
>>> start=time.time()
>>> y=x.replace("bc","TEST")
>>> print time.time()-start,"s"
0.350999832153 s
>>>
>>> start=time.time()
>>> y=re.sub("bc","TEST",x)
>>> print time.time()-start,"s"
0.751000165939 s
>>>
Related
I need find the last index of set of characters in a string. Consider the set of characters be x,y,z and string as Vereador Luiz Pauly Home then I need index as 18.
So for finding the index I have created a pattern with DOTALL flag and greedy quantifier as (?s).*(x|y|z). When the pattern is applied to that string(multiline), I can find out index from the start group. The code:
int findIndex(String str){
int index = -1;
Pattern p = Pattern.compile("(?s).*(x|y|z)");
Matcher m = regex.matcher(str);
if(m.find()){
index = m.start(1);
}
return index;
}
As expected it is returning the values correctly, if there is match.
But if there is no match, then it takes too long time (17 minutes for 600000 characters) as it is a Greedy match.
I tried with other quantifiers, but can't get the desired output. So can anyone refer any better regex?
PS: I can also think about traversing the content from last and finding the index.But I hope there is some better way in regex which can do the job quickly.
There are few ways to solve the problem and the best way will depend on the size of the input and the complexity of the pattern:
Reverse the input string and possibly the pattern, this might work for non-complex patterns. Unfortunately java.util.regex doesn't allow to to match the pattern from right to left.
Instead of using a greedy quantifier simply match the pattern and loop Matcher.find() until last occurrence is found.
Use a different regex engine with better performance e.g. RE2/J: linear time regular expression matching in Java.
If option 2 is not efficient enough for your case I'd suggest to try RE2/J:
Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.
If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.
In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.
Performance issues with the (?s).*(x|y|z) regex come from the fact the .* pattern is the first subpattern that grabs the whole string first, and then backtracking occurs to find x, y or z. If there is no match, or the match is at the start of the string, and the strings is very large, this might take a really long time.
The ([xyz])(?=[^xyz]*$) pattern seems a little bit better: it captures x, y or z and asserts there is no other x, y or z up to the end of the string, but it also is somewhat resource-consuming due to each lookahead check after a match is found.
The fastest regex to get your job done is
^(?:[^xyz]*+([xyz]))+
It matches
^ - start of string
(?:[^xyz]*+([xyz]))+ - 1 or more repetitions of
[^xyz]*+ - any 0 or more chars other than x, y and z matched possessively (no backtracking into the pattern is allowed)
([xyz]) - Group 1: x, y or z.
The Group 1 value and data will belong to the last iteration of the repeated group (as all the preceding data is re-written with each subsequent iteration).
StringBuilder both has a reverse and is a CharSequence, so searching is possible.
Pattern p = Pattern.compile("[xyz]");
StringBuilder sb = new StringBuilder(str).reverse();
Matcher m = p.matcher(sb);
return m.find() ? sb.length() - m.end() : -1;
Unfortunately reversal is costly.
A solution without regex is probably faster.
(BTW surrogate pairs are handled correctly by the reversal.)
I would like to replace all char '-' that between two numbers, or that between number and '.' by char '&'.For example
String input= "2.1(-7-11.3)-12.1*-2.3-.11"
String output= "2.1(-7&11.3)-12.1*-2.3&.11"
I have something like this, but I try to do it easier.
public void preperString(String input) {
input=input.replaceAll(" ","");
input=input.replaceAll(",",".");
input=input.replaceAll("-","&");
input=input.replaceAll("\\(&","\\(-");
input=input.replaceAll("\\[&","\\[-");
input=input.replaceAll("\\+&","\\+-");
input=input.replaceAll("\\*&","\\*-");
input=input.replaceAll("/&","/-");
input=input.replaceAll("\\^&","\\^-");
input=input.replaceAll("&&","&-");
input=input.replaceFirst("^&","-");
for (String s :input.split("[^.\\-\\d]")) {
if (!s.equals(""))
numbers.add(Double.parseDouble(s));
}
You can make it in one shot using groups of regex to solve your problem, you can use this :
String input = "2.1(-7-11.3)-12.1*-2.3-.11";
input = input.replaceAll("([\\d.])-([\\d.])", "$1&$2");
Output
2.1(-7&11.3)-12.1*-2.3&.11
([\\d.])-([\\d.])
// ^------------replace the hyphen(-) that it between
// ^__________^--------two number(\d)
// ^_^______^_^------or between number(\d) and dot(.)
regex demo
Let me guess. You don't really have a use for & here; you're just trying to replace certain minus signs with & so that they won't interfere with the split that you're trying to use to find all the numbers (so that the split doesn't return "-7-11" as one of the array elements, in your original example). Is that correct?
If my guess is right, then the correct answer is: don't use split. It is the wrong tool for the job. The purpose of split is to split up a string by looking for delimiter patterns (such as a sequence of whitespace or a comma); but where the format of the elements between the delimiters doesn't much matter. In your case, though, you are looking for elements of a particular numeric format (it might start with -, and otherwise will have at least one digit and at most one period; I don't know what your exact requirements are). In this case, instead of split, the right way to do this is to create a regular expression for the pattern you want your numbers to have, and then use m.find in a loop (where m is a Matcher) to get all your numbers.
If you need to treat some - characters differently (e.g. in -7-11, where you want the second - to be an operator and not part of -11), then you can make special checks for that in your loop, and skip over the - signs that you know you want to treat as operators.
It's simpler, readers will understand what you're trying to do, and it's less error-prone because all you have to do is make sure your pattern for expressing numbers accurately reflects what you're looking for.
It's common for newer Java programmers to think regexes and split are magic tools that can solve everything. But often the result ends up being too complex (code uses overly complicated regexes, or relies on trickery like having to replace characters with & temporarily). I cannot look at your original code and convince myself that it works right. It's not worth it.
You can use lookahead and lookbehind to match digit or dot:
input.replaceAll("(?<=[\\d\\.])-(?=[\\d\\.])","&")
Have a look on this fiddle.
To clarify, I want to know how to use a regex to match:
ab
aabb
aaabbb
...
I just found out that this works in Perl:
if ($exp =~ /^(a(?1)?b)$/)
To understand this, look at the string as though it grows from the outside-in, not left-right:
ab
a(ab)b
aa(ab)bb
(?1) is a reference to the outer set of parentheses. We need the ? afterwards for the last case (going from outside in), nothing is left and ? means 0 or 1 of the preceding expression (so it essentially acts as our base case).
My questions is: What is the equivalent of (?1) in Java?
In general, regular expressions are limited to regular languages, which means that since they are equivalent to languages acceptable by DFAs (discrete finite automata), they can't count, as generalized counting requires an infinite number of states to represent the infinite number of possible counted values.
Since discrete != infinte, you can't really count, but you can do some limited types of matching like in your (a (something) b) example.
Discusses a few limitations of DFAs (and Regular Languages / Regular Expressions by extension)
http://www.cs.washington.edu/education/courses/cse599/99sp/admin/Slides/Week2/sld012.htm
A better, but more verbose slide that expands on the limitations by going into DFAs in (still a bit high level) detail
http://www.cs.princeton.edu/courses/archive/spr05/cos126/lectures/18.pdf
By the way, the inside-out expansion is a cool DFA trick to basically side-step the actual need to count by using pattern matching that happens to rebuild the mirror image of the string. It looks like counting, but will fall apart as soon as you do more interesting things, like require matching in order (as opposed to mirrored order matching).
You can use the Pattern and Matcher classes to count the number of occurrences of one of your two characters, then enforce this number in your regular expression using the {n} syntax.
import java.util.regex.*;
class Test {
public static void main(String[] args) {
String s = "aaaabbbb";
Pattern pattern = Pattern.compile("a");
Matcher matcher = pattern.matcher(s);
// count all 'a's
int count = 0;
while (matcher.find())
count++;
// now enforce count matches for a and b in your regular expression
String rExp = String.format("a{%d}b{%d}", count, count);
Pattern matchSameCount = Pattern.compile(rExp);
Matcher m2 = matchSameCount.matcher(s);
System.out.println( m2.matches());
// prints true
}
}
Overall it is a little more work, but it's the only way I can think of that actually works at the moment.
That language is not regular (see explanation below). So, you can't match that kind of language with regular expressions.
You are going to need an stack to hold repeated words.
I recommend you reading the following links:
NFA: http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton
Explanation of why this language is not regular: Why is {a^nb^n | n >= 0} not regular?
Not being a Java developer, but plenty of experience in Perl, why not just /^a+b+$/ ?
I'm wondering if I've misunderstood your Q!
;-)
I'm attempting to create a Regular Expressions code in Java that will have a conditional search term.
What I mean by this is let's say I have 5 words; tree, car, dog, cat, bird. Now I would like the expression to search for these terms, however is only required to match 3 out of the five, and it could be any of the 5 it chooses to match.
I thought perhaps a using a back reference ?(3) would work but doesn't seem to do the trick.
A standard optional search (?) wouldn't work either because all terms are optional, however the number of matches required is not. Essentially is there a way to create a string that must be 50% (or any percent) correct to provide a match?
Would anyone happen to know or could point me in the right direction?
(I would hopefully like it working client side if possible)
Does it have to be a free-standing regular expression without any further code? A simple loop testing for each word and counting matches should do this perfectly. Pseudocode assuming you want N unique matches (you can also swap the substring test with a regex, doesn't matter how you determine matches as long as you keep the counting of unique matches out of the regex):
bool has_N_words(int n, string[] words, string text) {
int matches = 0;
foreach word in words {
if (word.substringOf(text)) counter++
if (counter >= n) return true
}
return false
}
It seems to me the only (save mind-blowing uses of obscure regex extensions - not that I have something in mind, I've just been surprised again and again what modern regex implementations allow) way to do this with an regular expression goes like this:
Enumerate all unique (ignoring order or not depending on implementation, see below) permutations of words
For each permutation, build a sub-regex that matches a string containing those words, either by
joining the first three words with .*? (this requires all unique permutations)
using three lookahead assertions like (?=.*word) (this allows dropping word combinations that occured before in a different order)
Combine all sub-regexes in a giant or.
That's impractical to do by hand, ugly and complex (as in computational complexity, not in programming effort) to do automatically, and inefficient as well as quite hacky either way.
I don't see why you would want to do this with a regext but if you really need it to be a regex:
/(tree|car|dog|cat|bird)/
Then count the matches you get from that...
(?i)(?s)(.*(tree|car|dog|cat|bird)){3,}?.*
The (?i) is for case insensitive and the (?s) to match new lines with .* also, since you are looking at emails.
The ? at the end is the reluctant quantifier.
I haven't actually tried it.
I'm trying to understand regex as much as I can, so I came up with this regex-based solution to codingbat.com repeatEnd:
Given a string and an int N, return a string made of N repetitions of the last N characters of the string. You may assume that N is between 0 and the length of the string, inclusive.
public String repeatEnd(String str, int N) {
return str.replaceAll(
".(?!.{N})(?=.*(?<=(.{N})))|."
.replace("N", Integer.toString(N)),
"$1"
);
}
Explanation on its parts:
.(?!.{N}): asserts that the matched character is one of the last N characters, by making sure that there aren't N characters following it.
(?=.*(?<=(.{N}))): in which case, use lookforward to first go all the way to the end of the string, then a nested lookbehind to capture the last N characters into \1. Note that this assertion will always be true.
|.: if the first assertion failed (i.e. there are at least N characters ahead) then match the character anyway; \1 would be empty.
In either case, a character is always matched; replace it with \1.
My questions are:
Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)
Is there a simpler regex-based solution?
Bonus question
Do repeatBegin (as analogously defined).
I'm honestly having troubles with this one!
Nice one! I don't see a way to significantly improve on that regex, although I would refactor it to avoid the needless use of negative logic:
".(?=.{N})|.(?=.*(?<=(.{N})))"
This way the second alternative is never entered until you reach the final N characters, which I think makes the intent a little clearer.
I've never seen a reference that says it's okay to nest lookarounds, but like Bart, I don't see why it wouldn't be. I sometimes use lookaheads inside lookbehinds to get around limitations on variable-length lookbehind expressions.
EDIT: I just realized I can simplify the regex quite a bit by putting the alternation inside the lookahead:
".(?=.{N}|.*(?<=(.{N})))"
By the way, have you considered using format() to build the regex instead of replace()?
return str.replaceAll(
String.format(".(?=.{%1$d}|.*(?<=(.{%1$d})))", N),
"$1"
);
Whoa, that's some scary regex voodoo there! : )
Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)
Yes, that is perfectly valid in most PCRE implementations I know of.
Is there a simpler regex-based solution?
I didn't spend too much time on it, but I don't quickly see how that could be simplified or shortened with a single regex replacement.
Is there a simpler regex-based solution?
It took me a while, but eventually I managed to simplify the regex to:
"(?=.{0,N}$(?<=(.{N}))).|." // repeatEnd
-or-
".(?<=^(?=(.{N})).{0,N})|." // repeatBegin
Like Alan Moore's answer, this removes the negative assertion, but doesn't even replace it with a positive one, so it now only has 2 assertions instead of 3.
I also like the fact that the "else" case is just a simple .. I prefer to put the bulk of my regex into the "working" side of the alternation, and keep the "non-working" side as simple as possible (usually a simple . or .*).