Conditional Regex searches

Conditional Regex searches - java

I'm attempting to create a Regular Expressions code in Java that will have a conditional search term.
What I mean by this is let's say I have 5 words; tree, car, dog, cat, bird. Now I would like the expression to search for these terms, however is only required to match 3 out of the five, and it could be any of the 5 it chooses to match.
I thought perhaps a using a back reference ?(3) would work but doesn't seem to do the trick.
A standard optional search (?) wouldn't work either because all terms are optional, however the number of matches required is not. Essentially is there a way to create a string that must be 50% (or any percent) correct to provide a match?
Would anyone happen to know or could point me in the right direction?
(I would hopefully like it working client side if possible)

Does it have to be a free-standing regular expression without any further code? A simple loop testing for each word and counting matches should do this perfectly. Pseudocode assuming you want N unique matches (you can also swap the substring test with a regex, doesn't matter how you determine matches as long as you keep the counting of unique matches out of the regex):
bool has_N_words(int n, string[] words, string text) {
int matches = 0;
foreach word in words {
if (word.substringOf(text)) counter++
if (counter >= n) return true
}
return false
}
It seems to me the only (save mind-blowing uses of obscure regex extensions - not that I have something in mind, I've just been surprised again and again what modern regex implementations allow) way to do this with an regular expression goes like this:
Enumerate all unique (ignoring order or not depending on implementation, see below) permutations of words
For each permutation, build a sub-regex that matches a string containing those words, either by
joining the first three words with .*? (this requires all unique permutations)
using three lookahead assertions like (?=.*word) (this allows dropping word combinations that occured before in a different order)
Combine all sub-regexes in a giant or.
That's impractical to do by hand, ugly and complex (as in computational complexity, not in programming effort) to do automatically, and inefficient as well as quite hacky either way.

I don't see why you would want to do this with a regext but if you really need it to be a regex:
/(tree|car|dog|cat|bird)/
Then count the matches you get from that...

(?i)(?s)(.*(tree|car|dog|cat|bird)){3,}?.*
The (?i) is for case insensitive and the (?s) to match new lines with .* also, since you are looking at emails.
The ? at the end is the reluctant quantifier.
I haven't actually tried it.

Related

Regex function to find specific depth in recursive

I have the following scenario where I am supposed to use regex (Java/PCRE) on a line of code and strip off certain defined function and only strong the value of that function like in example below:
Input
ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))
Output : Replace Regex
ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)
Here CInt, Math.Truncate, and ObjectToNumber is removed retaining on output as shown above
The functions CInt, Math.Truncate keep on changing to CStr or Math.Random etc etc so regex query can not be hardcoded.
I tried a lot of options on stackoverflow but most did not work.
Also it would be nice if the query is customizable like Cint returns everything function CInt refers to. ( find a text then everything between first ( and ) ignoring balanced parenthesis pairs in between.

I know it's not pretty, but it's your fault to use raw regex for this :)
#Test
void unwrapCIntCall() {
String input = "ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))";
String expectedOutput = "ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)";
String output = input.replaceAll("CInt\\s*\\(\\s*Math\\.Truncate\\s*\\(\\s*ObjectToNumber\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
assertEquals(expectedOutput, output);
}
Now some explanation; the \\s* parts allow any number of any whitespace character, where they are. In the pattern, I used (.*) in the middle, which means I match anything there, but it's fine*. I used (.*) instead of .* so that particular section gets captured as capturing group $1 (because $0 is always the whole match). The interesting part being captured, I can refer them in the replacement string.
*as long as you don't have multiple of such assignments within one string. Otherwise, you should break up the string into parts which contain only one such assignment and apply this replacement for each of those strings. Or, try (.*?) instead of (.*), it compiles for me - AFAIK that makes the .* match as few characters as possible.
If the methods actually being called vary, then replace their names in the regex with the variation you expect, like replace CInt with (?CInt|CStr), Math\\.Truncate with Math\\.(?Truncate|Random) etc. (Using (? instead of ( makes that group non-capturing, so they won't take up $1, $2, etc. slots).
If that gets too complicated, than you should really think whether you really want to do it with regex, or whether it'd be easier to just write a relatively longer function with plain string methods, like indexOf and substring :)
Bonus; if absolutely everything varies, but the call depth, then you might try this one:
String output = input.replaceAll("[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
Yes, it's definitely a nightmare to read, but as far as I understand, you are after this monster :)
You can use ([^()]*) instead of (.*) to prevent deeper nested expressions. Note, that fine control of depth is a real weakness of everyday regular expressions.

java String.replaceAll char between two numbers

I would like to replace all char '-' that between two numbers, or that between number and '.' by char '&'.For example
String input= "2.1(-7-11.3)-12.1*-2.3-.11"
String output= "2.1(-7&11.3)-12.1*-2.3&.11"
I have something like this, but I try to do it easier.
public void preperString(String input) {
input=input.replaceAll(" ","");
input=input.replaceAll(",",".");
input=input.replaceAll("-","&");
input=input.replaceAll("\\(&","\\(-");
input=input.replaceAll("\\[&","\\[-");
input=input.replaceAll("\\+&","\\+-");
input=input.replaceAll("\\*&","\\*-");
input=input.replaceAll("/&","/-");
input=input.replaceAll("\\^&","\\^-");
input=input.replaceAll("&&","&-");
input=input.replaceFirst("^&","-");
for (String s :input.split("[^.\\-\\d]")) {
if (!s.equals(""))
numbers.add(Double.parseDouble(s));
}

You can make it in one shot using groups of regex to solve your problem, you can use this :
String input = "2.1(-7-11.3)-12.1*-2.3-.11";
input = input.replaceAll("([\\d.])-([\\d.])", "$1&$2");
Output
2.1(-7&11.3)-12.1*-2.3&.11
([\\d.])-([\\d.])
// ^------------replace the hyphen(-) that it between
// ^__________^--------two number(\d)
// ^_^______^_^------or between number(\d) and dot(.)
regex demo

Let me guess. You don't really have a use for & here; you're just trying to replace certain minus signs with & so that they won't interfere with the split that you're trying to use to find all the numbers (so that the split doesn't return "-7-11" as one of the array elements, in your original example). Is that correct?
If my guess is right, then the correct answer is: don't use split. It is the wrong tool for the job. The purpose of split is to split up a string by looking for delimiter patterns (such as a sequence of whitespace or a comma); but where the format of the elements between the delimiters doesn't much matter. In your case, though, you are looking for elements of a particular numeric format (it might start with -, and otherwise will have at least one digit and at most one period; I don't know what your exact requirements are). In this case, instead of split, the right way to do this is to create a regular expression for the pattern you want your numbers to have, and then use m.find in a loop (where m is a Matcher) to get all your numbers.
If you need to treat some - characters differently (e.g. in -7-11, where you want the second - to be an operator and not part of -11), then you can make special checks for that in your loop, and skip over the - signs that you know you want to treat as operators.
It's simpler, readers will understand what you're trying to do, and it's less error-prone because all you have to do is make sure your pattern for expressing numbers accurately reflects what you're looking for.
It's common for newer Java programmers to think regexes and split are magic tools that can solve everything. But often the result ends up being too complex (code uses overly complicated regexes, or relies on trickery like having to replace characters with & temporarily). I cannot look at your original code and convince myself that it works right. It's not worth it.

You can use lookahead and lookbehind to match digit or dot:
input.replaceAll("(?<=[\\d\\.])-(?=[\\d\\.])","&")
Have a look on this fiddle.

Does the Java regex library optimize for any characters .*?

I have a wrapper class for matching regular expressions. Obviously, you compile a regular expression into a Pattern like this.
Pattern pattern = Pattern.compile(regex);
But suppose I used a .* to specify any number of characters. So it's basically a wildcard.
Pattern pattern = Pattern.compile(".*");
Does the pattern optimize to always return true and not really calculate anything? Or should I have my wrapper implement that optimization? I am doing this because I could easily process hundreds of thousands of regex operations in a process. If a regex parameter is null I coalesce it to a .*

In your case, I could just use a possessive quantifier to avoid any backtracking:
.*+
The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically.
Here is what Cristian Mocanu's writes in his Optimizing regular expressions in Java about a case similar to .*:
Java regex engine was not able to optimize the expression .*abc.*. I expected it would search for abc in the input string and report a failure very quickly, but it didn't. On the same input string, using String.indexOf("abc") was three times faster then my improved regular expression. It seems that the engine can optimize this expression only when the known string is right at its beginning or at a predetermined position inside it. For example, if I re-write the expression as .{100}abc.* the engine will match it more than ten times faster. Why? Because now the mandatory string abc is at a known position inside the string (there should be exactly one hundred characters before it).
Some of the hints on Java regex optimization from the same source:
If the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.
Another very useful way to automatically optimize a regular expression is to have the engine check the length of the input string against the expected length according to the regular expression. For example, the expression \d{100} is internally optimized such that if the input string is not 100 characters in length, the engine will report a failure without evaluating the entire regular expression.
Don't hide mandatory strings inside groupings or alternations because the engine won't be able to recognize them. When possible, it is also helpful to specify the lengths of the input strings that you want to match
If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches().
Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
Beware of alternation. Regular expressions like (X|Y|Z) have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of (abcd|abef) use ab(cd|ef).
Whenever you are using negated character classes to match something other than something else, use possessive quantifiers: instead of [^a]*a use [^a]*+a.
Non-matching strings may cause your code to freeze more often than those that contain a match. Remember to always test your regular expressions using non-matching strings first!
Beware of a known bug #5050507 (when the regex Pattern class throws a StackOverflowError), if you encounter this error, try to rewrite the regular expression or split it into several sub-expressions and run them separately. The latter technique can also sometimes even increase performance.
Instead of lazy dot matching, use tempered greedy token (e.g. (?:(?!something).)*) or unrolling the loop techinque (got downvoted for it today, no idea why).
Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.

When should we use Pattern and Matcher? [duplicate]

I've seen people here made comments like "regex is too slow!", or "why would you do something so simple using regex!" (and then present a 10+ lines alternative instead), etc.
I haven't really used regex in industrial setting, so I'm curious if there are applications where regex is demonstratably just too slow, AND where a simple non-regex alternative exists that performs significantly (maybe even asymptotically!) better.
Obviously many highly-specialized string manipulations with sophisticated string algorithms will outperform regex easily, but I'm talking about cases where a simple solution exists and significantly outperforms regex.
What counts as simple is subjective, of course, but I think a reasonable standard is that if it uses only String, StringBuilder, etc, then it's probably simple.
Note: I would very much appreciate answers that demonstrate the following:
a beginner-level regex solution to a non-toy real-life problem that performs horribly
the simple non-regex solution
the expert-level regex rewrite that performs comparably

I remember a textbook example of a regex gone bad. Be aware that none of the following approaches is recommended for production use! Use a proper CSV parser instead.
The mistake made in this example is quite common: Using a dot where a narrower character class is better suited.
In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).
1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 // don't match
42,12,13,12,32,13,14,43,56,31,78,10 // match
42,12,13,12,32,14,13,43,56,31,78,10 // don't match
We use a regex containing exactly 11 commas:
".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*"
This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)
A simple non-regex solution would be to split() each line and compare the 6th element. (Much faster: 9 microseconds per string.)
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
So we replace the greedy quantifier with the reluctant one:
".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?"
This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.
A performant regex replaces the dot by the character class "[^,]":
"[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*"
(This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)

I experimented a bit with the performance of various constructs, and unfortunately I discovered that Java regex doesn't perform what I consider very doable optimizations.
Java regex takes O(N) to match "(?s)^.*+$"
This is very disappointing. It's understandable for ".*" to take O(N), but with the optimization "hints" in the form of anchors (^ and $) and single-line mode Pattern.DOTALL/(?s), even making the repetition possessive (i.e. no backtracking), the regex engine still could not see that this will match every string, and still have to match in O(N).
This pattern isn't very useful, of course, but consider the next problem.
Java regex takes O(N) to match "(?s)^A.*Z$"
Again, I was hoping that the regex engine can see that thanks to the anchors and single-line mode, this is essentially the same as the O(1) non-regex:
s.startsWith("A") && s.endsWith("Z")
Unfortunately, no, this is still O(N). Very disappointing. Still, not very convincing because a nice and simple non-regex alternative exists.
Java regex takes O(N) to match "(?s)^.*[aeiou]{3}$"
This pattern matches strings that ends with 3 lowercase vowels. There is no nice and simple non-regex alternative, but you can still write something non-regex that matches this in O(1), since you only need to check the last 3 characters (for simplicity, we can assume that the string length is at least 3).
I also tried "(?s)^.*$(?<=[aeiou]{3})", in an attempt to tell the regex engine to just ignore everything else, and just check the last 3 characters, but of course this is still O(N) (which follows from the first section above).
In this particular scenario, however, regex can be made useful by combining it with substring. That is, instead of seeing if the whole string matches the pattern, you can manually restrict the pattern to attempt to match only the last 3 characters substring. In general, if you know before hand that the pattern has a finite length maximum match, you can substring the necessary amount of characters from the end of a very long string and regex on just that part.
Test harness
static void testAnchors() {
String pattern = "(?s)^.*[aeiou]{3}$";
for (int N = 1; N < 20; N++) {
String needle = stringLength(1 << N) + "ooo";
System.out.println(N);
boolean b = true;
for (int REPS = 10000; REPS --> 0; ) {
b &=
needle
//.substring(needle.length() - 3) // try with this
.matches(pattern);
}
System.out.println(b);
}
}
The string length in this test grows exponentially. If you run this test, you will find that it starts to really slow down after 10 (i.e. string length 1024). If you uncomment the substring line, however, the entire test will complete in no time (which also confirms that the problem is not because I didn't use Pattern.compile, which would yield a constant improvement at best, but rather because the patttern takes O(N) to match, which is problematic when the asymptotic growth of N is exponential).
Conclusion
It seems that Java regex does little to no optimization based on the pattern. Suffix matching in particular is especially costly, because the regex still needs to go through the entire length of the string.
Thankfully, doing the regex on the chopped suffix using substring (if you know the maximum length of the match) can still allow you to use regex for suffix matching in time independent of the length of the input string.
//update: actually I just realized that this applies to prefix matching too. Java regex matches a O(1) length prefix pattern in O(N). That is, "(?s)^[aeiou]{3}.*$" checks if a string starts with 3 lowercase letters in O(N) when it should be optimizable to O(1).
I thought prefix matching would be more regex-friendly, but I don't think it's possible to come up with a O(1)-runtime pattern to match the above (unless someone can prove me wrong).
Obviously you can do the s.substring(0, 3).matches("(?s)^[aeiou]{3}.*$") "trick", but the pattern itself is still O(N); you've just manually reduced N to a constant by using substring.
So for any kind of finite-length prefix/suffix matching of a really long string, you should preprocess using substring before using regex; otherwise it's O(N) where O(1) suffices.

In my tests, I found the following:
Using java's String.split method (which uses regex) took 2176ms under 1,000,000 iterations.
Using this custom split method took 43ms under 1,000,000 iterations.
Of course, it will only work if your "regex" is completely literal, but in those cases,
it will be much faster.
List<String> array = new ArrayList<String>();
String split = "ab";
String string = "aaabaaabaa";
int sp = 0;
for(int i = 0; i < string.length() - split.length(); i++){
if(string.substring(i, i + split.length()).equals(split)){
//Split point found
array.add(string.substring(sp, i));
sp = i + split.length();
i += split.length();
}
}
if(sp != 0){
array.add(string.substring(sp, string.length()));
}
return array;
So to answer your question, is it theoretically faster? Yes, absolutely, my algorithm is O(n), where n is the length of the string to split. (I'm not sure what regex would be). Is it practically faster? Well, over 1 million iterations, I saved basically 2 seconds. So, it depends on your needs I guess, but I wouldn't worry too much about backporting all code that uses regex to non-regex versions, and in fact, that might be necessary anyways, if the pattern is very complex, a literal split like this won't work. However, if you're splitting on, say, commas, this method will perform much better, though "much better" is subjective here.

Well, not always but sometimes slow, depends on patterns and implementations.
A quick example, 2x slower than normal replace, but I don't think its that slow.
>>> import time,re
>>>
>>> x="abbbcdexfbeczexczczkef111anncdehbzzdezf" * 500000
>>>
>>> start=time.time()
>>> y=x.replace("bc","TEST")
>>> print time.time()-start,"s"
0.350999832153 s
>>>
>>> start=time.time()
>>> y=re.sub("bc","TEST",x)
>>> print time.time()-start,"s"
0.751000165939 s
>>>

codingBat repeatEnd using regex

I'm trying to understand regex as much as I can, so I came up with this regex-based solution to codingbat.com repeatEnd:
Given a string and an int N, return a string made of N repetitions of the last N characters of the string. You may assume that N is between 0 and the length of the string, inclusive.
public String repeatEnd(String str, int N) {
return str.replaceAll(
".(?!.{N})(?=.*(?<=(.{N})))|."
.replace("N", Integer.toString(N)),
"$1"
);
}
Explanation on its parts:
.(?!.{N}): asserts that the matched character is one of the last N characters, by making sure that there aren't N characters following it.
(?=.*(?<=(.{N}))): in which case, use lookforward to first go all the way to the end of the string, then a nested lookbehind to capture the last N characters into \1. Note that this assertion will always be true.
|.: if the first assertion failed (i.e. there are at least N characters ahead) then match the character anyway; \1 would be empty.
In either case, a character is always matched; replace it with \1.
My questions are:
Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)
Is there a simpler regex-based solution?
Bonus question
Do repeatBegin (as analogously defined).
I'm honestly having troubles with this one!

Nice one! I don't see a way to significantly improve on that regex, although I would refactor it to avoid the needless use of negative logic:
".(?=.{N})|.(?=.*(?<=(.{N})))"
This way the second alternative is never entered until you reach the final N characters, which I think makes the intent a little clearer.
I've never seen a reference that says it's okay to nest lookarounds, but like Bart, I don't see why it wouldn't be. I sometimes use lookaheads inside lookbehinds to get around limitations on variable-length lookbehind expressions.
EDIT: I just realized I can simplify the regex quite a bit by putting the alternation inside the lookahead:
".(?=.{N}|.*(?<=(.{N})))"
By the way, have you considered using format() to build the regex instead of replace()?
return str.replaceAll(
String.format(".(?=.{%1$d}|.*(?<=(.{%1$d})))", N),
"$1"
);

Whoa, that's some scary regex voodoo there! : )
Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)
Yes, that is perfectly valid in most PCRE implementations I know of.
Is there a simpler regex-based solution?
I didn't spend too much time on it, but I don't quickly see how that could be simplified or shortened with a single regex replacement.

Is there a simpler regex-based solution?
It took me a while, but eventually I managed to simplify the regex to:
"(?=.{0,N}$(?<=(.{N}))).|." // repeatEnd
-or-
".(?<=^(?=(.{N})).{0,N})|." // repeatBegin
Like Alan Moore's answer, this removes the negative assertion, but doesn't even replace it with a positive one, so it now only has 2 assertions instead of 3.
I also like the fact that the "else" case is just a simple .. I prefer to put the bulk of my regex into the "working" side of the alternation, and keep the "non-working" side as simple as possible (usually a simple . or .*).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.