codingBat repeatEnd using regex - java

I'm trying to understand regex as much as I can, so I came up with this regex-based solution to codingbat.com repeatEnd:
Given a string and an int N, return a string made of N repetitions of the last N characters of the string. You may assume that N is between 0 and the length of the string, inclusive.
public String repeatEnd(String str, int N) {
return str.replaceAll(
".(?!.{N})(?=.*(?<=(.{N})))|."
.replace("N", Integer.toString(N)),
"$1"
);
}
Explanation on its parts:
.(?!.{N}): asserts that the matched character is one of the last N characters, by making sure that there aren't N characters following it.
(?=.*(?<=(.{N}))): in which case, use lookforward to first go all the way to the end of the string, then a nested lookbehind to capture the last N characters into \1. Note that this assertion will always be true.
|.: if the first assertion failed (i.e. there are at least N characters ahead) then match the character anyway; \1 would be empty.
In either case, a character is always matched; replace it with \1.
My questions are:
Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)
Is there a simpler regex-based solution?
Bonus question
Do repeatBegin (as analogously defined).
I'm honestly having troubles with this one!

Nice one! I don't see a way to significantly improve on that regex, although I would refactor it to avoid the needless use of negative logic:
".(?=.{N})|.(?=.*(?<=(.{N})))"
This way the second alternative is never entered until you reach the final N characters, which I think makes the intent a little clearer.
I've never seen a reference that says it's okay to nest lookarounds, but like Bart, I don't see why it wouldn't be. I sometimes use lookaheads inside lookbehinds to get around limitations on variable-length lookbehind expressions.
EDIT: I just realized I can simplify the regex quite a bit by putting the alternation inside the lookahead:
".(?=.{N}|.*(?<=(.{N})))"
By the way, have you considered using format() to build the regex instead of replace()?
return str.replaceAll(
String.format(".(?=.{%1$d}|.*(?<=(.{%1$d})))", N),
"$1"
);

Whoa, that's some scary regex voodoo there! : )
Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)
Yes, that is perfectly valid in most PCRE implementations I know of.
Is there a simpler regex-based solution?
I didn't spend too much time on it, but I don't quickly see how that could be simplified or shortened with a single regex replacement.

Is there a simpler regex-based solution?
It took me a while, but eventually I managed to simplify the regex to:
"(?=.{0,N}$(?<=(.{N}))).|." // repeatEnd
-or-
".(?<=^(?=(.{N})).{0,N})|." // repeatBegin
Like Alan Moore's answer, this removes the negative assertion, but doesn't even replace it with a positive one, so it now only has 2 assertions instead of 3.
I also like the fact that the "else" case is just a simple .. I prefer to put the bulk of my regex into the "working" side of the alternation, and keep the "non-working" side as simple as possible (usually a simple . or .*).

Related

Regex function to find specific depth in recursive

I have the following scenario where I am supposed to use regex (Java/PCRE) on a line of code and strip off certain defined function and only strong the value of that function like in example below:
Input
ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))
Output : Replace Regex
ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)
Here CInt, Math.Truncate, and ObjectToNumber is removed retaining on output as shown above
The functions CInt, Math.Truncate keep on changing to CStr or Math.Random etc etc so regex query can not be hardcoded.
I tried a lot of options on stackoverflow but most did not work.
Also it would be nice if the query is customizable like Cint returns everything function CInt refers to. ( find a text then everything between first ( and ) ignoring balanced parenthesis pairs in between.
I know it's not pretty, but it's your fault to use raw regex for this :)
#Test
void unwrapCIntCall() {
String input = "ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))";
String expectedOutput = "ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)";
String output = input.replaceAll("CInt\\s*\\(\\s*Math\\.Truncate\\s*\\(\\s*ObjectToNumber\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
assertEquals(expectedOutput, output);
}
Now some explanation; the \\s* parts allow any number of any whitespace character, where they are. In the pattern, I used (.*) in the middle, which means I match anything there, but it's fine*. I used (.*) instead of .* so that particular section gets captured as capturing group $1 (because $0 is always the whole match). The interesting part being captured, I can refer them in the replacement string.
*as long as you don't have multiple of such assignments within one string. Otherwise, you should break up the string into parts which contain only one such assignment and apply this replacement for each of those strings. Or, try (.*?) instead of (.*), it compiles for me - AFAIK that makes the .* match as few characters as possible.
If the methods actually being called vary, then replace their names in the regex with the variation you expect, like replace CInt with (?CInt|CStr), Math\\.Truncate with Math\\.(?Truncate|Random) etc. (Using (? instead of ( makes that group non-capturing, so they won't take up $1, $2, etc. slots).
If that gets too complicated, than you should really think whether you really want to do it with regex, or whether it'd be easier to just write a relatively longer function with plain string methods, like indexOf and substring :)
Bonus; if absolutely everything varies, but the call depth, then you might try this one:
String output = input.replaceAll("[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
Yes, it's definitely a nightmare to read, but as far as I understand, you are after this monster :)
You can use ([^()]*) instead of (.*) to prevent deeper nested expressions. Note, that fine control of depth is a real weakness of everyday regular expressions.

java String.replaceAll char between two numbers

I would like to replace all char '-' that between two numbers, or that between number and '.' by char '&'.For example
String input= "2.1(-7-11.3)-12.1*-2.3-.11"
String output= "2.1(-7&11.3)-12.1*-2.3&.11"
I have something like this, but I try to do it easier.
public void preperString(String input) {
input=input.replaceAll(" ","");
input=input.replaceAll(",",".");
input=input.replaceAll("-","&");
input=input.replaceAll("\\(&","\\(-");
input=input.replaceAll("\\[&","\\[-");
input=input.replaceAll("\\+&","\\+-");
input=input.replaceAll("\\*&","\\*-");
input=input.replaceAll("/&","/-");
input=input.replaceAll("\\^&","\\^-");
input=input.replaceAll("&&","&-");
input=input.replaceFirst("^&","-");
for (String s :input.split("[^.\\-\\d]")) {
if (!s.equals(""))
numbers.add(Double.parseDouble(s));
}
You can make it in one shot using groups of regex to solve your problem, you can use this :
String input = "2.1(-7-11.3)-12.1*-2.3-.11";
input = input.replaceAll("([\\d.])-([\\d.])", "$1&$2");
Output
2.1(-7&11.3)-12.1*-2.3&.11
([\\d.])-([\\d.])
// ^------------replace the hyphen(-) that it between
// ^__________^--------two number(\d)
// ^_^______^_^------or between number(\d) and dot(.)
regex demo
Let me guess. You don't really have a use for & here; you're just trying to replace certain minus signs with & so that they won't interfere with the split that you're trying to use to find all the numbers (so that the split doesn't return "-7-11" as one of the array elements, in your original example). Is that correct?
If my guess is right, then the correct answer is: don't use split. It is the wrong tool for the job. The purpose of split is to split up a string by looking for delimiter patterns (such as a sequence of whitespace or a comma); but where the format of the elements between the delimiters doesn't much matter. In your case, though, you are looking for elements of a particular numeric format (it might start with -, and otherwise will have at least one digit and at most one period; I don't know what your exact requirements are). In this case, instead of split, the right way to do this is to create a regular expression for the pattern you want your numbers to have, and then use m.find in a loop (where m is a Matcher) to get all your numbers.
If you need to treat some - characters differently (e.g. in -7-11, where you want the second - to be an operator and not part of -11), then you can make special checks for that in your loop, and skip over the - signs that you know you want to treat as operators.
It's simpler, readers will understand what you're trying to do, and it's less error-prone because all you have to do is make sure your pattern for expressing numbers accurately reflects what you're looking for.
It's common for newer Java programmers to think regexes and split are magic tools that can solve everything. But often the result ends up being too complex (code uses overly complicated regexes, or relies on trickery like having to replace characters with & temporarily). I cannot look at your original code and convince myself that it works right. It's not worth it.
You can use lookahead and lookbehind to match digit or dot:
input.replaceAll("(?<=[\\d\\.])-(?=[\\d\\.])","&")
Have a look on this fiddle.

When should we use Pattern and Matcher? [duplicate]

I've seen people here made comments like "regex is too slow!", or "why would you do something so simple using regex!" (and then present a 10+ lines alternative instead), etc.
I haven't really used regex in industrial setting, so I'm curious if there are applications where regex is demonstratably just too slow, AND where a simple non-regex alternative exists that performs significantly (maybe even asymptotically!) better.
Obviously many highly-specialized string manipulations with sophisticated string algorithms will outperform regex easily, but I'm talking about cases where a simple solution exists and significantly outperforms regex.
What counts as simple is subjective, of course, but I think a reasonable standard is that if it uses only String, StringBuilder, etc, then it's probably simple.
Note: I would very much appreciate answers that demonstrate the following:
a beginner-level regex solution to a non-toy real-life problem that performs horribly
the simple non-regex solution
the expert-level regex rewrite that performs comparably
I remember a textbook example of a regex gone bad. Be aware that none of the following approaches is recommended for production use! Use a proper CSV parser instead.
The mistake made in this example is quite common: Using a dot where a narrower character class is better suited.
In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).
1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 // don't match
42,12,13,12,32,13,14,43,56,31,78,10 // match
42,12,13,12,32,14,13,43,56,31,78,10 // don't match
We use a regex containing exactly 11 commas:
".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*"
This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)
A simple non-regex solution would be to split() each line and compare the 6th element. (Much faster: 9 microseconds per string.)
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
So we replace the greedy quantifier with the reluctant one:
".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?"
This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.
A performant regex replaces the dot by the character class "[^,]":
"[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*"
(This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)
I experimented a bit with the performance of various constructs, and unfortunately I discovered that Java regex doesn't perform what I consider very doable optimizations.
Java regex takes O(N) to match "(?s)^.*+$"
This is very disappointing. It's understandable for ".*" to take O(N), but with the optimization "hints" in the form of anchors (^ and $) and single-line mode Pattern.DOTALL/(?s), even making the repetition possessive (i.e. no backtracking), the regex engine still could not see that this will match every string, and still have to match in O(N).
This pattern isn't very useful, of course, but consider the next problem.
Java regex takes O(N) to match "(?s)^A.*Z$"
Again, I was hoping that the regex engine can see that thanks to the anchors and single-line mode, this is essentially the same as the O(1) non-regex:
s.startsWith("A") && s.endsWith("Z")
Unfortunately, no, this is still O(N). Very disappointing. Still, not very convincing because a nice and simple non-regex alternative exists.
Java regex takes O(N) to match "(?s)^.*[aeiou]{3}$"
This pattern matches strings that ends with 3 lowercase vowels. There is no nice and simple non-regex alternative, but you can still write something non-regex that matches this in O(1), since you only need to check the last 3 characters (for simplicity, we can assume that the string length is at least 3).
I also tried "(?s)^.*$(?<=[aeiou]{3})", in an attempt to tell the regex engine to just ignore everything else, and just check the last 3 characters, but of course this is still O(N) (which follows from the first section above).
In this particular scenario, however, regex can be made useful by combining it with substring. That is, instead of seeing if the whole string matches the pattern, you can manually restrict the pattern to attempt to match only the last 3 characters substring. In general, if you know before hand that the pattern has a finite length maximum match, you can substring the necessary amount of characters from the end of a very long string and regex on just that part.
Test harness
static void testAnchors() {
String pattern = "(?s)^.*[aeiou]{3}$";
for (int N = 1; N < 20; N++) {
String needle = stringLength(1 << N) + "ooo";
System.out.println(N);
boolean b = true;
for (int REPS = 10000; REPS --> 0; ) {
b &=
needle
//.substring(needle.length() - 3) // try with this
.matches(pattern);
}
System.out.println(b);
}
}
The string length in this test grows exponentially. If you run this test, you will find that it starts to really slow down after 10 (i.e. string length 1024). If you uncomment the substring line, however, the entire test will complete in no time (which also confirms that the problem is not because I didn't use Pattern.compile, which would yield a constant improvement at best, but rather because the patttern takes O(N) to match, which is problematic when the asymptotic growth of N is exponential).
Conclusion
It seems that Java regex does little to no optimization based on the pattern. Suffix matching in particular is especially costly, because the regex still needs to go through the entire length of the string.
Thankfully, doing the regex on the chopped suffix using substring (if you know the maximum length of the match) can still allow you to use regex for suffix matching in time independent of the length of the input string.
//update: actually I just realized that this applies to prefix matching too. Java regex matches a O(1) length prefix pattern in O(N). That is, "(?s)^[aeiou]{3}.*$" checks if a string starts with 3 lowercase letters in O(N) when it should be optimizable to O(1).
I thought prefix matching would be more regex-friendly, but I don't think it's possible to come up with a O(1)-runtime pattern to match the above (unless someone can prove me wrong).
Obviously you can do the s.substring(0, 3).matches("(?s)^[aeiou]{3}.*$") "trick", but the pattern itself is still O(N); you've just manually reduced N to a constant by using substring.
So for any kind of finite-length prefix/suffix matching of a really long string, you should preprocess using substring before using regex; otherwise it's O(N) where O(1) suffices.
In my tests, I found the following:
Using java's String.split method (which uses regex) took 2176ms under 1,000,000 iterations.
Using this custom split method took 43ms under 1,000,000 iterations.
Of course, it will only work if your "regex" is completely literal, but in those cases,
it will be much faster.
List<String> array = new ArrayList<String>();
String split = "ab";
String string = "aaabaaabaa";
int sp = 0;
for(int i = 0; i < string.length() - split.length(); i++){
if(string.substring(i, i + split.length()).equals(split)){
//Split point found
array.add(string.substring(sp, i));
sp = i + split.length();
i += split.length();
}
}
if(sp != 0){
array.add(string.substring(sp, string.length()));
}
return array;
So to answer your question, is it theoretically faster? Yes, absolutely, my algorithm is O(n), where n is the length of the string to split. (I'm not sure what regex would be). Is it practically faster? Well, over 1 million iterations, I saved basically 2 seconds. So, it depends on your needs I guess, but I wouldn't worry too much about backporting all code that uses regex to non-regex versions, and in fact, that might be necessary anyways, if the pattern is very complex, a literal split like this won't work. However, if you're splitting on, say, commas, this method will perform much better, though "much better" is subjective here.
Well, not always but sometimes slow, depends on patterns and implementations.
A quick example, 2x slower than normal replace, but I don't think its that slow.
>>> import time,re
>>>
>>> x="abbbcdexfbeczexczczkef111anncdehbzzdezf" * 500000
>>>
>>> start=time.time()
>>> y=x.replace("bc","TEST")
>>> print time.time()-start,"s"
0.350999832153 s
>>>
>>> start=time.time()
>>> y=re.sub("bc","TEST",x)
>>> print time.time()-start,"s"
0.751000165939 s
>>>

Very slow look-behind

I'm trying to recover two positions using java regex
The first one is given by the regex:
val r="""(?=(?<=[ ]|^)[^ ]{1,21474836}(?=[ ]|$)(?<=[^A-Z]|^)[A-Z]{1,21474836}(?=[^A-Z]|$))"""
The second one is given by the regex
val p="""(?<=(?<=[ ]|^)[^ ]{1,21474836}(?=[ ]|$)(?<=[^A-Z]|^)[A-Z]{1,21474836}(?=[^A-Z]|$))"""
Note that the two expressions are identical, except the first "=" is replaced by an "<=" in the second expression. I am not using neste quantifiers here.
My command to test it is the following:
r.findAllMatchIn("a <b/>"*100) //.... some long string of size 600...
p.findAllMatchIn("a <b/>"*100) //.... some long string of size 600...
The first example is almost instant during execution, whereas the second takes dozens of seconds. If I launch the same examples in a REPL, both are very fast.
Where does that come from? How can I make the second expression faster?
Update: Why this matters
Note that in general, I can have expressions of the type:
[^ ]+[^.]+
and I would like to know when this regular expression can be found on the left of a given position, or when it can end.
If I have the following data with the position below it:
abc145A
0123456
I would like the end of the previous expression to match position 1,2,3,4,5 and 6. If I use non-greedy repeating jokers, then it will match 1,3 and 5. If I use greedy operators, it matches only 6. This is why I need look-behind assertions. Or you will find me a way to define operators to find the positions I am looking for.
You aren't using nested quantifiers, but I suspect nested lookbehinds cause a similar problem. I suspect you don't need that outer lookahead/lookbehind at all - how about performing a single regex search using only the inner part of the regexes (common to both), and retrieving both the start position and the end position from each result?

Conditional Regex searches

I'm attempting to create a Regular Expressions code in Java that will have a conditional search term.
What I mean by this is let's say I have 5 words; tree, car, dog, cat, bird. Now I would like the expression to search for these terms, however is only required to match 3 out of the five, and it could be any of the 5 it chooses to match.
I thought perhaps a using a back reference ?(3) would work but doesn't seem to do the trick.
A standard optional search (?) wouldn't work either because all terms are optional, however the number of matches required is not. Essentially is there a way to create a string that must be 50% (or any percent) correct to provide a match?
Would anyone happen to know or could point me in the right direction?
(I would hopefully like it working client side if possible)
Does it have to be a free-standing regular expression without any further code? A simple loop testing for each word and counting matches should do this perfectly. Pseudocode assuming you want N unique matches (you can also swap the substring test with a regex, doesn't matter how you determine matches as long as you keep the counting of unique matches out of the regex):
bool has_N_words(int n, string[] words, string text) {
int matches = 0;
foreach word in words {
if (word.substringOf(text)) counter++
if (counter >= n) return true
}
return false
}
It seems to me the only (save mind-blowing uses of obscure regex extensions - not that I have something in mind, I've just been surprised again and again what modern regex implementations allow) way to do this with an regular expression goes like this:
Enumerate all unique (ignoring order or not depending on implementation, see below) permutations of words
For each permutation, build a sub-regex that matches a string containing those words, either by
joining the first three words with .*? (this requires all unique permutations)
using three lookahead assertions like (?=.*word) (this allows dropping word combinations that occured before in a different order)
Combine all sub-regexes in a giant or.
That's impractical to do by hand, ugly and complex (as in computational complexity, not in programming effort) to do automatically, and inefficient as well as quite hacky either way.
I don't see why you would want to do this with a regext but if you really need it to be a regex:
/(tree|car|dog|cat|bird)/
Then count the matches you get from that...
(?i)(?s)(.*(tree|car|dog|cat|bird)){3,}?.*
The (?i) is for case insensitive and the (?s) to match new lines with .* also, since you are looking at emails.
The ? at the end is the reluctant quantifier.
I haven't actually tried it.

Categories