This regex line exceeds my understanding "(?=(?:\d{3})++(?!\d))" - java

i am pretty ok with basic reg-ex. but this line of code used to make the thousand separation in large numbers exceeds my knowledge and googling it quite a bit did also not satisfy my curiosity. can one of u please take a minute to explain to me the following line of code?
someString.replaceAll("(\\G-?\\d{1,3})(?=(?:\\d{3})++(?!\\d))", "$1,");
i especially don't understand the regex structure "(?=(?:\d{3})++(?!\d))".
thanks a lot in advance.

"(?=(?:\d{3})++(?!\d))" is a lookahead assertion.
It means "only match if followed by ((three digits that we don't need to capture) repeated one or more times (and again repeated one or more times) (not followed by a digit))". See this explanation about (?:...) notation. It's called non-capturing group and means you don't need to reference this group after the match.
"(\\G-?\\d{1,3})" is the part that should actually match (but only if the above-described conditions are met).
Edit: I think + must be a special character, otherwise it's just a plus. If it's a special character (and quick search suggests that it is in Java, too), the second one is redundant.
Edit 2: Thanks to Alan Moore, it's now clear. The second + means possessive matching, so it means that if after checking as many 3-digit groups as possible it won't find that they're not followed by a non-digit, the engine will immediately give up instead of stepping one 3-digit group back.

this expression has some advanced stuff in it.
first , the easiest: \d{3} means exactly three digits. These are your thousands.
then: the ++ is a variant of + (which means one or more), but possessive, which means it will eat all of the thousands. Im not completely sure why this is necessary.
?:means it is a non-capturing group - i think this is just there for performance reasons and could be omitted.
?=is a positive lookahead - i think this means it is only checked whether that group exists but will not count towards the matched string - meaning it wont be replaced.
?! is a negative lookahead - i dont quite understand that but i think it means it must NOT match, which in turn means there cannot be another digit at the end of the matched sequence. This makes sure the first group gets the right digits. E.g. 10000 can only be matched as 10(000) but not 1(000)0 if you see what i mean.
Through the lookaheads, if i understand it correctly (i havent tested it), only the first group would actually be replaced, as it is the one that matches.

To me, the most interesting part of that regex is the \G. It took me a while to remember what it's for: to prevent adding commas to the fraction part if there is one. If the regex were simply:
(-?\d{1,3})(?=(?:\d{3})++(?!\d))
...this number:
12345.67890
...would end up as:
12,345.67,890
But adding \G to the beginning means a match can only start at the beginning of the string or at the position where the previous match ended. So it doesn't match 345 because of the . following it, and it doesn't match 67 because it would have to skip over some of the string to do so. And so it correctly returns:
12,345.67890
I know this isn't an answer to the question, but I thought it was worth a mention.

Related

Regex format for a particular Match

I am trying to write a regex for the following format
PA-123456-067_TY
It's always PA, followed by a dash, 6 digits, another dash, then 3 digits, and ends with _TY
Apparently, when I write this regex to match the above format it shows the output correctly
^[^[PA]-]+-(([^-]+)-([^_]+))_([^.]+)
with all the Negation symbols ^
This does not work if I write the regex in the below format without negation symbols
[[PA]-]+-(([-]+)-([_]+))_([.]+)
Can someone explain to me why is this so?
The negation symbol means that the character cannot be anything within the specified class. Your regex is much more complicated than it needs to be and is therefore obfuscating what you really want.
You probably want something like this:
^PA-(\d+)-(\d+)_TY$
... which matches anything that starts with PA-, then includes two groups of numbers separated by a dash, then an underscore and the letters TY. If you want everything after the PA to be what you capture, but separated into the three groups, then it's a little more abstract:
^PA-(.+)-(.+)_(.+)$
This matches:
PA-
a capture group of any characters
a dash
another capture group of any characters
an underscore
all the remaining characters until end-of-line
Character classes [...] are saying match any single character in the list, so your first capture group (([^-]+)-([^_]+)) is looking for anything that isn't a dash any number of times followed by a dash (which is fine) followed by anything that isn't an underscore (again fine). Having the extra set of parentheses around that creates another capture group (probably group 1 as it's the first parentheses reached by the regex engine)... that part is OK but probably makes interpreting the answer less intuitive in this case.
In the re-write however, your first capture group (([-]+)-([_]+)) matches [-]+, which means "one or more dashes" followed by a dash, followed by any number of underscores followed by an underscore. Since your input does not have a dash immediately following PA-, the entire regex fails to find anything.
Putting the PA inside embedded character classes is also making things complicated. The first part of your first one is looking for, well, I'm not actually sure how [^[PA]-]+ is interpreted in practice but I suspect it's something like "not either a P or an A or a dash any number of times". The second one is looking for the opposite, I think. But you don't want any of that, you just want to start without anything other than the actual sequence of characters you care about, which is just PA-.
Update: As per the clarifications in the comments on the original question, knowing you want fixed-size groups of digits, it would look like this:
^PA-(\d{6})-(\d{3})_TY$
That captures PA-, then a 6-digit number, then a dash, then a 3-digit number, then _TY. The six digit number and 3 digit numbers will be in capture groups 1 and 2, respectively.
If the sizes of those numbers could ever change, then replace {x} with + to just capture numbers regardless of max length.
according to your comment this would be appropriate PA-\d{6}-\d{3}_TY
EDIT: if you want to match a line use it with anchors: ^PA-\d{6}-\d{3}_TY$

Regex to capture text with unknown number of repeated groups in between

I'm trying to parse the number that follows "Dining:" in the following text, under SECOND LEVEL. So '666' should be returned.
MAIN LEVEL
Entrance: 11
Dining: 33
SECOND LEVEL
Entrance: 4444
Living: 5555
Dining: 666
THIRD LEVEL
Dining: 999
Kitchen: 000
Family: 33332
If I use something like (?:\bDining:\s)(.*\b) then it captures the first occurrence under MAIN. I'm trying to therefore specify SECOND LEVEL in the regex, followed by a repeating pattern of: new lines, multiple spaces, and then any text, until Dining: is found. This demo illustrates the two problems I encounter. The regex used is: (?:\bSECOND\sLEVEL(\n\s+.*)*Dining:)(.*\b)
A "Catastrophic backtracking" error appears until you delete the very last line containing Laundry: 1. Is this caused by too many matches or something?
Once you delete that line, the regex captures only the last match under OTHER LEVEL .. returning '2' as opposed to the match under SECOND LEVEL.
Sometimes Dining: will not exist under SECOND LEVEL and therefore nothing should be returned.
What is a regex that will only capture the SECOND LEVEL's Dining: number, and if it doesn't exist then returns nothing? Straight up regex preferred, no looping in Java if possible. Thanks
Use a negative lookahead based regex.
"(?m)^\\s*\\bSECOND LEVEL\\n(?:(?!\\n\\n)[\\s\\S])*\\bDining:\\s*(\\d+)"
DEMO
The best example I know of for catastrophic backtracking from here is (x+x+)+y. That is to say, it cannot work out the correct boundaries for the capture groups containing x because there are too many ways to divide them.
xxxxy is the first two + once, the third twice, or each of the first twice and the third once, or either of the first thrice, the other once and the last once. As you can see that gets dangerous!
You had (?:\bSECOND\sLEVEL(\n\s+.*)*Dining:)(.*\b) note the (\n\s+.*)*
the .* can be a nightmare when combined with the previous \n\s and enclosed with a *. It should be rewritten (\n\s+[^\s\n][^\n]*)* this ensures each quantifier ends before the next begins, minimising backtracking.
With this kind of thinking in mind I came up with the following regex to match your string:
(?<=SECOND LEVEL\n)(?:\s+(?:[^\s\n:][^\n:]*):[^\n]*)*\s+Dining:\s*([^\s\n][^\n$]*)

Java Matcher slow regex

This is very simple regex yet, it runs for over 30 seconds on a very short string: (i7 3970k # 3.4ghz)
Pattern compile = Pattern.compile("^(?=[a-z0-9-]{1,63})([a-z0-9]+[-]{0,1}){1,63}[a-z0-9]{1}$");
Matcher matcher = compile.matcher("test-metareg-rw40lntknahvpseba32cßáàâåäæç.nl");
boolean matches = matcher.matches(); //Takes 30+ seconds
First part the (?=) is assertion that the string contains at max these characters
The 2nd part is assertion that the string doesn't exceed syntax for example on this case to prevent --'s and end at least in [a-z0-9]
I tried to guess your intention but it was not easy:
(?=[a-z0-9-]{1,63}) this look-ahead seem to intend to require the next up to 63 characters to be lowercase ASCII letters or numbers, but in fact, it will succeed even if there’s only one letter followed by anything. So maybe you meant (?=[a-z0-9-]{1,63}$) to forbid anything else after the legal up to 63 characters.
You seem to want groups of at least one letter or number between the - but you made the - optional not really creating a constraint and allowing way to much possibilities which created the overhead of your expression. You can simply say: ([a-z0-9]++-){0,63}[a-z0-9]+. The groups within the braces require at least one letter or number and require the minus after that, the expression at the end requires at least one letter or number at the end of the expression but will also match the last group without a following - at the same time. This last group might also be the only one if no - is contained in your text at all.
Putting it all together you regex becomes: (?=[a-z0-9-]{1,63}$)([a-z0-9]++-){0,63}[a-z0-9]+. Note that you don’t need a leading ^ or trailing $ if you use the matches method; it already implies that the string bounds must match the expression bounds.
I hope I got your intention right…
I have fixed this regex replacing it as follows:
^(?=[a-z0-9-]{1,63})([a-z0-9]{0,1}|[-]{0,1}){1,63}[a-z0-9]{1}$
The section ([a-z0-9]+[-]{0,1}){1,63} became: ([a-z0-9]{0,1}|[-]{0,1}){1,63}
If you want to make sure that there is no -- in your string just use negative look ahead (?!.*--).
Also there is no point in writing {1}.
Another thing is if you want to ensure that string has max 63 characters then in your look-ahead you need to add $ at the end (?=[a-z0-9-]{1,63}$).
So maybe ^(?=[a-z0-9-]{1,63}$)(?!.*--)[a-z0-9-]+[a-z0-9]$
I think from what you say, your regex can be simplified to this
Edit - (For posterity) After reading #Holger's post, I am changing this to fix possible catastrophic backtracking, and to speed it up, which as my benches show is possibly the fastest way to do it.
# ^(?=[a-z0-9-]{1,63}$)[a-z0-9]++(?:-[a-z0-9]+)*+$
^ # BOL
(?= [a-z0-9-]{1,63} $ ) # max 1 - 63 of these characters
[a-z0-9]++ (?: - [a-z0-9]+ )*+ # consume the characters in this order
$ # EOL

Regular expression: who's greedier?

My primary concern is with the Java flavor, but I'd also appreciate information regarding others.
Let's say you have a subpattern like this:
(.*)(.*)
Not very useful as is, but let's say these two capture groups (say, \1 and \2) are part of a bigger pattern that matches with backreferences to these groups, etc.
So both are greedy, in that they try to capture as much as possible, only taking less when they have to.
My question is: who's greedier? Does \1 get first priority, giving \2 its share only if it has to?
What about:
(.*)(.*)(.*)
Let's assume that \1 does get first priority. Let's say it got too greedy, and then spit out a character. Who gets it first? Is it always \2 or can it be \3?
Let's assume it's \2 that gets \1's rejection. If this still doesn't work, who spits out now? Does \2 spit to \3, or does \1 spit out another to \2 first?
Bonus question
What happens if you write something like this:
(.*)(.*?)(.*)
Now \2 is reluctant. Does that mean \1 spits out to \3, and \2 only reluctantly accepts \3's rejection?
Example
Maybe it was a mistake for me not to give concrete examples to show how I'm using these patterns, but here's some:
System.out.println(
"OhMyGod=MyMyMyOhGodOhGodOhGod"
.replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><My><God>"
// same pattern, different input string
System.out.println(
"OhMyGod=OhMyGodOhOhOh"
.replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><MyGod><>"
// now \2 is reluctant
System.out.println(
"OhMyGod=OhMyGodOhOhOh"
.replaceAll("^(.*)(.*?)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><><MyGod>"
\1 will have priority, \2 and \3 will always match nothing. \2 will then have priority over \3.
As a general rule think of it like this, back-tracking will only occur to satisfy a match, it will not occur to satisfy greediness, so left is best :)
explaining back tracking and greediness is to much for me to tackle here, i'd suggest friedl's Mastering Regular Expressions
The addition of your concrete examples changes the nature of the question drastically. It still starts out as I described in my first answer, with the first (.*) gobbling up all the characters, and the second and third groups letting it have them, but then it has to match an equals sign.
Obviously there isn't one at the end of the string, so group #1 gives back characters one by one until the = in the regex can match the = in the target. Then the regex engine starts trying to match (\1|\2|\3)+$ and the real fun starts.
Group 1 gives up the d and group 2 (which is still empty) takes it, but the rest of the regex still can't match. Group 1 gives up the o and group 2 matches od, but the rest of the regex still can't match. And so it goes, with the third group getting involved, and the three of them slicing up the input in every way possible until an overall match is achieved. RegexBuddy reports that it takes 13,426 steps to get there.
In the first example, greediness (or lack of it) isn't really a factor; the only way a match can be achieved is if the words Oh, My and God are captured in separate groups, so eventually that's what happens. It doesn't even matter which group captures which word--that's just first come, first served, as I said before.
In the second and third examples it's only necessary to break the prefix into two chunks: Oh and MyGod. Group 2 captures MyGod in the second example because it's next in line and it's greedy, just like in the first example. In the third example, every time group 1 drops a character, group 2 (being reluctant) lets group 3 take it instead, so that's the one that ends up in possession of MyGod.
It's more complicated (and tedious) than that, of course, but I hope this answers your question. And I have to say, that's an interesting target string you chose; if it were possible for a regex engine to have an orgasm, I think these regexes would be the ones to bring it off. :D
Quantifiers aren't really greedy by default, they're just hasty. In your example, the first (.*) will start out by gobbling up everything it can without regard to the needs of the regex as a whole. Only then does it hand control to the next part, and if necessary it will give back some or all of what it just took (i.e., backtrack) so the rest of the regex can do its work.
That isn't necessary in this case because everything else can legally match zero characters. If the quantifiers were really greedy, the three groups would haggle until they had divided the input as evenly as possible; instead, the second and third groups let the first one keep what it took. They'll take it if it's put in front of them, but they won't fight for it. (That would be true even if they had possessive quantifiers, i.e, (.*)(.*+)(.*+).)
Making the second dot-star reluctant doesn't change anything, but switching the first one does. A reluctant quantifier starts out by matching only as much as it has to, then hands off to the next part. So the first group in (.*?)(.*)(.*) starts out by matching nothing, then the second group gobbles everything, and the third group cries "weee weee weee" all the way home.
Here's a bonus question for you: What happens if you make all three of the quantifiers reluctant? (Hint: In Java this is as much an API question as it is a regex question.)
Regular Expressions work in a sequence, that means the Regex-evaluator will only leave a group when he can't find a solution to that group anymore, and eventually do some backtracking to make the string fit to the next group. If you execute this regex, you will get all your chars evaluated in the first group, none in the next ones (Question-sign doesn't matter either).
As a simple general rule: leftmost quantifier wins. So as long as the following quantifiers identify purely optional subpatterns (regardless of them being made ungreedy), the first takes all.

codingBat separateThousands using regex (and unit testing how-to)

This question is a combination of regex practice and unit testing practice.
Regex part
I authored this problem separateThousands for personal practice:
Given a number as a string, introduce commas to separate thousands. The number may contain an optional minus sign, and an optional decimal part. There will not be any superfluous leading zeroes.
Here's my solution:
String separateThousands(String s) {
return s.replaceAll(
String.format("(?:%s)|(?:%s)",
"(?<=\\G\\d{3})(?=\\d)",
"(?<=^-?\\d{1,3})(?=(?:\\d{3})+(?!\\d))"
),
","
);
}
The way it works is that it classifies two types of commas, the first, and the rest. In the above regex, the rest subpattern actually appears before the first. A match will always be zero-length, which will be replaceAll with ",".
The rest basically looks behind to see if there was a match followed by 3 digits, and looks ahead to see if there's a digit. It's some sort of a chain reaction mechanism triggered by the previous match.
The first basically looks behind for ^ anchor, followed by an optional minus sign, and between 1 to 3 digits. The rest of the string from that point must match triplets of digits, followed by a nondigit (which could either be $ or \.).
My question for this part is:
Can this regex be simplified?
Can it be optimized further?
Ordering rest before first is deliberate, since first is only needed once
No capturing group
Unit testing part
As I've mentioned, I'm the author of this problem, so I'm also the one responsible for coming up with testcases for them. Here they are:
INPUT, OUTPUT
"1000", "1,000"
"-12345", "-12,345"
"-1234567890.1234567890", "-1,234,567,890.1234567890"
"123.456", "123.456"
".666666", ".666666"
"0", "0"
"123456789", "123,456,789"
"1234.5678", "1,234.5678"
"-55555.55555", "-55,555.55555"
"0.123456789", "0.123456789"
"123456.789", "123,456.789"
I haven't had much experience with industrial-strength unit testing, so I'm wondering if others can comment whether this is a good coverage, whether I've missed anything important, etc (I can always add more tests if there's a scenario I've missed).
This works for me:
return s.replaceAll("(\\G-?\\d{1,3})(?=(?:\\d{3})++(?!\\d))", "$1,");
The first time through, \G acts the same as ^, and the lookahead forces \d{1,3} to consume only as many characters as necessary to leave the match position at a three-digit boundary. After that, \d{1,3} consumes the maximum three digits every time, with \G to keep it anchored to the end of the previous match.
As for your unit tests, I would just make it clear in the problem description that the input will always be valid number, with at most one decimal point.
When you state the requirements are you intending for them to be enforced by your method?
The number may contain an optional
minus sign, and an optional decimal
part. There will not be any
superfluous leading zeroes.
If your intent is to have the method detect when those constraints are violated you will need additional to write additional unit-tests to ensure that contract is being enforced.
What about testing for 1234.5678.91011?
Do you expect your method to return 1,234.5678.91011 or just ignore the whole thing?
Best to write a test to verify your expectations

Categories