Regex to capture text with unknown number of repeated groups in between

Regex to capture text with unknown number of repeated groups in between - java

I'm trying to parse the number that follows "Dining:" in the following text, under SECOND LEVEL. So '666' should be returned.
MAIN LEVEL
Entrance: 11
Dining: 33
SECOND LEVEL
Entrance: 4444
Living: 5555
Dining: 666
THIRD LEVEL
Dining: 999
Kitchen: 000
Family: 33332
If I use something like (?:\bDining:\s)(.*\b) then it captures the first occurrence under MAIN. I'm trying to therefore specify SECOND LEVEL in the regex, followed by a repeating pattern of: new lines, multiple spaces, and then any text, until Dining: is found. This demo illustrates the two problems I encounter. The regex used is: (?:\bSECOND\sLEVEL(\n\s+.*)*Dining:)(.*\b)
A "Catastrophic backtracking" error appears until you delete the very last line containing Laundry: 1. Is this caused by too many matches or something?
Once you delete that line, the regex captures only the last match under OTHER LEVEL .. returning '2' as opposed to the match under SECOND LEVEL.
Sometimes Dining: will not exist under SECOND LEVEL and therefore nothing should be returned.
What is a regex that will only capture the SECOND LEVEL's Dining: number, and if it doesn't exist then returns nothing? Straight up regex preferred, no looping in Java if possible. Thanks

Use a negative lookahead based regex.
"(?m)^\\s*\\bSECOND LEVEL\\n(?:(?!\\n\\n)[\\s\\S])*\\bDining:\\s*(\\d+)"
DEMO

The best example I know of for catastrophic backtracking from here is (x+x+)+y. That is to say, it cannot work out the correct boundaries for the capture groups containing x because there are too many ways to divide them.
xxxxy is the first two + once, the third twice, or each of the first twice and the third once, or either of the first thrice, the other once and the last once. As you can see that gets dangerous!
You had (?:\bSECOND\sLEVEL(\n\s+.*)*Dining:)(.*\b) note the (\n\s+.*)*
the .* can be a nightmare when combined with the previous \n\s and enclosed with a *. It should be rewritten (\n\s+[^\s\n][^\n]*)* this ensures each quantifier ends before the next begins, minimising backtracking.
With this kind of thinking in mind I came up with the following regex to match your string:
(?<=SECOND LEVEL\n)(?:\s+(?:[^\s\n:][^\n:]*):[^\n]*)*\s+Dining:\s*([^\s\n][^\n$]*)

Related

Regex format for a particular Match

I am trying to write a regex for the following format
PA-123456-067_TY
It's always PA, followed by a dash, 6 digits, another dash, then 3 digits, and ends with _TY
Apparently, when I write this regex to match the above format it shows the output correctly
^[^[PA]-]+-(([^-]+)-([^_]+))_([^.]+)
with all the Negation symbols ^
This does not work if I write the regex in the below format without negation symbols
[[PA]-]+-(([-]+)-([_]+))_([.]+)
Can someone explain to me why is this so?

The negation symbol means that the character cannot be anything within the specified class. Your regex is much more complicated than it needs to be and is therefore obfuscating what you really want.
You probably want something like this:
^PA-(\d+)-(\d+)_TY$
... which matches anything that starts with PA-, then includes two groups of numbers separated by a dash, then an underscore and the letters TY. If you want everything after the PA to be what you capture, but separated into the three groups, then it's a little more abstract:
^PA-(.+)-(.+)_(.+)$
This matches:
PA-
a capture group of any characters
a dash
another capture group of any characters
an underscore
all the remaining characters until end-of-line
Character classes [...] are saying match any single character in the list, so your first capture group (([^-]+)-([^_]+)) is looking for anything that isn't a dash any number of times followed by a dash (which is fine) followed by anything that isn't an underscore (again fine). Having the extra set of parentheses around that creates another capture group (probably group 1 as it's the first parentheses reached by the regex engine)... that part is OK but probably makes interpreting the answer less intuitive in this case.
In the re-write however, your first capture group (([-]+)-([_]+)) matches [-]+, which means "one or more dashes" followed by a dash, followed by any number of underscores followed by an underscore. Since your input does not have a dash immediately following PA-, the entire regex fails to find anything.
Putting the PA inside embedded character classes is also making things complicated. The first part of your first one is looking for, well, I'm not actually sure how [^[PA]-]+ is interpreted in practice but I suspect it's something like "not either a P or an A or a dash any number of times". The second one is looking for the opposite, I think. But you don't want any of that, you just want to start without anything other than the actual sequence of characters you care about, which is just PA-.
Update: As per the clarifications in the comments on the original question, knowing you want fixed-size groups of digits, it would look like this:
^PA-(\d{6})-(\d{3})_TY$
That captures PA-, then a 6-digit number, then a dash, then a 3-digit number, then _TY. The six digit number and 3 digit numbers will be in capture groups 1 and 2, respectively.
If the sizes of those numbers could ever change, then replace {x} with + to just capture numbers regardless of max length.

according to your comment this would be appropriate PA-\d{6}-\d{3}_TY
EDIT: if you want to match a line use it with anchors: ^PA-\d{6}-\d{3}_TY$

Regex for multiple occurrences of specific words

Hello
I'm trying to create a validation rule that checks the regular expression to accept only specific phrases. Regex is based on Java.
Here are examples of correct inputs:
1OR2
2
1 OR 2 OR 15
( 2OR3) AND 1
(12AND13 AND1)OR(4 AND5)
((2AND3 AND 1)OR(4AND5))AND6
but I would be happy if only the regex could accept anything like :
())34AND(4
I have no idea how to create a regex to check if the brackets open and close correctly(they can be nested). I assumed it can be impossible to check it in regex so the proper validation for the brackets I've already made in the code(stack implementation). In the code I have a second step validation of the phrase.
All I need the regex to do is to check if there are these specific things inside the phrase:
numbers, round brackets, words AND and OR with multiple occurrences and whitespaces are allowed.
It should NOT accept letters or other characters.
All I managed to create so far is this:
^[0-9 \\(][0-9 \\(\\)]*
also tried adding something like:
\\b(AND|OR)\\b
inside the second pair of brackets but with no luck.
I cannot figure out how to correct it to add OR and AND words.

I used the following and matched all the inputs you gave:
^[^\)][0-9 \( (AND|OR)]*$
I assumed you didn't want to start with ), which is why I included ^[^\)].
In case you weren't aware, I use https://www.regexpal.com to check my regular expressions for code.

Since you have an arbitrary number of nested elements it's arguably not possible with regex.
For demonstration purposes only, this matches zero or more conjunctions and one set of parenthesis:
^\d+(\s*(?:AND|OR)\s*(\d+|\(?\s*\d+(\s*(?:AND|OR)\s*\d+)\s*\)))*$|^(\d+|\(?\s*\d+(\s*(?:AND|OR)\s*\d+)\s*\))\s*(\s*(?:AND|OR)\s*\d+)*$
That's it. Adding more sets and levels of nested parenthesis leads to exponentially increasing complexity - till it breaks altogether.
Demo

On Which Line Number Was the Regex Match Found?

I would like to search a .java file using Regular Expressions and I wonder if there is a way to detect one what lines in the file the matches are found.
For example if I look for the match hello with Java regular expressions, will some method tell me that the matches were found on lines 9, 15, and 30?

Possible... with Regex Trickery!
Disclaimer: This is not meant to be a practical solution, but an illustration of a way to use an extension of a terrific regex hack. Moreover, it only works on regex engines that allow capture groups to refer to themselves. For instance, you could use it in Notepad++, as it uses the PCRE engine—but not in Java.
Let's say your file is:
some code
more code
hey, hello!
more code
At the bottom of the file, paste :1:2:3:4:5:6:7, where : is a delimiter not found in the rest of the code, and where the numbers go at least as high as the number of lines.
Then, to get the line of the first hello, you can use:
(?m)(?:(?:^(?:(?!hello).)*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*.*hello(?=[^:]+((?(1)\1)+:(\d+)))
The line number of the first line containing hello will be captured by Group 2.
In the demo, see Group 2 capture in the right pane.
The hack relies on a group referring to itself. In the classic #Qtax trick, this is done with (?>\1?). For diversity, I used a conditional instead.
Explanation
The first part of the regex is a line skipper, which captures an increasing amount of the the line counter at the bottom to Group 1
The second part of the regex matches hello and captures the line number to Group 2
Inside the line skipper, (?:^(?:(?!hello).)*(?:\r?\n)) matches a line that doesn't contain hello.
Still inside the line skipper, the (?=[^:]+((?(1)\1):\d+)) lookahead gets us to the first : with [^:]+ then the outer parentheses in ((?(1)\1):\d+)) capture to Group 1... if Group 1 is set (?(1)\1) then Group 1, then, regardless, a colon and some digits. This ensures that each time the line skipper matches a line, Group 1 expands to a longer portion of :1:2:3:4:5:6:7
The * mataches the line skipper zero or more times
.*hello matches the line with hello
The lookahead (?=[^:]+((?(1)\1)+:(\d+))) is identical to the one in the line skipper, except that this time the digits are captured to Group 2: (\d+)
-
Reference
Qtax trick (recently awarded an additional bounty by #AmalMurali)
Replace a word with the number of the line on which it is found

If you are using a Unix based OS / terminal, you could use sed:
sed -n '/regex/=' file
(from this StackOverflow response)

There are no methods in Java that will do it for you. You must read the file line-by-line and check for a match on each line. You can keep an index of the lines as you read them and do whatever you want with that index when a match is found.

Solution (workaround) M1
just append line numbers to the File, line by line, before you process (regex match) it.
stackoverflow: how to append line numbers to the File
Solution (workaround) M2
count all the newline characters that occur before the match group.
long count_NewLines = Pattern.compile("\\R")
.matcher(content.substring(0, matcher.start()))
.results()
.count() + 1;

This regex line exceeds my understanding "(?=(?:\d{3})++(?!\d))"

i am pretty ok with basic reg-ex. but this line of code used to make the thousand separation in large numbers exceeds my knowledge and googling it quite a bit did also not satisfy my curiosity. can one of u please take a minute to explain to me the following line of code?
someString.replaceAll("(\\G-?\\d{1,3})(?=(?:\\d{3})++(?!\\d))", "$1,");
i especially don't understand the regex structure "(?=(?:\d{3})++(?!\d))".
thanks a lot in advance.

"(?=(?:\d{3})++(?!\d))" is a lookahead assertion.
It means "only match if followed by ((three digits that we don't need to capture) repeated one or more times (and again repeated one or more times) (not followed by a digit))". See this explanation about (?:...) notation. It's called non-capturing group and means you don't need to reference this group after the match.
"(\\G-?\\d{1,3})" is the part that should actually match (but only if the above-described conditions are met).
Edit: I think + must be a special character, otherwise it's just a plus. If it's a special character (and quick search suggests that it is in Java, too), the second one is redundant.
Edit 2: Thanks to Alan Moore, it's now clear. The second + means possessive matching, so it means that if after checking as many 3-digit groups as possible it won't find that they're not followed by a non-digit, the engine will immediately give up instead of stepping one 3-digit group back.

this expression has some advanced stuff in it.
first , the easiest: \d{3} means exactly three digits. These are your thousands.
then: the ++ is a variant of + (which means one or more), but possessive, which means it will eat all of the thousands. Im not completely sure why this is necessary.
?:means it is a non-capturing group - i think this is just there for performance reasons and could be omitted.
?=is a positive lookahead - i think this means it is only checked whether that group exists but will not count towards the matched string - meaning it wont be replaced.
?! is a negative lookahead - i dont quite understand that but i think it means it must NOT match, which in turn means there cannot be another digit at the end of the matched sequence. This makes sure the first group gets the right digits. E.g. 10000 can only be matched as 10(000) but not 1(000)0 if you see what i mean.
Through the lookaheads, if i understand it correctly (i havent tested it), only the first group would actually be replaced, as it is the one that matches.

To me, the most interesting part of that regex is the \G. It took me a while to remember what it's for: to prevent adding commas to the fraction part if there is one. If the regex were simply:
(-?\d{1,3})(?=(?:\d{3})++(?!\d))
...this number:
12345.67890
...would end up as:
12,345.67,890
But adding \G to the beginning means a match can only start at the beginning of the string or at the position where the previous match ended. So it doesn't match 345 because of the . following it, and it doesn't match 67 because it would have to skip over some of the string to do so. And so it correctly returns:
12,345.67890
I know this isn't an answer to the question, but I thought it was worth a mention.

Regular expression: who's greedier?

My primary concern is with the Java flavor, but I'd also appreciate information regarding others.
Let's say you have a subpattern like this:
(.*)(.*)
Not very useful as is, but let's say these two capture groups (say, \1 and \2) are part of a bigger pattern that matches with backreferences to these groups, etc.
So both are greedy, in that they try to capture as much as possible, only taking less when they have to.
My question is: who's greedier? Does \1 get first priority, giving \2 its share only if it has to?
What about:
(.*)(.*)(.*)
Let's assume that \1 does get first priority. Let's say it got too greedy, and then spit out a character. Who gets it first? Is it always \2 or can it be \3?
Let's assume it's \2 that gets \1's rejection. If this still doesn't work, who spits out now? Does \2 spit to \3, or does \1 spit out another to \2 first?
Bonus question
What happens if you write something like this:
(.*)(.*?)(.*)
Now \2 is reluctant. Does that mean \1 spits out to \3, and \2 only reluctantly accepts \3's rejection?
Example
Maybe it was a mistake for me not to give concrete examples to show how I'm using these patterns, but here's some:
System.out.println(
"OhMyGod=MyMyMyOhGodOhGodOhGod"
.replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><My><God>"
// same pattern, different input string
System.out.println(
"OhMyGod=OhMyGodOhOhOh"
.replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><MyGod><>"
// now \2 is reluctant
System.out.println(
"OhMyGod=OhMyGodOhOhOh"
.replaceAll("^(.*)(.*?)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><><MyGod>"

\1 will have priority, \2 and \3 will always match nothing. \2 will then have priority over \3.
As a general rule think of it like this, back-tracking will only occur to satisfy a match, it will not occur to satisfy greediness, so left is best :)
explaining back tracking and greediness is to much for me to tackle here, i'd suggest friedl's Mastering Regular Expressions

The addition of your concrete examples changes the nature of the question drastically. It still starts out as I described in my first answer, with the first (.*) gobbling up all the characters, and the second and third groups letting it have them, but then it has to match an equals sign.
Obviously there isn't one at the end of the string, so group #1 gives back characters one by one until the = in the regex can match the = in the target. Then the regex engine starts trying to match (\1|\2|\3)+$ and the real fun starts.
Group 1 gives up the d and group 2 (which is still empty) takes it, but the rest of the regex still can't match. Group 1 gives up the o and group 2 matches od, but the rest of the regex still can't match. And so it goes, with the third group getting involved, and the three of them slicing up the input in every way possible until an overall match is achieved. RegexBuddy reports that it takes 13,426 steps to get there.
In the first example, greediness (or lack of it) isn't really a factor; the only way a match can be achieved is if the words Oh, My and God are captured in separate groups, so eventually that's what happens. It doesn't even matter which group captures which word--that's just first come, first served, as I said before.
In the second and third examples it's only necessary to break the prefix into two chunks: Oh and MyGod. Group 2 captures MyGod in the second example because it's next in line and it's greedy, just like in the first example. In the third example, every time group 1 drops a character, group 2 (being reluctant) lets group 3 take it instead, so that's the one that ends up in possession of MyGod.
It's more complicated (and tedious) than that, of course, but I hope this answers your question. And I have to say, that's an interesting target string you chose; if it were possible for a regex engine to have an orgasm, I think these regexes would be the ones to bring it off. :D

Quantifiers aren't really greedy by default, they're just hasty. In your example, the first (.*) will start out by gobbling up everything it can without regard to the needs of the regex as a whole. Only then does it hand control to the next part, and if necessary it will give back some or all of what it just took (i.e., backtrack) so the rest of the regex can do its work.
That isn't necessary in this case because everything else can legally match zero characters. If the quantifiers were really greedy, the three groups would haggle until they had divided the input as evenly as possible; instead, the second and third groups let the first one keep what it took. They'll take it if it's put in front of them, but they won't fight for it. (That would be true even if they had possessive quantifiers, i.e, (.*)(.*+)(.*+).)
Making the second dot-star reluctant doesn't change anything, but switching the first one does. A reluctant quantifier starts out by matching only as much as it has to, then hands off to the next part. So the first group in (.*?)(.*)(.*) starts out by matching nothing, then the second group gobbles everything, and the third group cries "weee weee weee" all the way home.
Here's a bonus question for you: What happens if you make all three of the quantifiers reluctant? (Hint: In Java this is as much an API question as it is a regex question.)

Regular Expressions work in a sequence, that means the Regex-evaluator will only leave a group when he can't find a solution to that group anymore, and eventually do some backtracking to make the string fit to the next group. If you execute this regex, you will get all your chars evaluated in the first group, none in the next ones (Question-sign doesn't matter either).

As a simple general rule: leftmost quantifier wins. So as long as the following quantifiers identify purely optional subpatterns (regardless of them being made ungreedy), the first takes all.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.