Regex for multiple occurrences of specific words

Regex for multiple occurrences of specific words - java

Hello
I'm trying to create a validation rule that checks the regular expression to accept only specific phrases. Regex is based on Java.
Here are examples of correct inputs:
1OR2
2
1 OR 2 OR 15
( 2OR3) AND 1
(12AND13 AND1)OR(4 AND5)
((2AND3 AND 1)OR(4AND5))AND6
but I would be happy if only the regex could accept anything like :
())34AND(4
I have no idea how to create a regex to check if the brackets open and close correctly(they can be nested). I assumed it can be impossible to check it in regex so the proper validation for the brackets I've already made in the code(stack implementation). In the code I have a second step validation of the phrase.
All I need the regex to do is to check if there are these specific things inside the phrase:
numbers, round brackets, words AND and OR with multiple occurrences and whitespaces are allowed.
It should NOT accept letters or other characters.
All I managed to create so far is this:
^[0-9 \\(][0-9 \\(\\)]*
also tried adding something like:
\\b(AND|OR)\\b
inside the second pair of brackets but with no luck.
I cannot figure out how to correct it to add OR and AND words.

I used the following and matched all the inputs you gave:
^[^\)][0-9 \( (AND|OR)]*$
I assumed you didn't want to start with ), which is why I included ^[^\)].
In case you weren't aware, I use https://www.regexpal.com to check my regular expressions for code.

Since you have an arbitrary number of nested elements it's arguably not possible with regex.
For demonstration purposes only, this matches zero or more conjunctions and one set of parenthesis:
^\d+(\s*(?:AND|OR)\s*(\d+|\(?\s*\d+(\s*(?:AND|OR)\s*\d+)\s*\)))*$|^(\d+|\(?\s*\d+(\s*(?:AND|OR)\s*\d+)\s*\))\s*(\s*(?:AND|OR)\s*\d+)*$
That's it. Adding more sets and levels of nested parenthesis leads to exponentially increasing complexity - till it breaks altogether.
Demo

Related

Is it possible to match nested brackets with a regex without using recursion or balancing groups?

The problem: Match an arbitrarily nested group of brackets in a flavour of regex such as Java's java.util.regex that supports neither recursion nor balancing groups. I.e., match the three outer groups in:
(F(i(r(s)t))) ((S)(e)((c)(o))(n)d) (((((((Third)))))))
This exercise is purely academic, since we all know that regular expressions are not supposed to be used to match these things, just as Q-tips are not supposed to be used to clean ears.
Stack Overflow encourages self-answered questions, so I decided to create this post to share something I recently discovered.

Indeed! It's possible using forward references:
(?=\()(?:(?=.*?\((?!.*?\1)(.*\)(?!.*\2).*))(?=.*?\)(?!.*?\2)(.*)).)+?.*?(?=\1)[^(]*(?=\2$)
Proof
Et voila; there it is. That right there matches a full group of nested parentheses from start to end. Two substrings per match are necessarily captured and saved; these are useless to you. Just focus on the results of the main match.
No, there is no limit on depth. No, there are no recursive constructs hidden in there. Just plain ol' lookarounds, with a splash of forward referencing. If your flavour does not support forward references (I'm looking at you, JavaScript), then I'm sorry. I really am. I wish I could help you, but I'm not a freakin' miracle worker.
That's great and all, but I want to match inner groups too!
OK, here's the deal. The reason we were able to match those outer groups is because they are non-overlapping. As soon as the matches we desire begin to overlap, we must tweak our strategy somewhat. We can still inspect the subject for correctly-balanced groups of parentheses. However, instead of outright matching them, we need to save them with a capturing group like so:
(?=\()(?=((?:(?=.*?\((?!.*?\2)(.*\)(?!.*\3).*))(?=.*?\)(?!.*?\3)(.*)).)+?.*?(?=\2)[^(]*(?=\3$)))
Exactly the same as the previous expression, except I've wrapped the bulk of it in a lookahead to avoid consuming characters, added a capturing group, and tweaked the backreference indices so they play nice with their new friend. Now the expression matches at the position just before the next parenthetical group, and the substring of interest is saved as \1.
So... how the hell does this actually work?
I'm glad you asked. The general method is quite simple: iterate through characters one at a time while simultaneously matching the next occurrences of '(' and ')', capturing the rest of the string in each case so as to establish positions from which to resume searching in the next iteration. Let me break it down piece by piece:
Note
Component
Description
(?=\()
Make sure '(' follows before doing any hard work.
(?:
Start of group used to iterate through the string, so the following lookaheads match repeatedly.
Handle '('
(?=
This lookahead deals with finding the next '('.
.*?\((?!.*?\1)
Match up until the next '(' that is not followed by \1. Below, you'll see that \1 is filled with the entire part of the string following the last '(' matched. So (?!.*?\1) ensures we don't match the same '(' again
(.*\)(?!.*\2).*)
Fill \1 with the rest of the string. At the same time, check that there is at least another occurrence of ')'. This is a PCRE band-aid to overcome a bug with capturing groups in lookaheads.
)
Handle ')'
(?=
This lookahead deals with finding the next ')'
.*?\)(?!.*?\2)
Match up until the next ')' that is not followed by \2. Like the earlier '(' match, this forces matching of a ')' that hasn't been matched before.
(.*)
Fill \2 with the rest of the string. The above.mentioned bug is not applicable here, so a simple expression is sufficient.
)
.
Consume a single character so that the group can continue matching. It is safe to consume a character because neither occurrence of the next '(' or ')' could possibly exist before the new matching point.
)+?
Match as few times as possible until a balanced group has been found. This is validated by the following check
Final validation
.*?(?=\1)
Match up to and including the last '(' found.
[^(]*(?=\2$)
Then match up until the position where the last ')' was found, making sure we don't encounter another '(' along the way (which would imply an unbalanced group).
Conclusion
So, there you have it. A way to match balanced nested structures using forward references coupled with standard (extended) regular expression features - no recursion or balanced groups. It's not efficient, and it certainly isn't pretty, but it is possible. And it's never been done before. That, to me, is quite exciting.
I know a lot of you use regular expressions to accomplish and help other users accomplish simpler and more practical tasks, but if there is anyone out there who shares my excitement for pushing the limits of possibility with regular expressions then I'd love to hear from you. If there is interest, I have other similar material to post.

Brief
Input Corrections
First of all, your input is incorrect as there's an extra parenthesis (as shown below)
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Making appropriate modifications to either include or exclude the additional parenthesis, one might end up with one of the following strings:
Extra parenthesis removed
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
^
Additional parenthesis added to match extra closing parenthesis
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
^
Regex Capabilities
Second of all, this is really only truly possible in regex flavours that include the recursion capability since any other method will not properly match opening/closing brackets (as seen in the OP's solution, it matches the extra parenthesis from the incorrect input as noted above).
This means that for regex flavours that do not currently support recursion (Java, Python, JavaScript, etc.), recursion (or attempts at mimicking recursion) in regular expressions is not possible.
Input
Considering the original input is actually invalid, we'll use the following inputs to test against.
(F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
(F(i(r(s)t))) ((S)(e)((c)(o))n)d (((((((Third)))))))
((F(i(r(s)t))) ((S)(e)((c)(o))n)d) (((((((Third)))))))
Testing against these inputs should yield the following results:
INVALID (no match)
VALID (match)
VALID (match)
Code
There are multiple ways of matching nested groups. The solutions provided below all depend on regex flavours that include recursion capabilities (e.g. PCRE).
See regex in use here
Using DEFINE block
(?(DEFINE)
(?<value>[^()\r\n]+)
(?<groupVal>(?&group)|(?&value))
(?<group>(?&value)*\((?&groupVal)\)(?&groupVal)*)
)
^(?&group)$
Note: This regex uses the flags gmx
Without DEFINE block
See regex in use here
^(?<group>
(?<value>[^()\r\n]+)*
\((?<groupVal>(?&group)|(?&value))\)
(?&groupVal)*
)$
Note: This regex uses the flags gmx
Without x modifier (one-liner)
See regex in use here
^(?<group>(?<value>[^()\r\n]+)*\((?<groupVal>(?&group)|(?&value))\)(?&groupVal)*)$
Without named (groups & references)
See regex in use here
^(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*)$
Note: This is the shortest possible method that I could come up with.
Explanation
I'll explain the last regex as it's a simplified and minimal example of all the other regular expressions above it.
^ Assert position at the start of the line
(([^()\r\n]+)*\(((?1)|(?2))\)(?3)*) Capture the following into capture group 1
([^()\r\n]+)* Capture the following into capture group 2 any number of times
[^()\r\n]+ Match any character not present in the set ()\r\n one or more times
\( Match a left/opening parenthesis character ( literally
((?1)|(?2)) Capture either of the following into capture group 3
(?1) Recurse the first subpattern (1)
(?2) Recurse the second subpattern (2)
\) Match a right/closing parenthesis character ) literally
(?3)* Recurse the third subpattern (3) any number of times
$ Assert position at the end of the line

Regular expressions Groups java

Hi I am writing regular expression:I need get info from this line
;row:1;field:2;name:3
My regular expression for this line is
.*(row:([0-9]{1,}))?;.*field:([0-9]{1,});.*name:([0-9]{1,});
But the problem that the word row is optional,I can also get the line without word row,but in this case my regular expressions does not work,how I can write it?
Thanks.

If you are just trying to make the word row optional, you can surround it with parenthesis and use the question mark to say you want zero or one of the parenthesis contents. I have shown below:
(row:)?
This will find zero or one of "row:". I would recommend also adding the ?: operator inside the open parenthesis if you are not trying to creating match groups:
(?:row:)?
Adding the ?: operator tells the program that the parenthesis is not part of a match group, and you are using the parenthesis just to group the letters. So putting this into your regular expression looks like this:
.*((?:row:)?([0-9]{1,}))?;.*field:([0-9]{1,});.*name:([0-9]{1,});
If you want this applied to "field:" and "name:", then you would get this:
.*((?:row:)?([0-9]{1,}))?;.*(?:field:)?([0-9]{1,});.*(?:name:)?([0-9]{1,});
If you do not want the other parenthesis in your expression to create match groups, I would also change them. Also, to shorten the expression a little and make it cleaner I would change [0-9] to \d which means any digit, and I would replace {1,} with + which means "one or more", and remove the unnecessary parens which results in this:
.*(?:(?:row:)?\d+)?;.*(?:field:)?\d+;.*(?:name:)?\d+;
I'm not really sure that this is the final expression that you want, since your question was not very descriptive, but I have shown you how to find one or more of "row:" and clean up the expression a little without changing its meaning.

combine multiple regex to extract sub string from : separated string

I have been stuck for some time developing a single regex to extract a path from either of the following strings :
1. "life:living:fast"
2. "life"
3. ":life"
4. ":life:"
I have these regex expressions to use :
(.{3,}):", ":(.{3,}):", ":(.{3,})", "(.{3,})
The first match is all I need. i.e. the desired result for each should be the string located where the word life is. consider life to be a variable
But for some reason combining these individual regex's is a pain: If I excecute them sequentially I get the word 'life' extracted. However I am unable to combine them into one.
I appreciate your effort.

If you want the first life with the colons, you can use this:
^:?(?:.{3,}?)(?::|$)
See demo
If you prefer the first life without the colons, switch to this:
((?<=^:)|^)([^:]{3,}?)(?=:|$)
See demo
How it Works #1: ^:?(?:.{3,}?)(?::|$)
With ^:?, at the beginning of the string, we match an optional colon
(?:.{3,}?) lazily matches three or more chars up to...
(?::|$) a colon or the end of the string
How it Works #1: ((?<=^:)|^)([^:]{3,}?)(?=:|$)
((?<=^:)|^) ensures that we are either positioned at the beginning of the string, or after a colon immediately after the beginning of the string
([^:]{3,}?) lazily matches chars that are not colons...
up to a point where the lookahead (?=:|$) can assert that what follows is a colon or the end of the string.

You can use this pattern, since you are looking for the first word:
(?<=^:?)[^:]{3,}
Note that this pattern doesn't check all the string.

Regular Expression - Return all matches as a single match

I'm working with a piece of code that applies a regex to a string and returns the first match. I don't have access to modify the code to return all matches, nor do I have the ability to implement alternative code.
I have the following example target string:
usera,userb,,userc,,userd,usere,userf,
This is a list of comma delimited usernames joined from multiple sources, some of which were blank resulting in two commas in some places. I'm trying to write a regex that will return all of the comma delimited usernames except for specific values.
For example, consider the following expression:
[^,]\w{1,},(?<!(userb|userc|userd),)
This results in three matches:
usera,
usere,
userf,
Is there any way to get these results as a single match, instead of a match collection, e.g. a single match having the text 'usera,usere,userf,' ?
If I could write code in any language this would be trivial, but I'm limited to input of only the target string and the pattern, and I need a single match that has all items except for the ones I'm omitting. I'm not sure if this is even possible, everything I've ever done with regex's involves processing multiple items in a match collection.
Here is an example in Regex Coach. This image shows that there are the three matches I want, but my requirement is to have the text in a single match, not three separate matches.
EDIT1:
To clarify this ticket is specifically intended to solve the use case using only regular expression syntax. Solving this problem in code is trivial but solving it using only a regex was the requirement given the fact that the executing code is part of a 3rd party product that I didn't want to reverse engineer, wrap, or replace.

Is there any way to get these results as a single match, instead of a match collection, e.g. a single match having the text 'usera,usere,userf,'?
No. Regex matches are consecutive.
A regular expression matches a (sub)string from start to finish. You cannot drop the middle part, this is not how regex engines work. But you can apply the expression again to find another matching substring (incremental search - that's what Regex Coach does). This would result in a match collection.
That being said, you could also just match everything you don't want to keep and remove it, e.g.
,(?=[\s,]+)|(userb|userc|userd)[\s,]*
http://rubular.com/r/LOKOg6IeBa

Validating Regex: Make sure it will match at least two characters

I'm working on an Android app which will allow users to input regex expressions. I'll use the regex to query a database (MOTL), but I don't want to bring down a gigantic list (eg, if a user searches for "." to get every single version of every single MtG card in existence).
The shortest name of any Magic card is "Ow", so I need to allow for a search two characters long (such as, for example, "^Ow$"). I want to be able to validate the regex entered into the text field to ensure that it's going to match at least two characters once I make my query. For example, searching "Ac" is fine, searching "A" is not. "[A-D][e-g]+" is fine, "[F-M]*" is not.
What would be the best way to go about this? I was thinking to iterate through the input, counting +1 for each segment of regex that's guaranteed to match a character (lone characters, character classes, and capture groups, or those followed by + or {n,m}), and +0 for segments which might not match something (characters, character classes and capture groups followed by * or {0,n}).
Would that solution work? Is there any better way? Is this a waste of time, since most users will simply type part or all of the name of a card sitting in front of them?

You might be better off treating the users input as literal but with a "wildcard" character (e.g. *) so that they don't have to worry about the details of regular expressions and you don't have to solve this problem directly!
For example, your users could enter "force of *" and it would match "Force of Will", "Force of Nature", etc. This way you can just take their input and replace \* with .* and use the regex you build instead of worrying about cleaning up their input.

Here is a very naive brute force appraoch, if you are given a regex compare it against all alphanumeric chars (1 by 1) if it matches any then it doesnt meet your minimum length criteria.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex for multiple occurrences of specific words - java

I used the following and matched all the inputs you gave: ^[^\)][0-9 \( (AND|OR)]*$ I assumed you didn't want to start with ), which is why I included ^[^\)]. In case you weren't aware, I use https://www.regexpal.com to check my regular expressions for code.

Related

Is it possible to match nested brackets with a regex without using recursion or balancing groups?

Regular expressions Groups java

combine multiple regex to extract sub string from : separated string

Regular Expression - Return all matches as a single match

Validating Regex: Make sure it will match at least two characters

Categories

Resources