Consider the following regular expression, where X is any regex.
X{n}|X{m}
This regex would test for X occurring exactly n or m times.
Is there a regex quantifier that can test for an occurrence X exactly n or m times?
There is no single quantifier that means "exactly m or n times". The way you are doing it is fine.
An alternative is:
X{m}(X{k})?
where m < n and k is the value of n-m.
Here is the complete list of quantifiers (ref. http://www.regular-expressions.info/reference.html):
?, ?? - 0 or 1 occurences (?? is lazy, ? is greedy)
*, *? - any number of occurences
+, +? - at least one occurence
{n} - exactly n occurences
{n,m} - n to m occurences, inclusive
{n,m}? - n to m occurences, lazy
{n,}, {n,}? - at least n occurence
To get "exactly N or M", you need to write the quantified regex twice, unless m,n are special:
X{n,m} if m = n+1
(?:X{n}){1,2} if m = 2n
...
No, there is no such quantifier. But I'd restructure it to /X{m}(X{m-n})?/ to prevent problems in backtracking.
Very old post, but I'd like to contribute sth that might be of help.
I've tried it exactly the way stated in the question and it does work but there's a catch:
The order of the quantities matters. Consider this:
#[a-f0-9]{6}|#[a-f0-9]{3}
This will find all occurences of hex colour codes (they're either 3 or 6 digits long). But when I flip it around like this
#[a-f0-9]{3}|#[a-f0-9]{6}
it will only find the 3 digit ones or the first 3 digits of the 6 digit ones. This does make sense and a Regex pro might spot this right away, but for many this might be a peculiar behaviour. There are some advanced Regex features that might avoid this trap regardless of the order, but not everyone is knee-deep into Regex patterns.
TLDR; (?<=[^x]|^)(x{n}|x{m})(?:[^x]|$)
Looks like you want "x n times" or "x m times", I think a literal translation to regex would be (x{n}|x{m}).
Like this https://regex101.com/r/vH7yL5/1
or, in a case where you can have a sequence of more than m "x"s (assuming m > n), you can add 'following no "x"' and 'followed by no "x", translating to [^x](x{n}|x{m})[^x] but that would assume that there are always a character behind and after you "x"s. As you can see here: https://regex101.com/r/bB2vH2/1
you can change it to (?:[^x]|^)(x{n}|x{m})(?:[^x]|$), translating to "following no 'x' or following line start" and "followed by no 'x' or followed by line end". But still, it won't match two sequences with only one character between them (because the first match would require a character after, and the second a character before) as you can see here: https://regex101.com/r/oC5oJ4/1
Finally, to match the one character distant match, you can add a positive look ahead (?=) on the "no 'x' after" or a positive look behind (?<=) on the "no 'x' before", like this: https://regex101.com/r/mC4uX3/1
(?<=[^x]|^)(x{n}|x{m})(?:[^x]|$)
This way you will match only the exact number of 'x's you want.
Taking a look at Enhardened's answer, they state that their penultimate expression won't match sequences with only one character between them. There is an easy way to fix this without using look ahead/look behind, and that's to replace the start/end character with the boundary character. This lets you match against word boundaries which includes start/end. As such, the appropriate expression should be:
(?:[^x]|\b)(x{n}|x{m})(?:[^x]|\b)
As you can see here: https://regex101.com/r/oC5oJ4/2.
Related
I maybe miss something, but i'd like to know why this pattern is valid :
Pattern.compile("([0-9]{1,})");
The compile does not throw an exception despite that the occurence is not valid.
Thx a lot
despite that the occurence is not valid
quantifiers can be represented using {n,m} syntax where:
{n} - exactly n times
{n,} - at least n times
{n,m} - at least n but not more than m times
Source: https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
(notice that there is no {,m} quantifier representing "not more than m times") because we don't really need one, we can create it via {0,m})
So {1,} is valid and simply means "at least once".
To avoid confusion we can simplify it by replacing {1,} quantifier with more known form + like
Pattern.compile("([0-9]+)");
Most probably you also don't need to create capturing group 1 by surrounding [0-9]+ with parenthesis. We can access what regex matched with help of group 0, so such group 1 is redundant in most cases.
Pattern p = Pattern.compile("(?=[1-9][0-9]{2})[0-9]*[05]");
Matcher m = p.matcher("101");
while(m.find()){
System.out.println(m.start()+":"+ m.end()+ m.group());
}
Output------ >> 0:210
Please let me know why I am getting output of m.group() as 10 here.
As far as I understand m.group() should return nothing because [05] matches to nothing.
Your Pattern, (?=[1-9][0-9]{2})[0-9]*[05] consists of 2 parts:
(?=[1-9][0-9]{2})
and
[0-9]*[05]
The first part is a zero-width positive lookahead which searches for a number of length 3, and the first can not be 0. This matches your 101.
The second part searches for any amount of numbers and then a 0 or a 5. This matches the first 2 characters of 101, thus the result is 10.
See Java - Pattern for more information.
What your Regex is looking for is:
[1-9]:
match a single character present in the list below
1-9 a single character in the range between 1 and 9
[0-9]{2}:
match a single character present in the list below
Quantifier: {2} Exactly 2 times
0-9 a single character in the range between 0 and 9
[0-9]*:
match a single character present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
0-9 a single character in the range between 0 and 9
[05]:
match a single character present in the list below
05 a single character in the list 05 literally
for the String "101" this nacht the first 2 chars 101,
so you are printing.out:
System.out.println(**m.start()**+":"+ **m.end()**+ m.group());
where m.start() returns the start index of the previous match(char at
0). where m.end() returns the offset after the last character matched.
and where m.group() returns the input subsequence matched by the previous
match.
That regex was meant to match a number that's a multiple of 5 and greater than or equal to 100, but it's useless without anchors. It should be:
^(?=[1-9][0-9]{2}$)[0-9]*[05]$
The anchors ensure that both the lookahead and the main part are examining the whole of the string. But the task doesn't require a lookahead anyway; this works just fine:
^[1-9][0-9][05]$
As #AlanMoore states, there has to be an alignment.
Assertions are a self contained entity, all they have to do is Pass
to advance to the next construct.
Lets see what (?=[1-9][0-9]{2}) matches;
1111111110666
2222222222222222225666
33333333333333333333333330666
So far so good, on to the next construct.
Lets see what [0-9]*[05] matches.
What ever this matches is the final answer.
1111111110666
2222222222222222225666
33333333333333333333333330666
What to learn is that to get a cohesive answer, assertions have to be crafted to
coincide with constructs that come after them.
Here is an example of a constraint that could be applied
after the assertion.
The assertion state's it needs three digits and the first digit must be >= 1.
The constructs after the assertion state it can be any number of digit's,
as long as it ends with a 0 or 5.
This last part is distressing since it will match only the 500000
So for sure, you need at least three digits.
That can be done like this:
[0-9]{2,}[05]
This says two things
There must be at least three digits, but can be more
It must end with a 0 or 5.
That's it, put it all together, its:
(?=[1-9][0-9]{2})[0-9]{2,}[05]
Of course, this can be condensed to;
[1-9][0-9]+[05]
I want to know if there is a way to check if a string contains a certain pattern for a regex.
For example:
string.matches("something[0-9]x") would check if the string contains a substring of "something" with any single digit integer following it followed by "x". But lets say if I want to check the same thing, but there is no limit for that int, ie it could be 1000000. Is there like a wildcard for an int that I can use?
Just use modifier + after your character class which match the preceding token one or more time :
string.matches("something[0-9]+x")
Regular expressions work on characters; they have no semantic understanding of those characters. So it doesn't make sense to talk about "integers" here; the best that you can do is to talk about "digits". The number "1" is one digit; "1234" is four.
In a regular expression, you can match one or more of the preceding pattern using "+", so the regex "something[0-9]+x" should do what you want. If you want an upper bound on the number of digits, than you can try something like "something[0-9]{1,5}x"
Yes, simply use *, so in your example string.matches("something[0-9]+x")
It would match a string something followed by any digit from 0 to 9, which have to occur at least one time, so * means zero or more times, while + means it have to occur at least one time but can occur more times if it wants.
If you do [0-9]{n,m} you can specify with m and n in which range it can occur for example:
[0-9]{2,3} will match any digit and it have to occur 2 or 3 times, if you only use one digit in this bracs [0-9]{2} it has to occur at least 2 times.
But at last: simply learn to use google ... there are so many regexp sites with tutorials and stuff.
I've written a regular expression that matches any number of letters with any number of single spaces between the letters. I would like that regular expression to also enforce a minimum and maximum number of characters, but I'm not sure how to do that (or if it's possible).
My regular expression is:
[A-Za-z](\s?[A-Za-z])+
I realized it was only matching two sets of letters surrounding a single space, so I modified it slightly to fix that. The original question is still the same though.
Is there a way to enforce a minimum of three characters and a maximum of 30?
Yes
Just like + means one or more you can use {3,30} to match between 3 and 30
For example [a-z]{3,30} matches between 3 and 30 lowercase alphabet letters
From the documentation of the Pattern class
X{n,m} X, at least n but not more than m times
In your case, matching 3-30 letters followed by spaces could be accomplished with:
([a-zA-Z]\s){3,30}
If you require trailing whitespace, if you don't you can use: (2-29 times letter+space, then letter)
([a-zA-Z]\s){2,29}[a-zA-Z]
If you'd like whitespaces to count as characters you need to divide that number by 2 to get
([a-zA-Z]\s){1,14}[a-zA-Z]
You can add \s? to that last one if the trailing whitespace is optional. These were all tested on RegexPlanet
If you'd like the entire string altogether to be between 3 and 30 characters you can use lookaheads adding (?=^.{3,30}$) at the beginning of the RegExp and removing the other size limitations
All that said, in all honestly I'd probably just test the String's .length property. It's more readable.
This is what you are looking for
^[a-zA-Z](\s?[a-zA-Z]){2,29}$
^ is the start of string
$ is the end of string
(\s?[a-zA-Z]){2,29} would match (\s?[a-zA-Z]) 2 to 29 times..
Actually Benjamin's answer will lead to the complete solution to the OP's question.
Using lookaheads it is possible to restrict the total number of characters AND restrict the match to a set combination of letters and (optional) single spaces.
The regex that solves the entire problem would become
(?=^.{3,30}$)^([A-Za-z][\s]?)+$
This will match AAA, A A and also fail to match AA A since there are two consecutive spaces.
I tested this at http://regexpal.com/ and it does the trick.
You should use
[a-zA-Z ]{20}
[For allowed characters]{for limiting of the number of characters}
From the Pattern javadocs:
Greedy quantifiers:
X? X, once or not at all
X* X, zero or more times
X+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n but not more than m times
Reluctant quantifiers:
X?? X, once or not at all
X*? X, zero or more times
X+? X, one or more times
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n but not more than m times
The description of what they do is the same...so, what is the difference?
I would really appreciate some examples.
I am coding in Java, but I hear this concept is the same for most modern regex implementations.
A greedy operator always try to "grab" as much of the input as possible, while a reluctant quantifier will match as little of the input as possible and still create a match.
Example:
"The red fox jumped over the red fence"
/(.*)red/ => \1 = "The red fox jumped over the "
/(.*?)red/ => \1 = "The "
"aaa"
/a?a*/ => \1 = "a", \2 = "aa"
/a??a*/ => \1 = "", \2 = "aaa"
"Mr. Doe, John"
/^(?:Mrs?.)?.*\b(.*)$/ => \1 = "John"
/^(?:Mrs?.)?.*?\b(.*)$/ => \1 = "Doe, John"
From this link, where the tutorial author acknowledges the spirit of your question:
At first glance it may appear that
the quantifiers X?, X?? and X?+ do
exactly the same thing, since they all
promise to match "X, once or not at
all". There are subtle implementation
differences which will be explained
near the end of this section.
They go on to put together examples and offer the explanation:
Greedy quantifiers are considered
"greedy" because they force the
matcher to read in, or eat, the entire
input string prior to attempting the
first match. If the first match
attempt (the entire input string)
fails, the matcher backs off the input
string by one character and tries
again, repeating the process until a
match is found or there are no more
characters left to back off from.
Depending on the quantifier used in
the expression, the last thing it will
try matching against is 1 or 0
characters.
The reluctant quantifiers, however,
take the opposite approach: They start
at the beginning of the input string,
then reluctantly eat one character at
a time looking for a match. The last
thing they try is the entire input
string.
And for extra credit, the possessive explanation:
Finally, the possessive quantifiers
always eat the entire input string,
trying once (and only once) for a
match. Unlike the greedy quantifiers,
possessive quantifiers never back off,
even if doing so would allow the
overall match to succeed.
A greedy quantifier will match as much as possible and still get a match
A reluctant quantifier will match the smallest amount possible.
for example given the string
abcdef
the greedy qualifier
ab[a-z]*[a-z] would match abcdef
the reluctant qualifier
ab[a-z]*?[a-z] would match abc
say you have a regex "a\w*b", and use it on "abab"
Greedy matching will match "abab" (it looks for an a, as much occurrences of \w as possible, and a b) and reluctant matching will match just "ab" (as little \w as possible)
There is documentation on how Perl handles these quantifiers perldoc perlre.
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness":
*? Match 0 or more times, not greedily
+? Match 1 or more times, not greedily
?? Match 0 or 1 time, not greedily
{n}? Match exactly n times, not greedily
{n,}? Match at least n times, not greedily
{n,m}? Match at least n but not more than m times, not greedily
By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well.
*+ Match 0 or more times and give nothing back
++ Match 1 or more times and give nothing back
?+ Match 0 or 1 time and give nothing back
{n}+ Match exactly n times and give nothing back (redundant)
{n,}+ Match at least n times and give nothing back
{n,m}+ Match at least n but not more than m times and give nothing back
For instance,
'aaaa' =~ /a++a/
will never match, as the a++ will gobble up all the a 's in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as:
/"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not help. See the independent subexpression (?>...) for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/