Set minimum and maximum characters in a regular expression - java

I've written a regular expression that matches any number of letters with any number of single spaces between the letters. I would like that regular expression to also enforce a minimum and maximum number of characters, but I'm not sure how to do that (or if it's possible).
My regular expression is:
[A-Za-z](\s?[A-Za-z])+
I realized it was only matching two sets of letters surrounding a single space, so I modified it slightly to fix that. The original question is still the same though.
Is there a way to enforce a minimum of three characters and a maximum of 30?

Yes
Just like + means one or more you can use {3,30} to match between 3 and 30
For example [a-z]{3,30} matches between 3 and 30 lowercase alphabet letters
From the documentation of the Pattern class
X{n,m} X, at least n but not more than m times
In your case, matching 3-30 letters followed by spaces could be accomplished with:
([a-zA-Z]\s){3,30}
If you require trailing whitespace, if you don't you can use: (2-29 times letter+space, then letter)
([a-zA-Z]\s){2,29}[a-zA-Z]
If you'd like whitespaces to count as characters you need to divide that number by 2 to get
([a-zA-Z]\s){1,14}[a-zA-Z]
You can add \s? to that last one if the trailing whitespace is optional. These were all tested on RegexPlanet
If you'd like the entire string altogether to be between 3 and 30 characters you can use lookaheads adding (?=^.{3,30}$) at the beginning of the RegExp and removing the other size limitations
All that said, in all honestly I'd probably just test the String's .length property. It's more readable.

This is what you are looking for
^[a-zA-Z](\s?[a-zA-Z]){2,29}$
^ is the start of string
$ is the end of string
(\s?[a-zA-Z]){2,29} would match (\s?[a-zA-Z]) 2 to 29 times..

Actually Benjamin's answer will lead to the complete solution to the OP's question.
Using lookaheads it is possible to restrict the total number of characters AND restrict the match to a set combination of letters and (optional) single spaces.
The regex that solves the entire problem would become
(?=^.{3,30}$)^([A-Za-z][\s]?)+$
This will match AAA, A A and also fail to match AA A since there are two consecutive spaces.
I tested this at http://regexpal.com/ and it does the trick.

You should use
[a-zA-Z ]{20}
[For allowed characters]{for limiting of the number of characters}

Related

Regex format for a particular Match

I am trying to write a regex for the following format
PA-123456-067_TY
It's always PA, followed by a dash, 6 digits, another dash, then 3 digits, and ends with _TY
Apparently, when I write this regex to match the above format it shows the output correctly
^[^[PA]-]+-(([^-]+)-([^_]+))_([^.]+)
with all the Negation symbols ^
This does not work if I write the regex in the below format without negation symbols
[[PA]-]+-(([-]+)-([_]+))_([.]+)
Can someone explain to me why is this so?
The negation symbol means that the character cannot be anything within the specified class. Your regex is much more complicated than it needs to be and is therefore obfuscating what you really want.
You probably want something like this:
^PA-(\d+)-(\d+)_TY$
... which matches anything that starts with PA-, then includes two groups of numbers separated by a dash, then an underscore and the letters TY. If you want everything after the PA to be what you capture, but separated into the three groups, then it's a little more abstract:
^PA-(.+)-(.+)_(.+)$
This matches:
PA-
a capture group of any characters
a dash
another capture group of any characters
an underscore
all the remaining characters until end-of-line
Character classes [...] are saying match any single character in the list, so your first capture group (([^-]+)-([^_]+)) is looking for anything that isn't a dash any number of times followed by a dash (which is fine) followed by anything that isn't an underscore (again fine). Having the extra set of parentheses around that creates another capture group (probably group 1 as it's the first parentheses reached by the regex engine)... that part is OK but probably makes interpreting the answer less intuitive in this case.
In the re-write however, your first capture group (([-]+)-([_]+)) matches [-]+, which means "one or more dashes" followed by a dash, followed by any number of underscores followed by an underscore. Since your input does not have a dash immediately following PA-, the entire regex fails to find anything.
Putting the PA inside embedded character classes is also making things complicated. The first part of your first one is looking for, well, I'm not actually sure how [^[PA]-]+ is interpreted in practice but I suspect it's something like "not either a P or an A or a dash any number of times". The second one is looking for the opposite, I think. But you don't want any of that, you just want to start without anything other than the actual sequence of characters you care about, which is just PA-.
Update: As per the clarifications in the comments on the original question, knowing you want fixed-size groups of digits, it would look like this:
^PA-(\d{6})-(\d{3})_TY$
That captures PA-, then a 6-digit number, then a dash, then a 3-digit number, then _TY. The six digit number and 3 digit numbers will be in capture groups 1 and 2, respectively.
If the sizes of those numbers could ever change, then replace {x} with + to just capture numbers regardless of max length.
according to your comment this would be appropriate PA-\d{6}-\d{3}_TY
EDIT: if you want to match a line use it with anchors: ^PA-\d{6}-\d{3}_TY$

Java Matcher slow regex

This is very simple regex yet, it runs for over 30 seconds on a very short string: (i7 3970k # 3.4ghz)
Pattern compile = Pattern.compile("^(?=[a-z0-9-]{1,63})([a-z0-9]+[-]{0,1}){1,63}[a-z0-9]{1}$");
Matcher matcher = compile.matcher("test-metareg-rw40lntknahvpseba32cßáàâåäæç.nl");
boolean matches = matcher.matches(); //Takes 30+ seconds
First part the (?=) is assertion that the string contains at max these characters
The 2nd part is assertion that the string doesn't exceed syntax for example on this case to prevent --'s and end at least in [a-z0-9]
I tried to guess your intention but it was not easy:
(?=[a-z0-9-]{1,63}) this look-ahead seem to intend to require the next up to 63 characters to be lowercase ASCII letters or numbers, but in fact, it will succeed even if there’s only one letter followed by anything. So maybe you meant (?=[a-z0-9-]{1,63}$) to forbid anything else after the legal up to 63 characters.
You seem to want groups of at least one letter or number between the - but you made the - optional not really creating a constraint and allowing way to much possibilities which created the overhead of your expression. You can simply say: ([a-z0-9]++-){0,63}[a-z0-9]+. The groups within the braces require at least one letter or number and require the minus after that, the expression at the end requires at least one letter or number at the end of the expression but will also match the last group without a following - at the same time. This last group might also be the only one if no - is contained in your text at all.
Putting it all together you regex becomes: (?=[a-z0-9-]{1,63}$)([a-z0-9]++-){0,63}[a-z0-9]+. Note that you don’t need a leading ^ or trailing $ if you use the matches method; it already implies that the string bounds must match the expression bounds.
I hope I got your intention right…
I have fixed this regex replacing it as follows:
^(?=[a-z0-9-]{1,63})([a-z0-9]{0,1}|[-]{0,1}){1,63}[a-z0-9]{1}$
The section ([a-z0-9]+[-]{0,1}){1,63} became: ([a-z0-9]{0,1}|[-]{0,1}){1,63}
If you want to make sure that there is no -- in your string just use negative look ahead (?!.*--).
Also there is no point in writing {1}.
Another thing is if you want to ensure that string has max 63 characters then in your look-ahead you need to add $ at the end (?=[a-z0-9-]{1,63}$).
So maybe ^(?=[a-z0-9-]{1,63}$)(?!.*--)[a-z0-9-]+[a-z0-9]$
I think from what you say, your regex can be simplified to this
Edit - (For posterity) After reading #Holger's post, I am changing this to fix possible catastrophic backtracking, and to speed it up, which as my benches show is possibly the fastest way to do it.
# ^(?=[a-z0-9-]{1,63}$)[a-z0-9]++(?:-[a-z0-9]+)*+$
^ # BOL
(?= [a-z0-9-]{1,63} $ ) # max 1 - 63 of these characters
[a-z0-9]++ (?: - [a-z0-9]+ )*+ # consume the characters in this order
$ # EOL

How to combine these regex for javascript

Hi I am trying to use regEx in JS for identifying 3 identical consecutive characters (could be alphabets,numbers and also all non alpha numeric characters)
This identifies 3 identical consecutive alphabets and numbers : '(([0-9a-zA-Z])\1\1)'
This identifies 3 identical consecutive non alphanumerics : '(([^0-9a-zA-Z])\1\1)'
I am trying to combine both, like this : '(([0-9a-zA-Z])\1\1)|(([^0-9a-zA-Z])\1\1)'
But I am doing something wrong and its not working..(returns true for '88aa3BBdd99##')
Edit : And to find NO 3 identical characters, this seems to be wrong /(^([0-9a-zA-Z]|[^0-9a-zA-Z])\1\1)/ --> RegEx in JS to find No 3 Identical consecutive characters
thanks
Nohsib
The problem is that backreferences are counted from left to right throughout the whole regex. So if you combine them your numbers change:
(([0-9a-zA-Z])\2\2)|(([^0-9a-zA-Z])\4\4)
You could also remove the outer parens:
([0-9a-zA-Z])\1\1|([^0-9a-zA-Z])\2\2
Or you could just capture the alternatives in one set of parens together and append one back-reference to the end:
([0-9a-zA-Z]|[^0-9a-zA-Z])\1\1
But since your character classes match all characters anyway you can have that like this as well:
([\s\S])\1\1
And if you activate the DOTALL or SINGLELINE option, you can use a . instead:
(.)\1\1
It's actually much simpler:
(.)\1\1
The (.) matches any character, and each \1 is a back reference that matches the exact string that was matched by the first capturing group. You should be aware of what the . actually matches and then modify the group (in the parentheses) to fit your exact needs.

Regular Expression of a Specific Word

I want to create a regular expression in java using standard libraries that will accommodate the following sentence:
12 of 128
Obviously the numbers can be anything though... From 1 digit to many
Also, I'm not sure how to accommodate the word "of" but I thought maybe something along the lines of:
[\d\sof\s\d]
This should work for you:
(\d+\s+of\s+\d+)
This will assume that you want to capture the full block of text as "one group", and there can be one-or-more whitespace characters in between each (if only one space, you can change \s+ to just \s).
If you want to capture the numbers separately, you can try:
(\d+)\s+of\s+(\d+)
You want this:
\d+\sof\s\d+
The relevant change from what you already had is the addition of the two plus signs. That means, that it should match multiple digits but at least one.
Sample: http://regexr.com?32cao
This regexp
"\\d+ of \\d+"
will match at least one to any number of digits, followed by string " of " followed by one to any number of digits.

Regular Expression to match one or more digits 1-9, one '|', one or more '*" and zero or more ','

I'm new to regular expressions and I need to find a regular expression that matches one or more digits [1-9] only ONE '|' sign, one or more '*' sign and zero or more ',' sign.
The string should not contain any other characters.
This is what I have:
if(this.ruleString.matches("^[1-9|*,]*$"))
{
return true;
}
Is it correct?
Thanks,
Vinay
I think you should test separately for every type of symbols rather than write complex expression.
First, test that you don't have invalid symbols - "^[0-9|*,]$"
Then test for digits "[1-9]", it should match at least one.
Then test for "\\|", "\\*" and "\\," and check the number of matches.
If all test are passed then your string is valid.
Nope, try this:
"^[1-9]+\\|\\*+,*$"
Please give us at least 10 possible matching strings of what you are looking to accept, and 10 of what you want to reject, and tell us if either this have to keep some sequence or its order doesn't matter. So we can make a reliable regex.
By now, all I can offer is:
^[1-9]+\|{1}\*+,*$
This RegEx was tested against these sample strings, accepting them:
56421|*****,,,
2|*********,,,
1|*
7|*,
18|****
123456789|*
12|********,,
1516332|**,,,
111111|*
6|*****,,,,
And it was tested against these sample strings, rejecting them:
10|*,
2***525*|*****,,,
123456,15,22*66*****4|,,,*167
1|2*3,4,5,6*
,*|173,
|*,
||12211
12
1|,*
1233|54|***,,,,
I assume your given order is strict and all conditions apply at the same time.
It looks like the pattern you need is
n-n, one or more times seperated by commas
then a bar (|)
then n*n, one or more times seperated by commas.
Here is a regular expression for that.
([1-9]{1}[0-9]*\-[0-9]+){1}
(,[1-9]{1}[0-9]*\-[0-9]+)*
\|
([1-9]{1}[0-9]*\*[0-9]+){1}
(,[1-9]{1}[0-9]*\*[0-9]+)*
But it is so complex, and does not take into account the details, such as
for the case of n-m, you want
n less than m
(I guess).
And you likely want the same number of n-m before the bar, and x*y after the bar.
Depends whether you want to check the syntax completely or not.
(I hope you do want to.)
Since this is so complex, it should be done with a set of code instead of a single regular expression.
this regex should work
"^[1-9\\|\\*,-]*$"
Assert position at the beginning of the string «^»
Match a single character present in the list below «[1-9\|*,-]»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «»
A character in the range between “1” and “9” «1-9»
A | character «\|»
A * character «*»
The character “,” «,»
The character “-” «-»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»

Categories