Regex for checking > 1 upper, lower, digit, and special char - java

^.(?=.{15,})(?=.\d)(?=.[a-z])(?=.[A-Z])(?=.[!##$%^&+=]).*$
This is the regex I am currently using which will evaluate on 1 of each: upper,lower,digit, and specials of my choosing. The question I have is how do I make it check for 2 of each of these? Also I ask because it is seemingly difficult to write a test case for this as I do not know if it is only evaluating the first set of criteria that it needs. This is for a password, however the requirement is that it needs to be in regex form based on the package we are utilizing.
EDIT
Well as it stands in my haste to validate the expression I forgot to validate my string length. Thanks to Ken and Gumbo on helping me with this.
This is the code I am executing:
I do apologize as regex is not my area.
The password I am using is the following string "$$QiouWER1245", the behavior I am experiencing at the moment is that it randomly chooses to pass or fail. Any thoughts on this?
Pattern pattern = Pattern.compile(regEx);
Matcher match = pattern.matcher(password);
while(match.find()){
System.out.println(match.group());
}
From what I see if it evaluates to true it will throw the value in password back to me else it is an empty string.

Personally, I think a password policy that forces use of all three character classes is not very helpful. You can get the same degree of randomness by letting people make longer passwords. Users will tend to get frustrated and write passwords down if they have to abide by too many password rules (which make the passwords too difficult to remember). I recommend counting bits of entropy and making sure they're greater than 60 (usually requires a 10-14 character password). Entropy per character would depend roughly on the number of characters, the range of character sets they use, and maybe how often they switch between character sets (I would guess that passwords like HEYthere are more predictable than heYThEre).
Another note: do you plan not to count the symbols to the right of the keyboard (period, comma, angle brackets, etc.)?
If you still have to find groups of two characters, why not just repeat each pattern? For example, make (?=.\d) into (?=.\d.*\d).
For your test cases, if you are worried that it would only check the first criteria, then write a test case that makes sure each of the following passwords fails (because one and only one of the criteria is not met in each case): Just for fun I reversed the order of expectation of each character set, though it probably won't make a difference unless someone removes/forgets the ?= at some future date.
!##TESTwithoutnumbers
TESTwithoutsymbols123
&*(testwithoutuppercase456
+_^TESTWITHOUTLOWERCASE3498
I should point out that technically none of these passwords should be acceptable because they use dictionary words, which have about 2 bits of entropy per character instead of something more like 6. However, I realize that it's difficult to write a (maintainable and efficient) regular expression to check for dictionary words.

Try this:
"^(?=(?:\\D*\\d){2})(?=(?:[^a-z]*[a-z]){2})(?=(?:[^A-Z]*[A-Z]){2})(?=(?:[^!##$%^&*+=]*[!##$%^&*+=]){2}).{15,}$"
Here non-capturing groups (?:…) are used to group the conditions and repeat them. I’ve also used the complements of each character class for optimization instead of the universal ..

If I understand your question correctly, you want at least 15 characters, and to require at least 2 uppercase characters, at least 2 lowercase characters, at least 2 digits, and at least 2 special characters. In that case you could it like this:
^.*(?=.{15,})(?=.*\d.*\d)(?=.*[a-z].*[a-z])(?=.*[A-Z].*[A-Z])(?=.*[!##$%^&*+=].*[!##$%^&*+=]).*$
BTW, your original regex had an extra backslash before the \d

I'm not sure that one big regex is the right way to go here. It already looks far too complicated and will be very difficult to change in the future.
My suggestion is to structure the code in the following way:
check that the string has 2 lower case characters
return failure if not found or continue
check that the string has 2 upper case characters
return failure if not found or continue
etc.
This will also allow you to pass out a return code or errors string specifying why the password was not accepted and the code will be much simpler.

Related

Regex expression for comma and dash seperated text of items

I do have a Java Web Application, where I get some inputs from the user. Once I got this input I have to parse it and the parsing part depends on what kind of input I'll get. I decided to use the Pattern class of java for some of predefined user inputs.
So I need the last 2 regex patterns:
a)Enumaration:
input can be - A03,B24.1,A25.7
The simple way would be to check if there are a comma in there ([^,]+) but it will end up with a lot of updates in to parsing function, which I would like to avoid. So, in addition to comma it should check if it starts with
letter
minimum 3 letters (combined with numbers)
can have one dot in the word
minimum 1 comma (updated it)
b) Mixed
input can be A03,B24.1-B35.5,A25.7
So all of what Enumuration part got, but with addition that it can have a dash minimum one.
I've tried to use multiple online regex generators but didnt get it correct. Would be much appreciated if you can help.
Here is what I got if its B24.1-B35.5 if its just a simple range.
"='.{1}\\d{0,2}-.{1}\\d{0,2}'|='.{1}\\d{1,2}.\\d{1,2}-.{1}\\d{1,2}.\\d{1,2}'";
Edit1: Valid and Invalid inputs
for a)Enumaration
A03,B24.1,A25.7 Valid
A03,B24.1 Valid
A03,B24.1-B25.1 -Invalid because in this case (enumaration) it should not contain dash
A03 invalid because no comma
A03,B24.1 - Valid
A03 Invalid
for b)Mixed
everything that a enumeration has with addition that it can have dash too.
You can use this regex for (a) Enumeration part as per your rules:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Rules:
Verifies that each segment starts with a letter
Minimum of three letters or numbers [A-Za-z][A-Za-z0-9]{2,}
Optionally followed by decimal . and one or more alphabets and numbers i.e (?:\.[A-Za-z0-9]{1,})?
Same thing repeated, and seperated by a comma ,. Also must have atleast one comma so using + i.e (?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
?: to indicate non-capturing group
Using [A-Za-z0-9] instead of \w to avoid underscores
Regex101 Demo
For (b) Mixed, you haven't shared too many valid and invalid cases, but based on my current understanding here's what I have:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:[,-][A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Note that , from previous regex has been replaced with [,-] to allow - as well!
Regex101 Demo
// Will match
A03,B24.1-B35.5,A25.7
A03,B24.1,A25.7
A03,B24.1-B25.1
Hope this helps!
EDIT: Making sure each group starts with a letter (and not a number)
Thanks to #diginoise and #anubhava for pointing out! Changed [A-Za-z0-9]{3,} to [A-Za-z][A-Za-z0-9]{2,}
As I said in the comments, I would chop the input by commas and verify each segment separately. Your domain ICD 10 CM codes is very well defined and also I would be very wary of any input which could be non valid, yet pass the validation.
Here is my solution:
regex
([A-TV-Z][0-9][A-Z0-9](\.?[A-Z0-9]{0,4})?)
... however I would avoid that.
Since your domain is (moste likely) medical software, people's lives (or at least well being) is at stake. Not to mention astronomical damages and the lawyers ever-chasing ambulances. Therefore avoid the easy solution, and implement the bomb proof one.
You could use the regex to establish that given code is definitely not valid. However if a code passes your regex it does not mean that it is valid.
bomb proof method
See this example: O09.7, O09.70, O09.71, O09.72, O09.73 are valid entries, but O09.1 is not valid.
Therefore just get all possible codes. According to this gist there are 42784 different codes. Just load them to memory and any code which is not in the set, is not valid. You could compress said list and be clever about the encoding in memory, to occupy less space, but verbatim all codes are under 300kB on disk, so few MBs max in memory, therefore not a massive cost to pay for a price of people not having left instead of right kidney removed.

Regex to match if string *only* contains *all* characters from a character set, plus an optional one

I ran into a wee problem with Java regex. (I must say in advance, I'm not very experienced in either Java or regex.)
I have a string, and a set of three characters. I want to find out if the string is built from only these characters. Additionally (just to make it even more complicated), two of the characters must be in the string, while the third one is **optional*.
I do have a solution, my question is rather if anyone can offer anything better/nicer/more elegant, because this makes me cry blood when I look at it...
The set-up
There mandatory characters are: | (pipe) and - (dash).
The string in question should be built from a combination of these. They can be in any order, but both have to be in it.
The optional character is: : (colon).
The string can contain colons, but it does not have to. This is the only other character allowed, apart from the above two.
Any other characters are forbidden.
Expected results
Following strings should work/not work:
"------" = false
"||||" = false
"---|---" = true
"|||-|||" = true
"--|-|--|---|||-" = true
...and...
"----:|--|:::|---::|" = true
":::------:::---:---" = false
"|||:|:::::|" = false
"--:::---|:|---G---n" = false
...etc.
The "ugly" solution
Now, I have a solution that seems to work, based on this stackoverflow answer. The reason I'd like a better one will become obvious when you've recovered from seeing this:
if (string.matches("^[(?\\:)?\\|\\-]*(([\\|\\-][(?:\\:)?])|([(?:\\:)?][\\|\\-]))[(?\\:)?\\|\\-]*$") || string.matches("^[(?\\|)?\\-]*(([\\-][(?:\\|)?])|([(?:\\|)?][\\-]))[(?\\|)?\\-]*$")) {
//do funny stuff with a meaningless string
} else {
//don't do funny stuff with a meaningless string
}
Breaking it down
The first regex
"^[(?\\:)?\\|\\-]*(([\\|\\-][(?:\\:)?])|([(?:\\:)?][\\|\\-]))[(?\\:)?\\|\\-]*$"
checks for all three characters
The next one
"^[(?\\|)?\\-]*(([\\-][(?:\\|)?])|([(?:\\|)?][\\-]))[(?\\|)?\\-]*$"
check for the two mandatory ones only.
...Yea, I know...
But believe me I tried. Nothing else gave the desired result, but allowed through strings without the mandatory characters, etc.
The question is...
Does anyone know how to do it a simpler / more elegant way?
Bonus question: There is one thing I don't quite get in the regexes above (more than one, but this one bugs me the most):
As far as I understand(?) regular expressions, (?\\|)? should mean that the character | is either contained or not (unless I'm very much mistaken), still in the above setup it seems to enforce that character. This of course suits my purpose, but I cannot understand why it works that way.
So if anyone can explain, what I'm missing there, that'd be real great, besides, this I suspect holds the key to a simpler solution (checking for both mandatory and optional characters in one regex would be ideal.
Thank you all for reading (and suffering ) through my question, and even bigger thanks for those who reply. :)
PS
I did try stuff like ^[\\|\\-(?:\\:)?)]$, but that would not enforce all mandatory characters.
Use a lookahead based regex.
^(?=.*\\|)(?=.*-)[-:|]+$
or
^(?=.*\\|)[-:|]*-[-:|]*$
or
^[-:|]*(?:-:*\\||\\|:*-)[-:|]*$
DEMO 1DEMO 2
(?=.*\\|) expects atleast one pipe.
(?=.*-) expects atleast one hyphen.
[-:|]+ any char from the list one or more times.
$ End of the line.
Here is a simple answer:
(?=.*\|.*-|.*-.*\|)^([-|:]+)$
This says that the string needs to have a '-' followed by '|', or a '|' followed by a '-', via the look-ahead. Then the string only matches the allowed characters.
Demo: http://fiddle.re/1hnu96
Here is one without lookbefore and -hind.
^[-:|]*\\|[-:|]*-[-:|]*|[-:|]*-[-:|]*\\|[-:|]*$
This doesn't scale, so Avinash's solution is to be preferred - if your regex system has the lookbe*.

password Regexp doesn't work when split up

I cannot get a regexp that checks if password has at least one digit to work.
This has been answered everywhere but all the answers stop working if split up.
For example in this Working Password Validation if I remove:
(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])
from
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}$
in order to check for the presence of a single digit, the whole thing stops working
I'm new with regular expressions, this seems to make sense but it doesn't, show me the light if you can.
I'm not really sure what you mean by the whole things stops working.
(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])
All the above does is mandate that:
A lower case letter must appear at least once
An upper-case letter must appear at least once
Any of ##$%^&+= must appear at least once
So, there is no reason that taking them out should break anything--they are essentially independent components.
There are myriad ways to check if a String contains a number. How you want to check really depends on your specific requirements. The method used in the presented regex does this through a positive-look ahead: ^(?=.*[0-9])
^ : begins with
.* : matches 0 or more non-newline characters
[0-9]: is a character class that matches the numbers [0,1,2,...,9]
?= is the positive look ahead, which in this case says to match iff there exists at least one number
Hope that helped. You can start off with Oracle's Tutorial on Regular Expressions. After you digest that, I'm sure you'll be able to find more advanced resources via Google.

Combining (OR) arbitrary regular expressions

tl;dr Is there a way to OR/combine arbitrary regexes into a single regex (for matching, not capturing) in Java?
In my application I receive two lists from the user:
list of regular expressions
list of strings
and I need to output a list of the strings in (2) that were not matched by any of the regular expressions in (1).
I have the obvious naive implementation in place (iterate over all strings in (2); for each string iterate over all patterns in (1); if no pattern match the string add it to the list that will be returned) but I was wondering if it was possible to combine all patterns into a single one and let the regex compiler exploit optimization opportunities.
The obvious way to OR-combine regexes is obviously (regex1)|(regex2)|(regex3)|...|(regexN) but I'm pretty sure this is not the correct thing to do considering that I have no control over the individual regexes (e.g. they could contain all manners of back/forward references). I was therefore wondering if you can suggest a better way to combine arbitrary regexes in java.
note: it's only implied by the above, but I'll make it explicit: I'm only matching against the string - I don't need to use the output of the capturing groups.
Some regex engines (e.g. PCRE) have the construct (?|...). It's like a non-capturing group, but has the nice feature that in every alternation groups are counted from the same initial value. This would probably immediately solve your problem. So if switching the language for this task is an option for you, that should do the trick.
[edit: In fact, it will still cause problems with clashing named capturing groups. In fact, the pattern won't even compile, since group names cannot be reused.]
Otherwise you will have to manipulate the input patterns. hyde suggested renumbering the backreferences, but I think there is a simpler option: making all groups named groups. You can assure yourself that the names are unique.
So basically, for every input pattern you create a unique identifier (e.g. increment an ID). Then the trickiest part is finding capturing groups in the pattern. You won't be able to do this with a regex. You will have to parse the pattern yourself. Here are some thoughts on what to look out for if you are simply iterating through the pattern string:
Take note when you enter and leave a character class, because inside character classes parentheses are literal characters.
Maybe the trickiest part: ignore all opening parentheses that are followed by ?:, ?=, ?!, ?<=, ?<!, ?>. In addition there are the option setting parentheses: (?idmsuxU-idmsuxU) or (?idmsux-idmsux:somePatternHere) which also capture nothing (of course there could be any subset of those options and they could be in any order - the - is also optional).
Now you should be left only with opening parentheses that are either a normal capturing group or a named on: (?<name>. The easiest thing might be to treat them all the same - that is, having both a number and a name (where the name equals the number if it was not set). Then you rewrite all of those with something like (?<uniqueIdentifier-md5hashOfName> (the hyphen cannot be actually part of the name, you will just have your incremented number followed by the hash - since the hash is of fixed length there won't be any duplicates; pretty much at least). Make sure to remember which number and name the group originally had.
Whenever you encounter a backslash there are three options:
The next character is a number. You have a numbered backreference. Replace all those numbers with k<name> where name is the new group name you generated for the group.
The next characters are k<...>. Again replace this with the corresponding new name.
The next character is anything else. Skip it. That handles escaping of parentheses and escaping of backslashes at the same time.
I think Java might allow forward references. In that case you need two passes. Take care of renaming all groups first. Then change all the references.
Once you have done this on every input pattern, you can safely combine all of them with |. Any other feature than backreferences should not cause problems with this approach. At least not as long as your patterns are valid. Of course, if you have inputs a(b and c)d then you have a problem. But you will have that always if you don't check that the patterns can be compiled on their own.
I hope this gave you a pointer in the right direction.

Checking for specific strings with regex

I have a list of arbitrary length of Type String, I need to ensure each String element in the list is alphanumerical or numerical with no spaces and special characters such as - \ / _ etc.
Example of accepted strings include:
J0hn-132ss/sda
Hdka349040r38yd
Hd(ersd)3r4y743-2\d3
123456789
Examples of unacceptable strings include:
Hello
Joe
King
etc basically no words.
I’m currently using stringInstance.matches("regex") but not too sure on how to write the appropriate expression
if (str.matches("^[a-zA-Z0-9_/-\\|]*$")) return true;
else return false;
This method will always return true for words that don't conform to the format I mentioned.
A description of the regex I’m looking for in English would be something like:
Any String, where the String contains characters from (a-zA-Z AND 0-9 AND special characters)
OR (0-9 AND Special characters)
OR (0-9)
Edit: I have come up with the following expression which works but I feel that it may be bad in terms of it being unclear or to complex.
The expression:
(([\\pL\\pN\\pP]+[\\pN]+|[\\pN]+[\\pL\\pN\\pP]+)|([\\pN]+[\\pP]*)|([\\pN]+))+
I've used this website to help me: http://xenon.stanford.edu/~xusch/regexp/analyzer.html
Note that I’m still new to regex
WARNING: “Never” Write A-Z
All instances of ranges like A-Z or 0-9 that occur outside an RFC definition are virtually always ipso facto wrong in Unicode. In particular, things like [A-Za-z] are horrible antipatterns: they’re sure giveaways that the programmer has a caveman mentality about text that is almost wholly inappropriate this side of the Millennium. The Unicode patterns work on ASCII, but the ASCII patterns break on Uniocode, sometimes in ways that leave you open to security violations. Always write the Unicode version of the pattern no matter whether you are using 1970s data or modern Unicode, because that way you won’t screw up when you actually use real Java character data. It’s like the way you use your turn signal even when you “know” there is no one behind you, because if you’re wrong, you do no harm, whereas the other way, you very most certainly do. Get used to using the 7 Unicode categories:
\pL for Letters. Notice how \pL is a lot shorter to type than [A-Za-z].
\pN for Numbers.
\pM for Marks that combine with other code points.
\pS for Symbols, Signs, and Sigils. :)
\pP for Punctuation.
\pZ for Separators like spaces (but not control characters)
\pC for other invisible formatting and Control characters, including unassigned code points.
Solution
If you just want a pattern, you want
^[\pL\pN]+$
although in Java 7 you can do this:
(?U)^\w+$
assuming you don’t mind underscores and letters with arbitrary combining marks. Otherwise you have to write the very awkward:
(?U)^[[:alpha:]\pN]+$
The (?U) is new to Java 7. It corresponds to the Pattern class’s UNICODE_CHARACTER_CLASSES compilation flag. It switches the POSIX character classes like [:alpha:] and the simple shortcuts like \w to actually work with the full Java character set. Normally, they work only on the 1970sish ASCII set, which can be a security hole.
There is no way to make Java 7 always do this with its patterns without being told to, but you can write a frontend function that does this for you. You just have to remember to call yours instead.
Note that patterns in Java before v1.7 cannot be made to work according to the way UTS#18 on Unicode Regular Expressions says they must. Because of this, you leave yourself open to a wide range of bugs, infelicities, and paradoxes if you do not use the new Unicode flag. For example, the trivial and common pattern \b\w+\b will not be found to match anywhere at all within the string "élève", let alone in its entirety.
Therefore, if you are using patterns in pre-1.7 Java, you need to be extremely careful, far more careful than anyone ever is. You cannot use any of the POSIX charclasses or charclass shortcuts, including \w, \s, and \b, all of which break on anything but stone-age ASCII data. They cannot be used on Java’s native character set.
In Java 7, they can — but only with the right flag.
It is possible to refrase the description of needed regex to "contains at least one number" so the followind would work /.*[\pN].*/. Or, if you would like to limit your search to letters numbers and punctuation you shoud use /[\pL\pN\pP]*[\pN][\pL\pN\pP]*/. I've tested it on your examples and it works fine.
You can further refine your regexp by using lazy quantifiers like this /.*?[\pN].*?/. This way it would fail faster if there are no numbers.
I would like to recomend you a great book on regular expressions: Mastering regular expressions, it has a great introduction, in depth explanation of how regular expressions work and a chapter on regular expressions in java.
It looks like you just want to make sure that there are no spaces in the string. If so, you can this very simply:
return str.indexOf(" ") == -1;
This will return true if there are no spaces (valid by my understanding of your rules), and false if there is a space anywhere in the string (invalid).
Here is a partial answer, which does 0-9 and special characters OR 0-9.
^([\d]+|[\\/\-_]*)*$
This can be read as ((1 or more digits) OR (0 or more special char \ / - '_')) 0 or more times. It requires a digit, will take digits only, and will reject strings consisting of only special characters.
I used regex tester to test several of the strings.
Adding alphabetic characters seems easy, but a repetition of the given regexp may be required.

Categories