Regex for university emails - java

I am looking to validate email addresses by making sure they have a specific university subdomain, e.g. if the user says they attend Oxford University, I want to check that their email ends in .ox.ac.uk
If I have the '.ox.ac.uk' part stored as a variable, how can I incorporate this with a regex to check the whole email is valid and ends in that variable suffix?
Many thanks!

We are using this email pattern (derived from this regular-expressions.info article):
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?^`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?$`
You should be able to extend it with your needed suffix:
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+(?:\.ox\.ac\.uk)$`
Note that I replaced the TLD part [a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])? with your required suffix (?:\.ox\.ac\.uk) (\. is used to match the dot only)
Edit: one additional note: if you use String#matches(...) or Matcher#matches() there's no need for the leading ^ and the trailing $, since the entire string would have to match anyways.

Assuming you are using php.
$ending = '.ox.ac.uk';
if(preg_match('/'.preg_quote($ending).'$/i', $email_address)) //... your code
Further info: the preg_quote() is necessary so that characters get escaped if they have a special meaning. In your case it's the dots.
edit: To check if the whole email is valid, see other questions, it is asked a lot. Just wanted to help with your special case.

Related

Regex expression for comma and dash seperated text of items

I do have a Java Web Application, where I get some inputs from the user. Once I got this input I have to parse it and the parsing part depends on what kind of input I'll get. I decided to use the Pattern class of java for some of predefined user inputs.
So I need the last 2 regex patterns:
a)Enumaration:
input can be - A03,B24.1,A25.7
The simple way would be to check if there are a comma in there ([^,]+) but it will end up with a lot of updates in to parsing function, which I would like to avoid. So, in addition to comma it should check if it starts with
letter
minimum 3 letters (combined with numbers)
can have one dot in the word
minimum 1 comma (updated it)
b) Mixed
input can be A03,B24.1-B35.5,A25.7
So all of what Enumuration part got, but with addition that it can have a dash minimum one.
I've tried to use multiple online regex generators but didnt get it correct. Would be much appreciated if you can help.
Here is what I got if its B24.1-B35.5 if its just a simple range.
"='.{1}\\d{0,2}-.{1}\\d{0,2}'|='.{1}\\d{1,2}.\\d{1,2}-.{1}\\d{1,2}.\\d{1,2}'";
Edit1: Valid and Invalid inputs
for a)Enumaration
A03,B24.1,A25.7 Valid
A03,B24.1 Valid
A03,B24.1-B25.1 -Invalid because in this case (enumaration) it should not contain dash
A03 invalid because no comma
A03,B24.1 - Valid
A03 Invalid
for b)Mixed
everything that a enumeration has with addition that it can have dash too.
You can use this regex for (a) Enumeration part as per your rules:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Rules:
Verifies that each segment starts with a letter
Minimum of three letters or numbers [A-Za-z][A-Za-z0-9]{2,}
Optionally followed by decimal . and one or more alphabets and numbers i.e (?:\.[A-Za-z0-9]{1,})?
Same thing repeated, and seperated by a comma ,. Also must have atleast one comma so using + i.e (?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
?: to indicate non-capturing group
Using [A-Za-z0-9] instead of \w to avoid underscores
Regex101 Demo
For (b) Mixed, you haven't shared too many valid and invalid cases, but based on my current understanding here's what I have:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:[,-][A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Note that , from previous regex has been replaced with [,-] to allow - as well!
Regex101 Demo
// Will match
A03,B24.1-B35.5,A25.7
A03,B24.1,A25.7
A03,B24.1-B25.1
Hope this helps!
EDIT: Making sure each group starts with a letter (and not a number)
Thanks to #diginoise and #anubhava for pointing out! Changed [A-Za-z0-9]{3,} to [A-Za-z][A-Za-z0-9]{2,}
As I said in the comments, I would chop the input by commas and verify each segment separately. Your domain ICD 10 CM codes is very well defined and also I would be very wary of any input which could be non valid, yet pass the validation.
Here is my solution:
regex
([A-TV-Z][0-9][A-Z0-9](\.?[A-Z0-9]{0,4})?)
... however I would avoid that.
Since your domain is (moste likely) medical software, people's lives (or at least well being) is at stake. Not to mention astronomical damages and the lawyers ever-chasing ambulances. Therefore avoid the easy solution, and implement the bomb proof one.
You could use the regex to establish that given code is definitely not valid. However if a code passes your regex it does not mean that it is valid.
bomb proof method
See this example: O09.7, O09.70, O09.71, O09.72, O09.73 are valid entries, but O09.1 is not valid.
Therefore just get all possible codes. According to this gist there are 42784 different codes. Just load them to memory and any code which is not in the set, is not valid. You could compress said list and be clever about the encoding in memory, to occupy less space, but verbatim all codes are under 300kB on disk, so few MBs max in memory, therefore not a massive cost to pay for a price of people not having left instead of right kidney removed.

Does space affect the result in regex ـــــ java

I'm using regex in order to define a set of rules that extract specific information from unstructured resumes.
and this information are:
Company that applicant worked in or still working
role (designation)... ex: software engineer
Date (From-To)
every applicant write his/her employment details in his/her own way. However, some resume have a common style for example :
2012- 2014.Dean of the Faculty of Engineering Information Technology/
University Name.
so I define this regex in order to extract the needed information
Here my regex:
(^[0-9]{4})(-|–|.|_|to) ([0-9]{4})(.*) (of the|at|in) (.*).
and this regex was able to extract the information from the above example
role:Dean
company: Faculty of Engineering Information Technology/University Name.
date from: 2012 to :2014
loyalty: 2 years // this is depend on the extracted date
But I have another sample from another resume that have the same style of writing
1996-1997, Lecturer in Computer Science Department, Jerusalem open
university.
it should give Match but it didn't until I remove the space in the regex then it was able to extract the data
My question is does the space affect in regex??!!
and how I can fix this so that it could extract the data from both resume regarding of the space in the regex rule??
Here my demo
does the space affect in regex?
You have determined for yourself that it does. Space characters are not regex metacharacters, unless you enable the COMMENTS option in your pattern. Ordinarily, they stand for themselves, just like most other characters.
how I can fix this so that it could extract the data from both resume regarding of the space in the regex rule?
You can apply quantifers such as ? or * to space characters in your regex, just like you can to any other character or group. So, for example, you might use
(^[0-9]{4})(-|–|.|_|to) *([0-9]{4})(.*) (of the|at|in) (.*).
Do consider also that you might sometimes have to deal with tab characters, too. You can use the escape sequence \s to match any single whitespace character other than a newline, whether it be a space, a tab, or any other recognized as whitespace by Java.
You can use an optional amount of white-space by using \\s* instead of a space . \\s means white-space character, and the * means zero or more

Regex for email id validation

im new to regexes , I have email validation program with the given conditions for a valid email
# and . should be present only once
there should be five characters between # and .
there should be at least 3 characters before #
# should always precede the .
I cannot figure out the last part. Any help with a little explanation would be great :)
You should just use a standard regex. I generally use
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b.
Have a look at http://www.regular-expressions.info/email.html for different examples.
Validating an email using regex is really a tough one and you'll never get satisfied no matter what you use!
Here is the regex that only based on your four points on the question. Assuming by any character you mean [a-zA-Z0-9]:
(?=^[^.#]*#[^.#]*\.[^.#]*$)[a-zA-Z0-9]{3,}#[a-zA-Z0-9]{5}\.
Online Demo
Since you'll use is in Java, use \\ for every \ in your code.

password Regexp doesn't work when split up

I cannot get a regexp that checks if password has at least one digit to work.
This has been answered everywhere but all the answers stop working if split up.
For example in this Working Password Validation if I remove:
(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])
from
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}$
in order to check for the presence of a single digit, the whole thing stops working
I'm new with regular expressions, this seems to make sense but it doesn't, show me the light if you can.
I'm not really sure what you mean by the whole things stops working.
(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])
All the above does is mandate that:
A lower case letter must appear at least once
An upper-case letter must appear at least once
Any of ##$%^&+= must appear at least once
So, there is no reason that taking them out should break anything--they are essentially independent components.
There are myriad ways to check if a String contains a number. How you want to check really depends on your specific requirements. The method used in the presented regex does this through a positive-look ahead: ^(?=.*[0-9])
^ : begins with
.* : matches 0 or more non-newline characters
[0-9]: is a character class that matches the numbers [0,1,2,...,9]
?= is the positive look ahead, which in this case says to match iff there exists at least one number
Hope that helped. You can start off with Oracle's Tutorial on Regular Expressions. After you digest that, I'm sure you'll be able to find more advanced resources via Google.

Regular expression, excluding .. in suffix of email addy [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using a regular expression to validate an email address
This is homework, I've been working on it for a while, I've done lots of reading and feel I have gotten pretty familiar with regex for a beginner.
I am trying to find a regular expression for validating/invalidating a list of emails. There are two addresses which are giving me problems, I can't get them both to validate the correct way at the same time. I've gone through a dozen different expressions that work for all the other emails on the list but I can't get those two at the same time.
First, the addresses.
me#example..com - invalid
someone.nothere#1.0.0.127 - valid
The part of my expression which validates the suffix
I originally started with
#.+\\.[[a-z]0-9]+
And had a second pattern for checking some more invalid addresses and checked the email against both patterns, one checked for validity the other invalidity but my professor said he wanted it all in on expression.
#[[\\w]+\\.[\\w]+]+
or
#[\\w]+\\.[\\w]+
I've tried it written many, many different ways but I'm pretty sure I was just using different syntax to express these two expressions.
I know what I want it to do, I want it to match a character class of "character+"."character+"+
The plus sign being at least one. It works for the invalid class when I only allow the character class to repeat one time(and obviously the ip doesn't get matched), but when I allow the character class to repeat itself it matches the second period even thought it isn't preceded by a character. I don't understand why.
I've even tried grouping everything with () and putting {1} after the escaped . and changing the \w to a-z and replacing + with {1,}; nothing seems to require the period to surrounded by characters.
You need a negative look-ahead :
#\w+\.(?!\.)
See http://www.regular-expressions.info/lookaround.html
test in Perl :
Perl> $_ = 'someone.nothere#1.0.0.127'
someone.nothere#1.0.0.127
Perl> print "OK\n" if /\#\w+\.(?!\.)/
OK
1
Perl> $_ = 'me#example..com'
me#example..com
Perl> print "OK\n" if /\#\w+\.(?!\.)/
Perl>
#([\\w]+\\.)+[\\w]+
Matches at least one word character, followed by a '.'. This is repeated at least once, and is then followed by at least on more word character.
I think you want this:
#[\\w]+(\\.[\\w]+)+
This matches a "word" followed by one or more "." "word" sequences. (You can also do the grouping the other way around; e.g. see Dailin's answer.)
The problem with what you are doing before was that you were trying to embed a repeat inside a character class. That doesn't make sense, and there is no syntax that would support it. A character class defines a set of characters and matches against one character. Nothing more.
The official standard RFC 2822 describes the syntax that valid email addresses with this regular expression:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
More practical implementation of RFC 2822 (if we omit the syntax using double quotes and square brackets), which will still match 99.99% of all email addresses in actual use today, is:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

Categories