I'm working on GUI validation...
Please see the problem below...
How to validate an email with a specific format? at least one digit before the # and one digit after and at least two letters after the dot.
String EmailFormat = "m#m.co";
Pattern patternEmail = Pattern.compile("\\d{1,}#\\d{1,}.\\d{2,}");
Matcher matcherName = patternEmail.matcher(StudentEmail);
Don't write your own validator. Email has been around for decades and there are many standard libraries which work, address parts of the standard you may not know about, and are well tested by many other developers.
Apache Commons Email Validator is a good example. Even if you use a standard validator you need to be aware of the limitations or gotchas in validating an email address. Here are the javadocs for Commons EmailValidator which state, "This implementation is not guaranteed to catch all possible errors in an email address. For example, an address like nobody#noplace.somedog will pass validator, even though there is no TLD "somedog"" . So you can use a good email validator to determine if an address is valid, but you will have to do extra work to guarantee that the domain exists, accepts email, and accepts email fro that address.
If you require good addresses you will need a secondary mechanism. A confirmation email is a good mechanism. You send a link to the given address and the user must visit that link to verify that email can be sent to that address.
This the regex pattern for emails
String pt = "^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";
You can try it like this
List email = Arrays.asList("xyzl#gmail.com", "#", "sxd");
Predicate<String> validMail = (n) -> n.matches(pt);
email.stream().filter(validMail).forEach((n) -> System.out.println(n));
This is the description you can change it according to your need.
^ #start of the line
[_A-Za-z0-9-\\+]+ # must start with string in the bracket [ ], must contains one or more (+)
( # start of group #1
\\.[_A-Za-z0-9-]+ # follow by a dot "." and string in the bracket [ ], must contains one or more (+)
)* # end of group #1, this group is optional (*)
# # must contains a "#" symbol
[A-Za-z0-9-]+ # follow by string in the bracket [ ], must contains one or more (+)
( # start of group #2 - first level TLD checking
\\.[A-Za-z0-9]+ # follow by a dot "." and string in the bracket [ ], must contains one or more (+)
)* # end of group #2, this group is optional (*)
( # start of group #3 - second level TLD checking
\\.[A-Za-z]{2,} # follow by a dot "." and string in the bracket [ ], with minimum length of 2
) # end of group #3
$ #end of the line
Split email into two parts using # as delimiter:
String email = "some#email.com";
String[] parts = email.split("#"); // parts = [ "some", "email.com" ]
Validate each part separately, using multiple checks if necessary:
// validate username
String username = parts[0];
if (username.matches("\\d")) {
// ok
}
// validate domain
String domain = parts[1];
if (domain.matches("\\d") && domain.matches("\\.[a-z]{2,4}$")) {
// ok
}
Note that this is a very poor email validator and it shouldn't be used standalone.
Related
For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.
Here is an example of a valid input:
"name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
So there are two basic forms "name#domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.
The following array shows the data needed (trailing spaces or delimiters would not be a big problem):
["name#domain.com",
"Sixpack, Joe 1 <name#domain.com>",
"Sixpack, Joe 2 <name#domain.com>",
"Sixpack, Joe, 3<name#domain.com>",
"nameFoo#domain.com",
"nameBar#domain.com",
"nameBaz#domain.com"]
I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?
Final solution (variation on the accepted answer):
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
.replace(/(#.*?>?)\s*[,;]/g, "$1<|>")
.replace(/<\|>$/,"") // remove trailing delimiter
.split(/\s*<\|>\s*/) // split on delimiter including surround space
console.log(result)
Or in Java:
public static String[] extractEmailAddresses(String emailAddressList) {
return emailAddressList
.replaceAll("(#.*?>?)\\s*[,;]", "$1<|>")
.replaceAll("<\\|>$", "")
.split("\\s*<\\|>\\s*");
}
since you are not validating, i assume that the email addresses are valid.
Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
const result = string.match(/(.*?#.*?\..*?)[,;]/g)
console.log(result)
This pattern works for your provided examples:
([^#,;\s]+#[^#,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^#,;\s]+#[^#,;\s]+)>
([^#,;\s]+#[^#,;\s]+) # email defined by an # with connected chars except ',' ';' and white-space
| # OR
(?:$|\s*[,;])(?:\s*) # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?) # name
<([^#,;\s]+#[^#,;\s]+)> # email enclosed by lt-gt
PCRE Demo
Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.
Here is a javascript example, but Java's repalceAll and split can do the same job.
var string = "name#domain.com,Joe Sixpack <name#domain.com>, Sixpack, Joe <name#domain.com> ;Sixpack, Joe<name#domain.com> , name#domain.com,name#domain.com;name#domain.com;"
const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)
I have problem with matching groups that contain lookahead expression. I don't know why this expressions doesn't work:
"""((?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%))((?<=[\w:]\s)(\w+)(?=\s[cr]))"""
When I compile them separately, for example:
"""(?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)"""
I get the correct result
My sample text:
May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection
Expressions have been checked with this tool: http://regex-testdrive.com/en/dotest
My Scala code:
import scala.util.matching.Regex
val text = "May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection"
val regex = new Regex("""((?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%))((?<=[\w:]\s)(\w+)(?=\s[cr]))""")
val result = regex.findAllIn(text)
Does anyone know solution of this problem?
Multiple matching
You may fix the pattern as
^.*?(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)|(?<=[\w:]\s)\w+(?=\s[cr])
See the regex demo. The main point is to introduce the | alternation operator to match either of the 2 subpatterns. Note you do not need to put the ^ start of string anchor into a lookbehind, as ^ is already a zero-width assertion. Also, there are too many groupings that you do not seem to use any way. Also, to match a literal dot you need to escape it (. -> \.).
To obtain the multiple matches, you may use the following code snippet:
val text = "May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection"
val regex = """^.*?(?=\s\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s%)|(?<=[\w:]\s)\w+(?=\s[cr])""".r
val result = regex.findAllIn(text)
result.foreach { x => println(x) }
// => May 5 23:00:01
// UDP
See the Scala online demo.
Note that once a pattern is used with .FindAllIn, it is not anchored by default, so you will get all the matches there are in the input string.
Capturing groups
Another approach you may use is matching the whole line while capturing the necessary bits with capturing groups:
val text = "May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection"
val regex = """^(.*?)\s+\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%.*[\w:]\s+(\w+)\s+[cr].*""".r
val results = text match {
case regex(date, protocol) => Array(date, protocol)
case _ => Array[String]()
}
// Demo printing
results.foreach { m =>
println(m)
}
See another Scala demo. Since match requires a full string match, .* is added at the end of the pattern, and only relevant pairs of unescaped (...) are kept in the pattern. See the regex demo here.
your matches are not next to each other,
try this:
"""((?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)).*((?<=[\w:]\s)(\w+)(?=\s[cr]))"""
I just added the .* between them, it works on the link you sent :)
We have a tokenizer which tokenizes a text file .The logic followed is quite weird but necessary in our context.
An email such as
xyz.zyx#gmail.com
will result in the following tokens :
xyz
.
zyx
#
gmail
I would like to know how can we recognize the field as email if we are allowed to use only these tokens. No regex is allowed. We are allowed only to use the tokens and their surrounding tokens to figure out if the field is an email field
ok.. try some (bad) logic like this...
int i=0,j=0;
if(str.contains(".") && str.contains("#"))
{
if((i=str.indexOf(".") < (j=str.indexOf("#"))
{
if(i!=0 && i+1!=j) //ignore Strings like .# , abc.#
return true;
}
}
return false
Logically split an e-mail address into 3 parts:
A user name (or resource name), for this explanation let's call it the user name
The # character.
A host name, consisting of any number of "word dot" sequences + a final top level domain string.
Do a walk like this:
while token can be part of a user name
fetch next token;
if there no more -> no e-mail;
check if the next token is #
if not -> no e-mail
while there are tokens
while token can be part of a host name subpart (the "word" above)
fetch next token;
if there are no more -> might be a valid e-mail address
check if the next token is a dot
if not -> might be a valid e-mail address
set a flag that you found at least one dot
check if the next token can be part of a host name subpart
if not -> no valid e-mail address (or maybe you ignore a trailing dot and take what was found so far)
Add further checks if there are more tokens where needed. You also may have to post process the found tokens to ensure a valid e-mail address and you may have to rewind your tokenizer (or cache the fetched tokens) in case you did not find a valid e-mail address and need to feed the same input to some other recognition process.
Check if a list of tokens is an email:
list contains exactly one token #
index of token # != 0
at least 3 tokens after #
at least 1 . token after #, but not immediately after
starts and ends with character tokens
Additional checks:
no two . subsequent tokens
no special characters
length of character tokens after # is at least 2
total length of all character tokens before # is at least 3
I'm matching URLs against a regular expression, testing if they reflect a "shutdown" command.
Here's a URL that performs a shutdown:
/exec?debug=true&command=shutdown&f=0
Here's another, legitimate but confusing URL that performs shutdown:
/exec?commando=yes&zcommand=34&command=shutdown&p
Now, I must ensure there's only one command=... parameter and it is command=shutdown. Alternatively, I can live with ensuring the first command=... parameter is command=shutdown.
Here's my test for the requested regular expression:
/exec?version=0.4&command=shutdown&out=JSON&zcommand=1
Should match
/exec?version=0.4&command=startup&out=JSON&zcommand=1&commando=shutdown
Should fail to match
/exec?command=shutdown&out=JSON
Should match
/exec?version=0.4&command=admin&out=JSON&zcommand=1&command=shutdown
Should fail to match
Here's my baseline - a regular expression that passes the above tests - all but the last one:
^/exec?(.*\&)*command=shutdown(\&.*)*$
The problem is with the occurrence of more than one command=..., where the first one is not shutdown.
I tried using lookbehind:
^/exec?(.*\&)*(?<!(\&|\?)command=.*)command=shutdown(\&.*)*$
But I'm getting:
Look-behind group does not have an obvious maximum length near index 31
I even tried atomic grouping. To no avail. I can't make the following expression NOT match:
/exec?version=0.4&command=admin&out=JSON&zcommand=1&command=shutdown
Can anyone help with a regular expression that passes all the tests?
Clarifications
I see I owe you some context.
My task is to configure a Filter that guards the entrance of all our system’s servlets, and verifies there’s an open HTTP session (in other words: that a successful Login has occurred). The filter also allows configuring which URLs do not require login.
Some exceptions are easy: /login does not need login. Calls to localhost do not need login.
But sometimes it gets complicated. Like the shutdown command that cannot require login while other commands can and should (the strange reason for that is out of the scope of my question).
Since it’s a security matter, I can’t allow users to merely append &command=shutdown to a URL and bypass the filter.
So I really need a regular expression, or otherwise I’ll need to redefine the configuration specs.
You would need to do it in multiple steps:
(1) Find match of ^(?=\/exec\?).*?(?<=[?&])command=([^&]+)
(2) Check if match is shutdown
Ok. I thank you all for your great answers! I tried some of the suggestions, struggled with others, and all in all I have to agree that even if the right regex exists, it looks terrible, non maintainable, and can serve well as a nasty university exercise, but not in a real system configuration.
I also realize that since a Filter is involved here, and the Filter already parses its own URI, it is absolutely ridiculous to glue back all the URI parts into a string and match it against a regular expression. What was I thinking??
I'll therefore redesign the Filter and its configuration.
Thanks a lot, people! I appreciate the help :)
Noam Rotem.
P.S. - why was I getting a userXXXX nick? Very strange...
This tested (and fully commented) regex solution meets all your requirements:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
Pattern re = Pattern.compile(
" # Match URI having command=shutdown query variable value. \n" +
" ^ # Anchor to start of string. \n" +
" (?:[^:/?\\#\\s]+:)? # URI scheme (Optional). \n" +
" (?://[^/?\\#\\s]*)? # URI authority (Optional). \n" +
" [^?\\#\\s]* # URI path. \n" +
" \\? # Literal start of URI query. \n" +
" # Match var=value pairs preceding 'command=xxx'. \n" +
" (?: # Zero or more 'var=values' \n" +
" (?!command=) # only if not-'command=xxx'. \n" +
" [^&\\#\\s]* # Next var=value. \n" +
" & # var=value separator. \n" +
" )* # Zero or more 'var=values' \n" +
" command=shutdown # variable and value to match. \n" +
" # Match var=value pairs following 'command=shutdown'. \n" +
" (?: # Zero or more 'var=values' \n" +
" & # var=value separator. \n" +
" (?!command=) # only if not-'command=xxx'. \n" +
" [^&\\#\\s]* # Next var=value. \n" +
" )* # Zero or more 'var=values' \n" +
" (?:\\#\\S*)? # URI fragment (Optional). \n" +
" $ # Anchor to end of string.",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);
String s = "/exec?version=0.4&command=shutdown&out=JSON&zcommand=1";
// Should match
// String s = "/exec?version=0.4&command=startup&out=JSON&zcommand=1&commando=shutdown";
// Should fail to match
// String s = "/exec?command=shutdown&out=JSON";
// Should match
// String s = "/exec?version=0.4&command=admin&out=JSON&zcommand=1&command=shutdown";
// Should fail to match";
Matcher m = re.matcher(s);
if (m.find()) {
// Successful match
System.out.print("Match found.\n");
} else {
// Match attempt failed
System.out.print("No match found.\n");
}
}
}
The above regex matches any RFC3986 valid URI having any scheme, authority, path, query or fragment components, but it must have one (and only one) query "command" variable whose value must be exactly, but case insensitively: "shutdown".
A carefully crafted complex regex is perfectly fine (and maintainable) to use when written with proper indentation and commented steps (like shown above). (For more information on using regex to validate a URI, see my article: Regular Expression URI Validation)
If you can live with just accepting the first match, you could just use '\\Wcommand=([^&]+) and fetch the first group.
Otherwise, you could just call Matcher.find twice to test for subsequent matches, and eventually use the first match, why do you want to do this with a single complex regex?
I am not a Java coder, but try this one (works in Perl) >>
^(?=\/exec\?)(?:[^&]+(?<![?&]command)=[^&]+&)*(?<=[?&])command=shutdown(?:&|$)
To match the first occurrence of command=shutdown use this:
Pattern.compile("^((?!command=).)+command=shutdown.*$");
The results will look like this:
"/exec?version=0.4&command=shutdown&out=JSON&zcommand=1" => false
"/exec?command=shutdown&out=JSON" => true
"/exec?version=0.4&command=startup&out=JSON&zcommand=1&commando=shutdown" => false
"/exec?commando=yes&zcommand=34&command=shutdown&p" => false
If you want to match strings that ONLY contain one 'command=' use this:
Pattern.compile("^((?!command=).)+command=shutdown((?!command=).)+$");
Please note that using "not" qualifiers in regular expressions is not something they are intended for and performance might not be the best.
If this can be done with a single regular expression, and it may well could be; it will be so complex as to be un-readable, and thus un-maintainable as the intent of the logic will be lost. Even if it is "documented" it will still be much less obvious to someone who just knows Java.
A much better approach would be to use the URI object parse the entire thing, domain and all and pull off the query parameters and then write a simple loop that walks through them and decides based on your business logic what is a shutdown and what isn't. Then it will be simple, self-documenting and probably more efficient ( not that that should be a concern ).
Try this:
Pattern p = Pattern.compile(
"^/exec\\?(?:(?:(?!\\1)command=shutdown()|(?!command=)\\w+(?:=[^&]+)?)(?:&|$))+$\\1");
Or a little more readably:
^/exec\?
(?:
(?:
(?!\1)command=shutdown()
|
(?!command=)\w+(?:=[^&]+)?
)
(?:&|$)
)+$
\1
The main body of the regex is an alternation that matches either a shutdown command or a parameter whose name is not command. If it does match a shutdown command, the empty group in that branch "captures" an empty string. It doesn't need to consume anything, because we're only using it as a checkbox, confirming en passant that one of the parameters was a shutdown command.
The negative lookahead - (?!\1) - prevents it from matching two or more shutdown commands. I don't know if that's really necessary, but it's a good opportunity to demonstrate (1) how to negate a "back-assertion", and (2) that a backreference can appear before the group it refers to in certain circumstances (what's known as a forward reference).
When the whole URL has been consumed, the backreference (\1) acts like a zero-width assertion. If one of the parameters was command=shutdown, the backreference will succeed. Otherwise it will fail even though it's only trying to match an empty string, because the group it refers to didn't participate in the match.
But I have to concur with the other responders: when your regexes get this complicated, you should be thinking seriously about switching to a different approach.
EDIT: It works for me. Here's the demo.
I have tried to use the following kind of regex
([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))|(FakeEmail:)|(Email:)|(\1\2)|(\1\3)
(pretend the \1 is the email regex group, and \2 is FakeEmail: and \3 is Email: because I didnt count the parens to figure out the real grouping)
What I am trying to do is say "Find the word email: and if you find it, pick up any email address following the word."
That email regex I got off some other question on stack overflow.
my test string could be something like
"This guy is spamming me from
FakeEmail: fakeemailAdress#someplace.com
but here is is real info:
Email: testemail#someplace.com"
Any tips? Thanks
I'm either quite confused as to what you're trying to do, or your Regex is just very wrong. In particular:
Why do you have Email: at the end, instead of the beginning - to match your example?
Why do you have both your Email: and your \1\2 separated by pipe characters, almost as if they're in fields? This is compiling the pattern as ORs. (Find the email pattern, OR the word "Email:", OR whatever \1\2 will end up meaning as it is out of context here.)
If all you're trying to do is match something like Email: testemail#someplace.com, you don't need any backtracking.
Something like this is probably all you need:
Email:\s+([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))
Also, I'd strongly advise against trying to validate an email address so strictly. You may want to read http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx . I'd simplify the pattern to something more along the lines of:
Email:\s+(\S+)*#(\S+\.\S+)
Try:
(Fake)?Email: *([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))
And captured group \1 will be empty if it's a real email and contain "Fake" if it's a fake email, while \2 will be the email itself.
Do you actually want to capture it if it's FakeEmail though? If you want to capture all Email but ignore all FakeEmail then do:
\bEmail: *([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))
The word boundary prevents the Email bit from matching "FakeEmail".
UPDATE: note your regex only matches lowercase since it's got a-z in the [] everywhere but not [A-Z]. Make sure you feed your regex into the java match function with the ignore case switch. i.e.:
Pattern.compile("(Fake)?Email: .....", Pattern.CASE_INSENSITIVE)
You can use following code to match all type of email address:
String text = "This guy is spamming me from\n" +
"FakeEmail: fakeemail+Adress#someplace.com\n" +
"fakeEmail: \n" +
"fakeemail#someplace.com" +
"but here is is real info:\n" +
"Email: test.email+info#someplace.com\n";
Matcher m = Pattern.compile("(?i)(?s)Email:\\s*([_a-z\\d\\+-]+(\\.[_a-z\\d\\+-]+)*#[a-z\\d-]+(\\.[a-z\\d-]+)*(\\.[a-z]{2,4}))").matcher(text);
while(m.find())
System.out.printf("Email is [%s]%n", m.group(1));
This will match email text:
appearing on different lines by using (?s)
ignoring case comparison by using (?i)
Email address with a period . in it
Email address with a plus sign + in it
OUTPUT: From above code is
Email is [fakeemail+Adress#someplace.com]
Email is [fakeemail#someplace.comb]
Email is [test.email+info#someplace.com]