How to tokenize, scan or split this string of email addresses - java

For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.
Here is an example of a valid input:
"name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
So there are two basic forms "name#domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.
The following array shows the data needed (trailing spaces or delimiters would not be a big problem):
["name#domain.com",
"Sixpack, Joe 1 <name#domain.com>",
"Sixpack, Joe 2 <name#domain.com>",
"Sixpack, Joe, 3<name#domain.com>",
"nameFoo#domain.com",
"nameBar#domain.com",
"nameBaz#domain.com"]
I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?
Final solution (variation on the accepted answer):
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
.replace(/(#.*?>?)\s*[,;]/g, "$1<|>")
.replace(/<\|>$/,"") // remove trailing delimiter
.split(/\s*<\|>\s*/) // split on delimiter including surround space
console.log(result)
Or in Java:
public static String[] extractEmailAddresses(String emailAddressList) {
return emailAddressList
.replaceAll("(#.*?>?)\\s*[,;]", "$1<|>")
.replaceAll("<\\|>$", "")
.split("\\s*<\\|>\\s*");
}

since you are not validating, i assume that the email addresses are valid.
Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
const result = string.match(/(.*?#.*?\..*?)[,;]/g)
console.log(result)

This pattern works for your provided examples:
([^#,;\s]+#[^#,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^#,;\s]+#[^#,;\s]+)>
([^#,;\s]+#[^#,;\s]+) # email defined by an # with connected chars except ',' ';' and white-space
| # OR
(?:$|\s*[,;])(?:\s*) # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?) # name
<([^#,;\s]+#[^#,;\s]+)> # email enclosed by lt-gt
PCRE Demo

Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.
Here is a javascript example, but Java's repalceAll and split can do the same job.
var string = "name#domain.com,Joe Sixpack <name#domain.com>, Sixpack, Joe <name#domain.com> ;Sixpack, Joe<name#domain.com> , name#domain.com,name#domain.com;name#domain.com;"
const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)

Related

Regular expression: Replace everything before first occurence

I have the following regular expression that I'm using to remove the dev. part of my URL.
String domain = "dev.mydomain.com";
System.out.println(domain.replaceAll(".*\\.(?=.*\\.)", ""));
Outputs: mydomain.com but this is giving me issues when the domains are in the vein of dev.mydomain.com.pe or dev.mydomain.com.uk in those cases I am getting only the .com.pe and .com.uk parts.
Is there a modifier I can use on my regex to make sure it only takes what is before the first . (dot included)?
Desired output:
dev.mydomain.com -> mydomain.com
stage.mydomain.com.pe -> mydomain.com.pe
test.mydomain.com.uk -> mydomain.com.uk
You may use
^[^.]+\.(?=.*\.)
See the regex demo and the regex graph:
Details
^ - start of string
[^.]+ - 1 or more chars other than dots
\. - a dot
(?=.*\.) - followed with any 0 or more chars other than line break chars as many as possible and then a ..
Java usage example:
String result = domain.replaceFirst("^[^.]+\\.(?=.*\\.)", "");
Following regex will work for you. It will find first part (if exists), captures rest of the string as 2nd matching group and replaces the string with 2nd matching group. .*? is non-greedy search that will match until it sees first dot character.
(.*?\.)?(.*\..*)
Regex Demo
sample code:
String domain = "dev.mydomain.com";
System.out.println(domain.replaceAll("(.*?\\.)?(.*\\..*)", "$2"));
domain = "stage.mydomain.com.pe";
System.out.println(domain.replaceAll("(.*?\\.)?(.*\\..*)", "$2"));
domain = "test.mydomain.com.uk";
System.out.println(domain.replaceAll("(.*?\\.)?(.*\\..*)", "$2"));
domain = "mydomain.com";
System.out.println(domain.replaceAll("(.*?\\.)?(.*\\..*)", "$2"));
output:
mydomain.com
mydomain.com.pe
mydomain.com.uk
mydomain.com

Java string.split vs. C# Regex.split - limit to certain number of fields

I am a Java developer, but am working on a C# project. What I need to do is split a String by a delimiter, but limit it to a certain number of fields. In Java, I can do this:
String message = "xx/xx - xxxxxxxxxxxxxxxxxxx - xxxxxxx";
String[] splitMessage = message.split("\\s-", 3);
In this case, it will split it by the -, but I want to also have it check for any space before the dash, and limit it to 3 fields of the String. The String coming through is broken down into ___ - ____________ - _________ with the first space being a date (like 12/31) the second space being a message about the string, and the third space being a location tied to the message. The reason I limit it to 3 fields so the array only has 3 elements. The reason I do this is because sometimes the message can have dashes in it to look like this: 12/31 - Test message - test - Test City, 11111. So my Java code above would split it into this:
0: 12/31
1: Test message - test
2: Test City, 11111
I am trying to achieve something similar in C#, but am not sure how to limit it to a certain number of fields. This is my C# code:
var splitMessage = Regex.Split(Message, " -");
The problem is that without a limit, it splits it into 4 or 5 fields, instead of just the 3. For example, if this were the message: 12/31 - My test - don't use - just a test - Test City, 11111, it would return a string[] with 5 indexes:
0: 12/31
1: My test
2: don't use
3: just a test
4: Test City, 11111
When I want it to return this:
0: 12/31
1: My test - don't use - just a test
2: Test City, 11111
Before you ask, I can't change the incoming String. I have to parse it the same why I did in Java. So is there an equivalent to limiting it to 3 fields? Is there a better way to do it besides using Regex.Split()?
If you want to split based on the first and last instance of -, such that you get exactly three fields (so long as there are at least two dashes in the string), C# does actually have a neat trick for this. C# Regex allows for non-fixed-width lookbehinds. So the following regex:
(?<=^[^-]*)-|-(?=[^-]*$)
(<= //start lookbehind
^ //look for start of string
[^-]* //followed by any amount of non-dash characters
) //end lookbehind
- //match the dash
| //OR
- //match a dash
(?= //lookahead for
[^-]* //any amount of non-dash characters
$ //then the end of the string
) //end lookahead
Will match the first and last dash, and allow you to split the string the way you want to.
var splitMessage = Regex.Split(Message, "(?<=^[^-]*)-|-(?=[^-]*$)");
Note that this also has no problem splitting into fewer than three groups, if there are less dashes, but will not split into more than three.
You can't split like with the delimiter inside the one of the desired grouped, except when that is the last group.
You can however use a custom regex that consume as much as possible in the 2nd group to parse the said input:
var splitMessage = Regex.Match("12/31 - Test message - test - Test City, 11111", "^(.+?) - (.+) - (.+)$")
.Groups
.Cast<Group>()
// skip first group which is the entire match
.Skip(1)
.Select(x => x.Value)
.ToArray();
Given that the first group is "xx/xx", you can also opt to use this regex instead:
"^(../..) - (.+) - (.+)$"
// or, assuming they are date
"^(\d{2}/\d{2}) - (.+) - (.+)$"
EDIT: Or, you can just split by " - ", and then concatenate everything in the middle together when there is more than 3 matches:
var groups = "12/31 - Test message - test - Test City, 11111".Split(new[] { " - " }, StringSplitOptions.None);
if (groups.Length > 3)
{
groups = new[]
{
groups[0],
string.Join(" - ", groups.Skip(1).Take(groups.Length - 2)),
groups[groups.Length - 1]
};
}
Whe I have to split a string at certain delimiters including optional spaces, I do it usually this way:
String message = "xx/xx - xxxxxxxxxxxxxxxxxxx - xxxxxxx";
String[] splitMessage = message.split(" *- *", 3);
System.out.println(Arrays.asList(splitMessage));
Outputs: [xx/xx, xxxxxxxxxxxxxxxxxxx, xxxxxxx]
String message = "12/31 - My test - don't use - just a test - Test City; 11111";
String[] splitMessage = message.split(" *- *", 3);
System.out.println(Arrays.asList(splitMessage));
Outputs: [12/31, My test, don't use - just a test - Test City; 11111]
But you seem to want that something different:
splitMessage[0] shall contain the first part
splitMessage[1] shall contain the second and third part
splitMessage[2] shall contain the rest
How do you want to tell your computer that the second output element shall contain two parts? I think this is impossible except by splitting the string into all 5 parts and then re-concatenating the parts together as you want.
Maybe it's not clear what result you want. Can you specify the requirement more clearly: What shall happen if the input string contains more than 3 elements?

Trying to create a regexp

I have a string which I want a string to parse via Java or Python regexp:
something (\var1 \var2 \var3 $var4 #var5 $var6 *fdsfdsfd #uytuytuyt fdsgfdgfdgf aaabbccc)
The number of var is unknown. Their exact names are unknown. Their names may or may not start with "\" or "$", "*", "#" or "#" and there're delimited by whitespace.
I'd like to parse them separately, that is, in capture groups, if possible. How can I do that? The output I want is a list of:
[\var1 , \var2 , \var3 , $var4 , #var5 , $var6 , *fdsfdsfd , #uytuytuyt , fdsgfdgfdgf , aaabbccc]
I don't need the java or python code, I just need the regexp. My incomplete one is:
something\s\(.+\)
something\s\((.+)\)
In this regex you are capturing the string containing all the variables. split it based on whitespace since you are sure that they are delimited by whitespace.
m = re.search('something\s\((.+)\)', input_string)
if m:
list_of_vars = m.group(1).split()

GUI Email validator in Java

I'm working on GUI validation...
Please see the problem below...
How to validate an email with a specific format? at least one digit before the # and one digit after and at least two letters after the dot.
String EmailFormat = "m#m.co";
Pattern patternEmail = Pattern.compile("\\d{1,}#\\d{1,}.\\d{2,}");
Matcher matcherName = patternEmail.matcher(StudentEmail);
Don't write your own validator. Email has been around for decades and there are many standard libraries which work, address parts of the standard you may not know about, and are well tested by many other developers.
Apache Commons Email Validator is a good example. Even if you use a standard validator you need to be aware of the limitations or gotchas in validating an email address. Here are the javadocs for Commons EmailValidator which state, "This implementation is not guaranteed to catch all possible errors in an email address. For example, an address like nobody#noplace.somedog will pass validator, even though there is no TLD "somedog"" . So you can use a good email validator to determine if an address is valid, but you will have to do extra work to guarantee that the domain exists, accepts email, and accepts email fro that address.
If you require good addresses you will need a secondary mechanism. A confirmation email is a good mechanism. You send a link to the given address and the user must visit that link to verify that email can be sent to that address.
This the regex pattern for emails
String pt = "^[_A-Za-z0-9-\\+]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$";
You can try it like this
List email = Arrays.asList("xyzl#gmail.com", "#", "sxd");
Predicate<String> validMail = (n) -> n.matches(pt);
email.stream().filter(validMail).forEach((n) -> System.out.println(n));
This is the description you can change it according to your need.
^ #start of the line
[_A-Za-z0-9-\\+]+ # must start with string in the bracket [ ], must contains one or more (+)
( # start of group #1
\\.[_A-Za-z0-9-]+ # follow by a dot "." and string in the bracket [ ], must contains one or more (+)
)* # end of group #1, this group is optional (*)
# # must contains a "#" symbol
[A-Za-z0-9-]+ # follow by string in the bracket [ ], must contains one or more (+)
( # start of group #2 - first level TLD checking
\\.[A-Za-z0-9]+ # follow by a dot "." and string in the bracket [ ], must contains one or more (+)
)* # end of group #2, this group is optional (*)
( # start of group #3 - second level TLD checking
\\.[A-Za-z]{2,} # follow by a dot "." and string in the bracket [ ], with minimum length of 2
) # end of group #3
$ #end of the line
Split email into two parts using # as delimiter:
String email = "some#email.com";
String[] parts = email.split("#"); // parts = [ "some", "email.com" ]
Validate each part separately, using multiple checks if necessary:
// validate username
String username = parts[0];
if (username.matches("\\d")) {
// ok
}
// validate domain
String domain = parts[1];
if (domain.matches("\\d") && domain.matches("\\.[a-z]{2,4}$")) {
// ok
}
Note that this is a very poor email validator and it shouldn't be used standalone.

Recognizing email fields without using regular expressions

We have a tokenizer which tokenizes a text file .The logic followed is quite weird but necessary in our context.
An email such as
xyz.zyx#gmail.com
will result in the following tokens :
xyz
.
zyx
#
gmail
I would like to know how can we recognize the field as email if we are allowed to use only these tokens. No regex is allowed. We are allowed only to use the tokens and their surrounding tokens to figure out if the field is an email field
ok.. try some (bad) logic like this...
int i=0,j=0;
if(str.contains(".") && str.contains("#"))
{
if((i=str.indexOf(".") < (j=str.indexOf("#"))
{
if(i!=0 && i+1!=j) //ignore Strings like .# , abc.#
return true;
}
}
return false
Logically split an e-mail address into 3 parts:
A user name (or resource name), for this explanation let's call it the user name
The # character.
A host name, consisting of any number of "word dot" sequences + a final top level domain string.
Do a walk like this:
while token can be part of a user name
fetch next token;
if there no more -> no e-mail;
check if the next token is #
if not -> no e-mail
while there are tokens
while token can be part of a host name subpart (the "word" above)
fetch next token;
if there are no more -> might be a valid e-mail address
check if the next token is a dot
if not -> might be a valid e-mail address
set a flag that you found at least one dot
check if the next token can be part of a host name subpart
if not -> no valid e-mail address (or maybe you ignore a trailing dot and take what was found so far)
Add further checks if there are more tokens where needed. You also may have to post process the found tokens to ensure a valid e-mail address and you may have to rewind your tokenizer (or cache the fetched tokens) in case you did not find a valid e-mail address and need to feed the same input to some other recognition process.
Check if a list of tokens is an email:
list contains exactly one token #
index of token # != 0
at least 3 tokens after #
at least 1 . token after #, but not immediately after
starts and ends with character tokens
Additional checks:
no two . subsequent tokens
no special characters
length of character tokens after # is at least 2
total length of all character tokens before # is at least 3

Categories