Does space affect the result in regex ـــــ java - java

I'm using regex in order to define a set of rules that extract specific information from unstructured resumes.
and this information are:
Company that applicant worked in or still working
role (designation)... ex: software engineer
Date (From-To)
every applicant write his/her employment details in his/her own way. However, some resume have a common style for example :
2012- 2014.Dean of the Faculty of Engineering Information Technology/
University Name.
so I define this regex in order to extract the needed information
Here my regex:
(^[0-9]{4})(-|–|.|_|to) ([0-9]{4})(.*) (of the|at|in) (.*).
and this regex was able to extract the information from the above example
role:Dean
company: Faculty of Engineering Information Technology/University Name.
date from: 2012 to :2014
loyalty: 2 years // this is depend on the extracted date
But I have another sample from another resume that have the same style of writing
1996-1997, Lecturer in Computer Science Department, Jerusalem open
university.
it should give Match but it didn't until I remove the space in the regex then it was able to extract the data
My question is does the space affect in regex??!!
and how I can fix this so that it could extract the data from both resume regarding of the space in the regex rule??
Here my demo

does the space affect in regex?
You have determined for yourself that it does. Space characters are not regex metacharacters, unless you enable the COMMENTS option in your pattern. Ordinarily, they stand for themselves, just like most other characters.
how I can fix this so that it could extract the data from both resume regarding of the space in the regex rule?
You can apply quantifers such as ? or * to space characters in your regex, just like you can to any other character or group. So, for example, you might use
(^[0-9]{4})(-|–|.|_|to) *([0-9]{4})(.*) (of the|at|in) (.*).
Do consider also that you might sometimes have to deal with tab characters, too. You can use the escape sequence \s to match any single whitespace character other than a newline, whether it be a space, a tab, or any other recognized as whitespace by Java.

You can use an optional amount of white-space by using \\s* instead of a space . \\s means white-space character, and the * means zero or more

Related

Regex expression for comma and dash seperated text of items

I do have a Java Web Application, where I get some inputs from the user. Once I got this input I have to parse it and the parsing part depends on what kind of input I'll get. I decided to use the Pattern class of java for some of predefined user inputs.
So I need the last 2 regex patterns:
a)Enumaration:
input can be - A03,B24.1,A25.7
The simple way would be to check if there are a comma in there ([^,]+) but it will end up with a lot of updates in to parsing function, which I would like to avoid. So, in addition to comma it should check if it starts with
letter
minimum 3 letters (combined with numbers)
can have one dot in the word
minimum 1 comma (updated it)
b) Mixed
input can be A03,B24.1-B35.5,A25.7
So all of what Enumuration part got, but with addition that it can have a dash minimum one.
I've tried to use multiple online regex generators but didnt get it correct. Would be much appreciated if you can help.
Here is what I got if its B24.1-B35.5 if its just a simple range.
"='.{1}\\d{0,2}-.{1}\\d{0,2}'|='.{1}\\d{1,2}.\\d{1,2}-.{1}\\d{1,2}.\\d{1,2}'";
Edit1: Valid and Invalid inputs
for a)Enumaration
A03,B24.1,A25.7 Valid
A03,B24.1 Valid
A03,B24.1-B25.1 -Invalid because in this case (enumaration) it should not contain dash
A03 invalid because no comma
A03,B24.1 - Valid
A03 Invalid
for b)Mixed
everything that a enumeration has with addition that it can have dash too.
You can use this regex for (a) Enumeration part as per your rules:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Rules:
Verifies that each segment starts with a letter
Minimum of three letters or numbers [A-Za-z][A-Za-z0-9]{2,}
Optionally followed by decimal . and one or more alphabets and numbers i.e (?:\.[A-Za-z0-9]{1,})?
Same thing repeated, and seperated by a comma ,. Also must have atleast one comma so using + i.e (?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
?: to indicate non-capturing group
Using [A-Za-z0-9] instead of \w to avoid underscores
Regex101 Demo
For (b) Mixed, you haven't shared too many valid and invalid cases, but based on my current understanding here's what I have:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:[,-][A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Note that , from previous regex has been replaced with [,-] to allow - as well!
Regex101 Demo
// Will match
A03,B24.1-B35.5,A25.7
A03,B24.1,A25.7
A03,B24.1-B25.1
Hope this helps!
EDIT: Making sure each group starts with a letter (and not a number)
Thanks to #diginoise and #anubhava for pointing out! Changed [A-Za-z0-9]{3,} to [A-Za-z][A-Za-z0-9]{2,}
As I said in the comments, I would chop the input by commas and verify each segment separately. Your domain ICD 10 CM codes is very well defined and also I would be very wary of any input which could be non valid, yet pass the validation.
Here is my solution:
regex
([A-TV-Z][0-9][A-Z0-9](\.?[A-Z0-9]{0,4})?)
... however I would avoid that.
Since your domain is (moste likely) medical software, people's lives (or at least well being) is at stake. Not to mention astronomical damages and the lawyers ever-chasing ambulances. Therefore avoid the easy solution, and implement the bomb proof one.
You could use the regex to establish that given code is definitely not valid. However if a code passes your regex it does not mean that it is valid.
bomb proof method
See this example: O09.7, O09.70, O09.71, O09.72, O09.73 are valid entries, but O09.1 is not valid.
Therefore just get all possible codes. According to this gist there are 42784 different codes. Just load them to memory and any code which is not in the set, is not valid. You could compress said list and be clever about the encoding in memory, to occupy less space, but verbatim all codes are under 300kB on disk, so few MBs max in memory, therefore not a massive cost to pay for a price of people not having left instead of right kidney removed.

Stanford NER - Unable to identify Phone number

I am training my NER to the entity type Phonenumber whose part of speech is number. However when I test the same data that I have trained, the phone number is not identified by the classifier.
Is that because the part of speech(POS) of phone number is number(CD)?
You might want to use regexner instead for this use case.
Consider this sentence (put it in phone-number-example.txt):
You can reach the office at 555 555-5555.
If you make a regexner rules file like this (note each column is tab separated)
[0-9]{3}\W[0-9]{3}-[0-9]{4} PHONE_NUMBER MISC,NUMBER 1
And run this command:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -regexner.mapping phone_number.rules -file phone-number-example.txt -outputFormat text
It will identify the phone number in the output NER tagging.
One issue to look out for. You will note the tokenizer turns "555 555-5555" into one token. The first column of the rule file is a regex that matches a token. The regexner patterns are a space separated list of patterns that match each token you want to ner tag.
So in this example, the rule I made has a "\W" to capture the space. The rule wasn't working when I used "\s", etc..so I think there is an issue with writing regexes for tokens that contain spaces. Typically tokens don't contain spaces for that matter.
So you might want to work around this by expanding on "\W" and excluding other characters that you don't want since "\W" just means non-word characters. Also, you can obviously make the pattern I just listed more complicated and capture the various phone number patterns.
More info on RegexNER can be found here:
http://nlp.stanford.edu/software/regexner.html

Search database table with all special characters

I have a table of project in which i have a project name and that project name may contain any special character or any alpha numeric value or any combination of number word or special characters.
Now i need to apply keyword search in that and that may contain any special character in search.
So my question is: How we can search either single or multiple special characters in database?
I am using mysql 5.0 with java hibernate api.
This should be possible with some simple sanitization of you query.
e.g: a search for \#(%*#$\ becomes:
SELECT * FROM foo WHERE name LIKE "%\\#(\%*#$\\%";
when evaluated the back slashes escape so that the search ends up being anything that contains "\#(%*#$\"
In general anything that's a special character in a string can be escaped via a backslash. This only really becomes tricky if you have a name such as: "\\foo\\bar\\" which to escape properly would become "\\\\foo\\\\bar\\\\"
A side note, please proof read your posts prior to finalizing. Its really depressing and shows a lack of effort when your questions title has spelling errors in it.

Regular expression, excluding .. in suffix of email addy [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using a regular expression to validate an email address
This is homework, I've been working on it for a while, I've done lots of reading and feel I have gotten pretty familiar with regex for a beginner.
I am trying to find a regular expression for validating/invalidating a list of emails. There are two addresses which are giving me problems, I can't get them both to validate the correct way at the same time. I've gone through a dozen different expressions that work for all the other emails on the list but I can't get those two at the same time.
First, the addresses.
me#example..com - invalid
someone.nothere#1.0.0.127 - valid
The part of my expression which validates the suffix
I originally started with
#.+\\.[[a-z]0-9]+
And had a second pattern for checking some more invalid addresses and checked the email against both patterns, one checked for validity the other invalidity but my professor said he wanted it all in on expression.
#[[\\w]+\\.[\\w]+]+
or
#[\\w]+\\.[\\w]+
I've tried it written many, many different ways but I'm pretty sure I was just using different syntax to express these two expressions.
I know what I want it to do, I want it to match a character class of "character+"."character+"+
The plus sign being at least one. It works for the invalid class when I only allow the character class to repeat one time(and obviously the ip doesn't get matched), but when I allow the character class to repeat itself it matches the second period even thought it isn't preceded by a character. I don't understand why.
I've even tried grouping everything with () and putting {1} after the escaped . and changing the \w to a-z and replacing + with {1,}; nothing seems to require the period to surrounded by characters.
You need a negative look-ahead :
#\w+\.(?!\.)
See http://www.regular-expressions.info/lookaround.html
test in Perl :
Perl> $_ = 'someone.nothere#1.0.0.127'
someone.nothere#1.0.0.127
Perl> print "OK\n" if /\#\w+\.(?!\.)/
OK
1
Perl> $_ = 'me#example..com'
me#example..com
Perl> print "OK\n" if /\#\w+\.(?!\.)/
Perl>
#([\\w]+\\.)+[\\w]+
Matches at least one word character, followed by a '.'. This is repeated at least once, and is then followed by at least on more word character.
I think you want this:
#[\\w]+(\\.[\\w]+)+
This matches a "word" followed by one or more "." "word" sequences. (You can also do the grouping the other way around; e.g. see Dailin's answer.)
The problem with what you are doing before was that you were trying to embed a repeat inside a character class. That doesn't make sense, and there is no syntax that would support it. A character class defines a set of characters and matches against one character. Nothing more.
The official standard RFC 2822 describes the syntax that valid email addresses with this regular expression:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
More practical implementation of RFC 2822 (if we omit the syntax using double quotes and square brackets), which will still match 99.99% of all email addresses in actual use today, is:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

Regex for university emails

I am looking to validate email addresses by making sure they have a specific university subdomain, e.g. if the user says they attend Oxford University, I want to check that their email ends in .ox.ac.uk
If I have the '.ox.ac.uk' part stored as a variable, how can I incorporate this with a regex to check the whole email is valid and ends in that variable suffix?
Many thanks!
We are using this email pattern (derived from this regular-expressions.info article):
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?^`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?$`
You should be able to extend it with your needed suffix:
^[\w!#$%&'*+/=?^`{|}~-]+(?:\.[\w!#$%&'*+/=?`{|}~-]+)*#(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+(?:\.ox\.ac\.uk)$`
Note that I replaced the TLD part [a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])? with your required suffix (?:\.ox\.ac\.uk) (\. is used to match the dot only)
Edit: one additional note: if you use String#matches(...) or Matcher#matches() there's no need for the leading ^ and the trailing $, since the entire string would have to match anyways.
Assuming you are using php.
$ending = '.ox.ac.uk';
if(preg_match('/'.preg_quote($ending).'$/i', $email_address)) //... your code
Further info: the preg_quote() is necessary so that characters get escaped if they have a special meaning. In your case it's the dots.
edit: To check if the whole email is valid, see other questions, it is asked a lot. Just wanted to help with your special case.

Categories