Stanford NER - Unable to identify Phone number - java

I am training my NER to the entity type Phonenumber whose part of speech is number. However when I test the same data that I have trained, the phone number is not identified by the classifier.
Is that because the part of speech(POS) of phone number is number(CD)?

You might want to use regexner instead for this use case.
Consider this sentence (put it in phone-number-example.txt):
You can reach the office at 555 555-5555.
If you make a regexner rules file like this (note each column is tab separated)
[0-9]{3}\W[0-9]{3}-[0-9]{4} PHONE_NUMBER MISC,NUMBER 1
And run this command:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,regexner -regexner.mapping phone_number.rules -file phone-number-example.txt -outputFormat text
It will identify the phone number in the output NER tagging.
One issue to look out for. You will note the tokenizer turns "555 555-5555" into one token. The first column of the rule file is a regex that matches a token. The regexner patterns are a space separated list of patterns that match each token you want to ner tag.
So in this example, the rule I made has a "\W" to capture the space. The rule wasn't working when I used "\s", etc..so I think there is an issue with writing regexes for tokens that contain spaces. Typically tokens don't contain spaces for that matter.
So you might want to work around this by expanding on "\W" and excluding other characters that you don't want since "\W" just means non-word characters. Also, you can obviously make the pattern I just listed more complicated and capture the various phone number patterns.
More info on RegexNER can be found here:
http://nlp.stanford.edu/software/regexner.html

Related

Match custom pattern in regex multiple times

I am trying to parse a query which I need to modify to replace a specific property and its value with another property and different values. I am struggling to write a regex that will match the specify property and its value that I need.
Here are some examples to illustrate my point. test:property is the property name that we need to match.
Property with a single value: test:property:schema:Person
Property with multiple values (there is no limit on how many values there can be - this example uses 3): test:property:(schema:Person OR schema:Organization OR schema:Place)
Property with a single value in brackets: test:property:(schema:Person)
Property with another property in the query string (i.e. there are other parts of the string that I'm not interested in): test:property:schema:Person test:otherProperty:anotherValue
Also note that other combinations are possible such as other properties being before the property I need to capture, my property having multiple values with another property present in the query.
I want to match on the entire test:property section with each value captured within that match. Given the examples above these are the results I am looking for:
#
Match
Groups
1
test:property:schema:Person
schema:Person
2
test:property:(schema:Person OR schema:Organization OR schema:Place)
schema:Personschema:Organizationschema:Person
3
test:property:(schema:Person)
schema:Person
4
test:property:schema:Person
schema:Person
Note: #1 and #4 produce the same output. I wanted to illustrate that the rest of the string should be ignored (I only need to change the test:property key and value).
The pattern of schema:Person is defined as \w+\:\w+, i.e. one or more word characters, followed by a colon, followed by one or more word characters.
If we define the known parts of the string with names I think I can express what I want to match.
schema:Person - <TypeName> - note that the first part, schema in this case, is not fixed and can be different
test:property - <MatchProperty>
<MatchProperty>: // property name (which is known and the same - in the examples this is `test:property`) followed by a colon
( // optional open bracket
<TypeName>
(OR <TypeName>)* // optional additional TypeNames separated by an OR
) // optional close bracket
Every example I've found has had simple alphanumeric characters in the repeating section but my repeating pattern contains the colon which seems to be tripping me up. The closest I've got is this:
(test\:property:(?:\(([\w+\:\w+]+ [OR [\w+\:\w+]+)\))|[\w+\:\w+]+)
Which works okayish when there are no other properties (although the match for example #2 contains the entire property and value as the first group result, and a second group with the property value) but goes crazy when other properties are included.
Also, putting that regex through https://regex101.com/ I know it's not right as the backslash characters in the square brackets are being matched exactly. I started to have a go with capturing and non-capturing groups but got as far as this before giving up!
(?:(\w+\:\w+))(?:(\sOR\s))*(?:(\w+\:\w+))*
This isn't a complete solution if you want pure regex because there are some limitations to regex and Java regex in particular, but the regexes I came up with seem to work.
If you're looking to match the entire sequence, the following regex will work.
test:property:(?:\((\w+:\w+)(?:\sOR\s(\w+:\w+))*\)|(\w+:\w+))
Unfortunately, the repeated capture groups will only capture the last match, so in queries with multiple values (like example 2), groups 1 and 2 will be the first and last values (schema:Person and schema:Place). In queries without parentheses, the value will be in group 3.
If you know the maximum number of values, you could just generate a massive regex that will have enough groups, but this might not be ideal depending on your application.
The other regex to find values in groups of arbitrary length uses regex's positive lookbehind to match valid values. You can then generate an array of matches.
(?<=test:property:(?:(?:\((?:\w+:\w+\sOR\s)+)|\(?))\w+:\w+
The issue with this method is that it looks like Java lookbehind has some limitations, specifically, not allowing unbound or complex quantifiers. I'm not a Java person so I haven't tried things out for myself, but it seems like this wouldn't work either. If someone else has another solution, please post another answer!
With this in mind, I would probably suggest going with a combination regex + string parsing method. You can use regex to parse out the value or multiple values (separated by OR), then split the string to get your final values.
To match the entire part inside parentheses or the single value no parentheses, you can use this regex:
test:property:(?:\((\w+:\w+(?:\sOR\s\w+:\w+)*)\)|(\w+:\w+))
It's still split into two groups where one matches values with parentheses and the other matches values without (to avoid matching unpaired parentheses), but it should be usable.
If you want to play around with these regexes or learn more, here's a regexr: https://regexr.com/65kma

Regex expression for comma and dash seperated text of items

I do have a Java Web Application, where I get some inputs from the user. Once I got this input I have to parse it and the parsing part depends on what kind of input I'll get. I decided to use the Pattern class of java for some of predefined user inputs.
So I need the last 2 regex patterns:
a)Enumaration:
input can be - A03,B24.1,A25.7
The simple way would be to check if there are a comma in there ([^,]+) but it will end up with a lot of updates in to parsing function, which I would like to avoid. So, in addition to comma it should check if it starts with
letter
minimum 3 letters (combined with numbers)
can have one dot in the word
minimum 1 comma (updated it)
b) Mixed
input can be A03,B24.1-B35.5,A25.7
So all of what Enumuration part got, but with addition that it can have a dash minimum one.
I've tried to use multiple online regex generators but didnt get it correct. Would be much appreciated if you can help.
Here is what I got if its B24.1-B35.5 if its just a simple range.
"='.{1}\\d{0,2}-.{1}\\d{0,2}'|='.{1}\\d{1,2}.\\d{1,2}-.{1}\\d{1,2}.\\d{1,2}'";
Edit1: Valid and Invalid inputs
for a)Enumaration
A03,B24.1,A25.7 Valid
A03,B24.1 Valid
A03,B24.1-B25.1 -Invalid because in this case (enumaration) it should not contain dash
A03 invalid because no comma
A03,B24.1 - Valid
A03 Invalid
for b)Mixed
everything that a enumeration has with addition that it can have dash too.
You can use this regex for (a) Enumeration part as per your rules:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Rules:
Verifies that each segment starts with a letter
Minimum of three letters or numbers [A-Za-z][A-Za-z0-9]{2,}
Optionally followed by decimal . and one or more alphabets and numbers i.e (?:\.[A-Za-z0-9]{1,})?
Same thing repeated, and seperated by a comma ,. Also must have atleast one comma so using + i.e (?:,[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
?: to indicate non-capturing group
Using [A-Za-z0-9] instead of \w to avoid underscores
Regex101 Demo
For (b) Mixed, you haven't shared too many valid and invalid cases, but based on my current understanding here's what I have:
[A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?(?:[,-][A-Za-z][A-Za-z0-9]{2,}(?:\.[A-Za-z0-9]{1,})?)+
Note that , from previous regex has been replaced with [,-] to allow - as well!
Regex101 Demo
// Will match
A03,B24.1-B35.5,A25.7
A03,B24.1,A25.7
A03,B24.1-B25.1
Hope this helps!
EDIT: Making sure each group starts with a letter (and not a number)
Thanks to #diginoise and #anubhava for pointing out! Changed [A-Za-z0-9]{3,} to [A-Za-z][A-Za-z0-9]{2,}
As I said in the comments, I would chop the input by commas and verify each segment separately. Your domain ICD 10 CM codes is very well defined and also I would be very wary of any input which could be non valid, yet pass the validation.
Here is my solution:
regex
([A-TV-Z][0-9][A-Z0-9](\.?[A-Z0-9]{0,4})?)
... however I would avoid that.
Since your domain is (moste likely) medical software, people's lives (or at least well being) is at stake. Not to mention astronomical damages and the lawyers ever-chasing ambulances. Therefore avoid the easy solution, and implement the bomb proof one.
You could use the regex to establish that given code is definitely not valid. However if a code passes your regex it does not mean that it is valid.
bomb proof method
See this example: O09.7, O09.70, O09.71, O09.72, O09.73 are valid entries, but O09.1 is not valid.
Therefore just get all possible codes. According to this gist there are 42784 different codes. Just load them to memory and any code which is not in the set, is not valid. You could compress said list and be clever about the encoding in memory, to occupy less space, but verbatim all codes are under 300kB on disk, so few MBs max in memory, therefore not a massive cost to pay for a price of people not having left instead of right kidney removed.

Text classifier with word splitting using StanfordNLP classifier

After a quite successful start to StanfordNLP (and with the german module) I tried out classifying numerical data. This also exited with good results.
At least I tried to set up a classifier for categorizing text documents (both mails and scanned documents) but this was quite frustrating. What I want to do is working with a classifier on word base, not with n-grams. My training file has two columns: First with the category of the text, second with the text itself, without tabs or line breakers.
The properties file has the following content:
1.splitWordsWithPTBTokenizer=true
1.splitWordsRegexp=false
1.splitWordsTokenizerRegexp=false
1.useSplitWords=true
But when I start training the classifier like this...
ColumnDataClassifier cdc = new ColumnDataClassifier("classifier.properties");
Classifier<String, String> classifier =
cdc.makeClassifier(cdc.readTrainingExamples("data.train"));
...then I get many lines starting with the following hint:
[main] INFO edu.stanford.nlp.classify.ColumnDataClassifier - Warning: regexpTokenize pattern false didn't match on
My questions are:
1) Any idea what is wrong with my properties? I think, my training file is okay.
2) I want to use the words/tokens that I got from CoreNLP with the german model. Is this possible?
Thanks for any answers!
The numbering is correct, you don't have to put 2's in the beginning of the lines, as one other answer states. 1 stands for first data column, not for the first column in general in your training file (which is the category). Options with a 2. in the beginning would be for the second data column, or the third column in general in your training file - which you don't have.
I don't know about using the words/tokens you got from CoreNLP, but it also took me a while to find out how to use word n-grams, so maybe for some people this will be helpful:
# regex for splitting on whitespaces
1.splitWordsRegexp=\\s+
# enable word n-grams, just like character n-grams are used
1.useSplitWordNGrams=true
# range of values of n for your n-grams. (1-grams to 4-grams in this example)
1.minWordNGramLeng=1
1.maxWordNGramLeng=4
# use word 1-grams (just single words as features), obsolete if you're using
# useSplitWordNGrams with minWordNGramLeng=1
1.useSplitWords=true
# use adjacent word 2-grams, obsolete if you're using
# useSplitWordNGrams with minWordNGramLeng<=2 and maxWordNGramLeng>=2
1.useSplitWordPairs=true
# use word 2-grams in every possible combination, not just adjacent words
1.useAllSplitWordPairs=true
# same as the pairs but 3-grams, also not just adjacent words
1.useAllSplitWordTriples=true
for more information have a look at http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/classify/ColumnDataClassifier.html
You are saying your training file has two columns, first with the category of the text, second with the text itself. Based on this, your properties file is incorrect, because you are adding rules to the first column there.
Modify your properties to be applied to the column where the text is as follows:
2.splitWordsWithPTBTokenizer=true
2.splitWordsRegexp=false
2.splitWordsTokenizerRegexp=false
2.useSplitWords=true
Furthermore, I would suggest to work through the Software/Classifier/20 Newsgroups wiki, this shows some practical examples on how to work with the Stanford Classifier, and how to set up options through the properties file.

Does space affect the result in regex ـــــ java

I'm using regex in order to define a set of rules that extract specific information from unstructured resumes.
and this information are:
Company that applicant worked in or still working
role (designation)... ex: software engineer
Date (From-To)
every applicant write his/her employment details in his/her own way. However, some resume have a common style for example :
2012- 2014.Dean of the Faculty of Engineering Information Technology/
University Name.
so I define this regex in order to extract the needed information
Here my regex:
(^[0-9]{4})(-|–|.|_|to) ([0-9]{4})(.*) (of the|at|in) (.*).
and this regex was able to extract the information from the above example
role:Dean
company: Faculty of Engineering Information Technology/University Name.
date from: 2012 to :2014
loyalty: 2 years // this is depend on the extracted date
But I have another sample from another resume that have the same style of writing
1996-1997, Lecturer in Computer Science Department, Jerusalem open
university.
it should give Match but it didn't until I remove the space in the regex then it was able to extract the data
My question is does the space affect in regex??!!
and how I can fix this so that it could extract the data from both resume regarding of the space in the regex rule??
Here my demo
does the space affect in regex?
You have determined for yourself that it does. Space characters are not regex metacharacters, unless you enable the COMMENTS option in your pattern. Ordinarily, they stand for themselves, just like most other characters.
how I can fix this so that it could extract the data from both resume regarding of the space in the regex rule?
You can apply quantifers such as ? or * to space characters in your regex, just like you can to any other character or group. So, for example, you might use
(^[0-9]{4})(-|–|.|_|to) *([0-9]{4})(.*) (of the|at|in) (.*).
Do consider also that you might sometimes have to deal with tab characters, too. You can use the escape sequence \s to match any single whitespace character other than a newline, whether it be a space, a tab, or any other recognized as whitespace by Java.
You can use an optional amount of white-space by using \\s* instead of a space . \\s means white-space character, and the * means zero or more

Regular expression, excluding .. in suffix of email addy [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using a regular expression to validate an email address
This is homework, I've been working on it for a while, I've done lots of reading and feel I have gotten pretty familiar with regex for a beginner.
I am trying to find a regular expression for validating/invalidating a list of emails. There are two addresses which are giving me problems, I can't get them both to validate the correct way at the same time. I've gone through a dozen different expressions that work for all the other emails on the list but I can't get those two at the same time.
First, the addresses.
me#example..com - invalid
someone.nothere#1.0.0.127 - valid
The part of my expression which validates the suffix
I originally started with
#.+\\.[[a-z]0-9]+
And had a second pattern for checking some more invalid addresses and checked the email against both patterns, one checked for validity the other invalidity but my professor said he wanted it all in on expression.
#[[\\w]+\\.[\\w]+]+
or
#[\\w]+\\.[\\w]+
I've tried it written many, many different ways but I'm pretty sure I was just using different syntax to express these two expressions.
I know what I want it to do, I want it to match a character class of "character+"."character+"+
The plus sign being at least one. It works for the invalid class when I only allow the character class to repeat one time(and obviously the ip doesn't get matched), but when I allow the character class to repeat itself it matches the second period even thought it isn't preceded by a character. I don't understand why.
I've even tried grouping everything with () and putting {1} after the escaped . and changing the \w to a-z and replacing + with {1,}; nothing seems to require the period to surrounded by characters.
You need a negative look-ahead :
#\w+\.(?!\.)
See http://www.regular-expressions.info/lookaround.html
test in Perl :
Perl> $_ = 'someone.nothere#1.0.0.127'
someone.nothere#1.0.0.127
Perl> print "OK\n" if /\#\w+\.(?!\.)/
OK
1
Perl> $_ = 'me#example..com'
me#example..com
Perl> print "OK\n" if /\#\w+\.(?!\.)/
Perl>
#([\\w]+\\.)+[\\w]+
Matches at least one word character, followed by a '.'. This is repeated at least once, and is then followed by at least on more word character.
I think you want this:
#[\\w]+(\\.[\\w]+)+
This matches a "word" followed by one or more "." "word" sequences. (You can also do the grouping the other way around; e.g. see Dailin's answer.)
The problem with what you are doing before was that you were trying to embed a repeat inside a character class. That doesn't make sense, and there is no syntax that would support it. A character class defines a set of characters and matches against one character. Nothing more.
The official standard RFC 2822 describes the syntax that valid email addresses with this regular expression:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
More practical implementation of RFC 2822 (if we omit the syntax using double quotes and square brackets), which will still match 99.99% of all email addresses in actual use today, is:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

Categories