How to conduct nested named entity recognition in OpenNLP? - java

I am currently working on a java web server project, that requires the use of Natural Language processing, specifically Named Entity Recognition (NER).
I was using OpenNLP for java, since it was easy to add custom training data. It works perfectly.
However, I need to also be able to extract entites inside of entities (Nested named entity recognition). I tried doing this in OpenNLP, but I got parsing errors. So my guess is that OpenNLP sadly does not support nested entities.
Here is an example of what I need to parse:
Remind me to [START:reminder] give some presents to [START:contact] John [END] and [START:contact] Charlie [END][END].
If this cannot be achieved with OpenNLP, is there any other Java NLP Library that could do this. If there are no Java libraries at all, are there any NLP libraries in any other language that can do this?
Please help. Thanks!

The short answer is:
This cannot be achieved using openNLP NER which is suitable only for continuous entities because it use a BIO tagging scheme.
I don't know any library in any language capable of do this.
I think you are extending too much the concept of entity, which is habitually associated with persons, places, organizations, gene names etc.
But not with the identification of complex structures within text.
For that purpose you need to think in a more elaborated solution, taking into account the grammatical structure of the sentence, which can be obtained using a parser like the one in OpenNLP, and maybe combine this with the output of the NER process.

For the purpose of Name Entity Recognition (Java based) I use the following:
Apache UIMA
ClearTK
https://github.com/merishav/cleartk-tutorials
You can train models for your use case, I have already trained for NER for person, places, date of birth, profession.
ClearTK gives you a wrapper on MalletCRFClassifier.

Use this python source code (Python 3) https://gist.github.com/ttpro1995/cd8c60cfc72416a02713bb93dff9ae6f
It's create multiple un-nest version of nest data for you.
For input sentence below ( input data must be tokenized first, so there are space between and thing around it)
Remind me to <START:reminder> give some presents to <START:contact> John <END> and <START:contact> Charlie <END> <END> .
It output multiple sentence with different nest level.
Remind me to give some presents to John and Charlie .
Remind me to <START:reminder> give some presents to John and Charlie <END> .
Remind me to give some presents to <START:contact> John <END> and <START:contact> Charlie <END> .
Full source code here for quick copy-paste
import sys
END_TAG = 0
START_TAG = 1
NOT_TAG = -1
def detect_tag(in_token):
"""
detect tag in token
:param in_token:
:return:
"""
if "<START:" in in_token:
return START_TAG
elif "<END>" == in_token:
return END_TAG
return NOT_TAG
def remove_nest_tag(in_str):
"""
với <START:ORGANIZATION> Sở Cảnh sát Phòng cháy , chữa cháy ( PCCC ) và cứu nạn , cứu hộ <START:LOCATION> Hà Nội <END> <END>
:param in_str:
:return:
"""
state = 0
taglist = []
tag_dict = dict()
sentence_token = in_str.split()
## detect token tag
max_nest = 0
for index, token in enumerate(sentence_token):
# print(token + str(detect_tag(token)))
tag = detect_tag(token)
if tag > 0:
state += 1
if max_nest < state:
max_nest = state
token_info = (index, state, token)
taglist.append(token_info)
tag_dict[index] = token_info
elif tag == 0:
token_info = (index, state, token)
taglist.append(token_info)
tag_dict[index] = token_info
state -= 1
generate_sentences = []
for state in range(max_nest+1):
generate_sentence_token = []
for index, token in enumerate(sentence_token):
if detect_tag(token) >= 0: # is a tag
token_info = tag_dict[index]
if token_info[1] == state:
generate_sentence_token.append(token)
elif detect_tag(token) == -1 : # not a tag
generate_sentence_token.append(token)
sentence = ' '.join(generate_sentence_token)
generate_sentences.append(sentence)
return generate_sentences
# generate sentence
print(taglist)
def test():
tstr2 = "Remind me to <START:reminder> give some presents to <START:contact> John <END> and <START:contact> Charlie <END> <END> ."
result = remove_nest_tag(tstr2)
print("-----")
for sentence in result:
print(sentence)
if __name__ == "__main__":
"""
un-nest dataset for opennlp name
"""
# test()
# test()
if len(sys.argv) > 1:
inpath = sys.argv[1]
infile = open(inpath, 'r')
outfile = open(inpath+".out", 'w')
for line in infile:
sentences = remove_nest_tag(line)
for sentence in sentences:
outfile.write(sentence+"\n")
outfile.close()
else:
print("usage: python unnest_data.py input.txt")

Related

Drools rules in xtext format

I am new to Xtext and I want to use it to generate some code for drools rules. I have the following problem, I don't know how to write the dialect to have that $order in front of a Order(). I would really appreciate if someone will show me how to handle this example.
This is what I have tried so far
Model:
declarations+=Declaration*;
Declaration:
Rule;
State:
name=ID
;
Rule:
'rule' ruleDescription=STRING
'#specification'specificationDescription=STRING
'ruleflow-group' ruleflowDescription=STRING
'when' when=[State|QualifiedName]
'then' then=[State|QualifiedName];
QualifiedName: ID ('.' ID)*;
DolarSign: ('$' ID)*;
And here is the code for the rule:
rule "apply 10% discount to all items over US$ 100,00 in an order"
#specification "101"
ruleflow-group "All"
when
$order : Order(appliedBefore == null)
Order($name : /customer/name) from $order
$item : OrderItem( value > 100 ) from $order.items
then
System.out.println("10% applied" + $name);
end
You should be able to simply use:
'when $' when=[State|QualifiedName]

Dataflow GCS to BigQuery - How to output multiple rows per input?

Currently I am using the gcs-text-to-bigquery google provided template and feeding in a transform function to transform my jsonl file. The jsonl is pretty nested and i wanted to be able to output multiple rows per one row of the newline delimited json by doing some transforms.
For example:
{'state': 'FL', 'metropolitan_counties':[{'name': 'miami dade', 'population':100000}, {'name': 'county2', 'population':100000}…], 'rural_counties':{'name': 'county1', 'population':100000}, {'name': 'county2', 'population':100000}….{}], 'total_state_pop':10000000,….}
There will obviously be more counties than 2 and each state will have one of these lines. The output my boss wants is:
When i do the gcs-to-bq text transform, i end up only getting one line per state (so I'll get miami dade county from FL, and then whatever the first county is in my transform for the next state). I read a little bit and i think this is because of the mapping in the template that expects one output per jsonline. It seems I can do a pardo(DoFn ?) not sure what that is, or there is a similar option with beam.Map in python. There is some business logic in the transforms (right now it's about 25 lines of code as the json has more columns than i showed but those are pretty simple).
Any suggestions on this? data is coming in tonight/tomorrow, and there will be hundreds of thousands of rows in a BQ table.
the template i am using is currently in java, but i can translate it to python pretty easily as there are a lot of examples online in python. i know python better and i think its easier given the different types (sometimes a field can be null) and it seems less daunting given the examples i saw look simpler, however, open to either
Solving that in Python is somewhat straightforward. Here's one possibility (not fully tested):
from __future__ import absolute_import
import ast
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/service_account.json'
pipeline_args = [
'--job_name=test'
]
pipeline_options = PipelineOptions(pipeline_args)
def jsonify(element):
return ast.literal_eval(element)
def unnest(element):
state = element.get('state')
state_pop = element.get('total_state_pop')
if state is None or state_pop is None:
return
for type_ in ['metropolitan_counties', 'rural_counties']:
for e in element.get(type_, []):
name = e.get('name')
pop = e.get('population')
county_type = (
'Metropolitan' if type_ == 'metropolitan_counties' else 'Rural'
)
if name is None or pop is None:
continue
yield {
'State': state,
'County_Type': county_type,
'County_Name': name,
'County_Pop': pop,
'State_Pop': state_pop
}
with beam.Pipeline(options=pipeline_options) as p:
lines = p | ReadFromText('gs://url to file')
schema = 'State:STRING,County_Type:STRING,County_Name:STRING,County_Pop:INTEGER,State_Pop:INTEGER'
data = (
lines
| 'Jsonify' >> beam.Map(jsonify)
| 'Unnest' >> beam.FlatMap(unnest)
| 'Write to BQ' >> beam.io.Write(beam.io.BigQuerySink(
'project_id:dataset_id.table_name', schema=schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
)
This will only succeed if you are working with batch data. If you have streaming data then just change beam.io.Write(beam.io.BigquerySink(...)) to beam.io.WriteToBigQuery.

Extracting required substring from a result retrieved from Wolfram Alpha with Java

I'm working on a Java Program which takes a question from a user, sends it to the Wolfram Alpha API and then cleans up the result and prints it.
If the user asks the question "Who is the President of the USA?" the result is as follows
Response: <section><title>Input interpretation</title> <sectioncontents>United States | President</sectioncontents></section><section><title>Result</title><sectioncontents>Barack Obama (from 20/01/2009 to present)</sectioncontents></section><section><title>Basic information</title><sectioncontents>official position | President (44th)..........etc
I would like to Extract "Barack Obama (from 20/01/2009 to present)"
I have been able to trim up to Barack using the below code:
String clean =response.substring(response.indexOf("Result") + 31 , response.length());
System.out.println("Response: " + clean);
How would I trim the rest of the result?
Well, in case it helps, I came up with this regex:
Result.+?>([^<]+?)<
After finding "Result" it captures the first instance of > and < with at least one character between them.
UPDATE
Below is some sample code that might be helpful:
String response = "Response: <section><title>..."
Pattern pattern = Pattern.compile("Result.+?>([^<]+?)<");
Matcher match = pattern.matcher(response);
String clean = "";
if (match.find())
clean = match.group(1);
System.out.println(clean);
The response is essentially XML.
As has been discussed endlessly in many programming fora, regular expressions are not suitable for parsing XML - you should use an XML parser.

Java String formatting solution

I have a string description of a company, which is nasty written by different users (hand-typed). Here is a example (focus on the dots, spaces, first letters etc..):
XXXX is a Global menagement consulting,Technology services and
outsourcing company, with 257000people serving clients in more than
120 countries.. combining unparalleled experience, comprehensive
capabilities across all industries and business functions,and
extensive research on the worlds most successfull companies, XXXX
collaborates with clients to help them become high-performance
businesses and governments., the company generated net revenues of
US$27.9 Billion for the fiscal year ended 31.07.2012..
Now what i want is to format the string to a bit nicer version like this:
XXXX is a global management consulting, technology services and
outsourcing company, with 257,000 people serving clients in more than
120 countries. Combining unparalleled experience, comprehensive
capabilities across all industries and business functions, and
extensive research on the world’s most successful companies, XXXX
collaborates with clients to help them become high-performance
businesses and governments. The company generated net revenues of
US$27.9 billion for the fiscal year ended Aug. 31, 2012.
My question is: Is there any library with already defined methods which could do all the spelling corrections, unneeded space removal, etc .. ?
So far, I do it be replacing stuff like " ," with ", " and toUpperCase() if the is a "///." in front etc..
desc = desc.replace(" ", " ");
desc = desc.replace("..", ".");
desc = desc.replace(" .", ".");
desc = desc.replace(" ,", ", ");
desc = desc.replace(".,", ".");
desc = desc.replace(",.", ".");
desc = desc.replace(", .", ".");
desc = desc.replace("*", "");
I'm sure there is a cleaner and better version to do this. Using regex maybe??
Any solution would be appreciated.
If I were trying to solve your problem, I would probably read the text 1 char at a time, and format it as you go. For example, in psuedocode...
while (has more chars){
char letter = readChar();
if (letter == ','){
// checking for the ',.' combination
letter = readChar();
if (readChar == '.'){
// write out a '.' only
out.print('.');
}
else {
// it wasn't the ',.' combination, so you need to output both characters, whatever they are
out.print(',');
out.print(letter);
}
}
else if (another letter you want to filter){
// etc.
}
else {
// doesn't match any of the filters, so just output the letter
out.print(letter);
}
}
Basically if you read the text 1 char at a time, you can detect any of your chosen formatting problems as you go, and correct them immediately. This provides a performance improvement, as you're only reading over the text string once (not 8 times, like you are currently doing), and allows you to add as many different/complex formatting changes as you want. The downside, however, is that you need to write the logic yourself rather than relying on in-built functions.

Extracting packed data using regular expressions

I have data in a database in the format below:
a:19:{s:9:"raceclass";a:5:{i:0;a:1:{i:0;s:7:"250cc B";}i:1;a:1:{i:1;s:6:"OPEN B";}i:2;a:1:{i:2;s:9:"Plus 25 B";}i:3;a:1:{i:3;s:8:"Vet 30 B";}i:4;a:1:{i:4;s:7:"Vintage";}}s:9:"firstname";a:1:{i:0;a:1:{i:0;s:5:"James";}}s:12:"middle_FIELD";a:1:{i:0;a:1:{i:0;s:1:"R";}}s:8:"lastname";a:1:{i:0;a:1:{i:0;s:9:"Slaughter";}}s:5:"email";a:1:{i:0;a:1:{i:0;s:29:"jslaughter#xtrememxseries.com";}}s:8:"address1";a:1:{i:0;a:1:{i:0;s:18:"21 DiMartino Court";}}s:4:"city";a:1:{i:0;a:1:{i:0;s:6:"Walden";}}s:5:"state";a:1:{i:0;a:1:{i:0;s:8:"New York";}}s:3:"zip";a:1:{i:0;a:1:{i:0;s:5:"12586";}}s:7:"country";a:1:{i:0;a:1:{i:0;s:13:"United States";}}s:6:"gender";a:1:{i:0;a:1:{i:0;s:4:"Male";}}s:3:"dob";a:1:{i:0;a:1:{i:0;s:10:"06/04/1974";}}s:5:"phone";a:1:{i:0;a:1:{i:0;s:12:"845-713-4421";}}s:5:"skill";a:1:{i:0;a:1:{i:0;s:12:" AMATEUR (B)";}}s:11:"ridernumber";a:1:{i:0;a:1:{i:0;s:2:"69";}}s:8:"bikemake";a:1:{i:0;a:1:{i:0;s:3:"HON";}}s:8:"enginecc";a:1:{i:0;a:1:{i:0;s:3:"450";}}s:9:"amanumber";a:1:{i:0;a:1:{i:0;s:7:"1094649";}}s:10:"amaexpdate";a:1:{i:0;a:1:{i:0;s:5:"03/12";}}}
How can I write a regular expression to manipulate the above string to get data in the following format?:
raceclass - 250cc B, OPEN B, Plus 25 B, Vet30, Vintage
firstname - James
middle_FIELD - R
address1 = 21 DiMartino Court
city - walden
state - New york
zip - 12586
country - United States
gender - Male
dob - 06/04/1974
phone - 845-713-4421
skill - AMATEUR (B)
ridernumber - 69
bikemake - HON
enginecc - 450
amanumber - 1094649
amaexpdate - 03/12
This data isn't suitable for a regular expression. You should use a proper parser with a proper grammar for handling this string. There are several good options for that in Java, such as ANTLR.
Alternatively, if that is not an option it looks like you only want to handle things between "". Take a look at the java class Scanner. You should be able to get something working with that. Just look through the string and look for a ". If found start to gather text into a buffer. Once you have found another " ignore tokens until you have found the next " or the end of the input text.

Categories