How to create a good NER training model in OpenNLP?

How to create a good NER training model in OpenNLP? - java

I just have started with OpenNLP. I need to create a simple training model to recognize name entities.
Reading the doc here https://opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind I see this simple text to train the model:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
<START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC ,
was named a director of this British industrial conglomerate .
The questions are two:
Why should i have to put the names of the persons in a text (phrase) context ? Why not write person's name one for each line? like:
<START:person> Robert <END>
<START:person> Maria <END>
<START:person> John <END>
How can I also add extra information to that name?
For example I would like to save the information Male/Female for each name.
(I know there are systems that try to understand it reading the last letter, like the "a" for Female etc but i would like to add it myself)
Thanks.

The answer to your first question is that the algorithm works on surrounding context(tokens) within a sentence; it's not just a simple lookup mechanism. OpenNLP uses maximum entropy, which is a form of multinomial logistic regression to build its model. The reason for this is to reduce "word sense ambiguity," and find entities in context. For instance, if my name is April, I can easily get confused with the month of April, and if my name is May, then I would get confused with the month of May as well as the verb may. For your second part of the first question, you could make a list of names that are known, and use those names in a program that looks at your sentences and automatically annotates them to help you create a training set, however making a list of names alone without context will not train the model sufficiently or at all. In fact, there is an OpenNLP addon called the "modelbuilder addon" designed for this: you give it a file of names, and it uses the names and some of your data (sentences) to train a model. If you are looking for particular names of generally non ambiguous entities, you may be better off just using a list and something like regex to discover names rather than NER.
As for your second question there are a few options, but in general, I don't think NER is a great tool for delineating something like gender, however with enough training sentences you may get decent results. Since NER uses a model based on surrounding tokens in your sentence training set to establish the existence of a named entity, it can't do much in terms of identifying gender. You may be better off finding all person names, then referencing an index of names that you know are male or female to get a match. Also, some names, like Pat, are both male and female, and in most textual data there will be no indication of which it is to neither human nor machine. That being said, you could create a male and female model separately, or you could create different entity types within the same model. You could use an annotation like this (using different entity type names of male.person and female.person). I've never tried this but it might do ok, you'd have to test it on your data.
<START:male.person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mrs . <START:female.person> Maria <END> is chairman of Elsevier N.V. , the Dutch publishing group
NER= Named Entity Recognition
HTH

Related

Compare street name

I use Java , Spring, Ibatis, Oracle data base .
Inside that database we have 1 table is Street with 10 million records, the important column is street_name.
From GUI, I have to search company by street, for example : the street name input is Schonburgstrasse but the correct data inside DB is : Schönburgstrasse (German)
You can see that, the main different is : o and ö . And for sure I can't find this record by the SQL :
Select * from Street where street_name = 'Schonburgstrasse';
The rules are :
I can't change the data base schema any more.
I can't get 10M records to normalization one by one. After that compare data
(Normalization means , I will have function to convert From : Schönburgstrasse, To : Schonburgstrasse)
I have to take care for performance problem.
Thanks for your time.

Try using the Oracle SOUNDEX command, so the query will look like this:
Select * from Street where soundex(street_name) = soundex('Schonburgstrasse');

Oracle Text provides extensive capabilities for handling umlauts etc. In short:
create a fulltext index on your column (using a custom lexer)
search with the contains() operator instead of like

Using Stanford NER for extracting Address from a text document?

I was looking Stanford NER and thinking of using JAVA Apis it to extract postal address from a text document. The document may be any document where there is an postal address section e.g. Utility Bills, electricity bills.
So what I am thinking as the approach is,
Define postal address as a named entity using LOCATION and other primitive named entities.
Define segmentation and other sub process.
I am trying to find a example pipeline for the same (what are the steps in details required), anyone has done this before? Suggestions welcome.

To be clear: all credit goes to Raj Vardhan (and John Bauer) who had an interaction on the [java-nlp-user] mailing list.
Raj Vardhan wrote about the plan to work on "finding street address in a sentence":
Here is an approach I have thought of:
Find the event-anchor in a sentence
Select outgoing-edges in the SemanticGraph from that event-node
with relations such as *"prep-in" *or "prep-at".
IF the dependent value in the relation has POS tag as NNP
a) Find outgoing-edges from dependent value's node with relations such
as "nn"
b) Connect all such nodes in increasing order of occurrence in the
sentence.
c) PRINT resulting value as Location where the event occurred
This is obviously with certain assumptions such as direct dependency
between the event-anchor and location in a sentence.
Not sure whether this could help you, but I wanted to mention it just in case. Again, any credit should go to Raj Vardhan (and John Bauer).

Data retrieval / search in text

I am working on a selfProjet for my own interest on data retrieval. I have one text file with the following format.
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...
".I 1" indicate the beginning of chunk of text corresponding to doc ID1 and ".I 2" indicates the beginning of chunk of text corresponding to doc ID2.
I did:
split the docs and put them in separate files
delete stopwords (and, or, while, is, are, ...)
stem the words to get the root of each (achievement, achieve, achievable, ...all converted to achiv and so on)
and finally create e TreeMultiMap which looks like this:
{key: word} {Values are arraylist of docID and frequency of that word in that docID}
aerodynam [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
book [[Doc_00562,6],[Doc_01111,1]]
....
....
result [[Doc_00010,5]]
....
....
zzzz [[Doc_01235,1]]
Now my questions:
Suppose that user is interested to know:
what documents does have achieving and book? (idea)
documents which has achieving and skills but not book nor video
document include Aerodynamic
and some other simple queries like this
(input) so suppose she enters
achieving AND book
(achieving AND skills) AND (NOT (book AND video))
Aerodynamic
.....and some other simple queries
(Output)
[Doc_00562,6],[Doc_01121,5],[Doc_01151,3],[Doc_00012,2],[Doc_00001,1]
....
as you can see there might be
Some precedence modifier (parenthesis which we dont know the depth)
precedence of AND, OR, NOT
and some other interesting challenges and issues
So, I would like to run the queries against the TreeMultimap and search in the words(key) and retrieve the Values(list of docs) to user.
how should I think about this problem and how to design my solution? what articles or algorithms should i read? any idea would be appreciated. (thanks for reading this long post)

The collection that you have used is the Cranfield test collection, which I believe has around 3000 documents. While for collections of this size, it is okay to store the inverted list (the data structure that you have constructed) in memory with a hash-based or trie based organization, for realistic collections of much larger sizes, often comprised of millions of documents, you would find it difficult to store the inverted list entirely within memory in such cases.
Instead of reinventing the wheel, the practical solution is thus to make use of a standard text indexing (and retrieval) framework such as Lucene. This tutorial should help you to get started.
The questions that you seek to address can be answered by Boolean queries where you can specify set of Boolean operators AND, OR and NOT between its constituent terms. Lucene supports this. Have a look at the API doc here and a related StackOverflow question here.
The Boolean query retrieval algorithm is very simple. The list elements (i.e. the document ids) corresponding to each term are stored in sorted order so that at run-time it is possible to compute the union and intersection in time linear to the size of the lists, i.e. O(n1+n2).... (this is very similar to mergesort).
You can find more information in this book chapter.

Find matching records with least characters from Pattern - Oracle / Java

The web application I am working currently has an File import logic. The logic
1> reads the records from a file [excel or txt],
2> shows a non editable grid of all the records imported [New records are marked as New if they do not exist in the database and existing records are marked as Update] and
3> dumps the records in the database.
a file containing contacts with following format in the file (mirrors the columns in the database with primary keys First_Name, Last_Name):
First_Name, Last_Name, AddressLine1, AddressLine2, City, State, Zipcode
The issue we are running into is when there are different values for the same entity being entered in the file. example, Someone might type NY for New York while others would put in New York. Same applies to first name or last name ex. John Myers and John Myer refer to the same person, but because the record does not match exactly, it inserts the record rather than reusing it for an update.
Example, for the record from the file (Please note the name and address usage is purely coincidental :) ):
John, Myers, 44 Chestnut Hill, Apt 5, Indiana, Indiana, 11111
and the record in the database:
John, Myer, 80 Washington St, Apt 1, Chicago, IL, 3333
the system should have detected the record in the file as existing record [because of the last name being Myers and Myer and since first name matches completely] and do an update on the Address, but rather inserts a new value.
How can I approach this issue where I would want to find all the records that would perform the match on the existing records in the database?

It is a very difficult problem to solve, if you know the sources of your data, then you could attempt to manually rectify the different combinations of data input.
Else
you could try for phonetic data cleaning solutions

One solution I could think of is using Regex in Oracle to achieve the functionality upto some extent.
For each of the column, I would generate Regex expression half way through the String length. example, for the name "Myer" in the file and "Myers" in the database, following query would work:
SELECT Last_Name from Contacts WHERE (Last_Name IS NULL OR Regexp_Like(Last_Name, '^Mye?r?$'))
I would consider this as a partial solution because I would parse the input string and start appending the none or only one operator from half the length to the end of the string and hoping the input string is not so messed up.
Hoping to find some feedback from others on SO for this "solution".

Searching multiple fields with Lucene

I'm having some trouble with a search I'm trying to implement. I need for a user to be able to enter a search query into a web interface and for the back-end Java to search for the query in a number of fields. An example of this might be best:
Say I have a List containing "Person" objects. Say each object holds two String fields about the person:
FirstName: Jack
Surname: Smith
FirstName Mary
Surname: Jackson
If a user enters, "jack", I need the search to match both objects, the first on Surname, and the second on FirstName.
I've been looking at using a MultiFieldQueryParser but can't get the fields set up right. Any help on this or pointing to a good tutorial would be greatly appreciated.

MultiFieldQueryParser is what you want, as you say.
Make sure:
The field names are always used consistently
The same Analyzer is used on both fields, and also on the query parser
You won't find partial words by default, so if you search for jack you won't find jackson. (You can search for jack* in that case.)
Regarding field name, I always set up an enum for my field names, then use e.g. MyFieldEnum.firstname.name() when passing field names to Lucene, so that if I make a spelling mistake the compiler can catch it, and it's also a good place to put Javadoc so you can see what the fields are for, and also a place where you can see the complete list of fields you wish to support in your Lucene documents.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.