Using Stanford NER for extracting Address from a text document? - java

I was looking Stanford NER and thinking of using JAVA Apis it to extract postal address from a text document. The document may be any document where there is an postal address section e.g. Utility Bills, electricity bills.
So what I am thinking as the approach is,
Define postal address as a named entity using LOCATION and other primitive named entities.
Define segmentation and other sub process.
I am trying to find a example pipeline for the same (what are the steps in details required), anyone has done this before? Suggestions welcome.

To be clear: all credit goes to Raj Vardhan (and John Bauer) who had an interaction on the [java-nlp-user] mailing list.
Raj Vardhan wrote about the plan to work on "finding street address in a sentence":
Here is an approach I have thought of:
Find the event-anchor in a sentence
Select outgoing-edges in the SemanticGraph from that event-node
with relations such as *"prep-in" *or "prep-at".
IF the dependent value in the relation has POS tag as NNP
a) Find outgoing-edges from dependent value's node with relations such
as "nn"
b) Connect all such nodes in increasing order of occurrence in the
sentence.
c) PRINT resulting value as Location where the event occurred
This is obviously with certain assumptions such as direct dependency
between the event-anchor and location in a sentence.
Not sure whether this could help you, but I wanted to mention it just in case. Again, any credit should go to Raj Vardhan (and John Bauer).

Related

Semi Natural language Search using Apache Solr

I did some analysis on Apache Solr and its pretty good to search data from various sources.
The problem I am facing is how do I standardize my search grammar and translate search text into Solr query.
I have three types of file/database table to search from - namely Customer, Industry and Unit. The first keyword in the search box should be any of the three. After that, the user can define a fix set of criteria:
Metrics : 0 or many (ex, exposure, income, revenue, loan_amt etc)
Dimension : 0 or many (Geography, region, etc)
Example:
customer - Returns all customer data from customer core
customer income from Asia - Returns all customer income details who belongs to Asia
customer income revenue from Asia - Returns all customer income and revenue details who belongs to Asia
How can I translate the above natural language search text to solr query?
Can I fix my grammar of text in Solr like
first keyword should be customer/industry/unit,
second key-value would be one or more region/geography
and then metric values.
I am not looking for google like search but a limited search where the user knows what to search.
This doesn't seem to be a Solr question, strictly speaking. As a first step, you might want to define a context-free grammar (CFG, type-2 grammar) based on specific production rules for your input. This would give you some solid syntax rules to work from. Based on this, you can then create a parser for the natural language input and map the resulting parse tree to the keyword search in Solr.
In order not to get sucked into Question answering domain of NLP, which is considered the hardest domain of NLP, maybe try to define a syntax of your questions, for instance X in Y with Z, where X can be different entities like Customer, Y can be some geolocation and Z a filter.

How to create a good NER training model in OpenNLP?

I just have started with OpenNLP. I need to create a simple training model to recognize name entities.
Reading the doc here https://opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind I see this simple text to train the model:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
<START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC ,
was named a director of this British industrial conglomerate .
The questions are two:
Why should i have to put the names of the persons in a text (phrase) context ? Why not write person's name one for each line? like:
<START:person> Robert <END>
<START:person> Maria <END>
<START:person> John <END>
How can I also add extra information to that name?
For example I would like to save the information Male/Female for each name.
(I know there are systems that try to understand it reading the last letter, like the "a" for Female etc but i would like to add it myself)
Thanks.
The answer to your first question is that the algorithm works on surrounding context(tokens) within a sentence; it's not just a simple lookup mechanism. OpenNLP uses maximum entropy, which is a form of multinomial logistic regression to build its model. The reason for this is to reduce "word sense ambiguity," and find entities in context. For instance, if my name is April, I can easily get confused with the month of April, and if my name is May, then I would get confused with the month of May as well as the verb may. For your second part of the first question, you could make a list of names that are known, and use those names in a program that looks at your sentences and automatically annotates them to help you create a training set, however making a list of names alone without context will not train the model sufficiently or at all. In fact, there is an OpenNLP addon called the "modelbuilder addon" designed for this: you give it a file of names, and it uses the names and some of your data (sentences) to train a model. If you are looking for particular names of generally non ambiguous entities, you may be better off just using a list and something like regex to discover names rather than NER.
As for your second question there are a few options, but in general, I don't think NER is a great tool for delineating something like gender, however with enough training sentences you may get decent results. Since NER uses a model based on surrounding tokens in your sentence training set to establish the existence of a named entity, it can't do much in terms of identifying gender. You may be better off finding all person names, then referencing an index of names that you know are male or female to get a match. Also, some names, like Pat, are both male and female, and in most textual data there will be no indication of which it is to neither human nor machine. That being said, you could create a male and female model separately, or you could create different entity types within the same model. You could use an annotation like this (using different entity type names of male.person and female.person). I've never tried this but it might do ok, you'd have to test it on your data.
<START:male.person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mrs . <START:female.person> Maria <END> is chairman of Elsevier N.V. , the Dutch publishing group
NER= Named Entity Recognition
HTH

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?
If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.
From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

Lucene search on a Hibernate List field

I have a Hibernate annotated class TestClass that contains a List<String> field that I am indexing with Lucene. Consider the following example:
"Foo Bar" and "Bar Snafu" are two entries in the List for a particular record. Now, If a user searches on TestClass for "Foo Snafu" then the record will be found, I am guessing because the token Foo and the token Snafu are both tokens in the List<String> for this record.
Is there a way I can prevent this from happening?
The real world example is a Court case that has a List of Plaintiffs and Defendants. Say there are two people being prosecuted on the case, Joe Lewis Bob and Robert Clay Smith. These users are stored in the Court case record in a List of Defendants. This List of defendants is indexed with Lucene. Now if a user searches for either of the two defendants mentioned earlier, the case will be found. But the case will also be found if a user searches for Lewis Smith, or Joe Clay.
Update: It was mentioned in the Lucene IRC channel that I could possibly use a multi-valued field.
Update 2: It was mentioned in the Solr IRC channel that I could use the positionIncrementGap setting in schema.xml to accomplish this with Solr. Apparently if I use a phrase query (with or without slop) then "the increment gap ensures that different values in the same field won't cause an unintended match".
Lucene appends successive additions to the same field in the same document to the end of what it already has in the field.
If you want to treat each member of the List as an entirely separate entity, you should index them in different fields. you could just append the index to the field name you are already using. While I don't have complete information on your needs, of course, doing something like this is probably the better solution.
If you just want to search for the precise text "Foo Snafu", you can use a PhraseQuery. If you want to be sure your phrasequery doesn't cross from one list item to the next (ie, if you had "Bar Foo" and "Snafu Bar" in the index), you could insert some form of delimiting term between each member when writing to the index.

Matching inexact company names in Java

I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.
For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".
Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.
The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.
How would you do this in Java?
You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.
You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)
Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:
https://code.google.com/p/java-similarities/
If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.
(BTW, I'm not the author of the library, so kudos for its creators.)
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
LCS code
Example usage (guessing a category based on what people entered)
I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.
Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.
Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):
http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp
You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.
You can also use the regex library in Java
JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html
Using Regular Expressions in Java
http://www.regular-expressions.info/java.html
The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/
You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
vote up 1 vote down
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
* LCS code
* Example usage (guessing a category based on what people entered)
to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.
You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.

Categories