Semi Natural language Search using Apache Solr

Semi Natural language Search using Apache Solr - java

I did some analysis on Apache Solr and its pretty good to search data from various sources.
The problem I am facing is how do I standardize my search grammar and translate search text into Solr query.
I have three types of file/database table to search from - namely Customer, Industry and Unit. The first keyword in the search box should be any of the three. After that, the user can define a fix set of criteria:
Metrics : 0 or many (ex, exposure, income, revenue, loan_amt etc)
Dimension : 0 or many (Geography, region, etc)
Example:
customer - Returns all customer data from customer core
customer income from Asia - Returns all customer income details who belongs to Asia
customer income revenue from Asia - Returns all customer income and revenue details who belongs to Asia
How can I translate the above natural language search text to solr query?
Can I fix my grammar of text in Solr like
first keyword should be customer/industry/unit,
second key-value would be one or more region/geography
and then metric values.
I am not looking for google like search but a limited search where the user knows what to search.

This doesn't seem to be a Solr question, strictly speaking. As a first step, you might want to define a context-free grammar (CFG, type-2 grammar) based on specific production rules for your input. This would give you some solid syntax rules to work from. Based on this, you can then create a parser for the natural language input and map the resulting parse tree to the keyword search in Solr.

In order not to get sucked into Question answering domain of NLP, which is considered the hardest domain of NLP, maybe try to define a syntax of your questions, for instance X in Y with Z, where X can be different entities like Customer, Y can be some geolocation and Z a filter.

Related

Encoding record samples for expectation maximization algorithm

First, I'm a programmer without a data science background, so my working knowledge of statistics is quite limited.
I'm creating an entity matching tool to match records across internal datasets. I want to use the probabilistic matching technique described in these documents*. I have a good understanding of how the technique works and how to apply it, except for the derivation of agreement/disagreement weights using expectation maximization (EM).
Specifically, I'm unclear on how to encode my record pairs into the double[][] format required for
The EM implementation I have available is the Apache Common Math MultivariateNormalMixtureExpectationMaximization.
Here is a concrete example: matching company records.
A company has two fields: name (string) and country (enum), and I want to generate the m and u probabilistic weights using EM. How do I create the double[][] dataset for each field to feed into EM?
In the case of name, it is a string so there will be an approximate agreement / disagreement, using some string similarity method (edit distance, phonetic index, etc., the details aren't relevant here)
In the case of country, my data is normalized so agreement will only occur on an exact match. However certain countries are over and under represented. So a record with an under-represented country should have a higher weight than one with an over-represented country.
What exactly do the values in the inner double[] mean/represent?
How many entries/columns should there be?
How do I encode the records into the double[]?
* the documents describing the probabilistic matching technique using EM
An Overview of Record Linkage Methods
Death registrations to Census linkage project - Methodology and Quality Assessment - M AND U PROBABILITIES
Estimating Match Weights and Thresholds Algorithms for Master Person Index

Using Stanford NER for extracting Address from a text document?

I was looking Stanford NER and thinking of using JAVA Apis it to extract postal address from a text document. The document may be any document where there is an postal address section e.g. Utility Bills, electricity bills.
So what I am thinking as the approach is,
Define postal address as a named entity using LOCATION and other primitive named entities.
Define segmentation and other sub process.
I am trying to find a example pipeline for the same (what are the steps in details required), anyone has done this before? Suggestions welcome.

To be clear: all credit goes to Raj Vardhan (and John Bauer) who had an interaction on the [java-nlp-user] mailing list.
Raj Vardhan wrote about the plan to work on "finding street address in a sentence":
Here is an approach I have thought of:
Find the event-anchor in a sentence
Select outgoing-edges in the SemanticGraph from that event-node
with relations such as *"prep-in" *or "prep-at".
IF the dependent value in the relation has POS tag as NNP
a) Find outgoing-edges from dependent value's node with relations such
as "nn"
b) Connect all such nodes in increasing order of occurrence in the
sentence.
c) PRINT resulting value as Location where the event occurred
This is obviously with certain assumptions such as direct dependency
between the event-anchor and location in a sentence.
Not sure whether this could help you, but I wanted to mention it just in case. Again, any credit should go to Raj Vardhan (and John Bauer).

How to improve performance of SMO classifier in weka?

I am using weka SMO classifier for classify the documents.There are many parameters for smo available like Kernal, tolerance etc.., I tested using different parameters but i not get good result large data set.
For more than 90 category only 20% documents getting correctly classified.
Please anyone tell me the best set of parameter to get highest performance in SMO.

Principal issue here is not classification itself, but rather selecting suitable features. Using raw HTML leads to very large noise which in its turn makes classification results very poor. Thus, to get good results do the following:
Extract relevant text. Not just remove HTML tags, but get exactly the text describing item.
Create dictionary of key words. E.g. capuccino, latte, white rice, etc.
Use stemming or lemmatization to get word's base form and avoid counting, for example, "cotton" and "cottons" as 2 different words.
Make feature vectors from text. Attributes (feature names) should be all words from your dictionary. Values may be: binary (1 if word occurs in text, 0 otherwise), integer (number of occurrences of word in question in text), tf-idf (use this one if your texts have very different lengths) and others.
And only after all these steps you can use classifer.
Most probably classifier type won't play a big role here: dictionary-based features normally lead to quite exact results regardless of classification technique in use. You can use SVM (SMO), Naive Bayes, ANN or even kNN. More sophisticated methods include creation of category hierarchy, where, for example, category "coffee" is included into category "drinks" which in its turn is part of category "food".

Is this possible to develop some criteria based search on the Strings in C# or JAVA?

I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...

The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.

You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.

Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}

Matching inexact company names in Java

I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.
For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".
Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.
The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.
How would you do this in Java?

You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.
You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)

Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:
https://code.google.com/p/java-similarities/
If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.
(BTW, I'm not the author of the library, so kudos for its creators.)

You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
LCS code
Example usage (guessing a category based on what people entered)

I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.

Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.

Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):
http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp
You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.
You can also use the regex library in Java
JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html
Using Regular Expressions in Java
http://www.regular-expressions.info/java.html
The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/
You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex

vote up 1 vote down
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
* LCS code
* Example usage (guessing a category based on what people entered)
to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.

You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.