Lucene/Solr for approximate (company) name matching - java

I have a question regarding Lucene/Solr.
I am trying to solve a general (company) name matching problem.
Let me present one oversimplified example:
We have two (possibly large) lists of names viz., list_A and list_B.
We want to find the intersection of the two lists, but the names in the two lists may not always exactly match. For each distinct name in list_A, we will want to report one or more best matches from list_B.
I have heard that Lucene/Solr can solve this problem. Can you tell me if this is true? If it is, please point me to some minimal working example(s).
Thanks and regards,
Dibyendu

You could solve this with Lucene, yes, but if you just need to solve this one problem, creating a Lucene index would be a bit of a roundabout way to do it.
I'dd be more inclined to take a simpler approach. You could just find a library for fuzzy comparison between strings, and iterate through your lists and return only those under a certain threshold of similarity as matches.
org.apache.commons.lang3.StringUtils comes to mind, something like:
for (String a : alist) {
for (String b : blist) {
int dist = StringUtils.getLevenshteinDistance(a,b)
if (dist < threshold) {
//b is a good enough match for a, do something with it!
}
}
}
Depending on your intent, other algorithms might be more appropriate (Soundex or Metaphone, for instance)

SOLR can solve your problem. Index the list_B in SOLR. Now do a search for every item in list_A in SOLR, you will get one or more likely match from the list_B.
You need to configure analyzers and filters for the field according to your data set and what kind of similar result you want.

I am trying to do something similar, and I would like to point out to the other commenters that their proposed solutions (like Levenshtein Distance or Soundex) may not be appropriate, if the problem is matching of accurate names, as opposed to mis-spelled names.
For example: I doubt either one is much use for matching
John S W Edward
with
J Samuel Woodhouse Edward
I suppose it is possible, but this is a different class of problem than what they were intended to accomplish.

Related

Check if part of a String exists in HashMap

I have a HashMap of 60k key/value pairs.
I have 100 strings and out of those 100 strings one has a substring which exists in HashMap.
I would have to repeat this process thousand times. Is there is an efficient approach to do this?
Let's say, the hash contains like:
journal of america, rev su arabia, comutational journal, etc..
And the strings like:
published in rev su arabia
the publication event happened in
computationl journal 230:34
The first and third string contains the key/value in the hash and I need to find out those.
Code (not efficient)
private String contains(String candidateLine)
{
Iterator<String> it = journalName.iterator();
while (it.hasNext())
{
String journalName = it.next();
if (candidateLine.contains(journalName))
return journalName;
}
return null;
}
Please suggest.
Given your requirements, the only answer is: wrong design point. You are basically asking how to efficiently support "full text" search capabilities. And for that problem, the answer is: don't do it yourself.
Meaning: forget about re-inventing the wheel here. Instead, pick up an existing solution, such as Lucene (library) or products such as Solr or ElasticSearch ( see here for more information).
You see, most likely we are looking at a "real world" production problem here. So even when you find a clever way to build your own data structure to support your current requirements, chances are high that sooner or later "more" requirements will be coming your way.
Therefore I seriously suggest that clarify the exact problem to solve, and then identify that existing product that best solves the problem. Otherwise you will be fighting uphill battles like forever.

How to check if two Strings are approximately equal?

I'm making a chat responder for a game and i want know if there is a way you can compare two strings and see if they are approximatley equal to each other for example:
if someone typed:
"Strength level?"
it would do a function..
then if someone else typed:
"Str level?"
it would do that same function, but i want it so that if someone made a typo or something like that it would automatically detect what they're trying to type for example:
"Strength tlevel?"
would also make the function get called.
is what I'm asking here something simple or will it require me to make a big giant irritating function to check the Strings?
if you've been baffled by my explanation (Not really one of my strong points) then this is basically what I'm asking.
How can I check if two strings are similar to each other?
See this question and answer: Getting the closest string match
Using some heuristics and the Levenshtein distance algorithm, you can compute the similarity of two strings and take a guess at whether they're equal.
Your only option other than that would be a dictionary of accepted words similar to the one you're looking for.
You can use Levenshtein distance.
I believe you should use one of Edit distance algorithms to solve your problem. Here is for example Levenstein distance algorithm implementation in java. You may use it to compare words in the sentences and if sum of their edit distances would be less than for example 10% of sentence length consider them equals.
Perhaps what you need is a large dictionary for similar words and common spelling mistakes, for which you would use for each word to "translate" to one single entry or key.
This would be useful for custom words, so you could add "str" in the same key as "strength".
However, you could also make a few automated methods, i.e. when your word isn't found in the dictionary, to loop recursively for 1 letter difference (either missing or replaced) and can recurse into deeper levels, i.e. 2 missing letters etc.
I found a few projects that do text to phonemes translations, don't know which one is best
http://mary.dfki.de/
http://www2.eng.cam.ac.uk/~tpl/asp/source/Phoneme.java
http://java.dzone.com/announcements/announcing-phonemic-10
If you want to find similar word beginnings, you can use a stemmer. Stemmers reduce words to a common beginning. The most known algorithm if the Port Stemmer (http://tartarus.org/~martin/PorterStemmer).
Levenshtein, as pointed above, is great, but computational heavy for distances greater than one or two.

Concatenating RowFilter orFilters with andFilter in Java

All the questions pertaining this don't seem to answer the particular question I have.
My problem is this. I have a list of search terms, and for each term I find the edit distance to find possible misspelling of a word.
So for each word separated by a space, I have possible words each word could be.
For example: searching for green chilli might give us "fuzzy" words "green, greene and grain" and "chilli, chill and chilly".
Now I want the RowFilter to search for: "green OR greene OR grain" AND "chilli OR chill OR chilly".
I can't seem to find a way to do this in Java. I've looked all over the place but nothing talks about concatenating the OR and AND filters together in one RowFilter.
Would I have to roll my own solution based on the model? I suppose I can do this, but my method would most probably be naive at first and slow.
Any pointers as to how to roll my own solution for this or better yet, what's the Java way to do this right?
RowFilter.orFilter() and RowFilter.andFilter() seem apropos; each includes examples, and each accepts an arbitrary number of arguments.

searching list of tens or few hundreds short text strings, sorting by relevance

I have a list of people that I'd like to search through. I need to know 'how much' each item matches the string it is being tested against.
The list is rather small, currently 100+ names, and it probably won't reach 1000 anytime soon.
Therefore I assumed it would be OK to keep the whole list in memory and do the searching using something Java offers out-of-the-box or using some tiny library that just implements one or two testing algorithms. (In other words without bringing-in any complicated/overkill solution that stores indexes or relies on a database.)
What would be your choice in such case please?
EDIT: Seems like Levenshtein has closest to what I need from what has been adviced. Only that gets easily fooled when the search query is "John" and the names in list are significantly longer.
You should look at various string comparison algorithms and see which one suits your data best. Options are Jaro-Winkler, Smith-Waterman etc. Look up SimMetrics - a F/OSS library that offers a very comprehensive set of string comparison algorithms.
If you are looking for a 'how much' match, you should use Soundex. Here is a Java implementation of this algorithm.
Check out Double Metaphone, an improved soundex from 1990.
http://commons.apache.org/codec/userguide.html
http://svn.apache.org/viewvc/commons/proper/codec/trunk/src/java/org/apache/commons/codec/language/DoubleMetaphone.java?view=markup
According to me Jaro-Winkler algorithm will suit your requirement best.
Here is a Short summary of Jaro-Winkler Distance Algo
One of the PDF which compares different algorithms --> Link to PDF

Text similarity algorithm

I have two subtitles files.
I need a function that tells whether they represent the same text, or the similar text
Sometimes there are comments like "The wind is blowing... the music is playing" in one file only.
But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text).
And sometimes there are misspellings like 1 instead of l (one - L ) as here:
She 1eft the baggage.
Of course, it means function must return TRUE.
My comments:
The function should return percentage of the similarity of texts - AGREE
"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar
Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.
Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance
Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.
For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequency vectors.
You might want to look at several implementations that are described here: Cosine Similarity
You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diff already has everything you need) and improve it to provide good results for your input.
Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.
EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep
There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.
The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...
Here you find a helpful implementation of several algorithms within one library
if you are still looking for the solution then go with S-Bert (Sentence Bert) which is light weight algorithm which internally uses cosine similarly.

Categories