Exact Match in SOLR 5.1

Exact Match in SOLR 5.1 - java

I have setup Solr 5.1.0 with proper data importation from MYSQL database. It is working good.
But I want exact match results or relevant to that only.
like,
Dancers in Mumbai
It gives all results which contains "dancers + mumbai" and only "dancers" + only "mumbai" keywords. I want result which must contains only "dancers + mumbai" not others.

This is not a complete answer, but it's the direction I'm trying to take with a similar problem. Comments are very welcome.
Step 1:
Implement multiple Solr cores, core 1 is "jobs" (dancers/lawyers/etc), and core 2 is "cities" (mumbai/chennai/etc).
Step 2:
Query each core for exact matches, so implement the KeywordTokenizerFactory on the relevant field to find exact matches only. This will give you all the matches accross cores (e.g. jobs: dancers and cities:mumbai).
Step 3:
Perform your general query using EDisMax for a user-friendly search (e.g. searching for "dancers in mumbai" accross many fields), and use the boost field to boost the jobs/cities found in the earlier query.
I would love to know if there is a better way of doing something this elaborate, but I have not found it yet. Hope it helps.

Using required terms like: +dancers +mumbia
Or a phrase query: "dancers in mumbia"
Would work.
You can also set the default operator for your query to be "AND", using the q.op parameter.

Related

How to handle synonyms and stop words when building a fuzzy query with Hibernate Search Query DSL

Using Hibernate Search (5.8.2.Final) Query DSL to Elasticsearch server.
Given a field analyzer that does lowercase, standard stop-words, then a custom synonym with:
company => co
and finally, a custom stop-word:
co
And we've indexed a vendor name: Great Spaulding Company, which boils down to 2 terms in Elasticsearch after synonyms and stop-words: great and spaulding.
I'm trying to build my query so that each term 'must' match, fuzzy or exact, depending on the term length.
I get the results I want except when 1 of the terms happens to be a synonym or stop-word and long enough that my code adds fuzziness to it, like company~1, in which case, it is no longer seen as a synonym or stop-word and my query returns no match, since 'company' was never stored in the first place b/c it becomes 'co' and then removed as a stop word.
Time for some code. It may seem a bit hacky, but I've tried numerous ways and using simpleQueryString with withAndAsDefaultOperator and building my own phrase seems to get me the closest to the results I need (but I'm open to suggestions). I'm doing something like:
// assume passed in search String of "Great Spaulding Company"
String vendorName = "Great Spaulding Company";
List<String> vendorNameTerms = Arrays.asList(vendorName.split(" "));
List<String> qualifiedTerms = Lists.newArrayList();
vendorNameTerms.forEach(term -> {
int editDistance = getEditDistance(term); // 1..5 = 0, 6..10 = 1, > 10 = 2
int prefixLength = getPrefixLength(term); //appears of no use with simpleQueryString
String fuzzyMarker = editDistance > 0 ? "~" + editDistance : "";
qualifiedTerms.add(String.format("%s%s", term, fuzzyMarker));
});
// join my terms back together with their optional fuzziness marker
String phrase = qualifiedTerms.stream().collect(Collectors.joining(" "));
bool.should(
qb.simpleQueryString()
.onField("vendorNames.vendorName")
.withAndAsDefaultOperator()
.matching(phrase)
.createQuery()
);
As I said above, I'm finding that as long as I don't add any fuzziness to a possible synonym or stop-word, the query finds a match. So these phrases return a match:
"Great Spaulding~1" or "Great Spaulding~1 Co" or "Spaulding Co"
But since my code doesn't know what terms are synonyms or stop-words, it blindly looks at term length and says, oh, 'Company' is greater than 5 characters, I'll make it fuzzy, it builds these sorts of phrases which are NOT returning a match:
"Great Spaulding~1 Company~1" or "Great Company~1"
Why is Elasticsearch not processing Company~1 as a synonym?
Any idea on how I can make this work with simpleQueryString or
another DSL query?
How is everyone handling fuzzy searching on text that may contain stopwords?
[Edit] Same issue happens with punctuation that my analyzer would normally remove. I cannot include any punctuation in the fuzzy search string in my query b/c the ES analyzer doesn't seem to treat it as it would non-fuzzy and I don't get a match result.
Example based on above search string: Great Spaulding Company., gets built in my code to the phrase Great Spaulding~1 Company.,~1 and ES doesn't remove the punctuation or recognize the synonym word Company
I'm going to try a hack of calling ES _analyze REST api in order for it to tell me what tokens I should include in the query, although this will add overhead to every query I build. Similar to http://localhost:9200/myEntity/_analyze?analyzer=vendorNameAnalyzer&text=Great Spaulding Company., produces 3 tokens: great, spaulding and company.

Why is Elasticsearch not processing Company~1 as a synonym?
I'm going to guess it's because fuzzy queries are "term-level" queries, which means they operate on exact terms instead of analyzed text. If your term, once analyzed, resolved to multiple tokens, I don't think it would be easy to define an acceptable behavior for a fuzzy queries.
There's a more detailed explanation there (I believe it still applies to the Lucene version used in Elasticsearch 5.6).
Any idea on how I can make this work with simpleQueryString or another DSL query?
How is everyone handling fuzzy searching on text that may contain stopwords?
You could try reversing your synonym: use co => company instead of company => co, so that a query such as compayn~1 will match even if "compayn" is not analyzed. But that's not a satisfying solution, of course, since other example requiring analysis still won't work, such as Company~1.
Below are alternative solutions.
Solution 1: "match" query with fuzziness
This article describes a way to perform fuzzy searches, and in particular explains the difference between several types of fuzzy queries.
Unfortunately it seems that fuzzy queries in "simple query string" queries are translated in the type of query that does not perform analysis.
However, depending on your requirements, the "match" query may be enough. In order to access all the settings provided by Elasticsearch, you will have to fall back to native query building:
QueryDescriptor query = ElasticsearchQueries.fromJson(
"{ 'query': {"
+ "'match' : {"
+ "'vendorNames.vendorName': {"
// Not that using a proper JSON framework would be better here, to avoid problems with quotes in the terms
+ "'query': '" + userProvidedTerms + "',"
+ "'operator': 'and',"
+ "'fuzziness': 'AUTO'"
+ "}"
+ "}"
+ " } }"
);
List<?> result = session.createFullTextQuery( query ).list();
See this page for details about what "AUTO" means in the above example.
Note that until Hibernate Search 6 is released, you can't mix native queries like shown above with the Hibernate Search DSL. Either you use the DSL, or native queries, but not both in the same query.
Solution 2: ngrams
In my opinion, your best bet when the queries originate from your users, and those users are not Lucene experts, is to avoid parsing the queries altogether. Query parsing involves (at least in part) text analysis, and text analysis is best left to Lucene/Elasticsearch.
Then all you can do is configure the analyzers.
One way to add "fuzziness" with these tools would be to use an NGram filter. With min_gram = 3 and max_gram = 3, for example:
An indexed string such as "company" would be indexed as ["com", "omp", "mpa", "pan", "any"]
A query such as "compayn", once analyzed, would be translated to (essentially com OR omp OR mpa OR pay OR ayn
Such a query would potentially match a lot of documents, but when sorting by score, the document for "Great Spaulding Company" would come up to the top, because it matches almost all of the ngrams.
I used parameter values min_gram = 3 and max_gram = 3 for the example, but in a real world application something like min_gram = 3 and max_gram = 5 would work better, since the added, longer ngrams would give a better score to search terms that match a longer part of the indexed terms.
Of course if you can't sort by score, of if you can't accept too many trailing partial matches in the results, then this solution won't work for you.

IN Equivalent Query In Solr and Solrj

I am using solr5.0.0. I would like to know the equivalent query for
IN in solr or solrj.
If I need to query products of different brands, I can use IN clause. If I have brands like dell, sony, samsung. I need to find the product with these brands using Solr and in Java Solrj.
Now I am using this code in Solrj
qry.addFilterQuery("brand:dell OR brand:sony OR brand:samsung");
I know that I can use OR here, but need to know about IN in Solr. And the performance of OR.

As you can read in Solr's wiki about its' query syntax, Solr uses per default a superset of Lucene's Query parser. As you can see when reading both documents, something like IN does not exist. But you can get shorter than the example query you presented.
In case that your default operator is OR you can leave it out from the query. In addition you can make use of Field Grouping.
qry.addFilterQuery("brand:(dell sony samsung)");
In case OR is not your default operator or you are not sure about this, you can employ Local Parameters for the filter query so that OR is enforced. Afterwards you can again make use of Field Grouping.
qry.addFilterQuery("{!q.op=OR}brand:(dell sony samsung)");
Keep in mind that you need to surround a phrase with " to keep the words together
qry.addFilterQuery("{!q.op=OR}brand:(dell sony samsung \"packard bell\")");

Interpret Queries of Lucene

I was wondering if there is any way to interpret Queries of Lucene in simple terms?
For example :
Example # 1:
Input Query - name:John
Output - Interpreted as : Find all entries where attribute "name" is equal "John".
Example # 2:
Input Query - name:John AND phoneNumber:1234
Output - Interpreted as : Find all entries where attribute "name" is equal to "John" and attribute "phoneNumber" is equal to "1234".
Any tutorials in this regard will be helpful,
Thanks

The Lucene documentation does a pretty decent job in explaining basic queries and their interpretation. It seems as though that's all you're looking for; once you get into some of the more advanced query types, it gets hairy, but the documentation should always be your first stop; it's fairly comprehensive.
Edit: Ah, you want automated query explanation. I don't know of any that currently exist; I think you'll have to write your own, but if you're starting with standard QueryParser Syntax, I think the best input for your interpreter would be the output of QueryParser.parse(). That breaks down the free text into Lucene query objects that shouldn't be too difficult to wrap in a utility function that outputs a plain-English string for each one.

Find most common words in sql

I have a new problem. I have a database with a column that contains a wide variety of text, is there any way I can get SQL to tell me which are the 10 most common words used in these fields? As an example:
1 I am coming home a bit late today.
2 Train is running late.
3 What is the train schedule like today?
4 Snow is really bad right now.
And output optimally would be:
is: 3
late : 2
train: 2
today: 2
If it is not possible to do it with SQL, what else would you suggest I look into to get this information?

This might technically be doable in SQL, but it will be painful and very slow when you have more rows in your database.
The problem you are describing is a perfect use case for an indexing engine though, such as Lucene (I used this one as an example it since your question first contained the tag 'java' before being edited).

One option is to use table-valued split function that returns each word as a row ; count them ; sort them by count in descending order

Matching inexact company names in Java

I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.
For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".
Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.
The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.
How would you do this in Java?

You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.
You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)

Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:
https://code.google.com/p/java-similarities/
If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.
(BTW, I'm not the author of the library, so kudos for its creators.)

You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
LCS code
Example usage (guessing a category based on what people entered)

I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.

Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.

Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):
http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp
You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.
You can also use the regex library in Java
JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html
Using Regular Expressions in Java
http://www.regular-expressions.info/java.html
The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/
You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex

vote up 1 vote down
You can use an LCS algorithm to score them.
I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.
* LCS code
* Example usage (guessing a category based on what people entered)
to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.

You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.