problem with Lucene's automagical query conversion

problem with Lucene's automagical query conversion - java

Recently I have started using Lucene. However, after few days I've spotted that queries provided by me in form of Strings are converted by Lucene to more general ones.
Example:
MY QUERY: "want to go" (including " as I'm searching whole phrases)
QUERY OBJECT created from my query (.toString): text:"want ? go"
NUMBER OF RESULTS for texts:
I want to go out today -> 1 result - correct
I want sdfto go out today -> 1 result - incorrect, should be 0
I wanted to match execly phrase "want to go" and not "want whatever go". I noticed that only words "to" and "a" are replaced with "?".
My question is why Lucene is changing queries provided by me, and how to force Lucene to ask my queries (unchanged)?
Moreover, I'm using StandardAnayzer (indexing and quering).

to is a stop word, meaning it is not indexed and not searched by some analyzers [including StandardAnalyzer], because it is usually not useful for searching. if you don't want it to be 'stopped' you will need to use a different analyzer [both for indexing and searching], but it will probably have worth results.
You can also remove the word 'to' from the field STOP_WORDS
IMPORTANT: your indexing analyzer and searching analyzer should be consistent, including the STOP_WORDS field!

Related

How to handle synonyms and stop words when building a fuzzy query with Hibernate Search Query DSL

Using Hibernate Search (5.8.2.Final) Query DSL to Elasticsearch server.
Given a field analyzer that does lowercase, standard stop-words, then a custom synonym with:
company => co
and finally, a custom stop-word:
co
And we've indexed a vendor name: Great Spaulding Company, which boils down to 2 terms in Elasticsearch after synonyms and stop-words: great and spaulding.
I'm trying to build my query so that each term 'must' match, fuzzy or exact, depending on the term length.
I get the results I want except when 1 of the terms happens to be a synonym or stop-word and long enough that my code adds fuzziness to it, like company~1, in which case, it is no longer seen as a synonym or stop-word and my query returns no match, since 'company' was never stored in the first place b/c it becomes 'co' and then removed as a stop word.
Time for some code. It may seem a bit hacky, but I've tried numerous ways and using simpleQueryString with withAndAsDefaultOperator and building my own phrase seems to get me the closest to the results I need (but I'm open to suggestions). I'm doing something like:
// assume passed in search String of "Great Spaulding Company"
String vendorName = "Great Spaulding Company";
List<String> vendorNameTerms = Arrays.asList(vendorName.split(" "));
List<String> qualifiedTerms = Lists.newArrayList();
vendorNameTerms.forEach(term -> {
int editDistance = getEditDistance(term); // 1..5 = 0, 6..10 = 1, > 10 = 2
int prefixLength = getPrefixLength(term); //appears of no use with simpleQueryString
String fuzzyMarker = editDistance > 0 ? "~" + editDistance : "";
qualifiedTerms.add(String.format("%s%s", term, fuzzyMarker));
});
// join my terms back together with their optional fuzziness marker
String phrase = qualifiedTerms.stream().collect(Collectors.joining(" "));
bool.should(
qb.simpleQueryString()
.onField("vendorNames.vendorName")
.withAndAsDefaultOperator()
.matching(phrase)
.createQuery()
);
As I said above, I'm finding that as long as I don't add any fuzziness to a possible synonym or stop-word, the query finds a match. So these phrases return a match:
"Great Spaulding~1" or "Great Spaulding~1 Co" or "Spaulding Co"
But since my code doesn't know what terms are synonyms or stop-words, it blindly looks at term length and says, oh, 'Company' is greater than 5 characters, I'll make it fuzzy, it builds these sorts of phrases which are NOT returning a match:
"Great Spaulding~1 Company~1" or "Great Company~1"
Why is Elasticsearch not processing Company~1 as a synonym?
Any idea on how I can make this work with simpleQueryString or
another DSL query?
How is everyone handling fuzzy searching on text that may contain stopwords?
[Edit] Same issue happens with punctuation that my analyzer would normally remove. I cannot include any punctuation in the fuzzy search string in my query b/c the ES analyzer doesn't seem to treat it as it would non-fuzzy and I don't get a match result.
Example based on above search string: Great Spaulding Company., gets built in my code to the phrase Great Spaulding~1 Company.,~1 and ES doesn't remove the punctuation or recognize the synonym word Company
I'm going to try a hack of calling ES _analyze REST api in order for it to tell me what tokens I should include in the query, although this will add overhead to every query I build. Similar to http://localhost:9200/myEntity/_analyze?analyzer=vendorNameAnalyzer&text=Great Spaulding Company., produces 3 tokens: great, spaulding and company.

Why is Elasticsearch not processing Company~1 as a synonym?
I'm going to guess it's because fuzzy queries are "term-level" queries, which means they operate on exact terms instead of analyzed text. If your term, once analyzed, resolved to multiple tokens, I don't think it would be easy to define an acceptable behavior for a fuzzy queries.
There's a more detailed explanation there (I believe it still applies to the Lucene version used in Elasticsearch 5.6).
Any idea on how I can make this work with simpleQueryString or another DSL query?
How is everyone handling fuzzy searching on text that may contain stopwords?
You could try reversing your synonym: use co => company instead of company => co, so that a query such as compayn~1 will match even if "compayn" is not analyzed. But that's not a satisfying solution, of course, since other example requiring analysis still won't work, such as Company~1.
Below are alternative solutions.
Solution 1: "match" query with fuzziness
This article describes a way to perform fuzzy searches, and in particular explains the difference between several types of fuzzy queries.
Unfortunately it seems that fuzzy queries in "simple query string" queries are translated in the type of query that does not perform analysis.
However, depending on your requirements, the "match" query may be enough. In order to access all the settings provided by Elasticsearch, you will have to fall back to native query building:
QueryDescriptor query = ElasticsearchQueries.fromJson(
"{ 'query': {"
+ "'match' : {"
+ "'vendorNames.vendorName': {"
// Not that using a proper JSON framework would be better here, to avoid problems with quotes in the terms
+ "'query': '" + userProvidedTerms + "',"
+ "'operator': 'and',"
+ "'fuzziness': 'AUTO'"
+ "}"
+ "}"
+ " } }"
);
List<?> result = session.createFullTextQuery( query ).list();
See this page for details about what "AUTO" means in the above example.
Note that until Hibernate Search 6 is released, you can't mix native queries like shown above with the Hibernate Search DSL. Either you use the DSL, or native queries, but not both in the same query.
Solution 2: ngrams
In my opinion, your best bet when the queries originate from your users, and those users are not Lucene experts, is to avoid parsing the queries altogether. Query parsing involves (at least in part) text analysis, and text analysis is best left to Lucene/Elasticsearch.
Then all you can do is configure the analyzers.
One way to add "fuzziness" with these tools would be to use an NGram filter. With min_gram = 3 and max_gram = 3, for example:
An indexed string such as "company" would be indexed as ["com", "omp", "mpa", "pan", "any"]
A query such as "compayn", once analyzed, would be translated to (essentially com OR omp OR mpa OR pay OR ayn
Such a query would potentially match a lot of documents, but when sorting by score, the document for "Great Spaulding Company" would come up to the top, because it matches almost all of the ngrams.
I used parameter values min_gram = 3 and max_gram = 3 for the example, but in a real world application something like min_gram = 3 and max_gram = 5 would work better, since the added, longer ngrams would give a better score to search terms that match a longer part of the indexed terms.
Of course if you can't sort by score, of if you can't accept too many trailing partial matches in the results, then this solution won't work for you.

Interpret Queries of Lucene

I was wondering if there is any way to interpret Queries of Lucene in simple terms?
For example :
Example # 1:
Input Query - name:John
Output - Interpreted as : Find all entries where attribute "name" is equal "John".
Example # 2:
Input Query - name:John AND phoneNumber:1234
Output - Interpreted as : Find all entries where attribute "name" is equal to "John" and attribute "phoneNumber" is equal to "1234".
Any tutorials in this regard will be helpful,
Thanks

The Lucene documentation does a pretty decent job in explaining basic queries and their interpretation. It seems as though that's all you're looking for; once you get into some of the more advanced query types, it gets hairy, but the documentation should always be your first stop; it's fairly comprehensive.
Edit: Ah, you want automated query explanation. I don't know of any that currently exist; I think you'll have to write your own, but if you're starting with standard QueryParser Syntax, I think the best input for your interpreter would be the output of QueryParser.parse(). That breaks down the free text into Lucene query objects that shouldn't be too difficult to wrap in a utility function that outputs a plain-English string for each one.

Improving the speed of Solr query over 16 million tweets

I use Solr (SolrCloud) to index and search my tweets. There are about 16 million tweets and the index size is approximately 3 GB. The tweets are indexed in real time as they come so that real time search is enabled. Currently I use lowercase field type for my tweet body field. For a single search term in the search, it is taking around 7 seconds and with addition of each search term, time taken for search is linearly increasing. 3GB is the maximum RAM allocated for the solr process. Sample solr search query looks like this
tweet_body:*big* AND tweet_body:*data* AND tweet_tag:big_data
Any suggestions on improving the speed of searching? Currently I run only 1 shard which contains the entire tweet collection.

The query tweet_body:*big* can be expected to perform poorly. Trailing wildcards are easy, Leading Wildcards can be readily handled with a ReversedWildcardFilterFactory. Both, however, will have to scan every document, rather than being able to utilize the index to locate matching documents. Combining the two approaches would only allow you to search:
tweet_body:*big tweet_body:big*
Which is not the same thing. If you really must search for terms with a leading AND trailing wildcard, I would recommend looking into indexing your data as N-grams.
I wasn't previously aware of it, but it seems the lowercase field type is a Lowercase filtered KeywordAnalyzer. This is not what you want. That means the entire field is treated as a single token. Good for identification numbers and the like, but not for a body of text you wish to perform a full text search on.
So yes, you need to change it. text_general is probably appropriate. That will index a correctly tokenized field, and you should be able to performt he query you are looking for with:
tweet_body:big AND tweet_body:data AND tweet_tag:big_data
You will have to reindex, but there is no avoiding that. There is no good, performant way to perform a full text search on a keyword field.

Try using filter queries,as filter queries runs in parallel

WildcardQuery not returning correct result

I have created an index using some data. Now I am using WildcardQuery to search this data. The documents indexed have a field name Product Code against which I am searching.
Below is the code that I am using for creating the query and searching:
Term productCodeTerm = new Term("Product Code", "*"+searchText+"*");
query = new WildcardQuery(productCodeTerm);
searcher.search(query, 100);
The searchText variable contains the search string that is used to search the index. In case when searchString is 'jf', I get the following result:
JF32358
JF5215
JF2592
Now, when I try to search using 25, or f2 or f3 or anything else other than using only j,f,jf, then the query has no hits.
I am not able to understand why it is happening. Can someone help me understand the reason the search is behaving in this way?

What analyzer did you use at indexing time? Given your examples, you should make sure that your analyzer:
does lowercasing,
does not remove digits,
does not split at boundaries between letters and digits.

In the lucene FAQ page it says :
Leading wildcards (e.g. *ook) are not supported by the QueryParser by
default. As of Lucene 2.1, they can be enabled by calling
QueryParser.setAllowLeadingWildcard( true ). Note that this can be an
expensive operation: it requires scanning the list of tokens in the
index in its entirety to look for those that match the pattern.
For more information check here.

What is the difference between a phrase query and using a shingle filter?

I'm currently indexing webpage using lucene. The aim is to be able to quickly extract which page contain a certain expression (usually 1, 2 or 3 words), and which other words (or group of 1to 3 of them) are also in the page.
This will be used to build / enrich / alter a thesaurus (fixed vocabulary).
From the articles I found, it seems the problem is to find n-grams (or shingle).
Lucene has a ShingleFilter, a ShingleMatrixFilter, and a ShingleAnalyzerWrapper, which seem related to this task.
From this presentation, I learned that Lucene can also search for terms separated by a fixed number of words (called slops). An example is provided here.
However, I don't understand clearly the difference between those approach? Are they fundamentally different, or is it a performance / index size choice that you have to make?
What is the difference between ShingleMatrixFilter and ShingleFilter?
Hope a Lucene guru will FIND this question, and and answer ;-) !

The differences between using phrase versus shingle mainly involve performance and scoring.
When using phrase queries (say "foo bar") in the typical case where single words are in the index, phrase queries have to walk the inverted index for "foo" and for "bar" and find the documents that contain both terms, then walk their positions lists within each one of those documents to find the places where "foo" appeared right before "bar".
This has some cost to both performance and scoring:
Positions (.prx) must be indexed and searched, this is like an additional "dimension" to the inverted index which will increase indexing and search times
Because only individual terms appear in the inverted index, there is no real "phrase IDF" computed (this might not affect you). So instead this is approximated based on the sum of the term IDFs.
On the other hand, if you use shingles, you are also indexing word n-grams, in other words, if you are shingling up to size 2, you will also have terms like "foo bar" in the index. This means for this phrase query, it will be parsed as a simple TermQuery, without using any positions lists. And since its now a "real term", the phrase IDF will be exact, because we know exactly how many documents this "term" exists.
But using shingles has some costs as well:
Increased term dictionary, term index, and postings list sizes, though this might be a fair tradeoff especially if you completely disable positions entirely with Field.setIndexOptions.
Some additional cost during the analysis phase of indexing: although ShingleFilter is optimized nicely and is pretty fast.
No obvious way to compute "sloppy phrase queries" or inexact phrase matches, although this can be approximated, e.g. for a phrase of "foo bar baz" with shingles of size 2, you will have two tokens: foo_bar, bar_baz and you could implement the search via some of lucene's other queries (like BooleanQuery) for an inexact approximation.
In general, indexing word-ngrams with things like Shingles or CommonGrams is just a tradeoff (fairly expert), to reduce the cost of positional queries or to enhance phrase scoring.
But there are real-world use cases for this stuff, a good example is available here:
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.