How to query lucene with "like" operator? [duplicate]

How to query lucene with "like" operator? [duplicate] - java

This question already has answers here:
Leading wildcard character throws error in Lucene.NET
(3 answers)
Closed 9 years ago.
The wildcard * can only be used at the end of a word, like user*.
I want to query with a like %user%, how to do that?

The trouble with LIKE queries is that they are expensive in terms of time taken to execute. You can set up QueryParser to allow leading wildcards with the following:
QueryParser.setAllowLeadingWildcard(true)
And this will allow you to do searches like:
*user*
But this will take a long time to execute. Sometimes when people say they want a LIKE query, what they actually want is a fuzzy query. This would allow you to do the following search:
user~
Which would match the terms users and fuser. You can specify an edit distance between the term in your query and the terms you want matched using a float value between 0 and 1. For example user~0.8 would match more terms than user~0.5.
I suggest you also take a look at regex query, which supports regular expression syntax for Lucene searches. It may be closer to what you really need. Perhaps something like:
.*user.*

Lucene provides the ReverseStringFilter that allows to do leading wildcard search like *user. It works by indexing all terms in reverse order.
But I think there is no way to do something similar to 'LIKE %user%'.

Since Lucene 2.1 you can use
QueryParser.setAllowLeadingWildcard(true);
but this can kill performance. The LuceneFAQ has some more info for this.

When you think about it, it is not entirely unsurprising that lucene's support for wildcarding is (normally) restricted to a wildcard at the end of a word pattern.
Keyword search engines works by creating a reverse index of all words in the corpus, which is sorted in word order. When you do a normal non-wildcard search, the engine makes use of the fact that index entries are sorted to locate the entry or entries for your word in O(logN) steps where N is the number of words or entries. For a word pattern with a suffix wildcard, the same thing happens to find the first matching word, and other matches are found by scanning the entries until the fixed part of the pattern no longer matches.
However, for a word pattern with a wildcard prefix and a wildcard suffix, the engine would have to look at all entries in the index. This would be O(N) ... unless the engine built a whole stack of secondary indexes for matching literal substrings of words. (And that would make indexing a whole lot more expensive). And for more complex patterns (e.g. regexes) the problem would be even worse for the search engine.

Related

Lucene: searching for string matching a regex

I use Lucene to search for specific patterns using a regular expression. A new use case came up where I need to look up a specific string matching a regex pattern. Good example would be to look up a prices in documents: prices can be written in many ways, just looking for "1256.88" as stored in the database is not enough. The value in the document may have a currency in front of it, behind it or even not present at all ("EUR 1256,88", "1256,88 EUR" or just "1256,88"). The value may have thousands separators or not. And of course this can be combined with each other. So I want to search for a specific, known price ("1256.88") being part of a regex at the same time. An example regex would be
[0-9]{1,10}*([\.|,][0-9]{0,2})?([\ ]?[€|$])?
What is the Lucene way of doing this? Is there a way to search with a regex AND an "example"?
Or do I have to search with a regex and then filter out wrong hits manually afterwards? How do I find out which strings triggered the match?

Configuring the tokanisation of the search term in an elasticsearch query

I am doing a general search against elasticsearch (1.7) using a match query against a number of specified fields. This is done in a java app with one box to enter search terms in. Various search options are allowed (for example surrounding phrase with quotes to look for the phase not the component words). This means I am doing full test searches.
All is well except my account refs have forward slashes in them and a search on an account ref produces thousands of results. If I surround the account ref with quotes I get just the result I want. I assume an account ref of AC/1234/A01 is searching for [AC OR 1234 OR A01]. Initially I thought this was a regex issue but I don’t think it is.
I raised a similar question a while ago and one suggestion which I had thought worked was to add "analyzer": "keyword" to the query (in my code
queryStringQueryBuilder.analyzer("keyword")
).
The problem with this is that many of the other fields searched are not keyword and it is stopping a lot of flexible search options working (case sensitivity etc). I assume this has become something along the lines of an exact match in the text search.
I've looked at this the wrong way around for a while now and as I see it I can't fix it in the index or even in the general analyser settings as even if the account ref field is tokenised and analysed perfectly for my requirement the search will still search all the other fields for [AC OR 1234 OR A01].
Is there a way of configuring the search query to not split the account number on forward slashes? I could test ignoring all punctuation if it is possible to only split by whitespaces although I would prefer not to make such a radical change...
So I guess what I am asking is whether there is another built in analyzer which would still do a full full text search but would not split the search term up using punctuation ? If not is this something I could do with a custom analyzer (without applying it to the index itself ?)
Thanks.

The simplest way to do it is by replacing / with some character that doesn't cause the word to be split in two tokens, but doesn't interfere with your other terms (_, ., ' should work) or remove / completely using mapping char filter. There is a similar example here https://stackoverflow.com/a/23640832/783043

Lucene Reverse Search

I have a list of keywords in my database. For ex: Java Program, Php program etc. I index these keywords using Lucene. When I search for a text longer than the keywords (indexed words), How will get a match? For ex: I am searching for "My Java Program is better than yours". I would expect a match because I have indexed a keywod "Java Program"? How to do this efficiently using Lucene? If not Lucene what else can I use for this kind of a job?
Please note, I don't want to match on independent keywords "java" and "program". I want a match on "Java Program" (as one keyword just as I indexed).
Thank you.

If you have indexed your keywords with a StandardAnalyzer, the you could query them quite effectively with a query string like this
My Java Program is better than yours.
Which, unless quoted or something like that, effectively interprets to 7 queries (less after removing stopwords), So it will match when looking for "java" and when looking for "program".

Keyword (OR, AND) search in Lucene

I am using Lucene in my portal (J2EE based) for indexing and search services.
The problem is about the keywords of Lucene. When you use one of them in the search query, you'll get an error.
For example:
searchTerms = "ik OR jij"
This works fine, because it will search for "ik" or "jij"
searchTerms = "ik AND jij"
This works fine, it searches for "ik" and "jij"
But when you search:
searchTerms = "OR"
searchTerms = "AND"
searchTerms = "ik OR"
searchTerms = "OR ik"
Etc., it will fail with an error:
Component Name: STSE_RESULTS Class: org.apache.lucene.queryParser.ParseException Message: Cannot parse 'OR jij': Encountered "OR" at line 1, column 0.
Was expecting one of:
...
It makes sense, because these words are keywords for Lucene are probably reserved and will act as keywords.
In Dutch, the word "OR" is important because it has a meaning for "Ondernemings Raad". It is used in many texts, and it needs to be found. For example "or" does work, but does not return texts matching the term "OR". How can I make it searchable?
How can I escape the keyword "or"? Or How can I tell Lucene to treat "or" as a search term NOT as a keyword.

I suppose you have tried putting the "OR" into double quotes?
If that doesn't work I think you might have to go so far as to change the Lucene source and then recompile the whole thing, as the operator "OR" is buried deep inside the code. Actually, compiling probably isn't even enough: you'll have to change the file QueryParser.jj in the source package that serves as input for JavaCC, then run JavaCC, then recompile the whole thing.
The good news, however, is that there's only one line to change:
| <OR: ("OR" | "||") >
becomes
| <OR: ("||") >
That way, you'll have only "||" as logical OR operator. There is a build.xml that also contains the invocation of JavaCC, but you have to download that tool yourself. I can't try it myself right now, I'm afraid.
This is perhaps a good question for the Lucene developer mailing list, but please let us know if you do that and they come up with a simpler solution ;-)

OR, NOT and AND are reserved keywords. I solved this problem just 2 days ago by lower-casing those 3 words in the user's search term before feeding it into the lucene query parser. Note that if you search and replace for these keywords make sure you use word boundaries (\b) so you don't end up changing words such as ANDROID and ORDER.
I then let the user specify NOT and AND by using - and +, just like Google does.

Escaping OR and AND with double quotes works for me. So try with a Java string like
String query = "field:\"AND\"";

I have read your question many times! =[
please look at these suggestions
How is your index stored?
Document containing Fields stored can be stored as
1)Stored 2)Tokenized 3)Indexed 4)Vector
it can make a significant difference
please use Luke, it can tell you how your indexes are stored(actually)
Luke is a must have if you are working with lucene, as it gives you a real idea of how indexes are stored,it also offers search, try it let us know with your update!

You're probably doing something wrong when you're building the query. I'll second Narayan's suggestion on getting Luke (as posted in the comments) and try running your queries with that. It has been a little while since I used Lucene, but I don't remember ever having issues with OR and AND.
Other than that, you can try escaping the input strings using QueryParser.escape(userQuery)
More On Escaping

You can escape the "OR" when it's a search term, or write your own query parser for a different syntax. Lucene offers an extensive query API in addition to the parser, with which you support your own query syntax quite easily.

string matching algorithms used by lucene

i want to know the string matching algorithms used by Apache Lucene. i have been going through the index file format used by lucene given here. it seems that lucene stores all words occurring in the text as is with their frequency of occurrence in each document.
but as far as i know that for efficient string matching it would need to preprocess the words occurring in the Documents.
example:
search for "iamrohitbanga is a user of stackoverflow" (use fuzzy matching)
in some documents.
it is possible that there is a document containing the string "rohit banga"
to find that the substrings rohit and banga are present in the search string, it would use some efficient substring matching.
i want to know which algorithm it is. also if it does some preprocessing which function call in the java api triggers it.

As Yuval explained, in general Lucene is geared at exact matches (by normalizing terms with analyzers at both index and query time).
In the Lucene trunk code (not any released version yet) there is in fact suffix tree usage for inexact matches such as Regex, Wildcard, and Fuzzy.
The way this works is that a Lucene term dictionary itself is really a form of a suffix tree. You can see this in the file formats that you mentioned in a few places:
Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y".
The term info index gives us "random access" by indexing this tree at certain intervals (every 128th term by default).
So low-level it is a suffix tree, but at the higher level, we exploit these properties (mainly the ones specified in IndexReader.terms to treat the term dictionary as a deterministic finite state automaton (DFA):
Returns an enumeration of all terms starting at a given term. If the given term does not exist, the enumeration is positioned at the first term greater than the supplied term. The enumeration is ordered by Term.compareTo(). Each term is greater than all that precede it in the enumeration.
Inexact queries such as Regex, Wildcard, and Fuzzy are themselves also defined as DFAs, and the "matching" is simply DFA intersection.

The basic design of Lucene uses exact string matches, or defines equivalent strings using an Analyzer. An analyzer breaks text into indexable tokens. During this process, it may collate equivalent strings (e.g. upper and lower case, stemmed strings, remove diacritics etc.)
The resulting tokens are stored in the index as a dictionary plus a posting list of the tokens in documents. Therefore, you can build and use a Lucene index without ever using a string-matching algorithm such as KMP.
However, FuzzyQuery and WildCardQuery use something similar, first searching for matching terms and then using them for the full match. Please see Robert Muir's Blog Post about AutomatonQuery for a new, efficient approach to this problem.

As you pointed out Lucene stores only list of terms that occured in documents. How Lucene extracts these words is up to you. Default lucene analyzer simply breaks the words separated by spaces. You could write your own implementation that, for example for source string 'iamrohitbanga' yields 5 tokens: 'iamrohitbanga', 'i', 'am', 'rohit', 'banga'.
Please look lucene API docs for TokenFilter class.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.