elasticsearch wildcard query with ascii folding - java

i am searching names using the wildcard query it works fine however when we do search for ascii characters it is not working well
like when user search for "Hélè*", its not able to search.
note that i have already created analyzer that does ascii folding and lowercase on name field.
also its working fine when we do search in query_string. does that mean wildcard is not analyzing the ascii folding and query string does ?
if yes then is there any way to achieve wildcard with ascii ?
any help will be greatly appreciated.
Thanks,
Mohsin

Try using field query with analyze_wildcard set to true.
By default, elasticsearch doesn't try to analyze text in wildcard queries, it only lowercases it for some queries. Because of this your query is searching for all terms that start with hélè and there is no such terms in your index because of ascii folding filter.

In Solr there is a ReversedWildcardFilterFactory and it is used at index time. When it is used if query contains wildcard character then it is not converted to ascii otherwise it is converted and searched using ascii. You can define it after ASCIIFoldingFilterFactory.
I don't know something similar exists in Lucene but you can write your FilterFactory by looking its source code.
Also you can find this document useful.

Related

Why does Solr ClientUtils::escapeQueryChars escape spaces

Solr query has some special chars that need to be escaped, +-&|!(){}[]^"~*?:/.
SolrJ provides a utility method ClientUtils::escapeQueryChars which escapes more chars, including ; and white spaces.
It caused a bug in my application when the search term contains space, like foo bar, which was turned into foo\ bar by ClientUtils::escapeQueryChars. My solution is to split the search term, escape each term and joining them with AND or OR.
But it's still a pain to write extra code just to handle handle space.
Is there any special reason that space and ; are also escaped by this utility method ?
In Solr (and Lucene) the characters can have different meanings in query syntax depending from what query parser you're using (for example standard, dismax, edismax, etc.).
So when and what escape depends from which query parser you're using and which query you're trying to do. I know this seems too broad as answer but I'll add an example to make the things more clear.
For example, let's try to use edismax as query parser and have a document with a field named tv_display of type string.
If you write:
http://localhost:8983/solr/buybox/select?q=tv_display:Full HD
edismax will convert the query in +tv_display:Full +tv_display:HD.
In this way you'll never find the documents where tv_display is Full HD but all the documents where tv_display is Full and/or HD (and/or depends by your mm configuration).
ClientUtils::escapeQueryChars will convert Full HD in Full\ HD:
http://localhost:8983/solr/buybox/select?q=tv_display:Full\ HD
So edismax takes the entire string as a single token and only in this way will be returned all the documents where tv_display has Full HD.
In conclusion ClientUtils::escapeQueryChars escape all possible characters (spaces and semicolons included) that can be misunderstood by a query parser.

Hibernate functions lower and upper not working on Polish special characters

I have a problem with a lower and upper function in JPA (Hibernate). In my application a user should add a new item to the database, but the name should be unique. In order to achieve that, I need to compare user entered string with the strings in the database and ignore case while checking that.
Unfortunately, as I am using the Hibernate function to make all data upper-cased (in order to compare that) everything works fine except for the Polish special characters that remain the same.
This is the code I've used for testing purposes in order to check if it works:
TypedQuery<String> query = em.createQuery("SELECT upper(i.name) FROM Item i", String.class);
for (String name: query.getResultList())
System.out.println(name);
And that's what I get:
CZYSTY BANDAż
MAłY CHEMIK
MAłY MECHANIK
SPRZęT
ŚPIWóR
ŚRODEK DEZYNFEKUJąCY
ŚRODEK CZYSZCZąCY
All letters should be upper-cased. In the database every first letter of a first word is always capitalized. The problem concerns such characters like: ą, ę, ż, ź, ó, ł - they should look like Ą, Ę, Ż, Ź, Ó, Ł, but Hibernate seems not to recognize them as a single character which differs only in regard to the case.
The same thing happens when I use a lower function. Polish characters are not affected at all and remain the same.
I do not know if it concerns only Polish characters or from any other languages too.
I would be very grateful for any hint in this matter.
EDIT: I'm using Hibernate 5.2.2 Final with SQLite database and driver Xerial 3.8.11.2.
EDIT2: The same happens if I try to achieve that using native SQL query with Hibernate.
I've already found the solution. It turned out, that SQLite doesn't support the Unicode collation. It can only support ASCII latin characters while using lower, upper function or sorting.
There is an extension (SQLite ICU Extension), that SQLite must be compiled with in order to use Unicode collation (or other collations), but as far as I'm concerned it is not as simple solution as I would like it to be. I've decided to change the database provider to H2, which support Unicode collation by default without performing any modifications and it works like a charm now :)
So it's not Hibernate's fault, but the SQLite's. Thank you very much for your help :)

Search database table with all special characters

I have a table of project in which i have a project name and that project name may contain any special character or any alpha numeric value or any combination of number word or special characters.
Now i need to apply keyword search in that and that may contain any special character in search.
So my question is: How we can search either single or multiple special characters in database?
I am using mysql 5.0 with java hibernate api.
This should be possible with some simple sanitization of you query.
e.g: a search for \#(%*#$\ becomes:
SELECT * FROM foo WHERE name LIKE "%\\#(\%*#$\\%";
when evaluated the back slashes escape so that the search ends up being anything that contains "\#(%*#$\"
In general anything that's a special character in a string can be escaped via a backslash. This only really becomes tricky if you have a name such as: "\\foo\\bar\\" which to escape properly would become "\\\\foo\\\\bar\\\\"
A side note, please proof read your posts prior to finalizing. Its really depressing and shows a lack of effort when your questions title has spelling errors in it.

Lucene Reverse Search

I have a list of keywords in my database. For ex: Java Program, Php program etc. I index these keywords using Lucene. When I search for a text longer than the keywords (indexed words), How will get a match? For ex: I am searching for "My Java Program is better than yours". I would expect a match because I have indexed a keywod "Java Program"? How to do this efficiently using Lucene? If not Lucene what else can I use for this kind of a job?
Please note, I don't want to match on independent keywords "java" and "program". I want a match on "Java Program" (as one keyword just as I indexed).
Thank you.
If you have indexed your keywords with a StandardAnalyzer, the you could query them quite effectively with a query string like this
My Java Program is better than yours.
Which, unless quoted or something like that, effectively interprets to 7 queries (less after removing stopwords), So it will match when looking for "java" and when looking for "program".

Keyword (OR, AND) search in Lucene

I am using Lucene in my portal (J2EE based) for indexing and search services.
The problem is about the keywords of Lucene. When you use one of them in the search query, you'll get an error.
For example:
searchTerms = "ik OR jij"
This works fine, because it will search for "ik" or "jij"
searchTerms = "ik AND jij"
This works fine, it searches for "ik" and "jij"
But when you search:
searchTerms = "OR"
searchTerms = "AND"
searchTerms = "ik OR"
searchTerms = "OR ik"
Etc., it will fail with an error:
Component Name: STSE_RESULTS Class: org.apache.lucene.queryParser.ParseException Message: Cannot parse 'OR jij': Encountered "OR" at line 1, column 0.
Was expecting one of:
...
It makes sense, because these words are keywords for Lucene are probably reserved and will act as keywords.
In Dutch, the word "OR" is important because it has a meaning for "Ondernemings Raad". It is used in many texts, and it needs to be found. For example "or" does work, but does not return texts matching the term "OR". How can I make it searchable?
How can I escape the keyword "or"? Or How can I tell Lucene to treat "or" as a search term NOT as a keyword.
I suppose you have tried putting the "OR" into double quotes?
If that doesn't work I think you might have to go so far as to change the Lucene source and then recompile the whole thing, as the operator "OR" is buried deep inside the code. Actually, compiling probably isn't even enough: you'll have to change the file QueryParser.jj in the source package that serves as input for JavaCC, then run JavaCC, then recompile the whole thing.
The good news, however, is that there's only one line to change:
| <OR: ("OR" | "||") >
becomes
| <OR: ("||") >
That way, you'll have only "||" as logical OR operator. There is a build.xml that also contains the invocation of JavaCC, but you have to download that tool yourself. I can't try it myself right now, I'm afraid.
This is perhaps a good question for the Lucene developer mailing list, but please let us know if you do that and they come up with a simpler solution ;-)
OR, NOT and AND are reserved keywords. I solved this problem just 2 days ago by lower-casing those 3 words in the user's search term before feeding it into the lucene query parser. Note that if you search and replace for these keywords make sure you use word boundaries (\b) so you don't end up changing words such as ANDROID and ORDER.
I then let the user specify NOT and AND by using - and +, just like Google does.
Escaping OR and AND with double quotes works for me. So try with a Java string like
String query = "field:\"AND\"";
I have read your question many times! =[
please look at these suggestions
How is your index stored?
Document containing Fields stored can be stored as
1)Stored 2)Tokenized 3)Indexed 4)Vector
it can make a significant difference
please use Luke, it can tell you how your indexes are stored(actually)
Luke is a must have if you are working with lucene, as it gives you a real idea of how indexes are stored,it also offers search, try it let us know with your update!
You're probably doing something wrong when you're building the query. I'll second Narayan's suggestion on getting Luke (as posted in the comments) and try running your queries with that. It has been a little while since I used Lucene, but I don't remember ever having issues with OR and AND.
Other than that, you can try escaping the input strings using QueryParser.escape(userQuery)
More On Escaping
You can escape the "OR" when it's a search term, or write your own query parser for a different syntax. Lucene offers an extensive query API in addition to the parser, with which you support your own query syntax quite easily.

Categories