Lucene french search - java

I was using lucene indexing/searching and faced a problem. Lucene displays some results for query "word" and some for "l'word". (word is just for an example). I need it to display all mentioned results for any of two these queries.
I tried to change StandartAnalyzer to FrenchAnalyzer, but it also doesn't recognize this words as the same.
Changing all "l'word" to "word" in the index and in the search string is also not an option. We need original string to be displayed in results.

Related

Lucene Search SuggestWords()

I used the Apache Lucene library to do a search method.
public static List<String> suggestWords(String word, Directory directory, String field) {
blabla
}
Text
[Text]
text
[Next, Text, Heat, Sent, Test, Texts]
Had any of you ever work on this library, I would like to understand why when i search for Text I am getting the good word/words and when i search for text the first suggested word is Next and not Text. Should I always put the first letter of the word to UpperCase before searching the suggestWords list ?
Thank you !
In the Apache Lucene Library fields names are case sensitive. That could explain your issue with the Text and text case.
To avoid your issues, you might add a String.toLowercase() or Uppercase like you said, to ensure you to have the right answer.

Lucene fuzzy search on entire text

In Lucene, I can use fuzzy search to get 'similar' results.
For example, following query:
text:awesome~0.8
Will find the documents having 80% similar texts, like 'awesom'.
My question is, can I use fuzzy search on entire text (multiple words)?
For example, I want to find out 80% similar texts to following text:
this is my text with multiple words
Putting fuzzy clause on each word would not give me desired results:
text:(+this~0.8 +is~0.8 +my~0.8 +text~0.8 +with~0.8 +multiple~0.8 +words~0.8)
As it would return only those documents which has all the words (or 80% similar words against each word) specified in query.
I expect query to return me results where entire string is 80% similar (even if it doesn't have an entire word), for example:
this is text with multiple words
Something like this -
text:(+this +is +my +text +with +multiple +words)~0.8
Obviously above query gives syntax error, but I need to get results based on similarity on entire text/phrase.
I am happy to use Java API classes for this purpose as I need to use it in a Java program.
I am not sure that floating similarity for fuzzy query is allowed anymore in Lucene. From lucene-4.0 and later versions, FuzzyQuery supports maximum 2 edit distance.
Let's assume you want edit distance of 2. You can use Keyword Analyser while indexing your field. This will not tokenize your field values. While searching you can use FuzzyQuery with term containing full text.
Limitations of this solution:
Maximum edit distance is 2.
We are assuming that whatever you are looking up is a full value of that field. For example, if you indexed value is "this is my text", you cannot get the doument if you search with "this is ny"[made a mistake in query]. You can get this document if you query it as "this is ny text".

Proper Analyzer to Use in Lucene for 'case-insensitive , contains' matching

I am using Lucene to create an index of search items on a java servlet.
The user enters text on a webpage and an ajax request is made to the servlet to get any strings that match the query string. The results are used to populate an autocomplete menu on the webpage.
Currently Lucene code only sends back matches if the user enters a whole word. I want it to return results even if only 1 letter matches an item in the index. In other words, how do I get the Lucene code to match the whole input string, regardless of how small the input string is? Do I need to change the Analyzer being used? I am using standard analyzer:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_47);
Matching on single letters in common defeats the purpose of an inverted text engine, and none of the standard analyzers will do that. If you insist, you can use the http://lucene.apache.org/core/4_8_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html with min and max set to 1. You will need to build your own analyzer object, but that's a good idea anyway.
Based on clarification from a comment that the OP wishes to match across whitespace boundaries:
This is not a job for an inverted index. An inverted index works by indexing all the strings that can match. To match an input against all arbitrary-length substrings would require a gigantic index, and would be too slow. You need something else entirely.

Lucene: Mining email addresses, names, and identifiers from an index

I have a lucene index with approx. 1 million documents. From these documents, I want to mine
email addresses
signatures - ( [whitespace]/s/[whitespace]john doe[whitespace] )
specific identifiers from each of the documents (that follow a regex pattern "\s[0-9]{3}[a-zA-Z0-9]{6}\s").
I understand that ideally using solr, during index build time, its much easier, but how can one do this from a built lucene index?
I am using java. For email address search, I tried to .setAllowLeadingWildcard(true) and then searched for # to find all email addresses - but I actually got zero results . if I search for # in luke I get zero results. If I search for #hotmail.com in luke, I get bunch of results with valid email addresses such as aaaaa#hotmail.com.
The index was created using StandardAnalyzer. Not sure if it matters, but the text is in UTF-8 I believe.
Any helpful suggestions, pointers is great! Note this is not for front end, so query doesn't have to be near realtime.
Analysis does matter, yes. The standard analyzer will treat whitespace and punctuation, such as #, as a place to split input into tokens. As such, you wouldn't expect to see any of them actually present in the indexed data.
You can use Lucene's regex query, particularly for the third case. A PhraseQuery seems appropriate for the second, I think, though I'm more that slightly confused about what you are trying to accomplish there.
Generally, you might want to use a different analyzer for an email field, in order to use it as a single token. You should get reasonable results searching for a particular e-mail address, since, though the analyzer would remove the punctuation, searching for the three (usually) tokens of a email consecutively in a phrase would be expected to get good matches. However, a regex search like \w*#\w*\.\w*, won't be particularly effective, since the punctuation won't actually be indexed and searchable, and a regex search doesn't span multiple terms in the index. Apart from searching for a known set of e-mail domains, or something of that nature, you would want to re-index use analysis more in line with how you need to search it in order to do what you are asking.

Lucene Index problems with "-" character

I'm having trouble with a Lucene Index, which has indexed words, that contain "-" Characters.
It works for some words that contain "-" but not for all and I don't find the reason, why it's not working.
The field I'm searching in, is analyzed and contains version of the word with and without the "-" character.
I'm using the analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer
here an example:
if I search for "gsx-*" I got a result, the indexed field contains
"SUZUKI GSX-R 1000 GSX-R1000 GSXR"
but if I search for "v-*" I got no result. The indexed field of the expected result contains:
"SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM"
If I search for "v-strom" without "*" it works, but if I just search for "v-str" for example I don't get the result. (There should be a result because it's for a live search for a webshop)
So, what's the difference between the 2 expected results? why does it work for "gsx-" but not for "v-" ?
StandardAnalyzer will treat the hyphen as whitespace, I believe. So it turns your query "gsx-*" into "gsx*" and "v-*" into nothing because at also eliminates single-letter tokens. What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field.
So what you want is for "v-strom" as a whole to be an indexed term. StandardAnalyzer is not suited to this kind of text. Maybe have a go with the WhitespaceAnalyzer or SimpleAnalyzer. If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further TokenFilters. A very good explanation is given in the Lucene Analysis package Javadoc.
BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query.
ClassicAnalyzer handles '-' as a useful, non-delimiter character. As I understand ClassicAnalyzer, it handles '-' like the pre-3.1 StandardAnalyzer because ClassicAnalyzer uses ClassicTokenizer which treats numbers with an embedded '-' as a product code, so the whole thing is tokenized as one term.
When I was at Regenstrief Institute I noticed this after upgrading Luke, as the LOINC standard medical terms (LOINC was initiated by R.I.) are identified by a number followed by a '-' and a checkdigit, like '1-8' or '2857-1'. My searches for LOINCs like '45963-6' failed using StandardAnalyzer in Luke 3.5.0, but succeeded with ClassicAnalyzer (and this was because we built the index with the 2.9.2 Lucene.NET).
(Based on Lucene 4.7) StandardTokenizer splits hyphenated words into two. for example "chat-room" into "chat","room" and index the two words separately instead of indexing as a single whole word. It is quite common for separate words to be connected with a hyphen: “sport-mad,” “camera-ready,” “quick-thinking,” and so on. A significant number are hyphenated names, such as “Emma-Claire.” When doing a Whole Word Search or query, users expect to find the word within those hyphens. While there are some cases where they are separate words, that's why lucene keeps the hyphen out of the default definition.
To give support of hyphen in StandardAnalyzer, you have to make changes in StandardTokenizerImpl.java which is generated class from jFlex.
Refer this link for complete guide.
You have to add following line in SUPPLEMENTARY.jflex-macro which is included by StandardTokenizerImpl.jflex file.
MidLetterSupp = ( [\u002D] )
And After making changes provide StandardTokenizerImpl.jflex file as input to jFlex engine and click on generate. The output of that will be StandardTokenizerImpl.java
And using that class file rebuild the index.
The ClassicAnalzer is recommended to index text containing product codes like 'GSX-R1000'. It will recognize this as a single term and did not split up its parts. But for example the text 'Europe/Berlin' will be split up by the ClassicAnalzer into the words 'Europe' and 'Berlin'. This means if you have a text indexed by the ClassicAnalyzer containing the phrase
Europe/Berlin GSX-R1000
you can search for "europe", "berlin" or "GSX-R1000".
But be careful which analyzer you use for the search. I think the best choice to search a Lucene index is the KeywordAnalyzer. With the KeywordAnalyzer you can also search for specific fields in a document and you can build complex queries like:
(processid:4711) (berlin)
This query will search documents with the phrase 'berlin' but also a field 'processid' containing the number 4711.
But if you search the index for the phrase "europe/berlin" you will get no result! This is because the KeywordAnalyzer did not change your search phrase, but the phrase 'Europe/Berlin' was split up into two separate words by the ClassicAnalyzer. This means you have to search for 'europe' and 'berlin' separately.
To solve this conflict you can translate a search term, entered by the user, in a search query that fits you needs using the following code:
QueryParser parser = new QueryParser("content", new ClassicAnalyzer());
Query result = parser.parse(searchTerm);
searchTerm = result.toString("content");
This code will translate the serach pharse
Europe/Berlin
into
europe berlin
which will result in the expected document set.
Note: This will also work for more complex situations. The search term
Europe/Berlin GSX-R1000
will be translated into:
(europe berlin) GSX-R1000
which will search correctly for all phrases in combination using the KeyWordAnalyzer.

Categories