Lucene how to switch between case sensitive and case insensitive - java

I want to give user the option to do case case sensitive or case insensitive search.
My idea is use a case sensitive analyzer to index the data and then use sensitive or insensitive analyzer to search depending on user input.
So I created my case sensitive analyzer and here is a simple of my code:
public final class CaseSensitiveStandardAnalyzer extends StopwordAnalyzerBase {
#Override
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(CaseSensitiveStandardAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
For indexing I used this:
Analyzer analyzer = new CaseSensitiveStandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46,analyzer);
IndexWriter indexWriter = new IndexWriter(indexDir,config);
indexWriter.addDocument(document);
For searching I used:
Analyzer analyzer;
if(caseSentive)
analyzer = new CaseSensitiveStandardAnalyzer(Version.LUCENE_46);
else
analyzer = new StandardAnalyzer(Version.LUCENE_46);
QueryParser queryParser = new QueryParser(Version.LUCENE_46,"content", analyzer);
Query query = queryParser.parse(searchString);
//Search
TopDocs results = indexSearcher.search(query,10000);
ScoreDoc[] hits = results.scoreDocs;
When I tired this, the sensitive case worked, but the insensitive case didn't.
After more researching, I found that using a case-sensitive analyzer with a lower-care query will not work. Case-sensitive analyzer indexed work with case-sensitive query and case-insensitive analyzer indexed work with case-insensitive query, can anyone confirm this?
It seems to me the only reliable way to search both case-sensitive and case-insensitive is to index twice, one for each case, is this correct?

It seems to me the only reliable way to search both case-sensitive and case-insensitive is to index twice, one for each case, is this correct?
That would be a possible solution, but there are more optimal solutions for that use case: https://stackoverflow.com/a/2490441/867816
This might help, too: http://www.hascode.com/2014/07/lucene-by-example-specifying-analyzers-on-a-per-field-basis-and-writing-a-custom-analyzertokenizer/

Related

How to create Custom Analyzer in Lucene, with custom stop/common words from file

I m trying to create a custom analyzer in Lucene 8.3.0 that uses stemming and filters the given text using custom stop words from a file.
To be more clear, I don't want to use the default stop words filter and add some words on it, I want to filter using only a set of stop words from a stopWords.txt file.
How can I do this?
This is what I have written until now, but I am not sure if it is right
public class MyAnalyzer extends Analyzer{
//public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
// public TokenStream tokenStream(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream tokenStream = new StandardFilter(tokenizer);
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new StopFilter(tokenStream,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
//Adding Porter Stemming filtering
tokenStream = new PorterStemFilter(tokenStream);
//return tokenStream;
return new TokenStreamComponents(tokenizer, tokenStream);
}
}
First of all I am not sure if the structure is correct and for now I am using the StopFilter from StopAnalyzer just to test it (however it's not working).
You need to read the file and parse it to a CharArraySet to pass into the filter. StopFilter has some built in methods you can use to convert a List of Strings to a CharArraySet, like:
...
CharArraySet stopset = StopFilter.makeStopSet(myStopwordList);
tokenStream = new StopFilter(tokenStream, stopset);
...
It's listed as for internal purposes, so fair warning about relying on this class, but if you don't want to have to handle parsing your file to a list, you could use WordListLoader to parse your stopword file into a CharArraySet, something like:
...
CharArraySet stopset = WordlistLoader.getWordSet(myStopfileReader);
tokenStream = new StopFilter(tokenStream, stopset);
...

Unable to identify error in Lucene MoreLikeThis

I need to use Lucene MoreLikeThis to find similar documents given a paragraph of text. I am new to Lucene and followed the code here
I have already indexed the documents at the directory - "C:\Users\lucene_index_files\v2"
I am using "They are computer engineers and they like to develop their own tools. The program in languages like Java, CPP." as the document to which I want to find similar documents.
public class LuceneSearcher2 {
public static void main(String[] args) throws IOException {
LuceneSearcher2 m = new LuceneSearcher2();
System.out.println("1");
m.start();
System.out.println("2");
//m.writerEntries();
m.findSilimar("They are computer engineers and they like to develop their own tools. The program in languages like Java, CPP.");
System.out.println("3");
}
private Directory indexDir;
private StandardAnalyzer analyzer;
private IndexWriterConfig config;
public void start() throws IOException{
//analyzer = new StandardAnalyzer(Version.LUCENE_42);
//config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
analyzer = new StandardAnalyzer();
config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
indexDir = new RAMDirectory(); //don't write on disk
//https://stackoverflow.com/questions/36542551/lucene-in-java-method-not-found?rq=1
indexDir = FSDirectory.open(FileSystems.getDefault().getPath("C:\\Users\\lucene_index_files\\v2")); //write on disk
//System.out.println(indexDir);
}
private void findSilimar(String searchForSimilar) throws IOException {
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
System.out.println("2a");
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);
System.out.println("2b");
StringReader sReader = new StringReader(searchForSimilar);
//Query query = mlt.like(sReader, null);
//Throws error - The method like(String, Reader...) in the type MoreLikeThis is not applicable for the arguments (StringReader, null)
Query query = mlt.like("computer");
System.out.println("2c");
System.out.println(query.toString());
TopDocs topDocs = indexSearcher.search(query,10);
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
Document aSimilar = indexSearcher.doc( scoreDoc.doc );
String similarTitle = aSimilar.get("title");
String similarContent = aSimilar.get("content");
System.out.println("====similar finded====");
System.out.println("title: "+ similarTitle);
System.out.println("content: "+ similarContent);
}
System.out.println("2d");
}}
I am unsure as to what is causing the system to not generate an output/
What is your output ? I am assuming your not finding similar documents. The reason could be that the query you are creating is empty.
First of all to run your code in a meaningful way this line
Query query = mlt.like(sReader, null);
needs a String[] of field names as the argument, so it should work like this
Query query = mlt.like(sReader, new String[]{"title", "content"});
Now, in order to use MoreLikeThis in Lucene, your stored Fields have to have the set the option to store term vectors "setStoreTermVectors(true);" true when creating fields, for instance like this:
FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setStoreTermVectors(true);
fieldType.setTokenized(true);
Field contentField = new Field("contents", this.getBlurb(), fieldType);
doc.add(contentField);
Leaving this out could result in an empty query string and consequently no results for the query

How to extend Lucene's StandardAnalyzer for custom special character treatment?

I'm using Lucene's StandardAnalyzer for a specific index property.
As special characters like àéèäöü do not get indexed as expected, I want to replace these characters:
à -> a
é -> e
è -> e
ä -> ae
ö -> oe
ü -> ue
What is the best approach to extend the org.apache.lucene.analysis.standard.StandardAnalyzerclass?
I was looking for a way where the standard parser iterates over all tokens (words) and I can retrieve word by word and do the magic there.
Thanks for any hints.
I would propose to use MappingCharFilter, that will allow to have a map of Strings that will be replaces by Strings, so it will fit your requirements perfectly.
Some additional info - https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
You wouldn't extend StandardAnalyzer, since analyzer implementations are final. The meat of an analyzer implementation is the createComponents method, and you would have to override that anyway, so you wouldn't get much from extending it anyway.
Instead, you can copy the StandardAnalyzer source, and modify the createComponents method. For what you are asking for, I recommend adding ASCIIFoldingFilter, which will attempt to convert UTF characters (such as accented letters) into ASCII equivalents. So you could create an analyzer something like this:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter(tok);
tok = new ASCIIFoldingFilter(tok); /*Adding it before the StopFilter would probably be most helpful.*/
tok = new StopFilter(tok, StandardAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = new StandardFilter(in);
result = new LowerCaseFilter(result);
tok = new ASCIIFoldingFilter(tok);
return result;
}
}

How to give umlauts more weight in lucene?

I have a custom Analyzer for names. I'd like to give similar umlaut-matches more weight. Is that possible?
#Override
protected TokenStreamComponents createComponents(String fieldName, java.io.Reader reader) {
VERSION = Version.LUCENE_4_9;
final Tokenizer source = new StandardTokenizer(VERSION, reader);
TokenStream result = new StandardFilter(VERSION, source);
result = new LowerCaseFilter(VERSION, result);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
Example query:
input: "Zur Mühle"
outpt (equal scores): "Zur Linde", "Zur Muehle".
Of course I'd like to get the "Zur Muehle" as top result. But how can I tell lucene to scope umlaut matches more?
One way to do that is use payloads to boost terms containing umlauts. Please ask for further clarification if you need more details on using payloads.

how to search a file with lucene

I want to do a search for a query within a file "fdictionary.txt" containing a list of words (230,000 words) written line by line. any suggestion why this code is not working?
The spell checking part is working and gives me the list of suggestions (I limited the length of the list to 1). what I want to do is to search that fdictionary and if the word is already in there, do not call spell checking. My Search function is not working. It does not give me error! Here is what I have implemented:
public class SpellCorrection {
public static File indexDir = new File("/../idxDir");
public static void main(String[] args) throws IOException, FileNotFoundException, CorruptIndexException, ParseException {
Directory directory = FSDirectory.open(indexDir);
SpellChecker spell = new SpellChecker(directory);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_20, null);
File dictionary = new File("/../fdictionary00.txt");
spell.indexDictionary(new PlainTextDictionary(dictionary), config, true);
String query = "red"; //kne, console
String correctedQuery = query; //kne, console
if (!search(directory, query)) {
String[] suggestions = spell.suggestSimilar(query, 1);
if (suggestions != null) {correctedQuery=suggestions[0];}
}
System.out.println("The Query was: "+query);
System.out.println("The Corrected Query is: "+correctedQuery);
}
public static boolean search(Directory directory, String queryTerm) throws FileNotFoundException, CorruptIndexException, IOException, ParseException {
boolean isIn = false;
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
Term term = new Term(queryTerm);
Query termQuery = new TermQuery(term);
TopDocs hits = indexSearcher.search(termQuery, 100);
System.out.println(hits.totalHits);
if (hits.totalHits > 0) {
isIn = true;
}
return isIn;
}
}
where are you indexing the content from fdictionary00.txt?
You can search using IndexSearcher, only when you have index. If you are new to lucene, you might want to check some quick tutorials. (like http://lucenetutorial.com/lucene-in-5-minutes.html)
You never built the index.
You need to setup the index...
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
IndexWriter writer = new IndexWriter(directory,analyzer,true,IndexWriter.MaxFieldLength.UNLIMITED );
You then need to create a document and add each term to the document as an analyzed field..
Document doc = new Document();
doc.Add(new Field("name", word , Field.Store.YES, Field.Index.ANALYZED));
Then add the document to the index
writer.AddDocument(doc);
writer.Optimize();
Now build the index and close the index writer.
writer.Commit();
writer.Close();
You could make your SpellChecker instance available in a service and use spellChecker.exist(word).
Be aware that the SpellChecker will not index words 2 characters or less. To get around this you can add them to the index after you have created it (add them into SpellChecker.F_WORD field).
If you want to add to your live index and make them available for exist(word) then you will need to add them to the SpellChecker.F_WORD field. Of course, because you're not adding to all the other fields such as gram/start/end etc then your word will not appear as a suggestion for other misspelled words.
In this case you'd have had to add the word into your file so when you re-create the index it would then be available as a suggestion. It would be great if the project made SpellChecker.createDocument(...) public/protected, rather than private, as this method accomplishes everything with adding words.
After all this your need to call spellChecker.setSpellIndex(directory).

Categories