How to give umlauts more weight in lucene? - java

I have a custom Analyzer for names. I'd like to give similar umlaut-matches more weight. Is that possible?
#Override
protected TokenStreamComponents createComponents(String fieldName, java.io.Reader reader) {
VERSION = Version.LUCENE_4_9;
final Tokenizer source = new StandardTokenizer(VERSION, reader);
TokenStream result = new StandardFilter(VERSION, source);
result = new LowerCaseFilter(VERSION, result);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
Example query:
input: "Zur Mühle"
outpt (equal scores): "Zur Linde", "Zur Muehle".
Of course I'd like to get the "Zur Muehle" as top result. But how can I tell lucene to scope umlaut matches more?

One way to do that is use payloads to boost terms containing umlauts. Please ask for further clarification if you need more details on using payloads.

Related

How to create Custom Analyzer in Lucene, with custom stop/common words from file

I m trying to create a custom analyzer in Lucene 8.3.0 that uses stemming and filters the given text using custom stop words from a file.
To be more clear, I don't want to use the default stop words filter and add some words on it, I want to filter using only a set of stop words from a stopWords.txt file.
How can I do this?
This is what I have written until now, but I am not sure if it is right
public class MyAnalyzer extends Analyzer{
//public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
// public TokenStream tokenStream(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream tokenStream = new StandardFilter(tokenizer);
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new StopFilter(tokenStream,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
//Adding Porter Stemming filtering
tokenStream = new PorterStemFilter(tokenStream);
//return tokenStream;
return new TokenStreamComponents(tokenizer, tokenStream);
}
}
First of all I am not sure if the structure is correct and for now I am using the StopFilter from StopAnalyzer just to test it (however it's not working).
You need to read the file and parse it to a CharArraySet to pass into the filter. StopFilter has some built in methods you can use to convert a List of Strings to a CharArraySet, like:
...
CharArraySet stopset = StopFilter.makeStopSet(myStopwordList);
tokenStream = new StopFilter(tokenStream, stopset);
...
It's listed as for internal purposes, so fair warning about relying on this class, but if you don't want to have to handle parsing your file to a list, you could use WordListLoader to parse your stopword file into a CharArraySet, something like:
...
CharArraySet stopset = WordlistLoader.getWordSet(myStopfileReader);
tokenStream = new StopFilter(tokenStream, stopset);
...

How to test a Lucene Analyzer?

I'm not getting the expected results from my Analyzer and would like to test the tokenization process.
The answer this question: How to use a Lucene Analyzer to tokenize a String?
List<String> result = new ArrayList<String>();
TokenStream stream = analyzer.tokenStream(field, new StringReader(keywords));
try {
while(stream.incrementToken()) {
result.add(stream.getAttribute(TermAttribute.class).term());
}
}
catch(IOException e) {
// not thrown b/c we're using a string reader...
}
return result;
Uses the TermAttribute to extract the tokens from the stream. The problem is that TermAttribute is no longer in Lucene 6.
What has it been replaced by?
What would the equivalent be with Lucene 6.6.0?
I'm pretty sure it was replaced by CharTermAttribute javadoc
The ticket is pretty old, but maybe the code was kept around a bit longer:
https://issues.apache.org/jira/browse/LUCENE-2372

How to extend Lucene's StandardAnalyzer for custom special character treatment?

I'm using Lucene's StandardAnalyzer for a specific index property.
As special characters like àéèäöü do not get indexed as expected, I want to replace these characters:
à -> a
é -> e
è -> e
ä -> ae
ö -> oe
ü -> ue
What is the best approach to extend the org.apache.lucene.analysis.standard.StandardAnalyzerclass?
I was looking for a way where the standard parser iterates over all tokens (words) and I can retrieve word by word and do the magic there.
Thanks for any hints.
I would propose to use MappingCharFilter, that will allow to have a map of Strings that will be replaces by Strings, so it will fit your requirements perfectly.
Some additional info - https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
You wouldn't extend StandardAnalyzer, since analyzer implementations are final. The meat of an analyzer implementation is the createComponents method, and you would have to override that anyway, so you wouldn't get much from extending it anyway.
Instead, you can copy the StandardAnalyzer source, and modify the createComponents method. For what you are asking for, I recommend adding ASCIIFoldingFilter, which will attempt to convert UTF characters (such as accented letters) into ASCII equivalents. So you could create an analyzer something like this:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter(tok);
tok = new ASCIIFoldingFilter(tok); /*Adding it before the StopFilter would probably be most helpful.*/
tok = new StopFilter(tok, StandardAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = new StandardFilter(in);
result = new LowerCaseFilter(result);
tok = new ASCIIFoldingFilter(tok);
return result;
}
}

Lucene how to switch between case sensitive and case insensitive

I want to give user the option to do case case sensitive or case insensitive search.
My idea is use a case sensitive analyzer to index the data and then use sensitive or insensitive analyzer to search depending on user input.
So I created my case sensitive analyzer and here is a simple of my code:
public final class CaseSensitiveStandardAnalyzer extends StopwordAnalyzerBase {
#Override
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(CaseSensitiveStandardAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
For indexing I used this:
Analyzer analyzer = new CaseSensitiveStandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46,analyzer);
IndexWriter indexWriter = new IndexWriter(indexDir,config);
indexWriter.addDocument(document);
For searching I used:
Analyzer analyzer;
if(caseSentive)
analyzer = new CaseSensitiveStandardAnalyzer(Version.LUCENE_46);
else
analyzer = new StandardAnalyzer(Version.LUCENE_46);
QueryParser queryParser = new QueryParser(Version.LUCENE_46,"content", analyzer);
Query query = queryParser.parse(searchString);
//Search
TopDocs results = indexSearcher.search(query,10000);
ScoreDoc[] hits = results.scoreDocs;
When I tired this, the sensitive case worked, but the insensitive case didn't.
After more researching, I found that using a case-sensitive analyzer with a lower-care query will not work. Case-sensitive analyzer indexed work with case-sensitive query and case-insensitive analyzer indexed work with case-insensitive query, can anyone confirm this?
It seems to me the only reliable way to search both case-sensitive and case-insensitive is to index twice, one for each case, is this correct?
It seems to me the only reliable way to search both case-sensitive and case-insensitive is to index twice, one for each case, is this correct?
That would be a possible solution, but there are more optimal solutions for that use case: https://stackoverflow.com/a/2490441/867816
This might help, too: http://www.hascode.com/2014/07/lucene-by-example-specifying-analyzers-on-a-per-field-basis-and-writing-a-custom-analyzertokenizer/

How to get a Token from a Lucene TokenStream?

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream.
The worst part is that I'm looking at the comments in the JavaDocs that address my question.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29
Somehow, an AttributeSource is supposed to be used, rather than Tokens. I'm totally at a loss.
Can anyone explain how to get token-like information from a TokenStream?
Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}
Edit: The new way
According to Donotello, TermAttribute has been deprecated in favor of CharTermAttribute. According to jpountz (and Lucene's documentation), addAttribute is more desirable than getAttribute.
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}
This is how it should be (a clean version of Adam's answer):
TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(cattr.toString());
}
stream.end();
stream.close();
For the latest version of lucene 7.3.1
// Test the tokenizer
Analyzer testAnalyzer = new CJKAnalyzer();
String testText = "Test Tokenizer";
TokenStream ts = testAnalyzer.tokenStream("context", new StringReader(testText));
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
System.out.println("token start offset: " + offsetAtt.startOffset());
System.out.println(" token end offset: " + offsetAtt.endOffset());
}
ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
} finally {
ts.close(); // Release resources associated with this stream.
}
Reference: https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/package-summary.html
There are two variations in the OP question:
What is "the process to obtain Tokens from a TokenStream"?
"Can anyone explain how to get token-like information from a TokenStream?"
Recent versions of the Lucene documentation for Token say (emphasis added):
NOTE: As of 2.9 ... it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API.
And TokenStream says its API:
... has moved from being Token-based to Attribute-based ... the preferred way to store the information of a Token is to use AttributeImpls.
The other answers to this question cover #2 above: how to get token-like information from a TokenStream in the "new" recommended way using attributes. Reading through the documentation, the Lucene developers suggest that this change was made, in part, to reduce the number of individual objects created at a time.
But as some people have pointed out in the comments of those answers, they don't directly answer #1: how do you get a Token if you really want/need that type?
With the same API change that makes TokenStream an AttributeSource, Token now implements Attribute and can be used with TokenStream.addAttribute just like the other answers show for CharTermAttribute and OffsetAttribute. So they really did answer that part of the original question, they simply didn't show it.
It is important that while this approach will allow you to access Token while you're looping, it is still only a single object no matter how many logical tokens are in the stream. Every call to incrementToken() will change the state of the Token returned from addAttribute; So if your goal is to build a collection of different Token objects to be used outside the loop then you will need to do extra work to make a new Token object as a (deep?) copy.

Categories