I'm using Lucene's StandardAnalyzer for a specific index property.
As special characters like àéèäöü do not get indexed as expected, I want to replace these characters:
à -> a
é -> e
è -> e
ä -> ae
ö -> oe
ü -> ue
What is the best approach to extend the org.apache.lucene.analysis.standard.StandardAnalyzerclass?
I was looking for a way where the standard parser iterates over all tokens (words) and I can retrieve word by word and do the magic there.
Thanks for any hints.
I would propose to use MappingCharFilter, that will allow to have a map of Strings that will be replaces by Strings, so it will fit your requirements perfectly.
Some additional info - https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
You wouldn't extend StandardAnalyzer, since analyzer implementations are final. The meat of an analyzer implementation is the createComponents method, and you would have to override that anyway, so you wouldn't get much from extending it anyway.
Instead, you can copy the StandardAnalyzer source, and modify the createComponents method. For what you are asking for, I recommend adding ASCIIFoldingFilter, which will attempt to convert UTF characters (such as accented letters) into ASCII equivalents. So you could create an analyzer something like this:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter(tok);
tok = new ASCIIFoldingFilter(tok); /*Adding it before the StopFilter would probably be most helpful.*/
tok = new StopFilter(tok, StandardAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = new StandardFilter(in);
result = new LowerCaseFilter(result);
tok = new ASCIIFoldingFilter(tok);
return result;
}
}
Related
I want to tokenaze my text, I use tokenStream from StandardAnalyzer, but it has by default "toLowerCase".
My code:
ArrayList<String> toTextWord = new ArrayList<>();
Analyzer analyzer = new StandardAnalyzer();
try (TokenStream stream = analyzer.tokenStream("tags", new StringReader(iterStr))) {
stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
CharTermAttribute token = stream.getAttribute(CharTermAttribute.class);
System.out.println(token.toString());
toTextWord.add(token.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
How can i use StandardAnalyzer without "toLowerCase"? How can i turn off "toLowerCase" in this StandardAnalyzer?
You cannot turn off toLowerCase directly in the StandardAnalyzer.
You can create a custom analyzer which behaves the same way as the StandardAnalyzer, and then customize it to meet your needs:
Example using org.apache.lucene.analysis.custom.CustomAnalyzer:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.build();
Now you can comment out (or remove) the lowercase token filter:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("stop")
.build();
Note that if you want to exactly match the default Standard Analyzer, then you should also comment out or remove the stop-word filter, since by default stopwords are not removed from the Standard Analyzer unless you provide an explicit list.
That gives us this:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.build();
If I use the following input with my custom analyzer:
String iterStr = "Eric the quick brown fox jumps over Freddy the lazy dog, LOL.";
then the output from your code is as follows:
Eric
the
quick
brown
fox
jumps
over
Freddy
the
lazy
dog
LOL
Update
When using the CustomAnalyzer you can use string values to identify the different tokenizer and filter objects - such as "standard" and "lowercase", as used in my examples above.
If you want to avoid using these identifiers, you can use the relevant factory object with the NAME field:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(StandardTokenizerFactory.NAME)
.addTokenFilter(LowerCaseFilterFactory.NAME)
.addTokenFilter(StopFilterFactory.NAME)
.build();
I m trying to create a custom analyzer in Lucene 8.3.0 that uses stemming and filters the given text using custom stop words from a file.
To be more clear, I don't want to use the default stop words filter and add some words on it, I want to filter using only a set of stop words from a stopWords.txt file.
How can I do this?
This is what I have written until now, but I am not sure if it is right
public class MyAnalyzer extends Analyzer{
//public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
// public TokenStream tokenStream(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream tokenStream = new StandardFilter(tokenizer);
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new StopFilter(tokenStream,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
//Adding Porter Stemming filtering
tokenStream = new PorterStemFilter(tokenStream);
//return tokenStream;
return new TokenStreamComponents(tokenizer, tokenStream);
}
}
First of all I am not sure if the structure is correct and for now I am using the StopFilter from StopAnalyzer just to test it (however it's not working).
You need to read the file and parse it to a CharArraySet to pass into the filter. StopFilter has some built in methods you can use to convert a List of Strings to a CharArraySet, like:
...
CharArraySet stopset = StopFilter.makeStopSet(myStopwordList);
tokenStream = new StopFilter(tokenStream, stopset);
...
It's listed as for internal purposes, so fair warning about relying on this class, but if you don't want to have to handle parsing your file to a list, you could use WordListLoader to parse your stopword file into a CharArraySet, something like:
...
CharArraySet stopset = WordlistLoader.getWordSet(myStopfileReader);
tokenStream = new StopFilter(tokenStream, stopset);
...
In my Lucene index I store names with special characters (e.g. Savić) in a field like the one described below.
FieldType fieldType = new Field();
fieldType.setStored(true);
fieldType.setIndexed(true);
fieldType.setTokenized(false);<br>
new Field("NAME", "Savić".toLowerCase(), fieldType);
I use a StopwordAnalyzerBase analyzer and Lucene Version.LUCENE_45.
If I search in the field exactly for "savić" it doesn't find it. How to deal with the special characters?
#Override
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
PatternTokenizer src;
// diese Zeichen werden nicht als Trenner verwendet
src = new PatternTokenizer(reader, Pattern.compile("[\\W&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, TRIBUNA_WORDS_SET);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
super.setReader(reader);
}
};
}
You have a couple of choices:
Try adding an ASCIIFoldingFilter:
src = new PatternTokenizer(reader, Pattern.compile("[\\W&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new ASCIIFoldingFilter(tok);
tok = new StopFilter(matchVersion, tok, TRIBUNA_WORDS_SET);
This will take a fairly simplistic approach of reducing non-ASCII characters, such as Ä, to their best match in ASCII characters (A, in this case), if a reasonable ASCII alternative character exists. It won't do anything fancy with trying to use language-specific intelligence to determine the best replacements though.
For something more linguistically intelligent, there are tools to handle this sort of thing in many of the language-specific packages. The GermanNormalizationFilter would be one example, which will do similar things to the ASCIIFoldingFilter, but will apply the rules in a way that is appropriate to the German language, such as 'ß' being replaced by 'ss'. You'd use it similar to the above code:
src = new PatternTokenizer(reader, Pattern.compile("[\\W&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new GermanNormalizationFilter(tok);
tok = new StopFilter(matchVersion, tok, TRIBUNA_WORDS_SET);
I have a custom Analyzer for names. I'd like to give similar umlaut-matches more weight. Is that possible?
#Override
protected TokenStreamComponents createComponents(String fieldName, java.io.Reader reader) {
VERSION = Version.LUCENE_4_9;
final Tokenizer source = new StandardTokenizer(VERSION, reader);
TokenStream result = new StandardFilter(VERSION, source);
result = new LowerCaseFilter(VERSION, result);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
Example query:
input: "Zur Mühle"
outpt (equal scores): "Zur Linde", "Zur Muehle".
Of course I'd like to get the "Zur Muehle" as top result. But how can I tell lucene to scope umlaut matches more?
One way to do that is use payloads to boost terms containing umlauts. Please ask for further clarification if you need more details on using payloads.
I am puzzled by the strange behavior of ShingleFilter in Lucene 4.6. What I would like to do is extract all possible bigrams from a sentence. So if the sentence is "this is a dog", I want "this is", "is a ", "a dog".
What I see instead is:
"this this is"
"this is is"
"is is a"
"is a a"
"a a dog"
"a dog dog"
So somehow it replicates words and makes 3-grams instead of bigrams.
Here is my Java code:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(analyzer,2,2);
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
TokenStream tokenStream = shingleAnalyzer.tokenStream("content", reader);
ShingleFilter theFilter = new ShingleFilter(tokenStream);
theFilter.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = theFilter.addAttribute(CharTermAttribute.class);
theFilter.reset();
while (theFilter.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
theFilter.end();
theFilter.close();
Could someone please help me modify it so it actually outputs bigrams and does not replicate words? Thank you!
Natalia
I was able to produce correct results, using this code
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
StandardTokenizer source = new StandardTokenizer(Version.LUCENE_46, reader);
TokenStream tokenStream = new StandardFilter(Version.LUCENE_46, source);
ShingleFilter sf = new ShingleFilter(tokenStream);
sf.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = sf.addAttribute(CharTermAttribute.class);
sf.reset();
while (sf.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
sf.end();
sf.close();
And get as expected 3 bigrams - this is, is a, a dog.
The problem is you are applying two ShingleFilters. ShingleAnalyzerWrapper tacks one onto the StandardAnalyzer, and then you add another one explicitly. Since the ShingleAnalyzerWrapper uses it's default behavior of outputing unigrams, you end up with the following tokens off of that first ShingleFilter:
this
this is
is
is a
a
a dog
dog
so when the second filter comes along (this time without unigrams), it simply combines each consecutive tokens among those, leading to the result your are seeing.
So, either eliminate the ShingleAnalyzerWrapper, or the ShingleFilter added later. For instance, this should work:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
TokenStream tokenStream = analyzer.tokenStream("content", reader);
ShingleFilter theFilter = new ShingleFilter(tokenStream);
theFilter.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = theFilter.addAttribute(CharTermAttribute.class);
theFilter.reset();
while (theFilter.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
theFilter.end();
theFilter.close();