In my Lucene index I store names with special characters (e.g. Savić) in a field like the one described below.
FieldType fieldType = new Field();
fieldType.setStored(true);
fieldType.setIndexed(true);
fieldType.setTokenized(false);<br>
new Field("NAME", "Savić".toLowerCase(), fieldType);
I use a StopwordAnalyzerBase analyzer and Lucene Version.LUCENE_45.
If I search in the field exactly for "savić" it doesn't find it. How to deal with the special characters?
#Override
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
PatternTokenizer src;
// diese Zeichen werden nicht als Trenner verwendet
src = new PatternTokenizer(reader, Pattern.compile("[\\W&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, TRIBUNA_WORDS_SET);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
super.setReader(reader);
}
};
}
You have a couple of choices:
Try adding an ASCIIFoldingFilter:
src = new PatternTokenizer(reader, Pattern.compile("[\\W&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new ASCIIFoldingFilter(tok);
tok = new StopFilter(matchVersion, tok, TRIBUNA_WORDS_SET);
This will take a fairly simplistic approach of reducing non-ASCII characters, such as Ä, to their best match in ASCII characters (A, in this case), if a reasonable ASCII alternative character exists. It won't do anything fancy with trying to use language-specific intelligence to determine the best replacements though.
For something more linguistically intelligent, there are tools to handle this sort of thing in many of the language-specific packages. The GermanNormalizationFilter would be one example, which will do similar things to the ASCIIFoldingFilter, but will apply the rules in a way that is appropriate to the German language, such as 'ß' being replaced by 'ss'. You'd use it similar to the above code:
src = new PatternTokenizer(reader, Pattern.compile("[\\W&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new GermanNormalizationFilter(tok);
tok = new StopFilter(matchVersion, tok, TRIBUNA_WORDS_SET);
Related
I m trying to create a custom analyzer in Lucene 8.3.0 that uses stemming and filters the given text using custom stop words from a file.
To be more clear, I don't want to use the default stop words filter and add some words on it, I want to filter using only a set of stop words from a stopWords.txt file.
How can I do this?
This is what I have written until now, but I am not sure if it is right
public class MyAnalyzer extends Analyzer{
//public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
// public TokenStream tokenStream(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream tokenStream = new StandardFilter(tokenizer);
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new StopFilter(tokenStream,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
//Adding Porter Stemming filtering
tokenStream = new PorterStemFilter(tokenStream);
//return tokenStream;
return new TokenStreamComponents(tokenizer, tokenStream);
}
}
First of all I am not sure if the structure is correct and for now I am using the StopFilter from StopAnalyzer just to test it (however it's not working).
You need to read the file and parse it to a CharArraySet to pass into the filter. StopFilter has some built in methods you can use to convert a List of Strings to a CharArraySet, like:
...
CharArraySet stopset = StopFilter.makeStopSet(myStopwordList);
tokenStream = new StopFilter(tokenStream, stopset);
...
It's listed as for internal purposes, so fair warning about relying on this class, but if you don't want to have to handle parsing your file to a list, you could use WordListLoader to parse your stopword file into a CharArraySet, something like:
...
CharArraySet stopset = WordlistLoader.getWordSet(myStopfileReader);
tokenStream = new StopFilter(tokenStream, stopset);
...
I'm using Lucene's StandardAnalyzer for a specific index property.
As special characters like àéèäöü do not get indexed as expected, I want to replace these characters:
à -> a
é -> e
è -> e
ä -> ae
ö -> oe
ü -> ue
What is the best approach to extend the org.apache.lucene.analysis.standard.StandardAnalyzerclass?
I was looking for a way where the standard parser iterates over all tokens (words) and I can retrieve word by word and do the magic there.
Thanks for any hints.
I would propose to use MappingCharFilter, that will allow to have a map of Strings that will be replaces by Strings, so it will fit your requirements perfectly.
Some additional info - https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html
You wouldn't extend StandardAnalyzer, since analyzer implementations are final. The meat of an analyzer implementation is the createComponents method, and you would have to override that anyway, so you wouldn't get much from extending it anyway.
Instead, you can copy the StandardAnalyzer source, and modify the createComponents method. For what you are asking for, I recommend adding ASCIIFoldingFilter, which will attempt to convert UTF characters (such as accented letters) into ASCII equivalents. So you could create an analyzer something like this:
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter(tok);
tok = new ASCIIFoldingFilter(tok); /*Adding it before the StopFilter would probably be most helpful.*/
tok = new StopFilter(tok, StandardAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream result = new StandardFilter(in);
result = new LowerCaseFilter(result);
tok = new ASCIIFoldingFilter(tok);
return result;
}
}
I built a custom Lucene analyzer for my domain, but after getting unexpected results I decided to get the TokenStream for that analyzer and manually debug it.
By doing that I discovered that it doesn't seem to filter my stopwords, and I can't understand why.
Here's the text that I'm using as a test (random twitter item, sorry if it doesn't make any sense):
Ke #rosicata il #pareggio nel #recupero del #recupero di #vantaggiato del #Livorno a #Catania 1-1 #finale spegne i #sogni degli #etnei di #raggiungere i #playoff per la #promozione in #serieA #catanialivorno square squareformat iphoneography instagramapp uploaded:by=instagram
and these are my stopwords (stored in a file which is loaded by the analyzer):
square
squareformat
iphoneography
instagramapp
uploaded:by=instagram
Finally, here's the output (one line = one token):
rosicata
pareggio
recupero
recupero
vantaggiato
livorno
catania
finale
finale spegne
spegne
sogni
etnei
raggiungere
playoff
promozione
seriea
seriea catanialivorno
seriea catanialivorno square
catanialivorno
catanialivorno square
catanialivorno square squareformat
square
square squareformat
square squareformat iphoneography
squareformat
squareformat iphoneography
squareformat iphoneography instagramapp
iphoneography
iphoneography instagramapp
instagramapp
instagram
As you can see, the last lines contain what I thought to remove with my filter.
Here's filter code:
#Override
protected TokenStreamComponents createComponents(String string) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(DEFAULT_MAX_TOKEN_LENGTH);
TokenStream tokenStream = new StandardFilter(src);
// From StandardAnalyzer
tokenStream = new LowerCaseFilter(tokenStream);
// Custom filters
// Filter emails, uris and numbers
Set<String> stopTypes = new HashSet<>();
stopTypes.add("<URL>");
stopTypes.add("<NUM>");
stopTypes.add("<EMAIL>");
tokenStream = new TypeTokenFilter(tokenStream, stopTypes);
// Non latin removal
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("\\P{InBasic_Latin}"), "", true);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("^(?=.*\\d).+$"), "", true);
// Remove words containing www
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".*(www).*"), "", true);
// Remove special tags like uploaded:by=instagram
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".+:.+(=.+)?"), "", true);
// Remove words shorter than 3 characters
tokenStream = new LengthFilter(tokenStream, 3, 25);
// Stopwords
tokenStream = new StopFilter(tokenStream, stopwordsCollection);
// N-Grams
tokenStream = new ShingleFilter(tokenStream, 3);
// HACK - ShingleFilter uses fillers like _ for some reasons and there's no way to disable it now, so we replace all the _ with empty strings
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".*_.*"), "", true);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("\\b(\\w+)\\s+\\1\\b"), "", true);
// Stopwords
tokenStream = new StopFilter(tokenStream, stopwordsCollection);
// Final trim
tokenStream = new TrimFilter(tokenStream);
// Set CharTerm attribute
tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.addAttribute(FuzzyTermsEnum.LevenshteinAutomataAttribute.class);
return new TokenStreamComponents(src, tokenStream) {
#Override
protected void setReader(final Reader reader) {
src.setMaxTokenLength(DEFAULT_MAX_TOKEN_LENGTH);
super.setReader(reader);
}
};
}
Just to double check, I've placed a breakpoint before the return of this method and stopwordsCollection, which is a CharArraySet, contains the same words that I have in my file (so they're loaded correctly).
My first thought was that Shingle filter was messing with stopwords removal, but now I just saw that the output contains square too, which is a single-word stopword.
Can someone help me fixing this?
EDIT: just for clarity I'm also adding the code that I'm using to print tokens.
TokenStream tokenStream = null;
try {
tokenStream = new MyAnalyzer().tokenStream("text", item.toString());
tokenStream.reset();
// Iterate over the stream to process single words
while (tokenStream.incrementToken()) {
CharTermAttribute charTerm = tokenStream.getAttribute(CharTermAttribute.class);
System.out.println(charTerm.toString());
}
// Perform end-of-stream operations, e.g. set the final offset.
tokenStream.end();
} catch (IOException ex) {
} finally {
try {
// Close the stream to release resources
if (tokenStream != null) {
tokenStream.close();
}
} catch (IOException ex) {
}
}
EDIT 2: turns out that I stored the stopwords with trailing spaces, but there's still one least problem.
Current output is:
rosicata
pareggio
recupero
recupero
vantaggiato
livorno
catania
finale
finale spegne
spegne
sogni
etnei
raggiungere
playoff
promozione
seriea
seriea catanialivorno
catanialivorno
instagram
As you can see, last word is instagram, coming from uploaded:by=instagram. Now, uploaded:by=instagram is a stopword too, and I'm still using a regex-based filter to remove this kind of patterns
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile(".+:.+(=.+)?"), "", true);
I've also moved it as first filter, still I'm getting instagram.
I have a custom Analyzer for names. I'd like to give similar umlaut-matches more weight. Is that possible?
#Override
protected TokenStreamComponents createComponents(String fieldName, java.io.Reader reader) {
VERSION = Version.LUCENE_4_9;
final Tokenizer source = new StandardTokenizer(VERSION, reader);
TokenStream result = new StandardFilter(VERSION, source);
result = new LowerCaseFilter(VERSION, result);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
Example query:
input: "Zur Mühle"
outpt (equal scores): "Zur Linde", "Zur Muehle".
Of course I'd like to get the "Zur Muehle" as top result. But how can I tell lucene to scope umlaut matches more?
One way to do that is use payloads to boost terms containing umlauts. Please ask for further clarification if you need more details on using payloads.
I am puzzled by the strange behavior of ShingleFilter in Lucene 4.6. What I would like to do is extract all possible bigrams from a sentence. So if the sentence is "this is a dog", I want "this is", "is a ", "a dog".
What I see instead is:
"this this is"
"this is is"
"is is a"
"is a a"
"a a dog"
"a dog dog"
So somehow it replicates words and makes 3-grams instead of bigrams.
Here is my Java code:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(analyzer,2,2);
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
TokenStream tokenStream = shingleAnalyzer.tokenStream("content", reader);
ShingleFilter theFilter = new ShingleFilter(tokenStream);
theFilter.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = theFilter.addAttribute(CharTermAttribute.class);
theFilter.reset();
while (theFilter.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
theFilter.end();
theFilter.close();
Could someone please help me modify it so it actually outputs bigrams and does not replicate words? Thank you!
Natalia
I was able to produce correct results, using this code
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
StandardTokenizer source = new StandardTokenizer(Version.LUCENE_46, reader);
TokenStream tokenStream = new StandardFilter(Version.LUCENE_46, source);
ShingleFilter sf = new ShingleFilter(tokenStream);
sf.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = sf.addAttribute(CharTermAttribute.class);
sf.reset();
while (sf.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
sf.end();
sf.close();
And get as expected 3 bigrams - this is, is a, a dog.
The problem is you are applying two ShingleFilters. ShingleAnalyzerWrapper tacks one onto the StandardAnalyzer, and then you add another one explicitly. Since the ShingleAnalyzerWrapper uses it's default behavior of outputing unigrams, you end up with the following tokens off of that first ShingleFilter:
this
this is
is
is a
a
a dog
dog
so when the second filter comes along (this time without unigrams), it simply combines each consecutive tokens among those, leading to the result your are seeing.
So, either eliminate the ShingleAnalyzerWrapper, or the ShingleFilter added later. For instance, this should work:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
TokenStream tokenStream = analyzer.tokenStream("content", reader);
ShingleFilter theFilter = new ShingleFilter(tokenStream);
theFilter.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = theFilter.addAttribute(CharTermAttribute.class);
theFilter.reset();
while (theFilter.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
theFilter.end();
theFilter.close();