Lucene how can i turn off "toLowerCase" in StandardAnalyzer? - java

I want to tokenaze my text, I use tokenStream from StandardAnalyzer, but it has by default "toLowerCase".
My code:
ArrayList<String> toTextWord = new ArrayList<>();
Analyzer analyzer = new StandardAnalyzer();
try (TokenStream stream = analyzer.tokenStream("tags", new StringReader(iterStr))) {
stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
CharTermAttribute token = stream.getAttribute(CharTermAttribute.class);
System.out.println(token.toString());
toTextWord.add(token.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
How can i use StandardAnalyzer without "toLowerCase"? How can i turn off "toLowerCase" in this StandardAnalyzer?

You cannot turn off toLowerCase directly in the StandardAnalyzer.
You can create a custom analyzer which behaves the same way as the StandardAnalyzer, and then customize it to meet your needs:
Example using org.apache.lucene.analysis.custom.CustomAnalyzer:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.build();
Now you can comment out (or remove) the lowercase token filter:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("stop")
.build();
Note that if you want to exactly match the default Standard Analyzer, then you should also comment out or remove the stop-word filter, since by default stopwords are not removed from the Standard Analyzer unless you provide an explicit list.
That gives us this:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.build();
If I use the following input with my custom analyzer:
String iterStr = "Eric the quick brown fox jumps over Freddy the lazy dog, LOL.";
then the output from your code is as follows:
Eric
the
quick
brown
fox
jumps
over
Freddy
the
lazy
dog
LOL
Update
When using the CustomAnalyzer you can use string values to identify the different tokenizer and filter objects - such as "standard" and "lowercase", as used in my examples above.
If you want to avoid using these identifiers, you can use the relevant factory object with the NAME field:
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(StandardTokenizerFactory.NAME)
.addTokenFilter(LowerCaseFilterFactory.NAME)
.addTokenFilter(StopFilterFactory.NAME)
.build();

Related

How to create Custom Analyzer in Lucene, with custom stop/common words from file

I m trying to create a custom analyzer in Lucene 8.3.0 that uses stemming and filters the given text using custom stop words from a file.
To be more clear, I don't want to use the default stop words filter and add some words on it, I want to filter using only a set of stop words from a stopWords.txt file.
How can I do this?
This is what I have written until now, but I am not sure if it is right
public class MyAnalyzer extends Analyzer{
//public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
// public TokenStream tokenStream(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream tokenStream = new StandardFilter(tokenizer);
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new StopFilter(tokenStream,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
//Adding Porter Stemming filtering
tokenStream = new PorterStemFilter(tokenStream);
//return tokenStream;
return new TokenStreamComponents(tokenizer, tokenStream);
}
}
First of all I am not sure if the structure is correct and for now I am using the StopFilter from StopAnalyzer just to test it (however it's not working).
You need to read the file and parse it to a CharArraySet to pass into the filter. StopFilter has some built in methods you can use to convert a List of Strings to a CharArraySet, like:
...
CharArraySet stopset = StopFilter.makeStopSet(myStopwordList);
tokenStream = new StopFilter(tokenStream, stopset);
...
It's listed as for internal purposes, so fair warning about relying on this class, but if you don't want to have to handle parsing your file to a list, you could use WordListLoader to parse your stopword file into a CharArraySet, something like:
...
CharArraySet stopset = WordlistLoader.getWordSet(myStopfileReader);
tokenStream = new StopFilter(tokenStream, stopset);
...

Using Lucene Analyzer Without Indexing - Is My Approach Reasonable?

My objective is to leverage some of Lucene's many tokenizers and filters to transform input text, but without the creation of any indexes.
For example, given this (contrived) input string...
" Someone’s - [texté] goes here, foo . "
...and a Lucene analyzer like this...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("icuFolding")
.build();
I want to get the following output:
someone's texte goes here foo
The below Java method does what I want.
But is there a better (i.e. more typical and/or concise) way that I should be doing this?
I am specifically thinking about the way I have used TokenStream and CharTermAttribute, since I have never used them like this before. Feels clunky.
Here is the code:
Lucene 8.3.0 imports:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
My method:
private String transform(String input) throws IOException {
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("icuFolding")
.build();
TokenStream ts = analyzer.tokenStream("myField", new StringReader(input));
CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
StringBuilder sb = new StringBuilder();
try {
ts.reset();
while (ts.incrementToken()) {
sb.append(charTermAtt.toString()).append(" ");
}
ts.end();
} finally {
ts.close();
}
return sb.toString().trim();
}
I have been using this set-up for a few weeks without issue. I have not found a more concise approach. I think the code in the question is OK.

How to test a Lucene Analyzer?

I'm not getting the expected results from my Analyzer and would like to test the tokenization process.
The answer this question: How to use a Lucene Analyzer to tokenize a String?
List<String> result = new ArrayList<String>();
TokenStream stream = analyzer.tokenStream(field, new StringReader(keywords));
try {
while(stream.incrementToken()) {
result.add(stream.getAttribute(TermAttribute.class).term());
}
}
catch(IOException e) {
// not thrown b/c we're using a string reader...
}
return result;
Uses the TermAttribute to extract the tokens from the stream. The problem is that TermAttribute is no longer in Lucene 6.
What has it been replaced by?
What would the equivalent be with Lucene 6.6.0?
I'm pretty sure it was replaced by CharTermAttribute javadoc
The ticket is pretty old, but maybe the code was kept around a bit longer:
https://issues.apache.org/jira/browse/LUCENE-2372

ShingleFilter in Lucene 4.6 adds words to bigrams

I am puzzled by the strange behavior of ShingleFilter in Lucene 4.6. What I would like to do is extract all possible bigrams from a sentence. So if the sentence is "this is a dog", I want "this is", "is a ", "a dog".
What I see instead is:
"this this is"
"this is is"
"is is a"
"is a a"
"a a dog"
"a dog dog"
So somehow it replicates words and makes 3-grams instead of bigrams.
Here is my Java code:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(analyzer,2,2);
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
TokenStream tokenStream = shingleAnalyzer.tokenStream("content", reader);
ShingleFilter theFilter = new ShingleFilter(tokenStream);
theFilter.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = theFilter.addAttribute(CharTermAttribute.class);
theFilter.reset();
while (theFilter.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
theFilter.end();
theFilter.close();
Could someone please help me modify it so it actually outputs bigrams and does not replicate words? Thank you!
Natalia
I was able to produce correct results, using this code
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
StandardTokenizer source = new StandardTokenizer(Version.LUCENE_46, reader);
TokenStream tokenStream = new StandardFilter(Version.LUCENE_46, source);
ShingleFilter sf = new ShingleFilter(tokenStream);
sf.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = sf.addAttribute(CharTermAttribute.class);
sf.reset();
while (sf.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
sf.end();
sf.close();
And get as expected 3 bigrams - this is, is a, a dog.
The problem is you are applying two ShingleFilters. ShingleAnalyzerWrapper tacks one onto the StandardAnalyzer, and then you add another one explicitly. Since the ShingleAnalyzerWrapper uses it's default behavior of outputing unigrams, you end up with the following tokens off of that first ShingleFilter:
this
this is
is
is a
a
a dog
dog
so when the second filter comes along (this time without unigrams), it simply combines each consecutive tokens among those, leading to the result your are seeing.
So, either eliminate the ShingleAnalyzerWrapper, or the ShingleFilter added later. For instance, this should work:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
String theSentence = "this is a dog";
StringReader reader = new StringReader(theSentence);
TokenStream tokenStream = analyzer.tokenStream("content", reader);
ShingleFilter theFilter = new ShingleFilter(tokenStream);
theFilter.setOutputUnigrams(false);
CharTermAttribute charTermAttribute = theFilter.addAttribute(CharTermAttribute.class);
theFilter.reset();
while (theFilter.incrementToken()) {
System.out.println(charTermAttribute.toString());
}
theFilter.end();
theFilter.close();

How to get a Token from a Lucene TokenStream?

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream.
The worst part is that I'm looking at the comments in the JavaDocs that address my question.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29
Somehow, an AttributeSource is supposed to be used, rather than Tokens. I'm totally at a loss.
Can anyone explain how to get token-like information from a TokenStream?
Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}
Edit: The new way
According to Donotello, TermAttribute has been deprecated in favor of CharTermAttribute. According to jpountz (and Lucene's documentation), addAttribute is more desirable than getAttribute.
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}
This is how it should be (a clean version of Adam's answer):
TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(cattr.toString());
}
stream.end();
stream.close();
For the latest version of lucene 7.3.1
// Test the tokenizer
Analyzer testAnalyzer = new CJKAnalyzer();
String testText = "Test Tokenizer";
TokenStream ts = testAnalyzer.tokenStream("context", new StringReader(testText));
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
System.out.println("token start offset: " + offsetAtt.startOffset());
System.out.println(" token end offset: " + offsetAtt.endOffset());
}
ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
} finally {
ts.close(); // Release resources associated with this stream.
}
Reference: https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/package-summary.html
There are two variations in the OP question:
What is "the process to obtain Tokens from a TokenStream"?
"Can anyone explain how to get token-like information from a TokenStream?"
Recent versions of the Lucene documentation for Token say (emphasis added):
NOTE: As of 2.9 ... it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API.
And TokenStream says its API:
... has moved from being Token-based to Attribute-based ... the preferred way to store the information of a Token is to use AttributeImpls.
The other answers to this question cover #2 above: how to get token-like information from a TokenStream in the "new" recommended way using attributes. Reading through the documentation, the Lucene developers suggest that this change was made, in part, to reduce the number of individual objects created at a time.
But as some people have pointed out in the comments of those answers, they don't directly answer #1: how do you get a Token if you really want/need that type?
With the same API change that makes TokenStream an AttributeSource, Token now implements Attribute and can be used with TokenStream.addAttribute just like the other answers show for CharTermAttribute and OffsetAttribute. So they really did answer that part of the original question, they simply didn't show it.
It is important that while this approach will allow you to access Token while you're looping, it is still only a single object no matter how many logical tokens are in the stream. Every call to incrementToken() will change the state of the Token returned from addAttribute; So if your goal is to build a collection of different Token objects to be used outside the loop then you will need to do extra work to make a new Token object as a (deep?) copy.

Categories