How to test a Lucene Analyzer?

How to test a Lucene Analyzer? - java

I'm not getting the expected results from my Analyzer and would like to test the tokenization process.
The answer this question: How to use a Lucene Analyzer to tokenize a String?
List<String> result = new ArrayList<String>();
TokenStream stream = analyzer.tokenStream(field, new StringReader(keywords));
try {
while(stream.incrementToken()) {
result.add(stream.getAttribute(TermAttribute.class).term());
}
}
catch(IOException e) {
// not thrown b/c we're using a string reader...
}
return result;
Uses the TermAttribute to extract the tokens from the stream. The problem is that TermAttribute is no longer in Lucene 6.
What has it been replaced by?
What would the equivalent be with Lucene 6.6.0?

I'm pretty sure it was replaced by CharTermAttribute javadoc
The ticket is pretty old, but maybe the code was kept around a bit longer:
https://issues.apache.org/jira/browse/LUCENE-2372

Related

Using Lucene Analyzer Without Indexing - Is My Approach Reasonable?

My objective is to leverage some of Lucene's many tokenizers and filters to transform input text, but without the creation of any indexes.
For example, given this (contrived) input string...
" Someone’s - [texté] goes here, foo . "
...and a Lucene analyzer like this...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("icuFolding")
.build();
I want to get the following output:
someone's texte goes here foo
The below Java method does what I want.
But is there a better (i.e. more typical and/or concise) way that I should be doing this?
I am specifically thinking about the way I have used TokenStream and CharTermAttribute, since I have never used them like this before. Feels clunky.
Here is the code:
Lucene 8.3.0 imports:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
My method:
private String transform(String input) throws IOException {
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("icuFolding")
.build();
TokenStream ts = analyzer.tokenStream("myField", new StringReader(input));
CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
StringBuilder sb = new StringBuilder();
try {
ts.reset();
while (ts.incrementToken()) {
sb.append(charTermAtt.toString()).append(" ");
}
ts.end();
} finally {
ts.close();
}
return sb.toString().trim();
}

I have been using this set-up for a few weeks without issue. I have not found a more concise approach. I think the code in the question is OK.

How to give umlauts more weight in lucene?

I have a custom Analyzer for names. I'd like to give similar umlaut-matches more weight. Is that possible?
#Override
protected TokenStreamComponents createComponents(String fieldName, java.io.Reader reader) {
VERSION = Version.LUCENE_4_9;
final Tokenizer source = new StandardTokenizer(VERSION, reader);
TokenStream result = new StandardFilter(VERSION, source);
result = new LowerCaseFilter(VERSION, result);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
Example query:
input: "Zur Mühle"
outpt (equal scores): "Zur Linde", "Zur Muehle".
Of course I'd like to get the "Zur Muehle" as top result. But how can I tell lucene to scope umlaut matches more?

One way to do that is use payloads to boost terms containing umlauts. Please ask for further clarification if you need more details on using payloads.

i take a string from arraylist and have trouble with lucene tokenize

I have an ArrayList with some strings, and, for every string, I remove the stopwords and add it to a new ArrayList.
But, Netbeans returns an error:
"AWT-EventQueue-0" java.lang.NoSuchFieldError: LUCENE_47
...
at
org.apache.lucene.analysis.standard.StandardTokenizer.init(StandardTokenizer.java:144)
at
org.apache.lucene.analysis.standard.StandardTokenizer.<init>(StandardTokenizer.java:132)
The code to send the string is:
for(i=0;i<listR.size();i++){
str=listR.get(i);
try {
str=st.StopWordString(str);
listSt.add(str);
} catch (IOException ex) {
Logger.getLogger(PanelAuto.class.getName()).log(Level.SEVERE, null, ex);
}
}
And the method I use to remove the stopwords is:
public String StopWordString(String incoming) throws IOException{
Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_44,new StringReader(incoming));
final StandardFilter standardFilter = new StandardFilter(Version.LUCENE_44, tokenizer);
final StopFilter stopFilter = new StopFilter(Version.LUCENE_44, standardFilter, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
final CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
stopFilter.reset();
while(stopFilter.incrementToken()) {
final String token = charTermAttribute.toString().toString();
list.add(token);
}
for(i=0; i<list.size(); i++){
str += list.get(i)+" ";
}
return str;
}
The error is thrown on this line:
Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_44,new StringReader(incoming));

It looks like you have mismatched versions of the Lucene jars in your classpath. Particularly, it appears you have lucene-analyzers.common-4.7.X.jar, but an earlier version of lucene core (perhaps lucene-core-4.4.X.jar?). You have set the tokenizer to use an earlier algorithm, but you still need to use jars from the same version of Lucene. I believe this is the particular line which directly causes the issue, which gives an example of why that is:
if (matchVersion.onOrAfter(Version.LUCENE_47))
(The Version class lives in lucene-core)
If you've upgraded lucene-core to version 4.7 already, you may have an old jar in your classpath that you need to remove.

Lucene Java opening too many files. Am I using IndexWriter properly?

My Lucene Java implementation is eating up too many files. I followed the instructions in the Lucene Wiki about too many open files, but that only helped slow the problem. Here is my code to add objects (PTicket) to the index:
//This gets called when the bean is instantiated
public void initializeIndex() {
analyzer = new WhitespaceAnalyzer(Version.LUCENE_32);
config = new IndexWriterConfig(Version.LUCENE_32, analyzer);
}
public void addAllToIndex(Collection<PTicket> records) {
IndexWriter indexWriter = null;
config = new IndexWriterConfig(Version.LUCENE_32, analyzer);
try{
indexWriter = new IndexWriter(directory, config);
for(PTicket record : records) {
Document doc = new Document();
StringBuffer documentText = new StringBuffer();
doc.add(new Field("_id", record.getIdAsString(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("_type", record.getType(), Field.Store.YES, Field.Index.ANALYZED));
for(String key : record.getProps().keySet()) {
List<String> vals = record.getProps().get(key);
for(String val : vals) {
addToDocument(doc, key, val);
documentText.append(val).append(" ");
}
}
addToDocument(doc, DOC_TEXT, documentText.toString());
indexWriter.addDocument(doc);
}
indexWriter.optimize();
} catch (Exception e) {
e.printStackTrace();
} finally {
cleanup(indexWriter);
}
}
private void cleanup(IndexWriter iw) {
if(iw == null) {
return;
}
try{
iw.close();
} catch (IOException ioe) {
logger.error("Error trying to close index writer");
logger.error("{}", ioe.getClass().getName());
logger.error("{}", ioe.getMessage());
}
}
private void addToDocument(Document doc, String field, String value) {
doc.add(new Field(field, value, Field.Store.YES, Field.Index.ANALYZED));
}
EDIT TO ADD code for searching
public Set<Object> searchIndex(AthenaSearch search) {
try {
Query q = new QueryParser(Version.LUCENE_32, DOC_TEXT, analyzer).parse(query);
//search is actually instantiated in initialization. Lucene recommends this.
//IndexSearcher searcher = new IndexSearcher(directory, true);
TopDocs topDocs = searcher.search(q, numResults);
ScoreDoc[] hits = topDocs.scoreDocs;
for(int i=start;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
ids.add(d.get("_id"));
}
return ids;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
This code is in a web application.
1) Is this the advised way to use IndexWriter (instantiating a new one on each add to index)?
2) I've read that raising ulimit will help, but that just seems like a band-aid that won't address the actual problem.
3) Could the problem lie with IndexSearcher?

1) Is this the advised way to use
IndexWriter (instantiating a new one
on each add to index)?
i advise No, there are constructors, which will check if exists or create a new writer, in the directory containing the index. problem 2 would be solved if you reuse the indexwriter.
EDIT:
Ok it seems in Lucene 3.2 the most but one constructors are deprecated,so the resue of Indexwriter can be achieved by using Enum IndexWriterConfig.OpenMode with value CREATE_OR_APPEND.
also, opening new writer and closing on each document add is not efficient,i suggest reuse, if you want to speed up indexing, set the setRamBufferSize default value is 16MB, so do it by trial and error method
from the docs:
Note that you can open an index with
create=true even while readers are
using the index. The old readers will
continue to search the "point in time"
snapshot they had opened, and won't
see the newly created index until they
re-open.
also reuse the IndexSearcher,i cannot see the code for searching, but Indexsearcher is threadsafe and can be used as Readonly as well
also i suggest you to use MergeFactor on writer, this is not necessary but will help on limiting the creation of inverted index files, do it by trial and error method

I think we'd need to see your search code to be sure, but I'd suspect that it is a problem with the index searcher. More specifically, make sure that your index reader is being properly closed when you've finished with it.
Good luck,

The scientific correct answer would be: You can't really tell by this fragment of code.
The more constructive answer would be:
You have to make sure that there is only one IndexWriter is writing to the index at any given time and you therefor need some mechanism to make sure of that. So my answer depends of what you want to accomplish:
do you want a deeper understanding of Lucene? or..
do you just want to build and use an index?
If you answer is the latter, you probably want to look at projects like Solr, which hides all the index reading and writing.

This question is probably a duplicate of
Too many open files Error on Lucene
I am repeating here my answer for that.
Use compound index to reduce file count. When this flag is set, lucene will write a segment as single .cfs file instead of multiple files. This will reduce the number of files significantly.
IndexWriter.setUseCompoundFile(true)

How to get a Token from a Lucene TokenStream?

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream.
The worst part is that I'm looking at the comments in the JavaDocs that address my question.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29
Somehow, an AttributeSource is supposed to be used, rather than Tokens. I'm totally at a loss.
Can anyone explain how to get token-like information from a TokenStream?

Yeah, it's a little convoluted (compared to the good ol' way), but this should do it:
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = termAttribute.term();
}
Edit: The new way
According to Donotello, TermAttribute has been deprecated in favor of CharTermAttribute. According to jpountz (and Lucene's documentation), addAttribute is more desirable than getAttribute.
TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
}

This is how it should be (a clean version of Adam's answer):
TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(cattr.toString());
}
stream.end();
stream.close();

For the latest version of lucene 7.3.1
// Test the tokenizer
Analyzer testAnalyzer = new CJKAnalyzer();
String testText = "Test Tokenizer";
TokenStream ts = testAnalyzer.tokenStream("context", new StringReader(testText));
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
System.out.println("token start offset: " + offsetAtt.startOffset());
System.out.println(" token end offset: " + offsetAtt.endOffset());
}
ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
} finally {
ts.close(); // Release resources associated with this stream.
}
Reference: https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/package-summary.html

There are two variations in the OP question:
What is "the process to obtain Tokens from a TokenStream"?
"Can anyone explain how to get token-like information from a TokenStream?"
Recent versions of the Lucene documentation for Token say (emphasis added):
NOTE: As of 2.9 ... it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all Attributes, which is especially useful to easily switch from the old to the new TokenStream API.
And TokenStream says its API:
... has moved from being Token-based to Attribute-based ... the preferred way to store the information of a Token is to use AttributeImpls.
The other answers to this question cover #2 above: how to get token-like information from a TokenStream in the "new" recommended way using attributes. Reading through the documentation, the Lucene developers suggest that this change was made, in part, to reduce the number of individual objects created at a time.
But as some people have pointed out in the comments of those answers, they don't directly answer #1: how do you get a Token if you really want/need that type?
With the same API change that makes TokenStream an AttributeSource, Token now implements Attribute and can be used with TokenStream.addAttribute just like the other answers show for CharTermAttribute and OffsetAttribute. So they really did answer that part of the original question, they simply didn't show it.
It is important that while this approach will allow you to access Token while you're looping, it is still only a single object no matter how many logical tokens are in the stream. Every call to incrementToken() will change the state of the Token returned from addAttribute; So if your goal is to build a collection of different Token objects to be used outside the loop then you will need to do extra work to make a new Token object as a (deep?) copy.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.