Lucene Analyzer tokenizer for substring search

Lucene Analyzer tokenizer for substring search - java

I need a Lucene Tokenizer that can do the following. Given the string "wines bottle caps", the following queries should succeed
wine
bott
cap
ottl
aps
wine bottl
Here is what I have so far. How might I modify it to work? No query less than three characters should work.
public class PorterAnalyzer extends Analyzer {
private final Version version;
public PorterAnalyzer(Version version) {
this.version = version;
}
#Override
#SuppressWarnings("resource")
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
final StandardTokenizer src = new StandardTokenizer(reader);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter( tok);
tok = new StopFilter( tok, StandardAnalyzer.STOP_WORDS_SET);
tok = new PorterStemFilter(tok);
return new TokenStreamComponents(src, tok);
}
}

I think you are searching for NGramTokenFilter.
Try, for example:
tok=new NGramTokenFilter(tok,2,5);

Related

Lucene IndexWriter.commit() doesn't finished in ubuntu

Here is initialize code
public class Main {
public void index(String input_path, String index_dir, String separator, String extension, String field, DataHandler handler) {
Index index = new Index(handler);
index.initWriter(index_dir, new StandardAnalyzer());
index.run(input_path, field, extension, separator);
}
public List<?> search(String input_path, String index_dir, String separator, String extension, String field, DataHandler handler) {
Search search = new Search(handler);
search.initSearcher(index_dir, new StandardAnalyzer());
return search.runUsingFiles(input_path, field, extension, separator);
}
#SuppressWarnings("unchecked")
public static void main(String[] args) {
String lang = "en-US";
String dType = "data";
String train = "res/input/" +lang+ "/" +dType +"/train/";
String test = "res/input/"+ lang+ "/" +dType+ "/test/";
String separator = "\\|";
String extension = "csv";
String index_dir = "res/index/" +lang+ "." +dType+ ".index";
String output_file = "res/result/" +lang+ "." +dType+ ".output.json";
String searched_field = "utterance";
Main main = new Main();
DataHandler handler = new DataHandler();
main.index(train, index_dir, separator, extension, searched_field, handler);
//List<JSONObject> result = (List<JSONObject>) main.search(test, index_dir, separator, extension, searched_field, handler);
//handler.writeOutputJson(result, output_file);
}
}
It is my code
public class Index {
private IndexWriter writer;
private DataHandler handler;
public Index(DataHandler handler) {
this.handler = handler;
}
public Index() {
this(new DataHandler());
}
public void initWriter(String index_path, Directory store, Analyzer analyzer) {
IndexWriterConfig config = new IndexWriterConfig(analyzer);
try {
this.writer = new IndexWriter(store, config);
} catch (IOException e) {
e.printStackTrace();
}
}
public void initWriter(String index_path, Analyzer analyzer) {
try {
initWriter(index_path, FSDirectory.open(Paths.get(index_path)), analyzer);
} catch (IOException e) {
e.printStackTrace();
}
}
public void initWriter(String index_path) {
List<String> stopWords = Arrays.asList();
CharArraySet stopSet = new CharArraySet(stopWords, false);
initWriter(index_path, new StandardAnalyzer(stopSet));
}
#SuppressWarnings("unchecked")
public void indexDocs(List<?> datas, String field) throws IOException {
FieldType fieldType = new FieldType();
FieldType fieldType2 = new FieldType();
fieldType.setStored(true);
fieldType.setTokenized(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
fieldType2.setStored(true);
fieldType2.setTokenized(false);
fieldType2.setIndexOptions(IndexOptions.DOCS);
for(int i = 0 ; i < datas.size() ; i++) {
Map<String,String> temp = (Map<String,String>) datas.get(i);
Document doc = new Document();
for(String key : temp.keySet()) {
if(key.equals(field))
continue;
doc.add(new Field(key, temp.get(key), fieldType2));
}
doc.add(new Field(field, temp.get(field), fieldType));
this.writer.addDocument(doc);
}
}
public void run(String path, String field, String extension, String separator) {
List<File> files = this.handler.getInputFiles(path, extension);
List<?> data = this.handler.readDocs(files, separator);
try {
System.out.println("start index");
indexDocs(data, field);
this.writer.commit();
this.writer.close();
System.out.println("done");
} catch (IOException e) {
e.printStackTrace();
}
}
public void run(String path) {
run(path, "search_field", "csv", "\t");
}
I made simple search module using Java and Lucene.
This module consisted of two phase, index and search.
In index phase, It read csv files and convert to Document each row and add to IndexWriter object using IndexWriter.addDocument() method.
Finaly, It call IndexWriter.commit() method.
It is working well in my local PC (windows)
but in Ubuntu PC, doesn't finished IndexWriter.commit() method.
Of course IndexWriter.flush() method doesn't work.
What is the problem?

Lucene Chinese Analyzer Jcseg Using same code gives different results

using EJB3.0 + jersey restful API + lucene 6.1
The Analyzer is Jcseg Chinese Analyzer .
Code:
#Stateless
public class GoodsSearchBiz implements Serializable {
#Override
public List<String> test(){
Analyzer analyzer = new JcsegAnalyzer5X(JcsegTaskConfig.SEARCH_MODE);
JcsegAnalyzer5X jcseg = (JcsegAnalyzer5X) analyzer;
JcsegTaskConfig config = jcseg.getTaskConfig();
config.setAppendCJKSyn(true);
config.setAppendCJKPinyin(true);
TokenStream stream = null;
List<String> strList = new ArrayList<>();
try {
FSDirectory directory = FSDirectory.open(Paths.get(ResourcesUtils.loadGoodsMarketIndexDir()));
IndexWriterConfig iwConfig = new IndexWriterConfig(analyzer);
iwConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter iwriter = new IndexWriter(directory, iwConfig);
iwriter.deleteAll();
String words = "中华人民共和国";
Document doc = new Document();
doc.add(new TextField(SearchGoodsVO.FIELD_NAME, words, Field.Store.YES));
iwriter.addDocument(doc);
iwriter.commit();
iwriter.close();
stream = analyzer.tokenStream(SearchGoodsVO.FIELD_NAME, words);
stream.reset();
CharTermAttribute offsetAtt = stream.addAttribute(CharTermAttribute.class);
while (stream.incrementToken()) {
strList.add(offsetAtt.toString());
}
stream.end();
if (stream != null) stream.close();
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(strList);
return strList;
}
Run it in Main gives different results
public static void main(String[] args) {
GoodsSearchBiz goodsSearchBiz = new GoodsSearchBiz();
goodsSearchBiz.test();
}
}
/*The Api*/
#Path("/search")
#Produces(RestMediaType.JSON_HEADER)
#Consumes(RestMediaType.JSON_HEADER)
public class GoodsSearchApi {
#EJB
GoodsSearchBiz searchBiz;
#GET
#Path("/test")
public List<String> test() {
return searchBiz.test();
}
}
Results:
from Main:
[中华, 中华人民共和国, 华人, 人民, 人民共和国, 共和, 共和国]
Process finished with exit code 0
from API:
09:31:05,433 INFO [stdout] (default task-1) [中, 华, 人, 民, 共, 和, 国]
Why the same Code gives different Results like this?

u got to let jcseg load its lexicons.
at your api mode, Jcseg did't load the lexicon correctly.
visit https://github.com/lionsoul2014/jcseg for more help if u can read chinese

Port Lucene 3.6.2 Analyzer to Lucene 5.5.0

For Lucene 3.6.2 I have a following Analyzer:
public final class StandardAnalyzerV36 extends Analyzer {
private Analyzer analyzer;
public StandardAnalyzerV36() {
analyzer = new StandardAnalyzer(Version.LUCENE_36);
}
public StandardAnalyzerV36(Set<?> stopWords) {
analyzer = new StandardAnalyzer(Version.LUCENE_36, stopWords);
}
#Override
public final TokenStream tokenStream(String fieldName, Reader reader) {
return analyzer.tokenStream(fieldName, new HTMLStripCharFilter(CharReader.get(reader)));
}
#Override
public final TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
return analyzer.reusableTokenStream(fieldName, reader);
}
}
Could you please help me to port it on Analyzer for Lucene 5.5.0 ? The Analyzer interface was changed in the new version.
UPDATED
I have reimplemented this Analyzer to following:
public final class StandardAnalyzerV36 extends Analyzer {
public static final CharArraySet STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final ClassicTokenizer src = new ClassicTokenizer();
TokenStream tok = new StandardFilter(src);
tok = new StopFilter(new LowerCaseFilter(tok), STOP_WORDS_SET);
return new TokenStreamComponents(src, tok);
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
return new HTMLStripCharFilter(reader);
}
but my tests fails on following call:
tokens = LuceneUtils.tokenizeString(analyzer, "[{(RDBMS)}]");
public static List<String> tokenizeString(Analyzer analyzer, String string) {
List<String> result = new ArrayList<String>();
try {
TokenStream stream = analyzer.tokenStream(null, new StringReader(string));
stream.reset();
while (stream.incrementToken()) {
result.add(stream.getAttribute(CharTermAttribute.class).toString());
}
} catch (IOException e) {
// not thrown b/c we're using a string reader...
throw new RuntimeException(e);
}
return result;
}
with a following exception:
java.lang.IllegalStateException: TokenStream contract violation: close() call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:315)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:143)
What is wrong with this code ?

Finally I got it working:
public final class StandardAnalyzerV36 extends Analyzer {
public static final CharArraySet STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final ClassicTokenizer src = new ClassicTokenizer();
TokenStream tok = new StandardFilter(src);
tok = new StopFilter(new LowerCaseFilter(tok), STOP_WORDS_SET);
return new TokenStreamComponents(src, tok);
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
return new HTMLStripCharFilter(reader);
}
}
public class LuceneUtils {
public static List<String> tokenizeString(Analyzer analyzer, String string) {
List<String> result = new ArrayList<String>();
TokenStream stream = null;
try {
stream = analyzer.tokenStream(null, new StringReader(string));
stream.reset();
while (stream.incrementToken()) {
result.add(stream.getAttribute(CharTermAttribute.class).toString());
}
} catch (IOException e) {
// not thrown b/c we're using a string reader...
throw new RuntimeException(e);
} finally {
IOUtils.closeQuietly(stream);
}
return result;
}
}

Create a lucene romanian stemmer in java netbeans

I need to do a simple search engine which can recognize and stem Romanian words, including those with diacritics. I used RomanianAnalyzer, but it does not do the right stemming when it comes to the same word written with and without diacritics.
Can you help me with a code for adding/modifying an existing Romanian stemmer?
PS: I edited the question, to be more clear.

You can copy the RomanianAnalyzer source to create a custom analyzer, and add a filter to the analysis chain in the createComponents method. ASCIIFoldingFilter would probably be what you are looking for. I would add it to the end, to be sure that you don't mess up the stemmer when removing the diacritics.
public final class RomanianASCIIAnalyzer extends StopwordAnalyzerBase {
private final CharArraySet stemExclusionSet;
public final static String DEFAULT_STOPWORD_FILE = "stopwords.txt";
private static final String STOPWORDS_COMMENT = "#";
public static CharArraySet getDefaultStopSet(){
return DefaultSetHolder.DEFAULT_STOP_SET;
}
private static class DefaultSetHolder {
static final CharArraySet DEFAULT_STOP_SET;
static {
try {
DEFAULT_STOP_SET = loadStopwordSet(false, RomanianAnalyzer.class,
DEFAULT_STOPWORD_FILE, STOPWORDS_COMMENT);
} catch (IOException ex) {
throw new RuntimeException("Unable to load default stopword set");
}
}
}
public RomanianASCIIAnalyzer() {
this(DefaultSetHolder.DEFAULT_STOP_SET);
}
public RomanianASCIIAnalyzer(CharArraySet stopwords) {
this(stopwords, CharArraySet.EMPTY_SET);
}
public RomanianASCIIAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionSet) {
super(stopwords);
this.stemExclusionSet = CharArraySet.unmodifiableSet(CharArraySet.copy(stemExclusionSet));
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopwords);
if(!stemExclusionSet.isEmpty())
result = new SetKeywordMarkerFilter(result, stemExclusionSet);
result = new SnowballFilter(result, new RomanianStemmer());
//This following line is the addition made to the RomanianAnalyzer source.
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
}

How to map Java's locales and Lucene's analyzers?

I would like to find the Lucene analyzer corresponding to the language of a Java locale.
For instance, Locale.ENGLISH would be mapped to org.apache.lucene.analysis.en.EnglishAnalyzer.
Is there an automated mapping somewhere?

This is not available out-of-the-box. See below the way I do it.
public final class LocaleAwareAnalyzer extends AnalyzerWrapper {
private static final Logger LOG = LoggerFactory.getLogger(LocaleAwareAnalyzer.class);
private final Analyzer defaultAnalyzer;
private final Map<String, Analyzer> perLocaleAnalyzer = perLocaleAnalyzers();
public LocaleAwareAnalyzer(final Analyzer defaultAnalyzer) {
this.defaultAnalyzer = Precondition.notNull("defaultAnalyzer", defaultAnalyzer);
}
#Override
protected Analyzer getWrappedAnalyzer(final String fieldName) {
if (fieldName == null) {
return defaultAnalyzer;
}
final int n = fieldName.indexOf('_');
if (n >= 0) {
// Unfortunately CharArrayMap does not offer get(CharSequence, start, end)
final String locale = fieldName.substring(n + 1);
final Analyzer a = perLocaleAnalyzer.get(locale);
if (a != null) {
return a;
}
LOG.warn("No Analyzer for Locale '%s', using default", locale);
}
return defaultAnalyzer;
}
#Override
protected TokenStreamComponents wrapComponents(final String fieldName,
final TokenStreamComponents components) {
return components;
}
private static Map<String, Analyzer> perLocaleAnalyzers() {
final Map<String, Analyzer> m = new HashMap<>();
m.put("en", new EnglishAnalyzer(Version.LUCENE_43));
m.put("es", new SpanishAnalyzer(Version.LUCENE_43));
m.put("de", new GermanAnalyzer(Version.LUCENE_43));
m.put("fr", new FrenchAnalyzer(Version.LUCENE_43));
// ... etc
return m;
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene Analyzer tokenizer for substring search - java

I think you are searching for NGramTokenFilter. Try, for example: tok=new NGramTokenFilter(tok,2,5);

Related

Lucene IndexWriter.commit() doesn't finished in ubuntu

Lucene Chinese Analyzer Jcseg Using same code gives different results

Port Lucene 3.6.2 Analyzer to Lucene 5.5.0

Create a lucene romanian stemmer in java netbeans

How to map Java's locales and Lucene's analyzers?

Categories

Resources