Lucene Chinese Analyzer Jcseg Using same code gives different results - java

using EJB3.0 + jersey restful API + lucene 6.1
The Analyzer is Jcseg Chinese Analyzer .
Code:
#Stateless
public class GoodsSearchBiz implements Serializable {
#Override
public List<String> test(){
Analyzer analyzer = new JcsegAnalyzer5X(JcsegTaskConfig.SEARCH_MODE);
JcsegAnalyzer5X jcseg = (JcsegAnalyzer5X) analyzer;
JcsegTaskConfig config = jcseg.getTaskConfig();
config.setAppendCJKSyn(true);
config.setAppendCJKPinyin(true);
TokenStream stream = null;
List<String> strList = new ArrayList<>();
try {
FSDirectory directory = FSDirectory.open(Paths.get(ResourcesUtils.loadGoodsMarketIndexDir()));
IndexWriterConfig iwConfig = new IndexWriterConfig(analyzer);
iwConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter iwriter = new IndexWriter(directory, iwConfig);
iwriter.deleteAll();
String words = "中华人民共和国";
Document doc = new Document();
doc.add(new TextField(SearchGoodsVO.FIELD_NAME, words, Field.Store.YES));
iwriter.addDocument(doc);
iwriter.commit();
iwriter.close();
stream = analyzer.tokenStream(SearchGoodsVO.FIELD_NAME, words);
stream.reset();
CharTermAttribute offsetAtt = stream.addAttribute(CharTermAttribute.class);
while (stream.incrementToken()) {
strList.add(offsetAtt.toString());
}
stream.end();
if (stream != null) stream.close();
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(strList);
return strList;
}
Run it in Main gives different results
public static void main(String[] args) {
GoodsSearchBiz goodsSearchBiz = new GoodsSearchBiz();
goodsSearchBiz.test();
}
}
/*The Api*/
#Path("/search")
#Produces(RestMediaType.JSON_HEADER)
#Consumes(RestMediaType.JSON_HEADER)
public class GoodsSearchApi {
#EJB
GoodsSearchBiz searchBiz;
#GET
#Path("/test")
public List<String> test() {
return searchBiz.test();
}
}
Results:
from Main:
[中华, 中华人民共和国, 华人, 人民, 人民共和国, 共和, 共和国]
Process finished with exit code 0
from API:
09:31:05,433 INFO [stdout] (default task-1) [中, 华, 人, 民, 共, 和, 国]
Why the same Code gives different Results like this?

u got to let jcseg load its lexicons.
at your api mode, Jcseg did't load the lexicon correctly.
visit https://github.com/lionsoul2014/jcseg for more help if u can read chinese

Related

Lucene IndexWriters in #Singleton #ApplicationScoped bean closes the IndexWriter

I need to index document while they are being uploaded into different indexes based on their content in a Java web application where multiple users can be uploading multiple documents each simoultaneously
I am using Lucene 6.2.1 for indexing
for this I have created a Stateless EJB. which Indexes the document while it is being uploaded called IndexingSessionBean
But as I can not have multiple IndexWriters open on one index I have created a #Singleton and #ApplicationScoped bean called CatagoryIndexWriters, which should have a map of Index writers for each catagory of document and pass it to IndexingSessionBean.
my code is as given below
IndexingSessionBean.java
#Stateless
public class IndexingSessionBean {
#EJB
CatagoryIndexWriters catagoryIndexWriters;
public void indexFile(String documentId, String catId, byte[] fileBytes, boolean isUpdate) {
String content = // get contents of the fileBytes in String
try {
IndexWriter writer = catagoryIndexWriters.getTargetIndexWriter(catId)
Document doc = new Document();
Field documentIdField = new StringField("documentId", documentId, Field.Store.YES);
doc.add(documentIdField);
doc.add(new TextField("contents", content, Field.Store.YES));
if (!isUpdate) {
LOG.log(Level.INFO, "Indexing file with documentId {0}", documentId);
writer.addDocument(doc);
} else {
LOG.log(Level.INFO, "Updating Index for file with documentId {0}", documentId);
writer.updateDocument(new Term("documentId", documentId), doc);
}
}
catch (IOException ex) {
LOG.log(Level.SEVERE, "Unable to index document!", ex);
}
}
}
CatagoryIndexWriters
#Singleton
#ApplicationScoped
#ConcurrencyManagement(BEAN)
public class CatagoryIndexWriters {
#EJB
SystemConfigBean systemConfigBean;
Map<String, IndexWriter> indexWritersMap =new HashMap<String, IndexWriter>();
private double RAMBufferSize = 256.00;
public IndexWriter getCatagoryIndexWriter(String catId){
IndexWriter writer;
writer = indexWritersMap.get(catId);
if (writer != null){
return writer;
}else{
addCatagoryIndexWriterToMap(catId);
return indexWritersMap.get(catId);
}
}
private void createCatagoryIndexPath(String catId){
String indexPath = systemConfigBean.getSearchindexPath();
String catIndexPathString = indexPath+systemConfigBean.SEPARATORCHAR+catId;
Path catIndexPath = new File(catIndexPathString).toPath();
//Check the Catagory Index Folder if there is no index folder create it.
}
private void addCatagoryIndexWriterToMap(String catId){
createCatagoryIndexPath(catId);
String indexPath = systemConfigBean.getSearchindexPath();
String catIndexPathString = indexPath+systemConfigBean.SEPARATORCHAR+catId;
Path catIndexPath = new File(catIndexPathString).toPath();
try {
Directory dir = FSDirectory.open(catIndexPath);
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
iwc.setRAMBufferSizeMB(this.RAMBufferSize);
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
indexWritersMap.put(catId, writer);
}
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
But while adding document I get following exception..
Mai 12, 2017 12:54:59 PM org.apache.openejb.core.transaction.EjbTransactionUtil handleSystemException
SCHWERWIEGEND: EjbTransactionUtil.handleSystemException: this IndexWriter is closed
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:740)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:754)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1558)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1307)
at de.zaffar.docloaddoc.beans.IndexingSessionBean.indexFile(IndexingSessionBean.java:257)
I dont know from where the close method in IndexWriter bieng called
Your issue seems to be the line , try (IndexWriter writer = new IndexWriter(dir, iwc)) so this resource will be auto closed after try statement i.e. once you have put it into map.
try-with-resource has a very specific use case of using that resource with in the try - block otherwise it will be closed.
IndexWriterdoes implement AutoCloseable so it gets closed.
Remove it from try-with-resource and make it a normal statement then try again.

How to match exact text in Lucene search?

Im trying to match a text Config migration from ASA5505 8.2 to ASA5516 in column TITLE.
My program looks like this.
Directory directory = FSDirectory.open(indexDir);
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_35,new String[] {"TITLE"}, new StandardAnalyzer(Version.LUCENE_35));
IndexReader reader = IndexReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
queryParser.setPhraseSlop(0);
queryParser.setLowercaseExpandedTerms(true);
Query query = queryParser.parse("TITLE:Config migration from ASA5505 8.2 to ASA5516");
System.out.println(queryStr);
TopDocs topDocs = searcher.search(query,100);
System.out.println(topDocs.totalHits);
ScoreDoc[] hits = topDocs.scoreDocs;
System.out.println(hits.length + " Record(s) Found");
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println("\"Title :\" " +d.get("TITLE") );
}
But its returning
"Title :" Config migration from ASA5505 8.2 to ASA5516
"Title :" Firewall migration from ASA5585 to ASA5555
"Title :" Firewall migration from ASA5585 to ASA5555
Second 2 results are not expected.So what modification required to match exact text Config migration from ASA5505 8.2 to ASA5516
And my indexing function looks like this
public class Lucene {
public static final String INDEX_DIR = "./Lucene";
private static final String JDBC_DRIVER = "oracle.jdbc.OracleDriver";
private static final String CONNECTION_URL = "jdbc:oracle:thin:xxxxxxx"
private static final String USER_NAME = "localhost";
private static final String PASSWORD = "localhost";
private static final String QUERY = "select * from TITLE_TABLE";
public static void main(String[] args) throws Exception {
File indexDir = new File(INDEX_DIR);
Lucene indexer = new Lucene();
try {
Date start = new Date();
Class.forName(JDBC_DRIVER).newInstance();
Connection conn = DriverManager.getConnection(CONNECTION_URL, USER_NAME, PASSWORD);
SimpleAnalyzer analyzer = new SimpleAnalyzer(Version.LUCENE_35);
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexDir), indexWriterConfig);
System.out.println("Indexing to directory '" + indexDir + "'...");
int indexedDocumentCount = indexer.indexDocs(indexWriter, conn);
indexWriter.close();
System.out.println(indexedDocumentCount + " records have been indexed successfully");
System.out.println("Total Time:" + (new Date().getTime() - start.getTime()) / (1000));
} catch (Exception e) {
e.printStackTrace();
}
}
int indexDocs(IndexWriter writer, Connection conn) throws Exception {
String sql = QUERY;
Statement stmt = conn.createStatement();
stmt.setFetchSize(100000);
ResultSet rs = stmt.executeQuery(sql);
int i = 0;
while (rs.next()) {
System.out.println("Addind Doc No:" + i);
Document d = new Document();
System.out.println(rs.getString("TITLE"));
d.add(new Field("TITLE", rs.getString("TITLE"), Field.Store.YES, Field.Index.ANALYZED));
d.add(new Field("NAME", rs.getString("NAME"), Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(d);
i++;
}
return i;
}
}
PVR is correct, that using a phrase query is probably the right solution here, but they missed on how to use the PhraseQuery class. You are already using QueryParser though, so just use the query parser syntax by enclosing you search text in quotes:
Query query = queryParser.parse("TITLE:\"Config migration from ASA5505 8.2 to ASA5516\"");
Based on your update, you are using a different analyzer at index-time and query-time. SimpleAnalyzer and StandardAnalyzer don't do the same things. Unless you have a very good reason to do otherwise, you should analyze the same way when indexing and querying.
So, change the analyzer in your indexing code to StandardAnalyzer (or vice-versa, use SimpleAnalyzer when querying), and you should see better results.
Here is what i have written for you which works perfectly:
USE: queryParser.parse("\"Config migration from ASA5505 8.2 to ASA5516\"");
To create indexes
public static void main(String[] args)
{
IndexWriter writer = getIndexWriter();
Document doc = new Document();
Document doc1 = new Document();
Document doc2 = new Document();
doc.add(new Field("TITLE", "Config migration from ASA5505 8.2 to ASA5516",Field.Store.YES,Field.Index.ANALYZED));
doc1.add(new Field("TITLE", "Firewall migration from ASA5585 to ASA5555",Field.Store.YES,Field.Index.ANALYZED));
doc2.add(new Field("TITLE", "Firewall migration from ASA5585 to ASA5555",Field.Store.YES,Field.Index.ANALYZED));
try
{
writer.addDocument(doc);
writer.addDocument(doc1);
writer.addDocument(doc2);
writer.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static IndexWriter getIndexWriter()
{
IndexWriter indexWriter=null;
try
{
File file=new File("D://index//");
if(!file.exists())
file.mkdir();
IndexWriterConfig conf=new IndexWriterConfig(Version.LUCENE_34, new StandardAnalyzer(Version.LUCENE_34));
Directory directory=FSDirectory.open(file);
indexWriter=new IndexWriter(directory, conf);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return indexWriter;
}
}
2.To search string
public static void main(String[] args)
{
IndexReader reader=getIndexReader();
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(Version.LUCENE_34, "TITLE" ,new StandardAnalyzer(Version.LUCENE_34));
Query query;
try
{
query = parser.parse("\"Config migration from ASA5505 8.2 to ASA5516\"");
TopDocs hits = searcher.search(query,3);
ScoreDoc[] document = hits.scoreDocs;
int i=0;
for(i=0;i<document.length;i++)
{
Document doc = searcher.doc(i);
System.out.println("TITLE=" + doc.get("TITLE"));
}
searcher.close();
}
catch (Exception e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static IndexReader getIndexReader()
{
IndexReader reader=null;
Directory dir;
try
{
dir = FSDirectory.open(new File("D://index//"));
reader=IndexReader.open(dir);
} catch (IOException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
return reader;
}
Try PhraseQuery as follow:
BooleanQuery mainQuery= new BooleanQuery();
String searchTerm="config migration from asa5505 8.2 to asa5516";
String strArray[]= searchTerm.split(" ");
for(int index=0;index<strArray.length;index++)
{
PhraseQuery query1 = new PhraseQuery();
query1.add(new Term("TITLE",strArray[index]));
mainQuery.add(query1,BooleanClause.Occur.MUST);
}
And then execute the mainQuery.
Check out this thread of stackoverflow, It may help you to use PhraseQuery for exact search.

Lucene Apache doesn't keep my old index

I found this example in the internet:
Indexer.java
public class Indexer {
private IndexWriter writer;
#SuppressWarnings("deprecation")
public Indexer(String indexDirectoryPath) throws IOException {
Directory indexDirectory = FSDirectory.open(new File(indexDirectoryPath));
writer = new IndexWriter(indexDirectory, new StandardAnalyzer(Version.LUCENE_36), true,
IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws CorruptIndexException, IOException {
writer.close();
}
private Document getDocument(File file) throws IOException {
Document document = new Document();
Field contentField = new Field(LuceneConstants.CONTENTS, new FileReader(file));
Field fileNameField = new Field(LuceneConstants.FILE_NAME, file.getName(), Field.Store.YES,
Field.Index.NOT_ANALYZED);
Field filePathField = new Field(LuceneConstants.FILE_PATH, file.getCanonicalPath(), Field.Store.YES,
Field.Index.NOT_ANALYZED);
document.add(contentField);
document.add(fileNameField);
document.add(filePathField);
return document;
}
public void indexFile(File file) throws IOException {
Document document = getDocument(file);
writer.addDocument(document);
}
public int createIndex(String file) throws IOException {
indexFile(new File(file));
return writer.numDocs();
}
}
Searcher.java
public class Searcher {
IndexSearcher indexSearcher;
QueryParser queryParser;
Query query;
#SuppressWarnings("deprecation")
public Searcher(String indexDirectoryPath) throws IOException {
Directory indexDirectory = FSDirectory
.open(new File(indexDirectoryPath));
indexSearcher = new IndexSearcher(indexDirectory);
queryParser = new QueryParser(Version.LUCENE_36,
LuceneConstants.CONTENTS, new StandardAnalyzer(
Version.LUCENE_36));
}
public TopDocs search(String searchQuery) throws IOException,
ParseException {
query = queryParser.parse(QueryParser.escape(searchQuery));
return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
}
public Document getDocument(ScoreDoc scoreDoc)
throws CorruptIndexException, IOException {
return indexSearcher.doc(scoreDoc.doc);
}
public void close() throws IOException {
indexSearcher.close();
}
}
LuceneConstants.java
public class LuceneConstants {
public static final String CONTENTS = "contents";
public static final String FILE_NAME = "filename";
public static final String FILE_PATH = "filepath";
public static final int MAX_SEARCH = 10;
}
This is how I use them:
public static void main(String[] args) throws IOException, ParseException {
{
// First file
Indexer indexer = new Indexer("index");
indexer.createIndex("f1.txt");
indexer.close();
Searcher searcher = new Searcher(Constante.DIR_INDEX.getValor());
TopDocs hits = searcher.search("Art. 1°");
for (ScoreDoc scoreDoc : hits.scoreDocs) {
org.apache.lucene.document.Document doc = searcher.getDocument(scoreDoc);
String nomeArquivo = doc.get(LuceneConstants.FILE_PATH);
System.out.println(nomeArquivo);
}
}
System.out.println("-----");
{
// Second file
Indexer indexer = new Indexer("index");
indexer.createIndex("f2.txt");
indexer.close();
Searcher searcher = new Searcher(Constante.DIR_INDEX.getValor());
TopDocs hits = searcher.search("Art. 1°");
for (ScoreDoc scoreDoc : hits.scoreDocs) {
org.apache.lucene.document.Document doc = searcher.getDocument(scoreDoc);
String nomeArquivo = doc.get(LuceneConstants.FILE_PATH);
System.out.println(nomeArquivo);
}
}
}
It works perfectly fine until the "// second file" line.
After I index my second file I'm not able to find anything in my first file.
If I create an instance of Indexer and use it this same instance to index f1.txt and f2.txt and close it then it works like I want it to be. The problem is that if I close my application and open it and decide to index another file I'd lose both f1.txt and f2.txt.
Is there a way to make Lucene always keep the previous index when it index a new file?
Looks like you are using an old version of Lucene (3.6 or below), correct?
The third argument to the IndexWriter constructor specifies whether it should create a new index or open an existing one. If set to true, it will overwrite the existing index, if one exists in the given directory. If you want to open an existing index without overwriting it, it should be false:
writer = new IndexWriter(indexDirectory, new StandardAnalyzer(Version.LUCENE_36), false, IndexWriter.MaxFieldLength.UNLIMITED);

Create a lucene romanian stemmer in java netbeans

I need to do a simple search engine which can recognize and stem Romanian words, including those with diacritics. I used RomanianAnalyzer, but it does not do the right stemming when it comes to the same word written with and without diacritics.
Can you help me with a code for adding/modifying an existing Romanian stemmer?
PS: I edited the question, to be more clear.
You can copy the RomanianAnalyzer source to create a custom analyzer, and add a filter to the analysis chain in the createComponents method. ASCIIFoldingFilter would probably be what you are looking for. I would add it to the end, to be sure that you don't mess up the stemmer when removing the diacritics.
public final class RomanianASCIIAnalyzer extends StopwordAnalyzerBase {
private final CharArraySet stemExclusionSet;
public final static String DEFAULT_STOPWORD_FILE = "stopwords.txt";
private static final String STOPWORDS_COMMENT = "#";
public static CharArraySet getDefaultStopSet(){
return DefaultSetHolder.DEFAULT_STOP_SET;
}
private static class DefaultSetHolder {
static final CharArraySet DEFAULT_STOP_SET;
static {
try {
DEFAULT_STOP_SET = loadStopwordSet(false, RomanianAnalyzer.class,
DEFAULT_STOPWORD_FILE, STOPWORDS_COMMENT);
} catch (IOException ex) {
throw new RuntimeException("Unable to load default stopword set");
}
}
}
public RomanianASCIIAnalyzer() {
this(DefaultSetHolder.DEFAULT_STOP_SET);
}
public RomanianASCIIAnalyzer(CharArraySet stopwords) {
this(stopwords, CharArraySet.EMPTY_SET);
}
public RomanianASCIIAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionSet) {
super(stopwords);
this.stemExclusionSet = CharArraySet.unmodifiableSet(CharArraySet.copy(stemExclusionSet));
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopwords);
if(!stemExclusionSet.isEmpty())
result = new SetKeywordMarkerFilter(result, stemExclusionSet);
result = new SnowballFilter(result, new RomanianStemmer());
//This following line is the addition made to the RomanianAnalyzer source.
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
}

Simple lucene example not working

I am messing around with Lucene to see how it can help me and I am unable to get a very simple example working. I am using Lucene 5.1
Expectations is that when I search, I get the document ID for the document I added to the index in the console. I get nothing, no errors (just "Done" printed to console at the end)
Here is my code:
public static void main(String[] args) throws Exception {
// create structure on file system
IndexWriter writer = createOrGetIndexWriter(LocalDate.now());
writer.close();
// open for writing
writer = createOrGetIndexWriter(LocalDate.now());
Document document = new Document();
document.add(new IntField("test_field", 1, Field.Store.YES));
// write document and close.
writer.addDocument(document);
writer.commit();
writer.close();
// open reader
IndexReader reader = getIndexReader(LocalDate.now());
IndexSearcher indexSearcher = new IndexSearcher(reader);
Query q = new TermQuery(new Term("test_field", "1"));
// callback should be synchronous
indexSearcher.search(q, new SimpleCollector() {
#Override
public void collect(int i) throws IOException {
System.out.println(i);
}
#Override
public boolean needsScores() {
return false;
}
});
System.out.println("Done");
}
public static IndexWriter createOrGetIndexWriter(LocalDate date) throws Exception {
Directory directory = FSDirectory.open(Paths.get(date.toString()));
IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer());
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
return new IndexWriter(directory, iwc);
}
public static IndexReader getIndexReader(LocalDate date) throws Exception {
return DirectoryReader.open(FSDirectory.open(Paths.get(date.toString())));
}

Categories