I want to search text phase in pdf like "Labor Law". But in result, it return all file that contain the word "Labor" and "Law". please any help checking my cod below:
EnglishAnalyzer analyzer = new EnglishAnalyzer();
analyzer.setVersion(Version.LATEST);
QueryParser parser = new QueryParser("content", analyzer);
Query query = parser.parse("Labor Law");
Directory indexDirectory = FSDirectory.open(new File(indexLucencePath));
DirectoryReader dirReader = DirectoryReader.open(indexDirectory);
indexSearcher = new IndexSearcher(dirReader);
ScoreDoc[] queryResults = indexSearcher.search(query, numOfResults).scoreDocs;
List<IndexItem> results = new ArrayList<IndexItem>();
for (ScoreDoc scoreDoc : queryResults) {
Document doc = indexSearcher.doc(scoreDoc.doc);
results.add(new IndexItem(doc.get(IndexItem.TITLE), doc.get(IndexItem.CONTENT)));
}
Try
Phrase query:
Query query = parser.parse("\"Labor Law\"");
All terms must be present
Query query = parser.parse("+Labor +Law");
You can also create query yourself like this
BooleanQuery query= new BooleanQuery();
TermQuery clause1 = new TermQuery(new Term("content", "Labor"));
TermQuery clause2 = new TermQuery(new Term("content", "Law"));
query.add(new BooleanClause(clause1, BooleanClause.Occur.MUST));
query.add(new BooleanClause(clause1, BooleanClause.Occur.MUST));
There are different types of Analyzer available, please check with different Analyzer for your requirement. Comparison of Lucene Analyzers. This may also help Lucene: Multi-word phrases as search terms
Related
I am using Lucene 6.6.0 version, and I am indexing my data using StandardAnalyzer.
I am indexing following data of words.
a&e networks
a&e
After indexing , when I am searching with a&e it is not returning any results.
this is my sample code.
Directory dir = new RAMDirectory();
IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer());
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, iwc);
Document doc = new Document();
doc.add(new TextField("text", "a&e networks", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new TextField("text", "a&e", Field.Store.YES));
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Query query = new TermQuery(new Term("text", "a&e"));
TopDocs results = searcher.search(query, 5);
final ScoreDoc[] scoreDocs = results.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
System.out.println(scoreDoc.doc + " " + scoreDoc.score + " " + searcher.doc(scoreDoc.doc).get("text"));
}
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore());
I am getting output as
Hits: 0
Max score:NaN
Even I am searching for a also it is not giving any results in this case.
but if I add stopwords set to StandardAnalyzer like this
List<String> stopWords = Arrays.asList("&");
CharArraySet stopSet = new CharArraySet(stopWords, false);
IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer(stopSet));
and after that if i search for a then I am able to get the results. but even in that case also if i search for a&e , then I am not getting any results.
please suggest me how to achieve this, my goal here is if I search for a&e I should be able to get the results. do I need to any CustomAnalyzer ? If so please explain what should I add in CustomAnalyzer?
Probably & character is considered as a word boundary:
https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html
This class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
a and e are probably considered as stop word. So when indexed they are removed.
You can try some randomly generated keywords seperated by & character (eg. adsadaerewfds&eqeqwedasd). After indexing try to search keywords before and after &. If those keywords are found either store them without analyzing (you can use StringField) or create custom analyzer.
i wanna search a string with lots of words, and retrieves documents that matches with any of them. My indexing method is the folowing:
Document document = new Document();
document.add(new TextField("termos", text, Field.Store.YES));
document.add(new TextField("docNumber",fileNumber,Field.Store.YES));
config = new IndexWriterConfig(analyzer);
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.addTokenFilter("capitalization")
.build();
config = IndexWriterConfig(analyzer);
writer = new IndexWriter(indexDirectory, config);
writer.addDocument(document);
writer.commit();
And here is my search method. I dont wanna look for specific phrase, but any of word in that. The analyzer for search is the same that for index.
Query query = new QueryBuilder(analyzer).createPhraseQuery("termos","THE_PHRASE");
String indexDir = rootProjectFolder + "/indexDir/";
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(1000,1000);
searcher.search(query,collector);
Im new on Lucene. Someone can help me?
Using createPhraseQuery("termos", "list of words") will precisely try to match the phrase "list of words" with a phrase slop of 0.
If you want to match any term in a list of words, you can use createBooleanQuery :
new QueryBuilder(analyzer).createBooleanQuery("termos", terms, BooleanClause.Occur.SHOULD);
As an alternative, you can also use createMinShouldMatchQuery() so that you can require a fraction of the number of query terms to match, eg. to match at least 10 percent of the terms :
new QueryBuilder(analyzer).createMinShouldMatchQuery("termos", terms, 0.1f));
i wanna search a string with lots of words, and retrieves documents that matches with any of them. My indexing method is the folowing:
Document document = new Document();
document.add(new TextField("termos", text, Field.Store.YES));
document.add(new TextField("docNumber",fileNumber,Field.Store.YES));
config = new IndexWriterConfig(analyzer);
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.addTokenFilter("capitalization")
.build();
config = IndexWriterConfig(analyzer);
writer = new IndexWriter(indexDirectory, config);
writer.addDocument(document);
writer.commit();
And here is my search method. I dont wanna look for specific phrase, but any of word in that. The analyzer for search is the same that for index.
Query query = new QueryBuilder(analyzer).createPhraseQuery("termos","THE_PHRASE");
String indexDir = rootProjectFolder + "/indexDir/";
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(1000,1000);
searcher.search(query,collector);
Im new on Lucene. Someone can help me?
Using createPhraseQuery("termos", "list of words") will precisely try to match the phrase "list of words" with a phrase slop of 0.
If you want to match any term in a list of words, you can use createBooleanQuery :
new QueryBuilder(analyzer).createBooleanQuery("termos", terms, BooleanClause.Occur.SHOULD);
As an alternative, you can also use createMinShouldMatchQuery() so that you can require a fraction of the number of query terms to match, eg. to match at least 10 percent of the terms :
new QueryBuilder(analyzer).createMinShouldMatchQuery("termos", terms, 0.1f));
I have a problem with the migration of a Lucene field from version 3.0.3 to 5.x . I prepared two JUnit test programs (one with 3.0.3 and other with 5.x) to compare the behavior.
Lucene 3:
analyzer = new StandardAnalyzer(Version.LUCENE_30);
indexWriter = new IndexWriter(dir, analyzer, true, MaxFieldLength.UNLIMITED);
....
Document doc = new Document();
doc.add(new Field("keyword", "another test#foo-bar", Field.Store.YES,
Field.Index.ANALYZED));
indexWriter.addDocument(doc);
indexWriter.commit();
....
indexReader = IndexReader.open(FSDirectory.open(path.toFile()), false);
searcher = new IndexSearcher(indexReader);
QueryParser parser = new QueryParser(Version.LUCENE_30, "keyword", analyzer);
Query query = parser.parse("test");
searcher.search(query, searcher.maxDoc());
TopDocs topDocs = searcher.search(query, searcher.maxDoc());
ScoreDoc[] hits = topDocs.scoreDocs;
doc = indexReader.document(hits[0].doc);
// doc is now NULL <- EXPECTED
assertNull(result);
The similar test with Lucene 5.x (only changed code lines):
analyzer = new StandardAnalyzer();
IndexWriterConfig indexConfig = new IndexWriterConfig(analyzer)
.setCommitOnClose(true).setOpenMode(openMode);
// create the index writer
indexWriter = new IndexWriter(dir, indexConfig);
...
// line old style (Lucene 3)
doc.add(new Field("keyword", "another test#foo-bar", Field.Store.YES,
Field.Index.ANALYZED));
// or with new field types (enable only one line)
doc.add(new TextField("keyword", "another test#foo-bar", Field.Store.YES));
...
Query query = new QueryParser(field, analyzer).parse(field + ":"
+ value);
doc = indexReader.document(hits[0].doc);
// returns a document each time
assertNull(doc); // fails!
I used the following migration document https://lucene.apache.org/core/4_8_0/MIGRATE.html to replace the Field class with the TextField class. But the search works different.
Question: How can I create the same result with the new Lucene 5.x as before with Lucene 3?
The Lucene 3 analyzer seems to split the input string on spaces only. The Lucene 5 version of the analyzer seems to split on space, '#' and '-'. :/
I am trying to use WildCardQuery:
IndexSearcher indexSearcher = new IndexSearcher(ireader);
Term term = new Term("phrase", QueryParser.escape(partOfPhrase) + "*");
WildcardQuery wildcardQuery = new WildcardQuery(term);
LOG.debug(partOfPhrase);
Sort sort = new Sort(new SortField("freq", SortField.Type.LONG,true));
ScoreDoc[] hits = indexSearcher.search(wildcardQuery, null, 10, sort).scoreDocs;
But when I insert "san " (without quotes), I want to get something like:
"san diego", "san antonio" etc. But I am getting not only these results but also "sandals" (it must to be space after san), or juelz santana (I want to find sentences which start with san). How can I fix this issue?
EDIT
Also, if I insert "san d", I have no results.
One possible way to solve that problem - is to use another analyzer, that will not split query and text in document by space.
One of the possible analyzer - is a KeywordAnalzer, that will use whole data as a single keyword
Essential part of the test:
Directory dir = new RAMDirectory();
Analyzer analyzer = new KeywordAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, iwc);
later on, I could add needed docs:
Document doc = new Document();
doc.add(new TextField("text", "san diego", Field.Store.YES));
writer.addDocument(doc);
And finally, search as you want:
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Term term = new Term("text", QueryParser.escape("san ") + "*");
WildcardQuery wildcardQuery = new WildcardQuery(term);
My test is working properly, allowing me to retrieve san diego and san antonio and not take sandals. Take a look at full test here - https://github.com/MysterionRise/information-retrieval-adventure/blob/master/src/main/java/org/mystic/lucene/WildcardQueryWithSpace.java
For more information about analyzer itself - http://lucene.apache.org/core/4_10_2/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html