Lucene : Search with partial words - java

I am working on integrating Lucene in our application. Lucene is currently working, for example when I am searching "Upload" and there is some text called "Upload" in a document, then it works, but when I search "Uplo", then it doesn't work. Any ideas?
Code :
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("contents", new SimpleAnalyzer());
Query query = queryParser.parse(text);
TopDocs topDocs = indexSearcher.search(query, 50);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
org.apache.lucene.document.Document document = indexSearcher.doc(scoreDoc.doc);
objectIds.add(Integer.valueOf(document.get("id")));
System.out.println("");
System.out.println("id " + document.get("id"));
System.out.println("content " + document.get("contents"));
}
return objectIds;
Thank you.

'Upload' might be ONE Token in your Lucene index where a Token would be the smallest entity non splittable further. If you want to match partial words like 'Uplo' then it is better to go for Lucene NGram Indexing. Note that if you go for NGram indexing you will have higher space requirements for your inverted index.

You can use wildcard searches.
"?" symbol for single character wildcard search and "*" symbol for Multiple character wildcard searches (0 or more characters).
example - "Uplo*"

Change
Query query = queryParser.parse(text);
To
Query query = queryParser.parse("*"+text+"*");
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries).
To perform a single character wildcard search use the "?" symbol.
To perform a multiple character wildcard search use the "*" symbol.
The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:
te?t
Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:
test*
You can also use the wildcard searches in the middle of a term.
te*t
Note: You cannot use a * or ? symbol as the first character of a search.

Related

How to deal with identifier fields in Lucene?

I've stumbled upon a problem similar to the one described in this other question: I have a field named like 'type', which is an identifier, ie, it's case sensitive and I want to use it for exact searches, no tokenisation, no similarity searches, just plain "find exactly 'Sport:01'". I might benefit from 'Sport*', but it's not extremely important in my case.
I cannot make it work: I thought the right kind of field to store this is: StringField.TYPE_STORED, with DOCS_AND_FREQS_AND_POSITIONS and setOmitNorms ( true ). However, this way I can't correctly resolve a query like: +type:"RockMusic" +title: "a sample title" using the standard analyzer, because, as far as I understand, the analyzer converts the input into lower case (ie, rockmusic) and the type is stored in its original mixed-case form (hence, I cannot resolve it even if I remove the title clause).
I'd like to mix case-insensitive search over title with case-sensitive over type, since I've cases where type := BRAIN is an acronym and it's different than 'Brain'.
So, what's the best way to manage fields and searches like the above? Are there alternatives other than text and string fields?
I'm using Lucene 6.6.0, but this is a general issue, regarding multiple (all?) Lucene versions.
Some code showing details is here (see testIdMixedCaseID*). The real use case is rather more complicated, if you want to give a look, the problem is with the field CC_FIELD, which might be 'BioProc' and nothing can be found in such a case.
Please note I need to use the plain Lucene, not Solr or Elastic search.
The following notes are based on Lucene 8.x, not on Lucene 6.6 - so there may be some syntax differences - but I take your point about how any such differences should be coincidental to your question.
Here are some notes, where I will focus on the following aspect of your question:
However, this way I can't correctly resolve a query like: +type:"RockMusic" +title:"a sample title" using the standard analyzer
I think there are 2 parts to this:
Firstly, the query example using "a sample title" will - as you say - not work well with how a standard analyzer works - for the reasons you state.
But, secondly, it is possible to combine the two types of query you want to use, in a way which I believe gets you what you need: An exact match for the type field (e.g. RockMusic) and a more traditional tokenized & case-insensitive result for the title field (a sample title).
Here is how I would do that:
Here is some simple test data:
public static void buildIndex() throws IOException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
Document doc;
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
doc = new Document();
doc.add(new StringField("type", "RockMusic", Field.Store.YES));
doc.add(new TextField("title", "a sample title", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new StringField("type", "RockMusic", Field.Store.YES));
doc.add(new TextField("title", "another different title", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new StringField("type", "Rock Music", Field.Store.YES));
doc.add(new TextField("title", "a sample title", Field.Store.YES));
writer.addDocument(doc);
}
}
Here is the query code:
public static void doSearch() throws QueryNodeException, ParseException, IOException {
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery typeQuery = new TermQuery(new Term("type", "RockMusic"));
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("title", analyzer);
Query titleQuery = parser.parse("A Sample Title");
Query query = new BooleanQuery.Builder()
.add(typeQuery, BooleanClause.Occur.MUST)
.add(titleQuery, BooleanClause.Occur.MUST)
.build();
System.out.println("Query: " + query.toString());
System.out.println();
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits) {
System.out.println("doc = " + hit.doc + "; score = " + hit.score);
Document doc = searcher.doc(hit.doc);
System.out.println("Type = " + doc.get("type")
+ "; Title = " + doc.get("title"));
System.out.println();
}
}
The output from the above query is as follows:
Query: +type:RockMusic +(title:a title:sample title:title)
doc = 0; score = 0.7016101
Type = RockMusic; Title = a sample title
doc = 1; score = 0.2743341
Type = RockMusic; Title = another different title
As you can see, this query is a little different from the one taken from your question.
But the list of found documents shows that (a) the Rock Music document was not found at all (good - because Rock Music does not match the "type" search term of RockMusic); and (b) the title a sample title got a far higher match score than the another different title document, when searching for A Sample Title.
Additional notes:
This query works by combining a StringField exact search with a more traditional TextField tokenized search - this latter search being processed by the StandardAnalyzer (matching how the data was indexed in the first place).
I am making an assumption about the score ranking being useful to you - but for title searches, I think that is reasonable.
This approach would also apply to your BRAIN vs. brain example, for StringField data.
(I also assume that, for a user interface, a user could select the "RockMusic" type value from a drop-down, and enter the "A Sample Title" search in an input field - but this is getting off-topic, I think).
You could obviously enhance the analyzer to include stop-words, and so on, as needed.
Of course, my examples involve hard-coded data - but it would not take much to generalize this approach to handle dynamically-provided search terms.
Hope that this makes sense - and that I understood the problem correctly.
Going to answer myself...
I discovered what #andrewjames outlines in his excellent analysis by making a number of tests of my own. Essentially, fields like "type" don't play well with the standard analyser and they are best indexed and searched with an analyzer like KeywordAnalyzer, which, in practice, stores the original value as-is and searches it accordingly.
Most real cases are like my example, ie, mixed ID-like fields, which need exact matching, plus fields like 'title' or 'description', which best serves user searches using per-token searching, word-based scoring, stop words elimination, etc.
Because of that, PerFieldAnalyzerWrapper (see also my sample code, linked above) comes to much help, ie, a wrapper analyzer, which is able to dispatch analysis field-specific analyzers, on a field name basis.
One thing to add is that I still haven't clear which analyzer is used when a query is built without a parser (eg, using new TermQuery ( new Term ( fname, fval )), so now I use a QueryParser.

JPA Select query not returning results with one letter word

I have a query that when given a word that starts with a one-letter word followed by space character and then another word (ex: "T Distribution"), does not return results. While given "Distribution" alone returns results including the results for "T Distribution". It is the same behavior with all search terms beginning with a one-letter word followed by space character and then another word.
The problem appears when the search term is of this pattern:
"[one-letter][space][letter/word]". example: "o ring".
What would be the problem that the LIKE operator not working correctly in this case?
Here is my query:
#Cacheable(value = "filteredConcept")
#Query("SELECT NEW sina.backend.data.model.ConceptSummaryVer04(s.id, s.arabicGloss, s.englishGloss, s.example, s.dataSourceId,
s.synsetFrequnecy, s.arabicWordsCache, s.englishWordsCache, s.superId, s.categoryId, s.dataSourceCacheAr, s.dataSourceCacheEn,
s.superTypeCasheAr, s.superTypeCasheEn, s.area, s.era, s.rank, s.undiacritizedArabicWordsCache, s.normalizedEnglishWordsCache,
s.isTranslation, s.isGloss, s.arabicSynonymsCount, s.englishSynonymsCount) FROM Concept s
where s.undiacritizedArabicWordsCache LIKE %:searchTerm% AND data_source_id != 200 AND data_source_id != 31")
List<ConceptSummaryVer04> findByArabicWordsCacheAndNotConcept(#Param("searchTerm") String searchTerm, Sort sort);
the result of the query on the database itself:
link to screenshot
results on the database are returned no matter the letters case:
link to screenshot
I solved this problem.
It was due to the default configuration of the Full-text index on mysql database which is by default set to 2 (ft_min_word_len = 2).
I changed that and rebuilt the index. Then, one-letter words were returned by the query.
12.9.6 Fine-Tuning MySQL Full-Text Search
Use some quotes:
LIKE '%:searchTerm%';
Set searchTerm="%your_word%" and use it on query like this :
... s.undiacritizedArabicWordsCache LIKE :searchTerm ...

Elasticsearch won't find any results for a search containing letters

I'm begining with elasticsearch in java.
I can make a query that matches documents having a property containing a text. This property is a string.
The results are strange: when I'm searching for numbers in a string, I have some results but as soon as the query contains a letter no results are returned.
Here is a summary of the current behavior:
I have 2 documents :
{
model:"123",
serialNumber: "123"
}
and
{
model:"123",
serialNumber: "TT123"
}
If I'm searching for "123" I have 2 results => OK
If I'm searching for "TT", I have no results.
I'm using wildcard.
Here is a sample of my code:
BoolQueryBuilder bqb = new BoolQueryBuilder();
bqb.should(new WildcardQueryBuilder("serialNumber", "*TT*"));
/*or bqb.should(new WildcardQueryBuilder("serialNumber", "*123*"));*/
return QueryBuilders.filteredQuery(bqb, null);
Does "*tt*" find 2 results? Wildcard queries are not analyzed, but your analyzed index probably is so the index contains "tt123", which will not match "*TT*" but "*tt*" will.
That said, Wildcards are slow, you should should look into other analyzers, such as ngram to create your index.

Lucene error while parsing Query: Cannot parse '': Encountered "<EOF>" at line 1, column 0

I want to parse some text using Lucene query parser to carry out basic text preprocessing on the texts. I used following lines of code:
Analyzer analyzer = new EnglishAnalyzer();
QueryParser parser = new QueryParser("", analyzer);
String text = "...";
String ret = parser.parse(QueryParser.escape(text)).toString();
But, I am getting an error:
Exception in thread "main" org.apache.lucene.queryparser.classic.ParseException: Cannot parse '': Encountered "<EOF>" at line 1, column 0.
Using Query.escape() removes the special characters. However it doesn't remove
AND, NOT, OR
which are keywords used in lucene search.
There are two ways to deal with it :
Replace AND, NOT, OR in the query string.
Convert the query string to lower case.
Converting to lower case resolves the issue as only the capitalized AND, NOT, OR are keywords. They are treated as a regular word in lower case.
for those who face this problem, I realized that my parser throw exception for the word "NOT", even after escaped. I had to manually replace it by other word.

Lucene Search returning no results even if it should

I am creating a lucene search application, I used multiple instances of indexWriter with different analyzers and respective indexSearcher, but the search results returned are empty even though I know I have indexed the particular word I'm searching for.
Here is my SearchEngine class constructor
this.indexers = new ArrayList<StandardIndexer>();
this.indexers.add(new StandardIndexer(new StandardAnalyzer()));
this.indexers.add(new StandardIndexer(new EnglishStemAnalyzer()));
this.indexers.add(new StandardIndexer(new KeywordAnalyzer()));
this.indexers.add(new StandardIndexer(new EnglishSynonymAnalyzer()));
this.indexers.add(new StandardIndexer(new EnglishSynonymStemAnalyzer()));
this.indexers.add(new StandardIndexer(new EnglishSynonymKeywordAnalyzer()));
this.searchers = new ArrayList<StandardSearcher>();
for (StandardIndexer indexer : this.indexers) {
this.searchers.add(new StandardSearcher(indexer));
}
StandardIndexer and StandardSearcher are my implementations of indexer and searcher, as we can see instance of indexer is used in creation of indexSearcher, hence directory and type of analyzer used is also shared between pairs of indexer and searcher
Your question is about a known bug in unknown code.
So I write in general:
You have to use "nearly" the same analyzer for indexing and searching. So in the end the token (=word after stemming) in index must match the token in query.
You must be sure that a commit is finished before your can search an indexed document.
Be aware that the standard query parser split words at whitespace. You can not search for tokens with whitespace without extra work (escaping the whitespace, search with phrase ..)
You can use luke to view your index directory https://github.com/DmitryKey/luke

Categories