Elasticsearch won't find any results for a search containing letters

Elasticsearch won't find any results for a search containing letters - java

I'm begining with elasticsearch in java.
I can make a query that matches documents having a property containing a text. This property is a string.
The results are strange: when I'm searching for numbers in a string, I have some results but as soon as the query contains a letter no results are returned.
Here is a summary of the current behavior:
I have 2 documents :
{
model:"123",
serialNumber: "123"
}
and
{
model:"123",
serialNumber: "TT123"
}
If I'm searching for "123" I have 2 results => OK
If I'm searching for "TT", I have no results.
I'm using wildcard.
Here is a sample of my code:
BoolQueryBuilder bqb = new BoolQueryBuilder();
bqb.should(new WildcardQueryBuilder("serialNumber", "*TT*"));
/*or bqb.should(new WildcardQueryBuilder("serialNumber", "*123*"));*/
return QueryBuilders.filteredQuery(bqb, null);

Does "*tt*" find 2 results? Wildcard queries are not analyzed, but your analyzed index probably is so the index contains "tt123", which will not match "*TT*" but "*tt*" will.
That said, Wildcards are slow, you should should look into other analyzers, such as ngram to create your index.

Related

JPA Select query not returning results with one letter word

I have a query that when given a word that starts with a one-letter word followed by space character and then another word (ex: "T Distribution"), does not return results. While given "Distribution" alone returns results including the results for "T Distribution". It is the same behavior with all search terms beginning with a one-letter word followed by space character and then another word.
The problem appears when the search term is of this pattern:
"[one-letter][space][letter/word]". example: "o ring".
What would be the problem that the LIKE operator not working correctly in this case?
Here is my query:
#Cacheable(value = "filteredConcept")
#Query("SELECT NEW sina.backend.data.model.ConceptSummaryVer04(s.id, s.arabicGloss, s.englishGloss, s.example, s.dataSourceId,
s.synsetFrequnecy, s.arabicWordsCache, s.englishWordsCache, s.superId, s.categoryId, s.dataSourceCacheAr, s.dataSourceCacheEn,
s.superTypeCasheAr, s.superTypeCasheEn, s.area, s.era, s.rank, s.undiacritizedArabicWordsCache, s.normalizedEnglishWordsCache,
s.isTranslation, s.isGloss, s.arabicSynonymsCount, s.englishSynonymsCount) FROM Concept s
where s.undiacritizedArabicWordsCache LIKE %:searchTerm% AND data_source_id != 200 AND data_source_id != 31")
List<ConceptSummaryVer04> findByArabicWordsCacheAndNotConcept(#Param("searchTerm") String searchTerm, Sort sort);
the result of the query on the database itself:
link to screenshot
results on the database are returned no matter the letters case:
link to screenshot

I solved this problem.
It was due to the default configuration of the Full-text index on mysql database which is by default set to 2 (ft_min_word_len = 2).
I changed that and rebuilt the index. Then, one-letter words were returned by the query.
12.9.6 Fine-Tuning MySQL Full-Text Search

Use some quotes:
LIKE '%:searchTerm%';

Set searchTerm="%your_word%" and use it on query like this :
... s.undiacritizedArabicWordsCache LIKE :searchTerm ...

Query Elastic document field with and without characters

I have the following documents stored at my elasticsearch index (my_index):
{
"name": "111666"
},
{
"name": "111A666"
},
{
"name": "111B666"
}
and I want to be able to query these documents using both the exact value of the name field as well as a character-trimmed version of the value.
Examples
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "111666"
}
}
}
}
should return all of the (3) documents mentioned above.
On the other hand:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "111a666"
}
}
}
}
should return just one document (the one that matches exactly with the the provided value of the name field).
I didn't find a way to configure the settings of my_index in order to support such functionality (custom search/index analyzers etc..).
I should mention here that I am using ElasticSearch's Java API (QueryBuilders) in order to implement the above-mentioned queries, so I thought of doing it the Java-way.
Logic
1) Check if the provided query-string contains a letter
2) If yes (e.g 111A666), then search for 111A666 using a standard search analyzer
3) If not (e.g 111666), then use a custom search analyzer that trims the characters of the `name` field
Questions
1) Is it possible to implement this by somehow configuring how the data are stored/indexed at Elastic Search?
2) If not, is it possible to conditionally change the analyzer of a field at Runtime? (using Java)

You can easily use any build in analyzer or any custom analyzer to map your document in elasticsearch. More information on analyzer is here
The "term" query search for exact match. You can find more information about exact match here (Finding Exact Values)
But you can not change a index once it created. If you want to change any index, you have to create a new index and migrate all your data to new index.

Your question is about different logic for the analyzer at index and query time.
The solution for your Q1 is to generate two tokens at index time (111a666 -> [111a666, 111666]) but only on token at query time (111a666 -> 111a666 and 111666 -> 111666).
I.m.h.o. your have to generate a new analyzer like
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern_replace-tokenfilter.html which supported "preserve_original" like https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-capture-tokenfilter.html does.
Or you could use two fields (one with original and one without letters) and search over both.

Lucene : Search with partial words

I am working on integrating Lucene in our application. Lucene is currently working, for example when I am searching "Upload" and there is some text called "Upload" in a document, then it works, but when I search "Uplo", then it doesn't work. Any ideas?
Code :
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("contents", new SimpleAnalyzer());
Query query = queryParser.parse(text);
TopDocs topDocs = indexSearcher.search(query, 50);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
org.apache.lucene.document.Document document = indexSearcher.doc(scoreDoc.doc);
objectIds.add(Integer.valueOf(document.get("id")));
System.out.println("");
System.out.println("id " + document.get("id"));
System.out.println("content " + document.get("contents"));
}
return objectIds;
Thank you.

'Upload' might be ONE Token in your Lucene index where a Token would be the smallest entity non splittable further. If you want to match partial words like 'Uplo' then it is better to go for Lucene NGram Indexing. Note that if you go for NGram indexing you will have higher space requirements for your inverted index.

You can use wildcard searches.
"?" symbol for single character wildcard search and "*" symbol for Multiple character wildcard searches (0 or more characters).
example - "Uplo*"

Change
Query query = queryParser.parse(text);
To
Query query = queryParser.parse("*"+text+"*");
Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries).
To perform a single character wildcard search use the "?" symbol.
To perform a multiple character wildcard search use the "*" symbol.
The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:
te?t
Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:
test*
You can also use the wildcard searches in the middle of a term.
te*t
Note: You cannot use a * or ? symbol as the first character of a search.

Get matched index value of array in MongoDB Java

I am using mongodb with java and my documents looks like :
{
_id: ObjectId("abcd1234rf54")
createdDate: "12/11/15"
type: 1
nameIdentity: [
{"name":"a"},
{"name":"b"},
{"name":"c"}
]
}
Where nameIdentity is an array of name documents. I am trying to query on name and find out index of matched document.
For eg: my query is Document resultDocument = mongoDatabase.getCollection(test).find(new Document("nameIdentity.name","b")).first();.
When this query is executed it gives me the result document/matched document. But what I also want is the index of the result document. I mean at what index there is a match. Is this possible in this approach or is there some other way to do so. Any suggestions are highly appreciated.

How to retrieve the Field that "hit" in Lucene

Maybe I'm really missing something.
I have indexed a bunch of key/value pairs in Lucene (v4.1 if it matters). Say I have
key1=value1 and key2=value2, e.g. as read from a properties file.
They get indexed both as specific fields and into a catchall "ALL" field, e.g.
new Field("key1", "value1", aFieldTypeMimickingKeywords);
new Field("key2", "value2", aFieldTypeMimickingKeywords);
new Field("ALL", "key1=value1", aFieldTypeMimickingKeywords);
new Field("ALL", "key2=value2", aFieldTypeMimickingKeywords);
// then get added to the Document of course...
I can then do a wildcard search, using
new WildcardQuery(new Term("ALL", "*alue1"));
and it will find the hit.
But, it would be nice to get more info, like "what was complete value (e.g. "key1=value1") that goes with that hit?".
The best I can figure out it to get the Document, then get the list of IndexableFields, then loop over all of them and see if the field.stringValue().contains("alue1"). (I can look at the data structures in the debugger and all the info is there)
This seems completely insane cause isn't that what Lucene just did? Shouldn't the Hit information return some of the Fields?
Is Lucene missing what seems like "obvious" functionality? Google and starting at the APIs hasn't revealed anything straightforward, but I feel like I must be searching on the wrong stuff.

You might want to try with IndexSearcher.explain() method. Once you get the ID of the matching document, prepare a query for each field (using the same search keywords) and invoke Explanation.isMatch() for each query: the ones that yield true will give you the matched field. Example:
for (String field: fields){
Query query = new WildcardQuery(new Term(field, "*alue1"));
Explanation ex = searcher.explain(query, docID);
if (ex.isMatch()){
//Your query matched field
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Elasticsearch won't find any results for a search containing letters - java

Does "tt" find 2 results? Wildcard queries are not analyzed, but your analyzed index probably is so the index contains "tt123", which will not match "TT" but "tt" will. That said, Wildcards are slow, you should should look into other analyzers, such as ngram to create your index.

Related

JPA Select query not returning results with one letter word

Query Elastic document field with and without characters

Lucene : Search with partial words

Get matched index value of array in MongoDB Java

How to retrieve the Field that "hit" in Lucene

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Elasticsearch won't find any results for a search containing letters - java

Does "*tt*" find 2 results? Wildcard queries are not analyzed, but your analyzed index probably is so the index contains "tt123", which will not match "*TT*" but "*tt*" will. That said, Wildcards are slow, you should should look into other analyzers, such as ngram to create your index.

Related

JPA Select query not returning results with one letter word

Query Elastic document field with and without characters

Lucene : Search with partial words

Get matched index value of array in MongoDB Java

How to retrieve the Field that "hit" in Lucene

Categories

Resources

Does "tt" find 2 results? Wildcard queries are not analyzed, but your analyzed index probably is so the index contains "tt123", which will not match "TT" but "tt" will. That said, Wildcards are slow, you should should look into other analyzers, such as ngram to create your index.