ElasticSearch - define custom letter order for sorting

ElasticSearch - define custom letter order for sorting - java

I'm using ElasticSearch 2.4.2 (via HibernateSearch 5.7.1.Final from Java).
I have a problem with string sorting.
The language of my application has diacritics, which have a specific alphabetic
ordering. For example Ł goes directly after L, Ó goes after O, etc.
So you are supposed to sort the strings like this:
Dla
Dła
Doa
Dóa
Dza
Eza
ElasticSearch sorts by typical letters first, and moves all strange
letters to at the end:
Dla
Doa
Dza
Dła
Dóa
Eza
Can I add a custom letter ordering for ElasticSearch?
Maybe there are some plugins for this?
Do I need to write my own plugin? How do I start?
I found a plugin for Polish language for ElasticSearch,
but as I understand it is for analysing, and analysing is not a solution
in my case, because it will ignore diacritics and leave words with L and Ł mixed:
Dla
Dłb
Dlc
This would sometimes be acceptable, but is not acceptable in my specific usecase.
I will be grateful for any remarks on this.

I've never used it, but there is a plugin that could fit your needs: the ICU collation plugin.
You will have to use the icu_collation token filter, which will turns the tokens into collation keys. For that reason you will need to use a separate #Field (e.g. myField_sort) in Hibernate Search.
You can assign a specific analyzer to your field with #Field(name = "myField_sort", analyzer = #Analyzer(definition = "myCollationAnalyzer")), and define this analyzer (type, parameters) with something like that on one of your entities:
#Entity
#Indexed
#AnalyzerDef(
name = "myCollationAnalyzer",
filters = {
#TokenFilterDef(
name = "polish_collation",
factory = ElasticsearchTokenFilterFactory.class,
params = {
#Parameter(name = "type", value = "'icu_collation'"),
#Parameter(name = "language", value = "'pl'")
}
)
}
)
public class MyEntity {
See the documentation for more information: https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#_custom_analyzers
It's admittedly a bit clumsy right now, but analyzer configuration will get a bit cleaner in the next Hibernate Search version with normalizers and analyzer definition providers.
Note: as usual, your field will need to be declared as sortable (#SortableField(forField = "myField_sort")).

Related

Hibernate search and elastic search analysers for exact search

I have StandardAnalyser above the field in entity
#Field(name = "myField", index = Index.YES, analyze = Analyze.YES, analyzer = #Analyzer(impl = StandardAnalyzer.class)),
But I wonder how to then make it possible to:
1) Search by words like: I like bananas -> ["I", "like", "bananas"] (which my analyser currently allows me to do)
2) Search by exact input: "I like bananas" -> "I like bananas" (which StandardAnalyser does not allow and the fitting change would be to impl = KeywordAnalyser)
Should I change my analyser or maybe in java code, based on input (if it starts and ends with double quotes) change the way of searching?
Regards

Maybe you want a phrase query? A phrase query looks for a sequence of tokens, instead of just one token.
Query luceneQuery = queryBuilder.phrase()
.onField("myField")
.sentence("I like bananas")
.createQuery();
See https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#_phrase_queries
Phrase queries are still sort of "fuzzy", however, since the analyzer is still involved: case will be ignored, etc. It's just about looking for a sequence of (analyzed) tokens.
If you really need exact search, including case sensitivity, you can simply declare two fields:
#Field(name = "myField", index = Index.YES, analyze = Analyze.YES, analyzer = #Analyzer(impl = StandardAnalyzer.class))
#Field(name = "myField_exact", index = Index.YES, analyze = Analyze.NO)
Then you can target either myField or myField_exact at query time, depending on your needs.
Of course, you will need to reindex your data before myField_exact becomes available.

Specifying keyword type on String field

I started using hibernate-search-elasticsearch(5.8.2) because it seemed easy to integrate it maintains elasticsearch indices up to date without writing any code. It's a cool lib, but I'm starting to think that it has a very small set of the elasticsearch functionalities implemented. I'm executing a query with a painless script filter which needs to access a String field, which type is 'text' in the index mapping and this is not possible without enabling field data. But I'm not very keen on enabling it as it consumes a lot of heap memory. Here's what elasticsearch team suggests to do in my case:
Fielddata documentation
Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.
A text field is analyzed before indexing so that a value like New York can be found by searching for new or for york. A terms aggregation on this field will return a new bucket and a york bucket, when you probably want a single bucket called New York.
Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
Unfortunately I can't find a way to do it with the hibernate-search annotations. Can someone tell me if this is possible or I have to migrate to the vanilla elasticsearch lib and not using any wrappers?

With the current version of Hibernate Search, you need to create a different field for that (e.g. you can't have different flavors of the same field). Note that that's what Elasticsearch is doing under the hood anyway.
#Field(analyzer = "your-text-analyzer") // your default full text search field with the default name
#Field(name="myPropertyAggregation", index = Index.NO, normalizer = "keyword")
#SortableField(forField = "myPropertyAggregation")
private String myProperty;
It should create an unanalyzed field with doc values. You then need to refer to the myPropertyAggregation field for your aggregations.
Note that we will expose much more Elasticsearch features in the API in the future Search 6. In Search 5, the APIs are designed with Lucene in mind and we couldn't break them.

How to retrieve the Field that "hit" in Lucene

Maybe I'm really missing something.
I have indexed a bunch of key/value pairs in Lucene (v4.1 if it matters). Say I have
key1=value1 and key2=value2, e.g. as read from a properties file.
They get indexed both as specific fields and into a catchall "ALL" field, e.g.
new Field("key1", "value1", aFieldTypeMimickingKeywords);
new Field("key2", "value2", aFieldTypeMimickingKeywords);
new Field("ALL", "key1=value1", aFieldTypeMimickingKeywords);
new Field("ALL", "key2=value2", aFieldTypeMimickingKeywords);
// then get added to the Document of course...
I can then do a wildcard search, using
new WildcardQuery(new Term("ALL", "*alue1"));
and it will find the hit.
But, it would be nice to get more info, like "what was complete value (e.g. "key1=value1") that goes with that hit?".
The best I can figure out it to get the Document, then get the list of IndexableFields, then loop over all of them and see if the field.stringValue().contains("alue1"). (I can look at the data structures in the debugger and all the info is there)
This seems completely insane cause isn't that what Lucene just did? Shouldn't the Hit information return some of the Fields?
Is Lucene missing what seems like "obvious" functionality? Google and starting at the APIs hasn't revealed anything straightforward, but I feel like I must be searching on the wrong stuff.

You might want to try with IndexSearcher.explain() method. Once you get the ID of the matching document, prepare a query for each field (using the same search keywords) and invoke Explanation.isMatch() for each query: the ones that yield true will give you the matched field. Example:
for (String field: fields){
Query query = new WildcardQuery(new Term(field, "*alue1"));
Explanation ex = searcher.explain(query, docID);
if (ex.isMatch()){
//Your query matched field
}
}

hbase: querying for specific value with dynamically created qualifier

Hy,
Hbase allows a column family to have different qualifiers in different rows. In my case a column family has the following specification
abc[cnt] # where cnt is an integer that can be any positive integer
what I want to achieve is to get all the data from a different column family, only if the value of the described qualifier (in a different column family) matches.
for narrowing the Scan down I just add those two families I need for the query. but that is as far as I could get for now.
I already achieved the same behaviour with a SingleColumnValueFilter, but then the qualifier was known in advance. but for this one the qualifier can be abc1, abc2 ... there would be too many options, thus too many SingleColumnValueFilter's.
Then I tried using the ValueFilter, but this filter only returns those columns that match the value, thus the wrong column family.
Can you think of any way to achieve my goal, querying for a value within a dynamically created qualifier in a column family and returning the contents of the column family and another column family (as specified when creating the Scan)? preferably only querying once.
Thanks in advance for any input.
UPDATE: (for clarification as discussed in the comments)
in a more graphical way, a row may have the following:
colfam1:aaa
colfam1:aab
colfam1:aac
colfam2:abc1
colfam2:abc2
whereas I want to get all of the family colfam1 if any value of colfam2 has e.g. the value x, with regard to the fact that colfam2:abc[cnt] is dynamically created with cnt being any positive integer

I see two approaches for this: client-side filtering or server-side filtering.
Client-side filtering is more straightforward. The Scan adds only the two families "colfam1" and "colfam2". Then, for each Result you get from scanner.next(), you must filter according to the qualifiers in "colfam2".
byte[] queryValue = Bytes.toBytes("x");
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("colfam1");
scan.addFamily(Bytes.toBytes("colfam2");
ResultScanner scanner = myTable.getScanner(scan);
Result res;
while((res = scanner.next()) != null) {
NavigableMap<byte[],byte[]> colfam2 = res.getFamilyMap(Bytes.toBytes("colfam2"));
boolean foundQueryValue = false;
SearchForQueryValue: while(!colfam2.isEmpty()) {
Entry<byte[], byte[]> cell = colfam2.pollFirstEntry();
if( Bytes.equals(cell.getValue(), queryValue) ) {
foundQueryValue = true;
break SearchForQueryValue;
}
}
if(foundQueryValue) {
NavigableMap<byte[],byte[]> colfam1 = res.getFamilyMap(Bytes.toBytes("colfam1"));
LinkedList<KeyValue> listKV = new LinkedList<KeyValue>();
while(!colfam1.isEmpty()) {
Entry<byte[], byte[]> cell = colfam1.pollFirstEntry();
listKV.add(new KeyValue(res.getRow(), Bytes.toBytes("colfam1"), cell.getKey(), cell.getValue());
}
Result filteredResult = new Result(listKV);
}
}
(This code was not tested)
And then finally filteredResult is what you want. This approach is not elegant and might also give you performance issues if you have a lot of data in those families. If "colfam1" has a lot of data, you don't want to transfer it to the client if it will end up not being used if value "x" is not in a qualifier of "colfam2".
Server-side filtering. This requires you to implement your own Filter class. I believe you cannot use the provided filter types to do this. Implementing your own Filter takes some work, you also need to compile it as a .jar and make it available to all RegionServers. But then, it helps you to avoid sending loads of data of "colfam1" in vain.
It is too much work for me to show you how to custom implement a Filter, so I recommend reading a good book (HBase: The Definitive Guide for example). However, the Filter code will look pretty much like the client-side filtering I showed you, so that's half of the work done.

Lucene wildcard matching fails on chemical notations(?)

Using Hibernate Search Annotations (mostly just #Field(index = Index.TOKENIZED)) I've indexed a number of fields related to a persisted class of mine called Compound. I've setup text search over all the indexed fields, using the MultiFieldQueryParser, which has so far worked fine.
Among the fields indexed and searchable is a field called compoundName, with sample values:
3-Hydroxyflavone
6,4'-Dihydroxyflavone
When I search for either of these values in full the related Compound instances are returned. However problems occur when I use the partial name and introduce wildcards:
searching for 3-Hydroxyflav* still gives the correct hit, but
searching for 6,4'-Dihydroxyflav* fails to find anything.
Now as I'm quite new to Lucene / Hibernate-search, I'm not quite sure where to look at this point.. I think it might have something to do with the ' present in the second query, but I don't know how to proceed.. Should I look into Tokenizers / Analyzers / QueryParsers or something else entirely?
Or can anyone tell me how I can get the second wildcard search to match, preferably without breaking the MultiField-search behavior?
I'm using Hibernate-Search 3.1.0.GA & Lucene-core 2.9.3.
Some relevant code bits to illustrate my current approach:
Relevant parts of the indexed Compound class:
#Entity
#Indexed
#Data
#EqualsAndHashCode(callSuper = false, of = { "inchikey" })
public class Compound extends DomainObject {
#NaturalId
#NotEmpty
#Length(max = 30)
#Field(index = Index.TOKENIZED)
private String inchikey;
#ManyToOne
#IndexedEmbedded
private ChemicalClass chemicalClass;
#Field(index = Index.TOKENIZED)
private String commonName;
...
}
How I currently search over the indexed fields:
String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser =
new MultiFieldQueryParser(Version.LUCENE_29, searchfields, new StandardAnalyzer(Version.LUCENE_29));
FullTextSession fullTextSession = Search.getFullTextSession(getSession());
FullTextQuery fullTextQuery =
fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();

Use WhitespaceAnalyzer instead of StandardAnalyzer. It will just split at whitespace, and not at commas, hyphens etc. (It will not lowercase them though, so you will need to build your own chain of whitespace + lowercase, assuming you want your search to be case-insensitive). If you need to do things differently for different fields, you can use a PerFieldAnalyzer.
You can't just set it to un-tokenized, because that will interpret your entire body of text as one token.

I think your problem is a combination of analyzer and query language problems. It is hard to say what exactly causes the problem. To find this out I recommend you inspect you index using the Lucene index tool Luke.
Since in your Hibernate Search configuration you are not using a custom analyzer the default - StandardAnalyzer - is used. This would be consistent with the fact that you use StandardAnalyzer in the constructor of MultiFieldQueryParser (always use the same analyzer for indexing and searching!). What I am not so sure of is how "6,4'-Dihydroxyflavone" gets tokenized by StandardAnalyzer. That the first thing you have to find out. For example the javadoc says:
Splits words at hyphens, unless
there's a number in the token, in
which case the whole token is
interpreted as a product number and is
not split.
It might be that you need to write your own analyzer which tokenizes your chemical names the way you need it for your use cases.
Next the query parser. Make sure you understand the query syntax - Lucene query syntax. Some characters have special meaning, for example a '-'. It could be that your query is parsed the wrong way.
Either way, first step os to find out how your chemical names get tokenized. Hope that helps.

I wrote my own analyzer:
import java.util.Set;
import java.util.regex.Pattern;
import org.apache.lucene.index.memory.PatternAnalyzer;
import org.apache.lucene.util.Version;
public class ChemicalNameAnalyzer extends PatternAnalyzer {
private static Version version = Version.LUCENE_29;
private static Pattern pattern = compilePattern();
private static boolean toLowerCase = true;
private static Set stopWords = null;
public ChemicalNameAnalyzer(){
super(version, pattern, toLowerCase, stopWords);
}
public static Pattern compilePattern() {
StringBuilder sb = new StringBuilder();
sb.append("(-{0,1}\\(-{0,1})");//Matches an optional dash followed by an opening round bracket followed by an optional dash
sb.append("|");//"OR" (regex alternation)
sb.append("(-{0,1}\\)-{0,1})");
sb.append("|");//"OR" (regex alternation)
sb.append("((?<=([a-zA-Z]{2,}))-(?=([^a-zA-Z])))");//Matches a dash ("-") preceded by two or more letters and succeeded by a non-letter
return Pattern.compile(sb.toString());
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.