java : Case insensitive search in Elasticsearch - java

I'm trying to find out the documents in the index regardless of whether if it's field values are lowercase or uppercase in the index.
This is the index structure, I have designed with the custom analyzer. I'm new to analyzers and I might be wrong. This is how it looks :
POST arempris/emptagnames
{
"settings": {
"analyzer": {
"lowercase_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
},
"mappings" : {
"emptags":{
"properties": {
"employeeid": {
"type":"integer"
},
"tagName": {
"type": "text",
"fielddata": true,
"analyzer": "lowercase_keyword"
}
}
}
}
}
In the java back-end, I'm using BoolQueryBuilder to find tagnames using employeeids first. This is what I've coded to fetch the values :
BoolQueryBuilder query = new BoolQueryBuilder();
query.must(new WildcardQueryBuilder("tagName", "*June*"));
query.must(new TermQueryBuilder("employeeid", 358));
SearchResponse response12 = esclient.prepareSearch(index).setTypes("emptagnames")
.setQuery(query)
.execute().actionGet();
SearchHit[] hits2 = response12.getHits().getHits();
System.out.println(hits2.length);
for (SearchHit hit : hits2) {
Map map = hit.getSource();
System.out.println((String) map.get("tagName"));
}
It works fine when I specify the tag to be searched as "june" in lowercase, but when I specify it as "June" in the WildCardQueryBuilder with an uppercase for an alphabet, I'm not getting any match.
Let me know where have I committed the mistake. Would greatly appreciate your help and thanks in advance.

There are two type of queries in elasticsearch
Term level queries -> in which exact term is searched. https://www.elastic.co/guide/en/elasticsearch/reference/current/term-level-queries.html
Full text queries -> which first analyzes the query term and then search it. https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html
The rules for full text queries is
First it looks for search_analyzer in query
If not mentioned then it uses index time analyzer for that field for searching.
So in this case you need to change your query to this
BoolQueryBuilder query = new BoolQueryBuilder();
query.must(new QueryStringQueryBuilder("tagName:*June*"));
query.must(new TermQueryBuilder("employeeid", 358));
SearchResponse response12 = esclient.prepareSearch(index).setTypes("emptagnames")
.setQuery(query)
.execute().actionGet();
SearchHit[] hits2 = response12.getHits().getHits();
System.out.println(hits2.length);
for (SearchHit hit : hits2) {
Map map = hit.getSource();
System.out.println((String) map.get("tagName"));
}

Related

Elasticsearch 7.13 - elastic search response with old data after update api

We using elastic 7.13
we are doing periodical update to index using upsert
The sequence of operations
create new index with dynamic mapping all strings mapped as text
"dynamic_templates": [
{
"strings_as_keywords": {
"match_mapping_type": "string",
"mapping": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "search_term_analyzer",
"copy_to": "_all",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
}
}
}
]
upsert bulk with the attached code (I don't have equivalent with rest)
doing search on specific filed
localhost:9200/mdsearch-vitaly123/_search
{
"query": {
"match": {
"fullyQualifiedName": `value_test`
}
}
}
got 1 result
upsert again now "fullyQualifiedName": "value_test1234" (as in step 2)
do search as in step 3
got 2 results 1 doc with "fullyQualifiedName": "value_test"
and other "fullyQualifiedName": "value_test1234"
snippet below of upsert (step 2):
#Override
public List<BulkItemStatus> updateDocumentBulk(String indexName, List<JsonObject> indexDocuments) throws MDSearchIndexerException {
BulkRequest request = new BulkRequest().setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
ofNullable(indexDocuments).orElseThrow(NullPointerException::new)
.forEach(x -> {
var id = x.get("_id").getAsString();
x.remove("_id");
request.add(new UpdateRequest(indexName, id)
.docAsUpsert(true)
.doc(x.toString(), XContentType.JSON)
.retryOnConflict(3)
);
});
BulkResponse bulk = elasticsearchRestClient.bulk(request, RequestOptions.DEFAULT);
return stream(bulk.getItems())
.map(r -> new BulkItemStatus(r.getId(), isSuccess(r), r.getFailureMessage()))
.collect(Collectors.toList());
}
I can search by updated properties.
But the problem is that searches retrieve "updated fields" and previous one as well.
How can I solve it ?
maybe limit somehow the version number to be only 1.
I set setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE) but it didn't helped
Here in picture we can see result
P.S - old and updated data retrieved as well
Suggestions ?
Regards,
What is happening is that the following line must yield null:
var id = x.get("_id").getAsString();
In other words, there is no _id field in the JSON documents you pass in indexDocuments. It is not allowed to have fields with an initial underscore character in the source documents. If it was the case, you'd get the following error:
Field [_id] is a metadata field and cannot be added inside a document. Use the index API request parameters.
Hence, your update request cannot update any document (since there's no ID to identify the document to update) and will simply insert a new one (i.e. what docAsUpsert does), which is why you're seeing two different documents.

Partial query with Where statement using Elasticssearch 7 java api

I am using the following to search. It is working fine. But it is returning the results when complete word match is found. But I want results with a partial query (minimum 3 characters match incomplete word). Another check should be , I have a field campus in my document. Which has values like campus: "Bradford" , campus:"Oxford", campus:"Harvard" etc. I want that my query should return the document whose campus should be Bradford or Oxford and Nel will be available in the rest of the entire document.
RestHighLevelClient client;
QueryBuilder matchQueryBuilder = QueryBuilders.queryStringQuery("Nel");
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(matchQueryBuilder);
SearchRequest searchRequest = new SearchRequest("index_name");
searchRequest.source(sourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
If we map with the SQL statement, as we used where campus='Bradford' OR campus='Oxford'.
In the document, I have "Nelson Mandela II"
Currently, it is working if I write Nelson as query but I need it to work with query Nel.
There basically two possible ways to achieve the use-case you are looking for.
Solution 1: Using wildcard query
Assuming that you have two fields
name of type text
campus of type text
Below is how your java code would be:
private static void wildcardQuery(RestHighLevelClient client, SearchSourceBuilder sourceBuilder)
throws IOException {
System.out.println("-----------------------------------------------------");
System.out.println("Wildcard Query");
MatchQueryBuilder campusClause_1 = QueryBuilders.matchQuery("campus", "oxford");
MatchQueryBuilder campusClause_2 = QueryBuilders.matchQuery("campus", "bradford");
//Using wildcard query
WildcardQueryBuilder nameClause = QueryBuilders.wildcardQuery("name", "nel*");
//Main Query
BoolQueryBuilder query = QueryBuilders.boolQuery()
.must(nameClause)
.should(campusClause_1)
.should(campusClause_2)
.minimumShouldMatch(1);
sourceBuilder.query(query);
SearchRequest searchRequest = new SearchRequest();
//specify your index name in the below parameter
searchRequest.indices("my_wildcard_index");
searchRequest.source(sourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse.getHits().getTotalHits());
System.out.println("-----------------------------------------------------");
}
Note that if the fields of the above were of keyword type and you need exact match for case sensitivity, you'd need the below code:
TermQueryBuilder campusClause_2 = QueryBuilders.termQuery("campus", "Bradford");
Solution 2. Using Edge Ngram tokenizer (Preferred Solution)
For this you would need to make use of Edge Ngram tokenizer.
Below is how your mapping would be:
Mapping:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": "lowercase",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"name":{
"type": "text",
"analyzer": "my_analyzer"
},
"campus": {
"type": "text"
}
}
}
}
Sample Documents:
PUT my_index/_doc/1
{
"name": "Nelson Mandela",
"campus": "Bradford"
}
PUT my_index/_doc/2
{
"name": "Nel Chaz",
"campus": "Oxford"
}
Query DSL
POST my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "nel"
}
}
],
"should": [
{
"match": {
"campus": "bradford"
}
},
{
"match": {
"campus": "oxford"
}
}
],
"minimum_should_match": 1
}
}
}
Java Code:
private static void boolMatchQuery(RestHighLevelClient client, SearchSourceBuilder sourceBuilder)
throws IOException {
System.out.println("-----------------------------------------------------");
System.out.println("Bool Query");
MatchQueryBuilder campusClause_1 = QueryBuilders.matchQuery("campus", "oxford");
MatchQueryBuilder campusClause_2 = QueryBuilders.matchQuery("campus", "bradford");
//Plain old match query would suffice here
MatchQueryBuilder nameClause = QueryBuilders.matchQuery("name", "nel");
BoolQueryBuilder query = QueryBuilders.boolQuery()
.must(nameClause)
.should(campusClause_1)
.should(campusClause_2)
.minimumShouldMatch(1);
sourceBuilder.query(query);
SearchRequest searchRequest = new SearchRequest();
searchRequest.indices("my_index");
searchRequest.source(sourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse.getHits().getTotalHits());
}
Note how I've just made use of match query for the name field. I'd suggest you read a bit about what analysis, analyzer, tokenizer and edge-ngram tokenizers are about.
In the console, you should be able to see the total hits of the document.
Similarly you can also make use of other query types for e.g. Term query in the above solutions if you are looking for exact match for keyword field etc.
Updated Answer:
Personally I do not recommend Solution 1 as it would be lot of computational power wastage for a single field itself, let alone for multiple fields.
In order to do multi-field sub-string matches, the best way to do that would be to make use of a concept called as copy-to and then make use of Edge N-Gram tokenizer for that field.
So what does this Edge N-Gram tokenizer do really? Put it simply, based on min-gram and max-gram it would simply break down your tokens for e.g.
Zeppelin into Zep, Zepp, Zeppe, Zeppel, Zeppeli, Zeppelin and thereby insert these values in the inverted index of that field. Not if you just execute a very simple match query, it would return that document as your inverted index would have that substring.
And about copy_to field:
The copy_to parameter allows you to copy the values of multiple fields
into a group field, which can then be queried as a single field.
Using copy_to field, we have the below mapping for the two fields campus and name.
Mapping:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": "lowercase",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"name":{
"type": "text",
"copy_to": "search_string" <---- Note this
},
"campus": {
"type": "text",
"copy_to": "search_string" <---- Note this
},
"search_string": {
"type": "text",
"analyzer": "my_analyzer" <---- Note this
}
}
}
}
Notice in the above mapping, how I've made use of the Edge N-gram specific analyzer only to search_string. Note that this consumes disk space as a result you may want to take a step back and make sure that you do not use this analyzer for all the fields but again it depends on the use-case that you have.
Sample Document:
POST my_index/_doc/1
{
"campus": "Cambridge University",
"name": "Ramanujan"
}
Search Query:
POST my_index/_search
{
"query": {
"match": {
"search_string": "ram"
}
}
}
And that would give you the Java Code as simple as below:
private static void boolMatchQuery(RestHighLevelClient client, SearchSourceBuilder sourceBuilder)
throws IOException {
System.out.println("-----------------------------------------------------");
System.out.println("Bool Query");
MatchQueryBuilder searchClause = QueryBuilders.matchQuery("search_string", "ram");
//Feel free to add multiple clauses
BoolQueryBuilder query = QueryBuilders.boolQuery()
.must(searchClause);
sourceBuilder.query(query);
SearchRequest searchRequest = new SearchRequest();
searchRequest.indices("my_index");
searchRequest.source(sourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(searchResponse.getHits().getTotalHits());
}
Hope that helps!

How to add Bucket Sort to Query Aggregation

I have a ElasticSearch Query that is working well (curl), is my first Query,
First I am filtering by Organization (Multitenancy), then group by Customer, Finally sum the amount of the sales but I only want to have the 3 best customers.
My question is.. How to build the aggregation with the AggregationBuilders to get "bucket_sort" statement. I got the sales grouping by customer with Java API.
Elastic Query is:
curl -X POST 'http://localhost:9200/sales/sale/_search?pretty' -H 'Content-Type: application/json' -d '
{
"aggs": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"organization_id": "15"
}
}
]
}
},
"aggs": {
"by_customer": {
"terms": {
"field": "customer_id"
},
"aggs": {
"sum_total" : {
"sum": {
"field": "amount"
}
},
"total_total_sort": {
"bucket_sort": {
"sort": [
{"sum_total": {"order": "desc"}}
],
"size": 3
}
}
}
}
}
}
}
}'
My Java Code:
#Test
public void queryBestCustomers() throws UnknownHostException {
Client client = Query.client();
AggregationBuilder sum = AggregationBuilders.sum("sum_total").field("amount");
AggregationBuilder groupBy = AggregationBuilders.terms("by_customer").field("customer_id").subAggregation(sum);
AggregationBuilder aggregation =
AggregationBuilders
.filters("filtered",
new FiltersAggregator.KeyedFilter("must", QueryBuilders.termQuery("organization_id", "15"))).subAggregation(groupBy);
SearchRequestBuilder requestBuilder = client.prepareSearch("sales")
.setTypes("sale")
.addAggregation(aggregation);
SearchResponse response = requestBuilder.execute().actionGet();
}
I hope I got your question right.
Try adding "order" to your groupBy agg:
AggregationBuilder groupBy = AggregationBuilders.terms("by_customer").field("customer_id").subAggregation(sum).order(Terms.Order.aggregation("sum_total", false));
One more thing, if you want the top 3 clients than your .size(3) should be set on groupBy agg as well and not on sorting. like that:
AggregationBuilder groupBy = AggregationBuilders.terms("by_customer").field("customer_id").subAggregation(sum).order(Terms.Order.aggregation("sum_total", false)).size(3);
As another answer mentioned, "order" does work for your use case.
However there are other use cases where one may want to use bucket_sort. For example if someone wanted to page through the aggregation buckets.
As bucket_sort is a pipeline aggregation you cannot use the AggregationBuilders to instantiate it. Instead you'll need to use the PipelineAggregatorBuilders.
You can read more information about the bucket sort/pipeline aggregation here.
The ".from(50)" in the following code is an example of how you can page through the buckets. This causes the items in the bucket to start from item 50 if applicable. Not including "from" is the equivalent of ".from(0)"
BucketSortPipelineAggregationBuilder paging = PipelineAggregatorBuilders.bucketSort(
"paging", List.of(new FieldSortBuilder("sum_total").order(SortOrder.DESC))).from(50).size(10);
AggregationBuilders.terms("by_customer").field("customer_id").subAggregation(sum).subAggregation(paging);

Mongodb Morphia aggregation

I'm having trouble creating aggregation in Morphia, the documentation is really not clear. This is the original query:
db.collection('events').aggregate([
{
$match: {
"identifier": {
$in: [
userId1, userId2
]
},
$or: [
{
"info.name": "messageType",
"info.value": "Push",
"timestamp": {
$gte: newDate("2015-04-27T19:53:13.912Z"),
$lte: newDate("2015-08-27T19:53:13.912Z")
}
}
]
}{
$unwind: "$info"
},
{
$match: {
$or: [
{
"info.name": "messageType",
"info.value": "Push"
}
]
}
]);
The only example in their docs was using out and there's some example here but I couldn't make it to work.
I didn't even made it past the first match, here's what I have:
ArrayList<String> ids = new ArrayList<>();
ids.add("199941");
ids.add("199951");
Query<Event> q = ads.getQueryFactory().createQuery(ads);
q.and(q.criteria("identifier").in(ids));
AggregationPipeline pipeline = ads.createAggregation(Event.class).match(q);
Iterator<Event> iterator = pipeline.aggregate(Event.class);
Some help or guidance and how to start with the query or how it works will be great.
You need to create the query for the match() pipeline by breaking your code down into manageable pieces that will be easy to follow. So let's start
with the query to match the identifier field, you have done the great so far. We need to then combine with the $or part of the query.
Carrying on from where you left, create the full query as:
Query<Event> q = ads.getQueryFactory().createQuery(ads);
Criteria[] arrayA = {
q.criteria("info.name").equal("messageType"),
q.criteria("info.value").equal("Push"),
q.field("timestamp").greaterThan(start);
q.field("timestamp").lessThan(end);
};
Criteria[] arrayB = {
q.criteria("info.name").equal("messageType"),
q.criteria("info.value").equal("Push")
};
q.and(
q.criteria("identifier").in(ids),
q.or(arrayA)
);
Query<Event> query = ads.getQueryFactory().createQuery(ads);
query.or(arrayB);
AggregationPipeline pipeline = ads.createAggregation(Event.class)
.match(q)
.unwind("info")
.match(query);
Iterator<Event> iterator = pipeline.aggregate(Event.class);
The above is untested but will guide you somewhere closer home, so make some necessary adjustments where appropriate. For some references, the following SO questions may give you some pointers:
Complex AND-OR query in Morphia
Morphia query with or operator
and of course the AggregationTest.java Github page

How to write elasticsearch query aggregation in java?

This is my code in Marvel Sense:
GET /sweet/cake/_search
{
"query": {
"bool": {
"must": [
{"term": {
"code":"18"
}}
]
}
},
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "id"
}
}
}
}
And I want to write it in Java but I dont't know how.
You can find some examples in the official documentation for the Java client.
But in your case, you need to create one bool/must query using the QueryBuilders and one terms aggregation using the AggregationBuilders. It goes like this:
// build the query
BoolQueryBuilder query = QueryBuilders.boolFilter()
.must(QueryBuilders.termFilter("code", "18"));
// build the terms sub-aggregation
TermsAggregation stateAgg = AggregationBuilders.terms("group_by_state")
.field("id");
SearchResponse resp = client.prepareSearch("sweet")
.setType("cake")
.setQuery(query)
.setSize(0)
.addAggregation(stateAgg)
.execute()
.actionGet();

Categories