ElasticSearch range query in a paragraph - java

I have a field called Description which is a text field and has data like:
This is a good thing for versions before 3.2 but bad for 3.5 and later
I want to run range query on this type of text. I know that for a field containing only Dates/Age(Numbers) or even String Ids, we can use queries like
{
"query": {
"range" : {
"age" : {
"gte" : 10,
"lte" : 20,
"boost" : 2.0
}
}
}
}
But i have a mixed field like mentioned above and I need to perform range query on that. Also, i cannot change the index structure. I can only perform queries or do some post processing after retrieving results. So anyone has any idea how to run this type of query, or even obtain my goal after getting results in the post processing? I am using Java.

I hope i fully understand what you are looking for.
I've managed to create a simple working example.
Mappings
Using char_group tokenizer:
The char_group tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the pattern tokenizer is not acceptable.
Char Group Tokenizer
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"letter",
"whitespace"
]
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"fields": {
"digit": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
Post a few documents
PUT my_index/_doc/1
{
"text": "This is a good thing for versions before 3.2 but bad for 3.5 and later"
}
PUT my_index/_doc/2
{
"text": "This is a good thing for versions before 5 but bad for 6 and later"
}
Search Query
GET my_index/_search
{
"query": {
"range": {
"text.digit": {
"gte": 3.2,
"lte": 3.5
}
}
}
}
Results
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"text" : "This is a good thing for versions before 3.2 but bad for 3.5 and later"
}
}
]
}
Another Search Query
GET my_index/_search
{
"query": {
"range": {
"text.digit": {
"gt": 3.5
}
}
}
}
Results
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"text" : "This is a good thing for versions before 5 but bad for 6 and later"
}
}
]
}
Analyze Query
Play with the following query till you get the desired results.
It is already compatible to your example.
This is a good thing for versions before 3.2 but bad for 3.5 and later
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"letter",
"whitespace"
]
},
"text": "This is a good thing for versions before 3.2 but bad for 3.5 and later"
}
Hope this helps

Related

ElasticSearch - fuzzy search java api results are not proper

I have indexed sample documents in elasticsearch and trying to search using fuzzy query. But am not getting any results when am search by using Java fuzzy query api.
Please find my below mapping script :
PUT productcatalog
{
"settings": {
"analysis": {
"analyzer": {
"attr_analyzer": {
"type": "custom",
"tokenizer": "letter",
"char_filter": [
"html_strip"
],
"filter": ["lowercase", "asciifolding", "stemmer_minimal_english"]
}
},
"filter" : {
"stemmer_minimal_english" : {
"type" : "stemmer",
"name" : "minimal_english"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"values": {
"type": "text",
"analyzer": "attr_analyzer"
},
"catalog_type": {
"type": "text"
},
"catalog_id":{
"type": "long"
}
}
}
}
}
Please find my sample data.
PUT productcatalog/doc/1
{
"catalog_id" : "343",
"catalog_type" : "series",
"values" : "Activa Rooftop, valves, VG3000, VG3000FS, butterfly, ball"
}
PUT productcatalog/doc/2
{
"catalog_id" : "12717",
"catalog_type" : "product",
"values" : "Activa Rooftop, valves"
}
Please find my search script :
GET productcatalog/_search
{
"query": {
"match" : {
"values" : {
"query" : " activa rooftop VG3000",
"operator" : "and",
"boost": 1.0,
"fuzziness": 2,
"prefix_length": 0,
"max_expansions": 100
}
}
}
}
Am getting the below results for the above query :
{
"took": 239,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.970927,
"hits": [
{
"_index": "productcatalog",
"_type": "doc",
"_id": "1",
"_score": 0.970927,
"_source": {
"catalog_id": "343",
"catalog_type": "series",
"values": "Activa Rooftop, valves, VG3000, VG3000FS, butterfly, ball"
}
}
]
}
}
But if i use the below Java API for the same fuzzy search am not getting any results out of it.
Please find my below Java API query for fuzzy search :
QueryBuilder qb = QueryBuilders.boolQuery()
.must(QueryBuilders.fuzzyQuery("values", keyword).boost(1.0f).prefixLength(0).maxExpansions(100));
Update 1
I have tried with the below query
QueryBuilder qb = QueryBuilders.matchQuery(QueryBuilders.fuzzyQuery("values", keyword).boost(1.0f).prefixLength(0).maxExpansions(100));
But am not able to pass QueryBuilders inside matchQuery. Am getting this suggestion while am writing this query The method matchQuery(String, Object) in the type QueryBuilders is not applicable for the arguments (FuzzyQueryBuilder)
The mentioned java query is not a match query. It's a must query. you should use matchQuery instead of boolQuery().must(QueryBuilders.fuzzyQuery())
Update 1:
fuzzy query is a term query while match query is a full text query.
Also don't forget that in match query the default Operator is or operator which you should change it to and like your dsl query.

Unable to get data from elastic search SearchResponse

I am trying to retrieve the data from SearchResponse class with the above code:
SearchHits searchHits = searchResponse.getHits();
for (SearchHit searchHit : searchHits) {
SearchHitField title = searchHit.field("title");
System.out.println(title.getValue().toString());
}
But I get a null pointer exception in title.getValue() function. The "title" field is definitely there and I can verify that by printing the search response which gives the following output:
{
"took" : 13,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "myIndex",
"_type" : "myTye",
"_id" : "5c849b0f-d72d-4cc9-9b8c-e1201f888f94",
"_score" : 2.4181843,
"_source":{"esId":"100200153", "title":"Book 1"}
}
}
I know that I can retrieve the data with searchHit.getSource() but I am wondering why the above solution isn't working as well.
I think you have to specify .fields(fields) in the request to be able to access the fields part.
For example, if you have a query like this:
{
"query": {
"match_all": {}
}
}
you get in the hits section of the result some fields (_id, _type..., _source).
But, if you have something like this:
{
"query": {
"match_all": {}
},
"fields": ["my_field"]
}
you get back a different result:
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_malformed",
"_type": "test",
"_id": "1",
"_score": 1,
"fields": {
"my_field": [
"whatever"
]
}
},
...
You notice there, in the hits you have fields where the field specified in the search request is being returned.
It looks like you are almost there. On each hit, instead of getting the title, get the _source object, then the title field from that source object.

Format date in elasticsearch query (during retrieval)

I have a elasticsearch index with a field "aDate" (and lot of other fields) with the following mapping
"aDate" : {
"type" : "date",
"format" : "date_optional_time"
}
When i query for a document i get a result like
"aDate" : 1421179734000,
I know this is the epoch, the internal java/elasticsearch date format, but i want to have a result like:
"aDate" : "2015-01-13T20:08:54",
I play around with scripting
{
"query":{
"match_all":{
}
},
"script_fields":{
"aDate":{
"script":"if (!_source.aDate?.equals('null')) new java.text.SimpleDateFormat('yyyy-MM-dd\\'T\\'HH:mm:ss').format(new java.util.Date(_source.aDate));"
}
}
}
but it give strange results (script works basically, but aDate is the only field returned and _source is missing). This looks like
"hits": [{
"_index": "idx1",
"_type": "type2",
"_id": "8770",
"_score": 1.0,
"fields": {
"aDate": ["2015-01-12T17:15:47"]
}
},
I would prefer a solution without scripting if possible.
When you run a query in Elasticsearch you can request it to return the raw data, for example specifying fields:
curl -XGET http://localhost:9200/myindex/date-test/_search?pretty -d '
{
"fields" : "aDate",
"query":{
"match_all":{
}
}
}'
Will give you the date in the format that you originally stored it:
{
"_index" : "myindex",
"_type" : "date-test",
"_id" : "AUrlWNTAk1DYhbTcL2xO",
"_score" : 1.0,
"fields" : {
"aDate" : [ "2015-01-13T20:08:56" ]
}
}, {
"_index" : "myindex",
"_type" : "date-test",
"_id" : "AUrlQnFgk1DYhbTcL2xM",
"_score" : 1.0,
"fields" : {
"aDate" : [ 1421179734000 ]
}
It's not possible to change the date format unless you use a script.
curl -XGET http://localhost:9200/myindex/date-test/_search?pretty -d '
{
"query":{
"match_all":{ }
},
"script_fields":{
"aDate":{
"script":"use( groovy.time.TimeCategory ) { new Date( doc[\"aDate\"].value ) }"
}
}
}'
Will return:
{
"_index" : "myindex",
"_type" : "date-test",
"_id" : "AUrlWNTAk1DYhbTcL2xO",
"_score" : 1.0,
"fields" : {
"aDate" : [ "2015-01-13T20:08:56.000Z" ]
}
}, {
"_index" : "myindex",
"_type" : "date-test",
"_id" : "AUrlQnFgk1DYhbTcL2xM",
"_score" : 1.0,
"fields" : {
"aDate" : [ "2015-01-13T20:08:54.000Z" ]
}
}
To apply a format, append it as follows:
"script":"use( groovy.time.TimeCategory ){ new Date( doc[\"aDate\"].value ).format(\"yyyy-MM-dd\") }"
will return "aDate" : [ "2015-01-13" ]
To display the T, you'll need to use quotes but replace them with the Unicode equivalent:
"script":"use( groovy.time.TimeCategory ){ new Date( doc[\"aDate\"].value ).format(\"yyyy-MM-dd\u0027T\u0027HH:mm:ss\") }"
returns "aDate" : [ "2015-01-13T20:08:54" ]
To return script_fields and source
Use _source in your query to specify the fields you want to return:
curl -XGET http://localhost:9200/myindex/date-test/_search?pretty -d '
{ "_source" : "name",
"query":{
"match_all":{ }
},
"script_fields":{
"aDate":{
"script":"use( groovy.time.TimeCategory ) { new Date( doc[\"aDate\"].value ) }"
}
}
}'
Will return my name field:
"_source":{"name":"Terry"},
"fields" : {
"aDate" : [ "2015-01-13T20:08:56.000Z" ]
}
Using asterisk will return all fields, e.g.: "_source" : "*",
"_source":{"name":"Terry","aDate":1421179736000},
"fields" : {
"aDate" : [ "2015-01-13T20:08:56.000Z" ]
}
Since 5.0.0, es use Painless as script language: link
Try this (work in 6.3.2)
"script":"doc['aDate'].value.toString('yyyy-MM-dd HH:mm:ss')"
As LabOctoCat mentioned, Olly Cruickshank answer no longer works in elastic 2.2. I changed the script to:
"script":"new Date(doc['time'].value)"
You can format the date according to this.
Scripting it only computes the answer when the row is extracted. This is expensive, and keeps you from using any date-related search functions in Elasticsearch.
You should create an elasticsearch "date" field before inserting it. Looks like a java Date() object will do.
Thanks #Archon for your suggestion. I used your answer as a guide to remove the time element from a datetime field in Elasticsearch
{
"aggs": {
"grp_by_date": {
"terms": {
"size": 200,
"script": "doc['TransactionReconciliationsCreated'].value.toString('yyyy-MM-dd')"
}
}
}
}
If you use Elasticsearch 7, and want to display datetime in a specified timezone, you can request it like this
"query": {
"bool": {
"filter": [
{
"term": {
"client": {
"value": "iOS",
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"script_fields": {
"time": {
"script": "ZonedDateTime input = doc['time'].value; input = input.withZoneSameInstant(ZoneId.of('Asia/Shanghai')); String output = input.format(DateTimeFormatter.ISO_ZONED_DATE_TIME); return output"
}
},
"_source": true,
return
{
...
"_source" : {
...
"time" : 1632903354213
...
},
"fields" : {
"time" : [
"2021-09-29T16:15:54.213+08:00[Asia/Shanghai]"
]
}
},
...
}

Elastic search tokenizer &filter for Split the given data

I am so tied for split the data for my expectation output. But i could not able to got it. I tried all the Filter and Tokenizer.
I Have Updated setting in elastic search as give below.
{
"settings": {
"analysis": {
"filter": {
"filter_word_delimiter": {
"preserve_original": "true",
"type": "word_delimiter"
}
},
"analyzer": {
"en_us": {
"tokenizer": "keyword",
"filter": [ "filter_word_delimiter","lowercase" ]
}
}
}
}
}
Executed Queries
curl -XGET "XX.XX.XX.XX:9200/keyword/_analyze?pretty=1&analyzer=en_us" -d 'DataGridControl'
Hits value
{
"tokens" : [ {
"token" : "datagridcontrol"
"start_offset" : 0,
"end_offset" : 16,
"type" : "word",
"position" : 1
}, {
"token" : "data",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 1
}, {
"token" : "grid",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 2
}, {
"token" : "control",
"start_offset" : 9,
"end_offset" : 16,
"type" : "word",
"position" : 3
} ]
}
Expectation Result like ->
DataGridControl
DataGrid
DataControl
Data
grid
control
What type of tokenizer and Filter add to index setting.
Any help ?
Try this:
{
"settings": {
"analysis": {
"filter": {
"filter_word_delimiter": {
"type": "word_delimiter"
},
"custom_shingle": {
"type": "shingle",
"token_separator":"",
"max_shingle_size":3
}
},
"analyzer": {
"en_us": {
"tokenizer": "keyword",
"filter": [
"filter_word_delimiter",
"custom_shingle",
"lowercase"
]
}
}
}
}
}
and let me know if it gets you any closer.

Elasticsearch combining queries with Boolean query

I'm trying to combine mutiple queries in elasticsearch using a boolean query but the result is not what I'm expecting. For example:
If I have the following documents (among others):
DOC 1:
{
"name":"Iphone 5",
"product_suggestions":{
"input":[
"iphone 5",
"apple"
]
},
"description":"Iphone 5 - The almost last version",
"brand":"Apple",
"brand_facet":"Apple",
"state_id":"2",
"user_state_description":"Almost New",
"product_type_id":"1",
"current_price":350,
"finish_date":"2014/06/20 14:12",
"finish_date_ms":1403273520
}
DOC 2:
{
"name":"Apple II Lisa",
"product_suggestions":{
"input":[
"apple ii lisa",
"apple"
]
},
"description":"Make a offer and I Apple II Lisa!!",
"brand":"Apple",
"brand_facet":"Apple",
"state_id":"2",
"user_state_description":"Used",
"product_type_id":"1",
"current_price":150,
"finish_date":"2014/06/15 16:12",
"finish_date_ms":1402848720
}
DOC 3:
{
"name":"Iphone 5s",
"product_suggestions":{
"input":[
"iphone 5s",
"apple"
]
},
"description":"Iphone 5s 32Gb like new with a few scratches bla bla bla",
"brand":"Apple",
"brand_facet":"Apple",
"state_id":"1",
"user_state_description":"New",
"product_type_id":"2",
"current_price":510.1,
"finish_date":"2014/06/10 14:12",
"finish_date_ms":1402409520
}
DOC 4:
{
"name":"Iphone 4s",
"product_suggestions":{
"input":[
"iphone 4s",
"apple"
]
},
"description":"Iphone 4s 16Gb Mint conditions and unlocked to all network",
"brand":"Apple",
"brand_facet":"Apple",
"state_id":"1",
"user_state_description":"Almost New",
"product_type_id":"2",
"current_price":385,
"finish_date":"2014/06/12 16:12",
"finish_date_ms":1402589520
}
And if I run the following query (Get all documents and facets with the keyword "Apple" that the finish_date_ms is bigger than 1402869581)
{
"from" : 1,
"size" : 20,
"query" : {
"bool" : {
"must" : {
"query_string" : {
"query" : "apple",
"default_operator" : "and",
"analyze_wildcard" : true
}
},
"must_not" : {
"range" : {
"finish_date_ms" : {
"from" : null,
"to" : 1402869581,
"include_lower" : true,
"include_upper" : false
}
}
}
}
},
"facets" : {
"brand" : {
"terms" : {
"field" : "brand_facet",
"size" : 10
}
},
"product_type_id" : {
"terms" : {
"field" : "product_type_id",
"size" : 10
}
},
"state_id" : {
"terms" : {
"field" : "state_id",
"size" : 10
}
}
}
}
This returns:
{
"took":5,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":1,
"max_score":0.18392482,
"hits":[
]
},
"facets":{
"brand":{
"_type":"terms",
"missing":0,
"total":1,
"other":0,
"terms":[
{
"term":"Apple",
"count":1
}
]
},
"product_type_id":{
"_type":"terms",
"missing":0,
"total":1,
"other":0,
"terms":[
{
"term":1,
"count":1
}
]
},
"state_id":{
"_type":"terms",
"missing":0,
"total":1,
"other":0,
"terms":[
{
"term":2,
"count":1
}
]
}
}
}
And should return only the document DOC1. If I remove the range query, returns all the documents that has Apple word. If I remve the "term" query then n document is returns, so I presume the problem is in the range query.
Can anyone point me in the right direction with this?
One other important thing, all this query is to be implemented in java (if this help).
Thanks!
(sory for this huge post)
I found my mistake. (newbie mistake to be honest)
The problem was not in the range query but in the begging of the Json: The from field is set to 1 but the result is only one record so this should be 0!!
Thanks for everything!!

Categories