java : search for substring in elasticsearch - java

I'm trying to look for substrings in the elasticsearch, but what I've come to known and what I've coded doesn't exactly look for a substring like the way I want.
Here's what I've coded :
BoolQueryBuilder query = new BoolQueryBuilder();
query.must(new QueryStringQueryBuilder("tagName : *"+tagName+"*"));
SearchResponse response = esclient.prepareSearch(index).setTypes(type)
.setQuery(query)
.execute().actionGet();
SearchHit[] hits = response.getHits().getHits();
for (SearchHit hit : hits) {
Map map = hit.getSource();
list.add((String) map.get("tagName"));
}
list = list.stream().distinct().collect(Collectors.toList());
for(int i = 0; i < list.size(); i++) {;
jsonArrayBuilder.add((String) list.get(i));
}
What I'm trying to implement is to look even if part of the given tagname matches with anything should be listed.
But in case, for ex : if I'm looking for a tag named "social_security_number" and I type "social security" then I would like it to be listed.
But what's actually happening is if I miss the underscore, it's not getting listed.
Is it possible to be done? Should I modify this code to search that way?
Here is my index structure :
POST arempris/emptagnames
{
"mappings" : {
"emptags":{
"properties": {
"employeeid": {
"type":"integer"
},
"tagName": {
"type": "text",
"fielddata": true,
"analyzer": "lowercase_keyword",
"search_analyzer": "lowercase_keyword"
}
}
}
}
}
Would greatly appreciate for your help and thanks a lot in advance.

The analyzer that you have set does not tokenize anything, so the space is important. Specifying a custom analyzer that will split on whitespaces and underscores and anything you might find useful is a good solution. The below will work, but check really carefully what the analyzer does and visit the documentation for every part you don't understand.
PUT stackoverflow
{
"settings": {
"analysis": {
"analyzer": {
"customanalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"standard",
"generatewordparts"
]
}
},
"filter": {
"generatewordparts": {
"type": "word_delimiter",
"split_on_numerics": false,
"split_on_case_change": false,
"generate_word_parts": true,
"generate_number_parts": false,
"stem_english_possessive": false,
"catenate_all": false
}
}
}
},
"mappings": {
"emptags": {
"properties": {
"employeeid": {
"type": "integer"
},
"tagName": {
"type": "text",
"fielddata": true,
"analyzer": "customanalyzer",
"search_analyzer": "customanalyzer"
}
}
}
}
}
GET stackoverflow/emptags/1
{
"employeeid": 1,
"tagName": "social_security_number"
}
GET stackoverflow/_analyze
{
"analyzer" : "customanalyzer",
"text" : "social_security_number123"
}
GET stackoverflow/_search
{
"query": {
"query_string": {
"default_field": "tagName",
"query": "*curi*"
}
}
}
Another solution would be to normalize your input and replace any symbol that you want to treat as a whitespace (e.g. underscore) with a whitespace.
Read here for more
http://nocf-www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalizers.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html

Related

how to search case condition in elastic search

If there are age and name fields
When I look up here, if the name is A, the age is 30 or more, and the others are 20 or more. In this way, I want to give different conditions depending on the field value.
Does es provide this function? I would like to know which keywords to use.
You may or may not be able to tell us how to use it with QueryBuilders provided by Spring.
Thank you.
select * from people
where (name = 'a' and age >=30) or age >=20
This site can convert sql to esdsl
try this
{
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"match_phrase": {
"name": {
"query": "a"
}
}
},
{
"range": {
"age": {
"from": "30"
}
}
}
]
}
},
{
"range": {
"age": {
"from": "20"
}
}
}
]
}
},
"from": 0,
"size": 1
}

How to count particular pattern in a string?

we have some requests which have a lot of experiments. I just want to count the no experiments. If it's greater than some number then I will block those requests
{
"context": {
"requestId": "",
"locale": "",
"deviceId": "",
"currency": "",
"memberId": 0,
"cmsOrigin":,
"experiments": {
"forceByVariant":,
"forceByExperiment": [
{
"id": "test",
"variant": "A"
}
]
}
}
In this request, I just want to check how many id and variant inside the forceByExperiment. I have tried to do using regular expression but not able to do it. Anyone do it before similar thing.
I just split the string with variant and count them. Not sure good idea, but the end goal is to figure out that the request have a lot of experiments.
Using the circe library and Scala, here is an easy solution:
import io.circe._, io.circe.parser._
val jsonString = """{
"context": {
"requestId": "",
"locale": "",
"deviceId": "",
"currency": "",
"memberId": 0,
"cmsOrigin": "foo",
"experiments": {
"forceByVariant": [],
"forceByExperiment": [
{
"id": "test",
"variant": "A"
}
]
}
}
}"""
val parseResult = parse(jsonString)
val nElems = for {
json <- parse(jsonString)
array <- json.hcursor.downField("context").downField("experiments").downField("forceByExperiment").as[Seq[Json]]
} yield array.length
println(nElems) // Right(1)
If you really want to use regex, and if there is no id field in your json structure, you can use the following expression "id": "(\w+)" and count the number of match.
Example: https://regex101.com/r/GCeByw/1/

ElasticSearch range query in a paragraph

I have a field called Description which is a text field and has data like:
This is a good thing for versions before 3.2 but bad for 3.5 and later
I want to run range query on this type of text. I know that for a field containing only Dates/Age(Numbers) or even String Ids, we can use queries like
{
"query": {
"range" : {
"age" : {
"gte" : 10,
"lte" : 20,
"boost" : 2.0
}
}
}
}
But i have a mixed field like mentioned above and I need to perform range query on that. Also, i cannot change the index structure. I can only perform queries or do some post processing after retrieving results. So anyone has any idea how to run this type of query, or even obtain my goal after getting results in the post processing? I am using Java.
I hope i fully understand what you are looking for.
I've managed to create a simple working example.
Mappings
Using char_group tokenizer:
The char_group tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the pattern tokenizer is not acceptable.
Char Group Tokenizer
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"letter",
"whitespace"
]
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"fields": {
"digit": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
Post a few documents
PUT my_index/_doc/1
{
"text": "This is a good thing for versions before 3.2 but bad for 3.5 and later"
}
PUT my_index/_doc/2
{
"text": "This is a good thing for versions before 5 but bad for 6 and later"
}
Search Query
GET my_index/_search
{
"query": {
"range": {
"text.digit": {
"gte": 3.2,
"lte": 3.5
}
}
}
}
Results
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"text" : "This is a good thing for versions before 3.2 but bad for 3.5 and later"
}
}
]
}
Another Search Query
GET my_index/_search
{
"query": {
"range": {
"text.digit": {
"gt": 3.5
}
}
}
}
Results
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"text" : "This is a good thing for versions before 5 but bad for 6 and later"
}
}
]
}
Analyze Query
Play with the following query till you get the desired results.
It is already compatible to your example.
This is a good thing for versions before 3.2 but bad for 3.5 and later
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"letter",
"whitespace"
]
},
"text": "This is a good thing for versions before 3.2 but bad for 3.5 and later"
}
Hope this helps

ElasticSearch - fuzzy search java api results are not proper

I have indexed sample documents in elasticsearch and trying to search using fuzzy query. But am not getting any results when am search by using Java fuzzy query api.
Please find my below mapping script :
PUT productcatalog
{
"settings": {
"analysis": {
"analyzer": {
"attr_analyzer": {
"type": "custom",
"tokenizer": "letter",
"char_filter": [
"html_strip"
],
"filter": ["lowercase", "asciifolding", "stemmer_minimal_english"]
}
},
"filter" : {
"stemmer_minimal_english" : {
"type" : "stemmer",
"name" : "minimal_english"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"values": {
"type": "text",
"analyzer": "attr_analyzer"
},
"catalog_type": {
"type": "text"
},
"catalog_id":{
"type": "long"
}
}
}
}
}
Please find my sample data.
PUT productcatalog/doc/1
{
"catalog_id" : "343",
"catalog_type" : "series",
"values" : "Activa Rooftop, valves, VG3000, VG3000FS, butterfly, ball"
}
PUT productcatalog/doc/2
{
"catalog_id" : "12717",
"catalog_type" : "product",
"values" : "Activa Rooftop, valves"
}
Please find my search script :
GET productcatalog/_search
{
"query": {
"match" : {
"values" : {
"query" : " activa rooftop VG3000",
"operator" : "and",
"boost": 1.0,
"fuzziness": 2,
"prefix_length": 0,
"max_expansions": 100
}
}
}
}
Am getting the below results for the above query :
{
"took": 239,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.970927,
"hits": [
{
"_index": "productcatalog",
"_type": "doc",
"_id": "1",
"_score": 0.970927,
"_source": {
"catalog_id": "343",
"catalog_type": "series",
"values": "Activa Rooftop, valves, VG3000, VG3000FS, butterfly, ball"
}
}
]
}
}
But if i use the below Java API for the same fuzzy search am not getting any results out of it.
Please find my below Java API query for fuzzy search :
QueryBuilder qb = QueryBuilders.boolQuery()
.must(QueryBuilders.fuzzyQuery("values", keyword).boost(1.0f).prefixLength(0).maxExpansions(100));
Update 1
I have tried with the below query
QueryBuilder qb = QueryBuilders.matchQuery(QueryBuilders.fuzzyQuery("values", keyword).boost(1.0f).prefixLength(0).maxExpansions(100));
But am not able to pass QueryBuilders inside matchQuery. Am getting this suggestion while am writing this query The method matchQuery(String, Object) in the type QueryBuilders is not applicable for the arguments (FuzzyQueryBuilder)
The mentioned java query is not a match query. It's a must query. you should use matchQuery instead of boolQuery().must(QueryBuilders.fuzzyQuery())
Update 1:
fuzzy query is a term query while match query is a full text query.
Also don't forget that in match query the default Operator is or operator which you should change it to and like your dsl query.

Elasticsearch lowercase tokenizer quirk?

I am testing mapping for url-s in elasticsearch.
I want to be able to search for entry both by domain name with tld (e.g. example.com)
and without tld (e.g example) and for full domain document to be returned
(like, http://example.com and www.example.com and similar)
I PUT this mapping to ES - in Sense:
PUT /en_docs
{
"mappings": {
"url": {
"properties": {
"content": {
"type": "string",
"analyzer" : "urlzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"urlzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": [ "stopwords_filter" ]
}
},
"filter" : {
"stopwords_filter" : {
"type" : "stop",
"stopwords" : ["http", "https", "ftp", "www"]
}
}
}
}
}
Now, when I index an url document, e.g
POST /en_docs/url
{
"content": "http://example.com"
}
I can get it by searching example.com but just example doesnt return anything.
The lowercase tokenizer im using in my analyzer, as docs say, and as direct testing of my analyzer shows, gives example and com tokens, but when I do the search for indexed document, example returns nothing:
GET /en_docs/url/_search?q=example
gets no results, but if the query is example.com, result is returned.
What am I missing?

Categories