Elasticsearch lowercase tokenizer quirk?

Elasticsearch lowercase tokenizer quirk? - java

I am testing mapping for url-s in elasticsearch.
I want to be able to search for entry both by domain name with tld (e.g. example.com)
and without tld (e.g example) and for full domain document to be returned
(like, http://example.com and www.example.com and similar)
I PUT this mapping to ES - in Sense:
PUT /en_docs
{
"mappings": {
"url": {
"properties": {
"content": {
"type": "string",
"analyzer" : "urlzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"urlzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": [ "stopwords_filter" ]
}
},
"filter" : {
"stopwords_filter" : {
"type" : "stop",
"stopwords" : ["http", "https", "ftp", "www"]
}
}
}
}
}
Now, when I index an url document, e.g
POST /en_docs/url
{
"content": "http://example.com"
}
I can get it by searching example.com but just example doesnt return anything.
The lowercase tokenizer im using in my analyzer, as docs say, and as direct testing of my analyzer shows, gives example and com tokens, but when I do the search for indexed document, example returns nothing:
GET /en_docs/url/_search?q=example
gets no results, but if the query is example.com, result is returned.
What am I missing?

Related

ElasticSearch range query in a paragraph

I have a field called Description which is a text field and has data like:
This is a good thing for versions before 3.2 but bad for 3.5 and later
I want to run range query on this type of text. I know that for a field containing only Dates/Age(Numbers) or even String Ids, we can use queries like
{
"query": {
"range" : {
"age" : {
"gte" : 10,
"lte" : 20,
"boost" : 2.0
}
}
}
}
But i have a mixed field like mentioned above and I need to perform range query on that. Also, i cannot change the index structure. I can only perform queries or do some post processing after retrieving results. So anyone has any idea how to run this type of query, or even obtain my goal after getting results in the post processing? I am using Java.

I hope i fully understand what you are looking for.
I've managed to create a simple working example.
Mappings
Using char_group tokenizer:
The char_group tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the pattern tokenizer is not acceptable.
Char Group Tokenizer
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"letter",
"whitespace"
]
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"fields": {
"digit": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
Post a few documents
PUT my_index/_doc/1
{
"text": "This is a good thing for versions before 3.2 but bad for 3.5 and later"
}
PUT my_index/_doc/2
{
"text": "This is a good thing for versions before 5 but bad for 6 and later"
}
Search Query
GET my_index/_search
{
"query": {
"range": {
"text.digit": {
"gte": 3.2,
"lte": 3.5
}
}
}
}
Results
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"text" : "This is a good thing for versions before 3.2 but bad for 3.5 and later"
}
}
]
}
Another Search Query
GET my_index/_search
{
"query": {
"range": {
"text.digit": {
"gt": 3.5
}
}
}
}
Results
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"text" : "This is a good thing for versions before 5 but bad for 6 and later"
}
}
]
}
Analyze Query
Play with the following query till you get the desired results.
It is already compatible to your example.
This is a good thing for versions before 3.2 but bad for 3.5 and later
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"letter",
"whitespace"
]
},
"text": "This is a good thing for versions before 3.2 but bad for 3.5 and later"
}
Hope this helps

Elastic search match_pharse query is not working properly

Am trying to search the below document using match_phrase query in kibana but am not getting the response.
Please find the document below which is availabe in elastic search
{
"took":7,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"skipped":0,
"failed":0
},
"hits":{
"total":2910,
"max_score":1.0,
"hits":[
{
"_index":"documents",
"_type":"doc",
"_id":"DmLD22MBFTg0XFZppYt8",
"_score":1.0,
"_source":{
"doct_country":"DE",
"filename":"series_Accessories_v1_de-DE.pdf",
}
]
}
}
Please find the query which am using to search this above document.
GET documents/_search
{
"query": {
"match_phrase" : {
"message" : "Accessories_v1_de-DE.pdf"
}
}
}
For the above query am getting this response :
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}

There are two issues. Presumably in your query you mean to use the filename field rather than message which is not present in you example document:
GET documents/_search
{
"query": {
"match_phrase" : {
"filename" : "Accessories_v1_de-DE.pdf"
}
}
}
Second, you need Elasticsearch to know that the filename field should be indexed with _ treated as a split. This does not happen by default. One way to do this is to define your mapping as follows:
PUT /documents
{
"mappings" : {
"document" : {
"properties" : {
"filename" : { "type" : "text", "analyzer": "simple" }
}
}
}
}
The simple analyzer will split on any non-letter, so _ and numbers will be treated as splits. Depending on your application, you may need finer grained control over tokenization. See the documentation.

How do I create an ElasticSearch query without knowing what the field is?

I have someone putting JSON objects into Elasticsearch for which I do not know any fields. I would like to search all the fields for a given value using a matchQuery.
I understand that the _all is deprecated, and the copy_to doesn't work because I don't know what fields are available beforehand. Is there a way to accomplish this without knowing what fields to search for beforehand?

Yes, you can achieve this using a custom _all field (which I called my_all) and a dynamic template for your index. Basically, this idea is to have a generic mapping for all fields with a copy_to setting to the my_all field. I've also added store: true for the my_all field but only for the purpose of showing you that it works, in practice you won't need it.
So let's go and create the index:
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"all_fields": {
"match": "*",
"mapping": {
"copy_to": "my_all"
}
}
}
],
"properties": {
"my_all": {
"type": "text",
"store": true
}
}
}
}
}
Then index a document:
PUT my_index/_doc/1
{
"test": "the cat drinks milk",
"age": 10,
"alive": true,
"date": "2018-03-21T10:00:00.123Z",
"val": ["data", "data2", "data3"]
}
Finally, we can search using the my_all field and also show its content (because we store its content) in addition to the _source of the document:
GET my_index/_search?q=my_all:cat&_source=true&stored_fields=my_all
And the result is shown below:
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"test": "the cat drinks milk",
"age": 10,
"alive": true,
"date": "2018-03-21T10:00:00.123Z",
"val": [
"data",
"data2",
"data3"
]
},
"fields": {
"my_all": [
"the cat drinks milk",
"10",
"true",
"2018-03-21T10:00:00.123Z",
"data",
"data2",
"data3"
]
}
}
So given you can create the index and mapping of your index, you'll be able to search whatever people are sending to it.

java : search for substring in elasticsearch

I'm trying to look for substrings in the elasticsearch, but what I've come to known and what I've coded doesn't exactly look for a substring like the way I want.
Here's what I've coded :
BoolQueryBuilder query = new BoolQueryBuilder();
query.must(new QueryStringQueryBuilder("tagName : *"+tagName+"*"));
SearchResponse response = esclient.prepareSearch(index).setTypes(type)
.setQuery(query)
.execute().actionGet();
SearchHit[] hits = response.getHits().getHits();
for (SearchHit hit : hits) {
Map map = hit.getSource();
list.add((String) map.get("tagName"));
}
list = list.stream().distinct().collect(Collectors.toList());
for(int i = 0; i < list.size(); i++) {;
jsonArrayBuilder.add((String) list.get(i));
}
What I'm trying to implement is to look even if part of the given tagname matches with anything should be listed.
But in case, for ex : if I'm looking for a tag named "social_security_number" and I type "social security" then I would like it to be listed.
But what's actually happening is if I miss the underscore, it's not getting listed.
Is it possible to be done? Should I modify this code to search that way?
Here is my index structure :
POST arempris/emptagnames
{
"mappings" : {
"emptags":{
"properties": {
"employeeid": {
"type":"integer"
},
"tagName": {
"type": "text",
"fielddata": true,
"analyzer": "lowercase_keyword",
"search_analyzer": "lowercase_keyword"
}
}
}
}
}
Would greatly appreciate for your help and thanks a lot in advance.

The analyzer that you have set does not tokenize anything, so the space is important. Specifying a custom analyzer that will split on whitespaces and underscores and anything you might find useful is a good solution. The below will work, but check really carefully what the analyzer does and visit the documentation for every part you don't understand.
PUT stackoverflow
{
"settings": {
"analysis": {
"analyzer": {
"customanalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"standard",
"generatewordparts"
]
}
},
"filter": {
"generatewordparts": {
"type": "word_delimiter",
"split_on_numerics": false,
"split_on_case_change": false,
"generate_word_parts": true,
"generate_number_parts": false,
"stem_english_possessive": false,
"catenate_all": false
}
}
}
},
"mappings": {
"emptags": {
"properties": {
"employeeid": {
"type": "integer"
},
"tagName": {
"type": "text",
"fielddata": true,
"analyzer": "customanalyzer",
"search_analyzer": "customanalyzer"
}
}
}
}
}
GET stackoverflow/emptags/1
{
"employeeid": 1,
"tagName": "social_security_number"
}
GET stackoverflow/_analyze
{
"analyzer" : "customanalyzer",
"text" : "social_security_number123"
}
GET stackoverflow/_search
{
"query": {
"query_string": {
"default_field": "tagName",
"query": "*curi*"
}
}
}
Another solution would be to normalize your input and replace any symbol that you want to treat as a whitespace (e.g. underscore) with a whitespace.
Read here for more
http://nocf-www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalizers.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html

How to use queryString() in elasticsearch (java API)?

I am working on elastic-search v1.1.1
I faced a problem with search queries .I want to know How solve below obstacle
Here is my mapping
{
"token" : {
"type" : "string"
}
}
Data in indexed record is
{
token : "4r5etgg-kogignjj-jdjuty687-ofijfjfhf-kdjudyhd"
}
My search is
4r5etgg-kogignjj-jdjuty687-ofijfjfhf-kdjudyhd
I want exact match of the record ,which query I need to use to get exact match of the record
can it be done
QueryBuilders.queryString() ?
I checked with queryString() ,then I finalized its not useful for exact match
Please suggest me

You can put quotes around the string to do an exact match:
QueryBuilders.queryString("\"4r5etgg-kogignjj-jdjuty687-ofijfjfhf-kdjudyhd\"");
If you don't want partial matches on the above string index an untokenized version of the value and search on that. In you mapping add:
"token": {
"type": "multi_field",
"fields": {
"untouched": {
"type": "string",
"index": "not_analyzed"
}
}
}
Then search:
{
"query": {
"match": {
"token.untouched": "4r5etgg-kogignjj-jdjuty687-ofijfjfhf-kdjudyhd"
}
}
}

Change the mapping so ElasticSearch doesn't touch your data while indexing like so to:
{
"token" : {
"type" : "string",
"index": "not_analyzed"
}
}
And then run a TermQuery from java like this
QueryBuilders.termQuery("token", "4r5etgg-kogignjj-jdjuty687-ofijfjfhf-kdjudyhd");
That should give you your exact match.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Elasticsearch lowercase tokenizer quirk? - java

Related

ElasticSearch range query in a paragraph

Elastic search match_pharse query is not working properly

How do I create an ElasticSearch query without knowing what the field is?

java : search for substring in elasticsearch

How to use queryString() in elasticsearch (java API)?

Categories

Resources