Create and merge indexes using multiple analyzers in Elasticsearch

Create and merge indexes using multiple analyzers in Elasticsearch - java

So, I have two filters defined in my config JSON file. Now, I want to apply these filters one at a time and then combine the result.
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 20
},
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 1,
"max_shingle_size": 2
}
},
Example:
"best mac laptop" -> "best", "mac", "laptop", "best mac", "mac laptop", "bes", "best", "best ", "best m", "best ma", "best mac", ...
Like above, I want to create index using Shingle filter, then I want to create index autocomplete filter on original data, and then combine and create index in a single document. Is it possible? Is there anyway?

So, after looking hard into the spring data Elasticsearch docs I'm able to index same field using two different analyzers.
#Document(indexName = "course-doc")
#Setting(settingPath = "es-config/autocomplete.json")
#Getter
#Setter
public class Course {
#Id
long id;
#MultiField(
mainField = #Field(type = FieldType.Text, analyzer = "autocomplete_index", searchAnalyzer = "autocomplete_search"),
otherFields = {#InnerField(suffix = "search", type = FieldType.Text, analyzer = "search_index", searchAnalyzer = "autocomplete_search")})
String name;
}
autocomplete.json
{
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
},
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 1,
"max_shingle_size": 10
}
},
"analyzer": {
"autocomplete_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "lowercase" ]
},
"autocomplete_index": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "lowercase", "stop" , "autocomplete_filter" ]
},
"search_index": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "lowercase" , "shingle_filter" ]
},
"standard-analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "lowercase", "stop" ]
}
}
}
}

Related

QueryExceededMemoryLimitNoDiskUseAllowed with MogoDB Atlas

I've developed a java application that uses an Atlas MongoDB Serverless DB.
This application performs an Aggregation query with the following steps:
$match
$project
$addFields
$sort
$facet
$project
When I perform a query thats returns a lot of results I'm obtaining this exception: QueryExceededMemoryLimitNoDiskUseAllowed.
I've tried to modify my code adding allowDiskUse: true in the aggregation, but didn't resolve the exception.
I've tried to replicate my aggregation pipeline in Atlas console and found that every think works fine until $facet step that returns
Reason: PlanExecutor error during aggregation :: caused by :: Sort exceeded memory limit of 33554432 bytes, but did not opt in to external sorting.
This is my $facet step:
{$facet: {
paginatedResults: [{ $skip: 0 }, { $limit: 50 }],
totalCount: [
{
$count: 'count'
}
]
}
}
As you can see I'm using it to paginate my query results.
Any suggestion to avoid this problem?
I was thinking about making two different query one for the results and one for total count, but I'm not sure this is the best solution.
EDIT: added query
db.vendor_search.aggregate(
{$match: {
$or: [
{'searchKeys.value': {$regex: "vendor"}},
{'searchKeys.value': {$regex: "test"}},
{'searchKeys.valueClean': {$regex: "vendor"}},
{'searchKeys.valueClean': {$regex: "test"}},
],
buyerId: 7
}},
{$project: {
companyId: 1,
buyerId: 1,
companyName: 1,
legalForm: 1,
country: 1,
supplhiCompanyCode: 1,
vat: 1,
erpCode: 1,
visibility: 1,
businessStatus: 1,
city: 1,
logo: 1,
location: {$concat : ["$country.value",'$city']},
searchKeys: {
"$filter": {
"input": "$searchKeys",
"cond": {
"$or": [
{$regexMatch: {input: "$$this.value",regex: "vendor"}},
{$regexMatch: {input: "$$this.value",regex: "test"}}
{$regexMatch: {input: "$$this.valueClean",regex: "vendor"}},
{$regexMatch: {input: "$$this.valueClean",regex: "test"}}
]
}
}
}
}},
{$addFields: {
searchMatching: {
$reduce: {
input: "$searchKeys.type",
initialValue: [],
in: {
$concatArrays: [
"$$value",
{$cond: [{$in: ["$$this", "$$value"]},[],["$$this"]]}
]
}
}
},
'sort.supplhiId': { $toLower: "$supplhiCompanyCode" },
'sort.companyName': { $toLower: "$companyName" },
'sort.location': { $toLower: {$concat : ["$country.value"," ","$city"]}},
'sort.vat': { $toLower: "$vat" },
'sort.companyStatus': { $toLower: "$businessStatus" },
'sort.erpCode': { $toLower: "$erpCode" }
}},
{$sort: {"sort.companyName": 1}},
{$facet: {
paginatedResults: [{ $skip: 0 }, { $limit: 50 }],
totalCount: [
{
$count: 'count'
}
]
}
},
{$project: {paginatedResults:1, 'totalCount': {$first : '$totalCount.count'}}}
)
EDIT: Added model
{
"buyerId": 1,
"companyId": 869048,
"address": "FP8R+52H",
"businessStatus": "AC",
"city": "Chiffa",
"companyName": "Test Algeria 25 agosto",
"country": {
"lookupId": 78,
"code": "DZA",
"value": "Algeria"
},
"erpCode": null,
"legalForm": "Ltd.",
"logo": "fc4d821a-e814-49e4-96d1-f32421fdaa6d_1.jpg",
"searchKeys": [
{
"type": "contact",
"value": "pebiw81522#xitudy.com",
"valueClean": "pebiw81522xitudycom"
},
{
"type": "company_registration_number",
"value": "112211331144",
"valueClean": "112211331144"
},
{
"type": "vendor_name",
"value": "test algeria 25 agosto ltd.",
"valueClean": "test algeria 25 agosto ltd"
},
{
"type": "contact",
"value": "tredicisf2#ottobre2022.com",
"valueClean": "tredicisf2ottobre2022com"
},
{
"type": "contact",
"value": "ty#s.com",
"valueClean": "tyscom"
},
{
"type": "contact",
"value": "info#x.com",
"valueClean": "infoxcom"
},
{
"type": "tin",
"value": "00112341675",
"valueClean": "00112341675"
},
{
"type": "contact",
"value": "hatikog381#rxcay.com",
"valueClean": "hatikog381rxcaycom"
},
{
"type": "supplhi_id",
"value": "100059410",
"valueClean": "100059410"
},
{
"type": "contact",
"value": "tredici#ottobre2022.com",
"valueClean": "trediciottobre2022com"
},
{
"type": "country_key",
"value": "00112341675",
"valueClean": "00112341675"
},
{
"type": "vat",
"value": "00112341675",
"valueClean": "00112341675"
},
{
"type": "address",
"value": "fp8r+52h",
"valueClean": "fp8r52h"
},
{
"type": "city",
"value": "chiffa",
"valueClean": "chiffa"
},
{
"type": "contact",
"value": "prova#supplhi.com",
"valueClean": "provasupplhicom"
},
{
"type": "contact",
"value": "saraxo2669#dmonies.com",
"valueClean": "saraxo2669dmoniescom"
}
],
"supplhiCompanyCode": "100059410",
"vat": "00112341675",
"visibility": true
}

in ATLAS M0 free clusters and M2/M5 shared clusters sort in memory limit is 32 MB. ( ref ) , this limit seems to apply also to serverless
For not limited mongod you may usually increase this limit from 32MB for example to 320MB as follow:
db.adminCommand({setParameter: 1, internalQueryExecMaxBlockingSortBytes: 335544320})
You can check the current value with:
db.runCommand( { getParameter : 1, "internalQueryExecMaxBlockingSortBytes" : 1 } )
But it is best to optimize your queries to not hit this limit , if you post your full query and indexes ( db.collection.getIndexes() perhaps there is a better way ...

how to index each feature elasticsearch featurecollection

I got a typical FeatureCollection json file, when I import this file via Kibana I can see that the index has docs count equivalent to a number of features in FeatureCollection, so I can do a query and get the specific feature.
But when I try to index this file via java and using mapping like this, I can save the file but with docs count = 1, so I can't do geospatial query to get a specific feature.
{
"mappings": {
"dynamic": true,
"properties": {
"checksum": {
"type": "keyword"
},
"geometry": {
"properties": {
"coordinates": {
"type": "geo_shape"
}
}
}
}
}
}
this is the file:
{
"hashCode": 1708148999,
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"id": "station_exits.fid--3e26eb90_1774de53429_-64b5",
"geometry": {
"type": "Point",
"coordinates": [38.5752041, 54.8366001, 75.25849601491346]
},
"properties": {
"gid": 1,
"obj_id": "004528ca-3b1f-4210-b10e-afab2d268144",
"prefect": "SAO",
"district": "Timit",
"obj_name": "exits",
"line": "xxxxx",
"status": "xxxx",
"link": "/view/main"
}
}
]
}
Can anyone help to make the right mapping?

Elasticsearch nested sort - mismatch between document and nested object used for sorting

I've been developing a new search API with AWS Elasticsearch (version 6.2) as backend.
Right now, I'm trying to support "sort" options for the API.
My mapping is as follows (unrelated fields not included):
{
"properties": {
"id": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"description": {
"type": "text"
},
"materialDefinitionProperties": {
"type": "nested",
"properties": {
"id": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
},
"analyzer": "case_sensitive_analyzer"
},
"value" : {
"type": "nested",
"properties": {
"valueString": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
}
}
I'm attempting to allow the users sort by property value (path: materialDefinitionProperties.value.valueLong.raw).
Note that it's inside 2 levels of nested objects (materialDefinitionProperties and materialDefinitionProperties.value are nested objects).
To sort the results by the value of property with ID "PART NUMBER", my request for sorting is:
{
"fieldName": "materialDefinitionProperties.value.valueString.raw",
"nestedSort": {
"path": "materialDefinitionProperties",
"filter": {
"fieldName": "materialDefinitionProperties.id",
"value": "PART NUMBER",
"slop": 0,
"boost": 1
},
"nestedSort": {
"path": "materialDefinitionProperties.value"
}
},
"order": "ASC"
}
However, as I examined the response, the "sort" field does not match with document's property value:
{
"_index": "material-definition-index-v2",
"_type": "default",
"_id": "development_LITL4ZCNE",
"_source": {
"id": "LITL4ZCNE",
"description": [
"CPU, Intel, Cascade Lake, 8259CL, 24C, 210W, B1 Prod"
]
"materialDefinitionProperties": [
{
"id": "PART NUMBER",
"description": [],
"value": [
{
"valueString": "202-001193-001",
"isOriginal": true
}
]
}
]
},
"sort": [
"100-000018"
]
},
The document's PART NUMBER property is "202-001193-001", the "sort" field says "100-000018", which is the part number of another document.
It seems that there's a mismatch between the master document and nested object used for sorting.
This request worked well when there's only a small number of documents in the cluster. But once I backfill the cluster with ~1 million of records, the symptom appears. I've also tried creating a new ES cluster but the results are the same.
Sorting by other non-nested attributes worked well.
Did I misunderstand the concept of nested objects, or misuse the nested sort feature?
Any ideas appreciated!

This is a bug in Elasticsearch. Upgrading to 6.4.0 fixed the issue.
Issue tracker: https://github.com/elastic/elasticsearch/pull/32204
Release note: https://www.elastic.co/guide/en/elasticsearch/reference/current/release-notes-6.4.0.html

ElasticSearch - JavaApi searching not happening without (*) in my input query

Am fetching documents from elastic search using java api, i have the following code in my elastic search documents and am trying to search it with the following pattern.
code : MS-VMA1615-0D
Input : *VMA1615-0* -- Am getting the results (MS-VMA1615-0D).
Input : MS-VMA1615-0D -- Am getting the results (MS-VMA1615-0D).
Input : *VMA1615-0 -- Am getting the results (MS-VMA1615-0D).
Input : *VMA*-0* -- Am getting the results (MS-VMA1615-0D).
But, if i give input like below, am not getting results.
Input : VMA1615 -- Am not getting the results.
Am expecting to return the code MS-VMA1615-0D
Please find my below java code that am using
private final String INDEX = "products";
private final String TYPE = "doc";
SearchRequest searchRequest = new SearchRequest(INDEX);
searchRequest.types(TYPE);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryStringQueryBuilder qsQueryBuilder = new QueryStringQueryBuilder(code);
qsQueryBuilder.defaultField("code");
searchSourceBuilder.query(qsQueryBuilder);
searchSourceBuilder.size(50);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = null;
try {
searchResponse = SearchEngineClient.getInstance().search(searchRequest);
} catch (IOException e) {
e.getLocalizedMessage();
}
Item item = null;
SearchHit[] searchHits = searchResponse.getHits().getHits();
Please find my mapping details :
PUT products
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"code": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
}

To do what you're looking for you might have to change the tokenizer you're using. Currently you are using whitespace tokenizer which must be replaced with pattern tokenizer.
So your new mapping should look like the below one:
PUT products
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "pattern",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"code": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
}
So after changing your mapping a query to VMA1615 will return MS-VMA1615-0D.
This works as it tokenize the string "MS-VMA1615-0D" into "MS", "VMA1615" & "0D". So, whenever in your query you have any of them it will give you the result.
POST _analyze
{
"tokenizer": "pattern",
"text": "MS-VMA1615-0D"
}
will return:
{
"tokens": [
{
"token": "MS",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "VMA1615",
"start_offset": 3,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "0D",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 2
}
]
}
Based on your comment:
It is not how elasticsearch works. Elasticsearch stores the terms and
their corresponding documents in an inverted index data structure and
by default the terms produced by a full text search is based on
white-spaces, i.e. a text "Hi there I am a technocrat" would split up
as ["Hi", "there", "I", "am", "a", "technocrat"]. So this implies that
the terms which gets stored depends on how it is tokenized. After
indexing when you query let's say in the above example if I query for
"technocrat", I will get the result as the inverted index has that
term associated with my document. So in your case "VMA" is not stored as a term.
To do that use the below mapping:
PUT products
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "my_pattern_tokenizer",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
},
"tokenizer": {
"my_pattern_tokenizer": {
"type": "pattern",
"pattern": "-|\\d"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"code": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
}
So to check:
POST products/_analyze
{
"tokenizer": "my_pattern_tokenizer",
"text": "MS-VMA1615-0D"
}
will produce:
{
"tokens": [
{
"token": "MS",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "VMA",
"start_offset": 3,
"end_offset": 6,
"type": "word",
"position": 1
},
{
"token": "D",
"start_offset": 12,
"end_offset": 13,
"type": "word",
"position": 2
}
]
}

generating POJOs from JSON Schema for non-object types

I am trying to generate POJOs from the JSON Schema of XMBC.
I do this with jsonschema2pojo.
However, nothing gets generated. It doesn't even bring me an error.
This is a reduced sample json schema I am trying to generate from:
{
"description": "JSON-RPC API of XBMC",
"id": "http://xbmc.org/jsonrpc/ServiceDescription.json",
"methods": {
"Addons.ExecuteAddon": {
"description": "Executes the given addon with the given parameters (if possible)",
"params": [
{
"name": "addonid",
"required": true,
"type": "string"
},
{
"default": "",
"name": "params",
"type": [
{
"additionalProperties": {
"default": "",
"type": "string"
},
"type": "object"
},
{
"items": {
"type": "string"
},
"type": "array"
},
{
"description": "URL path (must start with / or ?",
"type": "string"
}
]
},
{
"default": false,
"name": "wait",
"type": "boolean"
}
],
"returns": {
"type": "string"
},
"type": "method"
}
},
"notifications": {
"Application.OnVolumeChanged": {
"description": "The volume of the application has changed.",
"params": [
{
"name": "sender",
"required": true,
"type": "string"
},
{
"name": "data",
"properties": {
"muted": {
"required": true,
"type": "boolean"
},
"volume": {
"maximum": 100,
"minimum": 0,
"required": true,
"type": "integer"
}
},
"required": true,
"type": "object"
}
],
"returns": null,
"type": "notification"
}
},
"types": {
"Addon.Content": {
"default": "unknown",
"enums": [
"unknown",
"video",
"audio",
"image",
"executable"
],
"id": "Addon.Content",
"type": "string"
}
},
"version": "6.14.3"
}
I must admin that my knowledge of JSON is very terse, maybe it is just a simple fault of mine. But can anyone help me how I can generate Java objects from such a JSON Schema?

The JSON Schema doesn't support method. JSON schema defines json data structure, it would not be used to define your methods. Most important attribute in JSON schema is properties.
It's good to generate POJO data models from a JSON schema, but not business logic. You can learn the JSON schema from those examples.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Create and merge indexes using multiple analyzers in Elasticsearch - java

Related

QueryExceededMemoryLimitNoDiskUseAllowed with MogoDB Atlas

how to index each feature elasticsearch featurecollection

Elasticsearch nested sort - mismatch between document and nested object used for sorting

ElasticSearch - JavaApi searching not happening without (*) in my input query

generating POJOs from JSON Schema for non-object types

Categories

Resources