Embedded document index for date is not used - java

I'm newbie using MongoDB and I have a collection for this type of document:
{
"_id" : {
"coordinate" : {
"latitude" : 532144,
"longitude" : -33333
},
"margin" : "N"
},
"prices" : [
{
"type" : "GAS_95",
"price" : 1370,
"date" : ISODate("2014-05-03T18:39:13.635Z")
},
{
"type" : "DIESEL_A",
"price" : 1299,
"date" : ISODate("2014-05-03T18:39:13.635Z")
},
{
"type" : "DIESEL_A_NEW",
"price" : 1350,
"date" : ISODate("2014-05-03T18:39:13.635Z")
},
{
"type" : "GAS_98",
"price" : 1470,
"date" : ISODate("2014-05-03T18:39:13.635Z")
}
]
}
I need to retrieve the prices for specific date, so then I run this query:
db.gasStation.aggregate(
{ "$unwind" : "$prices"},
{ "$match" : {
"_id" : {
"coordinate" : {
"latitude" : 532144 ,
"longitude" : -33333} ,
"margin" : "N"
} ,
"prices.date" : {
"$gte" : ISODate("2014-05-02T23:00:00.000Z") ,
"$lte" : ISODate("2014-05-03T22:59:59.999Z")
}
}
});
All works fine, I retrieve the documents but I presume that my can be improved, I tried to create an index for _id and prices.date:
db.gasStation.ensureIndex( {
"_id" : 1,
"prices.date" : 1
} )
After that I try to see if the index is being used in my query with the explain option but is not using any index:
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"plan" : {
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false,
"allPlans" : [
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"scanAndOrder" : false
}
]
}
}
},
{
"$unwind" : "$prices"
},
{
"$match" : {
"_id" : {
"coordinate" : {
"latitude" : 532144,
"longitude" : -33333
},
"margin" : "N"
},
"prices.date" : {
"$gte" : ISODate("2014-05-02T23:00:00Z"),
"$lte" : ISODate("2014-05-03T22:59:59.999Z")
}
}
}
],
"ok" : 1
}
is there any reason that my query is not suitable to use the index? I read on MongoDB documentation that the only pipeline that is not using indexes is $group but I'm not using that feature.

Try re-arranging your aggegration pipeline operators. For instance, this query:
db.gasStation.aggregate([
{ "$match" : {
"_id" : {
"coordinate" : {
"latitude" : 532144 ,
"longitude" : -33333} ,
"margin" : "N"
}
}},
{ "$unwind" : "$prices"},
{ "$match" : {
"prices.date" : {
"$gte" : ISODate("2014-05-02T23:00:00.000Z") ,
"$lte" : ISODate("2014-05-03T22:59:59.999Z")
}
}}
], {explain:true});
produces this output, which does show some index usage now:
{
"stages" : [
{
"$cursor" : {
"query" : {
"_id" : {
"coordinate" : {
"latitude" : 532144,
"longitude" : -33333
},
"margin" : "N"
}
},
"plan" : {
"cursor" : "IDCursor",
"indexBounds" : {
"_id" : [
[
{
"coordinate" : {
"latitude" : 532144,
"longitude" : -33333
},
"margin" : "N"
},
{
"coordinate" : {
"latitude" : 532144,
"longitude" : -33333
},
"margin" : "N"
}
]
]
}
}
}
},
{
"$unwind" : "$prices"
},
{
"$match" : {
"prices.date" : {
"$gte" : ISODate("2014-05-02T23:00:00Z"),
"$lte" : ISODate("2014-05-03T22:59:59.999Z")
}
}
}
],
"ok" : 1
The point is to try to get pipeline operators like $match and $sort up front at the beginning of the pipeline to use indexes to limit how much data is accessed and passed on into the rest of the aggregation. There is more that you can do with the above example to improve performance but this should give you a good idea of how to approach it.

Im going to quote the docs on this:
The $match and $sort pipeline operators can take advantage of an index
when they occur at the beginning of the pipeline.
source: http://docs.mongodb.org/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes
You don't have a $match or $sort at the beginning of the pipeline, you have the $unwind operation. Thus, indexes are useless here.
Edit - detailed explanation:
Still, it is possible to move part of the matching condition to the beginning of the pipeline so that an index will be used.
db.gasStation.aggregate([
{ "$match" : {
"_id" : {
"coordinate" : {
"latitude" : 532144 ,
"longitude" : -33333} ,
"margin" : "N"
}
}},
{ "$project": { "prices" : 1, "_id" : 0 } },
{ "$unwind" : "$prices"},
{ "$match" : {
"prices.date" : {
"$gte" : ISODate("2014-05-02T23:00:00.000Z") ,
"$lte" : ISODate("2014-05-03T22:59:59.999Z")
}
}}
],{explain:true});
However, here this index is unnecessary:
{"_id":1, "prices.date":1}
Why? Because the $match at the beginning of the pipeline only filters by the _id. In mongodb a document's _id is automatically indexed, and that's the index that will be used on this case.
Also, you can further optimize your query by removing unnecessary fields using the $project operator. If you don't need a field, remove it as soon as possible.

Related

Elastic search term query not working on a specific field

I'm new to elastic search.
So this is how the index looks:
{
"scresults-000001" : {
"aliases" : {
"scresults" : { }
},
"mappings" : {
"properties" : {
"callType" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"code" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"data" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"esdtValues" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"gasLimit" : {
"type" : "long"
},
AND MORE OTHER Fields.......
If I'm trying to create a search query in Java that looks like this:
{
"bool" : {
"filter" : [
{
"term" : {
"sender" : {
"value" : "sendervalue",
"boost" : 1.0
}
}
},
{
"term" : {
"data" : {
"value" : "YWRkTGlxdWlkaXR5UHJveHlAMDAwMDAwMDAwMDAwMDAwMDA1MDBlYmQzMDRjMmYzNGE2YjNmNmE1N2MxMzNhYjdiOGM2ZjgxZGM0MDE1NTQ4M0A3ZjE1YjEwODdmMjUwNzQ4QDBjMDU0YjcwNDhlMmY5NTE1ZWE3YWU=",
"boost" : 1.0
}
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
If I run this query I get 0 hits. If I change the field "data" with other field it works. I don't understand what's different.
How I actually create the query in Java+SpringBoot:
QueryBuilder boolQuery = QueryBuilders.boolQuery()
.filter(QueryBuilders.termQuery("sender", "sendervalue"))
.filter(QueryBuilders.termQuery("data",
"YWRkTGlxdWlkaXR5UHJveHlAMDAwMDAwMDAwMDAwMDAwMDA1MDBlYmQzMDRjMmYzNGE2YjNmNmE1N2MxMzNhYjdiOGM2ZjgxZGM0MDE1NTQ4M0A3ZjE1YjEwODdmMjUwNzQ4QDBjMDU0YjcwNDhlMmY5NTE1ZWE3YWU="));
Query searchQuery = new NativeSearchQueryBuilder()
.withFilter(boolQuery)
.build();
SearchHits<ScResults> articles = elasticsearchTemplate.search(searchQuery, ScResults.class);
Since you're trying to do an exact match on a string with a term query, you need to do it on the data.keyword field which is not analyzed. Since the data field is a text field, hence analyzed by the standard analyzer, not only are all letters lowercased but the = sign at the end also gets stripped off, so there's no way this can match (unless you use a match query on the data field but then you'd not do exact matching anymore).
POST _analyze
{
"analyzer": "standard",
"text": "YWRkTGlxdWlkaXR5UHJveHlAMDAwMDAwMDAwMDAwMDAwMDA1MDBlYmQzMDRjMmYzNGE2YjNmNmE1N2MxMzNhYjdiOGM2ZjgxZGM0MDE1NTQ4M0A3ZjE1YjEwODdmMjUwNzQ4QDBjMDU0YjcwNDhlMmY5NTE1ZWE3YWU="
}
Results:
{
"tokens" : [
{
"token" : "ywrktglxdwlkaxr5uhjvehlamdawmdawmdawmdawmdawmda1mdblymqzmdrjmmyznge2yjnmnme1n2mxmznhyjdiogm2zjgxzgm0mde1ntq4m0a3zje1yjewoddmmjuwnzq4qdbjmdu0yjcwndhlmmy5nte1zwe3ywu",
"start_offset" : 0,
"end_offset" : 163,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}

What is meant by processedWithError in the report task manager?

I already ingested the file into the druid, greatfully it shows the ingestion is success. However when I checked in the reports of the ingestion, there are all rows are processed with error yet the Datasource is display in the "Datasource" tab.
I have tried to minimise the rows from 20M to 20 rows only. Here is my configuration file:
"type" : "index",
"spec" : {
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "/home/data/Salutica",
"filter" : "outDashboard2RawV3.csv"
}
},
"dataSchema" : {
"dataSource": "DaTRUE2_Dashboard_V3",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "WEEK",
"queryGranularity" : "none",
"intervals" : ["2017-05-08/2019-05-17"],
"rollup" : false
},
"parser" : {
"type" : "string",
"parseSpec": {
"format" : "csv",
"timestampSpec" : {
"column" : "Date_Time",
"format" : "auto"
},
"columns" : [
"Main_ID","Parameter_ID","Date_Time","Serial_Number","Status","Station_ID",
"Station_Type","Parameter_Name","Failed_Date_Time","Failed_Measurement",
"Database_Name","Date_Time_Year","Date_Time_Month",
"Date_Time_Day","Date_Time_Hour","Date_Time_Weekday","Status_New"
],
"dimensionsSpec" : {
"dimensions" : [
"Date_Time","Serial_Number","Status","Station_ID",
"Station_Type","Parameter_Name","Failed_Date_Time",
"Failed_Measurement","Database_Name","Status_New",
{
"name" : "Main_ID",
"type" : "long"
},
{
"name" : "Parameter_ID",
"type" : "long"
},
{
"name" : "Date_Time_Year",
"type" : "long"
},
{
"name" : "Date_Time_Month",
"type" : "long"
},
{
"name" : "Date_Time_Day",
"type" : "long"
},
{
"name" : "Date_Time_Hour",
"type" : "long"
},
{
"name" : "Date_Time_Weekday",
"type" : "long"
}
]
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
}
]
},
"tuningConfig" : {
"type" : "index",
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 5000000
},
"jobProperties" : {}
}
}
}
Report:
{"ingestionStatsAndErrors":{"taskId":"index_DaTRUE2_Dashboard_V3_2019-09-10T01:16:47.113Z","payload":{"ingestionState":"COMPLETED","unparseableEvents":{},"rowStats":{"determinePartitions":{"processed":0,"processedWithError":0,"thrownAway":0,"unparseable":0},"buildSegments":{"processed":0,"processedWithError":20606701,"thrownAway":0,"unparseable":1}},"errorMsg":null},"type":"ingestionStatsAndErrors"}}
I'm expecting this:
{"processed":20606701,"processedWithError":0,"thrownAway":0,"unparseable":1}},"errorMsg":null},"type":"ingestionStatsAndErrors"}}
instead of this:
{"processed":0,"processedWithError":20606701,"thrownAway":0,"unparseable":1}},"errorMsg":null},"type":"ingestionStatsAndErrors"}}
Below is my input data from csv;
"Main_ID","Parameter_ID","Date_Time","Serial_Number","Status","Station_ID","Station_Type","Parameter_Name","Failed_Date_Time","Failed_Measurement","Database_Name","Date_Time_Year","Date_Time_Month","Date_Time_Day","Date_Time_Hour","Date_Time_Weekday","Status_New"
1,3,"2018-10-05 15:00:55","1840SDF00038","Passed","ST1","BLTBoard","1.8V","","","DaTRUE2Left",2018,10,5,15,"Friday","Passed"
1,4,"2018-10-05 15:00:55","1840SDF00038","Passed","ST1","BLTBoard","1.35V","","","DaTRUE2Left",2018,10,5,15,"Friday","Passed"
1,5,"2018-10-05 15:00:55","1840SDF00038","Passed","ST1","BLTBoard","Isc_VChrg","","","DaTRUE2Left",2018,10,5,15,"Friday","Passed"
1,6,"2018-10-05 15:00:55","1840SDF00038","Passed","ST1","BLTBoard","Isc_VBAT","","","DaTRUE2Left",2018,10,5,15,"Friday","Passed"

Elasticsearch geo search strange behavior

A few days ago I faced with the strange behavior of geo search in Elasticsearch.
I use AWS managed ES 5.5, obviously over REST interface.
Assume we have 200k objects with location info represented as the point only. I use geo search to find the points within multiple polygons. They are shown on the image below. Coordinates were extracted from final request to the ES.
The request is built using official Java High-level REST client. The request query will be attached below.
I want to search for all objects within at least one polygon.
Here is the query (real fields names and values were replaced by stub, Except location and locationPoint.coordinates)
{
"size" : 20,
"query" : {
"constant_score" : {
"filter" : {
"bool" : {
"must" : [
{
"terms" : {
"field1" : [
"a",
"b",
"c",
"d",
"e",
"f"
],
"boost" : 1.0
}
},
{
"term" : {
"field2" : {
"value" : "q",
"boost" : 1.0
}
}
},
{
"range" : {
"field3" : {
"from" : "10",
"to" : null,
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}
}
},
{
"range" : {
"field4" : {
"from" : "10",
"to" : null,
"include_lower" : true,
"include_upper" : true,
"boost" : 1.0
}
}
},
{
"geo_shape" : {
"location" : {
"shape" : {
"type" : "geometrycollection",
"geometries" : [
{
"type" : "multipolygon",
"orientation" : "right",
"coordinates" : [
[
// coords here
]
]
},
{
"type" : "polygon",
"orientation" : "right",
"coordinates" : [
[
// coords here
]
]
},
{
"type" : "polygon",
"orientation" : "right",
"coordinates" : [
[
// coords here
]
]
},
{
"type" : "polygon",
"orientation" : "right",
"coordinates" : [
[
// coords here
]
]
}
]
},
"relation" : "intersects"
},
"ignore_unmapped" : false,
"boost" : 1.0
}
}
]
}
},
"boost" : 1.0
}
},
"_source" : {
"includes" : [
"field1",
"field2",
"field3",
"field4",
"field8"
],
"excludes" : [ ]
},
"sort" : [
{
"field1" : {
"order" : "desc"
}
}
],
"aggregations" : {
"agg1" : {
"terms" : {
"field" : "field1",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg2" : {
"terms" : {
"field" : "field2",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg3" : {
"terms" : {
"field" : "field3",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg4" : {
"terms" : {
"field" : "field4",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg5" : {
"terms" : {
"field" : "field5",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg6" : {
"terms" : {
"field" : "field6",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg7" : {
"terms" : {
"field" : "field7",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"agg8" : {
"terms" : {
"field" : "field8",
"size" : 10000,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
},
"map_center" : {
"geo_centroid" : {
"field" : "locationPoint.coordinates"
}
},
"map_bound" : {
"geo_bounds" : {
"field" : "locationPoint.coordinates",
"wrap_longitude" : true
}
}
}
}
Note, that field location is mapped as geo_shape and field location.coordinates is mapped as geo_point.
So the problem is next. Below the results (hits count) of requests are presented. Only polygons are changing.
# Polygons Hits count
1) 1,2,3,4 5565
2) 1 4897
3) 3,4 75
4) 2 9
5) 1,3,4 5543
6) 1,2 5466
7) 2,3,4 84
So, if I add results of polygon 1st with 2,3,4 polygons I will not obtain the number as it was in full request.
For example, #1 != #2 + #7, also #1 != #5 + #4, but #7 == #4 + #3
I cannot understand whether it is the issue in this request or expected behavior or even bug in ES.
Can anyone help me to understand the logic of such ES behavior or point to the solution?
Thanks!
After a short conversation with Elasticsearch team member, we come up to AWS.
Build hashes of AWS and pure ES is not equal so, ES is modified by AWS team and we do not know exact changes. There can be some changes that might affect search in posted question.
Need to reproduce this behavior on pure ES cluster before we will continue our conversation.

how to add values in a single field in mongodb

i have a following Document in mongodb
{
"_id" : "1999",
"Email" : "mail#example.com",
"FirstName" : "personFirstNmae",
"LastName" : "personLastName",
"UserStatus" : "INACTIVE",
"FollowingItems" : [
{
"FollowingItemUuid" : "g12345",
"FollowingItemUuidType" : "GALLERY"
}
]
}
i want to achive this
{
"_id" : "1999",
"Email" : "mail#example.com",
"FirstName" : "personFirstNmae",
"LastName" : "personLastName",
"UserStatus" : "INACTIVE",
"FollowingItems" : [
{
"FollowingItemUuid" : "g12345",
"FollowingItemUuidType" : "GALLERY"
},
{
"FollowingItemUuid" : "M121",
"FollowingItemUuidType" : "MUSEUM"
}
]
}
here is my code
val q=QueryBuilder.start("_id").is("1999")
var update=collection.update(q.get,new BasicDBObject("$set",new BasicDBObject("FollowingItems.$.FollowingItemUuid","M121").append("FollowingItems.$.FollowingItemUuidType","MUSEUM")))
but it throws following exception
com.mongodb.WriteConcernException: { "serverUsed" : "Localhost:27017" , "ok" : 1 , "n" : 0 , "updatedExisting" : false , "err" : "cannot use the part (FollowingItems of FollowingItems.FollowingItemUuid) to traverse the element ({FollowingItems: [ { FollowingItemUuid: \"g12345\", FollowingItemUuidType: \"GALLERY\" } ]})" , "code" : 16837}
at com.mongodb.CommandResult.getWriteException(CommandResult.java:90)
at com.mongodb.CommandResult.getException(CommandResult.java:79)
at com.mongodb.DBCollectionImpl.translateBulkWriteException(DBCollectionImpl.java:316)
at com.mongodb.DBCollectionImpl.update(DBCollectionImpl.java:274)
at com.mongodb.casbah.MongoCollectionBase$class.update(MongoCollection.scala:882)
at com.mongodb.casbah.MongoCollection.update(MongoCollection.scala:1162)
Please guide me how can i achive my desried result and what i am doing wrong
You need to use the $push operator to use this. This is the MongoDB shell command:
db.data.update({
"_id": "1999"
}, {
"$push": {
"FollowingItems": {
"FollowingItemUuid": "M121",
"FollowingItemUuidType": "MUSEUM"
}
}
})
And this is your equivalent QueryBuilder syntax:
val q=QueryBuilder.start("_id").is("1999")
var update=collection.update(q.get,new BasicDBObject("$push",new BasicDBObject("FollowingItems.$.FollowingItemUuid","M121").append("FollowingItems.$.FollowingItemUuidType","MUSEUM")))

How to retrieve a document by its own sub document or array?

I have such structure of document:
{
"_id" : "4e76fd1e927e1c9127d1d2e8",
"name" : "***",
"embedPhoneList" : [
{
"type" : "家庭",
"number" : "00000000000"
},
{
"type" : "手机",
"number" : "00000000000"
}
],
"embedAddrList" : [
{
"type" : "家庭",
"addr" : "山东省诸城市***"
},
{
"type" : "工作",
"addr" : "深圳市南山区***"
}
],
"embedEmailList" : [
{
"email" : "********#gmail.com"
},
{
"email" : "********#gmail.com"
},
{
"email" : "********#gmail.com"
},
{
"email" : "********#gmail.com"
}
]
}
What I wan't to do is find the document by it's sub document,such as email in embedEmailList field.
Or if I have structure like this
{
"_id" : "4e76fd1e927e1c9127d1d2e8",
"name" : "***",
"embedEmailList" : [
"123#gmail.com" ,
"********#gmail.com" ,
]
}
the embedEmailList is array,how to find if there is 123#gmail.com?
Thanks.
To search for a specific value in an array, mongodb supports this syntax:
db.your_collection.find({embedEmailList : "foo#bar.com"});
See here for more information.
To search for a value in an embedded object, it supports this syntax:
db.your_collection.find({"embedEmailList.email" : "foo#bar.com"});
See here for more information.

Categories