Mongo and Java: Create indexes for aggregation framework

Mongo and Java: Create indexes for aggregation framework - java

Situation: I have collection with huge amount of documents after map reduce(aggregation). Documents in the collection looks like this:
/* 0 */
{
"_id" : {
"appId" : ObjectId("1"),
"timestamp" : ISODate("2014-04-12T00:00:00.000Z"),
"name" : "GameApp",
"user" : "test#mail.com",
"type" : "game"
},
"value" : {
"count" : 2
}
}
/* 1 */
{
"_id" : {
"appId" : ObjectId("2"),
"timestamp" : ISODate("2014-04-29T00:00:00.000Z"),
"name" : "ScannerApp",
"user" : "newUser#company.com",
"type" : "game"
},
"value" : {
"count" : 5
}
}
...
And I searching inside this collection with aggregation framework:
db.myCollection.aggregate([match, project, group, sort, skip, limit]); // aggregation can return result on Daily or Monthly time base depends of user search criteria, with pagination etc...
Possible search criteria:
1. {appId, timestamp, name, user, type}
2. {appId, timestamp}
3. {name, user}
I'm getting correct result, exactly what I need. But from optimisation point of view I have doubts about indexing.
Questions:
Is it possible to create indexes for such collection?
How I can create indexes for such object with complex _id field?
How I can do analog of db.collection.find().explain() to verify which index used?
And is good idea to index such collection or its my performance paranoia?
Answer summarisation:
MongoDB creates index by _id field automatically but that is useless in a case of complex _id field like in an example. For field like: _id: {name: "", timestamp: ""} you must use index like that: *.ensureIndex({"_id.name": 1, "_id.timestamp": 1}) only after that your collection will be indexed in proper way by _id field.
For tracking how your indexes works with Mongo Aggregation you can not use db.myCollection.aggregate().explain() and proper way of doing that is:
db.runCommand({
aggregate: "collection_name",
pipeline: [match, proj, group, sort, skip, limit],
explain: true
})
My testing on local computer sows that such indexing seems to be good idea. But this is require more testing with big collections.

First, indexes 1 and 3 are probably worth investigating. As for explain, you can pass explain as an option to your pipeline. You can find docs here and an example here

Related

optimize mongo query to get max date in a very short time

I'm using the query bellow to get max date (field named extractionDate) in a collection called KPI, and since I'm only interested in the field extractionDate:
#Override
public Mono<DBObject> getLastExtractionDate(MatchOperation matchOperation,ProjectionOperation projectionOperation) {
return Mono.from(mongoTemplate.aggregate(
newAggregation(
matchOperation,
projectionOperation,
group().max(EXTRACTION_DATE).as("result"),
project().andExclude("_id")
),
"kpi",
DBObject.class
));
}
And as you see above, I need to filter the result firstly using the match operation (matchOperation) after that, I'm doing a projection operation to extract only the max of field "extractionDate" and rename it as result.
But this query cost a lot of time (sometimes more than 20 seconds) because I have a huge amount of data, I already added an index on the field extractionDate but I did not gain a lot, so I'm looking for a way to mast it fast as max as possible.
update:
Number of documents we have in the collection kpi: 42.8m documents
The query that being executed:
Streaming aggregation: [{ "$match" : { "type" : { "$in" : ["INACTIVE_SITE", "DEVICE_NOT_BILLED", "NOT_REPLYING_POLLING", "MISSING_KEY_TECH_INFO", "MISSING_SITE", "ACTIVE_CIRCUITS_INACTIVE_RESOURCES", "INCONSISTENT_STATUS_VALUES"]}}}, { "$project" : { "extractionDate" : 1, "_id" : 0}}, { "$group" : { "_id" : null, "result" : { "$max" : "$extractionDate"}}}, { "$project" : { "_id" : 0}}] in collection kpi
explain plan:
Example of a document in the collection KPI:
And finally the indexes that already exist on this collection :

Index tuning will depend more on the properties in the $match expression. You should be able to run the query in mongosh with and get an explain plan to determine if your query is scanning the collection.
Other things to consider is the size of the collection versus the working set of the server.
Perhaps update your question with the $match expression, and the explain plan and perhaps the current set of index definitions and we can refine the indexing strategy.
Finally, "huge" is rather subjective? Are you querying millions or billions or documents, and what is the average document size?
Update:
Given that you're filtering on only one field, and aggregating on one field, you'll find the best result will be an index
{ "type":1,"extractionDate":1}
That index should cover your query -- because the $in will mean that a scan will be selected but a scan over a small index is significantly better than over the whole collection of documents.
NB. The existing index extractionDate_1_customer.irType_1 will not be any help for this query.

I was able to optimize the request thanks to previous answers using this approach:
#Override
public Mono<DBObject> getLastExtractionDate(MatchOperation matchOperation,ProjectionOperation projectionOperation) {
return Mono.from(mongoTemplate.aggregate(
newAggregation(
matchOperation,
sort(Sort.Direction.DESC,EXTRACTION_DATE),
limit(1),
projectionOperation
),
"kpi",
DBObject.class
));
}
Also I had to create a compound index on extractionDate and type (the field I had in matchOperation) like bellow:

mongodb java driver pullByFilter

I have document schema such as
{
"_id" : 18,
"name" : "Verdell Sowinski",
"scores" : [
{
"type" : "exam",
"score" : 62.12870233109035
},
{
"type" : "quiz",
"score" : 84.74586220889356
},
{
"type" : "homework",
"score" : 81.58947824932574
},
{
"type" : "homework",
"score" : 69.09840625499065
}
]
}
I have a solution using pull that copes with removing a single element at a time but saw
I want to get a general solution that would cope with irregular schema where there would be between one and many elements to the array and I would like to remove all elements based on a condition.
I'm using mongodb driver 3.2.2 and saw this pullByFilter which sounded good
Creates an update that removes from an array all elements that match the given filter.
I tried this
Bson filter = and(eq("type", "homework"), lt("score", highest));
Bson u = Updates.pullByFilter(filter);
UpdateResult ur = collection.updateOne(studentDoc, u);
Unsurprisingly, this did not have any effect since I wasn't specifying the array scores
I get an error
The positional operator did not find the match needed from the query. Unexpanded update: scores.$.type
when I change the filter to be
Bson filter = and(eq("scores.$.type", "homework"), lt("scores.$.score", highest));
Is there a one step solution to this problem?
There seems very little info on this particular method I can find. This question may relate to How to Update Multiple Array Elements in mongodb

After some more "thinking" (and a little trial and error), I found the correct Filters method to wrap my basic filter. I think I was focusing on array operators too much.
I'll not post it here in case of flaming.
Clue: think "matches..." (as in regex pattern matching) when dealing with Filters helper methods ;)

How to count distinct values of a reference collection in mongo

Having a list of books that points to a list of authors, I want to display a tree, having in each node the author name and the number of books he wrote. Initially, I have embedded the authors[] array directly into books collection, and this worked like a charm, using the magic of aggregation framework. However, later on, I realise that it would be nice to have some additional information attached to each author (e.g. it's picture, biographical data, birth date, etc). For the first solution, this is bad because:
it duplicates the data (not a big deal, and yes, I know that mongo's purpose is to encapsulate full objects, but let's ignore that for now);
whenever an additional property is created or updated on the old records won't benefit from this change, unless I specifically query for some unique old property and update all the book authors with the new/updated values.
The next thing was to use the second collection, called authors, and each books document is referencing a list of author ids, like this:
{
"_id" : ObjectId("58ed2a254374473fced950c1"),
"authors" : [
"58ed2a254d74s73fced950c1",
"58ed2a234374473fce3950c1"
],
"title" : "Book title"
....
}
For getting the author details, I have two options:
make an additional query to get the data from the author collection;
use DBRefs.
Questions:
Using DBRefs automatically loads the authors data into the book object, similar to what JPA #MannyToOne does for instance?
Is it possible to get the number of written books for each author, without having to query for each author's book count? When the authors were embedded, I was able to aggregate the distinct author name's and also the number of book documents that he was present on. Is such query possible between two collections?
What would be your recommendation for implementing this behaviour? (I am using Spring Data)

You can try the below query in the spring mongo application.
UnwindOperation unwindAuthorIds = Aggregation.unwind("authorsIds", true);
LookupOperation lookupAuthor = Aggregation.lookup("authors_collection", "authorsIds", "_id", "ref");
UnwindOperation unwindRefs = Aggregation.unwind("ref", true);
GroupOperation groupByAuthor = Aggregation.group("ref.authorName").count().as("count");
Aggregation aggregation = Aggregation.newAggregation(unwindAuthorIds, lookupAuthor, unwindRefs, groupByAuthor);
List<BasicDBObject> results = mongoOperations.aggregate(aggregation, "book_collection", BasicDBObject.class).getMappedResults();

Following #Veeram's suggestion, I was able to write this query:
db.book_collection.aggregate([
{
$unwind: "$authorsIds"
},
{
$lookup: {
from: "authors_collection",
localField: "authorsIds",
foreignField: "_id",
as: "ref"
}
},
{$group: {_id: "$ref.authorName", count: {$sum: 1}}}
])
which returns something like this:
{
"_id" : [
"Paulo Coelho"
],
"count" : 1
}
/* 2 */
{
"_id" : [
"Jules Verne"
],
"count" : 2
}
This is exactly what I needed, and it sounds about right. I only need to do an additional query now to get the books with no author set.

mongoDB: $inc of a nonexistent document in an array

I was not able to write a code, which would be able to increment a non-existent value in an array.
Let's consider a following structure in a mongo collection. (This is not the actual structure we use, but it maintains the issue)
{
"_id" : ObjectId("527400e43ca8e0f79c2ce52c"),
"content" : "Blotted Science",
"tags_with_ratings" : [
{
"ratings" : {
"0" : 6154,
"1" : 4974
},
"tag_name" : "math_core"
},
{
"ratings" : {
"0" : 154,
"1" : 474,
},
"tag_name" : "progressive_metal"
}
]
}
Example issue: We want to add to this document into the tags_with_ratings attribute an incrementation of a rating of a tag, which is not yet added in the array. For example we would want to increment a "0" value for a tag_name "dubstep".
So the expected behaviour would be, that mongo would upsert a document like this into the "tags_with_ratings" attribute:
{
"ratings" : {
"0" : 1
},
"tag_name" : "dubstep"
}
At the moment, we need to have one read operation, which checks if the nested document for the tag is there. If it's not, we pull the array tags_with_ratings out, create a new one, re-add the values from the previous one and add the new nested document in there. Shouldn't we be able to do this with one upsert operation, without having the expensive read happen?
The incrementation of the values takes up 90% of the process and more than half of it is consumed by reading, because we are unable to use $inc capability of creating an attribute, if it is non-existent in the array.

You cannot achieve what you want with one step using this schema.
You could do it however if you used tag_name as the key name instead of using ratings there, but then you may have a different issue when querying.
If the tag_name value was the field name (replacing ratings) you'd have {"dubstep":{"0":1}} instead of { "ratings" : {"0" : 1},"tag_name" : "dubstep"} which you can update dynamically the way you want to. Just keep in mind that this schema will make it more difficult to query - you have to know what the ratings are in advance to be able to query by keyname.

Ensuring index for nested repeating entities

I need to enforce unique constraint on a nested document, for example:
urlEntities: [
{ "url" : "http://t.co/ujBNNRWb0y" , "display_url" : "bit.ly/11JyiVp" , "expanded_url" :
"http://bit.ly/11JyiVp"} ,
{ "url" : "http://t.co/DeL6RiP8KR" , "display_url" : "ow.ly/i/2HC9x" ,
"expanded_url" : "http://ow.ly/i/2HC9x"}
]
url, display_url, and expaned_url need to be unique. How to issue ensureIndex command for this condition in MongoDB?
Also, is it a good design to have nested documents like this or should I move them to a separate collection and refer them from here inside urlEntities? I'm new to MongoDB, any best practices suggestion would be much helpful.
Full Scenario:
Say if I have a document as below in the db which has millions of data:
{ "_id" : { "$oid" : "51f72afa3893686e0c406e19"} , "user" : "test" , "urlEntities" : [ { "url" : "http://t.co/64HBcYmn9g" , "display_url" : "ow.ly/nqlkP" , "expanded_url" : "http://ow.ly/nqlkP"}] , "count" : 0}
When I get another document with similar urlEntities object, I need to update user and count fields only. First I thought of enforcing unique constraint on urlEntities fields and then handle exception and then go for an update, else if I check for each entry whether it exists before inserting, it will have significant impact on the performance. So, how can I enforce uniqueness in urlEntities? I tried
{"urlEntities.display_url":1,"urlEntities.expanded_url":1},{unique:true}
But still I'm able to insert the same document twice without exceptions.

Uniqueness is only enforced per document. You can not prevent the following (simplified from your example):
db.collection.ensureIndex( { 'urlEntities.url' : 1 } );
db.col.insert( {
_id: 42,
urlEntities: [
{
"url" : "http://t.co/ujBNNRWb0y"
},
{
"url" : "http://t.co/ujBNNRWb0y"
}
]
});
Similarily, you will have the same problem with a compound unique key for nested documents.
What you can do is the following:
db.collection.insert( {
_id: 43,
title: "This is an example",
} );
db.collection.update(
{ _id: 43 },
{
'$addToSet': {
urlEntities: {
"url" : "http://t.co/ujBNNRWb0y" ,
"display_url" : "bit.ly/11JyiVp" ,
"expanded_url" : "http://bit.ly/11JyiVp"
}
}
}
);
Now you have the document with _id 43 with one urlEntities document. If you run the same update query again, it will not add a new array element, because the full combination of url, display_url and expanded_url already exists.
Also, have a look at the $addToSet query operator's examples: http://docs.mongodb.org/manual/reference/operator/addToSet/

for indexes on nested documents read this.
regarding the second part (nested documents best practices) - it really depends on your business logic and queries. if those nested documents don't make sense as first class entities, meaning you won't be searching for them directly but only in the context of their parent document then having them nested make sense. otherwise you should consider extracting them out.
i think that there isn't absolute answer to your question. read the chapter about indexing... it helped me a lot.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.