I want to store key value pair in data base where key is a list of Integers or a set of Integers.
The use case that I have has the below steps
I will get a list of integers
I will need to check if that list of integers (as a key) is already present in the DB
If this is present, I will need to pick up the value from the DB
There are certain computations that I need to do if the list of integers (or set of integers) is not there in the DB already, if this there then I just want to pass the value and avoid the computations.
I am thinking of keeping the data in a key value store but I want the key to be specifically a list or set of integers.
I have thought about below options
Option A
Generate a unique hash for the list of integers and store that as key in key/value store
Problem:
I will have hash collision which will break my use case. I believe there is no way to generate hash with uniqueness 100% of the time.
This will not work.
If there is away to generate a unique hash (100%) times then that is the best way.
Option B
Create an immutable class with List of integers or Set of integers and store that as a key for my key value store.
Please share any feasible ways to achieve the need.
You don’t need to do anything special:
Map<List<Integer>, String> keyValueStore = new HashMap<>();
List<Integer> key = Arrays.asList(1, 2, 3);
keyValueStore.put(key, "foo");
All JDK collections implement sensible equals() and hashCode() that is based solely on the contents of the list.
Thank you. I would like to share some more findings.
I now tried the below further to what I mentioned in my earlier post.
I added the below documents in Mongodb
db.products.insertMany([
{
mapping: [1, 2,3],
hashKey:'ABC123',
date: Date()
},
{
mapping: [4, 5],
hashKey:'ABC45' ,
date: Date()
},
{
mapping: [6, 7,8],
hashKey:'ABC678' ,
date: Date()
},
{
mapping: [9, 10,11],
hashKey:'ABC91011',
date: Date()
},
{
mapping: [1, 9,10],
hashKey:'ABC1910',
date: Date()
},
{
mapping: [1, 3,4],
hashKey:'ABC134',
date: Date()
},
{
mapping: [4, 5,6],
hashKey:'ABC456',
date: Date()
}
]);
When I am now trying to find the mapping I am getting expected results
> db.products.find({ mapping: [4,5]}).pretty();
{
"_id" : ObjectId("5d4640281be52eaf11b25dfc"),
"mapping" : [
4,
5
],
"hashKey" : "ABC45",
"date" : "Sat Aug 03 2019 19:17:12 GMT-0700 (PDT)"
}
The above is giving the right result as the mapping [4,5] (insertion order retained) is present in the DB
> db.products.find({ mapping: [5,4]}).pretty();
The above is giving no result as expected as the mapping [5,4] is not present in the DB. The insertion order is retained
So it seems the "mapping" as List is working as expected.
I used Spring Data to read from MongoDB that is running locally.
The format of the document is
{
"_id" : 1,
"hashKey" : "ABC123",
"mapping" : [
1,
2,
3
],
"_class" : "com.spring.mongodb.document.Mappings"
}
I inserted 1.7 million records into DB using org.springframework.boot.CommandLineRunner
Then the query similar to my last example:
db.mappings.find({ mapping: [1,2,3]})
is taking average 1.05 seconds to find the mapping from 1.7 M records.
Please share if you have any suggestion to make it faster and how fast can I expect it to run.
I am not sure about create, update and delete performance as yet.
Related
I'm using the query bellow to get max date (field named extractionDate) in a collection called KPI, and since I'm only interested in the field extractionDate:
#Override
public Mono<DBObject> getLastExtractionDate(MatchOperation matchOperation,ProjectionOperation projectionOperation) {
return Mono.from(mongoTemplate.aggregate(
newAggregation(
matchOperation,
projectionOperation,
group().max(EXTRACTION_DATE).as("result"),
project().andExclude("_id")
),
"kpi",
DBObject.class
));
}
And as you see above, I need to filter the result firstly using the match operation (matchOperation) after that, I'm doing a projection operation to extract only the max of field "extractionDate" and rename it as result.
But this query cost a lot of time (sometimes more than 20 seconds) because I have a huge amount of data, I already added an index on the field extractionDate but I did not gain a lot, so I'm looking for a way to mast it fast as max as possible.
update:
Number of documents we have in the collection kpi: 42.8m documents
The query that being executed:
Streaming aggregation: [{ "$match" : { "type" : { "$in" : ["INACTIVE_SITE", "DEVICE_NOT_BILLED", "NOT_REPLYING_POLLING", "MISSING_KEY_TECH_INFO", "MISSING_SITE", "ACTIVE_CIRCUITS_INACTIVE_RESOURCES", "INCONSISTENT_STATUS_VALUES"]}}}, { "$project" : { "extractionDate" : 1, "_id" : 0}}, { "$group" : { "_id" : null, "result" : { "$max" : "$extractionDate"}}}, { "$project" : { "_id" : 0}}] in collection kpi
explain plan:
Example of a document in the collection KPI:
And finally the indexes that already exist on this collection :
Index tuning will depend more on the properties in the $match expression. You should be able to run the query in mongosh with and get an explain plan to determine if your query is scanning the collection.
Other things to consider is the size of the collection versus the working set of the server.
Perhaps update your question with the $match expression, and the explain plan and perhaps the current set of index definitions and we can refine the indexing strategy.
Finally, "huge" is rather subjective? Are you querying millions or billions or documents, and what is the average document size?
Update:
Given that you're filtering on only one field, and aggregating on one field, you'll find the best result will be an index
{ "type":1,"extractionDate":1}
That index should cover your query -- because the $in will mean that a scan will be selected but a scan over a small index is significantly better than over the whole collection of documents.
NB. The existing index extractionDate_1_customer.irType_1 will not be any help for this query.
I was able to optimize the request thanks to previous answers using this approach:
#Override
public Mono<DBObject> getLastExtractionDate(MatchOperation matchOperation,ProjectionOperation projectionOperation) {
return Mono.from(mongoTemplate.aggregate(
newAggregation(
matchOperation,
sort(Sort.Direction.DESC,EXTRACTION_DATE),
limit(1),
projectionOperation
),
"kpi",
DBObject.class
));
}
Also I had to create a compound index on extractionDate and type (the field I had in matchOperation) like bellow:
Having a list of books that points to a list of authors, I want to display a tree, having in each node the author name and the number of books he wrote. Initially, I have embedded the authors[] array directly into books collection, and this worked like a charm, using the magic of aggregation framework. However, later on, I realise that it would be nice to have some additional information attached to each author (e.g. it's picture, biographical data, birth date, etc). For the first solution, this is bad because:
it duplicates the data (not a big deal, and yes, I know that mongo's purpose is to encapsulate full objects, but let's ignore that for now);
whenever an additional property is created or updated on the old records won't benefit from this change, unless I specifically query for some unique old property and update all the book authors with the new/updated values.
The next thing was to use the second collection, called authors, and each books document is referencing a list of author ids, like this:
{
"_id" : ObjectId("58ed2a254374473fced950c1"),
"authors" : [
"58ed2a254d74s73fced950c1",
"58ed2a234374473fce3950c1"
],
"title" : "Book title"
....
}
For getting the author details, I have two options:
make an additional query to get the data from the author collection;
use DBRefs.
Questions:
Using DBRefs automatically loads the authors data into the book object, similar to what JPA #MannyToOne does for instance?
Is it possible to get the number of written books for each author, without having to query for each author's book count? When the authors were embedded, I was able to aggregate the distinct author name's and also the number of book documents that he was present on. Is such query possible between two collections?
What would be your recommendation for implementing this behaviour? (I am using Spring Data)
You can try the below query in the spring mongo application.
UnwindOperation unwindAuthorIds = Aggregation.unwind("authorsIds", true);
LookupOperation lookupAuthor = Aggregation.lookup("authors_collection", "authorsIds", "_id", "ref");
UnwindOperation unwindRefs = Aggregation.unwind("ref", true);
GroupOperation groupByAuthor = Aggregation.group("ref.authorName").count().as("count");
Aggregation aggregation = Aggregation.newAggregation(unwindAuthorIds, lookupAuthor, unwindRefs, groupByAuthor);
List<BasicDBObject> results = mongoOperations.aggregate(aggregation, "book_collection", BasicDBObject.class).getMappedResults();
Following #Veeram's suggestion, I was able to write this query:
db.book_collection.aggregate([
{
$unwind: "$authorsIds"
},
{
$lookup: {
from: "authors_collection",
localField: "authorsIds",
foreignField: "_id",
as: "ref"
}
},
{$group: {_id: "$ref.authorName", count: {$sum: 1}}}
])
which returns something like this:
{
"_id" : [
"Paulo Coelho"
],
"count" : 1
}
/* 2 */
{
"_id" : [
"Jules Verne"
],
"count" : 2
}
This is exactly what I needed, and it sounds about right. I only need to do an additional query now to get the books with no author set.
Situation: I have collection with huge amount of documents after map reduce(aggregation). Documents in the collection looks like this:
/* 0 */
{
"_id" : {
"appId" : ObjectId("1"),
"timestamp" : ISODate("2014-04-12T00:00:00.000Z"),
"name" : "GameApp",
"user" : "test#mail.com",
"type" : "game"
},
"value" : {
"count" : 2
}
}
/* 1 */
{
"_id" : {
"appId" : ObjectId("2"),
"timestamp" : ISODate("2014-04-29T00:00:00.000Z"),
"name" : "ScannerApp",
"user" : "newUser#company.com",
"type" : "game"
},
"value" : {
"count" : 5
}
}
...
And I searching inside this collection with aggregation framework:
db.myCollection.aggregate([match, project, group, sort, skip, limit]); // aggregation can return result on Daily or Monthly time base depends of user search criteria, with pagination etc...
Possible search criteria:
1. {appId, timestamp, name, user, type}
2. {appId, timestamp}
3. {name, user}
I'm getting correct result, exactly what I need. But from optimisation point of view I have doubts about indexing.
Questions:
Is it possible to create indexes for such collection?
How I can create indexes for such object with complex _id field?
How I can do analog of db.collection.find().explain() to verify which index used?
And is good idea to index such collection or its my performance paranoia?
Answer summarisation:
MongoDB creates index by _id field automatically but that is useless in a case of complex _id field like in an example. For field like: _id: {name: "", timestamp: ""} you must use index like that: *.ensureIndex({"_id.name": 1, "_id.timestamp": 1}) only after that your collection will be indexed in proper way by _id field.
For tracking how your indexes works with Mongo Aggregation you can not use db.myCollection.aggregate().explain() and proper way of doing that is:
db.runCommand({
aggregate: "collection_name",
pipeline: [match, proj, group, sort, skip, limit],
explain: true
})
My testing on local computer sows that such indexing seems to be good idea. But this is require more testing with big collections.
First, indexes 1 and 3 are probably worth investigating. As for explain, you can pass explain as an option to your pipeline. You can find docs here and an example here
I use MongoTemplate from Spring to access a MongoDB.
final Query query = new Query(Criteria.where("_id").exists(true));
query.with(new Sort(Direction.ASC, "FIRSTNAME", "LASTNAME", "EMAIL"));
if (count > 0) {
query.limit(count);
}
query.skip(start);
query.fields().include("FIRSTNAME");
query.fields().include("LASTNAME");
query.fields().include("EMAIL");
return mongoTemplate.find(query, User.class, "users");
I generated 400.000 records in my MongoDB.
When asking for the first 25 Users without using the above written sort line, I get the result within less then 50 milliseconds.
With sort it lasts over 4 seconds.
I then created indexes for FIRSTNAME, LASTNAME, EMAIL. Single indexes, not combined ones
mongoTemplate.indexOps("users").ensureIndex(new Index("FIRSTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("LASTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("EMAIL", Order.ASCENDING));
After creating these indexes the query again lasts over 4 seconds.
What was my mistake?
-- edit
MongoDB writes this on the console...
Thu Jul 04 10:10:11.442 [conn50] query mydb.users query: { query: { _id: { $exists: true } }, orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } } ntoreturn:25 ntoskip:0 nscanned:382424 scanAndOrder:1 keyUpdates:0 numYields: 2 locks(micros) r:6903475 nreturned:25 reslen:3669 4097ms
You have to create a compound index for FIRSTNAME, LASTNAME, and EMAIL, in this order and all of them using ascending order.
Thu Jul 04 10:10:11.442 [conn50] query mydb.users query:
{ query: { _id: { $exists: true } }, orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } }
ntoreturn:25 ntoskip:0 nscanned:382424 scanAndOrder:1 keyUpdates:0 numYields: 2
locks(micros) r:6903475 nreturned:25 reslen:3669 4097ms
Possible bad signs:
Your scanAndOrder is coming true (scanAndOrder=1), correct me if I am wrong.
It has to return (ntoreturn:25) means 25 documents but it is scanning (nscanned:382424) 382424 documents.
indexed queries, nscanned is the number of index keys in the range that Mongo scanned, and nscannedObjects is the number of documents it looked at to get to the final result. nscannedObjects includes at least all the documents returned, even if Mongo could tell just by looking at the index that the document was definitely a match. Thus, you can see that nscanned >= nscannedObjects >= n always.
Context of Question:
Case 1: When asking for the first 25 Users without using the above written sort line, I get the result within less then 50 milliseconds.
Case 2: With sort it lasts over 4 seconds.
query.with(new Sort(Direction.ASC, "FIRSTNAME", "LASTNAME", "EMAIL"));
As in this case there is no index: so it is doing as mentioned here:
This means MongoDB had to batch up all the results in memory, sort them, and then return them. Infelicities abound. First, it costs RAM and CPU on the server. Also, instead of streaming my results in batches, Mongo just dumps them all onto the network at once, taxing the RAM on my app servers. And finally, Mongo enforces a 32MB limit on data it will sort in memory.
Case 3: created indexes for FIRSTNAME, LASTNAME, EMAIL. Single indexes, not combined ones
I guess it is still not fetching data from index. You have to tune your indexes according to Sorting order
Sort Fields (ascending / descending only matters if there are multiple sort fields)
Add sort fields to the index in the same order and direction as your query's sort
For more details, check this
http://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
Possible Answer:
In the query orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } } order of sort is different than the order you have specified in :
mongoTemplate.indexOps("users").ensureIndex(new Index("FIRSTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("LASTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("EMAIL", Order.ASCENDING));
I guess Spring API might not be retaining order:
https://jira.springsource.org/browse/DATAMONGO-177
When I try to sort on multiple fields the order of the fields is not maintained. The Sort class is using a HashMap instead of a LinkedHashMap so the order they are returned is not guaranteed.
Could you mention spring jar version?
Hope this answers your question.
Correct me where you feel I might be wrong, as I am little rusty.
I am trying to do a range search on some numbers in MongoDB.
I have the following two records. The import fields are the last two, value and type. I want to be able to get records back that have a value: withing some range.
{ "_id" : ObjectId("4eace8570364cc13b7cfa59b"), "sub_id" : -1630181078, "sub_uri" : "http://datetest.com/datetest", "pro_id" : -1630181078, "pro_uri" : "http://datetest.com/datetest", "type" : "number", "value" : "8969.0" }
{ "_id" : ObjectId("4eacef7303642d3d1adcbdaf"), "sub_id" : -1630181078, "sub_uri" : "http://datetest.com/datetest", "pro_id" : -1630181078, "pro_uri" : "http://datetest.com/datetest", "type" : "number", "value" : "3423.0" }
When I do this query
> db.triples.find({"value":{$gte: 908}});
So I do this query:
> db.triples.find({"value":{$gte: "908"}});
And again neither of the two expected records is returned, (although in this case some other record containing a date is returned).
I expect to get the two records above, but neither of them is displayed.
I can see that there are quotation marks around the numbers - does this mean they are being stored as "Strings" and therefore the numeric search doesn't work? I have very explicitly saved them as a Double hence the ".0" that's appearing.
Or could there be some other reason that the find(... query ...) command isn't working as expected?
EDIT:
Insertion code looks like this (exception handling removed):
t.getObject().getValue() returns a String - this is working fine. I then use this to instantiate a Double which is what I was hoping would get saved to MongoDB, and allow numeric range searches.
triple.put("value",new Double(t.getObject().getValue()));
DBCollection triples;
triples = db.getCollection("triples");
triples.update(triple,triple,true,false);
You're right -- they are saved as strings and $gte will presumably use lexicographical order. Which mongoDB driver are you using? How exactly do you insert those records?