I use MongoTemplate from Spring to access a MongoDB.
final Query query = new Query(Criteria.where("_id").exists(true));
query.with(new Sort(Direction.ASC, "FIRSTNAME", "LASTNAME", "EMAIL"));
if (count > 0) {
query.limit(count);
}
query.skip(start);
query.fields().include("FIRSTNAME");
query.fields().include("LASTNAME");
query.fields().include("EMAIL");
return mongoTemplate.find(query, User.class, "users");
I generated 400.000 records in my MongoDB.
When asking for the first 25 Users without using the above written sort line, I get the result within less then 50 milliseconds.
With sort it lasts over 4 seconds.
I then created indexes for FIRSTNAME, LASTNAME, EMAIL. Single indexes, not combined ones
mongoTemplate.indexOps("users").ensureIndex(new Index("FIRSTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("LASTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("EMAIL", Order.ASCENDING));
After creating these indexes the query again lasts over 4 seconds.
What was my mistake?
-- edit
MongoDB writes this on the console...
Thu Jul 04 10:10:11.442 [conn50] query mydb.users query: { query: { _id: { $exists: true } }, orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } } ntoreturn:25 ntoskip:0 nscanned:382424 scanAndOrder:1 keyUpdates:0 numYields: 2 locks(micros) r:6903475 nreturned:25 reslen:3669 4097ms
You have to create a compound index for FIRSTNAME, LASTNAME, and EMAIL, in this order and all of them using ascending order.
Thu Jul 04 10:10:11.442 [conn50] query mydb.users query:
{ query: { _id: { $exists: true } }, orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } }
ntoreturn:25 ntoskip:0 nscanned:382424 scanAndOrder:1 keyUpdates:0 numYields: 2
locks(micros) r:6903475 nreturned:25 reslen:3669 4097ms
Possible bad signs:
Your scanAndOrder is coming true (scanAndOrder=1), correct me if I am wrong.
It has to return (ntoreturn:25) means 25 documents but it is scanning (nscanned:382424) 382424 documents.
indexed queries, nscanned is the number of index keys in the range that Mongo scanned, and nscannedObjects is the number of documents it looked at to get to the final result. nscannedObjects includes at least all the documents returned, even if Mongo could tell just by looking at the index that the document was definitely a match. Thus, you can see that nscanned >= nscannedObjects >= n always.
Context of Question:
Case 1: When asking for the first 25 Users without using the above written sort line, I get the result within less then 50 milliseconds.
Case 2: With sort it lasts over 4 seconds.
query.with(new Sort(Direction.ASC, "FIRSTNAME", "LASTNAME", "EMAIL"));
As in this case there is no index: so it is doing as mentioned here:
This means MongoDB had to batch up all the results in memory, sort them, and then return them. Infelicities abound. First, it costs RAM and CPU on the server. Also, instead of streaming my results in batches, Mongo just dumps them all onto the network at once, taxing the RAM on my app servers. And finally, Mongo enforces a 32MB limit on data it will sort in memory.
Case 3: created indexes for FIRSTNAME, LASTNAME, EMAIL. Single indexes, not combined ones
I guess it is still not fetching data from index. You have to tune your indexes according to Sorting order
Sort Fields (ascending / descending only matters if there are multiple sort fields)
Add sort fields to the index in the same order and direction as your query's sort
For more details, check this
http://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
Possible Answer:
In the query orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } } order of sort is different than the order you have specified in :
mongoTemplate.indexOps("users").ensureIndex(new Index("FIRSTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("LASTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("EMAIL", Order.ASCENDING));
I guess Spring API might not be retaining order:
https://jira.springsource.org/browse/DATAMONGO-177
When I try to sort on multiple fields the order of the fields is not maintained. The Sort class is using a HashMap instead of a LinkedHashMap so the order they are returned is not guaranteed.
Could you mention spring jar version?
Hope this answers your question.
Correct me where you feel I might be wrong, as I am little rusty.
Related
I'm using the query bellow to get max date (field named extractionDate) in a collection called KPI, and since I'm only interested in the field extractionDate:
#Override
public Mono<DBObject> getLastExtractionDate(MatchOperation matchOperation,ProjectionOperation projectionOperation) {
return Mono.from(mongoTemplate.aggregate(
newAggregation(
matchOperation,
projectionOperation,
group().max(EXTRACTION_DATE).as("result"),
project().andExclude("_id")
),
"kpi",
DBObject.class
));
}
And as you see above, I need to filter the result firstly using the match operation (matchOperation) after that, I'm doing a projection operation to extract only the max of field "extractionDate" and rename it as result.
But this query cost a lot of time (sometimes more than 20 seconds) because I have a huge amount of data, I already added an index on the field extractionDate but I did not gain a lot, so I'm looking for a way to mast it fast as max as possible.
update:
Number of documents we have in the collection kpi: 42.8m documents
The query that being executed:
Streaming aggregation: [{ "$match" : { "type" : { "$in" : ["INACTIVE_SITE", "DEVICE_NOT_BILLED", "NOT_REPLYING_POLLING", "MISSING_KEY_TECH_INFO", "MISSING_SITE", "ACTIVE_CIRCUITS_INACTIVE_RESOURCES", "INCONSISTENT_STATUS_VALUES"]}}}, { "$project" : { "extractionDate" : 1, "_id" : 0}}, { "$group" : { "_id" : null, "result" : { "$max" : "$extractionDate"}}}, { "$project" : { "_id" : 0}}] in collection kpi
explain plan:
Example of a document in the collection KPI:
And finally the indexes that already exist on this collection :
Index tuning will depend more on the properties in the $match expression. You should be able to run the query in mongosh with and get an explain plan to determine if your query is scanning the collection.
Other things to consider is the size of the collection versus the working set of the server.
Perhaps update your question with the $match expression, and the explain plan and perhaps the current set of index definitions and we can refine the indexing strategy.
Finally, "huge" is rather subjective? Are you querying millions or billions or documents, and what is the average document size?
Update:
Given that you're filtering on only one field, and aggregating on one field, you'll find the best result will be an index
{ "type":1,"extractionDate":1}
That index should cover your query -- because the $in will mean that a scan will be selected but a scan over a small index is significantly better than over the whole collection of documents.
NB. The existing index extractionDate_1_customer.irType_1 will not be any help for this query.
I was able to optimize the request thanks to previous answers using this approach:
#Override
public Mono<DBObject> getLastExtractionDate(MatchOperation matchOperation,ProjectionOperation projectionOperation) {
return Mono.from(mongoTemplate.aggregate(
newAggregation(
matchOperation,
sort(Sort.Direction.DESC,EXTRACTION_DATE),
limit(1),
projectionOperation
),
"kpi",
DBObject.class
));
}
Also I had to create a compound index on extractionDate and type (the field I had in matchOperation) like bellow:
I'm trying to query select statements using JDBCTamplate.
select statement:
SELECT currency, SUM(amount) AS total
FROM table_name
WHERE user_id IN (:userIdList)
GROUP BY currency
DB Table has three columns:
user_id
currency
amount
table for example
user_id currency amount
1 EUR 9000
2 EUR 1000
3 USD 124
When I'm trying to run this code
namedParamJDBCTemplate.query(query,
new MapSqlParameterSource('user_id', userIdList),
new ResultSetExtractor<Map>() {
#Override
public Map extractData(ResultSet resultSet) throws SQLException, DataAccessException {
HashMap<String,Object> mapRet = new HashMap<String,Object>();
while(resultSet.next()){
mapRet.put(resultSet.getString("currency"), resultSet.getString("total"));
}
return mapRet;
}
});
I'm getting the result set as a map, but the result of the amount looks like this :
EUR -> 10000.0E0
USD -> 124.0E0
When I run the same query in DB ( not via code) the result set is fine and without the '0E0'.
How can I get only EUR -> 10000 and USD-> 124 without the '0E0'?
.0E0 is the exponent of the number, as I think. So 124.0E0 stands for 124.0 multiplied with ten raised to the power of 0 (written 124 x 10^0). Anything raised to the power of 0 is 1, so you've got 124 x 1, which, of course, is the right value.
(If it was, e. g., 124.5E3, this would mean 124500.)
This notation is used more commonly to work with large numbers, because 5436.7E20 is much more readable than 543670000000000000000000.
Without knowing your database background, I can only suppose that this notation arises from the conversion of the numeric field to a string (in result.getString("total")). Therefore, you should ask yourself, if you really need the result as a string (or just use .getFloat or so, also changing your HashMap type). If so, you still have some possibilities:
Convert the value to a string later → e. g. String.valueOf(resultSet.getFloat("total"))
Truncate the .0E0 → e. g. resultSet.getString("total").replace(".0E0", "") (Attention, of course this won't work if, for some reason, you get another suffix like .5E3; it will also cut off any positions after the decimal point)
Perhaps find a database, JDBC or driver setting that suppresses the E-Notation.
I want to store key value pair in data base where key is a list of Integers or a set of Integers.
The use case that I have has the below steps
I will get a list of integers
I will need to check if that list of integers (as a key) is already present in the DB
If this is present, I will need to pick up the value from the DB
There are certain computations that I need to do if the list of integers (or set of integers) is not there in the DB already, if this there then I just want to pass the value and avoid the computations.
I am thinking of keeping the data in a key value store but I want the key to be specifically a list or set of integers.
I have thought about below options
Option A
Generate a unique hash for the list of integers and store that as key in key/value store
Problem:
I will have hash collision which will break my use case. I believe there is no way to generate hash with uniqueness 100% of the time.
This will not work.
If there is away to generate a unique hash (100%) times then that is the best way.
Option B
Create an immutable class with List of integers or Set of integers and store that as a key for my key value store.
Please share any feasible ways to achieve the need.
You don’t need to do anything special:
Map<List<Integer>, String> keyValueStore = new HashMap<>();
List<Integer> key = Arrays.asList(1, 2, 3);
keyValueStore.put(key, "foo");
All JDK collections implement sensible equals() and hashCode() that is based solely on the contents of the list.
Thank you. I would like to share some more findings.
I now tried the below further to what I mentioned in my earlier post.
I added the below documents in Mongodb
db.products.insertMany([
{
mapping: [1, 2,3],
hashKey:'ABC123',
date: Date()
},
{
mapping: [4, 5],
hashKey:'ABC45' ,
date: Date()
},
{
mapping: [6, 7,8],
hashKey:'ABC678' ,
date: Date()
},
{
mapping: [9, 10,11],
hashKey:'ABC91011',
date: Date()
},
{
mapping: [1, 9,10],
hashKey:'ABC1910',
date: Date()
},
{
mapping: [1, 3,4],
hashKey:'ABC134',
date: Date()
},
{
mapping: [4, 5,6],
hashKey:'ABC456',
date: Date()
}
]);
When I am now trying to find the mapping I am getting expected results
> db.products.find({ mapping: [4,5]}).pretty();
{
"_id" : ObjectId("5d4640281be52eaf11b25dfc"),
"mapping" : [
4,
5
],
"hashKey" : "ABC45",
"date" : "Sat Aug 03 2019 19:17:12 GMT-0700 (PDT)"
}
The above is giving the right result as the mapping [4,5] (insertion order retained) is present in the DB
> db.products.find({ mapping: [5,4]}).pretty();
The above is giving no result as expected as the mapping [5,4] is not present in the DB. The insertion order is retained
So it seems the "mapping" as List is working as expected.
I used Spring Data to read from MongoDB that is running locally.
The format of the document is
{
"_id" : 1,
"hashKey" : "ABC123",
"mapping" : [
1,
2,
3
],
"_class" : "com.spring.mongodb.document.Mappings"
}
I inserted 1.7 million records into DB using org.springframework.boot.CommandLineRunner
Then the query similar to my last example:
db.mappings.find({ mapping: [1,2,3]})
is taking average 1.05 seconds to find the mapping from 1.7 M records.
Please share if you have any suggestion to make it faster and how fast can I expect it to run.
I am not sure about create, update and delete performance as yet.
I am trying to pre split hbase table. One the HbaseAdmin java api is to create an hbase table is function of startkey, endkey and number of regions. Here's the java api that I use from HbaseAdmin void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions)
Is there any recommendation on choosing startkey and endkey based on dataset?
My approach is lets say we have 100 records in dataset. I want data divided approximately in 10 regions so each will have approx 10 records. so to find startkey i will say scan '/mytable', {LIMIT => 10} and pick the last rowkey as my startkey and then scan '/mytable', {LIMIT => 90} and pick the last rowkey as my endkey.
Does this approach to find startkey and rowkey looks ok or is there better practice?
EDIT
I tried following approaches to pre split empty table. ALl three didn't work the way I used it. I think I will need to salt the key to get equal distribution.
PS> I am displaying only some region info
1)
byte[][] splits = new RegionSplitter.HexStringSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits);
This gives regions with boundaries like:
{
"startkey":"-INFINITY",
"endkey":"11111111",
"numberofrows":3628951,
},
{
"startkey":"11111111",
"endkey":"22222222",
},
{
"startkey":"22222222",
"endkey":"33333333",
},
{
"startkey":"33333333",
"endkey":"44444444",
},
{
"startkey":"88888888",
"endkey":"99999999",
},
{
"startkey":"99999999",
"endkey":"aaaaaaaa",
},
{
"startkey":"aaaaaaaa",
"endkey":"bbbbbbbb",
},
{
"startkey":"eeeeeeee",
"endkey":"INFINITY",
}
This is useless as my rowkeys are of composite form like 'deptId|month|roleId|regionId' and doesn't fit into above boundaries.
2)
byte[][] splits = new RegionSplitter.UniformSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits)
This has same issue:
{
"startkey":"-INFINITY",
"endkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\x99",
}
{
"startkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\
"endkey":"33333332",
}
{
"startkey":"33333332",
"endkey":"L\\xCC\\xCC\\xCC\\xCC\\xCC\\xCC\\xCB",
}
{
"startkey":"\\xE6ffffffa",
"endkey":"INFINITY",
}
3) I tried supplying start key and end key and got following useless regions.
hBaseAdmin.createTable(tabledescriptor, Bytes.toBytes("04120|200808|805|1999"),
Bytes.toBytes("01253|201501|805|1999"), 10);
{
"startkey":"-INFINITY",
"endkey":"04120|200808|805|1999",
}
{
"startkey":"04120|200808|805|1999",
"endkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
}
{
"startkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
"endkey":"000ptq<200wp6\\xBC805|1999",
}
{
"startkey":"001\\x11\\x15\\x13\\x1C201\\x15\\x902\\x5C805|1999",
"endkey":"01253|201501|805|1999",
}
{
"startkey":"01253|201501|805|1999",
"endkey":"INFINITY",
}
First question : Out of my experience with hbase, I am not aware any hard rule for creating number of regions, with start key and end key.
But underlying thing is,
With your rowkey design, data should be distributed across the regions and not hotspotted
(36.1. Hotspotting)
However, if you define fixed number of regions as you mentioned 10. There may not be 10 after heavy data load. If it reaches, certain limit, number of regions will again split.
In your way of creating table with hbase admin documentation says, Creates a new table with the specified number of regions. The start key specified will become the end key of the first region of the table, and the end key specified will become the start key of the last region of the table (the first region has a null start key and the last region has a null end key).
Moreover, I prefer creating a table through script with presplits say 0-10 and I will design a rowkey such that its salted and it will be sitting on one of region servers to avoid hotspotting.
like
EDIT : If you want to implement own regionSplit
you can implement and provide your own implementation org.apache.hadoop.hbase.util.RegionSplitter.SplitAlgorithm and override
public byte[][] split(int numberOfSplits)
Second question :
My understanding :
You want to find startrowkey and end rowkey for the inserted data in a specific table... below are the ways.
If you want to find start and end rowkeys scan '.meta' table to understand how is your start rowkey and end rowkey..
you can access ui http://hbasemaster:60010 if you can see how the rowkeys are spread across each region. for each region start and rowkeys will be there.
to know how your keys are organized, after pre splitting your table and inserting in to hbase... use FirstKeyOnlyFilter
for example : scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'
which displays all your 100 rowkeys.
if you have huge data (not 100 rows as you mentioned) and want to take a dump of all rowkeys then you can use below from out side shell..
echo "scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'" | hbase shell > rowkeys.txt
I have database with 300 000 rows, and I need filter some rows by algorithm.
protected boolean validateMatch(DbMatch m) throws MatchException, NotSupportedSportException{
// expensive part
List<DbMatch> hh = sd.getMatches(DateService.beforeDay(m.getStart()), m.getHt(), m.getCountry(),m.getSportID());
List<DbMatch> ah = sd.getMatches(DateService.beforeDay(m.getStart()), m.getAt(), m.getCountry(),m.getSportID());
....
My hibernate dao function for load data from Mysql is used 2x times of init array size.
public List<DbMatch> getMatches(Date before,String team, String country,int sportID) throws NotSupportedSportException{
//Match_soccer where date between :start and :end
Criteria criteria = session.createCriteria(DbMatch.class);
criteria.add(Restrictions.le("start",before));
criteria.add(Restrictions.disjunction()
.add(Restrictions.eq("ht", team))
.add(Restrictions.eq("at", team)));
criteria.add(Restrictions.eq("country",country));
criteria.add(Restrictions.eq("sportID",sportID));
criteria.addOrder(Order.desc("start") );
return criteria.list();
}
Example how i try filter data
function List<DbMatch> filter(List<DbMatch> mSet){
List<DbMatch> filtred = new ArrayList<>();
for(DbMatch m:mSet){
if(validateMatch(DbMatch m))filtred.add(m);
}
}
(1)I tried different criteria settings and counted function times with stopwatch. My result is when I use filter(matches) matches size 1000 my program take 3 min 21 s 659 ms.
(2)I tried remove criteria.addOrder(Order.desc("start")); than program filtered after 3 min 12 s 811 ms.
(3)But if I remove criteria.addOrder(Order.desc("start")); and add criteria.setMaxResults(1); result was 22 s 311 ms.
Using last configs i can filter all my 300 000 record by 22,3 * 300 = 22300 s (~ 6h), but if use first function I should wait (~ 60 h).
If I want use criteria without order and limit i must be sure that my table is sorted by date on database because it is important get last match .
All data is stored on matches table.
Table indexes:
Table, Non_unique, Key_name, Seq_in_index, Column_name, Collation, Cardinality, Sub_part, Packed, Null, Index_type, Comment, Index_comment
matches, 0, PRIMARY, 1, mid, A, 220712, , , , BTREE, ,
matches, 0, UK_kcenwf4m58fssuccpknl1v25v, 1, beid, A, 220712, , , YES, BTREE, ,
UPDATED
After added ALTER TABLE matches ADD INDEX (sportID, country); now program time deacrised to 15s for 1000 matches. But if I not use order by and add limit need wait only 4s for 1000 mathces.
How I should act on this situation to improve program executions speed?
Your first order of business is to figure out how long each component take to process the request.
Find out the SQL query generated by the ORM and run that manually in MySQL workbench and see how long it takes (non cached). You can also ask for it to explain the index usage.
If it's fast enough then it's your java code that's taking longer and you need to optimize your algorithm. You can use JConsole to dig further into that.
If you identify which component is taking longer you can post here with your analysis and we can make suggestions accordingly.