Querying DynamoDB - java

I've got a DynamoDB table with a an alpha-numeric string as a hash key (e.g. "d4ed6962-3ec2-4312-a480-96ecbb48c9da"). I need to query the table based on another field in the table, hence I need my query to select all the keys such as my field x is between dat x and date y.
I know I need a condition on the hash key and another on a range key, however I struggle to compose a hash key condition that does not bind my query to specific IDs.
I thought I could get away with a redundant condition based on the ID being NOT_NULL, but when I use it I get the error:
Query key condition not supported
Below is the conditions I am using, any idea how to achieve this goal?
Condition hashKeyCondition = new Condition()
.withComparisonOperator(ComparisonOperator.NOT_NULL.toString());
Condition rangeCondition = new Condition()
.withComparisonOperator(ComparisonOperator.BETWEEN.toString())
.withAttributeValueList(new AttributeValue().withS(dateFormatter.print(lastScanTime())),
new AttributeValue().withS(dateFormatter.print(currentScanTime)));
Map<String, Condition> keyConditions = new HashMap<String, Condition>();
keyConditions.put("userId", hashKeyCondition);
keyConditions.put("lastAccesTime", rangeCondition);
Thanks in advance to everyone helping.

In DynamoDB you can get items with 3 api:
. Scan (flexible but expensive),
. Query (less flexible: you have to specify an hash, but less expensive)
. GetItem (by Hash and, if your table has one, by range)
The only way to achieve what you want is by either:
Use Scan, and be slow or expensive.
Use another table (B) as an index to the previous one (A) like:
B.HASH = 'VALUES'
B.RANGE = userid
B.lastAccesTime = lastAccesTime (with a secondary index)
Now you have to maintain that index on writes, but you can use it with the Query operation,
to get your userIds. Query B: hash='VALUES', lastaccessTime between x and y, select userid.
Hope this helps.

The NOT_NULL comparison operator is not valid for the hash key condition. The only valid operator for the Hash key condition on a query is EQ. More information can be found here:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
And what this means is that a query will not work, at least as your table is currently constructed. You can either use a Scan operation or you can create a separate table that stores the data by Date (hash) and User ID (range).
Good luck!

I ended up scanning the table and enforcing a filter.
Thanks to everyone taking time for helping out!

You could add Global Secondary Index with, for example, year and month of your date and make it your hash key, range key for that index would be your date then you could query any data range in a certain month. It will help you avoid expensive full scan.
E.g.
Global Secondary Index:
Hash key: month_and_year for example '2014 March'
Range key: full_date
Hope it helps!

You need to create GSI if you want to query other than Partition Key. Scan is very expensive in terms of cost and performance.

Related

DynamoDB QueryResultPage still returning results on bogus exclusive start key

tldr; - When using a bogus LastEvaluatedKey with DynamoDB queries for pagination it still returns results in some cases.
I am implementing pagination for a fairly straight forward CRUD Repository.
The implementation is based on:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.Pagination.html and
Pagination with DynamoDBMapper Java AWS SDK
I have a DynamoDB table and this query is running on a Global Secondary Index of that table.
Pagination is working fine, i.e. I have a 5000 records. I query and receive a set of 500 results and a LastEvaluatedKey. Using this key, I get the next set of 500 results etc.
This key is made up of:
partition key "instanceId" which is always the same on subsequent page requests.
range key "id" which is what changes on the next page request
Now I wrote a test to make sure that if a bogus LastEvaluatedKey is provided I should get zero results.
What is the actual behavior:
If I provide something like id = "rrrrrrrrrrrrr" I get zero results, as expected.
If I provide something like id = "aaaaaaaaaaaaa" I get 500 results!
What's worth noting is that the "id"'s are UUID strings, so the letter 'r' will not occur anywhere in any id.
My LastEvaluatedKey is made up like so (instanceId is the same for subsequent page queries):
var startKeyMap = new HashMap<String, AttributeValue>();
var idValue = new AttributeValue();
idValue.setS(startKey);
startKeyMap.put("id", idValue);
var instanceIdValue = new AttributeValue();
instanceIdValue.setS(instanceId);
startKeyMap.put("instanceId", instanceIdValue);
queryExpression.setExclusiveStartKey(startKeyMap);
I suspect what is happening is that because "id" is a sort key (in the GSI), the results are returned for anything greater than my bogus "aaaaaaaaaaaaaa". For "rrrrrrrrrrrr" it doesn't work because no keys would sort greater than 'rrrrrrrrrrrr'.
I would expect DDB to match exactly the exclusive start key, and return the next set of results from there but it seems like it is simply matching whatever comes close and returning whatever keys come after.
I also found:
DynamoDB Global Secondary Index with Exclusive Start Key
In there the solution is to set the primary and range keys of both the table and the index as the ExclusiveStartKey. However, in my case both are there, they are just reversed:
On the table the id is primary, instanceId is secondary. On the GSI, the reverse is true.
Can someone explain what is happening or what I'm doing wrong?
Working as designed...
ExclusiveStartKey just means to start with key greater than whatever value you've passed in.
Exclusive, meaning use greater than as opposed to inclusive which would be greater than or equal.

How to select items in date range in DynamoDB

How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.
I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.

How to query DynamoDB with multiple Indexes?

I have a table in DynamoDB and I have a primary key and two global shared indexes. For example, the table structure is as follows.
// Primary Keys
id -> PK
name -> SK
// Global Shared Index 1
status_one -> S1PK
status_one_time -> S1SK
// Global Shared Index 2
status_two -> S2PK
status_two_time -> S2SK
So what I need is I need to know how to use multiple keys in withKeyConditionExpression.
I will need to filter data by the following scenarios,
S1PK = :v1 and SK = :v4 and S2PK = :v2 and S2SK <= :v3
S2PK = :v1 and S2SK >= :v2 and S1SK <= :v2
So how can I do that? If I put the above queries into withKeyConditionExpression it will throw errors. So is there a way to query the table with Primary Keys and Secondary Indexes all at once? What I am doing wrong here? I really appreciate it if anybody can help me. Thanks in advance.
You can't. DynamoDB doesn't work like that.
A query can only access the table or a single index.
You could use Scan(), but realize that's going to read every record in the table (and you'll be charged for that) and simply throw away the ones that don't match. Great way to use up your provisioned capacity.
Also, DDB will only read 1MB at a time, so you'll likely need to call it in a loop.
If this is a common access pattern, you'll need to rethink your keys. Or rethink the use of DDB (by itself). A common pattern is to have the data duplicated to Elastic Search for better search functionality.

How to use begins_with in DynamoDBMapper BatchLoad

I am trying to perform batch get operation on DynamoDB using DynamoDBMapper.batchLoad() on a Table having composite primary key where I know the set of HashKey values but not the RangeKey value. Regarding RangeKey Value only information I only know character sequence with which they start with like if sequence says "test" then RangeKey value will be something like "test1243".
To solve this problem dynamodb support begins_with caluse but on query operation. How can I use the same begins_with clasue in BatchGet Operation.
You can only use the begins_with operator with queries. When you call GetItem or BatchGetItem you must specify the whole primary key (partition key + sort key if present) of the items you wish to retrieve so the begins_with operator is not useful.
You should just run queries in parallel, one for each of the hash keys you need to get the records for.

How do I query for a partition keys that contain a specific substring in dynamoDb?

I have a partitionKey that is made up 2 strings for e.g. userId:UserName. For e.g 1234:John, 4567:Mark etc. I want to query for all the records that match the substring defined by UserName for e.g. Find all the records that contain "Mark" in the partition key. How do I do this using DynamoDb APIs in Java?
Hopefully this is not something that you have to do frequently.
DynamoDB does not support querying by partial hash-key. You would have to use a table scan to iterate over all elements in the table and compare each one for matches.
This is highly inefficient and if you find yourself depending on this type of behavior then you have to revisit your choice of hash-key and your over-all design choices.
For the sake of completeness, the code you're looking for is along the following lines if you're using the Document API:
// dynamo returns results in chunks - you'll need this to get the next one
Map<String, AttributeValue> lastKeyEvaluated = null;
do {
ScanRequest scanRequest = new ScanRequest()
.withTableName("YourTableNameHere")
.withExclusiveStartKey(lastKeyEvaluated);
ScanResult result = client.scan(scanRequest);
for (Map<String, AttributeValue> item : result.getItems()){
// for each item in the result set, examine the partition key
// to determine if it's a match
string key = item.get("YourPartitionKeyAttributeNameHere").getS();
if (key.startsWith("Mark"))
System.out.println("Found an item that matches *:Mark:\n" + item);
}
lastKeyEvaluated = result.getLastEvaluatedKey();
} while (lastKeyEvaluated != null);
But before you implement something like this in your application consider choosing a different partition key strategy, or creating a secondary index for your table, or both - if you need to make this type of query often!
As a side note, I'm curious, what benefit do you get by including both user id and user name in the partition key? The user id would, presumably, be unique for you so why the user name?
You can't do this as you've described in a cost efficient manner. You'll need to scan the table, which is expensive and time consuming.
Revisit your choice of key so you are always running queries against full key values instead of substrings.
You might want to consider using a range key - when including a range key, queries can be efficiently run against either just the hash key (returning potentially multiple values), or the combination of hash key/range key (which must be unique).
In this example, if you're always querying on either userId:userName or userName (but not userId by itself), then using userName as hash key and userId as range key is a simple and efficient solution.

Categories