Given my DynamoDB has a column of 'BlockNumber', how do I write the Java QuerySpec to find the MAX block number in the DB? (It is configured as a GSI.)
Typically, your GSI would have a partition key and a sort key, just like a regular DynamoDB table. You would issue a query against a known partition key and set ScanIndexForward=false and Limit=1, so it would return one item only, and it would be the item with a matching partition key and the maximum value of the sort key. When ScanIndexForward is false, DynamoDB reads the items in reverse order by sort key value.
If the data is immutable the best option for this is to have a separate record that holds aggregate values. Whenever you add an item that may change the max value you would update the aggregate record. The best approach to this is to use DynamoDB streams to perform the updates to the aggregate record. Using Global Secondary Indexes for Materialized Aggregation Queries
Related
How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.
I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.
I am trying to perform batch get operation on DynamoDB using DynamoDBMapper.batchLoad() on a Table having composite primary key where I know the set of HashKey values but not the RangeKey value. Regarding RangeKey Value only information I only know character sequence with which they start with like if sequence says "test" then RangeKey value will be something like "test1243".
To solve this problem dynamodb support begins_with caluse but on query operation. How can I use the same begins_with clasue in BatchGet Operation.
You can only use the begins_with operator with queries. When you call GetItem or BatchGetItem you must specify the whole primary key (partition key + sort key if present) of the items you wish to retrieve so the begins_with operator is not useful.
You should just run queries in parallel, one for each of the hash keys you need to get the records for.
In my dataflow pipeline, I'll have two PCollections<TableRow> that have been read from BigQuery tables. I plan to merge those two PCollections into one PCollection with with a flatten.
Since BigQuery is append only, the goal is to write truncate the second table in BigQuery with the a new PCollection.
I've read through the documentation and it's the middle steps I'm confused about. With my new PCollection the plan is to use a Comparator DoFn to look at the max last update date and returning the given row. I'm unsure if I should be using a filter transform or if I should be doing a Group by key and then using a filter?
All PCollection<TableRow>s will contain the same values: IE: string, integer and timestamp. When it comes to key value pairs, most of the documentation on cloud dataflow includes just simple strings. Is it possible to have a key value pair that is the entire row of the PCollection<TableRow>?
The rows would look similar to:
customerID, customerName, lastUpdateDate
0001, customerOne, 2016-06-01 00:00:00
0001, customerOne, 2016-06-11 00:00:00
In the example above, I would want to filter the PCollection to just return the second row to a PCollection that would be written to BigQuery. Also, is it possible to apply these Pardo's on the third PCollection without creating a fourth?
You've asked a few questions. I have tried to answer them in isolation, but I may have misunderstood the whole scenario. If you provided some example code, it might help to clarify.
With my new PCollection the plan is to use a Comparator DoFn to look at the max last update date and returning the given row. I'm unsure if I should be using a filter transform or if I should be doing a Group by key and then using a filter?
Based on your description, it seems that you want to take a PCollection of elements and for each customerID (the key) find the most recent update to that customer's record. You can use the provided transforms to accomplish this via Top.largestPerKey(1, timestampComparator) where you set up your timestampComparator to look only at the timestamp.
Is it possible to have a key value pair that is the entire row of the PCollection?
A KV<K, V> can have any type for the key (K) and value (V). If you want to group by key, then the coder for the keys needs to be deterministic. TableRowJsonCoder is not deterministic, because it may contain arbitrary objects. But it sounds like you want to have the customerID for the key and the entire TableRow for the value.
is it possible to apply these Pardo's on the third PCollection without creating a fourth?
When you apply a PTransform to a PCollection, it results in a new PCollection. There is no way around that, and you don't need to try to minimize the number of PCollections in your pipeline.
A PCollection is a conceptual object; it does not have intrinsic cost. Your pipeline is going to be heavily optimized so that many intermediate PCollections - especially those in a sequence of ParDo transforms - will never be materialized anyhow.
I have a partitionKey that is made up 2 strings for e.g. userId:UserName. For e.g 1234:John, 4567:Mark etc. I want to query for all the records that match the substring defined by UserName for e.g. Find all the records that contain "Mark" in the partition key. How do I do this using DynamoDb APIs in Java?
Hopefully this is not something that you have to do frequently.
DynamoDB does not support querying by partial hash-key. You would have to use a table scan to iterate over all elements in the table and compare each one for matches.
This is highly inefficient and if you find yourself depending on this type of behavior then you have to revisit your choice of hash-key and your over-all design choices.
For the sake of completeness, the code you're looking for is along the following lines if you're using the Document API:
// dynamo returns results in chunks - you'll need this to get the next one
Map<String, AttributeValue> lastKeyEvaluated = null;
do {
ScanRequest scanRequest = new ScanRequest()
.withTableName("YourTableNameHere")
.withExclusiveStartKey(lastKeyEvaluated);
ScanResult result = client.scan(scanRequest);
for (Map<String, AttributeValue> item : result.getItems()){
// for each item in the result set, examine the partition key
// to determine if it's a match
string key = item.get("YourPartitionKeyAttributeNameHere").getS();
if (key.startsWith("Mark"))
System.out.println("Found an item that matches *:Mark:\n" + item);
}
lastKeyEvaluated = result.getLastEvaluatedKey();
} while (lastKeyEvaluated != null);
But before you implement something like this in your application consider choosing a different partition key strategy, or creating a secondary index for your table, or both - if you need to make this type of query often!
As a side note, I'm curious, what benefit do you get by including both user id and user name in the partition key? The user id would, presumably, be unique for you so why the user name?
You can't do this as you've described in a cost efficient manner. You'll need to scan the table, which is expensive and time consuming.
Revisit your choice of key so you are always running queries against full key values instead of substrings.
You might want to consider using a range key - when including a range key, queries can be efficiently run against either just the hash key (returning potentially multiple values), or the combination of hash key/range key (which must be unique).
In this example, if you're always querying on either userId:userName or userName (but not userId by itself), then using userName as hash key and userId as range key is a simple and efficient solution.
I've got a DynamoDB table with a an alpha-numeric string as a hash key (e.g. "d4ed6962-3ec2-4312-a480-96ecbb48c9da"). I need to query the table based on another field in the table, hence I need my query to select all the keys such as my field x is between dat x and date y.
I know I need a condition on the hash key and another on a range key, however I struggle to compose a hash key condition that does not bind my query to specific IDs.
I thought I could get away with a redundant condition based on the ID being NOT_NULL, but when I use it I get the error:
Query key condition not supported
Below is the conditions I am using, any idea how to achieve this goal?
Condition hashKeyCondition = new Condition()
.withComparisonOperator(ComparisonOperator.NOT_NULL.toString());
Condition rangeCondition = new Condition()
.withComparisonOperator(ComparisonOperator.BETWEEN.toString())
.withAttributeValueList(new AttributeValue().withS(dateFormatter.print(lastScanTime())),
new AttributeValue().withS(dateFormatter.print(currentScanTime)));
Map<String, Condition> keyConditions = new HashMap<String, Condition>();
keyConditions.put("userId", hashKeyCondition);
keyConditions.put("lastAccesTime", rangeCondition);
Thanks in advance to everyone helping.
In DynamoDB you can get items with 3 api:
. Scan (flexible but expensive),
. Query (less flexible: you have to specify an hash, but less expensive)
. GetItem (by Hash and, if your table has one, by range)
The only way to achieve what you want is by either:
Use Scan, and be slow or expensive.
Use another table (B) as an index to the previous one (A) like:
B.HASH = 'VALUES'
B.RANGE = userid
B.lastAccesTime = lastAccesTime (with a secondary index)
Now you have to maintain that index on writes, but you can use it with the Query operation,
to get your userIds. Query B: hash='VALUES', lastaccessTime between x and y, select userid.
Hope this helps.
The NOT_NULL comparison operator is not valid for the hash key condition. The only valid operator for the Hash key condition on a query is EQ. More information can be found here:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
And what this means is that a query will not work, at least as your table is currently constructed. You can either use a Scan operation or you can create a separate table that stores the data by Date (hash) and User ID (range).
Good luck!
I ended up scanning the table and enforcing a filter.
Thanks to everyone taking time for helping out!
You could add Global Secondary Index with, for example, year and month of your date and make it your hash key, range key for that index would be your date then you could query any data range in a certain month. It will help you avoid expensive full scan.
E.g.
Global Secondary Index:
Hash key: month_and_year for example '2014 March'
Range key: full_date
Hope it helps!
You need to create GSI if you want to query other than Partition Key. Scan is very expensive in terms of cost and performance.