How to query DynamoDB with multiple Indexes? - java

I have a table in DynamoDB and I have a primary key and two global shared indexes. For example, the table structure is as follows.
// Primary Keys
id -> PK
name -> SK
// Global Shared Index 1
status_one -> S1PK
status_one_time -> S1SK
// Global Shared Index 2
status_two -> S2PK
status_two_time -> S2SK
So what I need is I need to know how to use multiple keys in withKeyConditionExpression.
I will need to filter data by the following scenarios,
S1PK = :v1 and SK = :v4 and S2PK = :v2 and S2SK <= :v3
S2PK = :v1 and S2SK >= :v2 and S1SK <= :v2
So how can I do that? If I put the above queries into withKeyConditionExpression it will throw errors. So is there a way to query the table with Primary Keys and Secondary Indexes all at once? What I am doing wrong here? I really appreciate it if anybody can help me. Thanks in advance.

You can't. DynamoDB doesn't work like that.
A query can only access the table or a single index.
You could use Scan(), but realize that's going to read every record in the table (and you'll be charged for that) and simply throw away the ones that don't match. Great way to use up your provisioned capacity.
Also, DDB will only read 1MB at a time, so you'll likely need to call it in a loop.
If this is a common access pattern, you'll need to rethink your keys. Or rethink the use of DDB (by itself). A common pattern is to have the data duplicated to Elastic Search for better search functionality.

Related

How to select items in date range in DynamoDB

How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.
I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.

Compare very large tables in java

I am not able to find any satisfying solution so asking here.
I need to compare data of two large tables(~50M) with the same schema definition in JAVA.
I can not use order by clause while getting the resultset object and records might be not in order in both of the tables.
Can anyone help me what can be the right way to do it?
You could extract the data of the first DB table into a text file, and create a while loop on the resultSet for the 2nd table. As you iterate through the ResultSet do a search/verify against the text file. This solution works if memory is of concern to you.
If not, then just use a HashMap to hold the data for the first table and do the while loop and look up the records of the 2nd table from the HashMap.
This really depends on what you mean by 'compare'? Are you trying to see if they both contain the exact same data? Find rows in one not in the other? Find rows with the same primary keys that have differing values?
Also, why do you have to do this in Java? Regardless of what exactly you are trying to do, it's probably easier to do with SQL.
In Java, you'll want to create an class that represents the primary key for the tables, and a second classthat represents the rest of the data, which also includes the primary key class. If you only have a single column as the primary key, then this is easier.
We'll call P the primary key class, and D the rest.
Map map = new HashMap();
Select all of the rows from the first table, and insert them into the hash map.
Query all of the rows in the second table.
For each row, create a P object.
Use that to see what data was in the first table with the same Key.
Now you know if both tables contained the same row, and you can compare the non-key values from both both.
Like I said, this is much much easier to do in straight SQL.
You basically do a full outer join between the two tables. How exactly that join looks depends on exactly what you are trying to do.

Querying DynamoDB

I've got a DynamoDB table with a an alpha-numeric string as a hash key (e.g. "d4ed6962-3ec2-4312-a480-96ecbb48c9da"). I need to query the table based on another field in the table, hence I need my query to select all the keys such as my field x is between dat x and date y.
I know I need a condition on the hash key and another on a range key, however I struggle to compose a hash key condition that does not bind my query to specific IDs.
I thought I could get away with a redundant condition based on the ID being NOT_NULL, but when I use it I get the error:
Query key condition not supported
Below is the conditions I am using, any idea how to achieve this goal?
Condition hashKeyCondition = new Condition()
.withComparisonOperator(ComparisonOperator.NOT_NULL.toString());
Condition rangeCondition = new Condition()
.withComparisonOperator(ComparisonOperator.BETWEEN.toString())
.withAttributeValueList(new AttributeValue().withS(dateFormatter.print(lastScanTime())),
new AttributeValue().withS(dateFormatter.print(currentScanTime)));
Map<String, Condition> keyConditions = new HashMap<String, Condition>();
keyConditions.put("userId", hashKeyCondition);
keyConditions.put("lastAccesTime", rangeCondition);
Thanks in advance to everyone helping.
In DynamoDB you can get items with 3 api:
. Scan (flexible but expensive),
. Query (less flexible: you have to specify an hash, but less expensive)
. GetItem (by Hash and, if your table has one, by range)
The only way to achieve what you want is by either:
Use Scan, and be slow or expensive.
Use another table (B) as an index to the previous one (A) like:
B.HASH = 'VALUES'
B.RANGE = userid
B.lastAccesTime = lastAccesTime (with a secondary index)
Now you have to maintain that index on writes, but you can use it with the Query operation,
to get your userIds. Query B: hash='VALUES', lastaccessTime between x and y, select userid.
Hope this helps.
The NOT_NULL comparison operator is not valid for the hash key condition. The only valid operator for the Hash key condition on a query is EQ. More information can be found here:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
And what this means is that a query will not work, at least as your table is currently constructed. You can either use a Scan operation or you can create a separate table that stores the data by Date (hash) and User ID (range).
Good luck!
I ended up scanning the table and enforcing a filter.
Thanks to everyone taking time for helping out!
You could add Global Secondary Index with, for example, year and month of your date and make it your hash key, range key for that index would be your date then you could query any data range in a certain month. It will help you avoid expensive full scan.
E.g.
Global Secondary Index:
Hash key: month_and_year for example '2014 March'
Range key: full_date
Hope it helps!
You need to create GSI if you want to query other than Partition Key. Scan is very expensive in terms of cost and performance.

Efficient solution for grouping same values in a large dataset

At my job I was to develop and implement a solution for the following problem:
Given a dataset of 30M records extract (key, value) tuples from the particular dataset field, group them by key and value storing the number of same values for each key. Write top 5000 most frequent values for each key to a database. Each dataset row contains up to 100 (key, value) tuples in a form of serialized XML.
I came up with the solution like this (using Spring-Batch):
Batch job steps:
Step 1. Iterate over the dataset rows and extract (key, value) tuples. Upon getting some fixed number of tuples dump them on disk. Each tuple goes to a file with the name pattern '/chunk-', thus all values for a specified key are stored in one directory. Within one file values are stored sorted.
Step 2. Iterate over all '' directories and merge their chunk files into one grouping same values. Since the values are stored sorted, it is trivial to merge them for O(n * log k) complexity, where 'n' is the number of values in a chunk file and 'k' is the initial number of chunks.
Step 3. For each merged file (in other words for each key) sequentially read its values using PriorityQueue to maintain top 5000 values without loading all the values into memory. Write queue content to the database.
I spent about a week on this task, mainly because I have not worked with Spring-Batch previously and because I tried to make emphasis on scalability that requires accurate implementation of the multi-threading part.
The problem is that my manager consider this task way too easy to spend that much time on it.
And the question is - do you know more efficient solution or may be less efficient that would be easier to implement? And how much time would you need to implement my solution?
I am aware about MapReduce-like frameworks, but I can't use them because the application is supposed to be run on a simple PC with 3 cores and 1GB for Java heap.
Thank you in advance!
UPD: I think I did not stated my question clearly. Let me ask in other way:
Given the problem and being the project manager or at least the task reviewer would you accept my solution? And how much time would you dedicate to this task?
Are you sure this approach is faster than doing a pre-scan of the XML-file to extract all keys, and then parse the XML-file over and over for each key? You are doing a lot of file management tasks in this solution, which is definitely not for free.
As you have three Cores, you could parse three keys at the same time (as long as the file system can handle the load).
You solution seems reasonable and efficient, however I'd probably use SQL.
While parsing the Key/Value pairs I'd insert/update into a SQL table.
I'd then query the table for the top records.
Here's an example using only T-SQL (SQL 2008, but the concept should be workable in most any mordern rdbms)
The SQL between / START / and / END / would be the statements you need to execute in your code.
BEGIN
-- database table
DECLARE #tbl TABLE (
k INT -- key
, v INT -- value
, c INT -- count
, UNIQUE CLUSTERED (k, v)
)
-- insertion loop (for testing)
DECLARE #x INT
SET #x = 0
SET NOCOUNT OFF
WHILE (#x < 1000000)
BEGIN
--
SET #x = #x + 1
DECLARE #k INT
DECLARE #v INT
SET #k = CAST(RAND() * 10 as INT)
SET #v = CAST(RAND() * 100 as INT)
-- the INSERT / UPDATE code
/* START this is the sql you'd run for each row */
UPDATE #tbl SET c = c + 1 WHERE k = #k AND v = #v
IF ##ROWCOUNT = 0
INSERT INTO #tbl VALUES (#k, #v, 1)
/* END */
--
END
SET NOCOUNT ON
-- final select
DECLARE #topN INT
SET #topN = 50
/* START this is the sql you'd run once at the end */
SELECT
a.k
, a.v
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY k ORDER BY k ASC, c DESC) [rid]
, k
, v
FROM #tbl
) a
WHERE a.rid < #topN
/* END */
END
Gee, it doesn't seem like much work to try the old fashioned way of just doing it in-memory.
I would try just doing it first, then if you run out of memory, try one key per run (as per #Storstamp's answer).
If using the "simple" solution is not an option due to the size of the data, my next choice would be to use an SQL database. However, as most of these require quite much memory (and coming down to a crawl when heavily overloaded in RAM), maybe you should redirect your search into something like a NoSQL database such as MongoDB that can be quite efficient even when mostly disk-based. (Which your environment basically requires, having only 1GB of heap available).
The NoSQL database will do all the basic bookkeeping for you (storing the data, keeping track of all indexes, sorting it), and may probably do it a bit more efficient than your solution, due to the fact that all data may be sorted and indexed already when inserted, removing the extra steps of sorting the lines in the /chunk- files, merging them etc.
You will end up with a solution that is probably much easier to administrate, and it will also allow you to set up different kind of queries, instead of being optimized only for this specific case.
As a project manager I would not oppose your current solution. It is already fast and solves the problem. As an architect however, I would object due to the solution being a bit hard to maintain, and for not using proven technologies that basically does partially the same thing as you have coded on your own. It is hard to beat the tree and hash implementations of modern databases.

insert data into a lot of columns in java JDBC

I have a table with 50 columns and I want to insert all items in a HashMap variable into it (HashMap keys and table column names are the same).
How can I do that without writing 50 lines of code?
Get the key set for the HashMap. Iterate that key set to build a String containing your insert statement. Use the resulting String to create a PreparedStatement. Then iterate that key set again to set parameters by name using the Objects you retrieve from the HashMap.
You might have to write a few extra lines of special-case code if any of your values are of a Class that the JDBC driver isn't sure how to map.
I'd suggest you bite the dust and simply write a method that will do the dirty work for you containing 50 lines of parameter setting code. This isn't so bad, and you only have to write it once. I hope you aren't that lazy ;-)
And by the way, isn't 50 columns in a table a bit much? Perhaps a normalization process could help and lower complexity of your database and the code that will manipulate it.
Another way to go is to use an ORM like Hibernate, or a more lightweight approach like Spring JDBC template.
Call map.keySet() to get the name of all columns.
Create an INSERT statement by iterating the key set.
The column is from an item (a key) in the key set.
The data is from map.get(key).

Categories