Let’s say I have an album table where partition key is author and sort key is album. Each item also has a price, startDate and endDate attributes. Let say I want to find all the albums that “author=a”, “album=b”, “startDate < c”, “endDate > d” and “price is between e and f”, sorted by price. Is the most efficient way to do that is query on partition key and sort key, and then filter the results on conditions c, d, e and f, and then sort by price? Can secondary index help here? (It seems one secondary index can only be used for query on one or two non-key attributes, but my use case requires < and > operations on multiple non-key attributes and then sorting)
Thanks!
I am working through a similar schema design process.
The short answer is it will depend on exactly how much data you have that falls into the various categories, as well as on the exact QUERIES you hope to run against that data.
The main thing to remember is that you can only ever QUERY based on your Sort Key (where you know the Partition Key) but you ALSO have to maintain uniqueness in order to not overwrite needed data.
A good way to visualize this in your case would be as follows:
Each Artist is Unique (Artist seems to me like a good Partition Key)
Each Artist can have Mutliple Albums making this a good Sort Key (in cases where you will search for an Album for a known Artist)
In the above case your Sort Key is being combined with your Partition Key to create your Hash Key per the following answer (which is worth a read!) to allow you to write a query where you know the artist but only PART of the title.
Ie. here artist = "Pink Floyd" QUERY where string album contains "Moon"
That would match "Pink Floyd" Dark Side of the Moon.
That being said you would only ever have one "Price" for Pink Floyd - Dark Side of the Moon since the Partition Key and Sort Key combine to handle uniqueness. You would overwrite the existing object when you updated the entry with your second price.
So the real question is, what are the best Sort Keys for my use case?
To answer that you need to know what your most frequent QUERIES will be before you build the system.
Price Based Queries?
In your question you mentioned Price attributes in a case where you appear to know the artist and album.
“author=a”, “album=b”, “startDated” and “price is between e and f”, sorted by price"
To me in this case you probably DO NOT know the Artist, or if you do you probably do not know the Album, since you are probably looking to write a Query that returns albums from multiple artists or at least multiple Albums from the same artist.
HOWEVER
That may not be the case if you are creating a database that contains multiple entries (say from multiple vendors selling the same artist / album at different prices). In which case I would say the easiest way to either store only ONE entry for an Artist-Album (partition key) at a given price (sort key) but you would lose all other entries that match that same price for the Artist-Album.
Multiple Queries MAY require Multiple Tables
I had a similar use case and ended up needing to create multiple tables in order to handle my queries. Data is passed / processed from one table and spit out into another one using a Lambda that is triggered on insertion. I then send some queries to one table and some other queries to the initial table.
Related
How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.
I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.
Just investigating the abilities of Google App Engine and very interested in its search api. I love how you can define different fields to be automatically tokenised and sort the search results in different ways.
My question is can you have the results sorted in a way such that certain fields get more priority then others?
Example:
A document with two fields, title and body. It would be if search queries that matched titles were more highly ranked than querys that match the body.
Is this possible?
Cheers
Unfortunately, it's not possible at the moment. From the documentation:
By default, search returns its results by descending rank. Also by
default, the Search API sets the rank of each document to seconds
since Jan 1st 2011. This results in the freshest documents being
returned first. However, if you don’t need documents to be sorted by
the time they were added, you can use rank for other purposes. Suppose
you have a real estate application. What customers want most is
sorting by price. For an efficient default sort, you could set the
rank to the house price.
If you need multiple sort orders such as price low-to-high and price
high-to-low, you can create a separate index for each order. One index
would have rank = price and the other rank = MAXINT-price (since rank
must be positive).
In your use case, you can retrieve documents that have a match in their title in one query, and then retrieve documents with a match in their body in a second query. Obviously, you can specify different rules (or even a set of rules), e.g.:
if the first query returns more than X results, do not do the second query
retrieve the first 20 documents by title, and if the date of the last document is less than A, retrieve the first 10 documents by body
retrieve the best 15 documents by title and add the best 5 documents by body
and so on. The rules, of course, depend on your domain and the way you try to prioritize (rank) the documents.
I have some Objects (currently 20 000 +) that have the following Attributes
Object
----------
String name
int rating
I want to create an ELO rating for all these Objects. This implies the following:
To adjust the ELO of 2 Objects matched against each other I need to find those Objects in the list by name.
To Display the List I need to get every Object ordered by its rating.
The whole program will be implemented in Java, but I think its independent from the programming language.
Currently I am unsure which data model to choose for this Project. A friend advised me to use a 2-4 tree to insert the Objects ordered by name so I can change the rating of each object fast.
However the Objects are printed in order of their rating rather than by name and I don't want to sort so many Objects every time I output the list.
As I am pretty new to data structures: What is the usual way to solve this problem?
Creating another tree ordered by rating?
Having a list of ratings and each rating linked to each Object currently having that rating?
Edit: ELO rating is a mapping from the set of objects to the integers. Each object only gets one rating but a rating can have multiple Objects associated with it.
Creating another tree ordered by rating? Having a list of ratings and each rating linked to each Object currently having that rating?
Well , this is one way to do so , but will take huge space also since you have 20K+ entries .
The best way i can think of now is :
Use datastructure like multimap with key=name , and value = ratings.
This way , everytime you insert a new object in multimap , it will take O(logN) time .
To find all ratings with same name use equal_range , which is also an O(logN) operation .
Hope this helps !
A HashMap with name as the key will give you O(1) performance when matching the elements, keep a TreeSet with a rating Comparator and you'll have the items in order. Although you'll need to reinsert an element if the rating changes.
I am modelling a Cassandra schema to get a bit more familiar on the subject and was wondering what is the best practice regarding creating indexes.
For example:
create table emailtogroup(email text, groupid int, primary key(email));
select * from emailtogroup where email='joop';
create index on emailtogroup(groupid);
select * from emailtogroup where groupid=2 ;
Or i can create a entire new table:
create table grouptoemail(groupid int, email text, primary key(groupid, email));
select * from grouptoemail where groupid=2;
They both do the job.
I would expect creating a new table is faster cause now groupid becomes the partition key. But i'm not sure what "magic" is happening when creating a index and if this magic has a downside.
According to me your first approach is correct.
create table emailtogroup(email text, groupid int, primary key(email));
because 1) in your case email is sort of unique, good candidate for primary key and 2) multiple emails can belong to same group, good candidate for secondary index. Please refer to this post - Cassandra: choosing a Partition Key
The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible.
The second form of table creation is useful for range scans. For example if you have a use case like
i) List all the email groups which the user has joined from 1st Jan 2010 to 1st Jan 2013.
In that case you may have to design a table like
create table grouptoemail(email text, ts timestamp, groupid int, primary key(email, ts));
In this case all the email gropus which the user joined will be clustered on disk.(stored together on disk)
It depends on the cardinality of groupid. The cassandra docs:
When not to use an index
Do not use an index to query a huge volume of records for a small
number of results. For example, if you create an index on a
high-cardinality column, which has many distinct values, a query
between the fields will incur many seeks for very few results. In the
table with a billion users, looking up users by their email address (a
value that is typically unique for each user) instead of by their
state, is likely to be very inefficient. It would probably be more
efficient to manually maintain the table as a form of an index instead
of using the Cassandra built-in index. For columns containing unique
data, it is sometimes fine performance-wise to use an index for
convenience, as long as the query volume to the table having an
indexed column is moderate and not under constant load.
Naturally, there is no support for counter columns, in which every
value is distinct.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
So basically, if you are going to be dealing with a large dataset, and groupid won't return a lot of rows, a secondary index may not be the best idea.
Week #4 of DataStax Academy's Java Developement with Apache Cassandra class talks about how to model these problems efficiently. Check that out if you get a chance.
I have 2 collections in a mongodb database.
example:
employee(collection)
_id
name
gender
homelocation (double[] indexed as geodata)
companies_worked_in (reference, list of companies)
companies(collection)
_id
name
...
Now I need to query all companies who's name start with "wha" and has/had employees which live near (13.444519, 52.512878) ie.
How do I do that without taking too long?
With SQL it would've been a simple join (without the geospatiol search of course... :( )
You can issue 2 queries. (Queries I wrote are in JavaScript)
First query extracts all companies whose name starts with wha.
db.companies.find({name: {$regex: "^wha"}}, {_id: 1})
Second query can be like
db.employees.find({homelocation: {$near: [x,y]}, companies_worked_in: {$in: [result_from_above_query]} }, {companies_worked_in: 1})
Now simply filter companies_worked_in and have only those companies whose name starts with wha. I know it seems like the first query is useless in this case. But a lot of records would be filtered by $in query.
You might have to write some intermediate code between this two queries. I know this is not a single query solution. But this is one possible way to go and performance is also good depending upon what fields you index upon. In this case consider creating index on name (companies collection) and homelocation (geo-index) + companies_worked_in (employee collection) would help you gain performance.
P.S.
I doubt if you could create a composite index over homelocation and companies_worked_in, since both are arrays. You would have to index on one of these fields only. You might not be able to have a composite index.
Suggestion
Store the company name as well in employee collection. That ways you can avoid first query.