I am modelling a Cassandra schema to get a bit more familiar on the subject and was wondering what is the best practice regarding creating indexes.
For example:
create table emailtogroup(email text, groupid int, primary key(email));
select * from emailtogroup where email='joop';
create index on emailtogroup(groupid);
select * from emailtogroup where groupid=2 ;
Or i can create a entire new table:
create table grouptoemail(groupid int, email text, primary key(groupid, email));
select * from grouptoemail where groupid=2;
They both do the job.
I would expect creating a new table is faster cause now groupid becomes the partition key. But i'm not sure what "magic" is happening when creating a index and if this magic has a downside.
According to me your first approach is correct.
create table emailtogroup(email text, groupid int, primary key(email));
because 1) in your case email is sort of unique, good candidate for primary key and 2) multiple emails can belong to same group, good candidate for secondary index. Please refer to this post - Cassandra: choosing a Partition Key
The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible.
The second form of table creation is useful for range scans. For example if you have a use case like
i) List all the email groups which the user has joined from 1st Jan 2010 to 1st Jan 2013.
In that case you may have to design a table like
create table grouptoemail(email text, ts timestamp, groupid int, primary key(email, ts));
In this case all the email gropus which the user joined will be clustered on disk.(stored together on disk)
It depends on the cardinality of groupid. The cassandra docs:
When not to use an index
Do not use an index to query a huge volume of records for a small
number of results. For example, if you create an index on a
high-cardinality column, which has many distinct values, a query
between the fields will incur many seeks for very few results. In the
table with a billion users, looking up users by their email address (a
value that is typically unique for each user) instead of by their
state, is likely to be very inefficient. It would probably be more
efficient to manually maintain the table as a form of an index instead
of using the Cassandra built-in index. For columns containing unique
data, it is sometimes fine performance-wise to use an index for
convenience, as long as the query volume to the table having an
indexed column is moderate and not under constant load.
Naturally, there is no support for counter columns, in which every
value is distinct.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
So basically, if you are going to be dealing with a large dataset, and groupid won't return a lot of rows, a secondary index may not be the best idea.
Week #4 of DataStax Academy's Java Developement with Apache Cassandra class talks about how to model these problems efficiently. Check that out if you get a chance.
Related
How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.
I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.
Consider a table creation:
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(my_id varchar,
my_date varchar,
enum_one varchar,
enum_two varchar
PRIMARY KEY (my_id, my_date, enum_one, enum_two)
);
Columns enum_one and enum_two has fixed numbers of value (6 and ~20). Should I include enum columns to primary key or not?
Consider a situation when I have many rows with one enum_one value and a few with other values.
How does cassandra handle this situation - does it balance loading or most requests go to one node?
Cassandra load balances based on the partition key, so if you included the enum columns in the partition key, then it would have an effect on the load balancing.
In your example, you are using my_id as the partition key. If your reads and writes tend to have different values for my_id, then this should keep your data balanced.
If your reads and writes tend to mostly use just a few values for my_id (i.e. if my_id has low cardinality), then the data will not be well load balanced across the Cassandra nodes. If that is the case, then including the enum fields would increase the cardinality of the partition key and result in a more evenly balanced data load.
The flip side of this is that using a different partition key may impact what types of queries you can do efficiently. It is efficient to query for data within a single partition, so if you included the enum columns in the partition key, then you would have to query for each value of the enum columns in separate queries instead of a single query.
My use case is like this: I am inserting 10 million rows in a table described as follows:
keyval bigint, rangef bigint, arrayval blob, PRIMARY KEY (rangef, keyval)
and input data is like follows -
keyval - some timestamp
rangef - a random number
arrayval - a byte array
I am taking my primary key as composite key because after inserting 10 million rows, I want to perform range scan on keyval. As keyval contains timestamp, and my query will be like, give me all the rows between this-time to this-time. Hence to perform these kind of Select queries, i have my primary key as composite key.
Now, while ingestion, the performance was very good and satisfactory. But when I ran the query described above, the performance was very low. When I queried - bring me all the rows within t1 and t1 + 3 minutes, almost 500k records were returned in 160 seconds.
My query is like this
Statement s = QueryBuilder.select().all().from(keySpace, tableName).allowFiltering().where(QueryBuilder.gte("keyval", 1411516800)).and(QueryBuilder.lte("keyval", 1411516980));
s.setFetchSize(10000);
ResultSet rs = sess.execute(s);
for (Row row : rs)
{
count++;
}
System.out.println("Batch2 count = " + count);
I am using default partitioner, that is MurMur partitioner.
My cluster configuration is -
No. of nodes - 4
No. of seed nodes - 1
No. of disks - 6
MAX_HEAP_SIZE for each node = 8G
Rest configuration is default.
How I can improve my range scan performance?
Your are actually performing a full table scan and not a range scan. This is one of the slowest queries possible for Cassandra and is usually only used by analytics workloads. If at any time your queries require allow filterting for a OLTP workload something is most likely wrong. Basically Cassandra has been designed with the knowledge that queries which require accessing the entire dataset will not scale so a great deal of effort is made to make it simple to partition and quickly access data within a partition.
To fix this you need to rethink your data model and think about how you can restrict the data to queries on a single partition.
RussS is correct that your problems are caused both by the use of ALLOW FILTERING and that you are not limiting your query to a single partition.
How I can improve my range scan performance?
By limiting your query with a value for your partitioning key.
PRIMARY KEY (rangef, keyval)
If the above is indeed correct, then rangef is your partitioning key. Alter your query to first restrict for a specific value of rangef (the "single partition", as RussS suggested). Then your current range query on your clustering key keyval should work.
Now, that query may not return anything useful to you. Or you might have to iterate through many rangef values on the application side, and that could be cumbersome. This is where you need to re-evaluate your data model and come up with an appropriate key to partition your data by.
I made secondary index on Keyval, and my query performance was improved. From 160 seconds, it dropped to 40 seconds. So does indexing Keyval field makes sense?
The problem with relying on secondary indexes, is that they may seem fast at first, but get slow over time. Especially with a high-cardinality column like a timestamp (Keyval), a secondary index query has to go out to each node and ultimately scan a large number of rows to get a small number of results. It's always better to duplicate your data in a new query table, than to rely on a secondary index query.
In my postgreSQL database, I have a bigint and it's base36 conversion.
I have urls containing short_id in base36 and the decimal version, but I will query according to the url so base36.
Can base36 short_id be my primary key for better performance ?
At least in PostgreSQL, the PRIMARY KEY isn't about performance. It's about correctness and data structure. The query planner doesn't care about the PRIMARY KEY, only about the UNIQUE NOT NULL index that's created by defining the PRIMARY KEY constraint.
You can define whatever indexes you like. Want another unique index? Just create one.
If the base36 column is guaranteed unique then yes, it is a candidate for a primary key. Whether it's the best choice, and whether it's actually any faster than whatever you're currently doing, is somewhat app dependent.
Note that extra indexes aren't free - they do incur a cost for inserts and updates. So don't go crazy creating multiple indexes on every column for write-heavy tables.
BTW, some other database systems do have stronger performance implications for the PRIMARY KEY. In particular, on DB systems that use index-organized tables (where the main table is in a b-tree structure) the choice of the clustering key - usually also the primary key - is a big thing for performance.
In PostgreSQL every table is just a heap, so it's not relevant.
I'm looking to create a table for users and tracking their objectives. The objectives themselves would be on the order of 100s, if not 1000s, and would be maintained in their own table, but it wouldn't know who completed them - it would only define what objectives are available.
Objective:
ID | Name | Notes |
----+---------+---------+
| | |
Now, in the Java environment, the users will have a java.util.BitSet for the objectives. So I can go
/* in class User */
boolean hasCompletedObjective(int objectiveNum) {
if(objectiveNum < 0 || objectivenum > objectives.length())
throw new IllegalArgumentException("Objective " + objectiveNum + " is invalid. Use a constant from class Objective.");
return objectives.get(objectivenum);
}
I know internally, the BitSet uses a long[] to do its storage. What would be the best way to represent this in my Derby database? I'd prefer to keep it in columns on the AppUser table if at all possible, because they really are elements of the user.
Derby does not support arrays (to my knowledge) and while I'm not sure the column limit, something seems wrong with having 1000 columns, espeically since I know I will not be querying the database with things like
SELECT *
FROM AppUser
WHERE AppUser.ObjectiveXYZ
What are my options, both for storing it, and marshaling it into the BitSet?
Are there viable alternatives to java.util.BitSet?
Is there a flaw in the general approach? I'm open to ideas!
Thanks!
*EDIT: If at all possible, I would like the ability to add more objectives with only a data modification, not a table modification. But again, I'm open to ideas!
[puts on fake moustache]
Store the bitset as a BLOB. Start by simply serializing it, then if you want more space-efficiency, trying pushing the results through a DeflaterOutputStream on their way to the database. For better space- and time- efficiency, try the bitmap compression method used in FastBit, which breaks the bitset into 31-bit chunks, then run-length encodes all-zero chunks, packing the literal and run chunks into 32-bit words along with a discriminator bit.
If you know you'll only look at the objective bitset while the ResultSet that brought it from the database is still open, write a new bitset class that wraps the Blob interface and implements get on top of getBytes. This avoids having to read the whole BLOB into memory to check a few specific bits, and at least avoids having to allocate a separate buffer for the bitset if you do want to look at all the values. Note that making this work with a compressed bitset will take substantial ingenuity.
Be aware that this approach gives you no referential integrity, and no ability to query on the user-objective relationship, little flexibility for different uses of the data in future, and is exactly the kind of thing that Don Knuth warned you about.
The orthodox way to do this does not involve bitsets at all. You have a table for users, a table for objectives, and a join table, indicating which objectives a user has. Something like:
create table users (
id integer primary key,
name varchar(100) not null
);
create table objectives (
id integer primary key,
name varchar(100) not null
);
create table user_objective (
user_id integer not null references users,
objective_id integer not null references objectives,
primary key (user_id, objective_id)
);
Whenever a user has an objective, you put a row in the join table indicating the fact.
If you want to get the results into a bitset for a user, do an outer join of the user onto the objectives table via the join table, such that you get a row back for every objective, which has a single column with, say, a 1 for each joined objective, or 0 if there was no join.
The orthodox approach would also be to use a Set on the Java side, rather than a bitset. That maps very nicely onto the join table. Have you considered doing it this way?
If you're worried about memory consumption, a set will use about one pointer per objective a user actually has; the bitset will use a bit per possible objective. Most JVMs have 32-bit pointers (only old or huge-heaped 64-bit JVMs have 64-bit pointers), so if each user has on average less than 1/32nd of the possible objectives, the set will use less memory. There are some groovy data structures which will be able to store this information more compactly than either of those structures, but let's leave that to another question.