Pre Requirement : I need to find all the matching result in one query which is more than 40K results.
Requirement : Two tables - product and product_category. I am trying to fetch all the products with matching category from product_category table.
Table Structure :
CREATE TABLE catalog.product (
product_id string PRIMARY KEY index using plain,
name string,
sku string,
) clustered by (product_id) into 4 shards;
create table catalog.product_category (
category_id string primary key index using plain,
product_id string primary key index using plain,
INDEX product_category_index using plain(product_id, category_id)
active boolean,
parent_category_id integer,
updated_at timestamp
);
Join Query :
select p.product_id from catalog.product_category pc join catalog.product p on p.product_id=pc.product_id limit 40000;
Tried multiple things - indexing product_id (both as integer and string) etc.
Result : To showup 35K result it is taking more than 90 seconds everytime.
Problem : What can I do to optimize the query response time?
Some Other Information :
- CPU Core -4
- Tried with one or multiple nodes
- Default Sharding
- Total number of products - 35K and product_category has 35K enteries only.
Usecase : I am trying to use crateDB as persistent cache but with the given query response time we can't really do that. So we will move to some in-memory database like REDIS or Memcache. The reason to choose crateDB was the querying ability on your persistent data.
Joining on STRING type columns is very expensive comparing to joining on NUMERIC types (equality check on string values is much more expensive than on numeric values). If you don't have a special reason for doing that I'd recommend to change them to a NUMERIC type (e.g. INTEGER, LONG, ...), this would improve the query speed by a multiple factor.
Btw. index using plain is the default index setting for all columns, so you could just leave it out.
Also the compound index product_category_index will not help improving the join query, it's just important if you would filter on that by using a WHERE clause including this index column.
Updated
Another improvement you can do is adding an ORDER BY p.product_id, pc.product_id clause. By this, the join algorithm could stop when it reached your applied LIMIT.
Related
I'm trying to make a blog system of sort and I ran into a slight problem.
Simply put, there's 3 columns in my article table:
id SERIAL,
category VARCHAR FK,
category_id INT
id column is obviously the PK and it is used as a global identifier for all articles.
category column is well .. category.
category_id is used as a UNIQUE ID within a category so currently there is a UNIQUE(category, category_id) constraint in place.
However, I also want for category_id to auto-increment.
I want it so that every time I execute a query like
INSERT INTO article(category) VALUES ('stackoverflow');
I want the category_id column to be automatically be filled according to the latest category_id of the 'stackoverflow' category.
Achieving this in my logic code is quite easy. I just select latest num and insert +1 of that but that involves two separate queries.
I am looking for a SQL solution that can do all this in one query.
This has been asked many times and the general idea is bound to fail in a multi-user environment - and a blog system sounds like exactly such a case.
So the best answer is: Don't. Consider a different approach.
Drop the column category_id completely from your table - it does not store any information the other two columns (id, category) wouldn't store already.
Your id is a serial column and already auto-increments in a reliable fashion.
Auto increment SQL function
If you need some kind of category_id without gaps per category, generate it on the fly with row_number():
Serial numbers per group of rows for compound key
Concept
There are at least several ways to approach this. First one that comes to my mind:
Assign a value for category_id column inside a trigger executed for each row, by overwriting the input value from INSERT statement.
Action
Here's the SQL Fiddle to see the code in action
For a simple test, I'm creating article table holding categories and their id's that should be unique for each category. I have omitted constraint creation - that's not relevant to present the point.
create table article ( id serial, category varchar, category_id int )
Inserting some values for two distinct categories using generate_series() function to have an auto-increment already in place.
insert into article(category, category_id)
select 'stackoverflow', i from generate_series(1,1) i
union all
select 'stackexchange', i from generate_series(1,3) i
Creating a trigger function, that would select MAX(category_id) and increment its value by 1 for a category we're inserting a row with and then overwrite the value right before moving on with the actual INSERT to table (BEFORE INSERT trigger takes care of that).
CREATE OR REPLACE FUNCTION category_increment()
RETURNS trigger
LANGUAGE plpgsql
AS
$$
DECLARE
v_category_inc int := 0;
BEGIN
SELECT MAX(category_id) + 1 INTO v_category_inc FROM article WHERE category = NEW.category;
IF v_category_inc is null THEN
NEW.category_id := 1;
ELSE
NEW.category_id := v_category_inc;
END IF;
RETURN NEW;
END;
$$
Using the function as a trigger.
CREATE TRIGGER trg_category_increment
BEFORE INSERT ON article
FOR EACH ROW EXECUTE PROCEDURE category_increment()
Inserting some more values (post trigger appliance) for already existing categories and non-existing ones.
INSERT INTO article(category) VALUES
('stackoverflow'),
('stackexchange'),
('nonexisting');
Query used to select data:
select category, category_id From article order by 1,2
Result for initial inserts:
category category_id
stackexchange 1
stackexchange 2
stackexchange 3
stackoverflow 1
Result after final inserts:
category category_id
nonexisting 1
stackexchange 1
stackexchange 2
stackexchange 3
stackexchange 4
stackoverflow 1
stackoverflow 2
Postgresql uses sequences to achieve this; it's a different approach from what you are used to in MySQL. Take a look at http://www.postgresql.org/docs/current/static/sql-createsequence.html for complete reference.
Basically you create a sequence (a database object) by:
CREATE SEQUENCE serials;
And then when you want to add to your table you will have:
INSERT INTO mytable (name, id) VALUES ('The Name', NEXTVAL('serials')
Consider a table creation:
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(my_id varchar,
my_date varchar,
enum_one varchar,
enum_two varchar
PRIMARY KEY (my_id, my_date, enum_one, enum_two)
);
Columns enum_one and enum_two has fixed numbers of value (6 and ~20). Should I include enum columns to primary key or not?
Consider a situation when I have many rows with one enum_one value and a few with other values.
How does cassandra handle this situation - does it balance loading or most requests go to one node?
Cassandra load balances based on the partition key, so if you included the enum columns in the partition key, then it would have an effect on the load balancing.
In your example, you are using my_id as the partition key. If your reads and writes tend to have different values for my_id, then this should keep your data balanced.
If your reads and writes tend to mostly use just a few values for my_id (i.e. if my_id has low cardinality), then the data will not be well load balanced across the Cassandra nodes. If that is the case, then including the enum fields would increase the cardinality of the partition key and result in a more evenly balanced data load.
The flip side of this is that using a different partition key may impact what types of queries you can do efficiently. It is efficient to query for data within a single partition, so if you included the enum columns in the partition key, then you would have to query for each value of the enum columns in separate queries instead of a single query.
My use case is like this: I am inserting 10 million rows in a table described as follows:
keyval bigint, rangef bigint, arrayval blob, PRIMARY KEY (rangef, keyval)
and input data is like follows -
keyval - some timestamp
rangef - a random number
arrayval - a byte array
I am taking my primary key as composite key because after inserting 10 million rows, I want to perform range scan on keyval. As keyval contains timestamp, and my query will be like, give me all the rows between this-time to this-time. Hence to perform these kind of Select queries, i have my primary key as composite key.
Now, while ingestion, the performance was very good and satisfactory. But when I ran the query described above, the performance was very low. When I queried - bring me all the rows within t1 and t1 + 3 minutes, almost 500k records were returned in 160 seconds.
My query is like this
Statement s = QueryBuilder.select().all().from(keySpace, tableName).allowFiltering().where(QueryBuilder.gte("keyval", 1411516800)).and(QueryBuilder.lte("keyval", 1411516980));
s.setFetchSize(10000);
ResultSet rs = sess.execute(s);
for (Row row : rs)
{
count++;
}
System.out.println("Batch2 count = " + count);
I am using default partitioner, that is MurMur partitioner.
My cluster configuration is -
No. of nodes - 4
No. of seed nodes - 1
No. of disks - 6
MAX_HEAP_SIZE for each node = 8G
Rest configuration is default.
How I can improve my range scan performance?
Your are actually performing a full table scan and not a range scan. This is one of the slowest queries possible for Cassandra and is usually only used by analytics workloads. If at any time your queries require allow filterting for a OLTP workload something is most likely wrong. Basically Cassandra has been designed with the knowledge that queries which require accessing the entire dataset will not scale so a great deal of effort is made to make it simple to partition and quickly access data within a partition.
To fix this you need to rethink your data model and think about how you can restrict the data to queries on a single partition.
RussS is correct that your problems are caused both by the use of ALLOW FILTERING and that you are not limiting your query to a single partition.
How I can improve my range scan performance?
By limiting your query with a value for your partitioning key.
PRIMARY KEY (rangef, keyval)
If the above is indeed correct, then rangef is your partitioning key. Alter your query to first restrict for a specific value of rangef (the "single partition", as RussS suggested). Then your current range query on your clustering key keyval should work.
Now, that query may not return anything useful to you. Or you might have to iterate through many rangef values on the application side, and that could be cumbersome. This is where you need to re-evaluate your data model and come up with an appropriate key to partition your data by.
I made secondary index on Keyval, and my query performance was improved. From 160 seconds, it dropped to 40 seconds. So does indexing Keyval field makes sense?
The problem with relying on secondary indexes, is that they may seem fast at first, but get slow over time. Especially with a high-cardinality column like a timestamp (Keyval), a secondary index query has to go out to each node and ultimately scan a large number of rows to get a small number of results. It's always better to duplicate your data in a new query table, than to rely on a secondary index query.
I am modelling a Cassandra schema to get a bit more familiar on the subject and was wondering what is the best practice regarding creating indexes.
For example:
create table emailtogroup(email text, groupid int, primary key(email));
select * from emailtogroup where email='joop';
create index on emailtogroup(groupid);
select * from emailtogroup where groupid=2 ;
Or i can create a entire new table:
create table grouptoemail(groupid int, email text, primary key(groupid, email));
select * from grouptoemail where groupid=2;
They both do the job.
I would expect creating a new table is faster cause now groupid becomes the partition key. But i'm not sure what "magic" is happening when creating a index and if this magic has a downside.
According to me your first approach is correct.
create table emailtogroup(email text, groupid int, primary key(email));
because 1) in your case email is sort of unique, good candidate for primary key and 2) multiple emails can belong to same group, good candidate for secondary index. Please refer to this post - Cassandra: choosing a Partition Key
The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible.
The second form of table creation is useful for range scans. For example if you have a use case like
i) List all the email groups which the user has joined from 1st Jan 2010 to 1st Jan 2013.
In that case you may have to design a table like
create table grouptoemail(email text, ts timestamp, groupid int, primary key(email, ts));
In this case all the email gropus which the user joined will be clustered on disk.(stored together on disk)
It depends on the cardinality of groupid. The cassandra docs:
When not to use an index
Do not use an index to query a huge volume of records for a small
number of results. For example, if you create an index on a
high-cardinality column, which has many distinct values, a query
between the fields will incur many seeks for very few results. In the
table with a billion users, looking up users by their email address (a
value that is typically unique for each user) instead of by their
state, is likely to be very inefficient. It would probably be more
efficient to manually maintain the table as a form of an index instead
of using the Cassandra built-in index. For columns containing unique
data, it is sometimes fine performance-wise to use an index for
convenience, as long as the query volume to the table having an
indexed column is moderate and not under constant load.
Naturally, there is no support for counter columns, in which every
value is distinct.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
So basically, if you are going to be dealing with a large dataset, and groupid won't return a lot of rows, a secondary index may not be the best idea.
Week #4 of DataStax Academy's Java Developement with Apache Cassandra class talks about how to model these problems efficiently. Check that out if you get a chance.
I am working on designing the Cassandra Column Family schema for my below use case.. I am not sure what is the best way to design the cassandra column family for my below use case? I will be using CQL Datastax Java driver for this..
Below is my use case and the sample schema that I have designed for now -
SCHEMA_ID RECORD_NAME SCHEMA_VALUE TIMESTAMP
1 ABC some value t1
2 ABC some_other_value t2
3 DEF some value again t3
4 DEF some other value t4
5 GHI some new value t5
6 IOP some values again t6
Now what I will be looking from the above table is something like this -
For the first time whenever my application is running, I will ask for everything from the above table.. Meaning give me everything from the above table..
Then every 5 or 10 minutes, my background thread will be checking this table and will ask for give me everything that has changed only (full row if anything got changed for that row).. so that is the reason I am using timestamp as one of the column here..
But I am not sure how to design the query pattern in such a way such that both of my use cases gets satisfied easily and what will be the proper way of designing the table for this? Here SCHEMA_ID will be primary key I am thinking to use...
I will be using CQL and Datastax Java driver for this..
Update:-
If I am using something like this, then is there any problem with this approach?
CREATE TABLE TEST (SCHEMA_ID TEXT, RECORD_NAME TEXT, SCHEMA_VALUE TEXT, LAST_MODIFIED_DATE TIMESTAMP, PRIMARY KEY (ID));
INSERT INTO TEST (SCHEMA_ID, RECORD_NAME, SCHEMA_VALUE, LAST_MODIFIED_DATE) VALUES ('1', 't26', 'SOME_VALUE', 1382655211694);
Because, in my this use case, I don't want anybody to insert same SCHEMA_ID everytime.. SCHEMA_ID should be unique whenever we are inserting any new row into this table.. So with your example (#omnibear), it might be possible, somebody can insert same SCHEMA_ID twice? Am I correct?
And also regarding type you have taken as an extra column, that type column can be record_name in my example..
Regarding 1)
Cassandra is used for heavy writing, lots of data on multiple nodes. To retrieve ALL data from this kind of set-up is daring since this might involve huge amounts that have to be handled by one client. A better approach would be to use pagination. This is natively supported in 2.0.
Regarding 2)
The point is that partition keys only support EQ or IN queries. For LT or GT (< / >) you use column keys. So if it makes sense to group your entries by some ID like "type", you can use this for your partition key, and a timeuuid as a column key. This allows to query for all entries newer than X like so
create table test
(type int, SCHEMA_ID int, RECORD_NAME text,
SCHEMA_VALUE text, TIMESTAMP timeuuid,
primary key (type, timestamp));
select * from test where type IN (0,1,2,3) and timestamp < 58e0a7d7-eebc-11d8-9669-0800200c9a66;
Update:
You asked:
somebody can insert same SCHEMA_ID twice? Am I correct?
Yes, you can always make an insert with an existing primary key. The values at that primary key will be updated. Therefore, to preserve uniqueness, a UUID is often used in the primary key, for instance, timeuuid. It is a unique value containing a timestamp and the MAC address of the client. There is excellent documentation on this topic.
General advice:
Write down your queries first, then design your model. (Use case!)
Your queries define your data model which in turn is primarily defined by your primary keys.
So, in your case, I'd just adapt my schema above, like so:
CREATE TABLE TEST (SCHEMA_ID TEXT, RECORD_NAME TEXT, SCHEMA_VALUE TEXT,
LAST_MODIFIED_DATE TIMEUUID, PRIMARY KEY (RECORD_NAME, LAST_MODIFIED_DATE));
Which allows this query:
select * from test where RECORD_NAME IN ("componentA","componentB")
and LAST_MODIFIED_DATE < 1688f180-4141-11e3-aa6e-0800200c9a66;
the uuid corresponds to -> Wednesday, October 30, 2013 8:55:55 AM GMT
so you would fetch everything after that