My use case is like this: I am inserting 10 million rows in a table described as follows:
keyval bigint, rangef bigint, arrayval blob, PRIMARY KEY (rangef, keyval)
and input data is like follows -
keyval - some timestamp
rangef - a random number
arrayval - a byte array
I am taking my primary key as composite key because after inserting 10 million rows, I want to perform range scan on keyval. As keyval contains timestamp, and my query will be like, give me all the rows between this-time to this-time. Hence to perform these kind of Select queries, i have my primary key as composite key.
Now, while ingestion, the performance was very good and satisfactory. But when I ran the query described above, the performance was very low. When I queried - bring me all the rows within t1 and t1 + 3 minutes, almost 500k records were returned in 160 seconds.
My query is like this
Statement s = QueryBuilder.select().all().from(keySpace, tableName).allowFiltering().where(QueryBuilder.gte("keyval", 1411516800)).and(QueryBuilder.lte("keyval", 1411516980));
s.setFetchSize(10000);
ResultSet rs = sess.execute(s);
for (Row row : rs)
{
count++;
}
System.out.println("Batch2 count = " + count);
I am using default partitioner, that is MurMur partitioner.
My cluster configuration is -
No. of nodes - 4
No. of seed nodes - 1
No. of disks - 6
MAX_HEAP_SIZE for each node = 8G
Rest configuration is default.
How I can improve my range scan performance?
Your are actually performing a full table scan and not a range scan. This is one of the slowest queries possible for Cassandra and is usually only used by analytics workloads. If at any time your queries require allow filterting for a OLTP workload something is most likely wrong. Basically Cassandra has been designed with the knowledge that queries which require accessing the entire dataset will not scale so a great deal of effort is made to make it simple to partition and quickly access data within a partition.
To fix this you need to rethink your data model and think about how you can restrict the data to queries on a single partition.
RussS is correct that your problems are caused both by the use of ALLOW FILTERING and that you are not limiting your query to a single partition.
How I can improve my range scan performance?
By limiting your query with a value for your partitioning key.
PRIMARY KEY (rangef, keyval)
If the above is indeed correct, then rangef is your partitioning key. Alter your query to first restrict for a specific value of rangef (the "single partition", as RussS suggested). Then your current range query on your clustering key keyval should work.
Now, that query may not return anything useful to you. Or you might have to iterate through many rangef values on the application side, and that could be cumbersome. This is where you need to re-evaluate your data model and come up with an appropriate key to partition your data by.
I made secondary index on Keyval, and my query performance was improved. From 160 seconds, it dropped to 40 seconds. So does indexing Keyval field makes sense?
The problem with relying on secondary indexes, is that they may seem fast at first, but get slow over time. Especially with a high-cardinality column like a timestamp (Keyval), a secondary index query has to go out to each node and ultimately scan a large number of rows to get a small number of results. It's always better to duplicate your data in a new query table, than to rely on a secondary index query.
Related
How can I select all items within a given date range?
SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date
I want to make a query like this. Do I need to crate a global secondary index or not?
I've tried this
public void getItemsByDate(Date start, Date end) {
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
String stringStart = df.format(start);
String stringEnd = df.format(end);
ScanSpec scanSpec = new ScanSpec();
scanSpec.withFilterExpression("CreatedAt BETWEEN :from AND :to")
.withValueMap(
new ValueMap()
.withString(":from", stringStart)
.withString(":to", stringEnd));
ItemCollection<ScanOutcome> items = null;
items = gamesScoresTable.scan(scanSpec);
}
But it doesn't work, I'm getting less results than expected.
I can answer your questions, but to suggest any real solution, I would need to see the general shape of your data, as well as what your GameScore's primary key is.
TLDR;
Setup your table so that you can retrieve data with queries, rather than scans and filters, and then create indexes to support lesser used access patterns and improve querying flexibility. Because of how fast reads are when providing the full (or, although not as fast, partial) primary key, i.e. using queries, DynamoDB is optimal when table structure is driven by the application's access patterns.
When designing your tables, keep in mind NoSQL design best practices, as well as best practices for querying and scanning and it will pay dividends in the long run.
Explanations
Question 1
How can I select all items within a given date range?
To answer this, I'd like to break that question down a little more. Let's start with: How can I select all items?
This, you have already accomplished. A scan is a great way to retrieve all items in your table, and unless you have all your items within one partition, it is the only way to retrieve all the items in your table. Scans can be helpful when you have to access data by unknown keys.
Scans, however, have limitations, and as your table grows in size they'll cost you in both performance and dollars. A single scan can only retrieve a maximum of 1MB of data, of a single partition, and is capped at that partition's read capacity. When a scan tops out at either limitation, consecutive scans will happen sequentially. Meaning a scan on a large table could take multiple round trips.
On top of that, with scans you consume read capacity based on the size of the item, no matter how much (or little) data is returned. If you only request a small amount of attributes in your ProjectionExpression, and your FilterExpression eliminates 90% of the items in your table, you still paid to read the entire table.
You can optimize performance of scans using Parallel Scans, but if you require an entire table scan for an access pattern that happens frequently for your application, you should consider restructuring your table. More about scans.
Let's now look at: How can I select all items, based on some criteria?
The ideal way to accomplish retrieving data based on some criteria (in your case SELECT * FROM GameScores where createdAt >= start_date && createAt <=end_date) would be to query the base table (or index). To do so, per the documentation:
You must provide the name of the partition key attribute and a single value for that attribute. Query returns all items with that partition key value.
Like the documentation says, querying a partition will return all of its values. If your GameScores table has a partition key of GameName, then a query for GameName = PacMan will return all Items with that partition key. Other GameName partitions, however, will not be captured in this query.
If you need more depth in your query:
Optionally, you can provide a sort key attribute and use a comparison operator to refine the search results.
Here's a list of all the possible comparison operators you can use with your sort key. This is where you can leverage a between comparison operator in the KeyConditionExpression of your query operation. Something like: GameName = PacMan AND createdAt BETWEEN time1 AND time2 will work, if createdAt is the sort key of the table or index that you are querying.
If it is not the sort key, you might have the answer to your second question.
Question 2
Do I need to create a Global Secondary Index?
Let's start with: Do I need to create an index?
If your base table data structure does not fit some amount of access patterns for your application, you might need to. However, in DynamoDB, the denormalization of data also support more access patterns. I would recommend watching this video on how to structure your data.
Moving onto: Do I need to create a GSI?
GSIs do not support strong read consistency, so if you need that, you'll need to go with a Local Secondary Index (LSI). However, if you've already created your base table, you won't be able to create an LSI. Another difference between the two is the primary key: a GSI can have a different partition and sort key as the base table, while an LSI will only be able to differ in sort key. More about indexes.
Pre Requirement : I need to find all the matching result in one query which is more than 40K results.
Requirement : Two tables - product and product_category. I am trying to fetch all the products with matching category from product_category table.
Table Structure :
CREATE TABLE catalog.product (
product_id string PRIMARY KEY index using plain,
name string,
sku string,
) clustered by (product_id) into 4 shards;
create table catalog.product_category (
category_id string primary key index using plain,
product_id string primary key index using plain,
INDEX product_category_index using plain(product_id, category_id)
active boolean,
parent_category_id integer,
updated_at timestamp
);
Join Query :
select p.product_id from catalog.product_category pc join catalog.product p on p.product_id=pc.product_id limit 40000;
Tried multiple things - indexing product_id (both as integer and string) etc.
Result : To showup 35K result it is taking more than 90 seconds everytime.
Problem : What can I do to optimize the query response time?
Some Other Information :
- CPU Core -4
- Tried with one or multiple nodes
- Default Sharding
- Total number of products - 35K and product_category has 35K enteries only.
Usecase : I am trying to use crateDB as persistent cache but with the given query response time we can't really do that. So we will move to some in-memory database like REDIS or Memcache. The reason to choose crateDB was the querying ability on your persistent data.
Joining on STRING type columns is very expensive comparing to joining on NUMERIC types (equality check on string values is much more expensive than on numeric values). If you don't have a special reason for doing that I'd recommend to change them to a NUMERIC type (e.g. INTEGER, LONG, ...), this would improve the query speed by a multiple factor.
Btw. index using plain is the default index setting for all columns, so you could just leave it out.
Also the compound index product_category_index will not help improving the join query, it's just important if you would filter on that by using a WHERE clause including this index column.
Updated
Another improvement you can do is adding an ORDER BY p.product_id, pc.product_id clause. By this, the join algorithm could stop when it reached your applied LIMIT.
We have a large set of data (bulk data) that needs to be checked if the record is existing in the database.
We are using SQL Server2012/JPA/Hibernate/Spring.
What would be an efficient or recommended way to check if a record exists in the database?
Our entity ProductCodes has the following fields:
private Integer productCodeId // this is the PK
private Integer refCode1 // ref code 1-5 has a unique constraint
private Integer refCode2
private Integer refCode3
private Integer refCode4
private Integer refCode5
... other fields
The service that we are creating will be given a file where each line is a combination of refCode1-5.
The task of the service is to check and report all lines in the file that are already existing in the database.
We are looking at approaching this in two ways.
Approach1: Usual approach.
Loop through each line and call the DAO to query the refCode1-5 if existing in the db.
//psuedo code
for each line in the file
call dao. pass the refCode1-5 to query
(select * from ProductCodes where refCode1=? and refCode2=? and refCode3=? and refCode4=? and refCode5=?
given a large list of lines to check, this might be inefficient since we will be invoking the DAO xxxx number of times. If the file say consists of 1000 lines to check, this will be 1000 connections to the DB
Approach2: Query all records in the DB approach
We will query all records in the DB
Create a hash map with concatenated refCode1-5 as keys
Loop though each line in the file validating against the hashmap
We think this is more efficient in terms of DB connection since it will not create 1000 connections to the DB. However, if the DB table has for example 5000 records, then hibernate/jpa will create 5000 entities in memory and probably crash the application
We are thinking of going for the first approach since refCode1-5 has a unique constraint and will benefit from the implicit index.
But is there a better way of approaching this problem aside from the first approach?
try something like a batch select statement for say 100 refCodes instead of doing a single select for each refCode.
construct a query like
select <what ever you want> from <table> where ref_code in (.....)
Construct the select projection in a way that not just gives you wnat you want but also the details of ref_code. Teh in code you can do a count or multi-threaded scan of resultset if DB said you got less refCodes that the number you codes you entered in query.
You can try to use the concat operator.
select <your cols> from <your table> where concat(refCode1, refCode2, refCode3, refCode4, refCode5) IN (<set of concatenation from your file>);
I think this will be quite efficient and it may be worth to try to see if pre-sorting the lines and playing with the num of concatenation taken each times bring you some benefits.
I would suggest you create a temp table in your application where all records from file are stored initially with batch save, and later you run a query joining new temp table and productCodes table to achieve filtering how you like. In this way you are not locking productCodes table many times to check individual row as SqlServer locks rows on select statement as well.
I am modelling a Cassandra schema to get a bit more familiar on the subject and was wondering what is the best practice regarding creating indexes.
For example:
create table emailtogroup(email text, groupid int, primary key(email));
select * from emailtogroup where email='joop';
create index on emailtogroup(groupid);
select * from emailtogroup where groupid=2 ;
Or i can create a entire new table:
create table grouptoemail(groupid int, email text, primary key(groupid, email));
select * from grouptoemail where groupid=2;
They both do the job.
I would expect creating a new table is faster cause now groupid becomes the partition key. But i'm not sure what "magic" is happening when creating a index and if this magic has a downside.
According to me your first approach is correct.
create table emailtogroup(email text, groupid int, primary key(email));
because 1) in your case email is sort of unique, good candidate for primary key and 2) multiple emails can belong to same group, good candidate for secondary index. Please refer to this post - Cassandra: choosing a Partition Key
The partitioning key is used to distribute data across different nodes, and if you want your nodes to be balanced (i.e. well distributed data across each node) then you want your partitioning key to be as random as possible.
The second form of table creation is useful for range scans. For example if you have a use case like
i) List all the email groups which the user has joined from 1st Jan 2010 to 1st Jan 2013.
In that case you may have to design a table like
create table grouptoemail(email text, ts timestamp, groupid int, primary key(email, ts));
In this case all the email gropus which the user joined will be clustered on disk.(stored together on disk)
It depends on the cardinality of groupid. The cassandra docs:
When not to use an index
Do not use an index to query a huge volume of records for a small
number of results. For example, if you create an index on a
high-cardinality column, which has many distinct values, a query
between the fields will incur many seeks for very few results. In the
table with a billion users, looking up users by their email address (a
value that is typically unique for each user) instead of by their
state, is likely to be very inefficient. It would probably be more
efficient to manually maintain the table as a form of an index instead
of using the Cassandra built-in index. For columns containing unique
data, it is sometimes fine performance-wise to use an index for
convenience, as long as the query volume to the table having an
indexed column is moderate and not under constant load.
Naturally, there is no support for counter columns, in which every
value is distinct.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
So basically, if you are going to be dealing with a large dataset, and groupid won't return a lot of rows, a secondary index may not be the best idea.
Week #4 of DataStax Academy's Java Developement with Apache Cassandra class talks about how to model these problems efficiently. Check that out if you get a chance.
At my job I was to develop and implement a solution for the following problem:
Given a dataset of 30M records extract (key, value) tuples from the particular dataset field, group them by key and value storing the number of same values for each key. Write top 5000 most frequent values for each key to a database. Each dataset row contains up to 100 (key, value) tuples in a form of serialized XML.
I came up with the solution like this (using Spring-Batch):
Batch job steps:
Step 1. Iterate over the dataset rows and extract (key, value) tuples. Upon getting some fixed number of tuples dump them on disk. Each tuple goes to a file with the name pattern '/chunk-', thus all values for a specified key are stored in one directory. Within one file values are stored sorted.
Step 2. Iterate over all '' directories and merge their chunk files into one grouping same values. Since the values are stored sorted, it is trivial to merge them for O(n * log k) complexity, where 'n' is the number of values in a chunk file and 'k' is the initial number of chunks.
Step 3. For each merged file (in other words for each key) sequentially read its values using PriorityQueue to maintain top 5000 values without loading all the values into memory. Write queue content to the database.
I spent about a week on this task, mainly because I have not worked with Spring-Batch previously and because I tried to make emphasis on scalability that requires accurate implementation of the multi-threading part.
The problem is that my manager consider this task way too easy to spend that much time on it.
And the question is - do you know more efficient solution or may be less efficient that would be easier to implement? And how much time would you need to implement my solution?
I am aware about MapReduce-like frameworks, but I can't use them because the application is supposed to be run on a simple PC with 3 cores and 1GB for Java heap.
Thank you in advance!
UPD: I think I did not stated my question clearly. Let me ask in other way:
Given the problem and being the project manager or at least the task reviewer would you accept my solution? And how much time would you dedicate to this task?
Are you sure this approach is faster than doing a pre-scan of the XML-file to extract all keys, and then parse the XML-file over and over for each key? You are doing a lot of file management tasks in this solution, which is definitely not for free.
As you have three Cores, you could parse three keys at the same time (as long as the file system can handle the load).
You solution seems reasonable and efficient, however I'd probably use SQL.
While parsing the Key/Value pairs I'd insert/update into a SQL table.
I'd then query the table for the top records.
Here's an example using only T-SQL (SQL 2008, but the concept should be workable in most any mordern rdbms)
The SQL between / START / and / END / would be the statements you need to execute in your code.
BEGIN
-- database table
DECLARE #tbl TABLE (
k INT -- key
, v INT -- value
, c INT -- count
, UNIQUE CLUSTERED (k, v)
)
-- insertion loop (for testing)
DECLARE #x INT
SET #x = 0
SET NOCOUNT OFF
WHILE (#x < 1000000)
BEGIN
--
SET #x = #x + 1
DECLARE #k INT
DECLARE #v INT
SET #k = CAST(RAND() * 10 as INT)
SET #v = CAST(RAND() * 100 as INT)
-- the INSERT / UPDATE code
/* START this is the sql you'd run for each row */
UPDATE #tbl SET c = c + 1 WHERE k = #k AND v = #v
IF ##ROWCOUNT = 0
INSERT INTO #tbl VALUES (#k, #v, 1)
/* END */
--
END
SET NOCOUNT ON
-- final select
DECLARE #topN INT
SET #topN = 50
/* START this is the sql you'd run once at the end */
SELECT
a.k
, a.v
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY k ORDER BY k ASC, c DESC) [rid]
, k
, v
FROM #tbl
) a
WHERE a.rid < #topN
/* END */
END
Gee, it doesn't seem like much work to try the old fashioned way of just doing it in-memory.
I would try just doing it first, then if you run out of memory, try one key per run (as per #Storstamp's answer).
If using the "simple" solution is not an option due to the size of the data, my next choice would be to use an SQL database. However, as most of these require quite much memory (and coming down to a crawl when heavily overloaded in RAM), maybe you should redirect your search into something like a NoSQL database such as MongoDB that can be quite efficient even when mostly disk-based. (Which your environment basically requires, having only 1GB of heap available).
The NoSQL database will do all the basic bookkeeping for you (storing the data, keeping track of all indexes, sorting it), and may probably do it a bit more efficient than your solution, due to the fact that all data may be sorted and indexed already when inserted, removing the extra steps of sorting the lines in the /chunk- files, merging them etc.
You will end up with a solution that is probably much easier to administrate, and it will also allow you to set up different kind of queries, instead of being optimized only for this specific case.
As a project manager I would not oppose your current solution. It is already fast and solves the problem. As an architect however, I would object due to the solution being a bit hard to maintain, and for not using proven technologies that basically does partially the same thing as you have coded on your own. It is hard to beat the tree and hash implementations of modern databases.