Is there any other way to implement counters in Cassandra ?
I have a following table structure
CREATE TABLE userlog (
term text,
ts timestamp,
year int,
month int,
day int,
hour int,
weekofyear int,
dayofyear int,
count counter,
PRIMARY KEY (term, ts, year,month,day,hour,weekofyear,dayofyear)
);
But because of counter I need to put all the others columns in primary key,which is creating problems to my application.
So,is there any other way where I can avoid doing this (preferably using Java)?
You can avoid counters in Cassandra altogether by using an analytics engine such as Spark. The idea is to only store events in Cassandra and either periodically trigger Spark or continuously run Spark as a background job that would read the events and create aggregates such as counts. Those aggregate results can be written back into Cassandra again into a separate table (e.g. userlog_by_month, userlog_by_week,..).
Usually you would put the counter column in a separate table from the data table. In that way you can use whatever key you find convenient to access the counters.
The downside is you need to update two tables rather than just one, but this is unavoidable due to the way counters are implemented.
Related
I have two table with below model:
CREATE TABLE IF NOT EXISTS INV (
CODE TEXT,
PRODUCT_CODE TEXT,
LOCATION_NUMBER TEXT,
QUANTITY DECIMAL,
CHECK_INDICATOR BOOLEAN,
VERSION BIGINT,
PRIMARY KEY ((LOCATION_NUMBER, PRODUCT_CODE)));
CREATE TABLE IF NOT EXISTS LOOK_INV (
LOCATION_NUMBER TEXT,
CHECK_INDICATOR BOOLEAN,
PRODUCT_CODE TEXT,
CHECK_INDICATOR_DDTM TIMESTAMP,
PRIMARY KEY ((LOCATION_NUMBER), CHECK_INDICATOR, PRODUCT_CODE))
WITH CLUSTERING ORDER BY (CHECK_INDICATOR ASC, PRODUCT_CODE ASC);
I have a business operation where i need to update CHECK_INDICATOR in both the tables and QUANTITY in INV table.
As CHECK_INDICATOR is a part of key in LOOK_INV table, i need to delete the row first and insert a new row.
Below are the three operations i need to perform in batch fashion (either all will be executed sucessfully or none should be executed)
Delete row from LOOK_INV table.
Insert row in LOOK_INV table.
Update QUANTITY and CHECK_INDICATOR in INV table.
As INV table is getting access by multiple threads, i need to make sure before updating INV table row that it has not been changed since last read.
I am using LWT transaction to update INV table using VERSON column and batch operation for deletion and insertion in LOOK_INV table.I want to add all the three operation in batch.But since LWT is not acceptable in batch i need to execute in aforesaid fashion.
The problem with this approach is that in some scenario batch get executed sucessfully but updating INV table results in timeout exception and data become incosistent in both the table.
Is there any feature provided by cassandra to handle these type of scenario elegantly?
Caution with Lightweight Transactions (LWT)
Lightweight Transactions are currently considered a Cassandra anti-pattern because of the performance issues you are suffering.
Here is a bit of context to explain.
Cassandra does not use RDBMS ACID transactions with rollback or locking mechanisms. It does not provide locking because of a fundamental constraint on all kinds of distributed data store called the CAP Theorem. It states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
Because of this, Cassandra is not good for atomic operations and you should not use Cassandra for this purpose.
It does provide lightweight transactions, which can replace locking in some cases. But because the Paxos protocol (the basis for LWT) involves a series of actions that occur between nodes, there will be multiple round trips between the node that proposes a LWT and the other replicas that are part of the transaction.
This has an adverse impact on performance and is one reason for the WriteTimeoutException error. In this situation you can't know if the LWT operation has been applied, so you need to retry it in order to fallback to a stable state. Because LWTs are so expensive, the driver will not automatically retry it for you.
LTW comes with big performance penalties if used frequently, and we see some clients with big timeout issues due to using LWTs.
Lightweight transactions are generally a bad idea and should be used infrequently.
If you do require ACID properties on part of your workload but still require it to scale , consider shifting that part of your load to cochroach BD.
In summary, if you do need ACID transactions it is generally a lot easier to bring a second technology in.
We have a table that will contain a huge amount of time series data. Probably we have to store several entries per millisecond in that table. To fulfill these requirements the table looks like
CREATE TABLE statistic (
name text,
id uuid,
start timestamp,
other_data ...,
PRIMARY KEY (name, start, id)
) WITH CLUSTERING ORDER BY (start DESC);
As you can see, the table consists of two clustering keys, start stores the time when the data arrives, id has the purpose to avoid that data is overwritten when it arrives at the same time.
Now this is ok, we can make range queries like
SELECT * FROM statistic WHERE name ='foo' AND start >= 1453730078182
AND start <= 1453730078251;
But we also need the capability to have additional search parameters in the query like
SELECT * FROM statistic WHERE name = 'foo'
AND start >= 1453730078182 AND start <= 1453730078251 AND other_data = 'bar';
This does not work of course because other_data is not part of the primary key. If we add it to the primary key, we get the following error
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY column "other_data" cannot be restricted (preceding column "start" is restricted by a non-EQ relation)"
That is also OK, that is not the way Cassandra works (I think).
Our approach to solve the problem is to select the needed (time series) data with the above mentioned (first) range query and afterwards filter the data in our Java application. That means we go through the list and kick out all data we don't need in our Java application. One single entry has not much data, but it can happen that we talk about some millions of rows in worst case.
Now I have two questions:
Is that the right approach to solve the problem?
Is Cassandra capable to handle that amount of data?
This does not work of course because other_data is not part of the primary key. If we add it to the primary key, we get the following error
This is a sweet spot for secondary index on column other_data. In your case this index will scale because you always provide the partition key (name) so Cassandra will not hit all nodes in the cluster.
With a secondary index on other_data, your second SELECT statement will be possible.
Now there is another issue with your data model, which is the partition size. Indeed, if you are inserting several entries per milliseconds per name, this will not scale because the partition for each name will grow very fast ...
If the insert is distributed on different partition keys (different name) then it's fine
I would like to use Server-side data selection and filtering using the cassandra spark connector. In fact we have many sensors that send values every 1s, we are interested on these data aggregation using months, days, hours, etc,
I have proposed the following data model:
CREATE TABLE project1(
year int,
month int,
load_balancer int,
day int,
hour int,
estimation_time timestamp,
sensor_id int,
value double,
...
PRIMARY KEY ((year, month, load_balancer), day, hour, estimation_time, sensor_id)
Then, we were interested to get the data aggregation of a 2014-December- with loadbalancer IN (0,1,2,3). So they are 4 different partitions.
We are using the cassandra spark connector version 1.1.1, and we used a combine by query to get all values mean aggregated by hour.
So the processing time for 4,341,390 tuples, spark takes 11min to return the result.
Now the issue is that we are using 5 nodes however spark uses only one worker to execute the task.
Could you please suggest an update to the query or data model in order to enhance the performance?
Spark Cassandra Connector has this feature, it is SPARKC-25. You can just create an arbitrary RDD with values and then use it as a source of keys to fetch data from Cassandra table. Or in other words - join an arbitrary RDD to Cassandra RDD. In your case, that arbitrary RDD would include 4 tuples with different load balancer values. Look at the documentation for more info. SCC 1.2 has been released recently and it is probably compatible with Spark 1.1 (it is designed for Spark 1.2 though).
I want to store different kinds of counters for my user.
Platform: Java
E.g. I have identified:
currentNumRecords
currentNumSteps
currentNumFlowsInterval1440
currentNumFlowsInterval720
currentNumFlowsInterval240
currentNumFlowsInterval60
currentNumFlowsInterval30
etc.
Each of the counters above needs to be reset at the beginning of each month for each user. The value of each counter can be unpredictably high with peaks etc. (I mean that a lot of things are counted, so I want to think about a scalable solution).
Now my question is what approach to take to:
a) Should I have separate columns for each counter on the user table and doing things like 'Update set counterColumn = counterColumn+ 1' ?
b) put all the values in some kind of JSON/XML and put it in a single column? (in this case I always have to update all values at once)
The disadvantage I see is row locking on the user table everytime a single counter is incremented.
c) having an separate counter table with 3 columns (userid, name, counter) and doing one INSERT for each count + having a background job doing aggregates which are written to the User table? In this case would it be ok to store the aggregated counters as JSON inside a column in the user table?
d) Doing everything in MySQL or also use another technology? I also thought about using another solution for storing counters and only keeping the aggregates in MySQL. E.g. I have experimented with Apache Cassandra's distributed counters. My concerns are about the Transactions which cassandra does not have.
I need the counters to be exact because they are used for billing, thus I don't know if Cassandra is a good fit here, although the scalability of Cassandra seems tempting.
What about Redis for storing the counters + writing the aggregates in MySQL? Does Redis have stuff which helps me here? Or should I just store everything in a simple Java HashMap in-memory and have a aggregation background thread and don't use another technology?
In summary I am concerned about:
reduce row locking
have exact counters (transactions?)
Thanks for your ideas :)
You're sort of saying contradictory things.
The number of counts can be huge or at least unpredictable per user.
To me this means they must be uniform, like an array. It is not possible to have an unbounded number of heterogenous data, unless you have an unbounded amount of code and an unbounded number of developer hours to expend.
If they are uniform they should be flattened into a table user_counter where each row is of the form (user_id, counter_name, counter_value). However you will need to think carefully about what sort of indices you will need, etc. Updating at the beginning of the month if they are all set to zero or some default value is one SQL query.
Basically (c). (a) and (b) are most absurd and MySQL is still a suitable technology for this.
Your requirement is not so untypical. In general this is statistical session/user/... bound written data.
The first thing is to split things if not already done so. Make a mostly readonly database, and separately collect these data. So a separated user table for the normal properties.
The statistical data could be held in an in-memory table. You could also use means other than a database, a message queue, session attributes.
If we have a sequence to generate unique ID fields for a table, which of the 2 approaches is more efficient:
Create a trigger on insert, to populate the ID field by fetching nextval from sequence.
Calling nextval on the sequence in the application layer before inserting the object (or tuple) in the db.
EDIT: The application performs a mass upload. So assume thousands or a few millions of rows to be inserted each time the app runs. Would triggers from #1 be more efficient than calling the sequence within the app as mentioned in #2?
Since you are inserting a large number of rows, the most efficient approach would be to include the sequence.nextval as part of the SQL statement itself, i.e.
INSERT INTO table_name( table_id, <<other columns>> )
VALUES( sequence_name.nextval, <<bind variables>> )
or
INSERT INTO table_name( table_id, <<other columns>> )
SELECT sequence_name.nextval, <<other values>>
FROM some_other_table
If you use a trigger, you will force a context shift from the SQL engine to the PL/SQL engine (and back again) for every row you insert. If you get the nextval separately, you'll force an additional round-trip to the database server for every row. Neither of these are particularly costly if you do them once or twice. If you do them millions of times, though, the milliseconds add up to real time.
If you're only concerned about performance, on Oracle it'll generally be a bit faster to populate the ID with a sequence in your INSERT statement, rather than use a trigger, as triggers add a bit of overhead.
However (as Justin Cave says), the performance difference will probably be insignificant unless you're inserting millions of rows at a time. Test it to see.
What is a key? One or more fields to uniquely identify records, should be final and never change in the course of an application.
I make a difference between technical and business keys. Technical keys are defined on the database and are generated (sequence, uuid, etc ); business keys are defined by your domain model.
That's why I suggest
always generate technical PK's with a sequence/trigger on the database
never use this PK field in your application ( tip: mark the getId()
setId() #Deprecated )
define business fields which uniquely identify your entity and use these in equals/hashcode methods
I'd say if you already use hibernate, then let it control how the id's are created with #SequenceGenerator and #GeneratedValue. It will be more transparent, and Hibernate can reserve id's for itself so it might be more efficient than doing it by hand, or from a trigger.