How to executing batch statement and LWT as a transaction in Cassandra - java

I have two table with below model:
CREATE TABLE IF NOT EXISTS INV (
CODE TEXT,
PRODUCT_CODE TEXT,
LOCATION_NUMBER TEXT,
QUANTITY DECIMAL,
CHECK_INDICATOR BOOLEAN,
VERSION BIGINT,
PRIMARY KEY ((LOCATION_NUMBER, PRODUCT_CODE)));
CREATE TABLE IF NOT EXISTS LOOK_INV (
LOCATION_NUMBER TEXT,
CHECK_INDICATOR BOOLEAN,
PRODUCT_CODE TEXT,
CHECK_INDICATOR_DDTM TIMESTAMP,
PRIMARY KEY ((LOCATION_NUMBER), CHECK_INDICATOR, PRODUCT_CODE))
WITH CLUSTERING ORDER BY (CHECK_INDICATOR ASC, PRODUCT_CODE ASC);
I have a business operation where i need to update CHECK_INDICATOR in both the tables and QUANTITY in INV table.
As CHECK_INDICATOR is a part of key in LOOK_INV table, i need to delete the row first and insert a new row.
Below are the three operations i need to perform in batch fashion (either all will be executed sucessfully or none should be executed)
Delete row from LOOK_INV table.
Insert row in LOOK_INV table.
Update QUANTITY and CHECK_INDICATOR in INV table.
As INV table is getting access by multiple threads, i need to make sure before updating INV table row that it has not been changed since last read.
I am using LWT transaction to update INV table using VERSON column and batch operation for deletion and insertion in LOOK_INV table.I want to add all the three operation in batch.But since LWT is not acceptable in batch i need to execute in aforesaid fashion.
The problem with this approach is that in some scenario batch get executed sucessfully but updating INV table results in timeout exception and data become incosistent in both the table.
Is there any feature provided by cassandra to handle these type of scenario elegantly?

Caution with Lightweight Transactions (LWT)
Lightweight Transactions are currently considered a Cassandra anti-pattern because of the performance issues you are suffering.
Here is a bit of context to explain.
Cassandra does not use RDBMS ACID transactions with rollback or locking mechanisms. It does not provide locking because of a fundamental constraint on all kinds of distributed data store called the CAP Theorem. It states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
Because of this, Cassandra is not good for atomic operations and you should not use Cassandra for this purpose.
It does provide lightweight transactions, which can replace locking in some cases. But because the Paxos protocol (the basis for LWT) involves a series of actions that occur between nodes, there will be multiple round trips between the node that proposes a LWT and the other replicas that are part of the transaction.
This has an adverse impact on performance and is one reason for the WriteTimeoutException error. In this situation you can't know if the LWT operation has been applied, so you need to retry it in order to fallback to a stable state. Because LWTs are so expensive, the driver will not automatically retry it for you.
LTW comes with big performance penalties if used frequently, and we see some clients with big timeout issues due to using LWTs.
Lightweight transactions are generally a bad idea and should be used infrequently.
If you do require ACID properties on part of your workload but still require it to scale , consider shifting that part of your load to cochroach BD.
In summary, if you do need ACID transactions it is generally a lot easier to bring a second technology in.

Related

How does a database sequence manages a race condition?

I am writing an application which will be deployed on n number of nodes. The applications entity classes used the SEQUENCE generation strategy to generate the primary keys. Since, there would be bulk inserts; we shall be giving an allocation size as well.
The concern is when the application will be deployed on n nodes and if two nodes simultaneously requests next sequence from the defined sequence in database:
Wouldn't there be any race condition?
Or is that sequence also has some light weight locking mechanism to serve the requests sequentially, as it happens in IDENTITY strategy?
Or sequence is not the right solution to this problem?
Kindly help. Thank you !
Think of Sequence as a table with one column storing an integer representing the current id. Each time you insert a new entry, the next operations happen in a transaction:
The current value from SEQUENCE table is read
That value is assigned as ID to the new entry
The value from SEQUENCE is incremented
To answer your questions
The concurrency issues are addressed by the database.
Since inserts happen in a transaction (both simple and bulk inserts), the consistency on ID generation is enforced by the database engine via transactions (by the isolation level of the transaction to be more precise). Make sure your database engine supports transactions.
Sequence is the right solution, assuming your database engine supports transactions.

Prevent violating of UNIQUE constraint with Hibernate

I have a table like (id INTEGER, sometext VARCHAR(255), ....) with id as the primary key and a UNIQUE constraint on sometext. It gets used in a web server, where a request needs to find the id corresponding to a given sometext if it exists, otherwise a new row gets inserted.
This is the only operation on this table. There are no updates and no other operations on this table. Its sole purpose is to persistently number of encountered values of sometext. This means that I can't drop the id and use sometext as the PK.
I do the following:
First, I consult my own cache in order to avoid any DB access. Nearly always, this works and I'm done.
Otherwise, I use Hibernate Criteria to find the row by sometext. Usually, this works and again, I'm done.
Otherwise, I need to insert a new row.
This works fine, except when there are two overlapping requests with the same sometext. Then an ConstraintViolationException results. I'd need something like INSERT IGNORE or INSERT ... ON DUPLICATE KEY UPDATE (Mysql syntax) or MERGE (Firebird syntax).
I wonder what are the options?
AFAIK Hibernate merge works on PK only, so it's inappropriate. I guess, a native query might help or not, as it may or may not be committed when the second INSERT takes place.
Just let the database handle the concurrency. Start a secondary transaction purely for inserting the new row. if it fails with a ConstraintViolationException, just roll that transaction back and read the new row.
Not sure this scales well if the likelihood of a duplicate is high, a lot of extra work if some percent (depends on database) of transactions have to fail the insert and then reselect.
A secondary transaction minimizes the length of time the transaction to add the new text takes, assuming the database supports it correctly, it might be possible for the thread 1 transaction to cause the thread 2 select/insert to hang until the thread 1 transaction is committed or rolled back. Overall database design might also affect transaction throughput.
I don't necessarily question why sometext can't be a PK, wondering why you need to break it out at all. Of course, large volumes might substantially save space if sometext records are large, it almost seems like you're trying to emulate a lucene index to give you a complete list of text values.

In mysql is it advisible to rely on uniqueness constraints or to manually check if the row is already present?

I have a table:
userId | subject
with uniqueness constraint on both combined.
Now I am writing thousands of rows to this table every few minutes. The data stream is coming from a queue and it might repeat. I have to however make sure that there is only one unique combination of userId,subject in the table.
Currently I rely on mysql's uniqueness constraint which throws as exception.
Another approach is run a SELECT count(*) query to check if this row is already present and then skip it if need be.
Since I want to write on an average 4 rows per second what is advisable.
Programming language: Java
EDIT:
Just in case I am not clear the question here is whether relying ton MYSQL to throw an exception is better or running a select query before insert operation is better in terms of performance.
I thought a select query is less CPU/IO intensive than a INSERT query. If I run too many INSERTS wouldn't that create many locks ?
MySQL is ACID and employs transactional locking, so relying on its uniqueness constraints is very standard. Note that you can do this either via PRIMARY KEY or UNIQUE KEY (but favour the former if you can).
A unique constraint is unique for the complete committed dataset.
There a several databases which allows to set "transaction isolation level".
userId subject
A 1
B 2
-------------------------
A 2
A 3
The two rows above the line are committed. Every connection can read these lines. The two line under the line are currently been written within your transaction. Within this connection all four lines are visible.
If another thread / connection / transaction tries to store A-2 there will be an exception in one of the two transaction (the first one can commit the transaction, the second one can't).
Other isolation level may fail earlier. But it is not possible to violate against the Unique-key constraint.

Best way to fetch data from a single database table with multiple threads?

we have a system where we collect data every second on user activity on multiple web sites. we dump that data into a database X (say MS SQL Server). we now need to fetch data from this single table from daatbase X and insert into database Y (say mySql).
we want to fetch time based data from database X through multiple threads so that we fetch as fast as we can. Once fetched and stored in database Y, we will delete data from database X.
Are there any best practices on this sort of design? any specific things to take care on table design like sharing or something? Are there any other things that we need to take care to make sure we fetch it as fast as we can from threads running on multiple machines?
Thanks in advance!
Ravi
If you are moving data from one database to another, you will not gain any advantages by having multiple threads doing the work. It will only increase contention.
If both databases are of the same type, you should be looking into the vendors specific tools for replication. This will basically always outperform homegrown solutions.
If the databases are different (vendors), you have to decide upon an efficient mechanism for
identifying new/updated/deleted rows (Triggers, range based queries, full dumps)
transporting the data (unload to file & FTP, pull/push from a program)
loading the data on the other database (import, bulk insert)
Without more details, it's impossible to be more specific than that.
Oh, and the two most important considerations that will influence your choice are:
What is the expected data volume?
Longest acceptable delay between row creation in source DB and availability in Target DB
I would test (by measurement) your assumption that multiple slurper threads will speed things up. Without being more specific in your question, it looks like you want to do an ETL (extract transform load) process with your database, these are pretty efficient when you let the database specific technology handle it, especially if you're interested in aggregation etc.
There are two levels of concern of your issue:
The transaction between these two database:
This is important because you would delete database from source database. You must ensure that only remove data from X while the database has been stored into Y successfully. On the other side, your must ensure that the deletion of data from X must be successful to prevent re-insert same data into Y.
The performance of transferring data:
If the X database has incoming data whenever, which is a online database, it is not a good practice that just collect data, store to Y, and delete them. Planning a size of batch, the program starts a transaction for that batch; running the program repeatedly until the number of data in X is under the size of batch.
In both of databases, your should add a table to record the batch for processing.
There are three states in processing.
INIT - The start of batch, this value should be synchronized between two databases
COPIED - In database Y, the insertion of data and the update of this status should be in one transaction.
FINISH - In database X, the deletion of data and the update of this status should be in on transaction.
When the programing is running, it first checks the batches in 'INIT' or 'COPIED' state and restarts the session to process.
If X has an "INIT" record and Y don't, just insert the same INIT record to Y, then perform the insertion to Y.
If a record in Y is "COPIED" and X is "INIT", just update the state of X to "COPIED", then perform the deletion to X.
If a record in X is "FINISH" and the corresponding record in Y is "COPIED", just update the the state of Y to "FINISH".
In conclusion, processing data at a batch would give you a chance to optimize such transferring between two databases. The number of batch size dominates the efficiency of transforming and depends on two factors: how those databases concurrently used by other operation and the tuning parameter of your databases. In general situation, the write-throughput of Y is likely the bottleneck of processing.
Threads are not the way to go. The database(s) is the bottleneck here. Multiple threads will only increase contention. Even if 10 processes are jamming data into SQL Server, a single thread (rather than many) can pull it out faster. There is absolutely no doubt about that.
The SELECT itself can cause locks in the main table, reducing the throughput of the INSERTs, so I would "get in and get out" as fast as possible. If it were me, I would:
SELECT the rows based on a range query (date, recno, whatever), dump them into a file, and close the result set (cursor).
DELETE the rows based on the same range query.
Then process the dump. If possible, the dump format should be amenable to bulk-load into MySQL.
I don't want to beat up your architecture, but overall the design sounds problematic. SELECTing and DELETEing rows from a table undergoing a high INSERTion rate is going to create huge locking issues. I would be looking at "double-buffering" the data in the SQL Server.
For example, every minute the inserts switch between two tables. For example, in the first minute INSERTs go into TABLE_1, but when the minute rolls over they start INSERTing into TABLE_2, the next minute back to TABLE_1, and so forth. While INSERTS are going into TABLE_2, SELECT everything from TABLE_1 and dump it into MySQL (as efficiently as possible), then TRUNCATE the table (deleting all rows with zero penalty). This way, there is never lock-contention between the readers and writers.
Coordinating the rollover point of between TABLE_1 and TABLE_2 is the tricky part. But it can be done automatically through a clever use of SQL Server Partitioned Views.

Insert a lot of data into database in very small inserts

So i have a database where there is a lot of data being inserted from a java application. Usualy i insert into table1 get the last id, then again insert into table2 and get the last id from there and finally insert into table3 and get that id as well and work with it within the application. And i insert around 1000-2000 rows of data every 10-15 minutes.
And using a lot of small inserts and selects on a production webserver is not really good, because it sometimes bogs down the server.
My question is: is there a way how to insert multiple data into table1, table2, table3 without using such a huge amount of selects and inserts? Is there a sql-fu technique i'm missing?
Since you're probably relying on auto_increment primary keys, you have to do the inserts one at a time, at least for table1 and table2. Because MySQL won't give you more than the very last key generated.
You should never have to select. You can get the last inserted id from the Statement using the getGeneratedKeys() method. See an example showing this in the MySQL manual for the Connector/J:
http://dev.mysql.com/doc/refman/5.1/en/connector-j-usagenotes-basic.html#connector-j-examples-autoincrement-getgeneratedkeys
Other recommendations:
Use multi-row INSERT syntax for table3.
Use ALTER TABLE DISABLE KEYS while you're importing, and re-enable them when you're finished.
Use explicit transactions. I.e. begin a transaction before your data-loading routine, and commit at the end. I'd probably also commit after every 1000 rows of table1.
Use prepared statements.
Unfortunately, you can't use the fastest method for bulk load of data, LOAD DATA INFILE, because that doesn't allow you to get the generated id values per row.
There's a lot to talk about here:
It's likely that network latency is killing you if each of those INSERTs is another network roundtrip. Try batching your requests so they only require a single roundtrip for the entire transaction.
Speaking of transactions, you don't mention them. If all three of those INSERTs need to be a single unit of work you'd better be handling transactions properly. If you don't know how, better research them.
Try caching requests if they're reused a lot. The fastest roundtrip is the one you don't make.
You could redesign your database such that the primary key was not a database-generated, auto-incremented value, but rather a client generated UUID. Then you could generated all the keys for every record upfront and batch the inserts however you like.

Categories