Cassandra - transaction support - java

I am going through apache cassandra and working on sample data insertion, retrieving etc.
The documentation is very limited.
I am interested in knowing
can we completely replace relation db like mysql/ oracle with cassandra?
does cassandra support rollback/ commit?
does cassandra clients (thrift/ hector) support fetching associated object (objects where we save one super columns' key in another super column family)?
This will help me a lot to proceed further.
thank you in advance.

Short answer: No.
By design, Cassandra values availability and partition tolerance over consistency1. Basically, it's not possible to get acceptable latency while maintaining all three of qualities: one has to be sacrificed. This is called CAP theorem.
The amount of consistency is configurable in Cassandra using consistency levels, but there doesn't exist any semantics for rollback. There's no guarantee that you'll be able to roll back your changes even if the first write succeeds.
If you want to build application with transactions or locks on top of Cassandra, you probably want to look at Zookeeper, which can be used to provide distributed synchronization.
You might've already guessed this, but Cassandra doesn't have foreign keys or anything like that. This has to be handled manually. I'm not that familiar with Hector, but a higher-level client could be able to do this semi-automatically.
Whether or not you can use Cassandra to easily replace a RDBMS depends on your specific use case. In your use case (based on your questions), it might be hard to do so.

In version 2.x you can combine CQL-statements in logged batch that is atomic. Either all or none of statements succeed. Also you can read about lightweight transactions.
More than that - there are several persistence managers for Cassandra. You can achive foreign keys behavior on client level with them. For example, Achilles and Kundera.

If Zookeeper is able to handle transactions that has Oracle-quality then its a done deal. Relations and relation integrity is no problem to implement on top of ANY database. A foreign key is just another data-field. ACID/Transactions is the key issue.

instead of commit and rollback, you must use batch.
batch worked atomic, this means all records in multiple tables submit or no submit atomic mode
for example :
var batch = new BatchStatement();
batchItem= session.Prepare(stringCommand);
batch.Add(batchItem);
var result = session.ExecuteAsync(batch);

Of course you can but it is completely depends on your use case. If you don't pick the right db for your use case then you need to worry about lots of things on your own. For ex, in rdbms geographically distribution doesn't provided you need to find a way to do it. In cassandra, you lack some acid properties under some conditions. You need to handle those properties on application side.
Yes but limited for certain use cases. You can use batch property. It supports rollback but you lack the isolation. I am not sure this property exist in OSS Cassandra. For more info look
Dont understand what you mean by super column. If you ask to find an id in another table columns, yeah you can do it, why not. But definitely not understand what you mean by super column.
Overall Cassandra is not ACID compliant but there are some features that helps you under some conditions to be ACID compliant like batch, lightweight transactions.

Related

Is there a way to ensure uniqueness of a field without relying on database

Without relying on the database, is there a way to ensure a field (let's say a User's emailAddress) is unique.
Some common failure attempts:
Check first if emailAddress exists (by querying the DB) and if not then create the user. Now obviously in the window of check-then-act some other thread can create a user with same email. Hence this solution is no good.
Apply a language-level lock on the method responsible for creating the user. This solution fails as we need redundancy of the service for performance reasons and lock is on a single JVM.
Use an Event store (like an Akka actor's mailbox), event being an AddUser message, but since the actor behavior is asynchronous, the requestor(sender) can't be notified that user creation with unique email was successful. Moreover, how do 2 requests (with same email) know they contain a unique email? This may get complicated.
Database, being a single source of data that every thread and every service instance will write to, makes sense to implement the unique constraint here. But this holds true for Relational databases.
Then what about NoSql databases? some do allow for a unique constraint, but it's not their native behavior, or maybe it is.
But the question of not using the database to implement uniqueness of a field, what could be the options?
I think your question is more generic - "how do I ensure a database write action succeeded, and how do I handle cases where it didn't?". Uniqueness is just one failure mode - you may be attempting to insert a value that's too big, or of the wrong data type, or that doesn't match a foreign key constraint.
Relational databases solve this through being ACID-compliant, and throwing errors for the client to deal with when a transaction fails.
You want (some of) the benefits of ACID without the relational database. That's a fairly big topic of conversation. The obvious way to solve this is to introduce the concept of "transaction" in your application layer. For instance, in your case, you might send a "create account(emailAddress, name, ...)" message, and have the application listen for either an "accountCreated" or "accountCreationFailed" response. The recipient of that message is responsible for writing to the database; you have a couple of options. One is to lock that thread (so only one process can write to the database at any time); that's not super scalable. The other mechanism I've used is introducing status flags - you write the account data to the database with a "draft" flag, then check for your constraints (including uniqueness), and set the "draft" flag to "validated" if the constraints are met (i.e. there is no other record with the same email address), and "failed" if they are not.
to check for uniquness you need to store the "state" of the program. for safety you need to be able to apply changes to the state transactionally.
you can use database transactions. a few of the NoSQL databases support transactions too, for example, redis and MongoDB. you have to check for each vendor separately to see how they support transactions. in this setup, each client will connect to the database and it will handle all of the details for you. also depending on your use case you should be careful about the isolation level configuration.
if durability is not a concern then you can use in memory databases that support transactions.
which state store you choose, it should support transactions. there are several ways to implement transactions and achieve consistency. many relational databases like PostgresSQL achieve this by implementing the MVCC algorithm. in a distributed environment you have to look for distributed transactions such as 2PC, Paxos, etc.
normally everybody relies on availabe datastore solutions unless there is a weird or specific requirement for the project.
final note, the communication pattern is not related to the underlying problem here. for example, in the Actor case you mentioned, at the end of the day, each actor has to query the state to find if a email exists or not. if your state store supports Serializability then there is no problem and conflicts will not happen (communicating the error to the client is another issue). suppose that you are using PostgreSQL. when a insert/update query is issued, it is wrapped around a transaction and the underlying MVCC algorithm will take care of everything. in an advanced and distrbiuted environment you can use data stores that support distributed transactions, like CockroachDB.
if you want to dive deep you can research these keywords: ACID, isolation levels, atomicity, serializability, CAP theorem, 2PC, MVCC, distributed transacitons, distributed locks, ...
NoSQL databases provide different, weaker, guarantees than relational databases. Generally, the tradeoff is you give up ACID guarantees in exchange for increased scalability in the dimensions that matter for your application.
It's possible to provide some kind of uniqueness guarantee, but subject to certain tradeoffs. With NoSQL, there are always tradeoffs.
If your NoSQL store supports optimistic concurrency control, maybe this approach will work:
Store a separate document that contains the set of all emailAddress values, across all documents in your NoSQL table. This is one instance of this document at a given time.
Each time you want to save a document containing emailAddress, first confirm email address uniqueness:
Perform the following actions, protected by optimistic locking. You can on the backend if this due to a concurrent update:
Read this "all emails" document.
Confirm the email isn't present.
If not present, add the email address to the "all emails document"
Save it.
You've now traded one problem ... the lack of unique constraints, for another ... the inability to synchronise updates across your original document and this new "all emails" document. This may or may not be acceptable, it depends on the guarantees that your application needs to provide.
e.g. Maybe you can accept that an email may be added to "all emails", that saving the related document to your other "table" subsequently fails, and that that email address is now not able to be used. You could clean this up with a batch job somehow. Not sure.
The index of emails could be stored in some other service (e.g. a persistent cache). The same problem exists, you need to keep the index and your document store in sync somehow.
There's no easy solution. For a detailed overview of the relevant concepts, I'd recommend Designing Data-Intensive Applications by Martin Kleppmann.

Write-through cache Redis

I have been poundering on how to reliably implement a write-through caching mechanism to store realtime data.
Basically what we need is this:
Save data to Redis -> Save to database (underlying)
Read data from Redis <- Read from database in case unavailable in cache
The resources online to help in the implementation of this caching strategy seem scarce.
The problem is:
1) No built-in transaction possibility between Redis and the database (Mongo in my case).
2) No transactions mean that writes to the underlying database are unreliable.
The most straightforward way I see how this can be implemented is by using a broker like Kafka and putting messages on a persistent queue to be processed later.
Therefore Kafka would be the responsible entity for reliable processing.
Another way would be by having a custom implementation in a scheduler that checks the Redis database for dirty records. On first thought there seem to be some tradeoffs to this approach and I would like not having to go this road if possible.
I am looking on some options on how this can be implemented otherwise.
Or whether this is in fact the most viable approach.
So better approach than is as u mentioned above is to use kafka and consumer which will store data to mongo. But read about it delivery guarantee, as i remember exactly once is guaranteed in kafka streams only (between two topics), in your case your database should be idempotent because u get at least once guarantee. And don't forget to turn AOF on with Redis, not to loose data. And don't forget that in this case u get eventual consistency in db with all the consequences.
On review I will use MongoDB as a single datastore without Redis at all.
Premature optimization is evil I guess.
Anyhow, I can add additional architecture afterwards after benchmarking.
Plans to refactor towards a cache shouldn't be too hard.
Scaling is additional concern so I shouldn't be bothered with that during development right now.
Accepted #Ipave answer, going with a single datastore for the moment.

when to use Hibernate vs. Simple ResultSets for small application

I just started working on upgrading a small component in a distributed java application. The main application is a rather complicated applet/servlet combo running on JBoss and it extensively uses Hibernate for its DataAccess. The component i am working on however is very a very straightforward data importing service.
Basically the workflow is
Listen for a network event
Parse the data packet, extract a set of identifiers
Map the identifier set to a primary key in our database
Parse the rest of the packet and insert items in a related table using the foreign key found in step 3
Repeat
in the previous version of this component it used a hibernate based DAL, that is no longer usable for a variety of reasons (in particular it is EOL), so I am in charge of replacing the Data Access layer for this component.
So on the one hand I think i should use Hibernate because that's what the rest of the application does, but on the other i think i should just use regular java.sql.* classes because my requirements are really straightforward and aren't expected to change any time soon.
So my question is (and i understand it is subjective) at what point do you think that the added complexity of using an ORM tool (in terms of configuration, dependencies...) is worth it?
UPDATE
due to the way the DataAccesLayer for the main application was written (weird dependencies) i cannot easily use it, i would have to implement it myself.
If we look into why Spring-Hibernate combination is used?
Because for simple Jdbc operation we have to do lot of operation like getting a connection.
Making a statement and handling resultset.For all these steps there are lot of exception handling.
But with spring hibernate you have to use just this:
public PostProfiles findPostProfilesById(long id) {
List list=getHibernateTemplate().find("from PostProfiles where id=?",id);
return (PostProfiles) list.get(0);
}
And everything is taken care by framework.I hope it will solve you dilemma
I think the answer really depends on your skill set. It would probably take similar amount of time to craft a simple solution involving a handful of tables in either way (Hibernate or raw JDBC) if you are comfortable with both techniques.
As I am pretty comfortable with Hibernate, I'd just choose it as I prefer to working in a higher level and not worrying about things that Hibernate handles for me. Yes, it has its own glitches, but especially for simple data models it does the job, and does it well.
The only few reasons why would I choose plain JDBC would be:
uber-complicated maximum-optimized SQL that is performance critical;
Hibernate being stupid and not being capable to express what I want;
And especially if you say you are already managing other entities with Hibernate, why not keep your code in the same style everywhere?
I think you are better off using JDBC api. From what you describe, the two operations (select foreign key from table, insert into table_2) can easily be executed with a simple Stored Procedure call.
The advantage of using this technique is that you can manage transactions/exceptions within your stored procedure call.

Restrict postges access from java clients by using java program on a server

Perhaps this question is not very clear but I didn't find better words for the heading, which describes the problem I like to deal with shortly.
I want to restrict access from a java desktop application to postgres.
The background:
Suppose you have 2 apps running and the first Application has to do some complex calculations on the basis of data in the db. To nail the immutability of the data in the db down i'd like to lock the db for insert, update and delete operations. On client side i think it's impossible to handle this behaviour satisfactory. So i thought about to use a little java-app on server-side which works like a proxy. So the task is to hand over CRUD (Create Read Update Delete) operations until it gets a command to lock. After a lock it rejects all CUD operations until it gets a unlock command from the locking client or a timeout is reached.
Questions:
What do you think about this approach?
Is it possible to lock a Database while using such an approach?
Would you prefer Java SE or Java EE as server-side java app?
Thanks in advance.
Why not use transactions in your operations? The database has features to maintain data integrity itself, rather than resorting to a brute operation such as a total-database lock.
This locking mechanism you describe sounds like it would be a pain for the users. Are the users initating the lock or is the software itself? If it's the users, you can expect some problems when Bob hits lock and then goes to lunch for 2 hours, forgetting to unlock the database first...
Indeed... there are a few proper ways to deal with this problem.
Just lock the tables in your code. Postgresql has commands for locking entire tables that you could run from your client application
Pick a transaction isolation level that doesn't have the problem of reading data that was committed after your txn started (BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ).
Of these, by far the most efficient is to use repeatable read as your isolation level. Postgres supports this quite efficiently, and it will give you a consistent view of the data without such heavy locking of the db.
Year i thought about transactions but in this case i can't use them. I'm sorry i didn't mention it exactly. So assume the follow easy case:
A calculation closes one area of responsibility. After calc a new one is opened and new inserts are dedicated to it. But while calculation-process a insert or update or delete is not allowed to the data of the (currently calculated) area of responsibility. More over a delete is strictly prohibited because data has to be archived.
So imo the use of transactions doesn't fit this requirement. Or did i miss sth.?
ps: (off topic) #jsight: i currently read that intenally postgres mapps "repeatable read" to "serializable", so using "repeatable read" gets you more restriction then you would perhaps expect.

When can/should you go whole hog with the ORM approach?

It seems to me that introducing an ORM tool is supposed to make your architecture cleaner, but for efficiency I've found myself bypassing it and iterating over a JDBC Result Set on occasion. This leads to an uncoordinated tangle of artifacts instead of a cleaner architecture.
Is this because I'm applying the tool in an invalid Context, or is it deeper than that?
When can/should you go whole hog with the ORM approach?
Any insight would be greatly appreciated.
A little of background:
In my environment I have about 50 client computers and 1 reasonably powerful SQL Server.
I have a desktop application in which all 50 clients are accessing the data at all times.
The project's Data Model has gone through a number of reorganizations for various reasons including clarity, efficiency, etc.
My Data Model's history
JDBC calls directly
DAO + POJO without relations between Pojos (basically wrapping the JDBC).
Added Relations between POJOs implementing Lazy Loading, but just hiding the inter-DAO calls
Jumped onto the Hibernate bandwagon after seeing how "simple" it made data access (it made inter POJO relations trivial) and because it could decrease the number of round trips to the database when working with many related entities.
Since it was a desktop application keeping Sessions open long term was a nightmare so it ended up causing a whole lot of issues
Stepped back to a partial DAO/Hibernate approach that allows me to make direct JDBC calls behind the DAO curtain while at the same time using Hibernate.
Hibernate makes more sense when your application works on object graphs, which are persisted in the RDBMS. Instead, if your application logic works on a 2-D matrix of data, fetching those via direct JDBC works better. Although Hibernate is written on top of JDBC, it has capabilities which might be non-trivial to implement in JDBC. For eg:
Say, the user views a row in the UI and changes some of the values and you want to fire an update query for only those columns that did indeed change.
To avoid getting into deadlocks you need to maintain a global order for SQLs in a transaction. Getting this right JDBC might not be easy
Easily setting up optimistic locking. When you use JDBC, you need to remember to have this in every update query.
Batch updates, lazy materialization of collections etc might also be non-trivial to implement in JDBC.
(I say "might be non-trivial", because it of course can be done - and you might be a super hacker:)
Hibernate lets you fire your own SQL queries also, in case you need to.
Hope this helps you to decide.
PS: Keeping the Session open on a remote desktop client and running into trouble is really not Hibernate's problem - you would run into the same issue if you keep the Connection to the DB open for long.

Categories