There doesn't seem to be any direct way to know affected rows in cassandra for update, and delete statements.
For example if I have a query like this:
DELETE FROM xyztable WHERE PKEY IN (1,2,3,4,5,6);
Now, of course, since I've passed 6 keys, it is obvious that 6 rows will be affected.
But, like in RDBMS world, is there any way to know affected rows in update/delete statements in datastax-driver?
I've read cassandra gives no feedback on write operations here.
Except that I could not see any other discussion on this topic through google.
If that's not possible, can I be sure that with the type of query given above, it will either delete all or fail to delete all?
In the eventually consistent world you can look at these operations as if it was saving a delete request, and depending on the requested consistency level, waiting for a confirmation from several nodes that this request has been accepted. Then the request is delivered to the other nodes asynchronously.
Since there is no dependency on anything like foreign keys, then nothing should stop data from being deleted if the request was successfully accepted by the cluster.
However, there are a lot of ifs. For example, deleting data with a consistency level one, successfully accepted by one node, followed by an immediate node hard failure may result in the loss of that delete if it was not replicated before the failure.
Another example - during the deletion, one node was down, and stayed down for a significant amount of time, more than the gc_grace_period, i.e., more than it is required for the tombstones to be removed with deleted data. Then if this node is recovered, then all suddenly all data that has been deleted from the rest of the cluster, but not from this node, will be brought back to the cluster.
So in order to avoid these situations, and consider operations successful and final, a cassandra admin needs to implement some measures, including regular repair jobs (to make sure all nodes are up to date). Also applications need to decide what is better - faster performance with consistency level one at the expense of possible data loss, vs lower performance with higher consistency levels but with less possibility of data loss.
There is no way to do this in Cassandra because the model for writes, deletes, and updates in Cassandra is basically the same. In all of those cases a cell is added to the table which has either the new information or information about the delete. This is done without any inspection of the current DB state.
Without checking the rest of the replicas and doing a full merge on the row there is no way to tell if any operation will actually effect the current read state of the database.
This leads to the oft cited anti-pattern of "Reading before a write." In Cassandra you are meant to write as fast as possible and if you need to have history, use a datastructure which preservations a log of modifications rather than just current state.
There is one option for doing queries like this, using the CAS syntax of IF value THEN do other thing but this is a very expensive operation compared normal write and should be used sparingly.
Related
Below are my assumptions/queries. Please address if there is something wrong in my understanding
By Reading the documentation I understood that
Zookeeper writes go to the Leader, and they are replicated to follower. A read request can be served from the follower(slave) itself. And hence read can be stale.
Why can't we use zookeeper as a cache system?
As the write request is always made/redirected to Leader, it means node creation is consistent. When two clients sending a write request for same node name, one of them will ALWAYS get an error(NodeExistsException).
If above is true, then can we use zookeeper to keep track of duplicate requests by creating a znode with the requestId.
For generating a sequence number in a distributed system, we can use the sequential node creation.
Based on what information is available in the question and the comments, it appears that the basic question is:
In a stateless multi server architecture, how best to prevent data duplication, here the data is "has this refund been processed?"
This qualifies as "primarily opinion based". There are multiple ways to do this and no one way is the best. You can do it with MySQL and you can do it with Zookeeper.
Now comes pure opinion and speculation:
To process a refund, there must be some database somewhere? Why not just check against it? The duplicate-request scenario that you are preparing against seems like a rare occurrence - this wont be happening hundred times per sec. If so, then this scenario does not warrant high performance implementation. Just a database lookup should be fine.
Your workload seems to be 1:1 ratio of read:write. Every time a refund is processed, you check whether it is already processed or not and if not processed then process it and make an entry for it. Now Zookeeper itself says it works best for something like 10:1 ratio of read:write. While there is no such metric available for MySQL, it does not need to make certain* guarantees that zookeeper makes for write activities. Hence i hope, it should be better for pure write intensive loads. (* Guarantees like sequentiality, broadcast, consensus etc)
Just a nitpick, but your data is a linear list of hundreds (thousands? millions?) of transaction ids. This is exactly what MySQL (or any database) and its Primary Key is built for. Zookeeper is made for more complex/powerful hierarchical data. That you do not need.
i'm no expert in Databases so what i know about queries is that they are the way to read or write in databases
in eventual consistency read will return stale data
in write query first data node will be updated but other node will need some time to be updated
in strong consistency read will be locked until data get modified to it latest version (really i'm not sure about what i said here so help if u got it wrong)
in write query all read operations for will be lock until data node get modified to its latest version
so if i write data as eventual and tried ancestors query to get that data will i get the latest version ?
if i used ancestors query to update would all eventual read operation get the latest version ?
update
i think Transactions is there so if there is multi modification request to the same data 1 will succeeded and other will fail after that the data the have been modified will take some time to be replicated in all datacenter so if transaction succeeded does not mean all read query will return the latest version (correct me if i'm right)
If you use what you call an "ancestor query", you're working in a transaction: either the transaction terminates successfully, in which case all subsequent reads will get the values as updated by the transaction, or else the transaction fails, in which case none of the changes made by the transaction will be seen (this all-or-nothing property is often referred to as a transaction being "atomic"). In particular, you do get strong consistency this way, not just eventual consistency.
The cost can be large, in terms of performance and scalability. In particular, an application should not update an entity group (any and all entities descending from a common ancestor) more than once a second, which can be a very constraining limit for a highly scalable application.
The online docs include a large variety of tips, tricks and advice on how to deal with this -- you could start at https://cloud.google.com/datastore/docs/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/ and continue with the "additional resources" this article lists at the end.
One simple idea that often suffices is that (differently from queries) getting a specific entity from its key is strongly consistent without needing transactions, and memcache is also strongly consistent; writing a modified entity gives you its new key, so you can stash that key into memcache and have other parts of your code fetch the modified entity from that key, rather than relying on queries. This has limits, of course, because memcache doesn't give you unbounded space -- but it's a useful idea to keep in mind, nevertheless, in many practical cases.
With GAE the only way to be consistante is to use transaction, into a transaction you can update, query the last update but it's slower.
For me using ancestors just compose the primary key and that's all.
There are a lot of different tutorials across the internet about pagination with JDBC/iterating over huge result set.
So, basically there are a number of approaches I've found so far:
Vendor specific sql
Scrollable result set (?)
Holding plain result set in a memory and map the rows only when necessary (using fetchSize)
The result set fetch size, either set explicitly, or by default equal
to the statement fetch size that was passed to it, determines the
number of rows that are retrieved in any subsequent trips to the
database for that result set. This includes any trips that are still
required to complete the original query, as well as any refetching of
data into the result set. Data can be refetched, either explicitly or
implicitly, to update a scroll-sensitive or
scroll-insensitive/updatable result set.
Cursor (?)
Custom seek method paging implemented by jooq
Sorry for messing all these but I need someone to clear that out for me.
I have a simple task where service consumer asks for results with a pageNumber and pageSize. Looks like I have two options:
Use vendor specific sql
Hold the connection/statement/result set in the memory and rely on jdbc fetchSize
In the latter case I use rxJava-jdbc and if you look at producer implementation it holds the result set, then all you do is calling request(long n) and another n rows are processed. Of course everything is hidden under Observable suggar of rxJava. What I don't like about this approach is that you have to hold the resultSet between different service calls and have to clear that resultSet if client forgets to exhaust or close it. (Note: resultSet here is java ResultSet class, not the actual data)
So, what is recommended way of doing pagination? Is vendor specific sql considered slow compared to holding the connection?
I am using oracle, ScrollableResultSet is not recommended to be used with huge result sets as it caches the whole result set data on the client side. proof
Keeping resources open for an indefinite time is a bad thing in general. The database will, for example, create a cursor for you to obtain the fetched rows. That cursor and other resources will be kept open until you close the result set. The more queries you do in parallel the more resources will be occupied and at some point the database will reject further requests due to an exhausted resource pool (e.g. there is a limited number of cursors, that can be opened at a time).
Hibernate, for example, uses vendor specific SQL to fetch a "page" and I would do it just like that.
There are many approaches because there are many different use cases.
Do you actually expect users to fetch every page of the result set? Or are they more likely to fetch the first page or two and try something else if the data they're interested in isn't there. If you are Google, for example, you can be pretty confident that people will look at results from the first page, a small number will look at results from the second page, and a tiny fraction of results will come from the third page. It makes perfect sense in that case to use vendor-specific code to request a page of data and only run that for the next page when the user asks for it. If you expect the user to fetch the last page of the result, on the other hand, running a separate query for each page is going to be more expensive than running a single query and doing multiple fetches.
How long do users need to keep the queries open? How many concurrent users? If you're building an internal application that dozens of users will have access to and you expect users to keep cursors open for a few minutes, that might be reasonable. If you are trying to build an application that will have thousands of users that will be paging through a result over a span of hours, keeping resources allocated is a bad idea. If your users are really machines that are going to fetch data and process it in a loop as quickly as possible, a single ResultSet with multiple fetches makes far more sense.
How important is it that no row is missed/ every row is seen exactly once/ the results across pages are consistent? Multiple fetches from a single cursor guarantees that every row in the result is seen exactly once. Separate paginated queries might not-- new data could have been added or removed between queries being executed, your sort might not be fully deterministic, etc.
ScrollableResultSet caches result on client side - this requires memory resources. But for example PostgreSQL does it by default and nobody complains. Some databases simply use client's memory to hold the whole resultset. In most cases the database has to process much more data to re-evaluate the query.
Also you usually have much more clients, than database instances.
Also note that query re-execution - using rownum - as implemented by Hibernate does not guarantee correct(consistent) results. If data are modified between executions and default isolation level is used.
It really depends on use case. Changing Oracle's init parameter for max. connections and also for open cursors requires database restart.
So ScrollableResultSet and cursors can be used only when you can predict amount of (concurrent) users.
This question already has answers here:
Never delete entries? Good idea? Usual?
(10 answers)
Closed 9 years ago.
Iv'e just heard from a colleague that deleting rows on a relational DB is pretty dangerous (regarding indexing and cascading actions)
He said that one solution for allowing deletions is to have a "deprecated" field for each entity and instead set the field to true in order to mark the row as "deleted".
of course that will require you on all your queries to fetch all the "dedicated" == false (which is pretty cumbersome)
My questions are:
Is he right? if so - what exactly is dangerous about deleting exactly?
Does his solution is a good practice?
Any alternatives to this solution are available?
thanks.
This question has multiple layers. In general it is a good idea to mark rows as deleted instead of actually deleting them.
There are a few major benefits:
The data is recoverable. You can provide an undelete to users.
The update is faster than the delete.
In a publicly facing app none of the publicly interactable code has a true delete, making it much more difficult to use that code for inappropriate purposes (sql injection, etc.)
If you ever want to report in your data you can.
There are of course caveats and best practices:
This does not apply to lookup tables with easy to recreate data.
You need to consider culling. In our databases we cull deleted records into archival reporting tables. This keeps the primary tables fast, but allows us to report on data related to "deleted" items.
Your culling performance impact (at largish scale) will be similar to a backup and have similar considerations. Run them off hours if you want to archive them all at once, or periodically via cron if you want to just take X number per hour.
NEVER use the deleted data in your live data. In other words it is not a status flag! It is gone. I've made this mistake before and undoing it was painful.
If there is a very high percentage of deletes in a table ask yourself if keeping the data is actually important. You might adjust your culling process to not archive and to instead just run the actual delete.
This approach will last for a really really long time unless your dataset is massive and deletions are massive. Some architecture astronaut will ask you about what is going to happen when you archive 1 billion rows.... when you get to that point you are either hugely successful and can find another way, or you've screwed something else up so completely your archive tasks won't matter any more relative to the other issues you have.
If you have your schema well structured and use transactions where needed, deletions are perfectly safe and using deletion you will get far better performance than the approach you friend suggests.
Inserting a new element may get a tricky as deleting one. I wonder what hacky approach would your friend suggest to overcome that.
CRUD operations have been here for a long while now and creators of relational databases have done pretty good job in optimizing them. Any attempt to outsmart decades of gradual improvement with such hack will most probably fail.
Applying the solution your friend suggests may result in having a huge database with only a small fraction of non-deleted elements. This way your queries will become slower too.
Now having said all that I would like to support a little bit the other side. There are cases when the solution your friend suggests may be the only option. You can't change your schema everytime some query turns out to be slow. Also as others suggest in their answers if you use the "mark as deleted" approach deleted data will be recoverable(which may or may not be good again mentioned in other answers).
Dangerous? Will the server or data center blow up?
I think your colleague is indulging in some hyperbole.
You need not cascade updates or deletes if you don't wish to, but it can be easier than having to clean up manually. It's a choice that you make when you create your schema.
Marking rows as deleted using a flag is another way to go, but it's just another choice. You'll have to work harder to find all the bad rows and run a batch job to remove them.
if you have retention requirements, it's more typical to partition the schema and move older records off into a warehouse for historical analysis and reporting. In that case you wouldn't be deleting anything, just moving them out after a set period of time.
Yes, he is right. Databases (indexes, specifically) are optimized for insertion and deletion can be painfully slow. Even setting an indexed field to null can cause the same trouble. I see cascading as a lesser issue because the db should never be configured to do dangerous cascading automatically.
Yes, flagging a record as "inactive", "deleted", "deprecated" (your choice) is standard and preferred practice to resolve a deletion-related performance issue.
But, to qualify the above, it only applies to transactional (as opposed to archival) tables, and then only to those specific tables which contain a huge number of rows (millions and more). Do not ham-handedly apply a "best practice" across the board.
Another approach is to simply not have a transactional table with millions of rows. Move the data to an archival table before it grows to such proportions.
The problem with DELETE's in relational databases is that they are unrevertable. You delete data and it's gone. There is no way to restore it (except rollback to an earlier backup, of course). Combined with the SQL syntax, which is based on the principle "take everything I don't explicitely exclude" this can easily lead to unintentional loss of data due to user error or bugs.
Just marking data as deleted but not actually deleting it has the advantage that deleted data can be easily restored. But keep in mind that the marked-as-deleted pattern also has disadvantages:
As you said, programming gets a bit more complicated, because you have to remember that every SELECT must now include a WHERE deleted = false.
When you frequently delete data, your database will accumulate a lot of cruft. This will cause it to grow which impacts performance and uses unnecessary drive space.
When your users are forced to delete data due to privacy regulations and they assume that pressing the "delete" button really deletes it, this practice might inadvertedly cause them to violate these regulations.
We develop and operate a blogging application in which user data a scattered across many tables:
- Blog
- Article
- Comment
- Message
- Trackback
- 50 other tables.
Users are able to close their account, and their account/contents must disappear from the site right away.
For legal/contractual reasons, we also must be able to undelete their account/content for a given duration, and also to make those data available for juridic authorities for another duration.
Over the years and different applications, we used different approaches:
"deleted" flag everywhere : Each table has a "deleted" column, which is updated when data is deleted/restored. Very nasty because it slows down every list generation queries, creates a lot of updates upon deletion/restore. Also, it does not handle the two stage deletion described above. In fact we never used this one, but it's worth dis-advising it :)
"Multi table": For each table, we create a second table with the same schema plus two extra fields (dateDeleted, reason). The extra fields are used to know if the data is still accessible for restoration, when to delete it, and why/how it was deleted in the first place. This version is just a bit better than the previous version, but can be very nasty performance wise too when tables are growing. Also, you have to change the schema of some tables (ie: remove UNIQUE constraints) which makes the system harder to understand/upgrade for new developers, administrators ... and mentally healthy people in general.
"Multi DB": Same approach as before, but we move data on a different database cluster, which allows to browse those data without impacting the "end users" db. Also, for this app, the uniqueness constraint is done at the java level, so all the schemas are the same. Lastly, the double data retention constraint is done by having a dedicated DB for each constraint, which makes things easiers.
I have to admit that none of those approaches satisfies me, even if they can work up to a certain amount of data. I have also imagined that we could just delete some key rows in the DB, and let the rest inconsistent (and scheduled for a more controlled deletion job), but it scares me ...
Do you know other ways of doing the same thing, keeping the same level of features (we could align the two durations to simplify the problem) ? I'm not looking a solution for my existing apps, but would like to improve the next ones.
Any input will be highly appreciated !
It seams that every asset (blog, comment, ...) relies on the user. I would give the user table a column "active" which is 0 or 1, Then you implement a feature to ask on each query for the different asset "user active"? Try to optimize this lookup with indizes or something like that. In my opinion its the cleanst way. After this you can implement a job, which runs a cascading delete on users disabled for longer then x days.