Relational databases - to delete or not to delete? [duplicate]

Relational databases - to delete or not to delete? [duplicate] - java

This question already has answers here:
Never delete entries? Good idea? Usual?
(10 answers)
Closed 9 years ago.
Iv'e just heard from a colleague that deleting rows on a relational DB is pretty dangerous (regarding indexing and cascading actions)
He said that one solution for allowing deletions is to have a "deprecated" field for each entity and instead set the field to true in order to mark the row as "deleted".
of course that will require you on all your queries to fetch all the "dedicated" == false (which is pretty cumbersome)
My questions are:
Is he right? if so - what exactly is dangerous about deleting exactly?
Does his solution is a good practice?
Any alternatives to this solution are available?
thanks.

This question has multiple layers. In general it is a good idea to mark rows as deleted instead of actually deleting them.
There are a few major benefits:
The data is recoverable. You can provide an undelete to users.
The update is faster than the delete.
In a publicly facing app none of the publicly interactable code has a true delete, making it much more difficult to use that code for inappropriate purposes (sql injection, etc.)
If you ever want to report in your data you can.
There are of course caveats and best practices:
This does not apply to lookup tables with easy to recreate data.
You need to consider culling. In our databases we cull deleted records into archival reporting tables. This keeps the primary tables fast, but allows us to report on data related to "deleted" items.
Your culling performance impact (at largish scale) will be similar to a backup and have similar considerations. Run them off hours if you want to archive them all at once, or periodically via cron if you want to just take X number per hour.
NEVER use the deleted data in your live data. In other words it is not a status flag! It is gone. I've made this mistake before and undoing it was painful.
If there is a very high percentage of deletes in a table ask yourself if keeping the data is actually important. You might adjust your culling process to not archive and to instead just run the actual delete.
This approach will last for a really really long time unless your dataset is massive and deletions are massive. Some architecture astronaut will ask you about what is going to happen when you archive 1 billion rows.... when you get to that point you are either hugely successful and can find another way, or you've screwed something else up so completely your archive tasks won't matter any more relative to the other issues you have.

If you have your schema well structured and use transactions where needed, deletions are perfectly safe and using deletion you will get far better performance than the approach you friend suggests.
Inserting a new element may get a tricky as deleting one. I wonder what hacky approach would your friend suggest to overcome that.
CRUD operations have been here for a long while now and creators of relational databases have done pretty good job in optimizing them. Any attempt to outsmart decades of gradual improvement with such hack will most probably fail.
Applying the solution your friend suggests may result in having a huge database with only a small fraction of non-deleted elements. This way your queries will become slower too.
Now having said all that I would like to support a little bit the other side. There are cases when the solution your friend suggests may be the only option. You can't change your schema everytime some query turns out to be slow. Also as others suggest in their answers if you use the "mark as deleted" approach deleted data will be recoverable(which may or may not be good again mentioned in other answers).

Dangerous? Will the server or data center blow up?
I think your colleague is indulging in some hyperbole.
You need not cascade updates or deletes if you don't wish to, but it can be easier than having to clean up manually. It's a choice that you make when you create your schema.
Marking rows as deleted using a flag is another way to go, but it's just another choice. You'll have to work harder to find all the bad rows and run a batch job to remove them.
if you have retention requirements, it's more typical to partition the schema and move older records off into a warehouse for historical analysis and reporting. In that case you wouldn't be deleting anything, just moving them out after a set period of time.

Yes, he is right. Databases (indexes, specifically) are optimized for insertion and deletion can be painfully slow. Even setting an indexed field to null can cause the same trouble. I see cascading as a lesser issue because the db should never be configured to do dangerous cascading automatically.
Yes, flagging a record as "inactive", "deleted", "deprecated" (your choice) is standard and preferred practice to resolve a deletion-related performance issue.
But, to qualify the above, it only applies to transactional (as opposed to archival) tables, and then only to those specific tables which contain a huge number of rows (millions and more). Do not ham-handedly apply a "best practice" across the board.
Another approach is to simply not have a transactional table with millions of rows. Move the data to an archival table before it grows to such proportions.

The problem with DELETE's in relational databases is that they are unrevertable. You delete data and it's gone. There is no way to restore it (except rollback to an earlier backup, of course). Combined with the SQL syntax, which is based on the principle "take everything I don't explicitely exclude" this can easily lead to unintentional loss of data due to user error or bugs.
Just marking data as deleted but not actually deleting it has the advantage that deleted data can be easily restored. But keep in mind that the marked-as-deleted pattern also has disadvantages:
As you said, programming gets a bit more complicated, because you have to remember that every SELECT must now include a WHERE deleted = false.
When you frequently delete data, your database will accumulate a lot of cruft. This will cause it to grow which impacts performance and uses unnecessary drive space.
When your users are forced to delete data due to privacy regulations and they assume that pressing the "delete" button really deletes it, this practice might inadvertedly cause them to violate these regulations.

Related

SQL query performance, archive vs status change

Straight to the point, I've tried searching on google and on SO but cant find what I'm looking for. It could be because of not wording my searching correctly.
My question is,
I have a couple of tables which will be holding anywhere between 1,000 lines to 100,000 per year. I'm trying to figure out, do I/ how should I handle archiving the data? I'm not well experienced with databases, but below are a few method's I've came up with and I'm unsure which is a better practice. Of course taking into account performance and ease of coding. I'm using Java 1.8, Sql2o and Postgres.
Method 1
Archive the data into a separate database every year.
I don't really like this method because when we want to search for old data, our application will need to search into a different database and it'll be a hassle for me to add in more code for this.
Method 2
Archive the data into a separate database for data older than 2-3 years.
And use status on the lines to improve the performance. (See method 3) This is something I'm leaning towards as an 'Optimal' solution where the code is not as complex to do but also keeps by DB relatively clean.
Method 3
Just have status for each line (eg: A=active, R=Archived) to possibly improving the performance of the query. Just having a "select * from table where status = 'A' " to reduce the the number of line to look through.

100,000 rows per year is not that much. [1]
There's no need to move that to a separate place. If you already have good indexes in place, you almost certainly won't notice any degraded performance over the years.
However, if you want to be absolutely sure, you could add a year column and create an index for that (or add that to your existing indexes). But really, do that only for the tables where you know you need it. For example, if your table already has a date column which is part of your index(es), you don't need a separate year column.
[1] Unless you have thousands of columns and/or columns that contain large binary blobs - which doesn't seems to be the case here.

As Vog mentions, 100,000 rows is not very many. Nor is 1,000,000 or 5,000,000 -- sizes that your tables may grow to.
In many databases, you could use a clustered index where the first key is the "active" column. However, Postgres does not really support clustered indexes.
Instead, I would suggest that you look into table partitioning. This is a method where the underlying storage is split among different "files". You can easily specify that a query reads one or more partitions by using the partitioning key in a where clause.
For your particular use-case, I would further suggest having views on the data only for the active data. This would only read one partition, so the performance should be pretty much the same as reading a table with only the most recent data.
That said, I'm not sure if it is better to partition by an active flag or by year. That depends on how you are accessing the data, particularly the older data.

How to know affected rows in Cassandra(CQL)?

There doesn't seem to be any direct way to know affected rows in cassandra for update, and delete statements.
For example if I have a query like this:
DELETE FROM xyztable WHERE PKEY IN (1,2,3,4,5,6);
Now, of course, since I've passed 6 keys, it is obvious that 6 rows will be affected.
But, like in RDBMS world, is there any way to know affected rows in update/delete statements in datastax-driver?
I've read cassandra gives no feedback on write operations here.
Except that I could not see any other discussion on this topic through google.
If that's not possible, can I be sure that with the type of query given above, it will either delete all or fail to delete all?

In the eventually consistent world you can look at these operations as if it was saving a delete request, and depending on the requested consistency level, waiting for a confirmation from several nodes that this request has been accepted. Then the request is delivered to the other nodes asynchronously.
Since there is no dependency on anything like foreign keys, then nothing should stop data from being deleted if the request was successfully accepted by the cluster.
However, there are a lot of ifs. For example, deleting data with a consistency level one, successfully accepted by one node, followed by an immediate node hard failure may result in the loss of that delete if it was not replicated before the failure.
Another example - during the deletion, one node was down, and stayed down for a significant amount of time, more than the gc_grace_period, i.e., more than it is required for the tombstones to be removed with deleted data. Then if this node is recovered, then all suddenly all data that has been deleted from the rest of the cluster, but not from this node, will be brought back to the cluster.
So in order to avoid these situations, and consider operations successful and final, a cassandra admin needs to implement some measures, including regular repair jobs (to make sure all nodes are up to date). Also applications need to decide what is better - faster performance with consistency level one at the expense of possible data loss, vs lower performance with higher consistency levels but with less possibility of data loss.

There is no way to do this in Cassandra because the model for writes, deletes, and updates in Cassandra is basically the same. In all of those cases a cell is added to the table which has either the new information or information about the delete. This is done without any inspection of the current DB state.
Without checking the rest of the replicas and doing a full merge on the row there is no way to tell if any operation will actually effect the current read state of the database.
This leads to the oft cited anti-pattern of "Reading before a write." In Cassandra you are meant to write as fast as possible and if you need to have history, use a datastructure which preservations a log of modifications rather than just current state.
There is one option for doing queries like this, using the CAS syntax of IF value THEN do other thing but this is a very expensive operation compared normal write and should be used sparingly.

SQL Joins vs Java code?

I have a query like this
Select Folder.name from FROM FolderTable,ValidFolder, ValidFolderGroup, ValidUser,
ValidLocation, ValidDepartment where ValidUser.LocationCode *= ValidLocation.LocationCode
and ValidUser.DepartmentCode *= ValidDepartment.DepartmentCode and Folder.IssueUser =
ValidUser.UserId and ValidFolder.FolderType = Folder.FolderType and
ValidFolderGroup.FolderGroupCode = ValidFolder.FolderGroupCode and
ValidFolderGroup.GroupTypeCode = 13 and (ValidUser.UserId='User' OR
ValidUser.ManagerId='User') and ValidFolderGroup.GroupTypeCode = 13 and
Folder.IssueUser = 'User'
Now here all the table which start with Valid are cache table so these table already contains data .
Suppose if someone using JOOQ or Hibernate which one will be the best option
Use query as written above with all Joins?
Or Use Java code to fulfill the requirement rather than join because as user using Hibernate or JOOQ it already have Java class for the table and Valid table have already all the data ?

Okay, you're probably not going to like this answer, but the best way to do this is not to keep Valid "cached".
The best solution in my opinion would be to use jOOQ (if you prefer DSL) or Hibernate (if you prefer OR mapping) and query the Database every time, and consistently use the DAO pattern.
The jOOQ and Hibernate guys are almost certainly better at SQL than you are. We've used jOOQ and Hibernate in really large enterprise projects, and they both perform exceptionally. Particularly with a good connection pool like BoneCP. If after you've got that setup running, and running well, but still think you may have performance issues, you can always add a cache (like EhCache) afterwards.
Ultimately tho', I'm making a lot of assumptions about your software, namely that
There are more people than you working on it, and
It has to be maintained. If neither of these assumptions are true, then you can safely disregard this answer.

General answer:
Modern databases are incredibly good at optimising your query and choosing the best possible execution plan for you. Given your outer join notation using *=, you're obviously using SQL Server, so that's a pretty good database.
Even if you already have much of the "Valid" data in your application memory, chances are that your database also already has the same data in a buffer cache and thus the database doesn't need to hit the disk again for the various joins in your query.
In fact, depending on the nature of your data, the database might even assess that some of your joins are unneeded (if you have the right meta data, like constraints).
Specific answer:
In your particular case, it looks as though you can indeed strip most of your query yourself and query only the Folder table using search criteria from your application's "Valid" cache. I'm saying that it looks like it, because I don't fully understand the business logic behind those joins and whether they're all modelling 1:1 relationships, or whether removing them will change the semantics of the query.
So, technically, it's possible that you can remove the joins, but if you want to stay on the safe side, just keep things as they are as you migrate to jOOQ or Hibernate.
Alternative 3:
Of course, instead of tampering with this query, you might even be able to remove this query and fetch the Folder.name property already in your previous queries when you load the "Valid" content into memory.

Ever heard of views? Look into them, you'll be amazed.
Apart from that, it's impossible to say what you should do, there's no "best" and you provide way too little information to even make an educated guess about your specific requirements.
But, I'd not hard code things like database IDs in a query that ends up inside any program, far too prone to cause problems in the (near) future.

Hibernate Session.flush() efficiency problems

Sorry in advance if someone has already answered this specific question but I have yet to find an answer to my problem so here goes.
I am working on an application (no I cannot give the code as it is for a job so I'm sorry about that one) which uses DAO's and Hibernate and POJO's and all that stuff for communicating and writing to the database. This works well for the application assuming I don't have a ton of data to check when I call Session.flush(). That being said, there is a page where a user can add any number of items to a product and there is one particular case where there are something along the lines of 25 items. Each item has about 8 fields a piece that are all stored in the database. When I call the flush it does save everything to the database but it takes FOREVER to complete. The three lines I am calling are:
merge(myObject);
Session.flush();
Session.refresh(myObject);
I have tried a number of different combinations of things to fix this problem and a number of different solutions so coming back and saying "Don't use flus()" isn't much help as the saveOrUpdate() and other hibernate sessions don't seem to work. The only solution I can think of is to scrap the entire project (the code we got was inherited and poorly written to say the least) or tell the user community to suck it up.
It is my understanding from Hibernate API that if you want to write the data to the database it runs a check on every item, if there is a difference it creates a queue of update queries, then runs the queries. It seems as though this data is being updated every time because the "DATE_CREATED" column in my database is different even if the other values are unchanged.
What I was wondering is if there was another way to prevent such a large committing of data or a way of excluding that particular column from the "check" hibernate does so I don't have to commit all 25 items if I only made a change to 1?
Thanks in advance.
Mike

Well, you really cannot avoid the dirty checking in hibernate unless you use a StatelessSession. Of course, you lose a lot of features (lazy-load etc.) with that, but it's up to you to make this decision.
Another option: I would definitely try to use dynamic-update=true in your entity. Like:
#Entity(dynamicUpdate = true)
class MyClass
Using that, Hibernate will update the modified columns only. In small tables, with few columns, it's not so effective, but in your case maybe it can help make the whole process faster as you cannot avoid dirty checking with a regular Hibernate Session. Updating a few columns instead of the whole object is always better, right?
This post talks more about dynamic-update attribute.

What I was wondering is if there was another way to prevent such a
large committing of data or a way of excluding that particular column
from the "check" hibernate does so I don't have to commit all 25 items
if I only made a change to 1?
I would profile the application to ensure that the dirty checking on flush is actually the problem. If you find that this is indeed the case you can use evict to manage the session size.
session.update(myObject);
session.flush();
session.evict(myObject);

How to efficiently unpublish all datas from a particular user on a blogging application?

We develop and operate a blogging application in which user data a scattered across many tables:
- Blog
- Article
- Comment
- Message
- Trackback
- 50 other tables.
Users are able to close their account, and their account/contents must disappear from the site right away.
For legal/contractual reasons, we also must be able to undelete their account/content for a given duration, and also to make those data available for juridic authorities for another duration.
Over the years and different applications, we used different approaches:
"deleted" flag everywhere : Each table has a "deleted" column, which is updated when data is deleted/restored. Very nasty because it slows down every list generation queries, creates a lot of updates upon deletion/restore. Also, it does not handle the two stage deletion described above. In fact we never used this one, but it's worth dis-advising it :)
"Multi table": For each table, we create a second table with the same schema plus two extra fields (dateDeleted, reason). The extra fields are used to know if the data is still accessible for restoration, when to delete it, and why/how it was deleted in the first place. This version is just a bit better than the previous version, but can be very nasty performance wise too when tables are growing. Also, you have to change the schema of some tables (ie: remove UNIQUE constraints) which makes the system harder to understand/upgrade for new developers, administrators ... and mentally healthy people in general.
"Multi DB": Same approach as before, but we move data on a different database cluster, which allows to browse those data without impacting the "end users" db. Also, for this app, the uniqueness constraint is done at the java level, so all the schemas are the same. Lastly, the double data retention constraint is done by having a dedicated DB for each constraint, which makes things easiers.
I have to admit that none of those approaches satisfies me, even if they can work up to a certain amount of data. I have also imagined that we could just delete some key rows in the DB, and let the rest inconsistent (and scheduled for a more controlled deletion job), but it scares me ...
Do you know other ways of doing the same thing, keeping the same level of features (we could align the two durations to simplify the problem) ? I'm not looking a solution for my existing apps, but would like to improve the next ones.
Any input will be highly appreciated !

It seams that every asset (blog, comment, ...) relies on the user. I would give the user table a column "active" which is 0 or 1, Then you implement a feature to ask on each query for the different asset "user active"? Try to optimize this lookup with indizes or something like that. In my opinion its the cleanst way. After this you can implement a job, which runs a cascading delete on users disabled for longer then x days.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.