Ensuring data integrity when database is accessed by multiple java clients - java

I have one MySql database instance with an Account table which maintains a Balance field. I have multiple Java applications each using a Jdbc to connect to the database that can potentially increase or decrease the value of the Balance field. How do I ensure that the Balance value is read, calculated and updated and that this process happens in isolation, and is 'aware' of any other Java processes that might be in the middle of doing the same thing?

The simple answer is to use transactions:
http://dev.mysql.com/doc/refman/5.0/en/commit.html
However, in the case you describe, I much prefer not to store the balance of an account as a column in a table, but to calculate it by summing the value of the transactions related to that account. It's far less sensitive to the integrity issues you raise, and you're less likely to run into obscure locking scenarios.

One simple approach is JDBCs transaction management. See java.sql.Connection.setAutoCommit() documentation. It enables you to explicitly disable automatic statement commits:
Connection c = /* retrieve connection */
c.setAutoCommit(false);
c.setTransactionIsolation(/* depends on your requirements */);
c.executeQuery(/* */);
c.executeUpdate(/* */);
c.commit(); /* or c.rollback() */
In a real world scenario you must introduce a finally block to commit or rolback the transaction, otherwise you may end up with deadlocks in your database.
Edit: If your Java applications are end user clients, you are always in risk that users directly connect to the database (e.g. using Access) bypassing your transaction management logic. That's one reason we began to place application servers in between. A solution might also be to implement a stored procedure, so that the clients do not interact with the tables at all.

If you are using InnoDB engine, then you can use MySQL record level locking to lock specific account record for updates from other clients.
UPDATE: Alternitively you can application-level locks described here.

Related

Is there a way to ensure uniqueness of a field without relying on database

Without relying on the database, is there a way to ensure a field (let's say a User's emailAddress) is unique.
Some common failure attempts:
Check first if emailAddress exists (by querying the DB) and if not then create the user. Now obviously in the window of check-then-act some other thread can create a user with same email. Hence this solution is no good.
Apply a language-level lock on the method responsible for creating the user. This solution fails as we need redundancy of the service for performance reasons and lock is on a single JVM.
Use an Event store (like an Akka actor's mailbox), event being an AddUser message, but since the actor behavior is asynchronous, the requestor(sender) can't be notified that user creation with unique email was successful. Moreover, how do 2 requests (with same email) know they contain a unique email? This may get complicated.
Database, being a single source of data that every thread and every service instance will write to, makes sense to implement the unique constraint here. But this holds true for Relational databases.
Then what about NoSql databases? some do allow for a unique constraint, but it's not their native behavior, or maybe it is.
But the question of not using the database to implement uniqueness of a field, what could be the options?
I think your question is more generic - "how do I ensure a database write action succeeded, and how do I handle cases where it didn't?". Uniqueness is just one failure mode - you may be attempting to insert a value that's too big, or of the wrong data type, or that doesn't match a foreign key constraint.
Relational databases solve this through being ACID-compliant, and throwing errors for the client to deal with when a transaction fails.
You want (some of) the benefits of ACID without the relational database. That's a fairly big topic of conversation. The obvious way to solve this is to introduce the concept of "transaction" in your application layer. For instance, in your case, you might send a "create account(emailAddress, name, ...)" message, and have the application listen for either an "accountCreated" or "accountCreationFailed" response. The recipient of that message is responsible for writing to the database; you have a couple of options. One is to lock that thread (so only one process can write to the database at any time); that's not super scalable. The other mechanism I've used is introducing status flags - you write the account data to the database with a "draft" flag, then check for your constraints (including uniqueness), and set the "draft" flag to "validated" if the constraints are met (i.e. there is no other record with the same email address), and "failed" if they are not.
to check for uniquness you need to store the "state" of the program. for safety you need to be able to apply changes to the state transactionally.
you can use database transactions. a few of the NoSQL databases support transactions too, for example, redis and MongoDB. you have to check for each vendor separately to see how they support transactions. in this setup, each client will connect to the database and it will handle all of the details for you. also depending on your use case you should be careful about the isolation level configuration.
if durability is not a concern then you can use in memory databases that support transactions.
which state store you choose, it should support transactions. there are several ways to implement transactions and achieve consistency. many relational databases like PostgresSQL achieve this by implementing the MVCC algorithm. in a distributed environment you have to look for distributed transactions such as 2PC, Paxos, etc.
normally everybody relies on availabe datastore solutions unless there is a weird or specific requirement for the project.
final note, the communication pattern is not related to the underlying problem here. for example, in the Actor case you mentioned, at the end of the day, each actor has to query the state to find if a email exists or not. if your state store supports Serializability then there is no problem and conflicts will not happen (communicating the error to the client is another issue). suppose that you are using PostgreSQL. when a insert/update query is issued, it is wrapped around a transaction and the underlying MVCC algorithm will take care of everything. in an advanced and distrbiuted environment you can use data stores that support distributed transactions, like CockroachDB.
if you want to dive deep you can research these keywords: ACID, isolation levels, atomicity, serializability, CAP theorem, 2PC, MVCC, distributed transacitons, distributed locks, ...
NoSQL databases provide different, weaker, guarantees than relational databases. Generally, the tradeoff is you give up ACID guarantees in exchange for increased scalability in the dimensions that matter for your application.
It's possible to provide some kind of uniqueness guarantee, but subject to certain tradeoffs. With NoSQL, there are always tradeoffs.
If your NoSQL store supports optimistic concurrency control, maybe this approach will work:
Store a separate document that contains the set of all emailAddress values, across all documents in your NoSQL table. This is one instance of this document at a given time.
Each time you want to save a document containing emailAddress, first confirm email address uniqueness:
Perform the following actions, protected by optimistic locking. You can on the backend if this due to a concurrent update:
Read this "all emails" document.
Confirm the email isn't present.
If not present, add the email address to the "all emails document"
Save it.
You've now traded one problem ... the lack of unique constraints, for another ... the inability to synchronise updates across your original document and this new "all emails" document. This may or may not be acceptable, it depends on the guarantees that your application needs to provide.
e.g. Maybe you can accept that an email may be added to "all emails", that saving the related document to your other "table" subsequently fails, and that that email address is now not able to be used. You could clean this up with a batch job somehow. Not sure.
The index of emails could be stored in some other service (e.g. a persistent cache). The same problem exists, you need to keep the index and your document store in sync somehow.
There's no easy solution. For a detailed overview of the relevant concepts, I'd recommend Designing Data-Intensive Applications by Martin Kleppmann.

where we need to set hibernate session to thread local object

ThreadLocal<Session> tl = new ThreadLocal<Session>();
tl.set(session);
to get the session,
Employee emp = (Employee)((Session)tl.get().get(Employee.class, 1));
If our application is web based, the web container creates a separate thread for each request.
If all these requests concurrently using the same single Session object , we should get
unwanted results in our database operations.
To overcome from above results, it is good practice to set our session to threadLocal object
which does not allows concurrent usage of session.I think, If it is correct the application performance should be very poor.
What is the good approach in above scenarios.
If I'm in wrong track , in which situations we need to go for ThreadLocal.
I'm new to hibernate, please excuse me if this type questioning is silly.
thanks in advance.
Putting the Hibernate Session in ThreadLocal is unlikely to achieve the isolation between requests that you want. Surely you create a new Session for each request using a SessionFactory backed by a connection pooling implementation of DataSource, which means that the local reference to the Session is on the stack anyway. Changing that local reference to a member variable only complicates the code, imho.
Anyhow, ensuring isolation within a single container doesn't address the actual problem - how is data accessed efficiently while maintaining consistency within a multi-threaded environment.
There are two parts to the problem you mention - the first is that a database connection is an expensive resource, the second that you need to ensure some level of data consistency between threads/requests.
The general approach to the resource problem is to use a database connection pool (which I'd guess you're already doing). As each request is processed, connections are obtained from the pool and returned when finished but importantly the connections in the pool are maintained beyond the lifetime of a request thus avoiding the cost of creating a connection each time it is needed.
The consistency problem is a little trickier and there's no one size fits all model. What you need to be doing is thinking about what level of consistency you need - questions like does it matter if data is read at the same time it's being written, do updates absolutely have to be atomic, etc.
Once you know the answer to these questions there two places you need to look at consistency - in the database and in the code.
With the database you need to look at database level locks and create a scheme suitable for your application by applying that appropriate isolation levels.
With the code, things are a little more complicated. Data is often loaded and displayed for a period of time before updates are written back - no problem if there's a single user but in a multi-user system it's possible that updates are made based on stale data or multiple updates occur simulatiously. It may be acceptable to have a policy of last update wins, in which case it's simple, but if not you'll need to be using version numbers or old/new comparisons to ensure integrity at the time the updates are applied.
I am not sure if you have compulsion of using ThreadLocal. Using ThreadLocal to store session object is definitely is not a good idea, specially when you are using hibernate along with spring.
A typical scheme for using Hibernate with Spring is:
Inject the sessionFactory in your DAO. I assume that you have sessionFactory already configured which is backed by a pooled datasource.
Now in your DAO class, a session can be accessed as follows.
Session session = sessionFactory.getCurrentSession();
Here is a link to related article.
Please note that this example is specific to Hiberante 3.x APIs. This takes care of session creation/closure/thread-safety aspect internally and its neat too.

Using Java locks for database concurrency

I have the following scenario.
I have two tables. One stores multi values that are counters for transactions. Through a java application the first table value is read, incremented and written to the second table, as well as the new value being written back to the first table. Obviously there is potential for this to go wrong as it's a multiple user system.
My solution, in Java, to the issue is to provide Locks that have to, well should, be aquired before any action can be taken on either table. These Locks, ReentrantLocks, are static and there is one for each column in Table 1 as the values are completely independent of each other.
Is this a recommended approached?
Cheers.
No. Use implicit Database Locks1 for Database Concurrency. Relational databases support Transactions which are a vital part of ACID: use them.
Java-centric locks will not work cross-VM and as such will not help in multi-User/Server environments.
1 Databases are smart enough to acquire/release locks to ensure Consistency and Isolation and may even use "lock free" implementations such as MVCC. There are rare occasions when explicit database locks must be requested, but this is an advanced use-case.
Whilst agreeing with some of the sentiments of #pst's answer, I would say this depends slightly.
If the sequence of events is, and probably always will be, essentially "SQL oriented", then you may as well do the locking at the database level (and indeed, probably implicitly via the use of transactions).
However if there is, or you are planning to build in, significant data manipulation logic within your app tier (either generally or in the case of this specific operation), then locking at the app level may be more appropriate. (In reality, you will probably still run your SQL in transactions so that you're actually locking at both levels.)
I don't think the issue of multiple VMs is necessarily a compelling issue on its own for relying on DB-level locking. If you have multiple server apps accessing the database, you will in any case want to establish a well-defined protocol for which data is accessed concurrently under what circumstances. And in a system of moderate complexity, you will in any case want to build in a system of running periodic sanity checks on the data. (Even if your server apps are perfectly behaved 100% of the time, will back end tech support never ever ever have to run some miscellaneous SQL on the database outside your app...?)

Unit testing DDL statements that need to be in a transaction

I am working on an application that uses Oracle's built in authentication mechanisms to manage user accounts and passwords. The application also uses row level security. Basically every user that registers through the application gets an Oracle username and password instead of the typical entry in a "USERS" table. The users also receive labels on certain tables. This type of functionality requires that the execution of DML and DDL statements be combined in many instances, but this poses a problem because the DDL statements perform implicit commits. If an error occurs after a DDL statement has executed, the transaction management will not roll everything back. For example, when a new user registers with the system the following might take place:
Start transaction
Insert person details into a table. (i.e. first name, last name, etc.) -DML
Create an oracle account (create user testuser identified by password;) -DDL implicit commit. Transaction ends.
New transaction begins.
Perform more DML statments (inserts,updates,etc).
Error occurs, transaction only rolls back to step 4.
I understand that the above logic is working as designed, but I'm finding it difficult to unit test this type of functionality and manage it in data access layer. I have had the database go down or errors occur during the unit tests that caused the test schema to be contaminated with test data that should have been rolled back. It's easy enough to wipe the test schema when this happens, but I'm worried about database failures in a production environment. I'm looking for strategies to manage this.
This is a Java/Spring application. Spring is providing the transaction management.
First off I have to say: bad idea doing it this way. For two reasons:
Connections are based on user. That means you largely lose the benefits of connection pooling. It also doesn't scale terribly well. If you have 10,000 users on at once, you're going to be continually opening and closing hard connections (rather than soft connection pools); and
As you've discovered, creating and removing users is DDL not DML and thus you lose "transactionality".
Not sure why you've chosen to do it this but I would strongly recommend you implement users at the application and not the database layer.
As for how to solve your problem, basically you can't. Same as if you were creating a table or an index in the middle of your sequence.
You should use Oracle proxy authentication in combination with row level security.
Read this: http://www.oracle.com/technology/pub/articles/dikmans-toplink-security.html
I'll disagree with some of the previous comments and say that there are a lot of advantages to using the built-in Oracle account security. If you have to augment this with some sort of shadow table of users with additional information, how about wrapping the Oracle account creation in a separate package that is declared PRAGMA AUTONOMOUS_TRANSACTION and returns a sucess/failure status to the package that is doing the insert into the shadow table? I believe this would isolate the Oracle account creation from the transaction.

Restrict postges access from java clients by using java program on a server

Perhaps this question is not very clear but I didn't find better words for the heading, which describes the problem I like to deal with shortly.
I want to restrict access from a java desktop application to postgres.
The background:
Suppose you have 2 apps running and the first Application has to do some complex calculations on the basis of data in the db. To nail the immutability of the data in the db down i'd like to lock the db for insert, update and delete operations. On client side i think it's impossible to handle this behaviour satisfactory. So i thought about to use a little java-app on server-side which works like a proxy. So the task is to hand over CRUD (Create Read Update Delete) operations until it gets a command to lock. After a lock it rejects all CUD operations until it gets a unlock command from the locking client or a timeout is reached.
Questions:
What do you think about this approach?
Is it possible to lock a Database while using such an approach?
Would you prefer Java SE or Java EE as server-side java app?
Thanks in advance.
Why not use transactions in your operations? The database has features to maintain data integrity itself, rather than resorting to a brute operation such as a total-database lock.
This locking mechanism you describe sounds like it would be a pain for the users. Are the users initating the lock or is the software itself? If it's the users, you can expect some problems when Bob hits lock and then goes to lunch for 2 hours, forgetting to unlock the database first...
Indeed... there are a few proper ways to deal with this problem.
Just lock the tables in your code. Postgresql has commands for locking entire tables that you could run from your client application
Pick a transaction isolation level that doesn't have the problem of reading data that was committed after your txn started (BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ).
Of these, by far the most efficient is to use repeatable read as your isolation level. Postgres supports this quite efficiently, and it will give you a consistent view of the data without such heavy locking of the db.
Year i thought about transactions but in this case i can't use them. I'm sorry i didn't mention it exactly. So assume the follow easy case:
A calculation closes one area of responsibility. After calc a new one is opened and new inserts are dedicated to it. But while calculation-process a insert or update or delete is not allowed to the data of the (currently calculated) area of responsibility. More over a delete is strictly prohibited because data has to be archived.
So imo the use of transactions doesn't fit this requirement. Or did i miss sth.?
ps: (off topic) #jsight: i currently read that intenally postgres mapps "repeatable read" to "serializable", so using "repeatable read" gets you more restriction then you would perhaps expect.

Categories