Reliably tracking changes made by Hibernate

Reliably tracking changes made by Hibernate - java

I'm using a PostUpdateEventListener registered via
registry.appendListeners(EventType.POST_COMMIT_UPDATE, listener)
and a few other listeners in order to track changes made by Hibernate. This works perfectly, however, I see a problem there:
Let's say, for tracking some amount by id, I simply execute
amountByIdConcurrentMap.put(id, amount);
on every POST_COMMIT_UPDATE (let's ignore other operations). The problem is that this call happens some time after the commit. So with two commits writing the same entity shortly one after the other, I can receive the events in the wrong order, ending up with the older amount stored.
Is this really possible or are the operations synchronized somehow?
Is there a way how to prevent or at least detect such situation?

Two questions and a proposal later
Are you sure, that you need this optimization. Why not fetch the amount as it is written to the database by querying there. What gives you reason to work with caching.
How do you make sure, that the calculation of the amount before writing it to the database is properly synchronized, so that multiple threads or probably nodes do not use old data to calculate the amount and therefore overwrite the result of a later calculation?
I suppose you handle question number 2 right. Then you have to options:
Pessimistic locking, that means that immediatly before the commit you can exclusively update your cache without concurrency issues.
Optimistic locking: In that case you have a kind of timestamp or counter in your database-record you can also put into the cache together with the amount. This value you can use to find out, what the more recent value is.

No, there are no ordering guarantees, so you'll have to take care to ensure proper synchronization manually.
If the real problem you are solving is caching of entity state and if it is suitable to use second-level cache for the entity in question, then you would get everything out of the box by enabling the L2 cache.
Otherwise, instead of updating the map from the update listeners directly, you could submit tasks to an Executor or messaging system that would asynchronously start a new transaction and select for update the amount for the given id from the database. Then update the map in the same transaction while holding the corresponding row lock in the db, so that map updates for the same id are done serially.

Related

How to properly implement Optimistic Locking at the application layer?

I am a little confused as to why Optimistic Locking is actually safe. If I am checking the version at the time of retrieval with the version at the time of update, it seems like I can still have two requests enter the update block if the OS issues an interrupt and swaps the processes before the commit actually occurs. For example:
latestVersion = vehicle.getVersion();
if (vehicle.getVersion() == latestVersion) {
// update record in database
} else {
// don't update record
}
In this example, I am trying to manually use Optimistic Locking in a Java application without using JPA / Hibernate. However, it seems like two requests can enter the if block at the same time. Can you please help me understand how to do this properly? For context, I am also using Java Design Patterns website as an example.

Well... that's the optimistic part. The optimism is that it is safe. If you have to be certain it's safe, then that's not optimistic.
The example you show definitely is susceptible to a race condition. Not only because of thread scheduling, but also due to transaction isolation level.
A simple read in MySQL, in the default transaction isolation level of REPEATABLE READ, will read the data that was committed at the time your transaction started.
Whereas updating data will act on the data that is committed at the time of the update. If some other concurrent session has updated the row in the database in the meantime, and committed it, then your update will "see" the latest committed row, not the row viewed by your get method.
The way to avoid the race condition is to not be optimistic. Instead, force exclusive access to the record. Doveryai, no proveryai.
If you only have one app instance, you might use a critical section for this.
If you have multiple app instances, critical sections cannot coordinate other instances, so you need to coordinate in the database. You can do this by using pessimistic locking. Either read the record using a locking read query, or else you can use MySQL's user-defined locks.

How to tell other threads that the task corresponding to this record (in DB) was taken by me?

Right now, I am thinking of implementing multi-threading to take tasks corresponding to records in the DB tables. The tasks will be ordered by created date. Now, I am stuck to handle the case that when one task (record) being taken, other tasks should skip this one and chase the next one.
Is there any way to do this? Many thanks in advance.

One solution is to make a synchronized pickATask() method and free threads can only pick a task by this method.
this will force the other free threads to wait for their order.
synchronized public NeedTask pickATask(){
return task;
}

According to how big is your data insertion you can either use global vectorized variables or use a table in the database itself to record values like (string TASK, boolean Taken, boolean finished, int Owner_PID).
By using the database to check the status you tend to accomplish a faster code in large scale, but if do not have too many threads or this code will run just once the (Synchronized) global variable approach may be a better solution.

In my opinion if you create multiple thread to read from db and every thread involve in I/O operation and some kind of serialization while reading row from same table.In my mind this is not scallable and also some performance impact.
My solution will be one thread will be producer which will read the row in batch and create task and submit the task to execution (will be thread pool of worker to do the actual task.)Now we have two module which can be scallable independently.In producer side if required we can create multiple thread and every thread will read some partition data.For an example Thread 1 will read 0-100 and thread 2 read 101-200.

It depends on how you manage your communication between java and DB. Are you using direct jdbc calls, Hibernate, Spring Data or any other ORM framework. In case you use just JDBC you can manage this whole issue on your DB level. you will need to configure your DB to lock your record upon writing. I.e. once a record was selected for update no-one can read it until the update is finished.
In case that you use some ORM framework (Such as Hibernate for example) the framework allows you to manage concurrency issues. See about Optimistic and Pessimistic locking. Pessimistic locking does approximately what is described above - Once the record is being updated no-one can read it until the update is finished. Optimistic one uses versioning mechanism, and then multiple threads can try to update the record but only the first one succeeds and the rest will get an exception saying that they are now working with stale data and they should read the record again. The versioning mechanism is to add a version column that is usually a number or sometimes timestamp. Each thread reads the record and upon update it checks if the version in DB still the same. If so it means no-ne else updated the record and upon update the version is changed (incremented or current timestamp is set). If the version changed then someone else already updated the record since it was read and so this thread has stale record and should not be allowed to update it. Optimistic locking shows better performance in environment where reading heavily outnumbers writing

datastore: deleting entities outside transactions

I'm unable to find documentation that fully explains entities being deleted from datastore (I'm using JDO deletePersistent) without being in a transaction. I can afford loosing data accuracy during parallel updates when not using transaction for the sake of performance and avoiding contention.
But how can i make sure when my code is running on different machines at the same time that a delete operation would not be overridden by a later update / put on a previous read to that entity on another machine, I'm letting PersistenceManager take care of implicit updates to attached objects.
EDIT:
Trying to update that entity after deletePersistent will result in an exception but that is when trying to update the exact same copy being passed to deletePersistent. but if it was a different copy on another machine would be treated as updating a deleted entity (not valid) or as an insert or update resulting in putting that entity back?

This is taken from GAE Documentation:
Using Transactions
A transaction is a set of Datastore operations on one or more entity.
Each transaction is guaranteed to be atomic, which means that
transactions are never partially applied. Either all of the operations
in the transaction are applied, or none of them are applied.
An operation may fail when:
Too many users try to modify an entity group simultaneously. The
application reaches a resource limit. The Datastore encounters an
internal error.
Since transactions are guaranteed to be atomic, an ATOMIC operation like a single delete operation will always work inside or outside a transaction.

The answer is yes, even after the object was deleted if it was read before and the update was being committed after delete was committed it will be put back because as #Nick Johnson commented inserts and updates are the same. tested that using 20 seconds thread sleep after getting object for update allowing the object to be deleted and then being put back.

How to know when updates to the Google AppEngine HRD datastore are complete?

I have a long running job that updates 1000's of entity groups. I want to kick off a 2nd job afterwards that will have to assume all of those items have been updated. Since there are so many entity groups, I can't do it in a transaction, so i've just scheduled the 2nd job to run 15 minutes after the 1st completes using task queues.
Is there a better way?
Is it even safe to assume that 15 minutes gives a promise that the datastore is in sync with my previous calls?
I am using high replication.
In the google IO videos about HRD, they give a list of ways to deal with eventual consistency. One of them was to "accept it". Some updates (like twitter posts) don't need to be consistent with the next read. But they also said something like "hey, we're only talking miliseconds to a couple of seconds before they are consistent". Is that time frame documented anywhere else? Is it safe assuming that waiting 1 minute after a write before reading again will mean all my preivous writes are there in the read?
The mention of that is at the 39:30 mark in this video http://www.youtube.com/watch?feature=player_embedded&v=xO015C3R6dw

I don't think there is any built in way to determine if the updates are done. I would recommend adding a lastUpdated field to your entities and updating it with your first job, then check for the timestamp on the entity you're updating with the 2nd before running... kind of a hack but it should work.
Interested to see if anybody has a better solution. Kinda hope they do ;-)

This is automatic as long as you are getting entities without changing the consistency to Eventual. The HRD puts data to a majority of relevant datastore servers before returning. If you are calling the asynchronous version of put, you'll need to call get on all the Future objects before you can be sure it's completed.
If however you are querying for the items in the first job, there's no way to be sure that the index has been updated.
So for example...
If you are updating a property on every entity (but not creating any entities), then retrieving all entities of that kind. You can do a keys-only query followed by a batch get (which is approximately as fast/cheap as doing a normal query) and be sure that you have all updates applied.
On the other hand, if you're adding new entities or updating a property in the first process that the second process queries, there's no way to be sure.

I did find this statement:
With eventual consistency, more than 99.9% of your writes are available for queries within a few seconds.
at the bottom of this page:
http://code.google.com/appengine/docs/java/datastore/hr/overview.html
So, for my application, a 0.1% chance of it not being there on the next read is probably OK. However, I do plan to redesign my schema to make use of ancestor queries.

JPA persistence using multiple threads

I have a problem when I try to persist objects using multiple threads.
Details :
Suppose I have an object PaymentOrder which has a list of PaymentGroup (One to Many relationship) and PaymentGroup contains a list of CreditTransfer(One to Many Relationship again).
Since the number of CreditTransfer is huge (in lakhs), I have grouped it based on PaymentGroup(based on some business logic)
and creating WORKER threads(one thread for each PaymentGroup) to form the PaymentOrder objects and commit in database.
The problem is, each worker thread is creating one each of PaymentOrder(which contains a unique set of PaymentGroups).
The primary key for all the entitties are auto generated.
So there are three tables, 1. PAYMENT_ORDER_MASTER, 2. PAYMENT_GROUPS, 3. CREDIT_TRANSFERS, all are mapped by One to Many relationship.
Because of that when the second thread tries to persist its group in database, the framework tries to persist the same PaymentOrder, which previous thread committed,the transaction fails due to some other unique field constraints(the checksum of PaymentOrder).
Ideally it must be 1..n..m (PaymentOrder ->PaymentGroup-->CreditTransfer`)
What I need to achieve is if there is no entry of PaymentOrder in database make an entry, if its there, dont make entry in PAYMENT_ORDER_MASTER, but only in PAYMENT_GROUPS and CREDIT_TRANSFERS.
How can I ovecome this problem, maintaining the split-master-payment-order-using-groups logic and multiple threads?

You've got options.
1) Primitive but simple, catch the key violation error at the end and retry your insert without the parents. Assuming your parents are truly unique, you know that another thread just did the parents...proceed with children. This may perform poorly compared to other options, but maybe you get the pop you need. If you had a high % parents with one child, it would work nicely.
2) Change your read consistency level. It's vendor specific, but you can sometimes read uncommitted transactions. This would help you see the other threads' work prior to commit. It isn't foolproof, you still have to do #1 as well, since another thread can sneak in after the read. But it might improve your throughput, at a cost of more complexity. Could be impossible, based on RDBMS (or maybe it can happen but only at DB level, messing up other apps!)
3) Implement a work queue with single threaded consumer. If the main expensive work of the program is before the persistence level, you can have your threads "insert" their data into a work queue, where the keys aren't enforced. Then have a single thread pull from the work queue and persist. The work queue can be in memory, in another table, or in a vendor specific place (Weblogic Queue, Oracle AQ, etc). If the main work of the program is before the persistence, you parallelize THAT and go back to a single thread on the inserts. You can even have your consumer work in "batch insert" mode. Sweeeeeeeet.
4) Relax your constraints. Who cares really if there are two parents for the same child holding identical information? I'm just asking. If you don't later need super fast updates on the parent info, and you can change your reading programs to understand it, it can work nicely. It won't get you an "A" in DB design class, but if it works.....
5) Implement a goofy lock table. I hate this solution, but it does work---have your thread write down that it is working on parent "x" and nobody else can as it's first transaction (and commit). Typically leads to the same problem (and others--cleaning the records later, etc), but can work when child inserts are slow and single row insert is fast. You'll still have collisions, but fewer.

Hibernate sessions are not thread-safe. JDBC connections that underlay Hibernate are not thread safe. Consider multithreading your business logic instead so that each thread would use it's own Hibernate session and JDBC connection. By using a thread pool you can further improve your code by adding ability of throttling the number of the simultaneous threads.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.