Locking JPA entity hierarchy being updated massively from MDB's in cluster - java

I need some help in the following situation: imagine a hierarchy of Job entities that have a progress attribute. Some jobs consist of multiple subjobs making a job tree. The progress of these composite jobs is calculated from their subjobs. The bottom jobs progresses are updated periodically and then the whole tree's progress is recalculated bottom-up. Job progresses are updated via JMS messages received. In this case the job is fetched via JPA from the database and the progress is modified and a recursive recalculation is started.
How should I deal with locking if this runs in a cluster? I would like to avoid situations when two subjobs are updated from 0% to 100% and the parent job goes to 50% instead of 100% as both the updates see 0%, 100% and vica versa.
My first thought was using synchronization on the job objects. But this is not ok, because multiple runtime objects may represent the same record in the database.
Could you suggest me a good, efficient way to handle this situation?

Related

Quartz scheduler maintenance and performance overheads

We are currently evaluating quartz-scheduler to use in our project. For our use case, we need only one time trigger to be fired at some point in future, it need not to be a repeatable or cron trigger.
So in my POC, I'm creating a new simple one time trigger when business event occurs. I can see in clustered environment (using JDBC store of quartz), triggers are being balanced/distributed among multiple nodes.
Desired behaviour is observed from POC, but, from performance standpoint, how expensive will it be if we create a new one time trigger each time when we run at scale. From my understanding, one bottleneck would be bloating of database with triggers, possible solution for database cleanup is to add a background task to cleanup old triggers.
I am interested in hearing about experiences and pain points on maintaining scheduler with our design and any inputs for improving design.
You can safely use one-time triggers and they will be automatically removed by Quartz after they have fired. What happens is that Quartz checks all triggers and determines if these triggers are going to fire at some point in the future. If they do not, Quartz simply removes them from the store because it makes no sense to keep them.
A somewhat similar principle applies to jobs. If a job has no associated triggers, Quartz automatically removes it from the store unless the job has the durability flag set to true.
So in your case, you will probably want to register a bunch of durable jobs and then your app will create one-time triggers for these jobs on as needed basis. The jobs will remain in the store and the triggers will be automatically cleaned up when they are done.

Distributed, synchronized batch processing

In our current Java project, we need to batch process a huge set of records. Once, this processing is done, it must start again and process all records again. This processing must be parallelized as well as distributed among multiple nodes.
The records itself are stored in a database. Using some id range (e.g. 1-10000) for identifying a batch would be sufficient.
From a high level perspective, I see the following steps:
A sub task processes one batch of records.
A master task checks if any sub task is still running. If not, create one sub task for each batch of records.
We use MongoDB quite heavily and thought of persisting sub tasks in it. Then, each node can pick up sub tasks that are not done yet, does the processing and marks the record as done. Once there are no undone subtasks, the master task creates all the sub tasks again. This would probably work, but we are looking for a solution in which we don't need to do the heavy synchronization work ourselves.
Could this be a possible use-case for akka?
Can akka-persistence be used to synchronize the processing among different nodes?
Are there any other Java/JVM frameworks suited for this job?
Your question is way too broad for SO's format. Plase read this guide in the future before asking, and don't ask your group members to vote your question up just to inflate what is obviously an ill-posed question ( ͡° ͜ʖ ͡°).
Anyways:
1) Yes, you can implement your requirements in Akka. In particular, since you mentioned multiple nodes, you are looking at the akka-cluster module (for inter-node communication), and you might also need akka-cluster-sharding (in case you want to keep all data in memory beside during processing).
2) No, I would strongly not reccomend that. While you could technically force your problem into using akka-persistence for synchronizing the tasks, the goal of akka-persistence is simply to make an actor's state persistent. Akka itself in its basic form is enough for handling all your synchronization issues. Simply have a master actor create a worker for every subtask and monitor its completion.
3) Yes. Note that the answer to this question is always yes no matter which job.

How to store a single instance object in the AppEngine datastore

I need to create and store a single instance of an object in the AppEngine datastore (there will never need to be more than one object).
It is the last run time for a cron job that I am scheduling.
The idea is that the cron job will only pick up those rows that have been created/updated since its last run for processing, and will update the last run time after it has completed,
What is the best way to do this considering concurrency issues as well - in case a previous job has not finished running?
If I understand your question correctly, it sounds like you could just create a 'job bookkeeping' entity that records whether a job is currently running, along with any necessary state about what you are processing with the job.
Then, access that bookkeeping entity using a transaction, so that only one process can do a read + update on it at a time. That will let you safely check whether another job is still running before starting a new job.
(The datastore is non-relational, so I am guessing with your mention of 'rows', you instead mean entities of some Kind that you need to process? Your bookkeeping entity could store some state about which of these entities you'd processed so far, that would let you query for new ones to process).

quartz clustering : scheduler actions visible on all nodes

I'm having an issue; maybe you can help me.
Basically, I would like to know if :
quartz clustering can have its trigger changed dynamically (i.e. same config on all servers, but at a given point in time, I want to change the cron expression ON A SINGLE SERVER, and see this change propagated on ALL servers).
generally, if changes on a single server are propagated to all other servers (for example if I stop a particular scheduler on a single node, if all nodes stop the scheduler).
Unless, you're in for TerracottaJobStore, you probably utilize the clustering through database. The way it works is that scheduling data, such as Triggers and JobDetail, is saved to the database. All Scheduler nodes synchronize on that persisted data. Therefore, a change to that data from one node is reflected to all nodes.
OTOH, stopping / starting / standby etc. are all management data (as opposed to Triggers and JobDetails). Management data is considered node-specific and does not propagate to other nodes. According to this post it might in the future...

How to know when updates to the Google AppEngine HRD datastore are complete?

I have a long running job that updates 1000's of entity groups. I want to kick off a 2nd job afterwards that will have to assume all of those items have been updated. Since there are so many entity groups, I can't do it in a transaction, so i've just scheduled the 2nd job to run 15 minutes after the 1st completes using task queues.
Is there a better way?
Is it even safe to assume that 15 minutes gives a promise that the datastore is in sync with my previous calls?
I am using high replication.
In the google IO videos about HRD, they give a list of ways to deal with eventual consistency. One of them was to "accept it". Some updates (like twitter posts) don't need to be consistent with the next read. But they also said something like "hey, we're only talking miliseconds to a couple of seconds before they are consistent". Is that time frame documented anywhere else? Is it safe assuming that waiting 1 minute after a write before reading again will mean all my preivous writes are there in the read?
The mention of that is at the 39:30 mark in this video http://www.youtube.com/watch?feature=player_embedded&v=xO015C3R6dw
I don't think there is any built in way to determine if the updates are done. I would recommend adding a lastUpdated field to your entities and updating it with your first job, then check for the timestamp on the entity you're updating with the 2nd before running... kind of a hack but it should work.
Interested to see if anybody has a better solution. Kinda hope they do ;-)
This is automatic as long as you are getting entities without changing the consistency to Eventual. The HRD puts data to a majority of relevant datastore servers before returning. If you are calling the asynchronous version of put, you'll need to call get on all the Future objects before you can be sure it's completed.
If however you are querying for the items in the first job, there's no way to be sure that the index has been updated.
So for example...
If you are updating a property on every entity (but not creating any entities), then retrieving all entities of that kind. You can do a keys-only query followed by a batch get (which is approximately as fast/cheap as doing a normal query) and be sure that you have all updates applied.
On the other hand, if you're adding new entities or updating a property in the first process that the second process queries, there's no way to be sure.
I did find this statement:
With eventual consistency, more than 99.9% of your writes are available for queries within a few seconds.
at the bottom of this page:
http://code.google.com/appengine/docs/java/datastore/hr/overview.html
So, for my application, a 0.1% chance of it not being there on the next read is probably OK. However, I do plan to redesign my schema to make use of ancestor queries.

Categories