From this article we can learn that Spring-Batch holds the Job's status in some SQL repository.
And from this article we can learn that the location of the JobRepository can be configured - can be in-memory and can be remote DB.
So if we need to scale a batch job, should we run several different Spring-batch JARs, all configured to use the same shared DB in order to keep them synchronized?
Is this the right pattern / architecture?
Yes, this is the way to go. The problem that might happen when you launch the same job from different physical nodes is that you can create the same job instance twice. In this case, Spring Batch will not know which instance to pick up when restarting a failed execution. A shared job repository acts as a safeguard to prevent this kind of concurrency issues.
The job repository achieves this synchronization thanks to the transactional capabilities of the underlying database. The IsolationLevelForCreate can be set to an aggressive value (SERIALIZABLE is the default) in order to avoid the aforementioned issue.
Related
Context
We have a Spring boot application (an API used by an angular frontend).
It is running on a docker container.
It is using a single instance of a PostgreSQL database.
Our application had some load problems so we asked us to scale it.
We told us to run our API on several docker containers for that.
We have several questions / problems dealing with code synchronization over multiple docker instances executing our code.
Problem 1
We have some #Scheduled jobs integrated and deployed with our API code.
We don't want these scheduled jobs to be executed by all container instances, but only one.
I think we can simply handle this by disabling jobs on other containers through environment variables with the "-" value to disable the Spring scheduled cron.
https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/scheduling/annotation/Scheduled.html#CRON_DISABLED
Does it sounds right?
Problem 2
The other problem is that we use Spring's #Lock annotation on some repository methods.
public interface IncrementRepository extends JpaRepository<IncrementEntity, UUID> {
#Lock(LockModeType.PESSIMISTIC_FORCE_INCREMENT)
Optional<IncrementEntity> findByAnnee(String pAnneeAA);
#Lock(LockModeType.PESSIMISTIC_WRITE)
IncrementEntity save(IncrementEntity pIncrementEntity);
}
This is critical for us to have a lock on that as we get / compute an increment used to act as a unique identifier of some of our data.
If I correctly understood this locking mechanism :
if a process execute this code, the Spring JPA #Transaction will acquire a lock on the IncrementEntity (lock the database table).
when another process tries do do the same thing before the first lock has been released by the first transaction, it should have a PessimisticLockException and the second transaction will rollback
this is managed by Spring at application level, NOT directly at database level (right??)
So what will happen if we're running our code on several containers ?
app running in container 1 sets a lock
app running in container 2 execute the same code and tries to set the same lock while the first one has not been released yet
each Spring application running in different containers will probably acquire the lock without problems as they don't share the same information?
Please tell me if I correctly understood how it works, and if we will effectively have a problem running such code on several docker containers.
I guess that solution would be to set a lock directly on the database table, as we have only one instance of it?
Is there a way to easily set / release the lock at database level using Spring JPA code ?
Or perhaps I misunderstood and setting a lock using Spring's #Lock annotation sets a real database lock ?
In that case, perhaps we don't have any problem at all, as the lock is correctly set on the database itself, shared by all containers instances??
Problem 3
To avoid having too much exceptions and reject some requests trying to acquire a lock at the same time, we also added a synchronized block around the above code.
String numIncrement;
synchronized (this.mutex) {
try {
numIncrement = this.incrementService.getIncrement(var);
} catch (Exception e) {
// rethrow custom technical exception
}
}
This way concurrent requests should be delayed and queued, which is better for our users experience.
I guess that we will also have problems here as docker instances doesn't share the same JVM, so synchronization can work only in the scope of the container itself... right?
Conclusion
For all these problems, please tell me if you have some solutions to workaround / adapt our code so it can be compatible with app scaling.
Following a set of tests I can confirm these points about my original question
Problem 1
We can disable a Spring CRON with the - value
#Scheduled(cron = "-")
Problem 2
The Spring's JPa #Lock annotation sets a lock on the database itself. It is not managed by Spring software.
So when duplicating containers, if the Spring app in the first container sets a lock, the database is locked and when the second app in another container tries to get data it has the PessimisticLockException.
Problem 3
Synchronized code using the synchronized JAVA keyword is obviously managed by JVM, so there is no code mutual exclusion between containers.
We get the below error when using spring batch .
org.springframework.dao.OptimisticLockingFailureException: Attempt to update step execution id=8827 with wrong version (1), where current version is 2
What I observed from different forums was that we were using org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean which is not thread safe and not adivsable to be used in production.
We do not want to persist the meta data of the jobs or use in memory database - Is there any other alternative to MapJobRepositoryFactoryBean ?
Thanks
Lives
According to this post on the spring forums the MapJobRepositoryFactoryBean is not generally intended for production use. I guess I would ask why wouldn't you want the metadata persisted to a database? It provides tremendous value, not to mention giving you the ability to use the spring batch admin console.
Has "spring batch" ability to limit run jobs without manual check job status? Job can be different or instances of one job. Need something like configurable property.
No. The only way is to manually check via JobExplorer interface or directly query jobs metadata tables
I have a scenario where the unit of work is defined as:
Update table T1 in database server S1
Update table T2 in database server S2
And I want the above unit of work to happen either completely or none at all (as the case with any database transaction). How can I do this? I searched extensively and found this post close to what I am expecting but this seems to be very specific to Hibernate.
I am using Spring, iBatis and Tomcat (6.x) as the container.
It really depends on how robust a solution you need. The minimal level of reliability on such a thing is XA transactions. To use that, you need a database and JDBC driver that supports it for starters, then you could configure Spring to use it (here is an outline).
If XA isn't robust enough for you (XA has failure scenarios, such as if something goes wrong in the second phase of commits, such as a hardware failure) then what you really need to do is put all the data in one database and then have a separate process propagate it. So the data may be inconsistent, but it is recoverable.
Edit: What I mean is that put the whole of the data into one database. Either the first database, or a different database for this purpose. This database would essentially become a queue from which the final data view is fed. The write to that database (assuming a decent database product) will be complete, or fail completely. Then, a separate thread would poll that database and distribute any missing data to the other databases. So if the process should fail, when that thread starts up again it will continue the distribution process. The data may not exist in every place you want it to right away, but nothing would get lost.
You want a distributed transaction manager. I like using Atomikos which can be run within a JVM.
I have an application - more like a utility - that sits in a corner and updates two different databases periodically.
It is a little standalone app that has been built with a Spring Application Context. The context has two Hibernate Session Factories configured in it, in turn using Commons DBCP data sources configured in Spring.
Currently there is no transaction management, but I would like to add some. The update to one database depends on a successful update to the other.
The app does not sit in a Java EE container - it is bootstrapped by a static launcher class called from a shell script. The launcher class instantiates the Application Context and then invokes a method on one of its beans.
What is the 'best' way to put transactionality around the database updates?
I will leave the definition of 'best' to you, but I think it should be some function of 'easy to set up', 'easy to configure', 'inexpensive', and 'easy to package and redistribute'. Naturally FOSS would be good.
The best way to distribute transactions over more than one database is: Don't.
Some people will point you to XA but XA (or Two Phase Commit) is a lie (or marketese).
Imagine: After the first phase have told the XA manager that it can send the final commit, the network connection to one of the databases fails. Now what? Timeout? That would leave the other database corrupt. Rollback? Two problems: You can't roll back a commit and how do you know what happened to the second database? Maybe the network connection failed after it successfully committed the data and only the "success" message was lost?
The best way is to copy the data in a single place. Use a scheme which allows you to abort the copy and continue it at any time (for example, ignore data which you already have or order the select by ID and request only records > MAX(ID) of your copy). Protect this with a transaction. This is not a problem since you're only reading data from the source, so when the transaction fails for any reason, you can ignore the source database. Therefore, this is a plain old single source transaction.
After you have copied the data, process it locally.
Setup a transaction manager in your context. Spring docs have examples, and it is very simple. Then when you want to execute a transaction:
try {
TransactionTemplate tt = new TransactionTemplate(txManager);
tt.execute(new TransactionCallbackWithoutResult(){
protected void doInTransactionWithoutResult(
TransactionStatus status) {
updateDb1();
updateDb2();
}
} catch (TransactionException ex) {
// handle
}
For more examples, and information perhaps look at this:
XA transactions using Spring
When you say "two different databases", do you mean different database servers, or two different schemas within the same DB server?
If the former, then if you want full transactionality, then you need the XA transaction API, which provides full two-phase commit. But more importantly, you also need a transaction coordinator/monitor which manages transaction propagation between the different database systems. This is part of JavaEE spec, and a pretty rarefied part of it at that. The TX coordinator itself is a complex piece of software. Your application software (via Spring, if you so wish) talks to the coordinator.
If, however, you just mean two databases within the same DB server, then vanilla JDBC transactions should work just fine, just perform your operations against both databases within a single transaction.
In this case you would need a Transaction Monitor (server supporting XA protocol) and make sure your databases supports XA also. Most (all?) J2EE servers comes with Transaction Monitor built in. If your code is running not in J2EE server then there are bunch of standalone alternatives - Atomicos, Bitronix, etc.
You could try Spring ChainedTransactionManager - http://docs.spring.io/spring-data/commons/docs/1.6.2.RELEASE/api/org/springframework/data/transaction/ChainedTransactionManager.html that supports distributed db transaction. This could be a better alternative to XA