How does Spring Batch manage transactions (with possibly multiple datasources)? - java

I would like some information about the data flow in a Spring Batch processing but fail to find what I am looking for on the Internet (despite some useful questions on this site).
I am trying to establish standards to use Spring Batch in our company and we are wondering how Spring Batch behaves when several processors in a step updates data on different data sources.
This question focuses on a chunked process but feel free to provide information on other modes.
From what I have seen (please correct me if I am wrong), when a line is read, it follows the whole flow (reader, processors, writer) before the next is read (as opposed to a silo-processing where reader would process all lines, send them to the processor, and so on).
In my case, several processors read data (in different databases) and updates them in the process, and finally the writer inserts data into yet another DB. For now, the JobRepository is not linked to a database, but that would be an independent one, making the thing still a bit more complex.
This model cannot be changed since the data belongs to several business areas.
How is the transaction managed in this case? Is the data committed only once the full chunk is processed? And then, is there a 2-phase commit management? How is it ensured? What development or configuration should be made in order to ensure the consistency of data?
More generally, what would your recommendations be in a similar case?

Spring batch uses the Spring core transaction management, with most of the transaction semantics arranged around a chunk of items, as described in section 5.1 of the Spring Batch docs.
The transaction behaviour of the readers and writers depends on exactly what they are (eg file system, database, JMS queue etc), but if the resource is configured to support transactions then they will be enlisted by spring automatically. Same goes for XA - if you make the resource endpoint a XA compliant then it will utilise 2 phase commits for it.
Getting back to the chunk transaction, it will set up a transaction on chunk basis, so if you set the commit interval to 5 on a given tasklet then it will open and close a new transaction (that includes all resources managed by the transaction manager) for the set number of reads (defined as commit-interval).
But all of this is set up around reading from a single data source, does that meet your requirement? I'm not sure spring batch can manage a transaction where it reads data from multiple sources and writes the processor result into another database within a single transaction. (In fact I can't think of anything that could do that...)

Related

Spring integration FileTailingMessageProducer: Remember current line when restarting

We are using the Spring integration FileTailingMessageProducer (Apache Commons) for remotely tailing files and sending messages to rabbitmq.
Obviously when the java process that contains the file tailer is restarted, the information which lines have already been processed is lost. We would like to be able to restart the process and continue tailing at last line we had previously processed.
I guess we will have to keep this state either in a file on the host or a small database. The information stored in this file or db will probably be a simple map mapping file ids (file names will not suffice, since files may be rotated) to line numbers:
file ids -> line number
I am thinking about subclassing the ApacheCommonsFileTailingMessageProducer.
The java process will need to continually update this file or db. Is there a method for updating this file when the JVM exits?
Has anyone done this before? Are there any recommendations on how to proceed?
Spring Integration has an an abstraction MetadataStore - it's a simple key/value abstraction so would be perfect for this use case.
There are several implementations. The PropertiesPersistingMetadataStore persists to a properties file and, by default, only persists on an ApplicationContext close() (destroy()).
It implements Flushable so it can be flush()ed more often.
The other implementations (Redis, MongoDB, Gemfire) don't need flushing because the data is written immediately.
A subclass would work, the file tailer is a simple bean and can be declared as a <bean/> - there's no other "magic" done by the XML parser.
But, if you'd be interested in contributing it to the framework, consider adding the code to the adapter directly. Ideally, it would go in the superclass (FileTailingMessageProducerSupport) but I don't think we will have the ability to look at the file creation timestamp in the OSDelegatingFileTailingMessageProducer because we just get the line data streamed to us.
In any case, please open a JIRA Issue for this feature.

Transaction Management in EJB

Recently I was asked a question which left me thinking..want to get the community views on the same question.
I have a CustomerEJB which has say a createCustomer method. My EJB is exposed as a web service and hence createCustomer is one of its operations.
When a request hits createCustomer, 2 operations need to be performed
An INSERT SQL query into the database which may be adding certain data into db that came in input request
creation of a text file, say .txt in the file system.
Now the question is I want to couple these two tasks into a transaction. If any one task fails, I rollback the other task as well.
Without mentioning any hot technologies, like Spring/Hibernate what is the approach I can follow for Transaction management
My thoughts:
1. I can use JTA, demarcate the transaction boundaries and perform commit and rollback accordingly. JDBC can be used for the SQL task
2. I can use DAOs
Inviting your kind suggestions/comments
You would need to wrap the file creating in a XA capable JCA connector (not sure whether there's a ready made one out there, a quick good only found this fsconnector which doesn't support transactions yet), and use an XA driver for your DB transaction (most DBs will will be able handle this) and then wrap your EJB in an XA transaction (should be straightforward).
As long as both resources can handle the XA transactions, you'll get the benefit of 2-phase commits, which is what you're after.

How to design global distributed transaction(none database)? Can JTA use for none db transaction?

I think this is a fairly common question: how to put my business logic in a global transaction in distributed systems environment? Cite an example, I have a TaskA containing couples of sub tasks:
TaskA {subtask1, subtask2, subtask3 ... }
each of these subtasks may execute on local machine or a remote one, I hope TaskA executes in an atomic manner(success or fail) by means of transaction. Every subtask has a rollback function, once TaskA thinks the operation fails(because one of subtask fails), it calls each rollback function of subtasks. Otherwise TaskA commits the whole transaction.
To do this, I follow "Audit trial" transaction pattern to have a record for each subtask, so TaskA can know operation results of subtasks then decide rollback or commit. This sounds like simple, however, the hard part is how to associate each subtask to the global transaction?
When TaskA begins, it starts a global transaction about which subtask knows nothing. To make subtask aware of it, I have to pass the transaction context to every invocation of subtask. This is really dreadful! My subtask may either execute in a new thread or execute in remote by sending a message through AMQP broker, it's hard to consolidate the way of context propagation.
I did some research like "Transaction Patterns - A Collection of Four Transaction Related Patterns", "Checked Transactions in an Asynchronous Message Passing Environment", none of these solve my problem. They either don't have practical example or don't solve the context propagation issue.
I wonder how people solve this? as this sort of transaction must be common in enterprise software.
Is X/Open XA only the solution for this? Can JTA help here(I have't look into JTA as most stuff for it relate to database transaction, and I am using Spring, I don't want to involve another Java EE application server in my software).
Can some expert share some thoughts with me? thank you.
Conclusion
Arjan and Martin gave out really good answers, thank you.
Finally I didn't go with this way. After more research, I chose another pattern "CheckPoint" 1.
Looking into my requirement, I found my intention to "Audit Trial Transaction Pattern" is to know which level the operation has proceeded to, if it's failed, I can restart it at the failed spot after reloading some context. Actually this is not transaction, it didn't rollback other successful steps after failure. This is essence of CheckPoint pattern.
However, studying distributed transaction stuff makes me learned lot of interesting things. Apart from what Arjan and Martin have mentioned. I also suggest people digging into this area take a look at CORBA which is a well-known protocol for distributed system.
You're right, you need two-phase commit support thanks to a XA transaction manager provided by the JTA API.
As far as I know Spring does not provide a transaction manager implementation itself. The JtaTransactionManager only delegates to existing implementation usually provided from JavaEE implementations.
So you will have to plugin a JTA implementation into Spring to get effective job done. Here are some proposals:
JOTM
JBossTS based on Arjuna.
Atokimos
Then you will have to implement your resource manager to support two-phase commit. In the JavaEE worlds, it consists in a resource adapter packaged as a RAR archive. Depending on the type of resource, the following aspects are to read/implement:
XAResource interface
JCA Java Connector Architecture mainly if a remote connection is involved
JTS Transaction Service if transaction propagation between nodes is required
As examples, I recommend you to look at implementations for the classical "transactions with files" issue:
the transactional file manager from JBoss Transactions
XADisk project
If you want to write your own transactional resource, you indeed need to implement an XAResource and let it join an ongoing transaction, handle prepare and commit requests from the transaction manager etc.
Datasources are the most well known transactional resources, but as mentioned they are not the only ones. You already mentioned JMS providers. A variety of Caching solutions (e.g. Infinispan) are also transactional resources.
Implementing XAResources and working with the lower level part of the JTA API and the even lower level JTS (Java Transaction Service) is not a task for the faint of heart. The API can be archaic and the entire process is only scarcely documented.
The reason is that regular enterprise applications creating their own transactional resources is extremely rare. The entire point of being transactional is to do an action that is atomic for an external observer.
To be observable in the overwhelming majority of cases means the effects of the action are present in a database. Nearly each and every datasource is already transactional, so that use case is fully covered.
Another observable effect is whether a message has been send or not, and that too is fully covered by the existing messaging solutions.
Finally, updating a (cluster-wide) in memory map is yet another observable effect, which too is covered by the major cache providers.
There's a remnant of demand for transactional effects when operating with external enterprise information systems (EIS), and as a rule of thumb the vendors of such systems provide transaction aware connectors.
The shiver of use cases that remain is so small that apparently nobody ever really bothered to write much about it. There are some blogs out there that cover some of the basics, but much will be left to your own experimentation.
Do try to verify for yourself if you really absolutely need to go down this road, and if your need can't be met by one of the existing transactional resources.
You could do the following (similar to your checkpoint strategy):
TaskA executes in a (local) JTA transaction and tries to reserve the necessary resources before it delegates to your subtasks
Subtask invocations are done via JMS/XA: if step 1 fails, no subtask will ever be triggered and if step 1 commits then the JMS invocations will be received at each subtask
Subtasks (re)try to process their invocation message with a JMS max redelivery limit set (see your JMS vendor docs on how to do this)
Configure a "dead letter queue" for any failures in 3.
This works, assuming that:
-retrying in step 3 makes sense
-there is a bit of human intervention needed for messages in the dead letter queue
If this is not acceptable then there is also TCC: http://www.atomikos.com/Main/DownloadWhitepapers?article=TccForRestApi.pdf - this can be seen as a "reservation" pattern for REST services.
Hope this helps
Guy
http://www.atomikos.com

Handling transactions spanning across database servers

I have a scenario where the unit of work is defined as:
Update table T1 in database server S1
Update table T2 in database server S2
And I want the above unit of work to happen either completely or none at all (as the case with any database transaction). How can I do this? I searched extensively and found this post close to what I am expecting but this seems to be very specific to Hibernate.
I am using Spring, iBatis and Tomcat (6.x) as the container.
It really depends on how robust a solution you need. The minimal level of reliability on such a thing is XA transactions. To use that, you need a database and JDBC driver that supports it for starters, then you could configure Spring to use it (here is an outline).
If XA isn't robust enough for you (XA has failure scenarios, such as if something goes wrong in the second phase of commits, such as a hardware failure) then what you really need to do is put all the data in one database and then have a separate process propagate it. So the data may be inconsistent, but it is recoverable.
Edit: What I mean is that put the whole of the data into one database. Either the first database, or a different database for this purpose. This database would essentially become a queue from which the final data view is fed. The write to that database (assuming a decent database product) will be complete, or fail completely. Then, a separate thread would poll that database and distribute any missing data to the other databases. So if the process should fail, when that thread starts up again it will continue the distribution process. The data may not exist in every place you want it to right away, but nothing would get lost.
You want a distributed transaction manager. I like using Atomikos which can be run within a JVM.

What is the 'best' way to do distributed transactions across multiple databases using Spring and Hibernate

I have an application - more like a utility - that sits in a corner and updates two different databases periodically.
It is a little standalone app that has been built with a Spring Application Context. The context has two Hibernate Session Factories configured in it, in turn using Commons DBCP data sources configured in Spring.
Currently there is no transaction management, but I would like to add some. The update to one database depends on a successful update to the other.
The app does not sit in a Java EE container - it is bootstrapped by a static launcher class called from a shell script. The launcher class instantiates the Application Context and then invokes a method on one of its beans.
What is the 'best' way to put transactionality around the database updates?
I will leave the definition of 'best' to you, but I think it should be some function of 'easy to set up', 'easy to configure', 'inexpensive', and 'easy to package and redistribute'. Naturally FOSS would be good.
The best way to distribute transactions over more than one database is: Don't.
Some people will point you to XA but XA (or Two Phase Commit) is a lie (or marketese).
Imagine: After the first phase have told the XA manager that it can send the final commit, the network connection to one of the databases fails. Now what? Timeout? That would leave the other database corrupt. Rollback? Two problems: You can't roll back a commit and how do you know what happened to the second database? Maybe the network connection failed after it successfully committed the data and only the "success" message was lost?
The best way is to copy the data in a single place. Use a scheme which allows you to abort the copy and continue it at any time (for example, ignore data which you already have or order the select by ID and request only records > MAX(ID) of your copy). Protect this with a transaction. This is not a problem since you're only reading data from the source, so when the transaction fails for any reason, you can ignore the source database. Therefore, this is a plain old single source transaction.
After you have copied the data, process it locally.
Setup a transaction manager in your context. Spring docs have examples, and it is very simple. Then when you want to execute a transaction:
try {
TransactionTemplate tt = new TransactionTemplate(txManager);
tt.execute(new TransactionCallbackWithoutResult(){
protected void doInTransactionWithoutResult(
TransactionStatus status) {
updateDb1();
updateDb2();
}
} catch (TransactionException ex) {
// handle
}
For more examples, and information perhaps look at this:
XA transactions using Spring
When you say "two different databases", do you mean different database servers, or two different schemas within the same DB server?
If the former, then if you want full transactionality, then you need the XA transaction API, which provides full two-phase commit. But more importantly, you also need a transaction coordinator/monitor which manages transaction propagation between the different database systems. This is part of JavaEE spec, and a pretty rarefied part of it at that. The TX coordinator itself is a complex piece of software. Your application software (via Spring, if you so wish) talks to the coordinator.
If, however, you just mean two databases within the same DB server, then vanilla JDBC transactions should work just fine, just perform your operations against both databases within a single transaction.
In this case you would need a Transaction Monitor (server supporting XA protocol) and make sure your databases supports XA also. Most (all?) J2EE servers comes with Transaction Monitor built in. If your code is running not in J2EE server then there are bunch of standalone alternatives - Atomicos, Bitronix, etc.
You could try Spring ChainedTransactionManager - http://docs.spring.io/spring-data/commons/docs/1.6.2.RELEASE/api/org/springframework/data/transaction/ChainedTransactionManager.html that supports distributed db transaction. This could be a better alternative to XA

Categories