I am revamping a dataloader that reads from flat file and batchinsert using jdbctemplate for every 500 items. I am using java executor fixed thread pool that submits tasks, which does reading each file and batchupdate. For example when reading first file, if it fails during 3rd batchinsert ,all the previous batchinsert for this file needs to be rollbacked. The task should continue with next file and create a new transaction for insert. I needed a code that can do this . Currently I am using transactiontemplate and wrapping the batchinsert code inside doInTransactionwithoutcallbackand during exception in catch block calling transaction status.setrollbackonly. But I need a code which can create new transaction for next file irrespective of whether last file failed or succeded.setting propagation to requires new solves it?
As Sean commented, you should not reinvent the whole thing, and use Spring Batch instead.
Spring Batch will allow you to:
partition the execution (e.g. using a thread pool executor)
map records in the file(s) to objects
set the right commit interval, where it'd commit a "chunk" of processed records, and rollback in case any of them are "wrong"
specify what errors are skippable, retryable
and much more
And it is already there => coded, tested and awesome.
Related
I am currently trying to investigate the total number of transactions of my spring batch job.
As I can see, a StepExecution has a property commit_count that tells how many transactions were commited by the step.
I have a job that consists of two steps:
Read a file and map the content to Java objects ( = should be non-transactional, right?)
This step uses a Tasklet, which is only called once. So only one transaction is going to be created for this according to my understanding. Basically, the step does some specific processing of the created objects and persists them to the database afterwards ( = should be transactional, right?)
After the execution of my step I can see that both steps have a commit_count of 1.
But I expected only the second step to have a commit_count of 1. The other one should have a commit_count of 0, right?
I know that, next to the business transactions, Spring does some own transactional stuff in order to persist the job and step execution metadata and so on. But i have read on the internet that this is not technically wrapped in a transaction and thus I don't expect it to be included in the commit_count of a step, right?
In order to see the total number of commited transactions I also have tried to configure logging.level.org.springframework.transaction.interceptor: TRACE but this just logs lots of the following log statements
Completing transaction for [org.springframework.batch.core.repository.support.SimpleJobRepository.update]
Getting transaction for [org.springframework.batch.core.repository.support.SimpleJobRepository.updateExecutionContext]
Completing transaction sounds to me like a transaction has been commited. I am seeing lots of statements like this in the logs, so does this mean my batch job is creating lots of transactions? Actually I have expected the batch job metadata update would be enclosed in a single transaction..
Can someone please explain this. Thanks in Advance!
A commit is made after every chunk. The chunk size of Tasklet(non-chunk based) is always 1. That is reason why commit count is 1 for Tasklet(non-chunk based)
1. cron job started
2. create Entity1 and save to DB
3. Fetch transactionEntity from DB
4. using transactions as transactionIds.
for (Transaction id : transactionIds) {
a. create Entity2 and save to db
b. fetch paymentEntity from DB.
c. response = post request Rest API call
d. udpate Entity2 with response
}
5. udpate Entity1.
Problem statement - I am getting 5000+ transaction from db in transactionIds using cron jobs which need to process as given above. With the above approach while my previous loop is going on, next 5000+ transactions come in the loop as cron job runs in 2 minutes.
I have checked multiple solutions(.parallelStream() with ForkJoinPool / ListenableFuture, but am unable to decide which is the best solution to scale the above code. Can I use spring batch for this, if yes, how to do this? What are the steps comes in reader, process and writer from above steps.
One way to approach this problem will be to use Kafka for consuming the messages. You can increase the number of pods (hopefully you are using microservices) and each pod can be part of a consumer group. This will effectively remove the loop in your code and consumers can be increased on demand to process any scale.
Another advantage of message based approach will be that you can have multiple delivery modes(at least once, at most once etc) and there are a lot of open source libraries available to view the stats of the topic (Lag between consumption and production of messages in a topic).
If this is not possible,
The rest call should not happen for every transaction, you'll need to post the transactions as a batch. API calls are always expensive to do, so the lesser roundtrips will give you a huge difference in time taken to complete the loop.
Instead of directly updating DB before and after API call, you can change the loop use
repository.saveAll(yourentitycollection) // Only one DB call after looping, can be batched
Suggest you to move to producer-consumer strategy in near future.
I am working on an application with high number for DML operations due to which log file sync wait event is observed. We are using ebean framework for querying the Oracle database. I was looking for a way to reduce the number of commits. Is it advisable to use JDBC batch using batch size attribute for transactional calls.
Is it advisable to use JDBC batch using batch size attribute for
transactional calls.
Assuming a transaction is inserting, updating or deleting more that 1 bean/row then in short yes.
The caveat is that in terms of application code the actual execution of DML can occur later with statement flush at batch size, at commit time etc. This means statements can execute later in application code (like at commit time).
This typically only really matters to application code when application code is looking to handle exceptions like db constraint violations, missing foreign keys, unique constraints etc and actually continue the transaction. In this case we might need to add explicit transaction.flush() into the application code to ensure the statements have been executed and hit the database.
I am using executor service feature of Java. I want to understand the design perspective.
If something goes wrong in one of the batch what will be best approach to handle it?
I am creating fixed thread pool as,
ExecutorService pool = Executors.newFixedThreadPool(10);
Also I am using invokeall() to invoke all callable which is returning future object.
Here is my scenario -
I have 1000 records coming from xml-file and I wanted to save into DB.
I created batch of 10, each batch containing 100 records.
Batches started processing(say batch1, batch2, batch3... batch10) and lets say one of batch(batch7) came across error for a particular record while parsing the record from xml and it could not save into DB.
So my question is how I can handle this situation ?
How I can get/store failed batch information (batch7 above) ?
I mean, if there is any error in any of batch should i stop all other batches ?
Or where i can store information for failed batch and how I can take it for further processing once error corrected ?
The handler that has the logic to process the records should have an variable that stores the batch number.
The handler ideally should have a finite retry logic for few set of database errors.
Once the retry counts exhausts, it warrants a human intervention and it should exit throwing exceptions and the batch number . The executor should ideally should call shutDown . If your logic demands to stop the process immediately , then you should call shutDownNow . Ideally your design should be resistive to such failures and let other batches continue its work even if one fails. Hope it helped you
You should use CompletableFuture to do this
Use CompletableFuture.runAsync( ) to start a process asynchronous, it returns a future. On this future, you can use thenAccept(..) or thenRun(..) methods to do something when process is complete.
There is also a method, exceptionally(..) to do something when an exception is thrown.
By default, it uses a default executor service to do this async, but you can use your own if necessary.
So my question is how I can handle this situation ?
It all depends on your requirement.
How I can get/store failed batch information (batch7 above) ?
You can store it either in a file or database.
I mean, if there is any error in any of batch should i stop all other batches ?
This depends on your business use case. If you have requirement to stop batch processing even with single batch failure, you have to stop next batches. Otherwise you can continue with next set of batches.
Or where i can store information for failed batch and how I can take it for further processing once error corrected ?
This also depends on your requirement & design. You may have to inform the source about problematic XML file so that they can correct the file and sent it back to you. Once you receive the new copy, you have to push new file for processing. It can be manual or automated which depends on your design.
So I have a simple batch job with just one step which consists of a MongoItemReader to read in objects from MongoDB of course, a custom item processor (which for now just sets an 'isProcessed' boolean flag to true), and a MongoItemWriter.
Thing is, I want to be able to backup my jobs to a DB whenever they fail (for cases like server downtime), so I have implemented Mongo documents that basically store JobExecution, StepExecution, JobInstance, and ExecutionContext objects. They seem to create their respective objects correctly since I am able to use them to restart a job (after adding them to the job repository), but they are restarting from the very beginning when I instead want them to start from where they left off.
So I'm wondering then what I'm missing. Where exactly does a failed job store data of when/where it failed? I thought the readCount, readSkipCount, processSkipCount, etc variables would have something to do with it, but those are included in my StepExecution document (along with everything else the StepExecution class has a 'get' method for). I thought then maybe it was the execution context, but that was empty for both the job and its one step.
When a job is restarted, stateful components (those that implement ItemStream) receive the step's ExecutionContext during the open call allowing them to reset the state based on the last run. It's then up to the component to reset the state as well as maintain the state during the calls to ItemStream#update as processing occurs.
So based on the above, a failed job doesn't persist it's state when it fails...it's actually persisting it all along as it's successful. That way, when it does fail, things should be rolled back to the last successful point. Which leads me to...
Mongo isn't transactional. Are you sure the state is being persisted correctly? We don't have a Mongo based job repository for this reason...