Spring Batch Invocation - java

We need to allow the users to import the huge catalog into the application. How do we achieve this with Spring batch as the Job is singleton in Spring Batch. How do we tweak it so that I can invoke the same job any number of times with thread safety. We are fine with synchronous processing and not looking for Async .Appreciate your inputs.

Even though the job configuration is a singleton, each job instance is created from the job configuration as a new object by the job launcher, so you should have no problems with concurrency.

It sounds like multiple updates are going to be happening in an unsafe way to your database. E.g. if you have table 1 row 1 being updated by Job1 and another user kicks off Job2 there's no guarantee what values you'll end up with in row 1. I wouldn't be concerned about thread safety so much as row level concurrency safety. Typically if you only want a single import to run at a time the solution is not something like Spring, but a database specific import tool.
UPDATE:
See this SO answer for how to customize Spring Batch to only allow one job to run at at time. Note - this has nothing to do with thread safety. This is not how Spring Batch is typically used which is why this isn't listed as a normal use case in their docs.
Spring batch restrict single instance of job only

Related

Executorservice exception handling in java

I am using executor service feature of Java. I want to understand the design perspective.
If something goes wrong in one of the batch what will be best approach to handle it?
I am creating fixed thread pool as,
ExecutorService pool = Executors.newFixedThreadPool(10);
Also I am using invokeall() to invoke all callable which is returning future object.
Here is my scenario -
I have 1000 records coming from xml-file and I wanted to save into DB.
I created batch of 10, each batch containing 100 records.
Batches started processing(say batch1, batch2, batch3... batch10) and lets say one of batch(batch7) came across error for a particular record while parsing the record from xml and it could not save into DB.
So my question is how I can handle this situation ?
How I can get/store failed batch information (batch7 above) ?
I mean, if there is any error in any of batch should i stop all other batches ?
Or where i can store information for failed batch and how I can take it for further processing once error corrected ?
The handler that has the logic to process the records should have an variable that stores the batch number.
The handler ideally should have a finite retry logic for few set of database errors.
Once the retry counts exhausts, it warrants a human intervention and it should exit throwing exceptions and the batch number . The executor should ideally should call shutDown . If your logic demands to stop the process immediately , then you should call shutDownNow . Ideally your design should be resistive to such failures and let other batches continue its work even if one fails. Hope it helped you
You should use CompletableFuture to do this
Use CompletableFuture.runAsync( ) to start a process asynchronous, it returns a future. On this future, you can use thenAccept(..) or thenRun(..) methods to do something when process is complete.
There is also a method, exceptionally(..) to do something when an exception is thrown.
By default, it uses a default executor service to do this async, but you can use your own if necessary.
So my question is how I can handle this situation ?
It all depends on your requirement.
How I can get/store failed batch information (batch7 above) ?
You can store it either in a file or database.
I mean, if there is any error in any of batch should i stop all other batches ?
This depends on your business use case. If you have requirement to stop batch processing even with single batch failure, you have to stop next batches. Otherwise you can continue with next set of batches.
Or where i can store information for failed batch and how I can take it for further processing once error corrected ?
This also depends on your requirement & design. You may have to inform the source about problematic XML file so that they can correct the file and sent it back to you. Once you receive the new copy, you have to push new file for processing. It can be manual or automated which depends on your design.

Multi Threading with datastax java driver 2.0

My Data Model is based on time series(inserts feeds from various sources in cassandra CFs.) Can anyone suggest how to do inserts in Multi Threading.? Is executing query with executeAsync method similar to multi threading ? Is there any property of cassandra.yaml which I need to set to achieve Multi Threading ? Or any other prerequisites.
The driver is safe for multi-threaded use. What you will typically do is build your Cluster and get a Session instance during application startup, and then share the Session among all threads.
How you handle multi-threading is specific to your code. I don't know SQS either, but I imagine you'd either have multiple consumers that poll from the queue and process the messages themselves, or maybe dispatch the messages to a pool of workers.
Regarding executeAsync, the returned ResultSetFuture implements Guava's ListenableFuture, so you can register a success callback with addListener. But you'll have to provide an Executor to run that callback on (I don't recommend MoreExecutors#sameThreadExecutor as mentioned in the Javadoc, because your callback would end up running on one of the driver's I/O threads).
As mentioned by Carlo, a simple approach is to use the synchronous execute, and have your worker block until it gets a response from Cassandra, and then acknowledge the message.
executeAsync() creates a separate thread for the execution of the statement and immediately returns the control to caller -- a Future<ResultSet> will have your result. When working with this approach you won't know if any exception occurred until you're inside the Future.
In Cassandra you don't have to set anything. Just keep under control the thread-number within your application and initialize properly the Java Driver providing a PoolingOptions object that match your needs.
HTH, Carlo
If you are executing the query in multithreading environment, then make sure you wait for the executeAsync(statement) to complete,
session.executeAsync(statement) will return immediately, it does not guarantee whether the query is valid or submitted successfully. So if you're using threadpool then always use
ResultSetFuture future = session.executeAsync(statement);
future.getUninterruptibly();
This will wait for the query to be submitted and will not consume memory.

How to store a single instance object in the AppEngine datastore

I need to create and store a single instance of an object in the AppEngine datastore (there will never need to be more than one object).
It is the last run time for a cron job that I am scheduling.
The idea is that the cron job will only pick up those rows that have been created/updated since its last run for processing, and will update the last run time after it has completed,
What is the best way to do this considering concurrency issues as well - in case a previous job has not finished running?
If I understand your question correctly, it sounds like you could just create a 'job bookkeeping' entity that records whether a job is currently running, along with any necessary state about what you are processing with the job.
Then, access that bookkeeping entity using a transaction, so that only one process can do a read + update on it at a time. That will let you safely check whether another job is still running before starting a new job.
(The datastore is non-relational, so I am guessing with your mention of 'rows', you instead mean entities of some Kind that you need to process? Your bookkeeping entity could store some state about which of these entities you'd processed so far, that would let you query for new ones to process).

Creating Quartz Scheduler instances at run time

I am working on an application where we have 100 of jobs that's needs to be schedules for executions.
Here is my sample quartz.property file
org.quartz.scheduler.instanceName=QuartzScheduler
org.quartz.jobStore.class = org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.threadPool.threadCount=7
org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.MSSQLDelegate
org.quartz.jobStore.tablePrefix = QRTZ_
org.quartz.jobStore.dataSource = myDS
org.quartz.dataSource.myDS.driver=com.mysql.jdbc.Driver
org.quartz.dataSource.myDS.URL=jdbc:mysql://localhost:3306/quartz
org.quartz.dataSource.myDS.user=root
org.quartz.dataSource.myDS.password=root
org.quartz.dataSource.myDS.maxConnections=5
Though this is working fine, but we are planning to separates jobs in different groups so that it can be easy to maintain them.
Groups will be unique and we want that if a user(Admin) creates a new group a new instance of scheduler should get created and all jobs within that group should be handled by that scheduler instance in future.
This means if the Admin creates a new group say NewProductNotification than we should be able to create a scheduler instance with same name NewProductNotification and all jobs which are parts of the NewProductNotification group should be handeled by NewProductNotification instance of scheduler.
How is this possible and how can we store this information in the Database so that next time when the server is up Quartz should have knowledge about all the scheduler instances or do we need to add this information about new instance in property file.
As the proprty file above showing , we are using jdbcjobstore to handle everything using database.
I don't think dynamically creating schedulers is a good desing approach in Quartz. You can share the same database tables for multiple schedulers (job details and triggers have scheduler name as part of their primary key) but Scheduler is kind of heavyweight.
Can you explain why do you relly need separate schedulers? Maybe you can simply use Job groups and triggers groups (you are in fact using the term group) to distinguish jobs/groups? Also you can use different priorities for each trigger.
As a side note:
I'm using JobStoreCMT and I'm seeing deadlocks, what can I do?
Make sure you have at least number-of-threads-in-thread-pool + 2 connections in your datasources.
And in your configuration (reverse the values and it will be fine):
org.quartz.threadPool.threadCount=7
org.quartz.dataSource.myDS.maxConnections=5
From: I'm using JobStoreCMT and I'm seeing deadlocks, what can I do?
Dynamically creating schedules is very much possible. You would need to create objects of JobDetail and Trigger and pass to the SchedulerFactoryBean object. It will take care of the rest.

Two threads reading from the same table:how do i make both thread not to read the same set of data from the TASKS table

I have a tasks thread running in two separate instances of tomcat.
The Task threads concurrently reads (using select) TASKS table on certain where condition and then does some processing.
Issue is ,sometimes both the threads pick the same task , because of which the task is executed twice.
My question is how do i make both thread not to read the same set of data from the TASKS table
It is just because your code(which is accessing data base)DAO function is not synchronized.Make it synchronized,i think your problem will be solved.
If the TASKS table you mention is a database table then I would use Transaction isolation.
As a suggestion, within a trasaction, set an attribute of the TASK table to some unique identifiable value if not set. Commit the tracaction. If all is OK then the task has be selected by the thread.
I haven't come across this usecase so treat my suggestion with catuion.
I think you need to see some information how does work with any enterprise job scheduler, for example with Quartz
For your use case there is a better tool for the job - and that's messaging. You are persisting items that need to be worked on, and then attempting to synchronise access between workers. There are a number of issues that you would need to resolve in making this work - in general updating a table and selecting from it should not be mixed (it locks), so storing state there doesn't work; neither would synchronization in your Java code, as that wouldn't survive a server restart.
Using the JMS API with a message broker like ActiveMQ, you would publish a message to a queue. This message would contain the details of the task to be executed. The message broker would persist this somewhere (either in its own message store, or a database). Worker threads would then subscribe to the queue on the message broker, and each message would only be handed off to one of them. This is quite a powerful model, as you can have hundreds of message consumers all acting on tasks so it scales nicely. You can also make this as resilient as it needs to be, so tasks can survive both Tomcat and broker restarts.
Whether the database can provide graceful management of this will depend largely on whether it is using strict two-phase locking (S2PL) or multi-version concurrency control (MVCC) techniques to manage concurrency. Under MVCC reads don't block writes, and vice versa, so it is very possible to manage this with relatively simple logic. Under S2PL you would spend too much time blocking for the database to be a good mechanism for managing this, so you would probably want to look at external mechanisms. Of course, an external mechanism can work regardless of the database, it's just not really necessary with MVCC.
Databases using MVCC are PostgreSQL, Oracle, MS SQL Server (in certain configurations), InnoDB (except at the SERIALIZABLE isolation level), and probably many others. (These are the ones I know of off-hand.)
I didn't pick up any clues in the question as to which database product you are using, but if it is PostgreSQL you might want to consider using advisory locks. http://www.postgresql.org/docs/current/interactive/explicit-locking.html#ADVISORY-LOCKS I suspect many of the other products have some similar mechanism.
I think you need have some variable (column) where you keep last modified date of rows. Your threads can read same set of data with same modified date limitation.
Edit:
I did not see "not to read"
In this case you need have another table TaskExecutor (taskId , executorId) , and when some thread runs task you put data to TaskExecutor; and when you start another thread it just checks that task is already executing or not (Select ... from RanTask where taskId = ...).
Нou also need to take care of isolation level for transaсtions.

Categories