Spring Batch - Clustered Environment - Failover mechanism - java

Question: What is the failover strategy that spring batch supports best? Resource usage, failover mechanism have to be focussed on. Any suggestions?
Usecase - Spring batch has to be run to read a file(that will be put on the server by another application) from the server and process it.
Environment is clustered. So, there could be multiple server instances that could trigger the batch jobs trying to read the same file on arrival.
My thoughts: Polling can be done to check the arrival of the file and call the spring batch job. Since it is clustered, we could use active/passive strategy to poll. The other types such as roundrobin or time slicing can also be used.
Pardon me if I am not clear. I can explain if something is unclear.

As I understand from here
http://static.springsource.org/spring-batch/reference/html/scalability.html
the better approach would be to have just one poller and than distribute the job to the cluster through one of the mechanisms provided by spring Batch (I think the one named Remote Chunks is the best choice here).
I don't think you should worry about the clustering strategy as this is handled either by Spring Batch or by other clustering distribution mechanisms.

Related

Centralized batch job management java

Currently, I have an idea about building a centralized batch job management system (I temporarily call it batch service).
We own a microservice system, and the batch jobs are scattered across the services (including oracle's bacth jobs). So I intend to set up a bacth job management system.
But there is one problem that in microservices there are many databases, so I want the manipulation of data to be done by other services, and batch service only does the following things: setting, scheduling, checking status state, log, start, stop, retry.
My idea is to use message broker(kafka, rabbitmq, ...) to pass job request from batch service to other services. But I am not thinking of a solution to stop or save the log of jobs on the batch service.
Is this idea feasible and if so can you give me some advice on deployment technologies (We are deploying using spring boot at the moment).
Thanks for taking the time to read ^^.

Should I use Spring Data flow server for my new Spring Batch jobs?

I have a requirement to create around 10 Spring Batch jobs, which will consists of a reader and a writer. All readers read data from some different Oracle DB and write into a different Oracle Db(Source and destination servers are different). And the Spring jobs are implemented using Spring Boot. Also all 10+ jobs would be packaged into a single Jar File. So far fine.
Now the client also wants some UI to monitor the job status and act as a job organizer. I gone through the Spring Data flow Server documentation for UI requirement. But I'm not sure whether it'll serve the purpose, or is there any other alternative option available for monitoring the job status, stop and start the jobs whenever required from the UI.
Also how could I separate the the 10+ jobs inside a single Jar in the Spring Data Flow Server if it's the only option for an UI.
Thanks in advance.
I don't have reputation to add a comment. So, I am posting answer here. Although I know this is not the way to share reference link as an answer.
This might help you:
spring-batch-job-monitoring-with-angular-front-end-real-time-progress-bar
Observability of spring batch jobs is given by data that are persisted by the framework in a relational database... instances..executions..timestamps...read count..write count....
You have different way to exploit these data. SQL client, JMX, spring batch api (JobExplorer, JobOperator), spring admin (deprecated in favor of cloud data flow server).
Data flow is an orchestrator allowing you to execute data pipelines with streams and tasks(finite and short lived/monitored services). For your jobs we can imagine wrap each jobs in tasks and create a multitask pipeline. Data flow gives you status of each executions.
You can also expose your monitoring data by pushing them as metrics in an influxDb for instance...

WebSphere application server scheduler or java scheduler for insert

I am working on an application which is deployed on web-sphere application server 8.0. This application insert record in one table and uses the data-source by jndi lookup.
I need to create a batch job which will read data from the above table and will insert into some other table continuously on a fixed interval of time. It will be deployed on the same WAS server and use the same jndi lookup for data source.
I read on internet that web-sphere application server scheduling is an option and is done using EJB and session beans.
I also read about jdk's ScheduledThreadPoolExecutor. I can create a war having ScheduledThreadPoolExecutor implementation and deploy it on the WAS for this.
I tried to find the difference between these two in terms of usage, complexity, performance and maintainability but could not.
Please help me in deciding which approach will be better for creating the scheduler for insert batch jobs and why. And in case if WAS scheduler is better then please provide me link to create and deploy the same.
Thanks!
Some major differences between WAS Scheduler and Java SE ScheduledThreadPoolExecutor is that WAS Scheduler is transactional (task execution can roll back or commit), persistent (tasks are stored in a database), and can coordinate across members of a cluster (such that tasks can be scheduled from any member but only run on one member). ScheduledThreadPoolExecutor is much lighter weight approach because it doesn't have any of these capabilities and does all of its scheduling within a single JVM. Task executions neither roll back nor retry and are not kept externally in a database in case the server goes down. It should be noted that WebSphere Application Server also has CommonJ TimerManager (and AlarmManager via WorkManager) which are more similar to what you get with ScheduledThreadPoolExecutor if that is what you want. In that case, the application server still manages the threads and ensures that context of the scheduling thread is available on the thread of execution. Hope this helps with your decision.

Scheduled Jobs in Spring using Akka

I am trying to determine the best way to implement handling long running batch jobs in Spring MVC. I come across Akka in my searching as a non blocking framework for aync processing, which is preferred because I don't want the batch processing to eat up all the threads from the thread pool.
Essentially what I will be doing is have a job that needs to run on some set schedule that will go out and call various web services, process the data, and persist it.
I have seen some code example with using it with Spring, but I've never seen it used with a CRON type scheduler. It always seems to be using a fixed time period.
I'm not sure if this is even the best approach to handling large scale batch processing within Spring. Any suggestions or links to good Akka Spring resources are welcome.
I would suggest you to look into Spring Integration and Spring Batch projects. The first one allows you configure chains of services using EIP. We used it in or project to fetch files from FTP, deserialize and process them, import into DB, send emails if required etc. - all by schedule. The second one is more straightforward and basically provides a framework to work on rows of data. Both are configurable with Quartz and integrate into Spring MVC project nicely.

When should I use the JDBC Persistence Adapter in ActiveMQ?

Reading the ActiveMQ documentation (we are using the 5.3 release), I find a section about the possibility of using a JDBC persistence adapter with ActiveMQ.
What are the benefits? Does it provide any gain in performance or reliability? When should I use it?
In my opinion, you would use JDBC persistence if you wanted to have a failover broker and you could not use the file system. The JDBC persistence was significantly slower (during our tests) than journaling to the file system. For a single broker, the journaled file system is best.
If you are running two brokers in an active/passive failover, the two brokers must have access to the same journal / data store so that the passive broker can detect and take over if the primary fails. If you are using the journaled file system, then the files must be on a shared network drive of some sort, using NFS, WinShare, iSCSI, etc. This usually requires a higher-end NAS device if you want to eliminate the file share as a single point of failure.
The other option is that you can point both brokers to the database, which most applications already have access to. The tradeoff is usually simplicity at the price of performance, as the journaled JDBC persistence was slower in our tests.
We run ActiveMQ in an active/passive broker pair with journaled persistence via an NFS mount to a dedicated NAS device, and it works very well for us. We are able to process over 600 msgs/sec through our system with no issues.
Hey, the use of journaled JDBC seems to be better than using JDBC persistence only since the journaling is very much faster than JDBC persistence. It is better than just journalled persistence only cos' you have an additional backup of the messages in the db. Journalled JDBC has the additional advantage that the same data in journal is persisted to the db later and this can be accessed by developers when needed!
However, when you are using master/slave ActiveMQ topology with journalled JDBC, you might end up loosing messages since you might have messages in journal that are not yet into the DB!
If you have a redelivery plugin policy in place and use a master/slave setup, the scheduler is used for the redelivery.
As of today, the scheduler can only be setup on a file database, not on the JDBC. If you do not pay attention to that, you will take all messages that are in redelivery out of the HA scenario and local to the broker.
https://issues.apache.org/jira/browse/AMQ-5238 is an issue in Apache issue tracker that asks for a JDBC persistence adapter for schedulerdb. You can place a vote for it to make it happen.
Actually, even on the top AMQ HA solution, LevelDB+ZooKeeper, the scheduler is taken out of the game and documented to create issues (http://activemq.apache.org/replicated-leveldb-store.html at end of page).
In a JDBC scenario, therefor it can be considered unsafe and unsupported, but at least not clearly documented, how to setup the datastore for the redelivery policy.

Categories