I'm trying to wrap my head around Spring Batch, and while many tutorials show great examples of code, i feel like i'm missing how the "spring batch engine" works.
Scenario 1 - On user creation, create user at external service.
Web request
CreateLocalUser()
launch job CreateExternalUser()
CreateExternalUser() can fail because of many reasons, so we want to be able to retry and log errors, which Spring Batch can do for us. Also it's a decoupled process that has nothing to do with the creation of our local user.
Where does the job run? Will it run in the same thread as the web request, which means the end user will have to wait for the job to finish before getting http status 200?
Imagine i have a Web server and a Batch server. I want all jobs to run on the Batch server, but the jobs themselves can be initiated from the Web server. Can Spring Batch do this? Do i need some kind of Queue that i can write to from the Webserver and Consume from the Batch server, where the actual job will begin?
Scenario 2 - Process lines in huge file, start new job for each line
Read lines in huge file (1.000.000 lines)
Start new job for each line using input parameters from the file.
Processing the 1.000.000 lines is quick and the 1.000.000 new jobs will more or less be started at the same time. Where does these run? Do they run async to the initial job? Will my server be able to handle running all these more or less at the same time.
Additional question:
Is it possible to query Jobs based on a job input parameter. i.e. Scenario 1, i want to show the CreateExternalUser job status / error when viewing my local user with Id 1234 on my web page. CreateExternalUser job has input parameter userId: 1234
You have a few questions here so let's go through them one at a time:
Where does the job run? Will it run in the same thread as the web request, which means the end user will have to wait for the job to finish before getting http status 200?
That depends on your configuration. If you use the defaults, then yes. The job would run in the same thread and the user would be forced to wait until the job completes in order to get the 200. This obviously isn't a good idea...
Which is why Spring Batch's SimpleJobLauncher allows you to inject a TaskExecutor. By configuring your JobLauncher to use an async TaskExecutor implementation (ThreadPoolTaskExecutor for example), the job would be executed in a different thread, allowing the controller's processing to complete.
Obviously this is all within a single JVM, which bring us to your next question.
I want all jobs to run on the Batch server, but the jobs themselves can be initiated from the Web server. Can Spring Batch do this? Do i need some kind of Queue that i can write to from the Webserver and Consume from the Batch server, where the actual job will begin?
Spring Batch contains a module called Spring Batch Integration. This module provides various capabilities including using messages to launch Spring Batch Jobs. You can use this to have a remote "batch" server that you can communicate with from the web server. The communication mechanism is Spring Integration channels so any messaging option backed by SI would be supported (JMS, AMQP, REST, etc).
Scenario 2 - Process lines in huge file, start new job for each line
This scenario makes me think you're going down the wrong path for your design. Can you post a new question that elaborates on this use case?
Additional question: Is it possible to query Jobs based on a job input parameter
Job parameters are used to identify JobInstances and are fundamental to job identification. Because of this, yes, you can identify individual job runs based on the parameters.
Related
Currently, I have an idea about building a centralized batch job management system (I temporarily call it batch service).
We own a microservice system, and the batch jobs are scattered across the services (including oracle's bacth jobs). So I intend to set up a bacth job management system.
But there is one problem that in microservices there are many databases, so I want the manipulation of data to be done by other services, and batch service only does the following things: setting, scheduling, checking status state, log, start, stop, retry.
My idea is to use message broker(kafka, rabbitmq, ...) to pass job request from batch service to other services. But I am not thinking of a solution to stop or save the log of jobs on the batch service.
Is this idea feasible and if so can you give me some advice on deployment technologies (We are deploying using spring boot at the moment).
Thanks for taking the time to read ^^.
Application: ETL application written in java/Spring Batch/Oracle running on SpringBoot
Problem statement:
There are several long-running SQL queries in a java job, query execution is the last step of the job.
Sometimes query gets stuck at DB level and DB doesn't respond and eventually, the query is killed and the job has to be started again.
Now I want to implement (for both spring jobs and non spring jobs):
a way to start the job at the failure point not from the top.
High availability architecture so that users don't have to wait if the application server stops responding. Something similar to Hot-Hot architecture (Any Spring-based solution would be great)
a way to start the job at the failure point not from the top.
Spring Batch provides that by default, given that you use a persistent job repository. If you restart a failed job instance, Spring Batch will restart from the last save point in the last failed step (unless your configure your steps to be restartable even if they were completed successfully)
High availability architecture so that users don't have to wait if the application server stops responding.
If your database server is not responsive, neither Spring Batch nor any other tool can improve things here. What you can do is apply a timeout to your step and stop it if the timeout is exceeded. You can find a complete code example here: Restart step (or job) after timeout occours
I have a requirement to create around 10 Spring Batch jobs, which will consists of a reader and a writer. All readers read data from some different Oracle DB and write into a different Oracle Db(Source and destination servers are different). And the Spring jobs are implemented using Spring Boot. Also all 10+ jobs would be packaged into a single Jar File. So far fine.
Now the client also wants some UI to monitor the job status and act as a job organizer. I gone through the Spring Data flow Server documentation for UI requirement. But I'm not sure whether it'll serve the purpose, or is there any other alternative option available for monitoring the job status, stop and start the jobs whenever required from the UI.
Also how could I separate the the 10+ jobs inside a single Jar in the Spring Data Flow Server if it's the only option for an UI.
Thanks in advance.
I don't have reputation to add a comment. So, I am posting answer here. Although I know this is not the way to share reference link as an answer.
This might help you:
spring-batch-job-monitoring-with-angular-front-end-real-time-progress-bar
Observability of spring batch jobs is given by data that are persisted by the framework in a relational database... instances..executions..timestamps...read count..write count....
You have different way to exploit these data. SQL client, JMX, spring batch api (JobExplorer, JobOperator), spring admin (deprecated in favor of cloud data flow server).
Data flow is an orchestrator allowing you to execute data pipelines with streams and tasks(finite and short lived/monitored services). For your jobs we can imagine wrap each jobs in tasks and create a multitask pipeline. Data flow gives you status of each executions.
You can also expose your monitoring data by pushing them as metrics in an influxDb for instance...
We have a requirement, where we have to run many async background processes which accesses DBs, Kafka queues, etc. As of now, we are using Spring Batch with Tomcat (exploded WAR) for the same. However, we are facing certain issues which I'm unable to solve using Spring Batch. I was thinking of other frameworks to use, but couldn't find any that solves all my problems.
It would be great to know if there exists a framework which solves the following problems:
Since Spring Batch runs inside one Tomcat container (1 java process), any small update in any job/step will result in restarting the Tomcat server. This results in hard-stopping of all running jobs, resulting in incomplete/stale data.
WHAT I WANT: Bundle all the jars and run each job as a separate process. The framework should store the PID and should be able to manage (stop/force-kill) the job on demand. This way, when we want to update a JAR, the existing process won't be hindered (however, we should be able to stop the existing process from UI), and no other job (running or not) will also be touched.
I have looked at hot-update of JARs in Tomcat, but I'm skeptical whether to use such a mechanism in production.
Sub-question: Will OSGI integrate with Spring Batch? If so, is it possible to run each job as a separate container with all JARs embedded in it?
Spring batch doesn't have a master-slave architecture.
WHAT I WANT: There should be a master, where the list of jobs are specified. There should be slave machines (workers), which are specified to master in a configuration file. There should exist a scheduler in the master, which when needed to start a job, should assign a slave a job (possibly load-balanced, but not necessary) and the slave should update the DB. The master should be able to send and receive data from the slaves (start/stop/kill any job, give me update of running jobs, etc.) so that it can be displayed on a UI.
This way, in case I have a high load, I should be able to just add machines into the cluster and modify the master configuration file and the load should get balanced right away.
Spring batch doesn't have an in-built alerting mechanism in case of job stall/failure.
WHAT I WANT: I should be able to set up alerts for jobs in case of failure. If necessary, a job should have a timeout where it should able to notify the user (via email probably) or should force stop the job when the job crosses a specified threshold.
Maybe vertx can do the trick.
Since Spring Batch runs inside one Tomcat container (1 java process), any small update in any job/step will result in restarting the Tomcat server. This results in hard-stopping of all running jobs, resulting in incomplete/stale data.
Vertx allows you to build microservices. Each vertx instance is able to communicate with other instances. If you stop one, the others can still work (if there are not dependant, eg if you stop master, slaves will fail)
Vert.x is not an application server.
There's no monolithic Vert.x instance into which you deploy applications.
You just run your apps wherever you want to.
Spring batch doesn't have a master-slave architecture
Since vertx is even driven, you can easily create a master slave architecture. For example handle the http request in an vertx instance and dispatch them between severals other instances depending on the nature of the request.
Spring batch doesn't have an in-built alerting mechanism in case of job stall/failure.
In vertx, you can set a timeout for each message and handle failure.
Sending with timeouts
When sending a message with a reply handler you can specify a timeout in the DeliveryOptions.
If a reply is not received within that time, the reply handler will be called with a failure.
The default timeout is 30 seconds.
Send Failures
Message sends can fail for other reasons, including:
There are no handlers available to send the message to
The recipient has explicitly failed the message using fail
In all cases the reply handler will be called with the specific failure.
EDIT There are other frameworks to do microservices in java. Dropwizard is one of them, but I can't talk much more about it.
Currently I have a Java (and a half ported python version) app that runs in the background that has a queue of jobs (currently read out of a mysql database) which handles thread sleep/waking to share resources based on the job priority and running time. There is a front end php script that posts jobs to the database which are polled by the system every time interval.
This manner is somewhat inefficient (but nicer than locking issues using a job file) but I can't but wonder if there would be some way to simplify this.
My thoughts were java app (and or python app) sets up http service (jetty?) and has a web interface that directly pushes jobs to the queue without the middleman. Apache is serving other php sites so this would have to run in tandem.
I'm really after some other input as I'd prefer it to be a background service always running - having a cron execute jobs was painful (since some jobs run for 20+ hours so adding new ones was a pain with new php [ no threading] /java calls having to check if a service was running with outstanding jobs to add to instead of starting a new service) but also have a very simple web interface without too much resource wastage.
Thanks for your input.
Deploy a JSP using Tomcat (or similar) that allows the user to post job requests to a job scheduler web service using a webpage. On the backend, use Quartz Scheduler to manage your jobs and just have your web service add jobs to the Quartz queue.