Scalable spring batch job on kubernetes - java

I am developing an ETL batch application using spring batch. My ETL process takes data from one pagination based REST API and loads it to the Google Big-query. I would like to deploy this batch application in kubernetes cluster and want to exploit pod scalability feature. I understand spring batch supports both horizontal and vertical scaling. I have few questions:-
1) How to deploy this ETL app on kubernetes so that it creates pod on demand using remote chunking / remote partitioning?
2) I am assuming there would be main master pod and different slave pods provisioned based on load. Is it correct?
3) There is one kubernetes batch API also available. Use kubernetes batch API or use Spring Cloud feature.Whis option is the better one?

I have used Spring Boot with Spring Batch and Spring Cloud Task to do something similar to what you want to do. Maybe it will help you.
The way it works is like this: I have a manager app that deploys pods on Kubernetes with my master application. The master application does some work and then starts the remote partitioning deploying several other pods with "workers".
Trying to answer your questions:
1) You can create a docker image of an application that has a Spring Batch job. Let's call it Master application.
The application that will deploy the master application could uses a TaskLauncher or an AppDeployer from spring cloud deployer kubernetes
2) Correct. In this case you could use remote partitioning. Each partition would be another docker image with a Job. This would be your worker.
An example of remote partitioning can be found here.
3) In my case I used spring batch and manage to do everything I needed. The only problems I have now is with Upscalling and Downscaling my cluster. Since my workers are not stateful I'm experiencing some problems when instances are removed from the cluster. If you don't need to upscale or downscale your cluster, you are good to go.

Related

How to pair embedded Hazelcast running inside multiple docker container of same image

I have spring boot application. I am using embedded embedded Hazelcast in this web app. I use a lot of data available in cache. Initially I used to run only one docker containers. Now, to scale the response, I started 3 docker containers of same image. Every docker containers now have their embedded Hazelcast running. So, once I hit for refresh of the cache, the containers who serves the request, will only have the latest data and other two will not have the latest data unless and until refresh request is not served.
Problem:
Due to all three containers are running of their own along with Hazelcast, Hazelcast of all three containers are not in sync. I need to sync all the Hazelcast running inside the containers, so that a single refresh should refresh the cache data on all three containers Hazelcast.
How to do it?
Edit : I am using docker swarm. In one VM, I have two containers and on another, I have one container of same image.
I found that through <public-address-ip> it could be achieved, but did not tried so far.
You need to make your Hazelcast instances form one cluster. How to do it depends on the environment you're running in. Check Hazelcast Reference Manual Discovery Mechanisms for details.
If you run in Docker Swarm, you should use Docker Swarm Discovery SPI Plugin.

Spring Cloud Data Flow - Spring Batch Remote Chunking

How to do Spring Batch Remote Chunking within Spring Cloud Data Flow Server?
In my understanding - Remote Partitioning of Spring Batch can be done within Spring Cloud Data Flow Server using DeployerPartitionHandler.
But, How do we implement Remote Chunking inside SCDF?
There is nothing special to run a remote chunking job on SCDF. All you need to do is to run both the master and workers as Task applications.

Testing a Spring application on multiple JVMs

I have coded a Spring MVC Hibernate application with RabbitMQ as a messaging server & a MySQL DB. I have also used Hazelcast an in-memory distributed cache to centralize the state of the application, moving the local tomcat session to a centralized session & implementing distributed locks.
The app right now is hosted on a single tomcat server in my local system.
I want to test my application on a multiple JVM node environment i'e app running on multiple tomcat servers.
What would be the best approach to test the app.
Few things that come to my mind
A. Install & configure a load balancer & set up a tomcat cluster in my local system. This I believe is a tedious task & requires much effort.
B. Host the application on a PAAS like OpenShift, cloudfoundry but I am not sure if I will be able to test my application on several nodes.
C. Any other way to simulate a clustered environment on my local windows system?
I would suggest first you should understand your application requirement. For the real production/live environment, are you going to use Infrastructure as a service or PAAS.
If Infrastructure as a service then
I would suggest create local cluster environment and use the tomcat and spring application sticky session concept. Persist the session in Hazelcast or redis server installed on different node. Configure load balancer for multiple nodes having tomcat server. 2-3 VMs for testing purpose would be suitable.
If requirement is PAAS then
Don't think about local environment. Test directly on OpenShift or AWS free account and trust me you would be able to test on PAAS if all setup is fine.

How to start slaves on different machines in spring remote partitioning strategy

I am using spring batch local partitioning to process my Job.In local partitioning multiple slaves will be created in same instance i.e in the same job. How Remote partitioning is different from local partitioning.What i am assuming is that in Remote partitioning each slave will be executed in different machine. Is my understanding correct. If my understanding is correct how to start the slaves in different machines without using cloudfoundry. I have seen Michael Minella talk on Remote partitioning https://www.youtube.com/watch?v=CYTj5YT7CZU tutorial. I am curious to know how remote partitioning works without using cloudfoundry. How can I start slaves in different machines?
While that video uses CloudFoundry, the premise of how it works applies off CloudFoundry as well. In that video I launch multiple JVM processes (web apps in that case). Some are configured as slaves so they listen for work. The other is configured as a master and he's the one I use to do the actual launching of the job.
Off of CloudFoundry, this would be no different than deploying WAR files onto Tomcat instances on multiple servers. You could also use Spring Boot to package executable jar files that run your Spring applications in a web container. In fact, the code for that video (which is available on Github here: https://github.com/mminella/Spring-Batch-Talk-2.0) can be used in the same way it was on CF. The only change you'd need to make is to not use the CF specific connection factories and use traditional configuration for your services.
In the end, the deployment model is the same off CloudFoundry or on. You launch multiple JVM processes on multiple machines (connected by middleware of your choice) and Spring Batch handles the rest.

spring boot application in cluster

I am developing a spring boot application.
Since spring boot created a .jar file for an application.
I want to cluster this particular application on different server. Lets say I build a jar file and ran a project then it should run in cluster mode from number of defined servers and should be able to serve end user needs.
My jar will reside on only one server but it will be clustered across number of servers. When end user calls a web service from my spring boot app he never know from where it is getting called.
The reason behind clustering is suppose any of the server goes down in future, end user will still be able to access web services from another server. But I don't know how to make it clustered.
Can any one please give me insight on this ?
If you want to have it clustered, you just run your Spring Boot application on multiple servers (of course, the JAR must be present on those servers, otherwise you can't run it). You would then place a loadbalancer in front of the application servers to distribute the load.
If all services you are going to expose are stateless so you only need to use load balancer in front of your nodes for ex. apache or nginx, if your services are stateful "store any state [session, store data in db]" so you have to use distributed cache or in memory data grid:
for session you can use spring-session project which could used rails to store sessions.
for store data in DB you need to cluster DB it self and can use distributed cache above your DB layer like Hazelcast.
Look into spring cloud, they have used some netflix open software along with amazons to create 12 factor apps for micro services.
Ideally you would need a load balancer, service registry that can help you achieve multiple instances of spring boot. I believe you have to add a dependency called eureka.
Check the below link
Spring cloud
You can deploy it in cloud foundry and use autoscale function to increase your application instances.

Categories