How to distribute Java long running process to remote servers

How to distribute Java long running process to remote servers - java

My php web server receive requests and needs to launch a java program that runs between 30 sec and 5 minutes or even more. That long process needs to be distributed on the available servers in my LAN.
What i need:
a job queue ( that's done in a db)
A DB watch. Get notified of new or completed job (to start another job in the queue)
Start a java process on a remote and available computer.
It seems that it needs to be a DB watch since I need to evaluate which remote computer is available and a DB stored procedure wouldn't accomplish that easily.
What is the best or at least a good way to achieve this in a OS independant way using JAVA.
I guess I could use a FileWatch and manage the queue in a folder but it seems prehistoric.
Thanks

I would use a JMS queue. You add tasks/messages to a queue and the next available process takes a task, performs it and sends back any result on another queue or topic. This supports transparent load balancing and you can restart tasks if a process fails. No polling is required.

Related

Can we reliably keep HTTP/S connection open for a long time?

My team maintains an application (written on Java) which processes long running batch jobs. These jobs needs to be run on a defined sequence. Hence, the application starts a socket server on a pre-defined port to accept job execution requests. It keeps the socket open until the job completes (with success or failure). This way the job scheduler knows when one job ends and upon successful completion of the job, it triggers the next job in the pre-defined sequence. If the job fails, scheduler sends out an alert.
This is a setup we have had for over a decade. We have some jobs which runs for a few minutes and other which takes a couple hours (depending on the volume) to complete. The setup has worked without any issues.
Now, we need to move this application to a container (RedHat OpenShift Container Platform) and the infra policy in place allows only default HTTPS port be exposed. The scheduler sits outside OCP and cannot access any port other than the default HTTPS port.
In theory, we could use the HTTPS, set Client timeout to a very large duration and try to mimic the the current setup with TCP socket. But would this setup be reliable enough as HTTP protocol is designed to serve short-lived requests?

There isn't a reliable way to keep a connection alive for a long period over the internet, because of nodes (routers, load balancers, proxies, nat gateways, etc) that may be sitting between your client and server, they might drop mid connection under load, some of them will happily ignore your HTTP keep alive request, or have an internal max connection duration time that will kill long running TCP connections, you may find it works for you today but there is no guarantee it will work for you tomorrow.
So you'll probably need to submit the job as a short lived request and check the status via other means:
Push based strategy by sending a webhook URL as part of the job submission and have the server call it (possibly with retries) on job completion to notify interested parties.
Pull based strategy by having the server return a job ID on submission, then have the client check periodically. Due to the nature of your job durations, you may want to implement this with some form of exponential backoff up to a certain limit, for example, first check after waiting for 2 seconds, then wait for 4 seconds before next check, then 8 seconds, and so on, up to a maximum of time you are happy to wait between each check. So that you can find out about short job completions sooner and not check too frequently for long jobs.

When your worked with socket and TCPprotocol you were in control on how long to keep connections open. With HTTP you are only in control of logical connections and not physical ones. Actual connections are controlled by OS and usually IT people can configure all those timeouts. But by default how it works is that when you even close logical connection the real connection is no closed in anticipation of next communication. It is closed by OS and not controlled by your code. However even if it closes and your next request comes after that it is opened transparently to you. SO it doesn't really matter if it closed or not. It should be transparent to your code. So in short I assume that you can move to HTTP/HTTPS with no problems. But you will have to test and see. Also about other options on server to client communications you can look at my answer to this question: How to continues send data from backend to frontend when something changes

We have had bad experiences with long standing HTTP/HTTPS connections. We used to schedule short jobs (only a couple of minutes) via HTTP and wait for it to finish and send a response. This worked fine, until the jobs got longer (hours) and some network infrastructure closed the inactive connections. We ended up only submitting the request via HTTP, get an immediate response and then implemented a polling to wait for the response. At the time, the migration was pretty quick for us, but since then we have migrated it even further to use "webhooks", e.g. allow the processor of the job to signal it's state back to the server using a known webhook address.

IMHO, you should improve your scheduler to a REST API server, Websocket isn't effective in this scenario, the connection will inactive most of time

The jobs can be short-lived or long running. So, When a long running job fails in the middle, how does the restart of the job happen? Does it start from beginning again?
In a similar scenario, we had a database to keep track of the progress of the job (no of records successfully processed). So, the jobs can resume after a failure. With such a design, another webservice can monitor the status of the job by looking at the database. So, the main process is not impacted by constant polling by the client.

How to share chronicle queue between multiple micro services in AWS

We had a micro service approach for one of our systems using Kafka as an event bus.
We had some latency problems and experimented with replacing Kafka topics with a bunch of Chronicle queues. When running locally on a developer machine the results were amazing, one of our most expensive work flows was processing ten to thirty times faster.
Given the initial good results we decided to take the experiment further and deploy our proof on concept in AWS which is where our system runs. Our micro services run in docker containers across a bunch of EC2s.
We created an EFS volume and mounted it on each docker container. We verified the volume was accessible from each micro service and the right read write permissions were granted.
Now the problem:
MS1 receives a message (API call) does some processing and emits an event in a chronicle queue. We can see on the EFS file system the chronicle queue file is touched. MS2 is supposed to consume that event and do some further processing. This is not happening. Eventually restarting MS2 would trigger the message processing but this is not always the case. Easy to imagine the disappointment.
The question:
Is our EFS approach wrong? If yes what would be the way to go?
Thank you in advance for your inputs.

Chronicle Queue cannot work on a Network File System like EFS, as discussed in this previous question and also documented here: https://github.com/OpenHFT/Chronicle-Queue/#usage
To communicate between hosts you need Chronicle Queue Enterprise which supports TCP/IP replication.
Please note also doco for running with docker

Creating Java jobs pool

My application JavaEE app is backend service of mobile clients so the clients must be registered to backend service, there are lots of database process and different kinds of jobs in the registration process, to improve performance I am plannig to create job pool, for example when client is registering to backend service their jobs pushed to pool until the pool size getting full. If the pull size is full, jobs will be processed... Is there any suitable way to implement this idea ?
thanks,

What is the reason for waiting until you accumulate a big block instead of quickly processing small chunks? Performance-wise this is almost always better, not even speaking of transactions and such. Plus your clients wait longer than necessary.
If you really want to do it, I'd go for storing all incoming requests in a List, the database or a queue, whatever you prefer and whether it needs to be persistent, and have a periodical job checking for new ones and processing them, if needed only if a certain threshold exceeded.

server Socket in Clustered server - Java

I am building a application that uses quartz scheduler which is triggered every 30 min. Since we have clustered servers, if the application is deployed it will do the job twice , which we don't want to happen. Therefore we decided to use socket approach which requires port and ip . My question is , how would i implement socket approach if only one instance is to run in whole clustered environment ? Any suggestion or help and example code will be highly appreciated.

You may use a approach of JMS or similar, where you publish the task to be done on a Queue. With the nature of queue, that only one consumer can consume that message, you can be sure one of the instances should pick the task and run.

How do I handle a single text io stream with multiple inputs and outputs?

So I'm working through a bit of a problem, and some advice would be nice. First a little background, please excuse the length.
I am working on a management system that queries network devices via the TL1 protocol. For those unfamiliar with the protocol, the short answer is that is is a "human readable" language that communicates via a text based IO stream.
I am using Spring and Jsch to open a port to the remote NE (network element), login, run the command, then close the connection. There are two kinds of ways to get into the remote NE's, either directly (via the ssh gateway) if the element has a tcp/ip address (many are osi only), or through an ems (management system) of some type using what is called a "northbound interface".
Either way, the procedure is the same.
Use Jsch to open a port to the NE or ems.
Send login command for the NE ex. "act-user<tid>:<username>:UniqueId::<password>;"
Send command ex. "rtrv-alm-all:<tid>:ALL:uniqueid::,,,,;"
Retrieve and process results. The results of the above for example might look something like this...
RTRV-ALM-ALL:foo:ALL:uniqueid;
CMPSW205 02-01-11 18:33:05
M uniqueid COMPLD
"01-01-06:MJ,BOARDOUT-ALM,SA,01-10,12-53-58,,:\"OPA_C__LRX:BOARD EXTRACTED\","
;
The ; is important because it signals the end of the response.
Lastly logout, and close the port.
With Spring I have been using the ThreadPoolTaskExecutor quite effectively to do this.
Until this issue came up ...
With one particular ems platform (Hitachi) I ran across a roadblock with my approach. This ems handles as many as 80 nodes through it. You connect to the port, then issue a command to login to the ems, then run commands pointing to the various NE's. Same procedure as before, but here is the problem...
After you login into the ems, the next command, no matter what it is, will take up to 10 minutes to complete. until that happens, all other commands are blocked. After this initial wait all other commands work quickly. There appears to be no way to defeat this behaviour (my suspicion is that there is some NE auto-discovery happening during this period).
Now the thrust of my question...
So my next approach for this platform would be to connect to the ems, login to it, and keep the connection open, and just pass commands to the various NE's. That would mean a 10 minute delay after the application (web based) first loads, but would be fine after this point.
The problem I have is how best to do this. Having a single text based iostream for passing this stuff through looks like a large bottleneck, plus multiple users will be using the application, how do I handle multiple commands and responses against this single iostream? I can open a few iostreams (maybe up to 6) on this ems, but that also complicates sorting out what goes where.
Any advice on direction would be appreciated.

Look at using one process per ems so that communication to each ems is separated. This will at least ensure that communications with other ems's are unaffected by the problems with this one.
You're going to have to build some sort of a command queuing system so that commands sent to the Hitachi ems don't block the user interface until they are completed. Either that, or you're going to have to put a 10 minute delay into the client software before they can begin using it, or a 10 minute delay into the part of the interface that would handle the Hitachi.
Perhaps it would be a good policy to bring up the connection and immediately send some sort of ping or station keeping idle command - something benign that you don't care about the response, or gives no response, but will trigger the 10 minute delay to get it over with. Your users can become familiar with this 10 minute delay and at least start the application up before getting their coffee or something.
If you can somehow isolate the Hitachi from the other ems's in the application's design, this would really ensure that the 10 minute delay only exists while interfacing with the Hitachi. You can connect and issue a dummy command, and put the Hitachi in some sort of "connecting" state where commands cannot be used until the result comes in, and then you change the status to ready so the user can interact with it.
One other approach would be to develop some sort of middleware component - I don't know if you've already done this. If the clients are all web-based, you could run a communications piece on the webserver which takes connections from the clients and pipes them through one piece on the webserver which communicates with all of the ems's. When this piece starts up on the webserver, it can connect to each ems and send some initial ping command which starts the 10 minute timer. Once this is complete, the piece on the webserver could send keepalive messages every so often, again some sort of dummy command, to keep the socket alive so it wouldn't have to reset and go through the 10-minute wait time again. When the user brings up the website, they can communicate with this middleware server piece which would forward the requests to the appropriate ems and forward the response back to the client -- all through the already open connection.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.