Detecting when an EC2 instance will be shutting down (Java SDK) - java

I posted this on the AWS support forums but haven't received a response, so hoping you guys have an idea...
We have an auto-scaling group which boots up or terminates an instance based on current load. What I'd like to be able to do it detect, on my current EC2 instance, that it's about to be shut down and to finish my work.
To describe the situation in more detail. We have an auto-scaling group, and each instance reads content from a single SQS. Each instance will be running multiple threads, each thread is reading from the same SQS queue and processing the data as needed.
I need to know when this instance will be about to shut down, so I can stop new threads from reading data, and block the shutdown until the remaining data has finished processing.
I'm not sure how I can do this in the Java SDK, and I'm worried my instances will be terminated without my data being processed correctly.
Thanks
Lee

When it wants to scale down, AWS Auto Scaling will terminate your EC2 instances without warning.
There's no way to allow any queue workers to drain before terminating.
If your workers are processing messages transactionally, and you're not deleting messages from SQS until after they have been successfully processed, then this shouldn't be a problem. The processing will stop when the instance is terminated, and the transaction will not commit. The message won't be deleted from the SQS queue, and can be picked up and processed by another worker later on.
The only kind of draining behavior it supports is HTTP connection draining from an ELB: "If connection draining is enabled for your load balancer, Auto Scaling waits for the in-flight requests to complete or for the maximum timeout to expire, whichever comes first, before terminating instances".

Related

Can we reliably keep HTTP/S connection open for a long time?

My team maintains an application (written on Java) which processes long running batch jobs. These jobs needs to be run on a defined sequence. Hence, the application starts a socket server on a pre-defined port to accept job execution requests. It keeps the socket open until the job completes (with success or failure). This way the job scheduler knows when one job ends and upon successful completion of the job, it triggers the next job in the pre-defined sequence. If the job fails, scheduler sends out an alert.
This is a setup we have had for over a decade. We have some jobs which runs for a few minutes and other which takes a couple hours (depending on the volume) to complete. The setup has worked without any issues.
Now, we need to move this application to a container (RedHat OpenShift Container Platform) and the infra policy in place allows only default HTTPS port be exposed. The scheduler sits outside OCP and cannot access any port other than the default HTTPS port.
In theory, we could use the HTTPS, set Client timeout to a very large duration and try to mimic the the current setup with TCP socket. But would this setup be reliable enough as HTTP protocol is designed to serve short-lived requests?
There isn't a reliable way to keep a connection alive for a long period over the internet, because of nodes (routers, load balancers, proxies, nat gateways, etc) that may be sitting between your client and server, they might drop mid connection under load, some of them will happily ignore your HTTP keep alive request, or have an internal max connection duration time that will kill long running TCP connections, you may find it works for you today but there is no guarantee it will work for you tomorrow.
So you'll probably need to submit the job as a short lived request and check the status via other means:
Push based strategy by sending a webhook URL as part of the job submission and have the server call it (possibly with retries) on job completion to notify interested parties.
Pull based strategy by having the server return a job ID on submission, then have the client check periodically. Due to the nature of your job durations, you may want to implement this with some form of exponential backoff up to a certain limit, for example, first check after waiting for 2 seconds, then wait for 4 seconds before next check, then 8 seconds, and so on, up to a maximum of time you are happy to wait between each check. So that you can find out about short job completions sooner and not check too frequently for long jobs.
When your worked with socket and TCPprotocol you were in control on how long to keep connections open. With HTTP you are only in control of logical connections and not physical ones. Actual connections are controlled by OS and usually IT people can configure all those timeouts. But by default how it works is that when you even close logical connection the real connection is no closed in anticipation of next communication. It is closed by OS and not controlled by your code. However even if it closes and your next request comes after that it is opened transparently to you. SO it doesn't really matter if it closed or not. It should be transparent to your code. So in short I assume that you can move to HTTP/HTTPS with no problems. But you will have to test and see. Also about other options on server to client communications you can look at my answer to this question: How to continues send data from backend to frontend when something changes
We have had bad experiences with long standing HTTP/HTTPS connections. We used to schedule short jobs (only a couple of minutes) via HTTP and wait for it to finish and send a response. This worked fine, until the jobs got longer (hours) and some network infrastructure closed the inactive connections. We ended up only submitting the request via HTTP, get an immediate response and then implemented a polling to wait for the response. At the time, the migration was pretty quick for us, but since then we have migrated it even further to use "webhooks", e.g. allow the processor of the job to signal it's state back to the server using a known webhook address.
IMHO, you should improve your scheduler to a REST API server, Websocket isn't effective in this scenario, the connection will inactive most of time
The jobs can be short-lived or long running. So, When a long running job fails in the middle, how does the restart of the job happen? Does it start from beginning again?
In a similar scenario, we had a database to keep track of the progress of the job (no of records successfully processed). So, the jobs can resume after a failure. With such a design, another webservice can monitor the status of the job by looking at the database. So, the main process is not impacted by constant polling by the client.

SQS Java Listener cluster nodes duplicate data created on a memory heap issue

We have an external system that pushes data to SQS, and the consumer is another java service, that listens and created the requests, distributes the tasks, then acknowledges the SQS.
But a memory heap issue occurred a time, and from the logs, it is understood that the java consumer consumed multiple times the Queue, but all times throws a memory exception, and seems saved all later.
So the Q didn't get acknowledged, so data was available for the consumer. The db commits happened after a certain while only.
The strategy to avoid this kind of issue, what approach we can take?
Exception Handling, SQS retry mechanism change, check exists kind of validation..?

Access native KafkaConsumer in Camel RoutePolicy to change polling behaviour

I "monitor" the number of consecutive failures in my Camel processing pipeline with a Camel RoutePolicy.
When a threshold of failures is reached, I want to pause the processing for a configured amount of time because it probably means that the data from another system is not yet ready and therefore every message fails.
Since the source of my pipeline is a Kafka topic, I should not just stop the whole route because the broker would assume my consumer died and rebalance.
The best way to "pause" topic consumption seems to be to pause the [KafkaConsumer][3] (the native, not the one of Camel). Like this, the consumer continues to poll the broker, but it does not fetch any messages. Exactly what I need.
But can I access the native [KafkaConsumer][3] from the RoutePolicy context to call the pause and resume methods?
The spring-kafka listener containers expose these methods, it would be nice to use them from Camel too.
This is not yet supported, the two methods must be added to the camel-kafka consumer first.
There is also an existing issue for it: https://issues.apache.org/jira/browse/CAMEL-15106

ActiveMQ stops receiving messages after servers left idle for few hours

I've been browsing the forums for last few days and tried almost everything i could find, but without any luck.
The situation is: inside our Java Web Application we have ActiveMQ 5.7 (I know it's very old, eventually we will upgrade to newer version - but for some reasons it's not possible right now). We have only one broker and multiple consumers.
When I start the servers (I have tried to do so for 2, 3, 4 and more servers) everything is ok. The servers are comunicating with each other, QUEUE messages are consumed instantly. But when I leave the servers idle (for example to finally catch some sleep ;) ) it is no longer the case. Messages are stuck in the database and are not beign consumed. The only option to have them delivered is to restart the server.
Part of my configuration (we keep it in properties file, it's the actual state, however I have tried many different combinations):
BrokerServiceURI=broker:(tcp://0.0.0.0:{0})/{1}?persistent=true&useJmx=false&populateJMSXUserID=false&useShutdownHook=false&deleteAllMessagesOnStartup=false&enableStatistics=true
ConnectionFactoryURI=failover://({0})?initialReconnectDelay=100&timeout=6000
ConnectionFactoryServerURI=tcp://{0}:{1}?keepAlive=true&soTimeout=100&wireFormat.cacheEnabled=false&wireFormat.tightEncodingEnabled=false&wireFormat.maxInactivityDuration=0
BrokerService.startAsync=true
BrokerService.networkConnectorStartAsync=true
BrokerService.keepDurableSubsActive=false
Do you have a clue?
I cannot actually tell you the reason from the description mentioned above but I can list down a few checks that are fresh in my mind. Please confirm the following if they are valid for you or not.
Can you check the consumer connections?
Are the consumer sessions still active?
If all the consumer-connections are up, then check the thread-dump whether the active consumer threads (I'm assuming you created consumer threads, correct me if I'm wrong) are in RUNNING or WAITING state(this happened with me where all the consumers were active but some other thread was keeping a lock on Logger while posting a message to slack and the consumers were in WAITING state) because of some other thread in the server).
Check the Dispatch queue size for each consumer. Check the prefetch of each consumer and then compare Dispatch Queue size with Prefetch, refer
Is there a JMSXGroupID you are allotting to each message?
Can you tell a little more about your consumer/producer/broker configurations?

JMS queue is full

My Java EE application sends JMS to queue continuously, but sometimes the JMS consumer application stopped receiving JMS. It causes the JMS queue very large even full, that collapses the server.
My server is JBoss or Websphere. Do the application servers provide strategy to remove "timeout" JMS messages?
What is strategy to handle large JMS queue? Thanks!
With any asynchronous messaging you must deal with the "fast producer/slow consumer" problem. There are a number of ways to deal with this.
Add consumers. With WebSphere MQ you can trigger a queue based on depth. Some shops use this to add new consumer instances as queue depth grows. Then as queue depth begins to decline, the extra consumers die off. In this way, consumers can be made to automatically scale to accommodate changing loads. Other brokers generally have similar functionality.
Make the queue and underlying file system really large. This method attempts to absorb peaks in workload entirely in the queue. This is after all what queuing was designed to do in the first place. Problem is, it doesn't scale well and you must allocate disk that 99% of the time will be almost empty.
Expire old messages. If the messages have an expiry set then you can cause them to be cleaned up. Some JMS brokers will do this automatically while on others you may need to browse the queue in order to cause the expired messages to be deleted. Problem with this is that not all messages lose their business value and become eligible for expiry. Most fire-and-forget messages (audit logs, etc.) fall into this category.
Throttle back the producer. When the queue fills, nothing can put new messages to it. In WebSphere MQ the producing application then receives a return code indicating that the queue is full. If the application distinguishes between fatal and transient errors, it can stop and retry.
The key to successfully implementing any of these is that your system be allowed to provide "soft" errors that the application will respond to. For example, many shops will raise the MAXDEPTH parameter of a queue the first time they get a QFULL condition. If the queue depth exceeds the size of the underlying file system the result is that instead of a "soft" error that impacts a single queue the file system fills and the entire node is affected. You are MUCH better off tuning the system so that the queue hits MAXDEPTH well before the file system fills but then also instrumenting the app or other processes to react to the full queue in some way.
But no matter what else you do, option #4 above is mandatory. No matter how much disk you allocate or how many consumer instances you deploy or how quickly you expire messages there is always a possibility that your consumer(s) won't keep up with message production. When this happens your producer app should throttle back, or raise an alarm and stop or do anything other than hang or die. Asynchronous messaging is only asynchronous up to the point that you run out of space to queue messages. After that your apps are synchronous and must gracefully handle that situation, even if that means to (gracefully) shut own.
Sure!
http://download.oracle.com/docs/cd/E17802_01/products/products/jms/javadoc-102a/index.html
Message#setJMSExpiration(long) does exactly what you want.

Categories