Are Cassandra heartbeat requests costly? - java

I am using the default heartbeat interval of 30 seconds. Has anybody experience any load issues using that default interval?

It depends on the number of reads your app does per given window. If you are sure that your app is going to keep knocking the database with queries then you can go with a longer interval or even disable it by setting it to 0. If you have a on and off kind of load pattern ( like your get request for 2mins and no request for 2 next two mins) then having the heart beat is critical in order to keep the connection active. Otherwise making a new connection would be costly.

Related

Design session expiration service

I have integration with video stream provider. The flow is following: user requests a stream url, next we on behalf of user, request it from stream provider and return to the user. Next, we should prolong the stream id (session) every 10 secs. To minimize interaction with the client and because of slow network, we want to do this session prolongation on behalf of the user. So let's say, user will trigger one request per 2-5 mins, at the same time server will trigger session prolongation requests every 10 secs.
The question is in possible design of such service. I have not found better solution other than just simply iterate over all available session keys periodically and call prolongation service.
But this approach has disadvantages when user count will be really big it could slow down processing. Also, it is hard to scale with such an approach.
Maybe you have ideas about how to overcome this? Or please propose a better solution
I would write the keep alive as a single self contained piece of code, that will call the keep alive every x number for seconds for y amount of time before ending itself, where x, y and the keep alive endpoint are startup parameters.
Each time the user triggers a request - kick one of these off in the background. How you package that is determined on your deployment environment and how you intend to manage scaling out (background thread, new process, server-less function, etc.).
You may need to maintain some state info in a cache for management purposes (don't start a new one if one is already running, hung process states, etc.).

On the Kafka Java consumer client, is there a way to monitor health status as opposed to simply no-data?

I have a typical kafka consumer/producer app that is polling all the time for data. Sometimes, there might be no data for hours, but sometimes there could be thousands of messages per second. Because of this, the application is built so it's always polling, with a 500ms duration timeout.
However, I've noticed that sometimes, if the kafka cluster goes down, the consumer client, once started, won't throw an exception, it will simply timeout at 500ms, and continue returning empty ConsumerRecords<K,V>. So, as far as the application is concerned, there is no data to consume, when in reality, the whole Kafka cluster could be unreachable, but the app itself has no idea.
I checked the docs, and I couldn't find a way to validate consumer health, other than maybe closing the connection and subscribing to the topic every single time, but I really don't want to do that on a long-running application.
What's the best way to validate that the consumer is active and healthy while polling, ideally from the same thread/client object, so that the app can distinguish between no data and an unreachable kafka cluster situation?
I am sure this is not the best way to achieve what you are looking for.
But one simple way which I had implemented in my application is by maintaining a static counter in the application indicating emptyRecordSetReceived. Whenever I receive an empty record set by the poll operation I increment this counter.
This counter was emitted to the Graphite at periodic interval (say every minute) with the help of the Metric registry from the application.
Now let's say you know the maximum time frame for which the message will not be available to consume by this application. For example, say 6 hours. Given that you are polling every 500 Millisecond, you know that if we do not receive the message for 6 hours, the counter would increase by
2 poll in 1 second * 60 seconds * 60 minutes * 6 hours = 43200.
We had placed an alerting check based on this counter value reported to Graphite. This metric used to give me a decent idea if it is a genuine problem from the application or something else is down from the Broker or producer side.
This is just the naive way I had solved this use case to some extent. I would love to hear how it is actually done without maintaining these counters.

Timeouts for multiple clients in Java

I have one Server and multiple clients. With some period, clients sends an alive packet to Server. (At this moment, Server doesn't respond alive packets). The period may change device to device and configurable at runtime, for both Server and Clients. I want to generate an alert when one or more clients doesn't send the alive packet. (One packet or two in row etc.). This aliveness is used other parts of application so the quicker notice is the better. I came up some ideas but I couldn't select one.
Create a task that checks every clients last alive packet timestamps with current time and generate alert or alerts. Call this method in some period which should be smaller than minimum client-period.
Actually that seems better to me, however this way unnecessarily I check some clients alive. (Ex: If clients period are change 1-5 minute, task should be run in every minute at least, so I check all clients above 2 minute period is redundant). Also if the minimum of client periods is decrease, I should decrease the tasks period also.
Create a task for each clients, and check the last alive packet timestamps with current time, sleep for one client's period time.
In this way, if clients number goes very high, there will be dozens of task. Since they will sleep most of the time, I still doubt this is more elegant.
Is there any idiom or pattern for this kind of situation? I think watchdog kind implementation is suite well, however I didn't see something like in Java.
Approach 2 is not very useful as it is vague idea to write 100 task for 100 clients.
Approach 1 can be optimized if you use average client-period instead of minimum.
It depends on your needs.
Is it critical if alert is generated few seconds later (or earlier) than it should be?
If not then maybe it's worth grouping clients with nearby heartbeat intervals and run the check against not a single client but the group of clients? This will allow to decrease number of tasks (100 -> 10) and increase number of clients handled by single task (1 -> 10).
First approach is fine.
Only thing I can suggest you is that create an independent service to do this control. If you set this task as a thread in your server, it wouldn't be that manageable. Imagine your control thread is broken, killed etc, how would you notice? So, build an independent OS service, another java program, to check last alive timestamps periodically.
In this way you can easily modify and restart your service and see its logs separately. According to its importance, you may even built a "watchdog of watchdog" service too.

RESTful: What is the difference between ClientProperties.CONNECT_TIMEOUT and ClientProperties.READ_TIMEOUT in Jersey?

For setting up the timeouts while making REST calls we should specify both these parameters but I'm not sure why both and exactly what different purpose they serve. Also, what if we set only one of them or both with different value?
CONNECT_TIMEOUT is the amount of time it will wait to establish the connection to the host. Once connected, READ_TIMEOUT is the amount of time allowed for the server to respond with all of the content in a give request.
How you set either one will depend on your requirements, but they can be different values. CONNECT_TIMEOUT should not require a large value, because it is only the time required to setup a socket connection with the server. 30 seconds should be ample time - frankly if it is not complete within 10 seconds it is too long, and the server is likely hosed, or at least overloaded.
READ_TIMEOUT - this could be longer, especially if you know that the action/resource you requested takes a long time to process. You might set this as high as 60 seconds, or even several minutes. Again, this depends on how critical it is that you wait for confirmation that the process completed, and you'll weigh this against how quickly your system needs to respond on its end. If your client times out while waiting for the process to complete, that doesn't necessarily mean that the process stopped, it may keep on running until it is finished on the server (or at least, until it reaches the server's timeout).
If these calls are directly driving an interface, then you may want much lower times, as your users may not have the patience for such a delay. If it is called in a background or batch process, then longer times may be acceptable. This is up to you.

Why does GAE spawn new instances even though I have set min idle=1 and pending latency=max?

I have a low/sporadic-load application and the latency caused by starting new instances (around 10s) far exceeds the time needed to process my requests, which typically complete in less than 500ms.
So in order to avoid the latency spikes caused by the spawning of new instances ("loading requests"), I made the following two settings:
set min idle instances = max idle instances = 1, to ensure that there is always one instance running (one instance is enough to handle my traffic); and
set the pending latency to 15s, so that GAE waits for up to 15s for the one resident instance to become free rather than start a new one.
Billing is activated. However, GAE still starts new instances resulting in inacceptable latency. Why is that?
In the logs I can see that my requests always return in less than 500ms; there is no way that a request would get queued up to 15s.
What can I do about this? Any help much appreciated.
Update: my solution was to set up a cron job which issues a request every 5 minutes, to always have a dynamic instance running. As it turned out (see answer below), idle instances are reserved for crazy load spikes, not the low-load scenario that I'm in 99% of the time.
As #koma says, app-engine will create a dynamic instance to keep the number of idle instances constant, but not only it will create a new one but it will also use it immediately instead of using the idle one, on average. If you have a bunch of idle instances app engine will in fact still prefer spinning up dynamic ones even when a single request comes in, and will "save" the idle ones in case of crazy traffic spikes.
This is highly counter-intuitive because you'd expect it to use the instance that are already idling around to serve request and spin up dynamic ones for future requests, but that's not how it works.
If you set min idle instances = 1, it will definitely spawn another instance at first request.... because there is no longer any idle instance (it is busy processing the first request !).
And since a new instance is started, it might as well process some requests and be no longer idle ?
see also Google App Engine Instances keep quickly shutting down

Categories