How to dinamically apply scheduled kafka consumer based on topic?

How to dinamically apply scheduled kafka consumer based on topic? - java

I'm currently struggling with a consumer on kafka that can somehow schedule to a future time for execution.
To summarize: I have a big data storage (.csv file) and the records contains 2 columns: timestamp and value. I'm trying to process this values based on their timestamp. First record it has to be consumed instantly by kafka, next one should be processed in future with a delay of 'current record timestamp - previous record timestamp' (it is not a very big difference, just a few seconds = result will be in millis) and so on.
So basically I can't find a solution to implement a consumer on kafka that takes each records based on timestamp and use that exact delay. I have to just simulate these values and they have to be insert in DB accordly to that delay to work properly.
I've tried to work around threads, with executors, but for big data it's not a properly way.
I tried to create dynamic topics on producers based on timestamp and then subscribe to them and then somehow process with a queue. It didn't work.
I expect the kafka to consume each record with the delay based on timestamp.

I expect the kafka to consume each record with the delay based on
timestamp
If you have specific delay between messages then Kafka is not a proper solution.
When you send messages to the Kafka, in most scenarios you use network. Which could add its own, unpredictable, delay. Kafka is running as a different process and nobody could guarantee at which moment this process will be ready to receive next message. OS could suspend process, GC could start etc. This adds another delay which nobody could predict.
Also, Kafka is not designed to operate with time when message was received. It more focused on order of messages, low latency and high throughput but not on timing.

Related

How to control the number of messages that being emitted by Apache Kafka per a specific time?

I am new to Apache Kafka and I am trying to configure Apache Kafka that it receives messages from the producer as much as possible but it only sends to the consumer configured number of messages per specific time.
In other words How to configure Apache Kafka to send only "50 messages for example" per "30 seconds"
to the consumer regardless of the number of the messages, and in the next 30 seconds it takes another 50 messages from the cashed messages and so on.

If you have control over the consumer
You could use max.poll.records property to limit max number of records per poll() method call. And then you only need to ensure that poll() is called once in 30 seconds.
In general you can take a look at all available configuration properties here.
If you cannot control consumer
Then the only option for you is to write messages as per your demand - write at most 50 messages in 30 seconds. There are no configuration options available. Only your application logic can achieve that.
updated - how to control ensure call to poll
The simplest way is to:
while (true) {
consumer.poll()
// .. do your stuff
Thread.sleep(30000);
}
You can make things more complex with measuring time for processing (i.e. starting after poll call up to Thread.sleep() to not wait more then 30 seconds at all.

The problem that producer really doesn't send messages to the consumer. There is that persistent Kafka topic in between where producer places its messages. And it really doesn't care if there is any consumer on the other side. Same from the consumer perspective: it just subscribers for data from the topic and doesn't care if there is some producer on the other side. So, thinking about a back-pressure from the consumer down to producer where there is a messaging middle ware is wrong direction.
On the other hand it is not clear how those consumed messages may impact your third party service. The point is that Kafka consumer is single-threaded per partition. So, all the messages from one partition is going to be (must) processed one by one in the same thread. This way you cannot send more than one messages to your service: the next one can be sent only when the previous has been replied. So, think about it: how it is even possible in your consumer application to excess rate limit?
However if you have enough partitions and high concurrency on the consumer side, so you really may end up with several requests to your service in parallel from different threads. For this purpose I would suggest to take a look into a Rate Limiter pattern. This library provides a good implementation: https://resilience4j.readme.io/docs/ratelimiter. It is much better to keep messages in the topic then try to limit producer somehow.
To conclude: even if the consumer side is not your project, it is better to discuss with that team how to improve their consumer. You did your part well: the producer sends messages to Kafka topic. What else you can do over here?

Interesting use case and not sure why you need it, but two possible solutions: 1. To protect the cluster, you could use quotas, not for amount of messages but for bandwidth throughput: https://kafka.apache.org/documentation/#design_quotas . 2. If you need an exact amount of messages per time frame, you could put a buffering service (rate limiter) in between where you consume and pause, publishing messages to the consumed topic. Rate limiter could consume next 50 then pause until minute passes. This will increase space used on your cluster because of duplicated messages. You also need to be careful of how to pause the consumer, hearbeats need to be sent else you will rebalance your consumer continuously, ie you can't just sleep till next minute. This is obviously if you can't control the end consumer.

On the Kafka Java consumer client, is there a way to monitor health status as opposed to simply no-data?

I have a typical kafka consumer/producer app that is polling all the time for data. Sometimes, there might be no data for hours, but sometimes there could be thousands of messages per second. Because of this, the application is built so it's always polling, with a 500ms duration timeout.
However, I've noticed that sometimes, if the kafka cluster goes down, the consumer client, once started, won't throw an exception, it will simply timeout at 500ms, and continue returning empty ConsumerRecords<K,V>. So, as far as the application is concerned, there is no data to consume, when in reality, the whole Kafka cluster could be unreachable, but the app itself has no idea.
I checked the docs, and I couldn't find a way to validate consumer health, other than maybe closing the connection and subscribing to the topic every single time, but I really don't want to do that on a long-running application.
What's the best way to validate that the consumer is active and healthy while polling, ideally from the same thread/client object, so that the app can distinguish between no data and an unreachable kafka cluster situation?

I am sure this is not the best way to achieve what you are looking for.
But one simple way which I had implemented in my application is by maintaining a static counter in the application indicating emptyRecordSetReceived. Whenever I receive an empty record set by the poll operation I increment this counter.
This counter was emitted to the Graphite at periodic interval (say every minute) with the help of the Metric registry from the application.
Now let's say you know the maximum time frame for which the message will not be available to consume by this application. For example, say 6 hours. Given that you are polling every 500 Millisecond, you know that if we do not receive the message for 6 hours, the counter would increase by
2 poll in 1 second * 60 seconds * 60 minutes * 6 hours = 43200.
We had placed an alerting check based on this counter value reported to Graphite. This metric used to give me a decent idea if it is a genuine problem from the application or something else is down from the Broker or producer side.
This is just the naive way I had solved this use case to some extent. I would love to hear how it is actually done without maintaining these counters.

Trigger Lambda by number of SQS messages

I have a SQS which will receive a huge number of messages. The messages keep coming to the queue.
And I have a use case where if the number of messages in a queue reaches X number (such as 1,000), the system needs to trigger an event to process 1,000 at a time.
And the system will make a chunk of triggers. Each trigger has a thousand messages.
For example, if we have 2300 messages in a queue, we expect 3 triggers to a lambda function, the first 2 triggers corresponding to 1,000 messages, and the last one will contain 300 messages.
I'm researching and see CloudWatch Alarm can hook up to SQS metric on "NumberOfMessageReceived" to send to SNS. But I don't know how can I configure a chunk of alarms for each 1,000 messages.
Please advice me if AWS can support this use case or any customize we can make to achieve this.

So after going through some clarifications on the comments section with the OP, here's my answer (combined with #ChrisPollard's comment):
Achieving what you want with SQS is impossible, because every batch can only contain up to 10 messages. Since you need to process 1000 messages at once, this is definitely a no-go.
#ChrisPollard suggested to create a new record in DynamoDB every time a new file is pushed to S3. This is a very good approach. Increment the partition key by 1 every time and trigger a lambda through DynamoDB Streams. On your function, run a check against your partition key and, if it equals 1000, you run a query against your DynamoDB table filtering the last 1000 updated items (you'll need a Global Secondary Index on your CreatedAt field). Map these items (or use Projections) to create a very minimal JSON that contains only the necessary information. Something like:
[
{
"key": "my-amazing-key",
"bucket": "my-super-cool-bucket"
},
...
]
A JSON like this is only 87 bytes long (if you take the square brackets out of the game because they won't be repeated, you're left out with 83 bytes). If you round it up to 100 bytes, you can still successfully send it as one event to SQS, as it will only be around 100KB of data.
Then have one Lambda function subscribe to your SQS queue and then finally concatenate the 1 thousand files.
Things to keep in mind:
Make sure you really create the createdAt field in DynamoDB. By the time it hits one thousand, new items could have been inserted, so this way you make sure you are reading the 1000 items that you expected.
On your Lambda check, just run batchId % 1000 = 0, this way you don't need to delete anything, saving DynamoDB operations.
Watch out for the execution time of your Lambda. Concatenating 1000 files at once may take a while to run, so I'd run a couple of tests and put 1 min overhead on top of it. I.e, if it usually takes 5 mins, set your function's timeout to 6 mins.
If you have new info to share I am happy to edit my answer.

You can add alarms at 1k, 2k, 3k, etc...but that seems clunky.
Is there a reason you're letting the messages batch up? You can make this trigger event-based (when a queue message is added fire my lambda) and get rid of the complications of batching them.

I handled a very similar situation recently, process-A puts objects in an S3 bucket and every time it does it puts a message in the SQS, with the key and bucket details, I have a lambda which is triggered every hour, but it can be any trigger like your cloud watch alarm. Here is what you can do on every trigger:
Read the messages from the queue, SQS allows you to read only 10 messages at a time, and every time you read the messages, keep adding them to some list in your lambda, you also get a receipt handle for every message , you can use it to delete the messages and repeat this process until you read all 1000 messages in your queue. Now you can perform whatever operations are required on your list and feed it to process B in a number of different ways , like a file in S3 and/or a new queue that process B can read from.
Alternate approach to reading messages: SQS allows you to read only 10 messages at a time, you can send an optional parameter 'VisibilityTimeout':60 that hides the messages from the queue for 60 seconds and you can recursively read all the messages until you dont see any messages in the queue, all while adding them to a list in lambda to process them, this can be tricky since you have to try out different values for visibility time out based on how long it takes to read 1000 messages. Once you know you read all the messages, you can simply have the receipt handles and delete all of them. You can also purge the queue but , you may delete some of the messages that came in during this process that are not read at least once.

Unable to consume all new Kinesis messages

I'm converting a Java application from Kafka to Kinesis. This application runs forever. It sleeps for 30 seconds, then wakes up, runs some HBase queries, consumes and processes any new Kafka messages, then sleeps again.
This works fine in Kafka - that's exactly what the default Consumer does. However this is not the case in Kinesis. Consuming from the KCL requires the KCL consumer to be running at all times, which doesn't work for my needs. I need to be able to consume all new messages as required with a single method call.
The official documentation for the Kinesis Java API says:
You retrieve records from the stream on a per-shard basis. For each shard, and for each batch of records that you retrieve from that shard, you need to obtain a shard iterator.
and
If no records are returned, that means no data records are currently available from this shard at the sequence number referenced by the shard iterator. When this situation occurs, your application should wait for an amount of time
But I don't care about shards! I just want to get all messages since I last consumed, in one method call. And what if my app dies and needs to restart; how will it know where to resume?
Current code:
GetRecordsRequest getRecordsRequest = new GetRecordsRequest();
getRecordsRequest.setShardIterator(TRIM_HORIZON);
getRecordsRequest.setLimit(25);
GetRecordsResult result = client.getRecords(getRecordsRequest);
// Put the result into record list. The result can be empty.
records = result.getRecords();
EDIT
To be clearer, with Kafka I can run:
ConsumerRecords<String, String> records = this.consumer.poll(0);
to get all unconsumed messages. If my app dies and restarts, there's no problem, offsets are taken care of and I'll resume where I left off.
How do I do this in Kinesis?

To answer your question, you can use with StockTradeRecordProcessor where it has the option to reset the stats which in turn enable to consume only the new messages. Refer here to find the implementation of StockTradeRecordProcessor.
But on a hard note, This method uses 60-second intervals for the reporting and checkpointing rate but not 30 seconds as your application demands.

Kafka - Delayed Queue implementation using high level consumer

Want to implement a delayed consumer using the high level consumer api
main idea:
produce messages by key (each msg contains creation timestamp) this makes sure that each partition has ordered messages by produced time.
auto.commit.enable=false (will explicitly commit after each message process)
consume a message
check message timestamp and check if enough time has passed
process message (this operation will never fail)
commit 1 offset
while (it.hasNext()) {
val msg = it.next().message()
//checks timestamp in msg to see delay period exceeded
while (!delayedPeriodPassed(msg)) {
waitSomeTime() //Thread.sleep or something....
}
//certain that the msg was delayed and can now be handled
Try { process(msg) } //the msg process will never fail the consumer
consumer.commitOffsets //commit each msg
}
some concerns about this implementation:
commit each offset might slow ZK down
can consumer.commitOffsets throw an exception? if yes i will consume the same message twice (can solve with idempotent messages)
problem waiting long time without committing the offset, for example delay period is 24 hours, will get next from iterator, sleep for 24 hours, process and commit (ZK session timeout ?)
how can ZK session keep-alive without commit new offsets ? (setting a hive zookeeper.session.timeout.ms can resolve in dead consumer without recognising it)
any other problems im missing?
Thanks!

One way to go about this would be to use a different topic where you push all messages that are to be delayed. If all delayed messages should be processed after the same time delay this will be fairly straight forward:
while(it.hasNext()) {
val message = it.next().message()
if(shouldBeDelayed(message)) {
val delay = 24 hours
val delayTo = getCurrentTime() + delay
putMessageOnDelayedQueue(message, delay, delayTo)
}
else {
process(message)
}
consumer.commitOffset()
}
All regular messages will now be processed as soon as possible while those that need a delay gets put on another topic.
The nice thing is that we know that the message at the head of the delayed topic is the one that should be processed first since its delayTo value will be the smallest. Therefore we can set up another consumer that reads the head message, checks if the timestamp is in the past and if so processes the message and commits the offset. If not it does not commit the offset and instead just sleeps until that time:
while(it.hasNext()) {
val delayedMessage = it.peek().message()
if(delayedMessage.delayTo < getCurrentTime()) {
val readMessage = it.next().message
process(readMessage.originalMessage)
consumer.commitOffset()
} else {
delayProcessingUntil(delayedMessage.delayTo)
}
}
In case there are different delay times you could partition the topic on the delay (e.g. 24 hours, 12 hours, 6 hours). If the delay time is more dynamic than that it becomes a bit more complex. You could solve it by introducing having two delay topics. Read all messages off delay topic A and process all the messages whose delayTo value are in the past. Among the others you just find the one with the closest delayTo and then put them on topic B. Sleep until the closest one should be processed and do it all in reverse, i.e. process messages from topic B and put the once that shouldn't yet be proccessed back on topic A.
To answer your specific questions (some have been addressed in the comments to your question)
Commit each offset might slow ZK down
You could consider switching to storing the offset in Kafka (a feature available from 0.8.2, check out offsets.storage property in consumer config)
Can consumer.commitOffsets throw an exception? if yes, I will consume the same message twice (can solve with idempotent messages)
I believe it can, if it is not able to communicate with the offset storage for instance. Using idempotent messages solves this problem though, as you say.
Problem waiting long time without committing the offset, for example delay period is 24 hours, will get next from iterator, sleep for 24 hours, process and commit (ZK session timeout?)
This won't be a problem with the above outlined solution unless the processing of the message itself takes more than the session timeout.
How can ZK session keep-alive without commit new offsets? (setting a hive zookeeper.session.timeout.ms can resolve in dead consumer without recognizing it)
Again with the above you shouldn't need to set a long session timeout.
Any other problems I'm missing?
There always are ;)

Use Tibco EMS or other JMS Queue's. They have retry delay built in . Kafka may not be the right design choice for what you are doing

I would suggest another route in your cases.
It doesn't make sense to address the waiting time in the main thread of the consumer. This will be an anti-pattern in how the queues are used. Conceptually, you need to process the messages as fastest as possible and keep the queue at a low loading factor.
Instead, I would use a scheduler that will schedule jobs for each message you are need to delay. This way you can process the queue and create asynchronous jobs that will be triggered at predefined points in time.
The downfall of using this technique is that it is sensible to the status of the JVM that holds the scheduled jobs in memory. If that JVM fails, you loose the scheduled jobs and you don't know if the task was or was not executed.
There are scheduler implementations, though that can be configured to run in a cluster environment, thus keeping you safe from JVM crashes.
Take a look at this java scheduling framework: http://www.quartz-scheduler.org/

We had the same issue during one of our tasks. Although, eventually, it was solved without using delayed queues, but when exploring the solution, the best approach we found was to use pause and resume functionality provided by the KafkaConsumer API. This approach and its motivation is perfectly described here: https://medium.com/naukri-engineering/retry-mechanism-and-delay-queues-in-apache-kafka-528a6524f722

Keyed-list on schedule or its redis alternative may be best approaches.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.