How to reprocess a batched Kafka Stream

How to reprocess a batched Kafka Stream - java

I want to batch messages based on the timestamp the message was
created.
Furthermore, I want to batch these messages in fixed time
windows (of 1 minute).
Only after the passing of the window, the batch
should be pushed downstream.
For this to work, the Processor API seems more or less fitting (a la KStream batch process windows):
public void process(String key, SensorData sensorData) {
//epochTime is rounded to our prefered time window (1 minute)
long epochTime = Sensordata.epochTime;
ArrayList<SensorData> data = messageStore.get(epochTime);
if (data == null) {
data = new ArrayList<SensorData>();
}
data.add(sensorData);
messageStore.put(id, data);
this.context.commit();
}
#Override
public void init(ProcessorContext context) {
this.context = context;
this.context.schedule(60000); // equal to 1 minute
}
#Override
public void punctuate(long streamTime) {
KeyValueIterator<String, ArrayList<SensorData>> it = messageStore.all();
while (it.hasNext()) {
KeyValue<String, ArrayList<SensorData>> entry = it.next();
this.context.forward(entry.key, entry.value);
}
//reset messageStore
}
However, this approach has one major downside with : we don't use Kafka Streams windows.
out-of-order messages are not considered.
When operating in real-time, punctuation schedule should be equal to desired batch time window. If we set it to short, the batch will be forwarded and downstream computation will start to quickly. If set to long, and punctuation is triggered when a batch window is not finished yet, same problem.
also, replaying historic data whilst keeping the punctuation schedule (1 minute) will trigger the first computation only after 1 minute. If so, that will blow up the statestore and also feels wrong.
Taking these points into consideration, I should use Kafka Streams windows. But this is only possible in the Kafka Streams DSL...
Any toughts on this would be awesome.

You can mix-and-match DSL and Processor API using process(), transform(), or transformValues() within DSL (there are some other SO question about this already so I do not elaborate further). Thus, you can use regular window construct in combination with a custom (downstream) operator to hold the result back (and deduplicate). Some duduplication will already happen automatically within you window-operator (as of Kafka 0.10.1; see http://docs.confluent.io/current/streams/developer-guide.html#memory-management) but if you want to have exactly one result the cache will not do it for you.
About punctuate: it is triggered based on progress (ie, stream-time) and not based on wall-clock time -- so if you reprocess old data, if will be called the exact same amount of times as in you original run (just faster after each other if you consider wall-clock time as you process older data faster). There are also some SO question about this if you want to get more details.
However, I general consideration: why do you need exactly one result? If you do stream processing, you might want to build you downstream consumer application to be able to handle updates to your result. This is the inherent design of Kafka: using changelogs.

Related

How to create streaming Beam pipeline that is triggered once and only once in a fixed interval

I need to create an Apache Beam (Java) streaming job that should start once (and only once) every 60 seconds.
I got it working correctly using DirectRunner by using GenerateSequence, Window, and Combine.
However when I run it on Google Dataflow, sometimes it is triggered more than once within the 60 seconds window. I am guessing it has something to do with delays and out of order messages.
Pipeline pipeline = Pipeline.create(options);
pipeline
// Jenerate a tick every 15 seconds
.apply("Ticker", GenerateSequence.from(0).withRate(1, Duration.standardSeconds(15)))
// Just to check if individual ticks are being generated once every 15 second
.apply(ParDo.of(new DoFn<Long, Long>() {
#ProcessElement
public void processElement(#Element Long tick, OutputReceiver<Long> out) {
ZonedDateTime currentInstant = Instant.now().atZone(ZoneId.of("Asia/Jakarta"));
LOG.warn("-" + tick + "-" + currentInstant.toString());
out.output(word);
}
}
))
// 60 Second window
.apply("Window", Window.<Long>into(FixedWindows.of(Duration.standardSeconds(60))))
// Emit once per 60 second
.apply("Cobmine window into one", Combine.globally(Count.<Long>combineFn()).withoutDefaults())
.apply("START", ParDo.of(new DoFn<Long, ZonedDateTime>() {
#ProcessElement
public void processElement(#Element Long count, OutputReceiver<ZonedDateTime> out) {
ZonedDateTime currentInstant = Instant.now().atZone(ZoneId.of("Asia/Jakarta"));
// LOG just to check
// This log is sometimes printed more than once within 60 seconds
LOG.warn("x" + count + "-" + currentInstant.toString());
out.output(currentInstant);
}
}
));
It works most of the time, except once every 5 or 10 minutes at random I see two outputs in the same minute. How do I ensure "START" above runs once every 60 seconds? Thanks.

Short answer: you can't currently, Beam model is focused on event-time processing and correct handling of late data.
Workaround: you can define a processing-time timer, but you will have to deal with outputs and handling of the timer and late data manually, see this or this.
More details:
Windows and triggers in Beam are usually defined in event time, not in processing time. This way if you have late data coming after you already emitted the results for a window, late data still ends up in the correct window and results can be re-calculated for that window. Beam model allows you to express that logic and most of its functionality is tailored for that.
This also means that usually there is no requirement for a Beam pipeline to emit results at some specific real-world time, e.g. it doesn't make sense to say things like - "aggregate the events that belong to some window based on the data in the events themselves, and then output that window every minute". Beam runner aggregates the data for the window, possibly waits for the late data, and then emits results as soon as it deems right. The condition when the data is ready to be emitted is specified by a trigger. But that's just that - a condition when the window data is ready to be emitted, it doesn't actually force the runner to emit it. So the runner can emit it at any point in time after the trigger condition is met and the results are going to be correct, i.e. if more events have arrived since timer condition was met, only the ones that belong to a concrete window will be processed in that window.
Event-time windowing doesn't work with processing-time triggering and there are no convenient primitives (triggers/windows) in Beam to deal with processing time in presence of late data. And in this model if you use a trigger that only fires once, you lose the late data, and you still don't have a way to define a robust processing-time trigger. To build something like that you have to be able to specify things like the real-life point in time from which to start measuring the processing time from, and you will have to deal with issues of different processing time and delays that can happen across a large fleet of worker machines. This just is not part of Beam at the moment.
There are efforts in Beam community that will enable this use case, e.g. sink triggers and retractions that will allow you to define your pipeline in event-time space but remove the need for complex event-time triggers. The results could be either immediately updated/recalculated and emitted, or the trigger can be specified at a sink like "I want the output table to be updated every minute". And the results will be updated and recalculated for late data automatically without your involvement. These efforts are far from completion though at this point, so your best bet currently is either using one of the existing triggers or manually handling everything with timers.

Kafka Streams windowing aggregation batching

I have Kafka Streams processing in my application:
myStream
.mapValues(customTransformer::transform)
.groupByKey(Serialized.with(new Serdes.StringSerde(), new SomeCustomSerde()))
.windowedBy(TimeWindows.of(10000L).advanceBy(10000L))
.aggregate(CustomCollectorObject::new,
(key, value, aggregate) -> aggregate.collect(value),
Materialized.<String, CustomCollectorObject, WindowStore<Bytes, byte[]>>as("some_store_name")
.withValueSerde(new CustomCollectorSerde()))
.toStream()
.foreach((k, v) -> /* do something very important */);
Expected behavior: incoming messages are grouped by key and within some time interval are aggregated in CustomCollectorObject. CustomCollectorObject is just a class with a List inside. After every 10 seconds in foreach I'm doing something very important with my aggregated data. What is very important I expect that foreach is called every 10 seconds!
Actual behavior: I can see that processing in my foreach is called rarer, approx every 30-35 seconds, it doesn't matter much. What is very important, I receive 3-4 messages at once.
The question is: how can I reach the expected behavior? I need to my data was processed at runtime without any delays.
I've tried to set cache.max.bytes.buffering: 0 but in this case windowing doesn't work at all.

Kafka Streams has a different execution model and provides different semantics, ie, your expectation don't match what Kafka Streams does. There are multiple similar questions already:
How to send final kafka-streams aggregation result of a time windowed KTable?
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
https://www.confluent.io/blog/streams-tables-two-sides-same-coin
Also note, that the community is currently working on a new operator called suppress() that will be able to provide the semantics you want: https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
For now, you would need to add a transform() with a state store, and use punctuations to get the semantics you want (c.f. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-a-stream-processor)

Kafka Stream count on time window not reporting zero values

I'm using a Kafka streams to calculate how many events occurred in last 3 minutes using a hopping time window:
public class ViewCountAggregator {
void buildStream(KStreamBuilder builder) {
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
KStream<String, String> views = builder.stream(stringSerde, stringSerde, "streams-view-count-input");
KStream<String, Long> viewCount = views
.groupBy((key, value) -> value)
.count(TimeWindows.of(TimeUnit.MINUTES.toMillis(3)).advanceBy(TimeUnit.MINUTES.toMillis(1)))
.toStream()
.map((key, value) -> new KeyValue<>(key.key(), value));
viewCount.to(stringSerde, longSerde, "streams-view-count-output");
}
public static void main(String[] args) throws Exception {
// some not so important initialization code
...
}
}
When running a consumer and pushing some messages to an input topic it receives following updates as the time passes:
single 1
single 1
single 1
five 1
five 4
five 5
five 4
five 1
Which is almost correct, but it never receives updates for:
single 0
five 0
Without it my consumer that updates a counter will never set it back to zero when there are no events for a longer period of time. I'm expecting consumed messages to look like this:
single 1
single 1
single 1
single 0
five 1
five 4
five 5
five 4
five 1
five 0
Is there some configuration option / argument I'm missing that would help me achieving such behavior?

Which is almost correct, but it never receives updates for:
First, the computed output is correct.
Second, why is it correct:
If you apply a windowed aggregate, only those windows that do have actual content are created (all other systems I am familiar with, would produce the same output). Thus, if for some key, there is not data for a time period longer than the window size, there is no window instantiated and thus, there is also no count at all.
The reason to not instantiate windows if there is no content is quite simple: the processor cannot know all keys. In your example, you have two keys, but maybe later on there might come up a third key. Would you expect to get <thirdKey,0> from the beginning on? Also, as data streams are infinite in nature, keys might go away and never reappear. If you remember all seen keys, and emit <key,0> if there is no data for a key that disappeared, would you emit <key,0> for ever?
I don't want to say that your expected result/semantics does not make sense. It's just a very specific use case of yours and not applicable in general. Hence, stream processors don't implement it.
Third: What can you do?
There are multiple options:
Your consumer can keep track of what keys it did see, and using the embedded record timestamps figures out if a key is "missing" and then set the counter to zero for this key (for this, it might also help to remove the map step and preserve the Windowed<K> type for the key, such that the consumer get the information to which window a record belongs)
Add a stateful #transform() step in your Stream application that does the same thing as described in (1). For this, it might be helpful to register a punctuation call back.
Approach (2) should make it easier to track keys, as you can attach a state store to your transform step and thus don't need to deal with state (and failure/recovery) in your downstream consumer.
However, the tricky part for both approaches is still to decide when a key is missing, i.e., how long do you wait until you produce <key,0>. Note, that data might be late arriving (aka out-of-order) and even if you did emit <key,0> a late arriving record might producer a <key,1> message after your code did emit a <key,0> record. But maybe this is not really an issue for your case as it seems you use the latest window only anyways.
Last but not least one more comment: It seems that you are using only the latest count and that newer windows overwrite older windows in your downstream consumer. Thus, it might be worth to explore "Interactive Queries" to tap into the state of your count operator directly instead of consumer the topic and updating some other state. This might allow you to redesign and simplify you downstream application significantly. Check out the docs and a very good blog post about Interactive Queries for more details.

Kafka - Delayed Queue implementation using high level consumer

Want to implement a delayed consumer using the high level consumer api
main idea:
produce messages by key (each msg contains creation timestamp) this makes sure that each partition has ordered messages by produced time.
auto.commit.enable=false (will explicitly commit after each message process)
consume a message
check message timestamp and check if enough time has passed
process message (this operation will never fail)
commit 1 offset
while (it.hasNext()) {
val msg = it.next().message()
//checks timestamp in msg to see delay period exceeded
while (!delayedPeriodPassed(msg)) {
waitSomeTime() //Thread.sleep or something....
}
//certain that the msg was delayed and can now be handled
Try { process(msg) } //the msg process will never fail the consumer
consumer.commitOffsets //commit each msg
}
some concerns about this implementation:
commit each offset might slow ZK down
can consumer.commitOffsets throw an exception? if yes i will consume the same message twice (can solve with idempotent messages)
problem waiting long time without committing the offset, for example delay period is 24 hours, will get next from iterator, sleep for 24 hours, process and commit (ZK session timeout ?)
how can ZK session keep-alive without commit new offsets ? (setting a hive zookeeper.session.timeout.ms can resolve in dead consumer without recognising it)
any other problems im missing?
Thanks!

One way to go about this would be to use a different topic where you push all messages that are to be delayed. If all delayed messages should be processed after the same time delay this will be fairly straight forward:
while(it.hasNext()) {
val message = it.next().message()
if(shouldBeDelayed(message)) {
val delay = 24 hours
val delayTo = getCurrentTime() + delay
putMessageOnDelayedQueue(message, delay, delayTo)
}
else {
process(message)
}
consumer.commitOffset()
}
All regular messages will now be processed as soon as possible while those that need a delay gets put on another topic.
The nice thing is that we know that the message at the head of the delayed topic is the one that should be processed first since its delayTo value will be the smallest. Therefore we can set up another consumer that reads the head message, checks if the timestamp is in the past and if so processes the message and commits the offset. If not it does not commit the offset and instead just sleeps until that time:
while(it.hasNext()) {
val delayedMessage = it.peek().message()
if(delayedMessage.delayTo < getCurrentTime()) {
val readMessage = it.next().message
process(readMessage.originalMessage)
consumer.commitOffset()
} else {
delayProcessingUntil(delayedMessage.delayTo)
}
}
In case there are different delay times you could partition the topic on the delay (e.g. 24 hours, 12 hours, 6 hours). If the delay time is more dynamic than that it becomes a bit more complex. You could solve it by introducing having two delay topics. Read all messages off delay topic A and process all the messages whose delayTo value are in the past. Among the others you just find the one with the closest delayTo and then put them on topic B. Sleep until the closest one should be processed and do it all in reverse, i.e. process messages from topic B and put the once that shouldn't yet be proccessed back on topic A.
To answer your specific questions (some have been addressed in the comments to your question)
Commit each offset might slow ZK down
You could consider switching to storing the offset in Kafka (a feature available from 0.8.2, check out offsets.storage property in consumer config)
Can consumer.commitOffsets throw an exception? if yes, I will consume the same message twice (can solve with idempotent messages)
I believe it can, if it is not able to communicate with the offset storage for instance. Using idempotent messages solves this problem though, as you say.
Problem waiting long time without committing the offset, for example delay period is 24 hours, will get next from iterator, sleep for 24 hours, process and commit (ZK session timeout?)
This won't be a problem with the above outlined solution unless the processing of the message itself takes more than the session timeout.
How can ZK session keep-alive without commit new offsets? (setting a hive zookeeper.session.timeout.ms can resolve in dead consumer without recognizing it)
Again with the above you shouldn't need to set a long session timeout.
Any other problems I'm missing?
There always are ;)

Use Tibco EMS or other JMS Queue's. They have retry delay built in . Kafka may not be the right design choice for what you are doing

I would suggest another route in your cases.
It doesn't make sense to address the waiting time in the main thread of the consumer. This will be an anti-pattern in how the queues are used. Conceptually, you need to process the messages as fastest as possible and keep the queue at a low loading factor.
Instead, I would use a scheduler that will schedule jobs for each message you are need to delay. This way you can process the queue and create asynchronous jobs that will be triggered at predefined points in time.
The downfall of using this technique is that it is sensible to the status of the JVM that holds the scheduled jobs in memory. If that JVM fails, you loose the scheduled jobs and you don't know if the task was or was not executed.
There are scheduler implementations, though that can be configured to run in a cluster environment, thus keeping you safe from JVM crashes.
Take a look at this java scheduling framework: http://www.quartz-scheduler.org/

We had the same issue during one of our tasks. Although, eventually, it was solved without using delayed queues, but when exploring the solution, the best approach we found was to use pause and resume functionality provided by the KafkaConsumer API. This approach and its motivation is perfectly described here: https://medium.com/naukri-engineering/retry-mechanism-and-delay-queues-in-apache-kafka-528a6524f722

Keyed-list on schedule or its redis alternative may be best approaches.

Can you programmatically alter a queue's "dead letter" handling in a Java embedded broker?

Background
At a high level, I have a Java application in which certain events should trigger a certain action to be taken for the current user. However, the events may be very frequent, and the action is always the same. So when the first event happens, I would like to schedule the action for some point in the near future (e.g. 5 minutes). During that window of time, subsequent events should take no action, because the application sees that there's already an action scheduled. Once the scheduled action executes, we're back to Step 1 and the next event starts the cycle over again.
My thought is to implement this filtering and throttling mechanism by embedding an in-memory ActiveMQ instance within the application itself (I don't care about queue persistence).
I believe that JMS 2.0 supports this concept of delayed delivery, with delayed messages sitting in a "staging queue" until it's time for delivery to the real destination. However, I also believe that ActiveMQ does not yet support the JMS 2.0 spec... so I'm thinking about mimicking the same behavior with time-to-live (TTL) values and Dead Letter Queue (DLQ) handling.
Basically, my message producer code would put messages on a dummy staging queue from which no consumers ever pull anything. Messages would be placed with a 5-minute TTL value, and upon expiration ActiveMQ would dump them into a DLQ. That's the queue from which my message consumers would actually consume the messages.
Question
I don't think I want to actually consume from the "default" DLQ, because I have no idea what other internal things ActiveMQ might dump there that are completely unrelated to my application code. So I think it would be best for my dummy staging queue to have its own custom DLQ. I've only seen one page of ActiveMQ documentation which discusses DLQ config, and it only addresses XML config files for a standalone ActiveMQ installation (not an in-memory broker embedded within an app).
Is it possible to programmatically configure a custom DLQ at runtime for a queue in an embedded ActiveMQ instance?
I'd also be interested to hear alternative suggestions if you think I'm on the wrong track. I'm much more familiar with JMS than AMQP, so I don't know if this is much easier with Qpid or some other Java-embeddable AMQP broker. Whatever Apache Camel actually is (!), I believe it's supposed to excel at this sort of thing, but that learning curve might be gross overkill for this use case.

Although you're worried that Camel might be gross overkill for this usecase, I think that ActiveMQ is already gross overkill for the usecase you've described.
You're looking to schedule something to happen 5 minutes after an event happens, and for it to consume only the first event and ignore all the ones between the first one and when the 5 minutes are up, right? Why not just schedule your processing method for 5 minutes from now via ScheduledExecutorService or your favorite scheduling mechanism, and save the event in a HashMap<User, Event> member variable. If any more events come in for this user before the processing method fires, you'll just see that you already have an event stored and not store the new one, so you'll ignore all but the first. At the end of your processing method, delete the event for this user from your HashMap, and the next event to come in will be stored and scheduled.
Running ActiveMQ just to get this behavior seems like way more than you need. Or if not, can you explain why?
EDIT:
If you do go down this path, don't use the message TTL to expire your messages; just have the (one and only) consumer read them into memory and use the in-memory solution described above to only process (at most) one batch every 5 minutes. Either have a single queue with message selectors, or use dynamic queues, one per user. You don't need the DLQ to implement the delay, and even if you could get it to do that, it won't give you the functionality of batching everything so you only run once per 5 minutes. This isn't a path you want to go down, even if you figure out how.

A simple solution is keeping track of the pending actions in a concurrent structure and use a ScheduledExecutorService to execute them:
private static final Object RUNNING = new Object();
private final ConcurrentMap<UserId, Object> pendingActions =
new ConcurrentHashMap<>();
private ScheduledExecutorService ses = Executors.newScheduledThreadPool(10);
public void takeAction(final UserId id) {
Object running = pendingActions.putIfAbsent(id, RUNNING); // atomic
if(running == null) { // no pending action for this user
ses.schedule(new Runnable() {
#Override
public void run() {
doWork();
pendingActions.remove(id);
}
}, 5, TimeUnit.MINUTES);
}
}

With Camel this could be easily achieved with an Aggregator component with the parameter completionInterval , so on every five minutes you can check if the list aggregated messages is empty, if it's not fire a message to the route responsible for you user action and empty the list. You do need to maintain the whole list of exchanges, just the state (user action planned or not).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.