Kafka Stream count on time window not reporting zero values - java

I'm using a Kafka streams to calculate how many events occurred in last 3 minutes using a hopping time window:
public class ViewCountAggregator {
void buildStream(KStreamBuilder builder) {
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
KStream<String, String> views = builder.stream(stringSerde, stringSerde, "streams-view-count-input");
KStream<String, Long> viewCount = views
.groupBy((key, value) -> value)
.count(TimeWindows.of(TimeUnit.MINUTES.toMillis(3)).advanceBy(TimeUnit.MINUTES.toMillis(1)))
.toStream()
.map((key, value) -> new KeyValue<>(key.key(), value));
viewCount.to(stringSerde, longSerde, "streams-view-count-output");
}
public static void main(String[] args) throws Exception {
// some not so important initialization code
...
}
}
When running a consumer and pushing some messages to an input topic it receives following updates as the time passes:
single 1
single 1
single 1
five 1
five 4
five 5
five 4
five 1
Which is almost correct, but it never receives updates for:
single 0
five 0
Without it my consumer that updates a counter will never set it back to zero when there are no events for a longer period of time. I'm expecting consumed messages to look like this:
single 1
single 1
single 1
single 0
five 1
five 4
five 5
five 4
five 1
five 0
Is there some configuration option / argument I'm missing that would help me achieving such behavior?

Which is almost correct, but it never receives updates for:
First, the computed output is correct.
Second, why is it correct:
If you apply a windowed aggregate, only those windows that do have actual content are created (all other systems I am familiar with, would produce the same output). Thus, if for some key, there is not data for a time period longer than the window size, there is no window instantiated and thus, there is also no count at all.
The reason to not instantiate windows if there is no content is quite simple: the processor cannot know all keys. In your example, you have two keys, but maybe later on there might come up a third key. Would you expect to get <thirdKey,0> from the beginning on? Also, as data streams are infinite in nature, keys might go away and never reappear. If you remember all seen keys, and emit <key,0> if there is no data for a key that disappeared, would you emit <key,0> for ever?
I don't want to say that your expected result/semantics does not make sense. It's just a very specific use case of yours and not applicable in general. Hence, stream processors don't implement it.
Third: What can you do?
There are multiple options:
Your consumer can keep track of what keys it did see, and using the embedded record timestamps figures out if a key is "missing" and then set the counter to zero for this key (for this, it might also help to remove the map step and preserve the Windowed<K> type for the key, such that the consumer get the information to which window a record belongs)
Add a stateful #transform() step in your Stream application that does the same thing as described in (1). For this, it might be helpful to register a punctuation call back.
Approach (2) should make it easier to track keys, as you can attach a state store to your transform step and thus don't need to deal with state (and failure/recovery) in your downstream consumer.
However, the tricky part for both approaches is still to decide when a key is missing, i.e., how long do you wait until you produce <key,0>. Note, that data might be late arriving (aka out-of-order) and even if you did emit <key,0> a late arriving record might producer a <key,1> message after your code did emit a <key,0> record. But maybe this is not really an issue for your case as it seems you use the latest window only anyways.
Last but not least one more comment: It seems that you are using only the latest count and that newer windows overwrite older windows in your downstream consumer. Thus, it might be worth to explore "Interactive Queries" to tap into the state of your count operator directly instead of consumer the topic and updating some other state. This might allow you to redesign and simplify you downstream application significantly. Check out the docs and a very good blog post about Interactive Queries for more details.

Related

How to pool Akka substreams

I am trying to maintain a long running Akka stream, that fans into substreams (lets say to conflate/throttle/writeToDB records with a particular ID).
Because the stream should be kept alive for a long time, at one point the stream will be out of available substreams (And I'd like to clear unused memory anyway).
How can I cleanup the 'idle' substreams? (The doc points to idleTimeout and recoverWithRetries, but to me it does not seem to actually liberate a substream. Am I not using it properly? I can see that recoverWithRetries is called at the right time, but the next MAX_SUBSTREAMS + 1th key that arrives later still fails (Cannot open a new substream as there are too many substreams open))
How to handle the case that maybe, there is no substream to clean? (can I / how do I slow down the upstream?)
This post says that
groupBy removes inputs for subflows that have already been closed
This is not what I want, I need the substream to just be re-created in that case. Also I cannot find any mention of this behaviour in the doc.
In the end, what I need is to fan out a stream into a pool of substreams. If all substreams are used, slow down upstream. If a substream does not receive any new record for x seconds, emit, clear it and move it back to the pool.
Flow.of(Record.class)
.groupBy(MAX_SUBSTREAMS, Record::getKey)
.via(conflateThenThrottleThenCommitRecord)
.idleTimeout(Duration.of(2, ChronoUnit.SECONDS))
.recoverWithRetries(1, new PFBuilder()
.matchAny(ex -> Source.empty())
.build())
.mergeSubstreams();

How to create streaming Beam pipeline that is triggered once and only once in a fixed interval

I need to create an Apache Beam (Java) streaming job that should start once (and only once) every 60 seconds.
I got it working correctly using DirectRunner by using GenerateSequence, Window, and Combine.
However when I run it on Google Dataflow, sometimes it is triggered more than once within the 60 seconds window. I am guessing it has something to do with delays and out of order messages.
Pipeline pipeline = Pipeline.create(options);
pipeline
// Jenerate a tick every 15 seconds
.apply("Ticker", GenerateSequence.from(0).withRate(1, Duration.standardSeconds(15)))
// Just to check if individual ticks are being generated once every 15 second
.apply(ParDo.of(new DoFn<Long, Long>() {
#ProcessElement
public void processElement(#Element Long tick, OutputReceiver<Long> out) {
ZonedDateTime currentInstant = Instant.now().atZone(ZoneId.of("Asia/Jakarta"));
LOG.warn("-" + tick + "-" + currentInstant.toString());
out.output(word);
}
}
))
// 60 Second window
.apply("Window", Window.<Long>into(FixedWindows.of(Duration.standardSeconds(60))))
// Emit once per 60 second
.apply("Cobmine window into one", Combine.globally(Count.<Long>combineFn()).withoutDefaults())
.apply("START", ParDo.of(new DoFn<Long, ZonedDateTime>() {
#ProcessElement
public void processElement(#Element Long count, OutputReceiver<ZonedDateTime> out) {
ZonedDateTime currentInstant = Instant.now().atZone(ZoneId.of("Asia/Jakarta"));
// LOG just to check
// This log is sometimes printed more than once within 60 seconds
LOG.warn("x" + count + "-" + currentInstant.toString());
out.output(currentInstant);
}
}
));
It works most of the time, except once every 5 or 10 minutes at random I see two outputs in the same minute. How do I ensure "START" above runs once every 60 seconds? Thanks.
Short answer: you can't currently, Beam model is focused on event-time processing and correct handling of late data.
Workaround: you can define a processing-time timer, but you will have to deal with outputs and handling of the timer and late data manually, see this or this.
More details:
Windows and triggers in Beam are usually defined in event time, not in processing time. This way if you have late data coming after you already emitted the results for a window, late data still ends up in the correct window and results can be re-calculated for that window. Beam model allows you to express that logic and most of its functionality is tailored for that.
This also means that usually there is no requirement for a Beam pipeline to emit results at some specific real-world time, e.g. it doesn't make sense to say things like - "aggregate the events that belong to some window based on the data in the events themselves, and then output that window every minute". Beam runner aggregates the data for the window, possibly waits for the late data, and then emits results as soon as it deems right. The condition when the data is ready to be emitted is specified by a trigger. But that's just that - a condition when the window data is ready to be emitted, it doesn't actually force the runner to emit it. So the runner can emit it at any point in time after the trigger condition is met and the results are going to be correct, i.e. if more events have arrived since timer condition was met, only the ones that belong to a concrete window will be processed in that window.
Event-time windowing doesn't work with processing-time triggering and there are no convenient primitives (triggers/windows) in Beam to deal with processing time in presence of late data. And in this model if you use a trigger that only fires once, you lose the late data, and you still don't have a way to define a robust processing-time trigger. To build something like that you have to be able to specify things like the real-life point in time from which to start measuring the processing time from, and you will have to deal with issues of different processing time and delays that can happen across a large fleet of worker machines. This just is not part of Beam at the moment.
There are efforts in Beam community that will enable this use case, e.g. sink triggers and retractions that will allow you to define your pipeline in event-time space but remove the need for complex event-time triggers. The results could be either immediately updated/recalculated and emitted, or the trigger can be specified at a sink like "I want the output table to be updated every minute". And the results will be updated and recalculated for late data automatically without your involvement. These efforts are far from completion though at this point, so your best bet currently is either using one of the existing triggers or manually handling everything with timers.

Kafka KStreams Issue in Aggregation with Time Window

I have an issue with KStreams aggregation and windows. I want to aggregate a record into a list of records which have the same key as long as it falls inside a time window.
I have chosen SessionWindows because I have to work with a moving window inside a session: let's say record A arrives at 10:00:00; then every other record with the same key that arrives
inside the 10 second window time (until 10:00:10) will fall into the same session, bearing in mind that if it arrives at 10:00:03, the window will move until 10:00:13 (+10s).
That leads us to have a moving window of +10s from the last record received for a given key.
Now the problem: I want to obtain the last aggregated result. I have used .suppress() to indicate that I don't want any intermediate results, I just want the last one when the window closes. This
is not working fine because while it doesn't send any intermediate aggregated result, when the time window ends, I don't get any result. I have noted that in order to receive it I need to publish another
message into the topic, something which is in my case impossible.
Reading about .suppress() I have come to the conclusion that it may not be the way to achieve what I want, that's why my question is: how can I force the window to close and send the latest aggregated calculated result?
#StreamListener(ExtractContractBinding.RECEIVE_PAGE)
#SendTo(ExtractCommunicationBinding.AGGREGATED_PAGES)
public KStream<String, List<Records>> aggregatePages(KStream<?, Record> input) {
input.map(this::getRecord)
.groupBy(keyOfElement)
.windowedBy(SessionWindows.with(Duration.ofSeconds(10L)).grace(Duration.ofSeconds(10L)))
.aggregate(...do stuff...)
.suppress(Suppressed.untilWindowCloses(unbounded()))
.toStream()
.map(this::createAggregatedResult);
}
In short, the reason why this happens is because in KStreams, and most other stream processing engines that compute aggregations, time works based on event time.
https://kafka.apache.org/0101/documentation/streams#streams_time
In other words the window cannot close until a new message arrives beyond your time window + grace time that accounts for late arriving messages.
Moreover, based on some unit tests I’ve been writing recently I’m inclined to believe that the second message needs to land in the same partition as the previous message for event time to move forward. In practice, when you run in production and presumably process hundreds of messages per second this becomes unnoticeable.
Let me also add that you can implement custom timestamp extractor which allows you fine-grained control in terms of which time window a particular message lands in.
how can I force the window to close and send the latest aggregated calculated result?
To finally answer your question, it’s not possible to force the time window to close without emitting an extra message to the source topic.

Kafka Streams windowing aggregation batching

I have Kafka Streams processing in my application:
myStream
.mapValues(customTransformer::transform)
.groupByKey(Serialized.with(new Serdes.StringSerde(), new SomeCustomSerde()))
.windowedBy(TimeWindows.of(10000L).advanceBy(10000L))
.aggregate(CustomCollectorObject::new,
(key, value, aggregate) -> aggregate.collect(value),
Materialized.<String, CustomCollectorObject, WindowStore<Bytes, byte[]>>as("some_store_name")
.withValueSerde(new CustomCollectorSerde()))
.toStream()
.foreach((k, v) -> /* do something very important */);
Expected behavior: incoming messages are grouped by key and within some time interval are aggregated in CustomCollectorObject. CustomCollectorObject is just a class with a List inside. After every 10 seconds in foreach I'm doing something very important with my aggregated data. What is very important I expect that foreach is called every 10 seconds!
Actual behavior: I can see that processing in my foreach is called rarer, approx every 30-35 seconds, it doesn't matter much. What is very important, I receive 3-4 messages at once.
The question is: how can I reach the expected behavior? I need to my data was processed at runtime without any delays.
I've tried to set cache.max.bytes.buffering: 0 but in this case windowing doesn't work at all.
Kafka Streams has a different execution model and provides different semantics, ie, your expectation don't match what Kafka Streams does. There are multiple similar questions already:
How to send final kafka-streams aggregation result of a time windowed KTable?
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
https://www.confluent.io/blog/streams-tables-two-sides-same-coin
Also note, that the community is currently working on a new operator called suppress() that will be able to provide the semantics you want: https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
For now, you would need to add a transform() with a state store, and use punctuations to get the semantics you want (c.f. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-a-stream-processor)

How to reprocess a batched Kafka Stream

I want to batch messages based on the timestamp the message was
created.
Furthermore, I want to batch these messages in fixed time
windows (of 1 minute).
Only after the passing of the window, the batch
should be pushed downstream.
For this to work, the Processor API seems more or less fitting (a la KStream batch process windows):
public void process(String key, SensorData sensorData) {
//epochTime is rounded to our prefered time window (1 minute)
long epochTime = Sensordata.epochTime;
ArrayList<SensorData> data = messageStore.get(epochTime);
if (data == null) {
data = new ArrayList<SensorData>();
}
data.add(sensorData);
messageStore.put(id, data);
this.context.commit();
}
#Override
public void init(ProcessorContext context) {
this.context = context;
this.context.schedule(60000); // equal to 1 minute
}
#Override
public void punctuate(long streamTime) {
KeyValueIterator<String, ArrayList<SensorData>> it = messageStore.all();
while (it.hasNext()) {
KeyValue<String, ArrayList<SensorData>> entry = it.next();
this.context.forward(entry.key, entry.value);
}
//reset messageStore
}
However, this approach has one major downside with : we don't use Kafka Streams windows.
out-of-order messages are not considered.
When operating in real-time, punctuation schedule should be equal to desired batch time window. If we set it to short, the batch will be forwarded and downstream computation will start to quickly. If set to long, and punctuation is triggered when a batch window is not finished yet, same problem.
also, replaying historic data whilst keeping the punctuation schedule (1 minute) will trigger the first computation only after 1 minute. If so, that will blow up the statestore and also feels wrong.
Taking these points into consideration, I should use Kafka Streams windows. But this is only possible in the Kafka Streams DSL...
Any toughts on this would be awesome.
You can mix-and-match DSL and Processor API using process(), transform(), or transformValues() within DSL (there are some other SO question about this already so I do not elaborate further). Thus, you can use regular window construct in combination with a custom (downstream) operator to hold the result back (and deduplicate). Some duduplication will already happen automatically within you window-operator (as of Kafka 0.10.1; see http://docs.confluent.io/current/streams/developer-guide.html#memory-management) but if you want to have exactly one result the cache will not do it for you.
About punctuate: it is triggered based on progress (ie, stream-time) and not based on wall-clock time -- so if you reprocess old data, if will be called the exact same amount of times as in you original run (just faster after each other if you consider wall-clock time as you process older data faster). There are also some SO question about this if you want to get more details.
However, I general consideration: why do you need exactly one result? If you do stream processing, you might want to build you downstream consumer application to be able to handle updates to your result. This is the inherent design of Kafka: using changelogs.

Categories