Kafka Streams windowing aggregation batching

Kafka Streams windowing aggregation batching - java

I have Kafka Streams processing in my application:
myStream
.mapValues(customTransformer::transform)
.groupByKey(Serialized.with(new Serdes.StringSerde(), new SomeCustomSerde()))
.windowedBy(TimeWindows.of(10000L).advanceBy(10000L))
.aggregate(CustomCollectorObject::new,
(key, value, aggregate) -> aggregate.collect(value),
Materialized.<String, CustomCollectorObject, WindowStore<Bytes, byte[]>>as("some_store_name")
.withValueSerde(new CustomCollectorSerde()))
.toStream()
.foreach((k, v) -> /* do something very important */);
Expected behavior: incoming messages are grouped by key and within some time interval are aggregated in CustomCollectorObject. CustomCollectorObject is just a class with a List inside. After every 10 seconds in foreach I'm doing something very important with my aggregated data. What is very important I expect that foreach is called every 10 seconds!
Actual behavior: I can see that processing in my foreach is called rarer, approx every 30-35 seconds, it doesn't matter much. What is very important, I receive 3-4 messages at once.
The question is: how can I reach the expected behavior? I need to my data was processed at runtime without any delays.
I've tried to set cache.max.bytes.buffering: 0 but in this case windowing doesn't work at all.

Kafka Streams has a different execution model and provides different semantics, ie, your expectation don't match what Kafka Streams does. There are multiple similar questions already:
How to send final kafka-streams aggregation result of a time windowed KTable?
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
https://www.confluent.io/blog/streams-tables-two-sides-same-coin
Also note, that the community is currently working on a new operator called suppress() that will be able to provide the semantics you want: https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
For now, you would need to add a transform() with a state store, and use punctuations to get the semantics you want (c.f. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-a-stream-processor)

Related

Polling items from DynamoDB

AWS newbie here.
I have a DynamoDB table and 2+ nodes of Java apps reading/writing from/to it. My use case is as follow: the app should fetch N numbers of items every X seconds based on a timestamp, process them, then remove them from the DB. Because the app may scale, other nodes might be reading from the DB in the same time and I want to avoid processing the same items multiple times.
The questions is: is there any way to implement something like a poll() method that fetches the item and immediately removes it (atomic operation) as if the table was a queue. As far as I checked, delete item methods that DynamoDBMapper offers do not return removed items data.

Consistency is a weak spot of DDB, but that's the price to pay for its scalability.
You said it yourself, you're looking for a queue, so why not use one?
I suggest:
Create a lambda that:
Reads the items
Publishes them to an SQS FIFO queue with message deduplication
Deletes the items from the DB
Create an EventBridge schedule to run the Lambda every n minutes
Have your nodes poll that queue instead of DDB
For this to work you have to consider a few things regarding timings:
DDB will typically be consistent in under a second, but this isn't guaranteed.
SQS deduplication only works for 5 minutes.
EventBridge only supports minute level granularity, not seconds.
So you can run your Lambda as frequently as once a minute, but you can run your nodes as frequently (or infrequently) as you like.
If you run your Lambda less frequently than every 5 minutes then there is technically a chance of processing an item twice, but this is very unlikely to ever happen (technically this could still happen anyway if DDB took >10 minutes to be consistent, but again, extremely unlikely to ever happen).

My understanding is that you want to read and delete an item in an atomic manner, however, we are aware that is not possible with DynamoDB.
However, what is possible is deleting the item and being returned the value, which is more likened to a delete then read. As you correctly pointed out, the Mapper client does not support ReturnValues however the low level clients do.
Key keyToDelete = new Key().withHashKeyElement(new AttributeValue("214141"));
DeleteItemRequest dir = new DeleteItemRequest()
.withTableName("ABC")
.withKey(keyToDelete)
.withReturnValues("ALL_OLD");
More info here DeleteItemRequest

Is Kafka Stream really Real Time?

I'm using Kafka Stream API to test some functionality.
I have a Stream like :
KStream<String, UnifiedData> stream = builder.stream("topic", Consumed.with(Serdes.String(), new JsonSerde<>(Data.class)));
stream.groupBy((key, value) -> value.getMetadata().getId())
.windowedBy(TimeWindows.of(TimeUnit.SECONDS.toMillis(1000)))
.count()
.toStream()
.map((key, value) -> {
System.out.println(value);
return KeyValue.pair(key.toString(), value);
});
I found 2 stranges behaviours will producing some data in my topic :
First : I don't get an output for each data produced. For example, If I produce 20 messages with no delay, I will just get a 20 as output and not something like 1 2 3....
Second : There is like 20 seconds of delay between the time I produce my message and the time the System.out.println(value) print the result in my console
So, do you think that this behaviour is totally normal ? Or May I have a configuration problem with my kafka ?
I'm using Kafka 1.0.1, Kafka Stream 1.0.1, Java 8 and Spring-Boot

By default, Kafka Streams uses a cache to "dedupliate" consecutive outputs from an aggregation to reduce the downstream load.
You can disable caching globally by setting cache.max.bytes.buffering=0 in your KafkaStreams config. As an alternative, it's also possible to disable cache per store individually, by passing in a Materialized parameter into the aggregation operator.
Furthermore, all caches are flushed on commit and the default commit interval is 30 seconds. Thus, it makes sense that you see output after 30 seconds. If you disable caching, commit interval will not have any impact on the behavior any longer.
For more details see: https://kafka.apache.org/documentation/streams/developer-guide/memory-mgmt.html

Kafka Stream count on time window not reporting zero values

I'm using a Kafka streams to calculate how many events occurred in last 3 minutes using a hopping time window:
public class ViewCountAggregator {
void buildStream(KStreamBuilder builder) {
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
KStream<String, String> views = builder.stream(stringSerde, stringSerde, "streams-view-count-input");
KStream<String, Long> viewCount = views
.groupBy((key, value) -> value)
.count(TimeWindows.of(TimeUnit.MINUTES.toMillis(3)).advanceBy(TimeUnit.MINUTES.toMillis(1)))
.toStream()
.map((key, value) -> new KeyValue<>(key.key(), value));
viewCount.to(stringSerde, longSerde, "streams-view-count-output");
}
public static void main(String[] args) throws Exception {
// some not so important initialization code
...
}
}
When running a consumer and pushing some messages to an input topic it receives following updates as the time passes:
single 1
single 1
single 1
five 1
five 4
five 5
five 4
five 1
Which is almost correct, but it never receives updates for:
single 0
five 0
Without it my consumer that updates a counter will never set it back to zero when there are no events for a longer period of time. I'm expecting consumed messages to look like this:
single 1
single 1
single 1
single 0
five 1
five 4
five 5
five 4
five 1
five 0
Is there some configuration option / argument I'm missing that would help me achieving such behavior?

Which is almost correct, but it never receives updates for:
First, the computed output is correct.
Second, why is it correct:
If you apply a windowed aggregate, only those windows that do have actual content are created (all other systems I am familiar with, would produce the same output). Thus, if for some key, there is not data for a time period longer than the window size, there is no window instantiated and thus, there is also no count at all.
The reason to not instantiate windows if there is no content is quite simple: the processor cannot know all keys. In your example, you have two keys, but maybe later on there might come up a third key. Would you expect to get <thirdKey,0> from the beginning on? Also, as data streams are infinite in nature, keys might go away and never reappear. If you remember all seen keys, and emit <key,0> if there is no data for a key that disappeared, would you emit <key,0> for ever?
I don't want to say that your expected result/semantics does not make sense. It's just a very specific use case of yours and not applicable in general. Hence, stream processors don't implement it.
Third: What can you do?
There are multiple options:
Your consumer can keep track of what keys it did see, and using the embedded record timestamps figures out if a key is "missing" and then set the counter to zero for this key (for this, it might also help to remove the map step and preserve the Windowed<K> type for the key, such that the consumer get the information to which window a record belongs)
Add a stateful #transform() step in your Stream application that does the same thing as described in (1). For this, it might be helpful to register a punctuation call back.
Approach (2) should make it easier to track keys, as you can attach a state store to your transform step and thus don't need to deal with state (and failure/recovery) in your downstream consumer.
However, the tricky part for both approaches is still to decide when a key is missing, i.e., how long do you wait until you produce <key,0>. Note, that data might be late arriving (aka out-of-order) and even if you did emit <key,0> a late arriving record might producer a <key,1> message after your code did emit a <key,0> record. But maybe this is not really an issue for your case as it seems you use the latest window only anyways.
Last but not least one more comment: It seems that you are using only the latest count and that newer windows overwrite older windows in your downstream consumer. Thus, it might be worth to explore "Interactive Queries" to tap into the state of your count operator directly instead of consumer the topic and updating some other state. This might allow you to redesign and simplify you downstream application significantly. Check out the docs and a very good blog post about Interactive Queries for more details.

Emitting values from ktable and modifying it

I'm trying to solve the following problem with kafka.
There is a topic. let's call it src-topic. I receive records from this topic from time to time. I would like to store those values in a ktable and emit the values stored in the ktable every 10 seconds to dst-topic. When I emit a value from this ktable for the first time then I want to append 1 to the record I emit. Every subsequent time I would like to append 0 to the emitted record.
I'm looking for a correct and preferably idiomatic solution to this issue.
One of the solutions I see is to emit a record with 1 appended when I ingest from src-topic and then store in the ktable the record with 0 appended. Another thread will be reading from this ktable and emitting the records regularly. The problem with this approach is that it has a race condition.
Any advice will be appreciated.

There is no straight forward way to do this. Note, a KTable is a changelog stream (it might have a table state internally -- not all KTables do have a state --, but that's an implementation detail).
Thus, a KTable is a stream and you cannot flush a stream... And because the state (if there is any) is internal, you cannot flush the state either.
You can only access the state via Interactive Queries that also allow to do a range scan. However, this will not emit anything downstream but gives the data to the "non Streams part" of you application.
I think, you will need to use low-level Processor API to get the result you want.

Kafka KTable - shared aggregation across machines

Assume that I have a topic with numerous partitions. Im writing K/V data in there and want to aggregate said data in Tumbling Windows by keys.
Assume that I've launched as many worker instances as I have partitions and each worker instance is running on a separate machine.
How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values.
Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method?

How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values.
In general, Kafka Streams ensures that all values for the same key will be processed by the same (and only one) stream task, which also means only one application instance (what you described as "worker instance") will process the values for that key. Note that an app instance may run 1+ stream tasks, but these tasks are isolated.
This behavior is achieved through the partitioning of the data, and Kafka Streams ensures that a partition is always processed by the same and only one stream task. The logical link to keys/values is that, in Kafka and Kafka Streams, a key is always sent to the same partition (there is a gotcha here, but I'm not sure whether it makes sense to go into details for the scope of this question), hence one particular partition -- among possible many partitions -- contains all the values for the same key.
In some situations, such as when joining two streams A and B, you must ensure though that the aggregation will operate on the same key to ensure that data from both streams are co-located in the same stream task -- which, again, is all about ensuring that the relevant input stream partitions and thus matching the keys (from A and B, respectively) are made available in the same stream task. A typical method you'd use here is selectKey(). Once that is done, Kafka Streams ensures that, for joining the two streams A and B as well as for creating the joined output stream, all values for the same key will be processed by the same stream task and thus the same application instance.
Example:
Stream A has key userId with value { georegion }.
Stream B has key georegion with value { continent, description }.
Joining two streams only works (as of Kafka 0.10.0) when both streams use the same key. In this example, this means that you must re-key (and thus re-partition) stream A so that the resulting key is changed from userId to georegion. Otherwise, as of Kafka 0.10, you can't join A and B because data is not co-located in the stream task that is responsible for actually performing the join.
In this example, you could re-key/re-partition stream A via:
// Kafka 0.10.0.x (latest stable release as of Sep 2016)
A.map((userId, georegion) -> KeyValue.pair(georegion, userId)).through("rekeyed-topic")
// Upcoming versions of Kafka (not released yet)
A.map((userId, georegion) -> KeyValue.pair(georegion, userId))
The through() call is only required in Kafka 0.10.0 to actually trigger re-partitioning, and later versions of Kafka will do these automatically for you (this upcoming functionality is already completed and available in Kafka trunk).
Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method?
In general, no. The behavior above is achieved through partitioning, not through state stores.
Sometimes state stores are involved because of the operations you have defined for a stream, which might explain why you were asking this question. For example, a windowing operation will require state to be managed, and thus a state store will be created behind the scenes. But your actual question -- "insuring that the resultant aggregations include all values for each key" -- has nothing to do with state stores, it's about the partitioning behavior.

With worker instance, I assume you mean a Kafka Streams application instance, right? (Because there is no master/worker pattern in Kafka Streams -- it's a library and not a framework -- we do not use the term "worker".)
If you want to co-locate data per key, you need to partition the data by key. Thus, either your data is partitioned by key by your external producer when data gets written into a topic from the beginning on. Or you explicitly set a new key within Kafka Streams application (using for example selectKey() or map()) and re-distributed via a call to through().
(The explicit call to through() will not be necessary in future releases, ie, 0.10.1 and Kafka Streams will re-distribute records automatically if necessary.)
If messages/record should be partitioned, the key must not be null. You can also change the partitioning schema via producer configuration partitioner.class (see https://kafka.apache.org/documentation.html#producerconfigs).
Partitioning is completely independent from StateStores, even if StateStores are usually used on top of partitioned data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.