Kafka KTable - shared aggregation across machines - java

Assume that I have a topic with numerous partitions. Im writing K/V data in there and want to aggregate said data in Tumbling Windows by keys.
Assume that I've launched as many worker instances as I have partitions and each worker instance is running on a separate machine.
How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values.
Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method?

How would I go about insuring that the resultant aggregations include all values for each key? IE I don't want each worker instance to have some subset of the values.
In general, Kafka Streams ensures that all values for the same key will be processed by the same (and only one) stream task, which also means only one application instance (what you described as "worker instance") will process the values for that key. Note that an app instance may run 1+ stream tasks, but these tasks are isolated.
This behavior is achieved through the partitioning of the data, and Kafka Streams ensures that a partition is always processed by the same and only one stream task. The logical link to keys/values is that, in Kafka and Kafka Streams, a key is always sent to the same partition (there is a gotcha here, but I'm not sure whether it makes sense to go into details for the scope of this question), hence one particular partition -- among possible many partitions -- contains all the values for the same key.
In some situations, such as when joining two streams A and B, you must ensure though that the aggregation will operate on the same key to ensure that data from both streams are co-located in the same stream task -- which, again, is all about ensuring that the relevant input stream partitions and thus matching the keys (from A and B, respectively) are made available in the same stream task. A typical method you'd use here is selectKey(). Once that is done, Kafka Streams ensures that, for joining the two streams A and B as well as for creating the joined output stream, all values for the same key will be processed by the same stream task and thus the same application instance.
Example:
Stream A has key userId with value { georegion }.
Stream B has key georegion with value { continent, description }.
Joining two streams only works (as of Kafka 0.10.0) when both streams use the same key. In this example, this means that you must re-key (and thus re-partition) stream A so that the resulting key is changed from userId to georegion. Otherwise, as of Kafka 0.10, you can't join A and B because data is not co-located in the stream task that is responsible for actually performing the join.
In this example, you could re-key/re-partition stream A via:
// Kafka 0.10.0.x (latest stable release as of Sep 2016)
A.map((userId, georegion) -> KeyValue.pair(georegion, userId)).through("rekeyed-topic")
// Upcoming versions of Kafka (not released yet)
A.map((userId, georegion) -> KeyValue.pair(georegion, userId))
The through() call is only required in Kafka 0.10.0 to actually trigger re-partitioning, and later versions of Kafka will do these automatically for you (this upcoming functionality is already completed and available in Kafka trunk).
Is this something that a StateStore would be used for? Does Kafka manage this on its own or do I need to come up with a method?
In general, no. The behavior above is achieved through partitioning, not through state stores.
Sometimes state stores are involved because of the operations you have defined for a stream, which might explain why you were asking this question. For example, a windowing operation will require state to be managed, and thus a state store will be created behind the scenes. But your actual question -- "insuring that the resultant aggregations include all values for each key" -- has nothing to do with state stores, it's about the partitioning behavior.

With worker instance, I assume you mean a Kafka Streams application instance, right? (Because there is no master/worker pattern in Kafka Streams -- it's a library and not a framework -- we do not use the term "worker".)
If you want to co-locate data per key, you need to partition the data by key. Thus, either your data is partitioned by key by your external producer when data gets written into a topic from the beginning on. Or you explicitly set a new key within Kafka Streams application (using for example selectKey() or map()) and re-distributed via a call to through().
(The explicit call to through() will not be necessary in future releases, ie, 0.10.1 and Kafka Streams will re-distribute records automatically if necessary.)
If messages/record should be partitioned, the key must not be null. You can also change the partitioning schema via producer configuration partitioner.class (see https://kafka.apache.org/documentation.html#producerconfigs).
Partitioning is completely independent from StateStores, even if StateStores are usually used on top of partitioned data.

Related

Is it possible to restore a Kafka Streams state store after a restart without using changelog topics?

We have 2 compacted topics, each containing terabytes of data, which we want to join using Spring Cloud Stream and Kafka Streams. The (simplified) code looks like this:
#Bean
public BiConsumer<KTable<String, LeftEvent>, KTable<String, RightEvent>> processEvents() {
return ((leftEvents, rightEvents) -> {
leftEvents.join(rightEvents, this::merge)
.toStream()
.foreach(this::process);
});
}
The problem with this approach is that using KTables as input parameters results in the creation of changelog topics which essentially duplicate the source topics since, as mentioned above, both of these topics are already compacted. To avoid duplicating terabytes of data in Kafka, our first attempt was to use KStreams as inputs instead, and to transform them into KTables as follows:
stream.toTable(
Materialized
.<K, V, KeyValueStore<Bytes, byte[]>>as(stateStoreName)
.withLoggingDisabled()
);
thereby disabling logging and hence dispensing with the changelog topics, which in our context seem useless.
However, the following scenario now no longer works:
Generate a LeftEvent with key k1
Restart the application
Generate a RightEvent with key k1
The events are no longer joined, although the join works fine if the application is not restarted in-between (i.e. step 1, then 3).
When the application restarts, we would have expected the state stores to be reconstructed from the souce topics in the absence of changelog topics, but this is apparently not the case. In some occasions, we observed that rocksDB files (located in /tmp/kafka-streams/...) were used to retrieve data consumed prior to the restart, however we cannot assume that these files will still be available after a restart since we are working in a containerized environment.
Is there a way to support restarts (and achieve fault tolerance) without having to use changelog topics, which in our case duplicate the input topics? If not, we might have to reconsider our use of Kafka Streams...
You want to enable optimization of the Kafka Streams: https://docs.confluent.io/platform/current/streams/developer-guide/optimizing-streams.html#optimization-details (#1 is what you are looking for).
Currently, there are two optimizations that Kafka Streams performs when enabled:
The source KTable re-uses the source topic as the changelog topic.
When possible, Kafka Streams collapses multiple repartition topics into a single repartition topic.
A key thing to point out, since I have made this mistake myself, is do not forget to send the configuration to both the build() and to the construction of KStreams (optimization, as indicated here from the link provided) is done in the build.
// tell Kafka Streams to optimize the topology
config.setProperty(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);
// Since we've configured Streams to use optimizations, the topology is optimized during the build.
// And because optimizations are enabled, the resulting topology will no longer need to perform
// three explicit repartitioning steps, but only one.
final Topology topology = builder.build(config);
final KafkaStreams streams = new KafkaStreams(topology, config);
Now optimization is enabled for all of the Topology, so if you keep that in mind that the #2 optimization is also performed.

Apache Flink: Ordered timestamps with parallelism

I have a datastream in which the order of the events is important. The time characteristic is set to EventTime as the incoming records have a timestamp within them.
In order to guarantee the ordering, I set the parallelism for the program to 1. Could that become a problem, performance wise, when my program gets more complex?
If I understand correctly, I need to assign watermarks to my events, if I want to keep the stream ordered by timestamp. This is quite simple. But I'm reading that even that doesn't guarantee order? Later on, I want to do stateful computations over that stream. So, for that I use a FlatMap function, which needs the stream to be keyed. But if I key the stream, the order is lost again. AFAIK this is because of different stream partitions, which are "caused" by parallelism.
I have two questions:
Do I need parallelism? What factors do I need to consider here?
How would I achieve "ordered parallelism" with what I described above?
Several points to consider:
Setting the parallelism to 1 for the entire job will prevent scaling your application, which will affect performance. Whether this actually matters depends on your application requirements, but it would certainly be limitation, and could be a problem.
If the aggregates you've mentioned are meant to be computed globally across all the event records then operating in parallel will require doing some pre-aggregation in parallel. But in this case you will then have to reduce the parallelism to 1 in the later stages of your job graph in order to produce the ultimate (global) results.
If on the other hand these aggregates are to be computed independently for each value of some key, then it makes sense to consider keying the stream and to use that partitioning as the basis for operating in parallel.
All of the operations you mention require some state, whether computing max, min, averages, or uptime and downtime. For example, you can't compute the maximum without remembering the maximum encountered so far.
If I understand correctly how Flink's NiFi source connector works, then if the source is operating in parallel, keying the stream will result in out-of-order events.
However, none of the operations you've mentioned require that the data be delivered in-order. Computing uptime (and downtime) on an out-of-order stream will require some buffering -- these operations will need to wait for out-of-order data to arrive before they can produce results -- but that's certainly doable. That's exactly what watermarks are for; they define how long to wait for out-of-order data. You can use an event-time timer in a ProcessFunction to arrange for an onTimer callback to be called when all earlier events have been processed.
You could always sort the keyed stream. Here's an example.
The uptime/downtime calculation should be easy to do with Flink's CEP library (which sorts its input, btw).
UPDATE:
It is true that after applying a ProcessFunction to a keyed stream the stream is no longer keyed. But in this case you could safely use reinterpretAsKeyedStream to inform Flink that the stream is still keyed.
As for CEP, this library uses state on your behalf, making it easier to develop applications that need to react to patterns.

Kafka Streams windowing aggregation batching

I have Kafka Streams processing in my application:
myStream
.mapValues(customTransformer::transform)
.groupByKey(Serialized.with(new Serdes.StringSerde(), new SomeCustomSerde()))
.windowedBy(TimeWindows.of(10000L).advanceBy(10000L))
.aggregate(CustomCollectorObject::new,
(key, value, aggregate) -> aggregate.collect(value),
Materialized.<String, CustomCollectorObject, WindowStore<Bytes, byte[]>>as("some_store_name")
.withValueSerde(new CustomCollectorSerde()))
.toStream()
.foreach((k, v) -> /* do something very important */);
Expected behavior: incoming messages are grouped by key and within some time interval are aggregated in CustomCollectorObject. CustomCollectorObject is just a class with a List inside. After every 10 seconds in foreach I'm doing something very important with my aggregated data. What is very important I expect that foreach is called every 10 seconds!
Actual behavior: I can see that processing in my foreach is called rarer, approx every 30-35 seconds, it doesn't matter much. What is very important, I receive 3-4 messages at once.
The question is: how can I reach the expected behavior? I need to my data was processed at runtime without any delays.
I've tried to set cache.max.bytes.buffering: 0 but in this case windowing doesn't work at all.
Kafka Streams has a different execution model and provides different semantics, ie, your expectation don't match what Kafka Streams does. There are multiple similar questions already:
How to send final kafka-streams aggregation result of a time windowed KTable?
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
https://www.confluent.io/blog/streams-tables-two-sides-same-coin
Also note, that the community is currently working on a new operator called suppress() that will be able to provide the semantics you want: https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
For now, you would need to add a transform() with a state store, and use punctuations to get the semantics you want (c.f. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-a-stream-processor)

Should I care about dynamodb stream shards if I process stream events by lambda?

The dynamodb documentation says that there are shards and they are needed to be iterated first, then for each shard it is needed to get number of records.
The documentation also says:
(If you use the DynamoDB Streams Kinesis Adapter, this is handled for you: Your application will process the shards and stream records in the correct order, and automatically handle new or expired shards, as well as shards that split while the application is running. For more information, see Using the DynamoDB Streams Kinesis Adapter to Process Stream Records.)
Ok, But I use lambda not kinesis (ot they relates to each other?) and if a lambda function is attached to dynamodb stream should I care about shards ot not? Or I should just write labda code and expect that aws environment pass just some records to that lambda?
When using Lambda to consume a DynamoDB Stream the work of polling the API and keeping track of shards is all handled for you automatically. If your table has multiple shards then multiple Lambda functions will be invoked. From your prospective as a developer you just have to write the code for your Lambda function and the rest is taken care for you.
In-order processing is still guaranteed by DynamoDB streams so with a single shard will have only one instance of your Lambda function will be invoked at a time. However, with multiple shards you may see multiple instances of your Lambda function running at the same time. This fan-out is transparent and may cause issues or lead to surprising behaviors if you are not aware of it while coding your Lambda function.
For a deeper explanation of how this works I'd recommend the YouTube video AWS re:Invent 2016: Real-time Data Processing Using AWS Lambda (SVR301). While the focus is mostly on Kinesis Streams the same concepts for consuming DynamoDB Streams apply as the technology is nearly identical.
We use DynamoDB to process close to billion of records everyday and autoexpire those records and send to streams.
Everything is taken care by AWS and we don't need to do anything, except configuring streams (what type of image you want) and adding triggers.
The only fine tuning we did is,
When you get more data, we just increased the batch size to process faster and reduce the overhead on the number of calls to Lambda.
If you are using any external process to iterate over the stream, you might need to do the same.
Reference:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Hope it helps.

Partition aware hazelcast queue

Is there anyway we can perform a partition aware operation on Hazelcast distributed queue?
So for example, I would have multiple consumer nodes on a queue, and would expect 'similar' type of messages to be processed by the same node everytime. By similar type, I mean some business key for the message.
Currently we are having a distributed streaming data processing ecosystem, by consuming messages from local entry listener on a IMap. A particular object model property is set as the key, so we know the models are distributed in partitions key wise. The processing logic can thus be executed locally and without using a distributed lock (or any lock at all as per design contract). I would expect the similar behaviour using a distributed blocking queue instead.
Is that feasible? Using Hazelcast 3.3.3
You know that the queue (currently) is not a partitioned data-structure? So there is a single member in the cluster responsible for storing all data in that queue (and another member for the backup).
You can control where a queue is stored, e.g. if you have 2 queues and want them to be stored in the same partition, use
foo#somekey
bar#somekey
In this case both queues are stored in the 'somekey' partition.

Categories