Is Kafka Stream really Real Time? - java

I'm using Kafka Stream API to test some functionality.
I have a Stream like :
KStream<String, UnifiedData> stream = builder.stream("topic", Consumed.with(Serdes.String(), new JsonSerde<>(Data.class)));
stream.groupBy((key, value) -> value.getMetadata().getId())
.windowedBy(TimeWindows.of(TimeUnit.SECONDS.toMillis(1000)))
.count()
.toStream()
.map((key, value) -> {
System.out.println(value);
return KeyValue.pair(key.toString(), value);
});
I found 2 stranges behaviours will producing some data in my topic :
First : I don't get an output for each data produced. For example, If I produce 20 messages with no delay, I will just get a 20 as output and not something like 1 2 3....
Second : There is like 20 seconds of delay between the time I produce my message and the time the System.out.println(value) print the result in my console
So, do you think that this behaviour is totally normal ? Or May I have a configuration problem with my kafka ?
I'm using Kafka 1.0.1, Kafka Stream 1.0.1, Java 8 and Spring-Boot

By default, Kafka Streams uses a cache to "dedupliate" consecutive outputs from an aggregation to reduce the downstream load.
You can disable caching globally by setting cache.max.bytes.buffering=0 in your KafkaStreams config. As an alternative, it's also possible to disable cache per store individually, by passing in a Materialized parameter into the aggregation operator.
Furthermore, all caches are flushed on commit and the default commit interval is 30 seconds. Thus, it makes sense that you see output after 30 seconds. If you disable caching, commit interval will not have any impact on the behavior any longer.
For more details see: https://kafka.apache.org/documentation/streams/developer-guide/memory-mgmt.html

Related

Is it possible to restore a Kafka Streams state store after a restart without using changelog topics?

We have 2 compacted topics, each containing terabytes of data, which we want to join using Spring Cloud Stream and Kafka Streams. The (simplified) code looks like this:
#Bean
public BiConsumer<KTable<String, LeftEvent>, KTable<String, RightEvent>> processEvents() {
return ((leftEvents, rightEvents) -> {
leftEvents.join(rightEvents, this::merge)
.toStream()
.foreach(this::process);
});
}
The problem with this approach is that using KTables as input parameters results in the creation of changelog topics which essentially duplicate the source topics since, as mentioned above, both of these topics are already compacted. To avoid duplicating terabytes of data in Kafka, our first attempt was to use KStreams as inputs instead, and to transform them into KTables as follows:
stream.toTable(
Materialized
.<K, V, KeyValueStore<Bytes, byte[]>>as(stateStoreName)
.withLoggingDisabled()
);
thereby disabling logging and hence dispensing with the changelog topics, which in our context seem useless.
However, the following scenario now no longer works:
Generate a LeftEvent with key k1
Restart the application
Generate a RightEvent with key k1
The events are no longer joined, although the join works fine if the application is not restarted in-between (i.e. step 1, then 3).
When the application restarts, we would have expected the state stores to be reconstructed from the souce topics in the absence of changelog topics, but this is apparently not the case. In some occasions, we observed that rocksDB files (located in /tmp/kafka-streams/...) were used to retrieve data consumed prior to the restart, however we cannot assume that these files will still be available after a restart since we are working in a containerized environment.
Is there a way to support restarts (and achieve fault tolerance) without having to use changelog topics, which in our case duplicate the input topics? If not, we might have to reconsider our use of Kafka Streams...
You want to enable optimization of the Kafka Streams: https://docs.confluent.io/platform/current/streams/developer-guide/optimizing-streams.html#optimization-details (#1 is what you are looking for).
Currently, there are two optimizations that Kafka Streams performs when enabled:
The source KTable re-uses the source topic as the changelog topic.
When possible, Kafka Streams collapses multiple repartition topics into a single repartition topic.
A key thing to point out, since I have made this mistake myself, is do not forget to send the configuration to both the build() and to the construction of KStreams (optimization, as indicated here from the link provided) is done in the build.
// tell Kafka Streams to optimize the topology
config.setProperty(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);
// Since we've configured Streams to use optimizations, the topology is optimized during the build.
// And because optimizations are enabled, the resulting topology will no longer need to perform
// three explicit repartitioning steps, but only one.
final Topology topology = builder.build(config);
final KafkaStreams streams = new KafkaStreams(topology, config);
Now optimization is enabled for all of the Topology, so if you keep that in mind that the #2 optimization is also performed.

Kafka Streams windowing aggregation batching

I have Kafka Streams processing in my application:
myStream
.mapValues(customTransformer::transform)
.groupByKey(Serialized.with(new Serdes.StringSerde(), new SomeCustomSerde()))
.windowedBy(TimeWindows.of(10000L).advanceBy(10000L))
.aggregate(CustomCollectorObject::new,
(key, value, aggregate) -> aggregate.collect(value),
Materialized.<String, CustomCollectorObject, WindowStore<Bytes, byte[]>>as("some_store_name")
.withValueSerde(new CustomCollectorSerde()))
.toStream()
.foreach((k, v) -> /* do something very important */);
Expected behavior: incoming messages are grouped by key and within some time interval are aggregated in CustomCollectorObject. CustomCollectorObject is just a class with a List inside. After every 10 seconds in foreach I'm doing something very important with my aggregated data. What is very important I expect that foreach is called every 10 seconds!
Actual behavior: I can see that processing in my foreach is called rarer, approx every 30-35 seconds, it doesn't matter much. What is very important, I receive 3-4 messages at once.
The question is: how can I reach the expected behavior? I need to my data was processed at runtime without any delays.
I've tried to set cache.max.bytes.buffering: 0 but in this case windowing doesn't work at all.
Kafka Streams has a different execution model and provides different semantics, ie, your expectation don't match what Kafka Streams does. There are multiple similar questions already:
How to send final kafka-streams aggregation result of a time windowed KTable?
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
https://www.confluent.io/blog/streams-tables-two-sides-same-coin
Also note, that the community is currently working on a new operator called suppress() that will be able to provide the semantics you want: https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
For now, you would need to add a transform() with a state store, and use punctuations to get the semantics you want (c.f. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-a-stream-processor)

Limit the throughput of a Reactor Flux reading a Mongodb collection

I am using Spring 5, in detail the Reactor project, to read information from a huge Mongo collection to a Kafka topic. Unfortunately, the production of Kafka messages is much faster than the program that consumes them. So, I need to implement some backpressure mechanism.
Suppose I want a throughput of 100 messages every second. Googling a little, I decided to combine the feature of the buffer(int maxSize) method, zipping the result with a Flux that emits a message using a predefined interval.
// Create a clock that emits an event every second
final Flux<Long> clock = Flux.interval(Duration.ofMillis(1000L));
// Create a buffered producer
final Flux<ProducerRecord<String, Data>> outbound =
repository.findAll()
.map(this::buildData)
.map(this::createKafkaMessage)
.buffer(100)
// Limiting the emission in time interval
.zipWith(clock, (msgs, tick) -> msgs)
.flatMap(Flux::fromIterable);
// Subscribe a Kafka sender
kafkaSender.createOutbound()
.send(outbound)
.then()
.block();
Is there a smarter way to do this? I mean, it seems to me a little bit complex (the zip part, overall).
Yes, you can use delayElements(Duration.ofSeconds(1)) operation directily whitout need to zipWith it. There is always enhancement with reactor cool project as it a continious upgrades so let us be sticky :) hope was helpful!

Kafka Stream count on time window not reporting zero values

I'm using a Kafka streams to calculate how many events occurred in last 3 minutes using a hopping time window:
public class ViewCountAggregator {
void buildStream(KStreamBuilder builder) {
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
KStream<String, String> views = builder.stream(stringSerde, stringSerde, "streams-view-count-input");
KStream<String, Long> viewCount = views
.groupBy((key, value) -> value)
.count(TimeWindows.of(TimeUnit.MINUTES.toMillis(3)).advanceBy(TimeUnit.MINUTES.toMillis(1)))
.toStream()
.map((key, value) -> new KeyValue<>(key.key(), value));
viewCount.to(stringSerde, longSerde, "streams-view-count-output");
}
public static void main(String[] args) throws Exception {
// some not so important initialization code
...
}
}
When running a consumer and pushing some messages to an input topic it receives following updates as the time passes:
single 1
single 1
single 1
five 1
five 4
five 5
five 4
five 1
Which is almost correct, but it never receives updates for:
single 0
five 0
Without it my consumer that updates a counter will never set it back to zero when there are no events for a longer period of time. I'm expecting consumed messages to look like this:
single 1
single 1
single 1
single 0
five 1
five 4
five 5
five 4
five 1
five 0
Is there some configuration option / argument I'm missing that would help me achieving such behavior?
Which is almost correct, but it never receives updates for:
First, the computed output is correct.
Second, why is it correct:
If you apply a windowed aggregate, only those windows that do have actual content are created (all other systems I am familiar with, would produce the same output). Thus, if for some key, there is not data for a time period longer than the window size, there is no window instantiated and thus, there is also no count at all.
The reason to not instantiate windows if there is no content is quite simple: the processor cannot know all keys. In your example, you have two keys, but maybe later on there might come up a third key. Would you expect to get <thirdKey,0> from the beginning on? Also, as data streams are infinite in nature, keys might go away and never reappear. If you remember all seen keys, and emit <key,0> if there is no data for a key that disappeared, would you emit <key,0> for ever?
I don't want to say that your expected result/semantics does not make sense. It's just a very specific use case of yours and not applicable in general. Hence, stream processors don't implement it.
Third: What can you do?
There are multiple options:
Your consumer can keep track of what keys it did see, and using the embedded record timestamps figures out if a key is "missing" and then set the counter to zero for this key (for this, it might also help to remove the map step and preserve the Windowed<K> type for the key, such that the consumer get the information to which window a record belongs)
Add a stateful #transform() step in your Stream application that does the same thing as described in (1). For this, it might be helpful to register a punctuation call back.
Approach (2) should make it easier to track keys, as you can attach a state store to your transform step and thus don't need to deal with state (and failure/recovery) in your downstream consumer.
However, the tricky part for both approaches is still to decide when a key is missing, i.e., how long do you wait until you produce <key,0>. Note, that data might be late arriving (aka out-of-order) and even if you did emit <key,0> a late arriving record might producer a <key,1> message after your code did emit a <key,0> record. But maybe this is not really an issue for your case as it seems you use the latest window only anyways.
Last but not least one more comment: It seems that you are using only the latest count and that newer windows overwrite older windows in your downstream consumer. Thus, it might be worth to explore "Interactive Queries" to tap into the state of your count operator directly instead of consumer the topic and updating some other state. This might allow you to redesign and simplify you downstream application significantly. Check out the docs and a very good blog post about Interactive Queries for more details.

Kafka KStreams - processing timeouts

I am attempting to use <KStream>.process() with a TimeWindows.of("name", 30000) to batch up some KTable values and send them on. It seems that 30 seconds exceeds the consumer timeout interval after which Kafka considers said consumer to be defunct and releases the partition.
I've tried upping the frequency of poll and commit interval to avoid this:
config.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, "5000");
config.put(StreamsConfig.POLL_MS_CONFIG, "5000");
Unfortunately these errors are still occurring:
(lots of these)
ERROR o.a.k.s.p.internals.RecordCollector - Error sending record to topic kafka_test1-write_aggregate2-changelog
org.apache.kafka.common.errors.TimeoutException: Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for kafka_test1-write_aggregate2-changelog-0
Followed by these:
INFO o.a.k.c.c.i.AbstractCoordinator - Marking the coordinator 12.34.56.7:9092 (id: 2147483547 rack: null) dead for group kafka_test1
WARN o.a.k.s.p.internals.StreamThread - Failed to commit StreamTask #0_0 in thread [StreamThread-1]:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:578)
Clearly I need to be sending heartbeats back to the server more often. How?
My topology is:
KStreamBuilder kStreamBuilder = new KStreamBuilder();
KStream<String, String> lines = kStreamBuilder.stream(TOPIC);
KTable<Windowed<String>, String> kt = lines.aggregateByKey(
new DBAggregateInit(),
new DBAggregate(),
TimeWindows.of("write_aggregate2", 30000));
DBProcessorSupplier dbProcessorSupplier = new DBProcessorSupplier();
kt.toStream().process(dbProcessorSupplier);
KafkaStreams kafkaStreams = new KafkaStreams(kStreamBuilder, streamsConfig);
kafkaStreams.start();
The KTable is grouping values by key every 30 seconds. In Processor.init() I call context.schedule(30000).
DBProcessorSupplier provides an instance of DBProcessor. This is an implementation of AbstractProcessor where all the overrides have been provided. All they do is LOG so I know when each is being hit.
It's a pretty simple topology but it's clear I'm missing a step somewhere.
Edit:
I get that I can adjust this on the server side but Im hoping there is a client-side solution. I like the notion of partitions being made available pretty quickly when a client exits / dies.
Edit:
In an attempt to simplify the problem I removed the aggregation step from the graph. It's now just consumer->processor(). (If I send the consumer directly to .print() it works v quickly so I know it's ok). (Similarly If I output the aggregation (KTable) via .print() it seems ok too).
What I found was that the .process() - which should be calling .punctuate() every 30 seconds is actually blocking for variable lengths of time and outputting somewhat randomly (if at all).
Main program
Debug output
Processor Supplier
Processor
Further:
I set the debug level to 'debug' and reran. Im seeing lots of messages:
DEBUG o.a.k.s.p.internals.StreamTask - Start processing one record [ConsumerRecord <info>
but a breakpoint in the .punctuate() function isn't getting hit. So it's doing lots of work but not giving me a chance to use it.
A few clarifications:
StreamsConfig.COMMIT_INTERVAL_MS_CONFIG is a lower bound on the commit interval, ie, after a commit, the next commit happens not before this time passed. Basically, Kafka Stream tries to commit ASAP after this time passed, but there is no guarantee whatsoever how long it will actually take to do the next commit.
StreamsConfig.POLL_MS_CONFIG is used for the internal KafkaConsumer#poll() call, to specify the maximum blocking time of the poll() call.
Thus, both values are not helpful to heartbeat more often.
Kafka Streams follows a "depth-first" strategy when processing record. This means, that after a poll() for each record all operators of the topology are executed. Let's assume you have three consecutive maps, than all three maps will be called for the first record, before the next/second record will get processed.
Thus, the next poll() call will be made, after all record of the first poll() got fully processed. If you want to heartbeat more often, you need to make sure, that a single poll() call fetches less records, such that processing all records takes less time and the next poll() will be triggered earlier.
You can use configuration parameters for KafkaConsumer that you can specify via StreamsConfig to get this done (see https://kafka.apache.org/documentation.html#consumerconfigs):
streamConfig.put(ConsumerConfig.XXX, VALUE);
max.poll.records: if you decrease this value, less record will be polled
session.timeout.ms: if you increase this value, there is more time for processing data (adding this for completeness because it is actually a client setting and not a server/broker side configuration -- even if you are aware of this solution and do not like it :))
EDIT
As of Kafka 0.10.1 it is possible (and recommended) to prefix consumer and procuder configs within streams config. This avoids parameter conflicts as some parameter names are used for consumer and producer and cannot be distinguiesh otherwise (and would be applied to consumer and producer at the same time).
To prefix a parameter you can use StreamsConfig#consumerPrefix() or StreamsConfig#producerPrefix(), respectively. For example:
streamsConfig.put(StreamsConfig.consumerPrefix(ConsumerConfig.PARAMETER), VALUE);
One more thing to add: The scenario described in this question is a known issue and there is already KIP-62 that introduces a background thread for KafkaConsumer that send heartbeats, thus decoupling heartbeats from poll() calls. Kafka Streams will leverage this new feature in upcoming releases.

Categories