flink tumbling window is not triggered (no watermark strategy) - java

Problem statement: stream events from kafka source. These event payloads are of string format. Parse them into Documents and batch insert them into DB every 5 seconds based on event time.
map() functions are getting executed. But program control is not going into apply(). Hence bulk insert is not happening. I tried with keyed and non-keyed windows. None of them are working. No error is being thrown.
flink version: 1.15.0
Below is the code for my main method. How should I fix this?
public static void main(String[] args) throws Exception {
final Logger logger = LoggerFactory.getLogger(Main.class);
final StreamExecutionEnvironment streamExecutionEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecutionEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
KafkaConfig kafkaConfig = Utils.getAppConfig().getKafkaConfig();
logger.info("main() Loading kafka config :: {}", kafkaConfig);
KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
.setBootstrapServers(kafkaConfig.getBootstrapServers())
.setTopics(kafkaConfig.getTopics())
.setGroupId(kafkaConfig.getConsumerGroupId())
.setStartingOffsets(OffsetsInitializer.latest())
.setValueOnlyDeserializer(new SimpleStringSchema()).build();
logger.info("main() Configured kafka source :: {}", kafkaSource);
DataStreamSource<String> dataStream = streamExecutionEnv.fromSource(kafkaSource,
WatermarkStrategy.noWatermarks(), "mySource");
logger.info("main() Configured kafka dataStream :: {}", dataStream);
DataStream<Document> dataStream1 = dataStream.map(new DocumentMapperFunction());
DataStream<InsertOneModel<Document>> dataStream2 = dataStream1.map(new InsertOneModelMapperFunction());
DataStream<Object> dataStream3 = dataStream2
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5), Time.seconds(0)))
/*.keyBy(insertOneModel -> insertOneModel.getDocument().get("ackSubSystem"))
.window(TumblingEventTimeWindows.of(Time.seconds(5)))*/
.apply(new BulkInsertToDB2())
.setParallelism(1);
logger.info("main() before streamExecutionEnv execution");
dataStream3.print();
streamExecutionEnv.execute();
}

Use TumblingProcessingTimeWindows instead Event time windows.
As David has mentioned TumblingEventTimeWindows requires watermark strategy.

Event time windows require a watermark strategy. Without one, the windows are never triggered.
Furthermore, even with forMonotonousTimestamps, a given window will not be triggered until Flink has processed at least one event belonging to the following window from every Kafka partition. (If there are idle (or empty) Kafka partitions, you should use withIdleness to withdraw those partitions from the overall watermark calculations.)

Related

Kafka Streams application producing topic with same message

I am facing an issue with my Kafka streams application, where messages are being processed multiple times and the result topic is constantly receiving messages. This issue is only present in production and not in my local environment. Can you help me determine the root cause of this problem, based on the transformer code?
#Override
public KeyValue<String, UserClicks> transform(final String user, final Long clicks) {
UserClicks userClicks = tempStore.get(user);
if (userClicks != null) {
userClicks.clicks += clicks;
}
else {
final String region = regionStore.get(user).value();
userClicks = new UserClicks(user, region, clicks);
}
if (userClicks.clicks < CLICKS_THRESHOLD) {
tempStore.put(user, userClicks);
}
else {
tempStore.delete(user);
}
return KeyValue.pair(user, userClicks);
}
`
When I remove KStore from transformer everything seems to work fine.
Usally this problem occures becuase kafka can’t save its state, and it’s reading the same batch of messages. KStore stores it’s state on change log topic, and it stores it by producing messages. If the produces can’t produce for some reson, new offset can never be commited.
To resolve the issue, change the minimum number of in-sync replicas to 1 or set the replication factor to 2. By default, Kafka streams creates a replication factor of 1.
Easy way to configure this is through Conduktor, just go to topic config and changes min.insync.replicas property
It cant also be done through kafka CLI by running this command.
kafka-configs.sh --bootstrap-server localhost:9092 --alter --entity-type topics --entity-name configured-topic min.insync.replicas 1

Kafka read the whole topic in cmd JAVA app

I need to use Kafka stream with java application which runs in cronjob and read the whole topic each time. Unfortunately, for some reason, it commits the offset and on the next run, it reads of the last offset. I have tried various ways, but unfortunately without success. My settings are as follows:
streamsConfiguration.put(APPLICATION_ID_CONFIG, "app_id");
streamsConfiguration.put(AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(ENABLE_AUTO_COMMIT_CONFIG, "false");
And I read the topic with the following code:
Consumed<String, String> with = Consumed.with(Serdes.String(), Serdes.String());
with.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST);
final var stream = builder.stream("topic", with);
stream.foreach((key, value) -> {
log.info("Key= {}, value= {}", key, value);
});
final var kafkaStreams = new KafkaStreams(builder.build(), kafkaStreamProperties);
kafkaStreams.cleanUp();
kafkaStreams.start();
But still, it reads from the latest offset.
Kafka Streams commits offsets regularly, so after you run the application the first time and shut it down, the next time you start it up, Kafka Streams will pick up at the last committed offset. That's the standard Kafka behavior. The AUTO_OFFSET_RESET_CONFIG only applies when a consumer doesn't find an offset, so it relies on that config on where to start.
So if you want to reset it to read from the beginning the next time on startup, you can either use the application reset tool or change the application.id. If you get the properties for the Kafka Streams application externally, you could automate generating a unique name each time.

Camel Application re-processing Kinesis records

The problem:
I have three instances of a java application running in Kubernetes. My application uses Apache Camel to read from a Kinesis stream. I'm currently observing two related issues:
Each of the three running instances of my application is processing the records coming into the stream, when I only want each record to be processed once (I want three up and running for scaling purposes). I was hoping that while one instance is processing record A, a second could be picking up record B, etc.
Every time my application is re-deployed in Kubernetes, each instance starts every record all over again (in other words, it has no idea where it left off or which records have previously been processed).
After 5 minutes, the shard iterator that my application is using to poll kinesis times out. I know that this is normal behavior, but what I don't understand is why my application is not grabbing a new iterator. This screenshot shows the error from DataDog.
What I've tried:
First off, I believe that this issue is caused by inconsistent shard iterator ids, and kinesis consumer ids across the three instances of my application, and across deploys. However, I have been unable to locate where these values are set in code, and how I could go about setting them. Of course, there may also be a better solution altogether. I have found very little documentation on Kinesis/Kubernetes/Camel working together, and so very little outside sources have been helpful.
The documentation on AWS Kinesis :: Apache Camel is very limited, but what I have tried playing around with the iterator type and building a custom Client Configuration.
Let me know if you need any additional information, thanks.
Configuring the client:
main.bind("kinesisClient", AmazonKinesisClientBuilder.defaultClient());
.
.
.
inputUri = String.format("aws-kinesis://%s?amazonKinesisClient=#kinesisClient", rawKinesisName);
main.configure().addRoutesBuilder(new RawDataRoute(inputUri, inputTransform));
My route:
public class RawDataRoute extends RouteBuilder {
private static final Logger LOG = new Logger(RawDataRoute.class, true);
private String rawDataStreamUri;
private Expression transform;
public RawDataRoute(final String rawDataStreamUri, final Expression transform) {
this.rawDataStreamUri = rawDataStreamUri;
this.transform = transform;
}
#Override
public void configure() {
// TODO add error handling
from(rawDataStreamUri)
.routeId("raw_data_stream")
.transform(transform)
.to("direct:main_input_stream");
}
}

AWS Kinesis Client Java: Setting up TRIM_HORIZON Position in Stream does not work

I am running a test system that spawns a Kinesis producer which starts writing messages, e.g.: 1 through 100 to a stream with two shards.
During that cycle a consumer starts to read the messages from the stream. I noticed that the consumer only reads the LATEST messages that come into the stream after it's running. So for example, it starts reading at message 43. I tried modifying the Worker.class to use the TRIM_HORIZON Policy but this doesn't seem to be working.
KinesisClientLibConfiguration c = new KinesisClientLibConfiguration("MediaPlan", "randeepstream",
DefaultAWSCredentialsProviderChain.getInstance(),
"consumer1")
.withInitialPositionInStream(InitialPositionInStream.TRIM_HORIZON);
final Worker w = new Worker.Builder()
.recordProcessorFactory(rpf)
.config(kinesisConfig)
.build();
new Thread(() -> w.run()).start();
My consumer's processor is setup as:
public class ConsumerRecordProcessorImpl implements IRecordProcessor {
public void initialize(InitializationInput initializationInput) {
log.info("Setting up consumer with shard {} starting at {}", initializationInput.getShardId(),
initializationInput.getExtendedSequenceNumber());
}
public void processRecords(ProcessRecordsInput processRecordsInput) {
...
}
}
I would expect to see a message like:
Setting up consumer with shard shardId-000000000000 starting at TRIM_HORIZON 0
but instead I get:
Setting up consumer with shard shardId-000000000000 starting at LATEST 0
How do I get my consumer to stop reading the latest and read all unprocessed messages?
Here is an example using amazon-kinesis-client lib v2.
You will have to use Schedular(software.amazon.kinesis.coordinator) which reads record in background and provide a retrieval config to this scheduler as follows
RetrievalConfig retrievalConfig = setRetrievalConfig();
Scheduler scheduler = new Scheduler(
configsBuilder.checkpointConfig(),
configsBuilder.coordinatorConfig(),
configsBuilder.leaseManagementConfig(),
configsBuilder.lifecycleConfig(),
configsBuilder.metricsConfig(),
configsBuilder.processorConfig(),
retrievalConfig);
private RetrievalConfig setRetrievalConfig(){
InitialPositionInStreamExtended initialPositionInStreamExtended = InitialPositionInStreamExtended.newInitialPosition(InitialPositionInStream.TRIM_HORIZON);
RetrievalConfig retrievalConfig = configsBuilder.retrievalConfig().retrievalSpecificConfig(new PollingConfig(streamName, kinesisClient));
retrievalConfig.initialPositionInStreamExtended(initialPositionInStreamExtended);
return retrievalConfig;
}
Notice the InitialPositionInStream.TRIM_HORIZON this will tell the scheduler to start consume records after last known position. So even if the consumer is down and producer still running, all the records produces during downtime of the consumer will be consumed.
NOTE: configBuilder is object of ConfigsBuilder (software.amazon.kinesis.common)
UPDATE: initialPositionInStream position won't updated unless you call the checkpoint() API after processing the data you received from kinesis.
So once you call the checkpinter() then latest Position of the stream processing record gets updated in DynamoDB and now KCL will process the record from this position.

Apache Flink 1.2 - Timestamp and Watermarks assignment for window aggregation

I'm trying to build a data stream processing system and I want to aggregate the data sent in the last minute from my sensors. The sensors send data to a Kafka server in the sensor topic and it is consumed by Flink.
I'm using a Python generator with KafkaPython library and I send data in a JSON format. In the JSON theres is the field sent containing a timestamp. Such parameter is generated in Python each 10 seconds using int(datetime.now().timestamp()) which I know returns a unix format timestamp in seconds.
The problem is that the system prints nothing! What am I doing wrong?
// set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// setting topic and processing the stream from Sensor
DataStream<Sensor> sensorStream = env.addSource(new FlinkKafkaConsumer010<>("sensor", new SimpleStringSchema(), properties))
.flatMap(new ParseSensor()) // parsing into a Sensor object
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Sensor>() {
#Override
public long extractAscendingTimestamp(Sensor element) {
return element.getSent()*1000;
}
});
sensorStream.keyBy(meanSelector).window(TumblingEventTimeWindows.of(Time.minutes(1))).apply(new WindowMean(dataAggregation)).print();
During my attempts to make this work, I found the method .timeWindow() instead of .window() which worked! Being more precise, I wrote .timeWindow(Time.minutes(1)).
N.B.: despite Flink ran for 5 minutes the window was printed only 1 time!

Categories