I'm trying to build a data stream processing system and I want to aggregate the data sent in the last minute from my sensors. The sensors send data to a Kafka server in the sensor topic and it is consumed by Flink.
I'm using a Python generator with KafkaPython library and I send data in a JSON format. In the JSON theres is the field sent containing a timestamp. Such parameter is generated in Python each 10 seconds using int(datetime.now().timestamp()) which I know returns a unix format timestamp in seconds.
The problem is that the system prints nothing! What am I doing wrong?
// set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// setting topic and processing the stream from Sensor
DataStream<Sensor> sensorStream = env.addSource(new FlinkKafkaConsumer010<>("sensor", new SimpleStringSchema(), properties))
.flatMap(new ParseSensor()) // parsing into a Sensor object
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Sensor>() {
#Override
public long extractAscendingTimestamp(Sensor element) {
return element.getSent()*1000;
}
});
sensorStream.keyBy(meanSelector).window(TumblingEventTimeWindows.of(Time.minutes(1))).apply(new WindowMean(dataAggregation)).print();
During my attempts to make this work, I found the method .timeWindow() instead of .window() which worked! Being more precise, I wrote .timeWindow(Time.minutes(1)).
N.B.: despite Flink ran for 5 minutes the window was printed only 1 time!
Related
I need to use Kafka stream with java application which runs in cronjob and read the whole topic each time. Unfortunately, for some reason, it commits the offset and on the next run, it reads of the last offset. I have tried various ways, but unfortunately without success. My settings are as follows:
streamsConfiguration.put(APPLICATION_ID_CONFIG, "app_id");
streamsConfiguration.put(AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(ENABLE_AUTO_COMMIT_CONFIG, "false");
And I read the topic with the following code:
Consumed<String, String> with = Consumed.with(Serdes.String(), Serdes.String());
with.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST);
final var stream = builder.stream("topic", with);
stream.foreach((key, value) -> {
log.info("Key= {}, value= {}", key, value);
});
final var kafkaStreams = new KafkaStreams(builder.build(), kafkaStreamProperties);
kafkaStreams.cleanUp();
kafkaStreams.start();
But still, it reads from the latest offset.
Kafka Streams commits offsets regularly, so after you run the application the first time and shut it down, the next time you start it up, Kafka Streams will pick up at the last committed offset. That's the standard Kafka behavior. The AUTO_OFFSET_RESET_CONFIG only applies when a consumer doesn't find an offset, so it relies on that config on where to start.
So if you want to reset it to read from the beginning the next time on startup, you can either use the application reset tool or change the application.id. If you get the properties for the Kafka Streams application externally, you could automate generating a unique name each time.
Problem statement: stream events from kafka source. These event payloads are of string format. Parse them into Documents and batch insert them into DB every 5 seconds based on event time.
map() functions are getting executed. But program control is not going into apply(). Hence bulk insert is not happening. I tried with keyed and non-keyed windows. None of them are working. No error is being thrown.
flink version: 1.15.0
Below is the code for my main method. How should I fix this?
public static void main(String[] args) throws Exception {
final Logger logger = LoggerFactory.getLogger(Main.class);
final StreamExecutionEnvironment streamExecutionEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecutionEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
KafkaConfig kafkaConfig = Utils.getAppConfig().getKafkaConfig();
logger.info("main() Loading kafka config :: {}", kafkaConfig);
KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
.setBootstrapServers(kafkaConfig.getBootstrapServers())
.setTopics(kafkaConfig.getTopics())
.setGroupId(kafkaConfig.getConsumerGroupId())
.setStartingOffsets(OffsetsInitializer.latest())
.setValueOnlyDeserializer(new SimpleStringSchema()).build();
logger.info("main() Configured kafka source :: {}", kafkaSource);
DataStreamSource<String> dataStream = streamExecutionEnv.fromSource(kafkaSource,
WatermarkStrategy.noWatermarks(), "mySource");
logger.info("main() Configured kafka dataStream :: {}", dataStream);
DataStream<Document> dataStream1 = dataStream.map(new DocumentMapperFunction());
DataStream<InsertOneModel<Document>> dataStream2 = dataStream1.map(new InsertOneModelMapperFunction());
DataStream<Object> dataStream3 = dataStream2
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5), Time.seconds(0)))
/*.keyBy(insertOneModel -> insertOneModel.getDocument().get("ackSubSystem"))
.window(TumblingEventTimeWindows.of(Time.seconds(5)))*/
.apply(new BulkInsertToDB2())
.setParallelism(1);
logger.info("main() before streamExecutionEnv execution");
dataStream3.print();
streamExecutionEnv.execute();
}
Use TumblingProcessingTimeWindows instead Event time windows.
As David has mentioned TumblingEventTimeWindows requires watermark strategy.
Event time windows require a watermark strategy. Without one, the windows are never triggered.
Furthermore, even with forMonotonousTimestamps, a given window will not be triggered until Flink has processed at least one event belonging to the following window from every Kafka partition. (If there are idle (or empty) Kafka partitions, you should use withIdleness to withdraw those partitions from the overall watermark calculations.)
We are creating a POC to read database CDC and push it to external systems.
each source table CDC are sent to respective topics in Avro format(with Kafka Schema Registry and Kafka Server)
We are writing java code to consume the messages in avro schema,de-serialize it using AvroSerde and join them and then send to different topics so that it can be consumed by external systems.
We have a limitation though that we cannot produce messages to source table topics to send/receive new contents/changes. So only way to write join code is to read messages from beginning everytime from every source topic when we run the application.(until we have confident that code is working and can start receiving live data again)
In KafkaConsumer object we have an option to use seekToBeginning method to force reading from beginning in jave code, which works. However there are no option when we try to stream topic using KStream object and force to read it from beginning. What are the alternatives here?
We tried to reset the offset using kafka-consumer-groups reset-topic with --to-earliest but that sets the offset only to the nearest . When we try to reset offset manually with "0" with --to-offset parameter we get below warning but does not set to "0". my understanding is, setting to 0 should read messages from beginning. correct me if I am wrong.
"WARN New offset (0) is lower than earliest offset for topic partition"
Sample code below
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVER);
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, GROUP_ID);
properties.put("schema.registry.url", SCHEMA_REGISTRY_URL);
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(StreamsConfig.APPLICATION_ID_CONFIG, APPLICATION_ID);
StreamsBuilder builder = new StreamsBuilder();
//nothing returned here, when some offset has already been set
KStream myStream = builder.stream("my-topic-in-avro-schema",ConsumedWith(myKeySerde,myValueSerde));
KafkaStreams streams = new KafkaStreams(builder.build(),properties);
streams.start();
One way to do this would be to generate a random ConsumerGroup every time you start the stream application. Something like:
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, GROUP_ID + currentTimestamp);
That way, the stream will start reading from "earliest" as you have set it already in auto.offset.reset.
By the way, you are setting the properties for group.id twice in your code...
It will help someone who is also facing same issue. Replace Application Id and Group Id with some unique identifier using UUID.randomId.toString() in the configuration property. It should fetch the messages from beginning
I use spark streaming to get data from a queue in mq via a custom receiver.
Javastreaming context duration is 10 seconds.
And there is one task defined for the input from queue.
In event time line in spark UI, I see a job getting submitted in each 10s interval even when there is no data from the receiver.
Is it the normal behavior or how to stop jobs getting submitted when there is no data.
JavaDStream<String> customReceiverStream = ssc.receiverStream(newJavaCustomReceiver(host, port));
JavaDStream<String> words =lines.flatMap(new FlatMapFunction<String, String>() { ... });
words.print();
ssc.start();
ssc.awaitTermination();
As a work around
You can use livy to submit the spark jobs(use java codes instead of cli commands).
Livy job would be constantly checking a database that would have an indicator whether the data is flowing in or not.As soon as the data flow stops,change the indicator in the database and this would result in spark job being killed by the livy.(Use Livy sessions)
I have a Spark streaming app written in Java and using Spark 2.1. I am using KafkaUtils.createDirectStream to read messages from Kafka. I am using kryo encoder/decoder for kafka messages. I specified this in Kafka properties-> key.deserializer, value.deserializer, key.serializer, value.deserializer
When Spark pulls the messages in a micro batch, the messages are successfully decoded using kryo decoder. However I noticed that Spark executor creates a new instance of kryo decoder for decoding each message read from kafka. I checked this by putting logs inside the decoder constructor
This seems weird to me. Shouldn't the same instance of decoder be used for each message and each batch?
Code where I am reading from kafka:
JavaInputDStream<ConsumerRecord<String, Class1>> consumerRecords = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, Class1>Subscribe(topics, kafkaParams));
JavaPairDStream<String, Class1> converted = consumerRecords.mapToPair(consRecord -> {
return new Tuple2<String, Class1>(consRecord.key(), consRecord.value());
});
If we want to see how Spark fetches data from Kafka internally, we'll need to look at KafkaRDD.compute, which is a method implemented for every RDD which tells the framework how to, well, compute that RDD:
override def compute(thePart: Partition, context: TaskContext): Iterator[R] = {
val part = thePart.asInstanceOf[KafkaRDDPartition]
assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))
if (part.fromOffset == part.untilOffset) {
logInfo(s"Beginning offset ${part.fromOffset} is the same as ending offset " +
s"skipping ${part.topic} ${part.partition}")
Iterator.empty
} else {
new KafkaRDDIterator(part, context)
}
}
What's important here is the else clause, which creates a KafkaRDDIterator. This internally has:
val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
.newInstance(kc.config.props)
.asInstanceOf[Decoder[K]]
val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
.newInstance(kc.config.props)
.asInstanceOf[Decoder[V]]
Which as you see, creates an instance of both the key decoder and the value decoder via reflection, for each underlying partition. This means that it isn't being generated per message but per Kafka partition.
Why is it implemented this way? I don't know. I'm assuming because a key and value decoder should have a neglectable performance hit compared to all the other allocations happening inside Spark.
If you've profiled your app and found this to be an allocation hot-path, you could open an issue. Otherwise, I wouldn't worry about it.