Why does my watermark not advance in my Apache Flink keyed stream?

Why does my watermark not advance in my Apache Flink keyed stream? - java

I am currently using Apache Flink 1.13.2 with Java for my streaming application. I am using a keyed function with no window function. I have implemented a watermark strategy and autoWatermarkInterval config per the documentation, although my watermark is not advancing.
I have double-checked this by using the Flink web UI and printing the current watermark in my EventProcessor KeyedProcessFunction but the watermark is constantly set to a very large negative number -9223372036854775808 (lowest possible watermark).
env.getConfig().setAutoWatermarkInterval(1000);
WatermarkStrategy<EventPayload> watermarkStrategy = WatermarkStrategy
.<EventPayload>forMonotonousTimestamps()
.withTimestampAssigner((event, timestamp) -> event.getTimestamp());
DataStream<EventPayload> deserialized = input
.assignTimestampsAndWatermarks(watermarkStrategy)
.flatMap(new Deserializer());
DataStream<EnrichedEventPayload> resultStream =
AsyncDataStream.orderedWait(deserialized, new Enrichment(), 5, TimeUnit.SECONDS, 100);
DataStream<Session> eventsStream = resultStream
.filter(EnrichedEventPayload::getIsEnriched)
.keyBy(EnrichedEventPayload::getId)
.process(new EventProcessor());
I even tried to add the WatermarkStrategy to the stream where it is using keyBy (and adjusting the types to match) but still no luck.
DataStream<Session> eventsStream = resultStream
.filter(EnrichedEventPayload::getIsEnriched)
.keyBy(EnrichedEventPayload::getId)
.assignTimestampsAndWatermarks(watermarkStrategy)
.process(new EventProcessor());
I have also tried using my own class implementing WatermarkStrategy and set breakpoints on the onEvent function to ensure the new watermark was being emitted, although it still did not advance (and any associated timers did not fire).
Any help would be greatly appreciated!

This will happen if one of the parallel instances of the watermark strategy is idle (i.e., if there are no events flowing through it). Using the withIdleness(...) option on the watermark strategy would be one way to solve this.

Related

Kafka Stream - How to use the Suppress function?

As I implemented my business rule for my project, I need to reduce the number of events produced by the stream application to save the resource and to make the processor as fast as possible.
I figured out that Kafka offers the ability to suppress intermediate events base on either their RecordTime or WindowEndTime. My code with the usage of suppress:
KTable<Long, ProductWithMatchRecord> productWithCompetitorMatchKTable = competitorProductMatchWithLinkInfo.groupBy(
(linkMatchProductRecordId, linkMatchWithProduct) -> KeyValue.pair(linkMatchWithProduct.linkMatch().tikiProductId(), linkMatchWithProduct),
Grouped.with(longPayloadJsonSerde, linkMatchWithProductJSONSerde).withName("group-match-record-by-product-id")
).aggregate(
ProductWithMatchRecord::new,
(tikiProductId, linkMatchWithProduct, aggregate) -> aggregate.addLinkMatch(linkMatchWithProduct),
(tikiProductId, linkMatchWithProduct, aggregate) -> aggregate.removeLinkMatch(linkMatchWithProduct),
Named.as("aggregate-match-record-by-product-id"),
Materialized
.<Long, ProductWithMatchRecord, KeyValueStore<Bytes, byte[]>>as("match-record-by-product-id")
.withKeySerde(longPayloadJsonSerde)
.withValueSerde(productWithMatchRecordJSONSerde)
)
.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(10), null));
Basically, it is just a KTable that take the input from other KTable, aggregation, join,....
and then Suppress
The problem is I expect for 1 event of 1 given key, if there is no event for this key in the next 10 seconds, the corresponding data in productWithCompetitorMatchKTable will be produced.
However, after 10 seconds (or more), no event of the given is fired, until I made another event for this key.
Please help me to fix the problem or refer to some source of documentation that I can understand more about the suppress feature of Kafka stream application.
I have tried to debug and the code and change many configurations of the Suppressed.untilTimeLimit function, however, it wwas not working as I expected.

You need new events, to trigger the "time check".
Have a look at "punctuate".

Is it safe for a Flink application to have multiple data/key streams in s job all sharing the same Kafka source and sink?

(Goal Updated)
My goal on each data stream is:
filter different msgs
have different event time defined window session gaps
consumer from topic and produce to another topic
A fan-out -> fan-in like DAG.
var fanoutStreamOne = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreamTwo = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreamThree = new StreamComponents(/*filter, flatmap, etc*/);
var fanoutStreams = Set.of(fanoutStreamOne, fanoutStreamTwo, fanoutStreamThree)
var source = new FlinkKafkaConsumer<>(...);
var sink = new FlinkKafkaProducer<>(...);
// creates streams from same source to same sink (Using union())
new streamingJob(source, sink, fanoutStreams).execute();
I am just curious if this affects recovery/checkpoints or performance of the Flink application.
Has anyone had success with this implementation?
And should I have the watermark strategy up front before filtering?
Thank in advance!

Okay, the differenced time gaps are not possible, I think so. I tried it a year ago, with flink 1.7 , and I can't do it. The watermark is global to the application.
To the other problems, if you are using Kafka, yo can read from some topics using regex, and get the topic using the properly deserialization schema (here).
To filter the messages, I think you can use the filter functions with the dide output streams :) (here)

How do I extract the real-time FileIO state from Akka?

I am making a file transfer system using Akka. I've been looking at the documents for a while. The current status of progress is Actor2 received the file sent by Actor1 and wrote it to the local system of Actor2 (Actor1 = sender, Actor2 = receiver).
But I couldn't find a way to know how much byte I received in real time when writing.
I tested it, and it turns out, with runWith API, files can be written locally. With runForeach API, how much byte was delivered in real time through. However, if these two are created at the same time, the file cannot be written.
Here's my simple source. Please give me some advice.
public static Behavior<Command> create() {
return Behaviors.setup(context -> {
context.getLog().info("Registering myself with receptionist");
context.getSystem().receptionist().tell(Receptionist.register(RECEIVER_SERVICE_KEY, context.getSelf().narrow()));
Materializer mat = Materializer.createMaterializer(context);
return Behaviors.receive(Command.class)
.onMessage(TransferFile.class, command -> {
command.sourceRef.getSource().runWith(FileIO.toPath(Paths.get("test.pptx")), mat);
//command.replyTo.tell(new FileTransfered("filename", 1024));
command.sourceRef.getSource().runForeach(f -> System.out.println(f.size()), mat);
return Behaviors.same();
}).build();
});
}

Use a BroadcastHub to allow multiple consumers of your Source:
Source<ByteString, NotUsed> fileSource = command.sourceRef.getSource();
RunnableGraph<Source<ByteString, NotUsed>> runnableGraph =
fileSource.toMat(BroadcastHub.of(ByteString.class, 256), Keep.right());
// adjust the buffer size (256) as needed
Source<ByteString, NotUsed> fromFileSource = runnableGraph.run(mat);
fromFileSource.runWith(FileIO.toPath(Paths.get("test.pptx")), mat);
fromFileSource.runForeach(f -> System.out.println(f.size()), mat);

BroadcastHub as suggested by Jeffrey, allows for a single running stream to be connected to multiple other streams that are started and stopped over time.
Having a stream that dynamically connects to others requires quite a lot of extra hoops internally, so if you don't need that it is better to avoid that overhead.
If you use case is rather that you want to consume a single source with two sinks that is better done with source.alsoTo(sink1).to(sink2).
alsoTo in the flow API is backed by the Broadcast operator, but using that directly requires that you use the Graph DSL.

How to instrument streams and track progress? (vanilla Java8 or cylcops-react reactive streams)

Given some code using streams to process a large number of items, what's the best way to instrument the various steps for logging and performance/profiling?
Actual example:
ReactiveSeq.fromStream(pairs)
.filter(this::satisfiesThreshold)
.filter(this::satisfiesPersistConditions)
.map((pair) -> convertToResult(pair, jobId))
.flatMap(Option::toJavaStream)
.grouped(CHUNK_SIZE)
.forEach((chunk) ->
{
repository.save(chunk);
incrementAndReport();
});
reportProcessingTime();
Logging progress is important so I can trigger progress events in another thread that update a user interface.
Tracking the performance characteristics of the filtering and mapping steps in this stream is desireable to see where optimizations can be made to speed it up.
I see three options:
put logging/profiling code in each function
use peek around each step without actually using the value
some sort of annotation based or AOP solution (no idea what)
Which is the best? Any ideas on what #3 would look like? Is there another solution?

You have a couple of options here (if I have understood correctly) :-
We can make use of the elapsed operator to track the elapsed time between element emissions e.g.
ReactiveSeq.fromStream(Stream.of(1,2))
.filter(this::include)
.elapsed()
.map(this::logAndUnwrap)
Long[] filterTimeTakenMillis = new Long[maxSize];
int filterIndex = 0;
private <T> T logAndUnwrap(Tuple2<T, Long> t) {
//capture the elapsed time (t.v2) and then unwrap the tuple
filterTimeTakenMillis[filterIndex++]=t.v2;
return t.v1;
}
This will only work on cyclops-react Streams.
We can make use of the AOP-like functionality in FluentFunctions
e.g.
ReactiveSeq.fromStream(Stream.of(1,2))
.filter(this::include)
.elapsed()
.map(this::logAndUnwrap)
.map(FluentFunctions.of(this::convertToResult)
.around(a->{
SimpleTimer timer = new SimpleTimer();
String r = a.proceed();
mapTimeTakenNanos[mapIndex++]=timer.getElapsedNanos();
return r;
}));
This will also work on vanilla Java 8 Streams.

Can Spark Streaming do Anything Other Than Word Count?

I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).

The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why does my watermark not advance in my Apache Flink keyed stream? - java

This will happen if one of the parallel instances of the watermark strategy is idle (i.e., if there are no events flowing through it). Using the withIdleness(...) option on the watermark strategy would be one way to solve this.

Related

Kafka Stream - How to use the Suppress function?

Is it safe for a Flink application to have multiple data/key streams in s job all sharing the same Kafka source and sink?

How do I extract the real-time FileIO state from Akka?

How to instrument streams and track progress? (vanilla Java8 or cylcops-react reactive streams)

Can Spark Streaming do Anything Other Than Word Count?

Categories

Resources