Kafka Stream - How to use the Suppress function? - java

As I implemented my business rule for my project, I need to reduce the number of events produced by the stream application to save the resource and to make the processor as fast as possible.
I figured out that Kafka offers the ability to suppress intermediate events base on either their RecordTime or WindowEndTime. My code with the usage of suppress:
KTable<Long, ProductWithMatchRecord> productWithCompetitorMatchKTable = competitorProductMatchWithLinkInfo.groupBy(
(linkMatchProductRecordId, linkMatchWithProduct) -> KeyValue.pair(linkMatchWithProduct.linkMatch().tikiProductId(), linkMatchWithProduct),
Grouped.with(longPayloadJsonSerde, linkMatchWithProductJSONSerde).withName("group-match-record-by-product-id")
).aggregate(
ProductWithMatchRecord::new,
(tikiProductId, linkMatchWithProduct, aggregate) -> aggregate.addLinkMatch(linkMatchWithProduct),
(tikiProductId, linkMatchWithProduct, aggregate) -> aggregate.removeLinkMatch(linkMatchWithProduct),
Named.as("aggregate-match-record-by-product-id"),
Materialized
.<Long, ProductWithMatchRecord, KeyValueStore<Bytes, byte[]>>as("match-record-by-product-id")
.withKeySerde(longPayloadJsonSerde)
.withValueSerde(productWithMatchRecordJSONSerde)
)
.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(10), null));
Basically, it is just a KTable that take the input from other KTable, aggregation, join,....
and then Suppress
The problem is I expect for 1 event of 1 given key, if there is no event for this key in the next 10 seconds, the corresponding data in productWithCompetitorMatchKTable will be produced.
However, after 10 seconds (or more), no event of the given is fired, until I made another event for this key.
Please help me to fix the problem or refer to some source of documentation that I can understand more about the suppress feature of Kafka stream application.
I have tried to debug and the code and change many configurations of the Suppressed.untilTimeLimit function, however, it wwas not working as I expected.

You need new events, to trigger the "time check".
Have a look at "punctuate".

Related

Why does my watermark not advance in my Apache Flink keyed stream?

I am currently using Apache Flink 1.13.2 with Java for my streaming application. I am using a keyed function with no window function. I have implemented a watermark strategy and autoWatermarkInterval config per the documentation, although my watermark is not advancing.
I have double-checked this by using the Flink web UI and printing the current watermark in my EventProcessor KeyedProcessFunction but the watermark is constantly set to a very large negative number -9223372036854775808 (lowest possible watermark).
env.getConfig().setAutoWatermarkInterval(1000);
WatermarkStrategy<EventPayload> watermarkStrategy = WatermarkStrategy
.<EventPayload>forMonotonousTimestamps()
.withTimestampAssigner((event, timestamp) -> event.getTimestamp());
DataStream<EventPayload> deserialized = input
.assignTimestampsAndWatermarks(watermarkStrategy)
.flatMap(new Deserializer());
DataStream<EnrichedEventPayload> resultStream =
AsyncDataStream.orderedWait(deserialized, new Enrichment(), 5, TimeUnit.SECONDS, 100);
DataStream<Session> eventsStream = resultStream
.filter(EnrichedEventPayload::getIsEnriched)
.keyBy(EnrichedEventPayload::getId)
.process(new EventProcessor());
I even tried to add the WatermarkStrategy to the stream where it is using keyBy (and adjusting the types to match) but still no luck.
DataStream<Session> eventsStream = resultStream
.filter(EnrichedEventPayload::getIsEnriched)
.keyBy(EnrichedEventPayload::getId)
.assignTimestampsAndWatermarks(watermarkStrategy)
.process(new EventProcessor());
I have also tried using my own class implementing WatermarkStrategy and set breakpoints on the onEvent function to ensure the new watermark was being emitted, although it still did not advance (and any associated timers did not fire).
Any help would be greatly appreciated!
This will happen if one of the parallel instances of the watermark strategy is idle (i.e., if there are no events flowing through it). Using the withIdleness(...) option on the watermark strategy would be one way to solve this.

Spring Integration: start JPA polling only when all the results of last polling has been processed

I have a following flow that I would like to implement using Spring Integration Java DSL:
Poll a table in a database every 2 hours which returns id of documents that need to be processed
For each id, process a document through an HTTP gateway
Store a response in a database
I have a working Java code that does exactly these steps. An additional requirement that I'm struggling with is that the polling for the next round of documents shouldn't happen until all the documents from the last polling has been processed and stored in the database.
Is there any pattern in Spring Integration that I could use for this additional requirement?
Here is a simplified code - it will get more complex and I'll split processing of the documents (HTTP outbound and persisting) into separate classes / flows:
return IntegrationFlows.from(Jpa.inboundAdapter(this.targetEntityManagerFactory)
.entityClass(ProcessingMetadata.class)
.jpaQuery("select max(p.modifiedDate) from ProcessingMetadata p " +
"where p.status = com.test.ProcessingStatus.PROCESSED")
.maxResults(1)
.expectSingleResult(true),
e -> e.poller(Pollers.fixedDelay(Duration.ofSeconds(10))))
.handle(Jpa.retrievingGateway(this.sourceEntityManagerFactory)
.entityClass(DocumentHeader.class)
.jpaQuery("from DocumentHeader d where d.modified > :modified")
.parameterExpression("modified", "payload"))
.handle(Http.outboundGateway(uri)
.httpMethod(HttpMethod.POST)
.expectedResponseType(String.class))
.handle(Jpa.outboundAdapter(this.targetEntityManagerFactory)
.entityClass(ProcessingMetadata.class)
.persistMode(PersistMode.PERSIST),
e -> e.transactional(true))
.get();
UPDATE
Following Artem's suggestion, I'm trying to implement it using a SimpleActiveIdleMessageSourceAdvice
class WaitUntilCompleted extends SimpleActiveIdleMessageSourceAdvice {
public WaitUntilCompleted(DynamicPeriodicTrigger trigger) {
super(trigger);
}
#Override
public boolean beforeReceive(MessageSource<?> source) {
return false;
}
}
If I understand it correctly, above code would stop polling. Now I have no idea how to attach this Advice to the Jpa.inboundAdapter... It doesn't seem to have a proper method (neither Advice nor Spec Handler). Do I miss something obvious here? I've tried attaching the Advice to the Jpa.retrievingGateway but it doesn't change the flow at all.
UPDATE2
Check this question for a complete solution: Spring Integration: how to unit test an advice
I have answered today for similar question: How to poll from a queue 1 message at a time after downstream flow is completed in Spring Integration.
You also may have a trick on database level do not let to see new records in the table while others are locked. Or you can have some UPDATE in the end of flow while your SELECT won't see appropriate records until they are updated respectively.
But anyway any of those approaches I suggest for that question should be applied here as well.
Also you indeed can consider to rely on the SimpleActiveIdleMessageSourceAdvice since your solution is already based on a MessageSource implementation.
UPDATE
For your use-case it is probably would be better to extend that SimpleActiveIdleMessageSourceAdvice and override its beforeReceive() to check some state that you are able to read more data or not. The idlePollPeriod and activePollPeriod could be the same value: doesn't look like it make sense to change it in between since you are going to the idle state just after reading the next set of data.
For the state to check it really might be a simple AtomicBoolean bean which you should change after you process the current set of documents. That might be something after an aggregator or anything else you can use in your solution.
UPDATE 2
To use a WaitUntilCompleted for your Jpa.inboundAdapter you should have a configuration like this:
IntegrationFlows.from(Jpa.inboundAdapter(this.targetEntityManagerFactory)
.entityClass(ProcessingMetadata.class)
.jpaQuery("select max(p.modifiedDate) from ProcessingMetadata p " +
"where p.status = com.test.ProcessingStatus.PROCESSED")
.maxResults(1)
.expectSingleResult(true),
e -> e.poller(Pollers.fixedDelay(Duration.ofSeconds(10)).advice(waitUntilCompleted())))
Pay attention to the .advice(waitUntilCompleted()) which is a part of the pller configuration and points to your advice bean.

Why Do My Processing Time Window Triggers Fire But Event Time Ones Wont

I'm struggling to get event time based triggers to fire for my apache beam pipeline but do seem to be able to trigger window firing with processing time.
My pipeline is fairly basic:
I receive batches of data points which include millisecond level timestamps from pubsub reading in with a timestamp slightly earlier than the earliest batched data point. Data is batched to reduce client side workload & pubsub expenses.
I extract second level time stamps & apply timestamps to the individual data points
I window the data for processing & avoid using global window.
I group the data by second for later categorizations by second of streaming data.
I eventually use sliding windows on the categorized seconds to conditionally emit one of two messages to pubsub once per second.
My problem appears to be in step 3.
I'm trying to use the same windowing strategy at phase 3 that I'll eventually be using in phase 5 to run a sliding average calculation on the categorized seconds.
I've tried messing with the withTimestampCombiner(TimestampCombiner.EARLIEST) options but that does not seem to address it.
I've read about the .withEarlyFirings method used with event time but that seems like it would mimic my existing work around. I would ideally be able to rely on the watermark passing the end of the window & include late triggering.
// De-Batching The Pubsub Message
static public class UnpackDataPoints extends DoFn<String,String>{
#ProcessElement
public void processElement(#Element String c, OutputReceiver<String> out) {
JsonArray packedData = new JsonParser().parse(c).getAsJsonArray();
DateTimeFormatter dtf = DateTimeFormat.forPattern("EEE dd MMM YYYY HH:mm:ss:SSS zzz");
for (JsonElement acDataPoint: packedData){
String hereData = acDataPoint.toString();
DateTime date = dtf.parseDateTime(acDataPoint.getAsJsonObject().get("Timestamp").getAsString());
Instant eventTimeStamp = date.toInstant();
out.outputWithTimestamp(hereData,eventTimeStamp);
}
}
}
// Extracting The Second
static public class ExtractTimeStamp extends DoFn<String,KV<String,String>> {
#ProcessElement
public void processElement(ProcessContext ctx ,#Element String c, OutputReceiver<KV<String,String>> out) {
JsonObject accDataObject = new JsonParser().parse(c).getAsJsonObject();
String milliString = accDataObject.get("Timestamp").getAsString();
String secondString = StringUtils.left(milliString,24);
accDataObject.addProperty("noMiliTimeStamp", secondString);
String updatedAccData = accDataObject.toString();
KV<String,String> outputKV = KV.of(secondString,updatedAccData);
out.output(outputKV);
}
}
// The Pipeline & Windowing
Pipeline pipeline = Pipeline.create(options);
PCollection<String> dataPoints = pipeline
.apply("Read from Pubsub", PubsubIO.readStrings()
.fromTopic("projects/????/topics/???")
.withTimestampAttribute("messageTimestamp"))
.apply("Extract Individual Data Points",ParDo.of(new UnpackDataPoints()));
/// This is the event time window that doesn't fire for some reason
/*
PCollection<String> windowedDataPoints = dataPoints.apply(
Window.<String>into(SlidingWindows.of(Duration.standardSeconds(5)).every(Duration.standardSeconds(1)))
// .triggering(AfterWatermark.pastEndOfWindow())
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(TWO_MINUTES))
//.triggering(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(2)))
.discardingFiredPanes()
.withTimestampCombiner(TimestampCombiner.EARLIEST)
.withAllowedLateness(Duration.standardSeconds(1)));
*/
///// Temporary Work Around, this does fire but data is out of order
PCollection<String> windowedDataPoints = dataPoints.apply(
Window.<String>into(FixedWindows.of(Duration.standardMinutes(120)))
.triggering(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5)))
.discardingFiredPanes()
.withTimestampCombiner(TimestampCombiner.EARLIEST)
.withAllowedLateness(Duration.standardSeconds(1)));
PCollection<KV<String, String>> TimeStamped = windowedDataPoints
.apply( "Pulling Out The Second For Aggregates", ParDo.of(new ExtractTimeStamp()));
PCollection<KV<String, Iterable<String>>> TimeStampedGrouped = TimeStamped.apply("Group By Key",GroupByKey.create());
PCollection<KV<String, Iterable<String>>> testing = TimeStampedGrouped.apply("testingIsh", ParDo.of(new LogKVIterable()));
When I use the first windowing strategy which is commented out my pipeline runs indefinitely which receiving data & the LogKVIterable ParDo never returns anything, when I use the processing time work LogKVIterable does fire & log to the console.
This really looks like the timestamp that you're adding to your data may be wrong / corrupt. I would encourage you to verify the following:
The timestamp in your elements is correctly being added. Add some logging in the transforms before/after, and test that code extensively.
The Data Freshness and System Lag metrics in your pipeline are progressing as you expect. If Data Freshness is not moving as expected, that is a strong indicator that your timestamp is not appropriately set.
Triggering on processing time is different than triggering on event time. In processing time, there is no such thing as late data. In event time, handling late data is the real challenge. Late data in Event time processing is handled by using watermarks and triggers. For a great guide on this, I recommend checking out these two articles by Googler Tyler Akidau: a, b.
Since in Processing Time windowing, there is no such thing as late data, it makes sense that your Processing Time Apache Beam pipeline works without any problems.
Meanwhile in Event Time windowing, late data can occur, and your windowing and triggering should handle those scenarios, by good design.
Most likely your Event time processing pipeline code does not trigger due to bad configuration! It is not possible for me to reproduce your issue since your watermark (for a Pub/Sub source) is determined heuristically. Though I recommend that you debug your code by:
First - increasing allowedLatness. For example: to 1 hour. If this works, great! If not, see Second.
Second - comment-out withEarlyFirings. If this works, great! If not, uncomment and see Three
Three - Use Fixed-Time Windows instead of Sliding-Time Windows.
Continue debugging until you are able to isolate the issue
:) :)

How to instrument streams and track progress? (vanilla Java8 or cylcops-react reactive streams)

Given some code using streams to process a large number of items, what's the best way to instrument the various steps for logging and performance/profiling?
Actual example:
ReactiveSeq.fromStream(pairs)
.filter(this::satisfiesThreshold)
.filter(this::satisfiesPersistConditions)
.map((pair) -> convertToResult(pair, jobId))
.flatMap(Option::toJavaStream)
.grouped(CHUNK_SIZE)
.forEach((chunk) ->
{
repository.save(chunk);
incrementAndReport();
});
reportProcessingTime();
Logging progress is important so I can trigger progress events in another thread that update a user interface.
Tracking the performance characteristics of the filtering and mapping steps in this stream is desireable to see where optimizations can be made to speed it up.
I see three options:
put logging/profiling code in each function
use peek around each step without actually using the value
some sort of annotation based or AOP solution (no idea what)
Which is the best? Any ideas on what #3 would look like? Is there another solution?
You have a couple of options here (if I have understood correctly) :-
We can make use of the elapsed operator to track the elapsed time between element emissions e.g.
ReactiveSeq.fromStream(Stream.of(1,2))
.filter(this::include)
.elapsed()
.map(this::logAndUnwrap)
Long[] filterTimeTakenMillis = new Long[maxSize];
int filterIndex = 0;
private <T> T logAndUnwrap(Tuple2<T, Long> t) {
//capture the elapsed time (t.v2) and then unwrap the tuple
filterTimeTakenMillis[filterIndex++]=t.v2;
return t.v1;
}
This will only work on cyclops-react Streams.
We can make use of the AOP-like functionality in FluentFunctions
e.g.
ReactiveSeq.fromStream(Stream.of(1,2))
.filter(this::include)
.elapsed()
.map(this::logAndUnwrap)
.map(FluentFunctions.of(this::convertToResult)
.around(a->{
SimpleTimer timer = new SimpleTimer();
String r = a.proceed();
mapTimeTakenNanos[mapIndex++]=timer.getElapsedNanos();
return r;
}));
This will also work on vanilla Java 8 Streams.

Can Spark Streaming do Anything Other Than Word Count?

I'm trying to get to grips with Spark Streaming but I'm having difficulty. Despite reading the documentation and analysing the examples I wish to do something more than a word count on a text file/stream/Kafka queue which is the only thing we're allowed to understand from the docs.
I wish to listen to an incoming Kafka message stream, group messages by key and then process them. The code below is a simplified version of the process; get the stream of messages from Kafka, reduce by key to group messages by message key then to process them.
JavaPairDStream<String, byte[]> groupByKeyList = kafkaStream.reduceByKey((bytes, bytes2) -> bytes);
groupByKeyList.foreachRDD(rdd -> {
List<MyThing> myThingsList = new ArrayList<>();
MyCalculationCode myCalc = new MyCalculationCode();
rdd.foreachPartition(partition -> {
while (partition.hasNext()) {
Tuple2<String, byte[]> keyAndMessage = partition.next();
MyThing aSingleMyThing = MyThing.parseFrom(keyAndMessage._2); //parse from protobuffer format
myThingsList.add(aSingleMyThing);
}
});
List<MyResult> results = myCalc.doTheStuff(myThingsList);
//other code here to write results to file
});
When debugging I see that in the while (partition.hasNext()) the myThingsList has a different memory address than the declared List<MyThing> myThingsList in the outer forEachRDD.
When List<MyResult> results = myCalc.doTheStuff(myThingsList); is called there are no results because the myThingsList is a different instance of the List.
I'd like a solution to this problem but would prefer a reference to documentation to help me understand why this is not working (as anticipated) and how I can solve it for myself (I don't mean a link to the single page of Spark documentation but also section/paragraph or preferably still, a link to 'JavaDoc' that does not provide Scala examples with non-functional commented code).
The reason you're seeing different list addresses is because Spark doesn't execute foreachPartition locally on the driver, it has to serialize the function and send it over the Executor handling the processing of the partition. You have to remember that although working with the code feels like everything runs in a single location, the calculation is actually distributed.
The first problem I see with you code has to do with your reduceByKey which takes two byte arrays and returns the first, is that really what you want to do? That means you're effectively dropping parts of the data, perhaps you're looking for combineByKey which will allow you to return a JavaPairDStream<String, List<byte[]>.
Regarding parsing of your protobuf, looks to me like you don't want foreachRDD, you need an additional map to parse the data:
kafkaStream
.combineByKey(/* implement logic */)
.flatMap(x -> x._2)
.map(proto -> MyThing.parseFrom(proto))
.map(myThing -> myCalc.doStuff(myThing))
.foreachRDD(/* After all the processing, do stuff with result */)

Categories