How to drain the window after a Flink join using coGroup()?

How to drain the window after a Flink join using coGroup()? - java

I'd like to join data coming in from two Kafka topics ("left" and "right").
Matching records are to be joined using an ID, but if a "left" or a "right" record is missing, the other one should be passed downstream after a certain timeout. Therefore I have chosen to use the coGroup function.
This works, but there is one problem: If there is no message at all, there is always at least one record which stays in an internal buffer for good. It gets pushed out when new messages arrive. Otherwise it is stuck.
The expected behaviour is that all records should be pushed out after the configured idle timeout has been reached.
Some information which might be relevant
Flink 1.14.4
The Flink parallelism is set to 8, so is the number of partitions in both Kafka topics.
Flink checkpointing is enabled
Event-time processing is to be used
Lombok is used: So val is like final var
Some code snippets:
Relevant join settings
public static final int AUTO_WATERMARK_INTERVAL_MS = 500;
public static final Duration SOURCE_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(4000);
public static final Duration SOURCE_IDLE_TIMEOUT = Duration.ofMillis(1000);
public static final Duration TRANSFORMATION_MAX_OUT_OF_ORDERNESS = Duration.ofMillis(5000);
public static final Duration TRANSFORMATION_IDLE_TIMEOUT = Duration.ofMillis(1000);
public static final Time JOIN_WINDOW_SIZE = Time.milliseconds(1500);
Create KafkaSource
private static KafkaSource<JoinRecord> createKafkaSource(Config config, String topic) {
val properties = KafkaConfigUtils.createConsumerConfig(config);
val deserializationSchema = new KafkaRecordDeserializationSchema<JoinRecord>() {
#Override
public void deserialize(ConsumerRecord<byte[], byte[]> record, Collector<JoinRecord> out) {
val m = JsonUtils.deserialize(record.value(), JoinRecord.class);
val copy = m.toBuilder()
.partition(record.partition())
.build();
out.collect(copy);
}
#Override
public TypeInformation<JoinRecord> getProducedType() {
return TypeInformation.of(JoinRecord.class);
}
};
return KafkaSource.<JoinRecord>builder()
.setProperties(properties)
.setBootstrapServers(config.kafkaBootstrapServers)
.setTopics(topic)
.setGroupId(config.kafkaInputGroupIdPrefix + "-" + String.join("_", topic))
.setDeserializer(deserializationSchema)
.setStartingOffsets(OffsetsInitializer.latest())
.build();
}
Create DataStreamSource
Then the DataStreamSource is built on top of the KafkaSource:
Configure "max out of orderness"
Configure "idleness"
Extract timestamp from record, to be used for event time processing
private static DataStreamSource<JoinRecord> createLeftSource(Config config,
StreamExecutionEnvironment env) {
val leftKafkaSource = createLeftKafkaSource(config);
val leftWms = WatermarkStrategy
.<JoinRecord>forBoundedOutOfOrderness(SOURCE_MAX_OUT_OF_ORDERNESS)
.withIdleness(SOURCE_IDLE_TIMEOUT)
.withTimestampAssigner((joinRecord, __) -> joinRecord.timestamp.toEpochSecond() * 1000L);
return env.fromSource(leftKafkaSource, leftWms, "left-kafka-source");
}
Use keyBy
The keyed sources are created on top of the DataSource instances like this:
Again configure "out of orderness" and "idleness"
Again extract timestamp
val leftWms = WatermarkStrategy
.<JoinRecord>forBoundedOutOfOrderness(TRANSFORMATION_MAX_OUT_OF_ORDERNESS)
.withIdleness(TRANSFORMATION_IDLE_TIMEOUT)
.withTimestampAssigner((joinRecord, __) -> {
if (VERBOSE_JOIN)
log.info("Left : " + joinRecord);
return joinRecord.timestamp.toEpochSecond() * 1000L;
});
val leftKeyedSource = leftSource
.keyBy(jr -> jr.id)
.assignTimestampsAndWatermarks(leftWms)
.name("left-keyed-source");
Join using coGroup
The join then combines the left and the right keyed sources
val joinedStream = leftKeyedSource
.coGroup(rightKeyedSource)
.where(left -> left.id)
.equalTo(right -> right.id)
.window(TumblingEventTimeWindows.of(JOIN_WINDOW_SIZE))
.apply(new CoGroupFunction<JoinRecord, JoinRecord, JoinRecord>() {
#Override
public void coGroup(Iterable<JoinRecord> leftRecords,
Iterable<JoinRecord> rightRecords,
Collector<JoinRecord> out) {
// Transform
val result = ...;
out.collect(result);
}
Write stream to console
The resulting joinedStream is written to the console:
val consoleSink = new PrintSinkFunction<JoinRecord>();
joinedStream.addSink(consoleSink);
How can I configure this join operation, so that all records are pushed downstream after the configured idle timeout?
If it can't be done this way: Is there another option?

This is the expected behavior. withIdleness doesn't try to handle the case where all streams are idle. It only helps in cases where there are still events flowing from at least one source partition/shard/split.
To get the behavior you desire (in the context of a continuous streaming job), you'll have to implement a custom watermark strategy that advances the watermark based on a processing time timer. Here's an implementation that uses the legacy watermark API.
On the other hand, if the job is complete and you just want to drain the final results before shutting it down, you can use the --drain option when you stop the job. Or if you use bounded sources this will happen automatically.

Related

FlinkCEP pattern detection doesn't happen in real time

I'm still new to Flink CEP library and yet I don't understand the pattern detection behavior.
Considering the example below, I have a Flink app that consumes data from a kafka topic, data is produced periodically, I want to use Flink CEP pattern to detect when a value is bigger than a given threshold.
The code is below:
public class CEPJob{
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "test");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(),
properties);
consumer.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
DataStream<String> stream = env.addSource(consumer);
// Process incoming data.
DataStream<Stock> inputEventStream = stream.map(new MapFunction<String, Stock>() {
private static final long serialVersionUID = -491668877013085114L;
#Override
public Stock map(String value) {
String[] data = value.split(":");
System.out.println("Date: " + data[0] + ", Adj Close: " + data[1]);
Stock stock = new Stock(data[0], Double.parseDouble(data[1]));
return stock;
}
});
// Create the pattern
Pattern<Stock, ?> myPattern = Pattern.<Stock>begin("first").where(new SimpleCondition<Stock>() {
private static final long serialVersionUID = -6301755149429716724L;
#Override
public boolean filter(Stock value) throws Exception {
return (value.getAdj_Close() > 140.0);
}
});
// Create a pattern stream from our warning pattern
PatternStream<Stock> myPatternStream = CEP.pattern(inputEventStream, myPattern);
// Generate alert for each matched pattern
DataStream<Stock> warnings = myPatternStream .select((Map<String, List<Stock>> pattern) -> {
Stock first = pattern.get("first").get(0);
return first;
});
warnings.print();
env.execute("CEP job");
}
}
What happens when I run the job, pattern detection doesn't happen in real-time, it outputs the warning for the detected pattern of the current record only after a second record is produced, it looks like it's delayed to print to the log the warining, I really didn't understand how to make it outputs the warning the time it detect the pattern without waiting for next record and thank you :) .
Data coming from Kafka are in string format: "date:value", it produce data every 5 secs.
Java version: 1.8, Scala version: 2.11.12, Flink version: 1.12.2, Kafka version: 2.3.0

The solution I found that to send a fake record (a null object for example) in the Kafka topic every time I produce a value to the topic, and on the Flink side (in the pattern declaration) I test if the received record is fake or not.
It seems like FlinkCEP always waits for the upcoming event before it outputs the warning.

Akka stream broadcast in java

I am trying to broadcast to 2 sink from a source in java, got stuck in between, any pointer will be helpful
public static void main(String[] args) {
ActorSystem system = ActorSystem.create("GraphBasics");
ActorMaterializer materializer = ActorMaterializer.create(system);
final Source<Integer, NotUsed> source = Source.range(1, 1000);
Sink<Integer,CompletionStage<Done>> firstSink = Sink.foreach(x -> System.out.println("first sink "+x));
Sink<Integer,CompletionStage<Done>> secondsink = Sink.foreach(x -> System.out.println("second sink "+x));
RunnableGraph.fromGraph(
GraphDSL.create(
b -> {
UniformFanOutShape<Integer, Integer> bcast = b.add(Broadcast.create(2));
b.from(b.add(source)).viaFanOut(bcast).to(b.add(firstSink)).to(b.add(secondsink));
return ClosedShape.getInstance();
}))
.run(materializer);
}

i am not that much familiar with java api for akka-stream graphs, so i used the official doc. there are 2 errors in your snippet:
when you added source to the graph builder, you need to get Outlet from it. so instead of b.from(b.add(source)) there should smth like this: b.from(b.add(source).out()) according to the official doc
you can't just call two .to method in a row, because .to expects smth with Sink shape, which means kind of dead end. instead you need to attach 2nd sink to the bcast directly, like this:
(...).viaFanOut(bcast).to(b.add(firstSink));
b.from(bcast).to(b.add(secondSink));
all in all the code should look like this:
ActorSystem system = ActorSystem.create("GraphBasics");
ActorMaterializer materializer = ActorMaterializer.create(system);
final Source<Integer, NotUsed> source = Source.range(1, 1000);
Sink<Integer, CompletionStage<Done>> firstSink = foreach(x -> System.out.println("first sink " + x));
Sink<Integer, CompletionStage<Done>> secondSink = foreach(x -> System.out.println("second sink " + x));
RunnableGraph.fromGraph(
GraphDSL.create(b -> {
UniformFanOutShape<Integer, Integer> bcast = b.add(Broadcast.create(2));
b.from(b.add(source).out()).viaFanOut(bcast).to(b.add(firstSink));
b.from(bcast).to(b.add(secondSink));
return ClosedShape.getInstance();
}
)
).run(materializer);
Final note - i would think twice whether it makes sense to use graph api. If you case as simple as this one (just 2 sinks), you might want just to use alsoTo or alsoToMat. They give you the possibility to attach multiple sinks to the flow without the need to use graphs.

Drools timed rule execution via Java API

I want to create a time-based rule that is being triggered every 5 minutes, and Drools documentation states that:
Conversely when the Drools engine runs in passive mode (i.e.: using fireAllRules instead of fireUntilHalt) by default it doesn’t fire consequences of timed rules unless fireAllRules isn’t invoked again. However it is possible to change this default behavior by configuring the KieSession with a TimedRuleExecutionOption as shown in the following example
KieSessionConfiguration ksconf = KieServices.Factory.get().newKieSessionConfiguration();
ksconf.setOption( TimedRuleExecutionOption.YES );
KSession ksession = kbase.newKieSession(ksconf, null);
However, I am not accessing the KieSession object directly because I am using the Java REST API to send requests to a Drools project deployed on KieExecution Server like so (example taken directly from the Drools documentation):
public class MyConfigurationObject {
private static final String URL = "http://localhost:8080/kie-server/services/rest/server";
private static final String USER = "baAdmin";
private static final String PASSWORD = "password#1";
private static final MarshallingFormat FORMAT = MarshallingFormat.JSON;
private static KieServicesConfiguration conf;
private static KieServicesClient kieServicesClient;
public static void initializeKieServerClient() {
conf = KieServicesFactory.newRestConfiguration(URL, USER, PASSWORD);
conf.setMarshallingFormat(FORMAT);
kieServicesClient = KieServicesFactory.newKieServicesClient(conf);
}
public void executeCommands() {
String containerId = "hello";
System.out.println("== Sending commands to the server ==");
RuleServicesClient rulesClient = kieServicesClient.getServicesClient(RuleServicesClient.class);
KieCommands commandsFactory = KieServices.Factory.get().getCommands();
Command<?> insert = commandsFactory.newInsert("Some String OBJ");
Command<?> fireAllRules = commandsFactory.newFireAllRules();
Command<?> batchCommand = commandsFactory.newBatchExecution(Arrays.asList(insert, fireAllRules));
ServiceResponse<ExecutionResults> executeResponse = rulesClient.executeCommandsWithResults(containerId, batchCommand);
if(executeResponse.getType() == ResponseType.SUCCESS) {
System.out.println("Commands executed with success! Response: ");
System.out.println(executeResponse.getResult());
} else {
System.out.println("Error executing rules. Message: ");
System.out.println(executeResponse.getMsg());
}
}
}
so I'm a bit confused as to how i can pass this TimedRuleExecutionOption to the session?
I've already found a workaround by sending a FireAllRules command periodically but I'd like to know if I can configure this session option so that I don't have to add periodical triggering for every timed event I want to create.
Also, I've tried using FireUntilHalt instead of FireAllRules, but to my understanding that command blocks the execution thread on the server and I have to send a HaltCommand at some point, all of which I would like to avoid since I have a multi-threaded client that sends events to the server.

pass "-Ddrools.timedRuleExecution=true" while starting server instance where kie-server.war is deployed.

You can use drools cron function. It acts as a timer and invoke rule based on the cron expresion. Example to execute a rule every 5 minutes :
rule "Send SMS every 5 minutes"
timer (cron:* 0/5 * * * ?)
when
$a : Event( )
then
end
you can find explanation here

Flink Kafka - how to make App run in Parallel?

I am creating a app in Flink to
Read Messages from a topic
Do some simple process on it
Write Result to a different topic
My code does work, however it does not run in parallel
How do I do that?
It seems my code runs only on one thread/block?
On the Flink Web Dashboard:
App goes to running status
But, there is only one block shown in the overview subtasks
And Bytes Received / Sent, Records Received / Sent is always zero ( no Update )
Here is my code, please assist me in learning how to split my app to be able to run in parallel, and am I writing the app correctly?
public class SimpleApp {
public static void main(String[] args) throws Exception {
// create execution environment INPUT
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment();
// event time characteristic
env_in.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// production Ready (Does NOT Work if greater than 1)
env_in.setParallelism(Integer.parseInt(args[0].toString()));
// configure kafka consumer
Properties properties = new Properties();
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("auto.offset.reset", "earliest");
// create a kafka consumer
final DataStream<String> consumer = env_in
.addSource(new FlinkKafkaConsumer09<>("test", new
SimpleStringSchema(), properties));
// filter data
SingleOutputStreamOperator<String> result = consumer.filter(new
FilterFunction<String>(){
#Override
public boolean filter(String s) throws Exception {
return s.substring(0, 2).contentEquals("PS");
}
});
// Process Data
// Transform String Records to JSON Objects
SingleOutputStreamOperator<JSONObject> data = result.map(new
MapFunction<String, JSONObject>()
{
#Override
public JSONObject map(String value) throws Exception
{
JSONObject jsnobj = new JSONObject();
if(value.substring(0, 2).contentEquals("PS"))
{
// 1. Raw Data
jsnobj.put("Raw_Data", value.substring(0, value.length()-6));
// 2. Comment
int first_index_comment = value.indexOf("$");
int last_index_comment = value.lastIndexOf("$") + 1;
// - set comment
String comment =
value.substring(first_index_comment, last_index_comment);
comment = comment.substring(0, comment.length()-6);
jsnobj.put("Comment", comment);
}
else {
jsnobj.put("INVALID", value);
}
return jsnobj;
}
});
// Write JSON to Kafka Topic
data.addSink(new FlinkKafkaProducer09<JSONObject>("localhost:9092",
"FilteredData",
new SimpleJsonSchema()));
env_in.execute();
}
}
My code does work, but it seems to run only on a single thread
( One block shown ) in web interface ( No passing of data, hence the bytes sent / received are not updated ).
How do I make it run in parallel ?

To run your job in parallel you can do 2 things:
Increase the parallelism of your job at the env level - i.e. do something like
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(4);
But this would only increase parallelism at flink end after it reads the data, so if the source is producing data faster it might not be fully utilized.
To fully parallelize your job, setup multiple partitions for your kafka topic, ideally the amount of parallelism you would want with your flink job. So, you might want to do something like below when you are creating your kafka topic:
bin/kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 3 --partitions 4 --topic test

Spark Checkpoint doesn't remember state (Java HDFS)

ALready Looked at Spark streaming not remembering previous state
but doesn't help.
Also looked at http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing but cant find JavaStreamingContextFactory although I am using spark streaming 2.11 v 2.0.1
My code works fine but when I restart it... it won't remember the last checkpoint...
Function0<JavaStreamingContext> scFunction = new Function0<JavaStreamingContext>() {
#Override
public JavaStreamingContext call() throws Exception {
//Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.milliseconds(SPARK_DURATION));
//checkpointDir = "hdfs://user:pw#192.168.1.50:54310/spark/checkpoint";
ssc.sparkContext().setCheckpointDir(checkpointDir);
StorageLevel.MEMORY_AND_DISK();
return ssc;
}
};
JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkpointDir, scFunction);
Currently data is streaming from kafka and I am performing some transformation and action.
JavaPairDStream<Integer, Long> responseCodeCountDStream = logObject.transformToPair
(MainApplication::responseCodeCount);
JavaPairDStream<Integer, Long> cumulativeResponseCodeCountDStream = responseCodeCountDStream.updateStateByKey
(COMPUTE_RUNNING_SUM);
cumulativeResponseCodeCountDStream.foreachRDD(rdd -> {
rdd.checkpoint();
LOG.warn("Response code counts: " + rdd.take(100));
});
Could somebody point me to right direction, if I am missing something?
Also, I can see that checkpoint is being saved in hdfs. But why wont it read from it?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to drain the window after a Flink join using coGroup()? - java

Related

FlinkCEP pattern detection doesn't happen in real time

Akka stream broadcast in java

Drools timed rule execution via Java API

Flink Kafka - how to make App run in Parallel?

Spark Checkpoint doesn't remember state (Java HDFS)

Categories

Resources