Apache Flink join different DataStreams on specific key

Apache Flink join different DataStreams on specific key - java

I have two DataStreams, the first one called DataStream<String> source which receive records from a message broker, and the second one is a SingleOutputOperator<Event> events, which is the result of mapping the source into Event.class.
I have a uses cases that needs to use SingleOutputOperator<Event> events and other that uses DataStream<String> source. In one of the use cases that use DataStream<String> source, I need to join the SingleOutputOperator<String> result after apply some filters and to avoid to map the source again into Event.class as I already have that operation done and that Stream, I need to search each record into the SingleOutputOperator<String> result into the SingleOutputOperator<Event> events and the apply another map to export a SingleOutputOperator<EventOutDto> out.
This is the idea as example:
DataStream<String> source = env.readFrom(source);
SingleOutputOperator<Event> events = source.map(s -> mapper.readValue(s, Event.class));
public void filterAndJoin(DataStream<String> source, SingleOutputOperator<Event> events){
SingleOutputOperator<String> filtered = source.filter(s -> new FilterFunction());
SingleOutputOperator<EventOutDto> result = (this will be the result of search each record
based on id in the filtered stream into the events stream where the id must match and return the event if found)
.map(event -> new EventOutDto(event)).addSink(new RichSinkFunction());
}
I have this code:
filtered.join(events)
.where(k -> {
JsonNode tree = mapper.readTree(k);
String id = "";
if (tree.get("Id") != null) {
id = tree.get("Id").asText();
}
return id;
})
.equalTo(e -> {
return e.Id;
})
.window(TumblingEventTimeWindows.of(Time.seconds(1)))
.apply(new JoinFunction<String, Event, BehSingleEventTriggerDTO>() {
#Override
public EventOutDto join(String s, Event event) throws Exception {
return new EventOutDto(event);
}
})
.addSink(new SinkFunction());
In the above code all works fine, the ids are the same, so basically the where(id).equalTo(id) should work, but the process never reaches the apply function.
Observation: Watermark are assigned with the same timestamp
Questions:
Any idea why?
Am I explained myself fine?

I solved the join by doing this:
SingleOutputStreamOperator<ObjectDTO> triggers = candidates
.keyBy(new KeySelector())
.intervalJoin(keyedStream.keyBy(e -> e.Id))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process(new new ProcessFunctionOne())
.keyBy(k -> k.otherId)
.process(new ProcessFunctionTwo());

Related

KTable with list of values

I would like to create a KTable<String, List>. When app receive a record Record<String, TimePeriod> it just add this timePeriod to corresponding list if key exists in other case creates new list with this timePeriod.
KTable<String, List<TimePeriod>> kTable = streamsBuilder(topologyConfig.exemptionsTopic(), Consumed.with(Serdes.String(), new TimePeriodSerde())
.groupByKey()
.aggregate(
ArrayList::new, /*Initializer*/
(key, timePeriod, timePeriodList) -> { /* aggregator */
timePeriodList.add(timePeriod);
return timePeriodList;
}, Materialized.<String,List<TimePeriod>, KeyValueStore<Bytes, byte[]>>as("storeName")
.withValueSerde(Serdes.ListSerde(ArrayList.class, new TimePeriodSerde())));
I'm not pretty sure that this is proper code)
And later I would like to join stream <String, Product> with this kTable. Assume that key in kStream and in kTable is productName.
I would like to join product with time period if product.getProductionDate() lie in any of this timePeriods. So something like this...
ValueJoiner<Product, List<TimePeriod>, JoinedProduct> joiner = (product, periodList) -> {
for(TimePeriod period : periodList) {
if(product.isInRange(period) {
return new JoinedProduct(product, period);
}
}
return new JoinedProduct(product, null);
};
streamsBuilder.stream(topologyConfig.productsTopic(), Consumed.with(Serdes.String(), new ProductSerde()))
.leftJoin(kTable, joiner)
.to(...);
when I try to execute this there is an error during building the project.
but intellij idea nothing underlines me.
I am lean more that there is mistake in creating a kTable

Java - Spring Boot - Reactive Redis Stream ( TEXT_EVENT_STREAM_VALUE )

I want to write an endpoint which always shows the newest messages of a redis stream (reactive).
The entities look like this {'key' : 'some_key', 'status' : 'some_string'}.
So I would like to have the following result:
Page is called, content would be for instance displaying an entity:
{'key' : 'abc', 'status' : 'status_A'}
the page is not closed
Then a new entity is added to the stream
XADD mystream * key abc status statusB
Now I would prefer to see each item of the Stream, without updating the Tab
{'key' : 'abc', 'status' : 'status_A'}
{'key' : 'abc', 'status' : 'status_B'}
When I try to mock this behavior it works and I get the expected output.
#GetMapping(value="/light/live/mock", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
#ResponseBody
public Flux<Light> liveLightMock() {
List<Light> test = Arrays.asList(new Light("key", "on") , new Light("key", "off"),
new Light("key", "on") , new Light("key", "off"),
new Light("key", "on") , new Light("key", "off"),
new Light("key", "on") , new Light("key", "off"),
new Light("key", "on") , new Light("key", "off"));
return Flux.fromIterable(test).delayElements(Duration.ofMillis(500));
}
The individual elements of the list are displayed one after another with a 500ms Delay between items.
However, when I try to access Redis instead of the mocked variant, it no longer works. I try to test the partial functions successively. So that my idea works first the save (1) function must work, if the save function works, displaying old records without reactiv features must work (2) and last but not least if both work i kinda need to get the reactiv part going.
Maybe you guys can help me get the Reactive Part Working. Im working on it for days without getting any improvements.
Ty guys :)
Test 1) - Saving Function (Short Version)
looks like its working.
#GetMapping(value="/light/create", produces = MediaType.APPLICATION_JSON_VALUE)
#ResponseBody
public Flux<Light> createTestLight() {
String status = (++statusIdx % 2 == 0) ? "on" : "off";
Light light = new Light(Consts.LIGHT_ID, status);
return LightRepository.save(light).flux();
}
#Override
public Mono<Light> save(Light light) {
Map<String, String> lightMap = new HashMap<>();
lightMap.put("key", light.getKey());
lightMap.put("status", light.getStatus());
return operations.opsForStream(redisSerializationContext)
.add("mystream", lightMap)
.map(__ -> light);
}
Test 2) - Loading/Reading Function (Short Version)
seems to be working, but not reaktiv -> i add a new entity while a WebView was Open, the View showed all Items but didnt Updated once i added new items. after reloading i saw every item
How can i get getLightsto return something that is working with TEXT_EVENT_STREAM_VALUE which subscribes to the stream?
#Override
public Flux<Object> getLights() {
ReadOffset readOffset = ReadOffset.from("0");
StreamOffset<String> offset = StreamOffset.fromStart("mystream"); //fromStart or Latest
Function<? super MapRecord<String, Object, Object>, ? extends Publisher<?>> mapFunc = entries -> {
Map<Object, Object> kvp = entries.getValue();
String key = (String) kvp.get("key");
String status = (String) kvp.get("status");
Light light = new Light(key, status);
return Flux.just(light);
};
return operations.opsForStream()
.read(offset)
.flatMap(mapFunc);
}
#GetMapping(value="/light/live", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
#ResponseBody
public Flux<Object> lightLive() {
return LightRepository.getLights();
}
Test 1) - Saving Function (Long Version)
The Endpoint & Saving Functions are part of Diffrent Classes.
String status = (++statusIdx % 2 == 0) ? "on" : "off"; flip flops the status from on to off, to on, to off, ...
#GetMapping(value="/light/create", produces = MediaType.APPLICATION_JSON_VALUE)
#ResponseBody
public Flux<Light> createTestLight() {
String status = (++statusIdx % 2 == 0) ? "on" : "off";
Light light = new Light(Consts.LIGHT_ID, status);
return LightRepository.save(light).flux();
}
#Override
public Mono<Light> save(Light light) {
Map<String, String> lightMap = new HashMap<>();
lightMap.put("key", light.getKey());
lightMap.put("status", light.getStatus());
return operations.opsForStream(redisSerializationContext)
.add("mystream", lightMap)
.map(__ -> light);
}
To Validate the Functions i
Delted the Stream, to Empty it
127.0.0.1:6379> del mystream
(integer) 1
127.0.0.1:6379> XLEN myStream
(integer) 0
Called the Creation Endpoint twice /light/create
i expected the Stream now to have two Items, on with status = on, and one with off
127.0.0.1:6379> XLEN mystream
(integer) 2
127.0.0.1:6379> xread STREAMS mystream 0-0
1) 1) "mystream"
2) 1) 1) "1610456865517-0"
2) 1) "key"
2) "light_1"
3) "status"
4) "off"
2) 1) "1610456866708-0"
2) 1) "key"
2) "light_1"
3) "status"
4) "on"
It looks like the Saving part is Working.
Test 2) - Loading/Reading Function (Long Version)
seems to be working, but not reaktiv -> i add a new entity and the page updates its values
#Override
public Flux<Object> getLights() {
ReadOffset readOffset = ReadOffset.from("0");
StreamOffset<String> offset = StreamOffset.fromStart("mystream"); //fromStart or Latest
Function<? super MapRecord<String, Object, Object>, ? extends Publisher<?>> mapFunc = entries -> {
Map<Object, Object> kvp = entries.getValue();
String key = (String) kvp.get("key");
String status = (String) kvp.get("status");
Light light = new Light(key, status);
return Flux.just(light);
};
return operations.opsForStream()
.read(offset)
.flatMap(mapFunc);
}
#GetMapping(value="/light/live", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
#ResponseBody
public Flux<Object> lightLive() {
return LightRepository.getLights();
}
Calling /light/live -> i should have N entries
-> if I can see Entries, the normal Display is Working (non Reactive)
Calling /light/create twice -> the live Few should have added 2 Entries -> N+2Entries
Waiting 1 Minute just to be Safe
The View Should Show N+2 Entries for the Reactiv Part to be working
Refresh View from 1 (/light/live), should still show the same amount if Reactiv Works
Displaying the Information works (1), the Adding part of (2) worked, checked per Terminal, 4) didnt work
ergo the Display is working, but its not reactive
after i refreshed the Browser (5) i got the expected N+2 entries - so (2) worked aswell

There's a misconception here, reading from Redis reactively does not mean you have subscribed for new events.
Reactive will not provide you live updates, it will call Redis once and it will display whatever is there. So even if you wait for a day or two nothing is going to change in UI/Console, you will still seeing N entries.
You need to either use Redis PUB/SUB or you need to call Redis repetitively to get the latest update.
EDIT:
A working solution..
private List<Light> reactiveReadToList() {
log.info("reactiveReadToList");
return read().collectList().block();
}
private Flux<Light> read() {
StreamOffset<Object> offset = StreamOffset.fromStart("mystream");
return redisTemplate
.opsForStream()
.read(offset)
.flatMap(
e -> {
Map<Object, Object> kvp = e.getValue();
String key = (String) kvp.get("key");
String id = (String) kvp.get("id");
String status = (String) kvp.get("status");
Light light = new Light(id, key, status);
log.info("{}", light);
return Flux.just(light);
});
}
A reader that reads data from Redis on demand using reactive template and send it to the client as it sees using offset, it sends only one event at once we can send all of them.
#RequiredArgsConstructor
class DataReader {
#NonNull FluxSink<Light> sink;
private List<Light> readLights = null;
private int currentOffset = 0;
void register() {
readLights = reactiveReadToList();
sink.onRequest(
e -> {
long demand = sink.requestedFromDownstream();
for (int i = 0; i < demand && currentOffset < readLights.size(); i++, currentOffset++) {
sink.next(readLights.get(currentOffset));
}
if (currentOffset == readLights.size()) {
readLights = reactiveReadToList();
currentOffset = 0;
}
});
}
}
A method that uses DataReader to generate flux
public Flux<Light> getLights() {
return Flux.create(e -> new DataReader(e).register());
}
Now we've added an onRequest method on the sink to handle the client demand, this reads data from the Redis stream as required and sends it to the client.
This looks to be very CPU intensive maybe we should delay the calls if there're no more new events, maybe add a sleep call inside register method if we see there're not new elements in the stream.

Side input in global window as slowly changing cache questions

Context:
We have some schema files in Cloud Storage. In our Dataflow job, we need to refer to these schema files to transform our data. These schema files change on a daily/weekly basis. Our data source is PubSub and we window PubSub messages into a fixed window of 1 minutes. The schema files we need fit well into memory, they are about 90 MB.
What I have tried:
Referring to this doc from Apache Beam, we created a side input that writes into a global window with a GenerateSequence like so:
// Creates a side input that refreshes the schema every minute
PCollectionView<Map<String, byte[]>> dataBlobView =
pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
.apply(Window.<Long>into(new GlobalWindows()).triggering(
Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, Map<String, byte[]>>() {
#ProcessElement
public void processElement(ProcessContext ctx) throws Exception {
byte[] avroSchemaBlob = getAvroSchema();
byte[] fileDescriptorSetBlob = getFileDescriptorSet();
byte[] depsBlob = getFileDescriptorDeps();
Map<String, byte[]> dataBlobs = ImmutableMap.of(
"version", Longs.toByteArray(ctx.element().byteValue()),
"avroSchemaBlob", avroSchemaBlob,
"fileDescriptorSetBlob", fileDescriptorSetBlob,
"depsBlob", depsBlob);
ctx.output(dataBlobs);
}
}))
.apply(View.asSingleton());
"getAvroSchema", "getFileDescriptorSet" and "getFileDescriptorDeps" read files as byte[] from Cloud Storage.
However, this approach failed from the exception:
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: PCollection with more than one element accessed as a singleton view.
I then tried writing my own Combine Globally function like so:
static class GetLatestVersion implements SerializableFunction<Iterable<Map<String, byte[]>>, Map<String, byte[]>> {
#Override
public Map<String, byte[]> apply(Iterable<Map<String, byte[]>> versions) {
Map<String, byte[]> result = Maps.newHashMap();
Long maxVersion = Long.MIN_VALUE;
for (Map<String, byte[]> version: versions){
Long currentVersion = Longs.fromByteArray(version.get("version"));
logger.info("Side input version: " + currentVersion);
if (currentVersion > maxVersion) {
result = version;
maxVersion = currentVersion;
}
}
return result;
}
}
But it still triggers the same exception........
I then came across this and this Beam email archives and it seems like what's suggested in the Beam doc does not work. And I have to use a MultiMap to avoid the exception I ran into above. With a MultiMap, I will also have to iterate through the values and have my own logic to pick my desired value (latest).
My questions:
Why do I still get the exception "PCollection with more than one element accessed as a singleton view" even after I globally combine everything into 1 result?
If I go with the MultiMap approach, wouldn't the job eventually run out of memory? Because everyday we are basically increasing the MultiMap by 90 MB (the size of our data blob), unless Dataflow has some smart MultiMap implementation behind the scene.
What is the recommended way to do this?
Thanks

Use .apply(View.asMap()) instead of .apply(View.asSingleton());
This is the full example:
PCollectionView<Map<String, byte[]>> dataBlobView =
pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
.apply(Window.<Long>into(new GlobalWindows()).triggering(
Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, KV<String, byte[]>>() {
#ProcessElement
public void processElement(ProcessContext ctx) throws Exception {
byte[] avroSchemaBlob = getAvroSchema();
byte[] fileDescriptorSetBlob = getFileDescriptorSet();
byte[] depsBlob = getFileDescriptorDeps();
ctx.output(KV.of("version", Longs.toByteArray(ctx.element().byteValue())));
ctx.output(KV.of("avroSchemaBlob", avroSchemaBlob));
ctx.output(KV.of("fileDescriptorSetBlob", fileDescriptorSetBlob));
ctx.output(KV.of("depsBlob", depsBlob));
}
}))
.apply(View.asMap());
You can use the map from the side inputs as described in documentation.
Apache Beam version 2.34.0

Extracting Timestamp from producer message

I really need help!
I can't extract the timestamp for a message sent by a producer. In my project I work with Json, I have a class in which I define the keys and one in which I define the values of the message that I will send via a producer on a "Raw" topic. I have 2 other classes that do the same thing for the output message that my consumer will read on the topic called "Tdt". In the main class KafkaStreams.java I define the stream and map the keys and values. Starting Kafka locally, I start a producer who writes a message on the "raw" topic with keys and values, then on another shell the consumer starts reading the exit message on the "tdt" topic. How do I get the event timestamp? I need to know the timestamp in which the message was sent by the producer. Do I need a TimestampExtractor?
Here is my main class kafkastreams (my application works great, I just need the timestamp)
#Bean("app1StreamTopology")
public KStream<LibAssIbanRawKey, LibAssIbanRawValue> kStream() throws ParseException {
JsonSerde<Dwsitspr4JoinValue> Dwsitspr4JoinValueSerde = new JsonSerde<>(Dwsitspr4JoinValue.class);
KStream<LibAssIbanRawKey, LibAssIbanRawValue> stream = defaultKafkaStreamsBuilder.stream(inputTopic);
stream.peek((k,v) -> logger.info("Debug3 Chiave descrizione -> ({})",v.getCATRAPP()));
GlobalKTable<Integer, Dwsitspr4JoinValue> categoriaRapporto = defaultKafkaStreamsBuilder
.globalTable(temptiptopicname,
Consumed.with(Serdes.Integer(), Dwsitspr4JoinValueSerde)
// .withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST)
);
logger.info("Debug3 Chiave descrizione -> ({})",categoriaRapporto.toString()) ;
stream.peek((k,v) -> logger.info("Debug4 Chiave descrizione -> ({})",v.getCATRAPP()) );
stream
.join(categoriaRapporto, (k, v) -> v.getCATRAPP(), (valueStream, valueGlobalKtable) -> {
// Value mapping
LibAssIbanTdtValue newValue = new LibAssIbanTdtValue();
newValue.setDescrizioneRidottaCodiceCategoriaDelRapporto(valueGlobalKtable.getDescrizioneRidotta());
newValue.setDescrizioneEstesaCodiceCategoriaDelRapporto(valueGlobalKtable.getDescrizioneEstesa());
newValue.setIdentificativo(valueStream.getAUD_CCID());
.
.
.//Other Value Mapped
.
.
.map((key, value) -> {
// Key mapping
LibAssIbanTdtKey newKey = new LibAssIbanTdtKey();
newKey.setData(dtf.format(localDate));
newKey.setIdentificatoreUnivocoDellaRigaDiTabella(key.getTABROWID());
return KeyValue.pair(newKey, value);
}).to(outputTopic, Produced.with(new JsonSerde<>(LibAssIbanTdtKey.class), new JsonSerde<>(LibAssIbanTdtValue.class)));
return stream;
}
}

Yes you need a TimestampExtractor.
public class YourTimestampExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> consumerRecord, long l) {
// do whatever you want with the timestamp available with consumerRecord.timestamp()
...
// return here the timestamp you want to use (here default)
return consumerRecord.timestamp();
}
}
You'll need to tell kafka stream what extractor to use under the key StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG

Using ElasticSearch's script_upsert to create a document

According to the official documentation Update API - Upserts one can use scripted_upsert in order to handle update (for existing document) or insert (for new document) form within the script. The thing is they never show how the script should look to do that. The Java - Update API Doesn't have any information on the ScriptUpsert uses.
This is the code I'm using:
//My function to build and use the upsert
public void scriptedUpsert(String key, String parent, String scriptSource, Map<String, ? extends Object> parameters) {
Script script = new Script(scriptSource, ScriptType.INLINE, null, parameters);
UpdateRequest request = new UpdateRequest(index, type, key);
request.scriptedUpsert(true);
request.script(script);
if (parent != null) {
request.parent(parent);
}
this.bulkProcessor.add(request);
}
//A test call to validate the function
String scriptSource = "if (!ctx._source.hasProperty(\"numbers\")) {ctx._source.numbers=[]}";
Map<String, List<Integer>> parameters = new HashMap<>();
List<Integer> numbers = new LinkedList<>();
numbers.add(100);
parameters.put("numbers", numbers);
bulk.scriptedUpsert("testUser", null, scriptSource, parameters);
And I'm getting the following exception when "testUser" documents doesn't exists:
DocumentMissingException[[user][testUser]: document missing
How can I make the scriptUpsert work from the Java code?

This is how a scripted_upsert command should look like (and its script):
POST /sessions/session/1/_update
{
"scripted_upsert": true,
"script": {
"inline": "if (ctx.op == \"create\") ctx._source.numbers = newNumbers; else ctx._source.numbers += updatedNumbers",
"params": {
"newNumbers": [1,2,3],
"updatedNumbers": [55]
}
},
"upsert": {}
}
If you call the above command and the index doesn't exist, it will create it, together with the newNumbers values in the new documents. If you call again the exact same command the numbers values will become 1,2,3,55.
And in your case you are missing "upsert": {} part.

As Andrei suggested I was missing the upsert part, changing the function to:
public void scriptedUpsert(String key, String parent, String scriptSource, Map<String, ? extends Object> parameters) {
Script script = new Script(scriptSource, ScriptType.INLINE, null, parameters);
UpdateRequest request = new UpdateRequest(index, type, key);
request.scriptedUpsert(true);
request.script(script);
request.upsert("{}"); // <--- The change
if (parent != null) {
request.parent(parent);
}
this.bulkProcessor.add(request);
}
Fix it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Flink join different DataStreams on specific key - java

Related

KTable with list of values

Java - Spring Boot - Reactive Redis Stream ( TEXT_EVENT_STREAM_VALUE )

Side input in global window as slowly changing cache questions

Extracting Timestamp from producer message

Using ElasticSearch's script_upsert to create a document

Categories

Resources