#StreamListener("input")
#SendTo("output")
public KStream<?, MyObject> process(KStream<Object, IncomingObject> input) {
KTable table = input.flatMapValues(value -> this.getMylogic(value));
return table.toStream();
}
I am trying to convert KStream to KTable and then to KStream but I am getting cannot convert from KStream to KTable
value is json. Kindly help and how can i use aggregation also?
{
"name":"test",
address{
"localAddress":"myaddress",
"businessAddress":"testAddress"
}
}
in mylogic method I am taking only address to send to another topic.
Kindly help
You need to apply aggregate function, such as count to get KTable as result. Otherwise there's no way to do KStream -> KTable -> KStream.
What you need, and can do, is KStream.count() (for example) -> KTable -> KStream. So basically results of aggreagtion will be published to KStream, which might be also published into the Kafka topic.
Related
I have 3 Kafka Streams for 3 different topics. To each of these streams I apply a filter operation so that I can filter results that contain a certain field which is the type of message.
Right now I have three streams that are the result of applying the filter operation which we'll call Stream A, Stream B and Stream C.
I join the messages from each Stream by key by applying a join operation between Stream A and Stream B and then I get the result (Stream A + Stream B) and join that with Stream C (Messages on Stream C usually show up much later that A and B because the producer takes longer to produce those messages)
What I want to know is how can I trigger an event (To signal an error) when one of the 3 required messages fails to appear within a defined time window? So lets say I want to join messages from Streams A and B and the message from Stream A appears but the one from Stream B does not or both of them appear but then Stream C message does not. How can I signal that a certain message from a certain Stream failed to appear within the given time window?
Below you can find the code I have that is working for the best case scenario which is all messages appear within the time window.
//Filter the Streams to get only records corresponding to report of type FULL REPORT
KStream<String, String> filteredStreamA = StreamA.filter((reportId, jsonPayload) -> filterRecordsByReportType(jsonPayload));
KStream<String, String> filteredStreamB = StreamB.filter((reportId, jsonPayload) -> filterRecordsByReportType(jsonPayload));
KStream<String, String> filteredStreamC = StreamC.filter((reportId, jsonPayload) -> filterRecordsByReportType(jsonPayload));
//Inner Join of the Streams A and B by Key during a 30 seconds window
KStream<String, String> joinedAandBStream = filteredStreamA
.join(filteredStreamB, A_B_joiner, JoinWindows.of(Duration.ofSeconds(30)));
//Inner Join of the Streams (A+B) and C during a 60 seconds window
KStream<String, String> joinedAandBandC = joinedAandBStream
.join(filteredStreamC, A_B_C_joiner, JoinWindows.of(Duration.ofSeconds(60)));
I need to aggregate a stream, which is a join of two other streams. To do this, I specify the windowing of 1 day, but I need to use as a timestamp the value stored in the json of the message. Is it realistic to specify your own timestamp for the stream?
//Record of stream1: {"a_id": 1, "b_id": 2}
//Record of stream2: {"b_id": 2, "timestamp": ...}
KStream<Long, JsonNode> aStream = builder
.stream(aTopic, Consumed.with(Serdes.String(), jsonSerde))
.selectKey((k, v) -> v.get("b_id").asLong());
KStream<Long, JsonNode> bStream = builder
.stream(bTopic, Consumed.with(Serdes.String(), jsonSerde))
.selectKey((k, v) -> v.get("b_id").asLong());
aStream.join(bStream, (JsonNode v1, JsonNode v2) ->
JsonUtils.addFieldIntoJsonNode(v1, v2.get("timestamp"), "timestamp"),
JoinWindows.of(Duration.ofHours(1)),
StreamJoined.with(Serdes.Long(), jsonSerde, jsonSerde))
.{some aggregation with windowing by that "timestamp" field}
I tried to use a timestamp extractor, but I can specify it only when reading a stream that does not fit, because the join window will then be different in the two streams.
What can be done in this case?
You can write your own Processor or Transformer and utilise the ProcessorContext within. If your Kafka Streams version is sufficiently recent, you should find the method ProcessorContext.<K,V> forward(K key, V value, To to). The To class allows specification of the timestamp to be used. The simplest call would be To.all().withTimestamp(123456789L).
You can use a custom timestamp extractor that you can either set globally for all input topics via default.timestamp.extractor config or pass on a per-topic basis via Consumed.with(...).withTimestampExtractor(...).
Cf https://docs.confluent.io/platform/current/streams/developer-guide/config-streams.html#default-timestamp-extractor
I'm having a KStream<String,Event> which should be windowedBy and aggregated results in an out of memory:
java.lang.OutOfMemoryError: Java heap space
The KStream DSL is as follows:
TimeWindows timeWindows = TimeWindows.of(Duration.ofDays(1)).advanceBy(Duration.ofMillis(1));
Initializer<History> historyInitializer = History::new;
Aggregator<String, Event, History> historyAggregator = (key, value, aggregate) -> {
aggregate.key = value.uuid;
aggregate.addHistoryEventWindow(value);
return aggregate;
};
KTable<String, History> historyWindowed = eventStreamRaw
.filter((key, value) -> value != null)
.groupByKey(Grouped.with(Serdes.String(), this.eventSerde))
// segment our messages into 1-day windows
.windowedBy(timeWindows)
.aggregate(historyInitializer, historyAggregator, Named.as("name"), Materialized.with(Serdes.String(), this.historySerde))
.suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
.groupBy(
(key, value) -> new KeyValue<String, History>(
value.key + "|+|" + key.window().start() + "|+|" + key.window().end(), value),
Grouped.with(Serdes.String(), this.historySerde))
.aggregate(History::new, (key, value, aggValue) -> value, (key, value, aggValue) -> value,
Materialized.with(Serdes.String(), this.historySerde));
Reading some articles (for example Kafka Streams Window By & RocksDB Tuning) I noticed that I may have to configure the store "Materialized" with a retention of "1 day + 1 Milli".
But trying to add that doesn't work for me:
final Materialized<String, History, WindowStore<Bytes, byte[]>> store = Materialized.<String, History, WindowStore<Bytes, byte[]>>as("eventstore")
.withKeySerde(Serdes.String())
.withValueSerde(this.historySerde)
.withRetention(Duration.ofDays(1).plus(Duration.ofMillis(1)));
KTable<String, History> historyWindowed = eventStreamRaw
...
.aggregate(historyInitializer, historyAggregator, Named.as("name"), store)
The Java compile throw the following error:
The method
aggregate(Initializer<VR>, Aggregator<? super String,? super Event,VR>, Named, Materialized<String,VR,WindowStore<Bytes,byte[]>>)
in the type TimeWindowedKStream<String,Event> is not applicable for the arguments
(Initializer<History>, Aggregator<String,Event,History>, Named, Materialized<String,History,WindowStore<Bytes,byte[]>>)
To be honest, I don't get it. The parameters are correct; the VR type is 'History'.
So, do you know what I'm missing?
The idea of this windowedBy KTable is to have a state which holds all events for one "thing" for one day. Let's say a new alert is produced I want to attach all events of a "thing" for one day to the alert. I would then do a leftJoin from the KStream Alert to the KTable History. Would that the best way to add historical data to a Kafka event? Is there a way to just "look up" the last x days of the KStream Events? I've checked the KStream Alert-KStream Event leftJoin but that would produce an output for every new KStream Event. So, that would be from my point not practicable.
Many thanks for your help. I hope it's just a simple fix one. Highly appreciate!
looking at the following post Kafka Streams App - count and sum aggregate I've imported the wrong "Byte"-class. So, be sure to import the following class "org.apache.kafka.common.utils.Bytes".
But, maybe you have a better idea to enrich a Kafka message from one stream with historical data from another stream related by a (foreign) key.
Thanks guys.
I have a question about how to update JavaRDD values.
I have a JavaRDD<CostedEventMessage> with message objects containing information about to which partition of kafka topic it should be written to.
I'm trying to change the partitionId field of such objects using the following code:
rddToKafka = rddToKafka.map(event -> repartitionEvent(event, numPartitions));
where the repartitionEvent logic is:
costedEventMessage.setPartitionId(1);
return costedEventMessage;
But the modification does not happen.
Could you please advice why and how to correctly modify values in a JavaRDD?
Spark is lazy, so from the code you pasted above it's not clear if you actually performed any action on the JavaRDD (like collect or forEach) and how you came to the conclusion that data was not changed.
For example, if you assumed that by running the following code:
List<CostedEventMessage> messagesLst = ...;
JavaRDD<CostedEventMessage> rddToKafka = javaSparkContext.parallelize(messagesLst);
rddToKafka = rddToKafka.map(event -> repartitionEvent(event, numPartitions));
Each element in messagesLst would have partition set to 1, you are wrong.
That would hold true if you added for example:
messagesLst = rddToKafka.collect();
For more details refer to documentation
I've written a stream that takes in messages and sends out a table of the keys that have appeared. If something appears, it will show a count of 1. This is a simplified version of my production code in order to demonstrate the bug. In a live run, a message is sent out for each message received.
However, when I run it in a unit test using ProcessorTopologyTestDriver, I get a different behavior. If a key that has already been seen before is received, I get an extra message.
If I send messages with keys "key1", then "key2", then "key1", I get the following output.
key1 - 1
key2 - 1
key1 - 0
key1 - 1
For some reason, it decrements the value before adding it back in. This only happens when using ProcessorTopologyTestDriver. Is this expected? Is there a work around? Or is this a bug?
Here's my topology:
final StreamsBuilder builder = new StreamsBuilder();
KGroupedTable<String, String> groupedTable
= builder.table(applicationConfig.sourceTopic(), Consumed.with(Serdes.String(), Serdes.String()))
.groupBy((key, value) -> KeyValue.pair(key, value), Serialized.with(Serdes.String(), Serdes.String()));
KTable<String, Long> countTable = groupedTable.count();
KStream<String, Long> countTableAsStream = countTable.toStream();
countTableAsStream.to(applicationConfig.outputTopic(), Produced.with(Serdes.String(), Serdes.Long()));
Here's my unit test code:
TopologyWithGroupedTable top = new TopologyWithGroupedTable(appConfig, map);
Topology topology = top.get();
ProcessorTopologyTestDriver driver = new ProcessorTopologyTestDriver(config, topology);
driver.process(inputTopic, "key1", "theval", Serdes.String().serializer(), Serdes.String().serializer());
driver.process(inputTopic, "key2", "theval", Serdes.String().serializer(), Serdes.String().serializer());
driver.process(inputTopic, "key1", "theval", Serdes.String().serializer(), Serdes.String().serializer());
ProducerRecord<String, Long> outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key1", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value());
outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key2", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value());
outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key1", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value()); //this fails, I get 0. If I pull another message, it shows key1 with a count of 1
Here's a repo of the full code:
https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/
Stream topology: https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/src/main/java/com/nick/kstreams/TopologyWithGroupedTable.java
Test code: https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/src/test/java/com/nick/kstreams/TopologyWithGroupedTableTests.java
It's not a bug, but behavior by design (c.f. explanation below).
The difference in behavior is due to KTable state store caching (cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html). When you run the unit test, the cache is flushed after each record, while in your production run, this is not the case. If you disable caching in your production run, I assume that it behaves the same as in your unit test.
Side remark: ProcessorTopologyTestDriver is an internal class and not part of public API. Thus, there is no compatibility guarantee. You should use the official unit-test packages instead: https://docs.confluent.io/current/streams/developer-guide/test-streams.html
Why do you see two records:
In your code, you are using a KTable#groupBy() and in your specific use case, you don't change the key. However, in general, the key might be changed (depending on the value of the input KTable. Thus, if the input KTable is changed, the downstream aggregation needs to remove/subtract the old key-value pair from the aggregation result, and add the new key-value pair to the aggregation result—in general, the key of the old and new pair are different and thus, it's required to generate two records because the subtraction and addition could happen on different instances as different keys might be hashed differently. Does this make sense?
Thus, for each update of the input KTable, two updates two the result KTable on usually two different key-value pairs need to be computed. For you specific case, in which the key does not change, Kafka Stream does the same thing (there is no check/optimization for this case to "merge" both operations into one if the key is actually the same).