I need to aggregate a stream, which is a join of two other streams. To do this, I specify the windowing of 1 day, but I need to use as a timestamp the value stored in the json of the message. Is it realistic to specify your own timestamp for the stream?
//Record of stream1: {"a_id": 1, "b_id": 2}
//Record of stream2: {"b_id": 2, "timestamp": ...}
KStream<Long, JsonNode> aStream = builder
.stream(aTopic, Consumed.with(Serdes.String(), jsonSerde))
.selectKey((k, v) -> v.get("b_id").asLong());
KStream<Long, JsonNode> bStream = builder
.stream(bTopic, Consumed.with(Serdes.String(), jsonSerde))
.selectKey((k, v) -> v.get("b_id").asLong());
aStream.join(bStream, (JsonNode v1, JsonNode v2) ->
JsonUtils.addFieldIntoJsonNode(v1, v2.get("timestamp"), "timestamp"),
JoinWindows.of(Duration.ofHours(1)),
StreamJoined.with(Serdes.Long(), jsonSerde, jsonSerde))
.{some aggregation with windowing by that "timestamp" field}
I tried to use a timestamp extractor, but I can specify it only when reading a stream that does not fit, because the join window will then be different in the two streams.
What can be done in this case?
You can write your own Processor or Transformer and utilise the ProcessorContext within. If your Kafka Streams version is sufficiently recent, you should find the method ProcessorContext.<K,V> forward(K key, V value, To to). The To class allows specification of the timestamp to be used. The simplest call would be To.all().withTimestamp(123456789L).
You can use a custom timestamp extractor that you can either set globally for all input topics via default.timestamp.extractor config or pass on a per-topic basis via Consumed.with(...).withTimestampExtractor(...).
Cf https://docs.confluent.io/platform/current/streams/developer-guide/config-streams.html#default-timestamp-extractor
Related
I'm having a KStream<String,Event> which should be windowedBy and aggregated results in an out of memory:
java.lang.OutOfMemoryError: Java heap space
The KStream DSL is as follows:
TimeWindows timeWindows = TimeWindows.of(Duration.ofDays(1)).advanceBy(Duration.ofMillis(1));
Initializer<History> historyInitializer = History::new;
Aggregator<String, Event, History> historyAggregator = (key, value, aggregate) -> {
aggregate.key = value.uuid;
aggregate.addHistoryEventWindow(value);
return aggregate;
};
KTable<String, History> historyWindowed = eventStreamRaw
.filter((key, value) -> value != null)
.groupByKey(Grouped.with(Serdes.String(), this.eventSerde))
// segment our messages into 1-day windows
.windowedBy(timeWindows)
.aggregate(historyInitializer, historyAggregator, Named.as("name"), Materialized.with(Serdes.String(), this.historySerde))
.suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
.groupBy(
(key, value) -> new KeyValue<String, History>(
value.key + "|+|" + key.window().start() + "|+|" + key.window().end(), value),
Grouped.with(Serdes.String(), this.historySerde))
.aggregate(History::new, (key, value, aggValue) -> value, (key, value, aggValue) -> value,
Materialized.with(Serdes.String(), this.historySerde));
Reading some articles (for example Kafka Streams Window By & RocksDB Tuning) I noticed that I may have to configure the store "Materialized" with a retention of "1 day + 1 Milli".
But trying to add that doesn't work for me:
final Materialized<String, History, WindowStore<Bytes, byte[]>> store = Materialized.<String, History, WindowStore<Bytes, byte[]>>as("eventstore")
.withKeySerde(Serdes.String())
.withValueSerde(this.historySerde)
.withRetention(Duration.ofDays(1).plus(Duration.ofMillis(1)));
KTable<String, History> historyWindowed = eventStreamRaw
...
.aggregate(historyInitializer, historyAggregator, Named.as("name"), store)
The Java compile throw the following error:
The method
aggregate(Initializer<VR>, Aggregator<? super String,? super Event,VR>, Named, Materialized<String,VR,WindowStore<Bytes,byte[]>>)
in the type TimeWindowedKStream<String,Event> is not applicable for the arguments
(Initializer<History>, Aggregator<String,Event,History>, Named, Materialized<String,History,WindowStore<Bytes,byte[]>>)
To be honest, I don't get it. The parameters are correct; the VR type is 'History'.
So, do you know what I'm missing?
The idea of this windowedBy KTable is to have a state which holds all events for one "thing" for one day. Let's say a new alert is produced I want to attach all events of a "thing" for one day to the alert. I would then do a leftJoin from the KStream Alert to the KTable History. Would that the best way to add historical data to a Kafka event? Is there a way to just "look up" the last x days of the KStream Events? I've checked the KStream Alert-KStream Event leftJoin but that would produce an output for every new KStream Event. So, that would be from my point not practicable.
Many thanks for your help. I hope it's just a simple fix one. Highly appreciate!
looking at the following post Kafka Streams App - count and sum aggregate I've imported the wrong "Byte"-class. So, be sure to import the following class "org.apache.kafka.common.utils.Bytes".
But, maybe you have a better idea to enrich a Kafka message from one stream with historical data from another stream related by a (foreign) key.
Thanks guys.
Is there an equivalent of Hazelcast's IMap.values(Predicate) for Infinispan ? And possibly non-blocking (async) ?
Thanks.
It depends on what you trying to do. Infinispan extends Java's Stream functionality so you can use the Stream interface to get the filtered values.
Examples
//filter by key
cache.values().stream()
.filterKeys(/*set with keys*/)
.forEach(/*do something with the value*/) //or collect()
//filter by key and value
cache.getAdvancedCache().cacheEntrySet().stream()
.filter(entry -> /*check key and value using entry.getKey() or entry.getValue()*/)
.map(StreamMarshalling.entryToValueFunction()) //extract the value only
.forEach(/*do something with the value*/; //or collect()
Infinispan documentation about streams here.
My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.
Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset
If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
I've written a stream that takes in messages and sends out a table of the keys that have appeared. If something appears, it will show a count of 1. This is a simplified version of my production code in order to demonstrate the bug. In a live run, a message is sent out for each message received.
However, when I run it in a unit test using ProcessorTopologyTestDriver, I get a different behavior. If a key that has already been seen before is received, I get an extra message.
If I send messages with keys "key1", then "key2", then "key1", I get the following output.
key1 - 1
key2 - 1
key1 - 0
key1 - 1
For some reason, it decrements the value before adding it back in. This only happens when using ProcessorTopologyTestDriver. Is this expected? Is there a work around? Or is this a bug?
Here's my topology:
final StreamsBuilder builder = new StreamsBuilder();
KGroupedTable<String, String> groupedTable
= builder.table(applicationConfig.sourceTopic(), Consumed.with(Serdes.String(), Serdes.String()))
.groupBy((key, value) -> KeyValue.pair(key, value), Serialized.with(Serdes.String(), Serdes.String()));
KTable<String, Long> countTable = groupedTable.count();
KStream<String, Long> countTableAsStream = countTable.toStream();
countTableAsStream.to(applicationConfig.outputTopic(), Produced.with(Serdes.String(), Serdes.Long()));
Here's my unit test code:
TopologyWithGroupedTable top = new TopologyWithGroupedTable(appConfig, map);
Topology topology = top.get();
ProcessorTopologyTestDriver driver = new ProcessorTopologyTestDriver(config, topology);
driver.process(inputTopic, "key1", "theval", Serdes.String().serializer(), Serdes.String().serializer());
driver.process(inputTopic, "key2", "theval", Serdes.String().serializer(), Serdes.String().serializer());
driver.process(inputTopic, "key1", "theval", Serdes.String().serializer(), Serdes.String().serializer());
ProducerRecord<String, Long> outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key1", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value());
outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key2", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value());
outputRecord = driver.readOutput(outputTopic, keyDeserializer, valueDeserializer);
assertEquals("key1", outputRecord.key());
assertEquals(Long.valueOf(1L), outputRecord.value()); //this fails, I get 0. If I pull another message, it shows key1 with a count of 1
Here's a repo of the full code:
https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/
Stream topology: https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/src/main/java/com/nick/kstreams/TopologyWithGroupedTable.java
Test code: https://bitbucket.org/nsinha/testtopologywithgroupedtable/src/master/src/test/java/com/nick/kstreams/TopologyWithGroupedTableTests.java
It's not a bug, but behavior by design (c.f. explanation below).
The difference in behavior is due to KTable state store caching (cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html). When you run the unit test, the cache is flushed after each record, while in your production run, this is not the case. If you disable caching in your production run, I assume that it behaves the same as in your unit test.
Side remark: ProcessorTopologyTestDriver is an internal class and not part of public API. Thus, there is no compatibility guarantee. You should use the official unit-test packages instead: https://docs.confluent.io/current/streams/developer-guide/test-streams.html
Why do you see two records:
In your code, you are using a KTable#groupBy() and in your specific use case, you don't change the key. However, in general, the key might be changed (depending on the value of the input KTable. Thus, if the input KTable is changed, the downstream aggregation needs to remove/subtract the old key-value pair from the aggregation result, and add the new key-value pair to the aggregation result—in general, the key of the old and new pair are different and thus, it's required to generate two records because the subtraction and addition could happen on different instances as different keys might be hashed differently. Does this make sense?
Thus, for each update of the input KTable, two updates two the result KTable on usually two different key-value pairs need to be computed. For you specific case, in which the key does not change, Kafka Stream does the same thing (there is no check/optimization for this case to "merge" both operations into one if the key is actually the same).
#StreamListener("input")
#SendTo("output")
public KStream<?, MyObject> process(KStream<Object, IncomingObject> input) {
KTable table = input.flatMapValues(value -> this.getMylogic(value));
return table.toStream();
}
I am trying to convert KStream to KTable and then to KStream but I am getting cannot convert from KStream to KTable
value is json. Kindly help and how can i use aggregation also?
{
"name":"test",
address{
"localAddress":"myaddress",
"businessAddress":"testAddress"
}
}
in mylogic method I am taking only address to send to another topic.
Kindly help
You need to apply aggregate function, such as count to get KTable as result. Otherwise there's no way to do KStream -> KTable -> KStream.
What you need, and can do, is KStream.count() (for example) -> KTable -> KStream. So basically results of aggreagtion will be published to KStream, which might be also published into the Kafka topic.