Kafka Streams: Grouping by a key in Json log

Kafka Streams: Grouping by a key in Json log - java

I have kafka Streams application with an input topic input on which the following records come as json logs:
JSON log:
{"CreationTime":"2018-02-12T12:32:31","UserId":"abc#gmail.com","Operation":"upload","Workload":"Drive"}
I am building a stream from the topic:
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source_user_activity = builder.stream("input");
Next I want to groupBy "UserId" and find count against each user.
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source_user_activity = builder.stream("input");
final KTable<String, Long> wordCounts = source_user_activity
.flatMap((key, value) -> {
List<KeyValue<String, String>> result = new LinkedList<>();
JSONObject valueObject = new JSONObject(value);
result.add(KeyValue.pair((valueObject.get("UserId").toString()), valueObject.toString()));
return result;
})
.groupByKey()
.count();
wordCounts.toStream().to("output",Produced.with(stringSerde, longSerde));
wordCounts.print();
Next I am consuming records from output topic using console-consumer. I am not seeing any text, its just some thing like this:
However wordCounts.print() shows this:
[KSTREAM-AGGREGATE-0000000003]: abc#gmail.com, (1<-null)
What am I doing wrong here? Thanks.

The data of the value is encoded as long (you are using LongSerde for the value) and console consumer users StringDeserializer by default, and thus, it cannot correctly deserialize the value.
You need to specify LongDeserializer via a command line argument for the console consumer for the value.

Related

Cannot resolve Serde with An Array List KAFKA

I'm getting a compilation error when trying to create a new topic with KafkaStream object.
My new topic should contain a key and a value which is an Array.
Here is my code :
Serde<String> stringSerde = Serdes.String();
// Cannot create a Serde of an ArrayList
Serde<String> arrayListSerde = Serdes.String();
KStream<String, String> stats = builder.stream("parking-rows-stats");
KStream<String, List<String>> parkingCountAndTypeStream = stats
.selectKey((key, jsonRecordString) -> extractParkingName(jsonRecordString))
.map((key, value) -> new KeyValue<>(key, extractPlaceTypeAndNumber(value)));
//Compilation error in this line
parkingCountAndTypeStream.to("parking-typeandnbfreeplaces-updates", Produced.with(stringSerde,arrayListSerde));

Kafka stream consumer for JSON object : How to map

I am new to Kafka/Kafka Stream. I am using latest Kafka/kafka-stream and kafka-client and openjdk11. My producer is producing json objects ( Where key is the name) that looks like
{"Name":"John", "amount":123, "time":2019-10-03T05:24:52" }
Producer code for better understanding:
public static ProducerRecord<String, String> newRandomTransaction(String name) {
// creates an empty json {}
ObjectNode transaction = JsonNodeFactory.instance.objectNode();
Integer amount = ThreadLocalRandom.current().nextInt(0, 100);
// Instant.now() is to get the current time
Instant now = Instant.now();
// we write the data to the json document
transaction.put("name", name);
transaction.put("amount", amount);
transaction.put("time", now.toString());
return new ProducerRecord<>("bank-transactions", name, transaction.toString());
}
Now I am trying to write my application that consumes the transactions and compute the total money in that person's balance.
( FYI: I am using an old code and trying to make it work).
Used GroupBYKey as the topic already has the right key. And then aggregate to compute the total balance where I am struggling.
Application at this moment ( commented out part is the old code that I am trying to make it work in the next line):
public class BankBalanceExactlyOnceApp {
private static ObjectMapper mapper = new ObjectMapper();
public static void main(String[] args) {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "bank-balance-application");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "127.0.0.1:9092");
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
// we disable the cache to demonstrate all the "steps" involved in the transformation - not recommended in prod
config.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, "0");
// Exactly once processing!!
config.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
// json Serde
final Serializer<JsonNode> jsonSerializer = new JsonSerializer();
final Deserializer<JsonNode> jsonDeserializer = new JsonDeserializer();
final Serde<JsonNode> jsonSerde = Serdes.serdeFrom(jsonSerializer, jsonDeserializer);
StreamsBuilder builder = new StreamsBuilder();
KStream<String, JsonNode> bankTransactions =
builder.stream( "bank-transactions", Materialized.with(Serdes.String(), jsonSerde);
// create the initial json object for balances
ObjectNode initialBalance = JsonNodeFactory.instance.objectNode();
initialBalance.put("count", 0);
initialBalance.put("balance", 0);
initialBalance.put("time", Instant.ofEpochMilli(0L).toString());
/*KTable<String, JsonNode> bankBalance = bankTransactions
.groupByKey(Serdes.String(), jsonSerde)
.aggregate(
() -> initialBalance,
(key, transaction, balance) -> newBalance(transaction, balance),
jsonSerde,
"bank-balance-agg"
);*/
KTable<String, JsonNode> bankBalance = bankTransactions
.groupByKey(Serialized.with(Serdes.String(), jsonSerde))
.aggregate(
() -> initialBalance,
(key, transaction, balance) -> {
//String t = transaction.toString();
newBalance(transaction, balance);
},
Materialized.with(Serdes.String(), jsonSerde),
"bank-balance-agg"
);
bankBalance.toStream().to("bank-balance-exactly-once", Produced.with(Serdes.String(), jsonSerde));
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.cleanUp();
streams.start();
// print the topology
System.out.println(streams.toString());
// shutdown hook to correctly close the streams application
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
private static JsonNode newBalance(JsonNode transaction, JsonNode balance) {
// create a new balance json object
ObjectNode newBalance = JsonNodeFactory.instance.objectNode();
newBalance.put("count", balance.get("count").asInt() + 1);
newBalance.put("balance", balance.get("balance").asInt() + transaction.get("amount").asInt());
Long balanceEpoch = Instant.parse(balance.get("time").asText()).toEpochMilli();
Long transactionEpoch = Instant.parse(transaction.get("time").asText()).toEpochMilli();
Instant newBalanceInstant = Instant.ofEpochMilli(Math.max(balanceEpoch, transactionEpoch));
newBalance.put("time", newBalanceInstant.toString());
return newBalance;
}
}
The issue is : as I trying to call newBalance(transaction, balance) in the line:
aggregate(
() -> initialBalance,
(key, transaction, balance) -> newBalance(transaction, balance),
jsonSerde,
"bank-balance-agg"
)
and seeing the compiler error with msg:
newBalance(JsonNode, JsonNode) can not be applied to (<lambda parameter>,<lambda parameter>)
I tried to read it as string, changed the param type from JsonNode to Object. However, could not fix it.
May I get any suggestion on how to fix it?

KGroupedStream in Kafka Streams 2.3 doesn't have method with following signature:
<VR> KTable<K, VR> aggregate(final Initializer<VR> initializer,
final Aggregator<? super K, ? super V, VR> aggregator,
final Materialized<K, VR, KeyValueStore<Bytes, byte[]>> materialized,
String aggregateName);
There are two overloaded method aggregate:
<VR> KTable<K, VR> aggregate(final Initializer<VR> initializer,
final Aggregator<? super K, ? super V, VR> aggregator);
<VR> KTable<K, VR> aggregate(final Initializer<VR> initializer,
final Aggregator<? super K, ? super V, VR> aggregator,
final Materialized<K, VR, KeyValueStore<Bytes, byte[]>> materialized);
You should use second one and your code should look something like:
KTable<String, JsonNode> bankBalance = input
.groupByKey(Grouped.with(Serdes.String(), jsonSerde))
.aggregate(
() -> initialBalance,
(key, transaction, balance) -> newBalance(transaction, balance),
Materialized.with(Serdes.String(), jsonSerde)
);

Empty data is returned when querying using Kafka tumbling window

I'm trying to query the state store to get the data in a window of 5 mins. For that I'm using tumbling window. Have added REST to query the data.
I've stream A which consumes data from topic1 and performs some transformations and output a key value to topic2.
Now in stream B I'm doing tumbling window operation on topic2 data. When I run the code and queried using REST, I'm seeing empty data on my browser. I can see the data in the state store flowing.
What I've observed is, instead of topic2 getting data from stream A, I used a producer class to inject the data to topic2 and able to query the data from browser. But when the topic2 is getting data from stream A, I'm getting empty data.
Here is my stream A code :
public static void main(String[] args) {
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("topic1");
KStream<String, String> output = source
.map((k,v)->
{
Map<String, Object> Fields = new LinkedHashMap<>();
Fields.put("FNAME","ABC");
Fields.put("LNAME","XYZ");
Map<String, Object> nFields = new LinkedHashMap<>();
nFields.put("ADDRESS1","HY");
nFields.put("ADDRESS2","BA");
nFields.put("addF",Fields);
Map<String, Object> eve = new LinkedHashMap<>();
eve.put("nFields", nFields);
Map<String, Object> fevent = new LinkedHashMap<>();
fevent.put("eve", eve);
LinkedHashMap<String, Object> newMap = new LinkedHashMap<>(fevent);
return new KeyValue<>("JAY1234",newMap.toString());
});
output.to("topic2");
}
Here is my stream B code (where tumbling window operation happening):
public static void main(String[] args) {
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> eventStream = builder.stream("topic2");
eventStream.groupByKey()
.windowedBy(TimeWindows.of(300000))
.reduce((v1, v2) -> v1 + ";" + v2, Materialized.as("TumblingWindowPoc"));
final Topology topology = builder.build();
KafkaStreams streams = new KafkaStreams(topology, props);
streams.start();
}
REST code :
#GET()
#Path("/{storeName}/{key}")
#Produces(MediaType.APPLICATION_JSON)
public List<KeyValue<String, String>> windowedByKey(#PathParam("storeName") final String storeName,
#PathParam("key") final String key) {
final ReadOnlyWindowStore<String, String> store = streams.store(storeName,
QueryableStoreTypes.<String, String>windowStore());
if (store == null) {
throw new NotFoundException(); }
long timeTo = System.currentTimeMillis();
long timeFrom = timeTo - 30000;
final WindowStoreIterator<String> results = store.fetch(key, timeFrom, timeTo);
final List<KeyValue<String,String>> windowResults = new ArrayList<>();
while (results.hasNext()) {
final KeyValue<Long, String> next = results.next();
windowResults.add(new KeyValue<String,String>(key + "#" + next.key, next.value));
}
return windowResults;
}
And this is how my key value data looks like :
JAY1234 {eve = {nFields = {ADDRESS1 = HY,ADDRESS2 = BA,Fields = {FNAME = ABC,LNAME = XYZ,}}}}
I should be able to get the data when querying using REST. Any help is greatly appreciated.
Thanks!

to fetch the window timeFrom should be before window start. So if you want the data for last 30 seconds, you can substract window duration for fetching, like timeTo - 30000 - 300000, and then filter out events required events from whole window data

KStream to KTable Left Join Returns Null

I am currently trying to use a KStream to KTable join to perform enrichment of a Kafka topic. For my proof of concept I currently have a Kafka Stream with about 600,000 records which all have the same key and a KTable created from a topic with 1 record of a key, value pair where the key in the KTable topic matches the key of the 600,000 records in the topic the KStream is created from.
When I use a left join (via the code below), all of the records return NULL on the ValueJoiner.
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-pipe-json-parse-" + System.currentTimeMillis());
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "xxx.xx.xx.xxx:9092");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG, "org.apache.kafka.streams.processor.WallclockTimestampExtractor");
props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 5);
final StreamsBuilder builder = new StreamsBuilder();
// Build a Kafka Stream from the Netcool Input Topic
KStream<String, String> source = builder.stream("output-100k");
// Join the KStream to the KTable
KStream<String, String> enriched_output = source
.leftJoin(netcool_enrichment, (orig_msg, description) -> {
String new_msg = jsonEnricher(orig_msg, description);
if (description != null) {
System.out.println("\n[DEBUG] Enriched Input Orig: " + orig_msg);
System.out.println("[DEBUG] Enriched Input Desc: " + description);
System.out.println("[DEBUG] Enriched Output: " + new_msg);
}
return new_msg;
});
Here is a sample output record (using a forEach loop) from the source KStream:
[KSTREAM] Key: ismlogs
[KSTREAM] Value: {"severity":"debug","ingested_timestamp":"2018-07-18T19:32:47.227Z","#timestamp":"2018-06-28T23:36:31.000Z","offset":482,"#metadata":{"beat":"filebeat","topic":"input-100k","type":"doc","version":"6.2.2"},"beat":{"hostname":"abc.dec.com","name":"abc.dec.com","version":"6.2.2"},"source":"/root/100k-raw.txt","message":"Thu Jun 28 23:36:31 2018 Debug: Checking status of file /ism/profiles/active/test.xml","key":"ismlogs","tags":["ismlogs"]}
I have tried converting the KTable back to a KStream and used a forEach loop over the converted Stream and I verify the records are actually there in the KTable.
KTable<String, String> enrichment = builder.table("enrichment");
KStream<String, String> ktable_debug = enrichment.toStream();
ktable_debug.foreach(new ForeachAction<String, String>() {
public void apply(String key, String value) {
System.out.println("[KTABLE] Key: " + key);
System.out.println("[KTABLE] Value: " + value);
}
});
The code above outputs:
[KTABLE] Key: "ismlogs"
[KTABLE] Value: "ISM Logs"

According to your console messages, the keys are different, and therefore they won't join :
[KSTREAM] Key: ismlogs
[KTABLE] Key: "ismlogs"
In the case of the KTable, the key is actually "ismlogs" with the double-quotes.

Using SessionWindows on aggregated data in KafkaStreams (0.11)

i'm trying to use SessionWindows in my aggregation function in Kafka (0.11) but can not comprehend, why i get errors.
Here is my code-snippet:
// defining some values:
public static final Integer SESSION_TIMEOUT_MS = 6000000;
public static final String INTOPIC = "input";
public static final String HOST = "host";
// setting up serdes:
final Serializer<JsonNode> jsonSerializer = new JsonSerializer();
final Deserializer<JsonNode> jsonDeserializer = new JsonDeserializer();
final Serde<JsonNode> jsonSerde = Serdes.serdeFrom(jsonSerializer, jsonDeserializer);
// some more code to build up the streams
KStreamBuilder builder = new KStreamBuilder();
KStream<String, JsonNode> dataStream = builder.stream(Serdes.String(), jsonSerde, INTOPIC);
// constructing the initalMessage ObjectNode:
ObjectNode initialMessage = JsonNodeFactory.instance.objectNode();
initialMessage.put("count", 0);
initialMessage.put("endTime", "");
// transforming data to KGroupedStream<String,JsonNode>
KGroupedStream<String, JsonNode> data = dataStream.map((key, value) ->{return new KeyValue<>(value.get(HOST).asText(), value); }).groupByKey(Serdes.String(), jsonSerde);
// finally aggregate the data usind SessionWindows
KTable<Windowed<String>, JsonNode> aggregatedData = data.aggregate(
() -> initialMessage,
(key, incomingMessage, initialMessage) -> countData(incomingMessage, initialMessage),
SessionWindows.with(SESSION_TIMEOUT_MS),
jsonSerde,
"aggregated-data");
private static JsonNode countData(JsonNode incomingMessage, JsonNode initialMessage){
// some dataprocessing
}
When i change
KTable<Windowed<String>,JsonNode>
to
KTable<String, JsonNode>
and remove
SessionWindows.with(SESSION_TIMEOUT_MS)
from the aggregate function, everything is ok.
If i don't, eclipse tells me for line
KTable<Windowed<String>, JsonNode> aggregatedData = data.aggregate( [...])
The method aggregate(Initializer, Aggregator, Windows, Serde, String) in the type KGroupedStream is not applicable for the arguments (() -> {}, ( key, incomingMessage, initialMessage) -> {}, SessionWindows, Serde, String)
and for the line
() -> initialMessage
Type mismatch: cannot convert from ObjectNode to VR
and:
(key, incomingMessage, initialMessage) -> countData(incomingMessage, initialMessage),
The method countData(JsonNode, JsonNode) in the type DataWindowed is not applicable for the arguments (JsonNode, VR)
I realy don't see, where the types get lost!
Any hint would be great!
Thx :D

I realy needed to implement a Merger:
Merger<? super String, JsonNode>tmpMerger = new MergerClass<String, JsonNode>();
and add it to the aggregate function:
KTable<Windowed<String>, JsonNode> aggregatedData = data.aggregate(
() -> initialMessage,
(key, incomingMessage, initialMessage) -> countData(incomingMessage, initialMessage),
tmpMerger,
SessionWindows.with(SESSION_TIMEOUT_MS),
jsonSerde,
"aggregated-data");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Kafka Streams: Grouping by a key in Json log - java

The data of the value is encoded as long (you are using LongSerde for the value) and console consumer users StringDeserializer by default, and thus, it cannot correctly deserialize the value. You need to specify LongDeserializer via a command line argument for the console consumer for the value.

Related

Cannot resolve Serde with An Array List KAFKA

Kafka stream consumer for JSON object : How to map

Empty data is returned when querying using Kafka tumbling window

KStream to KTable Left Join Returns Null

Using SessionWindows on aggregated data in KafkaStreams (0.11)

Categories

Resources