Many-to-One records Kafka Streams - java

I would like to turn many records into one per message. I tried many things like custom reducing and aggregators, but they all still send one-to-one records back out. For example I would like to convert many strings into just one string. If my stream is messages with the same key, but different values, "the", "sky", "is", "blue", then I would like to outback back one concatenation of them in a new topic "the,sky,is,blue,". What I am instead getting is 4 messages "the,", "the, sky,", "the,sky, is,", "the,sky,is,blue,". When I send a second message to the kafka consumer, it will concatenate on the previous aggregation and I eventually receive this "the,sky,is,blue,the,sky,is,blue,"
I also tried using a custom storebuilder and changing a lot of the settings to see if that would do anything.
Map<String, String> changelogConfig = new HashMap<>();
changelogConfig.put("message.down.conversion.enable", "true");
changelogConfig.put("flush.messages", "0");
changelogConfig.put("flush.ms", "0");
StoreBuilder<KeyValueStore<String, String>> aggStoreSupplier = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("AggStore"),
Serdes.String(),
Serdes.String())
.withLoggingEnabled(changelogConfig);
KStream<String, String> results = source // single message get processed and eventually i get these string results I need to concatenate
.groupByKey() // this kgroupedstream has the N records, which was how many were sent in the message
.reduce(new Reducer<String>() {
#Override
public String apply(String aggValue, String value) {
return value + "," + aggValue;
}
}, Materialized.as("AggStore"))
.toStream();
results.to("results", Produced.with(Serdes.String(), Serdes.String()));
final Topology topology = builder.build(); // to describe topology
System.out.println(topology.describe()); // to print description
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.cleanUp();
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);

Related

ProcessorContext schedule only executed once

I'm using Apache Kafka Stream where I added a transform in my stream
final StreamsBuilder streamsBuilder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<String, byte[]>> correlationStore =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(STORE_NAME),
Serdes.String(),
Serdes.ByteArray());
streamsBuilder.addStateStore(correlationStore);
streamsBuilder.stream(topicName, inputConsumed)
.peek(InboundPendingMessageStreamer::logEntries)
.transform(() -> new CleanerTransformer<String, byte[], KeyValue<String, byte[]>>(Duration.ofMillis(5000), STORE_NAME), STORE_NAME)
.toTable();
I'm having difficulties to understand the CleanerTransformer Transformer class that I create, where in the init method, I set a schedule with a scanFrequency and a PunctuationType.
#Override
public void init(ProcessorContext context) {
this.stateStore = context.getStateStore(purgeStoreName);
context.schedule(scanFrequency, PunctuationType.STREAM_TIME, timestamp -> {
try (final KeyValueIterator<K, byte[]> all = stateStore.all()) {
while (all.hasNext()) {
final var headers = context.headers();
final KeyValue<K, byte[]> record = all.next();
}
}
});
}
Adding an event in the stream, I got the message in the schedule callback, but it's only executed once.
My understanding was, that it should be executed every time configured in the scanFrequency.
Any idea what I'm doing wrong here?

KafkaConsumer with Multithreading

Created below KafkaConsumer, that will take topicName, partitionNo, beginOffset and endOffset as parameters. But below logic I can execute for one partition at a time because KafkaConsumer is not thread safe. If I want to complete the all 20 partitions it is taking longer time. So how to implement the KafkaConsumer with multi threads so that I can search all partitions at the same time ?
"I have a topic with 20 partitions and has employee data. From UI search screen, I will pass employee number and birth Date, Now I want to search all these 20 partitions to find a particular employees data is there are not. If it is matches then I want put in a separate List and download as file."
public List<String> searchMessages(String topicName, int partitionNo, long beginOffset, long endOffset) {
List<String> filteredMessages = new ArrayList<>();
TopicPartition tp = new TopicPartition("topicName", partitionNo);
Properties clusterOneProps = kafkaConsumerConfig.getConsumerProperties();
KafkaConsumer<String, Object> consumer = new KafkaConsumer<>(clusterOneProps);
try {
consumer.subscribe(Collections.singletonList("topicName"), new ConsumerRebalanceListener() {
#Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// TODO Auto-generated method stub
}
#Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// TODO Auto-generated method stub
consumer.seek(tp, beginOffset);
}
});
Thread.sleep(100);
boolean flag = true;
System.out.println("search started......from offset is "+beginOffset);
while(flag) {
ConsumerRecords<String, Object> crs = consumer.poll(Duration.ofMillis(100L));
for (ConsumerRecord<String, Object> record : crs) {
// search criteria
if(record.value().toString().contains("01111") && record.value().toString().contains("2021-11-06")) {
System.out.println("founddddddddddddddddddddddddddddddddddddddd "+record.offset());
filteredMessages.add(record.value().toString());
}
if (record.offset() == endOffset) {
flag = false;
break;
}
}
}
System.out.println("doneeeeeeeeeeeeeeeee");
}catch(Exception e) {
e.printStackTrace();
}finally {
consumer.close();
}
You need to use the Kafka Parallel Consumer library. Check the library here, and this blog post.
It's possible to 'simulate' parallel consumption with the normal consumar (by having multiple consumers), but you have to hand roll a good amount of code. This blog post explains this approach, but I recommend using the parallel consumer.

Manually acknowledge Kafka Event A consuming after producing event B

I have a case where I have to consume event A and do some processing, then produce the event B. So my problem is what would happen is the processing crashed and the application couldn't produce B while it consumed already A. My approach is to acknowledge after successfully publishing B, am I correct or should implement another solution for this case?
#KafkaListener(
id = TOPIC_ID,
topics = TOPIC_ID,
groupId = GROUP_ID,
containerFactory = LISTENER_CONTAINER_FACTORY
)
public void listen(List<Message<A>> messages, Acknowledgment acknowledgment) {
try {
final AEvent aEvent = messages.stream()
.filter(message -> null != message.getPayload())
.map(Message::getPayload)
.findFirst()
.get();
processDao.doSomeProcessing() // returns a Mono<Example> by calling an externe API
.subscribe(
response -> {
ProducerRecord<String, BEvent> BEventRecord = new ProducerRecord<>(TOPIC_ID, null, BEvent);
ListenableFuture<SendResult<String, BEvent>> future = kafkaProducerTemplate.send(buildBEvent());
future.addCallback(new ListenableFutureCallback<SendResult<String, BEvent>>() {
#Override
public void onSuccess(SendResult<String, BEvent> BEventSendResult) {
//TODO: do when event published successfully
}
#Override
public void onFailure(Throwable exception) {
exception.printStackTrace();
throw new ExampleException();
}
});
},
error -> {
error.printStackTrace();
throw new ExampleException();
}
);
acknowledgment.acknowledge(); // ??
} catch (ExampleException) {
exception.printStackTrace();
}
}
You can't manage kafka "acknowledgments" when using async code such as reactor.
Kafka does not manage discrete acks for each topic/partition, just the last committed offset for the partition.
If you process two records asynchronously, you will have a race as to which offset will be committed first.
You need to perform the sends on the listener container thread to maintain proper ordering.

Process and check event using kafka-streams during some period

I have a KStream eventsStream, which is get data from a topic "events".
There is two type of events, their keys:
1. {user_id = X, event_id = 1} {..value, include time_event...}
2. {user_id = X, event_id = 2} {..value, include time_event...}
I need to migrate events with event_id = 1 to a topic "results" if during 10 minutes there is not given an event with event_id = 2 by user.
For example,
1. First case: we get data {user_id = 100, event_id = 1} {.. time_event = xxxx ...} and no events during 10 minutes {user_id = 100, event_id = 2} {.. time_event = xxxx + 10 minutes...}, so we'll write it to results-topic
2. Second case: we get data {user_id = 100, event_id = 1} {.. time_event = xxxx ...} and an event during 10 minutes {user_id = 100, event_id = 2} {.. time_event = xxxx + 5 minutes...}, so we'll not write it to results-topic
How does it possible to realise in java code this behavior using kafka-streams?
My code:
public class ResultStream {
public static KafkaStreams newStream() {
Properties properties = Config.getProperties("ResultStream");
Serde<String> stringSerde = Serdes.String();
StreamsBuilder builder = new StreamsBuilder();
StoreBuilder<KeyValueStore<String, String>> store =
Stores.keyValueStoreBuilder(
Stores.inMemoryKeyValueStore("inmemory"),
stringSerde,
stringSerde
);
builder.addStateStore(store);
KStream<String, String> resourceEventStream = builder.stream(EVENTS.topicName(), Consumed.with(stringSerde, stringSerde));
resourceEventStream.print(Printed.toSysOut());
resourceEventStream.process(() -> new CashProcessor("inmemory"), "inmemory");
resourceEventStream.process(() -> new FilterProcessor("inmemory", resourceEventStream), "inmemory");
Topology topology = builder.build();
return new KafkaStreams(topology, properties);
}
}
public class FilterProcessor implements Processor {
private ProcessorContext context;
private String eventStoreName;
private KeyValueStore<String, String> eventStore;
private KStream<String, String> stream;
public FilterProcessor(String eventStoreName, KStream<String, String> stream) {
this.eventStoreName = eventStoreName;
this.stream = stream;
}
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
eventStore = (KeyValueStore) processorContext.getStateStore(eventStoreName);
}
#Override
public void process(Object key, Object value) {
this.context.schedule(Duration.ofMinutes(1), PunctuationType.WALL_CLOCK_TIME, timestamp -> {
System.out.println("Scheduler is working");
stream.filter((k, v) -> {
JsonObject events = new Gson().fromJson(k, JsonObject.class);
if (***condition***) {
return true;
}
return false;
}).to("results");
});
}
#Override
public void close() {
}
}
CashProcessor's role only to put events to local store, and delete record with event_id = 1 by user if there is given an event_id = 2 with the same user.
FilterProcess should filter events using local store every minute. But I can't invoke correctly this processing (as I do it in fact)...
I'm really need help.
Why do you pass KStream into your processor? That is not how the DSL works.
As you "connect" your processors via resourceEventStream.process() already, your FilterProcessor#process(key, value) method will be called for each record in the stream automatically -- however, a KStream#process() is a terminal operation and thus does not allow you to send any data downstream. Instead, you might want to use transform() (that is basically the same as process() plus an output KStream).
To actually forward data downstream in your punctuation, you should use context.forward() using the ProcessorContext that is provided via init() method.

Can't seem to transform KStream<A,B> to KTable<X,Y>

This is my first attempt at trying to use a KTable. I have a Kafka Stream that contains Avro serialized objects of type A,B. And this works fine. I can write a Consumer that consumes just fine or a simple KStream that simply counts records.
The B object has a field containing a country code. I'd like to supply that code to a KTable so it can count the number of records that contain a particular country code. To do so I'm trying to convert the stream into a stream of X,Y (or really: country-code, count). Eventually I look at the contents of the table and extract an array of KV pairs.
The code I have (included) always errors out with the following (see the line with 'Caused by'):
2018-07-26 13:42:48.688 [com.findology.tools.controller.TestEventGeneratorController-16d7cd06-4742-402e-a679-898b9ef78c41-StreamThread-1; AssignedStreamsTasks] ERROR -- stream-thread [com.findology.tools.controller.TestEventGeneratorController-16d7c\
d06-4742-402e-a679-898b9ef78c41-StreamThread-1] Failed to process stream task 0_0 due to the following error:
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000000, topic=com.findology.model.traffic.CpaTrackingCallback, partition=0, offset=962649
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:240)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:411)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:922)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:802)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (key: org.apache.kafka.common.serialization.ByteArraySerializer / value: org.apache.kafka.common.serialization.ByteArraySerializer) is not compatible to the actual key or value type (key type: java.lang.Integer / value type: java.lang.Integer). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:92)
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.forward(AbstractProcessorContext.java:174)
at org.apache.kafka.streams.kstream.internals.KStreamFilter$KStreamFilterProcessor.process(KStreamFilter.java:43)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:46)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:211)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:124)
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.forward(AbstractProcessorContext.java:174)
at org.apache.kafka.streams.kstream.internals.KStreamTransform$KStreamTransformProcessor.process(KStreamTransform.java:59)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:46)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:211)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:124)
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.forward(AbstractProcessorContext.java:174)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:80)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:224)
... 6 more
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to [B
at org.apache.kafka.common.serialization.ByteArraySerializer.serialize(ByteArraySerializer.java:21)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:146)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:94)
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:87)
... 19 more
And here is the code I'm using. I've omitted certain classes for brevity. Note that I'm not using the Confluent KafkaAvro classes.
private synchronized void createStreamProcessor2() {
if (streams == null) {
try {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, getClass().getName());
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
StreamsConfig config = new StreamsConfig(props);
StreamsBuilder builder = new StreamsBuilder();
Map<String, Object> serdeProps = new HashMap<>();
serdeProps.put("schema.registry.url", schemaRegistryURL);
AvroSerde<CpaTrackingCallback> cpaTrackingCallbackAvroSerde = new AvroSerde<>(schemaRegistryURL);
cpaTrackingCallbackAvroSerde.configure(serdeProps, false);
// This is the key to telling kafka the specific Serde instance to use
// to deserialize the Avro encoded value
KStream<Long, CpaTrackingCallback> stream = builder.stream(CpaTrackingCallback.class.getName(),
Consumed.with(Serdes.Long(), cpaTrackingCallbackAvroSerde));
// provide a way to convert CpsTrackicking... info into just country codes
// (Long, CpaTrackingCallback) -> (countryCode:Integer, placeHolder:Long)
TransformerSupplier<Long, CpaTrackingCallback, KeyValue<Integer, Long>> transformer = new TransformerSupplier<Long, CpaTrackingCallback, KeyValue<Integer, Long>>() {
#Override
public Transformer<Long, CpaTrackingCallback, KeyValue<Integer, Long>> get() {
return new Transformer<Long, CpaTrackingCallback, KeyValue<Integer, Long>>() {
#Override
public void init(ProcessorContext context) {
// Not doing Punctuate so no need to store context
}
#Override
public KeyValue<Integer, Long> transform(Long key, CpaTrackingCallback value) {
return new KeyValue(value.getCountryCode(), 1);
}
#Override
public KeyValue<Integer, Long> punctuate(long timestamp) {
return null;
}
#Override
public void close() {
}
};
}
};
KTable<Integer, Long> countryCounts = stream.transform(transformer).groupByKey() //
.count(Materialized.as("country-counts"));
streams = new KafkaStreams(builder.build(), config);
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
streams.cleanUp();
streams.start();
try {
countryCountsView = waitUntilStoreIsQueryable("country-counts", QueryableStoreTypes.keyValueStore(),
streams);
}
catch (InterruptedException e) {
log.warn("Interrupted while waiting for query store to become available", e);
}
}
catch (Exception e) {
log.error(e);
}
}
}
The bare groupByKey() method on KStream uses the default serializer/deserializer (which you haven't set). Use the method groupByKey(Serialized<K,V> serialized), as in:
.groupByKey(Serialized.with(Serdes.Integer(), Serdes.Long()))
Also note, what you do in your custom TransformerSupplier, you can do simply with a KStream.map call.

Categories