Way to read data from Kafka headers in Apache Flink - java

I have a project where I am consuming data from Kafka. Apparently, there are a couple fields that are going to be included in the headers that I will need to read as well for each message. Is there a way to do this in Flink currently?
Thanks!

#Jicaar, Actually Kafka has added Header notion since version 0.11.0.0. https://issues.apache.org/jira/browse/KAFKA-4208
The problem is flink-connector-kafka-0.11_2.11 which comes with flink-1.4.0, and supposedly supports kafka-0.11.0.0 just ignores message headers when reading from kafka.
So unfortunately there is no way to read those headers unless you implement your own KafkaConsumer in flin.
I'm also interested in readin in kafka message headers and hope Flink team will add support for this.

I faced similar issue and found a way to do this in Flink 1.8. Here is what I wrote:
FlinkKafkaConsumer<ObjectNode> consumer = new FlinkKafkaConsumer("topic", new JSONKeyValueDeserializationSchema(true){
ObjectMapper mapper = new ObjectMapper();
#Override
public ObjectNode deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
ObjectNode result = super.deserialize(record);
if (record.headers() != null) {
Map<String, JsonNode> headers = StreamSupport.stream(record.headers().spliterator(), false).collect(Collectors.toMap(h -> h.key(), h -> (JsonNode)this.mapper.convertValue(new String(h.value()), JsonNode.class)));
result.set("headers", mapper.convertValue(headers, JsonNode.class));
}
return result;
}
}, kafkaProps);
Hope this helps!

Here's the code for new versions of Flink.
KafkaSource<String> source = KafkaSource.<String>builder()
.setBootstrapServers(ParameterConfig.parameters.getRequired(ParameterConstant.KAFKA_ADDRESS))
.setTopics(ParameterConfig.parameters.getRequired(ParameterConstant.KAFKA_SOURCE_TOPICS))
.setGroupId(ParameterConfig.parameters.getRequired(ParameterConstant.KAFKA_SOURCE_GROUPID))
.setStartingOffsets(OffsetsInitializer.latest())
.setDeserializer(new KafkaRecordDeserializationSchema<String>() {
#Override
public void deserialize(ConsumerRecord<byte[], byte[]> consumerRecord, Collector<String> collector) {
try {
Map<String, String> headers = StreamSupport
.stream(consumerRecord.headers().spliterator(), false)
.collect(Collectors.toMap(Header::key, h -> new String(h.value())));
collector.collect(new JSONObject(headers).toString());
} catch (Exception e){
e.printStackTrace();
log.error("Headers Not found in Kafka Stream with consumer record : {}", consumerRecord);
}
}
#Override
public TypeInformation<String> getProducedType() {
return TypeInformation.of(new TypeHint<>() {});
}
})
.build();

Related

Apache Kafka - Implementing a KTable and producing event using CloudEvent

I have an implementation related to KTable and using CloudEvents to produce events, but for some unknown reasons, the produced event from KTable is not formatted based on CloudEvent. The implementation is as below:
public void initKafkaStream() {
StreamsBuilder streamsBuilder = new StreamsBuilder();
PojoCloudEventDataMapper<TicketEvent> ticketEventMapper = PojoCloudEventDataMapper.from(objectMapper, TicketEvent.class);
KStream<String, CloudEvent> rawTicketStream = streamsBuilder.stream(rawTicketEvent, Consumed.with(Serdes.String(), cloudEventSerde));
rawTicketStream
.mapValues(e -> convertToPojo(e, TicketEventMapper))
.filter((k, v) -> v != null)
.groupByKey()
.aggregate(
AggregatedTicketEvent::new,
(key, val, agg) -> doAggregation(agg, val),
Materialized
.<String, AggregatedTicketEvent, KeyValueStore<Bytes, byte[]>>as("aggregatedTicket")
.withValueSerde(aggregatedTicketEventSerde)
.withLoggingDisabled()
)
.mapValues(result -> {
try {
return CloudEventBuilder.v1()
.withId(UUID.randomUUID().toString())
.withType("ticket_update")
.withSource(sourceTemplate.expand(result.getCurrent().getId()))
.withTime(result.getMeta().getOccurredAt())
.withData(objectMapper.writeValueAsBytes(result))
.withDataContentType("application/json")
.build();
} catch (JsonProcessingException e) {
throw new RuntimeException(e);
}
})
.toStream()
.to(aggregatedTicketEvent, Produced.with(Serdes.String(), cloudEventSerde));
streams = new KafkaStreams(streamsBuilder.build(streamsConfig), streamsConfig);
streams.setUncaughtExceptionHandler(ex -> StreamThreadExceptionResponse.REPLACE_THREAD);
streams.start();
}
Has anyone had such an issue?
Thanks in advance
The issue was that the configuration in props has been overwritten from Kafka streams in the serializer/deserializer, and by default sets the format to Encoding.BINARY. When the encoding is in Binary then the CloudEvents format is present only in the header instead of in the payload. To make sure that the serializers have the correct configuration I added them in the CloudEventSerializer and CloudEventDeserializer. In this case, the Serdes.serdeFrom() will be like the below:
Map<String, Object> ceSerializerConfigs = new HashMap<>();
ceSerializerConfigs.put(ENCODING_CONFIG, Encoding.STRUCTURED);
ceSerializerConfigs.put(EVENT_FORMAT_CONFIG, JsonFormat.CONTENT_TYPE);
CloudEventSerializer serializer = new CloudEventSerializer();
serializer.configure(ceSerializerConfigs, false);
CloudEventDeserializer deserializer = new CloudEventDeserializer();
deserializer.configure(ceSerializerConfigs, false);
this.cloudEventSerde = Serdes.serdeFrom(serializer, deserializer);
In order to get cloud events format in JSON payload we have to use Encoding.STRUCTURED with JSON content type, this will do the magic and have the results in the payload.
Hope this will help someone who's struggling with this issue!
Best,

Many-to-One records Kafka Streams

I would like to turn many records into one per message. I tried many things like custom reducing and aggregators, but they all still send one-to-one records back out. For example I would like to convert many strings into just one string. If my stream is messages with the same key, but different values, "the", "sky", "is", "blue", then I would like to outback back one concatenation of them in a new topic "the,sky,is,blue,". What I am instead getting is 4 messages "the,", "the, sky,", "the,sky, is,", "the,sky,is,blue,". When I send a second message to the kafka consumer, it will concatenate on the previous aggregation and I eventually receive this "the,sky,is,blue,the,sky,is,blue,"
I also tried using a custom storebuilder and changing a lot of the settings to see if that would do anything.
Map<String, String> changelogConfig = new HashMap<>();
changelogConfig.put("message.down.conversion.enable", "true");
changelogConfig.put("flush.messages", "0");
changelogConfig.put("flush.ms", "0");
StoreBuilder<KeyValueStore<String, String>> aggStoreSupplier = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("AggStore"),
Serdes.String(),
Serdes.String())
.withLoggingEnabled(changelogConfig);
KStream<String, String> results = source // single message get processed and eventually i get these string results I need to concatenate
.groupByKey() // this kgroupedstream has the N records, which was how many were sent in the message
.reduce(new Reducer<String>() {
#Override
public String apply(String aggValue, String value) {
return value + "," + aggValue;
}
}, Materialized.as("AggStore"))
.toStream();
results.to("results", Produced.with(Serdes.String(), Serdes.String()));
final Topology topology = builder.build(); // to describe topology
System.out.println(topology.describe()); // to print description
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.cleanUp();
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);

How to skip corrupt (non-serializable) messages in Spring Kafka Consumer?

This question is for Spring Kafka, related to Apache Kafka with High Level Consumer: Skip corrupted messages
Is there a way to configure Spring Kafka consumer to skip a record that cannot be read/processed (is corrupt)?
I am seeing a situation where the consumer gets stuck on the same record if it cannot be deserialized. This is the error the consumer throws.
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Can not construct instance of java.time.LocalDate: no long/Long-argument constructor/factory method to deserialize from Number value
The consumer polls the topic and just keeps printing the same error in a loop till program is killed.
In a #KafkaListener that has the following Consumer factory configurations,
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class);
You need ErrorHandlingDeserializer: https://docs.spring.io/spring-kafka/docs/2.2.0.RELEASE/reference/html/_reference.html#error-handling-deserializer
If you can't move to that 2.2 version, consider to implement your own and return null for those records which can't be deserialized properly.
The source code is here: https://github.com/spring-projects/spring-kafka/blob/master/spring-kafka/src/main/java/org/springframework/kafka/support/serializer/ErrorHandlingDeserializer2.java
In case you are using older version of kafka, in a #KafkaListener set the following Consumer factory configurations.
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, CustomDeserializer.class);
Here is the code for CustomDeserializer:
import java.util.Map;
import org.apache.kafka.common.serialization.Deserializer;
import com.fasterxml.jackson.databind.ObjectMapper;
public class CustomDeserializer implements Deserializer<Object>
{
#Override
public void configure( Map<String, ?> configs, boolean isKey )
{
}
#Override
public Object deserialize( String topic, byte[] data )
{
ObjectMapper mapper = new ObjectMapper();
Object object = null;
try
{
object = mapper.readValue(data, Object.class);
}
catch ( Exception exception )
{
System.out.println("Error in deserializing bytes " + exception);
}
return object;
}
#Override
public void close()
{
}
}
Since I want my code to be generic enough to read any kind of json,
object = mapper.readValue(data, Object.class); I am converting it to Object.class. And as we are catching exception here, it won't be retried once read.

Can't seem to transform KStream<A,B> to KTable<X,Y>

This is my first attempt at trying to use a KTable. I have a Kafka Stream that contains Avro serialized objects of type A,B. And this works fine. I can write a Consumer that consumes just fine or a simple KStream that simply counts records.
The B object has a field containing a country code. I'd like to supply that code to a KTable so it can count the number of records that contain a particular country code. To do so I'm trying to convert the stream into a stream of X,Y (or really: country-code, count). Eventually I look at the contents of the table and extract an array of KV pairs.
The code I have (included) always errors out with the following (see the line with 'Caused by'):
2018-07-26 13:42:48.688 [com.findology.tools.controller.TestEventGeneratorController-16d7cd06-4742-402e-a679-898b9ef78c41-StreamThread-1; AssignedStreamsTasks] ERROR -- stream-thread [com.findology.tools.controller.TestEventGeneratorController-16d7c\
d06-4742-402e-a679-898b9ef78c41-StreamThread-1] Failed to process stream task 0_0 due to the following error:
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000000, topic=com.findology.model.traffic.CpaTrackingCallback, partition=0, offset=962649
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:240)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:411)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:922)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:802)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:749)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:719)
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (key: org.apache.kafka.common.serialization.ByteArraySerializer / value: org.apache.kafka.common.serialization.ByteArraySerializer) is not compatible to the actual key or value type (key type: java.lang.Integer / value type: java.lang.Integer). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:92)
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.forward(AbstractProcessorContext.java:174)
at org.apache.kafka.streams.kstream.internals.KStreamFilter$KStreamFilterProcessor.process(KStreamFilter.java:43)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:46)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:211)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:124)
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.forward(AbstractProcessorContext.java:174)
at org.apache.kafka.streams.kstream.internals.KStreamTransform$KStreamTransformProcessor.process(KStreamTransform.java:59)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:46)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:211)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:124)
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.forward(AbstractProcessorContext.java:174)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:80)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:224)
... 6 more
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to [B
at org.apache.kafka.common.serialization.ByteArraySerializer.serialize(ByteArraySerializer.java:21)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:146)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:94)
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:87)
... 19 more
And here is the code I'm using. I've omitted certain classes for brevity. Note that I'm not using the Confluent KafkaAvro classes.
private synchronized void createStreamProcessor2() {
if (streams == null) {
try {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, getClass().getName());
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
StreamsConfig config = new StreamsConfig(props);
StreamsBuilder builder = new StreamsBuilder();
Map<String, Object> serdeProps = new HashMap<>();
serdeProps.put("schema.registry.url", schemaRegistryURL);
AvroSerde<CpaTrackingCallback> cpaTrackingCallbackAvroSerde = new AvroSerde<>(schemaRegistryURL);
cpaTrackingCallbackAvroSerde.configure(serdeProps, false);
// This is the key to telling kafka the specific Serde instance to use
// to deserialize the Avro encoded value
KStream<Long, CpaTrackingCallback> stream = builder.stream(CpaTrackingCallback.class.getName(),
Consumed.with(Serdes.Long(), cpaTrackingCallbackAvroSerde));
// provide a way to convert CpsTrackicking... info into just country codes
// (Long, CpaTrackingCallback) -> (countryCode:Integer, placeHolder:Long)
TransformerSupplier<Long, CpaTrackingCallback, KeyValue<Integer, Long>> transformer = new TransformerSupplier<Long, CpaTrackingCallback, KeyValue<Integer, Long>>() {
#Override
public Transformer<Long, CpaTrackingCallback, KeyValue<Integer, Long>> get() {
return new Transformer<Long, CpaTrackingCallback, KeyValue<Integer, Long>>() {
#Override
public void init(ProcessorContext context) {
// Not doing Punctuate so no need to store context
}
#Override
public KeyValue<Integer, Long> transform(Long key, CpaTrackingCallback value) {
return new KeyValue(value.getCountryCode(), 1);
}
#Override
public KeyValue<Integer, Long> punctuate(long timestamp) {
return null;
}
#Override
public void close() {
}
};
}
};
KTable<Integer, Long> countryCounts = stream.transform(transformer).groupByKey() //
.count(Materialized.as("country-counts"));
streams = new KafkaStreams(builder.build(), config);
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
streams.cleanUp();
streams.start();
try {
countryCountsView = waitUntilStoreIsQueryable("country-counts", QueryableStoreTypes.keyValueStore(),
streams);
}
catch (InterruptedException e) {
log.warn("Interrupted while waiting for query store to become available", e);
}
}
catch (Exception e) {
log.error(e);
}
}
}
The bare groupByKey() method on KStream uses the default serializer/deserializer (which you haven't set). Use the method groupByKey(Serialized<K,V> serialized), as in:
.groupByKey(Serialized.with(Serdes.Integer(), Serdes.Long()))
Also note, what you do in your custom TransformerSupplier, you can do simply with a KStream.map call.

Apache Flink integration with Elasticsearch

I am trying to integrate Flink with Elasticsearch 2.1.1, I am using the maven dependency
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch2_2.10</artifactId>
<version>1.1-SNAPSHOT</version>
</dependency>
and here's the Java Code where I am reading the events from a Kafka queue (which works fine) but somehow the events are not getting posted in the Elasticsearch and there is no error either, in the below code if I change any of the settings related to port, hostname, cluster name or index name of ElasticSearch then immediately I see an error but currently it doesn't show any error nor any new documents get created in ElasticSearch
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// parse user parameters
ParameterTool parameterTool = ParameterTool.fromArgs(args);
DataStream<String> messageStream = env.addSource(new FlinkKafkaConsumer082<>(parameterTool.getRequired("topic"), new SimpleStringSchema(), parameterTool.getProperties()));
messageStream.print();
Map<String, String> config = new HashMap<>();
config.put(ElasticsearchSink.CONFIG_KEY_BULK_FLUSH_MAX_ACTIONS, "1");
config.put(ElasticsearchSink.CONFIG_KEY_BULK_FLUSH_INTERVAL_MS, "1");
config.put("cluster.name", "FlinkDemo");
List<InetSocketAddress> transports = new ArrayList<>();
transports.add(new InetSocketAddress(InetAddress.getByName("localhost"), 9300));
messageStream.addSink(new ElasticsearchSink<String>(config, transports, new TestElasticsearchSinkFunction()));
env.execute();
}
private static class TestElasticsearchSinkFunction implements ElasticsearchSinkFunction<String> {
private static final long serialVersionUID = 1L;
public IndexRequest createIndexRequest(String element) {
Map<String, Object> json = new HashMap<>();
json.put("data", element);
return Requests.indexRequest()
.index("flink").id("hash"+element).source(json);
}
#Override
public void process(String element, RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}
}
I was indeed running it on the local machine and debugging as well but, the only thing I was missing is to properly configure logging, as most of elastic issues are described in "log.warn" statement. The issue was the exception inside "BulkRequestHandler.java" in elasticsearch-2.2.1 client API, which was throwing the error -"org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 1: type is missing;" As I had created the index but not an type which I find pretty strange as it should be primarily be concerned with index and create the type by default.
I have found a very good example of Flink & Elasticsearch Connector
First Maven dependency:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch2_2.10</artifactId>
<version>1.1-SNAPSHOT</version>
</dependency>
Second Example Java code
public static void writeElastic(DataStream<String> input) {
Map<String, String> config = new HashMap<>();
// This instructs the sink to emit after every element, otherwise they would be buffered
config.put("bulk.flush.max.actions", "1");
config.put("cluster.name", "es_keira");
try {
// Add elasticsearch hosts on startup
List<InetSocketAddress> transports = new ArrayList<>();
transports.add(new InetSocketAddress("127.0.0.1", 9300)); // port is 9300 not 9200 for ES TransportClient
ElasticsearchSinkFunction<String> indexLog = new ElasticsearchSinkFunction<String>() {
public IndexRequest createIndexRequest(String element) {
String[] logContent = element.trim().split("\t");
Map<String, String> esJson = new HashMap<>();
esJson.put("IP", logContent[0]);
esJson.put("info", logContent[1]);
return Requests
.indexRequest()
.index("viper-test")
.type("viper-log")
.source(esJson);
}
#Override
public void process(String element, RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}
};
ElasticsearchSink esSink = new ElasticsearchSink(config, transports, indexLog);
input.addSink(esSink);
} catch (Exception e) {
System.out.println(e);
}
}

Categories