Read messages from Kafka topic between a range of offsets

Read messages from Kafka topic between a range of offsets - java

I am looking for a way to consume some set of messages from my Kafka topic with specific offset range (assume my partition has offset from 200 - 300, I want to consume the messages from offset 250-270).
I am using below code where I can specify the initial offset, but it would consume all the messages from 250 to till end. Is there any way/attributes available to set the end offset to consume the messages till that point.
#KafkaListener(id = "KafkaListener",
topics = "${kafka.topic.name}",
containerFactory = "kafkaManualAckListenerContainerFactory",
errorHandler = "${kafka.error.handler}",
topicPartitions = #TopicPartition(topic = "${kafka.topic.name}",
partitionOffsets = {
#PartitionOffset(partition = "0", initialOffset = "250"),
#PartitionOffset(partition = "1", initialOffset = "250")
}))

You can use seek() in order to force the consumer to start consuming from a specific offset and then poll() until you reach the target end offset.
public void seek(TopicPartition partition, long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the
same partition more than once, the latest offset will be used on the
next poll(). Note that you may lose data if this API is arbitrarily
used in the middle of consumption, to reset the fetch offsets
For example, let's assume you want to start from offset 200:
TopicPartition tp = new TopicPartition("myTopic", 0);
Long startOffset = 200L
Long endOffset = 300L
List<TopicPartition> topics = Arrays.asList(tp);
consumer.assign(topics);
consumer.seek(topicPartition, startOffset);
now you just need to keep poll()ing until endOffset is reached:
boolean run = true;
while (run) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
// Do whatever you want to do with `record`
// Check if end offset has been reached
if (record.offset() == endOffset) {
run = false;
break;
}
}
}

KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<String, String>(properties);
boolean keepOnReading = true;
// offset to read the data from.
long offsetToReadFrom = 250L;
// seek is mostly used to replay data or fetch a specific message
// seek
kafkaConsumer.seek(partitionToReadFrom, offsetToReadFrom);
while(keepOnReading) {
ConsumerRecords<String, String> records = kafkaConsumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
numberOfMessagesRead ++;
logger.info("Key: "+record.key() + ", Value: " + record.value());
logger.info("Partition: " + record.partition() + ", Offset: " + record.offset());
if(record.offset() == 270L) {
keepOnReading = false;
break;
}
}
}
I hope this helps you !!

Related

get the unprocessed message count in spring kafka

we are migrating to Kafka, I need to create a monitoring POC service that will periodically check the unprocessed message count in the Kafka queue and based on the count take some action. but this service must not read or process the message, designated consumers will do that, with every cron this service just needs the count of unprocessed messages present in the queue.
so far I have done this, from multiple examples
public void stats() throws ExecutionException, InterruptedException {
Map<String, Object> props = new HashMap<>();
// list of host:port pairs used for establishing the initial connections to the Kafka cluster
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
try (final KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) {
consumer.subscribe(Arrays.asList(topicName));
while (true) {
Thread.sleep(1000);
ConsumerRecords<String, String> records = consumer.poll(1000);
if (!records.isEmpty()) {
System.out.println("records is not empty = " + records.count() + " " + records);
}
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
Set<TopicPartition> partitions = consumer.assignment();
//consumer.seekToBeginning(partitions);
Map<TopicPartition, Long> offsets = consumer.endOffsets(partitions);
for (TopicPartition partition : offsets.keySet()) {
OffsetAndMetadata commitOffset = consumer.committed(new TopicPartition(partition.topic(), partition.partition()));
Long lag = commitOffset == null ? offsets.get(partition) : offsets.get(partition) - commitOffset.offset();
System.out.println("lag = " + lag);
System.out.printf("partition %s is at %d\n", partition.topic(), offsets.get(partition));
}
}
}
}
}
the code is working fine some times and some times gives wrong output, please let me know

Don't subscribe to the topic; just create a consumer with the same group to get the endOffsets.
See this answer for an example.

Multiple Apache Flink windows validations

I'm just getting started on stream processing using Apache Flink, the thing is that I'm receiving a stream of Json that look like this:
{
token_id: “tok_afgtryuo”,
ip_address: “128.123.45.1“,
device_fingerprint: “abcghift”,
card_hash: “hgtyuigash”,
“bin_number”: “424242”,
“last4”: “4242”,
“name”: “Seu Jorge”
}
And was asked if i could fulfill the following business rules:
Decline if number of tokens > 5 for this IP in last 10 seconds
Decline if number of tokens > 15 for this IP in last minute
Decline if number of tokens > 60 for this IP in last hour
I made 2 classes, main class when I'm making an instance to call the Window function with different parameters to avoid duplicate code:
Main.java
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//This DataStream Would be Converting the Json to a Token Object
DataStream<Token> baseStream =
env.addSource(new SocketTextStreamFunction("localhost",
9999,
"\n",
1))
.map(new MapTokens());
// 1- First rule Decline if number of tokens > 5 for this IP in last 10 seconds
DataStreamSink<String> response1 = new RuleMaker().getStreamKeyCount(baseStream, "ip", Time.seconds(10),
5, "seconds").print();
//2 -Decline if number of tokens > 15 for this IP in last minute
DataStreamSink<String> response2 = new RuleMaker().getStreamKeyCount(baseStream, "ip", Time.minutes(1),
62, "minutes").print();
//3- Decline if number of tokens > 60 for this IP in last hour
DataStreamSink<String> response3 = new RuleMaker().getStreamKeyCount(baseStream, "ip", Time.hours(1),
60, "Hours").print();
env.execute("Job2");
}
And another class where I'm doing all the logic for rules, I'm counting the times where an IP address appears and, if it is more than the allowed number in the time window I'm returning a message with some information:
Rulemaker.java
public class RuleMaker {
public DataStream<String> getStreamKeyCount(DataStream<Token> stream,
String tokenProp,
Time time,
Integer maxPetitions,
String ruleType){
return
stream
.flatMap(new FlatMapFunction<Token, Tuple3<String, Integer, String>>() {
#Override
public void flatMap(Token token, Collector<Tuple3<String, Integer, String>> collector) throws Exception {
String tokenSelection = "";
switch (tokenProp)
{
case "ip":
tokenSelection = token.getIpAddress();
break;
case "device":
tokenSelection = token.getDeviceFingerprint();
break;
case "cardHash":
tokenSelection = token.getCardHash();
break;
}
collector.collect(new Tuple3<>(tokenSelection, 1, token.get_tokenId()));
}
})
.keyBy(0)
.timeWindow(time)
.process(new MyProcessWindowFunction(maxPetitions, ruleType));
}
//Class to process the elements from the window
private class MyProcessWindowFunction extends ProcessWindowFunction<
Tuple3<String, Integer, String>,
String,
Tuple,
TimeWindow
> {
private Integer _maxPetitions;
private String _ruleType;
public MyProcessWindowFunction(Integer maxPetitions, String ruleType) {
this._maxPetitions = maxPetitions;
this._ruleType = ruleType;
}
#Override
public void process(Tuple tuple, Context context, Iterable<Tuple3<String, Integer, String>> iterable, Collector<String> out) throws Exception {
Integer counter = 0;
for (Tuple3<String, Integer, String> element : iterable) {
counter += element.f1++;
if(counter > _maxPetitions){
out.collect("El elemeto ha sido declinado: " + element.f2 + " Num elements: " + counter + " rule type: " + _ruleType + " token: " + element.f0 );
counter = 0;
}
}
}
}
}
So far, i think this code is working but I'm a begginer on Apache Flink, and I'll appreciate a lot if you could tell me if it's something wrong about the way I'm trying to work with this and point me to the right direction.
Thanks a lot.

General approach looks very good, although I would have thought that Table API would be powerful enough to help you (more concise) which supports Json out of the box.
If you want to stick to DataStream API, in getStreamKeyCount, the switch around tokenProp should be replaced by passing a key extractor to getStreamKeyCount to have only one place to add new rules.
public DataStream<String> getStreamKeyCount(DataStream<Token> stream,
KeySelector<Token, String> keyExtractor,
Time time,
Integer maxPetitions,
String ruleType){
return stream
.map(token -> new Tuple3<>(keyExtractor.getKey(token), 1, token.get_tokenId()))
.keyBy(0)
.timeWindow(time)
.process(new MyProcessWindowFunction(maxPetitions, ruleType));
}
Then the invocation becomes
DataStreamSink<String> response2 = ruleMaker.getStreamKeyCount(baseStream,
Token::getIpAddress, Time.minutes(1), 62, "minutes");

Commit Offsets to Kafka on Spark Executors

I am getting events from Kafka, enriching/filtering/transforming them on Spark and then storing them in ES. I am committing back the offsets to Kafka
I have two questions/problems:
(1) My current Spark job is VERY slow
I have 50 partitions for a topic and 20 executors. Each executor has 2 cores and 4g of memory each. My driver has 8g of memory. I am consuming 1000 events/partition/second and my batch interval is 10 seconds. This means, I am consuming 500000 events in 10 seconds
My ES cluster is as follows:
20 shards / index
3 master instances c5.xlarge.elasticsearch
12 instances m4.xlarge.elasticsearch
disk / node = 1024 GB so 12 TB in total
And I am getting huge scheduling and processing delays
(2) How can I commit offsets on executors?
Currently, I enrich/transform/filter my events on executors and then send everything to ES using BulkRequest. It's a synchronous process. If I get positive feedback, I send the offset list to driver. If not, I send back an empty list. On the driver, I commit offsets to Kafka. I believe, there should be a way, where I can commit offsets on executors but I don't know how to pass kafka Stream to executors:
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges, this::onComplete);
This is the code for committing offsets to Kafka which requires Kafka Stream
Here is my overall code:
kafkaStream.foreachRDD( // kafka topic
rdd -> { // runs on driver
rdd.cache();
String batchIdentifier =
Long.toHexString(Double.doubleToLongBits(Math.random()));
LOGGER.info("## [" + batchIdentifier + "] Starting batch ...");
Instant batchStart = Instant.now();
List<OffsetRange> offsetsToCommit =
rdd.mapPartitionsWithIndex( // kafka partition
(index, eventsIterator) -> { // runs on worker
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
LOGGER.info(
"## Consuming " + offsetRanges[index].count() + " events" + " partition: " + index
);
if (!eventsIterator.hasNext()) {
return Collections.emptyIterator();
}
// get single ES documents
List<SingleEventBaseDocument> eventList = getSingleEventBaseDocuments(eventsIterator);
// build request wrappers
List<InsertRequestWrapper> requestWrapperList = getRequestsToInsert(eventList, offsetRanges[index]);
LOGGER.info(
"## Processed " + offsetRanges[index].count() + " events" + " partition: " + index + " list size: " + eventList.size()
);
BulkResponse bulkItemResponses = elasticSearchRepository.addElasticSearchDocumentsSync(requestWrapperList);
if (!bulkItemResponses.hasFailures()) {
return Arrays.asList(offsetRanges).iterator();
}
elasticSearchRepository.close();
return Collections.emptyIterator();
},
true
).collect();
LOGGER.info(
"## [" + batchIdentifier + "] Collected all offsets in " + (Instant.now().toEpochMilli() - batchStart.toEpochMilli()) + "ms"
);
OffsetRange[] offsets = new OffsetRange[offsetsToCommit.size()];
for (int i = 0; i < offsets.length ; i++) {
offsets[i] = offsetsToCommit.get(i);
}
try {
offsetManagementMapper.commit(offsets);
} catch (Exception e) {
// ignore
}
LOGGER.info(
"## [" + batchIdentifier + "] Finished batch of " + offsetsToCommit.size() + " messages " +
"in " + (Instant.now().toEpochMilli() - batchStart.toEpochMilli()) + "ms"
);
rdd.unpersist();
});

You can move the offset logic above the rdd loop ... I am using below template for better offset handling and performance
JavaInputDStream<ConsumerRecord<String, String>> kafkaStream = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
kafkaStream.foreachRDD( kafkaStreamRDD -> {
//fetch kafka offsets for manually commiting it later
OffsetRange[] offsetRanges = ((HasOffsetRanges) kafkaStreamRDD.rdd()).offsetRanges();
//filter unwanted data
kafkaStreamRDD.filter(
new Function<ConsumerRecord<String, String>, Boolean>() {
#Override
public Boolean call(ConsumerRecord<String, String> kafkaRecord) throws Exception {
if(kafkaRecord!=null) {
if(!StringUtils.isAnyBlank(kafkaRecord.key() , kafkaRecord.value())) {
return Boolean.TRUE;
}
}
return Boolean.FALSE;
}
}).foreachPartition( kafkaRecords -> {
// init connections here
while(kafkaRecords.hasNext()) {
ConsumerRecord<String, String> kafkaConsumerRecord = kafkaRecords.next();
// work here
}
});
//commit offsets
((CanCommitOffsets) kafkaStream.inputDStream()).commitAsync(offsetRanges);
});

Retrieve last n messages of Kafka consumer from a particular topic

kafka version : 0.9.0.1
If n = 20,
I have to get last 20 messages of a topic.
I tried with
kafkaConsumer.seekToBeginning();
But it retrieves all the messages. I need to get only the last 20 messages.
This topic may have hundreds of thousands of records
public List<JSONObject> consumeMessages(String kafkaTopicName) {
KafkaConsumer<String, String> kafkaConsumer = null;
boolean flag = true;
List<JSONObject> messagesFromKafka = new ArrayList<>();
int recordCount = 0;
int i = 0;
int maxMessagesToReturn = 20;
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "project.group.id");
props.put("max.partition.fetch.bytes", "1048576000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaConsumer = new KafkaConsumer<>(props);
kafkaConsumer.subscribe(Arrays.asList(kafkaTopicName));
TopicPartition topicPartition = new TopicPartition(kafkaTopicName, 0);
LOGGER.info("Subscribed to topic " + kafkaConsumer.listTopics());
while (flag) {
// will consume all the messages and store in records
ConsumerRecords<String, String> records = kafkaConsumer.poll(1000);
kafkaConsumer.seekToBeginning(topicPartition);
// getting total records count
recordCount = records.count();
LOGGER.info("recordCount " + recordCount);
for (ConsumerRecord<String, String> record : records) {
if(record.value() != null) {
if (i >= recordCount - maxMessagesToReturn) {
// adding last 20 messages to messagesFromKafka
LOGGER.info("kafkaMessage "+record.value());
messagesFromKafka.add(new JSONObject(record.value()));
}
i++;
}
}
if (recordCount > 0) {
flag = false;
}
}
kafkaConsumer.close();
return messagesFromKafka;
}

You can use kafkaConsumer.seekToEnd(Collection<TopicPartition> partitions) to seek to the last offset of the given partition(s). As per the documentation:
"Seek to the last offset for each of the given partitions. This function evaluates lazily, seeking to the final offset in all partitions only when poll(Duration) or position(TopicPartition) are called. If no partitions are provided, seek to the final offset for all of the currently assigned partitions."
Then you can retrieve the position of a particular partition using position(TopicPartition partition).
Then you can reduce 20 from it, and use kafkaConsumer.seek(TopicPartition partition, long offset) to get to the most recent 20 messages.
Simply,
kafkaConsumer.seekToEnd(partitionList);
long endPosition = kafkaConsumer.position(topicPartiton);
long recentMessagesStartPosition = endPosition - maxMessagesToReturn;
kafkaConsumer.seek(topicPartition, recentMessagesStartPosition);
Now you can retrieve the most recent 20 messages using poll()
This is the simple logic, but if you have multiple partitions, you have to consider those cases as well. I did not try this, but hope you'll get the concept.

How to write Kafka Consumer Client in java to consume the messages from multiple brokers?

I was looking for java client (Kafka Consumer) to consume the messages from multiple brokers. please advice
Below is the code written to publish the messages to multiple brokers using simple partitioner.
Topic is created with replication factor "2" and partition "3".
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster)
{
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
logger.info("Number of Partitions " + numPartitions);
if (keyBytes == null)
{
int nextValue = counter.getAndIncrement();
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0)
{
int part = toPositive(nextValue) % availablePartitions.size();
int selectedPartition = availablePartitions.get(part).partition();
logger.info("Selected partition is " + selectedPartition);
return selectedPartition;
}
else
{
// no partitions are available, give a non-available partition
return toPositive(nextValue) % numPartitions;
}
}
else
{
// hash the keyBytes to choose a partition
return toPositive(Utils.murmur2(keyBytes)) % numPartitions;
}
}
public void publishMessage(String message , String topic)
{
Producer<String, String> producer = null;
try
{
producer = new KafkaProducer<>(producerConfigs());
logger.info("Topic to publish the message --" + this.topic);
for(int i =0 ; i < 10 ; i++)
{
producer.send(new ProducerRecord<String, String>(this.topic, message));
logger.info("Message Published Successfully");
}
}
catch(Exception e)
{
logger.error("Exception Occured " + e.getMessage()) ;
}
finally
{
producer.close();
}
}
public Map<String, Object> producerConfigs()
{
loadPropertyFile();
Map<String, Object> propsMap = new HashMap<>();
propsMap.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList);
propsMap.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
propsMap.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
propsMap.put(ProducerConfig.PARTITIONER_CLASS_CONFIG, SimplePartitioner.class);
propsMap.put(ProducerConfig.ACKS_CONFIG, "1");
return propsMap;
}
public Map<String, Object> consumerConfigs() {
Map<String, Object> propsMap = new HashMap<>();
System.out.println("properties.getBootstrap()" + properties.getBootstrap());
propsMap.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, properties.getBootstrap());
propsMap.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
propsMap.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, properties.getAutocommit());
propsMap.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, properties.getTimeout());
propsMap.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
propsMap.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
propsMap.put(ConsumerConfig.GROUP_ID_CONFIG, properties.getGroupid());
propsMap.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, properties.getAutooffset());
return propsMap;
}
#KafkaListener(id = "ID1", topics = "${config.topic}", group = "${config.groupid}")
public void listen(ConsumerRecord<?, ?> record)
{
logger.info("Message Consumed " + record);
logger.info("Partition From which Record is Received " + record.partition());
this.message = record.value().toString();
}
bootstrap.servers = [localhost:9092, localhost:9093, localhost:9094]

If you use a regular Java consumer, it will automatically read from multiple brokers. There is no special code you need to write. Just subscribe to the topic(s) you want to consumer and the consumer will connect to the corresponding brokers automatically. You only provide a "single entry point" broker -- the client figures out all other broker of the cluster automatically.

Number of Kafka broker nodes in cluster has nothing to do with consumer logic. Nodes in cluster only used for fault tolerance and bootstrap process. You placing messaging in different partitions of topic based on some custom logic it also not going to effect consumer logic. Even If you have single consumer than that consumer will consume messages from all partitions of Topic subscribed. I request you to check your code with Kafka cluster with single broker node...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read messages from Kafka topic between a range of offsets - java

Related

get the unprocessed message count in spring kafka

Multiple Apache Flink windows validations

Commit Offsets to Kafka on Spark Executors

Retrieve last n messages of Kafka consumer from a particular topic

How to write Kafka Consumer Client in java to consume the messages from multiple brokers?

Categories

Resources