How to choose a Key and Offset for a Kafka Producer - java

I'm following here.While following the code. I came up with two Questions
Is the Key and offset were the same?
According to Google,
Offset: A Kafka topic receives messages across a distributed set of
partitions where they are stored. Each partition maintains the
messages it has received in a sequential order where they are
identified by an offset, also known as a position.
Seems both are very similar for me. Since offset maintain a unique message in the partition: Producers send records to a partition based on the record’s key
What is the best way to choose the Key/Offset for a producer?
For an instance the example which I provided above, they have chosen the timestamp as the Key and offset.
Is this the always the best recommendation?
class IRCMessageListener extends IRCEventAdapter {
#Override
public void onPrivmsg(String channel, IRCUser u, String msg) {
IRCMessage event = new IRCMessage(channel, u, msg);
//FIXME kafka round robin default partitioner seems to always publish to partition 0 only (?)
long ts = event.getInt64("timestamp");
Map<String, ?> srcOffset = Collections.singletonMap(TIMESTAMP_FIELD, ts);
Map<String, ?> srcPartition = Collections.singletonMap(CHANNEL_FIELD, channel);
SourceRecord record = new SourceRecord(srcPartition, srcOffset, topic, KEY_SCHEMA, ts, IRCMessage.SCHEMA, event);
queue.offer(record);
}
Because I'm actually trying to create a custom Kafka connector to get the data from 3rd Party WebSocket API. The API sends real-time data stream messages for a given Key value. So I thought of using that Key for my PartitionKey as well as Offset. But need to make sure I'm right about my thought.

Key is an optional metadata, that can be sent with a Kafka message, and by default, it is used to route message to a specific partition. E.g. if you're sending a message m with key as k, to a topic mytopic that has p partitions, then m goes to the partition Hash(k) % p in mytopic. It has no connection to the offset of a partition whatsoever. Offsets are used by consumers to keep track of the position of last read message in a partition. In your case, if the timestamp is fairly randomly distributed, then it's fine, else you might be causing partition imbalance while using it as key.

These are some basic differences :
Offset : maintained by kafka to keep a track of the records consumed to avoid loss of records and duplicate records while consuming.
Key : it is specific to input events,if it is not available then by default it is mentioned as null,this is useful while writing records to HDFS with default partition-er using kafka connect.every message can have a single key or many messages can have similar key.

Related

How to consume messages within specific timestamps using reactor kafka

I need to consume messages from a Kafka topic within a specific time range. It is easy enough to determine the starting offset with the help of partition.seekToTimestamp(startTimeInMillis) like the code below:
ReceiverOptions<ByteBuffer, ByteBuffer> options =
receiverOptions
.subscription(topicConfig.getTopics())
.pollTimeout(Duration.ofMillis(topicConfig.getPollWaitTimeoutMs()))
.addAssignListener(
partitions -> {
partitions.forEach(partition -> partition.seekToTimestamp(startTimeInMillis));
});
but how do I stop consuming once the messages start exceeding the end timestamp?
I can use the below code:
KafkaReceiver.create(options)
.receive()
.takeWhile(record -> record.timestamp() < endTimeInMillis)
.map(this::handleConsumerRecord);
But the problem is that when the condition is met for a message, that message might be the last valid mesage in its partition, but it might be possible that other partitions in the topic still have messages below the end timestamp.
How do I ensure that I have consumed all the messages across partitions within the end timestamp?

How to make a partition of a topic in Kafka by user_id?

I'm building a web app backend with SpringBoot and I have to use Kafka for sending messages.
I want to have a topic for example "testTopic" and I want to produce there some messages from differents users to send the messages later to differents machines.
If the user A sends a message to his machine and user B sends a message to his machine.
How can I differentiate who has sent which message and to which machine it should arrive?
I've read about Kafka topic partitions but I don't know if Im doing it well in my code.
Here I'm building my topic
#Bean
public NewTopic kafkaExampleTopic() {
return TopicBuilder.name("TestTopic").partitions(1).build();
}
Here I'm sending data to that topic
#Bean
CommandLineRunner commandLineRunner(KafkaTemplate<String, String> kafkaTemplate) {
return args -> {
kafkaTemplate.send("TestTopic", String.valueOf(MessageBuilder.withPayload("Hello kafka testTopic uno con key 1")
.setHeader(KafkaHeaders.MESSAGE_KEY, "1").build()));
kafkaTemplate.send("TestTopic", String.valueOf(MessageBuilder.withPayload("Hello kafka testTopic uno con key 2")
.setHeader(KafkaHeaders.MESSAGE_KEY, "2").build()));
};
}
And this is my listener
#KafkaListener(topics = "TestTopic", groupId = "exampleGroupId")
public void listenWithHeaders(
#Payload String message,
#Header(KafkaHeaders.RECEIVED_PARTITION_ID) int partition) {
System.out.println(
"Received Message: " + message
+ "from partition: " + partition);
}
Thank you so much guys!
Topic Partitions need to be decided ahead of time.
For example, if you have numeric ids, you could define a topic with ten partitions, then create your own Partitioner class that will route every number into the partition based on its leading digit (ids 1, 10, 15, etc all go to partition 1). If you use hexadecimal values (such as UUID), maybe a topic with 16 partitions (a-f, 0-9). Alphanumeric-lowercase - 36, and so on.
By default, Kafka's DefaultPartitioner will perform a Murmur2 hash modulo-d by the number of topic partitions. With that, it's possible that ids 5 and 7, for example, could end up in the same partition. Depending on your consumer's needs, that might not be what you want.
Consumers are what run on the different machines. The partitions shouldn't matter except to know that consumers of the same group cannot be assigned the same partitions (if you only have one partition, only one consumer per group can read it).

Kafka committed and last offsets using admin API

I am querying kafka broker using the admin client API to get the committed offsets of CONSUMER_GROUP using the below code:
Map<TopicPartition, OffsetAndMetadata> offsets =
admin.listConsumerGroupOffsets(CONSUMER_GROUP)
.partitionsToOffsetAndMetadata().get();
The above code will trigger a query to a special created __consumer_offsets topic to get the committed offsets for each of the partition of the topic(s)-partition that CONSUMER_GROUP is responsible for.
On the other hand, I am using the below code to retrieve the latest (end) offsets for each of the topic(s) partition of CONSUMER_GROUP
for(TopicPartition tp: offsets.keySet()) {
requestLatestOffsets.put(tp, OffsetSpec.latest());
}
Map<TopicPartition, ListOffsetsResult.ListOffsetsResultInfo> latestOffsets =
admin.listOffsets(requestLatestOffsets).all().get();
for (Map.Entry<TopicPartition, OffsetAndMetadata> e: offsets.entrySet()) {
long latestOffset = latestOffsets.get(e.getKey()).offset();
My question is that the committed and latest offsets are hence queried/requested from two different topics. The committed offset are requested from the __consumer_offsets topic, and the latest (end) offsets are requested from the actual topic(s) of the CONSUMER_GROUP.
(1) is the above description about requesting committed and latest offsets accurate?
(2) is it possible to query __consumer_offsets topic directly?
Thank you.
Yes your understanding is correct. Committed offsets are stored in the __consumer_offsets topic while you need to query specific partitions to get their end offsets.
Yes __consumer_offsets is a regular topic, you can consume it directly if you want. It's typically way easier to retrieve data via the provided API, but if you're interested in its content you can consume it. Check the console Formatters if you want to see how to deserialize the data.

Kafka only subscribe to latest message?

Sometimes(seems very random) Kafka sends old messages. I only want the latest messages so it overwrite messages with the same key. Currently it looks like I have multiple messages with the same key it doesn't get compacted.
I use this setting in the topic:
cleanup.policy=compact
I'm using Java/Kotlin and Apache Kafka 1.1.1 client.
Properties(8).apply {
val jaasTemplate = "org.apache.kafka.common.security.scram.ScramLoginModule required username=\"%s\" password=\"%s\";"
val jaasCfg = String.format(jaasTemplate, Configuration.kafkaUsername, Configuration.kafkaPassword)
put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
BOOTSTRAP_SERVERS)
put(ConsumerConfig.GROUP_ID_CONFIG,
"ApiKafkaKotlinConsumer${Configuration.kafkaGroupId}")
put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
StringDeserializer::class.java.name)
put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
StringDeserializer::class.java.name)
put("security.protocol", "SASL_SSL")
put("sasl.mechanism", "SCRAM-SHA-256")
put("sasl.jaas.config", jaasCfg)
put("max.poll.records", 100)
put("receive.buffer.bytes", 1000000)
}
Have I missed some settings?
If you want have only one value for each key, you have to use KTable<K,V> abstraction: StreamsBuilder::table(final String topic) from Kafka Streams. Topic used here should have cleanup policy set to compact.
If you use KafkaConsumer you just pull data from brokers. It doesn't give you any mechanism that perform some kind of deduplication. Depending on if compaction was performed or not, you can get one to n messages for same key.
Regarding compaction
Compaction doesn't mean, that all old value for same key are removed immediately. When old message for same key will be removed, depends on several properties. The most important are:
log.cleaner.min.cleanable.ratio
The minimum ratio of dirty log to total log for a log to eligible for cleaning
log.cleaner.min.compaction.lag.ms
The minimum time a message will remain uncompacted in the log. Only applicable for logs that are being compacted.
log.cleaner.enable
Enable the log cleaner process to run on the server. Should be enabled if using any topics with a cleanup.policy=compact including the internal offsets topic. If disabled those topics will not be compacted and continually grow in size.
More detail about compaction you can find https://kafka.apache.org/documentation/#compaction

How to dynamically route message to queues in RabbitMQ

I want to develop solutions that can dynamically route messages to different queues (more than 10000 queues). That's what I have so far:
Exchange with type set to topic. So that I can route messages to different queues based on routing keys.
10000 queues that have routing key as #.%numberOfQueue.#. The %numberOfQueue% is simple numeric value for that queue (but it might be changed for more meaningfull ones).
Producer producing message with routing key like that: 5.10.15.105.10000 which means that message should be routed to queue with keys 5, 10, 15, 105 and 10000 as they comform the patterns of that queues.
That's how it looks like from java client API:
String exchangeName = "rabbit.test.exchange";
String exchangeType = "topic";
boolean exchangeDurable = true;
boolean queueDurable = true;
boolean queueExclusive = false;
boolean queueAutoDelete = false;
Map<String, Object> queueArguments = null;
for (int i = 0; i < numberOfQueues; i++) {
String queueNameIterated = "rabbit.test" + i + ".queue";
channel.exchangeDeclare(exchangeName, exchangeType, exchangeDurable);
channel.queueDeclare(queueNameIterated, queueDurable, queueExclusive, queueAutoDelete, queueArguments);
String routingKey = "#." + i + ".#";
channel.queueBind(queueNameIterated, exchangeName, routingKey);
}
That's how routingKey generated for all messages for queues from 0 to 9998:
private String generateRoutingKey() {
StringBuilder keyBuilder = new StringBuilder();
for (int i = 0; i < numberOfQueues - 2; i++) {
keyBuilder.append(i);
keyBuilder.append('.');
}
String result = keyBuilder.append(numberOfQueues - 2).toString();
LOGGER.info("generated key: {}", result);
return result;
}
Seems good. The problem is that I can't use such long routingKey with channel.basicPublish() method:
Exception in thread "main" java.lang.IllegalArgumentException: Short string too long; utf-8 encoded length = 48884, max = 255.
at com.rabbitmq.client.impl.ValueWriter.writeShortstr(ValueWriter.java:50)
at com.rabbitmq.client.impl.MethodArgumentWriter.writeShortstr(MethodArgumentWriter.java:74)
at com.rabbitmq.client.impl.AMQImpl$Basic$Publish.writeArgumentsTo(AMQImpl.java:2319)
at com.rabbitmq.client.impl.Method.toFrame(Method.java:85)
at com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:104)
at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:396)
at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:372)
at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:690)
at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:672)
at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:662)
at com.rabbitmq.client.impl.recovery.AutorecoveringChannel.basicPublish(AutorecoveringChannel.java:192)
I have requirements:
Dynamically choose from producer in which queues produce the messages. It might be just one queue, all queues or 1000 queues.
I have more than 10000 different queues and it might be needed to produce same message to them.
So the questions are:
Can I use such long key? If can - how to do it?
Maybe I can achieve the same goal by different configuration of exchange or queues?
Maybe there are some hash function that can effectivily distinguesh destinations and collapse that in 255 symbols? If so, It should provide way to deal with different publishings (for example how to send to only queues numbered 555 and 8989?)?
Maybe there are some different key strategy that could be used in that way?
How else I can achieve my requirements?
I started using RabbitQM just a short time ago, hope I can help you nonetheless. There can be as many words in the routing key as you like, up to the limit of 255 bytes (as also described in e.g. RabbitMQ Tutorial 5 - Topics). Thus, the topics exchange does not seem to be appropriate for your use case.
Perhaps you can use a headers exchange in this case? According to the concept description:
A headers exchange is designed for routing on multiple attributes that are more easily expressed as message headers than a routing key. Headers exchanges ignore the routing key attribute. Instead, the attributes used for routing are taken from the headers attribute. A message is considered matching if the value of the header equals the value specified upon binding.
See here and here for an example. As I said, I just started with RabbitMQ, therefore, I don't know for sure whether this could be an option for you. If I have time later I try to construct a simple example for you.

Categories