I was wondering if there is a java client code that is a Kafka Consumer that enables to read data via push notification / a blocking read, instead of the current poll:
final KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
consumer.subscribe(Arrays.asList("test"));
new Thread(){
#Override
public void run()
{
while (true)
{
ConsumerRecords<String, String> records = consumer.poll(100); //poll
for (ConsumerRecord<String, String> record : records)
{
System.out.printf("offset = %d, key = %s, value = %s", record.offset(), record.key(),
record.value());
System.out.println();
callback.onMessage(record.value());
}
}
}
}.start();
if I understand your question correctly you wish for the data to be pushed to consumer when available instead of having the consumer being responsible of checking from new data and pulling.
On https://kafka.apache.org/08/design.html they discuss push vs. pull and the choice that was made in Kafka where the producer push the messages to the broker and the consumer pulls from broker. They also mention the attempts they have made to prevent the downsides of a pull-based approach. If you require a pushing pub/sub messaging system you may want to look at Scribe or Flume which is also mentioned in the link :)
Related
I am using kafka_2.12 version 2.3.0 where I am publishing data into kafka topic using partition and key. I need to find a way using which I can consume a particular message from topic using key and partition combination. That way I won't have to consume all the messages and iterate for the correct one.
Right now I am only able to do this
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props)
consumer.subscribe(Collections.singletonList("topic"))
ConsumerRecords<String, String> records = consumer.poll(100)
def data = records.findAll {
it -> it.key().equals(key)
}
You can't "get messages by key from Kafka".
One solution, if practical, would be to have as many partitions as keys and always route messages for a key to the same partition.
Message Key as Partition
kafkaConsumer.assign(topicPartitions);
kafkaConsumer.seekToBeginning(topicPartitions);
// Pull records from kafka, keep polling until we get nothing back
final List<ConsumerRecord<byte[], byte[]>> allRecords = new ArrayList<>();
ConsumerRecords<byte[], byte[]> records;
do {
// Grab records from kafka
records = kafkaConsumer.poll(2000L);
logger.info("Found {} records in kafka", records.count());
// Add to our array list
records.forEach(allRecords::add);
}
while (!records.isEmpty());
Access messages of a Topic using Topic Name only
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(<Topic Name>,<Topic Name>));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
There are two ways to consume topic/partitions is:
KafkaConsumer.assign() : Document link
KafkaConsumer.subscribe() : Document link
So, You can't get messages by key.
If you don't have a plan to expand partitions, consider using assign() method. Because all the messages that come with the specific key will go to the same partition.
How to use:
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(properties);
TopicPartition partition = new TopicPartition("some-topic", 0);
consumer.assign(Arrays.asList(partition));
while(true){
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
String data = records.findAll {
it -> it.key().equals(key)
}
}
I am continuously sending data in Avro format to a topic called "SD_RTL". This data generation is possible through a Confluent datagen that conforms to a custom Avro schema.
When I use kafka-avro-console-consumer on this topic, I see my data correctly. Here's an example of one correctly received tuple:
{"_id":1276215,"serialno":"0","timestamp":416481,"locationid":"Location_0","gpscoords":{"latitude":-2.9789479087622794,"longitude":-4.344459940322691},"data":{"tag1":0,"tag2":1}}
The problem appears when I try to consume this data through a Java app. I get the error "unknown magic byte".
I am using the code inspired by the second snippet under the Serializer section in confluent's website.
This is my code:
//consumer properties
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "group1");
//string inputs and outputs
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "io.confluent.kafka.serializers.KafkaAvroDeserializer");
props.put("schema.registry.url", "localhost:8081");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
//subscribe to topic
String topic = "SD_RTL";
final Consumer<String, SensorsPayload> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(topic));
try {
while (true) {
ConsumerRecords<String, SensorsPayload> records = consumer.poll(100);
for (ConsumerRecord<String, SensorsPayload> record : records) {
System.out.printf("offset = %d, key = %s, value = %s \n", record.offset(), record.key(), record.value());
}
}
} finally {
consumer.close();
}
My code and confluent's code are a bit different. For example, Confluent uses the line:
final Consumer<String, GenericRecord> consumer = new KafkaConsumer<String, String>(props);
Whereas if I put the right part as <String, String> instead of <String, SensorsPayload>, IntelliJ complains about incompatible types. I'm not sure if it's related to my issue.
I've generated my SensorsPayload class automatically from an Avro schema through the avro-maven-plugin.
Why does my Consumer app generate an "unknown magic byte" error when kafka's avro console consumer does not?
I have a simple program because I'm trying to receive data using kafka. When I start a kafka producer and I send data, for example: "Hello", I get this when I print the message: (null, Hello). And I don't know why this null appears. Is there any way to avoid this null? I think it's due to Tuple2<String, String>, the first parameter, but I only want to print the second parameter. And another thing, when I print that using System.out.println("inside map "+ message); it does not appear any message, does someone know why? Thanks.
public static void main(String[] args){
SparkConf sparkConf = new SparkConf().setAppName("org.kakfa.spark.ConsumerData").setMaster("local[4]");
// Substitute 127.0.0.1 with the actual address of your Spark Master (or use "local" to run in local mode
sparkConf.set("spark.cassandra.connection.host", "127.0.0.1");
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = KafkaProperties.TOPIC.split(",");
for (String topic: topics) {
topicMap.put(topic, KafkaProperties.NUM_THREADS);
}
/* connection to cassandra */
CassandraConnector connector = CassandraConnector.apply(sparkConf);
System.out.println("+++++++++++ cassandra connector created ++++++++++++++++++++++++++++");
/* Receive kafka inputs */
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, KafkaProperties.ZOOKEEPER, KafkaProperties.GROUP_CONSUMER, topicMap);
System.out.println("+++++++++++++ streaming-kafka connection done +++++++++++++++++++++++++++");
JavaDStream<String> lines = messages.map(
new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> message) {
System.out.println("inside map "+ message);
return message._2();
}
}
);
messages.print();
jssc.start();
jssc.awaitTermination();
}
Q1) Null values:
Messages in Kafka are Keyed, that means they all have a (Key, Value) structure.
When you see (null, Hello) is because the producer published a (null,"Hello") value in a topic.
If you want to omit the key in your process, map the original Dtream to remove the key: kafkaDStream.map( new Function<String,String>() {...})
Q2) System.out.println("inside map "+ message); does not print. A couple of classical reasons:
Transformations are applied in the executors, so when running in a cluster, that output will appear in the executors and not on the master.
Operations are lazy and DStreams need to be materialized for operations to be applied.
In this specific case, the JavaDStream<String> lines is never materialized i.e. not used for an output operation. Therefore the map is never executed.
I'm using KafkaConsumer to consume messages from Kafka server (topics)..
It works fine for topics created before starting Consumer code...
But the problem is, it will not work if the topics created dynamically(i mean to say after consumer code started), but the API says it will support dynamic topic creation.. Here is the link for your reference..
Kafka version used : 0.9.0.1
https://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
Here is the JAVA code...
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "false");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
Pattern r = Pattern.compile("siddu(\\d)*");
consumer.subscribe(r, new HandleRebalance());
try {
while(true) {
ConsumerRecords<String, String> records = consumer.poll(Long.MAX_VALUE);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(partition.partition() + ": " +record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
}
} finally {
consumer.close();
}
NOTE: My topic names are matching the Regular Expression..
And if i restart the consumer then it will start reading messages pushed to topic...
Any help is really appreciated...
There was an answer to this in the apache kafka mail archives. I am copying it below:
The consumer supports a configuration option "metadata.max.age.ms"
which basically controls how often topic metadata is fetched. By
default, this is set fairly high (5 minutes), which means it will take
up to 5 minutes to discover new topics matching your regular
expression. You can set this lower to discover topics quicker.
So in your props you can:
props.put("metadata.max.age.ms", 5000);
This will cause your consumer to find out about new topics every 5 seconds.
You can hook into Zookeeper. Check out the sample code. In essence, you will create a watcher on the Zookeeper node /brokers/topics. When new children are added here, it's a new Topic being added, and your watcher will get triggered.
Note that the difference between this and the other answer is that this one is a trigger where the other is a polling -- this one will be as close to real-time as possible, the other will be within whatever your polling interval is at best.
Here is the solution it worked for me by using KafkaConsumer api. Here is the Java code for it.
private static Consumer<Long, String> createConsumer(String topic) {
final Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,
BOOTSTRAP_SERVERS);
props.put(ConsumerConfig.GROUP_ID_CONFIG,
"KafkaExampleConsumer");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
StringDeserializer.class.getName());
// Create the consumer using props.
final Consumer<Long, String> consumer =
new KafkaConsumer<>(props);
// Subscribe to the topic.
consumer.subscribe(Collections.singletonList(topic));
return consumer;
}
public static void runConsumer(String topic) throws InterruptedException {
final Consumer<Long, String> consumer = createConsumer(topic);
ConsumerRecords<Long, String> records = consumer.poll(100);
for (ConsumerRecord<Long, String> record : records)
System.out.printf("hiiiii offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
consumer.commitAsync();
consumer.close();
//System.out.println("DONE");
}
using this we can consume the message from dynamically created topics.
use the subscribe method in KafkaConsumer class which takes a pattern as argument for the list of topics to get data from
/**
Subscribe to all topics matching specified pattern to get dynamically assigned partitions. * The pattern matching will be done
periodically against all topics existing at the time of check. * This
can be controlled through the {#code metadata.max.age.ms}
configuration: by lowering * the max metadata age, the consumer will
refresh metadata more often and check for matching topics. * *
See {#link #subscribe(Collection, ConsumerRebalanceListener)} for
details on the * use of the {#link ConsumerRebalanceListener}.
Generally rebalances are triggered when there * is a change to the
topics matching the provided pattern and when consumer group
membership changes. * Group rebalances only take place during an
active call to {#link #poll(Duration)}. * * #param pattern Pattern
to subscribe to * #param listener Non-null listener instance to get
notifications on partition assignment/revocation for the *
subscribed topics * #throws IllegalArgumentException If pattern or
listener is null * #throws IllegalStateException If {#code
subscribe()} is called previously with topics, or assign is called *
previously (without a subsequent call to {#link #unsubscribe()}), or
if not * configured at-least one
partition assignment strategy */ #Override public void
subscribe(Pattern pattern, ConsumerRebalanceListener listener) {
JavaPairReceiverInputDStream<String, byte[]> messages = KafkaUtils.createStream(...);
JavaPairDStream<String, byte[]> filteredMessages = filterValidMessages(messages);
JavaDStream<String> useCase1 = calculateUseCase1(filteredMessages);
JavaDStream<String> useCase2 = calculateUseCase2(filteredMessages);
JavaDStream<String> useCase3 = calculateUseCase3(filteredMessages);
JavaDStream<String> useCase4 = calculateUseCase4(filteredMessages);
...
I retrieve messages from Kafka, filter that and use the same messages for mutiple use-cases. Here useCase1 to 4 are independent of each other and can be calculated parallely. However, when i look at the logs, i see that calculations are happening sequentially. How can i make them to run parallely. Any suggestion would be helpful.
Try creating creating Kafka topics for each of your 4 use cases. Then try creating 4 different Kafka DStreams.
I moved all code inside a for loop and iterated by the number of partitions in the kafka topic and i see an improvement.
for(i=0;i<numOfPartitions;i++)
{
JavaPairReceiverInputDStream<String, byte[]> messages =
KafkaUtils.createStream(...);
JavaPairDStream<String, byte[]> filteredMessages =
filterValidMessages(messages);
JavaDStream<String> useCase1 = calculateUseCase1(filteredMessages);
JavaDStream<String> useCase2 = calculateUseCase2(filteredMessages);
JavaDStream<String> useCase3 = calculateUseCase3(filteredMessages);
JavaDStream<String> useCase4 = calculateUseCase4(filteredMessages);
}
Reference : http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/