Kafka Consumer Properties to read from the maximum offset - java

I have written a Java Kafka Consumer. I would like to make sure how to explicitly ensure that once the Kafka Consumer is started it only reads the messages which are sent by the producer from that time onwards i.e. it should not read any messages which have already been sent by the producer to Kafka. Can anyone explain how to ensure this? :
Here is a snippet of the properties I use
Properties properties = new Properties();
properties.put("zookeeper.connect", zookeeperHost);
properties.put("group.id", group);
properties.put("auto.offset.reset","largest");
ConsumerConfig consumerConfig = new ConsumerConfig(properties);
consumerConnector = Consumer.createJavaConsumerConnector(consumerConfig);
UPDATE Sept14:
I am using the following properties, it seems that the consumer still reads from the beginning at times, can someone tell me what's wrong now?
I am using Kafka Version 0.8.2
properties.put("auto.offset.reset","largest");
properties.put("auto.commit.enable","false");

Based on answers above, it seems that the correct mechanism is as follows for setting properties of the consumer:
properties.put("auto.offset.reset","largest");
properties.put("auto.commit.enable","false");
This ensures reading from the maximum offset

Related

org.apache.kafka.common.errors.TimeoutException: Topic not present in metadata after 60000 ms

I'm getting the error:
org.apache.kafka.common.errors.TimeoutException: Topic testtopic2 not present in metadata after 60000 ms.
When trying to produce to the topic in my local kafka instance on windows using Java. Note that the topic testtopic2 exists and I'm able produce messages to it using the windows console producer just fine.
Below the code that I'm using:
import java.util.Properties;
import org.apache.kafka.clients.CommonClientConfigs;
import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
public class Kafka_Producer {
public static void main(String[] args){
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.RETRIES_CONFIG, 0);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<String, String>(props);
TestCallback callback = new TestCallback();
for (long i = 0; i < 100 ; i++) {
ProducerRecord<String, String> data = new ProducerRecord<String, String>(
"testtopic2", "key-" + i, "message-"+i );
producer.send(data, callback);
}
producer.close();
}
private static class TestCallback implements Callback {
#Override
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
if (e != null) {
System.out.println("Error while producing message to topic :" + recordMetadata);
e.printStackTrace();
} else {
String message = String.format("sent message to topic:%s partition:%s offset:%s", recordMetadata.topic(), recordMetadata.partition(), recordMetadata.offset());
System.out.println(message);
}
}
}
}
Pom dependency:
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.6.0</version>
</dependency>
Output of list and describe:
I was having this same problem today. I'm a newbie at Kafka and was simply trying to get a sample Java producer and consumer running. I was able to get the consumer working, but kept getting the same "topic not present in metadata" error as you, with the producer.
Finally, out of desperation, I added some code to my producer to dump the topics. When I did this, I then got runtime errors because of missing classes in packages jackson-databind and jackson-core. After adding them, I no longer got the "topic not present" error. I removed the topic-dumping code I temporarily added, an it still worked.
This error also can appear because of destination Kafka instance "died" or URL to it is wrong.
In such a case a thread that sends message to Kafka will be blocked on max.block.ms time which defaults exactly to 60000 ms.
You can check whether it is because of above property by passing changed value:
Properties props = new Properties();
...(among others)
props.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 30000); # 30 sec or any other value of your choice
If TimeoutException is thrown after your specified time, then you should check whether your URL to Kafka is correct or Kafka instance is alive.
It might also be caused by an nonexistent partition.
e.g. If you have a single partition [0] and your producer tries to send to partition [1] you'll get the same error. The topic in this case exists, but not the partition.
First off I want to say thanks to Bobb Dobbs for his answer, I was also struggling with this for a while today. I just want to add that the only dependency I had to add is jackson-databind. This is the only dependency I have in my project, besides kafka-clients.
Update: I've learned a bit more about what's going on. kafka-clients sets the scope of its jackson-databind dependency as "provided," which means it expects it to be provided at runtime by the JDK or a container. See this article for more details on the provided maven scope.
This scope is used to mark dependencies that should be provided at runtime by JDK or a container, hence the name.
A good use case for this scope would be a web application deployed in some container, where the container already provides some libraries itself.
I'm not sure the exact reasoning on setting its scope to provided, except that maybe this library is something people normally would want to provide themselves to keep it up to the latest version for security fixes, etc.
I also had similar issue, where I was trying this on my local environment on my macbook. It was quite frustrating and I tried a few approaches
Stopped Zookeeper, Stopped Kafka, restarted ZK and Kafka. (Didn't help)
Stopped ZK. Deleted ZK data directory. Deleted Kafka logs.dirs and restarted Kafka (Didn't help)
Restarted my macbook - This did the trick.
I have used Kafka in production for more than 3 years, but didn't face this problem on the cluster, happened only on my local environment. However, restarting fixes it for me.
I saw this issue when someone on my team had changed the value for the spring.kafka.security.protocol config (we are using Spring on my project). Previously it had been "SSL" in our config, but it was updated to be PLAINTEXT. In higher environments where we connect to a cluster that uses SSL, we saw the error OP ran into.
Why we saw this error as opposed to an SSL error or authentication error is beyond me, but if you run into this error it may be worth double checking your client authentication configs to your Kafka cluster.
This error is an apparent error, and it may be triggered by the following deep conditions.
First and the most situation is your kafka producer config is wrong, check your kafka properties BOOTSTRAP_SERVERS_CONFIG weather is correct server address.
In docker environment, you might check your port mapping.
Check whether the firewall has opened port 9092 of the server where the broker is located.
If your broker run in ssl, check your producer config about SSL_TRUSTSTROE_LOCATION_CONFIG, SECURITY_PROTOCOL_CONFIG, SSL_TRUSTSTORE_TYPE_CONFIG.
And, some broker config both run in ssl and PLAINTEXT, make sure which port is your need.
I created a topic with single partition and tried to populate the topic into 10 partitions. And I got this issue.
I deleted the topic using kafka-topics.sh script, but didn't wait long to finish the clean up. I started populating the topic. When I was looking at topic metadata, it has one partition and I am getting exactly same issue as mentioned in first part of this answer.
Note that this could happen as well because the versions of kafka-client and Spring are not compatible
More info in https://spring.io/projects/spring-kafka "Kafka Client Compatibility" matrix
kafka-topic --bootstrap-server 127.0.0.1:9092 --topic my_first --create --partitions 3
First try to insert the topic with in the Kafka stream using the above command
here my_first is the topic name.
You may want to check your producer properties for metadata.max.idle.ms
The metadata a producer caches for as long as above configured value. Any changes to the meta on the broker end will not be available on the client (producer) immediately. Restarting a producer should however, read the metadata at startup.
Update: check default values here.. https://kafka.apache.org/documentation.html#metadata.max.idle.ms
I was facing same issue.
It could happen when your bootstrap or registry URL are wrong or unreachable
in case if you came here with same error while setting up your integration tests using testcontainers, this could happen because of used port by kafka inside container and exposed port outside. So, make sure that started bootstrap server port is correctly mapped to exposed port that you are using in your tests.
In my case i just replaced properties file entries after kafka container started:
KafkaContainer kafka = new KafkaContainer(...);
kafka.start();
String brokers = kafka.getBootstrapServers()
TestPropertySourceUtils.addInlinedPropertiesToEnvironment(context,
"spring.kafka.bootstrap-servers=" + brokers,
"spring.kafka.producer.bootstrap-servers=" + brokers,
"spring.kafka.consumer.bootstrap-servers=" + brokers
);
Spent quite some time before I figured it out, hope this helps someone.
The two dependencies in pom : kafka-streams and spring-kafka
in application.yml (or properties) :
spring:
kafka:
bootstrap-servers: <service_url/bootstrap_server_ur>
producer:
bootstrap-servers: <service_url/bootstrap_server_url>
key-serializer: org.apache.kafka.common.serialization.StringSerializer
value-serializer: org.apache.kafka.common.serialization.StringSerializer
group-id: <your_consumer_id>
#SpringBootApplication class another annotation : #EnableKafka
This will make it work without any errors.
I was having the same problem, and it's because of wrong config. Here's my Producer configuration that worked. Change ${} properties with your config. Don't forget to set all properties:
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, ${servers});
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
props.put("enable.auto.commit", "false");
props.put("auto.offset.reset", "earliest");
props.put("security.protocol", "SASL_PLAINTEXT");
props.put("basic.auth.credentials.source", "USER_INFO");
props.put("basic.auth.user.info", "${user}:${pass}");
props.put("sasl.kerberos.service.name", "kafka");
props.put("auto.register.schemas", "false");
props.put("schema.registry.url", "${https://your_url}");
props.put("schema.registry.ssl.truststore.location", "client_truststore.jks");
props.put("schema.registry.ssl.truststore.password", "${password}");
KafkaProducer producer = new KafkaProducer(props);
ClassEvent event = getEventObjectData();
ProducerRecord<String, ClassEvent> record = new ProducerRecord<String, ClassEvent>(args[0], event);
Execution from cluster:
java -Djava.security.auth.login.config=${jaas.conf} -cp ${your-producer-example.jar} ${your.package.class.ClassName} ${topic}
Hope it helps

Is there a way to read messages using Kafka stream(not via KafkaConsumer) from beginning everytime in java?

We are creating a POC to read database CDC and push it to external systems.
each source table CDC are sent to respective topics in Avro format(with Kafka Schema Registry and Kafka Server)
We are writing java code to consume the messages in avro schema,de-serialize it using AvroSerde and join them and then send to different topics so that it can be consumed by external systems.
We have a limitation though that we cannot produce messages to source table topics to send/receive new contents/changes. So only way to write join code is to read messages from beginning everytime from every source topic when we run the application.(until we have confident that code is working and can start receiving live data again)
In KafkaConsumer object we have an option to use seekToBeginning method to force reading from beginning in jave code, which works. However there are no option when we try to stream topic using KStream object and force to read it from beginning. What are the alternatives here?
We tried to reset the offset using kafka-consumer-groups reset-topic with --to-earliest but that sets the offset only to the nearest . When we try to reset offset manually with "0" with --to-offset parameter we get below warning but does not set to "0". my understanding is, setting to 0 should read messages from beginning. correct me if I am wrong.
"WARN New offset (0) is lower than earliest offset for topic partition"
Sample code below
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVER);
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, GROUP_ID);
properties.put("schema.registry.url", SCHEMA_REGISTRY_URL);
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(StreamsConfig.APPLICATION_ID_CONFIG, APPLICATION_ID);
StreamsBuilder builder = new StreamsBuilder();
//nothing returned here, when some offset has already been set
KStream myStream = builder.stream("my-topic-in-avro-schema",ConsumedWith(myKeySerde,myValueSerde));
KafkaStreams streams = new KafkaStreams(builder.build(),properties);
streams.start();
One way to do this would be to generate a random ConsumerGroup every time you start the stream application. Something like:
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, GROUP_ID + currentTimestamp);
That way, the stream will start reading from "earliest" as you have set it already in auto.offset.reset.
By the way, you are setting the properties for group.id twice in your code...
It will help someone who is also facing same issue. Replace Application Id and Group Id with some unique identifier using UUID.randomId.toString() in the configuration property. It should fetch the messages from beginning

Kafka producer is losing messages when broker is down

Given the following scenario:
I bring up zookeeper and a single kafka broker on my local and create "test" topic as described in the kafka quickstart: https://kafka.apache.org/quickstart
Then, I run a simple java program that produces a message to the "test" topic every second. After some time I bring down my local kafka broker and see producer continues producing messages, it doesn't throw any exception. Finally, I bring kafka broker up again, producer is able to reconnect to broker and it continues producing messages, but, all those messages that were produced during kafka broker downtime are lost. Producer doesn't replay them when detects healthy kafka broker.
How can I prevent this? I want kafka producer to replay those messages when it detects kafka broker back online. Here is my producer config:
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("linger.ms", 0);
props.put("key.serializer", StringSerializer.class.getName());
props.put("value.serializer", StringSerializer.class.getName());
Kafka Producer library has a retry mechanism built in, however it is turned off by default. Change a retries Producer config to a value bigger that 0 (default value) to turn it on. You should also experiment with retry.backoff.ms and request.timetout.ms in order to customise Producer retries.
Example Kafka Producer config with enabled retries:
retries=2147483647 //Integer.MAX_VALUE
retry.backoff.ms=1000
request.timeout.ms=305000 //5 minutes
max.block.ms=2147483647 //Integer.MAX_VALUE
You can find more information about those properties in Apache Kafka documentation.
Since you're running just one broker, I'm afraid you won't be able to store messages when your broker is down.
However, it is strange that you don't get any exception/warning/errors when you bring your broker down.
I would expect a "Failed to update metadata" or "expiring messages" error because when the producer sends messages to the broker(s) mentioned against the bootstrap.servers property, it first checks with the zookeeper for the active controller (or leader) and partitions. So, in your case since you're running kafka in a stand-alone mode and when the broker is down the producer should not receive the leader information and error out.
Could you please check what the following properties are set to:
request.timeout.ms
max.block.ms
and play around (reducing, may be) with these values? and check the results?
One more option you might want to try out is to send messages to Kafka in a synchronous fashion (blocking send() method until the messages are received) and here's a code snippet that might help (taken from this documentation reference):
If you want to simulate a simple blocking call you can call the get() method immediately:
byte[] key = "key".getBytes();
byte[] value = "value".getBytes();
ProducerRecord<byte[],byte[]> record = new ProducerRecord<byte[],byte[]>("my-topic", key, value)
producer.send(record).get();
In this case, kafka should throw an exception if the messages are not sent successfully for any reason.
I hope this helps.

How can I send large messages with Kafka (over 15MB)?

I send String-messages to Kafka V. 0.8 with the Java Producer API.
If the message size is about 15 MB I get a MessageSizeTooLargeException.
I have tried to set message.max.bytesto 40 MB, but I still get the exception. Small messages worked without problems.
(The exception appear in the producer, I don't have a consumer in this application.)
What can I do to get rid of this exception?
My example producer config
private ProducerConfig kafkaConfig() {
Properties props = new Properties();
props.put("metadata.broker.list", BROKERS);
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("request.required.acks", "1");
props.put("message.max.bytes", "" + 1024 * 1024 * 40);
return new ProducerConfig(props);
}
Error-Log:
4709 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 214 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
4869 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 217 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5035 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 220 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5198 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 223 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5305 [main] ERROR kafka.producer.async.DefaultEventHandler - Failed to send requests for topics datasift with correlation ids in [213,224]
kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries.
at kafka.producer.async.DefaultEventHandler.handle(Unknown Source)
at kafka.producer.Producer.send(Unknown Source)
at kafka.javaapi.producer.Producer.send(Unknown Source)
You need to adjust three (or four) properties:
Consumer side:fetch.message.max.bytes - this will determine the largest size of a message that can be fetched by the consumer.
Broker side: replica.fetch.max.bytes - this will allow for the replicas in the brokers to send messages within the cluster and make sure the messages are replicated correctly. If this is too small, then the message will never be replicated, and therefore, the consumer will never see the message because the message will never be committed (fully replicated).
Broker side: message.max.bytes - this is the largest size of the message that can be received by the broker from a producer.
Broker side (per topic): max.message.bytes - this is the largest size of the message the broker will allow to be appended to the topic. This size is validated pre-compression. (Defaults to broker's message.max.bytes.)
I found out the hard way about number 2 - you don't get ANY exceptions, messages, or warnings from Kafka, so be sure to consider this when you are sending large messages.
Minor changes required for Kafka 0.10 and the new consumer compared to laughing_man's answer:
Broker: No changes, you still need to increase properties message.max.bytes and replica.fetch.max.bytes. message.max.bytes has to be equal or smaller(*) than replica.fetch.max.bytes.
Producer: Increase max.request.size to send the larger message.
Consumer: Increase max.partition.fetch.bytes to receive larger messages.
(*) Read the comments to learn more about message.max.bytes<=replica.fetch.max.bytes
The answer from #laughing_man is quite accurate. But still, I wanted to give a recommendation which I learned from Kafka expert Stephane Maarek. We actively applied this solution in our live systems.
Kafka isn’t meant to handle large messages.
Your API should use cloud storage (for example, AWS S3) and simply push a reference to S3 to Kafka or any other message broker. You'll need to find a place to save your data, whether it can be a network drive or something else entirely, but it shouldn't be a message broker.
If you don't want to proceed with the recommended and reliable solution above,
The message max size is 1MB (the setting in your brokers is called message.max.bytes) Apache Kafka. If you really needed it badly, you could increase that size and make sure to increase the network buffers for your producers and consumers.
And if you really care about splitting your message, make sure each message split has the exact same key so that it gets pushed to the same partition, and your message content should report a “part id” so that your consumer can fully reconstruct the message.
If the message is text-based try to compress the data, which may reduce the data size, but not magically.
Again, you have to use an external system to store that data and just push an external reference to Kafka. That is a very common architecture and one you should go with and widely accepted.
Keep that in mind Kafka works best only if the messages are huge in amount but not in size.
Source: https://www.quora.com/How-do-I-send-Large-messages-80-MB-in-Kafka
The idea is to have equal size of message being sent from Kafka Producer to Kafka Broker and then received by Kafka Consumer i.e.
Kafka producer --> Kafka Broker --> Kafka Consumer
Suppose if the requirement is to send 15MB of message, then the Producer, the Broker and the Consumer, all three, needs to be in sync.
Kafka Producer sends 15 MB --> Kafka Broker Allows/Stores 15 MB --> Kafka Consumer receives 15 MB
The setting therefore should be:
a) on Broker:
message.max.bytes=15728640
replica.fetch.max.bytes=15728640
b) on Consumer:
fetch.message.max.bytes=15728640
You need to override the following properties:
Broker Configs($KAFKA_HOME/config/server.properties)
replica.fetch.max.bytes
message.max.bytes
Consumer Configs($KAFKA_HOME/config/consumer.properties)
This step didn't work for me. I add it to the consumer app and it was working fine
fetch.message.max.bytes
Restart the server.
look at this documentation for more info:
http://kafka.apache.org/08/configuration.html
I think, most of the answers here are kind of outdated or not entirely complete.
To refer on the answer of Sacha Vetter (with the update for Kafka 0.10), I'd like to provide some additional Information and links to the official documentation.
Producer Configuration:
max.request.size (Link) has to be increased for files bigger than 1 MB, otherwise they are rejected
Broker/Topic configuration:
message.max.bytes (Link) may be set, if one like to increase the message size on broker level. But, from the documentation: "This can be set per topic with the topic level max.message.bytes config."
max.message.bytes (Link) may be increased, if only one topic should be able to accept lager files. The broker configuration must not be changed.
I'd always prefer a topic-restricted configuration, due to the fact, that I can configure the topic by myself as a client for the Kafka cluster (e.g. with the admin client). I may not have any influence on the broker configuration itself.
In the answers from above, some more configurations are mentioned as necessary:
replica.fetch.max.bytes (Link) (Broker config)
From the documentation: "This is not an absolute maximum, if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that progress can be made."
max.partition.fetch.bytes (Link) (Consumer config)
From the documentation: "Records are fetched in batches by the consumer. If the first record batch in the first non-empty partition of the fetch is larger than this limit, the batch will still be returned to ensure that the consumer can make progress."
fetch.max.bytes (Link) (Consumer config; not mentioned above, but same category)
From the documentation: "Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress."
Conclusion: The configurations regarding fetching messages are not necessary to change for processing messages, lager than the default values of these configuration (had this tested in a small setup). Probably, the consumer may always get batches of size 1. However, two of the configurations from the first block has to be set, as mentioned in the answers before.
This clarification should not tell anything about performance and should not be a recommendation to set or not to set these configuration. The best values has to be evaluated individually depending on the concrete planned throughput and data structure.
One key thing to remember that message.max.bytes attribute must be in sync with the consumer's fetch.message.max.bytes property. the fetch size must be at least as large as the maximum message size otherwise there could be situation where producers can send messages larger than the consumer can consume/fetch. It might worth taking a look at it.
Which version of Kafka you are using? Also provide some more details trace that you are getting. is there some thing like ... payload size of xxxx larger
than 1000000 coming up in the log?
For people using landoop kafka:
You can pass the config values in the environment variables like:
docker run -d --rm -p 2181:2181 -p 3030:3030 -p 8081-8083:8081-8083 -p 9581-9585:9581-9585 -p 9092:9092
-e KAFKA_TOPIC_MAX_MESSAGE_BYTES=15728640 -e KAFKA_REPLICA_FETCH_MAX_BYTES=15728640 landoop/fast-data-dev:latest `
This sets topic.max.message.bytes and replica.fetch.max.bytes on the broker.
And if you're using rdkafka then pass the message.max.bytes in the producer config like:
const producer = new Kafka.Producer({
'metadata.broker.list': 'localhost:9092',
'message.max.bytes': '15728640',
'dr_cb': true
});
Similarly, for the consumer,
const kafkaConf = {
"group.id": "librd-test",
"fetch.message.max.bytes":"15728640",
... .. }
Here is how I achieved successfully sending data up to 100mb using kafka-python==2.0.2:
Broker:
consumer = KafkaConsumer(
...
max_partition_fetch_bytes=max_bytes,
fetch_max_bytes=max_bytes,
)
Producer (See final solution at the end):
producer = KafkaProducer(
...
max_request_size=KafkaSettings.MAX_BYTES,
)
Then:
producer.send(topic, value=data).get()
After sending data like this, the following exception appeared:
MessageSizeTooLargeError: The message is n bytes when serialized which is larger than the total memory buffer you have configured with the buffer_memory configuration.
Finally I increased buffer_memory (default 32mb) to receive the message on the other end.
producer = KafkaProducer(
...
max_request_size=KafkaSettings.MAX_BYTES,
buffer_memory=KafkaSettings.MAX_BYTES * 3,
)

KafkaSpout is not receiving anything from Kafka

I am trying to rig up a a Kafka-Storm "Hello World" system. I have Kafka installed and running, when I send data with the Kafka producer I can read it with the Kafka console consumer.
I took the Chapter 02 example from the "Getting Started With Storm" O'Reilly book, and modified it to use KafkaSpout instead of a regular spout.
When I run the application, with data already pending in kafka, nextTuple of the KafkaSpout doesn't get any messages - it goes in, tries to iterate over an empty managers list under the coordinator, and exits.
My environment is a fairly old Cloudera VM, with Storm 0.9 and Kafka-Storm-0.9(the latest), and Kafka 2.9.2-0.7.0.
This is how I defined the SpoutConfig and the topology:
String zookeepers = "localhost:2181";
SpoutConfig spoutConfig = new SpoutConfig(new SpoutConfig.ZkHosts(zookeepers, "/brokers"),
"gtest",
"/kafka", // zookeeper root path for offset storing
"KafkaSpout");
spoutConfig.forceStartOffsetTime(-1);
KafkaSpoutTester kafkaSpout = new KafkaSpoutTester(spoutConfig);
//Topology definition
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("word-reader", kafkaSpout, 1);
builder.setBolt("word-normalizer", new WordNormalizer())
.shuffleGrouping("word-reader");
builder.setBolt("word-counter", new WordCounter(),1)
.fieldsGrouping("word-normalizer", new Fields("word"));
//Configuration
Config conf = new Config();
conf.put("wordsFile", args[0]);
conf.setDebug(false);
//Topology run
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
cluster = new LocalCluster();
cluster.submitTopology("Getting-Started-Toplogie", conf, builder.createTopology());
Can someone please help me figure out why I am not receiving anything?
Thanks,
G.
If you've already consumed the message, it is not supposed read any more, unless your producer produces new messages. It is because of the forceStartOffsetTime call with -1 in your code.
Form storm-contrib documentation:
Another very useful config in the spout is the ability to force the spout to rewind to a previous offset. You do forceStartOffsetTime on the spout config, like so:
spoutConfig.forceStartOffsetTime(-2);
It will choose the latest offset written around that timestamp to start consuming. You can force the spout to always start from the latest offset by passing in -1, and you can force it to start from the earliest offset by passing in -2.
How you producer looks like? would be useful to have a snippet. You can replace -1 by -2 and see if you receive anything, if your producer is fine then you should be able to consume.
SpoutConfig spoutConf = new SpoutConfig(...)
spoutConf.startOffsetTime = kafka.api.OffsetRequest.LatestTime();
SpoutConfig spoutConfig = new SpoutConfig(new SpoutConfig.ZkHosts(zookeepers, "/brokers"),
"gtest", // name of topic used by producer & consumer
"/kafka", // zookeeper root path for offset storing
"KafkaSpout");
You are using "gtest" topic for receiving the data. Make sure that you are sending data from this topic by producer.
And in the bolt, print that tuple like that
public void execute(Tuple tuple, BasicOutputCollector collector) {
System.out.println(tuple);
}
It should print the pending data in kafka.
I went through some grief getting storm and Kafka integrated. These are both fast moving and relatively young projects, so it can be hard getting working examples to jump start your development.
To help other developers (and hopefully get others contributing useful examples that I can use as well), I started a github project to house code snippets related to Storm/Kafka (and Esper) development.
You are welcome to check it out here >
https://github.com/buildlackey/cep
(click on the storm+kafka directory for a sample program that should get you up and running).

Categories