Apache Storm Kafka Spout Lag Issue - java

I am building a Java Spring application using Storm 1.1.2 and Kafka 0.11 to be launched in a Docker container.
Everything in my topology works as planned but under a high load from Kafka, the Kafka lag increases more and more over time.
My KafkaSpoutConfig:
KafkaSpoutConfig<String,String> spoutConf =
KafkaSpoutConfig.builder("kafkaContainerName:9092", "myTopic")
.setProp(ConsumerConfig.GROUP_ID_CONFIG, "myGroup")
.setProp(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, MyObjectDeserializer.class)
.build()
Then my topology is as follows
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("stormKafkaSpout", new KafkaSpout<String,String>(spoutConf), 25);
builder.setBolt("routerBolt", new RouterBolt(),25).shuffleGrouping("stormKafkaSpout");
Config conf = new Config();
conf.setNumWorkers(10);
conf.put(Config.STORM_ZOOKEEPER_SERVERS, ImmutableList.of("zookeeper"));
conf.put(Config.STORM_ZOOKEEPER_PORT, 2181);
conf.put(Config.NIMBUS_SEEDS, ImmutableList.of("nimbus"));
conf.put(Config.NIMBUS_THRIFT_PORT, 6627);
System.setProperty("storm.jar", "/opt/storm.jar");
StormSubmitter.submitTopology("topologyId", conf, builder.createTopology());
The RouterBolt (which extends BaseRichBolt) does one very simple switch statement and then uses a local KafkaProducer object to send a new message to another topic. Like I said, everything compiles and the topology runs as expected but under a high load (3000 messages/s), the Kafka lag just piles up equating to low throughput for the topology.
I've tried disabling acking with
conf.setNumAckers(0);
and
conf.put(Config.TOPOLGY_ACKER_EXECUTORS, 0);
but I guess it's not an acking issue.
I've seen on the Storm UI that the RouterBolt has execution latency of 1.2ms and process latency of .03ms under the high load which leads me to believe the Spout is the bottleneck.Also the parallelism hint is 25 because there are 25 partitions of "myTopic". Thanks!

You may be affected by https://issues.apache.org/jira/browse/STORM-3102, which causes the spout to do a pretty expensive call on every emit. Please try upgrading to one of the fixed versions.
Edit: The fix isn't actually released yet. You might still want to try out the fix by building the spout from source using e.g. https://github.com/apache/storm/tree/1.1.x-branch to build a 1.1.4 snapshot.

Related

Apache Storm - KafkaSpout not consuming messaes from Kafka Topic

I am trying to integrate Kafka to Storm Toplogy using below code but unfortunately KafkaSpout is not consuming messages from Kafka-topic. At Storm UI-Core, Emitted count remains 0 forever.
String bootStrapServer = "10.20.10.238:9092";
String topic = "test.topic";
KafkaSpoutConfig.Builder spoutConfigBuilder = KafkaSpoutConfig.builder(bootStrapServer,topic);
spoutConfigBuilder.setProp(ConsumerConfig.RECEIVE_BUFFER_CONFIG,100*1024*1024);
spoutConfigBuilder.setProp(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG,100*1024*1024);
spoutConfigBuilder.setProcessingGuarantee(KafkaSpoutConfig.ProcessingGuarantee.AT_LEAST_ONCE);
Boolean readFromStart = true;
if(readFromStart) {
spoutConfigBuilder.setFirstPollOffsetStrategy(FirstPollOffsetStrategy.EARLIEST);
}
else {
spoutConfigBuilder.setFirstPollOffsetStrategy(FirstPollOffsetStrategy.LATEST);
}
KafkaSpout spout = new KafkaSpout(spoutConfigBuilder.build());
builder.setSpout("kafkaSpout", spout, 1);
// And a Bolt to see messages
builder.setBolt("fcBolt", new FcBolt(), 1).setNumTasks(1).shuffleGrouping("kafkaSpout");
But when I tried to see the produced messages from CLI , I am able to see all messages on topic with below command:
bin/kafka-console-consumer.sh --topic test.topic --from-beginning --bootstrap-server 10.20.10.238:9092
Picked up _JAVA_OPTIONS: -Xmx128000m
test
test
test1
....
Versions:
Storm : 2.2.0
Kafka : 2.13_2.6.0
At old versions,it works fine! Something I had missed to read at newer version.
Any help appreciated. Thanks in Advance!
Hard to know with what you have, so consider showing the rest of your code too.
But from what you do have, it does not appear like you are actually producing any events.
If you are trying to consume kafka events in your spout for further processing then make sure you are actually subscribed to a topic that is having events created on it, and then you can not see the event output through the console consumer since you are consuming them in Storm, not producing them.
If you are trying to produce kafka events to the test topic through Storm and then trying to consume them through the console consumer then make sure you are actually producing events in storm.
Hope that puts you on the right path, I would suggest going over the base-concepts of Kafka here: Kafka Introduction

Apache Storm Trident and Kafka Spout Integration

I am unable to find good documentation for correctly integrating Kafka with Apache Storm Trident. I tried to look into the related previously posted questions here, but no sufficient information.
I would like to connect Trident with Kafka as OpaqueTridentKafkaSpout. Here is the sample Code which is currently working
GlobalPartitionInformation globalPartitionInformation = new GlobalPartitionInformation(properties.getProperty("topic", "mytopic"));
Broker brokerForPartition0 = new Broker("IP1",9092);
Broker brokerForPartition1 = new Broker("IP2", 9092);
Broker brokerForPartition2 = new Broker("IP3:9092");
globalPartitionInformation.addPartition(0, brokerForPartition0);//mapping from partition 0 to brokerForPartition0
globalPartitionInformation.addPartition(1, brokerForPartition1);//mapping from partition 1 to brokerForPartition1
globalPartitionInformation.addPartition(2, brokerForPartition2);//mapping from partition 2 to brokerForPartition2
StaticHosts staticHosts = new StaticHosts(globalPartitionInformation);
TridentKafkaConfig tridentKafkaConfig = new TridentKafkaConfig(hosts,properties.getProperty("topic", "mytopic"));
tridentKafkaConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
OpaqueTridentKafkaSpout kafkaSpout = new OpaqueTridentKafkaSpout(tridentKafkaConfig);
With this I am able to generate streams for my topology as shown in the code below
TridentTopology topology = new TridentTopology();
Stream analyticsStream = topology.newStream("spout", kafkaSpout).parallelismHint(Integer.valueOf(properties.getProperty("spout","6")))
Though I have provided parallelism and my partitions, only 1 executor of Kafka Spout is running and thereby I am unable to scale it well.
Can anyone please guide me better ways of integrating Apache Storm Trident (2.0.0) with Apache Kafka (1.0) with 3 node cluster each ?
Also, as soon as it finishes reading from Kafka, I am getting these logs constantly
2018-04-09 14:17:34.119 o.a.s.k.KafkaUtils Thread-15-spout-spout-executor[79 79] [INFO] Metrics Tick: Not enough data to calculate spout lag. 2018-04-09 14:17:34.129 o.a.s.k.KafkaUtils Thread-21-spout-spout-executor[88 88] [INFO] Metrics Tick: Not enough data to calculate spout lag.
And in Storm UI, I can see acks for the messages above. Any suggestion to ignore metric Ticks ?
If you are on Storm 2.0.0 anyway, I think you should switch to the storm-kafka-client Trident spout. The storm-kafka module is only intended to support older Kafka versions, since the underlying Kafka API (SimpleConsumer) is being removed. The new module supports Kafka from 0.10.0.0 and forward.
You can find an example Trident topology for the new spout here https://github.com/apache/storm/blob/master/examples/storm-kafka-client-examples/src/main/java/org/apache/storm/kafka/trident/TridentKafkaClientTopologyNamedTopics.java.

Kafka 1.0 Streaming API: message consumption from partitions get delayed

Recently, I've switched our streaming app from spark-streaming 2.1 to use kafka-streaming new API (1.0) with kafka broker server 0.11.0.0
I have implemented my own Processor class, and in process method, I just printed the message content.
I have a kafka cluster of 3 machines, and the topic I am hooking on have 300 partitions.
I ran the streaming app with 100 thread, on a machine with 32 GB of RAM, and 8 cores.
My problem is, in some cases, I got the messages once it reached the kafka topic/partition, and in other cases, I got the message after it has reached the topic with 10-15 minutes, Don't know why!
I used the below command line to track the lag on the kafka topic for the group.id for the streaming app.
./bin/kafka-run-class.sh kafka.admin.ConsumerGroupCommand --bootstrap-server kafka1:9092,kafka2:9092,kafka3:9092 --new-consumer --describe --group kf_streaming_gp_id
but unfortunately it is not consistently give accurate results, or even give result at all, any body know why?
Is there is something I missed with the streaming app so that I can read the messages once reached the partitions consistently?
Any consumer properties fix such problem.
My kafka-streaming app structure is as below:
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "kf_streaming_gp_id");
config.put(StreamsConfig.CLIENT_ID_CONFIG, "kf_streaming_gp_id");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka1:9092,kafka2:9092,kafka3:9092");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, DocumentSerde.class);
config.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, CustomTimeExtractor.class);
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
config.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 100);
KStream<String, Document> topicStreams = builder.stream(sourceTopic);
topicStreams.process(() -> new DocumentProcessor(appName, environment, dimensions, vector, sinkTopic));
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();
I figured out what was the problem in my case.
It turned out that there were threads stuck with doing a high CPU intensive work, which resulted in stopping other threads from consuming messages, that's why I saw such bursts, when I stopped this cpu intensive logic, everything was super fast, and messages gets to the streaming job once they got to the kafka topic.

Apache Storm Bolt task is not receiving message after some time

We have a storm topology in which we configured one spout and two bolts. Spout queries data from DB continuously and send tuples it to first bolt for some processing. First bolt does some processing and send tuples it to second bolt which calls third party web service and sends data. So, what is happening after some time, last bolt is not getting any tuples and if we restart the topology it works fine. Only last bolt is in problem here. Other spout and first bolt are running fine, and I am not using acking framework. I have configured only one worker in this case`.
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("messageListenrSpout", new MessageListenerSpout(), 1);
builder.setBolt("processorBolt", new ProcessorBolt(), 20).shuffleGrouping("messageListenrSpout");
builder.setBolt("notifierBolt", new NotifierBolt(),40).shuffleGrouping("processorBolt");
Config conf = new Config();
conf.put(Config.TOPOLOGY_SLEEP_SPOUT_WAIT_STRATEGY_TIME_MS, 10000);
//conf.setMessageTimeoutSecs(600);
conf.setDebug(true);
StormSubmitter.submitTopology(TOPOLOGY, conf, builder.createTopology());
It's quite likely that you're having problems with a backlog of tuples causing timeouts. Try increasing the parallelism hint for the 2nd bolt since it sounds like that one's process time is much longer than that of the first bolt (that's why there would be a backlog into the 2nd bolt). If you're running this topology on the cluster look at the Storm UI to see the specifics.
Guys when I was debugging my topology, I found that if let's say spout is sending message fast but bolt is processing slow. In this case, message will be queued up LMAX Disruptor Queue. Then spout task wait for that to be empty. If you take thread dump, you will find threads are in TIMED_WAITING state. So, we need to configure topology in such a way that its inflow and outflow maintained.

KafkaSpout is not receiving anything from Kafka

I am trying to rig up a a Kafka-Storm "Hello World" system. I have Kafka installed and running, when I send data with the Kafka producer I can read it with the Kafka console consumer.
I took the Chapter 02 example from the "Getting Started With Storm" O'Reilly book, and modified it to use KafkaSpout instead of a regular spout.
When I run the application, with data already pending in kafka, nextTuple of the KafkaSpout doesn't get any messages - it goes in, tries to iterate over an empty managers list under the coordinator, and exits.
My environment is a fairly old Cloudera VM, with Storm 0.9 and Kafka-Storm-0.9(the latest), and Kafka 2.9.2-0.7.0.
This is how I defined the SpoutConfig and the topology:
String zookeepers = "localhost:2181";
SpoutConfig spoutConfig = new SpoutConfig(new SpoutConfig.ZkHosts(zookeepers, "/brokers"),
"gtest",
"/kafka", // zookeeper root path for offset storing
"KafkaSpout");
spoutConfig.forceStartOffsetTime(-1);
KafkaSpoutTester kafkaSpout = new KafkaSpoutTester(spoutConfig);
//Topology definition
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("word-reader", kafkaSpout, 1);
builder.setBolt("word-normalizer", new WordNormalizer())
.shuffleGrouping("word-reader");
builder.setBolt("word-counter", new WordCounter(),1)
.fieldsGrouping("word-normalizer", new Fields("word"));
//Configuration
Config conf = new Config();
conf.put("wordsFile", args[0]);
conf.setDebug(false);
//Topology run
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
cluster = new LocalCluster();
cluster.submitTopology("Getting-Started-Toplogie", conf, builder.createTopology());
Can someone please help me figure out why I am not receiving anything?
Thanks,
G.
If you've already consumed the message, it is not supposed read any more, unless your producer produces new messages. It is because of the forceStartOffsetTime call with -1 in your code.
Form storm-contrib documentation:
Another very useful config in the spout is the ability to force the spout to rewind to a previous offset. You do forceStartOffsetTime on the spout config, like so:
spoutConfig.forceStartOffsetTime(-2);
It will choose the latest offset written around that timestamp to start consuming. You can force the spout to always start from the latest offset by passing in -1, and you can force it to start from the earliest offset by passing in -2.
How you producer looks like? would be useful to have a snippet. You can replace -1 by -2 and see if you receive anything, if your producer is fine then you should be able to consume.
SpoutConfig spoutConf = new SpoutConfig(...)
spoutConf.startOffsetTime = kafka.api.OffsetRequest.LatestTime();
SpoutConfig spoutConfig = new SpoutConfig(new SpoutConfig.ZkHosts(zookeepers, "/brokers"),
"gtest", // name of topic used by producer & consumer
"/kafka", // zookeeper root path for offset storing
"KafkaSpout");
You are using "gtest" topic for receiving the data. Make sure that you are sending data from this topic by producer.
And in the bolt, print that tuple like that
public void execute(Tuple tuple, BasicOutputCollector collector) {
System.out.println(tuple);
}
It should print the pending data in kafka.
I went through some grief getting storm and Kafka integrated. These are both fast moving and relatively young projects, so it can be hard getting working examples to jump start your development.
To help other developers (and hopefully get others contributing useful examples that I can use as well), I started a github project to house code snippets related to Storm/Kafka (and Esper) development.
You are welcome to check it out here >
https://github.com/buildlackey/cep
(click on the storm+kafka directory for a sample program that should get you up and running).

Categories