We have a storm topology in which we configured one spout and two bolts. Spout queries data from DB continuously and send tuples it to first bolt for some processing. First bolt does some processing and send tuples it to second bolt which calls third party web service and sends data. So, what is happening after some time, last bolt is not getting any tuples and if we restart the topology it works fine. Only last bolt is in problem here. Other spout and first bolt are running fine, and I am not using acking framework. I have configured only one worker in this case`.
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("messageListenrSpout", new MessageListenerSpout(), 1);
builder.setBolt("processorBolt", new ProcessorBolt(), 20).shuffleGrouping("messageListenrSpout");
builder.setBolt("notifierBolt", new NotifierBolt(),40).shuffleGrouping("processorBolt");
Config conf = new Config();
conf.put(Config.TOPOLOGY_SLEEP_SPOUT_WAIT_STRATEGY_TIME_MS, 10000);
//conf.setMessageTimeoutSecs(600);
conf.setDebug(true);
StormSubmitter.submitTopology(TOPOLOGY, conf, builder.createTopology());
It's quite likely that you're having problems with a backlog of tuples causing timeouts. Try increasing the parallelism hint for the 2nd bolt since it sounds like that one's process time is much longer than that of the first bolt (that's why there would be a backlog into the 2nd bolt). If you're running this topology on the cluster look at the Storm UI to see the specifics.
Guys when I was debugging my topology, I found that if let's say spout is sending message fast but bolt is processing slow. In this case, message will be queued up LMAX Disruptor Queue. Then spout task wait for that to be empty. If you take thread dump, you will find threads are in TIMED_WAITING state. So, we need to configure topology in such a way that its inflow and outflow maintained.
Related
I am having a java web application with a scheduled background job which send kafka message.
I am using the default producer configuration and the job is very simple.
I am not assigning a producer id so each execution a new kafkaproducer is created with a new dynamic id.
When testing i noticed that always the first execution of my job (after my application is up) the job take around 2s but fewer ms in next executions.
I am not having any application cache and creating new instance of kafkaproducer each execution.
Any explanation to this please?is there any static cache in the kafka producer API?( i am always sending to the same topic and not specifying the partition in my producer record)
I have a requirement for polling a hazelcast (client mode) queue with retry (10 attempts) option on exception. I was expecting that camel polling and processing would be multi threaded. but It wasn't. While retrying on exception, any new message to the queue will be piled up and will be picked up for processing only after 1st one gets completed. Is there any option for parallel processing (concurrent consume). I have added concurrentConsumer and poolSize as a query parameter. But it didn't really play well.
What I have tried is:
fromF(hazelcast-queue://FOO?concurrentConsumers=5&hazelcastInstance=#hazelcastInstance&poolSize=10&queueConsumerMode=Poll).to("direct:testPoll");
from("direct:testPoll")
.log(LoggingLevel.DEBUG,":::>:Camel[${routeId}] consumes")
.onException(Exception.class)
.maximumRedeliveries(maxAttempt)
.delayPattern(delayPattern)
.maximumRedeliveryDelay(maxDelay)
.handled(true)
.logExhausted(false)
.end()
.bean("processTestPoll").log(INFO,"${body}").end();
Error:
There are 1 parameters that couldn't be set on the endpoint. Check the uri if the parameters are spelt correctly and that they are properties of the endpoint. Unknown parameters=[{concurrentConsumers=10}]
Your help will be really appreciated. Thanks in advance.
What you try to achieve can be done thanks to a SEDA in 2 different ways:
Generic Way
You can send your messages to a SEDA endpoint and consume them concurrently as next:
fromF("hazelcast-%sFOO?hazelcastInstance=#hazelcastInstance&queueConsumerMode=Poll",
HazelcastConstants.QUEUE_PREFIX)
.to("seda:process");
from("seda:process?concurrentConsumers=5")
.log("Processing: ${threadName} ${body}");
In the previous example, the Hazelcast Queue FOO is polled by one thread that puts the messages into the SEDA process and the SEDA process is consumed concurrently by 5 threads.
More details about concurrent consumers with the SEDA component
Specific Way
As you proposed in your deleted answer, you can also implement it directly using the specific SEDA endpoint for Hazelcast as next:
fromF("hazelcast-%sFOO?hazelcastInstance=#hazelcastInstance&concurrentConsumers=5",
HazelcastConstants.SEDA_PREFIX)
.log("Processing: ${threadName} ${body}");
In the previous example, the Hazelcast Queue FOO is consumed concurrently by 5 threads.
More details about the Hazelcat SEDA endpoint.
Hi I am creating a Topology using apache-storm in which my Spout is collecting data from Kakfa Topic and sending it to a bolt.
I am doing some validation over the tuple and emitting stream again for other bolt.
Now the issue is that my second bolt which is using stream of the first bolt has a overload method prepare(Map<String, Object> map, TopologyContext topologyContext, OutputCollector outputCollector)
which is executing after let say every 2 seconds.
Code for topology is
topologyBuilder.setBolt("abc",new ValidationBolt()).shuffleGrouping(configurations.SPOUT_ID);
topologyBuilder.setBolt("TEST",new TestBolt()).shuffleGrouping("abc",Utils.VALIDATED_STREAM);
Code for First bolt "abc" is
#Override
public void execute(Tuple tuple) {
String document = String.valueOf(tuple.getValue(4));
if (Utils.isJSONValid(document)) {
outputCollector.emit(Utils.VALIDATED_STREAM,new Values(document));
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declareStream(Utils.VALIDATED_STREAM,new Fields("document"));
}
While I was searching I found
The prepare method is called when the bolt is initialised and is
similar to the open method in spout. It is called only once for the bolt.
It gets the configuration for the bolt and also the context of the bolt.
The collector is used to emit or output the tuples from this bolt.
Link to public gist for log
Storm topology log
Your log shows you are using LocalCluster. It is a testing/demo tool, don't use it for production workloads. Instead set up a real distributed cluster.
Regarding what is happening:
When you run topologies in a LocalCluster, Storm simulates a real cluster by just running all the components (Nimbus, Supervisors and workers) as threads in a single JVM. Your log shows these lines:
20:14:12.451 [SLOT_1027] INFO o.a.s.ProcessSimulator - Begin killing process 2ea97301-24c9-4c1a-bcba-61008693971a
20:14:12.451 [SLOT_1027] INFO o.a.s.d.w.Worker - Shutting down worker smart-transactional-data-1-1566571315 72bbf510-c342-4385-9599-0821a2dee94e 1027
20:14:15.518 [SLOT_1027] INFO o.a.s.d.s.Slot - STATE running msInState: 33328 topo:smart-transactional-data-1-1566571315 worker:2ea97301-24c9-4c1a-bcba-61008693971a -> kill-blob-update msInState: 3001 topo:smart-transactional-data-1-1566571315 worker:2ea97301-24c9-4c1a-bcba-61008693971a
20:14:15.540 [SLOT_1027] INFO o.a.s.d.w.Worker - Launching worker for smart-transactional-data-1-1566571315
The LocalCluster is shutting down one of the simulated workers, because one of the blobs (e.g. topology jar, topology configuration, other types of shared files, see more at https://storm.apache.org/releases/2.0.0/distcache-blobstore.html) in the blobstore changed. Normally when this happens in a real cluster, the worker JVM will be killed, the blob will be updated and the worker will restart. Since you are using LocalCluster, it just kills the worker thread and restarts it. This is why you are seeing multiple invocations of prepare.
I am building a Java Spring application using Storm 1.1.2 and Kafka 0.11 to be launched in a Docker container.
Everything in my topology works as planned but under a high load from Kafka, the Kafka lag increases more and more over time.
My KafkaSpoutConfig:
KafkaSpoutConfig<String,String> spoutConf =
KafkaSpoutConfig.builder("kafkaContainerName:9092", "myTopic")
.setProp(ConsumerConfig.GROUP_ID_CONFIG, "myGroup")
.setProp(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, MyObjectDeserializer.class)
.build()
Then my topology is as follows
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("stormKafkaSpout", new KafkaSpout<String,String>(spoutConf), 25);
builder.setBolt("routerBolt", new RouterBolt(),25).shuffleGrouping("stormKafkaSpout");
Config conf = new Config();
conf.setNumWorkers(10);
conf.put(Config.STORM_ZOOKEEPER_SERVERS, ImmutableList.of("zookeeper"));
conf.put(Config.STORM_ZOOKEEPER_PORT, 2181);
conf.put(Config.NIMBUS_SEEDS, ImmutableList.of("nimbus"));
conf.put(Config.NIMBUS_THRIFT_PORT, 6627);
System.setProperty("storm.jar", "/opt/storm.jar");
StormSubmitter.submitTopology("topologyId", conf, builder.createTopology());
The RouterBolt (which extends BaseRichBolt) does one very simple switch statement and then uses a local KafkaProducer object to send a new message to another topic. Like I said, everything compiles and the topology runs as expected but under a high load (3000 messages/s), the Kafka lag just piles up equating to low throughput for the topology.
I've tried disabling acking with
conf.setNumAckers(0);
and
conf.put(Config.TOPOLGY_ACKER_EXECUTORS, 0);
but I guess it's not an acking issue.
I've seen on the Storm UI that the RouterBolt has execution latency of 1.2ms and process latency of .03ms under the high load which leads me to believe the Spout is the bottleneck.Also the parallelism hint is 25 because there are 25 partitions of "myTopic". Thanks!
You may be affected by https://issues.apache.org/jira/browse/STORM-3102, which causes the spout to do a pretty expensive call on every emit. Please try upgrading to one of the fixed versions.
Edit: The fix isn't actually released yet. You might still want to try out the fix by building the spout from source using e.g. https://github.com/apache/storm/tree/1.1.x-branch to build a 1.1.4 snapshot.
I am trying to rig up a a Kafka-Storm "Hello World" system. I have Kafka installed and running, when I send data with the Kafka producer I can read it with the Kafka console consumer.
I took the Chapter 02 example from the "Getting Started With Storm" O'Reilly book, and modified it to use KafkaSpout instead of a regular spout.
When I run the application, with data already pending in kafka, nextTuple of the KafkaSpout doesn't get any messages - it goes in, tries to iterate over an empty managers list under the coordinator, and exits.
My environment is a fairly old Cloudera VM, with Storm 0.9 and Kafka-Storm-0.9(the latest), and Kafka 2.9.2-0.7.0.
This is how I defined the SpoutConfig and the topology:
String zookeepers = "localhost:2181";
SpoutConfig spoutConfig = new SpoutConfig(new SpoutConfig.ZkHosts(zookeepers, "/brokers"),
"gtest",
"/kafka", // zookeeper root path for offset storing
"KafkaSpout");
spoutConfig.forceStartOffsetTime(-1);
KafkaSpoutTester kafkaSpout = new KafkaSpoutTester(spoutConfig);
//Topology definition
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("word-reader", kafkaSpout, 1);
builder.setBolt("word-normalizer", new WordNormalizer())
.shuffleGrouping("word-reader");
builder.setBolt("word-counter", new WordCounter(),1)
.fieldsGrouping("word-normalizer", new Fields("word"));
//Configuration
Config conf = new Config();
conf.put("wordsFile", args[0]);
conf.setDebug(false);
//Topology run
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
cluster = new LocalCluster();
cluster.submitTopology("Getting-Started-Toplogie", conf, builder.createTopology());
Can someone please help me figure out why I am not receiving anything?
Thanks,
G.
If you've already consumed the message, it is not supposed read any more, unless your producer produces new messages. It is because of the forceStartOffsetTime call with -1 in your code.
Form storm-contrib documentation:
Another very useful config in the spout is the ability to force the spout to rewind to a previous offset. You do forceStartOffsetTime on the spout config, like so:
spoutConfig.forceStartOffsetTime(-2);
It will choose the latest offset written around that timestamp to start consuming. You can force the spout to always start from the latest offset by passing in -1, and you can force it to start from the earliest offset by passing in -2.
How you producer looks like? would be useful to have a snippet. You can replace -1 by -2 and see if you receive anything, if your producer is fine then you should be able to consume.
SpoutConfig spoutConf = new SpoutConfig(...)
spoutConf.startOffsetTime = kafka.api.OffsetRequest.LatestTime();
SpoutConfig spoutConfig = new SpoutConfig(new SpoutConfig.ZkHosts(zookeepers, "/brokers"),
"gtest", // name of topic used by producer & consumer
"/kafka", // zookeeper root path for offset storing
"KafkaSpout");
You are using "gtest" topic for receiving the data. Make sure that you are sending data from this topic by producer.
And in the bolt, print that tuple like that
public void execute(Tuple tuple, BasicOutputCollector collector) {
System.out.println(tuple);
}
It should print the pending data in kafka.
I went through some grief getting storm and Kafka integrated. These are both fast moving and relatively young projects, so it can be hard getting working examples to jump start your development.
To help other developers (and hopefully get others contributing useful examples that I can use as well), I started a github project to house code snippets related to Storm/Kafka (and Esper) development.
You are welcome to check it out here >
https://github.com/buildlackey/cep
(click on the storm+kafka directory for a sample program that should get you up and running).