prepare method executing multiple times - java

Hi I am creating a Topology using apache-storm in which my Spout is collecting data from Kakfa Topic and sending it to a bolt.
I am doing some validation over the tuple and emitting stream again for other bolt.
Now the issue is that my second bolt which is using stream of the first bolt has a overload method prepare(Map<String, Object> map, TopologyContext topologyContext, OutputCollector outputCollector)
which is executing after let say every 2 seconds.
Code for topology is
topologyBuilder.setBolt("abc",new ValidationBolt()).shuffleGrouping(configurations.SPOUT_ID);
topologyBuilder.setBolt("TEST",new TestBolt()).shuffleGrouping("abc",Utils.VALIDATED_STREAM);
Code for First bolt "abc" is
#Override
public void execute(Tuple tuple) {
String document = String.valueOf(tuple.getValue(4));
if (Utils.isJSONValid(document)) {
outputCollector.emit(Utils.VALIDATED_STREAM,new Values(document));
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declareStream(Utils.VALIDATED_STREAM,new Fields("document"));
}
While I was searching I found
The prepare method is called when the bolt is initialised and is
similar to the open method in spout. It is called only once for the bolt.
It gets the configuration for the bolt and also the context of the bolt.
The collector is used to emit or output the tuples from this bolt.
Link to public gist for log
Storm topology log

Your log shows you are using LocalCluster. It is a testing/demo tool, don't use it for production workloads. Instead set up a real distributed cluster.
Regarding what is happening:
When you run topologies in a LocalCluster, Storm simulates a real cluster by just running all the components (Nimbus, Supervisors and workers) as threads in a single JVM. Your log shows these lines:
20:14:12.451 [SLOT_1027] INFO o.a.s.ProcessSimulator - Begin killing process 2ea97301-24c9-4c1a-bcba-61008693971a
20:14:12.451 [SLOT_1027] INFO o.a.s.d.w.Worker - Shutting down worker smart-transactional-data-1-1566571315 72bbf510-c342-4385-9599-0821a2dee94e 1027
20:14:15.518 [SLOT_1027] INFO o.a.s.d.s.Slot - STATE running msInState: 33328 topo:smart-transactional-data-1-1566571315 worker:2ea97301-24c9-4c1a-bcba-61008693971a -> kill-blob-update msInState: 3001 topo:smart-transactional-data-1-1566571315 worker:2ea97301-24c9-4c1a-bcba-61008693971a
20:14:15.540 [SLOT_1027] INFO o.a.s.d.w.Worker - Launching worker for smart-transactional-data-1-1566571315
The LocalCluster is shutting down one of the simulated workers, because one of the blobs (e.g. topology jar, topology configuration, other types of shared files, see more at https://storm.apache.org/releases/2.0.0/distcache-blobstore.html) in the blobstore changed. Normally when this happens in a real cluster, the worker JVM will be killed, the blob will be updated and the worker will restart. Since you are using LocalCluster, it just kills the worker thread and restarts it. This is why you are seeing multiple invocations of prepare.

Related

Locking Mechanism if pod crashes while processing mongodb record

We have a java/spring application which runs on EKS pods and we have records stored in MongoDB collection.
STATUS: READY,STARTED,COMPLETED
Application needs to pick the records which are in READY status and update the status to STARTED. Once the processing of the record is completed, the status will be updated to COMPLETED
Once the record is STARTED, it may take few hours to complete, until then other pods(other instance of the same app) should not pick this record. If some exception occurs, the app changes the status to READY so that other pods(or the same pod) can pick the READY record for processing.
Requirement: If the pod crashes when the record is processing(STARTED) but crashes before changing the status to READY/COMPLETED, the other pod should be able to pick this record and start processing again.
We have some solution in mind but trying to find the best solution. Request you to help me with some best approaches.
You can use a shutdown hook from spring:
#Component
public class Bean1 {
#PreDestroy
public void destroy() {
## handle database change
System.out.println(Status changed to ready);
}
}
Beyond that, that kind of job could run better in a messaging architecture, using SQS for example. Instead of using the status on the database to handle and orchestrate the task, you can use an SQS, publish the message that needs to be consumed (the messages that were in ready state) and have a poll of workers consuming messages from this SQS, if something crashes or the pod of this workers needs to be reclaimed, the message goes back to SQS and can be consumed by another pod.

Kafka producer slow first execution/message

I am having a java web application with a scheduled background job which send kafka message.
I am using the default producer configuration and the job is very simple.
I am not assigning a producer id so each execution a new kafkaproducer is created with a new dynamic id.
When testing i noticed that always the first execution of my job (after my application is up) the job take around 2s but fewer ms in next executions.
I am not having any application cache and creating new instance of kafkaproducer each execution.
Any explanation to this please?is there any static cache in the kafka producer API?( i am always sending to the same topic and not specifying the partition in my producer record)

#Incoming not running multithreaded on Quarkus application connecting to RabbitMQ

Some background
We are running a fairly simple application that handles subscriptions and are running into the limits of the external service. The solution is that we are introducing a queue and throttle the consumers of this queue to optimize the throughput.
For this we are using a Quarkus (2.7.5.Final) implementation and using quarkus-smallrye-reactive-messaging-rabbitmq connector provided by quarkus.io
Simplified implementation
rabbitmq-host=localhost
rabbitmq-port=5672
rabbitmq-username=guest
rabbitmq-password=guest
mp.messaging.incoming.subscriptions-in.connector=smallrye-rabbitmq
mp.messaging.incoming.subscriptions-in.queue.name=subscriptions
#Incoming("subscriptions-in")
public CompletionStage<Void> consume(Message<JsonObject> message) {
try {
Thread.sleep(1000);
return message.ack();
} catch (Exception e) {
return message.nack(e);
}
}
The problem
This only uses one worker thread and therefore the jobs are handles 1 by 1, ideally this application picks up as many jobs as there are worker threads available (in parallel), how can I make this work?
I tried
#Incoming("subscriptions-in")
#Blocking
Didn't change anything
#Incoming("subscriptions-in")
#NonBlocking
Didn't change anything
#Incoming("subscriptions-in")
#Blocking(ordered = false)
This made it split of into different worker threads, but ?detached? the job from the queue, so none of the messages got ack'd or nack'd
#Incoming("subscriptions-in-1")
..
#Incoming("subscriptions-in-2")
..
#Incoming("subscriptions-in-3")
These different channels seem to all work on the same worker thread (which is picked on startup)
The only way I currently see is to slim down the application and run one consumer thread each and just run 50 in parallel in kubernetes. This feels wrong and I can't believe there is no way to multithread at least some of the consuming.
Question
I am hopeful that I am missing a simple solution or am missing the concept of this RabbitMQ connector.
Is there anyway to get the #Incoming consumption to run in parallel?
Or is there a way in this Java implementation to increase the prefetch count? If so I can multithread them myself

Stop submitting jobs when there is no data

I use spark streaming to get data from a queue in mq via a custom receiver.
Javastreaming context duration is 10 seconds.
And there is one task defined for the input from queue.
In event time line in spark UI, I see a job getting submitted in each 10s interval even when there is no data from the receiver.
Is it the normal behavior or how to stop jobs getting submitted when there is no data.
JavaDStream<String> customReceiverStream = ssc.receiverStream(newJavaCustomReceiver(host, port));
JavaDStream<String> words =lines.flatMap(new FlatMapFunction<String, String>() { ... });
words.print();
ssc.start();
ssc.awaitTermination();
As a work around
You can use livy to submit the spark jobs(use java codes instead of cli commands).
Livy job would be constantly checking a database that would have an indicator whether the data is flowing in or not.As soon as the data flow stops,change the indicator in the database and this would result in spark job being killed by the livy.(Use Livy sessions)

Apache Storm Bolt task is not receiving message after some time

We have a storm topology in which we configured one spout and two bolts. Spout queries data from DB continuously and send tuples it to first bolt for some processing. First bolt does some processing and send tuples it to second bolt which calls third party web service and sends data. So, what is happening after some time, last bolt is not getting any tuples and if we restart the topology it works fine. Only last bolt is in problem here. Other spout and first bolt are running fine, and I am not using acking framework. I have configured only one worker in this case`.
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("messageListenrSpout", new MessageListenerSpout(), 1);
builder.setBolt("processorBolt", new ProcessorBolt(), 20).shuffleGrouping("messageListenrSpout");
builder.setBolt("notifierBolt", new NotifierBolt(),40).shuffleGrouping("processorBolt");
Config conf = new Config();
conf.put(Config.TOPOLOGY_SLEEP_SPOUT_WAIT_STRATEGY_TIME_MS, 10000);
//conf.setMessageTimeoutSecs(600);
conf.setDebug(true);
StormSubmitter.submitTopology(TOPOLOGY, conf, builder.createTopology());
It's quite likely that you're having problems with a backlog of tuples causing timeouts. Try increasing the parallelism hint for the 2nd bolt since it sounds like that one's process time is much longer than that of the first bolt (that's why there would be a backlog into the 2nd bolt). If you're running this topology on the cluster look at the Storm UI to see the specifics.
Guys when I was debugging my topology, I found that if let's say spout is sending message fast but bolt is processing slow. In this case, message will be queued up LMAX Disruptor Queue. Then spout task wait for that to be empty. If you take thread dump, you will find threads are in TIMED_WAITING state. So, we need to configure topology in such a way that its inflow and outflow maintained.

Categories