Storm-Kafka multiple spouts, how to share the load?

Storm-Kafka multiple spouts, how to share the load? - java

I am trying to share the task among the multiple spouts. I have a situation, where I'm getting one tuple/message at a time from external source and I want to have multiple instances of a spout, main intention behind is to share the load and increase performance efficiency.
I can do the same with one Spout itself, but I want to share the load across multiple spouts. I am not able to get the logic to spread the load. Since the offset of messages will not be known until the particular spout finishes the consuming the part (i.e based on buffer size set).
Can anyone please put some bright light on the how to work-out on the logic/algorithm?
Advance Thanks for your time.
Update in response to answers:
Now used multi-partitions on Kafka (i.e 5)
Following is the code used:
builder.setSpout("spout", new KafkaSpout(cfg), 5);
Tested by flooding with 800 MB data on each partition and it took ~22 sec to finish read.
Again, used the code with parallelism_hint = 1
i.e. builder.setSpout("spout", new KafkaSpout(cfg), 1);
Now it took more ~23 sec! Why?
According to Storm Docs setSpout() declaration is as follows:
public SpoutDeclarer setSpout(java.lang.String id,
IRichSpout spout,
java.lang.Number parallelism_hint)
where,
parallelism_hint - is the number of tasks that should be assigned to execute this spout. Each task will run on a thread in a process somewhere around the cluster.

I had come across a discussion in storm-user which discuss something similar.
Read Relationship between Spout parallelism and number of kafka partitions.
2 things to note while using kafka-spout for storm
The maximum parallelism you can have on a KafkaSpout is the number of partitions.
We can split the load into multiple kafka topics and have separate spout instances for each. ie. each spout handling a separate topic.
So if we have a case where kafka partitions per host is configured as 1 and the number of hosts is 2. Even if we set the spout parallelism as 10, the max value which is repected will only be 2 which is the number of partitions.
How To mention the number of partition in the Kafka-spout?
List<HostPort> hosts = new ArrayList<HostPort>();
hosts.add(new HostPort("localhost",9092));
SpoutConfig objConfig=new SpoutConfig(new KafkaConfig.StaticHosts(hosts, 4), "spoutCaliber", "/kafkastorm", "discovery");
As you can see, here brokers can be added using hosts.add and the partion number is specified as 4 in the new KafkaConfig.StaticHosts(hosts, 4) code snippet.
How To mention the parallelism hint in the Kafka-spout?
builder.setSpout("spout", spout,4);
You can mention the same while adding your spout into the topology using setSpout method. Here 4 is the parallelism hint.
More links that might help
Understanding-the-parallelism-of-a-Storm-topology
what-is-the-task-in-twitter-storm-parallelism
Disclaimer:
!! i am new to both storm and java !!!! So pls edit/add if its required some where.

Related

Unexpected backlog size in Pulsar

I'm using Pulsar for communication between services and I'm experiencing flakiness in a quite simple test of producers and consumers.
In JUnit 4 test, I spin up (my own wrappers around) a ZooKeeper server, a BookKeeper bookie, and a PulsarService; the configurations should be quite standard.
The test can be summarized in the following steps:
build a producer;
build a consumer (say, a reader of a Pulsar topic);
check the message backlog (using precise backlog);
this is done by getting the current subscription via PulsarAdmin#topics#getStats#subscriptions
I expect it to be 0, as nothing was sent on the topic, but sometimes it is 1, but this seems another problem...
build a new producer and synchronously send a message onto the topic;
build a new consumer and read the messages on the topic;
I expect a backlog of one message, and I actually read one
build a new producer and synchronously send four messages;
fetch again the messages, using the messageID read at step 5 as start message ID;
I expect a backlog of four messages here, and most of the time this value is correct, but running the test about ten times I consistently get 2 or 5
I tried debugging the test, but I cannot figure out where those values come from; did I misunderstand something?

Things you can try if not already done:
Ask for precise backlog measurement. By default, it's only estimated as getting the precise measurement is a costlier operation. Use admin.topics().getStats(topic, true) for this. (See https://github.com/apache/pulsar/blob/724523f3051def9577d6bd27697866c99f4a7b0e/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L862)
Deactivate batching on the producer side. The number returned in msgBacklog is the number of entries so multiple messages batched in a single entry will count as 1. See relevant issue : https://github.com/apache/pulsar/issues/7623. It can explain why you see a value of 2 for the msgBacklog if the 4 messages have been put in the same batch. Beware that deactivating batching can have a huge impact on performance.

Force kafka consumer to poll partition with highest lag

I have a setup where several KafkaConsumers each handle a number of partitions on a single topic. They are statically assigned the partitions, in a way that ensures that each consumer has an equal number of partitions to handle. The record key is also chosen so that we have equal distribution of messages over all partitions.
At times of heavy load, we often see a small number of partitions build up a considerable lag (thousands of messages/several minutes worth), while other partitions that are getting the same load and are consumed by the same consumer manage to keep the lag down to a few hundred messages / couple of seconds.
It looks like the consumer is fetching records as fast as it can, going around most of the partitions, but now and then there is one partition that gets left out for a long time. Ideally, I'd like to see the lag spread out more evenly across the partitions.
I've been reading about KafkaConsumer poll behaviour and configuration for a while now, and so far I think there's 2 options to work around this:
Build something custom that can monitor the lag per partition, and use KafkaConsumer.pause() and .resume() to essentially force the KafkaConsumer to read from the partitions with the most lag
Restrict our KafkaConsumer to only ever subscribe to one TopicPartition, and work with multiple instances of KafkaConsumer.
Neither of these options seem like the proper way to handle this. Configuration also doesn't seem to have the answer:
max.partition.fetch.bytes only specifies the max fetch size for a single partition, it doesn't guarantee that the next fetch will be from another partition.
max.poll.interval.ms only works for consumer groups and not on a per-partition basis.
Am I missing a way to encourage the KafkaConsumer to switch partition more often? Or a way to implement a preference for the partitions with the highest lag?

Not sure wether the answer is still relevant to you or if my answer exactly replies to your needs, However, you could try a lag aware assignor. This assignor which assign partitions to consumers ensures that consumers are assigned partitions so that the lag among consumers is assigned uniformly/equally. Here is a well written code that I used it that implements a lag based assignor.
https://github.com/grantneale/kafka-lag-based-assignor
All what you need is to configure you consumer to use this assignor. The below statament.
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, LagBasedPartitionAssignor.class.getName());

Queueing tasks via JMS

I would like to make a question to the comunity and get as many feedbacks as possible about an strategy I have been thinking, oriented to resolve some issues of performance in my project.
The context:
We have an important process that perform 4 steps.
An entity status change and its persistence
If 1 ends OK. Entity is exported into a CSV file.
If 2 ends OK. Entity is exported into another CSV. This one with way more Info.
If 3 ends OK. The last CSV is sent by mail
Steps 1 and 2 are linked and they are critical.
Steps 3 and 4 are not critical. Doesn't even care if they ends successfully.
Performance of 1-2 is fine, but 3-4 in some escenarios are just insanely slow. Mostly cause step 3.
If we execute all the steps as a sequence, some times step 3 causes a timeout. Client do not get any response about steps 1 and 2 (the important ones) and user don't know whats going on.
This case made me think in JMS queues in order to delegate the last 2 steps to another app/process. Deallocate the notification from the business logic. Second export and mailing will be processed when posible and probably in parallel. I could also split it in 2 queues: exports, mail notification.
Our webapp runs into a WebLogic 11 cluster, so I could use its implementation.
What do you think about the strategy? Is WebLogic JMS implementation anything good? Should I check another implementation? ActiveMQ, RabbitMQ,...
I have also thinking on tiketing system implementation with spring-tasks.
At this point I have to point at spring-batch. Its usage is limited. We have already so many jobs focused on important processes of data consolidation and the window of time for allocation of more jobs is limited. Plus the impact of to try to process all items massively at once.
May be we could if we find out a way to use the multithread of spring-batch but we didn't find yet the way to fit oír requirements into such strategy.
Thank you in advance and excuse my english. I promise to keep working hard on it :-).

One problem to consider is data integrity. If step n fails, does step n-1 need to be reversed? Is there any ordering dependencies that you need to be aware of? And are you writing to the same or different CSV? If the same, then might have contention issues.
Now, back to the original problem. I would consider Java executors, using 4 fixed-sized pools and move the task through the pools as successes occur:
Submit step 1 to pool 1, getting a Future back, which will be used to check for completion.
When step 1 completes, you submit step 2 to pool 2.
When step 2 completes, you now can return a result to the caller. The call to this point has been waiting (likely with a timeout so it doesn't hang around forever) but now the critical tasks are done.
After returning to the client, submit step 3 to pool 3.
When step 3 completes, submit step to pool 4.
The pools themselves, while fixed sized, could be larger for pool 1/2 to get maximum throughput (and to get back to your client as quickly as possible) and pool 3/4 could be smaller but still large enough to get the work done.
You could do something similar with JMS, but the issues are similar: you need to have multiple listeners or multiple threads per listener so that you can process at an appropriate speed. You could do steps 1/2 synchronously without a pool, but then you don't get some of the thread management that executors give you. You still need to "schedule" steps 3/4 by putting them on the JMS queue and still have listeners to process them.
The ability to recover from server going down is key here, but Executors/ExecutorService has not persistence, so then I'd definitely be looking at JMS (and then I'd be queuing absolutely everything up, even the first 2 steps) but depending on your use case it might be overkill.

Yes, an event-driven approach where a message bus makes the integration sounds good. They are asynch so you will not have timeout. Of course you will need to use a Topic. WLS has some memory issues when you have too many messages in the server, maybe a different server would work better for separation of concerns and resources.

How to set TOPOLOGY_MAX_SPOUT_PENDING parameter

In my topology, I read trigger messages from a Kafka queue. On receiving the trigger message, I need to emit around 4096 messages to a bolt. In the bolt, after some processing it will publish to another Kafka queue (another topology will consume this later).
I'm trying to set TOPOLOGY_MAX_SPOUT_PENDING parameter to throttle the number of messages going to bolt. But I see it is having no effect. Is it because I'm emitting all the tuples in one nextTuple() method? If so, what should be the work around?

If you are reading from kafka, you should use the KafkaSpout that comes packed with storm. Don't try to implement your own spout, trust me, I use the KafkaSpout in production and it works very smoothly. Each Kafka message generates exactly one tuple.
And as you can see on this nice page from the manual, you can set the topology.max.spout.pending like this:
Config conf = new Config();
conf.setMaxSpoutPending(5000);
StormSubmitter.submitTopology("mytopology", conf, topology);
The topology.max.spout.pending is set per spout, if you have four spouts you will have a maximum of non-complete tuples inside your topology equal to the number of spouts * topology.max.spout.pending.
Another tip, is that you should use the storm UI to see if the topology.max.spout.pending was set properly.
Remember the topology.max.spout.pending is only the number of tuples not unprocessed inside the topology, the topology will never stop consume messages from kafka, at least on a production system... If you want to consume batches of 4096 you need to implement caching logic on your bolts, or use something else than storm (something micro batch oriented).

To make TOPOLOGY_MAX_SPOUT_PENDING you need to enable fault-tolerance mechanism (ie, assigning message IDs in Spouts and anchor and ack in Bolts). Furthermore, if you emit more than one tuple per call to Spout.nextTuple() TOPOLOGY_MAX_SPOUT_PENDING will not work as expected.
It is actually bad practice for some more reasons so emit more than a single tuple per Spout.nextTuple() call (see Why should I not loop or block in Spout.nextTuple() for more details).

How to achieve maximum concurrency for a distributed application using a database as medium of communication

I have an application which is similar to classic producer consumer problem. Just wanted to check out all the possible implementations to achieve it. The problem is-
Process A: inserts a row into the table in database (producers)
Process B: reads M rows from the table, deletes the read M rows after processing.
Tasks in process B:
1. Read M rows
2. Process these rows
3. Delete these rows
N1 instances of process A,
N2 instances of process B runs concurrently.
Each instance runs on a different box.
Some requirements:
If a process p1 is reading (0,M-1) rows. process p2 should not wait for p1 until it releases the lock on these rows, instead it should read (M,2M-1) rows.

I bet there are better ways of parallel processing than using DB as the excahnger between producer and consumer. Why not queues? Have you checked the tools/frameworks designed for Map/Reduce. Hadoop, GridGain, JPPF all can do this.

Similar concept is being used in ConcurrentHashMap of Java.15.
A list of rows which are being processed should be maintained separately. When any process needs to interact with DB, it should check whether that rows are being processed by another process. If so it should wait on that condition, else it can process. maintaining Indexes might help in such a case

I think that if this application is implemented it actually uses hand made queue. I believe that JMS is much better in this case. There are a lot of JMS implementations available. Most of them are open source.
In your case process A should insert tasks into the queue. Process B should be blocked on receive(), get N messages and then process them. You probably have reasons to get a bulk of tasks from your queue but if you change implementation to JMS based you probably do not need this at all, so you can just listen to the queue and process message immediately. The implementation becomes almost trivial, very flexible and scalable. You can run as many processes A and B as you want and distribute them among separate boxes.

You may also want to take a look into Amazon Elastic Map Reduce
http://aws.amazon.com/elasticmapreduce/

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.