Choosing optimal multithreading model for new Kafka consumer API - java

Let me first describe my use-case.
I have topics [ T1 ... Tn ] to which the Kafka consumer(s) need to subscribe to. For each topic, all the data passing though it are logically similar. Let's assume data in different topics don't have any correlation. But once consumed, all the data irrespective of their topics receive the same treatment, i.e. they are fed to Elasticsearch using bulkprocessor api. ES is set up as multi-node cluster.
The kafka consumer javadoc mentions two different multithreading approaches. I'm leaning towards the first approach, i.e. One Consumer Per Thread model. Assuming #partitions / topic = p, I'll have p consumer threads for each topic. So, in total there will be n.p threads. If I've independent bulkprocessor attached to each of these threads, then I can choose to control committed position manually. It'll save me from data loss in case bulkprocessor fails. But the downside is, number of bulkprocessors might become too high and that might slow down elasticsearch ingestion.
The other approach I'm thinking, is to have only one thread per topic, so each thread listens to p partitions and writes to one bulk processor. In that case I've to use auto-commit for offsets, and I might lose data for bulkprocessor failure.
I'd like to know which approach is better, or is there a 3rd approach, better than both of these?
Kafka v0.9.0.x and ES v2.3.x

Related

Advantage of "forcing" the partition

I have a topic theimportanttopic with three partitions.
Is an advantage to "forcing" the partition assignment?
For instance, I have three consumers in one group. I want consumer1 to always and only consume from partition-0, consumer2 to always and only consume from partition-1 and consumer3 to always and only consume from partition-2.
One consumer should not touch any other partition at any point of time except the one that was assigned.
A drawback I can think of is that when one of the consumers goes down, no one is consuming from the partition.
Let's suppose a fancy self-healing architecture is in place, can bring back any of those lost consumer back very efficiently.
Would it be an advantage, knowing there won't be any partition reassignment cost to the healthy consumers? The healthy consumers can focus on their own partition, etc.
Are there any other pros and cons?
https://docs.spring.io/spring-kafka/reference/html/#tip-assign-all-parts
https://docs.spring.io/spring-kafka/reference/html/#manual-assignment
It seems the API allow possibility of forcing the partition, I was wondering if this use case was one of the purposes of this design.
How do you know what is the "number" of each consumer? Based on your last questions, you've either used kubernetes, or setting concurrency in Spring Kafka. In either case, pods/threads rebalance across partitions of the same executable application... Therefore, you cannot scale them and assign to specific partitions without extra external locking logic.
In my opinion, all executable consumers should be equally be able to handle any partition.
Plus, as you pointed out, there's downtime if one stops.
But, the use case is to exactly match the producer. You've produced data by some custom partitioner logic, therefore, you need exact consumers to read only a subset of that data.
Also, assignment doesn't use consumer groups, so while there would be no rebalancing, it makes it not possible to monitor lag using tools like Burrow or consumer groups cli. Lag would need gathered directly from the consumer metrics themselves.

Scaling pattern matched Kafka consumers

I have a scenario where there are multiple Kafka topics (single partition each) and a single consumer group to consume the records. I use a single pattern matched consumer in the consumer group that matches all the topics and hence consumes all the records in all the topics.
I now want to scale this up and have multiple consumers (in the same consumer group) listening to all the topics. However, this does not seem to be working as all the records are getting consumed only by the first consumer in the group, rendering other consumers in the group useless. Also, I am running consumers as separate threads using an ExecutorService.
How can I achieve this?
Below is my code:
Pattern pattern = Pattern.compile(topicPattern); consumer.subscribe(pattern);
The pattern sent in the code above is such that it matches with the names of all the topics,
eg.
If topics names are sample_topic_1, sample_topic_2 etc, we match it with sample_topic_*$.
The approach you describe should work with the code you posted. However, this might be a case where there's not enough data for more than one consumer. Or maybe the data comes in "bursts" that are small enough to fit in a single batch.
Even though load in Kafka is theoretically distributed across all consumers of the same consumer group, in practice, if there is only data for one "batch", then the first consumer could grab all the data and there will be nothing left for anybody else. This means either:
You're not sending enough data for it to be distributed across all consumers (try sending lots more data to validate this), or
You have a weird configuration where your configured batches are gigantic, and/or linger.ms property is configured very high, or
A combination of the two above.
I suggest trying to send more data first, and seeing if that fixes the issue. If not, try to scale back to only 1 consumer, validate it's still working. Then just add one more consumer to that consumer group, and seeing if the behavior changes.

How to run hundreds of Kafka consumers on the same machine?

In Kafka docs, it is mentioned that the consumers are not Thread-Safe. To avoid this problem, I read that it is a good idea to run a consumer for every Java process. How can this be achieved?
The number of consumers is not defined, but can change according to need.
Thank,
Alessio
You're right that the documentation specifies that Kafka consumers are not thread-safe. However, it also says that you should run consumers on separate threads,
not processes. That's quite different. See here for an answer with more specifics, geared towards Java/JVM:
https://stackoverflow.com/a/15795159/236528
In general, you can have as many consumers as you want on a Kafka topic. Some of these might share a group id, in which case, all the partitions for that topic will be distributed across all the consumers active at any point in time.
There's much more detail on the Javadoc for the Kafka Consumer, linked at the bottom of this answer, but I copied the two thread/consumer models suggested by the documentation below.
1. One Consumer Per Thread
A simple option is to give each thread its own consumer instance. Here
are the pros and cons of this approach:
PRO: It is the easiest to implement
PRO: It is often the fastest as no inter-thread co-ordination is needed
PRO: It makes in-order processing on a per-partition basis very easy to implement (each thread just processes messages in the order it receives them).
CON: More consumers means more TCP connections to the cluster (one per thread). In general Kafka handles connections very efficiently so this is generally a small cost.
CON: Multiple consumers means more requests being sent to the server and slightly less batching of data which can cause some drop in I/O throughput.
CON: The number of total threads across all processes will be limited by the total number of partitions.
2. Decouple Consumption and Processing
Another alternative is to have one or more consumer threads that do
all data consumption and hands off ConsumerRecords instances to a
blocking queue consumed by a pool of processor threads that actually
handle the record processing. This option likewise has pros and cons:
PRO: This option allows independently scaling the number of consumers
and processors. This makes it possible to have a single consumer that
feeds many processor threads, avoiding any limitation on partitions.
CON: Guaranteeing order across the processors requires particular care
as the threads will execute independently an earlier chunk of data may
actually be processed after a later chunk of data just due to the luck
of thread execution timing. For processing that has no ordering
requirements this is not a problem.
CON: Manually committing the
position becomes harder as it requires that all threads co-ordinate to
ensure that processing is complete for that partition. There are many
possible variations on this approach. For example each processor
thread can have its own queue, and the consumer threads can hash into
these queues using the TopicPartition to ensure in-order consumption
and simplify commit.
In my experience, option #1 is the best for starting out, and you can upgrade to option #2 only if you really need it. Option #2 is the only way to extract the maximum performance from the kafka consumer, but its implementation is more complex. So, give option #1 a try first, and see if it's good enough for your specific use case.
The full Javadoc is available at this link:
https://kafka.apache.org/23/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

Is Kafka the right solution for messages with dependencies?

We have messages which are dependent.Ex. say we have 4 messages M1, M2, M1_update1,(should be processed only after M1 is processed),M3 (should be processed only after M1,M2 are processed).
In this example, only M1 and M2 can be processed in parallel, others have to be sequential. I know messages in one partition of Kafka topic are processed sequentially. But how do I know that M1,M2 are processed and now is the time to push M1_update1 and M3 messages to the topic? Is Kafka right choice for this kind of use-case? Any insights is appreciated!!
Kafka is used as pub-sub messaging system which is highly scalable and fault tolerant.
I believe using kafka alone when your messages are interdependent could be a bad choice. The processing you require is condition based probably you need a routing engine such as camel or drool to achieve the end result.
You're basically describing a message queue that guarantees ordering. Kafka, by design, does not guarantee ordering, except in the case you mention, where the topic has a single partition. In that case, though, you're not taking full advantage of Kafka's ability to maximize throughput by parallelizing data in partitions.
As far as messages being dependent on each other, that would require a logic layer that core Kafka itself doesn't provide. If I understand it correctly, and the processing happens after the message is consumed from Kafka, you would need some sort of notification on the consumer end, which would receive and process M1 and M2 and somehow notify the producer on the other side it's now ok to send M1_update and M3. This is definitely outside the scope of what core Kafka provides. You could still use Kafka to build something like this, but there's probably other solutions that would work better for you.

Apache Kafka Message broadcasting

I am studying Apache-kafka and have some confusion. Please help me to understand the following scenario.
I have a topic with 5 partitions and 5 brokers in a Kafka cluster. I am maintaining my message order in Partition 1(say P1).I want to broadcast the messages of P1 to 10 consumers.
So my question is; how do these 10 consumers interact with topic partition p1.
This is probably not how you want to use Kafka.
Unless you're being explicit with how you set your keys, you can't really control which partition your messages end up in when producing to a topic. Partitions in Kafka are designed to be more like low-level plumbing, something that exists, but you don't usually have to interact with. On the consumer side, you will be assigned partitions based on how many consumers are active for a particular consumer group at any one time.
One way to get around this is to define a topic to have only a single partition, in which case, of course, all messages will go to that partition. This is not ideal, since Kafka won't be able to parallelize data ingestion or serving, but it is possible.
So, having said that, let's assume that you did manage to put all your messages in partition 1 of a specific topic. When you fire up a consumer of that topic with consumer group id of consumer1, it will be assigned all the partitions for that topic, since that consumer is the only active one for that particular group id. If there is only one partition for that topic, like explained above, then that consumer will get all the data. If you then fire up a second consumer with the same group id, Kafka will notice there's a second consumer for that specific group id, but since there's only one partition, it can't assign any partitions to it, so that consumer will never get any data.
On the other hand, if you fire up a third consumer with a different consumer group id, say consumer2, that consumer will now get all the data, and it won't interfere at all with consumer1 message consumption, since Kafka keeps track of their consuming offsets separately. Kafka keeps track of which offset each particular ConsumerGroupId is at on each partition, so it won't get confused if one of them starts consuming slowly or stops for a while and restarts consuming later that day.
Much more detailed information here on how Kafka works here: https://kafka.apache.org/documentation/#gettingStarted
And more information on how to use the Kafka consumer at this link:
https://kafka.apache.org/20/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
#mjuarez's answer is absolutely correct - just for brevity I would reduce it to the following;
Don't try and read only from a single partition because it's a low level construct and it somewhat undermines the parallelism of Kafka. You're much better off just creating more topics if you need finer separation of data.
I would also add that most of the time a consumer needn't know which partition a message came from, in the same way that I don't eat a sandwich differently depending on which store it came from.
#mjuarez is actually not correct and I am not sure why his comment is being falsely confirmed by the OP. You can absolutely explicitly tell Kafka which partition a producer record pertains to using the following:
ProducerRecord(
java.lang.String topic,
java.lang.Integer partition, // <--------- !!!
java.lang.Long timestamp,
K key,
V value)
https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/producer/ProducerRecord.html#ProducerRecord-java.lang.String-java.lang.Integer-java.lang.Long-K-V-
So most of what was said after that becomes irrelevant.
Now to address the OP question directly: you want to accomplish a broadcast. To have a message sent once and read more than once you would have to have a different consumer group for each reader.
And that use case is an absolutely valid Kafka usage paradigm.
You can accomplish that using RabbitMQ too:
https://www.rabbitmq.com/tutorials/tutorial-three-java.html
... but the way it is done is not ideal because multiple out-of-process queues are involved.

Categories