Batch consumer camel kafka

Batch consumer camel kafka - java

I am unable to read in batch with the kafka camel consumer, despite following an example posted here. Are there changes I need to make to my producer, or is the problem most likely with my consumer configuration?
The application in question utilizes the kafka camel component to ingest messages from a rest endpoint, validate them, and place them on a topic. I then have a separate service that consumes them from the topic and persists them in a time-series database.
The messages were being produced and consumed one at a time, but the database expects the messages to be consumed and committed in batch for optimal performance. Without touching the producer, I tried adjusting the consumer to match the example in the answer to this question:
How to transactionally poll Kafka from Camel?
I wasn't sure how the messages would appear, so for now I'm just logging them:
from(kafkaReadingConsumerEndpoint).routeId("rawReadingsConsumer").process(exchange -> {
// simple approach to generating errors
String body = exchange.getIn().getBody(String.class);
if (body.startsWith("error")) {
throw new RuntimeException("can't handle the message");
}
log.info("BODY:{}", body);
}).process(kafkaOffsetManager);
But the messages still appear to be coming across one at a time with no batch read.
My consumer config is this:
kafka:
host: myhost
port: myport
consumer:
seekTo: beginning
maxPartitionFetchBytes: 55000
maxPollRecords: 50
consumerCount: 1
autoOffsetReset: earliest
autoCommitEnable: false
allowManualCommit: true
breakOnFirstError: true
Does my config need work, or are there changes I need to make to the producer to have this work correctly?

At the lowest layer, the KafkaConsumer#poll method is going to return an Iterator<ConsumerRecord>; there's no way around that.
I don't have in-depth experience with Camel, but in order to get a "batch" of records, you'll need some intermediate collection to "queue" the data that you want to eventually send downstream to some "collection consumer" process. Then you will need some "switch" processor that says "wait, process this batch" or "continue filling this batch".
As far as databases go, that process is exactly what Kafka Connect JDBC Sink does with batch.size config.

We solved a similar requirement by using the Aggregation [1] capability provided by Camel
A rough code snippet
#Override
public void configure() throws Exception {
// 1. Define your Aggregation Strat
AggregationStrategy agg = AggregationStrategies.flexible(String.class)
.accumulateInCollection(ArrayList.class)
.pick(body());
from("kafka:your-topic?and-other-params")
// 2. Define your Aggregation Strat Params
.aggregate(constant(true), agg)
.completionInterval(1000)
.completionSize(100)
.parallelProcessing(true)
// 3. Generate bulk insert statement
.process(exchange -> {
List<String> body = (List<String>) exchange.getIn().getBody();
String query = generateBulkInsertQueryStatement("target-table", body);
exchange.getMessage().setBody(query);
})
.to("jdbc:dataSource");
}
There are a variety of strategies that you can implement, but we chose this particular one because it allows you to create a List of strings for the message contents that we need to ingest into the db. [2]
We set a variety of different params such as completionInterval & completionSize. The most important one for us was to set parallellProcessing(true) [3] ; without that our performance wasn't nearly getting the required throughput.
Once the aggregation has either collected 100 messages or 1000 ms has passed, then the processor generates a bulk insert statement, which then gets sent to the db.
[1] https://camel.apache.org/components/3.18.x/eips/aggregate-eip.html
[2] https://camel.apache.org/components/3.18.x/eips/aggregate-eip.html#_aggregating_into_a_list
[3] https://camel.apache.org/components/3.18.x/eips/aggregate-eip.html#_worker_pools

Related

Spring Cloud Stream - notice and handle errors in broker

I am fairly new to developing distributed applications with messaging, and to Spring Cloud Stream in particular. I am currently wondering about best practices on how to deal with errors on the broker side.
In our application, we need to both consume and produce messages from/to multiple sources/destinations like this:
Consumer side
For consuming, we have defined multiple #Beans of type java.util.function.Consumer. The configuration for those looks like this:
spring.cloud.stream.bindings.consumeA-in-0.destination=inputA
spring.cloud.stream.bindings.consumeA-in-0.group=$Default
spring.cloud.stream.bindings.consumeB-in-0.destination=inputB
spring.cloud.stream.bindings.consumeB-in-0.group=$Default
This part works quite well - wenn starting the application, the exchanges "inputA" and "inputB" as well as the queues "inputA.$Default" and "inputB.$Default" with corresponding binding are automatically created in RabbitMQ.
Also, in case of an error (e.g. a queue is suddenly not available), the application gets notified immediately with a QueuesNotAvailableException and continuously tries to re-establish the connection.
My only question here is: Is there some way to handle this exception in code? Or, what are best practices to deal with failures like this on broker side?
Producer side
This one is more problematic. Producing messages is triggered by some internal logic, we cannot use function #Beans here. Instead, we currently rely on StreamBridge to send messages. The problem is that this approach does not trigger creation of exchanges and queues on startup. So when our code calls streamBridge.send("outputA", message), the message is sent (result is true), but it just disappears into the void since RabbitMQ automatically drops unroutable messages.
I found that with this configuration, I can at least get RabbitMQ to create exchanges and queues as soon as the first message is sent:
spring.cloud.stream.source=produceA;produceB
spring.cloud.stream.default.producer.requiredGroups=$Default
spring.cloud.stream.bindings.produceA-out-0.destination=outputA
spring.cloud.stream.bindings.produceB-out-0.destination=outputB
I need to use streamBridge.send("produceA-out-0", message) in code to make it work, which is not too great since it means having explicit configuration hardcoded, but at least it works.
I also tried to implement the producer in a Reactor style as desribed in this answer, but in this case the exchange/queue also is not created on application startup and the sent message just disappears even though the return status of the sending method is "OK".
Failures on the broker side are not registered at all with this approach - when I simulate one e.g. by deleting the queue or the exchange, it is not registered by the application. Only when another message is sent, I get in the logs:
ERROR 21804 --- [127.0.0.1:32404] o.s.a.r.c.CachingConnectionFactory : Shutdown Signal: channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - no exchange 'produceA-out-0' in vhost '/', class-id=60, method-id=40)
But still, the result of StreamBridge#send was true in this case. But we need to know that sending did actually fail at this point (we persist the state of the sent object using this boolean return value). Is there any way to accomplish that?
Any other suggestions on how to make this producer scenario more robust? Best practices?
EDIT
I found an interesting solution to the producer problem using correlations:
...
CorrelationData correlation = new CorrelationData(UUID.randomUUID().toString());
messageHeaderAccessor.setHeader(AmqpHeaders.PUBLISH_CONFIRM_CORRELATION, correlation);
Message<String> message = MessageBuilder.createMessage(payload, messageHeaderAccessor.getMessageHeaders());
boolean sent = streamBridge.send(channel, message);
try {
final CorrelationData.Confirm confirm = correlation.getFuture().get(30, TimeUnit.SECONDS);
if (correlation.getReturned() == null && confirm.isAck()) {
// success logic
} else {
// failed logic
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
// failed logic
} catch (ExecutionException | TimeoutException e) {
// failed logic
}
using these additional configurations:
spring.cloud.stream.rabbit.default.producer.useConfirmHeader=true
spring.rabbitmq.publisher-confirm-type=correlated
spring.rabbitmq.publisher-returns=true
This seems to work quite well, although I'm still clueless about the return value of StreamBridge#send, it is always true and I cannot find information in which cases it would be false. But the rest is fine, I can get information on issues with the exchange or the queue from the correlation or the confirm.
But this solution is very much focused on RabbitMQ, which causes two problems:
our application should be able to connect to different brokers (e.g. Azure Service Bus)
in tests we use Kafka binder and I don't know how to configure the application context to make it work in this case, too
Any help would be appreciated.

On the consumer side, you can listen for an event such as the ListenerContainerConsumerFailedEvent.
https://docs.spring.io/spring-amqp/docs/current/reference/html/#consumer-events
On the producer side, producers only know about exchanges, not any queues bound to them; hence the requiredGroups property which causes the queue to be bound.
You only need spring.cloud.stream.default.producer.requiredGroups=$Default - you can send to arbitrary destinations using the StreamBridge and the infrastructure will be created.
#SpringBootApplication
public class So70769305Application {
public static void main(String[] args) {
SpringApplication.run(So70769305Application.class, args);
}
#Bean
ApplicationRunner runner(StreamBridge bridge) {
return args -> bridge.send("foo", "test");
}
}
spring.cloud.stream.default.producer.requiredGroups=$Default

Spark Stream new Job after stream start

I have a situation where I am trying to stream using spark streaming from kafka. The stream is a direct stream. I am able to create a stream and then start streaming, also able to get any updates (if any) on kafka via the streaming.
The issue comes in when i have a new request to stream a new topic. Since SparkStreaming context can be only 1 per jvm, I cannot create a new stream for every new request.
The way I figured out is
Once a DStream is created and spark streaming is already in progress, just attach a new stream to it. This does not seem to work, the createDStream (for a new topic2) does not return a stream and further processing is stopped. The streaming keep on continuing on the first request (say topic1).
Second, I thought to stop the stream, create DStream and then start streaming again. I cannot use the same streaming context (it throws an excpection that jobs cannot be added after streaming has been stopped), and if I create a new stream for new topic (topic2), the old stream topic (topic1) is lost and it streams only the new one.
Here is the code, have a look
JavaStreamingContext javaStreamingContext;
if(null == javaStreamingContext) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
} else {
StreamingContextState streamingContextState = javaStreamingContext.getState();
if(streamingContextState == StreamingContextState.STOPPED) {
javaStreamingContext = JavaStreamingContext(sparkContext, Durations.seconds(duration));
}
}
Collection<String> topics = Arrays.asList(getTopicName(schemaName));
SparkVoidFunctionImpl impl = new SparkVoidFunctionImpl(getSparkSession());
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
.map((stringStringConsumerRecord) -> stringStringConsumerRecord.value())
.foreachRDD(impl);
if (javaStreamingContext.getState() == StreamingContextState.ACTIVE) {
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
}
Don't worry about SparkVoidFunctionImpl, this is a custom class with is the implementation of VoidFunction.
The above is approach 1, where i do not stop the existing streaming. When a new request comes into this method, it does not get a new streaming object, it tries to create a dstream. The issue is the DStream object is never returned.
KafkaUtils.createDirectStream(javaStreamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
This does not return a dstream, the control just terminates without an error.The steps further are not executed.
I have tried many things and read multiple article, but I belive this is a very common production level issue. Any streaming done is to be done on multiple different topics and each of them is handled differently.
Please help

The thing is spark master sends out code to workers and although the data is streaming, underlying code and variable values remain static unless job is restarted.
Few options I could think:
Spark Job server: Every time you want to subscribe/stream from a different topic instead of touching already running job, start a new job. From your API body you can supply the parameters or topic name. If you want to stop streaming from a specific topic, just stop respective job. It will give you a lot of flexibility and control on resources.
[Theoritical] Topic Filter: Subscribe all topics you think you will want, when records are pulled for a duration, filter out records based on a LIST of topics. Manipulate this list of topics through API to increase or decrease your scope of topics, it could be a broadcast variable as well. This is just an idea, I have not tried this option at all.
Another work around is to relay your Topic-2 data to Topic-1 using a microservice whenever you need it & stop if you don't want to.

Safe way to use batch listener

I am trying to use spring-kafka 1.3.x (1.3.3 and 1.3.4). What is not clear is whether there is a safe way to consume messages in batch without skipping a message (or set of messages) when an exception occurs eg network outage. My preference is also to leverage the container capabilities as much as possible to remain in Spring framework rather than trying to create a custom framework for dealing with this challenge.
I am setting the following properties onto a ConcurrentMessageListenerContainer :
.setAckOnError(false);
.setAckMode(AckMode.MANUAL);
I am also setting the following kafka specific consumer properties:
enable.auto.commit=false
auto.offset.reset=earliest
If I set a RetryTemplate, I get a class cast exception since it only works for non-batch consumers. Documentation states retry is not available for batch so this may be OK.
I then setup a consumer such as this one:
```java
#KafkaListener(containerFactory = "conatinerFactory",
groupId = "myGroup",
topics = "myTopic")
public void onMessage(#Payload List<Entries> batchedData,
#Header(required = false,
value = KafkaHeaders.OFFSET) List<Long> offsets,
Acknowledgment ack) {
log.info("Working on: {}" + offsets);
int x = 1;
if(x == 1) {
log.info("Failure on: {}" + offsets);
throw new RuntimeException("mock failure");
}
// do nothing else for now
// unreachable code
ack.acknowledge();
}
```
When I send a message into the system to mock the exception above then the only visible action to me is that the listener reports the exception.
When I send another (new) message into the system, the container consumes the new message. The old message is skipped since the offset is advanced to the next offset.
Since I have asked the container not to acknowledge (directly or indirectly) and since there is no other properties that I can see to notify the container not to advance, then I am confused why the container does advance.
What I noticed is that for a similar consideration, what is being recommended is to upgrade to 2.1.x and use the container stop capability that was added into the ContainerAware ErrorHandler there.
But what if you are trapped in 1.3.x for the time being, is there a way or missing property that can be used to ensure the container does not advance to the next message or batch of messages?
I can see an option to create a custom framework around the consumer in order to achieve the desired effect. But are there other options, simpler, and more spring friendly.
Thoughts?

From #garyrussell (spring-kafka github project)
The offset has not been committed but the broker won't send the data again. You have to re-seek the topics/partitions.
2.1 provides the SeekToCurrentBatchErrorHandler which will re-seek automatically for you.
2.0 Added consumer-aware listeners, giving you access to the consumer (for seeking) in the listener.
With 1.3.x you have to implement ConsumerSeekAware and perform the seeks yourself (in the listener after catching the exception). Save off the ConsumerSeekCallback in a ThreadLocal.
You will need to add the partitions to your method signature; then seek to the lowest offset in the list for each partition.

Zipkin, using existing libraries to handle tracing in microservices connected with Apache Kafka

I would like to implement tracing in my microservices architecture. I am using Apache Kafka as message broker and I am not using Spring Framework. Tracing is a new concept for me. At first I wanted to create my own implementation, but now I would like to use existing libraries. Brave looks like the one I will want to use. I would like to know if there are some guides, examples or docs on how to do this. Documentation on Github page is minimal, and I find it hard to start using Brave. Or maybe there is better library with proper documentation, that is easier to use. I will be looking at Apache HTrace because it looks promising. Some getting started guides will be nice.

There are a bunch of ways to answer this, but I'll answer it from the "one-way" perspective. The short answer though, is I think you have to roll your own right now!
While Kafka can be used in many ways, it can be used as a transport for unidirectional single producer single consumer messages. This action is similar to normal one-way RPC, where you have a request, but no response.
In Zipkin, an RPC span is usually request-response. For example, you see timing of the client sending to the server, and also the way back to the client. One-way is where you leave out the other side. The span starts with a "cs" (client send) and ends with a "sr" (server received).
Mapping this to Kafka, you would mark client sent when you produce the message and server received when the consumer receives it.
The trick to Kafka is that there is no nice place to stuff the trace context. That's because unlike a lot of messaging systems, there are no headers in a Kafka message. Without a trace context, you don't know which trace (or span for that matter) you are completing!
The "hack" approach is to stuff trace identifiers as the message key. A less hacky way would be to coordinate a body wrapper which you can nest the trace context into.
Here's an example of the former:
https://gist.github.com/adriancole/76d94054b77e3be338bd75424ca8ba30

I meet the same problem too.Here is my solution, a less hacky way as above said.
ServerSpan serverSpan = brave.serverSpanThreadBinder().getCurrentServerSpan();
TraceHeader traceHeader = convert(serverSpan);
//in kafka producer,user KafkaTemplete to send
String wrapMsg = "wrap traceHeader with originMsg ";
kafkaTemplate.send(topic, wrapMsg).get(10, TimeUnit.SECONDS);// use synchronization
//then in kafka consumer
ConsumerRecords<String, String> records = consumer.poll(5000);
// for loop
for (ConsumerRecord<String, String> record : records) {
String topic = record.topic();
int partition = record.partition();
long offset = record.offset();
String val = record.value();
//parse val to json
Object warpObj = JSON.parseObject(val);
TraceHeader traceHeader = warpObj.getTraceHeader();
//then you can do something like this
MyRequest myRequest = new MyRequest(traceHeader, "/esb/consumer", "POST");
brave.serverRequestInterceptor().handle(new HttpServerRequestAdapter(new MyHttpServerRequest(myRequest), new DefaultSpanNameProvider()));
//then some httprequest within brave-apache-http-interceptors
//http.post(url,content)
}
you must implements MyHttpServerRequest and MyRequest.It is easy,you just return something a span need,such as uri,header,method.
This is a rough and ugly code example,just offer an idea.

How can I send large messages with Kafka (over 15MB)?

I send String-messages to Kafka V. 0.8 with the Java Producer API.
If the message size is about 15 MB I get a MessageSizeTooLargeException.
I have tried to set message.max.bytesto 40 MB, but I still get the exception. Small messages worked without problems.
(The exception appear in the producer, I don't have a consumer in this application.)
What can I do to get rid of this exception?
My example producer config
private ProducerConfig kafkaConfig() {
Properties props = new Properties();
props.put("metadata.broker.list", BROKERS);
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("request.required.acks", "1");
props.put("message.max.bytes", "" + 1024 * 1024 * 40);
return new ProducerConfig(props);
}
Error-Log:
4709 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 214 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
4869 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 217 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5035 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 220 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5198 [main] WARN kafka.producer.async.DefaultEventHandler - Produce request with correlation id 223 failed due to [datasift,0]: kafka.common.MessageSizeTooLargeException
5305 [main] ERROR kafka.producer.async.DefaultEventHandler - Failed to send requests for topics datasift with correlation ids in [213,224]
kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries.
at kafka.producer.async.DefaultEventHandler.handle(Unknown Source)
at kafka.producer.Producer.send(Unknown Source)
at kafka.javaapi.producer.Producer.send(Unknown Source)

You need to adjust three (or four) properties:
Consumer side:fetch.message.max.bytes - this will determine the largest size of a message that can be fetched by the consumer.
Broker side: replica.fetch.max.bytes - this will allow for the replicas in the brokers to send messages within the cluster and make sure the messages are replicated correctly. If this is too small, then the message will never be replicated, and therefore, the consumer will never see the message because the message will never be committed (fully replicated).
Broker side: message.max.bytes - this is the largest size of the message that can be received by the broker from a producer.
Broker side (per topic): max.message.bytes - this is the largest size of the message the broker will allow to be appended to the topic. This size is validated pre-compression. (Defaults to broker's message.max.bytes.)
I found out the hard way about number 2 - you don't get ANY exceptions, messages, or warnings from Kafka, so be sure to consider this when you are sending large messages.

Minor changes required for Kafka 0.10 and the new consumer compared to laughing_man's answer:
Broker: No changes, you still need to increase properties message.max.bytes and replica.fetch.max.bytes. message.max.bytes has to be equal or smaller(*) than replica.fetch.max.bytes.
Producer: Increase max.request.size to send the larger message.
Consumer: Increase max.partition.fetch.bytes to receive larger messages.
(*) Read the comments to learn more about message.max.bytes<=replica.fetch.max.bytes

The answer from #laughing_man is quite accurate. But still, I wanted to give a recommendation which I learned from Kafka expert Stephane Maarek. We actively applied this solution in our live systems.
Kafka isn’t meant to handle large messages.
Your API should use cloud storage (for example, AWS S3) and simply push a reference to S3 to Kafka or any other message broker. You'll need to find a place to save your data, whether it can be a network drive or something else entirely, but it shouldn't be a message broker.
If you don't want to proceed with the recommended and reliable solution above,
The message max size is 1MB (the setting in your brokers is called message.max.bytes) Apache Kafka. If you really needed it badly, you could increase that size and make sure to increase the network buffers for your producers and consumers.
And if you really care about splitting your message, make sure each message split has the exact same key so that it gets pushed to the same partition, and your message content should report a “part id” so that your consumer can fully reconstruct the message.
If the message is text-based try to compress the data, which may reduce the data size, but not magically.
Again, you have to use an external system to store that data and just push an external reference to Kafka. That is a very common architecture and one you should go with and widely accepted.
Keep that in mind Kafka works best only if the messages are huge in amount but not in size.
Source: https://www.quora.com/How-do-I-send-Large-messages-80-MB-in-Kafka

The idea is to have equal size of message being sent from Kafka Producer to Kafka Broker and then received by Kafka Consumer i.e.
Kafka producer --> Kafka Broker --> Kafka Consumer
Suppose if the requirement is to send 15MB of message, then the Producer, the Broker and the Consumer, all three, needs to be in sync.
Kafka Producer sends 15 MB --> Kafka Broker Allows/Stores 15 MB --> Kafka Consumer receives 15 MB
The setting therefore should be:
a) on Broker:
message.max.bytes=15728640
replica.fetch.max.bytes=15728640
b) on Consumer:
fetch.message.max.bytes=15728640

You need to override the following properties:
Broker Configs($KAFKA_HOME/config/server.properties)
replica.fetch.max.bytes
message.max.bytes
Consumer Configs($KAFKA_HOME/config/consumer.properties)
This step didn't work for me. I add it to the consumer app and it was working fine
fetch.message.max.bytes
Restart the server.
look at this documentation for more info:
http://kafka.apache.org/08/configuration.html

I think, most of the answers here are kind of outdated or not entirely complete.
To refer on the answer of Sacha Vetter (with the update for Kafka 0.10), I'd like to provide some additional Information and links to the official documentation.
Producer Configuration:
max.request.size (Link) has to be increased for files bigger than 1 MB, otherwise they are rejected
Broker/Topic configuration:
message.max.bytes (Link) may be set, if one like to increase the message size on broker level. But, from the documentation: "This can be set per topic with the topic level max.message.bytes config."
max.message.bytes (Link) may be increased, if only one topic should be able to accept lager files. The broker configuration must not be changed.
I'd always prefer a topic-restricted configuration, due to the fact, that I can configure the topic by myself as a client for the Kafka cluster (e.g. with the admin client). I may not have any influence on the broker configuration itself.
In the answers from above, some more configurations are mentioned as necessary:
replica.fetch.max.bytes (Link) (Broker config)
From the documentation: "This is not an absolute maximum, if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that progress can be made."
max.partition.fetch.bytes (Link) (Consumer config)
From the documentation: "Records are fetched in batches by the consumer. If the first record batch in the first non-empty partition of the fetch is larger than this limit, the batch will still be returned to ensure that the consumer can make progress."
fetch.max.bytes (Link) (Consumer config; not mentioned above, but same category)
From the documentation: "Records are fetched in batches by the consumer, and if the first record batch in the first non-empty partition of the fetch is larger than this value, the record batch will still be returned to ensure that the consumer can make progress."
Conclusion: The configurations regarding fetching messages are not necessary to change for processing messages, lager than the default values of these configuration (had this tested in a small setup). Probably, the consumer may always get batches of size 1. However, two of the configurations from the first block has to be set, as mentioned in the answers before.
This clarification should not tell anything about performance and should not be a recommendation to set or not to set these configuration. The best values has to be evaluated individually depending on the concrete planned throughput and data structure.

One key thing to remember that message.max.bytes attribute must be in sync with the consumer's fetch.message.max.bytes property. the fetch size must be at least as large as the maximum message size otherwise there could be situation where producers can send messages larger than the consumer can consume/fetch. It might worth taking a look at it.
Which version of Kafka you are using? Also provide some more details trace that you are getting. is there some thing like ... payload size of xxxx larger
than 1000000 coming up in the log?

For people using landoop kafka:
You can pass the config values in the environment variables like:
docker run -d --rm -p 2181:2181 -p 3030:3030 -p 8081-8083:8081-8083 -p 9581-9585:9581-9585 -p 9092:9092
-e KAFKA_TOPIC_MAX_MESSAGE_BYTES=15728640 -e KAFKA_REPLICA_FETCH_MAX_BYTES=15728640 landoop/fast-data-dev:latest `
This sets topic.max.message.bytes and replica.fetch.max.bytes on the broker.
And if you're using rdkafka then pass the message.max.bytes in the producer config like:
const producer = new Kafka.Producer({
'metadata.broker.list': 'localhost:9092',
'message.max.bytes': '15728640',
'dr_cb': true
});
Similarly, for the consumer,
const kafkaConf = {
"group.id": "librd-test",
"fetch.message.max.bytes":"15728640",
... .. }

Here is how I achieved successfully sending data up to 100mb using kafka-python==2.0.2:
Broker:
consumer = KafkaConsumer(
...
max_partition_fetch_bytes=max_bytes,
fetch_max_bytes=max_bytes,
)
Producer (See final solution at the end):
producer = KafkaProducer(
...
max_request_size=KafkaSettings.MAX_BYTES,
)
Then:
producer.send(topic, value=data).get()
After sending data like this, the following exception appeared:
MessageSizeTooLargeError: The message is n bytes when serialized which is larger than the total memory buffer you have configured with the buffer_memory configuration.
Finally I increased buffer_memory (default 32mb) to receive the message on the other end.
producer = KafkaProducer(
...
max_request_size=KafkaSettings.MAX_BYTES,
buffer_memory=KafkaSettings.MAX_BYTES * 3,
)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.