Big GC pause caused by Kafka producer in microservice

Big GC pause caused by Kafka producer in microservice - java

One of my web services sends log message (size: 10K) to kafka (version: 0.10.2.1) for each online request, and I found that the KafkaProducer consumes lots of memory which caused long gc pausing time.
There is only one Kafka producer in my service, which is recommended officially.
I am just wondering if anyone has any suggestions on how to send messages to kafka without any impact on the online services?

Sounds like the producer isn't able to keep up with the rate at which your service is generating logs. (This is necessarily speculation since you have given minimal details about your setup).
Have you benchmarked your kafka cluster? Is it able to sustain the kind of load you generate?
Another avenue would be to decouple your kafka producer from your actual service. Since you're dealing with log messages, your application can simply write the logs to disk, and you can have a separate process reading these log files and sending them to kafka. This way the producing of messages won't impact your main service.
You can even have the kafka producer running on a different VM/Container entirely, and read the logs via something like an NFS mount.

Related

Kafka consumer group not rebalancing when increasing partitions

I have a situation where in my dev environment, my Kafka consumer groups will rebalance and distribute partitions to consumer instances just fine after increasing the partition count of a subscribed topic.
However, when we deploy our product into its kubernetes environment, we aren't seeing the consumer groups rebalance after increasing the partition count of the topic. Kafka recognized the increase which can be seen from the server logs or describing the topic from the command line. However, the consumer groups won't rebalance and recognize the new partitions. From my local testing, kafka respects metadata.max.age.ms (default 5 mins). But in kubernetes the group never rebalances.
I don't know if it affects anything but we're using static membership.
The consumers are written in Java and use the standard Kafka Java library. No messages are flowing through Kafka, and adding messages doesn't help. I don't see anything special in the server or consumer configurations that differs from my dev environment. Is anyone aware of any configurations that may affect this behavior?
** Update **
The issue was only occurring for a new topic. At first, the consumer application was starting before the producer application (which is intended to create the topic). So the consumer was auto creating the topic. In this scenario, the topic defaulted to 1 partition. When the producer application started it, updated the partition count per configuration. After that update, we never saw a rebalance.
Next we tried disabling consumer auto topic creation to address this. This prevented the consumer application from auto creating the topic on subscription. Yet still after the topic was created by the producer app, the consumer group was never rebalanced, so the consumer application would sit idle.
According to the documentation I've found, and testing in my dev environment, both of these situations should trigger a rebalance. For whatever reason we don't see that happen in our deployments. My temporary workaround was just to ensure that the topic is already created prior to allowing my consumer's to subscribe. I don't like it, but it works for now. I suspect that the different behavior I'm seeing is due to my dev environment running a single kafka broker vs the kubernetes deployments with a cluster, but that's just a guess.

Kafka defaults to update topic metadata only after 5 minutes, so will not detect partition changes immediately, as you've noticed. The deployment method of your app shouldn't matter, as long as network requests are properly reaching the broker.
Plus, check your partition assignment strategy to see if it's using sticky assignment. This will depend on what version of the client you're using, as the defaults changed around 2.7, I think
No messages are flowing through Kafka
If there's no data on the new partitions, there's no real need to rebalance to consume from them

Kafka log compaction callback

I'm using Kafka log compactor, and I was wondering if there lis any call-back function that I can use to be invoked as a consumer, when the Kafka broker perform the log compaction of my Topic.
So far I cannot see any callback for this, so I was wondering what is the standard strategy to detect that log compaction took place.
Regards

The client itself has no communication with the broker for such events. In the past, we used Splunk to capture the compaction events from the LogCleaner process logs, and we could generate webhook events based on that, if we needed if for any reason (we only used it for administrative debugging and clients never needed it)

How to share chronicle queue between multiple micro services in AWS

We had a micro service approach for one of our systems using Kafka as an event bus.
We had some latency problems and experimented with replacing Kafka topics with a bunch of Chronicle queues. When running locally on a developer machine the results were amazing, one of our most expensive work flows was processing ten to thirty times faster.
Given the initial good results we decided to take the experiment further and deploy our proof on concept in AWS which is where our system runs. Our micro services run in docker containers across a bunch of EC2s.
We created an EFS volume and mounted it on each docker container. We verified the volume was accessible from each micro service and the right read write permissions were granted.
Now the problem:
MS1 receives a message (API call) does some processing and emits an event in a chronicle queue. We can see on the EFS file system the chronicle queue file is touched. MS2 is supposed to consume that event and do some further processing. This is not happening. Eventually restarting MS2 would trigger the message processing but this is not always the case. Easy to imagine the disappointment.
The question:
Is our EFS approach wrong? If yes what would be the way to go?
Thank you in advance for your inputs.

Chronicle Queue cannot work on a Network File System like EFS, as discussed in this previous question and also documented here: https://github.com/OpenHFT/Chronicle-Queue/#usage
To communicate between hosts you need Chronicle Queue Enterprise which supports TCP/IP replication.
Please note also doco for running with docker

Fail safe mechanism for Kafka

I am working on application that writes to Kafka queue which is read by other application. When I am unable to send message on Kafka due to network or other reason, I need to write messages during Kafka down time to other place e.g Oracle or local file system, so that I don't loose messages generated during down time.Problem with oracle or other DB is it too can go down. Is there any recommendations about how could I achieve fail safe during Kafka down time.
Number of messages generated are approx 20-25 million per day. For messages stored during downtime I am planning to have batch job to re send them to destination application once target application is up again.
Thank you

You can push those messages into a cloud based messaging service like SQS. It supports around 3K messages per second.
There is also a connector that allows you to push back the messages into Kafka directly, with no other headaches.
If you can't export the data out of your local network, then maybe a cluster of RabbitMQ instances may help, although it wouldn't be a plug & play solution.

is there a java pattern for a process to constantly run to poll or listen for messages off a queue and process them?

planning on moving a lot of our single threaded synchronous processing batch jobs to a more distributed architecture with workers. the thought is having a master process read records off the database, and send them off to a queue. then have a multiple workers read off the queue to process the records in parallel.
is there any well known java pattern for a simple CLI/batch job that constantly runs to poll/listen for messages on queues? would like to use that for all the workers. or is there a better way to do this? should the listener/worker be deployed in an app container or can it be just a standalone program?
thanks
edit: also to note, im not looking to use JavaEE/JMS, but more hosted solutions like SQS, a hosted RabbitMQ, or IronMQ

If you're using a JavaEE application server (and if not, you should), you don't have to program that logic by hand since the application server does it for you.
You then implement and deploy a message driven bean that listens to a queue and processes the message received. The application server will manage a connection pool to listen to queue messages and create a thread with an instance of your message driven bean which will receive the message and be able to process it.
The messages will be processed concurrently since the application server will have a connection pool and a thread pool available to listen to the queue.
All JavaEE-featured application servers like IBM Websphere or JBoss have configurations available in their admin consoles to create Message Queue listeners depending or the message queue implementation and then bind this message queue listeners to your Message Driven Bean.

I don't a lot about this, and I maybe don't really answer your question, but I tried something a few month ago that might interest you to deals with message queue.
You can have a look at this: http://www.rabbitmq.com/getstarted.html
I seems Work Queue could fix your requirements.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.