Poor performance of log4j2 in combination with Kafka - java

I'm having performance issues using log4j2 (2.5) in combination with Kafka (0.10.1.0). When I enable Kafka in the log4j2.xml file, my application slows down to a crawl, whilst only outputting around 200KB/s of events to the Kafka broker. This is orders of magnitude lower than that what Kafka is supposed to achieve (https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines).
Here is the relevant part of my log4j2.xml config file:
<Kafka name="KafkaAll" topic="all">
<PatternLayout pattern="%date %message" />
<Property name="bootstrap.servers">localhost:9092</Property>
<Property name="buffer.memory">67108864</Property>
<Property name="batch.size">8196</Property>
<Property name="acks">1</Property>
</Kafka>
After running some tests I've been able to pin down the problem and found out that the ProducerPerformance test shipped with Kafka does achieve decent performance. Its performance is around 5MB/s, with similar sized messages of 100 bytes. After extensive testing, I found out that the difference lies not in the configuration, but in the way the calls are implemented. The log4j2 KafkaAppender uses the KafkaManager class to write log to Kafka:
public void send(final byte[] msg) throws ExecutionException, InterruptedException, TimeoutException {
if (producer != null) {
producer.send(new ProducerRecord<byte[], byte[]>(topic, msg)).get(timeoutMillis, TimeUnit.MILLISECONDS);
}
}
The performance problem is caused by the call to the "get" method, which blocks until the send has completed. Funny enough, there is a log4j appender included with Kafka that does take this issue into account:
Future<RecordMetadata> response = producer.send(new ProducerRecord<byte[], byte[]>(topic, message.getBytes()));
if (syncSend) {
try {
response.get();
} catch (InterruptedException ex) {
throw new RuntimeException(ex);
} catch (ExecutionException ex) {
throw new RuntimeException(ex);
}
}
In other words, when syncSend is set to false, the call to send returns immediately. This "syncSend" property is however nowhere to be found in the log4j2 implementation of the KafkaAppender.
I've tried making calls asynchronous through other means, such as using the AsyncAppender shipped with log4j2 and setting the acks property to 0 instead of 1. However, none of the settings achieve the performance gain by not waiting for the send to complete. I've also tried to use the log4j appender shipped with Kafka, but I did not manage to get it to work with log4j2 (and I want to stick with log4j2).
So, finally, I have decided to fork the KafkaAppender shipped with log4j2 and remove the blocking part of the call. This works, but of course I'd rather be using of-the-shelf packages.
Is there anyone who also ran into this problem? How did you solve the issue? Is there an easier way without changing code?

The KafkaAppender now has a new attribute "syncSend" that can be set to support asynchronous sending, significantly improving throughput to Kafka. Thank you for your support!

Related

Spring Cloud Stream - notice and handle errors in broker

I am fairly new to developing distributed applications with messaging, and to Spring Cloud Stream in particular. I am currently wondering about best practices on how to deal with errors on the broker side.
In our application, we need to both consume and produce messages from/to multiple sources/destinations like this:
Consumer side
For consuming, we have defined multiple #Beans of type java.util.function.Consumer. The configuration for those looks like this:
spring.cloud.stream.bindings.consumeA-in-0.destination=inputA
spring.cloud.stream.bindings.consumeA-in-0.group=$Default
spring.cloud.stream.bindings.consumeB-in-0.destination=inputB
spring.cloud.stream.bindings.consumeB-in-0.group=$Default
This part works quite well - wenn starting the application, the exchanges "inputA" and "inputB" as well as the queues "inputA.$Default" and "inputB.$Default" with corresponding binding are automatically created in RabbitMQ.
Also, in case of an error (e.g. a queue is suddenly not available), the application gets notified immediately with a QueuesNotAvailableException and continuously tries to re-establish the connection.
My only question here is: Is there some way to handle this exception in code? Or, what are best practices to deal with failures like this on broker side?
Producer side
This one is more problematic. Producing messages is triggered by some internal logic, we cannot use function #Beans here. Instead, we currently rely on StreamBridge to send messages. The problem is that this approach does not trigger creation of exchanges and queues on startup. So when our code calls streamBridge.send("outputA", message), the message is sent (result is true), but it just disappears into the void since RabbitMQ automatically drops unroutable messages.
I found that with this configuration, I can at least get RabbitMQ to create exchanges and queues as soon as the first message is sent:
spring.cloud.stream.source=produceA;produceB
spring.cloud.stream.default.producer.requiredGroups=$Default
spring.cloud.stream.bindings.produceA-out-0.destination=outputA
spring.cloud.stream.bindings.produceB-out-0.destination=outputB
I need to use streamBridge.send("produceA-out-0", message) in code to make it work, which is not too great since it means having explicit configuration hardcoded, but at least it works.
I also tried to implement the producer in a Reactor style as desribed in this answer, but in this case the exchange/queue also is not created on application startup and the sent message just disappears even though the return status of the sending method is "OK".
Failures on the broker side are not registered at all with this approach - when I simulate one e.g. by deleting the queue or the exchange, it is not registered by the application. Only when another message is sent, I get in the logs:
ERROR 21804 --- [127.0.0.1:32404] o.s.a.r.c.CachingConnectionFactory : Shutdown Signal: channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - no exchange 'produceA-out-0' in vhost '/', class-id=60, method-id=40)
But still, the result of StreamBridge#send was true in this case. But we need to know that sending did actually fail at this point (we persist the state of the sent object using this boolean return value). Is there any way to accomplish that?
Any other suggestions on how to make this producer scenario more robust? Best practices?
EDIT
I found an interesting solution to the producer problem using correlations:
...
CorrelationData correlation = new CorrelationData(UUID.randomUUID().toString());
messageHeaderAccessor.setHeader(AmqpHeaders.PUBLISH_CONFIRM_CORRELATION, correlation);
Message<String> message = MessageBuilder.createMessage(payload, messageHeaderAccessor.getMessageHeaders());
boolean sent = streamBridge.send(channel, message);
try {
final CorrelationData.Confirm confirm = correlation.getFuture().get(30, TimeUnit.SECONDS);
if (correlation.getReturned() == null && confirm.isAck()) {
// success logic
} else {
// failed logic
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
// failed logic
} catch (ExecutionException | TimeoutException e) {
// failed logic
}
using these additional configurations:
spring.cloud.stream.rabbit.default.producer.useConfirmHeader=true
spring.rabbitmq.publisher-confirm-type=correlated
spring.rabbitmq.publisher-returns=true
This seems to work quite well, although I'm still clueless about the return value of StreamBridge#send, it is always true and I cannot find information in which cases it would be false. But the rest is fine, I can get information on issues with the exchange or the queue from the correlation or the confirm.
But this solution is very much focused on RabbitMQ, which causes two problems:
our application should be able to connect to different brokers (e.g. Azure Service Bus)
in tests we use Kafka binder and I don't know how to configure the application context to make it work in this case, too
Any help would be appreciated.
On the consumer side, you can listen for an event such as the ListenerContainerConsumerFailedEvent.
https://docs.spring.io/spring-amqp/docs/current/reference/html/#consumer-events
On the producer side, producers only know about exchanges, not any queues bound to them; hence the requiredGroups property which causes the queue to be bound.
You only need spring.cloud.stream.default.producer.requiredGroups=$Default - you can send to arbitrary destinations using the StreamBridge and the infrastructure will be created.
#SpringBootApplication
public class So70769305Application {
public static void main(String[] args) {
SpringApplication.run(So70769305Application.class, args);
}
#Bean
ApplicationRunner runner(StreamBridge bridge) {
return args -> bridge.send("foo", "test");
}
}
spring.cloud.stream.default.producer.requiredGroups=$Default

Safe way to use batch listener

I am trying to use spring-kafka 1.3.x (1.3.3 and 1.3.4). What is not clear is whether there is a safe way to consume messages in batch without skipping a message (or set of messages) when an exception occurs eg network outage. My preference is also to leverage the container capabilities as much as possible to remain in Spring framework rather than trying to create a custom framework for dealing with this challenge.
I am setting the following properties onto a ConcurrentMessageListenerContainer :
.setAckOnError(false);
.setAckMode(AckMode.MANUAL);
I am also setting the following kafka specific consumer properties:
enable.auto.commit=false
auto.offset.reset=earliest
If I set a RetryTemplate, I get a class cast exception since it only works for non-batch consumers. Documentation states retry is not available for batch so this may be OK.
I then setup a consumer such as this one:
```java
#KafkaListener(containerFactory = "conatinerFactory",
groupId = "myGroup",
topics = "myTopic")
public void onMessage(#Payload List<Entries> batchedData,
#Header(required = false,
value = KafkaHeaders.OFFSET) List<Long> offsets,
Acknowledgment ack) {
log.info("Working on: {}" + offsets);
int x = 1;
if(x == 1) {
log.info("Failure on: {}" + offsets);
throw new RuntimeException("mock failure");
}
// do nothing else for now
// unreachable code
ack.acknowledge();
}
```
When I send a message into the system to mock the exception above then the only visible action to me is that the listener reports the exception.
When I send another (new) message into the system, the container consumes the new message. The old message is skipped since the offset is advanced to the next offset.
Since I have asked the container not to acknowledge (directly or indirectly) and since there is no other properties that I can see to notify the container not to advance, then I am confused why the container does advance.
What I noticed is that for a similar consideration, what is being recommended is to upgrade to 2.1.x and use the container stop capability that was added into the ContainerAware ErrorHandler there.
But what if you are trapped in 1.3.x for the time being, is there a way or missing property that can be used to ensure the container does not advance to the next message or batch of messages?
I can see an option to create a custom framework around the consumer in order to achieve the desired effect. But are there other options, simpler, and more spring friendly.
Thoughts?
From #garyrussell (spring-kafka github project)
The offset has not been committed but the broker won't send the data again. You have to re-seek the topics/partitions.
2.1 provides the SeekToCurrentBatchErrorHandler which will re-seek automatically for you.
2.0 Added consumer-aware listeners, giving you access to the consumer (for seeking) in the listener.
With 1.3.x you have to implement ConsumerSeekAware and perform the seeks yourself (in the listener after catching the exception). Save off the ConsumerSeekCallback in a ThreadLocal.
You will need to add the partitions to your method signature; then seek to the lowest offset in the list for each partition.

How do I stop a Camel route when JVM gets to a certain heap size?

I am using Apache Camel to connect to various endpoints, including JMS topics, and write to a database. Sometimes the database connection fails (for whatever reason, database issue, network blip, etc) and the messages from the topic subscriber start backing up. At a certain point, there are so many messages backed up waiting to be written to the database that the application throws an out of memory error. So far I understand all that.
The problem I have is the following: When the application is frantically trying to garbage collect before eventually giving up and accepting that it is out of memory, the application stops working, but is still alive. This means that the topic subscriber is still seen as active by the JMS provider, but not reading anything off the topic, so the provider starts queueing up the messages. Eventually the provider falls over also when the maximum depth runs out.
How can I configure my application to either disconnect when reaching a certain heap usage, or kill itself completely much much faster when running out of memory? I believe there are some JVM parameters that allow the application to kill itself much quicker when running out of memory, but I am wondering if that is the best solution or whether there is another way?
First of all I think you should use a JDBC connection pool that is capable of refreshing failed connections. So you do not run into the described scenario in the first place. At least not if the DB/network issue is short lived.
Next I'd protect the message broker by applying producer flow control (at least thats how it is called in ActiveMQ). I.e. prevent message producers from submitting more messages if a certain memory threshold has been breached. If the thresholds are set correctly, then that will prevent your message broker from falling over.
As for your original question: I'd use JMX to monitor the VM. If some metric, e.g. memory, breaches a threshold then you can suspend or shut down the route or the whole Camel context via the MBeans Camel exposes.
You can control (start/stop and suspend/resume) Camel routes using the Camel context methods .stop(), .start(), .suspend() and .resume().
You can spin a separate thread that monitors the current VM memory and stops the required route when a certain condition is met.
new Thread() {
#Override
public void run() {
while(true) {
long free = Runtime.getRuntime().freeMemory();
boolean routeRunning = camelContext.isRouteStarted("yourRoute");
if (free < threshold && routeRunning) {
camelContext.stopRoute("yourRoute");
} else if (free > threshold && !routeRunning) {
camelContext.startRoute("yourRoute");
}
// Check every 10 seconds
Thread.sleep(10000);
}
}
}
As commented in the other answer, relying on this is not particularly robust, but at least a little more robust than getting an OutOfMemoryException. Note that you need to .stop() the route, .suspend() does not deallocate resources, which means the connection with the queue provider is still open and the service looks like it is open for business.
You can also stop the route as part of the error handling of the route itself (this is possibly more robust but would require manual intervention to restart the route once the error is cleared, or a scheduled route that periodically checks if the error condition still exists and restart the route if it is gone). The thing to keep in mind is that you cannot stop a route from the same thread that is servicing the route at the time so you need to spin a separate thread that does the stopping. For example:
route("sample").from("jms://myqueue")
// Handle SQL Exceptions by shutting down the route
.onException(SQLException.class)
.process(new Processor() {
// This processor spawns a new thread that stops the current route
Thread stop;
#Override
public void process(final Exchange exchange) throws Exception {
if (stop == null) {
stop = new Thread() {
#Override
public void run() {
try {
// Stop the current route
exchange.getContext().stopRoute("sample");
} catch (Exception e) {}
}
};
}
// start the thread in background
stop.start();
}
})
.end()
// Standard route processors go here
.to(...);

Which Appenders should be used in distributed system? How to configure them?

I am trying to add logging component to distributed system. It is written in AspectJ to avoid chaining current source-code. I use socket appender to send logs, but I'd like to try something more effective.
I've heard I should use JMSAppender and AsyncAppender, but I failed to configure it. Should I create Receiver which gathers logs and pass them to database and to GUI (I use ChainSaw)?
I tried to follow turorial1 and tutorial2 , but they aren't clear enough.
Edit:
In a small demo I've prepared I sent 6 logs for a request (simulation of 3 components)
[2012-08-08 15:40:28,957] [request1344433228957] [Component_A] [start]
[2012-08-08 15:40:32,050] [request1344433228957] [Component_B] [start]
[2012-08-08 15:40:32,113] [request1344433228957] [Component_C] [start]
[2012-08-08 15:40:32,113] [request1344433228957] [Component_C] [end - throwing]
[2012-08-08 15:40:32,144] [request1344433228957] [Component_B] [end]
[2012-08-08 15:40:32,175] [request1344433228957] [Component_A] [end]
Using socket Appender. So my log4j.properties is:
log4j.rootLogger=DEBUG, server
log4j.appender.server=org.apache.log4j.net.SocketAppender
log4j.appender.server.Port=4712
log4j.appender.server.RemoteHost=localhost
log4j.appender.server.ReconnectionDelay=1000
so I run
>java -classpath log4j-1.2.17.jar org.apache.log4j.net.SimpleSocketServer 4712 log4j-server.properties
with configuration
log4j.rootLogger=DEBUG, CA, FA
#
log4j.appender.CA=org.apache.log4j.ConsoleAppender
log4j.appender.CA.layout=org.apache.log4j.PatternLayout
log4j.appender.CA.layout.ConversionPattern=[%d] [%t] [%c] [%m]%n
#
log4j.appender.FA=org.apache.log4j.FileAppender
log4j.appender.FA.File=report.log
log4j.appender.FA.layout=org.apache.log4j.PatternLayout
log4j.appender.FA.layout.ConversionPattern=[%d] [%t] [%c] [%m]%n
Then I send my logs from file to Chainsaw:
It is absolutely basic, but I want to learn how to do it better. First of all, I'd like to send logs asynchronously. Then create very simple Receiver, which e.g. can pass logs to a file.
I tried to follow tutorials I listed above, but I failed. So question is: could you provide some example configuration? Example of Receiver.java and log4.properties files?
I would use NFS or CDFS and mount a drive on all the machines. Have each application instance write to a different file. You will be able to find all the logs in one directory (or drive) no matter how many machines you use.
I wouldn't use NFS or CDFS over a global WAN with a high latency e.g. > 50 ms round trip. In this cause I have used JMS (but I didn't use log4j)
My two cents.. Whatever you do, make sure that you use asynchronous mechanism to deliver your logs to the receiver, otherwise it will eventually stall your apps. Another point, to deliver logs reliably you should consider a fail over mechanism built into the appender itself - receivers may go offline for short or long time, if you care for the logs, the fail over is definitely required. We have built similar system you describe (sorry for the add), but if you like you can use our appender (look in downloads), it's free and has the sources. There is also a video tutorial. It has fail over and flexible asynchronous mechanism plus a backup fall back.
How many appenders should you use? One appender per jvm will do all right. Config files should probably be per jvm, not sure how you intend to implement the receiver, in any case the appenders need to find your receiver which is usually host port pair at least. Regarding the database, my experience is very sour with RDBMS (we are moving to nosql) but if you don't go above couple of hundred million records, most commercial databases will do with some effort. Not a simple task I must say, took us couple of years to build commercial quality system you just drawn with few skinny rectangles :)
Finally I've found how to configure it. I put 2 files into src folder.
jndi.properties
topic.logTopic=logTopic
and log4j-jms.properties
log4j.rootLogger=INFO, stdout, jms
## Be sure that ActiveMQ messages are not logged to 'jms' appender
log4j.logger.org.apache.activemq=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=
## Configure 'jms' appender. You'll also need jndi.properties file in order to make it work
log4j.appender.jms=org.apache.log4j.net.JMSAppender
log4j.appender.jms.InitialContextFactoryName=org.apache.activemq.jndi.ActiveMQInitialContextFactory
log4j.appender.jms.ProviderURL=tcp://localhost:61616
log4j.appender.jms.TopicBindingName=logTopic
log4j.appender.jms.TopicConnectionFactoryBindingName=ConnectionFactory
Then I run my program with VM argument
-Dlog4j.configuration=log4j-jms.properties
and receive logs in class Receiver.java
public class Receiver implements MessageListener {
PrintWriter pw = new PrintWriter("result.log");
Connection conn;
Session sess;
MessageConsumer consumer;
public Receiver() throws Exception {
ActiveMQConnectionFactory factory = new ActiveMQConnectionFactory("tcp://localhost:61616");
Connection conn = factory.createConnection();
Session sess = conn.createSession(false, Session.AUTO_ACKNOWLEDGE);
conn.start();
MessageConsumer consumer = sess.createConsumer(sess.createTopic("logTopic"));
consumer.setMessageListener(this);
}
public static void main(String[] args) throws Exception {
new Receiver();
}
public void onMessage(Message message) {
try {
LoggingEvent event = (LoggingEvent) ((ActiveMQObjectMessage) message).getObject();
DateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss,SSS");
String nowAsString = df.format(new Date(event.getTimeStamp()));
pw.println("["+ nowAsString + "]" +
" [" + event.getThreadName()+"]" +
" ["+ event.getLoggerName() + "]" +
" ["+ event.getMessage()+"]");
pw.flush();
} catch (Exception e) {
e.printStackTrace();
}
}
}
I'd recommend syslog and the built in syslog appender. Use TCP for reliable logging (+Asyc appender maybe) or UDP for fire-and-forget logging.
I have a rsyslog config if you need.

Is this a realistic expectation of a distributed mechanism?

I've been evaluating ActiveMQ as a candidate message broker. I've written some test code to try and get an understanding of ActiveMQ's performance limitations.
I can produce a failure state in the broker by sending messages as fast as possible like this:
try {
while(true) {
byte[] payload = new byte[(int) (Math.random() * 16384)];
BytesMessage message = session.createBytesMessage();
message.writeBytes(payload);
producer.send(message);
} catch (JMSException ex) { ... }
I was surprised that the line
producer.send(message);
blocks when the broker enters a failed state. I was hoping that some exception would be thrown, so there would be some indication that the broker has failed.
I realize that my test code is spamming the broker, and I expect the broker to fail. However, I would prefer that the broker failed "loudly" as opposed to simply blocking.
Is this an unrealistic expectation?
Update:
Uri's answer references an ActiveMQ bug report that was filed in March. The bug description includes a proposal that sounds like what I'm looking for: "if the request on the transport had a timeout (this is to catch failure scenarios, so something that's not expected to reasonably happen), things would have errored out rather than building waiting threads."
However, after 8 months the bug is currently unassigned with a single vote. So I guess the question still stands, is this something ActiveMQ should (will?) implement?
You are testing the 'slow consumer' and producer flowcontrol issue all message brokers have to deal with. Do you wanna fail producers, block them or spool to disk?
Basically the out of the box default in ActiveMQ is to block producers. But you can configure message cursors to spool to disk.
BTW you've not said if you are using queues/topics or persistent/non-persistent; if you are using non persistent topics there are other strategies you can use for discarding messages etc.
Apprently there's a known issue, not sure if it's been fixed:
https://issues.apache.org/activemq/browse/AMQ-1625
Not sure about ActiveMQ config, but other JMS providers have various configuration options - so you maybe able to get ActiveMQ to do as you wish in that situation.
I know Fiorano has options to specify whether providers block or not in this situation.

Categories