OutOfMemoryError due to a huge number of ActiveMQ XATransactionId objects

OutOfMemoryError due to a huge number of ActiveMQ XATransactionId objects - java

We have a Weblogic server running several apps. Some of those apps use an ActiveMQ instance which is configured to use the Weblogic XA transaction manager.
Now after about 3 minutes after startup, the JVM triggers an OutOfMemoryError. A heap dump shows that about 85% of all memory is occupied by a LinkedList that contains org.apache.activemq.command.XATransactionId instances. The list is a root object and we are not sure who needs it.
What could cause this?

We had exactly the same issue on Weblogic 12c and activemq-ra. XATransactionId object instances were created continuously causing server overload.
After more than 2 weeks of debugging, we found that the problem was caused by WebLogic Transaction Manager trying to recover some pending activemq transactions by calling the method recover() which returns the ids of transaction that seems to be not completed and have to be recovered. The call to this method by Weblogic returned always a not null number n (always the same) and that causes the creation of n instance of XATransactionId object.
After some investigations, we found that Weblogic stores by default its Transaction logs TLOG in filesystem and this can be changed to be persisted in DB. We thought that there was a problem in TLOGs being in file system and we tried to change it to DB and it worked ! Now our server runs for more that 2 weeks without any restart and memory is stable because no XATransactionId are created a part from the necessary amount of it ;)
I hope this will help you and keep us informed if it worked for you.
Good luck !

To be honest it sounds like you're getting a ton of JMS messages and either not consuming them or, if you are, your consumer is not acknowledging the messages if they are not in auto acknowledge mode.

Check your JMS queue backlog. There may be a queue with high backlog, which server is trying to read. These messages may have been corrupted, due to some crash
The best option is to delete the backlog in JMS queue or take a back up in some other queue

Related

Spring boot + tomcat 8.5 + mongoDB, AsyncRequestTimeoutException

I have created a spring boot web application and deployed war of the same to tomcat container.
The application connects to mongoDB using Async connections. I am using mongodb-driver-async library for that.
At startup everything works fine. But as soon as load increases, It shows following exception in DB connections:
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:75)
at org.springframework.web.context.request.async.WebAsyncManager$5.run(WebAsyncManager.java:392)
at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:143)
at org.apache.catalina.core.AsyncListenerWrapper.fireOnTimeout(AsyncListenerWrapper.java:44)
at org.apache.catalina.core.AsyncContextImpl.timeout(AsyncContextImpl.java:131)
at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:157)
I am using following versions of software:
Spring boot -> 1.5.4.RELEASE
Tomcat (installed as standalone binary) -> apache-tomcat-8.5.37
Mongo DB version: v3.4.10
mongodb-driver-async: 3.4.2
As soon as I restart the tomcat service, everything starts working fine.
Please help, what could be the root cause of this issue.
P.S.: I am using DeferredResult and CompletableFuture to create Async REST API.
I have also tried using spring.mvc.async.request-timeout in application and configured asynTimeout in tomcat. But still getting same error.

It's probably obvious that Spring is timing out your requests and throwing AsyncRequestTimeoutException, which returns a 503 back to your client.
Now the question is, why is this happening? There are two possibilities.
These are legitimate timeouts. You mentioned that you only see the exceptions when the load on your server increases. So possibly your server just can't handle that load and its performance has degraded to the point where some requests can't complete before Spring times them out.
The timeouts are caused by your server failing to send a response to an asynchronous request due to a programming error, leaving the request open until Spring eventually times it out. It's easy for this to happen if your server doesn't handle exceptions well. If your server is synchronous, it's okay to be a little sloppy with exception handling because unhandled exceptions will propagate up to the server framework, which will send a response back to the client. But if you fail to handle an exception in some asynchronous code, that exception will be caught elsewhere (probably in some thread pool management code), and there's no way for that code to know that there's an asynchronous request waiting on the result of the operation that threw the exception.
It's hard to figure out what might be happening without knowing more about your application. But there are some things you could investigate.
First, try looking for resource exhaustion.
Is the garbage collector running all the time?
Are all CPUs pegged at 100%?
Is the OS swapping heavily?
If the database server is on a separate machine, is that machine showing signs of resource exhaustion?
How many connections are open to the database? If there is a connection pool, is it maxed out?
How many threads are running? If there are thread pools in the server, are they maxed out?
If something's at its limit then possibly it is the bottleneck that is causing your requests to time out.
Try setting spring.mvc.async.request-timeout to -1 and see what happens. Do you now get responses for every request, only slowly, or do some requests seem to hang forever? If it's the latter, that strongly suggests that there's a bug in your server that's causing it to lose track of requests and fail to send responses. (If setting spring.mvc.async.request-timeout appears to have no effect, then the next thing you should investigate is whether the mechanism you're using for setting the configuration actually works.)
A strategy that I've found useful in these cases is to generate a unique ID for each request and write the ID along with some contextual information every time the server either makes an asynchronous call or receives a response from an asynchronous call, and at various checkpoints within asynchronous handlers. If requests go missing, you can use the log information to figure out the request IDs and what the server was last doing with that request.
A similar strategy is to save each request ID into a map in which the value is an object that tracks when the request was started and what your server last did with that request. (In this case your server is updating this map at each checkpoint rather than, or in addition to, writing to the log.) You can set up a filter to generate the request IDs and maintain the map. If your filter sees the server send a 5xx response, you can log the last action for that request from the map.
Hope this helps!

Asynchroneus tasks are arranged in a queue(pool) which is processed in parallel depending on the number of threads allocated. Not all asynchroneus tasks are executed at the same time. Some of them are queued. In a such system getting AsyncRequestTimeoutException is normal behaviour.
If you are filling up the queues with asynchroneus tasks that are unable to execute under pressure. Increasing the timeout will only delay the problem. You should focus instead on the problem:
Reduce the execution time(through various optimizations) of asynchroneus task. This will relax the pooling of async tasks. It oviously requires coding.
Increase the number of CPUSs allocated in order to be able to run more efficiently the parallel tasks.
Increase the number of threads servicing the executor of the driver.
Mongo Async driver is using AsynchronousSocketChannel or Netty if Netty is found in the classpath. In order to increase the number of the worker threads servicing the async comunication you should use:
MongoClientSettings.builder()
.streamFactoryFactory(NettyStreamFactoryFactory(io.netty.channel.EventLoopGroup eventLoopGroup,
io.netty.buffer.ByteBufAllocator allocator))
.build();
where eventLoopGroup would be io.netty.channel.nio.NioEventLoopGroup(int nThreads))
on the NioEventLoopGroup you can set the number of threads servicing your async comunication
Read more about Netty configuration here https://mongodb.github.io/mongo-java-driver/3.2/driver-async/reference/connecting/connection-settings/

ActiveMQ stops receiving messages after servers left idle for few hours

I've been browsing the forums for last few days and tried almost everything i could find, but without any luck.
The situation is: inside our Java Web Application we have ActiveMQ 5.7 (I know it's very old, eventually we will upgrade to newer version - but for some reasons it's not possible right now). We have only one broker and multiple consumers.
When I start the servers (I have tried to do so for 2, 3, 4 and more servers) everything is ok. The servers are comunicating with each other, QUEUE messages are consumed instantly. But when I leave the servers idle (for example to finally catch some sleep ;) ) it is no longer the case. Messages are stuck in the database and are not beign consumed. The only option to have them delivered is to restart the server.
Part of my configuration (we keep it in properties file, it's the actual state, however I have tried many different combinations):
BrokerServiceURI=broker:(tcp://0.0.0.0:{0})/{1}?persistent=true&useJmx=false&populateJMSXUserID=false&useShutdownHook=false&deleteAllMessagesOnStartup=false&enableStatistics=true
ConnectionFactoryURI=failover://({0})?initialReconnectDelay=100&timeout=6000
ConnectionFactoryServerURI=tcp://{0}:{1}?keepAlive=true&soTimeout=100&wireFormat.cacheEnabled=false&wireFormat.tightEncodingEnabled=false&wireFormat.maxInactivityDuration=0
BrokerService.startAsync=true
BrokerService.networkConnectorStartAsync=true
BrokerService.keepDurableSubsActive=false
Do you have a clue?

I cannot actually tell you the reason from the description mentioned above but I can list down a few checks that are fresh in my mind. Please confirm the following if they are valid for you or not.
Can you check the consumer connections?
Are the consumer sessions still active?
If all the consumer-connections are up, then check the thread-dump whether the active consumer threads (I'm assuming you created consumer threads, correct me if I'm wrong) are in RUNNING or WAITING state(this happened with me where all the consumers were active but some other thread was keeping a lock on Logger while posting a message to slack and the consumers were in WAITING state) because of some other thread in the server).
Check the Dispatch queue size for each consumer. Check the prefetch of each consumer and then compare Dispatch Queue size with Prefetch, refer
Is there a JMSXGroupID you are allotting to each message?
Can you tell a little more about your consumer/producer/broker configurations?

HornetQ Persistence is not removing files

In my application, I've noticed that HornetQ 2.4.1 has been piling up message journal files, (sometimes into the thousands.) I'm using HornetQ via JMS Queues and we're using Wildfly 8.2. Normally, when starting the server instance, HornetQ will have 3 messaging journals and a lock file.
The piling up of message journal files has caused issues when restarting the server, we'll see a log that states:
HQ221014: 54% loaded
When removing the files, the server loads just fine. I've experimented some, and it appears as though messages in these files have already been processed, but I'm not sure why they continue to pile up over time.
Edit 1: I've found this link that indicates we're not acknowledging messages. However, when we create the session like so connection.createSession(false,Session.AUTO_ACKNOWLEDGE);.
I'll continue looking for a solution.

I've come to find out that this has been caused (for one reason or another, I currently believe it has something to do with server load or network hangs) by the failure of calling the afterDelivery() method. I'm addressing this by not hitting that queue so often. It's not elegant, but it serves my purpose.
See following HornetQ messages I found in the logs:
HQ152006: Unable to call after delivery
javax.transaction.RollbackException: ARJUNA016053: Could not commit transaction. at org.jboss.as.ejb3.inflow.MessageEndpointInvocationHandler.afterDelivery(MessageEndpointInvocationHandler.java:87)
HQ222144: Queue could not finish waiting executors. Try increasing the thread pool size
HQ222172: Queue jms.queue.myQueue was busy for more than 10,000 milliseconds. There are possibly consumers hanging on a network operation

Strange behavior of quartz in cluster configuration

I'm developing scheduled services.
The application is developed using JDK 1.6, Spring Framework 2.5.6 and Quartz 1.8.4 to schedule jobs.
I've two clustered servers with WebLogic Server 10.3.5.
Sometimes it seems that the scheduling of quartz goes crazy. Analyzing the conditions in which it occurs, there seems to be a clock "desynchronization" greater than a second between the clustered servers. However this desynchronization is not always due to the system time of the servers, sometimes it seems that even if the clocks of the machines are synchronized, there is a little "delay" introduced by the JVM.
Has anyone encountered the same problem? Is there a way to solve it?
Thanks in advance

When using a JDBC-JobStore on Oracle with version 2.2.1, I experienced the same problem.
In my case, I was running Quartz on a single node. However, I noticed the database machine was not time synchronized with the node running Quartz.
I activated ntpd on both the database machine and the machine running Quartz, and the problem went away after a few minutes.

The issue is most often happens because of de-synchronisation of time in cluster nodes.
However it also may be caused by unstable connection of application to DB. Such connection problems may be caused by network problems (if application server and DB server are on different machines) or performance problems (DB server processes requests very slowly by some reason).
In such case chances of appearance of this issue may be reduced by increasing org.quartz.jobStore.clusterCheckinInterval value.

I am using Quartz 2.2.1 and I notice a strange behavior whenever a cluster recovery occurs.
For instance, even if the machines have been synchronized with ntpdate service I obtain this message on cluster instance recovery:
org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances “This scheduler instance () is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior”.
Here says that the solution is: "Synchronize the time on all cluster nodes and then restart the cluster. The messages should no longer appear in the log."
As every machine is synchronized maybe this "delay" is introduced by the JVM?? I don´t know...:(

This issue is nearly always attributable to clock-skew. Even if you think you have NTPd setup properly a couple of things can still happen:
We thought we had NTPd working (and it was configured properly) but on AWS the firewalls were blocking the NTP ports. UDP 123. Again, that's UDP not TCP.
If you don't sync often enough you will accumulate clock-skew. The accuracy of the timers on many motherboards is notoriously wonky. Thus over time (days) suddenly you get these Quartz errors. Over 5 minutes and you get many security errors like Kerberos for example.
So the moral of this story is sync with NTPd but do it often and verify it is actually working.

I faced the same issue. Firstly you should check the logs and time sync for your cluster.
The marker is messages in logs:
08-02-2018 17:13:49.926 [QuartzScheduler_schedulerService-pc6061518092456074_ClusterManager] INFO o.s.s.quartz.LocalDataSourceJobStore - ClusterManager: detected 1 failed or restarted instances.
08-02-2018 17:14:06.137 [QuartzScheduler_schedulerService-pc6061518092765988_ClusterManager] WARN o.s.s.quartz.LocalDataSourceJobStore - This scheduler instance (pc6061518092765988) is still active but was recovered by another instance in the cluster.
When the first node observed that the second node is absent more than org.quartz.jobStore.clusterCheckinInterval it unregistered the second node from the cluster and removed all its triggers.
Take a look to the synchronization algorithm: org.quartz.impl.jdbcjobstore.JobStoreSupport.ClusterManager#run
It may happen when 'check in' takes long time.
My solution is to override org.quartz.impl.jdbcjobstore.JobStoreSupport#calcFailedIfAfter. The hardcoded value '7500L' looks like as the grace period. I replaced it as parameter.
Note: If you using SchedulerFactoryBean be careful with registering new JobStoreSupport subclass. The Spring forcibly register own store org.springframework.scheduling.quartz.LocalDataSourceJobStore.

Detecting ActiveMQ flow control

I have a production system that uses ActiveMQ (5.3.2) to send messages from server A to server B. A few weeks ago, the system inexplicably started taking 10+ second to send a message. After a reboot of the producer, the system worked fine.
After investigation, I'm pretty sure this is due to producer flow control. (I have a fairly standard activemq setup). The day before this happened (for other reasons) my consumer software had been acting erratically and had even stopped accepting connections for a while. So I'm guessing this triggered this. (It does puzzle me that the requests were still being throttled a day later).
Question -- how can I confirm that the requests were being throttled. I took a heap dump of the server -- is there data in memory I can look for?
Edit: I've found the following:
WireFormatNegotiator.tcpNoDelayEnabled=false for one of three WireFormatNegotiator instances in the memory. I'm trying to figure out what sets this.
And second (and more important), is there a way I can use JMX to tell if the messages are being throttled? I'd like to set up a Nagios alert to let me know if this happens in the future. What property should I check for with JMX?

you can configure your producer client to throw javax.jms.ResourceAllocationException exceptions which can then be detected/logged, etc. just set one of the following...
<systemUsage>
<systemUsage sendFailIfNoSpaceAfterTimeout="3000">
...OR...
<systemUsage sendFailIfNoSpace="true">

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.