Strange behavior of quartz in cluster configuration

Strange behavior of quartz in cluster configuration - java

I'm developing scheduled services.
The application is developed using JDK 1.6, Spring Framework 2.5.6 and Quartz 1.8.4 to schedule jobs.
I've two clustered servers with WebLogic Server 10.3.5.
Sometimes it seems that the scheduling of quartz goes crazy. Analyzing the conditions in which it occurs, there seems to be a clock "desynchronization" greater than a second between the clustered servers. However this desynchronization is not always due to the system time of the servers, sometimes it seems that even if the clocks of the machines are synchronized, there is a little "delay" introduced by the JVM.
Has anyone encountered the same problem? Is there a way to solve it?
Thanks in advance

When using a JDBC-JobStore on Oracle with version 2.2.1, I experienced the same problem.
In my case, I was running Quartz on a single node. However, I noticed the database machine was not time synchronized with the node running Quartz.
I activated ntpd on both the database machine and the machine running Quartz, and the problem went away after a few minutes.

The issue is most often happens because of de-synchronisation of time in cluster nodes.
However it also may be caused by unstable connection of application to DB. Such connection problems may be caused by network problems (if application server and DB server are on different machines) or performance problems (DB server processes requests very slowly by some reason).
In such case chances of appearance of this issue may be reduced by increasing org.quartz.jobStore.clusterCheckinInterval value.

I am using Quartz 2.2.1 and I notice a strange behavior whenever a cluster recovery occurs.
For instance, even if the machines have been synchronized with ntpdate service I obtain this message on cluster instance recovery:
org.quartz.impl.jdbcjobstore.JobStoreSupport findFailedInstances “This scheduler instance () is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior”.
Here says that the solution is: "Synchronize the time on all cluster nodes and then restart the cluster. The messages should no longer appear in the log."
As every machine is synchronized maybe this "delay" is introduced by the JVM?? I don´t know...:(

This issue is nearly always attributable to clock-skew. Even if you think you have NTPd setup properly a couple of things can still happen:
We thought we had NTPd working (and it was configured properly) but on AWS the firewalls were blocking the NTP ports. UDP 123. Again, that's UDP not TCP.
If you don't sync often enough you will accumulate clock-skew. The accuracy of the timers on many motherboards is notoriously wonky. Thus over time (days) suddenly you get these Quartz errors. Over 5 minutes and you get many security errors like Kerberos for example.
So the moral of this story is sync with NTPd but do it often and verify it is actually working.

I faced the same issue. Firstly you should check the logs and time sync for your cluster.
The marker is messages in logs:
08-02-2018 17:13:49.926 [QuartzScheduler_schedulerService-pc6061518092456074_ClusterManager] INFO o.s.s.quartz.LocalDataSourceJobStore - ClusterManager: detected 1 failed or restarted instances.
08-02-2018 17:14:06.137 [QuartzScheduler_schedulerService-pc6061518092765988_ClusterManager] WARN o.s.s.quartz.LocalDataSourceJobStore - This scheduler instance (pc6061518092765988) is still active but was recovered by another instance in the cluster.
When the first node observed that the second node is absent more than org.quartz.jobStore.clusterCheckinInterval it unregistered the second node from the cluster and removed all its triggers.
Take a look to the synchronization algorithm: org.quartz.impl.jdbcjobstore.JobStoreSupport.ClusterManager#run
It may happen when 'check in' takes long time.
My solution is to override org.quartz.impl.jdbcjobstore.JobStoreSupport#calcFailedIfAfter. The hardcoded value '7500L' looks like as the grace period. I replaced it as parameter.
Note: If you using SchedulerFactoryBean be careful with registering new JobStoreSupport subclass. The Spring forcibly register own store org.springframework.scheduling.quartz.LocalDataSourceJobStore.

Related

Using Apache Ignite some expired data remains in memory with TTL enabled and after thread execution

Issue
Create an ignite client (in client mode false) and put some data (10k entries/values) to it with very small expiration time (~20s) and TTL enabled.
Each time the thread is running it'll remove all the entries that expired, but after few attempts this thread is not removing all the expired entries, some of them are staying in memory and are not removed by this thread execution.
That means we got some expired data in memory, and it's something we want to avoid.
Please can you confirm that is a real issue or just misuse/configuration of my setup?
Thanks for your feedback.
Test
I've tried in three different setups: full local mode (embedded server) on MacOS, remote server using one node in Docker, and also remote cluster using 3 nodes in kubernetes.
To reproduce
Git repo: https://github.com/panes/ignite-sample
Run MyIgniteLoadRunnerTest.run() to reproduce the issue described on top.
(Global setup: Writing 10k entries of 64octets each with TTL 10s)

It seems to be a known issue. Here's the link to track it https://issues.apache.org/jira/browse/IGNITE-11438. It's to be included into Ignite 2.8 release. As far as I know it has already been released as a part of GridGain Community Edition.

redis - acquiring connection time with PHP clients

My use case is a bunch of isolated calls that at some point interact with redis.
The problem is I'm seeing a super long wait time for acquiring a connection, having tried both predis and credis on my LAN environment. Over 1-3 client threads, the time it takes for my PHP scripts to connect to redis and select a database ranges from 18ms to 700ms!
Normally, I'd use a connection pool or cache a connection and use it across all my threads, but I don't think this can be done in PHP over different scripts.
Is there anything I can do to speed this up?

Apparently, Predis needs the persistent flag set: https://github.com/nrk/predis/wiki/Connection-Parameters) and also FPM, which was frustrating to set up on both Windows and Linux, not to mention testing before switching to FPM on our live setup.
I've switched to Phpredis (https://github.com/phpredis/phpredis), which is a PHP module/extension, and all is good now. The connection times have dropped dramatically using $redis->pconnect() and are consistent across multiple scripts/threads.
Caveat: it IS a little different from Predis in terms of error handling (it fails when instantiating the object, not when running the first call, it returns false instead of null for nonexistent values - ???), so watch out for that if switching from Predis.

OutOfMemoryError due to a huge number of ActiveMQ XATransactionId objects

We have a Weblogic server running several apps. Some of those apps use an ActiveMQ instance which is configured to use the Weblogic XA transaction manager.
Now after about 3 minutes after startup, the JVM triggers an OutOfMemoryError. A heap dump shows that about 85% of all memory is occupied by a LinkedList that contains org.apache.activemq.command.XATransactionId instances. The list is a root object and we are not sure who needs it.
What could cause this?

We had exactly the same issue on Weblogic 12c and activemq-ra. XATransactionId object instances were created continuously causing server overload.
After more than 2 weeks of debugging, we found that the problem was caused by WebLogic Transaction Manager trying to recover some pending activemq transactions by calling the method recover() which returns the ids of transaction that seems to be not completed and have to be recovered. The call to this method by Weblogic returned always a not null number n (always the same) and that causes the creation of n instance of XATransactionId object.
After some investigations, we found that Weblogic stores by default its Transaction logs TLOG in filesystem and this can be changed to be persisted in DB. We thought that there was a problem in TLOGs being in file system and we tried to change it to DB and it worked ! Now our server runs for more that 2 weeks without any restart and memory is stable because no XATransactionId are created a part from the necessary amount of it ;)
I hope this will help you and keep us informed if it worked for you.
Good luck !

To be honest it sounds like you're getting a ton of JMS messages and either not consuming them or, if you are, your consumer is not acknowledging the messages if they are not in auto acknowledge mode.

Check your JMS queue backlog. There may be a queue with high backlog, which server is trying to read. These messages may have been corrupted, due to some crash
The best option is to delete the backlog in JMS queue or take a back up in some other queue

JBoss Mysql Increasing Connections

We have an application which we run inside the JBoss EJB Container. This application makes connections to mysql and runs stored procedures on mysql. We have observed that after a point in time Jboss stops responding to web connections to web application hosted on it. So after investigating we have found the following issues.
The number of socket connections from jboss keeps on increasing and once it goes above thousand we observe that jboss stops working completely because of the limit on a process for the number of socket connections(i.e 1024), we have cross checked the code for socket connections, but we feel it makes socket connections only to mysql, so either this is a problem or something else is doing this, can't find the actual cause. We have tried using netstat, lsof on linux, any other suggestions to finding the root cause of the connection issue would be of great help.
We also checked the show processlist of mysql, but it shows only 8 to 10 active connections at any point in time. So no luck here.
There is also another interesting thing, we had reduced the timeout for connections from our application from 86400 seconds to 30 seconds, and we have reduced the wait timeout for mysql database to 50 seconds, so there is a gap of 20 seconds. We have again and again cross checked the database for any issues with this, but this hardly affects it. But any suggestions in this would also be helpful. We plan to reduce the difference to 5 seconds.
Update : We have subsequently changed the connection timeout from 30 to 170 and also mysql waittimeout to 180
we have tried making changes according to jboss forums where it says cache connection manager tag, we have to enable an attribute called debug=true, we have tried this solution, but what happens is if there are transactions, this causes them to drop off, which is causing havoc in our application, we subsequently reverted the changes, and are running it just like that, but the application is still on the verge of a disaster. We are still running clueless, JBOSS seems to be at the core of our issues, still no solution :(

JDBC requests to Oracle 11g failing to be commited although apparently succeding

We have an older web-based application (Java with Spring 2.5.4 framework) running on a GlassFish 3.1 (build 43) server. This application was recently (a few weeks ago) re-directed to use an Oracle 11g (11.2.0.3.0) database and ojdbc6.jar/orai18n.jar (up from Oracle 10g 10.2.0.3.0 and ojdbc14.jar) -- using a JDBC Thin connection. The application is using org.apache.commons.dbcp.BasicDataSource version 1.2.2 for connections and the database requests are handled either through Spring jdbcTemplate (via the JdbcDaoSupport abstract class) or Spring's PlatformTransactionManager.
This morning we noticed that application users were able to enter information, modify it and later to retrieve and print that data through the application, but that there were no committed updates for the last 24 hours. This application currently has only a few users each day and they are apparently sharing the same connection which has been kept open by the connection pool during the last day and so their uncommitted updates were visible through the application, but not through other connections to the database. When the connection was closed, the uncommitted updates were lost.
Examining the server logs showed no errors from the time of the last committed changes to the database through the times of printed reports the next morning. In addition, even if some of the changes had been (somehow) made with the JDBC connection being set to Auto-Commit false, there were specific commits made for some of those updates that were part of a transaction which, as part of a try/catch block should have either executed one of the "transactionManager.commit(transactionStatus);" or "transactionManager.rollback(transactionStatus);" calls that must have been processed without error. It looks as though the commit was returning successfully, but no commit actually occurred.
Restarting the GlassFish domain and the application restored the normal operation with the various updates being committed as they are entered.
My question is has anyone seen or heard about anything like this occurring and, if so, what could have caused it?
Thank you for any ideas here -- we are at a loss.
Some new information:
Examination of our Oracle 11g Server showed that near the time that we believe that the commits appeared to stop, there were four operations blocked on some other operation that we were not able to fully resolve, but was probably an update.
Examination of the Glassfish Server logs showed that the appearance of the worker threads changed following this estimated start time and fewer threads were appearing in the log until only one thread continued to be used for several hours.
The problem occurred again about one week later and was caught after about 1/2 hour. At this time, there were two worker threads in operation.

The problem occurred due to a combination of two things. The first was a method that setup a Spring Transaction, but had an exit that bypassed both the TransactionManager.commit() and the TransactionManager.rollback() (as well as the several SQL requests making up the transaction). Although this was admittedly incorrect coding, in the past, this transaction was closed and therefore had no effect on subsequent usage.
The solution was to insure that the transaction was not started if there was nothing to be done; or, in general double check to make sure that all transactions, once started, are completed.
I am not certain of the exact how or why this problem began presenting itself, so the following is partly conjectured. Apparently, upgrading to Oracle 11g and/or switching to the ojdbc6.jar driver altered the earlier behavior of the incorrect code so that the transaction was not terminated and the connection auto-commit was left false. (It could also be due to some other change that we have not identified since the special case above happens rarely – but does happen.) The corresponding JDBC connection appears to be bound to a specific GlassFish worker thread (I will call this a 'bad' thread in the following as opposed to the normally acting 'good' threads). Whenever this 'bad' thread is used to handle an application request (for this particular application), changes are uncommitted and selects return dirty data. As time goes on, when a change is requested on a 'good' thread and JDBC connection that already has an uncommitted change made on the 'bad' thread, than the new request hangs and the worker thread also hangs. Eventually all but the 'bad' worker thread are hung and everything seems to work correctly from the application viewpoint, but nothing is ever committed.
Again, the solution was to correct the bad code.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.