Failed to ping node [nodeId=null]

Failed to ping node [nodeId=null] - java

we have an Ignite setup (apache-ignite-2.13.0-1,Zulu Java 11.0.13, RHEL 8.6) with 3 server nodes and ~20 clients joining the topology as client nodes. The client application additionally also connects via JDBC. The application is from a 3rd party vendor, so I don't know what they are doing internally.
Since some time we see that always one of the 3 servers logs a huge amount of these warnings:
[12:40:41,446][WARNING][tcp-disco-ip-finder-cleaner-#7-#62][TcpDiscoverySpi] Failed to ping node [nodeId=null]. Reached the timeout 60000ms. Cause: Connection refused (Connection refused)
It did not always do that, Ignite and the application were updated multiple times, and at some point this started showing up.
I don't understand what this means. All the nodes I see in the topology with ignitevisor have a nodeId set, but here it is null. All server nodes and clients have full connectivity between each on all high ports. All expected nodes are shown in the topology.
So what is this node with nodeId=null? How can I find more about where that comes from?
Regards,
Sven

Wrapping it up,
the message was introduced in 2.11 in order to provide additional logging to communication and networking.
The warning itself just means that a node might not be accessible from the current one, i.e. we can't ping that node. That is normal in many cases and you can ignore this warning.
The implementation seems to be quite incorrect, we'd like to write it down only first time instead of having a bunch of duplicate messages. Plus, that type of logging information used to be the DEBUG one, whereas now it's become more severe - WARN for no reason.
There is an open ticket for an improvement.

Related

OutOfMemoryError due to a huge number of ActiveMQ XATransactionId objects

We have a Weblogic server running several apps. Some of those apps use an ActiveMQ instance which is configured to use the Weblogic XA transaction manager.
Now after about 3 minutes after startup, the JVM triggers an OutOfMemoryError. A heap dump shows that about 85% of all memory is occupied by a LinkedList that contains org.apache.activemq.command.XATransactionId instances. The list is a root object and we are not sure who needs it.
What could cause this?

We had exactly the same issue on Weblogic 12c and activemq-ra. XATransactionId object instances were created continuously causing server overload.
After more than 2 weeks of debugging, we found that the problem was caused by WebLogic Transaction Manager trying to recover some pending activemq transactions by calling the method recover() which returns the ids of transaction that seems to be not completed and have to be recovered. The call to this method by Weblogic returned always a not null number n (always the same) and that causes the creation of n instance of XATransactionId object.
After some investigations, we found that Weblogic stores by default its Transaction logs TLOG in filesystem and this can be changed to be persisted in DB. We thought that there was a problem in TLOGs being in file system and we tried to change it to DB and it worked ! Now our server runs for more that 2 weeks without any restart and memory is stable because no XATransactionId are created a part from the necessary amount of it ;)
I hope this will help you and keep us informed if it worked for you.
Good luck !

To be honest it sounds like you're getting a ton of JMS messages and either not consuming them or, if you are, your consumer is not acknowledging the messages if they are not in auto acknowledge mode.

Check your JMS queue backlog. There may be a queue with high backlog, which server is trying to read. These messages may have been corrupted, due to some crash
The best option is to delete the backlog in JMS queue or take a back up in some other queue

Getting "Write attempt on defunct connection" Error From Datastax Cassandra Java Driver

I have a web service application using Cassandra 2.0 and Datastax java driver 2.0.2. I sometimes get the stacktrace below when trying to write to/read from database, especially if the application has been sitting there for a while (like overnight). This error usually goes away when I retry, however, sometimes it persists and I have to restart the web app to get rid of the error.
I wonder if this is some sort of "stale connection" issue. However, the Datastax java driver documentation indicates it is supposed to keep the connection alive.
I did a google search on the error message and only two (!) hits were given by google. They are related. This is the answer in one of the google result:
Sylvain Lebresne Apr 2 You're running into
https://datastax-oss.atlassian.net/browse/JAVA-250. We'll fix it soon
hopefully (I have some half-finished patch that I need to finish), but
currently, if you restart a whole cluster without doing queries during
the restat, it can sometimes happen that you'll get this before the
cluster properly reconnect. In the meantime and as a workaround, you
can always make sure to run a few trivial queries while you're doing
the cluster restart to avoid it.
However this does not look like my scenario because we are not restarting the cluster at all. I wonder if anyone has some insights about this error?
Stacktrace:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ec2-54-197-xxx-xxx.compute-1.amazonaws.com/54.197.xxx.xxx:9042 (com.datastax.driver.core.ConnectionException: [ec2-54-197-xxx-xxx.compute-1.amazonaws.com/54.197.xxx.xxx:9042] Write attempt on defunct connection))
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65)
at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:256)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:172)
at com.datastax.driver.core.SessionManager.execute(SessionManager.java:92)

I have what I believe is the exact same issue (Write attempt on defunct connection) on my development machine intermittently.
It seems to happen when my dev machine goes to sleep while the server is up. Obviously there's no power management in the AWS cluster you're running, but it gives you a hint - the key is that something is breaking your control connection or intermittently preventing network connectivity between your hosts.
You should see the reconnection thread in your logs:
21:34:51.616 [Reconnection-1] ERROR c.d.driver.core.ControlConnection - [Control connection] Cannot connect to any host, scheduling retry in 2000 milliseconds
The next request after this will always succeed in my experience.
TL; DR - check for networking issues or any intermittent shutdown of servers that could break the control connection. The driver should do a better job of re-establishing broken control connections, sounds like they're working on it for JAVA-250

Best way to find activemq connection leaks

This is shameful, but we know there are some activemq connection leaks. The code is old and has many twists and turns that makes finding the leaky flow very hard.
We fire many short leaved jobs from batch machine. We know that not all paths are closing the activemq connection properly. When connection is not closed, but job terminates, activemq holds that connection for some amount of time. Ultimately, there are some critical applications which get impacted because activemq max connection limit exceeds.
Is it possible to set connection name or other identifying information so that a non properly closed connection will appear in activemq's log files. This will tell us which log files need to be examined. Sheer number of jobs makes it very hard to find out which exact job caused the problem. However once we know the job, we can deduce enough information from logs to find and fix the connection leaks.
Right now all we see is ip address from which connection originated and since all the jobs originate from same machine, its not helpful to find out who caused the problem

If you add jms.clientID=something into your connection URL and turn on DEBUG logging in your conf/log4j.properties, you will get the client id in your debug log on AMQ. You could then write something to analyze your log and find the AMQ ID for a given clientID and match the logs that way.
If your process is truly exiting though, your connection should be going away at that point (ie you can't keep the connection alive if there's no process to service it).
If you are running on Linux, you can do an netstat -anp | grep 61616 (or whatever your AMQ port is) to see which PIDs still have connections to AMQ, and then another ps to see what those processes are.

Tell Datastax Java Cassandra driver to timeout cluster connection

How do you tell the Datastax Java Cassandra driver to time-out when it attempts to connect to your cluster?
I'm particularly interested in the case when the hosts are reachable, but the Cassandra ports are blocked or the Cassandra daemons are not running. I'm writing a command-line client that ought to exit and report a suitable error message if it can not connect in a reasonable time. At present it seems that the driver will wait forever for a contact point to response, if the contact point is reachable.
That is, I want Cluster.build() to throw a NoHostAvailableException if the driver can not communicate with the Cassandra daemon of any of the contact points within a given maximum time.
Creating my own RetryPolicy won't work: that is for retrying queries, and I want the timeout to apply before we are ready to run queries.
Creating my own ReconnectinoPolicy initially looked promising, but the contract for the interface gives no means for indicating "consider this node to be dead forever more"

That is, I want Cluster.build() to throw a NoHostAvailableException if the driver can not communicate with the Cassandra daemon of any of the contact points within a given maximum time.
This is supposed to be the case. The driver will try to connect to each of the contact points and throw an exception if it fails to connect to any. You can control the maximum time the driver will try connecting (to each node) through SocketOptions.setConnectTimeoutMillis() (the default is 5 seconds).
My experience is that Cluster.build() does return an exception if no node can be connected to, but if your experience differs, you might want to report it as a bug (but a bit more detail on how you reproduce this would help).
That being said:
The timeout above is per host. So if you pass a list of 100 contact points, you could in theory have to wait 500 seconds (by default) before getting the NoHostAvailableException. But there is no real point in providing that many contact points, and in practice, if Cassandra is not running on the node tried, the connection attempt will usually fail right away (you won't wait the timeout).
There is currently no real query timeout on the driver side. Which mean that if the driver does connect to a node (which means that some process is listening on that port and accept the connection), but get no answer to his initial messages, then it can indeed hold forever. That should probably be fixed, and I encourage you to open a ticket for that on https://datastax-oss.atlassian.net/browse/JAVA. However, this doesn't seem to be the case you are describing, since if "Cassandra ports are blocked or the Cassandra daemons are not running" then the driver shouldn't be able to connect in the first place.

DatabaseLessLeasing has failed and Server is not in majority cluster partition

I'm facing a DatabaseLessLeasing issue. Our's is a middleware application. We don't have any database and our application is running on WebLogic server. We have 2 servers in one cluster. Both servers are up and running, but we are using only one server to do the processing. When the primary server fails, whole server and services will migrate to secondary server. This is working fine.
But we had one issue end of last year that our secondary server hardware was down and secondary server was not available. We got the below issue. When we went to Oracle, they suggested to have one more server or have one database which is high availability to hold the cluster leasing information to point out which is the master server. As of now we don't have that option to do as putting the new server means there will be a budget issue and client is not ready for it.
Our Weblogic configuration for cluster are:
one cluster with 2 managed servers
cluster messaging mode is Multicast
Migration Basis is Consensus
load algorithm is Round Robin
This is the log I found
LOG: Critical Health BEA-310006 Critical Subsystem
DatabaseLessLeasing has failed. Setting server state to FAILED.
Reason: Server is not in the majority cluster partition>
Critical WebLogicServer BEA-000385 Server health failed. Reason:
health of critical service 'DatabaseLessLeasing' failed Notice
WebLogicServer BEA-000365 Server state changed to FAILED
**Note: **I remember one thing, the server was not down when this happened. Both the servers were running but all of a sudden server tried to restart and it unable to restart. Restart was failed. I saw that status was showing as failedToRestart and application went down.
Can anyone please help me on this issue.
Thank you

Consensus leasing requires a majority of servers to continue functioning. Any time there is a network partition, the servers in the majority partition will continue to run while those in the minority partition will fail since they cannot contact the cluster leader or elect a new cluster leader since they will not have the majority of servers. If the partition results in an equal division of servers, then the partition that contains the cluster leader will survive while the other one will fail.
Owing to above functionality, If automatic server migration is enabled, the servers are required to contact the cluster leader and renew their leases periodically. Servers will shut themselves down if they are unable to renew their leases. The failed servers will then be automatically migrated to the machines in the majority partition.
The server which got partitioned (and not part of majority cluster) will get into FAILED state. This behavior is put in place to avoid split-brain scenarios where there are two partitions of a cluster and both think they are the real cluster. When a cluster gets segmented, the largest segment will survive and the smaller segment will shut itself down. When servers cannot reach the cluster master, they determine if they are in the larger partition or not. If they are in the larger partition, they will elect a new cluster master. If not, they will all shut down when their lease expires. Two-node clusters are problematic in this case. When a cluster gets partitioned, which partition is the largest? When the cluster master goes down in a two-node cluster, the remaining server has no way of knowing if it is in the majority or not. In that case, if the remaining server is the cluster master, it will continue to run. If it is not the master, it will shut down.
Usually this error shows up when there are only 2 managed servers in onc cluster.
To solve this kind of issues, create another server; since the cluster is only of 2 nodes, any server will fall out of the majority cluster partition if it loses connectivity/drops cluster broadcast messages. In this scenario, there are no other servers part of the cluster.
For Consensus Leasing, it is always recommended to create a cluster with at-least 3 nodes; that way you can ensure some stability.
In that scenario, even if one server falls out of the cluster, the other two still function correctly as they remain in the majority cluster partition The third one will rejoin the cluster, or will be eventually restarted.
In a scenario where you have only 2 servers as part of the cluster, one falling out from the cluster will result in both the servers being restarted, as they are not a part of the majority cluster partition; this would ultimately result in a very unstable environment.
Another possible scenario is that there was a communication issue between the Managed servers, you should look out for messages like "lost .* message(s)" [in case of unicast it is some thing like "Lost 2 unicast message(s)."] This may be caused due to temporary network issues

Make sure that the node manger for the secondary node in the clustered migration configuration is up and running.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.