I'm facing a DatabaseLessLeasing issue. Our's is a middleware application. We don't have any database and our application is running on WebLogic server. We have 2 servers in one cluster. Both servers are up and running, but we are using only one server to do the processing. When the primary server fails, whole server and services will migrate to secondary server. This is working fine.
But we had one issue end of last year that our secondary server hardware was down and secondary server was not available. We got the below issue. When we went to Oracle, they suggested to have one more server or have one database which is high availability to hold the cluster leasing information to point out which is the master server. As of now we don't have that option to do as putting the new server means there will be a budget issue and client is not ready for it.
Our Weblogic configuration for cluster are:
one cluster with 2 managed servers
cluster messaging mode is Multicast
Migration Basis is Consensus
load algorithm is Round Robin
This is the log I found
LOG: Critical Health BEA-310006 Critical Subsystem
DatabaseLessLeasing has failed. Setting server state to FAILED.
Reason: Server is not in the majority cluster partition>
Critical WebLogicServer BEA-000385 Server health failed. Reason:
health of critical service 'DatabaseLessLeasing' failed Notice
WebLogicServer BEA-000365 Server state changed to FAILED
**Note: **I remember one thing, the server was not down when this happened. Both the servers were running but all of a sudden server tried to restart and it unable to restart. Restart was failed. I saw that status was showing as failedToRestart and application went down.
Can anyone please help me on this issue.
Thank you
Consensus leasing requires a majority of servers to continue functioning. Any time there is a network partition, the servers in the majority partition will continue to run while those in the minority partition will fail since they cannot contact the cluster leader or elect a new cluster leader since they will not have the majority of servers. If the partition results in an equal division of servers, then the partition that contains the cluster leader will survive while the other one will fail.
Owing to above functionality, If automatic server migration is enabled, the servers are required to contact the cluster leader and renew their leases periodically. Servers will shut themselves down if they are unable to renew their leases. The failed servers will then be automatically migrated to the machines in the majority partition.
The server which got partitioned (and not part of majority cluster) will get into FAILED state. This behavior is put in place to avoid split-brain scenarios where there are two partitions of a cluster and both think they are the real cluster. When a cluster gets segmented, the largest segment will survive and the smaller segment will shut itself down. When servers cannot reach the cluster master, they determine if they are in the larger partition or not. If they are in the larger partition, they will elect a new cluster master. If not, they will all shut down when their lease expires. Two-node clusters are problematic in this case. When a cluster gets partitioned, which partition is the largest? When the cluster master goes down in a two-node cluster, the remaining server has no way of knowing if it is in the majority or not. In that case, if the remaining server is the cluster master, it will continue to run. If it is not the master, it will shut down.
Usually this error shows up when there are only 2 managed servers in onc cluster.
To solve this kind of issues, create another server; since the cluster is only of 2 nodes, any server will fall out of the majority cluster partition if it loses connectivity/drops cluster broadcast messages. In this scenario, there are no other servers part of the cluster.
For Consensus Leasing, it is always recommended to create a cluster with at-least 3 nodes; that way you can ensure some stability.
In that scenario, even if one server falls out of the cluster, the other two still function correctly as they remain in the majority cluster partition The third one will rejoin the cluster, or will be eventually restarted.
In a scenario where you have only 2 servers as part of the cluster, one falling out from the cluster will result in both the servers being restarted, as they are not a part of the majority cluster partition; this would ultimately result in a very unstable environment.
Another possible scenario is that there was a communication issue between the Managed servers, you should look out for messages like "lost .* message(s)" [in case of unicast it is some thing like "Lost 2 unicast message(s)."] This may be caused due to temporary network issues
Make sure that the node manger for the secondary node in the clustered migration configuration is up and running.
Related
When restarting one Cassandra seed node, all its client connection are rebalance to other nodes as expected. However, when the node is up again the incoming connection are not staying to the previous level. It causes a ~10% performance impact since the other nodes are slightly more used. Restarting the client applications solve the issue.
Is it possible to have an automatic client rebalancing without a restart of the clients after some time, e.g. 1h?
I am using 4 seeds node on my application with the latest Java driver com.datastax.oss:java-driver-bom:4.15.0 and Cassandra 4.0.7. Load balancing policy used is DcInferringLoadBalancingPolicy.
Here is an example of the restart of one node. in graph.
we have an Ignite setup (apache-ignite-2.13.0-1,Zulu Java 11.0.13, RHEL 8.6) with 3 server nodes and ~20 clients joining the topology as client nodes. The client application additionally also connects via JDBC. The application is from a 3rd party vendor, so I don't know what they are doing internally.
Since some time we see that always one of the 3 servers logs a huge amount of these warnings:
[12:40:41,446][WARNING][tcp-disco-ip-finder-cleaner-#7-#62][TcpDiscoverySpi] Failed to ping node [nodeId=null]. Reached the timeout 60000ms. Cause: Connection refused (Connection refused)
It did not always do that, Ignite and the application were updated multiple times, and at some point this started showing up.
I don't understand what this means. All the nodes I see in the topology with ignitevisor have a nodeId set, but here it is null. All server nodes and clients have full connectivity between each on all high ports. All expected nodes are shown in the topology.
So what is this node with nodeId=null? How can I find more about where that comes from?
Regards,
Sven
Wrapping it up,
the message was introduced in 2.11 in order to provide additional logging to communication and networking.
The warning itself just means that a node might not be accessible from the current one, i.e. we can't ping that node. That is normal in many cases and you can ignore this warning.
The implementation seems to be quite incorrect, we'd like to write it down only first time instead of having a bunch of duplicate messages. Plus, that type of logging information used to be the DEBUG one, whereas now it's become more severe - WARN for no reason.
There is an open ticket for an improvement.
I am going to implement a kafka cluster consisting of 3 machines, one for zookeeper and other 2 as brokers. I have about 6 consumer machines and about hundred producers.
Now if one of the broker fails data loss is avoided thanks to replication feature. But what if zookeeper fails and the same machine cannot be started? I have several questions:
I noticed that even after zookeeper failure producers continued to push messages in designated broker. But they could no longer be retrieved by consumers. Because Consumers got unregistered. So in this case is data lost permanently?
How to change zookeeper ip in broker config in run time? Will they have to be shutdown to change zookeeper ip?
Even if new zookeeper machine is somehow brought into the cluster previous would the previous data be lost?
Running only one instance of Zookeeper is not fault-tolerant and the behavior cannot be predicted. According to HBase reference, you should setup an ensemble with at least 3 servers.
Have a look at the official documentation page: Zookeeper clustered setup.
There is a cluster of tomcats, each tomcat node generates "tasks" which can be performed by any other node. I'd prefer task to be performed by the node which created it.
I thought that it would be good idea to use an embedded broker for each tomcat and configure it as a store-and-forward network. The problem is that a node can go down and the tasks/messages should then be performed by other tomcat instead of waiting for current one to get up.
On the other hand - when using master/slave cluster how to prioritize the node which sent the message?
How to configure it in activemq?
The priority of a local consumer should be default. In AMQ Docs:
ActiveMQ uses Consumer Priority so that local JMS consumers are always
higher priority than remote brokers in a store and forward network.
However, you will not really achive what you want. If one tomcat node goes down, so will the embedded ActiveMQ (and any messages still attached to that instance). A message will not automatically get copied to all other brokers.
But you ask something about Master/slave cluster. Do you intend to have a network of brokers OR a master/slave setup? Or do you intend to have a combo?
I have a central load balancing server and several application servers running on Apache Tomcat. The load balancing server receives request and forwards them to the application servers in round robin fashion. If one these application servers goes down, the load balancing server should stop forwarding requests to it.
My current solution for this is to ping the application servers every few minutes and if I don't receive a response, remove them from a list of available servers. Is there a better way to monitor the status of these servers? Should I ping more often or should the application servers constantly inform the load balancing server?
Execute a null transaction on it regularly. Pinging really isn't enough, it only exercises the TCP/IP stack, and I have seen operating systems in states where TCP/IP was up but no applications and not even part of the OS stack itself. Executing a transaction exercises everything. Include the database in the null transaction.
First ensure your server isn DDOS attrack protected , if the depends on you application connection avg time edit keep alive time
Then you should study about precock mpm , i think it will give you best solution