Apache Ignite: Caches unusable after reconnecting to Ignite servers

Apache Ignite: Caches unusable after reconnecting to Ignite servers - java

I am using Apache Ignite as a distributed cache and I am running into some fundamental robustness issues. If our Ignite servers reboot for any reason it seems like this breaks all of our Ignite clients, even after the Ignite servers come back online.
This is the error the clients see when interacting with caches after the servers reboot and the clients reconnect:
Caused by: org.apache.ignite.internal.processors.cache.CacheStoppedException: Failed to perform cache operation (cache is stopped): <redacted>
My expectation is that the Ignite clients would reconnect to the Ignite servers and continue working once the servers are online. From what I've read thick clients should do this, but I don't see this happening. Why is the cache still considered to be stopped?
We are using Ignite 2.7.6 with Kubernetes IP finder.

Looks like you are using a stale cache proxy.
If you are using an in memory-cluster, and created a cache dynamically from a client, then the given cache will disappear when the cluster restarts.
The following code, executed from a client against an in-memory cluster, will generate an exception when the cluster restarts, if the cache in question is not part of a server config, but created dynamically on the client.
Ignition.setClientMode(true);
Ignite = Ignition.start();
IgniteCache cache = ignite.getOrCreateCache("mycache"); //dynamically created cache
int counter = 0;
while(true) {
try {
cache.put(counter, counter);
System.out.println("added counter: " + counter);
} catch (Exception e) {
e.printStackTrace();
}
}
generates
java.lang.IllegalStateException: class org.apache.ignite.internal.processors.cache.CacheStoppedException: Failed to perform cache operation (cache is stopped): mycache
at org.apache.ignite.internal.processors.cache.GridCacheGateway.enter(GridCacheGateway.java:164)
at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.onEnter(GatewayProtectedCacheProxy.java:1555)
You need to watch for disconnect events/exceptions
see: https://ignite.apache.org/docs/latest/clustering/connect-client-nodes
IgniteCache cache = ignite.getOrCreateCache(cachecfg);
try {
cache.put(1, "value");
} catch (IgniteClientDisconnectedException e) {
if (e.getCause() instanceof IgniteClientDisconnectedException) {
IgniteClientDisconnectedException cause = (IgniteClientDisconnectedException) e.getCause();
cause.reconnectFuture().get(); // Wait until the client is reconnected.
// proceed
If this is a persistent cluster consisting of multiple baseline nodes,
you should wait until the cluster activates.
https://ignite.apache.org/docs/latest/clustering/baseline-topology
while (!ignite.cluster().active()) {
System.out.println("Waiting for activation");
Thread.sleep(5000);
}
After re-connect you might need to reinitialize your cache proxy
cache = ignite.getOrCreateCache(cachecfg);
}

Related

java - [Apache Curator] How to properly close curator

I'm trying to implement a fallback logic on my connection to Zookeeper using Apache Curator, basically I have two sets of connection strings and if I receive a LOST state on my state listener I try to reconnect my curator client on the another set of connection strings. I could simple put all machines on the same connection string but I want to connect on fallback only when all machines for the default cluster are offline.
The problem is that I can't close the previous curator client when I try to change to the fallback cluster, I keep receiving the LOG message saying that curator is trying to reconnect, even after I connect on the fallback set of zookeepers. Below you can find a code example of what I'm trying to do:
final ConnectionStateListener listener = (client1, state) -> {
if (state == ConnectionState.LOST) {
reconnect();
}
};
And the reconnect method (will change the lastHost to the fallback cluster):
if (client != null) {
client.close();
}
...
client = CuratorFrameworkFactory.newClient(
lastHost,
sessionTimeout,
connectionTimeout,
retryPolicy);
...
client.start()
I can successfully connect on the new set of connection strings (fallback) but the problem is that the previous client keep trying to connect on the previous connection strings.
Looking at the close() method I saw that curator only close things if the State of the client is STARTED, I think that's why curator keep trying to connect on the previous cluster.
Is there a way to close() the curator client without having the STARTED state on it?
If not is there another way to implement this logic (fallback zookeeper servers) on curator?
Thanks.

How to reconnect a HazelcastClient to HazelcastServer after server restart

I'm having a problem using the hazelcast in an architecture based on microservice and springboot.
I keep one of the applications with being the application that will be the server of hazelcast and the others are clients of this.
However if I have to update the application that is the hazelcast server, the clients applications of the cache overturn the connection to the server and when I up the new version of the server these client applications do not reconnect.
Is there any off setting the hazelcastclient to be doing pooling on the server to try to reconnect as soon as it comes back?
My client is like below:
#bean
open fun hazelcastInstance(): HazelcastInstance? {
return try {
val clientConfig = ClientConfig()
HazelcastClient.newHazelcastClient(clientConfig)
} catch (e: Exception) {
log.error("Could not connect to hazelcast server, server up without cache")
null
}
}
and I receive "com.hazelcast.client.HazelcastClientNotActiveException: Client is shutdown" if my server goes down.
I'm grateful if you could help me

The Connection Attempt Limit and Connection Attempt Period configuration elements help to configure clients' reconnection behaviour. The client will retry as many as ClientNetworkConfig.connectionAttemptLimit times to reconnect to the cluster. Connection Attempt Period is the duration in milliseconds between the connection attempts defined by ClientNetworkConfig.connectionAttemptLimit. Here is an example of how you configure them:
ClientConfig clientConfig = new ClientConfig();
clientConfig.getNetworkConfig().setConnectionAttemptLimit(5);
clientConfig.getNetworkConfig().setConnectionAttemptPeriod(5000);
Starting with Hazelcast 3.9, you can use configuration element reconnect-mode to configure how the client will reconnect to the cluster after a disconnection. It has three options (OFF, ON or ASYNC). The option OFF disables the reconnection. ON enables reconnection in a blocking manner where all the waiting invocations will be blocked until a cluster connection is established or failed. The option ASYNC enables reconnection in a non-blocking manner where all the waiting invocations will receive a HazelcastClientOfflineException. Its default value is ON. You can see a configuration example below:
ClientConfig clientConfig = new ClientConfig();
clientConfig.getConnectionStrategyConfig()
.setReconnectMode(ClientConnectionStrategyConfig.ReconnectMode.ON);

Hazelcast - Client mode - How to recover after cluster failure?

We are using hazelcast distributed lock and cache functions in our products. Usage of distributed locking is vitally important for our business logic.
Currently we are using the embedded mode(each application node is also a hazelcast cluster member). We are going to switch to client - server mode.
The problem we have noticed for client - server is that, once the cluster is down for a period, after several attempts clients are destroyed and any objects (maps, sets, etc.) that were retrieved from that client are no longer usable.
Also the client instance does not recover even after the Hazelcast cluster comes back up (we receive HazelcastInstanceNotActiveException )
I know that this issue has been addressed several times and ended up as being a feature request:
issue1
issue2
issue3
My question : What should be the strategy to recover the client? Currently we are planning to enqueue a task in the client process as below. Based on a condition it will try to restart the client instance...
We will check whether the client is running or not via clientInstance.getLifecycleService().isRunning() check.
Here is the task code:
private class ClientModeHazelcastInstanceReconnectorTask implements Runnable {
#Override
public void run() {
try {
HazelCastService hazelcastService = HazelCastService.getInstance();
HazelcastInstance clientInstance = hazelcastService.getHazelcastInstance();
boolean running = clientInstance.getLifecycleService().isRunning();
if (!running) {
logger.info("Current clientInstance is NOT running. Trying to start hazelcastInstance from ClientModeHazelcastInstanceReconnectorTask...");
hazelcastService.startHazelcastInstance(HazelcastOperationMode.CLIENT);
}
} catch (Exception ex) {
logger.error("Error occured in ClientModeHazelcastInstanceReconnectorTask !!!", ex);
}
}
}
Is this approach suitable? I also tried to listen LifeCycle events but could not make it work via events.
Regards

In Hazelcast 3.9 we changed the way connection and reconnection works in clients. You can read about the new behavior in the docs: http://docs.hazelcast.org/docs/3.9.1/manual/html-single/index.html#configuring-client-connection-strategy
I hope this helps.

In Hazelcast 3.10 you may increase connection attempt limit from 2 (by default) to maximum:
ClientConfig clientConfig = new ClientConfig();
clientConfig.getNetworkConfig().setConnectionAttemptLimit(Integer.MAX_VALUE);

The current connections count keeps increasing in my Elasticache Redis node

I am using Jedis in a tomcat web app to connect to an Elascticache Redis node. The app is used by hundreds of users in day time. I am not sure of this is normal or not, but whenever I check the current connections count with cloudwatch metrics, I see the current connections increasing without falling down.
This is my Jedis pool configuration:
public static JedisPool getPool(){
if(pool == null){
JedisPoolConfig config = new JedisPoolConfig();
config.setMinIdle(5);
config.setMaxIdle(35);
config.setMaxTotal(1500);
config.setMaxWaitMillis(3000);
config.setTestOnBorrow(true);
config.setTestWhileIdle(true);
pool = new JedisPool(config, PropertiesManager.getInstance().getRedisServer());
}
return pool;
}
and this is how I always use the pool connections to execute redis commands:
Jedis jedis = JedisUtil.getPool().getResource();
try{
//Redis commands
}
catch(JedisException e){
e.printStackTrace();
throw e;
}finally{
if (jedis != null) JedisUtil.getPool().returnResource(jedis);
}
With this configuration, the count is currently over 200. Am I missing something that is supposed to discard or kill unsused connections ? I set maxIdle to 35 and I expected that the count falls down to 35 when the traffic is very low but this never happened.

we had the same problem. After investigating a little bit more further we came across with this (from redis official documentation - http://redis.io/topics/clients) :
By default recent versions of Redis don't close the connection with the client if the client is idle for many seconds: the connection will remain open forever.
By default, aws provides a timeout value of 0. Therefore, any connection that has been initialised with your redis instance will be kept by redis even if the connection initialised by your client is down.
Create a new cache parameter policy in aws with a timeout different of 0 and then you should be good

In the cache parameter group you can edit timeout. It defaults to 0 which leaves idle connection in redis. If you set it to 100 it will remove connections idle for 100 seconds.

You can check the pool size using JMX. Activating the idle evictor thread is a good idea. You can do so by setting the timeBetweenEvictionRunsMillis parameter on the JedisPoolConfig.
If you don't use transactions (EXEC) or blocking operations (BLPOP, BRPOP), you could stick to one connection if connection count is a concern for you. The lettuce client is thread-safe with one connection

Websphere MQ using JMS, closed connections stuck on the MQ

I have a simple JMS application deployed on OC4J under AIX server, in my application I'm listening to some queues and sending to other queues on a Websphere MQ deployed under AS400 server.
The problem is that my connections to these queues are terminated/closed when it stays idle for some time with the error MQJMS1016 (this is not the problem), and when that happens I attempt to recover the connection and it works, however, the old connection is stuck at the MQ and would not terminate until it is terminated manually.
The recovery code goes as follows:
public void recover() {
cleanup();
init();
}
public void cleanup(){
if (session != null) {
try {
session .close();
} catch (JMSException e) {
}
}
if (connection != null) {
try {
connection.close();
} catch (JMSException e) {
}
}
}
public void init(){
// typical initialization of the connection, session and queue...
}

The MQJMS1016 is an internal error and indicates that the connection loss is due to something wrong with the code or WMQ itself. Tuning the channels will help but you really need to get to the problem of why the app is spewing orphaned connections fast enough to exhaust all available channels.
The first thing I'd want to do is check the versions of WMQ and of the WMQ client that are running. If this is new development, be sure you are using the WMQ v7 client because v6 is end-of-life as of Sept 2011. The v7 client works with v6 QMgrs until you are able to upgrade that as well. Once you get to v7 client and QMgr, there are quite a bit of channel tuning and reconnection options available to you.
The WMQ v7 client download is here: http://bit.ly/bXM0q3
Also, note that the reconnect logic in the code above does not sleep between attempts. If a client throws connection requests at a high rate of speed, it can overload the WMQ listener and execute a very effective DOS attack. Recommended to sleep a few seconds between attempts.
Finally, please, PLEASE print the linked exceptions in your JMSException catch blocks. If you have a problem with a JMS transport provider, the JMS Linked Exception will contain any low-level error info. In the case of WMQ it contains the Reason Code such as 2035 MQRC_AUTHORIZATION_ERROR or 2033 MQRC_NO_MSG_AVAILABLE. Here's an example:
try {
.
. code that might throw a JMSException
.
} catch (JMSException je) {
System.err.println("caught "+je);
Exception e = je.getLinkedException();
if (e != null) {
System.err.println("linked exception: "+e);
} else {
System.err.println("No linked exception found.");
}
}
If you get an error at 2am some night, your WMQ administrator will thank you for the linked exceptions.

Since the orphaned connections (stuck connections on MQ side) does not affect the messages processing (i.e. they do not consume messages), we left things as it is until the maximum connections allowed on the MQ was reached.
The recovery did not work anymore, and once we reached that point, the MQ administrator had to clean the orphaned connection manually, however, the good news is that searching for this particular problem led to an issue reported on IBM support site:
check here

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.