Tell Datastax Java Cassandra driver to timeout cluster connection - java

How do you tell the Datastax Java Cassandra driver to time-out when it attempts to connect to your cluster?
I'm particularly interested in the case when the hosts are reachable, but the Cassandra ports are blocked or the Cassandra daemons are not running. I'm writing a command-line client that ought to exit and report a suitable error message if it can not connect in a reasonable time. At present it seems that the driver will wait forever for a contact point to response, if the contact point is reachable.
That is, I want Cluster.build() to throw a NoHostAvailableException if the driver can not communicate with the Cassandra daemon of any of the contact points within a given maximum time.
Creating my own RetryPolicy won't work: that is for retrying queries, and I want the timeout to apply before we are ready to run queries.
Creating my own ReconnectinoPolicy initially looked promising, but the contract for the interface gives no means for indicating "consider this node to be dead forever more"

That is, I want Cluster.build() to throw a NoHostAvailableException if the driver can not communicate with the Cassandra daemon of any of the contact points within a given maximum time.
This is supposed to be the case. The driver will try to connect to each of the contact points and throw an exception if it fails to connect to any. You can control the maximum time the driver will try connecting (to each node) through SocketOptions.setConnectTimeoutMillis() (the default is 5 seconds).
My experience is that Cluster.build() does return an exception if no node can be connected to, but if your experience differs, you might want to report it as a bug (but a bit more detail on how you reproduce this would help).
That being said:
The timeout above is per host. So if you pass a list of 100 contact points, you could in theory have to wait 500 seconds (by default) before getting the NoHostAvailableException. But there is no real point in providing that many contact points, and in practice, if Cassandra is not running on the node tried, the connection attempt will usually fail right away (you won't wait the timeout).
There is currently no real query timeout on the driver side. Which mean that if the driver does connect to a node (which means that some process is listening on that port and accept the connection), but get no answer to his initial messages, then it can indeed hold forever. That should probably be fixed, and I encourage you to open a ticket for that on https://datastax-oss.atlassian.net/browse/JAVA. However, this doesn't seem to be the case you are describing, since if "Cassandra ports are blocked or the Cassandra daemons are not running" then the driver shouldn't be able to connect in the first place.

Related

Failed to ping node [nodeId=null]

we have an Ignite setup (apache-ignite-2.13.0-1,Zulu Java 11.0.13, RHEL 8.6) with 3 server nodes and ~20 clients joining the topology as client nodes. The client application additionally also connects via JDBC. The application is from a 3rd party vendor, so I don't know what they are doing internally.
Since some time we see that always one of the 3 servers logs a huge amount of these warnings:
[12:40:41,446][WARNING][tcp-disco-ip-finder-cleaner-#7-#62][TcpDiscoverySpi] Failed to ping node [nodeId=null]. Reached the timeout 60000ms. Cause: Connection refused (Connection refused)
It did not always do that, Ignite and the application were updated multiple times, and at some point this started showing up.
I don't understand what this means. All the nodes I see in the topology with ignitevisor have a nodeId set, but here it is null. All server nodes and clients have full connectivity between each on all high ports. All expected nodes are shown in the topology.
So what is this node with nodeId=null? How can I find more about where that comes from?
Regards,
Sven
Wrapping it up,
the message was introduced in 2.11 in order to provide additional logging to communication and networking.
The warning itself just means that a node might not be accessible from the current one, i.e. we can't ping that node. That is normal in many cases and you can ignore this warning.
The implementation seems to be quite incorrect, we'd like to write it down only first time instead of having a bunch of duplicate messages. Plus, that type of logging information used to be the DEBUG one, whereas now it's become more severe - WARN for no reason.
There is an open ticket for an improvement.

PostgreSQL JDBC Connection issue

We have PostgreSQL 9.6 instance at a ubuntu 18.04 machine. When we restart java services deployed in a Kubernetes cluster then already existing idle connection didn't get remove and service create new connections on each restart. Due to this, we have reached a connection limit so many times and we have to terminate connection manually every time. Same service versions are deployed on other instances but we are not getting this scenario on other servers.
I have some questions regarding this
Can it be a PostgreSQL configuration issue? However, i didn't find any timeout-related setting difference in 2 instances (1 is working fine and another isnt)
If this is a java service issue then what should I check?
If its neither a PostgreSQL issue not a java issue then what should i look into?
If the client process dies without closing the database connection properly, it takes a while (2 hours by default) for the server to notice that the connection is dead.
The mechanism for that is provided by TCP and is called keepalive: after a certain idle time, the operating system starts sending keepalive packets. After a certain number of such packets without response, the TCP connection is closed, and the database backend process will die.
To make PostgreSQL detect dead connections faster, set the tcp_keepalives_idle parameter in postgresql.conf to less than 7200 seconds.

How to increase timeout setting in Gremlin Java Remote client?

We are getting below exception while performing a load test on our application which is using Gremlin Java.
how to solve this issue?
Exception:
java.lang.IllegalStateException: org.apache.tinkerpop.gremlin.process.remote.RemoteConnectionException: java.lang.RuntimeException: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timed out while waiting for an available host - check the client configuration and connectivity to the server if this message persists
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.promise(RemoteStep.java:98 )
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.processNextStart(RemoteStep.java:65 )
at org.apache.tinkerpop.gremlin.process.traversal.step.ut
"Timed out while waiting for an available host" - This is most certainly a connectivity issue between your client and DB. There are numerous answers around debugging connectivity to Neptune, please try them out. To begin with, can you try the following from your client machine?
telnet <db-endpoint> <db-port>
You would most likely see that its waiting to establish the connection, which confirms this hypothesis.
In general, establishing a connection to the server is fairly quick. The only timeout that you need to worry about is query timeout, and Neptune has a parameter group entry for that.
https://docs.aws.amazon.com/neptune/latest/userguide/parameters.html
I faced the same error. Neptunes does not logs error stack trace into logs. TimeoutException for me was coming when the cpu > 60 percent use case. The cpu would go this high, because of many connections being made to db.
Gremlin is based on websockets and multiple requests can be multiplexed and used via same channel. Adding maxInProcessPerConnection and maxSimultaneosUsagePerConnection did help me to reduce the error rate to 0 percent. These paramters set the count of process which will be multiplexed within one connection. In my case around 50 workers concurrently read/write.
I observed that for my use case setting the values to 32 led to minimum cpu usage.
Below is the Cluster properties I am settling with for now.
By default Cluster keeps a pool of max 8 websocket connections if not mentioned. I was getting TimeoutException when maxPoolSize was set to 100.
.addContactPoint(uri)
.port(port)
.serializer(Serializers.GRAPHBINARY_V1D0)
.maxInProcessPerConnection(32)
.maxSimultaneousUsagePerConnection(32)
.maxContentLength(10000000)
.maxWaitForConnection(10)
.create()

Cassandra Datastax connection pooling monitor/metrics

my team is moving from using Astyanax driver (which is deprecated soon if not already) to using Datastax 3.0 driver.
Our code implements Astyanax's ConnectionPoolMonitor class and we capture about 22 different metrics on our connection pool usage.
I am trying to find an equivalent way to do this with Datastax driver. But all I could find is this:
https://datastax.github.io/java-driver/manual/pooling/#monitoring-and-tuning-the-pool
Basically, the example above shows how you can run a background thread that continuously polls Session.State. This seems rather awkward. Astyanax does callbacks to the classes that implement ConnectionPoolMonitor.
And the amount of info exposed in Session.State is rather limited: connected hosts, inflight queries, open connections, and trashed connections.
Is there a better option out there that I haven't found somehow? How can I capture metrics such as these:
count of when Pool is exhausted, got connection timeout, socket timeout, got not hosts
count of connection created, closed, borrowed, returned, creation error
count of host added, removed, down, reactivated/reconnected
count of exception unknown error, bad request, interrupted, transport error
Try cluster.getMetrics() and read this Java doc: http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/Metrics.html

Getting "Write attempt on defunct connection" Error From Datastax Cassandra Java Driver

I have a web service application using Cassandra 2.0 and Datastax java driver 2.0.2. I sometimes get the stacktrace below when trying to write to/read from database, especially if the application has been sitting there for a while (like overnight). This error usually goes away when I retry, however, sometimes it persists and I have to restart the web app to get rid of the error.
I wonder if this is some sort of "stale connection" issue. However, the Datastax java driver documentation indicates it is supposed to keep the connection alive.
I did a google search on the error message and only two (!) hits were given by google. They are related. This is the answer in one of the google result:
Sylvain Lebresne Apr 2 You're running into
https://datastax-oss.atlassian.net/browse/JAVA-250. We'll fix it soon
hopefully (I have some half-finished patch that I need to finish), but
currently, if you restart a whole cluster without doing queries during
the restat, it can sometimes happen that you'll get this before the
cluster properly reconnect. In the meantime and as a workaround, you
can always make sure to run a few trivial queries while you're doing
the cluster restart to avoid it.
However this does not look like my scenario because we are not restarting the cluster at all. I wonder if anyone has some insights about this error?
Stacktrace:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ec2-54-197-xxx-xxx.compute-1.amazonaws.com/54.197.xxx.xxx:9042 (com.datastax.driver.core.ConnectionException: [ec2-54-197-xxx-xxx.compute-1.amazonaws.com/54.197.xxx.xxx:9042] Write attempt on defunct connection))
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65)
at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:256)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:172)
at com.datastax.driver.core.SessionManager.execute(SessionManager.java:92)
I have what I believe is the exact same issue (Write attempt on defunct connection) on my development machine intermittently.
It seems to happen when my dev machine goes to sleep while the server is up. Obviously there's no power management in the AWS cluster you're running, but it gives you a hint - the key is that something is breaking your control connection or intermittently preventing network connectivity between your hosts.
You should see the reconnection thread in your logs:
21:34:51.616 [Reconnection-1] ERROR c.d.driver.core.ControlConnection - [Control connection] Cannot connect to any host, scheduling retry in 2000 milliseconds
The next request after this will always succeed in my experience.
TL; DR - check for networking issues or any intermittent shutdown of servers that could break the control connection. The driver should do a better job of re-establishing broken control connections, sounds like they're working on it for JAVA-250

Categories