How to increase timeout setting in Gremlin Java Remote client? - java

We are getting below exception while performing a load test on our application which is using Gremlin Java.
how to solve this issue?
Exception:
java.lang.IllegalStateException: org.apache.tinkerpop.gremlin.process.remote.RemoteConnectionException: java.lang.RuntimeException: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timed out while waiting for an available host - check the client configuration and connectivity to the server if this message persists
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.promise(RemoteStep.java:98 )
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.processNextStart(RemoteStep.java:65 )
at org.apache.tinkerpop.gremlin.process.traversal.step.ut

"Timed out while waiting for an available host" - This is most certainly a connectivity issue between your client and DB. There are numerous answers around debugging connectivity to Neptune, please try them out. To begin with, can you try the following from your client machine?
telnet <db-endpoint> <db-port>
You would most likely see that its waiting to establish the connection, which confirms this hypothesis.
In general, establishing a connection to the server is fairly quick. The only timeout that you need to worry about is query timeout, and Neptune has a parameter group entry for that.
https://docs.aws.amazon.com/neptune/latest/userguide/parameters.html

I faced the same error. Neptunes does not logs error stack trace into logs. TimeoutException for me was coming when the cpu > 60 percent use case. The cpu would go this high, because of many connections being made to db.
Gremlin is based on websockets and multiple requests can be multiplexed and used via same channel. Adding maxInProcessPerConnection and maxSimultaneosUsagePerConnection did help me to reduce the error rate to 0 percent. These paramters set the count of process which will be multiplexed within one connection. In my case around 50 workers concurrently read/write.
I observed that for my use case setting the values to 32 led to minimum cpu usage.
Below is the Cluster properties I am settling with for now.
By default Cluster keeps a pool of max 8 websocket connections if not mentioned. I was getting TimeoutException when maxPoolSize was set to 100.
.addContactPoint(uri)
.port(port)
.serializer(Serializers.GRAPHBINARY_V1D0)
.maxInProcessPerConnection(32)
.maxSimultaneousUsagePerConnection(32)
.maxContentLength(10000000)
.maxWaitForConnection(10)
.create()

Related

Failed to connect to Tomcat server on ec2 instance

UPDATE:
My goal is to learn what factors could overwhelm my little tomcat server. And when some exception happens, what I could do to resolve or remediate it without switching my server to a better machine. This is not a real app in a production environment but just my own experiment (Besides some changes on the server-side, I may also do something on my client-side)
Both of my client and server are very simple: the server only checks the URL format and send 201 code if it is correct. Each request sent from my client only includes an easy JSON body. There is no database involved. The two machines (t2-micro) only run client and server respectively.
My client is OkHttpClient(). To avoid timeout exceptions, I already set timeout 1,000,000 milli secs via setConnectTimeout, setReadTimeout, and setWriteTimeout. I also go to $CATALINA/conf/server.xml on my server and set connectionTimeout = "-1"(infinite)
ORIGINAL POST:
I'm trying to stress out my server by having a client launching 3000+ threads sending HTTP requests to my server. Both of my client and server reside on different ec2 instances.
Initially, I encountered some timeout issues, but after I set the connection, read and write timeout to a bigger value, this exception has been resolved. However, with the same specification, I'm getting java.net.ConnectException: Failed to connect to my_host_ip:8080 exception. And I do not know its root cause. I'm new to multithreading and distributed system, can anyone please give me some insights of this exception?
Below is some screenshot of from my ec2:
1. Client:
2. Server:
Having gone through similar exercise in past I can say that there is no definitive answer to the problem of scaling.
Here are some general trouble shooting steps that may lead to more specific information. I would suggest trying out tests by tweaking a few parameters in each test and measure the changes in Cpu, logs etc.
Please provide what value you have put for the timeout. Increasing timeout could cause your server (or client) to run out of threads quickly (cause each thread can process for longer). Question the need for increasing timeout. Is there any processing that slows your server?
Check application logs, JVM usage, memory usage on the client and Server. There will be some hints there.
Your client seems to be hitting 99%+ and then come down. This implies that there could be a problem at the client side in that it maxes out during the test. Your might want to resize your client to be able to do more.
Look at open file handles. The number should be sufficiently high.
Tomcat has some limit on thread count to handle load. You can check this in server.xml and if required change it to handle more. Although cpu doesn't actually max out on server side so unlikely that this is the problem.
If you a database then check the performance of the database. Also check jdbc connect settings. There is thread and timeout config at jdbc level as well.
Is response compression set up on the Tomcat? It will give much better throughout on server especially if the data being sent back by each request is more than a few kbs.
--------Update----------
Based on update on question few more thoughts.
Since the application is fairly simple, the path in terms of stressing the server should be to start low and increase load in increments whilst monitoring various things (cpu, memory, JVM usage, file handle count, network i/o).
The increments of load should be spread over several runs.
Start with something as low as 100 parallel threads.
Record as much information as you can after each run and if the server holds up well, increase load.
Suggested increments 100, 200, 500, 1000, 1500, 2000, 2500, 3000.
At some level you will see that the server can no longer take it. That would be your breaking point.
As you increase load and monitor you will likely discover patterns that suggest tuning of specific parameters. Each tuning attempt should then be tested again the same level of multi threading. The improvement of available will be obvious from the monitoring.

PostgreSQL JDBC Connection issue

We have PostgreSQL 9.6 instance at a ubuntu 18.04 machine. When we restart java services deployed in a Kubernetes cluster then already existing idle connection didn't get remove and service create new connections on each restart. Due to this, we have reached a connection limit so many times and we have to terminate connection manually every time. Same service versions are deployed on other instances but we are not getting this scenario on other servers.
I have some questions regarding this
Can it be a PostgreSQL configuration issue? However, i didn't find any timeout-related setting difference in 2 instances (1 is working fine and another isnt)
If this is a java service issue then what should I check?
If its neither a PostgreSQL issue not a java issue then what should i look into?
If the client process dies without closing the database connection properly, it takes a while (2 hours by default) for the server to notice that the connection is dead.
The mechanism for that is provided by TCP and is called keepalive: after a certain idle time, the operating system starts sending keepalive packets. After a certain number of such packets without response, the TCP connection is closed, and the database backend process will die.
To make PostgreSQL detect dead connections faster, set the tcp_keepalives_idle parameter in postgresql.conf to less than 7200 seconds.

Using Spring REST template, either creating too many connections or slow

I have a RESTful service that works very fast. I am testing it on localhost. The client is using Spring REST template. I started by using a naive approach:
RestTemplate restTemplate = new RestTemplate(Collections.singletonList(new GsonHttpMessageConverter()));
Result result = restTemplate.postForObject(url, payload, Result.class);
When I make a lot of these requests, I am getting the following exception:
Caused by: org.springframework.web.client.ResourceAccessException: I/O error on POST request for "http://localhost:8080/myservice":No buffer space available (maximum connections reached?): connect; nested exception is java.net.SocketException: No buffer space available (maximum connections reached?): connect
This is caused by connections not being closed and hanging in TIME_WAIT state. The exception starts happening when the ephemeral ports are exhausted. Then the execution waits for the ports to be free again. I am seeing peak performance with long breaks. The rate I am getting is almost what I need, but of course, these TIME_WAIT connections are not good. Tested both on Linux (Ubuntu 14) and Windows (7), similar results at different times due to different ranges of the ports.
To fix this, I tried using an HttpClient with HttpClientBuilder from Apache Http Components library.
RestTemplate restTemplate = new RestTemplate(Collections.singletonList(new GsonHttpMessageConverter()));
HttpClient httpClient = HttpClientBuilder.create()
.setMaxConnTotal(TOTAL)
.setMaxConnPerRoute(PER_ROUTE)
.build();
restTemplate.setRequestFactory(new HttpComponentsClientHttpRequestFactory(httpClient));
Result result = restTemplate.postForObject(url, payload, Result.class);
With this client, I see no exceptions. The client is now using only a very limited number of ephemeral ports. But whatever settings I use (TOTAL and PER_ROUTE), I can't get the performance I need.
Using the netstat command, I see that there are not many connections done to the server. I tried setting the numbers to several thousands, but it seems the client never uses that much.
Is there anything I can do to improve the performance, without opening too many connections?
UPDATE: I've tried setting number of total and per route connections to 5000 and 2500 but it still looks like the client is not creating more than a hundred (judging from netstat -n | wc -l). The REST service is implemented using JAX-RS and running on Jetty.
UPDATE2: I have now tuned the server with some memory settings and I am getting really good throughput. The naive approach is still a bit faster, but I think it's just a little overhead of the pooling on client side.
Actually Spring Boot is not leaking connections. What you're seeing here is standard behavior of the Linux kernel (and every major OS). All sockets that are closed from the machine go to a TIME_WAIT state for some duration of time. This is to prevent the next socket that uses that ephemeral port from receiving packets that were actually intended for the previous socket on that port. The difference you're seeing between the two is a result of the connection pooling approaches each one takes.
More specifically, RestTemplate does not use connection pooling by default. This means every rest call opens a new local ephemeral port and a new connection to the server. If your service is very fast, it will blow through its available local port range in no time at all. With the Apache HttpClient, you are taking advantage of connection pooling. This will prevent your application from seeing the problem that you described. However, given that your service is able to respond faster than the Linux kernel is taking sockets out of TIME_WAIT, connection pooling will make your client slower no matter what you do (if it didn't slow anything down - then you'd run out of local ephemeral ports again).
While it's possible to enable TCP reuse in the Linux kernel, it's can get dangerous (packets can get delayed and you could get ephemeral ports receiving random packets they don't understand which could cause all kinds of problems). The solution here is to use connection pooling as you have in the second example, with sufficiently high numbers to achieve close to the performance you're looking for.
To help you tune your connection pool, you'll want to tweak the maxConnPerRoute and maxConnTotal parameters. maxConnPerRoute limits the number of connections that will be made to a single IP:Port pair, and maxTotal limits the number of total connections that will ever be opened. In your case, since it appears all requests are made to the same location, you could set them to the same (high) value.

Cloud SQL database seems to always shutdown even with 'Always On' activation

I am benchmarking Google Cloud SQL. I started with the default "On Demand" activation policy since I work on a test platform with very few hits to save a few $.
It takes about 20-30s to connect for the very first queries (which I think was caused by the database to be started). After that, performance is great.
Now I switched to "Always ON" activation policy. I was expecting to have the exact same response times on the very first requests on my website. BUT: just like the "On Demand" policy, it takes about 30s to reconnect to the database. The time is spent in the connection pool trying to reconnect to the database so I am sure it is Cloud SQL time.
I suspect the "Always ON" policy to do absolutely nothing (except maybe cost more $? I haven't checked yet) and I got the feeling that the database continues to be shutdown. Maybe it changes the timeout policy slightly?
I found out this thread :
First connect from Prestashop to Google Cloud SQL always fails
So apparently there are still timeouts, but we can change it depending on the billing plan?
This is very unclear to me.
So here are my questions:
What is the timeout of a SQL instance for "Always ON" policy with "Per Use" billing?
What is the timeout of a SQL instance for "Always ON" policy with "Package" billing?
Is there a way I can manually set my own timeouts? After all, I am the one who pays... If I want my instance to keep running, that is my problem.
EDIT
I am sure it is a connection problem because I previously had a 3 seconds timeout on web requests. With this timeout set, all my requests threw the following exception :
Caused by: java.sql.SQLException: Cannot get a connection, general error
at org.apache.tomcat.dbcp.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:130)
at org.apache.tomcat.dbcp.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:1412)
at org.springframework.jdbc.datasource.lookup.AbstractRoutingDataSource.getConnection(AbstractRoutingDataSource.java:148)
at org.hibernate.ejb.connection.InjectedDataSourceConnectionProvider.getConnection(InjectedDataSourceConnectionProvider.java:71)
at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:446)
... 27 more
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at org.apache.tomcat.dbcp.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:582)
at org.apache.tomcat.dbcp.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:439)
at org.apache.tomcat.dbcp.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:360)
at org.apache.tomcat.dbcp.dbcp2.PoolingDataSource.getConnection(PoolingDataSource.java:118)
... 31 more
With the help of Cloud SQL support, the problem apparently came from my GCE server "forgetting" tcp connections after 10 minutes. Since it was a test server, the connections to the Cloud SQL instance from there were long-lived unused connection.
So to workaround this, I used the tcp keepalive, as advised by Google:
sudo bash -c 'echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time'
Information about this configuration could be found here https://cloud.google.com/sql/docs/gce-access (section 6).
Don't forget to restart any application that makes connection to the Cloud SQL instance after setting this setting.

Getting "Write attempt on defunct connection" Error From Datastax Cassandra Java Driver

I have a web service application using Cassandra 2.0 and Datastax java driver 2.0.2. I sometimes get the stacktrace below when trying to write to/read from database, especially if the application has been sitting there for a while (like overnight). This error usually goes away when I retry, however, sometimes it persists and I have to restart the web app to get rid of the error.
I wonder if this is some sort of "stale connection" issue. However, the Datastax java driver documentation indicates it is supposed to keep the connection alive.
I did a google search on the error message and only two (!) hits were given by google. They are related. This is the answer in one of the google result:
Sylvain Lebresne Apr 2 You're running into
https://datastax-oss.atlassian.net/browse/JAVA-250. We'll fix it soon
hopefully (I have some half-finished patch that I need to finish), but
currently, if you restart a whole cluster without doing queries during
the restat, it can sometimes happen that you'll get this before the
cluster properly reconnect. In the meantime and as a workaround, you
can always make sure to run a few trivial queries while you're doing
the cluster restart to avoid it.
However this does not look like my scenario because we are not restarting the cluster at all. I wonder if anyone has some insights about this error?
Stacktrace:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ec2-54-197-xxx-xxx.compute-1.amazonaws.com/54.197.xxx.xxx:9042 (com.datastax.driver.core.ConnectionException: [ec2-54-197-xxx-xxx.compute-1.amazonaws.com/54.197.xxx.xxx:9042] Write attempt on defunct connection))
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65)
at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:256)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:172)
at com.datastax.driver.core.SessionManager.execute(SessionManager.java:92)
I have what I believe is the exact same issue (Write attempt on defunct connection) on my development machine intermittently.
It seems to happen when my dev machine goes to sleep while the server is up. Obviously there's no power management in the AWS cluster you're running, but it gives you a hint - the key is that something is breaking your control connection or intermittently preventing network connectivity between your hosts.
You should see the reconnection thread in your logs:
21:34:51.616 [Reconnection-1] ERROR c.d.driver.core.ControlConnection - [Control connection] Cannot connect to any host, scheduling retry in 2000 milliseconds
The next request after this will always succeed in my experience.
TL; DR - check for networking issues or any intermittent shutdown of servers that could break the control connection. The driver should do a better job of re-establishing broken control connections, sounds like they're working on it for JAVA-250

Categories