Lettuce Redis client: Differentiating between connectTimeout in SocketOptions and defaultTimeout in RedisClusterClient - java

My application uses Lettuce Redis client to connect to AWS Elasticache. I am trying to follow this guide to increase my service's resiliency. One of the points being suggested is regarding the socket timeout:
Ensure that the socket timeout of the client is set to at least one second (vs. the typical “none” default in several clients). Setting the timeout too low can lead to numerous timeouts when the server load is high. Setting it too high can result in your application taking a long time to detect connection issues.
The pseudo code on how I am creating connections is:
RedisClusterClient redisClusterClient = RedisClusterClient.create(clientResources, redisUrl);
// Topology refresh and periodic refresh
ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
.enablePeriodicRefresh(true)
.enableAllAdaptiveRefreshTriggers()
.build();
// Update cluster topology periodically
redisClient.setOptions(ClusterClientOptions.builder()
.topologyRefreshOptions(topologyRefreshOptions)
.build());
StatefulRedisClusterConnection connection = redisClusterClient.connect(new ByteArrayCodec());
I was going through the lettuce docs and saw there are two timeout options available for this:
Use connectTimeout field in SocketOptions
Use defaultTimeout field in RedisClusterClient
I would really appreciate if someone could help me understand the differences between the two and which one works better for my use case.
EDIT: Here is what I have tried till now:
I tried using both SocketOptions and deafultTimeout() one at a time and ran some tests.
Here is what I did:
Test Case 1
Set connectTimeout in SocketOptions to 1s and updated the redisClient object using setOptions() method.
Use Litmuschaos to add latency of >1s to the calls made to AWS Elasticache.
Use Elasticache failover API to bring down one of the nodes in the redis cluster.
Test Case 2
Set defaultTimeout in redisClient to 1s.
Use Litmuschaos to add latency of >1s to the calls made to AWS Elasticache.
Use Elasticache failover API to bring down one of the nodes in the redis cluster.
Observation (For both TCs):
The lettuce logs indicated that it is not able to connect to the node which was brought down (This was expected as AWS was still in the process of replacing it).
Once the redis node was up in AWS EC, Lettuce logs showed that it was successfully able to reconnect to that node (This was unexpected as I was already adding latency to the calls made to AWS EC).
Am I missing some config here?

Related

Azure eventhub Kafka org.apache.kafka.common.errors.TimeoutException for some of the records

Have a ArrayList containing 80 to 100 records trying to stream and send each individual record(POJO ,not entire list) to Kafka topic (event hub) . Scheduled a cron job like every hour to send these records(POJO) to event hub.
Able to see messages being sent to eventhub ,but after 3 to 4 successful run getting following exception (which includes several messages being sent and several failing with below exception)
Expiring 14 record(s) for eventhubname: 30125 ms has passed since batch creation plus linger time
Following is the config for Producer used,
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.ACKS_CONFIG, "1");
props.put(ProducerConfig.RETRIES_CONFIG, "3");
Message Retention period - 7
Partition - 6
using spring Kafka(2.2.3) to send the events
method marked as #Async where kafka send is written
#Async
protected void send() {
kafkatemplate.send(record);
}
Expected - No exception to be thrown from kafka
Actual - org.apache.kafka.common.errors.TimeoutException is been thrown
Prakash - we have seen a number of issues where spiky producer patterns see batch timeout.
The problem here is that the producer has two TCP connections that can go idle for > 4 mins - at that point, Azure load balancers close out the idle connections. The Kafka client is unaware that the connections have been closed so it attempts to send a batch on a dead connection, which times out, at which point retry kicks in.
Set connections.max.idle.ms to < 4mins – this allows Kafka client’s network client layer to gracefully handle connection close for the producer’s message-sending TCP connection
Set metadata.max.age.ms to < 4mins – this is effectively a keep-alive for the producer metadata TCP connection
Feel free to reach out to the EH product team on Github, we are fairly good about responding to issues - https://github.com/Azure/azure-event-hubs-for-kafka
This exception indicates you are queueing records at a faster rate than they can be sent. Once a record is added a batch, there is a time limit for sending that batch to ensure it has been sent within a specified duration. This is controlled by the Producer configuration parameter, request.timeout.ms. If the batch has been queued longer than the timeout limit, the exception will be thrown. Records in that batch will be removed from the send queue.
Please check the below for similar issue, this might help better.
Kafka producer TimeoutException: Expiring 1 record(s)
you can also check this link
when-does-the-apache-kafka-client-throw-a-batch-expired-exception/34794261#34794261 for reason more details about batch expired exception.
Also implement proper retry policy.
Note this does not account any network issues scanner side. With network issues you will not be able to send to either hub.
Hope it helps.

How to check AWS SQS Connection idle time

Here is the thing,
I'm creating an SQS Connection. I'm using the same connection to create consumers to listen to two different queues(Q1, Q2).
Enabling and disabling to queue is handled by the Admin user of application through a UI.
So, Whenever I disable Q1 consumer, I shouldn't close the connection, and close the connection only when both Q1 & Q2 Consumers are disabled, I can't afford to write complex code to check if both consumers are disabled.
Is there a way to check idle time of an open SQSConnection.
or
I would like to know the cost of keeping an SQSConnection open all the time
or
How about opening two different connections
here is how I'm creating the connection
SQSConnectionFactory connectionFactory = new SQSConnectionFactory(
new ProviderConfiguration(), ((AmazonSQSClientBuilder)
AmazonSQSClientBuilder.standard().withRegion(sqsRegion)).
withCredentials(
_getCredentialsProvider(awsSecretKey, awsAccessKey)));
_connection = connectionFactory.createConnection();
The entire question, here, seems premised on the unfortunate name SQSConnectionFactory, which isn't what this really is. A more accurate name might have been something like SQSConfiguredClientFactory.
None of the createConnection methods set-up the physical connection to SQS
https://github.com/awslabs/amazon-sqs-java-messaging-lib/blob/master/src/main/java/com/amazon/sqs/javamessaging/SQSConnectionFactory.java
...because SQS doesn't actually use established/continuous "connections."
The service API interactions take place over HTTPS, with transient connections being created, kept alive, and destroyed as other methods (e.g. receiveMessage(queueUrl)) need them.
So with regard to your questions: 1. connections are not left "open" in any meaningful/relevant sense, so there is nothing to check; 2. the only cost comes from actually using the connections to send/receive/delete messages; and 3. this seems unnecessary for the reasons indicated above.

Hazelcast : Tuning properties for a node having temporary network glitch in a cluster

We have embedded hazelcast cluster with 10 aws instances. Version of hazelcast is 3.7.3 Right now we have following settings for the hazelcast
hazelcast.max.no.heartbeat.seconds=30
hazelcast.max.no.master.confirmation.seconds=150
hazelcast.heartbeat.interval.seconds=1
hazelcast.operation.call.timeout.millis=5000
hazelcast.merge.first.run.delay.seconds=60
Apart from above settings other property values are default.
Recently one of the node was not reachable for few minutes or so and some of the operations slowed down while getting things from cache. We have backup for each map so if things were not available from one partition, hazelcast should have responded from another partition but it seems everything slowed down because of one node not reachable.
Following is the exception that we saw in the logs for hazelcast.
[3.7.2] PartitionIteratingOperation invocation failed to complete due
to operation-heartbeat-timeout. Current time: 2017-05-30 16:12:52.442.
Total elapsed time: 10825 ms. Last operation heartbeat: never. Last
operation heartbeat from member: 2017-05-30 16:12:42.166.
Invocation{op=com.hazelcast.spi.impl.operationservice.impl.operations.PartitionIteratingOperation{serviceName='hz:impl:mapService',
identityHash=1798676695, partitionId=-1, replicaIndex=0, callId=0,
invocationTime=1496160761670 (2017-05-30 16:12:41.670),
waitTimeout=-1, callTimeout=5000,
operationFactory=com.hazelcast.map.impl.operation.MapGetAllOperationFactory#2afbcab7}, tryCount=10, tryPauseMillis=300, invokeCount=1,
callTimeoutMillis=5000, firstInvocationTimeMs=1496160761617,
firstInvocationTime='2017-05-30 16:12:41.617', lastHeartbeatMillis=0,
lastHeartbeatTime='1970-01-01 00:00:00.000',
target=[172.18.84.36]:9123, pendingResponse={VOID},
backupsAcksExpected=0, backupsAcksReceived=0,
connection=Connection[id=12, /172.18.64.219:9123->/172.18.84.36:48180,
endpoint=[172.18.84.36]:9123, alive=true, type=MEMBER]}
Can someone suggest what should be the correct settings for hazelcast so that one node temporary not reachable doesn't slow down the whole cluster?
Operation call timeout should not be set to a low value. Probably best to leave it at the default value. Some internal mechanism like heartbeat rely on call timeout.
According to the reference manual version 3.11.7.
I will recommend reading the split-brain syndrome.
Maybe you should create another quorum to fall back in the case that your node fails to communicate.
Also, by experience I will recommend to get the reference manual specific for your version. Even if the default is suppose to be set as 5, I found that the specific version recommends other values.

AWS ElasticSearch 2.3 Java HTTP bulk API

I'm attampting to use a bulk HTTP api in Java on AWS ElasticSearch 2.3.
When I use a rest client for teh bulk load, I get the following error:
504 GATEWAY_TIMEOUT
When I run it as Lambda in Java, for HTTP Posts, I get:
{
"errorMessage": "2017-01-09T19:05:32.925Z 8e8164a7-d69e-11e6-8954-f3ac8e70b5be Task timed out after 15.00 seconds"
}
Through testing I noticed the bulk API doesn't work these with these settings:
"number_of_shards" : 5,
"number_of_replicas" : 5
When shards and replicas are set to 1, I can do a bulk load no problem.
I have tried using this setting to allow for my bulk load as well:
"refresh_interval" : -1
but so far it made no impact at all. In Java Lambda, I load my data as an InputStream from S3 location.
What are my options at this point for Java HTTP?
Is there anything else in index settings I could try?
Is there anything else in AWS access policy I could try?
Thank you for your time.
1Edit:
I also have tried these params: _bulk?action.write_consistency=one&refresh But makes no difference so far.
2Edit:
here is what made my bulk load work - set consistency param (I did NOT need to set refresh_interval):
URIBuilder uriBuilder = new URIBuilder(myuri);
uriBuilder = uriBuilder.addParameter("consistency", "one");
HttpPost post = new HttpPost(uriBuilder.build());
HttpEntity entity = new InputStreamEntity(myInputStream);
post.setEntity(entity);
From my experience, the issue can occur when your index replication settings can not be satisfied by your cluster. This happens either during a network partition, or if you simply set a replication requirement that can not be satisfied by your physical cluster.
In my case, this happens when I apply my production settings (number_of_replicas : 3) to my development cluster (which is single node cluster).
Your two solutions (setting the replica's to 1 Or setting your consistency to 1) resolve this issue because they allow Elastic to continue the bulk index w/o waiting for additional replica's to come online.
Elastic Search probably could have a more intuitive message on failure, maybe they do in Elastic 5.
Setting your cluster to a single

Read timeout on a web service, but the operation still completes?

I have a Java web service client running on Linux (using Axis 1.4) that invokes a series of web services operations performed against a Windows server. There are times that some transactional operations fail with this Exception:
java.net.SocketTimeoutException: Read timed out
However, the operation on the server is completed (even having no useful response on the client). Is this a bug of either the web service server/client? Or is expected to happen on a TCP socket?
This is the expected behavior, rather than a bug. The operation behind the web service doesn't know anything about your read timing out so continues processing the operation.
You could increase the timeout of the connection - if you are manually manipulating the socket itself, the socket.connect() method can take a timeout (in milliseconds). A zero should avoid your side timing out - see the API docs.
If the operation is going to take a long time in each case, you may want to look at making this asynchronous - a first request submits the operations, then a second request to get back the results, possibly with some polling to see when the results are ready.
If you think the operation should be completing in this time, have you access to the server to see why it is taking so long?
I had similar issue. We have JAX-WS soap webservice running on Jboss EAP6 (or JBOSS 7). The default http socket timeout is set to 60 seconds unless otherwise overridden in server or by the client. To fix this issue I changed our java client to something like this. I had to use 3 different combinations of propeties here
This combination seems to work as standalone java client or webservice client running as part of other application on other web server.
//Set timeout on the client
String edxWsUrl ="http://www.example.com/service?wsdl";
URL WsURL = new URL(edxWsUrl);
EdxWebServiceImplService edxService = new EdxWebServiceImplService(WsURL);
EdxWebServiceImpl edxServicePort = edxService.getEdxWebServiceImplPort();
//Set timeout on the client
BindingProvider edxWebserviceBindingProvider = (BindingProvider)edxServicePort;
BindingProvider edxWebserviceBindingProvider = (BindingProvider)edxServicePort;
edxWebserviceBindingProvider.getRequestContext().put("com.sun.xml.internal.ws.request.timeout", connectionTimeoutInMilliSeconds);
edxWebserviceBindingProvider.getRequestContext().put("com.sun.xml.internal.ws.connect.timeout", connectionTimeoutInMilliSeconds);
edxWebserviceBindingProvider.getRequestContext().put("com.sun.xml.ws.request.timeout", connectionTimeoutInMilliSeconds);
edxWebserviceBindingProvider.getRequestContext().put("com.sun.xml.ws.connect.timeout", connectionTimeoutInMilliSeconds);
edxWebserviceBindingProvider.getRequestContext().put("javax.xml.ws.client.receiveTimeout", connectionTimeoutInMilliSeconds);
edxWebserviceBindingProvider.getRequestContext().put("javax.xml.ws.client.connectionTimeout", connectionTimeoutInMilliSeconds);

Categories