I have a long-running application that uses azure eventhub SDK (5.1.0), continually publishing data to Azure event hub. The service threw the below exception after few days. What could be the cause of this and how we can overcome this?
Stack Trace:
Exception in thread "SendTimeout-timer" reactor.core.Exceptions$BubblingException: com.azure.core.amqp.exception.AmqpException: Entity(abc): Send operation timed out, errorContext[NAMESPACE: abc-eventhub.servicebus.windows.net, PATH: abc-metrics, REFERENCE_ID: 70288acf171a4614ab6dcfe2884ee9ec_G2S2, LINK_CREDIT: 210]
at reactor.core.Exceptions.bubble(Exceptions.java:173)
at reactor.core.publisher.Operators.onErrorDropped(Operators.java:612)
at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.onError(FluxTimeout.java:203)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondError(MonoFlatMap.java:185)
at reactor.core.publisher.MonoFlatMap$FlatMapInner.onError(MonoFlatMap.java:251)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onError(FluxHide.java:132)
at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:185)
at com.azure.core.amqp.implementation.ReactorSender$SendTimeout.run(ReactorSender.java:565)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
Caused by: com.azure.core.amqp.exception.AmqpException: Entity(abc-metrics): Send operation timed out, errorContext[NAMESPACE: abc-eventhub.servicebus.windows.net, PATH: abc-metrics, REFERENCE_ID: 70288acf171a4614ab6dcfe2884ee9ec_G2S2, LINK_CREDIT: 210]
at com.azure.core.amqp.implementation.ReactorSender$SendTimeout.run(ReactorSender.java:562)
... 2 more
I'm using Azure eventhub Java SDK 5.1.0
According to official documentation:
A TimeoutException indicates that a user-initiated operation is taking
longer than the operation timeout.
For Event Hubs, the timeout is specified either as part of the
connection string, or through ServiceBusConnectionStringBuilder. The
error message itself might vary, but it always contains the timeout
value specified for the current operation.
Common causes
There are two common causes for this error: incorrect
configuration, or a transient service error.
Incorrect configuration The operation timeout might be too small for
the operational condition. The default value for the operation timeout
in the client SDK is 60 seconds. Check to see if your code has the
value set to something too small. The condition of the network and CPU
usage can affect the time it takes for a particular operation to
complete, so the operation timeout should not be set to a small value.
Transient service error Sometimes the Event Hubs service can
experience delays in processing requests; for example, during periods
of high traffic. In such cases, you can retry your operation after a
delay, until the operation is successful. If the same operation still
fails after multiple attempts, visit the Azure service status site to
see if there are any known service outages.
If you are consistently seeing this error frequently, I would suggest to reach Azure support for a deeper look.
Related
I get lots of events to process in RabbitMq and then those get forward to service 1 to process and after some processing the data, there is an internal call to a micro service2. However, I do get java.net.SocketTimeoutException: timeout frequently when I call service2, so I tried to increase timeout limit from 2s to 10 sec as a first trial and it did minimise the timeout exceptions but still lot of them are still there,
second change I made is removal of deprecated retry method of spring and replace the same with retryWhen method with back off and jitter factor introduced as shown below
.retryWhen(Retry.backoff(ServiceUtils.NUM_RETRIES, Duration.ofSeconds(2)).jitter(0.50)
.onRetryExhaustedThrow((retryBackoffSpec, retrySignal) -> {
throw new ServiceException(
ErrorBo.builder()
.message("Service failed to process after max retries")
.build());
}))
.onErrorResume(error -> {
// return and print the error only if all the retries have been exhausted
log.error(error.getMessage() + ". Error occurred while generating pdf");
return Mono.error(ServiceUtils
.returnServiceException(ServiceErrorCodes.SERVICE_FAILURE,
String.format("Service failed to process after max retries, failed to generate PDF)));
})
);
So my questions are,
I do get success for few service call and for some failure, does it mean some where there is still bottle neck for processing the request may be at server side that is does not process all the request.
Do I need to still increase timeout limit if possible
How do I make sure that there is no java.net.SocketTimeoutException: timeout
This issue has started coming recently. and it seems there is no change in ports or any connection level changes.
But still what all things I should check in order to make sure the connection level setting are correct. Could someone please guide on this.
Thanks in advance.
I am working on a legacy project which uses Java 8, Spring, HikariCP, and MySQL. Microservices' methods are triggered with a Kafka topic and start a reporting operation. Almost all triggered methods have this and some of them have the same usage inside their blocks.
new ForkJoinPool().submit(() -> { users.parallelStream().forEach(user ->
The application creates 8-9k threads and all of them try to get or create a record. However, the database couldn't handle these requests and started to throw exceptions and Zabbix sends mails about heap memory usage above %90:
Caused by: java.sql.SQLTransientConnectionException: HikariPool-2 -
Connection is not available, request timed out after 30000ms.
When I check the database and see the variable for max_connections = 600, but this is not enough.
I want to set a limit for thread count for the application level.
I tried setting these parameters but the thread size doesn't decrease.
SPRING_TASK_EXECUTION_POOL_QUEUE-CAPACITY , SPRING_TASK_EXECUTION_POOL_MAX-SIZE, -Djava.util.concurrent.ForkJoinPool.common.parallelism
Is there any property to solve this problem?
I have changed all new ForkJoinPool() to ForkJoinPool.commonPool() and use this parameter to control thread creation -Djava.util.concurrent.ForkJoinPool.common.parallelism after that I have fixed my problem.
I am running a batch job in AWS which consumes messages from a SQS queue and writes them to a Kafka topic using akka. I've created a Sqs Async Client with the following parameters:
private static SqsAsyncClient getSqsAsyncClient(final Config configuration, final String awsRegion) {
var asyncHttpClientBuilder = NettyNioAsyncHttpClient.builder()
.maxConcurrency(100)
.maxPendingConnectionAcquires(10_000)
.connectionMaxIdleTime(Duration.ofSeconds(60))
.connectionTimeout(Duration.ofSeconds(30))
.connectionAcquisitionTimeout(Duration.ofSeconds(30))
.readTimeout(Duration.ofSeconds(30));
return SqsAsyncClient.builder()
.region(Region.of(awsRegion))
.httpClientBuilder(asyncHttpClientBuilder)
.endpointOverride(URI.create("https://sqs.us-east-1.amazonaws.com/000000000000")).build();
}
private static SqsSourceSettings getSqsSourceSettings(final Config configuration) {
final SqsSourceSettings sqsSourceSettings = SqsSourceSettings.create().withCloseOnEmptyReceive(false);
if (configuration.hasPath(ConfigPaths.SqsSource.MAX_BATCH_SIZE)) {
sqsSourceSettings.withMaxBatchSize(10);
}
if (configuration.hasPath(ConfigPaths.SqsSource.MAX_BUFFER_SIZE)) {
sqsSourceSettings.withMaxBufferSize(1000);
}
if (configuration.hasPath(ConfigPaths.SqsSource.WAIT_TIME_SECS)) {
sqsSourceSettings.withWaitTime(Duration.of(20, SECONDS));
}
return sqsSourceSettings;
}
But, whilst running my batch job I get the following AWS SDK exception:
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.
The exception still seems to occur even after I try tweaking the parameters mentioned here:
Consider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate. Increasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout. If the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests
Has anyone run into this issue before?
I encountered the same issue, and I ended up firing 100 async batch requests then wait for those 100 to get cleared before firing another 100 and so on.
At medium to high load (test and production), when using the Vert.x Redis client, I get the following warning after a few hundred requests.
2019-11-22 11:30:02.320 [vert.x-eventloop-thread-1] WARN io.vertx.redis.client.impl.RedisClient - No handler waiting for message: [null, 400992, <data from redis>]
As a result, the handler supplied to the Redis call (see below) does not get called and the incoming request times out.
Handler<AsyncResult<String>> handler = res -> {
// success handler
};
redis.get(key, res -> {
handler.handle(res);
});
The real issue is that once the "No handler ..." warning comes up, the Redis client becomes useless because all further calls to Redis made via the client fails with the same warning resulting in the handler not getting called. I have an exception handler set on the client to attempt reconnection, but I do not see any reconnections being attempted.
How can one recover from this problem? Any workarounds to alleviate the severity would also be great.
I'm on vertx-core and vertx-redis-client 3.8.1 .
The upcoming 4.0 release had addressed this issue and a release should be hapening soon, how soon, I can't really tell.
The problem is that we can't easily port back from the master branch to the 3.8 branch because a major refactoring has happened on the client and the codebases are very different.
The new code, uses a connection pool and has been tested for concurrent access (and this is where the issue you're seeing comes from). Under load the requests are routed across all event loops and the queue that maintains the state between in flight requests (requests sent to redis) and waiting handlers would get out of sync in very special conditions.
So I'd first try to see if you can already start moving your code to 4.0, you can have a try with the 4.0.0-milestone3 version but to be totally fine, just have a run with the latest master which has more issues solved in this area.
Some times I am getting following exception while checking isSubscribed with some topic.
Checking condition : Client.isSubscribed(topic)
Exception : com.pushtechnology.diffusion.multiplexer.MultiplexerBlockedException
This exception manifests in logs/Server.log as PUSH-000503 with description
A blocking operation failed because the multiplexers failed to process it within {} milliseconds
The default value for that timeout is 30s, which is a lifetime for a multiplexer to wait. The manual says the following: "This indicates that the server is severely overloaded or deadlocked".
On another note, version v5.5 is unsupported, and you are advised to upgrade before attempting a reproduction.