How to limit number of threads all concurrent methods at once - java

I am working on a legacy project which uses Java 8, Spring, HikariCP, and MySQL. Microservices' methods are triggered with a Kafka topic and start a reporting operation. Almost all triggered methods have this and some of them have the same usage inside their blocks.
new ForkJoinPool().submit(() -> { users.parallelStream().forEach(user ->
The application creates 8-9k threads and all of them try to get or create a record. However, the database couldn't handle these requests and started to throw exceptions and Zabbix sends mails about heap memory usage above %90:
Caused by: java.sql.SQLTransientConnectionException: HikariPool-2 -
Connection is not available, request timed out after 30000ms.
When I check the database and see the variable for max_connections = 600, but this is not enough.
I want to set a limit for thread count for the application level.
I tried setting these parameters but the thread size doesn't decrease.
SPRING_TASK_EXECUTION_POOL_QUEUE-CAPACITY , SPRING_TASK_EXECUTION_POOL_MAX-SIZE, -Djava.util.concurrent.ForkJoinPool.common.parallelism
Is there any property to solve this problem?

I have changed all new ForkJoinPool() to ForkJoinPool.commonPool() and use this parameter to control thread creation -Djava.util.concurrent.ForkJoinPool.common.parallelism after that I have fixed my problem.

Related

Unable to execute HTTP request: Acquire operation took longer than the configured maximum time

I am running a batch job in AWS which consumes messages from a SQS queue and writes them to a Kafka topic using akka. I've created a Sqs Async Client with the following parameters:
private static SqsAsyncClient getSqsAsyncClient(final Config configuration, final String awsRegion) {
var asyncHttpClientBuilder = NettyNioAsyncHttpClient.builder()
.maxConcurrency(100)
.maxPendingConnectionAcquires(10_000)
.connectionMaxIdleTime(Duration.ofSeconds(60))
.connectionTimeout(Duration.ofSeconds(30))
.connectionAcquisitionTimeout(Duration.ofSeconds(30))
.readTimeout(Duration.ofSeconds(30));
return SqsAsyncClient.builder()
.region(Region.of(awsRegion))
.httpClientBuilder(asyncHttpClientBuilder)
.endpointOverride(URI.create("https://sqs.us-east-1.amazonaws.com/000000000000")).build();
}
private static SqsSourceSettings getSqsSourceSettings(final Config configuration) {
final SqsSourceSettings sqsSourceSettings = SqsSourceSettings.create().withCloseOnEmptyReceive(false);
if (configuration.hasPath(ConfigPaths.SqsSource.MAX_BATCH_SIZE)) {
sqsSourceSettings.withMaxBatchSize(10);
}
if (configuration.hasPath(ConfigPaths.SqsSource.MAX_BUFFER_SIZE)) {
sqsSourceSettings.withMaxBufferSize(1000);
}
if (configuration.hasPath(ConfigPaths.SqsSource.WAIT_TIME_SECS)) {
sqsSourceSettings.withWaitTime(Duration.of(20, SECONDS));
}
return sqsSourceSettings;
}
But, whilst running my batch job I get the following AWS SDK exception:
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.
The exception still seems to occur even after I try tweaking the parameters mentioned here:
Consider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate. Increasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout. If the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests
Has anyone run into this issue before?
I encountered the same issue, and I ended up firing 100 async batch requests then wait for those 100 to get cleared before firing another 100 and so on.

Send operation timed out when producing events to Azure eventhubs

I have a long-running application that uses azure eventhub SDK (5.1.0), continually publishing data to Azure event hub. The service threw the below exception after few days. What could be the cause of this and how we can overcome this?
Stack Trace:
Exception in thread "SendTimeout-timer" reactor.core.Exceptions$BubblingException: com.azure.core.amqp.exception.AmqpException: Entity(abc): Send operation timed out, errorContext[NAMESPACE: abc-eventhub.servicebus.windows.net, PATH: abc-metrics, REFERENCE_ID: 70288acf171a4614ab6dcfe2884ee9ec_G2S2, LINK_CREDIT: 210]
at reactor.core.Exceptions.bubble(Exceptions.java:173)
at reactor.core.publisher.Operators.onErrorDropped(Operators.java:612)
at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.onError(FluxTimeout.java:203)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.secondError(MonoFlatMap.java:185)
at reactor.core.publisher.MonoFlatMap$FlatMapInner.onError(MonoFlatMap.java:251)
at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onError(FluxHide.java:132)
at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:185)
at com.azure.core.amqp.implementation.ReactorSender$SendTimeout.run(ReactorSender.java:565)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
Caused by: com.azure.core.amqp.exception.AmqpException: Entity(abc-metrics): Send operation timed out, errorContext[NAMESPACE: abc-eventhub.servicebus.windows.net, PATH: abc-metrics, REFERENCE_ID: 70288acf171a4614ab6dcfe2884ee9ec_G2S2, LINK_CREDIT: 210]
at com.azure.core.amqp.implementation.ReactorSender$SendTimeout.run(ReactorSender.java:562)
... 2 more
I'm using Azure eventhub Java SDK 5.1.0
According to official documentation:
A TimeoutException indicates that a user-initiated operation is taking
longer than the operation timeout.
For Event Hubs, the timeout is specified either as part of the
connection string, or through ServiceBusConnectionStringBuilder. The
error message itself might vary, but it always contains the timeout
value specified for the current operation.
Common causes
There are two common causes for this error: incorrect
configuration, or a transient service error.
Incorrect configuration The operation timeout might be too small for
the operational condition. The default value for the operation timeout
in the client SDK is 60 seconds. Check to see if your code has the
value set to something too small. The condition of the network and CPU
usage can affect the time it takes for a particular operation to
complete, so the operation timeout should not be set to a small value.
Transient service error Sometimes the Event Hubs service can
experience delays in processing requests; for example, during periods
of high traffic. In such cases, you can retry your operation after a
delay, until the operation is successful. If the same operation still
fails after multiple attempts, visit the Azure service status site to
see if there are any known service outages.
If you are consistently seeing this error frequently, I would suggest to reach Azure support for a deeper look.

How to recover client from "No handler waiting for message" warning?

At medium to high load (test and production), when using the Vert.x Redis client, I get the following warning after a few hundred requests.
2019-11-22 11:30:02.320 [vert.x-eventloop-thread-1] WARN io.vertx.redis.client.impl.RedisClient - No handler waiting for message: [null, 400992, <data from redis>]
As a result, the handler supplied to the Redis call (see below) does not get called and the incoming request times out.
Handler<AsyncResult<String>> handler = res -> {
// success handler
};
redis.get(key, res -> {
handler.handle(res);
});
The real issue is that once the "No handler ..." warning comes up, the Redis client becomes useless because all further calls to Redis made via the client fails with the same warning resulting in the handler not getting called. I have an exception handler set on the client to attempt reconnection, but I do not see any reconnections being attempted.
How can one recover from this problem? Any workarounds to alleviate the severity would also be great.
I'm on vertx-core and vertx-redis-client 3.8.1 .
The upcoming 4.0 release had addressed this issue and a release should be hapening soon, how soon, I can't really tell.
The problem is that we can't easily port back from the master branch to the 3.8 branch because a major refactoring has happened on the client and the codebases are very different.
The new code, uses a connection pool and has been tested for concurrent access (and this is where the issue you're seeing comes from). Under load the requests are routed across all event loops and the queue that maintains the state between in flight requests (requests sent to redis) and waiting handlers would get out of sync in very special conditions.
So I'd first try to see if you can already start moving your code to 4.0, you can have a try with the 4.0.0-milestone3 version but to be totally fine, just have a run with the latest master which has more issues solved in this area.

Connection timeouts with HikariCP

I have a Spring Boot (v2.0.8) application which makes use of a HikariCP (v2.7.9) Pool (connecting to MariaDB) configured with:
minimumIdle: 1
maximumPoolSize: 10
leakDetectionThreshold: 30000
The issue is that our production component, once every few weeks, is repeatedly throwing SQLTransientConnectionException " Connection is not available, request timed out after 30000ms...". The issue is that it never recovers from this and consistently throws the exception. A restart of the componnent is therefore required.
From looking at the HikariPool source code, it would seem that this is happening because every time it is calling connectionBag.borrow(timeout, MILLISECONDS) the poolEntry is null and hence throws the timeout Exception. For it to be null, the connection pool must have no free entries i.e. all PoolEntry in the sharedList are marked IN_USE.
I am not sure why the component would not recover from this since eventually I would expect a PoolEntry to be marked NOT_IN_USE and this would break the repeated Exceptions.
Possible scenarios I can think of:
All entries are IN_USE and the DB goes down temporarily. I would expect Exceptions to be thrown for the in-flight queries. Perhaps at this point the PoolEntry status is never reset and therefore is stuck at IN_USE. In this case I would have thought if an Exception is thrown the status is changed so that the connection can cleared from the pool. Can anyone confirm if this is the case?
A flood of REST requests are made to the component which in turn require DB queries to be executed. This fills the connection pool and therefore subsequent requests timeout waiting for previous requests to complete. This makes sense however I would expect the component to recover once the requests complete, which it is not.
Does anyone have an idea of what might be the issue here? I have tried configuring the various timeouts that are in the Hikari documentation but have had no luck diagnosing / resolving this issue. Any help would be appreciated.
Thanks!
Scenario 2 is most likely what is happening. I ran into the same issue when using it with cloud dataflow and receiving a large amount of connection requests. The only solution I found was to play with the config to find a combination that worked for my use case.
I'll leave you my code that works for 50-100 requests per second and wish you luck.
private static DataSource pool;
final HikariConfig config = new HikariConfig();
config.setMinimumIdle(5);
config.setMaximumPoolSize(50);
config.setConnectionTimeout(10000);
config.setIdleTimeout(600000);
config.setMaxLifetime(1800000);
config.setJdbcUrl(JDBC_URL);
config.setUsername(JDBC_USER);
config.setPassword(JDBC_PASS);
pool = new HikariDataSource(config);

MDB new threads are calling onMessage while previous thread not finished

In JBOSS EAP6 I've got a long running MDB thread listening to a JMS Queue. It received a Text Message with a DB key of work it should process (loop).
During its execution I noticed that new threads spawn new MDB instances, leading to inconsistencies. I do want to prevent that in a programmatic manner or in a configuration manner whithout changing performace. So, for instance check in the onMessage that work is ongoing. I can't change the DB Model.
Since I'm running in a single VM I'm on the verge (last resort) of using a static Set that stores the DB key. (I'm a bit under time pressure to fix this).
The problem was caused by the fact I forgot the specify the transaction time-out. Hence the default time-out seems to kick in.
The problem was solved by adding the transaction time out:
#ActivationConfigProperty( propertyName = "transactionTimeout", propertyValue = "10800" )

Categories