Getting reactor.pool.PoolShutdownException during save in database - java

Service is using org.springframework.r2dbc.core.DatabaseClient with reactor-pool and r2dbc-mysql driver.
I'm doing inserts in the database every 5-10 seconds (50-100 insert statements) and randomly after every 2-3 hours I'm getting reactor.pool.PoolShutdownException: Pool has been shut down, what might be the reason for this behavior?
Dependecy versions:
r2dbc-pool: 0.8.8.RELEASE
r2dbc-mysql: 0.8.2
spring-r2dbc: 5.3.15
Stacktrace:
org.springframework.dao.DataAccessResourceFailureException: Failed to obtain R2DBC Connection; nested exception is reactor.pool.PoolShutdownException: Pool has been shut down
at org.springframework.r2dbc.connection.ConnectionFactoryUtils.lambda$getConnection$0(ConnectionFactoryUtils.java:88)
at reactor.core.publisher.Mono.lambda$onErrorMap$31(Mono.java:3733)
at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:94)
at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:106)
at reactor.core.publisher.FluxRetry$RetrySubscriber.onError(FluxRetry.java:95)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onError(MonoFlatMap.java:172)
at reactor.pool.AbstractPool$Borrower.fail(AbstractPool.java:477)
at reactor.pool.SimpleDequePool.doAcquire(SimpleDequePool.java:264)
at reactor.pool.AbstractPool$Borrower.request(AbstractPool.java:432)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:110)
at reactor.pool.SimpleDequePool$QueueBorrowerMono.subscribe(SimpleDequePool.java:676)
at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
at reactor.core.publisher.FluxRetry$RetrySubscriber.resubscribe(FluxRetry.java:117)
at reactor.core.publisher.MonoRetry.subscribeOrReturn(MonoRetry.java:50)
at reactor.core.publisher.Mono.subscribe(Mono.java:4385)
at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onError(MonoFlatMap.java:172)
at reactor.core.publisher.FluxMap$MapSubscriber.onError(FluxMap.java:132)
at reactor.core.publisher.Operators.error(Operators.java:198)
at reactor.core.publisher.MonoError.subscribe(MonoError.java:53)
at reactor.core.publisher.MonoDeferContextual.subscribe(MonoDeferContextual.java:55)
at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
at reactor.core.publisher.FluxUsingWhen.subscribe(FluxUsingWhen.java:104)
at reactor.core.publisher.Flux.subscribe(Flux.java:8469)
at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:255)
at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
at reactor.core.publisher.MonoZip$ZipCoordinator.signal(MonoZip.java:251)
at reactor.core.publisher.MonoZip$ZipInner.onNext(MonoZip.java:336)
at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1816)
at reactor.core.publisher.MonoCompletionStage.lambda$subscribe$0(MonoCompletionStage.java:83)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.whenComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.whenComplete(Unknown Source)
at reactor.core.publisher.MonoCompletionStage.subscribe(MonoCompletionStage.java:58)
at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
at reactor.core.publisher.MonoZip.subscribe(MonoZip.java:128)
at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
at reactor.core.publisher.MonoZip.subscribe(MonoZip.java:128)
at reactor.core.publisher.Mono.subscribe(Mono.java:4400)
at reactor.core.publisher.MonoSubscribeOn$SubscribeOnSubscriber.run(MonoSubscribeOn.java:126)
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)

Basically, it happens when you have too many pending acquired connections
Example: your connection pool is 100 but you are trying to do 500 parallel
inserts, where 400 will be in pending status).
In this situation, reactor-pool disposes connection pool. So to avoid such an issue, I’m controlling the number of parallel executions.
Actually I see two ways to handle this case:
(The right way in my opinion) Is to control the flow of incoming messages by specifying concurrency parameter on the operator (concurrency =< then pool size)
flux
//some intermediate operators
.flatMap( {databaseOperation(it) }, poolSize)
In this case, you won't have more parallel executions than your connection pool can afford.
using delayUntil operator, which delays elements until there are unused connections (you can use connection pool metrics to retrieve that). I wouldn't recommend this approach because you can end up with of memory exception if you are not controlling back-pressure, or you will have to drop some items if the buffer overflows.
Method that delays message until there are available connections
fun <T> Flux<T>.delayUntilHasAvailableConnections(pool: ConnectionPool): Flux<T> {
val hasAvailableConnections = Supplier {
val metrics = pool.metrics.get()
metrics.pendingAcquireSize() <= metrics.maxAllocatedSize
}
val connectionsExistsMono = Mono.fromSupplier(hasAvailableConnections)
val hasConnectionMono = connectionsExistsMono
.handle { hasConnections, sink: SynchronousSink<Boolean> ->
if (hasConnections) {
sink.next(hasConnections)
} else {
sink.error(RuntimeException("No Connections"))
}
}.retryWhen(Retry.fixedDelay(Long.MAX_VALUE, Duration.ofSeconds(5)))
return delayUntil { hasConnectionMono }
}
Usage:
flux
//some intermediate operators
.delayUntilHasAvailableConnections(connectionPool)
.flatMap { databaseOperation(it) }

As Vladlen pointed out, reactor-pool has the ability to some kind of refuse new connections to the pool if there is a large queue for acquiring DB connections. In case of a Spring application with spring-r2dbc, this functionality is disabled by default and all connection attempts get queued.
Nevertheless I got the same exception in my Spring application. In my case it was a more special condition but if someone else stumbles upon this: The Spring actuator health check also checks the connectivity to the database. If you have your application deployed to Kubernetes, the request to the /actuator/health endpoint may not return in time if there is a large queue in front of the DB pool. This causes the readiness-/liveness-probes of Kubernetes to fail with "Connection Timed out" making Kubernetes think the application is unhealthy (which is some kind of true). In the end, Kubernetes kills the application and all existing connections to the DB pool get terminated with the above exception.
My solution was to manually limit the load similar to what Vladlen pointed out:
val coroutineLimit = Semaphore(dbPoolSize)
workItems.forEach {
launch {
coroutineLimit.withPermit {
// My DB operation
}
}
}

Related

How to solve "Socket read timed out" when using hikari connection pool

I am developing an application using play framework (version 2.8.0), java(version 1.8) with an oracle database(version 12C).
There is only zero or one hit to the database in a day, I am getting below error.
java.sql.SQLRecoverableException: IO Error: Socket read timed out
at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:919)
at oracle.jdbc.driver.PhysicalConnection.close(PhysicalConnection.java:2005)
at com.zaxxer.hikari.pool.PoolBase.quietlyCloseConnection(PoolBase.java:138)
at com.zaxxer.hikari.pool.HikariPool.lambda$closeConnection$1(HikariPool.java:447)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Socket read timed out
at oracle.net.nt.TimeoutSocketChannel.read(TimeoutSocketChannel.java:174)
at oracle.net.ns.NIOHeader.readHeaderBuffer(NIOHeader.java:82)
at oracle.net.ns.NIOPacket.readFromSocketChannel(NIOPacket.java:139)
at oracle.net.ns.NIOPacket.readFromSocketChannel(NIOPacket.java:101)
at oracle.net.ns.NIONSDataChannel.readDataFromSocketChannel(NIONSDataChannel.java:80)
at oracle.jdbc.driver.T4CMAREngineNIO.prepareForReading(T4CMAREngineNIO.java:98)
at oracle.jdbc.driver.T4CMAREngineNIO.unmarshalUB1(T4CMAREngineNIO.java:534)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:485)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:252)
at oracle.jdbc.driver.T4C7Ocommoncall.doOLOGOFF(T4C7Ocommoncall.java:62)
at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:908)
... 6 common frames omitted
db {
default {
driver=oracle.jdbc.OracleDriver
url="jdbc:oracle:thin:#XXX.XXX.XXX.XX:XXXX/XXXXXXX"
username="XXXXXXXXX"
password="XXXXXXXXX"
hikaricp {
dataSource {
cachePrepStmts = true
prepStmtCacheSize = 250
prepStmtCacheSqlLimit = 2048
}
}
}
}
It seems it is causing due to inactive database connection, How can I solve this?
Please let me know if any other information is required?
You can enable TCP keepalive for JDBC - either be setting directive or by adding "ENABLE=BROKEN" into connection string.
Usually Cisco/Juniper cuts off TCP connection when it is inactive for more that on hour.
While Linux kernel starts sending keepalive probes after two hours(tcp_keepalive_time). So if you decide to turn tcp keepalive on, you will also need root, to change this kernel tunable to lower value(10-15 minutes)
Moreover HikariCP should not keep open any connection for longer than 30 minutes - by default.
So if your FW, Linux kernel and HikariCP all use default settings, then this error should not occur in your system.
See HikariCP official documentation
maxLifetime:
This property controls the maximum lifetime of a connection in the
pool. An in-use connection will never be retired, only when it is
closed will it then be removed. On a connection-by-connection basis,
minor negative attenuation is applied to avoid mass-extinction in the
pool. We strongly recommend setting this value, and it should be
several seconds shorter than any database or infrastructure imposed
connection time limit. A value of 0 indicates no maximum lifetime
(infinite lifetime), subject of course to the idleTimeout setting. The
minimum allowed value is 30000ms (30 seconds). Default: 1800000 (30
minutes)
I have added the below configuration for hickaricp in configuration file and it is
working fine.
## Database Connection Pool
play.db.pool = hikaricp
play.db.prototype.hikaricp.connectionTimeout=120000
play.db.prototype.hikaricp.idleTimeout=15000
play.db.prototype.hikaricp.leakDetectionThreshold=120000
play.db.prototype.hikaricp.validationTimeout=10000
play.db.prototype.hikaricp.maxLifetime=120000

Lettuce RedisCache throws java.util.concurrent.RejectedExecutionException Thread limit exceeded replacing blocked worker

I am using Spring Boot with Redis-Cache, with Lettuce default configuration, and receiving the following RejectedExecutionException after the server has been up for a few minutes:
org.springframework.data.redis.RedisSystemException: Unknown redis exception; nested exception is java.util.concurrent.RejectedExecutionException: Thread limit exceeded replacing blocked worker
at org.springframework.data.redis.FallbackExceptionTranslationStrategy.getFallback(FallbackExceptionTranslationStrategy.java:53)
at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:43)
at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:257)
at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.convertLettuceAccessException(LettuceStringCommands.java:718)
at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.get(LettuceStringCommands.java:63)
at org.springframework.data.redis.connection.DefaultedRedisConnection.get(DefaultedRedisConnection.java:210)
at org.springframework.data.redis.cache.DefaultRedisCacheWriter.lambda$get$1(DefaultRedisCacheWriter.java:109)
at org.springframework.data.redis.cache.DefaultRedisCacheWriter.execute(DefaultRedisCacheWriter.java:242)
at org.springframework.data.redis.cache.DefaultRedisCacheWriter.get(DefaultRedisCacheWriter.java:109)
at org.springframework.data.redis.cache.RedisCache.lookup(RedisCache.java:82)
at org.springframework.cache.support.AbstractValueAdaptingCache.get(AbstractValueAdaptingCache.java:58)
at org.springframework.cache.interceptor.AbstractCacheInvoker.doGet(AbstractCacheInvoker.java:73)
at org.springframework.cache.interceptor.CacheAspectSupport.findInCaches(CacheAspectSupport.java:525)
at org.springframework.cache.interceptor.CacheAspectSupport.findCachedItem(CacheAspectSupport.java:490)
at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:372)
at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:316)
at org.springframework.cache.interceptor.CacheInterceptor.invoke(CacheInterceptor.java:61)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688)
at my.controller.MyController.lambda$myFunc$0(MyController.java:60)
at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1386)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: java.util.concurrent.RejectedExecutionException: Thread limit exceeded replacing blocked worker
at java.util.concurrent.ForkJoinPool.tryCompensate(ForkJoinPool.java:2011)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3310)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at io.lettuce.core.protocol.AsyncCommand.await(AsyncCommand.java:81)
at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:112)
at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:62)
at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
at com.sun.proxy.$Proxy168.get(Unknown Source)
at org.springframework.data.redis.connection.lettuce.LettuceStringCommands.get(LettuceStringCommands.java:61)
... 21 common frames omitted
It seems that Lettuce uses the common ForkJoinPool, that the DeferredResult uses as well, and all the requests and connections are choking the pool (please correct me if I'm wrong). What is the recommended approach? Should I move Lettuce to use a different pool? If so how? Please let me if there is any other configuration or other information that I can provide.
Lettuce uses Netty's EventLoop's as its threading infrastructure.
What here happens is that your task is executed on a ForkJoin pool. Lettuce uses CompletableFutures to return a handle for an async result processing and the synchronous API calls CompletableFuture.get(timeout, TimeUnit) to await command completion. Calling a blocking method on a ForkJoin pool involves the ManagedBlocker to potentially switch to a different thread that can proceed with work.
If too many threads await command completion, you eventually end up with RejectedExecutionException.
I suggest using a different execution pool for the lambda that you're invoking.

trying to find db connection leak in my code, using Spring / JPA / Hikari

I've got a problem with a Spring web application that periodically runs into an error fetching a connection from my connection pool. Eventually in the logs I see entries like:
Caused by: javax.persistence.PersistenceException: org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
Only way to recover I've found once it hits this point is to restart Tomcat.
I think the most likely explanation is I have some code somewhere that is not properly cleaning up its connection - returning it to Hikari, leaving something open so Spring can't clean it up, etc.
To troubleshoot I've set my hikari config leakDetectionThreshold to 5000ms and enabled logging. After that, I see log entries like
2018-04-24 19:53:56 WARN ProxyLeakTask:87 - Connection leak detection
triggered for org.postgresql.jdbc.PgConnection#664ec666, stack trace
follows
java.lang.Exception: Apparent connection leak detected
at org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122)
at org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:35)
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:99)
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getPhysicalConnection(LogicalConnectionManagedImpl.java:129)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl.connection(StatementPreparerImpl.java:47)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$5.doPrepare(StatementPreparerImpl.java:146)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$StatementPreparationTemplate.prepareStatement(StatementPreparerImpl.java:172)
at org.hibernate.engine.jdbc.internal.StatementPreparerImpl.prepareQueryStatement(StatementPreparerImpl.java:148)
at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1940)
at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1909)
at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:1887)
at org.hibernate.loader.Loader.doQuery(Loader.java:932)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:349)
at org.hibernate.loader.Loader.doList(Loader.java:2615)
at org.hibernate.loader.Loader.doList(Loader.java:2598)
at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2430)
at org.hibernate.loader.Loader.list(Loader.java:2425)
at org.hibernate.loader.custom.CustomLoader.list(CustomLoader.java:335)
at org.hibernate.internal.SessionImpl.listCustomQuery(SessionImpl.java:2129)
at org.hibernate.internal.AbstractSharedSessionContract.list(AbstractSharedSessionContract.java:981)
at org.hibernate.query.internal.NativeQueryImpl.doList(NativeQueryImpl.java:147)
at org.hibernate.query.internal.AbstractProducedQuery.list(AbstractProducedQuery.java:1398)
at org.hibernate.query.internal.AbstractProducedQuery.getSingleResult(AbstractProducedQuery.java:1444)
at sun.reflect.GeneratedMethodAccessor191.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.orm.jpa.SharedEntityManagerCreator$DeferredQueryInvocationHandler.invoke(SharedEntityManagerCreator.java:379)
at com.sun.proxy.$Proxy163.getSingleResult(Unknown Source)
at com.mycompany.web.jpa.util.DBHelper.getPagedMappedDbResults(DBHelper.java:76)
at com.mycompany.web.jpa.repository.TaskRepositoryImpl.findTaskDetailsByStepIdAndIdIn(TaskRepositoryImpl.java:245)
......
So it is detecting a possible leak. Could be a false positive I suppose? But this is also the only class in my app that is doing database access outside of the standard service/repository pattern often used in Spring apps, so it seems like a likely culprit, and it's my best lead at the moment.
Anyway, the last piece of non library code I see in the trace (ie stuff I wrote, so most likely to be the cause of the leak!) is my DBHelper::getPagedMappedDbResults method, relevant bit included here:
Query q = entityManager.createNativeQuery(countQueryText);
setQueryParameters(q, parameters);
long numActualResults = 0;
try {
numActualResults = ((Number)q.getSingleResult()).longValue(); // line 76
} catch (Exception e) {
System.out.println("just in case: " + e);
}
So basically I create a Query object from my EntityManager instance, set some parameters, and run it to get some results.
Is there something I need to be doing with a Query object when I'm done with it? q.cleanup()? I don't see anything like this from reading the docs, but am I not doing good housekeeping on this resource?
The entityManager itself is created from an #Autowired annotation. My understanding is if I didn't "new" it to instantiate it and instead let the Spring framework autowire it, then Spring will do whatever cleanup is necessary. Is that right? Or do I need to be doing some cleanup after I use the entityManager?
Version details:
Tomcat 8 / Java 8
Spring 5.0.0.RELEASE
Spring Data Kay-RELEASE
Hibernate 5.2.3.Final
Hikari 2.4.5
Any advice or suggestions would be greatly appreciated, thanks!
What is the query? Is it heavy? Maybe you have deadlock here? Connection management looks fine. You do not acquire connection explicitly, so no need to release it. The query might be long running so Hibernate is not able to complete it and release the connection.
Also, you can check the number of open connections on the DB side. Do some analysis on that side as well.

spring boot redis operation throw broken pipe error

We use redis in spring boot project. After running a period of time, the redis operation MAY throw a broken pipe error, but sometimes it will succeed. Restarting the service will resolve this problem, but it's not a good idea.
I can't tell the reason why it happens. It seems that some redis connections in the pool are unusable, but aren't closed and evicted from the pool.
my questions are:
possible reason causing the broken pipe error?
if there isn't redis operation for a long period, will the idle connection in the pool become unusable?
will the connection be closed and evicted from pool when broken pipe error happen?
pom.xml:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>
application.yml:
spring:
redis:
database: 0
host: ${REDIS_HOST:127.0.0.1}
password: ${REDIS_PASSWORD:password}
port: ${REDIS_PORT:6379}
timeout: ${REDIS_TIMEOUT:1000}
pool:
max-active: ${REDIS_MAX_ACTIVE:100}
max-wait: ${REDIS_MAX_WAIT:500}
max-idle: ${REDIS_MAX_IDLE:20}
min-idle: ${REDIS_MIN_IDLE:5}
error message:
org.springframework.data.redis.RedisConnectionFailureException: java.net.SocketException: Broken pipe (Write failed); nested exception is redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketException: Broken pipe (Write failed)
at org.springframework.data.redis.connection.jedis.JedisExceptionConverter.convert(JedisExceptionConverter.java:67) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
at org.springframework.data.redis.connection.jedis.JedisExceptionConverter.convert(JedisExceptionConverter.java:41) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:37) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:37) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
at org.springframework.data.redis.connection.jedis.JedisConnection.convertJedisAccessException(JedisConnection.java:212) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
at org.springframework.data.redis.connection.jedis.JedisConnection.hSet(JedisConnection.java:2810) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
at org.springframework.data.redis.core.DefaultHashOperations$9.doInRedis(DefaultHashOperations.java:173) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:204) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:166) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:88) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
at org.springframework.data.redis.core.DefaultHashOperations.put(DefaultHashOperations.java:170) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
Caused by: redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketException: Broken pipe (Write failed)
at redis.clients.jedis.Connection.flush(Connection.java:291) ~[jedis-2.8.2.jar!/:na]
at redis.clients.jedis.Connection.getIntegerReply(Connection.java:220) ~[jedis-2.8.2.jar!/:na]
at redis.clients.jedis.BinaryJedis.hset(BinaryJedis.java:749) ~[jedis-2.8.2.jar!/:na]
at org.springframework.data.redis.connection.jedis.JedisConnection.hSet(JedisConnection.java:2808) ~[spring-data-redis-1.7.6.RELEASE.jar!/:na]
... 115 common frames omitted
Caused by: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method) ~[na:1.8.0_111]
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) ~[na:1.8.0_111]
at java.net.SocketOutputStream.write(SocketOutputStream.java:153) ~[na:1.8.0_111]
at redis.clients.util.RedisOutputStream.flushBuffer(RedisOutputStream.java:52) ~[jedis-2.8.2.jar!/:na]
at redis.clients.util.RedisOutputStream.flush(RedisOutputStream.java:216) ~[jedis-2.8.2.jar!/:na]
at redis.clients.jedis.Connection.flush(Connection.java:288) ~[jedis-2.8.2.jar!/:na]
... 118 common frames omitted
Answer my question:
why the broken pipe error happen?
TransactionSynchronizationManager will hold the RedisConnection in thread, and wont close it or return it to pool, see RedisTemplate.java and RedisConnectionUtils.java. After restarting the redis server, operation on the holded RedisConnection in thread will throw broken pipe error.
how to resolve it?
Adding try/catch for all redis operation, if the error happens , unbinding it from thread, and could get a new connection from pool and execute redis operation again.
private static final ExceptionTranslationStrategy EXCEPTION_TRANSLATION =
new FallbackExceptionTranslationStrategy(JedisConverters.exceptionConverter());
public Object req(RedisRequest req) {
try {
return req.request();
} catch (Exception ex) {
if (ex instanceof NullPointerException) {
throw ex;
}
DataAccessException exception = EXCEPTION_TRANSLATION.translate(ex);
if (exception instanceof RedisConnectionFailureException) {
RedisConnectionUtils.unbindConnection(factory);
/** retry again */
return req.request();
} else {
throw ex;
}
}
}
This can happen for n number of reasons, one of them could be when you use a long-living connection (e. g. connect to Redis on application start and then use the connection over and over).
Some of the things to do are:
Reconnect if the connection is broken (needs some try/catch magic to prevent that errors are propagated to your application logic)
or a better was is to use
TestOnBorrow - Sends a PING request when you ask for the resource.
TestOnReturn - Sends a PING when you return a resource to the pool.
TestWhileIdle - Sends periodic PINGS from idle resources in the pool.
Connect at the moment you need the connection and disconnect afterwards
Regarding
if there isn't redis operation for a long period, will the idle connection in the pool become unusable?
maxidle means at any given time the system allows 'maxIdle' many number of connections to be idle the rest will be constantly checked, closed and returned to the pool.
I dont know a reason when an idle connection be unusable. Anyways this can be got rid of by using ways mentioned above.

Jedis Exception java.net.ConnectException: Address already in use

I have a Jedis Server and I had made a separate RedisManager for managing the jedis connections. The code for RedisManager is as follows
package RedisServerPackage;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;
public class RedisManager {
private static final RedisManager instance = new RedisManager();
private static final JedisPoolConfig poolConfig= new JedisPoolConfig();
private static JedisPool pool = null;
private RedisManager() {}
public final static RedisManager getInstance() {
if(pool == null)
{
poolConfig.setMaxTotal(-1);
pool = new JedisPool(poolConfig,"localhost");
}
return instance;
}
public void release() {
pool.destroy();
}
public Jedis getJedis() {
return pool.getResource();
}
public void returnJedis(Jedis jedis) {
pool.returnResource(jedis);
}
}
Now I execute my code where I have about 1000 clients hitting the server and performing certain operations using the PubSub model. I have monitored the redis-server and found that at a time, maximum 45 clients were active and max blocked clients were around 39. After running the client code for about 5 minutes or so, I get the exception
redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool
at redis.clients.util.Pool.getResource(Pool.java:50)
at redis.clients.jedis.JedisPool.getResource(JedisPool.java:88)
at RedisServerPackage.RedisManager.getJedis(RedisManager.java:31)
at RedisServerPackage.RedisQueue.dequeue(RedisQueue.java:45)
at RedisServerPackage.QueueProcessor.run(QueueProcessor.java:22)
at java.lang.Thread.run(Thread.java:745)
Caused by: redis.clients.jedis.exceptions.JedisConnectionException: java.net.ConnectException: Address already in use
at redis.clients.jedis.Connection.connect(Connection.java:148)
at redis.clients.jedis.BinaryClient.connect(BinaryClient.java:75)
at redis.clients.jedis.BinaryJedis.connect(BinaryJedis.java:1572)
at redis.clients.jedis.JedisFactory.makeObject(JedisFactory.java:69)
at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
at redis.clients.util.Pool.getResource(Pool.java:48)
... 5 more
Caused by: java.net.ConnectException: Address already in use
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at redis.clients.jedis.Connection.connect(Connection.java:142)
... 12 more
I am not able to find out as to what is causing this exception. Also, I am reusing the jedis instances. An example code is
public void JedisExample(String temporaryString) {
Jedis jedis = manage.getJedis();
try {
// Some code here
} catch (Exception e) {
System.out.println(e);
}finally{
manager.returnJedis(jedis);
// manage is an instance of RedisManager class provided before.
}
}
I had this exception happening intermittently on MacOS when trying to load test my server app.
Turns out, the problem was related to the fact, that macOS only has 16K ports available that won't be released until socket TIME_WAIT is passed. The default timeout for TIME_WAIT is 15 seconds.
You can check yours via
sysctl net.inet.tcp.msl
To fix it temporarily to allow load testing, I used
sudo sysctl -w net.inet.tcp.msl=1000
this reduced TIME_WAIT to 1 second, allowing to create and release connections faster, which in turn enabled me to get Tomcat to convert REST requests to Redis PUBSUB messages at the rate of about 4000 qps and got 0 errors after 4 hours of bombardment under 16 concurrent Siege threads. Before, about 1% of requests would error out with the exception above.
The author of the question did not state the OS, but I hope this answer might help someone else running into similar situation, because this entry comes on top when searching for such exception in Jedis. Basically, check your TIME_WAIT when load testing, regardless of OS.
UPDATE
Warning. Do not do it in production! Ideally, increase it back to 15 seconds after load testing round on your workstations. Decreasing TIME_WAIT might be dangerous, because sockets become available faster after closing, and some delayed packets might arrive to a newly opened connection, causing unpredictable errors or even compromising security. Read more on the TCP/IP and TIME_WAITs, before you decide to follow the instructions above or consult your networking engineer.

Categories