Alternatives to sleep when checking if Netty server is up? - java

I am starting Netty with a rest interface. I get this exception: RESTEASY004655: Unable to invoke request
at org.jboss.resteasy.client.jaxrs.engines.ApacheHttpClient4Engine.invoke(
at org.jboss.resteasy.client.jaxrs.internal.ClientInvocation.invoke(
at org.jboss.resteasy.client.jaxrs.internal.proxy.ClientInvoker.invoke(
at org.jboss.resteasy.client.jaxrs.internal.proxy.ClientProxy.invoke(
at com.sun.proxy.$ Source)
at com.openet.atf.agent.manage.Master.startSlaves(
at com.openet.acceptance.runner.AcceptanceRunner.main(
Caused by: org.apache.http.conn.HttpHostConnectException: Connection to http://ovm1:8889 refused
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(
at org.apache.http.impl.client.DefaultRequestDirector.execute(
at org.apache.http.impl.client.AbstractHttpClient.execute(
at org.apache.http.impl.client.AbstractHttpClient.execute(
at org.jboss.resteasy.client.jaxrs.engines.ApacheHttpClient4Engine.invoke(
... 7 more
Caused by: Connection refused
I prevented this from happening by doing a Thread.sleep(5000);. I am looking for a better alternative to sleep. Sleep always assumes that the length of time is 5 seconds.

A common approach used in situations where success could be any time in the future is the back off pattern, typically implemented by doubling the wait time every iteration.
Something like this:
long wait = 50; // ms
boolean connected;
while (!connected) {
connected = <code to check connection>
wait *= 2;

You can sleep until the connection is established.
boolean up = false;
while (!up) {
try {
// Try to connect
up = true;
} catch (Exception e) {


JedisCluster configurations and how it maintains the pool of connections

I have recently started using JedisCluster for my application. There is little to no documentation and examples for the same. I tested a use case and the results are not what I expected
public class test {
private static JedisCluster setConnection(HashSet<HostAndPort> IP) {
JedisCluster jediscluster = new JedisCluster(IP, 30000, 3,
new GenericObjectPoolConfig() {{
return jediscluster;
public static int getIdleconn(Map<String, JedisPool> nodes){
int i = 0;
for (String k : nodes.keySet()) {
return i;
public static void main(String[] args) {
HashSet IP = new HashSet<HostAndPort>() {
add(new HostAndPort("host1", port1));
add(new HostAndPort("host2", port2));
JedisCluster cluster = setConnection(IP);
cluster.set("Dummy", "0");
cluster.set("Dummy1", "0");
cluster.set("Dummy3", "0");
try {
} catch (InterruptedException e) {
The output for this snippet is:
I have set the timeout to 30000 JedisCluster(IP, 30000, 3,new GenericObjectPoolConfig() . I believe this is the connection timeout which means Idle connections are closed after 30 seconds. Although this doesn't seem to be happening. After sleeping for 60 seconds, the number of idle connections is still 3. What I am doing/understanding wrong here? I want the pool to close the connection if not used for more than 30 seconds.
setMinIdle(1). Does this mean that regardless the connection timeout, the pool will always maintain one connection?
I prefer availability more than throughput for my app. What should be the value for setMaxWaitMillis if conn timeout is 30 secs?
Though rare, the app fails with redis.clients.jedis.exceptions.JedisNoReachableClusterNodeException: No reachable node in cluster. This i think is connected to 1. How to prevent this?
30000 or 30 seconds here refers to (socket) timeout; the timeout for single socket (read) operation. It is not related with closing idle connections.
Closing idle connections are controlled by GenericObjectPoolConfig. So check the parameters there.
Yes (mostly).
setMaxWaitMillis is the timeout for getting a connection object from a connection object pool. It is not related to 30 secs and not really solve you anything in terms of availability.
Keep your cluster nodes available.
There has been changes in Jedis related to this. You can try a recent version (4.x, even better 4.2.x).

AsyncIO operation not writing to output directory in Apache Flink

I'm very new to Flink, and trying out the Async IO operation by following the doc from here. I've a text file containing bunch of integers. I'm creating a stream on the file, then for each line, I'm making an async http request and finally storing the results into an output file. I created a fastAPI rest endpoint for managing simple get request. In the Flink code, I'm using the java async-http-client library to wrap the http call into an async request. But the problem is, when I run the Flink code, it always times out.
My input file looks something like:
The fastAPI code goes something like this:
import time
from random import random
from fastapi import FastAPI
app = FastAPI()
async def read_temperature(temperature: int):
if temperature <= 0:
return {"category": "insanely cold"}
elif temperature <= 15:
return {"category": "cold"}
elif temperature <= 25:
return {"category": "moderate"}
elif temperature <= 35:
return {"category": "moderately hot"}
elif temperature <= 45:
return {"category": "hot"}
return {"category": "insanely hot"}
And finally, this is my Flink code:
import ...
public class AsyncHttpRequest extends RichAsyncFunction<String, Tuple2<String, String>> {
private transient AsyncHttpClient client;
public void open(Configuration parameters) {
client = asyncHttpClient();
public void close() throws Exception {
public void asyncInvoke(String key, final ResultFuture<Tuple2<String, String>> resultFuture) throws Exception {
// issue the asynchronous request, receive a future for result
String getURL = String.format("http://localhost:8000/temperatures/%s", key);
final Future<Response> result = client.executeRequest(get(getURL).build());
// set the callback to be executed once the request by the client is complete
// the callback simply forwards the result to the result future
CompletableFuture.supplyAsync(() -> {
try {
JSONObject responseJson = new JSONObject(result.get().getResponseBody());
return responseJson.getString("category");
} catch (InterruptedException | ExecutionException e) {
return null;
}).thenAccept((String httpResult) -> {
resultFuture.complete(Collections.singleton(new Tuple2<>(key, httpResult)));
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> stream =
DataStream<Tuple2<String, String>> resultStream =
stream, new AsyncHttpRequest(), 60, TimeUnit.SECONDS, 10);
final StreamingFileSink<Tuple2<String, String>> sink =
new Path("file:///Users/me/http_output"),
new SimpleStringEncoder<Tuple2<String, String>>("UTF-8"))
.withRolloverInterval(TimeUnit.MINUTES.toMillis(1)) .withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024)
env.execute("Async Http job");
I get the following stacktrace when running the flink job:
Exception in thread "main" org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(
at org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$2(
at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(
at java.base/java.util.concurrent.CompletableFuture.postComplete(
at java.base/java.util.concurrent.CompletableFuture.complete(
at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(
at java.base/java.util.concurrent.CompletableFuture.postComplete(
at java.base/java.util.concurrent.CompletableFuture.complete(
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(
at akka.dispatch.OnComplete.internal(Future.scala:264)
at akka.dispatch.OnComplete.internal(Future.scala:261)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:573)
at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
at scala.concurrent.Future.$anonfun$andThen$1(Future.scala:532)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
at akka.dispatch.BatchingExecutor$
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(
at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(
at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(
at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(
at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(
at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(
at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(
at jdk.internal.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(
at java.base/java.lang.reflect.Method.invoke(
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
... 4 more
Caused by: java.lang.Exception: Could not complete the stream element: Record # (undef) : 9.
at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator$ResultHandler.completeExceptionally(
at org.apache.flink.streaming.api.functions.async.AsyncFunction.timeout(
at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.lambda$processElement$0(
at org.apache.flink.streaming.runtime.tasks.StreamTask.invokeProcessingTimeCallback(
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$null$17(
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield(
at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.waitInFlightInputsFinished(
at org.apache.flink.streaming.api.operators.async.AsyncWaitOperator.endInput(
at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.endOperatorInput(
at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.lambda$close$0(
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(
at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(
at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(
at org.apache.flink.streaming.runtime.tasks.OperatorChain.closeOperators(
at org.apache.flink.streaming.runtime.tasks.StreamTask.afterInvoke(
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(
at org.apache.flink.runtime.taskmanager.Task.doRun(
at java.base/
Caused by: java.util.concurrent.TimeoutException: Async function call has timed out.
... 20 more
Prining the Interruption/Execution Exception shows the following error:
java.util.concurrent.ExecutionException: executor not accepting a task
at java.base/java.util.concurrent.CompletableFuture.reportGet(
at java.base/java.util.concurrent.CompletableFuture.get(
at org.asynchttpclient.netty.NettyResponseFuture.get(
at java.base/java.util.concurrent.CompletableFuture$
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(
at java.base/java.util.concurrent.ForkJoinTask.doExec(
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(
at java.base/java.util.concurrent.ForkJoinPool.scan(
at java.base/java.util.concurrent.ForkJoinPool.runWorker(
at java.base/
Caused by: executor not accepting a task
at org.asynchttpclient.netty.SimpleChannelFutureListener.operationComplete(
at org.asynchttpclient.netty.SimpleChannelFutureListener.operationComplete(
at io.netty.util.concurrent.DefaultPromise.notifyListener0(
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(
at io.netty.util.concurrent.DefaultPromise.notifyListeners(
at io.netty.util.concurrent.DefaultPromise.setValue0(
at io.netty.util.concurrent.DefaultPromise.setFailure0(
at io.netty.util.concurrent.DefaultPromise.setFailure(
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(
at io.netty.bootstrap.Bootstrap.access$000(
at io.netty.bootstrap.Bootstrap$1.operationComplete(
at io.netty.bootstrap.Bootstrap$1.operationComplete(
at io.netty.util.concurrent.DefaultPromise.notifyListener0(
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(
at io.netty.util.concurrent.DefaultPromise.notifyListeners(
at io.netty.util.concurrent.DefaultPromise.setValue0(
at io.netty.util.concurrent.DefaultPromise.setSuccess0(
at io.netty.util.concurrent.DefaultPromise.trySuccess(
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(
at io.netty.util.concurrent.SingleThreadEventExecutor$
at io.netty.util.internal.ThreadExecutorMap$
at java.base/
Caused by: java.lang.IllegalStateException: executor not accepting a task
at io.netty.resolver.AddressResolverGroup.getResolver(
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(
... 21 more
I'm not really sure why the async function is timing out, because I can see the fastAPI endpoint is being queried in the console. Also, the endpoint is working fine, as all my postman requests go through perfectly fine. Any help in resolving the core issue is greatly appreciated.
I'm on MacOS Big Sur and using the following 3rd party libs:
implementation 'org.json:json:20201115'
implementation 'org.apache.httpcomponents:httpclient:4.5.13'
implementation 'org.asynchttpclient:async-http-client:2.12.2'
implementation 'org.apache.flink:flink-core:1.12.2'
implementation 'org.apache.flink:flink-streaming-java_2.12:1.12.2'
implementation 'org.apache.flink:flink-clients_2.12:1.12.2'
Update 1: If I reduce the capacity to 1 (from 1000), then I don't get any error, but still the output is empty.
Update 2: After the suggestion made by #DavidAnderson I've enabled checkpointing. Now, I'm not seeing the timeout error, and my job is not getting terminated abruptly, which is a good news. But now, the output folder is still empty. I've updated my Flink code to reflect the checkpointing changes.
A common issue with AsyncIO is that every concurrent request limit used in the execution stack should be sized appropriately. Some of these limits are implicit, e.g. if you don't supply your own thread pool to CompletableFuture.supplyAsync(), then it uses the shared commonPool, which is limited to a very small size, see
IIRC I use AsyncDataStream.unorderedWait(capacity) <= HTTP client capacity <= executor capacity. And the HTTP client capacity is often a limit of both its connection pool size and the number of connections per host.

ScheduledThreadPoolExecutor stops executing with caught exceptions

I have the following class
public class MaintanceTools {
public static final ScheduledThreadPoolExecutor THREADSUPERVISER;
private static final int ALLOWEDIDLESECONDS = 1*20;
static {
THREADSUPERVISER = new ScheduledThreadPoolExecutor(10);
public static void launchThreadsSupervising() {
THREADSUPERVISER.scheduleWithFixedDelay(() -> {
System.out.println("maintance launched " +;
ACTIVECONNECTIONS.forEach((connection) -> {
try {
if ( !connection.isDataConnected() &&
} catch (Throwable e) { }
System.out.println("maintance finished " +;
}, 0, 20, TimeUnit.SECONDS);
Which iterates over all FTP connections (cause I write FTP server), checks if connection not transmitting any data and idle for some time and closes the connection if so.
The problem is task never runs after some exceptions thrown in the interrupting thread. I know that it's written in docs
If any execution of the task encounters an exception, subsequent executions are suppressed. Otherwise, the task will only terminate via cancellation or termination of the executor.
And I have the exception, but it is caught and do not go outside throwing function.
This function throws AsynchronousCloseException because it hangs on; and when connection is closed, exception thrown and caught.
The question is how to make THREADSUPERVISER work regardless any thrown and handled exceptions.
Debug output:
maintance launched 2017-08-30T14:03:05.504Z // launched and finished as expected
maintance finished 2017-08-30T14:03:05.566Z
output: FTPConnection id: 176 220 Service ready.
output: FTPConnection id: 190 226 File stored 135 bytes.
closing data socket: FTP connection 190, /0:0:0:0:0:0:0:1:1409
maintance launched 2017-08-30T14:03:25.581Z // launched and finished as expected
maintance finished 2017-08-30T14:03:25.581Z
async exception error reading. // got exception
maintance launched 2017-08-30T14:03:45.596Z // launched, but not finished and never run again
output: FTPConnection id: 176 221 Timeout exceeded, closing control and data connection.
closing data socket: FTP connection 176, /0:0:0:0:0:0:0:1:1407
As turns out, the problem was in
I had ConcurrentModifyingException. Solution in worked perfectly

ORA-12518, TNS:listener could not hand off client connection comes from a loop with heavy memory access

I have a loop with heavy memory access from oracle.
int firstResult = 0;
int maxResult = 500;
int targetTotal = 8000; // more or less
int phase = 1;
for (int i = 0; i<= targetTotal; i += maxResult) {
try {
Session session = .... init hibernate session ...
// Start Transaction
List<Accounts> importableInvAcList = ...getting list using session and firstResult-maxResult...
List<ContractData> dataList = new ArrayList<>();
List<ErrorData> errorDataList = new ArrayList<>();
for (Accounts account : importableInvAcList) {
... Converting 500 Accounts object to ContractData object ...
... along with 5 more database call using existing session ...
.. On converting The object we generate thousands of ErrorData...
dataList.add(.. converted account to Contract data ..);
errorDataList.add(.. generated error data ..);
}; // 500 data; // 10,000-5,000 data
... Commit Transaction ...
} catch (Exception e) {
On the second phase (2nd loop) the Exception comes out. Sometimes Exception is coming out in 3rd or fifth phase.
I also checked the Runtime Memory.
Runtime runtime = Runtime.getRuntime();
long total = runtime.totalMemory();
long free = runtime.freeMemory();
long used = total - free;
long max = runtime.maxMemory();
And in the second phase the status was below for sample...
Used: 1022 MB, Free: 313 MB, Total Allocated: 1335 MB
Stack Trace is here...
org.hibernate.exception.GenericJDBCException: Cannot open connection
at org.hibernate.exception.SQLStateConverter.handledNonSpecificException(
at org.hibernate.exception.SQLStateConverter.convert(
at org.hibernate.exception.JDBCExceptionHelper.convert(
at org.hibernate.exception.JDBCExceptionHelper.convert(
at org.hibernate.jdbc.ConnectionManager.openConnection(
at org.hibernate.jdbc.ConnectionManager.getConnection(
at org.hibernate.jdbc.JDBCContext.connection(
at org.hibernate.transaction.JDBCTransaction.begin(
at org.hibernate.impl.SessionImpl.beginTransaction(
at ibbl.remote.tx.TxSessionImpl.beginTx(
at ibbl.remote.tx.TxController.initPersistence(
Caused by: java.sql.SQLException: Listener refused the connection with the following error:
ORA-12518, TNS:listener could not hand off client connection
Noted that, this process running in a Thread, and there are 3 similar Thread running at a time.
Why this Exception hangs out after the loop running a while ?
there are 3 similar Thread running at a time.
If your code creates a total of 3 Threads, then, optimally, you need only 3 Oracle Connections. Create all of them before any Thread is created. Create the Threads, assign each Thread a Connection, then start the Threads.
Chances are good, though, that your code might be way too aggressively consuming resources on whatever machine is hosting it. Even if you eliminate the ORA-12518, the RDBMS server may "go south". By "go south", I mean if your application is consuming too many resources the machine hosting it or the machine hosting the RDBMS server may "panic" or something equally dreadful.

SSH Server Identification never received - Handshake Deadlock [SSHJ]

We're having some trouble trying to implement a Pool of SftpConnections for our application.
We're currently using SSHJ (Schmizz) as the transport library, and facing an issue we simply cannot simulate in our development environment (but the error keeps showing randomly in production, sometimes after three days, sometimes after just 10 minutes).
The problem is, when trying to send a file via SFTP, the thread gets locked in the init method from schmizz' TransportImpl class:
public void init(String remoteHost, int remotePort, InputStream in, OutputStream out)
throws TransportException {
connInfo = new ConnInfo(remoteHost, remotePort, in, out);
try {
if (config.isWaitForServerIdentBeforeSendingClientIdent()) {
} else {
}"Server identity string: {}", serverID);
} catch (IOException e) {
throw new TransportException(e);
isWaitForServerIdentBeforeSendingClientIdent is FALSE for us, so first of all the client (we) send our identification, as appears in logs:
"Client identity String: blabla"
Then it's turn for the receiveServerIdent:
private void receiveServerIdent() throws IOException
final Buffer.PlainBuffer buf = new Buffer.PlainBuffer();
while ((serverID = readIdentification(buf)).isEmpty()) {
int b =;
if (b == -1)
throw new TransportException("Server closed connection during identification exchange");
buf.putByte((byte) b);
The thread never gets the control back, as the server never replies with its identity. Seems like the code is stuck in this While loop. No timeouts, or SSH exceptions are thrown, my client just keeps waiting forever, and the thread gets deadlocked.
This is the readIdentification method's impl:
private String readIdentification(Buffer.PlainBuffer buffer)
throws IOException {
String ident = new IdentificationStringParser(buffer, loggerFactory).parseIdentificationString();
if (ident.isEmpty()) {
return ident;
if (!ident.startsWith("SSH-2.0-") && !ident.startsWith("SSH-1.99-"))
throw new TransportException(DisconnectReason.PROTOCOL_VERSION_NOT_SUPPORTED,
"Server does not support SSHv2, identified as: " + ident);
return ident;
Seems like ConnectionInfo's inputstream never gets data to read, as if the server closed the connection (even if, as said earlier, no exception is thrown).
I've tried to simulate this error by saturating the negotiation, closing sockets while connecting, using conntrack to kill established connections while the handshake is being made, but with no luck at all, so any help would be HIGHLY appreciated.
: )
I bet following code creates a problem:
String ident = new IdentificationStringParser(buffer, loggerFactory).parseIdentificationString();
if (ident.isEmpty()) {
return ident;
If the IdentificationStringParser.parseIdentificationString() returns empty string, it will be returned to the caller method. The caller method will keep calling the while ((serverID = readIdentification(buf)).isEmpty()) since the string is always empty. The only way to break the loop would be if call to int b =; returns -1... but if server keeps sending the data (or resending the data) this condition is never met.
If this is the case I would add some kind of artificial way to detect this like:
private String readIdentification(Buffer.PlainBuffer buffer, AtomicInteger numberOfAttempts)
throws IOException {
String ident = new IdentificationStringParser(buffer, loggerFactory).parseIdentificationString();
if (ident.isEmpty() && numberOfAttempts.intValue() < 1000) { // 1000
return ident;
} else if (numberOfAttempts.intValue() >= 1000) {
throw new TransportException("To many attempts to read the server ident").
if (!ident.startsWith("SSH-2.0-") && !ident.startsWith("SSH-1.99-"))
throw new TransportException(DisconnectReason.PROTOCOL_VERSION_NOT_SUPPORTED,
"Server does not support SSHv2, identified as: " + ident);
return ident;
This way you would at least confirm that this is the case and can dig further why .parseIdentificationString() returns empty string.
Faced a similar issue where we would see:
INFO [net.schmizz.sshj.transport.TransportImpl : pool-6-thread-2] - Client identity string: blablabla
INFO [net.schmizz.sshj.transport.TransportImpl : pool-6-thread-2] - Server identity string: blablabla
But on some occasions, there were no server response.
Our service would typically wake up and transfer several files simultaneously, one file per connection / thread.
The issue was in the sshd server config, we increased maxStartups from default value 10
(we noticed the problems started shortly after batch sizes increased to above 10)
Default in /etc/ssh/sshd_config:
MaxStartups 10:30:100
Changed to:
MaxStartups 30:30:100
Specifies the maximum number of concurrent unauthenticated connections to the SSH daemon. Additional connections will be dropped until authentication succeeds or the LoginGraceTime expires for a connection. The default is 10:30:100. Alternatively, random early drop can be enabled by specifying the three colon separated values start:rate:full (e.g. "10:30:60"). sshd will refuse connection attempts with a probability of rate/100 (30%) if there are currently start (10) unauthenticated connections. The probability increases linearly and all connection attempts are refused if the number of unauthenticated connections reaches full (60).
If you cannot control the server, you might have to find a way to limit your concurrent connection attempts in your client code instead.
