Spring Boot WebSocket Broker does not sends CONNECTED frame - java

I recently faced and issue, unfortunately on production environment, when my Spring Boot Server was not sending response of CONNECT frame (i.e CONNECTED frame). It first started happening occasionally but later on all the CONNECT requests sent by the browser were not replied to.
On console I was able to see following log
After some investigating, I found out that inboundChannel queue was holding many requests at that time. I believe this was the reason.
2022-06-01 18:22:59,943 INFO Thread-id-74- springframework.web.socket.config.WebSocketMessageBrokerStats: WebSocketSession[130 current WS(129)-HttpStream(1)-HttpPoll(0), 225 total, 0 closed abnormally (0 connect failure, 0 send limit, 2 transport error)], stompSubProtocol[processed CONNECT(220)-CONNECTED(132)-DISCONNECT(0)], stompBrokerRelay[null], inboundChannel[pool size = 2, active threads = 2, queued tasks = 10774, completed tasks = 31806], outboundChannel[pool size = 2, active threads = 0, queued tasks = 0, completed tasks = 570895], sockJsScheduler[pool size = 1, active threads = 1, queued tasks = 134, completed tasks = 1985]
I was wondering what might be the cause of the issue, what can cause queuing in the inboundChannel queue?
here is my current STOMP config on my angular application.
const config: StompJS.StompConfig = {
brokerURL: this.serverUrl,
connectHeaders: {
ccid: this.cookieService.get('ccid'),
username: `${this.globalContext.get('me')['username']}`,
},
debug: (str) => {
this.loggerService.log(this.sessionId, ' | ', str);
},
webSocketFactory: () => {
return new SockJS(this.serverUrl);
},
logRawCommunication: true,
reconnectDelay: 3000,
heartbeatIncoming: 100,
heartbeatOutgoing: 100,
discardWebsocketOnCommFailure: true,
connectionTimeout: 4000
};

Finally I think I found the solution, so the problem was around queued-tasks for inbound-channel as can be seen in the log appended
2022-06-01 18:22:59,943 INFO Thread-id-74- springframework.web.socket.config.WebSocketMessageBrokerStats: WebSocketSession[130 current WS(129)-HttpStream(1)-HttpPoll(0), 225 total, 0 closed abnormally (0 connect failure, 0 send limit, 2 transport error)], stompSubProtocol[processed CONNECT(220)-CONNECTED(132)-DISCONNECT(0)], stompBrokerRelay[null], inboundChannel[pool size = 2, active threads = 2, queued tasks = 10774, completed tasks = 31806], outboundChannel[pool size = 2, active threads = 0, queued tasks = 0, completed tasks = 570895], sockJsScheduler[pool size = 1, active threads = 1, queued tasks = 134, completed tasks = 1985]
I was socked to say only 2 threads were allocated to the task through I was running on a 8 core machine. So I checked the code for TaskExecutor and found this.
this.taskExecutor.setCorePoolSize(Runtime.getRuntime().availableProcessors() * 2);
According to this my corePoolSize should had been around 8*2=16 and figured out there is some bug with Runtime.getRuntime().availableProcessors() because of which it does not returns correct value in Java8, but has been fixed for newer version. Hence, i decided to fix this manually.
#Override
public void configureClientInboundChannel(ChannelRegistration registration) {
logger.debug("Configuring task executor for Client Inbound Channel");
if(inboundCoreThreads != null && inboundCoreThreads > 0) {
registration.taskExecutor().corePoolSize(inboundCoreThreads);
}
}
Now the question was why is it getting queued, so we started looking at the thread dump. And figured out that most of the threads are stuck in WAITING state due to cachelimit. And hence updated the cacheLimit to 4096 from 1024
#Override
public void configureMessageBroker(MessageBrokerRegistry config) {
config.setCacheLimit(messageBrokerCacheLimit);
}
ofcourse, inboundCoreThreads and messageBrokerCacheLimit are the variable names and have to put values in these.
After this, everything seems to working just fine. Thankyou #Ilya Lapitan for help.

Related

Aerospike java client failure on scanAll

I using the following method in order to truncate data from aerospike namespace.set.bins:
// Setting LUT
val calendar = Calendar.getInstance()
calendar.setTimeInMillis(startTime + 1262304000000L) // uses CITRUSLEAF_EPOCH - see https://discuss.aerospike.com/t/how-to-use-view-and-calulate-last-update-time-lut-for-the-truncate-command/4330
logger.info(s"truncate($startTime = ${calendar.getTime}, durableDelete = $durableDelete) on ${config.toRecoverMap}")
// Define Scan and Write Policies
val writePolicy = new WritePolicy()
val scanPolicy = new ScanPolicy()
writePolicy.durableDelete = durableDelete
scanPolicy.filterExp = Exp.build(Exp.le(Exp.lastUpdate(), Exp.`val`(calendar)))
// Scan all records such as LUT <= startTime
config.toRecoverMap.flatMap { case (namespace, mapOfSetsToBins) =>
for ((set, bins) <- mapOfSetsToBins) yield {
val recordCount = new AtomicInteger(0)
client.scanAll(scanPolicy, namespace, set, new ScanCallback() {
override def scanCallback(key: Key, record: Record): Unit = {
val requiresNullify = bins.filter(record.bins.containsKey(_)).distinct // Instead of making bulk requests which maybe not be needed and load AS
if (requiresNullify.nonEmpty) {
client.put(writePolicy, key, requiresNullify.map(Bin.asNull): _*)
logger.debug(s"${recordCount.incrementAndGet()}: (${requiresNullify.mkString(",")}) Bins of Record: $record with $key are set to NULL")
}
}
})
logger.info(s"Totally $recordCount records affected during the truncate operation on $namespace.$set.$bins")
recordCount.get
}
}
}
This is failed on:
...
2021-08-08 16:51:30,551 [Aerospike-6] DEBUG c.d.a.c.r.services.AerospikeService.scanCallback(55) - 33950: (IsActive) Bins of Record: (gen:3),(exp:0),(bins:(IsActive:0)) with test-recovery-set-multi-1:null:95001b26e70dbb35e1487802ebbc857eceb92246 are set to NULL
for reason:
Error -11,6,0,30000,0,5: Max retries exceeded: 5
com.aerospike.client.AerospikeException: Error -11,6,0,30000,0,5: Max retries exceeded: 5
at com.aerospike.client.query.PartitionTracker.isComplete(PartitionTracker.java:282)
at com.aerospike.client.command.ScanExecutor.scanPartitions(ScanExecutor.java:70)
at com.aerospike.client.AerospikeClient.scanAll(AerospikeClient.java:1519)
at com.aerospike.connect.reloader.services.AerospikeService.$anonfun$truncate$3(AerospikeService.scala:50)
at com.aerospike.connect.reloader.services.AerospikeService.$anonfun$truncate$3$adapted(AerospikeService.scala:48)
at scala.collection.Iterator$$anon$9.next(Iterator.scala:575)
at scala.collection.immutable.List.prependedAll(List.scala:153)
at scala.collection.immutable.List$.from(List.scala:651)
at scala.collection.immutable.List$.from(List.scala:648)
at scala.collection.IterableFactory$Delegate.from(Factory.scala:288)
at scala.collection.immutable.Iterable$.from(Iterable.scala:35)
at scala.collection.immutable.Iterable$.from(Iterable.scala:32)
at scala.collection.IterableOps$WithFilter.map(Iterable.scala:884)
at com.aerospike.connect.reloader.services.AerospikeService.$anonfun$truncate$1(AerospikeService.scala:48)
at scala.collection.StrictOptimizedIterableOps.flatMap(StrictOptimizedIterableOps.scala:117)
at scala.collection.StrictOptimizedIterableOps.flatMap$(StrictOptimizedIterableOps.scala:104)
at scala.collection.immutable.Map$Map1.flatMap(Map.scala:241)
at com.aerospike.connect.reloader.services.AerospikeService.truncate(AerospikeService.scala:47)
at com.aerospike.connect.reloader.tests.services.AerospikeServiceSpec.$anonfun$new$2(AerospikeServiceSpec.scala:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.wordspec.AnyWordSpecLike$$anon$3.apply(AnyWordSpecLike.scala:1077)
at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
at com.aerospike.connect.reloader.tests.services.AerospikeServiceSpec.withFixture(AerospikeServiceSpec.scala:13)
at org.scalatest.wordspec.AnyWordSpecLike.invokeWithFixture$1(AnyWordSpecLike.scala:1075)
at org.scalatest.wordspec.AnyWordSpecLike.$anonfun$runTest$1(AnyWordSpecLike.scala:1087)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.wordspec.AnyWordSpecLike.runTest(AnyWordSpecLike.scala:1087)
at org.scalatest.wordspec.AnyWordSpecLike.runTest$(AnyWordSpecLike.scala:1069)
at com.aerospike.connect.reloader.tests.services.AerospikeServiceSpec.runTest(AerospikeServiceSpec.scala:13)
at org.scalatest.wordspec.AnyWordSpecLike.$anonfun$runTests$1(AnyWordSpecLike.scala:1146)
at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
at scala.collection.immutable.List.foreach(List.scala:333)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:390)
at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:427)
at scala.collection.immutable.List.foreach(List.scala:333)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
at org.scalatest.wordspec.AnyWordSpecLike.runTests(AnyWordSpecLike.scala:1146)
at org.scalatest.wordspec.AnyWordSpecLike.runTests$(AnyWordSpecLike.scala:1145)
at com.aerospike.connect.reloader.tests.services.AerospikeServiceSpec.runTests(AerospikeServiceSpec.scala:13)
at org.scalatest.Suite.run(Suite.scala:1112)
at org.scalatest.Suite.run$(Suite.scala:1094)
at com.aerospike.connect.reloader.tests.services.AerospikeServiceSpec.org$scalatest$BeforeAndAfterAll$$super$run(AerospikeServiceSpec.scala:13)
at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at com.aerospike.connect.reloader.tests.services.AerospikeServiceSpec.org$scalatest$wordspec$AnyWordSpecLike$$super$run(AerospikeServiceSpec.scala:13)
at org.scalatest.wordspec.AnyWordSpecLike.$anonfun$run$1(AnyWordSpecLike.scala:1191)
at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
at org.scalatest.wordspec.AnyWordSpecLike.run(AnyWordSpecLike.scala:1191)
at org.scalatest.wordspec.AnyWordSpecLike.run$(AnyWordSpecLike.scala:1189)
at com.aerospike.connect.reloader.tests.services.AerospikeServiceSpec.run(AerospikeServiceSpec.scala:13)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1320)
at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1314)
at scala.collection.immutable.List.foreach(List.scala:333)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1314)
at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:993)
at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:971)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1480)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:971)
at org.scalatest.tools.Runner$.run(Runner.scala:798)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:38)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:25)
Any ideas why its happening?
LUT method:
def calculateCurrentLUT(): Long = {
logger.info("calculateCurrentLUTs() Triggered")
val policy = new WritePolicy()
policy.setTimeout(config.operationTimeoutInMillis)
val key = new Key(config.toRecover.head.namespace, AerospikeConfiguration.dummySetName, AerospikeConfiguration.dummyKey)
client.put(policy, key, new Bin(AerospikeConfiguration.dummyBin, "Used by the Recovery process to calculate current machine startTime"))
client.execute(policy, key, AerospikeConfiguration.packageName, "getLUT").asInstanceOf[Long]
}
with:
def registerUDFs(): RegisterTask = {
logger.info(s"registerUDFs() Triggered")
val policy = new WritePolicy()
policy.setTimeout(config.operationTimeoutInMillis)
client.registerUdfString(policy, """
|function getLUT(r)
| return record.last_update_time(r)
|end
|""", AerospikeConfiguration.packageName + ".lua", Language.LUA)
}
AerospikeException: Error -11,6,0,30000,0,5: Max retries exceeded: 5 means -11: error code, maximum retry attempts on this operation exceeded specified value. Shows 6 iterations (orig+maxretries) and you specified max retries at 5. Your connection settings are: 0 - for connectTimeout - wait to create initial socket, 0 is default, 30000 or 30s is your time to close an idle socket, 0 is the total timeout for this scan is operation - 0 means don't timeout which is correct for scans, 5 is the times you retried - looks like server is not responding back to client scan call in 30seconds and client closes the idle socket and retries and after 5 re-tries throws an Exception. Something is obviously wrong - check server log for more clues. For e.g. Are you using the correct server version that supports Expressions for scans? Second, I would check your computation of LUT comparison expression. if the filter expression is evaluating to false, scan will just return an EOF on completion, no matching records -but if socket times out before that, scan will go into a retry.

How to validate is PoolingHttpClientConnectionManager is applied on jersey client

Below is the code snippet I am using for jersey client connection pooling.
ClientConfig clientConfig = new ClientConfig();
clientConfig.property(ClientProperties.CONNECT_TIMEOUT, defaultConnectTimeout);
clientConfig.property(ClientProperties.READ_TIMEOUT, defaultReadTimeout);
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
cm.setMaxTotal(50);
cm.setDefaultMaxPerRoute(5);
clientConfig.property(ApacheClientProperties.CONNECTION_MANAGER, cm);
clientConfig.connectorProvider(new ApacheConnectorProvider());
How can I validate that my client is using connection pooling? Is poolStats.getAvailable() count is valid way of making sure ? In my case this available count is 1 when I tested client.
Yes, the count can be 1, but to confirm you can try following steps.
You can first add a thread that keep running in background and prints the existing state of poolstats at some interval, lets say every 60 sec. You can use below logic. Ensure you are referring to same PoolingHttpClientConnectionManager object instance in below logic code running as a part of background thread.
Then, try calling the logic which internally makes call to external service using the mentioned jersey client in continuation (may be in for loop)
You should see different logs (in your thread logic) getting printed which would confirm that the jersey client is actually using the pooled configuration.
Logic:
PoolStats poolStats = cm.getTotalStats();
Set<HttpRoute> routes = cm.getRoutes();
if(CollectionUtils.isNotEmpty(routes)) {
for (HttpRoute route : routes) {
PoolStats routeStats = poolingHttpClientConnectionManager.getStats(route);
int routeAvailable = routeStats.getAvailable();
int routeLeased = routeStats.getLeased();
int routeIdle = (routeAvailable - routeLeased);
log.info("Pool Stats for Route - Host = {}, Available = {} , Leased = {}, Idle = {}, Pending = {}, Max = {} " ,
route.getTargetHost(), routeAvailable, routeLeased, routeIdle, poolStats.getPending(), poolStats.getMax());
}
}
int available = poolStats.getAvailable();
int leased = poolStats.getLeased();
int idle = (available - leased);
log.info("Pool Stats - Available = {} , Leased = {}, Idle = {}, Pending = {}, Max = {} " ,
available, leased, idle, poolStats.getPending(), poolStats.getMax());

How to vertically scale Vert.x without Verticals?

According the the Vert.x docs - deploying using Verticles is optional. If this is the case - how can I deploy say an HTTP server onto multiple event loops? Here's what I tried - also read the api docs and couldn't find anything:
Vertx vertx = Vertx.vertx(new VertxOptions().setEventLoopPoolSize(10));
HttpServerOptions options = new HttpServerOptions().setLogActivity(true);
for (int i = 0; i < 10; i++) {
vertx.createHttpServer(options).requestHandler(request -> {
request.response().end("Hello world");
}).listen(8081);
}
This appears to create 10 HTTP servers on the first event loop but I'm hoping for 1 server per event loop.
Here's what I see in my logs - all eventloop-thread-0:
08:42:46.667 [vert.x-eventloop-thread-0] DEBUG
io.netty.handler.logging.LoggingHandler - [id: 0x0c651def,
L:/0:0:0:0:0:0:0:1:8081 - R:/0:0:0:0:0:0:0:1:50978] READ: 78B
08:42:46.805 [vert.x-eventloop-thread-0] DEBUG
io.netty.handler.logging.LoggingHandler - [id: 0xe050d078,
L:/0:0:0:0:0:0:0:1:8081 - R:/0:0:0:0:0:0:0:1:51000] READ: 78B
08:42:47.400 [vert.x-eventloop-thread-0] DEBUG
io.netty.handler.logging.LoggingHandler - [id: 0x22b626b8,
L:/0:0:0:0:0:0:0:1:8081 - R:/0:0:0:0:0:0:0:1:51002] READ: 78B
"Optional" doesn't mean "you can, getting the same benefits". "Optional" simply means "you can".
Vert.x has the notion of thread affinity. HTTP Server created from the same thread will always be assigned to the same event loop. Otherwise you'll get nasty thread-safety problems.
You can compare the example code from above with the following code:
Vertx vertx = Vertx.vertx();
HttpServerOptions options = new HttpServerOptions().setLogActivity(true);
// Spawn multiple threads, so EventLoops won't be bound to main
ExecutorService tp = Executors.newWorkStealingPool(10);
CountDownLatch l = new CountDownLatch(1);
for (int i = 0; i < 10; i++) {
tp.execute(() -> {
vertx.createHttpServer(options).requestHandler(request -> {
System.out.println(Thread.currentThread().getName());
// Slow the response somewhat
vertx.setTimer(1000, (h) -> {
request.response().end("Hello world");
});
}).listen(8081);
});
}
// Just wait here
l.await();
Output is something like:
vert.x-eventloop-thread-0
vert.x-eventloop-thread-1
vert.x-eventloop-thread-2
vert.x-eventloop-thread-0
That's because each event loop thread now is bound to a separate executing thread.

Why doesn't this thread pool execute HTTP requests simultaneously?

I wrote a few lines of code which will send 50 HTTP GET requests to a service running on my machine. The service will always sleep 1 second and return a HTTP status code 200 with an empty body. As expected the code runs for about 50 seconds.
To speed things up a little I tried to create an ExecutorService with 4 threads so I could always send 4 requests at the same time to my service. I expected the code to run for about 13 seconds.
final List<String> urls = new ArrayList<>();
for (int i = 0; i < 50; i++)
urls.add("http://localhost:5000/test/" + i);
final RestTemplate restTemplate = new RestTemplate();
final List<Callable<String>> tasks = urls
.stream()
.map(u -> (Callable<String>) () -> {
System.out.println(LocalDateTime.now() + " - " + Thread.currentThread().getName() + ": " + u);
return restTemplate.getForObject(u, String.class);
}).collect(Collectors.toList());
final ExecutorService executorService = Executors.newFixedThreadPool(4);
final long start = System.currentTimeMillis();
try {
final List<Future<String>> futures = executorService.invokeAll(tasks);
final List<String> results = futures.stream().map(f -> {
try {
return f.get();
} catch (InterruptedException | ExecutionException e) {
throw new IllegalStateException(e);
}
}).collect(Collectors.toList());
System.out.println(results);
} finally {
executorService.shutdown();
executorService.awaitTermination(10, TimeUnit.SECONDS);
}
final long elapsed = System.currentTimeMillis() - start;
System.out.println("Took " + elapsed + " ms...");
But - if you look at the seconds of the debug output - it seems like the first 4 requests are executed simultaneously but all other request are executed one after another:
2018-10-21T17:42:16.160 - pool-1-thread-3: http://localhost:5000/test/2
2018-10-21T17:42:16.160 - pool-1-thread-1: http://localhost:5000/test/0
2018-10-21T17:42:16.160 - pool-1-thread-2: http://localhost:5000/test/1
2018-10-21T17:42:16.159 - pool-1-thread-4: http://localhost:5000/test/3
2018-10-21T17:42:17.233 - pool-1-thread-3: http://localhost:5000/test/4
2018-10-21T17:42:18.232 - pool-1-thread-2: http://localhost:5000/test/5
2018-10-21T17:42:19.237 - pool-1-thread-4: http://localhost:5000/test/6
2018-10-21T17:42:20.241 - pool-1-thread-1: http://localhost:5000/test/7
...
Took 50310 ms...
So for debugging purposes I changed the HTTP request to a sleep call:
// return restTemplate.getForObject(u, String.class);
TimeUnit.SECONDS.sleep(1);
return "";
And now the code works as expected:
...
Took 13068 ms...
So my question is why does the code with the sleep call work as expected and the code with the HTTP request doesn't? And how can I get it to behave in the way I expected?
From the information, I can see this is the most probable root cause:
The requests you make are done in parallel but the HTTP server which fulfils these request handles 1 request at a time.
So when you start making requests, the executor service fires up the requests concurrently, thus you get the first 4 at same time.
But the HTTP server can respond to requests one at a time i.e. after 1 second each.
Now when 1st request is fulfilled the executor service picks another request and fires it and this goes on till last request.
4 request are blocked at HTTP server at a time, which are being served serially one after the other.
To get a Proof of Concept of this theory what you can do is use a messaging service (queue) which can receive concurrently from 4 channels an test. That should reduce the time.

Vert.x performance drop when starting with -cluster option

I'm wondering if any one experienced the same problem.
We have a Vert.x application and in the end it's purpose is to insert 600 million rows into a Cassandra cluster. We are testing the speed of Vert.x in combination with Cassandra by doing tests in smaller amounts.
If we run the fat jar (build with Shade plugin) without the -cluster option, we are able to insert 10 million records in about a minute. When we add the -cluster option (eventually we will run the Vert.x application in cluster) it takes about 5 minutes for 10 million records to insert.
Does anyone know why?
We know that the Hazelcast config will create some overhead, but never thought it would be 5 times slower. This implies we will need 5 EC2 instances in cluster to get the same result when using 1 EC2 without the cluster option.
As mentioned, everything runs on EC2 instances:
2 Cassandra servers on t2.small
1 Vert.x server on t2.2xlarge
You are actually running into corner cases of the Vert.x Hazelcast Cluster manager.
First of all you are using a worker Verticle to send your messages (30000001). Under the hood Hazelcast is blocking and thus when you send a message from a worker the version 3.3.3 does not take that in account. Recently we added this fix https://github.com/vert-x3/issues/issues/75 (not present in 3.4.0.Beta1 but present in 3.4.0-SNAPSHOTS) that will improve this case.
Second when you send all your messages at the same time, it runs into another corner case that prevents the Hazelcast cluster manager to use a cache of the cluster topology. This topology cache is usually updated after the first message has been sent and sending all the messages in one shot prevents the usage of the ache (short explanation HazelcastAsyncMultiMap#getInProgressCount will be > 0 and prevents the cache to be used), hence paying the penalty of an expensive lookup (hence the cache).
If I use Bertjan's reproducer with 3.4.0-SNAPSHOT + Hazelcast and the following change: send message to destination, wait for reply. Upon reply send all messages then I get a lot of improvements.
Without clustering : 5852 ms
With clustering with HZ 3.3.3 :16745 ms
With clustering with HZ 3.4.0-SNAPSHOT + initial message : 8609 ms
I believe also you should not use a worker verticle to send that many messages and instead send them using an event loop verticle via batches. Perhaps you should explain your use case and we can think about the best way to solve it.
When you're you enable clustering (of any kind) to an application you are making your application more resilient to failures but you're also adding a performance penalty.
For example your current flow (without clustering) is something like:
client ->
vert.x app ->
in memory same process eventbus (negletible) ->
handler -> cassandra
<- vert.x app
<- client
Once you enable clustering:
client ->
vert.x app ->
serialize request ->
network request cluster member ->
deserialize request ->
handler -> cassandra
<- serialize response
<- network reply
<- deserialize response
<- vert.x app
<- client
As you can see there are many encode decode operations required plus several network calls and this all gets added to your total request time.
In order to achive best performance you need to take advantage of locality the closer you are of your data store usually the fastest.
Just to add the code of the project. I guess that would help.
Sender verticle:
public class ProviderVerticle extends AbstractVerticle {
#Override
public void start() throws Exception {
IntStream.range(1, 30000001).parallel().forEach(i -> {
vertx.eventBus().send("clustertest1", Json.encode(new TestCluster1(i, "abc", LocalDateTime.now())));
});
}
#Override
public void stop() throws Exception {
super.stop();
}
}
And the inserter verticle
public class ReceiverVerticle extends AbstractVerticle {
private int messagesReceived = 1;
private Session cassandraSession;
#Override
public void start() throws Exception {
PoolingOptions poolingOptions = new PoolingOptions()
.setCoreConnectionsPerHost(HostDistance.LOCAL, 2)
.setMaxConnectionsPerHost(HostDistance.LOCAL, 3)
.setCoreConnectionsPerHost(HostDistance.REMOTE, 1)
.setMaxConnectionsPerHost(HostDistance.REMOTE, 3)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 20)
.setMaxQueueSize(32768)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 20);
Cluster cluster = Cluster.builder()
.withPoolingOptions(poolingOptions)
.addContactPoints(ClusterSetup.SEEDS)
.build();
System.out.println("Connecting session");
cassandraSession = cluster.connect("kiespees");
System.out.println("Session connected:\n\tcluster [" + cassandraSession.getCluster().getClusterName() + "]");
System.out.println("Connected hosts: ");
cassandraSession.getState().getConnectedHosts().forEach(host -> System.out.println(host.getAddress()));
PreparedStatement prepared = cassandraSession.prepare(
"insert into clustertest1 (id, value, created) " +
"values (:id, :value, :created)");
PreparedStatement preparedTimer = cassandraSession.prepare(
"insert into timer (name, created_on, amount) " +
"values (:name, :createdOn, :amount)");
BoundStatement timerStart = preparedTimer.bind()
.setString("name", "clusterteststart")
.setInt("amount", 0)
.setTimestamp("createdOn", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(timerStart);
EventBus bus = vertx.eventBus();
System.out.println("Bus info: " + bus.toString());
MessageConsumer<String> cons = bus.consumer("clustertest1");
System.out.println("Consumer info: " + cons.address());
System.out.println("Waiting for messages");
cons.handler(message -> {
TestCluster1 tc = Json.decodeValue(message.body(), TestCluster1.class);
if (messagesReceived % 100000 == 0)
System.out.println("Message received: " + messagesReceived);
BoundStatement boundRecord = prepared.bind()
.setInt("id", tc.getId())
.setString("value", tc.getValue())
.setTimestamp("created", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(boundRecord);
if (messagesReceived % 100000 == 0) {
BoundStatement timerStop = preparedTimer.bind()
.setString("name", "clusterteststop")
.setInt("amount", messagesReceived)
.setTimestamp("createdOn", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(timerStop);
}
messagesReceived++;
//message.reply("OK");
});
}
#Override
public void stop() throws Exception {
super.stop();
cassandraSession.close();
}
}

Categories