How to increase WebSocket throughput - java

I need to pull data from a lot of clients connecting to a java server through a web socket.
There are a lot of web socket implementations, and I picked vert.x.
I made a simple demo where I listen to text frames of json, parse them with jackson and send response back. Json parser doesn't influence significantly on the throughput.
I am getting overall speed 2.5k per second with 2 or 10 clients.
Then I tried to use buffering and clients don't wait for every single response but send batch of messages (30k - 90k) after a confirmation from a server - speed increased up to 8k per second.
I see that java process has a CPU bottleneck - 1 core is used by 100%.
Mean while nodejs client cpu consumption is only 5%.
Even 1 client causes server to eat almost a whole core.
Do you think it's worth to try other websocket implementations like jetty?
Is there way to scale vert.x with multiple cores?
After I changed the log level from debug to info I have 70k. Debug level causes vert.x print messages for every frame.

It's possible specify number of verticle (thread) instances by e.g. configuring DeploymentOptions http://vertx.io/docs/vertx-core/java/#_specifying_number_of_verticle_instances
You was able to create more than 60k connections on a single machine, so I assume the average time of a connection was less than a second. Is it the case you expect on production? To compare other solutions you can try to run https://github.com/smallnest/C1000K-Servers

Something doesn't sound right. That's very low performance. Sounds like vert.x is not configured properly for parallelization. Are you using only one verticle (thread)?
The Kaazing Gateway is one of the first WS implementations. By default, it uses multiple cores and is further configurable with threads, etc. We have users using it for massive IoT deployments so your issue is not the JVM.
In case you're interested, here's the github repo: https://github.com/kaazing/gateway
Full disclosure: I work for Kaazing

Related

Tactics to explore performance degradation of java based web app over time

I am working on enterprise java application which has a lot of tools/frameworks in it already, such as Struts, JAX-RS and Spring MVC. It contains UIs and REST endpoints bundled together in a .war file.
The project is evolving and we are getting rid of older tools, striving for sticking up with only Spring MVC/Webflux.
Application is performing search on millions of XML/JSON records and recently the search engine was switched from Marklogic to Elasticsearch.
What we have noticed is that on production with not that heavy usage (up to 1.7k rpm on 2-4 application nodes) response times on some of the endpoints are increasing over time.
Elasticsearch has a space to grow and does not show any signs of a huge load.
So currently we have to restart/replace prod instances once in like a week or two when average response time is over 3 seconds instead of a regular 200-300 millis.
I tried to get CPU and heap flame graphs using async-profiler but the load profile is changing on every measurement as we have bunch of features available so I cannot really compare how graphs are changing over time.
Can you advise me on some tactics/approaches for finding the proper place in the code?
Found issue. It is related to thread pooling.
What we have noticed is that over time amount of active tomcat threads were growing together with response times:
On the image you can also see that the server was restarted on May 9th.
I was able to get a heap dump before the server restarted and after some digging found an interesting repeated piece in thread dump:
Thread xxx
at sun.misc.Unsafe.park(ZJ)V (Native Method)
at java.util.concurrent.locks.LockSupport.park(Ljava/lang/Object;)V (LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()V (AbstractQueuedSynchronizer.java:2039)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(Ljava/lang/Object;Ljava/lang/Object;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/Future;)Lorg/apache/http/pool/PoolEntry; (AbstractConnPool.java:377)
at org.apache.http.pool.AbstractConnPool.access$200(Lorg/apache/http/pool/AbstractConnPool;Ljava/lang/Object;Ljava/lang/Object;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/Future;)Lorg/apache/http/pool/PoolEntry; (AbstractConnPool.java:67)
at org.apache.http.pool.AbstractConnPool$2.get(JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/pool/PoolEntry; (AbstractConnPool.java:243)
at org.apache.http.pool.AbstractConnPool$2.get(JLjava/util/concurrent/TimeUnit;)Ljava/lang/Object; (AbstractConnPool.java:191)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(Ljava/util/concurrent/Future;JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/HttpClientConnection; (PoolingHttpClientConnectionManager.java:282)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/HttpClientConnection; (PoolingHttpClientConnectionManager.java:269)
at org.apache.http.impl.execchain.MainClientExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (MainClientExec.java:191)
at org.apache.http.impl.execchain.ProtocolExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (ProtocolExec.java:185)
at org.apache.http.impl.execchain.RetryExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (RedirectExec.java:111)
at org.apache.http.impl.client.InternalHttpClient.doExecute(Lorg/apache/http/HttpHost;Lorg/apache/http/HttpRequest;Lorg/apache/http/protocol/HttpContext;)Lorg/apache/http/client/methods/CloseableHttpResponse; (InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(Lorg/apache/http/client/methods/HttpUriRequest;Lorg/apache/http/protocol/HttpContext;)Lorg/apache/http/client/methods/CloseableHttpResponse; (CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(Lorg/apache/http/client/methods/HttpUriRequest;)Lorg/apache/http/client/methods/CloseableHttpResponse; (CloseableHttpClient.java:108)
at io.searchbox.client.http.JestHttpClient.executeRequest(Lorg/apache/http/client/methods/HttpUriRequest;)Lorg/apache/http/client/methods/CloseableHttpResponse; (JestHttpClient.java:136)
at io.searchbox.client.http.JestHttpClient.execute(Lio/searchbox/action/Action;Lorg/apache/http/client/config/RequestConfig;)Lio/searchbox/client/JestResult; (JestHttpClient.java:70)
at io.searchbox.client.http.JestHttpClient.execute(Lio/searchbox/action/Action;)Lio/searchbox/client/JestResult; (JestHttpClient.java:63)
...
In our case we are using Jest library to talk to Elasticsearch.
Internally it is using Apache HTTP client & Apache HTTP Async Client.
As you see on the thread dump it's clear that this thread was waiting for an available thread in HTTP Client's thread pool. And there were more threads with exactly the same stack.
What I also discovered is that we set maxTotal (maximum total number of connections) to 20 and defaultMaxPerRoute (maximum connections per route) to 2:
By default the pool allows only 20 concurrent connections in total and two concurrent connections per a unique route. The two connection limit is due to the requirements of the HTTP specification. However, in practical terms this can often be too restrictive.
See Connection pools description.
So the fix I did is increased those values to 50 and 40 respectively.
I would still prefer to have this parameters unbounded and grow with a usage but for now stick to these values.

GRPC server response latency

First off, has anyone done a performance comparison for throughput/latency between a GRPC client-server implementation v/s a websocket+protobuf client-server implementation? Or at least something similary.
In order to reach this goal, I am trying out the example JAVA helloworld grpc client-server and trying to compare the latency of the response with a similar websocket client-server. Currently i am trying this out with both client and server on my local machine.
The websocket client-server has a simple while loop on the server side. For the grpc server i notice that it uses an asynchronous execution model. I suspect it creates a new thread for each client request, resulting in additional processing overheads. For instance, the websocket response latency i am measuring is in the order of 6-7 ms and the grpc example is showing a latency of about 600-700ms, accounting for protobuf overhead.
In order to do a similar comparison for grpc, is there a way to run the grpc server synchronously? I want to be able to eliminate the overhead of the thread creation/dispatch and other such internal overhead introduced by the asynchronous handling.
I do understand that there is a protobuf overhead involved in grpc that is not there in my websocket client-server example. However i can account for that by measuring the overhead introduced by protobuf processing.
Also, if i cannot run the grpc server synchronously, can i at least measure the thread dispatch/asynchronous processing overhead?
I am relatively new to JAVA, so pardon my ignorance.
Benchmarking in Java is easy to get wrong. You need to do many seconds worth of warm-up for multiple levels of JIT to kick in. You also need time for the heap size to level-off. In a simplistic one-shot benchmark, it's easy to see the code that runs last is fastest (independent of what that code is), due to class loading. 600 ms is an insanely large number for gRPC latency; we see around 300 µs median latency on Google Compute Engine between two machines, with TLS. I expect you have no warm-ups, so you are counting the time it takes for Java to load gRPC and are measuring Java using its interpreter with gRPC.
There is not a synchronous version of the gRPC server, and even if there was it would still run with a separate thread by default. grpc-java uses a cached thread pool, so after an initial request gRPC should be able to re-use a thread for calling the service.
The cost of jumping between threads is generally low, although it can add tail latency. In some in-process NOOP benchmarks we see RPC completion in 8 µs using the extra threads and 4 µs without. If you really want though, you can use serverBuilder.directExecutor() to avoid the thread hop. Note that most services would get slower with that option and have really poor tail latency, as service processing can delay I/O.
In order to do a similar comparison for grpc, is there a way to run the grpc server synchronously? I want to be able to eliminate the overhead of the thread creation/dispatch and other such internal overhead introduced by the asynchronous handling.
You can create a synchronous client. Generally the asynchronous is way faster. (Tested in Scala) You can simply use all resources you got in an non-blocking way. I would create a test on how many request from how many clients the server can handle per second. You can than limit the incoming request per client to make sure that your service will not crash. Asynchronous is also better for HTTP 2. HTTP 2 provides Multiplexing.
For a Benchmark I can recommend Metrics. You can expose the metrics via log or http endpoint.

How does Restful WS works with multiple clients at the same time?

I am new to the RESTful Webservices world and I have a question regarding how WS works.
Context:
I am developing a RESTful WS that will have a high load; at one given time I can have let's say up to 10 clients sending multiple requests. All the requests will be sent to port 80.
I am developing the WS with Jersey (Java) and deploying on a Tomcat Webserver.
Question:
Let's say we have 5 clients that send requests at the same time; each one sends 2 requests to port 80; will they be treated in FIFO order? Can we have some sort of multi-threading if let's say we don't care about the order?
It all depends what server you use and how it is configured. Standard configuration (you have to work hard to make it not standard) is to have multiple threads. In other words - server usually automatically creates or uses another thread for each new request and it is almost certain that it will be processed in parallel.
You can actually see it inside your running code by using java.lang.Thread.currentThread() - print the name of current thread and Rest request and you will see.
To answer your question, a thread will be fetched from thread pool to server every request you send. The server does not care about the order, the request comes first will be served first.
More about the servers:
I suggest you use Nginx or Apache as reverse server to enable high performance, a thread will be fetched from the thread pool to server the request. To improve performance, you can increase the thread pool size. However, too much thread will, on the other hand, reduce your performance due to the frequency of switching from thread to thread increases. You don't want to have a very large thread pool.
If you are using Apache + Tomcat, basically, you have the same situation like you are using Tomcat. But apache is more suitable than tomcat to be the web server. In real life, companies use apache as reverse server that dispatch request to tomcat.
Apache and Tomcat are multithread based server, their performance reduce when there are too much requests. If you have to handle a lot of requests, you can use Nginx.
Nginx is an even based server, it uses queue to store requests and use FIFO to dispatch them. It can handle a lot of requests with much fewer threads. Therefore, its performance will be more stable even with larger amount of requests. However, with extremely large amount of requests, Nginx will also be overwhelmed, as its event loop has no room for extra requests.
Companies due with the situation by using distributed system concepts. For example load balancer. But to answer your question, that's a little too much. Check this article and this article to gain a better idea about nginx and apache.

Service using Jest is blocking on threadpool, why?

I've a Java + Spring app that will query ElasticSearch using Jest client (poor choice because it is poorly documented). ElasticSearch has response times of about 8-20 ms with 150 concurrent connections, but my app goes up to 900 -1500 ms. A quick look at VisualVM tells me that the processor usage is below 10% and profiling it tells me that 98% of the time all that the app does is wait on the following method
org.apache.http.pool.PoolEntryFuture.await()
that is part of Apache HttpCore and a dependency of Jest. I don't have a limitation in terms of threads that can run on tomcat (max is 200 and VisualVM says that the maximum number of thread during the experiment was 174). So it's not waiting free threads.
I think that the latency increase is excessive and I suspect that Jest is using an internal threadpool that has not enough threads to cope with all the requests, but I don't know.
Thoughts?
I think that the latency increase is excessive and I suspect that Jest is using an internal threadpool that has not enough threads to cope with all the requests...
In poking around the source real fast I see that you should be able to inject a ClientConfig into the Jest client factory.
The ClientConfig has the following setters which seem to impact the internal Apache http client connection manager:
clientConfig.maxTotalConnection(...);
clientConfig.defaultMaxTotalConnectionPerRoute(...);
clientConfig.maxTotalConnectionPerRoute(...);
Maybe tweaking some of those will give you more connections? Take a look at the JestClientFactory source to see what it is doing. We've definitely had to tweak those values in the past when making a large number of connections to the same server using HttpClient.
I would test this with just one connection and see what the average response time is. With just one thread you should have more than enough thread and resources etc. Most likely the process is waiting on an external resource like a database or a network service.

jboss unable to handle more than 3000 request

I created a web service both client and server. I thought of doing the performance testing. I tried jmeter with a sample test plan to execute it. Upto 3000 request jboss processed the request but when requests more than 3000 some of the request are not processed (In sense of Can't open connection : Connection refused). Where i have to make the changes to handle more than 10000 request at the same time. Either it's a jboss issue or System Throughput ?
jmeter Config : 300 Threads, 1 sec ramp up and 10 loop ups.
System (Server Config) : Windows 7, 4G RAM
Where i have to make the changes to handle more than 10000 request at the same time
10 thousand concurrent requests in Tomcat (I believe it is used in JBoss) is quite a lot. In typical setup (with blocking IO connector) you need one thread per one HTTP connection. This is way too much for ordinary JVM. On a 64-bit server machine one thread needs 1 MiB (check out -Xss parameter). And you only have 4 GiB.
Moreover, the number of context switches will kill your performance. You would need hundreds of cores to effectively handle all these connections. And if your request is I/O or database bound - you'll see a bottleneck elsewhere.
That being said you need a different approach. Either try out non-blocking I/O or asynchronous servlets (since 3.0) or... scale out. By default Tomcat can handle 100-200 concurrent connections (reasonable default) and a similar amount of connections are queued. Everything above that is rejected and you are probably experiencing that.
See also
Advanced IO and Tomcat
Asynchronous Support in Servlet 3.0
There are two common problems that I think of.
First, if you run JBoss on Linux as a normal user, you can run into 'Too many open files', if you did not edit the limits.conf file. See https://community.jboss.org/thread/155699. Each open socket counts as an 'open file' for Linux, so the OS could block your connections because of this.
Second, the maximum threadpool size for incoming connections is 200 by default. This limits the number of concurrent requests, i.e. requests that are in progress at the same time. If you have jmeter doing 300 threads, the jboss connector threadpool should be larger. You can find this in jboss6 in the jboss-web.sar/server.xml. Look for 'maxThreads' in the element: http://docs.jboss.org/jbossweb/latest/config/http.html.
200 is the recommended maximum for a single core CPU. Above that, the context switches start to give too much overhead, like Tomasz says. So for production use, only increase to 400 on a dual core, 800 on a quad core, etc.

Categories