How to handle many slow connections

How to handle many slow connections - java

The throughput of our web application seems to be limited by slow connections. In load tests, we easily achieve about 5000 requests/s. But in practice, we max out at about 1000 requests/s. The server isn't really under serious load, neither IO nor CPU wise. The same applies to the database. The main difference seems to be that most worker threads are slowed down by clients that cannot accept the response fast enough (often responses are several MB in size).
We hardly have any static resources. The the problem is about dynamically generated content. It's implemented with the Spring Framework. But I think it wouldn't be different for any other servlet based implementation.
So what are our options for improving throughput? Is there some sort of caching available that would quickly absorb the response, free up the worker threads and then asynchronously deliver it to the client at their speed?
We'd rather not increase the number of processing threads as they keep a database connection open for most of their processing. We're really looking for a solution where a small number of worker threads can work at full speed.

I would suggest you to use standard techniques such as gzip for responses.
The second one is to use asynchronous processing in Spring MVC. See Making a Controller Method Asynchronous to learn more about this.

Related

Non-blocking vs blocking Java server with JDBC calls

Our gRPC need to handle 1000 QPS and each request requires a list of sequential operations to happen, including one which is to read data from the DB using JDBC. Handling a single request takes at most 50ms.
Our application can be written in two ways:
Option 1 - Classic one blocking thread per request: we can create a large thread pool (~200) and simply assign one thread per request and have that thread block while it waits for the DB.
Option 2 - Having each request handled in a truly non-blocking fashion:. This would require us to use a non-blocking MySQL client which I don't know if it exist, but for now let's assume it exist.
My understanding is that non-blocking approach has these pros and cons:
Pros: Allows to reduce the number of threads required, and as a such reduce the memory footprint
Pros: Save some overhead on the OS since it doesn't need to give CPU time to the thread waiting for IO
Cons: For a large application (where each task is subscribing a callback to the previous task), it requires to split a single request to multiple threads creating a different kind of overhead. And potentially if a same request gets executed on multiple physical core, it adds overhead as data might not be available in L1/L2 core cache.
Question 1: Even though non blocking application seems to be the new cool thing, my understanding is that for an application that aren't memory bounded and where creating more threads isn't a problem, it's not clear that writing a non-blocking application is actually more CPU efficient than writing blocking application. Is there any reason to believe otherwise?
Question 2: My understanding is also that if we use JDBC, the connection is actually blocking and even if we make the rest of our application to be non-blocking, because of the JDBC client we lose all the benefit and in that case a Option 1 is most likely better?

For question 1, you are correct -- non-blocking is not inherently better (and with the arrival of Virtual Threads, it's about to become a lot worse in comparison to good old thread-per-request). At best, you could look at the tools you are working with and do some performance testing with a small scale example. But frankly, that is down to the tool, not the strategy (at least, until Virtual Threads get here).
For question 2, I would strongly encourage you to choose the solution that works best with your tool/framework. Staying within your ecosystem will allow you to make more flexible moves when the time comes to optimize.
But all things equal, I would strongly encourage you to stick with thread-per-request, since you are working with Java. Ignoring Virtual Threads, thread-per-request allows you to work with and manage simple, blocking, synchronous code. You don't have to deal with callbacks or tracing the logic through confusing and piecemeal logs. Simply make a thread per request, let it block where it does, and then let your scheduler handle which thread should have the CPU core at any given time.

Pros: Save some overhead on the OS since it doesn't need to give CPU time to the thread waiting for IO
It’s not just the CPU time for waiting threads, but also the overhead of switching between threads competing for the CPU. As you have more threads, more of them will be in a running state, and the CPU time must be spread between them. This requires a lot of memory management for switching.
Cons: For a large application (where each task is subscribing a callback to the previous task), it requires to split a single request to multiple threads creating a different kind of overhead. And potentially if a same request gets executed on multiple physical core, it adds overhead as data might not be available in L1/L2 core cache.
This also happens with the “classic” approach since blocking calls will cause the CPU to switch to a different thread, and, as stated before, the CPU will even have to switch between runnable threads to share the CPU time as their number increases.
Question 1: […] for an application that aren't memory bounded and where creating more threads isn't a problem
In the current state of Java, creating more threads is always going to be a problem at some point. With the thread-per-request model, it depends how many requests you have in parallel. 1000, probably ok, 10000… maybe not.
it's not clear that writing a non-blocking application is actually more CPU efficient than writing blocking application. Is there any reason to believe otherwise?
It is not just a question of efficiency, but also scalability. For the performance itself, this would require proper load testing. You may also want to check Is non-blocking I/O really faster than multi-threaded blocking I/O? How?
Question 2: My understanding is also that if we use JDBC, the connection is actually blocking and even if we make the rest of our application to be non-blocking, because of the JDBC client we lose all the benefit and in that case a Option 1 is most likely better?
JDBC is indeed a synchronous API. Oracle was working on ADBA as an asynchronous equivalent, but they discontinued it, considering that Project Loom will make it irrelevant. R2DBC provides an alternative which supports MySQL. Spring even supports reactive transactions.

Optimise a Spring Boot backend for high CPU/memory workloads

I am running a Spring Boot application that acts as a backend for a frontend Javascript application. The frontend is served to the client as a static resource and the backend serves API requests coming from it. The application is initially designed to run on-premise but should be built in a way that allows easy porting to a cloud-native solution.
I expect the backend to do some heavy lifting ETL work which will be heavy on the memory and CPU side. At the same time, it won't need to scale to serve many concurrent requests - it only really needs to serve requests that kick-off and manage the jobs, which will be invoked by a single user who's interfacing with it.
What are some parameters that I could tweak to fine-tune for this type of deployment?
Current thinking:
Reduce server.tomcat.max-threads to a single digit to minimize the footprint of the request thread pool as I am not expected to handle more than one-two concurrently
Do the same for the database connection pool
Fine-tune Xms and Xmx when launching the JAR
I would appreciate any other insights about how to make sure that the Java application takes up as big a footprint on the system as it can as well as Spring Boot specific parameters that I could tweak. Thank you.

If you have long running background tasks, I would offload to work to a threadpool and set the maximum number of threads to the number of CPU's in your system. Also set a maximum capacity on the queue of the executor so you don't overload it with too much pending work.
Offloading to a different thread will make sure that the threads of the container remain available and you don't end up with a completely unresponsive system.
Your suggestions for the maximum heap size and connection pool are valid.

Generally the optization techniques would be simialr to any Java based application and since yours is a ETL kind of processing where throughput is more important than individual request latencies.
If the task is typical ETL - then generally the transformation is more CPU intensive as compared to extract and load which are IO intensive. So my first suggestion would be to profile your application to understand typical CPU/Mem usage pattern. If it spends more time in IO then you can live with much more number of processing threads than the CPU cores so you can tune that. Also remember that if the memory pressure is too much then a considerable CPU is spent by garbage collector as well so max memory settings would be important as well. You can enable GC logging in your application as use tool like this to analyze GC patterns easily. You can experiment with ParallelGC algorithms which is more suited for high throughput
at the cost of slightly high pause time. Typically if you have large data to interact with in extract and load phase you generally tend to use streaming/buffering techniques to reduce your memory footprint. An example is using cursors while dealing with large resultsets while interacting with database. Similarly if the load phase involves interaction with large files you can use parallel processing. then But all these would need experimentation to validate.
On a side note there is another project in Spring Ecosystem - Spring Batch which uses lots of such optimization techniques suited for batch processing for ETL types of job. You can read about some of techniques here.

Reactive Webflux for toggle server - is it beneficial?

We have a need to implement a simple toggle server (rest application) that would take the toggle name and return if it is enabled or disabled. We are expecting a load of 10s of thousands of requests per day.
Does Spring (reactive) webflux makes sense here?
My understanding is reactive rest apis will be useful if there is any possibility of idle time of the http thread - meaning thread waiting for some job to be done and it can't proceed until it receives the response like from db reads or rest calls to other services.
Our use case is just to return the toggle value (probably from some cache) that is being queried. Will the reactive rest service be useful in our case? Does it provide any advantages over simple spring boot application?

I'm coming from a background of "traditional" spring/spring-mvc apps development experience, and these days I'm also starting to learn spring webflux and based on the data provided in the question here are my observations (disclaimer: since I'm a beginner in this area as I said, take this answer with a grain of salt):
WebFlux is less "straight forward" to implement compared to the traditional application: the maintenance cost is higher, the debugging is harder, etc.
WebFlux will shine if your operations are I/O bound. If you're going to read the data from in-memory cache - this is not an I/O bound operation. I also understand that the nature of "toggle" data is that it doesn't change that much, but gets accessed (read) frequently, hence keeping it in some memory cache indeed makes sense here, unless you build something huge that won't fit in memory at all, but this is a different story.
WebFlux + netty will let you to serve simultaneously thousands of requests, tomcat, having a traditional "thread per request" model, still allows 200 threads + 100 in the queue by default, if you exceed these values it will fail, but netty will "survive". Based on the data presented in the question, I don't see you'll benefit from netty here.
10s of thousands requests per day seems like something that any kind of server can handle easily, tomcat, jetty, whatever - you don't need that "high-load" here.
As I've mentioned in item "3" WebFlux is good in simultaneous request handling, but you probably won't gain any performance improvement over the traditional approach, its not about the speed, its about the better resource utilization.
If you're going to read the data from the database and you do want to go with webflux, make sure you do have reactive drivers for your database - when you run the flow, you should be "reactive" all the way, blocking on db access doesn't make sence.
So, bottom line, If I were you, I would have started with a regular server and consider moving to reactive stack later (probably this "later" will never come as long the expectations specified in the questions don't change).

Indeed it aims to minimize thread idling and getting more performance by using fewer threads than in a traditional multithreading approach where a thread per request is used, or in reality a pool of worker threads to prevent too many threads being created.
If you're only getting tens of thousands of requests per day, and your use case is as simple as that, it doesn't sound like you need to plan anything special for it. A regular webapp will perform just fine.

Java server - multiple clients handling - is using threads optimal?

I am developing server-client communication based system and I am trying to determine the most optimal way to handle multiple clients. What is important I really don't want to use any third-party libraries.
In many places in the Internet I saw this resolved by creating a separate thread for each connections, but I don't think it is the best way when I assume there will be a huge number of connections (maybe I'm wrong). So, solution that I'm thinking of is
Creating queue of events and handling them by workers - the defined pool of threads (where there is a constant number n of workers). This solution seems to be pretty slow, but I can not imagine how big difference will be in case of handling huge amount of clients.
I've been thinking also about load-balancing via multiinstantiatig the server (on different physical machines) but it is only a nice add-on to any solution, not the solution itself.
I am aware that Java is not really async-friendly, but maybe I lack some knowledge and there is nice solution. I'll be grateful for any sugestions.
Additional info:
I assume really big number of connections
Every connection will last for a long time (days, maybe weeks)
Program will need to send some data to specified client quite frequently
Each client will send data to server about once a 3 seconds
To avoid discussion (as SO is not a place for them):
One client - one thread
Many clients - constant number of threads and events pool
Any async-like solution, that I'm not aware of
Anything else?

I'd suggest starting off with the simple architecture of one thread per connection. Modern JVMs on sufficiently sized systems can support thousands of threads. You might be pleasantly surprised at how well even this simple scheme works. If you need 300k connections, though, I doubt that one thread per connection will work. (But I've been wrong before.) You might have to fiddle with the thread stack size and OS resource limits.
A queueing system will help decouple the connections from the threads handling the work, but it will add to the amount of work done per message received from each client. This will also add to latency (but I'm not sure how important that is). If you have 300k connections, you'll probably want to have a pool of threads reading from the connections, and you'll also want to have more than one queue through which the work flows. If you have 300k clients sending data once every 3 seconds, that's 100k ops/sec, which is a lot of data to shove through a single queue. That might turn into a bottleneck.
Another approach probably worth investigating is to have a pool of worker threads, but instead of each worker reading data from a queue written by connection reader threads, have each worker handle a bunch of sockets directly. Use a NIO Selector to have each thread wait on multiple sockets. Say, have 100 threads each handling 3,000 sockets. Or perhaps have 1,000 threads each handling 300 sockets. This depends on the amount of data and the amount of work necessary to process each incoming message. You'll have to experiment. This will probably be considerably simpler than using asynchronous I/O.

Java 7 has true asynchronous IO under the NIO package I've heard. I don't know much about it other than its difficult to work with.
Basic IO in java is blocking. This means using a fixed number of threads to support many clients is likely not possible with basic IO as you could have all threads tied up in blocking calls reading from clients who aren't sending data.
I suggest you look in asynchronous IO with Grizzly/Netty, if you change your mind on 3rd party libraries.
If you haven't changed your mind, look into NIO yourself.

Java thread per connection model vs NIO

Is the non-blocking Java NIO still slower than your standard thread per connection asynchronous socket?
In addition, if you were to use threads per connection, would you just create new threads or would you use a very large thread pool?
I'm writing an MMORPG server in Java that should be able to scale 10000 clients easily given powerful enough hardware, although the maximum amount of clients is 24000 (which I believe is impossible to reach for the thread per connection model because of a 15000 thread limit in Java).
From a three year old article, I've heard that blocking IO with a thread per connection model was still 25% faster than NIO (namely, this document http://www.mailinator.com/tymaPaulMultithreaded.pdf), but can the same still be achieved on this day? Java has changed a lot since then, and I've heard that the results were questionable when comparing real life scenarios because the VM used was not Sun Java.
Also, because it is an MMORPG server with many concurrent users interacting with each other, will the use of synchronization and thread safety practices decrease performance to the point where a single threaded NIO selector serving 10000 clients will be faster? (all the work doesn't necessary have to be processed on the thread with the selector, it can be processed on worker threads like how MINA/Netty works).
Thanks!

NIO benefits should be taken with a grain of salt.
In a HTTP server, most connections are keep-alive connections, they are idle most of times. It would be a waste of resource to pre-allocate a thread for each.
For MMORPG things are very different. I guess connections are constantly busy receiving instructions from users and sending latest system state to users. A thread is needed most of time for a connection.
If you use NIO, you'll have to constantly re-allocate a thread for a connection. It may be a inferior solution, to the simple fixed-thread-per-connection solution.
The default thread stack size is pretty large, (1/4 MB?) it's the major reason why there can only be limited threads. Try reduce it and see if your system can support more.
However if your game is indeed very "busy", it's your CPU that you need to worry the most. NIO or not, it's really hard to handle thousands of hyper active gamers on a machine.

There are actually 3 solutions:
Multiple threads
One thread and NIO
Both solutions 1 and 2 at the same
time
The best thing to do for performance is to have a small, limited number of threads and multiplex network events onto these threads with NIO as new messages come in over the network.
Using NIO with one thread is a bad idea for a few reasons:
If you have multiple CPUs or cores, you will be idling resources since you can only use one core at a time if you only have one thread.
If you have to block for some reason (maybe to do a disk access), you CPU is idle when you could be handling another connection while you're waiting for the disk.
One thread per connection is a bad idea because it doesn't scale. Let's say have:
10 000 connections
2 CPUs with 2 cores each
only 100 threads will be block at any given time
Then you can work out that you only need 104 threads. Any more and you're wasting resources managing extra threads that you don't need. There is a lot of bookkeeping under the hood needed to manage 10 000 threads. This will slow you down.
This is why you combine the two solutions. Also, make sure your VM is using the fastest system calls. Every OS has its own unique system calls for high performance network IO. Make sure your VM is using the latest and greatest. I believe this is epoll() in Linux.
In addition, if you were to use
threads per connection, would you just
create new threads or would you use a
very large thread pool?
It depends how much time you want to spend optimizing. The quickest solution is to create resources like threads and strings when needed. Then let the garbage collection claim them when you're done with them. You can get a performance boost by having a pool of resources. Instead of creating a new object, you ask the pool for one, and return it to the pool when you're done. This adds the complexity of concurrency control. This can be further optimized with advance concurrency algorithms like non-blocking algorithms. New versions of the Java API have a few of these for you. You can spend the rest of your life doing these optimizations on just one program. What is the best solution for your specific application is probably a question that deserves its own post.

If you willing to spend any amount of money on powerful enough hardware why limit yourself to one server. google don't use one server, they don't even use one datacenter of servers.
A common misconception is that NIO allows non-blocking IO therefor its the only model worth benchmarking. If you benchmark blocking NIO you can get it 30% faster than old IO. i.e. if you use the same threading model and compare just the IO models.
For a sophisticated game, you are far more likely to run out of CPU before you hit 10K connections. Again it is simpler to have a solution which scales horizontally. Then you don't need to worry about how many connections you can get.
How many users can reasonably interact? 24? in which case you have 1000 independent groups interacting. You won't have this many cores in one server.
How much money per users are you intending to spend on server(s)? You can buy an 12 core server with 64 GB of memory for less than £5000. If you place 2500 users on this server you have spent £2 per user.
EDIT: I have a reference http://vanillajava.blogspot.com/2010/07/java-nio-is-faster-than-java-io-for.html which is mine. ;) I had this reviewed by someone who is a GURU of Java Networking and it broadly agreed with what he had found.

If you have busy connections, which means they constantly send you data and you send them back, you may use non-Blocking IO in conjunction with Akka.
Akka is an open-source toolkit and runtime simplifying the construction of concurrent and distributed applications on the JVM. Akka supports multiple programming models for concurrency, but it emphasizes actor-based concurrency, with inspiration drawn from Erlang. Language bindings exist for both Java and Scala.
Akka's logic is non-blocking so its perfect for asynchronous programming. Using Akka Actors you may remove Thread overhead.
But if your socket streams block more often, I suggest using Blocking IO in conjunction with Quasar
Quasar is an open-source library for simple, lightweight JVM concurrency, which implements true lightweight threads (AKA fibers) on the JVM. Quasar fibers behave just like plain Java threads, except they have virtually no memory and task-switching overhead, so that you can easily spawn hundreds of thousands of fibers – or even millions – in a single JVM. Quasar also provides channels for inter-fiber communications modeled after those offered by the Go language, complete with channel selectors. It also contains a full implementation of the actor model, closely modeled after Erlang.
Quasar's logic is blocking, so you may spawn, say 24000 fibers waiting on different connections. One of positive points about Quasar is, fibers can interact with plain Threads very easily. Also Quasar has integrations with popular libraries, such as Apache HTTP client or JDBC or Jersey and so on, so you may use benefits of using Fibers in many aspects of your project.
You may see a good comparison between these two frameworks here.

As most of you guys are saying that the server is bound to be locked up in CPU usage before 10k concurrent users are reached, I suppose it is better for me to use a threaded blocking (N)IO approach considering the fact that for this particular MMORPG, getting several packets per second for each player is not uncommon and might bog down a selector if one were to be used.
Peter raised an interesting point that blocking NIO is faster than the old libraries while irreputable mentioned that for a busy MMORPG server, it would be better to use threads because of how many instructions are received per player. I wouldn't count on too many players going idle on this game, so it shouldn't be a problem for me to have a bunch of non-running threads. I've come to realize that synchronization is still required even when using a framework based on NIO because they use several worker threads running at the same time to process packets received from clients. Context switching may prove to be expensive, but I'll give this solution a try. It's relatively easy to refactor my code so that I could use a NIO framework if I find there is a bottleneck.
I believe my question has been answered. I'll just wait a little bit more in order to receive even more insight from more people. Thank you for all your answers!
EDIT: I've finally chosen my course of action. I actually was indecisive and decided to use JBoss Netty and allow the user to switch between either oio or nio using the classes
org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory;
org.jboss.netty.channel.socket.oio.OioServerSocketChannelFactory;
Quite nice that Netty supports both!

You might get some inspiration from the former Sun sponsored project, now named Red Dwarf.
The old website at http://www.reddwarfserver.org/ is down.
Github to the rescue: https://github.com/reddwarf-nextgen/reddwarf

If you do client side network calls, most likely you just need plain socket io.
If you are creating server side technologies, then NIO would help you in separating the network io part from fulfillment/processing work.
IO threads configured as 1 or 2 for network IO. Worker threads are for actual processing part(which ranges from 1 to N, based on machine capabilities).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.