Trying to understand the best possible Java ThreadPoolTaskExecutor that I can define when transferring to OkHttpClient, latency wise. Currently our definition is the following:
<property name="corePoolSize" value="#{ T(java.lang.Math).max(32,numCpu) * 2 }" />
<property name="maxPoolSize" value="#{ T(java.lang.Math).max(32,numCpu) * 8 }" />
<property name="queueCapacity" value="200"/>
That is maximal queue capacity (at which new Thread will be opened) is 200, minimal thread count is max(32,numCpu) * 2 and maximal thread count is max(32,numCpu) * 8. In our case numCpu could vary from 16 to 24 (though if hyper-threading is taken into account then multiply that number *2, right? ).
But when you think about it - I am not sure that the number of Threads here should be somehow connected to CPU count. Those are sending/receiving threads of HTTP client, not BusinessLogic threads. So perhaps CPU count shouldn't be even factor here.
Any opinions/advice?
It sounds to me like your thread pool is being used to concurrently make lots of HTTP connections, meaning your performance is limited not by CPU usage but by I/O (and potentially memory too). The "optimal" number of threads is going to be limited by a number of other factors ...
1. Link speed between your client and endpoints.
Let's say your client is connected to a 1Gbps link but somewhere down the line, all of your endpoints can only serve you data at 1Mbps. To max out your local bandwidth, you would need to run 1000 connections concurrently to fully utilize your 1Gbps link, meaning your thread pool needs to run 1000 threads. But this could be problematic as well because of another issue ...
2. Memory usage per-thread is non-zero, even if they aren't doing anything intensive.
The default amount of stack space allocated to a Java varies by vendor, but it's on the order of 1MB. This doesn't sound like a whole lot, but if you need to run thousands of threads to keep as many client connections active at a time, you will need to allocate gigabytes of RAM for the stack space alone. You can adjust the stack space allocated per thread using the -Xss[size] VM argument, but this is global to the VM, so shrinking the stack size may cause problems in other areas of your program, depending on what you are doing.
3. Average HTTP request size.
Sometimes, it's going to boil down to how much data you expect to transfer per POST/GET call. Recall that each TCP connection requires an initial handshake before any data can be sent. If the amount of data you expect to transmit over the life of an HTTP call is very small, you may not be able to keep thousands of connections running concurrently, even if you have thousands of threads at your disposal. If the amount is very large, it may only take a few concurrent connections to max out the total bandwidth available to your client.
Finally ...
You may not be able to predict the link speed of every connection if all of your endpoints are running out there on the web. I think the best you can do is benchmark the performance of different configurations, while considering each of these factors, and choose the configuration that seems to give the best performance in your typical operating environment. It will likely be somewhere between N and 1000, where N is the number of cores you run, but nailing that number down to something specific will take a little bit of elbow grease :)
Related
I have two different types of server and clients working at the moment and i am trying to decide which one would be better for an MMO server or at least a small MMO like server with at least 100 players at a time.
my first server is using a thread per connection model and sending objects over the socket using ObjectOutputStream.
my second server is using java nio to use only one thread for all the connections and using select to loop through them. this server is also using ObjectOutputStream to send data
for my question, what would be a better approach to an MMO server and if the single thread model is better how would sending an object over the socket channel be affected, would it not be read all the way and not get the full object?
for each object being sent over it just contains for example an int and 2 floats for sending position and player id.
I will relate this question to why MMO use UDP over TCP. The reason being that UDP promises fast delivery whereas TCP promises guaranteed delivery.
A similar analogy can be applied to a single-threaded vs a multi-threaded model.Regardless of which you choose, your overall CPU cycles remain the same i.e. the server can process only so much information per second.
Lets see what happens in each of these scenarios
1.Single-Threaded Model :
In this, your own implementation or the underlying library will end up creating a pipeline where the requests start queuing. If you are at min load, the queue will remain virtually empty and execution will be real-time, however a lot of CPU may be wasted. At max load, there will be a long queue-up and execution will have latency with increasing load, however delivery will be guaranteed and CPU utilization will be optimum. Typically a slow client will slow everybody else down.
Multi-Threaded Model :
In this, depending on how your own implementation or the underlying library implements mutli-threading, parallel execution of requests will start happening. The catch to MT is that it's easy to get fooled. For example, java.util.concurrent.ThreadPoolExecutor doesnt actually do any parallel processing unless you set the queue size to a low value. Once parallel processing starts happening, at min load, your execution will be superfast and CPU utilization will be optimum and game performance will be great. However, at max load your RAM usage will be high and CPU utilization will still be optimum. Typically you'll need to put thread interrupts to avoid a slow client hogging all the threads, which will mean glitchy performance for the slow client. Additionally as you start exhausting your thread pool and resources, threads will either get queued or just get dropped leading to glitchy performance.
In gaming, performance matters more over stability, hence there is no question that you should use MT wherever you can, however tuning your thread parameters to compliment your server resources will decide whether its a boon or a complete bane
This is my first post, I am database administrator and looking to use play framework version 2.4 . I have read the play 2 documentation and still have a few questions since I am very new to it. I have a messaging system that will need to handle loads of up to 50,000 blocking threads per second. If I am correct the maximum number of threads available on play are as follows:
Parallism-Factor * AvailableProcessors
Where the Parallism-Factor is the amount of threads that could be used per core? I have seen that most examples have this number as 1.0 what is wrong with going for a 100 or so? I have this P-Factor right now set at 10.0 and I have 150 cpu cores so that means that I have a maximum of 1,500 threads available if that is the case and I have to process up to 50,000 blocking requests per second then the system would be very slow right? so that the only way to scale would be to add more cores since all the requests are blocking?
'50,000 blocking requests per second' doesn't necessary mean that you need 50.000 threads to handle them. You don't need a thread for each database call.
To do a very simple calculation: Each database call takes 0.1 seconds, which is an arbitrary number since I have no clue how long your calls take. And each of those 50.000 requests lead to a single, blocking database call. Then your system needs to handle 5000 database calls per seconds. So if you have 10 CPUs you'd need 500 threads per CPU to handle them, or if you have 250 CPUs you'd need 20 threads per CPU. But this is only under ideal circumstances where those requests don't actually do anything else than blocking and waiting.
Play is using Akka for its thread management and concurrency. The advantage is that one doesn't have to care about the hassles of concurrency during your application programming any more.
Akka calculates the max number of threads with available CPUs * parallelism-factor. Additionally you can limit them with parallelism-min or parallelism-max. So if you have 250 CPUs and your parallelism-factor is 20 you have max 5000 threads available at once which might or might not be enough to handle your requests.
So to come back to your question: It's difficult to say. It depends on the time your database calls take and the how heavy you use your CPUs for other calculations. I think there is no other way but trying it out and do some performance measuring. But in general it's better to have less threads since it takes a lot of resources to create a thread. And I'd guess a parallelism-factor of 20 is a good starting point in your case for 250 CPUs.
I also found this documentation to akka-concurrency-test which itself has a good list of sources.
Play was created for the asynchronous programming. That is the reason why parallelism factor is 1.0. This is the most optimal when you do a lot of small and fast non-blocking algorithms.
The question is what do you mean "50,000 blocking threads per second". Blocking could be different. The most spread example of blocking is access to RDBMS. I am pretty sure that in this case your system can handle 50000 blocking db access like a charm.
The example from the Play Thread Pool documentation says that it is fine to put parallelism factor to 300 for application that use blocking database calls, so 150*300 = 45 000 - almost your number.
(The specifics for this question are for a mod for Minecraft. In general, the question deals with resizing a threadpool based on system load and CPU availability).
I am coming from an Objective C background, and Apple's libdispatch (Grand Central Dispatch) for thread scheduling.
The immediate concern I have is trying to reduce the size of the threadpool when a CMS tenured collection is running. The program in question (Minecraft) only works well with CMS collections. A much less immediate, but still "of interest", is reducing the threadpool size when other programs are demanding significant CPU (specifically, either a screen recorder, or a twitch stream).
In Java, I have just found out about (deep breath):
Executors, which provide access to thread pools (both fixed size, and adjustable size), with cached thread existence (to avoid the overhead of constantly re-creating new threads, or to avoid the worry of coding threads to pause and resume based on workload),
Executor (no s), which is the generic interface for saying "Now it is time to execute this runnable()",
ExecutorService, which manages the threadpools according to Executor,
ThreadPoolExecutor, which is what actually manages the thread pool, and has the ability to say "This is the maximum number of threads to use".
Under normal operation, about 5 times a second, there will be 50 high priority, and 400 low priority operations submitted to the thread pool per user on the server. This is for high-powered machines.
What I want to do is:
Work with less-powerful machines. So, if a computer only has 2 cores, and the main program (two primary threads, plus some minor assistant threads) is already maxing out the CPU, these background tasks will be competing with the main program and the garbage collector. In this case, I don't want to reduce the number of background threads (it will probably stay at 2), but I do want to reduce how much work I schedule. So this is just "How do I detect when the work-load is going up". I suspect that this is just a case of watching the size of the work queue I use when Executors.newCachedThreadPool()
But the first problem: I can't find anything to return the size of the work queue! ThreadPoolExecutor() can return the queue, and I can ask that for a size, but newCachedThreadPool() only returns an ExecutorService, which doesn't let me query for size (or rather, I don't see how to).
If I have "enough cores", I want to tell the pool to use more threads. Ideally, enough to keep CPU usage near max. Most of the tasks that I want to run are CPU bound (disk I/O will be the exception, not the rule; concurrency blocking will also be rare). But I don't want to heavily over-schedule threads. How do I determine "enough threads" without going way over the available cores?
If, for example, screen recording (or streaming) activates, CPU core usage by other programs will go up, and then I want to reduce the number of threads; as the number of threads go down, and queue backlog goes up, I can reduce the amount of tasks I add to the queue. But I have no idea how to detect this.
I think that the best advice I / we can give is to not try to "micro-manage" the number of threads in the thread pools. Set it to sensible size that is proportional to the number of physical cores ... and leave it. By all means provide some static tuning parameters (e.g. in config files), but don't to make the system tune itself dynamically. (IMO, the chances that dynamic tuning will work better than static are ... pretty slim.)
For "real-time" streaming, your success is going to depend on the overall load and the scheduler's ability to prioritize more than the number of threads. However it is a fact that standard Java SE on a standard OS is not suited to hard real-time, so your real-time performance is liable to deteriorate badly if you push the envelope.
My application is a "thread-per-request" web server with a thread pool of M threads. All processing of a single request runs in the same thread.
Suppose I am running the application in a computer with N cores. I would like to configure M to limit the CPU usage: e.g. up to 50% of all CPUs.
If the processing were entirely CPU-bound then I would set M to N/2. However the processing does some IO.
I can run the application with different M and use top -H, ps -L, jstat, etc. to monitor it.
How would you suggest me estimate M ?
To have a CPU usage of 50% does not necessarily mean that the number of threads needs to be N_Cores / 2. When dealing with I/O the CPU wastes many cycles in waiting for the data to arrive from devices.
So you need a tool to measure real CPU usage and through experiments, you could increase the number of threads until the real CPU usage goes to 50%.
perf for Linux is such a tool. This question addresses the problem. Also be sure to collect statistics system wide: perf record -a.
You are interested in your CPU issuing and executing as many instructions / cycle as possible (IPC). Modern servers can execute up to 4 IPC for intense compute bound workloads. You want to go as close to that as possible to get good CPU utilization, and that means that you need to increase the thread count. Of course if there's many I/O that won't be possible due to many context switches that bring some penalties (cache flushing, kernel code, etc.)
So the final answer would be just increase the thread count until real CPU usage goes to 50%.
It depends enterely on your particular software and hardware.
Hardware is important since if a thread blocks writing to a slow disk it will take long to wake up again (and consume cpu again) but if the disk is really fast the blocking will only switch cpu context but the thread will be running again immediately.
I think you can only try and try with different parameters or the app may monitor it's CPU use itself and adjust the pool dynamically.
I rented a little tomcat server to provide http-get service for an android app via jsp on dailyRazor. The maximum java heap is "Max memory: 92.81 MB".
Default tomcat setting for maxThreads was 25. As the number of users using my service growed, i was getting lots of refused connections / timeouts for the server at prime time (which i think was because the thread pool is too small). Thats why I increased the maxThreads to 250. In this night, the server crashed showing me multiple java.lang.OutOfMemoryError s. 250 seems to be a bit to heavy for the little heap :p I temporary reduced maxThreads to 50 which seems to be fine as i dont get any more errors.
As i dont know much about tomcat, i want to ask for a good way to find the right number for maxThreads. I thought about looking at the maximum memory usage of one thread. Then maxThreads = (maxMemory / memoryOfOneThread). Is there a better solution?
thanks
danijoo
The amount of memory used per thread depends on what you do in the thread. So it depends on your software.
It's not that easy to calculate.
But 92MB for a tomcat, that's very tight. I would look for a way to tackle that.
The answer is that there is no good way, AFAIK. Or more precisely, nothing that is significantly better than "try it and see".
But my real reason for Answering is to point out some flaws in your understanding of your problem, and your expectations.
Threads don't use Heap memory. Or at least not directly.
The majority of a thread's memory usage is the thread stack. The thread stack size is tunable, and has a platform-specific default that can be as much as 1Mb.
However ... the thread stack is NOT allocated in the Java heap. It is allocated in off-heap memory.
If your system crashed due to an OOME when you increased the number of stacks, then the heap usage was due to the code running on the threads, not the threads themselves. (OK, you probably knew that.)
If you have a problem with requests being dropped etcetera, then increasing the number of threads is not normally the solution. In fact, it is quite likely to make throughput worse.
The thing is, that for a thread to execute it has to be assigned to a processor; i.e. a "core". Your Tomcat server may have 2 or 4 cores (I guess). So that means only 2 or 4 threads can be executing at any time. With 25 threads that meant that when all 25 are active, they get roughly 1/6th of a core's worth of CPU time. With 250 threads active (if that ever happened) they'd get 1/60th of a core. In the mean time:
Each of those requests will have created a bunch of objects, that can't be garbage collected until the request is finished.
Each of those requests could be competing for database cycles, disc I/O bandwidth, network bandwidth, and so on.
Depending on the nature of the application, there could be increased contention over data structures / database rows, resulting in increased thread context switch overheads.
There will be a sweet spot where the number of threads gives you the best request throughput. If you significantly past that, throughput (requests completed per second) will drop off, and average time to process each request will increase. Go too far and the system will grind to a halt. (OOMEs are one way this can happen. Another is requests taking so long that the clients time out and stop waiting for the responses.)
Basically, your application (when optimally tuned) will achieve a certain average throughput rate. If your actual request rate exceeds that, then the best strategy is to process the ones that you can, and drop the rest quickly. Dropping requests may seem bad, but it is better than the alternative which is taking so long to respond that the requestor times out the request and launches a new one.
You should add stack size in your formula. It depends on JVM, e.g. for HotSpot default stack size is about 512k per thread on 32-bit JVM. BTW with such small memory amount like you say and with 250 threads stack size may be the actual problem. Note that you can change it with -Xss option.