Performance of Multi-Threaded Client Network Application - java

I have implemented a client application that sends requests to a server. The way it works can be described very simply. I specify a number of threads. Each of this threads repeatedly sends requests to a server and waits for the answer.
I have plotted the total throughput of the client, for a various number of threads. The number of virtual clients is not important, I am interested by the maximal, saturated performance, at the very right of the graph.
I am surprised because I did not expect the performance to scale with the number of threads. Indeed, most of the processor time is spent in blocking i/o in Java (blocking sockets), as the client-server communication has a 1ms latency, and the client is running on a 8 core machine.
I have looked for solutions online, this answer on Quora seems to imply that the waiting time for blocking i/o can be scheduled to use for other tasks. Is is true, specifically for Java blocking sockets ? In that case, why don't I get linear scaling with the number of threads ?
If this important, I am running this application in the cloud. Also, this is part of a larger application, but I have identified this component as the bottleneck of the whole setup.

I have looked for solutions online, this answer on Quora seems to
imply that the waiting time for blocking i/o can be scheduled to use
for other tasks. Is is true, specifically for Java blocking sockets ?
Regular Java threads map to OS-level threads one-to-one. They're equivalent. So yes, it's true of Java, and in fact every other language. Unless it's using Green Threads or non-blocking IO.
In that case, why don't I get linear scaling with the number of
threads ?
Think about what you're doing from the perspective of the CPU. The CPU performs a costly context switch and allows some thread to run. That thread uses the CPU for a very short duration to prepare a network call, and then it blocks for a long time (milliseconds are quite a lot for a CPU working at 3GHz).
So each thread is doing only a tiny bit of work before another context switch is required. That means that a lot of the CPU's time is wasted on context switches instead of doing useful work.
Contrast that with a thread that's doing a CPU-bound task. The context switch takes the same time. But when a CPU-bound task is allowed to run, it manages to utilize the CPU for a long time, making the context-switch cheaper by comparison. This increases the overall CPU utilization.
So on one hand, you see higher rates with every new thread because you're essentially performing more concurrent I/O operations. On the other hand, every new thread adds a cost. So the marginal benefit of each additional thread is a bit smaller every time. If you keep adding threads, at some point you'll even reach a point where the rate will fall with each new thread.

Related

Thread per connection vs one thread for all connections in java

I have two different types of server and clients working at the moment and i am trying to decide which one would be better for an MMO server or at least a small MMO like server with at least 100 players at a time.
my first server is using a thread per connection model and sending objects over the socket using ObjectOutputStream.
my second server is using java nio to use only one thread for all the connections and using select to loop through them. this server is also using ObjectOutputStream to send data
for my question, what would be a better approach to an MMO server and if the single thread model is better how would sending an object over the socket channel be affected, would it not be read all the way and not get the full object?
for each object being sent over it just contains for example an int and 2 floats for sending position and player id.
I will relate this question to why MMO use UDP over TCP. The reason being that UDP promises fast delivery whereas TCP promises guaranteed delivery.
A similar analogy can be applied to a single-threaded vs a multi-threaded model.Regardless of which you choose, your overall CPU cycles remain the same i.e. the server can process only so much information per second.
Lets see what happens in each of these scenarios
1.Single-Threaded Model :
In this, your own implementation or the underlying library will end up creating a pipeline where the requests start queuing. If you are at min load, the queue will remain virtually empty and execution will be real-time, however a lot of CPU may be wasted. At max load, there will be a long queue-up and execution will have latency with increasing load, however delivery will be guaranteed and CPU utilization will be optimum. Typically a slow client will slow everybody else down.
Multi-Threaded Model :
In this, depending on how your own implementation or the underlying library implements mutli-threading, parallel execution of requests will start happening. The catch to MT is that it's easy to get fooled. For example, java.util.concurrent.ThreadPoolExecutor doesnt actually do any parallel processing unless you set the queue size to a low value. Once parallel processing starts happening, at min load, your execution will be superfast and CPU utilization will be optimum and game performance will be great. However, at max load your RAM usage will be high and CPU utilization will still be optimum. Typically you'll need to put thread interrupts to avoid a slow client hogging all the threads, which will mean glitchy performance for the slow client. Additionally as you start exhausting your thread pool and resources, threads will either get queued or just get dropped leading to glitchy performance.
In gaming, performance matters more over stability, hence there is no question that you should use MT wherever you can, however tuning your thread parameters to compliment your server resources will decide whether its a boon or a complete bane

Java: I want to adjust the size of a thread pool. How can I detect either CMS collection starts, or other system-wide factors affecting available CPU?

(The specifics for this question are for a mod for Minecraft. In general, the question deals with resizing a threadpool based on system load and CPU availability).
I am coming from an Objective C background, and Apple's libdispatch (Grand Central Dispatch) for thread scheduling.
The immediate concern I have is trying to reduce the size of the threadpool when a CMS tenured collection is running. The program in question (Minecraft) only works well with CMS collections. A much less immediate, but still "of interest", is reducing the threadpool size when other programs are demanding significant CPU (specifically, either a screen recorder, or a twitch stream).
In Java, I have just found out about (deep breath):
Executors, which provide access to thread pools (both fixed size, and adjustable size), with cached thread existence (to avoid the overhead of constantly re-creating new threads, or to avoid the worry of coding threads to pause and resume based on workload),
Executor (no s), which is the generic interface for saying "Now it is time to execute this runnable()",
ExecutorService, which manages the threadpools according to Executor,
ThreadPoolExecutor, which is what actually manages the thread pool, and has the ability to say "This is the maximum number of threads to use".
Under normal operation, about 5 times a second, there will be 50 high priority, and 400 low priority operations submitted to the thread pool per user on the server. This is for high-powered machines.
What I want to do is:
Work with less-powerful machines. So, if a computer only has 2 cores, and the main program (two primary threads, plus some minor assistant threads) is already maxing out the CPU, these background tasks will be competing with the main program and the garbage collector. In this case, I don't want to reduce the number of background threads (it will probably stay at 2), but I do want to reduce how much work I schedule. So this is just "How do I detect when the work-load is going up". I suspect that this is just a case of watching the size of the work queue I use when Executors.newCachedThreadPool()
But the first problem: I can't find anything to return the size of the work queue! ThreadPoolExecutor() can return the queue, and I can ask that for a size, but newCachedThreadPool() only returns an ExecutorService, which doesn't let me query for size (or rather, I don't see how to).
If I have "enough cores", I want to tell the pool to use more threads. Ideally, enough to keep CPU usage near max. Most of the tasks that I want to run are CPU bound (disk I/O will be the exception, not the rule; concurrency blocking will also be rare). But I don't want to heavily over-schedule threads. How do I determine "enough threads" without going way over the available cores?
If, for example, screen recording (or streaming) activates, CPU core usage by other programs will go up, and then I want to reduce the number of threads; as the number of threads go down, and queue backlog goes up, I can reduce the amount of tasks I add to the queue. But I have no idea how to detect this.
I think that the best advice I / we can give is to not try to "micro-manage" the number of threads in the thread pools. Set it to sensible size that is proportional to the number of physical cores ... and leave it. By all means provide some static tuning parameters (e.g. in config files), but don't to make the system tune itself dynamically. (IMO, the chances that dynamic tuning will work better than static are ... pretty slim.)
For "real-time" streaming, your success is going to depend on the overall load and the scheduler's ability to prioritize more than the number of threads. However it is a fact that standard Java SE on a standard OS is not suited to hard real-time, so your real-time performance is liable to deteriorate badly if you push the envelope.

What is the relationship between number of CPU cores and number of threads in an app in java?

I'm new to java multi-threaded programming. The question that has came to my mind is that how many threads can I run according to the number of my CPU cores. and if I run threads more than CPU cores will it be an overhead for the machine to run the app. for example when we have a server machine which has a server software that run 2 threads(main thread + developer thread), will it be an overhead for the server when more simultaneous clients make socket connections to the server or not?
Thanks.
The number of threads a system can execute simultaneously is (of course) identical to the number of cores in the system.
The number of threads that can exist on the system is limited by the available memory (each thread requires a stack and a structure used by the OS to manage the thread), and possibly there is a limitation how many threads the OS allows (this depends on the OS architecture, some OS' may use a fixed size table and once its full no more threads can be created).
Commonly, todays computers can handle hundreds to thousands of threads.
The reason why more threads are used than cores exist in the system is: Most threads will inevitably spend much of their time waiting for some event (example: word processor waiting for user to type on keyboard). The OS manages it that threads that wait in such a manner do not consume CPU time.
Idea behind it is don't let your CPU sleep, neither load it too much that it waste most of time in thread switching.
Its helpful to check Tuning the pool size, In IBMs paper
Idea behind is, it depends on the nature of task, if its all in-memory computation tasks you can use N+1 threads (N numbers of cores (included hyper threading)).
Or
we need to do the application profiling and find out waiting time (WT) , service time (ST) for a typical request and approximately N*(1+WT/ST) number of optimal threads we can have, considering 100% utilization of CPU.
That depends on what the threads are doing. The CPU is only able to do X things at once, where X is the number of cores it has. That means X threads at most can be active at any one time - however the other threads can wait their turn and the CPU will process them at appropriate moments.
You should also consider that a lot of the time threads are waiting for a response, or waiting for data to load, or a network message to arrive, etc so are not actually trying to do anything. These idle/waiting threads have very little load on the system.
Don't worry about getting a higher number of threads than CPU cores; that is actually not in your hands, but in OS'.
Assuming the JVM maps your java threads over OS threads (which is fairly normal these days), it depends on the thread management your OS does. There you rely on how smart the kernel implementation is to get performance out of your cores.
What you must keep in mind is that your design must be sustainable. For example, application servers are built on a threadpool full of worker threads. Those threads are awaken in order to serve requests. Do you want a thread for each request? Then you will surely have a problem - requests can arrive in the thousands to the server, and that could be a problem for the kernel to manage. Actually the threadpool size should be limited (between 1 and X and easily changed even in real time), threads should get work from a concurrent queue (java gives you some excellent classes for that) and each one attend requests sequentially.
I hope that being of help
Having less threads than CPUs can mean you are not using all the CPUs in your system. Having more threads might improve throughput if CPU is your bottleneck.
Having more threads than CPU does introduce an overhead and if CPU is your bottleneck this can hurt performance. However, if network IO, is your bottleneck, this overhead is a price worth paying as it usually allows you to handle many more connections. e.g. You can have 1000 TCP connections with their own threads.
There doesn't have to be any relation. A computer can have any number of cores; a process can have any number of threads.
There are several different reasons that processes utilize threading, including:
Programming abstraction. Dividing up work and assigning each division to a unit of execution (a thread) is a natural approach to many problems. Programming patterns that utilize this approach include the reactor, thread-per-connection, and thread pool patterns. Some, however, view threads as an anti-pattern. The inimitable Alan Cox summed this up well with the quote, "threads are for people who can't program state machines."
Blocking I/O. Without threads, blocking I/O halts the whole process. This can be detrimental to both throughput and latency. In a multithreaded process, individual threads may block, waiting on I/O, while other threads make forward progress. Blocking I/O via threads is thus an alternative to asynchronous & non-blocking I/O.
Memory savings. Threads provide an efficient way to share memory yet utilize multiple units of execution. In this manner they are an alternative to multiple processes.
Parallelism. In machines with multiple processors, threads provide an efficient way to achieve true parallelism. As each thread receives its own virtualized processor and is an independently schedulable entity, multiple threads may run on multiple processors at the same time, improving a system's throughput. To the extent that threads are used to achieve parallelism—that is, there are no more threads than processors—the "threads are for people who can't program state machines" quote does not apply.
The first three bullets utilize threads with no relationship to cores. If you are using threads as a programming abstraction to handle UI elements, for example, you'll have one thread per UI element (or whatever) regardless of whether you have 1 core or 12. Similarly, if you were using threads to perform blocking I/O, you'd scale your thread count with your I/O capacity, not your processing power.
The fourth bullet, however, does relate threads to cores. If the goal of threading is parallelism, then the number of threads should scale linearly with the number of cores. For example, if you double the number of cores in a system, then you would double the number of threads in your application. This is true for cores in the logical sense—that is, including SMT.
When threading is used to achieve parallelism—and this is both a common and the best use of threading—you will often have, say, one or two threads per core. Oftentimes, applications are written so as to dynamically size thread pools off the number of available cores. A single thread per core is ideal, but applications often use a larger multiplier, such as two threads per core, due to bugs and inefficiencies in their code, such as operations that block when none should.
Best performance will be when number of cores(NOC) equals number of thread (NOT), because if NOT > NOC then processor should switch context or OS will try to do that work, which is expensive enough opperation. But you have to understand that it impossible to have NOC = NOT on Web Servers because you can't predict how much clients will be at the same time. Take a look on load balancing concept to solve this issue in best way.

Why increasing number of threads is useless?

I've been looking for the way to increase the speed of processing messages received from rabbitmq queue. The only way I've found is make more than one threads doing the same - receiving and processing. And this gave me some profit. After I created 4 threads the speed quadrupled. As I have 8-core processor I've decided to increase the number of threads to 8. But this gives no performance increasing. YourKit shows that only 50% of CPU is used. Somebody can say that my app is lightweight and so it is so, but I can say that it can't do more work than it does regardless I produce much more what to do. Why this doesn't work?
There are many different issues that can constrain the maximum speed of some application on a given system. For example, it can be limited by memory bandwidth, by Amdahl's Law effects (time needed for non-parallel code, including synchronized blocks), I/O bandwidth, and cache space.
If you want further improvement you need to do some measurements and profiling to find where the time is going, and then work on that.
The short (and not particularly helpful) answer is "overheads and bottlenecks".
For instance:
Creating threads in Java is relatively expensive. If the amount of work done by a thread isn't large, the overhead of creating the thread can out-weigh the benefits.
Context switching between threads is relatively expensive, especially when you take account of memory-related overheads such as cache misses, TLB misses. (These overheads actually hit when a native thread is assigned to a core. If the OS can somehow keep a native thread on a single core continuously (i.e. with no other threads on the same core), then it can use spinlocking ... and avoid the context switch. But the more Java threads you have, the less likely it is that the OS can do this.)
The threads may be spending a large proportion of their time waiting for I/O to complete. The I/O system's throughput or the speed / latency of some external service can be a bottleneck.
You may have contention over data structures; e.g. threads requiring exclusive access to safely read or update (say) a shared Map. If threads regularly need to wait for others to release locks, then you have a bottleneck.
Your computation may be dominated by the costs of "feeding" the threads. For example, if there is a single master thread that hands out "work" to worker threads, then the master thread's activities could be the bottleneck; i.e. it may not be able to provide enough work to keep the workers busy.
Since your tags imply that you are using a message queue, it is possible that that is the bottleneck, especially if the messages are big or the "work" done on each one is relatively small.
(Using a separate separate message queue service is liable to increase context switches, add I/O latency, add protocol overheads and so on. It's not an automatic route to performance improvement for small-scale systems.)
It is also possible that you have "hyperthreaded" cores not real cores, or that the operating system is stopping your JVM from using all cores.
If CPU or waiting for IO is your bottle neck, adding independent threads can make a big difference.
If you have a shared resource is a bottleneck, e.g. your L3 cache, your network adapter, your kernel, adding threads won't help because CPU is not the problem. In fact it can often make it worse by adding overhead.
my app is lightweight
In which case CPU is unlikely to be your issue and you are doing well to see a speed up with more than 1 CPU. Most likely you are speeding up CPU used by RabbitMQ. Ideally it should be more efficient and this shouldn't really help much. IMHO, more efficient messaging solutions don't gain much by multiple CPUs as they will not be bottlenecked on CPU.
One way or another, you're only using 4 cores. There's a lot that can stop stop you from doubling your performance by doubling your threads, but from your 4 thread success you've gotten past all that. I'm guessing there's a bug in your code to set off 8 threads and it's only firing up 4. (Even with hyperthreading, you're going to get some improvement. Even with every possible problem, you're going to get some improvement.) Otherwise, I'll go with T.J.Crowder and Stephen C: I don't think you really have 8 cores.
I'd try using different numbers of threads: 3, 5, 6. See what changes. I think you'll stumble on the problem soon enough.
To be fair to Java: if you write thread-safe code and avoid bottlenecks, it handles threads really well, as you've noticed going from one thread to 4. I've always found the overhead costs to be trivial.
Your application does not have linear speedup, and therefore, it does not have good scalability.
In order to keep increase the number of threads you need to ensure the data being handle is growing accordingly. For a fixed amount of data, increasing number of threads (and/or cores) will have a diminishing return at some point since the overhead of creating threads will outweight the thread's compute time.
Make sure to look up the following link:
Ahmdahl's law
Gustafson's law
Gustafson's law is a great counterpoint to Ahmdal's law so I highly recommend understanding that article.

Java threading synchronization for I/O bound and CPU-heavy operations

The task is - need to process multiple I/O streams (HTTP downloads) with some CPU-heavy operation. Ideally would like to have full bandwidth and CPU 100% used. Of course - heavy CPU processing is slower then internet download. Unprocessed data could be cached to disk. Are there any existing Executors in ASF or other components providing this functionality? If not - what's the best way to achieve this? Thinking of having 2 thread pools one for Internet-To-Disk and other for Disk-To-CPU-To-Disk operations.
EDITED:
I'll clarify my question:
2 thread pools: Internet-To-Disk and Disk-To-CPU-To-Disk is producer/consumer approach itself. The question was HOW to make sure I've selected right number of threads for producers and consumers? Same code will work simultenously on different boxes, arches with different number of cores and different bandwidth. How to make sure I've chosen right number of threads so 100% bandwidth and 100% CPU are consumed?
Assuming that CPU processing is going to be the main bottleneck of your system, the number of threads for CPU processing should be, at the least, set to the number of CPUs or cores available.
I/O part is probably not going to use much CPU at all, but you may want to allocate a fixed pool of few threads (equal to, or less than, the number of cores) to prevent excess thread context switching for simultaneous I/O streams.
You may also set the number of threads for CPU processing to a number slightly bigger than the number of cores, if your CPU processing threads do not always use 100% of CPU from start to finish. For example, if they may do some I/O or access some shared resource in the middle of processing.
But as with any system, the ideal number of threads will greatly depend on the nature of your program. You can use tools like JVisual VM (bundled with JDK) to analyse how threads are utilised in your program, and try different thread setting variations.
You can use producer-consumer for this purpose. Use as many producers and consumers as its needed to fulfill the needs.
If your CPU stage is more intensive than the download time, why not just download the data as you are able to process it. That way you can have multiple Internet-To-CPU-To-Disk processes. By skipping a stage it may be faster, and it will certainly be simpler.
I'd go for a producer-consumer architecture : one thread pool to process the data (managed by an ExecutorService), and one or more threads to download the data from the internet.
The data to be processed would be put into a bounded blocking queue (ex: LinkedBlockingQueue), so that the downloading threads would only fetch data when required (that is, when a computing thread is able to process new data). Plus, this structure guaranteed thread safety and memory publication.

Categories