I have two different types of server and clients working at the moment and i am trying to decide which one would be better for an MMO server or at least a small MMO like server with at least 100 players at a time.
my first server is using a thread per connection model and sending objects over the socket using ObjectOutputStream.
my second server is using java nio to use only one thread for all the connections and using select to loop through them. this server is also using ObjectOutputStream to send data
for my question, what would be a better approach to an MMO server and if the single thread model is better how would sending an object over the socket channel be affected, would it not be read all the way and not get the full object?
for each object being sent over it just contains for example an int and 2 floats for sending position and player id.
I will relate this question to why MMO use UDP over TCP. The reason being that UDP promises fast delivery whereas TCP promises guaranteed delivery.
A similar analogy can be applied to a single-threaded vs a multi-threaded model.Regardless of which you choose, your overall CPU cycles remain the same i.e. the server can process only so much information per second.
Lets see what happens in each of these scenarios
1.Single-Threaded Model :
In this, your own implementation or the underlying library will end up creating a pipeline where the requests start queuing. If you are at min load, the queue will remain virtually empty and execution will be real-time, however a lot of CPU may be wasted. At max load, there will be a long queue-up and execution will have latency with increasing load, however delivery will be guaranteed and CPU utilization will be optimum. Typically a slow client will slow everybody else down.
Multi-Threaded Model :
In this, depending on how your own implementation or the underlying library implements mutli-threading, parallel execution of requests will start happening. The catch to MT is that it's easy to get fooled. For example, java.util.concurrent.ThreadPoolExecutor doesnt actually do any parallel processing unless you set the queue size to a low value. Once parallel processing starts happening, at min load, your execution will be superfast and CPU utilization will be optimum and game performance will be great. However, at max load your RAM usage will be high and CPU utilization will still be optimum. Typically you'll need to put thread interrupts to avoid a slow client hogging all the threads, which will mean glitchy performance for the slow client. Additionally as you start exhausting your thread pool and resources, threads will either get queued or just get dropped leading to glitchy performance.
In gaming, performance matters more over stability, hence there is no question that you should use MT wherever you can, however tuning your thread parameters to compliment your server resources will decide whether its a boon or a complete bane
Related
I have implemented a client application that sends requests to a server. The way it works can be described very simply. I specify a number of threads. Each of this threads repeatedly sends requests to a server and waits for the answer.
I have plotted the total throughput of the client, for a various number of threads. The number of virtual clients is not important, I am interested by the maximal, saturated performance, at the very right of the graph.
I am surprised because I did not expect the performance to scale with the number of threads. Indeed, most of the processor time is spent in blocking i/o in Java (blocking sockets), as the client-server communication has a 1ms latency, and the client is running on a 8 core machine.
I have looked for solutions online, this answer on Quora seems to imply that the waiting time for blocking i/o can be scheduled to use for other tasks. Is is true, specifically for Java blocking sockets ? In that case, why don't I get linear scaling with the number of threads ?
If this important, I am running this application in the cloud. Also, this is part of a larger application, but I have identified this component as the bottleneck of the whole setup.
I have looked for solutions online, this answer on Quora seems to
imply that the waiting time for blocking i/o can be scheduled to use
for other tasks. Is is true, specifically for Java blocking sockets ?
Regular Java threads map to OS-level threads one-to-one. They're equivalent. So yes, it's true of Java, and in fact every other language. Unless it's using Green Threads or non-blocking IO.
In that case, why don't I get linear scaling with the number of
threads ?
Think about what you're doing from the perspective of the CPU. The CPU performs a costly context switch and allows some thread to run. That thread uses the CPU for a very short duration to prepare a network call, and then it blocks for a long time (milliseconds are quite a lot for a CPU working at 3GHz).
So each thread is doing only a tiny bit of work before another context switch is required. That means that a lot of the CPU's time is wasted on context switches instead of doing useful work.
Contrast that with a thread that's doing a CPU-bound task. The context switch takes the same time. But when a CPU-bound task is allowed to run, it manages to utilize the CPU for a long time, making the context-switch cheaper by comparison. This increases the overall CPU utilization.
So on one hand, you see higher rates with every new thread because you're essentially performing more concurrent I/O operations. On the other hand, every new thread adds a cost. So the marginal benefit of each additional thread is a bit smaller every time. If you keep adding threads, at some point you'll even reach a point where the rate will fall with each new thread.
I a new in java. I'm c++ programmer and nowadays study java for 2 months.
Sorry for my pool English.
I have a question that if it needs memory pool or object pool for Akka actor model. I think if i send some message from one actor to one of the other actors, i have to allocate some heap memory(just like new Some String, or new Some BigInteger and other more..) And times on, the garbage collector will be got started(I'm not sure if it would be started) and it makes my application calculate slowly.
So I search for the way to make the memory-pool and failed(Java not supported memory pool). And I Could Make the object pool but in others project i did not find anybody use the object-pool with actor(also in Akka Homepage).
Is there any documents bout this topic in the akka hompage? Plz tell me the link or tell me the solution of my question.
Thanks.
If, as it's likely you will, you are using Akka across multiple computers, messages are serialized on the wire and sent to the other instance. This means that simply a local memory pool won't suffice.
While it's technically possible that you write a custom JSerializer (see the doc here) implementation that stores local messages in a memory pool after deserializing them, I feel that's a bit of an overkill for most applications (and easy to cock-up and actually worsen performance with lookup times in the map)
Yes, when the GC kicks in, the app will lag a bit under heavy loads. But in 95% of the scenarios, especially under a performant framework like Akka, GC will not be your bottleneck: IO will.
I'm not saying you shouldn't do it. I'm saying that before you take on the task, given its non-triviality, you should measure the impact of GC on your app at runtime with things like Kamon or other Akka-specialized monitoring solutions, and only after you are sure it's worth it you can go for it.
Using an ArrayBlockingQueue to hold a pool of your objects should help,
Here is the example code.
TO create a pool and insert an instance of pooled object in it.
BlockingQueue<YOURCLASS> queue = new ArrayBlockingQueue<YOURCLASS>(256);//Adjust 256 to your desired count. ArrayBlockingQueues size cannot be adjusted once it is initialized.
queue.put(YOUROBJ); //This should be in your code that instanciates the pool
and later where you need it (in your actor that receives message)
YOURCLASS instanceName = queue.take();
You might have to write some code around this to create and manage the pool.
But this is the gist of it.
One can do object pooling to minimise long tail of latency (by sacrifice of median in multythreaded environment). consider using appropriate queues e.g. from JCTools, Distruptor, or Agrona. Don't forget about rules of engagement for state exhange via mutable state using multiple thereads in stored objects - https://youtu.be/nhYIEqt-jvY (the best content I was able to find).
Again, don't expect to improve throughout using such slightly dangerous techniques. You will loose L1-L3 cache efficiency and will polite PCI with barriers.
Bit of tangent (to get sense of low latency technology):
One may consider some GC implementation with lower latency if you want to stick with Akka, or use custom reactive model where object pool is used by single thread, or memory copied over e.g. Distrupptor's approach.
Another alternative is using memory regions (the way Erlang VM works). It creates garbage, but in form easy to handle by GC!
If you go to very low latency IO and are the biggest enemy of latency - forget legacy TCP (vs RDMA over Infininiband), switches (over swichless), OS accessing disk via OS calls and file system (use RDMA), forget interrupts shared by same core, not pinned cores (and without spinning for input) to real CPU core (vs virtual/hyperthreads) or inter NUMA communication or messages one by one instead of hardware multicast (or better optical switch) for multiple consumers and don't forget turning Epsilon GC for JVM ;)
(The specifics for this question are for a mod for Minecraft. In general, the question deals with resizing a threadpool based on system load and CPU availability).
I am coming from an Objective C background, and Apple's libdispatch (Grand Central Dispatch) for thread scheduling.
The immediate concern I have is trying to reduce the size of the threadpool when a CMS tenured collection is running. The program in question (Minecraft) only works well with CMS collections. A much less immediate, but still "of interest", is reducing the threadpool size when other programs are demanding significant CPU (specifically, either a screen recorder, or a twitch stream).
In Java, I have just found out about (deep breath):
Executors, which provide access to thread pools (both fixed size, and adjustable size), with cached thread existence (to avoid the overhead of constantly re-creating new threads, or to avoid the worry of coding threads to pause and resume based on workload),
Executor (no s), which is the generic interface for saying "Now it is time to execute this runnable()",
ExecutorService, which manages the threadpools according to Executor,
ThreadPoolExecutor, which is what actually manages the thread pool, and has the ability to say "This is the maximum number of threads to use".
Under normal operation, about 5 times a second, there will be 50 high priority, and 400 low priority operations submitted to the thread pool per user on the server. This is for high-powered machines.
What I want to do is:
Work with less-powerful machines. So, if a computer only has 2 cores, and the main program (two primary threads, plus some minor assistant threads) is already maxing out the CPU, these background tasks will be competing with the main program and the garbage collector. In this case, I don't want to reduce the number of background threads (it will probably stay at 2), but I do want to reduce how much work I schedule. So this is just "How do I detect when the work-load is going up". I suspect that this is just a case of watching the size of the work queue I use when Executors.newCachedThreadPool()
But the first problem: I can't find anything to return the size of the work queue! ThreadPoolExecutor() can return the queue, and I can ask that for a size, but newCachedThreadPool() only returns an ExecutorService, which doesn't let me query for size (or rather, I don't see how to).
If I have "enough cores", I want to tell the pool to use more threads. Ideally, enough to keep CPU usage near max. Most of the tasks that I want to run are CPU bound (disk I/O will be the exception, not the rule; concurrency blocking will also be rare). But I don't want to heavily over-schedule threads. How do I determine "enough threads" without going way over the available cores?
If, for example, screen recording (or streaming) activates, CPU core usage by other programs will go up, and then I want to reduce the number of threads; as the number of threads go down, and queue backlog goes up, I can reduce the amount of tasks I add to the queue. But I have no idea how to detect this.
I think that the best advice I / we can give is to not try to "micro-manage" the number of threads in the thread pools. Set it to sensible size that is proportional to the number of physical cores ... and leave it. By all means provide some static tuning parameters (e.g. in config files), but don't to make the system tune itself dynamically. (IMO, the chances that dynamic tuning will work better than static are ... pretty slim.)
For "real-time" streaming, your success is going to depend on the overall load and the scheduler's ability to prioritize more than the number of threads. However it is a fact that standard Java SE on a standard OS is not suited to hard real-time, so your real-time performance is liable to deteriorate badly if you push the envelope.
My program needs to send data to multiple (about 50) "client" stations. Important bits of data must be sent over TCP to ensure arrival. The connections are mostly fixed and are not expected to change during a single period of activity of the program.
What do you think would be the best architecture for this? I've heard that creating a new thread per connection is generally not recommended, but is this recommendation valid when connections are not expected to change? Scalability would be nice to have but is not much of a concern as the number of client stations is not expected to grow.
The program is written in Java if it matters.
Thanks,
Alex
If scalability, throughput and memory usage are not a concern, then using 50 threads is an adequate solution. It has the advantage of being simple, and simplicity is a good thing.
If you want to be able to scale, or you are concerned about memory usage (N threads implies N thread stacks) then you need to consider an architecture using NIO selectors. However, the best architecture probably depends on things like:
the amount of work that needs to be performed for each client station,
whether the work is evenly spread (on average),
whether the work involves other I/O, access to shared data structures, etc and
how close the aggregate work is to saturating a single processor.
50 threads is fine, go for it. It hardly matters. Anything over 200 threads, start to worry..
I'd use thread pool anyway. Depending on your thread pool configuration it will create as many threads as you need but this solution is more scalable. It will be ok not only for 50 but also for 5000 clients.
Why don't you limit the amount of threads by using someting like a connection Pool?
The task is - need to process multiple I/O streams (HTTP downloads) with some CPU-heavy operation. Ideally would like to have full bandwidth and CPU 100% used. Of course - heavy CPU processing is slower then internet download. Unprocessed data could be cached to disk. Are there any existing Executors in ASF or other components providing this functionality? If not - what's the best way to achieve this? Thinking of having 2 thread pools one for Internet-To-Disk and other for Disk-To-CPU-To-Disk operations.
EDITED:
I'll clarify my question:
2 thread pools: Internet-To-Disk and Disk-To-CPU-To-Disk is producer/consumer approach itself. The question was HOW to make sure I've selected right number of threads for producers and consumers? Same code will work simultenously on different boxes, arches with different number of cores and different bandwidth. How to make sure I've chosen right number of threads so 100% bandwidth and 100% CPU are consumed?
Assuming that CPU processing is going to be the main bottleneck of your system, the number of threads for CPU processing should be, at the least, set to the number of CPUs or cores available.
I/O part is probably not going to use much CPU at all, but you may want to allocate a fixed pool of few threads (equal to, or less than, the number of cores) to prevent excess thread context switching for simultaneous I/O streams.
You may also set the number of threads for CPU processing to a number slightly bigger than the number of cores, if your CPU processing threads do not always use 100% of CPU from start to finish. For example, if they may do some I/O or access some shared resource in the middle of processing.
But as with any system, the ideal number of threads will greatly depend on the nature of your program. You can use tools like JVisual VM (bundled with JDK) to analyse how threads are utilised in your program, and try different thread setting variations.
You can use producer-consumer for this purpose. Use as many producers and consumers as its needed to fulfill the needs.
If your CPU stage is more intensive than the download time, why not just download the data as you are able to process it. That way you can have multiple Internet-To-CPU-To-Disk processes. By skipping a stage it may be faster, and it will certainly be simpler.
I'd go for a producer-consumer architecture : one thread pool to process the data (managed by an ExecutorService), and one or more threads to download the data from the internet.
The data to be processed would be put into a bounded blocking queue (ex: LinkedBlockingQueue), so that the downloading threads would only fetch data when required (that is, when a computing thread is able to process new data). Plus, this structure guaranteed thread safety and memory publication.