What's the bottleneck for java.nio non-blocking I/O?

What's the bottleneck for java.nio non-blocking I/O? - java

I know there are many open source server programs that leverage java.nio's non-blocking I/O, such as Mina. Many implementations use multiple selectors and multi-threading to handle selected events. It seems like a perfect design.
Is it? What is the bottleneck for an NIO-based server? It seems like there wouldn't be any?
Is there any need to control the number of connections? How would one do so?

With traditional blocking I/O, each connection must be handled by one or more dedicated threads. As the number of connections grows so does the number of required threads. This model works reasonably well with connection numbers into the hundreds or low thousands, but it doesn't scale well past that.
Multiplexing and non-blocking I/O invert the model, allowing one thread to service many different connections. It does so by selecting the active connections and only performing I/O when it's guaranteed the sockets are ready.
This is a much more scalable solution because now you're not having hordes of mostly-inactive threads sitting around twiddling their thumbs. Instead you have one or a few very active threads shuttling between all of the sockets. So what's the bottleneck here?
An NIO-based server is still limited by its CPU. As the number of sockets and the amount of I/O each does grows the CPU will be more and more busy.
The multiplexing threads need to service the sockets as quickly as possible. They can only work with one at a time. If the server isn't careful, there might be too much work going on in these threads. When that happens it can take some careful, perhaps difficult programming to move the work off-thread.
If the incoming data can't be processed immediately, it may be prudent to copy it off to a separate memory buffer so it's not sitting in the operating system's queue. That copying takes both time and additional memory.
Programs can't have an infinite number of file descriptors / kernel handles open. Every socket has associated read and write buffers in the OS, so you'll eventually run into the operating system's limits.
Obviously, you're still limited by your hardware and network infrastructure. A server is limited by its NIC, by the bandwidth and latency of other network hops, etc.
This is a very generic answer. To really answer this question you need to examine the particular server program in question. Every program is different. There may be bottlenecks in any particular program that aren't Java NIO's "fault", so to speak.

What is the bottleneck for an NIO-based server?
The network, memory, CPU, all the usual things.
It seems like there wouldn't be any?
Why?
Is there any need to control the number of connections?
Not really.
How would one do so?
Count them in and out, and deregister OP_ACCEPT while you're at the maximum.

Related

Scalability of Redis Cluster using Jedis 2.8.0 to benchmark throughput

I have an instance of JedisCluster shared between N threads that perform set operations.
When I run with 64 threads, the throughput of set operations is only slightly increased (compared to running using 8 threads).
How to configure the JedisCluster instance using the GenericObjectPoolConfig so that I can maximize throughput as I increase the thread count?
I have tried
GenericObjectPoolConfig poolConfig = new GenericObjectPoolConfig();
poolConfig.setMaxTotal(64);
jedisCluster = new JedisCluster(jedisClusterNodes, poolConfig);
believing this could increase the number of jedisCluster connection to the cluster and so boost throughput.
However, I observed a minimal effect.

When talking about performance, we need to dig into details a bit before I can actually answer your question.
A naive approach suggests: The more Threads (concurrency), the higher the throughput.
My statement is not wrong, but it is also not true. Concurrency and the resulting performance are not (always) linear because there is so many involved behind the scenes. Turning something from sequential to concurrent processing might result in something that executes twice of the work compared to sequential execution. This example assumes that you run a multi-core machine, that is not occupied by anything else and it has enough bandwidth for the required work processing (I/O, Network, Memory). If you scale this example from two threads to eight, but your machine has only four physical cores, weird things might happen.
First of all, the processor needs to schedule two threads so each of the threads probably behaves as if they would run sequentially, except that the process, the OS, and the processor have increased overhead caused by twice as many threads as cores. Orchestrating these guys comes at a cost that needs to be paid at least in memory allocation and CPU time. If the workload requires heavy I/O, then the work processing might be limited by your I/O bandwidth and running things concurrently may increase throughput as the CPU is mostly waiting until the I/O comes back with the data to process. In that scenario, 4 threads might be blocked by I/O while the other 4 threads are doing some work. Similar applies to memory and other resources utilized by your application. Actually, there's much more that digs into context switching, branch prediction, L1/L2/L3 caching, locking and much more that is enough to write a 500-page book. Let's stay at a basic level.
Resource sharing and certain limitations lead to different scalability profiles. Some are linear until a certain concurrency level, some hit a roof and adding more concurrency results in the same throughput, some have a knee when adding concurrency makes it even slower because of $reasons.
Now, we can analyze how Redis, Redis Cluster, and concurrency are related.
Redis is a network service which requires network I/O. Networking might be obvious, but we require to add this fact to our considerations meaning a Redis server shares its network connection with other things running on the same host and things that use the switches, routers, hubs, etc. Same applies to the client, even in the case you told everybody else not to run anything while you're testing.
The next thing is, Redis uses a single-threaded processing model for user tasks (Don't want to dig into Background I/O, lazy-free memory freeing and asynchronous replication). So you could assume that Redis uses one CPU core for its work but, in fact, it can use more than that. If multiple clients send commands at a time, Redis processes commands sequentially, in the order of arrival (except for blocking operations, but let's leave this out for this post). If you run N Redis instances on one machine where N is also the number of CPU cores, you can easily run again into a sharing scenario - That is something you might want to avoid.
You have one or many clients that talk to your Redis server(s). Depending on the number of clients involved in your test, this has an effect. Running 64 threads on a 8 core machine might be not the best idea since only 8 cores can execute work at a time (let's leave hyper-threading and all that out of here, don't want to confuse you too much). Requesting more than 8 threads causes time-sharing effects. Running a bit more threads than CPU cores for Redis and other networked services isn't a too bad of an idea since there is always some overhead/lag coming from the I/O (network). You need to send packets from Java (through the JVM, the OS, the network adapter, routers) to Redis (routers, network, yadda yadda yadda), Redis has to process the commands and send the response back. This usually takes some time.
The client itself (assuming concurrency on one JVM) locks certain resources for synchronization. Especially requesting new connections with using existing/creating new connections is a scenario for locking. You already found a link to the Pool config. While one thread locks a resource, no other thread can access the resource.
Knowing the basics, we can dig into how to measure throughput using jedis and Redis Cluster:
Congestion on Redis Cluster can be an issue. If all client threads are talking to the same cluster node, then other cluster nodes are idle, and you effectively measured how one node behaves but not the cluster: Solution: Create an even workload (Level: Hard!)
Congestion on the Client: Running 64 threads on a 8 core machine (that is just my assumption here, so please don't beat me up if I'm wrong) is not the best idea. Raising the number of threads on a client a bit above the number of Cluster nodes (assuming even workload for each cluster node) and a bit over the number of CPU cores can improve performance is never a too bad idea. Having 8x threads (compared to the number of CPU cores) is an overkill because it adds scheduling overhead at all levels. In general, performance engineering is related to finding the best ratio between work, overhead, bandwidth limitations and concurrency. So finding the best number of threads is an own field in computer science.
If running a test using multiple systems, that run a number of total threads, is something that might be closer to a production environment than running a test from one system. Distributed performance testing is a master class (Level: Very hard!) The trick here is to monitor all resources that are used by your test making sure nothing is overloaded or finding the tipping point where you identify the limit of a particular resource. Monitoring the client and the server are just the easy parts.
Since I do not know your setup (number of Redis Cluster nodes, distribution of Cluster nodes amongst different servers, load on the Redis servers, the client, and the network during test caused by other things than your test), it is impossible to say what's the cause.

Java: Should I use multithreading in this scenario?

I am writing a server side application using Java.
The server holds a number of users of the system. For each user, I want to synchronize its disk space with a remote network storage. Because synchronizations are independent, I am thinking to do them concurrently.
I am thinking to create one thread for each user and let the synchronization tasks to fire at the same time.
But the system can have tens of thousands of users. This means creating tens of thousand thread at one time and fire at the same time. I am not sure if this is something JVM can handle.
Even if it can handle this, will that be memory efficient because each thread have its own stack and this could be a big memory hit!
Please let me know your opinion.
Many thanks.

You could look at a fixed size thread pool giving a pool of threads to execute your task. This would give the benefit of multithreading with a sensible limit.
Check out Executors.newFixedThreadPool()

You should look into Non-blocking IO.
Here is a "random" article about it from courtesy of google:
http://www.developer.com/java/article.php/3837316/Non-Blocking-IO-Made-Possible-in-Java.htm

Personally I wouldn't have tens of thousands of users on a single machine. You won't be able to much per user with this many users active. You should be able to afford more than one machine.
You can have this many thread in Java but as you say this is not efficient. You can use an NIO library to manage multiple connection with each thread.
Libraries like
http://mina.apache.org/
http://www.jboss.org/netty
Are suitable.
Also interesting http://code.google.com/p/nfs-rpc/

Java TCP/IP Socket Latency - stuck at 50 μs (microseconds)? (used for Java IPC)

We have been profiling and profiling our application to get reduce latency as much as possible. Our application consists of 3 separate Java processes, all running on the same server, which are passing messages to each other over TCP/IP sockets.
We have reduced processing time in first component to 25 μs, but we see that the TCP/IP socket write (on localhost) to the next component invariably takes about 50 μs. We see one other anomalous behavior, in that the component which is accepting the connection can write faster (i.e. < 50 μs). Right now, all the components are running < 100 μs with the exception of the socket communications.
Not being a TCP/IP expert, I don't know what could be done to speed this up. Would Unix Domain Sockets be faster? MemoryMappedFiles? what other mechanisms could possibly be a faster way to pass the data from one Java Process to another?
UPDATE 6/21/2011
We created 2 benchmark applications, one in Java and one in C++ to benchmark TCP/IP more tightly and to compare. The Java app used NIO (blocking mode), and the C++ used Boost ASIO tcp library. The results were more or less equivalent, with the C++ app about 4 μs faster than Java (but in one of the tests Java beat C++). Also, both versions showed a lot of variability in the time per message.
I think we are agreeing with the basic conclusion that a shared memory implementation is going to be the fastest. (Although we would also like to evaluate the Informatica product, provided it fits the budget.)

If using native libraries via JNI is an option, I'd consider implementing IPC as usual (search for IPC, mmap, shm_open, etc.).
There's a lot of overhead associated with using JNI, but at least it's a little less than the full system calls needed to do anything with sockets or pipes. You'll likely be able to get down to about 3 microseconds one-way latency using a polling shared memory IPC implementation via JNI. (Make sure to use the -Xcomp JVM option or adjust the compilation threshold, too; otherwise your first 10,000 samples will be terrible. It makes a big difference.)
I'm a little surprised that a TCP socket write is taking 50 microseconds - most operating systems optimize TCP loopback to some extent. Solaris actually does a pretty good job of it with something called TCP Fusion. And if there has been any optimization for loopback communication at all, it's usually been for TCP. UDP tends to get neglected - so I wouldn't bother with it in this case. I also wouldn't bother with pipes (stdin/stdout or your own named pipes, etc.), because they're going to be even slower.
And generally, a lot of the latency you're seeing is likely coming from signaling - either waiting on an IO selector like select() in the case of sockets, or waiting on a semaphore, or waiting on something. If you want the lowest latency possible, you'll have to burn a core sitting in a tight loop polling for new data.
Of course, there's always the commercial off-the-shelf route - which I happen to know for a certainty would solve your problem in a hurry - but of course it does cost money. And in the interest of full disclosure: I do work for Informatica on their low-latency messaging software. (And my honest opinion, as an engineer, is that it's pretty fantastic software - certainly worth checking out for this project.)

"The O'Reilly book on NIO (Java NIO, page 84), seems to be vague about
whether the memory mapping stays in memory. Maybe it is just saying
that like other memory, if you run out of physical, this gets swapped
back to disk, but otherwise not?"
Linux. mmap() call allocates pages in OS page cache area (which are periodically get flushed to disk and can be evicted based on Clock-PRO which is approximation of LRU algorithm?) So the answer on your question is - yes. Memory mapped buffer can be evicted (in theory) from memory unless it is mlocke'd (mlock()). This is in theory. In practice, I think it is hardly possible if your system is not swapping In this case, first victims are page buffers.

See my answer to fastest (low latency) method for Inter Process Communication between Java and C/C++ - with memory mapped files (shared memory) java-to-java latency can be reduced to 0.3 microsecond

MemoryMappedFiles is not viable solution for low latency IPC at all - if mapped segment of memory gets updated it is eventually will be synced to disk thus introducing unpredictable delay which measures in milliseconds at least. For low latency one can try combinations of either Shared Memory + message queues (notifications), or shared memory + semaphores. This works on all Unixes especially System V version (not POSIX) but if you run application on Linux you pretty safe with POSIX IPC (most features are available in 2.6 kernel) Yes, you will need JNI to get this done.
UPD: I forgot that this is JVM - JVM IPC and we have already GCs which we can not control fully, so introducing additional several ms pauses due to OS file buffers flash to disk may be acceptable.

Check out https://github.com/pcdv/jocket
It's a low-latency replacement for local Java sockets that uses shared memory.
RTT latency between 2 processes is well below 1us on a modern CPU.

Java Thread Performance

I am working on a bittorrent client. While communicating with the peers the easiest way for me to communicate with them is to spawn a new thread for each one of them. But if the user wants to keep connections with large number of peers that my cause me to spawn a lot of threads.
Another solution i thought of is have one thread to iterate through peer objects and run them for e period.
I checked other libraries mostly in ruby( mine is in java ) and they spawn one thread for each new peer. Do you think spawning one thread will degrade performence if user sets the number of connections to a high number like 100 or 200?

It shouldn't be a problem unless you're running thousands of threads. I'd look into a compromise, using a threadpool. You can detect the number of CPUs at runtime and decide how many threads to spin up based on that, and then hand out work to the threadpool as it comes along.

You can avoid the problem altogether by using Non-blocking IO (java.nio.*).

I'd recommend using an Executor to keep the number of threads pooled.
Executors.newFixedThreadPool(numberOfThreads);
With this, you can basically add "tasks" to the pool and they will complete as soon as threads become available. This way, you're not exhausting all of the enduser's computer's threads and still getting a lot done at the same time. If you set it to like 16, you'd be pretty safe, though you could always allow the user to change this number if they wanted to.

No.....
Once I had this very same doubt and created a .net app (4 years ago) with 400 threads....
Provided they don't do a lot of work, with a decent machine you should be fine...

A few hundred threads is not a problem for most workstation-class machines, and is simpler to code.
However, if you are interested in pursuing your idea, you can use the non-blocking IO features provided by Java's NIO packages. Jean-Francois Arcand's blog contains a lot of good tips learned from creating the Grizzly connector for Glassfish.

Well in 32bit Windows for example there is actually a maximum number of native Threads you can create (2 Gigs / (number of Threads * ThreadStackSize (default is 2MB)) or something like that). So with too many connections you simply might run out of Virtual Memory address space.
I think a compromise might work: Use a Thread Pool with e.g. 10 Threads (depending on the machine) running and Distribute the connections evenly. Inside the Thread loop through the peers assigned to this Thread. And limit the maximum number of connections.

Use a thread pool and you should be safe with a fairly large pool size (100 or so). CPU will not be a problem since you are IO bound with this type of application.
You can easily make the pools size configurable and put in a reasonable maximum, just to prevent memory related issues with all the threads. Of course that should only occur if all the threads are actually being used.

Network Programming: to maintain sockets or not?

I'm currently translating an API from C# to Java which has a network component.
The C# version seems to keep the input and output streams and the socket open for the duration of its classes being used.
Is this correct?
Bearing in mind that the application is sending commands and receiving events based on user input, is it more sensible to open a new socket stream for each "message"?
I'm maintaining a ServerSocket for listening to the server throwing events but I'm not so sure that maintaining a Socket and output stream for outbound comms is such a good idea.
I'm not really used to Socket programming. As with many developers I usually work at the application layer when I need to do networking and not at the socket layer, and it's been 5 or 6 years since I did this stuff at university.
Cheers for the help. I guess this is more asking for advice than for a definitive answer.

There is a trade off between the cost of keeping the connections open and the cost of creating those connections.
Creating connections costs time and bandwidth. You have to do the 3-way TCP handshake, launch a new server thread, ...
Keeping connections open costs mainly memory and connections. Network connections are a resource limited by the OS. If you have too many clients connected, you might run out of available connections. It will cost memory as you will have one thread open for each connection, with its associated state.
The right balanced will be different based on the usage you expect. If you have a lot of clients connecting for short period of times, it's probably gonna be more efficient to close the connections. If you have few clients connecting for long period of time, you should probably keep the connections open ...

If you've only got a single socket on the client and the server, you should keep it open for as long as possible.

If your application and the server it talks to are close, network-wise, it MAY be sensible to close the connection, but if they're distant, network-wise, you are probably better off letting the socket live for the duration.
Guillaume mentioned the 3-way handshake and that basically means that opening a socket will take a minimum of 3 times the shortest packet transit time. That can be approximated by "half the ping round-trip" and can easily reach 60-100 ms for long distances. If you end up with an additional 300 ms wait, for each command, will that impact the user experience?
Personally, I would leave the socket open, it's easier and doesn't cost time for every instance of "need to send something", the relative cost is small (one file descriptor, a bit of memory for the data structures in user-space and some extra storage in the kernel).

It depends on how frequent you expect the user to type in commands. If it happens quite infrequently, you could perhaps close the sockets. If frequent, creating sockets repeatedly can be an expensive operation.
Now having said that, how expensive, in terms of machine resources, is it to have a socket connection open for infrequent data? Why exactly do you think that "maintaining a Socket and output stream for outbound comms is not such a good idea" (even though it seems the right thing to do)? On the other hand, this is different for file streams if you expect that other processes might want to use the same file. Closing the file stream quickly in this case would be the way to go.
How likely is it that you are going to run out of the many TCP connections you can create, which other processes making outbound connections might want to use? Or do you expect to have a large number of clients connecting to your server at a time?

You can also look at DatagramSocket and DatagramPacket. The advantage is lower over-head, the disadvantage is the over-head that regular Socket provides.

I suggest you look at using an existing messaging solution like ActiveMQ or Netty. This will handle lot of the issues you may find with messaging.

I am coming a bit late, but I didn't see anyone suggest that.
I think it will be wise to consider pooling your connections(doesn't matter if Socket or TCP), being able to maintain couple connections open and quickly reuse them in your code base would be optimal in case of performance.
In fact, Roslyn compiler extensively use this technique in a lot of places.
https://github.com/dotnet/roslyn/search?l=C%23&q=pooled&type=&utf8=%E2%9C%93

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.