I have a java application which uses a few MulticastSocket instances to listen
to a few UDP multicast feeds. Each such socket is handled by a dedicated thread.
The thread reads each Datagram, parses it's content and writes to log (log4j) the packet's sequence id (long) and the timestamp the Datagram was received.
When I try to run 2 instances of the same application on a Windows Server 2008 R2,
with 2 * 6 cores and compare the 2 logs created by the 2 applications,
I notice that quite frequently the timing of the packets isn't the same.
Most packets are received by the 2 apps at the same time (milis), but frequently
there's a difference of about 1-7ms diff between the reception time of the same packet
by the 2 apps.
I tried allocating more buffers in the NIC, and also made the socket read buffer bigger.
In addition I tried minimizing GC runs and I also use -verbose:gc and can see
that GC times and the problematic timing diff does not occur at the same time.
This allows me to assume that my problem isn't GC related.
No drop packets problem was observed, and a bandwidth problem is not likely.
Ideas / Opinions are welcome.
Thanks.
By default Windows timer interrupt frequency is 100 Hz (1 tick per 10ms). It means that OS cannot guarantee that Java threads will get woken up at higher precision.
Here's an excerpt from a prominent James Holmes article about timing in Java - it could be your case:
for Windows users, particularly on dual-core or multi-processor systems (and it seems most commonly on x64 AMD systems) if you see erratic timing behaviour either in Java, or other applications (games, multi-media presentations) on your system, then try adding the /usepmtimer switch in your boot.ini file.
PS: by no means I'm credible in the field of Windows performance optimization, also starting from Windows 2008 HPET is supported, but how it is related to timer interrupt frequency is a mystery to me.
7ms is a very good result for a 6 core machine, and drift in java will be much higher than that if garbage collector kicks in.
Dont forget that java runtime has it's own overhead as well.
Related
I have a big question about tuning linux for java performance, so i start with my case.
I have an application running a number of threads that communicate with each other. My typical workflow is:
1) Some consumer thread sync on a common Object lock and calls wait() on it.
2) Some producer thread waits via Selector for data from network.
2.1) producer receives data and form an object with received timestamp (microseconds precision).
2.2) producer puts this packet in some exchange map and calls notifyAll on common lock.
3) Consumer thread wakes up and reads produced object.
3.1) consumer creates new object and writes in it time difference in microseconds between received timestamp and current timestamp. this way i can monitor reaction time.
And this reaction time is the whole issue.
When i test my application on my own machine i usually get about 200-400 microseconds reaction time, but when i monitor it on my production linux machine i get numbers from 2000 to 4000 microseconds!
Right now i'm running ubuntu 16.04 as my production OS and Oracle jdk 8-111. I have a physical server with 2 Xeon Processors. I run only usual OS daemons and my app on this server so there is plenty of resources compared to my dev notebook.
I run my java app as a jar file with flags:
sudo chrt -r 77 java -server -XX:+UseNUMA -d64 -Xmx1500m -XX:NewSize=1000m -XX:+UseG1GC -jar ...
I use sudo chrt to change priority since i thought it's the case, but it didn't help.
I tuned bios for maximum performance and turned off C-States.
What else i can tune for faster reaction times and low context switches?
No, there is no single echo 1 > /proc/sys/unlock_concurrent_magic option on Linux to globally improve concurrent performance. If such an option existed, it would simply be enabled by default.
In fact, despite being tunable in general, you aren't going to find many tunables that have a big effect specifically on raw concurrency on Linux. The ones that might help often incidentially related to concurrency - i.e., something is enabled (let's say THP) which slows down your particular load, and since at least part of the slowness occurs under lock, the whole concurrent throughput is affected.
Java concurrency vs the OS
My experience, however, is that Java applications are very rarely affected directly by OS-level concurrency behavior. In fact, most of Java concurrency is implemented efficiently without using OS features, and will behave the same across OSes. The primary places where the JVM touches the OS as it relates to concurrency is for thread creation, destruction and waiting/blocking. Recent Linux kernels have very good implementations of all three so it would be unusual that you run into a bottleneck there (indeed, a well-tuned application should not be doing a ton of thread creation and should also seek to minimize blocking).
So I find it very likely the performance difference is due to other differences, in hardware, in application configuration, in the applied load, or something else.
Characterize your performance
Here's what I'd try first to characterize the performance discrepancy between your development host and the production system.
At the top level, the difference is going either be because the production stack is actually slower for the load in question or because the local test isn't an accurate reflection of the production load.
One quick test you can do to distinguish the cases is to run whatever local test you are running to get 200-400us response times on an unloaded production server. If the server is still getting response times that are 10x worse, then you know your test is probably reasonable, and the difference is really in the production stack.
At that point, the problem could still be in OS, in the software configuration, in the hardware, etc. So you should try to bisect the differences between the production stack and your local host - set any tunable parameters to the same value, investigate any application-specific configuration differences, try to characterize any hardware differences.
One big gotcha is that often production servers are in multi-socket configurations, which may increase the cost of contention by an order of magnitude, since cross-socket communication (generally 100+ cycles) is required - whereas development boxes are generally multi-core but single-socket, so contention overhead is contained to the shared L3 (generally ~30 cycles).
On the other hand, you might find that your local test performs just fine on the production server as well, so the real issue is that your test doesn't represent the true production load. You should then make an effort to characterize the true production load so you can replicate it and then tune it locally. How to tune it locally could of course fill a book or two (or require a very highly paid contractor or two), so you'd have to come back with a narrower question to get useful help here.
"Big Iron" vs your laptop
It is a common fallacy that "big iron" is going to be faster at everything than your puny laptop. In fact, quite the opposite is true for many operations, especially when you measure operation latency (as opposed to total throughput).
For example, latency to memory on the server parts is often 50% slower versus client parts, even comparing single socket systems. John McCalpin reports a main-memory latency of 54.6 ns for a client Sandy Bridge part and 79 ns for the corresponding server part. It is well known the path to memory and memory controller design for servers trades off latency for throughput, reliability and the ability to support more cores and total DRAM1.
In particular, you mention that your producer server is a "2 Xeon Processors", which I take to mean a dual-socket system. Once you introduce a second socket, you change the mechanics of synchronization entirely. On a single core system, when separate threads under contention, at worst you are sending cache lines and coherency traffic through the shared L3, which has a latency of 30-40 cycles.
On a system with more than one socket, however, concurrency traffic generally has to flow over the QPI links between sockets, which has latency on the order of DRAM access, perhaps 80 ns (i.e., 240 cycles on a 3GHz box). So you can have nearly an order of magnitude slowdown from the hardware architecture alone.
Furthermore, notifyAll type scenarios as you describe your workflow often get much worse with more cores and more threads. E.g., with more cores, you are less likely to have two communicating processes running on the same hyperthread (which dramatically speeds up inter-thread coordination, but is otherwise undesirable) and the total contention and coherency traffic may scale up in proportion to the number of cores (e.g., because a cache line has to ping-pong around to every core when you wake up threads).
So it's often the case that a heavily contended (often badly designed) algorithm performs much worse on "big iron" than on a single-socket consumer system.
1 E.g., through buffering, which adds latency, but increases the host's maximum RAM capacity.
Ubuntu 12.04 LTS
java -version
java version "1.6.0_38"
Java(TM) SE Runtime Environment (build 1.6.0_38-b05)
Java HotSpot(TM) 64-Bit Server VM (build 20.13-b02, mixed mode)
4 core CPU - some Dell server hardware
10 threads from time to time run a "heavy" job over several minutes. At other periods they are doing nothing.
1 thread is supposed to wake up every 5 (or so) secs and send a quick ping over the network to another process. This works nicely as long as the other 10 threads do nothing, but when the other 10 threads are running a "heavy" job it never (or very rarely) get to run and send its ping.
I could understand this if this "heavy" job was CPU intensive. But during such a "heavy" job top says something like 50-100% IO-wait but around 1% CPU usage. Profiling shows that by far most of the time spent by the 10 threads are spent in (waiting I guess) in some NIO call. It all adds up, and is kinda expected because a lot of the heaviness about the job is to read files from disk.
What I do not understand is that during such a "heavy" job, the 1 thread doing pings do not get to run. How can that be explained when top shows 1% CPU usage and it seems (profiling and top) that the 10 threads are spending most of their time waiting for IO. Isnt the 1 ping-thread supposed to get execution-time when the other threads are waiting for IO?
Java thread priority is equal on all 11 threads.
Spreading a few yields here and there in the 10 threads seem to solve (or lower) the problem, but I simply do not understand why the ping thread does not get to run without the yields, when the other threads do not do much but wait for IO.
ADDITIONAL INFO 05.03.2014
I have reproduced the problem in a simpler setup - even though not very simple yet (you will have to find out how to install a Apache Zookeeper server, but it is fairly simple - I can provide info later)
Find Eclipse Kepler project here (maven - build by "mvn package"): https://dl.dropboxusercontent.com/u/25718039/io-test.zip
Find binary here: https://dl.dropboxusercontent.com/u/25718039/io-test-1.0-SNAPSHOT-jar-with-dependencies.jar
Start a Apache ZooKeeper 3.4.5 (on port 2181) server on a machine. On another separate machine (this is where I have Ubuntu 12.04 LTS etc. as described above) run the binary as follows (first create a folder io-test-files - 50GB space needed)
nohup java -cp io-test-1.0-SNAPSHOT-jar-with-dependencies.jar dk.designware.io_test.ZKIOTest ./io-test-files 10 1024 5000000 IP-of-ZooKeeper-server:2181 > ./io-test-files/stdouterr.txt 2>&1 &
First it creates 10 5GB files (50GB is way more than machine RAM so not much help by OS file-cache), then starts a ZooKeeper client (which is supposed to keep its connection with the ZooKeeper server up by sending pings/heartbeats regularly), then makes 10 threads doing random-access into the 10 files creating a lot of disk IO, but really no real usage of the CPU. I see that the ZooKeeper client eventually loses its connection ("Zk state"-prints stop saying "CONNECTED" - in stdouterr.txt), and that is basically what I do not understand. The ZooKeeper client-thread only wants to send a tiny heartbeat with several secs apart, and only if it is not able to do that within a period of 20 secs is will lose its connection. I would expect it to have easy access to the CPU, because all the other threads are basically only waiting for disk IO.
During the test I see the following using "top"
Very high "Load average". Above 10 which I do not understand, because there are basically only 10 threads doing something. I also thought that "Load average" only counted threads that actually wanted to do real stuff on the CPU (not including waiting of IO), but according to http://en.wikipedia.org/wiki/Load_%28computing%29 Linux also counts "uninterruptible sleeps" including threads waiting of IO. But I really do not hope/think that it will prevent other threads that have real stuff to do, from getting their hands on the CPU
Very high %wa, but almost no %sy and %us on the CPU(s)
Here is the output from one of my runs: https://dl.dropboxusercontent.com/u/25718039/io-test-output.txt
I've got a Java App running on Ubuntu, the app listens on a socket for incoming connections, and creates a new thread to process each connection. The app receives incoming data on each connection processes the data, and sends the processed data back to the client. Simple enough.
With only one instance of the application running and up to 70 simultaneous threads, the app will run up the CPU to over 150%.. and have trouble keeping up processing the incoming data. This is running on a Dell 24 Core System.
Now if I create 3 instances of my application, and split the incoming data across the 3 instances on the same machine, the max overall cpu on the same machine may only reach 25%.
Question is why would one instance of the application use 6 times the amount of CPU that 3 instances on the same machine each processing one third of the amount of data use?
I'm not a linux guy, but can anyone recommend a tool to monitor system resources to try and figure out where the bottleneck is occurring? Any clues as to why 3 instances processing the same amount of data as 1 instance would use so much less overall system CPU?
In general this should not be the case. Maybe you are reading the CPU usage wrong. Try top, htop, ps, vmstat commands to see what's going on.
I could imagine one of the reasons for such behaviour - resource contention. If you have some sort of lock or a busy loop which manifests itself only on one instance (max connections, or max threads) then your system might not parallelize processing optimally and wait for resources. I suggest to connect something like jconsole to your java processes and see what's happening.
As a general recommendation check how many threads are available per JVM and if you are using them correctly. Maybe you don't have enough memory allocated to JVM so it's garbage collecting too often. If you use database ops then check for bottlenecks there too. Profile and find the place where it spends most of the time and compare 1 to 3 instances in terms of % of time spend in that function.
We have been profiling and profiling our application to get reduce latency as much as possible. Our application consists of 3 separate Java processes, all running on the same server, which are passing messages to each other over TCP/IP sockets.
We have reduced processing time in first component to 25 μs, but we see that the TCP/IP socket write (on localhost) to the next component invariably takes about 50 μs. We see one other anomalous behavior, in that the component which is accepting the connection can write faster (i.e. < 50 μs). Right now, all the components are running < 100 μs with the exception of the socket communications.
Not being a TCP/IP expert, I don't know what could be done to speed this up. Would Unix Domain Sockets be faster? MemoryMappedFiles? what other mechanisms could possibly be a faster way to pass the data from one Java Process to another?
UPDATE 6/21/2011
We created 2 benchmark applications, one in Java and one in C++ to benchmark TCP/IP more tightly and to compare. The Java app used NIO (blocking mode), and the C++ used Boost ASIO tcp library. The results were more or less equivalent, with the C++ app about 4 μs faster than Java (but in one of the tests Java beat C++). Also, both versions showed a lot of variability in the time per message.
I think we are agreeing with the basic conclusion that a shared memory implementation is going to be the fastest. (Although we would also like to evaluate the Informatica product, provided it fits the budget.)
If using native libraries via JNI is an option, I'd consider implementing IPC as usual (search for IPC, mmap, shm_open, etc.).
There's a lot of overhead associated with using JNI, but at least it's a little less than the full system calls needed to do anything with sockets or pipes. You'll likely be able to get down to about 3 microseconds one-way latency using a polling shared memory IPC implementation via JNI. (Make sure to use the -Xcomp JVM option or adjust the compilation threshold, too; otherwise your first 10,000 samples will be terrible. It makes a big difference.)
I'm a little surprised that a TCP socket write is taking 50 microseconds - most operating systems optimize TCP loopback to some extent. Solaris actually does a pretty good job of it with something called TCP Fusion. And if there has been any optimization for loopback communication at all, it's usually been for TCP. UDP tends to get neglected - so I wouldn't bother with it in this case. I also wouldn't bother with pipes (stdin/stdout or your own named pipes, etc.), because they're going to be even slower.
And generally, a lot of the latency you're seeing is likely coming from signaling - either waiting on an IO selector like select() in the case of sockets, or waiting on a semaphore, or waiting on something. If you want the lowest latency possible, you'll have to burn a core sitting in a tight loop polling for new data.
Of course, there's always the commercial off-the-shelf route - which I happen to know for a certainty would solve your problem in a hurry - but of course it does cost money. And in the interest of full disclosure: I do work for Informatica on their low-latency messaging software. (And my honest opinion, as an engineer, is that it's pretty fantastic software - certainly worth checking out for this project.)
"The O'Reilly book on NIO (Java NIO, page 84), seems to be vague about
whether the memory mapping stays in memory. Maybe it is just saying
that like other memory, if you run out of physical, this gets swapped
back to disk, but otherwise not?"
Linux. mmap() call allocates pages in OS page cache area (which are periodically get flushed to disk and can be evicted based on Clock-PRO which is approximation of LRU algorithm?) So the answer on your question is - yes. Memory mapped buffer can be evicted (in theory) from memory unless it is mlocke'd (mlock()). This is in theory. In practice, I think it is hardly possible if your system is not swapping In this case, first victims are page buffers.
See my answer to fastest (low latency) method for Inter Process Communication between Java and C/C++ - with memory mapped files (shared memory) java-to-java latency can be reduced to 0.3 microsecond
MemoryMappedFiles is not viable solution for low latency IPC at all - if mapped segment of memory gets updated it is eventually will be synced to disk thus introducing unpredictable delay which measures in milliseconds at least. For low latency one can try combinations of either Shared Memory + message queues (notifications), or shared memory + semaphores. This works on all Unixes especially System V version (not POSIX) but if you run application on Linux you pretty safe with POSIX IPC (most features are available in 2.6 kernel) Yes, you will need JNI to get this done.
UPD: I forgot that this is JVM - JVM IPC and we have already GCs which we can not control fully, so introducing additional several ms pauses due to OS file buffers flash to disk may be acceptable.
Check out https://github.com/pcdv/jocket
It's a low-latency replacement for local Java sockets that uses shared memory.
RTT latency between 2 processes is well below 1us on a modern CPU.
i.e. Time A = voltage hits the NIC; Time B = Selector from Java NIO package is able to select socket channel for I/O.
Use SO_TIMESTAMP and find a NIC that actually supports timestamps and one that supports timestamps with better than millisecond resolution. Then you should have a chance if you can get Java to read incoming cmsg ancillary data.
Without good hardware support the packets are going to be tagged by the kernel with most likely a low resolution unstable timer.
(edit #1) Example code in C requiring 2.6.30 or newer kernel I think:
http://www.mjmwired.net/kernel/Documentation/networking/timestamping/timestamping.c
(edit #2) Example code to determine kernel to user-space latency in C:
http://vilimpoc.org/research/ku-latency/
(edit #3) I recommend following the J-OWAMP project which is dependent upon high resolution timers and packet latency testing. The OWAMP team have been pushing the Linux kernel team for better SO_TIMESTAMP support.
http://www.av.it.pt/jowamp/
you'll need to use something like tcpdump and then correlate timestamps between your application logs and the "sniffer" logs to determine this, it's not possible from the jvm alone.