Trying to solve an issue where there is a significant amount of latency on outgoing messages that seems to be related to the socket flush behaviour. I've been taking packet captures of outgoing FIX messages from a quickfixj initiator to an acceptor.
To summarise the environment, the java intiator makes a socket connection to a server socket on another server. Both servers are running Redhat Enterprise Linux 5.10. The MSS from a netstat on the interfaces is 0. The MTU of the NICs are all 1500 (inifinite I believe for the loopback interface). On the application side the messages are encoded into a byte array by quickfixj and written to the socket. The socket is configured with TCP_NODELAY enabled.
I am almost sure I can eliminate the application as the cause of the latency, as when the acceptor (the ServerSocket) is run on the same server as the Initiator using the loopback interface, there is no sender latency. This is an example of some packet capture entries using the loopback interface:
"No.","Time","Source","Destination","Protocol","Length","SendingTime (52)","MsgSeqNum (34)","Destination Port","Info","RelativeTime","Delta","Push"
"0.001606","10:23:29.223638","127.0.0.1","127.0.0.1","FIX","1224","20150527-09:23:29.223","5360","6082","MarketDataSnapshotFullRefresh","0.001606","0.000029","Set"
"0.001800","10:23:29.223832","127.0.0.1","127.0.0.1","FIX","1224","20150527-09:23:29.223","5361","6082","MarketDataSnapshotFullRefresh","0.001800","0.000157","Set"
"0.001823","10:23:29.223855","127.0.0.1","127.0.0.1","FIX","1224","20150527-09:23:29.223","5362","6082","MarketDataSnapshotFullRefresh","0.001823","0.000023","Set"
"0.002105","10:23:29.224137","127.0.0.1","127.0.0.1","FIX","825","20150527-09:23:29.223","5363","6082","MarketDataSnapshotFullRefresh","0.002105","0.000282","Set"
"0.002256","10:23:29.224288","127.0.0.1","127.0.0.1","FIX","2851","20150527-09:23:29.224,20150527-09:23:29.224,20150527-09:23:29.224","5364,5365,5366","6082","MarketDataSnapshotFullRefresh","0.002256","0.000014","Set"
"0.002327","10:23:29.224359","127.0.0.1","127.0.0.1","FIX","825","20150527-09:23:29.224","5367","6082","MarketDataSnapshotFullRefresh","0.002327","0.000071","Set"
"0.287124","10:23:29.509156","127.0.0.1","127.0.0.1","FIX","1079","20150527-09:23:29.508","5368","6082","MarketDataSnapshotFullRefresh","0.287124","0.284785","Set"
The main things of interest there being that 1/ despite the packet length (the biggest here is 2851) the PUSH flag is set on each packet. And 2/ the measure of latency I'm measuring here is the "Sending Time" set by the message before its encoded, and the packet capture time "Time". The packet capture is being done on the same server as the Initiator that is sending the data. For a packet capture of 10,000 packets there is no great difference between "SendingTime" and "Time" when using loopback. For this reason I think I can eliminate the application as the cause of the sending latency.
When the acceptor is moved to another server on the LAN, the sending latency starts to get worse on packets that are greater than the MTU size. This is a snippet of the a capture:
"No.","Time","Source","Destination","Protocol","Length","SendingTime (52)","MsgSeqNum (34)","Destination Port","Info","RelativeTime","Delta","Push"
"68.603270","10:35:18.820635","10.XX.33.115","10.XX.33.112","FIX","1223","20150527-09:35:18.820","842","6082","MarketDataSnapshotFullRefresh","68.603270","0.000183","Set"
"68.603510","10:35:18.820875","10.XX.33.115","10.XX.33.112","FIX","1223","20150527-09:35:18.820","843","6082","MarketDataSnapshotFullRefresh","68.603510","0.000240","Set"
"68.638293","10:35:18.855658","10.XX.33.115","10.XX.33.112","FIX","1514","20150527-09:35:18.821","844","6082","MarketDataSnapshotFullRefresh","68.638293","0.000340","Not set"
"68.638344","10:35:18.855709","10.XX.33.115","10.XX.33.112","FIX","1514","20150527-09:35:18.821","845","6082","MarketDataSnapshotFullRefresh","68.638344","0.000051","Not set"
What's significant here is when the packets are smaller than the MSS (derived from the MTU) then the PUSH flag is set and there is no sender latency. This would be expected as disabling Nagle's algorithm will be causing a PUSH to be set on these smaller packets. When the packet size is bigger than the MSS - a packet size of 1514 in this case - the difference between the time the packet is captured and the SendingTime has jumped to 35ms.
It doesn't seem likely that this 35ms latency is caused by the application encoding the messages, as large packet size messages were sent in <1ms on the loopback interface. The capture also takes place on the sender side, so it doesn't seem that the MTU segmentation that can be the cause either. The most likely reason seems to me that because there is no PUSH flag set - as the packet is larger than the MSS - then the socket and/or TCP stack at the OS level is not deciding to flush it until 35ms later. The test acceptor on the other server is not a slow consumer and is on the same LAN, so ACKs are timely.
Can anyone give any pointers as to what could cause this socket sending latency for > MSS packets? Against a real counterparty in the US this sender latency reaches as high as 300ms. I thought if a packet size was greater than the MSS then it would be sent immediately regardless of previous ACKS (as long as the socket buffer size was not exceeded). Netstat generally shows 0 socket q and wind sizes and the issue seems to occur on all > MSS packets, even from startup. This looks like the socket is deciding not to flush immediately for some reason, but unsure what factors could cause that.
Edit: As pointed out by EJP, there is no flush in linux. The socket send puts the data in the linux kernal's network buffers as I understand it. And it seems for these non-push packets, the kernel is waiting for the ack from the previous packet before it delivers it. This isn't what I'd expect, in TCP I'd expect the packet to still be delivered until the socket buffers filled up.
This is not a comprehensive answer as TCP behaviour will differ depending on a lot of factors. But in this case, this was the reason for the problem we faced.
The congestion window, in the TCP congestion control implementation, allows for an increasing amount of packets to be sent without an acknowledgement as long as it doesn't detect signs of congestion, i.e retransmissions. Generally speaking, when these occur, the congestion algorithm will reset the congestion window limiting the packets that can be sent before an ack can be sent. This manifests itself in the sender latency we witnessed, as packets were held in the kernel buffer awaiting ackowledgements for prior packets. There are no TCP_NODELAY, TCP_CORK etc. type instructions that will override the congestion control behaviour in this regard.
In our case this was made worse by a long round trip time to the other venue. However, as it was a dedicated line with very little packet loss per day, it was not retransmissions that were the cause of the congestion control kicking in. In the end it appears to have been solved by disabling the following flag in linux. This would also cause the congestion window to be reset, but through detecting idleness rather than packet loss:
"tcp_slow_start_after_idle - BOOLEAN
If set, provide RFC2861 behavior and time out the congestion
window after an idle period. An idle period is defined at
the current RTO. If unset, the congestion window will not
be timed out after an idle period.
Default: 1
(Note if you face these issues it is also possible to investigate other forms of congestion control algorithm than the ones your kernel might be currently set up for).
I am writing a program that transfers files over the network using TCP sockets.
Now I noticed that when I send a packet in size for example 1024 bytes, I get them split on the other side.
By "split" I mean I get some packets as if they were a part of a whole packet.
I tried to reduce the packet size and the algorithm worked, when the packet size was immensely small (about 30 bytes per packet) thus the file transferred very slowly.
Is there anything I can do in order to prevent the splitting?
SOLVED:i switched the connection to be over UDP and since UDP is packet bounded it worked
There is not such thing in TCP. TCP is a stream, what you write is what you get at the other end. This does not mean you get it the way it was written; TCP may break or group packets in order to do the jobs as effectively as possible. You can send 8 mega bytes packet in one write and TCP can break down into 10, 100 or 1000 packets, what you need to know is that at the other end you will get exactly 8 mega bytes no more no less. In order to do a file transfer effectively you need to tell the receiver how many bytes you are going to send. The receiver may read it in one chunk or in 100 chunks but must keep track of the data it reads and how many bytes to read.
Because TCP is stream oriented, TCP will not transfer information of 'packet boundaries', like UDP and SCTP.
So you must add information of packet boundaries to TCP payload, if it is not there already. There are several ways to do it:
You can use a length field for indicating how many bytes the following packet contains.
Or there could be a reserved symbol for separating different packets.
In all ways, receiver must read TCP input stream again, if complete packet is not received.
You can control the TCP maximum segment size in some socket implementations. If you set it low enough, you can make the segment fit inside a single packet. The BSD Sockets API, which influenced almost every other implementation, has a setsockopt() function that lets you set various options on the socket. One of them, TCP_MAXSEG, controls the maximum segment size.
Unfortunately for you, the standard Java Socket class doesn't support this particular option.
I have implemented a system similar to BitTorrent, and I would like to know at what size I should set the packets of each chunk. I was not able to find how BitTorrent does it, what size packets they use. I currently use 100 kilobyte packets, is that a lot?
TCP breaks data into packets automatically. You don't have to worry about the size of network packets.
The size of a TCP packet is constrained by the MTU (maximal transfer unit) of the network, typically around 1500 bytes. If you were making a game or a multimedia program where low latency is important you might have to keep in mind that data is sent in packets, but for a file transfer program it doesn't matter.
There is no such thing as a TCP packet. It's a byte stream. Under the hood it is broken into segments, in a way that is entirely out of your control, and further under the hood those segments are wrapped in IP packets, ditto.
Just write as much as you like in each write, the more the better.
I'm sending very large (64000 bytes) datagrams. I realize that the MTU is much smaller than 64000 bytes (a typical value is around 1500 bytes, from my reading), but I would suspect that one of two things would happen - either no datagrams would make it through (everything greater than 1500 bytes would get silently dropped or cause an error/exception to be thrown) or the 64000 byte datagrams would get chunked into about 43 1500 byte messages and transmitted transparently.
Over a long run (2000+ 64000 byte datagrams), about 1% (which seems abnormally high for even a LAN) of the datagrams get dropped. I might expect this over a network, where datagrams can arrive out of order, get dropped, filtered, and so on. However, I did not expect this when running on localhost.
What is causing the inability to send/receive data locally? I realize UDP is unreliable, but I didn't expect it to be so unreliable on localhost. I'm wondering if it's just a timing issue since both the sending and receiving components are on the same machine.
For completeness, I've included the code to send/receive datagrams.
Sending:
DatagramSocket socket = new DatagramSocket(senderPort);
int valueToSend = 0;
while (valueToSend < valuesToSend || valuesToSend == -1) {
byte[] intBytes = intToBytes(valueToSend);
byte[] buffer = new byte[bufferSize - 4];
//this makes sure that the data is put into an array of the size we want to send
byte[] bytesToSend = concatAll(intBytes, buffer);
System.out.println("Sending " + valueToSend + " as " + bytesToSend.length + " bytes");
DatagramPacket packet = new DatagramPacket(bytesToSend,
bufferSize, receiverAddress, receiverPort);
socket.send(packet);
Thread.sleep(delay);
valueToSend++;
}
Receiving:
DatagramSocket socket = new DatagramSocket(receiverPort);
while (true) {
DatagramPacket packet = new DatagramPacket(
new byte[bufferSize], bufferSize);
System.out.println("Waiting for datagram...");
socket.receive(packet);
int receivedValue = bytesToInt(packet.getData(), 0);
System.out.println("Received: " + receivedValue
+ ". Expected: " + expectedValue);
if (receivedValue == expectedValue) {
receivedDatagrams++;
totalDatagrams++;
}
else {
droppedDatagrams++;
totalDatagrams++;
}
expectedValue = receivedValue + 1;
System.out.println("Expected Datagrams: " + totalDatagrams);
System.out.println("Received Datagrams: " + receivedDatagrams);
System.out.println("Dropped Datagrams: " + droppedDatagrams);
System.out.println("Received: "
+ ((double) receivedDatagrams / totalDatagrams));
System.out.println("Dropped: "
+ ((double) droppedDatagrams / totalDatagrams));
System.out.println();
}
Overview
What is causing the inability to send/receive data locally?
Mostly buffer space. Imagine sending a constant 10MB/second while only able to consume 5MB/second. The operating system and network stack can't keep up, so packets are dropped. (This differs from TCP, which provides flow control and re-transmission to handle such a situation.)
Even when data is consumed without overflowing buffers, there might be small time slices where data cannot be consumed, so the system will drop packets. (Such as during garbage collection, or when the OS task switches to a higher-priority process momentarily, and so forth.)
This applies to all devices in the network stack. A non-local network, an Ethernet switch, router, hub, and other hardware will also drop packets when queues are full. Sending a 10MB/s stream through a 100MB/s Ethernet switch while someone else tries to cram 100MB/s through the same physical line will cause dropped packets.
Increase both the socket buffers size and operating system's socket buffer size.
Linux
The default socket buffer size is typically 128k or less, which leaves very little room for pausing the data processing.
sysctl
Use sysctl to increase the transmit (write memory [wmem]) and receive (read memory [rmem]) buffers:
net.core.wmem_max
net.core.wmem_default
net.core.rmem_max
net.core.rmem_default
For example, to bump the value to 8 megabytes:
sysctl -w net.core.rmem_max=8388608
To make the setting persist, update /etc/sysctl.conf as well, such as:
net.core.rmem_max=8388608
An in-depth article on tuning the network stack dives into far more details, touching on multiple levels of how packets are received and processed in Linux from the kernel's network driver through ring buffers all the way to C's recv call. The article describes additional settings and files to monitor when diagnosing network issues. (See below.)
Before making any of the following tweaks, be sure to understand how they affect the network stack. There is a real possibility of rendering your network unusable. Choose numbers appropriate for your system, network configuration, and expected traffic load:
net.core.rmem_max=8388608
net.core.rmem_default=8388608
net.core.wmem_max=8388608
net.core.wmem_default=8388608
net.ipv4.udp_mem='262144 327680 434274'
net.ipv4.udp_rmem_min=16384
net.ipv4.udp_wmem_min=16384
net.core.netdev_budget=600
net.ipv4.ip_early_demux=0
net.core.netdev_max_backlog=3000
ethtool
Additionally, ethtool is useful to query or change network settings. For example, if ${DEVICE} is eth0 (use ip address or ipconfig to determine your network device name), then it may be possible to increase the RX and TX buffers using:
ethtool -G ${DEVICE} rx 4096
ethtool -G ${DEVICE} tx 4096
iptables
By default, iptables will log information about packets, which consumes CPU time, albeit minimal. For example, you can disable logging of UDP packets on port 6004 using:
iptables -t raw -I PREROUTING 1 -p udp --dport 6004 -j NOTRACK
iptables -I INPUT 1 -p udp --dport 6004 -j ACCEPT
Your particular port and protocol will vary.
Monitoring
Several files contain information about what is happening to network packets at various stages of sending and receiving. In the following list ${IRQ} is the interrupt request number and ${DEVICE} is the network device:
/proc/cpuinfo - shows number of CPUs available (helpful for IRQ-balancing)
/proc/irq/${IRQ}/smp-affinity - shows IRQ affinity
/proc/net/dev - contains general packet statistics
/sys/class/net/${DEVICE}/queues/QUEUE/rps_cpus - relates to Receive Packet Steering (RPS)
/proc/softirqs - used for ntuple filtering
/proc/net/softnet_stat - for packet statistics, such as drops, time squeezes, CPU collisions, etc.
/proc/sys/net/core/flow_limit_cpu_bitmap - shows packet flow (can help diagnose drops between large and small flows)
/proc/net/snmp
/proc/net/udp
Summary
Buffer space is the most likely culprit for dropped packets. There are numerous buffers strewn throughout the network stack, each having its own impact on sending and receiving packets. Network drivers, operating systems, kernel settings, and other factors can affect packet drops. There is no silver bullet.
Further Reading
https://github.com/leandromoreira/linux-network-performance-parameters
http://man7.org/linux/man-pages/man7/udp.7.html
http://www.ethernetresearch.com/geekzone/linux-networking-commands-to-debug-ipudptcp-packet-loss/
UDP pkts scheduling may be handled by multiple threads on OS level. That would explain why you receive them out of order even on 127.0.0.1.
Your expectations, as expressed in your question and in numerous comments to other answers, are wrong. All the following can happen even in the absence of routers and cables.
If you send a packet to any receiver and there is no room in his socket receive buffer it will get dropped.
If you send a UDP datagram larger than the path MTU it will get fragmented into smaller packets, which are subject to (1).
If all the packets of a datagram don't arrive, the datagram will never get delivered.
The TCP/IP stack has no obligation to deliver packets or UDP datagrams in order.
UDP packets are not guaranteed to reach their destination whereas TCP is!
I don't know what makes you expect a percentage less then 1% of dropped packets for UDP.
That being said, based on RFC 1122 (see section 3.3.2), the maximum buffer size guaranteed not to be split into multiple IP datagrams is 576 bytes. Larger UDP datagrams may be transmitted but they will likely be split into multiple IP datagrams to be reassembled at the receiving end point.
I would imagine that a reason contributing to the high rate of dropped packets you're seeing is that if one IP packet that was part of a large UDP datagram is lost, the whole UDP datagram will be lost. And you're counting UDP datagrams - not IP packets.
I want to attempt to calculate how much data (bytes) I send/receive over the network. I send/receive both TCP and UDP packets, so I need to be able to calculate the size of these packets including their respective headers. I looked at this questions: Size of empty UDP and TCP packet and it lists the minimum size of the header, but is that libel to change? Should I just add the number of bytes I send in the packet, but the size of the minimum header? Also, I know at some point (n bytes) the data would be too big to fit in just one packet.
One other thing, the client is a mobile device, so it may receive over cellular or wifi. I am not sure if there is a difference in the packet size between the two, but I would probably just want to assume what ever is larger.
So my questions are, assuming the data is n bytes long:
1) How big would the TCP packet be, assuming it all fits in one packet?
2) How big would the UDP packet be, assuming it all fits in one packet?
3) Is there an easy way to determine the number of bytes it would take to overrun one packet? For both TCP and UDP.
Lets assume we're only talking about ethernet and IPv4
Look at your interface MTU, which has already subtracted
the size of the ethernet headers for the OS I can
remember (linux and FreeBSD)
Subtract 20 bytes for a normal IP header (no IP options)
Subtract 20 bytes for a normal TCP header
Or
Subtract 8 bytes for a UDP header
That is how much data you can pack into one IPv4 packet. So, if your TCP data is n bytes long, your total ethernet payload is (n + 20 + 20); your ethernet payload for UDP is (n + 20 + 8).
EDIT FOR QUESTIONS
RE: MTU
Your interface MTU is the largest ethernet payload that your drivers will let you encapsulate onto the wire. I subtract because we're assuming we start from the MTU and work up the encapsulation chain (i.e. eth -> ip -> tcp|udp); you cant send TCP or UDP without an IP header, so that must be accounted for as well..
RE: Calculating application overhead
Theoretical calculations about the overhead your application will generate are fine, but I suggest lab testing if you want meaningful numbers. Usage factors like average data transfer per client session, client hit rate per minute and concurrent clients can make a difference in some (unusual) cases.
It is sadly not possible to determine this completely. Packets might be split, reassembled etc. by network hardware all along the path to the receiver, so there is no guarantee to calculate the exact number of bytes.
Ethernet defines the frame size with 1500bytes, which makes 1460 bytes remaining if the headers are subtracted. Using jumbo frames up to 9k bytes is usually only supported locally. When the packet reaches the WAN, it will be fragmented.