Account for the Bandwidth-Delay Product with Larger Network Buffers

This is post #14 in my December 2013 series about Linux Virtual Machine Performance Tuning. For more, please see the tag “Linux VM Performance Tuning.”

At times we can refer to certain network concepts with a “pipe” analogy, where bandwidth is the diameter of the pipe and latency is the length of the pipe. A garden hose has a certain limited amount of “bandwidth” because it has a small interior diameter, and has a lot of latency because it is so long. Water you put in one end of the hose takes a while to come out the other end. A house in the USA likely has a main sewer pipe with 4 inches of “bandwidth” so it can move more, uh, “data” simultaneously, but the latency, or time it takes to traverse the length of the pipe, might be the same as the garden hose.

It’s the same on computer networks. Network links have sizes, like 100 Mbps, 1 Gbps, 10 Gbps, 40 Gbps, etc. which correspond to how much data can move at once. Those network links also have a length to them, measured in milliseconds. At distance this roughly corresponds to the speed of light (whoa, physics!), plus the time it takes for intermediate routers and repeaters to help the signal along. Using the “ping” utility you can see how long it takes:

$ ping drsite
PING drsite (10.10.10.10) 56(84) bytes of data.
64 bytes from drsite (10.10.10.10): icmp_seq=1 ttl=58 time=8.97 ms
64 bytes from drsite (10.10.10.10): icmp_seq=2 ttl=58 time=8.98 ms

In a sewer pipe we can figure out how much water can be in the pipe all at once by calculating the volume of the interior. A 4″ inner diameter pipe 50 feet long holds about 130 gallons of water. The same is true of long network links, like those used for DR replication, etc. We call it the Bandwidth-Delay Product, or BDP. In order to get the most from our network links we need to keep those pipes full of data. BDP is important because it helps us figure out how much data that could be, and gives us guidance for how big to make our OS network buffers to make that happen.

First, we figure out what our BDP is. I have a 1 Gbps link to my DR site, and ping says it’s 8.98 milliseconds there and back. With BDP we use the “round trip time” (RTT) because the TCP protocol has to acknowledge the receipt of the data on the other side, too, and that receipt has to travel all the way back to us. That’s what ping, or an ICMP echo, is, so my RTT here is 8.98 ms.

First, we convert my link speed to bytes, since that’s what an OS uses for buffer sizing. Also, networking uses 1000 bits = 1 kilobit, unlike storage which is 1024 bits = 1 kilobit.

(1,000,000,000 bits/second) / 8 bits per byte = 125,000,000 bytes/second.

Second, we convert the round trip time to seconds:

8.98 ms / (1000 ms/second) = 0.00898 seconds

Third, we multiply the two:

125,000,000 bytes/second * 0.00898 seconds = 1122500 bytes.

My link to the DR site can “hold” that much data at once. Now we consider our operating system, in my case Red Hat Enterprise Linux 6. Recent Linux kernels (2.4.27+ and 2.6.7+) have enabled network stack autotuning, which is wonderful. It removes the need for a lot of the old-style tuning we had to do. You can check to see if it is enabled with:

$ cat /proc/sys/net/ipv4/tcp_moderate_rcvbuf
1

The “1” means that it is. So let’s check the per-connection defaults in /proc/sys/net/ipv4/tcp_rmem and /proc/sys/net/ipv4/tcp_wmem:

$ cat /proc/sys/net/ipv4/tcp_rmem
4096 87380 4194304
$ cat /proc/sys/net/ipv4/tcp_wmem
4096 16384 4194304

These arrays are the minimum, default initial, and maximum buffer sizes the autotuning can use, in bytes. Note that 4194304 is greater than my 1122500, so we’re good there. You can certainly make the defaults larger, but unless your BDP is that big it won’t change anything.

We have one more buffer to check, and that’s the maximum buffer size for an application:

$ cat /proc/sys/net/core/rmem_max
124928
$ cat /proc/sys/net/core/wmem_max
124928

This basically says that, despite our autotuning defaults, an application only has access to about 124 KB of buffer. We probably want to change this so an application that requests more can have it. Applications that make these specific requests will disable autotuning, but that might be okay since the application probably knows what it’s doing. We’ll set it to the autotuning max by adding the following to /etc/sysctl.conf (and running “sysctl –p” to reload the config):

net.core.wmem_max = 4194304
net.core.rmem_max = 4194304

Last, we should make sure that TCP timestamps, window scaling, and selective acknowledgement are also on:

$ cat /proc/sys/net/ipv4/tcp_timestamps
1
$ cat /proc/sys/net/ipv4/tcp_window_scaling
1
$ cat /proc/sys/net/ipv4/tcp_sack
1

Timestamps help the autotuning mechanisms determine proper buffer sizing, though they’ve been implicated by security professionals as a way for attackers to learn things like system uptime, etc. Like most system security, disabling it comes at a price. Window scaling means that the autotuning algorithms can change the receive window to better adapt to large amounts of incoming data. Last, TCP selective acknowledgement means that instead of ACKing every single packet that comes in, which generates quite a lot of traffic by itself (“I got packet 1. I got packet 2. I got packet 3…”) the receiving side can ACK a lot of packets at once (“I got packets 1-4000.”) and save bandwidth.

I’ve adjusted these tunables for years, and in my case I actually had to remove some of my tuning once autoconfiguration came along because I was defeating it with old settings. You might be in that position, too. In most cases I’d expect people to be reasonably happy with the recent OS defaults inside a traditional data center, but as 40 & 100 Gbps Ethernet comes along, or you start replicating data between your Amazon AWS instances in Tokyo and Virginia, USA, this will be something to think about.

I want to acknowledge the Pittsburgh Supercomputing Center’s work in this area. While this post draws from about 18 years of notes I maintain many of those notes have likely originated from their timeless TCP tuning web page. They keep it relatively up to date for many different OSes, and also include discussion of other topics like congestion control algorithms.

Last, if you want an interesting story about distance, latency, the speed of light, and applications you might check out Trey Harris’ account of “The Case of the 500 Mile Email.”

Comments on this entry are closed.

  • Ping tome is already a round trip time.
    Don’t multiply by 2.

    • Good catch, I adjusted the ping times to make it work. :)