## Sunday, February 5, 2012

### Calculating TCP RTO...

I was reading RFC6298 on RTO calculation and decided to try to see within Linux kernel how and where it is calculated. Basically, RTO, or Retransmittion Timeout, determines how long TCP waits for acknowledgment (ACK) of transmitted segment. If the acknowledgment isn't received within this time it is deemed lost. Actually, ACK could be lost too, but there is no way for sender to differentiate between those two cases, as illustrated by the following figure:

Thus I'll treat them equally and always refer to segment loss.

The important part of calculating RTO is to determine how long it takes for a segment to go to the receiver and for ACK to come back from receiver to sender. This is a Round Trip Time, or RTT. In some ideal world (and very static for that matter) this value would be constant and would never change. And RTO would be easy to determine, it is equal to RTT, maybe slightly slightly larger, but nevertheless the two would be almost equal.

But we are not living in an ideal word and this process is complicated by the fact that network conditions constantly change. Not only that, but receiver has also certain freedom to chose when it will return ACK, though this time has upper bound of 500ms. So, RTT is never used directly as RTO, some additional calculations are used. The key is to estimate as good as possible the true value of RTT that will be experienced by the next segment to be transmitted and in the same time avoid abrupt changes resulting in transient conditions, and not to react too slow on network condition changes.This is illustrated with the following figure:

In order to achieve that, two new variables are introduced, smoothed RTT, or short SRTT, and RTT variance, or RTTVAR. Those two variables are updated, whenever we have a new RTT measurement, like this (taken from the RFC6298):
RTTVAR <- (1 - beta) * RTTVAR + beta * |SRTT - R'|
SRTT <- (1 - alpha) * SRTT + alpha * R'
alpha and beta are parameters that determine how fast we forget the past. If this parameter is too small new measurements will have little influence on our current understanding of expected RTT and we will slowly react to changes. If, on the other hand, alpha approaches 1 then the past will not influence our current estimation of RTT and it might happen that a single RTT was huge for whatever reason and that suddenly we have wrong estimation. Not only that, but we could have erratic behavior of SRTT. So, alpha and beta parameters have to be carefully selected. The values recommended by RFC are alpha=1/8 and beta=1/4.

Finally, RTO is calculated like this:
RTO <- SRTT + max (G, K*RTTVAR)
Constant K is set to 4, and G is a clock granularity in seconds, i.e. if you get timer interrupt each second, then G is 1 second. The max function is used so that, e.g. you don't get 400ms for K*RTTVAR and try to wait for that long while your clock has resolution of 1s. In that case, 1s will prevail and will be selected as a variance.

Initial values

Still, there is a question of initial values, i.e. what to do when first SYN segment is sent? In that case RFC specifies you have to set RTO to 1 second, which is actually lower than specified in the previous RFC that mandated minimum value of 3 seconds. When first acknowledgment returns its RTT value is stored into SRTT and variance is set to RTT/2. Then, RTO is calculated as usual.

Some complications

There are some additional rules that are required by RFC. First, if calculated RTO is less than 1s, then it has to be rounded to 1second. It is a minimul RTO value allowed by RFC.

Next, in case some segment is retransmitted, then when acknowledgement arrives it is not taken into calculation of SRTT and RTTVAR. This is called Karn's algorithm, even though it is not algorithm but more a rule. The reason for it is that it is impossible to know if this is acknowledgement for a first transmission, or for retransmission and thus we could skew SRTT. This ambiguity is illustrated with the following figure:

But, there is possibility for TCP to negotiate timestamp option on a certain connection and in that case, the previous ambiguity is resolved so each ACK can be used to calculate SRTT and RTTVAR.

Implementation within Linux kernel

Now, let us see how and where is this implemented within the Linux kernel. I'm using the latest stable kernel at the time this post was written and that is 3.2.4. If you are looking at some later or earlier kernel, then the things might be more or less different.

The call graph of the relevant functions is shown in the following figure:

The main function to calculate RTO is tcp_valid_rtt_meas() which updates RTT estimation and sets new RTO for future segments that will be sent. It is called by two functions, tcp_ack_saw_tstamp() which processes ACK that has embedded timestamp option, or tcp_ack_no_tstamp() which processes ACK without timestamp option. In both cases, tcp_valid_rtt_meas() is called with socket structure that determines to which connection this measurement belongs to and also measured RTT value.

But before describing functions that do calculations, first we are going to describe used data structures. Actually, we'll describe only those elements that are used to calculate RTO.

Data structures

The main data structure passed between all the functions we describe is struct sock. This is a fairly large structure used to describe network layer data about every possible type of socket. Every socket from any higher layer has this structure placed at the beginning. The following figure illustrates this:

In our case, the higher layer structure we are interested in is TCP. The structure used to describe TCP sockets is struct tcp_sock. So, when our functions get struct sock as argument, they use use tcp_sk inline function to convert (cast!) it into struct tcp_sock data structure. Note that, if you think a little about it, this tcp_sk inline function actually is a no-op after compilation! It's only purpose is casting which is high-level thing, not something important for assembly/machine code.

Anyway, in struct tcp_sock there is a set of variables used for RTO calculation:
/* RTT measurement */
u32     srtt;      /* smoothed round trip time << 3        */
u32     mdev;      /* medium deviation                     */
u32     mdev_max;  /* maximal mdev for the last rtt period */
u32     rttvar;    /* smoothed mdev_max                    */
u32     rtt_seq;   /* sequence number to update rttvar     */

In there we note two expected variables, srtt and rttvar, but also several other ones. Also what is important to realize is that srtt var contains number of seconds shifted to left by three bits, i.e. multiplied by 8. So, in order to store 1 second in srtt you'll have to write there number 8. In other words, every value counts as 125ms, so if you want to store 0.5s you'll write there number 4, i.e. 4*125ms = 500ms. Similarly, mdev is counted in units of 250ms, i.e. its value is four time smaller and to store there 1s you need to write number 4 (4*250ms = 1s).

We'll see later that this way of storing data, along with the requirements for calculating RTO specified in RFC, allow for very efficient (if somewhat obscured) code to determine srtt and rttvar.

As indicated in comments embedded in code, there are additional variables that allow RTTVAR to be updated every RTT time units. The mechanism to achieve that is the following. At each ACK, mdev is calculated and if this mdev is higher than the current highest one (mdev_max) then it is stored into mdev_max field. When RTT time units passes, mdev_max is used to update RTTVAR. To know when RTT time units passed, the field rtt_seq is used. Namely, the sequence number within ACK received is checked against rtt_seq, and if it is larger than rtt_seq than RTTVAR update is triggered and in the same time rtt_seq is set to sequence number that will be used in the next outgoing segment (snd_nxt, also part of the tcp_sock structure).

Functions

Now that we have described relevant data structures, let us turn our attention to the functions themselves. We'll start from the end, i.e. from the calculation of RTO.

tcp_set_rto()

This is the function that calculates current RTO, even though it actually does this via a call to inline function __tcp_set_rto() . RTO is calculated as follows:
(tp->srtt >> 3) + tp->rttvar
Which is actually the same as in RFC apart from right shift. The right shift is necessary because srtt is expressed in 8 times smaller units and has to be normalized before being added to rttvar. rttvar, on the other hand, is coded "normally", i.e. number one means 1 second.

The function tcp_set_rto() also makes sure that RTO isn't larger than TCP_RTO_MAX, which is set to 120 seconds (i.e. 120*HZ).

tcp_rtt_estimator()

This function, in addition to struct sock parameter that holds data for TCP connection whose SRTT and RTTVAR variables should be updated, also receives measured RTT value for the received acknowledgment, i.e.
static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt);
Update process is done in a highly optimized way, which makes things a bit obscure at the first sight. This is the consequence of the fact that srtt and mdev are counted in units of 125ms and 250ms, respectively (as explained in the subsection about data structures).

So, the first thing that's done within this function is the following (slightly modified for explanation):
mrtt -= (tp->srtt >> 3);
Now, mrtt holds value R' - SRTT (using notation from RFC). To calculate new SRTT the following line, immediatelly after the previous one, is used:
srtt += mrtt;
And that's it for SRTT! But, it is easier to understand this calculation if we write it as follows:
mrtt' = mrtt - srtt/8
srtt = srtt + mrtt' = srtt + mrtt - srtt/8 = 7/8 * srtt + mrtt
which is actually the equation given in RFC. Again, srtt is 8 times larger so I can normalize it to show that the value will be correct:
8*rsrtt = 7/8 * 8 * rsrtt + mrtt (divide by 8 yields):
rsrtt = 7/8 rsrtt + mrtt/8
I'm using rsrtt to denote real srtt.

It also has to update mdev, which is done with the following line (note that mrtt isn't changed while the previous calculations were performed, i.e. it is still R' - SRTT):
mrtt -= (tp->mdev >> 2);
tp->mdev += mrtt;
again, I slightly simplified the code. The simplification is related to the fact that I'm assuming mrtt is positive after substraction that changes it into R' - SRTT. Again, the trick is used that mdev is stored in 4 time smaller units.

When mdev is calculated, it is checked against mdev_max and if it is larger, then mdev_max is updated to a new value:
if (tp->mdev > tp->mdev_max) {
tp->mdev_max = tp->mdev;
if (tp->mdev_max > tp->rttvar)
tp->rttvar = tp->mdev_max;
}

One more thing is done too, if mdev_max is larger then rttvar then rttvar is also immediately updated with a new value. Note a trick here. RFC requires RTTVAR to be multiplied by a constant K which is set to be 4! This is accomplished with assigning mdev_max (which is already multiplied by 4) directly to rttvar!

What's left to do is to check if RTT time units has passed from last (regular) update to RTTVAR, and if it is then it's time to update it again. This is done within the following code fragment:

if (after(tp->snd_una, tp->rtt_seq)) {
if (tp->mdev_max < tp->rttvar)
tp->rttvar -= (tp->rttvar - tp->mdev_max) >> 2;
tp->rtt_seq = tp->snd_nxt;
tp->mdev_max = tcp_rto_min(sk);
}
As we said when we were talking about data structures, indicator that RTT units has passed is signaled with sequence number within ACK being after saved value in snd_una field. You might wonder why simple less-then operator isn't used? The reason is that sequence numbers might wrap around so it is necessary to take that into account!

Note that if rtt_var is larger than mdev_max nothing happens, i.e. this code only decreases the value of rttvar! But, if it is smaller, then rttvar is adjusted by the following quantity (as per RFC):
rttvar - (rttvar - mdev) / 4 = 3/4 * rttvar + mdev/4
Again, we have some trickery with different scales of rttvar and mdev. You can understand it as follows: New rttvar consists of 3/4 old rttvar. rttvar itself is multiplied by 4 (i.e. by constant K from RFC). RFC also specifies that mdev must participate with 1/4 (i.e. factor beta). Additionaly, mdev is already 4 times larger, and thus, it is already pre-multiplied by constant K! Thus, it can be added without further multiplications.

One more thing left to explain is the new value of mdev_max and the function tcp_rto_min that is called to provide it. This function hides some complexity, but in the simplest possible case it will return constant TCP_RTO_MIN which has value 200ms (HZ/5). In more general case, the ip command from iproute package allows RTT and RTTVAR to be specified per destination so this function checks if it is specified and if it is then returns given value.

The special case for this function is the initial state, i.e. when the connection is opened. In that case argument mrtt will be zero, and also srtt will be zero. If mrtt is zero, assumed value is 1, and note that it is the initial RTT defined by RFC. srtt being zero triggers it's initialization:
tp->srtt = mdev << 3;
tp->mdev = mdev << 1;
Basically, transcodes mdev into appropriate units and stores the value into srtt (i.e. mdev is 1, so into srtt has to be stored 8). At first, it might seem that mdev is calculated wrongly, but according to RFC, it's initial value has to be half of the initial RTT. This translates into:
mdev<<2 / 2 = mdev*4/2 = mdev*2 = mdev <<1
So, mdev's initial value is correct!

And that's it. I intend also to describe TCP congestion control within Linux kernel in some future post.

Linze said...

Beautiful Article!
Waiting for Cogestion Article!

any date? :)

Stjepan Groš (sgros) said...

Glad to hear you liked the post. As for sequel, I have to admit I forgot on it. But, I'll find some time to write it, hopefully before New Year.

Ewoud said...

Thx for this!

Tim Watson said...

Nice work, this is great!