Everything about nothing: code

Showing posts with label code. Show all posts

Wednesday, January 13, 2016

Connections in NetworkManager

Connections, as defined and used by NM, are very close to PvDs. The goal of this post is to analyse data structures/functions for connections within NetworkManager and so that plan of integrating PvDs into NM can be developed. This is done in a separate post.

A definition of connection in NetworkManager can be found in the comment within the libnm-core/nm-connection.c file:

An #NMConnection describes all the settings and configuration values that are necessary to configure network devices for operation on a specific network. Connections are the fundamental operating object for NetworkManager; no device is connected without a #NMConnection, or disconnected without having been connected with a #NMConnection.

Each #NMConnection contains a list of #NMSetting objects usually referenced by name (using nm_connection_get_setting_by_name()) or by type (with nm_connection_get_setting()). The settings describe the actual parameters with which the network devices are configured, including device-specific parameters (MTU, SSID, APN, channel, rate, etc) and IP-level parameters (addresses, routes, addressing methods, etc).

In the following text we'll see how connections are implemented in the NM code, how they are initialized and how they are accessed over the DBus. Note that there are some parts related to connections that are specific for a system/distribution on which NM is running. In that case I concentrate on how things are done on Fedora (and very likely all RHEL derivatives).

Class and interface hierarchy

The base for all connection objects in NetworkManager is the interface defined in the files libnm-core/nm-connection.[ch]. The interface is implemented by the following classes:

Class NMSettingsConnectionClass defined in the files src/settings/nm-settings-connection.[ch].

This class is used by NetworkManager daemon and it is exported via DBus interface.
Class NMRemoteConnectionClass defined in the files libnm/nm-remote-connection.[ch].

Used by clients in clients subdirectory.
Class NMSimpleConnectionClass defined in the files libnm-core/nm-simple-connection.[ch].

This is the object passed via DBus so it is for communicating connections from and to NetworkManager and its clients.

Accessing individual connection data stored in NetworkManager

Each connection, active or not, known to the NetworkManager is exposed through DBus on path /org/freedesktop/NetworkManager/Settings/%u where %u is a sequence number assigned to each connection. Interface implemented by each connection is org.freedesktop.NetworkManager.Settings.Connection. The interface is described in introspection/nm-settings-connection.xml file.

To invoke a method defined in interface on the given object you can use dbus-send command line tool like this:

dbus-send --print-reply --system \
--dest=org.freedesktop.NetworkManager \
/org/freedesktop/NetworkManager/Settings/0 \
org.freedesktop.NetworkManager.Settings.Connection.GetSettings

In this particular case we are invoking GetSettings method on object /org/freedesktop/NetworkManager/Settings/0 which will return us its configuration parameters. Note that invoking this particular method is easy since there are no arguments to the method.

The class that represents connection, and that is answering to DBus messages, is declared in src/settings/nm-settings-connection.[ch] files. This class implements interface NM_TYPE_CONNECTION and also subclasses NM_TYPE_EXPORTED_OBJECT class. The NM_TYPE_EXPORTED_OBJECT class has all the methods necessary to expose the object on DBus.

To see what functions are called when DBus methods are called take a look at the end of the source file nm-settings-connection.c. There, you'll find the following code:

nm_exported_object_class_add_interface (NM_EXPORTED_OBJECT_CLASS (class),
NMDBUS_TYPE_SETTINGS_CONNECTION_SKELETON,
"Update", impl_settings_connection_update,
"UpdateUnsaved", impl_settings_connection_update_unsaved,
"Delete", impl_settings_connection_delete,
"GetSettings", impl_settings_connection_get_settings,
"GetSecrets", impl_settings_connection_get_secrets,
"ClearSecrets", impl_settings_connection_clear_secrets,
"Save", impl_settings_connection_save,
NULL);

What this code does is that it binds GBus methods to function that should be called. When we called GetSettings method, obviously that ended up in the function impl_settings_get_settings().

The first step done when processing GetSettings method is authorization check. After authorization check has succeeded, the return message is constructed in get_settings_auth_cb() method.

Accessing and manipulating all connections in NetworkManager

NetworkManager exposes interface org.freedesktop.NetworkManager.Setting on object org.freedesktop.NetworkManager.Setting that, among other things, allows the user to retrieve list of all the known connections to the NetworkManager. To get all connections you would could use the following dbus-send command:

dbus-send --print-reply --type=method_call --system \
--dest=org.freedesktop.NetworkManager \
/org/freedesktop/NetworkManager/Settings \
org.freedesktop.DBus.Properties.Get \
string:org.freedesktop.NetworkManager.Settings \
string:"Connections"

This would get you something like the following output:

variant array [
object path "/org/freedesktop/NetworkManager/Settings/10"
object path "/org/freedesktop/NetworkManager/Settings/11"
object path "/org/freedesktop/NetworkManager/Settings/12"
object path "/org/freedesktop/NetworkManager/Settings/13"
object path "/org/freedesktop/NetworkManager/Settings/14"
object path "/org/freedesktop/NetworkManager/Settings/15"
object path "/org/freedesktop/NetworkManager/Settings/0"
object path "/org/freedesktop/NetworkManager/Settings/1"
object path "/org/freedesktop/NetworkManager/Settings/2"
object path "/org/freedesktop/NetworkManager/Settings/3"
object path "/org/freedesktop/NetworkManager/Settings/4"
object path "/org/freedesktop/NetworkManager/Settings/5"
object path "/org/freedesktop/NetworkManager/Settings/6"
object path "/org/freedesktop/NetworkManager/Settings/7"
object path "/org/freedesktop/NetworkManager/Settings/8"
object path "/org/freedesktop/NetworkManager/Settings/9"
object path "/org/freedesktop/NetworkManager/Settings/16"
]

The exact output will depend on your particular setup and usage.

The given interface and property is implemented by class NMSettingsClass (defined in src/settings/nm-settings.[ch]). This class implements interface NM_TYPE_CONNECTION_PROVIDED (defined in src/nm-connection-provider.[ch]). There is only one object of this class in NetworkManager and it is instantiated when NetworkManager is starting.

Looking in the file src/settings/nm-settings.c you can see at the end registration of function to be called when DBus messages are received. DBus interface of this module is defined in introspection/nm-settings.xml file. Here is the relevant code that binds DBus methos to the functions that implement them:

nm_exported_object_class_add_interface (
NM_EXPORTED_OBJECT_CLASS (class),
NMDBUS_TYPE_SETTINGS_SKELETON,
"ListConnections", impl_settings_list_connections,
"GetConnectionByUuid", impl_settings_get_connection_by_uuid,
"AddConnection", impl_settings_add_connection,
"AddConnectionUnsaved", impl_settings_add_connection_unsaved,
"LoadConnections", impl_settings_load_connections,
"ReloadConnections", impl_settings_reload_connections,
"SaveHostname", impl_settings_save_hostname,
NULL);

So, when we called ListConnections method, obviously that ended up in the function impl_settings_list_connections(). Here, we'll emphasize one more method, LoadConnections. This DBus method, implemented in impl_settings_load_connections(), load all connections defined in the system. We'll take a look now at that method.

Initializing connections

All network connections are loaded and initialized from two sources: system dependent network configuration and VPN configuration scripts.

System dependent network configuration

There are several types of distributions with different network configuration mechanisms. Since that part is obviously system dependent, NetworkManager has a plugin system that isolates the majority of NetworkManager code from those system dependent parts. Plugins can be found in the directory src/settings/plugins. Additionally, all the plugins are based on the src/settings/plugin.[ch] base class. In the case of Fedora Linux (as well as RHEL, CentOS and other derivatives) network configuration is recorded in scripts in the directory /etc/sysconfig/network-scripts. and the plugin that handles those configuration files is stored in the directory src/settings/plugins/ifcfg-rh.

The initialization of connections stored in the directory /etc/sysconfig/network-scripts is done when NetworkManager bootstraps and instantiates object NMSettings of type NMSettingsClass. This is performed in the function nm_manager_start() in the file src/nm-manager.c. There, the method nm_settings_start() is called on the NMSettings object which in turn first initializes all plugins (as, by default, found in the directory /usr/lib64/NetworkManager/). It then calls private method load_connections() to actually load all connections. Note that in the directory /usr/lib64/NetworkManager/ there are plugins of other types too, but only plugins that have prefix libnm-settings-plugin- are loaded. Which plugins should be loaded are can be defined in three places (lowest to highest priority):

Compile time defaults, as given to configure.sh script, or, by default for RHEL type systems "ifcfg-rh,ibft" plugins.
In configuration file /etc/NetworkManager/NetworkManager.conf.
As specified in the command line.

Method load_connections() iterates over every defined plugin and asks each plugin for all registered connections it knows by calling a method get_connections() within a specific plugin. For RedHat type of distributions the the plugin that handles all connections is src/nm-settings/plugins/ifcfg-rh/plugin.c and in that file function get_connections() is called. Now, if called for the first time, this function will in turn call read_connections() within the same plugin/file that will read all available connections. Basically, it opens directory /etc/sysconfig/network-scripts and builds a list of all files in the directory. Than, it tries to open each file and only those that were parsed properly are left as valid connections. Each found connection is stored in object NMIfcfgConnection of type NMIfcfgConnectionClass. These objects are defined in files src/nm-settings/plugins/ifcfg-rh/nm-ifcfg-connection.[ch].

When all the connections were loaded, read_connections() returns a list of all known connections to the plugin. The function load_connections() then, for each connection reported by each plugin, calls claim_connection() method in nm-connection.c. This function, among other tasks, exports the connection via DBus in a form described above, in the section Accessing individual connection data stored in NetworkManager.

VPN configuration scripts

Details about VPN connections are stored in /etc/NetworkManager/system-connections directory, one subdirectory per VPN. Those files are read by src/vpn-manager/nm-vpn-manager.c when the object is initialized and as such initialized when VPN manager is initialized. VPN manager also also monitors changes in the VPN configuration directory and acts appropriately.

Properties of a connection

Each connection has a set of properties attached to it in a form of a key-value pairs.

Activating a connection

A connection is activated by calling ActivateConnection DBus method. This method is implemented in the NetworkManager's main class/object, NMManager. This class/object is a singleton object who's impementation is in src/nm-manager.[ch] files. Looking at the code that binds DBus methods to the functions that implement them we can see that ActivateConnection is implemented by the function impl_manager_activate_connection(). The ActivateConnection method, and its implementation function, accept several parameters:

Connection that should be activated identified by its connection path (e.g. "/org/freedesktop/NetworkManager/Settings/2").
Device on which connection should be activated identified by its path (e.g. "/org/freedesktop/NetworkManager/Devices/2").
Specific object?

Some of the input argument can be unspecified. To make them unspecified in DBus call they should be set to "/" and this will be translated into NULL pointer in the impl_manager_activate_connection() function. Of all combinations of parameters (with respect to being NULL or non-NULL) the following ones are allowed:

When connection path is not specified device must be given. In that case all the connections for that device will be retrieved and the best one will be selected. "The best one" is defined as the most recently used one.
If connection path is specified, then device might or might not be specified. In case it is not defined the best device for the given connection will be selected. To determine "the best device" first list of all devices is retrieved and then for each device status is checked (must be managed, available, compatible with the requested connection). Note that "compatible with the requested connection" means, for example, you can not start wireless connection on a wired connection.

There are "software only", or virtual, connections. Those are checked in the function nm_connection_is_virtual() which is implemented in the file libnm-core/nm-connection.c. When this post was written, the following connections were defined as virtual, or software only:

Bond
Team
Bridge
VLAN
TUN
IPtunnel
MACVLAN
VXLAN

Finally, there are also VPN connections that also don't have associated devices.

When all the checks are performed, devices and connections are found, then an object of type NMActiveConnection is created in the function _new_active_connection(). Here, in case VPN connection is started, VPN establishment is initiated and you can read more about that process in another post.

Saturday, January 2, 2016

Processing RA messages in NetworkManager

The goal of this post is to analyze processing of RA messages through the NetworkManager code, starting with the initial reception all the way through the assignment of parameters to a device through which the RA was received. As a special case will also take a look what happens when RS is sent, i.e. what is different in comparison to unsolicited RAs. But first, we'll take a look at the relevant code organization and initialization process.

Code to process RAs and initialization phase

For receiving RA and sending RA NetworkManager uses libndp library. This library is used in class NM_TYPE_LNDP_RDISC (defined in the file src/rdisc/nm-lndp-rdisc.c) which is a platform specific class tailored for the Linux OS. This class inherits from class NM_TYPE_RDISC (defined in the file src/rdisc/nm-rdisc.c) which is a platform independent base class. It contains functionality that is platform independent so theoretically NetworkManager can be more easily ported to, e.g. FreeBSD.

To create a new object of the type NM_TYPE_LNDP_RDISC it is necessary to call function nm_lndp_rdisc_new(). This is, for example, done by the class NM_DEVICE_TYPE in function addrconf6_start() for each device for which IPv6 configuration is started.

Now, if NetworkManager will use RAs or not depends on the configuration setting for IPv6 that the user defines. If you go to configuration dialog for some network interface there is a setting for IPv6 configuration which might be ON or OFF. In case it is OFF, no IPv6 configuration will be done. If IPv6 configuration is enabled (switch placed in ON state) then the specific configuration methods should be selected. The selected option is checked in the function src/devices/nm-device.c:act_stage3_ip6_config_start(), where, depending on the option selected, a specific initialization is started:

Automatic (method NM_SETTING_IP6_CONFIG_METHOD_AUTO)

Start full IPv6 configuration by calling src/devices/nm-device.c:addrconf6_start() function.
Automatic, DHCP only (method NM_SETTING_IP6_CONFIG_METHOD_DHCP)

Only configuration based on DHCP parameters received will be done. This type of configuration is initiated by calling function src/devices/nm-device.c:dhcp_start().
Manual (method NM_SETTING_IP6_CONFIG_METHOD_MANUAL)

Manual configuration using parameters specified in the configuration dialog and nothing else. The configuration of this type is initiated by calling function nm_ip6_config_new() which returns appropriate IPv6 configuration object.
Link-local Only (method NM_SETTING_IP6_CONFIG_METHOD_LINK_LOCAL)

Initiate only a link-local address configuration by calling function src/devices/nm-device.c:linklocal6_start().

Since in this post we are concerned with RA processing than we are obviously interested only in Automatic configuration type, the one that calls addrconf6_start() function. This function, in turn, calls function src/nm-device.c:linklocal6_start() to ensure that link local configuration is present. It might happen that link local address isn't configured and so RA configuration must wait, or link local configuration is still present. In either case, when link local configuration is present RA processing can start. RA processing is kicked off by calling src/nm-device.c:addrconf6_start_with_link_ready() which in turn calls src/nm-rdisc.c:nm_rdisc_start() to kick off RA configuration.

nm_rdisc_start() is called with a pointer to NM_LNDP_RDISC class (defined in src/rdisc/nm_lndp_rdisc.c). Note that a method (nm_rdisc_start()) from a base class (NM_RDISC_TYPE, defined in src/rdisc/nm_rdisc.c) is called with a pointer to a subclass of a NM_RDISC_TYPE! Method in a base class does the following:

Checks that there is a subclass which defined virtual method start() (gassert(klass->start)).
Initializes timeout for the configuration process. If timeout fires, then rdisc_ra_timeout_cb() will be called that emits NM_RDISC_RA_TIMEOUT signal.
Invokes a method start() from a subclass. Subclass is, as already said, NM_LNDP_RDISC and the given method registers a callback src/rdisc/nm-lndp-rdisc.c:receive_ra() with libndp library. The callback is called by libndp library whenever RA is received.
Starts solicit process by invoking solicit() method. This method schedules RS to be sent in certain amount of time (variable next) by send_rs() method. This method, actually, invokes send_rs() method from a subclass (src/nm-rdisc/nm-rdisc-linux.c:send_rs()) which sends RS using libndp library. Note that the number of RSes sent is bounded and after certain amount of them sent the process is stopped under the assumption that there is no IPv6 capable router on the network.
After RA has been received and processed the application of configuration parameters is done in src/device/nm-device.c:rdisc_config_changed() method. This callback is achieved by registering to NM_RDISC_CONFIG_CHANGED signal that is emitted by src/rdisc/nm-rdisc.c class whenever IPv6 configuration changes.

So, in conclusion, when link local configuration is finished, RA processing is started. The RA processing consists of waiting for RA in src/rdisc/nm-lndp-rdisc.c:receive_ra(). If RA doesn't arrive is certain amount of time then RS is sent in function src/nm-rdisc/nm-rdisc-linux.c:send_rs().

RA processing

When RA is received it is processed by the function src/rdisc/nm-lndp-rdisc.c:receive_ra(). The following configuration options are processed from RA by the given function:

DHCP level.
Default gateway.
Addresses and routes.
DNS information (RDNSS option).
DNS search list (DNSSL option).
Hop limit.
MTU.

All the options that were parsed are stored (or removed from) a private attributes of the base object (NMRDisc defined in src/rdisc/nm-rdisc.h).

Finally, the method src/nm-rdisc.c:nm_rdisc_ra_received() is called to cancel all the timeouts. It will also emit signal NM_RDISC_CONFIG_CHANGED that will trigger application of received configuration parameters to a networking device.

Processing RS/RA

The RS/RA processing differs only by the fact that RS is sent after certain amount of time has passed and RA wasn't received, as described in the Code to process RAs and initialization phase section. After RS is sent, the RA processing is the same as it would be without RS being sent.

Applying IPv6 configuration data

Application of received IPv6 configuration data is done in the method src/device/nm-device.c:rdisc_config_changed(). IPv6 configuration is stored in IPv6 configuration object NM_TYPE_IP6_CONFIG defined in src/nm-ip6-config.c.

Note that this isn't the real application of configuration data, but only that the configuration data is stored in the appropriate object.

The function that really applies configuration data is src/devices/nm-device.c:ip6_config_merge_and_apply().

Sunday, February 5, 2012

Calculating TCP RTO...

I was reading RFC6298 on RTO calculation and decided to try to see within Linux kernel how and where it is calculated. Basically, RTO, or Retransmittion Timeout, determines how long TCP waits for acknowledgment (ACK) of transmitted segment. If the acknowledgment isn't received within this time it is deemed lost. Actually, ACK could be lost too, but there is no way for sender to differentiate between those two cases, as illustrated by the following figure:

Thus I'll treat them equally and always refer to segment loss.

The important part of calculating RTO is to determine how long it takes for a segment to go to the receiver and for ACK to come back from receiver to sender. This is a Round Trip Time, or RTT. In some ideal world (and very static for that matter) this value would be constant and would never change. And RTO would be easy to determine, it is equal to RTT, maybe slightly slightly larger, but nevertheless the two would be almost equal.

But we are not living in an ideal word and this process is complicated by the fact that network conditions constantly change. Not only that, but receiver has also certain freedom to chose when it will return ACK, though this time has upper bound of 500ms. So, RTT is never used directly as RTO, some additional calculations are used. The key is to estimate as good as possible the true value of RTT that will be experienced by the next segment to be transmitted and in the same time avoid abrupt changes resulting in transient conditions, and not to react too slow on network condition changes.This is illustrated with the following figure:

In order to achieve that, two new variables are introduced, smoothed RTT, or short SRTT, and RTT variance, or RTTVAR. Those two variables are updated, whenever we have a new RTT measurement, like this (taken from the RFC6298):

RTTVAR <- (1 - beta) * RTTVAR + beta * |SRTT - R'|
SRTT <- (1 - alpha) * SRTT + alpha * R'

alpha and beta are parameters that determine how fast we forget the past. If this parameter is too small new measurements will have little influence on our current understanding of expected RTT and we will slowly react to changes. If, on the other hand, alpha approaches 1 then the past will not influence our current estimation of RTT and it might happen that a single RTT was huge for whatever reason and that suddenly we have wrong estimation. Not only that, but we could have erratic behavior of SRTT. So, alpha and beta parameters have to be carefully selected. The values recommended by RFC are alpha=1/8 and beta=1/4.

Finally, RTO is calculated like this:

RTO <- SRTT + max (G, K*RTTVAR)

Constant K is set to 4, and G is a clock granularity in seconds, i.e. if you get timer interrupt each second, then G is 1 second. The max function is used so that, e.g. you don't get 400ms for K*RTTVAR and try to wait for that long while your clock has resolution of 1s. In that case, 1s will prevail and will be selected as a variance.

Initial values

Still, there is a question of initial values, i.e. what to do when first SYN segment is sent? In that case RFC specifies you have to set RTO to 1 second, which is actually lower than specified in the previous RFC that mandated minimum value of 3 seconds. When first acknowledgment returns its RTT value is stored into SRTT and variance is set to RTT/2. Then, RTO is calculated as usual.

Some complications

There are some additional rules that are required by RFC. First, if calculated RTO is less than 1s, then it has to be rounded to 1second. It is a minimul RTO value allowed by RFC.

Next, in case some segment is retransmitted, then when acknowledgement arrives it is not taken into calculation of SRTT and RTTVAR. This is called Karn's algorithm, even though it is not algorithm but more a rule. The reason for it is that it is impossible to know if this is acknowledgement for a first transmission, or for retransmission and thus we could skew SRTT. This ambiguity is illustrated with the following figure:

But, there is possibility for TCP to negotiate timestamp option on a certain connection and in that case, the previous ambiguity is resolved so each ACK can be used to calculate SRTT and RTTVAR.

Implementation within Linux kernel

Now, let us see how and where is this implemented within the Linux kernel. I'm using the latest stable kernel at the time this post was written and that is 3.2.4. If you are looking at some later or earlier kernel, then the things might be more or less different.

The call graph of the relevant functions is shown in the following figure:

The main function to calculate RTO is tcp_valid_rtt_meas() which updates RTT estimation and sets new RTO for future segments that will be sent. It is called by two functions, tcp_ack_saw_tstamp() which processes ACK that has embedded timestamp option, or tcp_ack_no_tstamp() which processes ACK without timestamp option. In both cases, tcp_valid_rtt_meas() is called with socket structure that determines to which connection this measurement belongs to and also measured RTT value.

But before describing functions that do calculations, first we are going to describe used data structures. Actually, we'll describe only those elements that are used to calculate RTO.

Data structures

The main data structure passed between all the functions we describe is struct sock. This is a fairly large structure used to describe network layer data about every possible type of socket. Every socket from any higher layer has this structure placed at the beginning. The following figure illustrates this:

In our case, the higher layer structure we are interested in is TCP. The structure used to describe TCP sockets is struct tcp_sock. So, when our functions get struct sock as argument, they use use tcp_sk inline function to convert (cast!) it into struct tcp_sock data structure. Note that, if you think a little about it, this tcp_sk inline function actually is a no-op after compilation! It's only purpose is casting which is high-level thing, not something important for assembly/machine code.

Anyway, in struct tcp_sock there is a set of variables used for RTO calculation:

/* RTT measurement */
        u32     srtt;      /* smoothed round trip time << 3        */
        u32     mdev;      /* medium deviation                     */
        u32     mdev_max;  /* maximal mdev for the last rtt period */
        u32     rttvar;    /* smoothed mdev_max                    */
        u32     rtt_seq;   /* sequence number to update rttvar     */

In there we note two expected variables, srtt and rttvar, but also several other ones. Also what is important to realize is that srtt var contains number of seconds shifted to left by three bits, i.e. multiplied by 8. So, in order to store 1 second in srtt you'll have to write there number 8. In other words, every value counts as 125ms, so if you want to store 0.5s you'll write there number 4, i.e. 4*125ms = 500ms. Similarly, mdev is counted in units of 250ms, i.e. its value is four time smaller and to store there 1s you need to write number 4 (4*250ms = 1s).

We'll see later that this way of storing data, along with the requirements for calculating RTO specified in RFC, allow for very efficient (if somewhat obscured) code to determine srtt and rttvar.

As indicated in comments embedded in code, there are additional variables that allow RTTVAR to be updated every RTT time units. The mechanism to achieve that is the following. At each ACK, mdev is calculated and if this mdev is higher than the current highest one (mdev_max) then it is stored into mdev_max field. When RTT time units passes, mdev_max is used to update RTTVAR. To know when RTT time units passed, the field rtt_seq is used. Namely, the sequence number within ACK received is checked against rtt_seq, and if it is larger than rtt_seq than RTTVAR update is triggered and in the same time rtt_seq is set to sequence number that will be used in the next outgoing segment (snd_nxt, also part of the tcp_sock structure).

Functions

Now that we have described relevant data structures, let us turn our attention to the functions themselves. We'll start from the end, i.e. from the calculation of RTO.

tcp_set_rto()

This is the function that calculates current RTO, even though it actually does this via a call to inline function __tcp_set_rto() . RTO is calculated as follows:

(tp->srtt >> 3) + tp->rttvar

Which is actually the same as in RFC apart from right shift. The right shift is necessary because srtt is expressed in 8 times smaller units and has to be normalized before being added to rttvar. rttvar, on the other hand, is coded "normally", i.e. number one means 1 second.

The function tcp_set_rto() also makes sure that RTO isn't larger than TCP_RTO_MAX, which is set to 120 seconds (i.e. 120*HZ).

tcp_rtt_estimator()

This function, in addition to struct sock parameter that holds data for TCP connection whose SRTT and RTTVAR variables should be updated, also receives measured RTT value for the received acknowledgment, i.e.

static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt);

Update process is done in a highly optimized way, which makes things a bit obscure at the first sight. This is the consequence of the fact that srtt and mdev are counted in units of 125ms and 250ms, respectively (as explained in the subsection about data structures).

So, the first thing that's done within this function is the following (slightly modified for explanation):

mrtt -= (tp->srtt >> 3);

Now, mrtt holds value R' - SRTT (using notation from RFC). To calculate new SRTT the following line, immediatelly after the previous one, is used:

srtt += mrtt;

And that's it for SRTT! But, it is easier to understand this calculation if we write it as follows:

mrtt' = mrtt - srtt/8
srtt = srtt + mrtt' = srtt + mrtt - srtt/8 = 7/8 * srtt + mrtt

which is actually the equation given in RFC. Again, srtt is 8 times larger so I can normalize it to show that the value will be correct:

8*rsrtt = 7/8 * 8 * rsrtt + mrtt (divide by 8 yields):
rsrtt = 7/8 rsrtt + mrtt/8

I'm using rsrtt to denote real srtt.

It also has to update mdev, which is done with the following line (note that mrtt isn't changed while the previous calculations were performed, i.e. it is still R' - SRTT):

mrtt -= (tp->mdev >> 2);
tp->mdev += mrtt;

again, I slightly simplified the code. The simplification is related to the fact that I'm assuming mrtt is positive after substraction that changes it into R' - SRTT. Again, the trick is used that mdev is stored in 4 time smaller units.

When mdev is calculated, it is checked against mdev_max and if it is larger, then mdev_max is updated to a new value:

if (tp->mdev > tp->mdev_max) {
    tp->mdev_max = tp->mdev;
    if (tp->mdev_max > tp->rttvar)
        tp->rttvar = tp->mdev_max;
}

One more thing is done too, if mdev_max is larger then rttvar then rttvar is also immediately updated with a new value. Note a trick here. RFC requires RTTVAR to be multiplied by a constant K which is set to be 4! This is accomplished with assigning mdev_max (which is already multiplied by 4) directly to rttvar!

What's left to do is to check if RTT time units has passed from last (regular) update to RTTVAR, and if it is then it's time to update it again. This is done within the following code fragment:

if (after(tp->snd_una, tp->rtt_seq)) {
    if (tp->mdev_max < tp->rttvar)
        tp->rttvar -= (tp->rttvar - tp->mdev_max) >> 2;
    tp->rtt_seq = tp->snd_nxt;
    tp->mdev_max = tcp_rto_min(sk);
}

As we said when we were talking about data structures, indicator that RTT units has passed is signaled with sequence number within ACK being after saved value in snd_una field. You might wonder why simple less-then operator isn't used? The reason is that sequence numbers might wrap around so it is necessary to take that into account!

Note that if rtt_var is larger than mdev_max nothing happens, i.e. this code only decreases the value of rttvar! But, if it is smaller, then rttvar is adjusted by the following quantity (as per RFC):

rttvar - (rttvar - mdev) / 4 = 3/4 * rttvar + mdev/4

Again, we have some trickery with different scales of rttvar and mdev. You can understand it as follows: New rttvar consists of 3/4 old rttvar. rttvar itself is multiplied by 4 (i.e. by constant K from RFC). RFC also specifies that mdev must participate with 1/4 (i.e. factor beta). Additionaly, mdev is already 4 times larger, and thus, it is already pre-multiplied by constant K! Thus, it can be added without further multiplications.

One more thing left to explain is the new value of mdev_max and the function tcp_rto_min that is called to provide it. This function hides some complexity, but in the simplest possible case it will return constant TCP_RTO_MIN which has value 200ms (HZ/5). In more general case, the ip command from iproute package allows RTT and RTTVAR to be specified per destination so this function checks if it is specified and if it is then returns given value.

The special case for this function is the initial state, i.e. when the connection is opened. In that case argument mrtt will be zero, and also srtt will be zero. If mrtt is zero, assumed value is 1, and note that it is the initial RTT defined by RFC. srtt being zero triggers it's initialization:

tp->srtt = mdev << 3;
tp->mdev = mdev << 1;

Basically, transcodes mdev into appropriate units and stores the value into srtt (i.e. mdev is 1, so into srtt has to be stored 8). At first, it might seem that mdev is calculated wrongly, but according to RFC, it's initial value has to be half of the initial RTT. This translates into:

mdev<<2 / 2 = mdev*4/2 = mdev*2 = mdev <<1

So, mdev's initial value is correct!

And that's it. I intend also to describe TCP congestion control within Linux kernel in some future post.

Everything about nothing