Everything about nothing: kernel

Showing posts with label kernel. Show all posts

Wednesday, June 29, 2016

Quick note about Fedora 24 and VMWare Workstation 12.1

I just updated Fedora 24 from update-testing repository and that pulled Linux kernel 4.6. Well, as usual, VMWare Workstation needed some patching in order to work. Luckily, I quickly found a fix on VMWare forums. Note that at the end of the thread there is a script you can use to automatically patch necessary files. But, be careful, I didn't try it!

Anyway, after patching, run:

vmware-modconfig --console --install-all

and that should be it!

Just as a sidenote, turns out that the same info I found on Arch Wiki page devoted to VMWare. And that page is full of information so you should bookmark it and whenever you have a problem with VMWare check there.

Edit 20160707: The script mentioned above had some errors in it. Here is the fixed version.

Friday, February 12, 2016

Few notes about network namespaces in Linux

For some time I'm working with network namespaces as implemented in the Linux kernel. Here I'll collect some notes about the implementation, behavior, usage and anything else I learn while using network namespaces.

Kernel API for NETNS

Kernel offers two system calls that allow management of network namespaces. The first one is for creating a new network namespace, unshare(2). Actually, this system call allows other types of namespaces to be created, but here we are interested only in network namespaces. So, to create a new network namespace you should call the function like this:

#include <sched.h>
...
unshare(CLONE_NEWNET);
...

And that would define a new network namespace.

There are two ways other processes can now use that network namespace. The first approach is for the process that created new network namespace to fork other processes and each forked process would share and inherit the parent's process network namespace. The same is true if exec is used.

The second system call kernel offers is setns(2). To use this system call you have to have a file descriptor that is somehow related to the network namespace you want to use. There are two approaches how to obtain the file descriptor.

The first approach is to know the process that lives currently in the required network namespace. Let's say that the PID of the given process is $PID. So, to obtain file descriptor you should open the file /proc/$PID/ns/net file and that's it, pass file descriptor to setns(2) system call to switch network namespace. This approach always works.

The second approach works only for iproute2 compatible tools. Namely, ip command when creating new network namespace creates a file in /var/run/netns directory and bind mounts new network namespace to this file. So, if you know a name of network namespace you want to access (let's say the name is NAME), to obtain file descriptor you just need to open(2) related file, i.e. /var/run/netns/NAME.

Note that there is no system call that would allow you to remove some existing network namespace. Each network namespace exists as long as there is at least one process that uses it, or there is a mount point.

Two remarks for the end of this section. First, there is no system call that would allow one process to move some other process into another network namespace! And second, you need appropriate privileges to use the mentioned system calls, i.e. regular user processes can't switch namespaces.

Socket API behavior

The next question is how Socket API behaves when network namespaces are used, and things here are quite interesting.

First, each socket handle you create is bound to whatever network namespace was active at the time the socket was created. That means that you can set one network namespace to be active (say NS1) create socket and then immediately set another network namespace to be active (NS2). The socket created is bound to NS1 no matter which network namespace is active and socket can be used normally. In other words, when doing some operation with the socket (let's say bind, connect, anything) you don't need to activate socket's own network namespace before that!

Also, to note is that network namespace is per-thread setting, meaning if you set certain network namespace in one thread, this won't have any impact on other threads in the process.

Command line tools

There are two command line tools available to manipulate network namespaces. The first one is nsenter(1) which isn't specific to networking. It allows one to start some process within predefined network namespace. The second tool is ip command from iproute2 package. It allows management of network namespaces and also allows network interfaces to be switched between different namespaces.

NETLINK behavior

To change device from one network namespace to another one, NETLINK must be used. I found somewhere references to /sys files, but at least on my system they don't appear to exist.

One interesting fact is that interface ID is global across all network namespaces - except for loopback interface, i.e. if you create interface in one network namespace and it gets ID N, and then you move it to another network namespace, it will keep ID N.

TBD.

Wednesday, March 25, 2015

VMWare Workstation 11 and Linux kernel 3.19

Well, I thought that starting with kernel 3.18 there will be no need any more for manual patching in order to make VMWare Workstation 11.0 work again (11.1 didn't work either). But, I was wrong. After updating vmnet compilation ended with errors and I had to search for a solution. I found it, on ArchWiki pages. Now, it happened once before to me that I just pointed to a page with a solution, and that page was changed so that solution disappeared. To avoid this, here is step by step what you have to do. First, download a patch. You don't need to be a root to execute this command:

$ curl http://pastie.org/pastes/9934018/download -o /tmp/vmnet-3.19.patch

Now, switch to root and execute the following commands:

# cd /usr/lib/vmware/modules/source
# tar -xf vmnet.tar
# patch -p0 -i /tmp/vmnet-3.19.patch
# mv vmnet.tar vmnet.tar.SAVED
# tar -cf vmnet.tar vmnet-only
# rm -r vmnet-only
# vmware-modconfig --console --install-all

And that should be it.

Sunday, November 9, 2014

Fedora 20 update to kernel 3.17.2-200 and VMWare Workstation

Since I updated to VMWare Workstation 10.0.5 at the end of January 2015, things were again broken. Returning to this post I found that the link in the post now points to something that has changed and there is no patch nor instructions of what to do. So, I had to google again and now I placed the complete instructions in this post so that the next time I don't have to google again.

It turns out that there is a single line that has to be changed in the vmnet module in order for VMWare to be runnable again. So, here are the steps you have to do in order to patch the file:

Create temporary directory, e.g. /tmp/vmware and position yourself in that directory.

Create a file named vmnet.patch and put into it the following content:

diff -ur vmnet-only.a/netif.c vmnet-only/netif.c
--- vmnet-only.a/netif.c    2014-10-10 03:23:08.585920012 +0300
+++ vmnet-only/netif.c  2014-10-10 03:23:09.245920008 +0300
@@ -149,7 +149,7 @@
    memcpy(deviceName, devName, sizeof deviceName);
    NULL_TERMINATE_STRING(deviceName);

-   dev = alloc_netdev(sizeof *netIf, deviceName, VNetNetIfSetup);
+   dev = alloc_netdev(sizeof *netIf, deviceName, NET_NAME_UNKNOWN, VNetNetIfSetup);
    if (!dev) {
       retval = -ENOMEM;
       goto out;

Unpack /usr/lib/vmware/modules/source/vmnet.tar in the current directory (/tmp/vmware):
```
tar xf /usr/lib/vmware/modules/source/vmnet.tar
```

Patch the module:

cd vmnet-only; patch -p1 < ../vmnet.patch; cd ..

Make a copy of old, unpatched, archive:

mv /usr/lib/vmware/modules/source/vmnet.tar /usr/lib/vmware/modules/source/vmnet.tar.SAVED

Create a new archive:

tar cf /usr/lib/vmware/modules/source/vmnet.tar vmnet-only

Start vmware configuration process:

vmware-modconfig --console --install-all

Hopefully, that should be it.

Old instructions (not valid any more!)

Well, here we go again. After recent update which brought kernel 3.17 to Fedora 20, VMWare Workstation 10.0.4 had problems with kernel modules. Luckily, after some short googling I found a solution. That solution works. There are two things that might confuse you though:

After cd command and before for loop you have to switch to root account (that is indicated by prompt sign change from $ to #).
The substring kernel-version in patch command should be replaced with a string "3.17". That is actually the name you gave to a file while executing curl command at the beginning of the process.

Anyway, that's it.

Sunday, February 23, 2014

Kernel upgrade to 3.13.3-201 and VMWare Workstation...

I just started VMWare for the first time after upgrade and restart into kernel 3.13.3-201, and it didn't work. Well, I already got used to it. Anyway, the fix was very quick. I found a patch in a post on VMWare forums, downloaded it and applied it. The application process is a bit more manually involved that with the previous patches. You need to:

Switch to root user.
Go to the /usr/lib/vmware/modules/source directory
Unpack vmnet.tar using tar command.
Enter into newly created vmnet-only directory
Apply patch (patch -p1 < path_and_name_of_unpacked_patch_file). Patch shouldn't make any noise, only patched file is displayed.
Go one level up (exit vmnet-only directory)
Rename existing vmnet.tar into something else, just in case.
Create new vmnet.tar (tar cf vmnet.tar vmnet-only)
Start vmware

And that's it...

Tuesday, August 27, 2013

VMWare Workstation and kernel 3.10 (again)

Change in the kernel version again broke VMWare Workstation. It would definitely be the best option for VMWare to try to integrate their modules into the kernel. In such a way maintenance of those modules would be bound with kernel and a lot of people would have a lot less annoyance. But, that's not the case and so such problems happen. In this case, the solution is again relatively easy. Just run the following commands, as is, and everything should work:

cd /tmp
curl -O http://pkgbuild.com/git/aur-mirror.git/plain/vmware-patch/vmblock-9.0.2-5.0.2-3.10.patch
curl -O http://pkgbuild.com/git/aur-mirror.git/plain/vmware-patch/vmnet-9.0.2-5.0.2-3.10.patch
cd /usr/lib/vmware/modules/source
tar -xvf vmblock.tar
tar -xvf vmnet.tar
patch -p0 -i /tmp/vmblock-9.0.2-5.0.2-3.10.patch
patch -p0 -i /tmp/vmnet-9.0.2-5.0.2-3.10.patch
tar -cf vmblock.tar vmblock-only
tar -cf vmnet.tar vmnet-only
rm -rf vmblock-only
rm -rf vmnet-only
vmware-modconfig --console --install-all

The commands were taken from this link. Note that curl commands are broken to two lines due to the space problems, but they should be typed in a single line. I tried this with VMWare Workstation 9.0.1 and kernel 3.10.9 and it worked flawlessly.

Monday, August 5, 2013

TCP client self connect...

This is so cool and unexpected, but then nothing out of spec, that I had to reblog it. Namely, if you run the following snippet of the Bourne shell code:

while true
do
telnet 127.0.0.1 50000
done

You'll constantly receive message 'Connection refused', but at one point the connection will be established and whatever you type, will be echoed back:

Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
test1
test1
test2
test2

Note that you didn't start any server and there is no process listening on port 50000 on localhost, but yet, it connected! Looking at the output of netstat command we see that there is really established connection:

$ netstat -tn | grep 50000
tcp 0 0 127.0.0.1:50000 127.0.0.1:50000 ESTABLISHED

and, if we monitor traffic using tcpdump we observe a three way handshake:

21:31:02.327307 IP 127.0.0.1.50000 > 127.0.0.1.50000: Flags [S], seq 2707282816, win 43690, options [mss 65495,sackOK,TS val 41197287 ecr 0,nop,wscale 7], length 0
21:31:02.327318 IP 127.0.0.1.50000 > 127.0.0.1.50000: Flags [S.], seq 2707282816, ack 2707282817, win 43690, options [mss 65495,sackOK,TS val 41197287 ecr 41197287,nop,wscale 7], length 0
21:31:02.327324 IP 127.0.0.1.50000 > 127.0.0.1.50000: Flags [.], ack 1, win 342, options [nop,nop,TS val 41197287 ecr 41197287], length 0

What happened? In short, client connected to itself. :) A bit longer explanation follows...

Let's start with a fact that when a client (in this case it is a telnet application) creates socket and tries to connect to server, kernel assigns it a random source port number. This is because each TCP connection is uniquely identified with 4-tuple:

(source IP, source port, destination IP, destination port)

Of those, three parameters are predetermined, i.e. source IP, destination IP and destination port, what's left is source port that has to be somehow arbitrarily assigned, and usually applications leave that to the kernel which takes it from the range of ephemeral ports. Applications can choose source port using bind(2) system call, but it is very rarely done. Now, in what range do these ephemeral ports live? They are high ports, and you can take a look into /proc file system to see specific values for your Linux machine, e.g.:

$ cat /proc/sys/net/ipv4/ip_local_port_range
32768 61000

In this case, ephemeral ports are taken between 32768 and 61000.

Now, back to our example with telnet application. When telnet is started, kernel selects some free port from the given range of ephemeral ports and tries to connect to localhost (destination IP 127.0.0.1), port 50000. Since no process usually listens on ephemeral ports RST response is sent back and telnet client gives error message Connection refused. This exchange on the network can be seen by using tcpdump tool:

# tcpdump -nni lo port 50000
21:31:02.326447 IP 127.0.0.1.49999 > 127.0.0.1.50000: Flags [S], seq 1951433652, win 43690, options [mss 65495,sackOK,TS val 41197286 ecr 0,nop,wscale 7], length 0
21:31:02.326455 IP 127.0.0.1.50000 > 127.0.0.1.49999: Flags [R.], seq 0, ack 387395547, win 0, length 0

It is interesting to note that Linux chooses ephemeral ports sequentially, not randomly. This allows easy guessing of the ports, and might be a security problem, but further investigation is necessary to confirm this.

Anyway, during many unsuccessful connections, at one iteration, telnet client is assigned source port 50000 and SYN request is sent to port 50000, i.e. to itself. So, it establishes connection with itself! This is actually fully according to the TCP specification which supports a so called feature simultaneous open, illustrated in Figure 8 in RFC793 (note, there is errata to this example in RFC1122).

Yet, the example from RFC793 assumes that there are two independent endpoints trying to connect at the same time, but in our case it is only one side so there is a small deviation from prescribed behavior. Let's take a look. Here is a TCP state machine taken from Wikipedia page about TCP:

When telnet client starts, source port 50000 is assigned and state machine is instantiated which is immediately initialized into CLOSED state. Then, telnet tries to connect to a server which means SYN is sent and TCP state machine of the source port goes to SYN SENT state. Now, this same source port, i.e. state machine, receives this SYN and because of this goes into SYN RECEIVED state (arrow from right to left marked with SYN/SYN+ACK). While transiting to a new state, SYN+ACK is emitted that is again received by the state machine. Now, we come to a bit of a mystery, namely how the state machine transitions to ESTABLISHED state and when an ACK is emitted to finish three way handshake?

To answer that, we'll have to dig a bit into the kernel's source code. First, note that there is an explicit case for self connect which is also commented. This case is triggered in TCP_SYN_SENT state. Then, socket is placed into TCP_SYN_RECV state and SYN+ACK is sent back. This SYN+ACK is immediately looped back and processed in function tcp_rcv_state_process(). In that function, the function tcp_validate_incoming() is called. That function, finally, after few checks calls function tcp_send_challenge_ack() that sends ACK. The state of the TCP connection (i.e. socket) is changed to ESTABLISHED in function tcp_rcv_state_process() within a part that processes ACK flag. And that concludes the description what happens actually happens, and what is seen on a network.

The scenario of self connect described in this post is quite specific and requires specific preconditions. First, obviously, you need to (ab)use ephemeral ports for listening servers so that you clients try to connect to ephemeral ports. Next, client and server have to run on the same IP address, otherwise client will not be able to self connect. Finally, this can only happen during initial handshake phase. If you find some client using some ephemeral port and try to connect to it, you'll be refused. So, the conclusion is: Don't use ephemeral ports for servers! Or otherwise, you risk very interesting behavior that is nondeterministic and hard to debug.

Monday, March 4, 2013

Fedora 18 and update to kernel 3.8.1

Today, I updated Fedora 18 and, as a consequence, kernel was also updated to version 3.8.1. Up until now, after each upgrade only thing I had to do is to softlink version.h file (see this post, section Virtualization). But now, VMCI module didn't compile either. Luckily, some had the same problem during RC status of kernel 3.8 and they successfully solved it. :) I tried it, and it worked flawlessly.

You need to download the patch and then execute the following commands (as a user root):

cd /usr/lib/vmware/modules/source
cp vmci.tar vmci.tar.SAVED
tar xf vmci.tar
cd vmci-only
patch -p1 < path to downloaded patch filecd ..
tar cf vmci.tar vmci-only/
rm -rf vmci-only

Be careful with the last rm command. :) Also, cp command is only precaution, if something goes wrong, you have a copy of old vmci.tar archive.

Anyway, just for the completeness here is what you should do to fix missing version.h file:

cd /usr/src/kernels/3.8.1-201.fc18.x86_64/include/linux
ln -sf /usr/src/kernels/3.8.1-201.fc18.x86_64/include/generated/uapi/linux/version.h .

And that's it. Probably soon will appear all-in-one-patch that will streamline this whole procedure.

Thursday, December 27, 2012

UDP Lite...

Many people know for TCP and UDP, at least those that work in the field of networking or are learning computer networks in some course. But, the truth is that there are others too, e.g. SCTP, DCCP, UDP Lite. And all of those are actually implemented in Linux kernel. What I'm going to do is describe each one of those in the following few posts and give examples of their use. In this post, I'm going to talk about UDP Lite. I'll assume that you know UDP and also that you know how to use socket API to write UDP application.

UDP Lite is specified in RFC3828: The Lightweight User Datagram Protocol (UDP-Lite) . The basic motivation for the introduction of a new variant of UDP is that certain applications (primarily multimedia ones) want to receive packets even if they are damaged. The reason is that codecs used can recover and mask errors. UDP itself has a checksum field that covers the whole packet and if there is an error in the packet, it is silently dropped. It should be noted that this checksum is quite weak actually and doesn't catch a lot of errors, but nevertheless it is problematic for such applications. So, UDP lite changes standard UDP behavior in that it allows only part of the packet to be covered with a checksum. And, because it is now different protocol, new protocol ID is assigned to it, i.e. 136.

So, how to use UDP Lite in you applications? Actually, very easy. First, when creating socket you have to specify that you want UDP Lite, and not (default) UDP:

s = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDPLITE);

Next, you need to define what part of the packet will be protected by a checksum. This is achieved with socket options, i.e. setsockopt(2) system call. Here is the function that will set how many octets of the packet has to be protected:

void setoption(int sockfd, int option, int value)
{
if (setsockopt(sockfd, IPPROTO_UDPLITE, option,
(void *)&value, sizeof(value)) == -1) {
perror("setsockopt");
exit(1);
}
}

It receives socket handle (sockfd) created with socket function, option that should be set (option) and the option's value (value). There are two options, UDPLITE_SEND_CSCOV and UDPLITE_RECV_CSCOV. Option UDPLITE_SEND_CSCOV sets the number of protected octets for outgoing packets, and UDPLITE_RECV_CSCOV sets at least how many octets have to be protected in the inbound packets in order to be passed to the application.

You can also obtain values using the following function:

int getoption(int sockfd, int option)
{
int cov;
socklen_t len = sizeof(int);
if (getsockopt(sockfd, IPPROTO_UDPLITE, option,
(void *)&cov, &len) == -1) {
perror("getsockopt");
exit(1);
}
return cov;
}

This function accepts socket (sockfd) and option it should retrieve (i.e. UDPLITE_SEND_CSCOV or UDPLITE_RECV_CSCOV) and returns the option's value. Note that the two constants, UDPLITE_SEND_CSCOV or UDPLITE_RECV_CSCOV, should be explicitly defined in your source because it is possible that glibc doesn't (yet) define them.

I wrote fully functional client and server applications you can download and test. To compile them you don't need any special options. So that should be easy. Only change you'll probably need is the IP address that clients sends packets to. This is a constant SERVER_IPADDR which contains server's IP address hex encoded. For example, IP address 127.0.0.1 is 0x7f000001.

Finally, I have to say that UDP Lite will probably have problems traversing NATs. For example, I tried it on my ADSL connection and it didn't pass through the NAT. What I did is that I just started client with IP address of one of my servers on the Internet, and on that server I sniffed packets. Nothing came to the server. This will probably be a big problem for the adoption of UDP Lite, but the time will tell...

You can read more about this subject on the Wikipedia page and in the Linux manual page udplite(7).

Tuesday, December 25, 2012

Controlling which congestion control algorithm is used in Linux

Linux kernel has a quite advanced networking stack, and that's also true for congestion control. It is a very advanced implementation who's primary characteristics are modular structure and flexibility. All the specific congestion control algorithms are separated into loadable modules. The following congestion control mechanisms are available in the mainline kernel tree:

Default, system wide, congestion control algorithm is Cubic. You can check that by inspecting the content of the file /proc/sys/net/ipv4/tcp_congestion_control:

$ cat /proc/sys/net/ipv4/tcp_congestion_control
cubic

So, to change system-wide default you only have to write a name of congestion control algorithm to the same file. For example, to change it to reno you would do it this way:

# echo reno > /proc/sys/net/ipv4/tcp_congestion_control
# cat /proc/sys/net/ipv4/tcp_congestion_control
reno

Note that, to change the value, you have to be the root user. As the root you can specify any available congestion algorithm you wish. In the case the algorithm you specified isn't loaded into the kernel, via standard kernel module mechanism, it will be automatically loaded. To see what congestion control algorithms are currently loaded take a look into the content of the file /proc/sys/net/ipv4/tcp_available_congestion_control:

$ cat /proc/sys/net/ipv4/tcp_available_congestion_control
vegas lp reno cubic

It is also possible to change congestion control algorithm on a per-socket basis using setsockopt(2) system call. Here is the essential part of the code to do that:

...
int s, ns, optlen;
char optval[TCP_CA_NAME_MAX];
...
s = socket(AF_INET, SOCK_STREAM, 0);
...
ns = accept(s, ...);
...
strcpy(optval, "reno");
optlen = strlen(optval);
if (setsockopt(ns, IPPROTO_TCP, TCP_CONGESTION, optval, optlen) < 0) {
perror("setsockopt");
return 1;
}

In this fragment we are setting congestion control algorithm to reno. Note that that the constant TCP_CA_NAME_MAX (value 16) isn't defined in system include files so they have to be explicitly defined in your sources.

When you are using this way of defining congestion control algorithm, you should be aware of few things:

You can change congestion control algorithm as an ordinary user.
If you are not root user, then you are only allowed to use congestion control algorithms specified in the file /proc/sys/net/ipv4/tcp_allowed_congestion_control. For all the other you'll receive error message.
No congestion control algorithm is bound to socket until it is in the connected state.

You can also obtain current control congestion algorithm using the following snippet of the code:

optlen = TCP_CA_NAME_MAX;
if (getsockopt(ns, IPPROTO_TCP, TCP_CONGESTION, optval, &optlen) < 0) {
perror("getsockopt");
return 1;
}

Here you can download a code you can compile and run. To compile it just run gcc on it without any special options. This code will start server (it will listen on port 10000). Connect to it using telnet (telnet localhost 10000) in another terminal and the moment you do that you'll see that the example code printed default congestion control algorithm and then it changed it to reno. It will then close connection.

Instead of the conclusion I'll warn you that this congestion control algorithm manipulation isn't portable to other systems and if you use this in your code you are bound to Linux kernel.

Tuesday, June 19, 2012

VMWare Workstation on Fedora 17...

Today, when I started VMWare Workstation, it notified me that there is a free security update. Since it is advisable to update whenever there are security issues, I approved it, but of course that after update I had a problem starting VMWare workstation. Since there are constantly problems with VMWare and Fedora I finally decided to track everything I had to do in this post. In other words, this post will be updated whenever I have to do something to VMWare Workstation to get it to work.

Ok, as I said, on a fully updated Fedora 17 (kernel-3.4.2-4.fc17.x86_64) when a new VMWare Workstation 8.0.4 is installed (or updated) it can not configure itself because of errors in kernel modules. As I wrote this I wasn't able to find a patch that fixes everything in one step, but I managed to combine two fixes that allowed VMWare to start. First, I had to apply patch for 8.0.3 that fixed vmnet.tar file. To do that I also had to modify a bit the script distributed with patch so that it accepts the fact that I'm using a newer version of VMWare (i.e. 8.0.4 instead of 8.0.3). Then, I had to apply a small fix for vmblock.tar that I wrote about in other post. Finally, I tried to start VMWare Workstation but it failed again?! After a bit poking, I realized that there were dangling processes/modules so I had to kill all VMWare processes, remove modules, and that start succeeded. Of course that I could also restart the laptop, but because of the number of open windows I have, that wasn't an option. :)

When I did all that, then I found out that there is a patch that does all this, i.e. those two steps but combined, but it is for VMWare Workstation 8.0.2, which means you still have to poke a bit in script that applies a patch. And finally, this blog seems to be a good place to look when you have some problems with VMWare (and VirtualBox) and newer versions of kernel.

Sunday, February 5, 2012

Calculating TCP RTO...

I was reading RFC6298 on RTO calculation and decided to try to see within Linux kernel how and where it is calculated. Basically, RTO, or Retransmittion Timeout, determines how long TCP waits for acknowledgment (ACK) of transmitted segment. If the acknowledgment isn't received within this time it is deemed lost. Actually, ACK could be lost too, but there is no way for sender to differentiate between those two cases, as illustrated by the following figure:

Thus I'll treat them equally and always refer to segment loss.

The important part of calculating RTO is to determine how long it takes for a segment to go to the receiver and for ACK to come back from receiver to sender. This is a Round Trip Time, or RTT. In some ideal world (and very static for that matter) this value would be constant and would never change. And RTO would be easy to determine, it is equal to RTT, maybe slightly slightly larger, but nevertheless the two would be almost equal.

But we are not living in an ideal word and this process is complicated by the fact that network conditions constantly change. Not only that, but receiver has also certain freedom to chose when it will return ACK, though this time has upper bound of 500ms. So, RTT is never used directly as RTO, some additional calculations are used. The key is to estimate as good as possible the true value of RTT that will be experienced by the next segment to be transmitted and in the same time avoid abrupt changes resulting in transient conditions, and not to react too slow on network condition changes.This is illustrated with the following figure:

In order to achieve that, two new variables are introduced, smoothed RTT, or short SRTT, and RTT variance, or RTTVAR. Those two variables are updated, whenever we have a new RTT measurement, like this (taken from the RFC6298):

RTTVAR <- (1 - beta) * RTTVAR + beta * |SRTT - R'|
SRTT <- (1 - alpha) * SRTT + alpha * R'

alpha and beta are parameters that determine how fast we forget the past. If this parameter is too small new measurements will have little influence on our current understanding of expected RTT and we will slowly react to changes. If, on the other hand, alpha approaches 1 then the past will not influence our current estimation of RTT and it might happen that a single RTT was huge for whatever reason and that suddenly we have wrong estimation. Not only that, but we could have erratic behavior of SRTT. So, alpha and beta parameters have to be carefully selected. The values recommended by RFC are alpha=1/8 and beta=1/4.

Finally, RTO is calculated like this:

RTO <- SRTT + max (G, K*RTTVAR)

Constant K is set to 4, and G is a clock granularity in seconds, i.e. if you get timer interrupt each second, then G is 1 second. The max function is used so that, e.g. you don't get 400ms for K*RTTVAR and try to wait for that long while your clock has resolution of 1s. In that case, 1s will prevail and will be selected as a variance.

Initial values

Still, there is a question of initial values, i.e. what to do when first SYN segment is sent? In that case RFC specifies you have to set RTO to 1 second, which is actually lower than specified in the previous RFC that mandated minimum value of 3 seconds. When first acknowledgment returns its RTT value is stored into SRTT and variance is set to RTT/2. Then, RTO is calculated as usual.

Some complications

There are some additional rules that are required by RFC. First, if calculated RTO is less than 1s, then it has to be rounded to 1second. It is a minimul RTO value allowed by RFC.

Next, in case some segment is retransmitted, then when acknowledgement arrives it is not taken into calculation of SRTT and RTTVAR. This is called Karn's algorithm, even though it is not algorithm but more a rule. The reason for it is that it is impossible to know if this is acknowledgement for a first transmission, or for retransmission and thus we could skew SRTT. This ambiguity is illustrated with the following figure:

But, there is possibility for TCP to negotiate timestamp option on a certain connection and in that case, the previous ambiguity is resolved so each ACK can be used to calculate SRTT and RTTVAR.

Implementation within Linux kernel

Now, let us see how and where is this implemented within the Linux kernel. I'm using the latest stable kernel at the time this post was written and that is 3.2.4. If you are looking at some later or earlier kernel, then the things might be more or less different.

The call graph of the relevant functions is shown in the following figure:

The main function to calculate RTO is tcp_valid_rtt_meas() which updates RTT estimation and sets new RTO for future segments that will be sent. It is called by two functions, tcp_ack_saw_tstamp() which processes ACK that has embedded timestamp option, or tcp_ack_no_tstamp() which processes ACK without timestamp option. In both cases, tcp_valid_rtt_meas() is called with socket structure that determines to which connection this measurement belongs to and also measured RTT value.

But before describing functions that do calculations, first we are going to describe used data structures. Actually, we'll describe only those elements that are used to calculate RTO.

Data structures

The main data structure passed between all the functions we describe is struct sock. This is a fairly large structure used to describe network layer data about every possible type of socket. Every socket from any higher layer has this structure placed at the beginning. The following figure illustrates this:

In our case, the higher layer structure we are interested in is TCP. The structure used to describe TCP sockets is struct tcp_sock. So, when our functions get struct sock as argument, they use use tcp_sk inline function to convert (cast!) it into struct tcp_sock data structure. Note that, if you think a little about it, this tcp_sk inline function actually is a no-op after compilation! It's only purpose is casting which is high-level thing, not something important for assembly/machine code.

Anyway, in struct tcp_sock there is a set of variables used for RTO calculation:

/* RTT measurement */
        u32     srtt;      /* smoothed round trip time << 3        */
        u32     mdev;      /* medium deviation                     */
        u32     mdev_max;  /* maximal mdev for the last rtt period */
        u32     rttvar;    /* smoothed mdev_max                    */
        u32     rtt_seq;   /* sequence number to update rttvar     */

In there we note two expected variables, srtt and rttvar, but also several other ones. Also what is important to realize is that srtt var contains number of seconds shifted to left by three bits, i.e. multiplied by 8. So, in order to store 1 second in srtt you'll have to write there number 8. In other words, every value counts as 125ms, so if you want to store 0.5s you'll write there number 4, i.e. 4*125ms = 500ms. Similarly, mdev is counted in units of 250ms, i.e. its value is four time smaller and to store there 1s you need to write number 4 (4*250ms = 1s).

We'll see later that this way of storing data, along with the requirements for calculating RTO specified in RFC, allow for very efficient (if somewhat obscured) code to determine srtt and rttvar.

As indicated in comments embedded in code, there are additional variables that allow RTTVAR to be updated every RTT time units. The mechanism to achieve that is the following. At each ACK, mdev is calculated and if this mdev is higher than the current highest one (mdev_max) then it is stored into mdev_max field. When RTT time units passes, mdev_max is used to update RTTVAR. To know when RTT time units passed, the field rtt_seq is used. Namely, the sequence number within ACK received is checked against rtt_seq, and if it is larger than rtt_seq than RTTVAR update is triggered and in the same time rtt_seq is set to sequence number that will be used in the next outgoing segment (snd_nxt, also part of the tcp_sock structure).

Functions

Now that we have described relevant data structures, let us turn our attention to the functions themselves. We'll start from the end, i.e. from the calculation of RTO.

tcp_set_rto()

This is the function that calculates current RTO, even though it actually does this via a call to inline function __tcp_set_rto() . RTO is calculated as follows:

(tp->srtt >> 3) + tp->rttvar

Which is actually the same as in RFC apart from right shift. The right shift is necessary because srtt is expressed in 8 times smaller units and has to be normalized before being added to rttvar. rttvar, on the other hand, is coded "normally", i.e. number one means 1 second.

The function tcp_set_rto() also makes sure that RTO isn't larger than TCP_RTO_MAX, which is set to 120 seconds (i.e. 120*HZ).

tcp_rtt_estimator()

This function, in addition to struct sock parameter that holds data for TCP connection whose SRTT and RTTVAR variables should be updated, also receives measured RTT value for the received acknowledgment, i.e.

static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt);

Update process is done in a highly optimized way, which makes things a bit obscure at the first sight. This is the consequence of the fact that srtt and mdev are counted in units of 125ms and 250ms, respectively (as explained in the subsection about data structures).

So, the first thing that's done within this function is the following (slightly modified for explanation):

mrtt -= (tp->srtt >> 3);

Now, mrtt holds value R' - SRTT (using notation from RFC). To calculate new SRTT the following line, immediatelly after the previous one, is used:

srtt += mrtt;

And that's it for SRTT! But, it is easier to understand this calculation if we write it as follows:

mrtt' = mrtt - srtt/8
srtt = srtt + mrtt' = srtt + mrtt - srtt/8 = 7/8 * srtt + mrtt

which is actually the equation given in RFC. Again, srtt is 8 times larger so I can normalize it to show that the value will be correct:

8*rsrtt = 7/8 * 8 * rsrtt + mrtt (divide by 8 yields):
rsrtt = 7/8 rsrtt + mrtt/8

I'm using rsrtt to denote real srtt.

It also has to update mdev, which is done with the following line (note that mrtt isn't changed while the previous calculations were performed, i.e. it is still R' - SRTT):

mrtt -= (tp->mdev >> 2);
tp->mdev += mrtt;

again, I slightly simplified the code. The simplification is related to the fact that I'm assuming mrtt is positive after substraction that changes it into R' - SRTT. Again, the trick is used that mdev is stored in 4 time smaller units.

When mdev is calculated, it is checked against mdev_max and if it is larger, then mdev_max is updated to a new value:

if (tp->mdev > tp->mdev_max) {
    tp->mdev_max = tp->mdev;
    if (tp->mdev_max > tp->rttvar)
        tp->rttvar = tp->mdev_max;
}

One more thing is done too, if mdev_max is larger then rttvar then rttvar is also immediately updated with a new value. Note a trick here. RFC requires RTTVAR to be multiplied by a constant K which is set to be 4! This is accomplished with assigning mdev_max (which is already multiplied by 4) directly to rttvar!

What's left to do is to check if RTT time units has passed from last (regular) update to RTTVAR, and if it is then it's time to update it again. This is done within the following code fragment:

if (after(tp->snd_una, tp->rtt_seq)) {
    if (tp->mdev_max < tp->rttvar)
        tp->rttvar -= (tp->rttvar - tp->mdev_max) >> 2;
    tp->rtt_seq = tp->snd_nxt;
    tp->mdev_max = tcp_rto_min(sk);
}

As we said when we were talking about data structures, indicator that RTT units has passed is signaled with sequence number within ACK being after saved value in snd_una field. You might wonder why simple less-then operator isn't used? The reason is that sequence numbers might wrap around so it is necessary to take that into account!

Note that if rtt_var is larger than mdev_max nothing happens, i.e. this code only decreases the value of rttvar! But, if it is smaller, then rttvar is adjusted by the following quantity (as per RFC):

rttvar - (rttvar - mdev) / 4 = 3/4 * rttvar + mdev/4

Again, we have some trickery with different scales of rttvar and mdev. You can understand it as follows: New rttvar consists of 3/4 old rttvar. rttvar itself is multiplied by 4 (i.e. by constant K from RFC). RFC also specifies that mdev must participate with 1/4 (i.e. factor beta). Additionaly, mdev is already 4 times larger, and thus, it is already pre-multiplied by constant K! Thus, it can be added without further multiplications.

One more thing left to explain is the new value of mdev_max and the function tcp_rto_min that is called to provide it. This function hides some complexity, but in the simplest possible case it will return constant TCP_RTO_MIN which has value 200ms (HZ/5). In more general case, the ip command from iproute package allows RTT and RTTVAR to be specified per destination so this function checks if it is specified and if it is then returns given value.

The special case for this function is the initial state, i.e. when the connection is opened. In that case argument mrtt will be zero, and also srtt will be zero. If mrtt is zero, assumed value is 1, and note that it is the initial RTT defined by RFC. srtt being zero triggers it's initialization:

tp->srtt = mdev << 3;
tp->mdev = mdev << 1;

Basically, transcodes mdev into appropriate units and stores the value into srtt (i.e. mdev is 1, so into srtt has to be stored 8). At first, it might seem that mdev is calculated wrongly, but according to RFC, it's initial value has to be half of the initial RTT. This translates into:

mdev<<2 / 2 = mdev*4/2 = mdev*2 = mdev <<1

So, mdev's initial value is correct!

And that's it. I intend also to describe TCP congestion control within Linux kernel in some future post.

Everything about nothing