Friday, January 9, 2015

Getting free disk space in Linux

While working on a script to have full Zimbra backups as many days in the past as possible, I was trying to automatically remove old backups based on the free space value. Basically, the idea was to remove directory by directory until free space reached some threshold. To find out free space on a disk is easy, use df(1) command. Basically, it looks like this:
$ df -k /
Filesystem 1K-blocks     Used Available Use% Mounted on
/dev/sda1   56267084 39311864  16938836  70% /
The problem is that it is necessary to use some postprocessing in order to obtain desired value, i.e. 5th or 5th column. cut(1) command, in this case, is a bit problematic because in general you can not expect that the output is so nicely formatted, nor it is fixed. For example, based on the width of the widest device node in the first column, it is automatically resized. That in turn means number of whitespaces varies, and you end up being forced to use something else than cut(1). Probably, the most appropriate tool is awk(1), since awk(1) can properly parse fields separated with variable number of whitespaces. In addition, you need to get rid of first line. That can be done using head(1)/tail(1), but it is more efficient to use awk(1) itself. So, you end up with the following construct:
$ df -k / | awk 'NR==2 {print $4}'
But, for some reason, I wasn't satisfied with the given solution because I thought I'm using too complex tools for something that should be simpler than that. So, I started to search is there some other way to obtain free space of some partition. It turned out that stat(1) command is able to do that, but it's rarely used for that purpose. It is used to find out data about files, or directories, but not file systems. Yet, there is an option, -f, that tells stat(1) we are querying file system, and also there is an option --format which accepts format sequences in a style of date(1) command. So, to get the free space on root file system you can use it as follows:
$ stat -f --format "%f" /
stat(1) command without --format option prints all the data about file system it can find out:
$ stat -f /
  File: "/"
    ID: b8a4e1f0a2aefb22 Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 14066771   Free: 4238805    Available: 4234709
Inodes: Total: 3588096    Free: 2151591
This makes it in some way analogous to df(1) command. But, we are getting values in blocks, instead of kilobytes! You can get block size using %S format sequence, but that's it. So, some additional trickery is needed. One solution is to output arithmetic expression and evaluate it using bc(1) command, like this:
$ stat -f --format "%f * %S" / | bc
Alternatively, it is also possible to use shell's arithmetic evaluation like this:
$ echo $((`stat -f --format "%f * %S" /`))17362145280
But, in both cases we are starting two process. In a first case the processes are stat(1) and bc(1), and in the second case it is a new subshell (for backtick) and stat(1). Note that this is the same as the solution with awk(1). But in case of awk(1) we are starting two more complex tools of which one, df(1), is more targeted to display value to a user than to be used in scripts. One additional advantage of a method using awk(1) might be portability, i.e. I'm df(1)/awk(1) combination is probably more common than stat(1)/bc(1) combination.

Anyway, the difference probably isn't so big with respect to performance, but obviously there is another way to do it, and it was interesting to pursue an alternative. 

Monday, December 15, 2014

Incremental backup with rsync and cp with low disk demand

I just created very simple incremental backup solution for my Zimbra installation which efficiently uses disk space. The idea is simple. First, using rsync, I'm making a copy of existing Zimbra installation to another disk:
rsync --delete --delete-excluded -a \
        --exclude zimbra/data/amavisd/ \
        --exclude zimbra/data/clamav/ \
        --exclude zimbra/data/tmp \
        --exclude zimbra/data/mailboxd/ \
        --exclude zimbra/log \
        --exclude zimbra/zmstat \
        /opt/zimbra ${DSTDIR}/
I excluded from synchronization some directories that are not necessary for restoring Zimbra. Then, using cp I'm creating copy of this directory but which only consists of hard links to original files, the content isn't copied:
cd ${DSTDIR}
cp -al zimbra zimbra.`date +%Y%m%d%H%M`
Note the option -l that tells cp to hard link files instead of making a new copy. Also, note that the copy created is named so that it contains timestamp when it was created. Here is the content of the directory:
$ ls -l ${DSTDIR}
total 16
drwx------ 7 root   root    4096 Pro  9 15:31 zimbra
drwx------ 7 root   root    4096 Pro  9 15:31 zimbra.201412131551
drwx------ 7 root   root    4096 Pro  9 15:31 zimbra.201412140326
drwx------ 7 root   root    4096 Pro  9 15:31 zimbra.201412150325
Next time rsync runs, it will delete files that don't exist any more, and when it copies changed files it will create a new copy, and then remove the old one. Removing the old one means unlinking which in essence leaves the old version saved in the directory made by cp. This way you'll allocate space only for new and changed files, while the old ones will share disk space.

This system uses only the space it needs. Now, it is interesting to note du's command behavior in case of hard links. Here is an example:
# du -sh zimbra*
132G      zimbra
3.4G      zimbra.201412131551
3.2G      zimbra.201412140326
114M      zimbra.201412150325
# du -sh zimbra.201412131551
132G      zimbra.201412131551
# du -sh zimbra.201412150325
132G      zimbra.201412150325
In the first case it tells us how much space is used by main directory, zimbra, and then it tells us the difference in usage of the other directories, e.g. zimbra is using 132G and zimbra.201412131551 uses 3.4G more/differently. But, when we give specific directory to du command, then it tells us how much this directory is by itself, so we see that all the files in zimbra.201412131551 indeed use 132G.

And that's basically it. These two commands (rsync and cp) are placed in a script with some additional boilerplate code and everything is run from cron.

Thursday, December 11, 2014

How to determine if some blob is encrypted or not

If you ever wondered how hard it is to differentiate between encrypted file and some regular binary file, then wonder no more, because the answer is: very easy, at least in principle. Namely, by looking at the distribution of octets in a file you can know if it is encrypted or not. The point is that after encryption the file must look like a random sequence of bytes. So, every byte, from 0 to 255, will occur almost the same number of times in a file. On the other hand, text files, images, and other files will have some bytes occurring more frequently than the others. For example, in text files space (0x20) occurs most frequently. So, the procedure is very easy, just count how many times each octet occurs in the file and then look at the differences. You can do it by hand(!), write your own application, or use some existing tool.

Of course, looking at the distribution isn't so easy, so, for a quick check it is possible to use entropy as a measure of randomness. Basically, low entropy means that the file is not encrypted, while high entropy means it is. Entropy, in case of file that consists of stream of octets, can be in a range [0,8], with 0 being no entropy at all, and 8 being the maximum entropy. I found a relatively good explanation of entropy that also has links to the code to calculate entropy in a file. Now, in case you don't have a tool for calculating entropy at a hand, some compression tool will be useful, too. Namely, random files can not be compressed. And that is the reason why compression is always done before encryption, because encrypted data looks random and is thus incompressible.

Let me demonstrate all this on an example. Take, for example, this file. That is a file linked to by the blog post I mentioned previously in which entropy is explained. It is a simple C file you can download on your machine and compile it into executable called entropy1:
$ gcc -Wall -o entropy1 entropy1.c -lm
So, let us see what is the entropy of the C file itself:
$ ./entropy1 entropy1.c
4.95 bits per byte
it is relatively low entropy, meaning, not much information content is there with respect to the size. Ok, let's encrypt it now. We'll do this using openssl tool:
$ openssl enc -aes-128-cbc \
        -in entropy1.c -out entropy1.c.aes \
        -K 000102030405060708090a0b0c0d0e0f \
        -iv 000102030405060708090a0b0c0d0e0f
The given command encrypts input file (entropy1.c) using 128-bit key 000102030405060708090a0b0c0d0e0f in AES-CBC mode and with initialization vector 000102030405060708090a0b0c0d0e0f. The output is written to a file with the name entropy1.c.aes. Let us see what is the entropy now:
$ ./entropy1 entropy1.c.aes
7.86 bits per byte
That's very high entropy, and in line with what we've said about encrypted files having high entropy. Lets check how compressible is the original file, and encrypted one:
$ zip entropy1.c
  adding: entropy1.c (deflated 48%)
$ zip entropy1.c.aes
  adding: entropy1.c.aes (stored 0%)
As can be seen, encrypted file isn't compressible, while plain text is. What about entropy of those compressed files:
$ ./entropy1
7.53 bits per byte
$ ./entropy1
7.79 bits per byte
As you can see, they both have high entropy. This actually means that entropy can not be used to differentiate between compressed files and encrypted ones. And now, there is a problem, how to differentiate between encrypted and compressed files? In that case statistical tests of randomness have to be performed. Good tool for that purpose is ent. Basically, that tool performs two tests: Chi Square and Monte Carlo pi approximation. I found a good blog post about using that tool. At the end, there is a list of rule of thumb rules that can be used to determine if the file is encrypted or compressed:
  • Large deviations in the chi square distribution, or large percentages of error in the Monte Carlo approximation are sure signs of compression.
  • Very accurate pi calculations (< .01% error) are sure signs of encryption.
  • Lower chi values (< 300) with higher pi error (> .03%) are indicative of compression.
  • Higher chi values (> 300) with lower pi errors (< .03%) are indicative of encryption.
Take those numbers only indicatively, because we'll see different values in the example later. But nevertheless, they are a good hint on what to look at. Also, let me explain from where did he get the value 300 for Chi square. If you watched linked video in that blog post, you'll know that there are two important parameters when calculating Chi Squared test, number of degrees of freedom and a critical value. Namely, in our case we have 256 values and that translates into 255 degrees of freedom. Next, if we select p=0.05, i.e. we want to determine if the stream of bytes is random with 95% of certainty, then looking into some table we obtain critical value 293.24, rounded it is 300. When Chi square is below that value, then we accept null hypothesis, i.e. the data is random, otherwise we reject null hypothesis, i.e. the data isn't random.

Here is the output from a given tool for encrypted file:
$ ./ent entropy1.c.aes Entropy = 7.858737 bits per byte.
Optimum compression would reduce the sizeof this 1376 byte file by 1 percent.
Chi square distribution for 1376 samples is 253.40, and randomlywould exceed this value 51.66 percent of the times.
Arithmetic mean value of data bytes is 129.5407 (127.5 = random).Monte Carlo value for Pi is 3.091703057 (error 1.59 percent).Serial correlation coefficient is 0.040458 (totally uncorrelated = 0.0).
Then, here is for just compressed file:
$ ./ent Entropy = 7.530339 bits per byte.
Optimum compression would reduce the sizeof this 883 byte file by 5 percent.
Chi square distribution for 883 samples is 1203.56, and randomlywould exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 111.6512 (127.5 = random).Monte Carlo value for Pi is 3.374149660 (error 7.40 percent).Serial correlation coefficient is 0.183855 (totally uncorrelated = 0.0).
And finally, for encrypted and then compressed file:
$ ./ent
Entropy = 7.788071 bits per byte.
Optimum compression would reduce the size
of this 1554 byte file by 2 percent.
Chi square distribution for 1554 samples is 733.20, and randomly
would exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 121.0187 (127.5 = random).
Monte Carlo value for Pi is 3.166023166 (error 0.78 percent).
Serial correlation coefficient is 0.162305 (totally uncorrelated = 0.0).
First, note that the error in calculating Pi is higher in case of the compressed file (7.40 vs 0.78/1.59). Next, Chi square is in the first case less than 300 which indicates encryption (i.e. the data is random), while in the last two cases it is bigger than 300 meaning the data is not so random!

For the end, here is a link to interesting blog post describing file type identification based on the byte distribution and its application for reversing XOR encryption.

About Me

scientist, consultant, security specialist, networking guy, system administrator, philosopher ;)