Monday, December 15, 2014

Incremental backup with rsync and cp with low disk demand

I just created very simple incremental backup solution for my Zimbra installation which efficiently uses disk space. The idea is simple. First, using rsync, I'm making a copy of existing Zimbra installation to another disk:
rsync --delete --delete-excluded -a \
        --exclude zimbra/data/amavisd/ \
        --exclude zimbra/data/clamav/ \
        --exclude zimbra/data/tmp \
        --exclude zimbra/data/mailboxd/imap-inactive-session-cache.data \
        --exclude zimbra/log \
        --exclude zimbra/zmstat \
        /opt/zimbra ${DSTDIR}/
I excluded from synchronization some directories that are not necessary for restoring Zimbra. Then, using cp I'm creating copy of this directory but which only consists of hard links to original files, the content isn't copied:
cd ${DSTDIR}
cp -al zimbra zimbra.`date +%Y%m%d%H%M`
Note the option -l that tells cp to hard link files instead of making a new copy. Also, note that the copy created is named so that it contains timestamp when it was created. Here is the content of the directory:
$ ls -l ${DSTDIR}
total 16
drwx------ 7 root   root    4096 Pro  9 15:31 zimbra
drwx------ 7 root   root    4096 Pro  9 15:31 zimbra.201412131551
drwx------ 7 root   root    4096 Pro  9 15:31 zimbra.201412140326
drwx------ 7 root   root    4096 Pro  9 15:31 zimbra.201412150325
Next time rsync runs, it will delete files that don't exist any more, and when it copies changed files it will create a new copy, and then remove the old one. Removing the old one means unlinking which in essence leaves the old version saved in the directory made by cp. This way you'll allocate space only for new and changed files, while the old ones will share disk space.

This system uses only the space it needs. Now, it is interesting to note du's command behavior in case of hard links. Here is an example:
# du -sh zimbra*
132G      zimbra
3.4G      zimbra.201412131551
3.2G      zimbra.201412140326
114M      zimbra.201412150325
# du -sh zimbra.201412131551
132G      zimbra.201412131551
# du -sh zimbra.201412150325
132G      zimbra.201412150325
In the first case it tells us how much space is used by main directory, zimbra, and then it tells us the difference in usage of the other directories, e.g. zimbra is using 132G and zimbra.201412131551 uses 3.4G more/differently. But, when we give specific directory to du command, then it tells us how much this directory is by itself, so we see that all the files in zimbra.201412131551 indeed use 132G.

And that's basically it. These two commands (rsync and cp) are placed in a script with some additional boilerplate code and everything is run from cron.

Thursday, December 11, 2014

How to determine if some blob is encrypted or not

If you ever wondered how hard it is to differentiate between encrypted file and some regular binary file, then wonder no more, because the answer is: very easy, at least in principle. Namely, by looking at the distribution of octets in a file you can know if it is encrypted or not. The point is that after encryption the file must look like a random sequence of bytes. So, every byte, from 0 to 255, will occur almost the same number of times in a file. On the other hand, text files, images, and other files will have some bytes occurring more frequently than the others. For example, in text files space (0x20) occurs most frequently. So, the procedure is very easy, just count how many times each octet occurs in the file and then look at the differences. You can do it by hand(!), write your own application, or use some existing tool.

Of course, looking at the distribution isn't so easy, so, for a quick check it is possible to use entropy as a measure of randomness. Basically, low entropy means that the file is not encrypted, while high entropy means it is. Entropy, in case of file that consists of stream of octets, can be in a range [0,8], with 0 being no entropy at all, and 8 being the maximum entropy. I found a relatively good explanation of entropy that also has links to the code to calculate entropy in a file. Now, in case you don't have a tool for calculating entropy at a hand, some compression tool will be useful, too. Namely, random files can not be compressed. And that is the reason why compression is always done before encryption, because encrypted data looks random and is thus incompressible.

Let me demonstrate all this on an example. Take, for example, this file. That is a file linked to by the blog post I mentioned previously in which entropy is explained. It is a simple C file you can download on your machine and compile it into executable called entropy1:
$ gcc -Wall -o entropy1 entropy1.c -lm
So, let us see what is the entropy of the C file itself:
$ ./entropy1 entropy1.c
4.95 bits per byte
it is relatively low entropy, meaning, not much information content is there with respect to the size. Ok, let's encrypt it now. We'll do this using openssl tool:
$ openssl enc -aes-128-cbc \
        -in entropy1.c -out entropy1.c.aes \
        -K 000102030405060708090a0b0c0d0e0f \
        -iv 000102030405060708090a0b0c0d0e0f
The given command encrypts input file (entropy1.c) using 128-bit key 000102030405060708090a0b0c0d0e0f in AES-CBC mode and with initialization vector 000102030405060708090a0b0c0d0e0f. The output is written to a file with the name entropy1.c.aes. Let us see what is the entropy now:
$ ./entropy1 entropy1.c.aes
7.86 bits per byte
That's very high entropy, and in line with what we've said about encrypted files having high entropy. Lets check how compressible is the original file, and encrypted one:
$ zip entropy1.zip entropy1.c
  adding: entropy1.c (deflated 48%)
$ zip entropy2.zip entropy1.c.aes
  adding: entropy1.c.aes (stored 0%)
As can be seen, encrypted file isn't compressible, while plain text is. What about entropy of those compressed files:
$ ./entropy1 entropy1.zip
7.53 bits per byte
$ ./entropy1 entropy2.zip
7.79 bits per byte
As you can see, they both have high entropy. This actually means that entropy can not be used to differentiate between compressed files and encrypted ones. And now, there is a problem, how to differentiate between encrypted and compressed files? In that case statistical tests of randomness have to be performed. Good tool for that purpose is ent. Basically, that tool performs two tests: Chi Square and Monte Carlo pi approximation. I found a good blog post about using that tool. At the end, there is a list of rule of thumb rules that can be used to determine if the file is encrypted or compressed:
  • Large deviations in the chi square distribution, or large percentages of error in the Monte Carlo approximation are sure signs of compression.
  • Very accurate pi calculations (< .01% error) are sure signs of encryption.
  • Lower chi values (< 300) with higher pi error (> .03%) are indicative of compression.
  • Higher chi values (> 300) with lower pi errors (< .03%) are indicative of encryption.
Take those numbers only indicatively, because we'll see different values in the example later. But nevertheless, they are a good hint on what to look at. Also, let me explain from where did he get the value 300 for Chi square. If you watched linked video in that blog post, you'll know that there are two important parameters when calculating Chi Squared test, number of degrees of freedom and a critical value. Namely, in our case we have 256 values and that translates into 255 degrees of freedom. Next, if we select p=0.05, i.e. we want to determine if the stream of bytes is random with 95% of certainty, then looking into some table we obtain critical value 293.24, rounded it is 300. When Chi square is below that value, then we accept null hypothesis, i.e. the data is random, otherwise we reject null hypothesis, i.e. the data isn't random.

Here is the output from a given tool for encrypted file:
$ ./ent entropy1.c.aes Entropy = 7.858737 bits per byte.
Optimum compression would reduce the sizeof this 1376 byte file by 1 percent.
Chi square distribution for 1376 samples is 253.40, and randomlywould exceed this value 51.66 percent of the times.
Arithmetic mean value of data bytes is 129.5407 (127.5 = random).Monte Carlo value for Pi is 3.091703057 (error 1.59 percent).Serial correlation coefficient is 0.040458 (totally uncorrelated = 0.0).
Then, here is for just compressed file:
$ ./ent entropy1.zip Entropy = 7.530339 bits per byte.
Optimum compression would reduce the sizeof this 883 byte file by 5 percent.
Chi square distribution for 883 samples is 1203.56, and randomlywould exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 111.6512 (127.5 = random).Monte Carlo value for Pi is 3.374149660 (error 7.40 percent).Serial correlation coefficient is 0.183855 (totally uncorrelated = 0.0).
And finally, for encrypted and then compressed file:
$ ./ent entropy2.zip
Entropy = 7.788071 bits per byte.
Optimum compression would reduce the size
of this 1554 byte file by 2 percent.
Chi square distribution for 1554 samples is 733.20, and randomly
would exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 121.0187 (127.5 = random).
Monte Carlo value for Pi is 3.166023166 (error 0.78 percent).
Serial correlation coefficient is 0.162305 (totally uncorrelated = 0.0).
First, note that the error in calculating Pi is higher in case of the compressed file (7.40 vs 0.78/1.59). Next, Chi square is in the first case less than 300 which indicates encryption (i.e. the data is random), while in the last two cases it is bigger than 300 meaning the data is not so random!

For the end, here is a link to interesting blog post describing file type identification based on the byte distribution and its application for reversing XOR encryption.

Thursday, December 4, 2014

Lenovo W540 and Fedora 21

At the end of the November 2014. I got a new laptop, Lenovo W540, and immediately I started taking notes about this machine with my impressions. I'm a long time W-series Lenovo user, and I think that those notebooks are very good machines, albeit a bit more expensive so probably not for an average user. Anyway, this post has been in making for some time now, and I'll update it in a due course. In it I'll write about my impressions, as well as about installing and using Fedora 21 on this machine. Note that when I was starting to write this post Fedora 21 was in beta. So, everything I say here might change in final release (in case I don't change a post).

First, let me start by positive observations about the machine:
  • The new machine is thinner than the W530.
  • Power adapter seems to be smaller than the one for W530 model.
  • Very easy to access RAM slots and hard disk slot in case you want to upgrade RAM and/or put another disk in it.
Well, true, that's a very short list. So, here are some negative ones:
  • They again changed connector for attaching power adapter. In other words, anything you have thus far (and I have quite a lot!) won't work for this machine.
  • There is no lock that holds the lid closed.
  • At first, I thought that there is no LEDs that, when you close a lid, show you the state of the laptop. This is important because now I don't know if the laptop is in sleep mode or not when the lid is closed. But later I realised that there is, it is a dot over letter i in ThinkPad logo in the lower right corner of a lid (looking from above).
  • There is numeric part of the keyboard that, honestly, I don't need. This space was gained by not having speakers on both sides of the keyboard as in W530. Later I realised even more how cumbersome this part of keyboard is. Namely, I'm holding this machine a lot of time on my lap and because of numerical keyboard I can not have laptop centered while holding it in my lap.
  • There are no more buttons on touchpad, the touchpad itself is a button. But I managed to get used to it by getting rid of the reflex to click with separate arm.
  • Fn keys are overloaded with different additional functionalities. For example, F1 key is mute now and it has also LED indicator! Furthermore, all the function keys have alternative functionality that you obtained in the previous models using Fn key. Now, it is opposite, you get the regular functionality of those keys by pressing Fn key in the lower left corner! This is weird! Only later I found out that there is a small led diode on the Fn key, and if you press Fn+Esc it turns on meaning that keys are now function keys, F1, F2, etc.
To be honest, I don't like those changes and probably it'll take some time until I get used to them.

Installing OS

Ok, now about Fedora 21 installation. First, I changed the laptop to use UEFI boot exclusively. I don't know if this is good or bad, but in the end I did it. Note that there is hybrid mode, i.e. both old BIOS and new UEFI will be used for boot process (which ever one manages to boot the machine), but I didn't use it. Anyway, since I removed CDROM I had to boot it somehow to install Fedora 21. First, I tried with PXEBOOT. But, no luck with UEFI. Note that I managed to boot the machine using old BIOS. This means that I properly configured everything for network boot using old BIOS, but not for new UEFI BIOS. Since I wanted to have UEFI boot, I gave up from this option.

Since I managed to obtain a USB stick I decided to go that route. First, I dd'ed efidisk.img file, and that booted laptop, but it couldn't find anything to install from, yet alone start installation. So, I downloaded live Fedora 21 Workstation and dd'ed that to USB disk. That worked.

For some strange reason, I decided to use BTRFS filesystem. Actually, the reason is that I can mount separate root and home partitions that use the same pool of free space. That way it won't happen that I have low free space on one partition, and a lot of space on another partition. But, I didn't notice that encryption is selected separately for the whole volume, and not for a specific file system, i.e. mount point. Since no reasonable person will install OS these days without encryption, I did installation several times until I managed to get over that problem.

While working with new OS what frustrated me a lot was a touchpad. It didn't get click when I tried to left click, it scrolled randomly, and I couldn't find middle click or the right click. Also, the problem was that I pressed mouse button and then scrolled while the mouse button was pressed. This is also somehow problematic on this touchpad.

Here are some additional links I found:
That's it for now. Stay tuned for more...

About Me

scientist, consultant, security specialist, networking guy, system administrator, philosopher ;)

Blog Archive