Of course, looking at the distribution isn't so easy, so, for a quick check it is possible to use entropy as a measure of randomness. Basically, low entropy means that the file is not encrypted, while high entropy means it is. Entropy, in case of file that consists of stream of octets, can be in a range [0,8], with 0 being no entropy at all, and 8 being the maximum entropy. I found a relatively good explanation of entropy that also has links to the code to calculate entropy in a file. Now, in case you don't have a tool for calculating entropy at a hand, some compression tool will be useful, too. Namely, random files can not be compressed. And that is the reason why compression is always done before encryption, because encrypted data looks random and is thus incompressible.
Let me demonstrate all this on an example. Take, for example, this file. That is a file linked to by the blog post I mentioned previously in which entropy is explained. It is a simple C file you can download on your machine and compile it into executable called entropy1:
$ gcc -Wall -o entropy1 entropy1.c -lmSo, let us see what is the entropy of the C file itself:
$ ./entropy1 entropy1.cit is relatively low entropy, meaning, not much information content is there with respect to the size. Ok, let's encrypt it now. We'll do this using openssl tool:
4.95 bits per byte
$ openssl enc -aes-128-cbc \The given command encrypts input file (entropy1.c) using 128-bit key 000102030405060708090a0b0c0d0e0f in AES-CBC mode and with initialization vector 000102030405060708090a0b0c0d0e0f. The output is written to a file with the name entropy1.c.aes. Let us see what is the entropy now:
-in entropy1.c -out entropy1.c.aes \
-K 000102030405060708090a0b0c0d0e0f \
-iv 000102030405060708090a0b0c0d0e0f
$ ./entropy1 entropy1.c.aesThat's very high entropy, and in line with what we've said about encrypted files having high entropy. Lets check how compressible is the original file, and encrypted one:
7.86 bits per byte
$ zip entropy1.zip entropy1.c
adding: entropy1.c (deflated 48%)
$ zip entropy2.zip entropy1.c.aes
adding: entropy1.c.aes (stored 0%)
As can be seen, encrypted file isn't compressible, while plain text is. What about entropy of those compressed files:
$ ./entropy1 entropy1.zip
7.53 bits per byte
$ ./entropy1 entropy2.zip
7.79 bits per byte
As you can see, they both have high entropy. This actually means that entropy can not be used to differentiate between compressed files and encrypted ones. And now, there is a problem, how to differentiate between encrypted and compressed files? In that case statistical tests of randomness have to be performed. Good tool for that purpose is ent. Basically, that tool performs two tests: Chi Square and Monte Carlo pi approximation. I found a good blog post about using that tool. At the end, there is a list of rule of thumb rules that can be used to determine if the file is encrypted or compressed:
- Large deviations in the chi square distribution, or large percentages of error in the Monte Carlo approximation are sure signs of compression.
- Very accurate pi calculations (< .01% error) are sure signs of encryption.
- Lower chi values (< 300) with higher pi error (> .03%) are indicative of compression.
- Higher chi values (> 300) with lower pi errors (< .03%) are indicative of encryption.
Here is the output from a given tool for encrypted file:
$ ./ent entropy1.c.aes Entropy = 7.858737 bits per byte.Then, here is for just compressed file:
Optimum compression would reduce the sizeof this 1376 byte file by 1 percent.
Chi square distribution for 1376 samples is 253.40, and randomlywould exceed this value 51.66 percent of the times.
Arithmetic mean value of data bytes is 129.5407 (127.5 = random).Monte Carlo value for Pi is 3.091703057 (error 1.59 percent).Serial correlation coefficient is 0.040458 (totally uncorrelated = 0.0).
$ ./ent entropy1.zip Entropy = 7.530339 bits per byte.And finally, for encrypted and then compressed file:
Optimum compression would reduce the sizeof this 883 byte file by 5 percent.
Chi square distribution for 883 samples is 1203.56, and randomlywould exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 111.6512 (127.5 = random).Monte Carlo value for Pi is 3.374149660 (error 7.40 percent).Serial correlation coefficient is 0.183855 (totally uncorrelated = 0.0).
$ ./ent entropy2.zipFirst, note that the error in calculating Pi is higher in case of the compressed file (7.40 vs 0.78/1.59). Next, Chi square is in the first case less than 300 which indicates encryption (i.e. the data is random), while in the last two cases it is bigger than 300 meaning the data is not so random!
Entropy = 7.788071 bits per byte.
Optimum compression would reduce the size
of this 1554 byte file by 2 percent.
Chi square distribution for 1554 samples is 733.20, and randomly
would exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 121.0187 (127.5 = random).
Monte Carlo value for Pi is 3.166023166 (error 0.78 percent).
Serial correlation coefficient is 0.162305 (totally uncorrelated = 0.0).
For the end, here is a link to interesting blog post describing file type identification based on the byte distribution and its application for reversing XOR encryption.
No comments:
Post a Comment