Of course, looking at the distribution isn't so easy, so, for a quick check it is possible to use entropy as a measure of randomness. Basically, low entropy means that the file is not encrypted, while high entropy means it is. Entropy, in case of file that consists of stream of octets, can be in a range [0,8], with 0 being no entropy at all, and 8 being the maximum entropy. I found a relatively good explanation of entropy that also has links to the code to calculate entropy in a file. Now, in case you don't have a tool for calculating entropy at a hand, some compression tool will be useful, too. Namely, random files can not be compressed. And that is the reason why compression is always done before encryption, because encrypted data looks random and is thus incompressible.

Let me demonstrate all this on an example. Take, for example, this file. That is a file linked to by the blog post I mentioned previously in which entropy is explained. It is a simple C file you can download on your machine and compile it into executable called entropy1:

$ gcc -Wall -o entropy1 entropy1.c -lmSo, let us see what is the entropy of the C file itself:

$ ./entropy1 entropy1.cit is relatively low entropy, meaning, not much information content is there with respect to the size. Ok, let's encrypt it now. We'll do this using openssl tool:

4.95 bits per byte

$ openssl enc -aes-128-cbc \The given command encrypts input file (entropy1.c) using 128-bit key 000102030405060708090a0b0c0d0e0f in AES-CBC mode and with initialization vector 000102030405060708090a0b0c0d0e0f. The output is written to a file with the name entropy1.c.aes. Let us see what is the entropy now:

-in entropy1.c -out entropy1.c.aes \

-K 000102030405060708090a0b0c0d0e0f \

-iv 000102030405060708090a0b0c0d0e0f

$ ./entropy1 entropy1.c.aesThat's very high entropy, and in line with what we've said about encrypted files having high entropy. Lets check how compressible is the original file, and encrypted one:

7.86 bits per byte

$ zip entropy1.zip entropy1.c

adding: entropy1.c (deflated 48%)

$ zip entropy2.zip entropy1.c.aes

adding: entropy1.c.aes (stored 0%)

As can be seen, encrypted file isn't compressible, while plain text is. What about entropy of those compressed files:

$ ./entropy1 entropy1.zip

7.53 bits per byte

$ ./entropy1 entropy2.zip

7.79 bits per byte

As you can see, they both have high entropy. This actually means that entropy can not be used to differentiate between compressed files and encrypted ones. And now, there is a problem, how to differentiate between encrypted and compressed files? In that case statistical tests of randomness have to be performed. Good tool for that purpose is ent. Basically, that tool performs two tests: Chi Square and Monte Carlo pi approximation. I found a good blog post about using that tool. At the end, there is a list of rule of thumb rules that can be used to determine if the file is encrypted or compressed:

- Large deviations in the chi square distribution, or large percentages of error in the Monte Carlo approximation are sure signs of compression.
- Very accurate pi calculations (< .01% error) are sure signs of encryption.
- Lower chi values (< 300) with higher pi error (> .03%) are indicative of compression.
- Higher chi values (> 300) with lower pi errors (< .03%) are indicative of encryption.

*300*for Chi square. If you watched linked video in that blog post, you'll know that there are two important parameters when calculating Chi Squared test, number of

*degrees of freedom*and a

*critical value*. Namely, in our case we have 256 values and that translates into 255 degrees of freedom. Next, if we select p=0.05, i.e. we want to determine if the stream of bytes is random with 95% of certainty, then looking into some table we obtain critical value 293.24, rounded it is 300. When Chi square is below that value, then we accept null hypothesis, i.e. the data is random, otherwise we reject null hypothesis, i.e. the data isn't random.

Here is the output from a given tool for encrypted file:

$ ./ent entropy1.c.aes Entropy = 7.858737 bits per byte.Then, here is for just compressed file:

Optimum compression would reduce the sizeof this 1376 byte file by 1 percent.

Chi square distribution for 1376 samples is253.40, and randomlywould exceed this value 51.66 percent of the times.

Arithmetic mean value of data bytes is 129.5407 (127.5 = random).Monte Carlo value for Pi is 3.091703057 (error 1.59 percent).Serial correlation coefficient is 0.040458 (totally uncorrelated = 0.0).

$ ./ent entropy1.zip Entropy = 7.530339 bits per byte.And finally, for encrypted and then compressed file:

Optimum compression would reduce the sizeof this 883 byte file by 5 percent.

Chi square distribution for 883 samples is1203.56, and randomlywould exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 111.6512 (127.5 = random).Monte Carlo value for Pi is 3.374149660 (error 7.40 percent).Serial correlation coefficient is 0.183855 (totally uncorrelated = 0.0).

$ ./ent entropy2.zipFirst, note that the error in calculating Pi is higher in case of the compressed file (7.40 vs 0.78/1.59). Next, Chi square is in the first case less than 300 which indicates encryption (i.e. the data is random), while in the last two cases it is bigger than 300 meaning the data is not so random!

Entropy = 7.788071 bits per byte.

Optimum compression would reduce the size

of this 1554 byte file by 2 percent.

Chi square distribution for 1554 samples is733.20, and randomly

would exceed this value less than 0.01 percent of the times.

Arithmetic mean value of data bytes is 121.0187 (127.5 = random).

Monte Carlo value for Pi is 3.166023166 (error 0.78 percent).

Serial correlation coefficient is 0.162305 (totally uncorrelated = 0.0).

For the end, here is a link to interesting blog post describing file type identification based on the byte distribution and its application for reversing XOR encryption.

## No comments:

Post a Comment