If you ever wondered how hard it is to differentiate between encrypted file and some regular binary file, then wonder no more, because the answer is: very easy, at least in principle. Namely, by looking at the distribution of octets in a file you can know if it is encrypted or not. The point is that after encryption the file must look like a random sequence of bytes. So, every byte, from 0 to 255, will occur almost the same number of times in a file. On the other hand, text files, images, and other files will have some bytes occurring more frequently than the others. For example, in text files space (0x20) occurs most frequently. So, the procedure is very easy, just count how many times each octet occurs in the file and then look at the differences. You can do it by hand(!), write your own application, or use some existing tool.
Of course, looking at the distribution isn't so easy, so, for a quick check it is possible to use entropy as a measure of randomness. Basically, low entropy means that the file is not encrypted, while high entropy means it is. Entropy, in case of file that consists of stream of octets, can be in a range [0,8], with 0 being no entropy at all, and 8 being the maximum entropy. I found a relatively good
explanation of entropy that also has links to the code to calculate entropy in a file. Now, in case you don't have a tool for calculating entropy at a hand, some compression tool will be useful, too. Namely, random files can not be compressed. And that is the reason why compression is always done before encryption, because encrypted data looks random and is thus incompressible.
Let me demonstrate all this on an example. Take, for example,
this file. That is a file linked to by
the blog post I mentioned previously in which entropy is explained. It is a simple C file you can download on your machine and compile it into executable called entropy1:
$ gcc -Wall -o entropy1 entropy1.c -lm
So, let us see what is the entropy of the C file itself:
$ ./entropy1 entropy1.c
4.95 bits per byte
it is relatively low entropy, meaning, not much information content is there with respect to the size. Ok, let's encrypt it now. We'll do this using openssl tool:
$ openssl enc -aes-128-cbc \
-in entropy1.c -out entropy1.c.aes \
-K 000102030405060708090a0b0c0d0e0f \
-iv 000102030405060708090a0b0c0d0e0f
The given command encrypts input file (
entropy1.c) using 128-bit key
000102030405060708090a0b0c0d0e0f in AES-CBC mode and with initialization vector
000102030405060708090a0b0c0d0e0f. The output is written to a file with the name
entropy1.c.aes. Let us see what is the entropy now:
$ ./entropy1 entropy1.c.aes
7.86 bits per byte
That's very high entropy, and in line with what we've said about encrypted files having high entropy. Lets check how compressible is the original file, and encrypted one:
$ zip entropy1.zip entropy1.c
adding: entropy1.c (deflated 48%)
$ zip entropy2.zip entropy1.c.aes
adding: entropy1.c.aes (stored 0%)
As can be seen, encrypted file isn't compressible, while plain text is. What about entropy of those compressed files:
$ ./entropy1 entropy1.zip
7.53 bits per byte
$ ./entropy1 entropy2.zip
7.79 bits per byte
As you can see, they both have high entropy. This actually means that entropy can not be used to differentiate between compressed files and encrypted ones. And now, there is a problem, how to differentiate between encrypted and compressed files? In that case statistical tests of randomness have to be performed. Good tool for that purpose is
ent. Basically, that tool performs two tests: Chi Square and Monte Carlo pi approximation. I found a
good blog post about using that tool. At the end, there is a list of rule of thumb rules that can be used to determine if the file is encrypted or compressed:
- Large deviations in the chi square distribution, or large percentages of error in the Monte Carlo approximation are sure signs of compression.
- Very accurate pi calculations (< .01% error) are sure signs of encryption.
- Lower chi values (< 300) with higher pi error (> .03%) are indicative of compression.
- Higher chi values (> 300) with lower pi errors (< .03%) are indicative of encryption.
Take those numbers only indicatively, because we'll see different values in the example later. But nevertheless, they are a good hint on what to look at. Also, let me explain from where did he get the value
300 for Chi square. If you watched linked video in that blog post, you'll know that there are two important parameters when calculating Chi Squared test, number of
degrees of freedom and a
critical value. Namely, in our case we have 256 values and that translates into 255 degrees of freedom. Next, if we select p=0.05, i.e. we want to determine if the stream of bytes is random with 95% of certainty, then looking into
some table we obtain critical value 293.24, rounded it is 300. When Chi square is below that value, then we accept null hypothesis, i.e. the data is random, otherwise we reject null hypothesis, i.e. the data isn't random.
Here is the output from a given tool for encrypted file:
$ ./ent entropy1.c.aes Entropy = 7.858737 bits per byte.
Optimum compression would reduce the sizeof this 1376 byte file by 1 percent.
Chi square distribution for 1376 samples is 253.40, and randomlywould exceed this value 51.66 percent of the times.
Arithmetic mean value of data bytes is 129.5407 (127.5 = random).Monte Carlo value for Pi is 3.091703057 (error 1.59 percent).Serial correlation coefficient is 0.040458 (totally uncorrelated = 0.0).
Then, here is for just compressed file:
$ ./ent entropy1.zip Entropy = 7.530339 bits per byte.
Optimum compression would reduce the sizeof this 883 byte file by 5 percent.
Chi square distribution for 883 samples is 1203.56, and randomlywould exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 111.6512 (127.5 = random).Monte Carlo value for Pi is 3.374149660 (error 7.40 percent).Serial correlation coefficient is 0.183855 (totally uncorrelated = 0.0).
And finally, for encrypted and then compressed file:
$ ./ent entropy2.zip
Entropy = 7.788071 bits per byte.
Optimum compression would reduce the size
of this 1554 byte file by 2 percent.
Chi square distribution for 1554 samples is 733.20, and randomly
would exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 121.0187 (127.5 = random).
Monte Carlo value for Pi is 3.166023166 (error 0.78 percent).
Serial correlation coefficient is 0.162305 (totally uncorrelated = 0.0).
First, note that the error in calculating Pi is higher in case of the compressed file (7.40 vs 0.78/1.59). Next, Chi square is in the first case less than 300 which indicates encryption (i.e. the data is random), while in the last two cases it is bigger than 300 meaning the data is not so random!
For the end, here is a link to
interesting blog post describing file type identification based on the byte distribution and its application for reversing XOR encryption.