Everything about nothing: Research paper: "Before We Knew It..."

The paper I'll comment in this post was presented on ACM's Conference on Computer and Communications Security held on Oct. 16-18, 2012. The paper tries to answer the following question: How long, on average, does zero-day attack last before it is publicly disclosed? This is one of those questions, which when you see them are so obvious, but for some strange reason they didn't occur to you. And what's more, no one else didn't try to tackle them! In the same time this is a very important question from security defense perspective!

Anyway, having an idea is one thing, to realize it is completely another. And in this paper, the authors did both very well! In short, it is an excellent paper with a lot of information to digest! So, I strongly recommend anyone who's in security field to study it carefully. I'll put here some notes what I found novel and/or interesting while I was reading it. Note that for someone else, something else in the paper may be interesting or novel, and thus this post is definitely not replacement for reading the paper yourself. Also, if you search a bit on the Internet you'll find that others also covered this paper.

Contributions

The contributions of this paper are:

Analysis of dynamics and characteristics of zero-day attacks, i.e. how long it takes before zero-day attacks are discovered, how many hosts are targeted, etc.
A method to detect zero-day attacks based by correlating anti-virus signatures of malicious code that exploits certain vulnerabilities with a database of binary file downloads across 11 million hosts on the Internet.
Analysis of impact of vulnerability disclosure on number of attacks and their variations. In other words, what happens when new vulnerability is disclosed, how exactly does that impact number and variations of attacks.

Findings and implications

The key finding of this research is that zero day attacks are discovered, on average, 312 days after they first appeared. But in one case it took 30 months to discover the vulnerability that was exploited. Next finding is that zero day attacks, by themselves, are quite targeted. There are of course exceptions, but majority of them hit only several hosts. Next, after vulnerability is disclosed there is a surge of new variants of exploits as well as number of attacks. The number of attacks can be five orders of magnitude higher after they've been disclosed than before.

During their study, the authors found 11 not previously known zero-day attacks. But be careful, it isn't a statement that they found vulnerabilities now previously known. It means there are known vulnerabilities, but up to this point (i.e. this research) it wasn't know that those vulnerabilities were used for zero-day attacks.

So, here is my interpretation of implications of these findings. This means that currently there are at least dozen exploits in the wild no one is aware of. So, if you are a high profile company, this means that you are in a serious trouble. Now, as usual, everything depends on many things are you, or will you, be attacked. Next, when there is a disclosure of a vulnerability and there is no patch available, you have to be very careful because at that point there is a surge of attacks.

Data and Code used for research

In experimental research, data is of utmost importance. If you have data, then you can experiment on it, otherwise, you can only theoretize. Furthermore, the more data you have, the better experiment can be done. Personally, I believe that the access to data is one of the main differentiation that makes productive researchers. So, it is interesting to see what sources of data the authors used, and is this data available to others.

So, reading the paper the authors state that they used the following sources:

Worldwide Intelligence Network Environment (WINE).
Open Source Vulnerability Database (OSVDB) - public database with information about vulnerabilities starting going back to 1998. The authors use it to take discovery, disclosure and exploit release days.
Symantec Threat Explorer - Publis site with data about latest threats, risks and vulnerabilities. It also has historical data available, what was the primary reason auhtors used it in this research.
Symantec data set with dynamic analysis results for malware samples

From WINE database the authors use binary reputation and anti-virus telemetry data-sets. Anti-virus telemetry records when malware was detected on user's host, i.e. the anti-virus product detected a signature that identifies some malware. The authors had on their disposal 225 million detections recorded on 9 million hosts in period from December 2009 till August 2011. Each record provides detection time, threat label, hash of detected malicious file and the country where the machine resides. Binary reputation is database of hashes of all binary data downloaded on hosts that participate in WINE. No matter if something that you download is malicious or not, the hash is calculated and sent to WINE. The authors report that this collection started in February 2008, and up to the point when the research was concluded, March 2012, there were 32 billion reports for about 300 million distinct binary files, that were downloaded on 11 million hosts. Each report contains download time, hash and source URL of the download.

The authors restricted their research on Windows vulnerabilities, which I believe, is caused by the fact that WINE probably collects data mainly from Windows hosts. If there are Linux, or some other operating system, then there's probably a lot less data available. Also, as the authors themselves state, Windows is a primary platform for attacks so, again, much more data will be available.

Method

The idea underlying the method used by authors is simple: Look at known vulnerabilities and try to see when they were for the first time exploited. If this exploit time is earlier than disclosure date, then it's zero-day exploit.

In order to realize this idea the authors go through several steps:

First, they search for potential vulnerabilities they will analyze. They do that by querying OSVDB, Microsoft and Adobe Bulletins. The key information that the authors get from those databases are CVE identifiers of vulnerabilities and their disclosure date. Here we get the following set:
cve_id_i = {T_{discovery_time}, T_disclosure, T_{exploit_relase}, T_{patch_release}}
Then, the authors query Symantec's Threat Explorer using CVE numbers. Threat Explorer provides Symantec's ID of a threats (i.e. mallicious code) that exploited given vulnerabilities. This ID will be used to connect vulnerabilities with other Symantec's databases. Note: I was unable to find virus id in Threat Explorer! Did I miss something? Anyway, after this step we have the following set:
Z_i = {virus_id_i, cve_id_i}
The authors now use anti-virus telemetry data to link virus_id with exploits, that is, binary files that contain exploits. Since, whenever anti-virus product detects virus, it records which virus was detected (virus_id) and also hash of the file in which the virus was detected. This hash is important and used in the fifth step.
E_ij = {virus_id_i, file_hash_id_j}
This is an optional step. Namely, the problem that authors have is that lately there are more and more attacks that don't rely on executable files, but are embedded in different data files. There are signatures in anti-virus telemetry database of those files, but there are no data in binary reputation database to track them when they first appeared. So, assumption is that after non-binary data is downloaded and it compromises a machine, then some binary data is downloaded. In this step, the authors try to identify that downloaded binary file which then can be tracked in binary reputation database. To implement this idea, the authors looked in dynamic analysis data (part of Symantec's Threat Explorer database) which provides information what was downloaded after the successful attack. Authors note that this actually doesn't imply that what was downloaded is really consequence of successful attack so this is the reason why this is an optional step! Note: I was unable to find this data Threat Explorer!
Now that the authors have hash of malicious file, the go to binary reputation database and they look what is the earliest time this file was detected on the Internet. The time is important because if it is earlier than disclosure time, it means zero-day attack was detected!
Because the authors are also interested in attack intensity, they also collect additional information about each occurrence of malicious file in binary reputation data. This is used to analyze what happens after disclosure, i.e. how many attacks are there and how many variants.

(Potential) problems

Only vulnerabilities that have CVE assigned were analyzed.

Only Symantec customers that opted in for data collection. No research of those that don't have AV software, or use software from other vendors.

There is possibility of underestimation of how long zero-day attacks last because for some attacks that were found closer to a beginning of a research period it could be that they appeared even earlier that the recording started.

Lately more and more attacks are embedded into non-executable files, e.g. pdf, xlsx, doc, etc. Which might skew results. The authors state that binary reputation database, starting from late 2011, records also hashes of non-executable files, but this was to late for this study to take into account.

Symantec Internet Threat Report lists 31 zero-day vulnerabilities, of those, the method devised and used by the authors found only 7. The authors analyzed why the rest 24 were not found and the reasons are:

Web attacks are not covered by this research, but they count for significant number of zero-day attacks. Binary reputation data and anti-virus telemetry monitor only host based attacks.
Polymorphic malware makes binary data hashes completely different and it is very hard to correlate those with cve_ids they exploit.
Then, there are non-executable exploits already mentioned.
Zero day exploits are targeted attacks and it is likely they attack someone that isn't participating in binary reputation database.

Some interesting side info

First of, the following graph (source in dot language) neatly defines life-cycle of zero-day vulnerabilities, i.e. what might happen from the day they are discovered till the day they are remediated:

Actually, this is a graph that shows potential lifecycle of each unknown vulnerability in the code. The state Vulnerability is instantiated the moment programmer has made a mistake in the code. Note that this graph is valid for each vulnerability in some specific code base but not in some particular instance of this code base that exists at some user, who, for example, forgot to patch his software. The moment the patch is created that fixed vulnerability, the instance of the graph for that particular vulnerability goes into state Remediation, no matter that there is a lot of unpatched code on the Internet.

The branch testing means that software vendor found the problem and, based on that, produced a patch. Exploit branch, on the other hand, means that blackhat found the vulnerability and started a zero-day attack. Note that blackhats also analyze patches and based on that they try to figure out what vulnerability was corrected and if it's possible to exploit it. After public dissemination of information about vulnerability, AV vendors can include signatures for their customer protections. Note that AV vendors actually track exploits, not vulnerability! This public dissemination of vulnerabilities also allows vendors to react and create patch.

The previous graph shows states of a vulnerability that can generate a lot of different sequences. One of the sequences, when blackhats discover vulnerability first, is described by the following sequence of steps:

Vulnerability introduced (t_v)
Exploit released in the wild (t_e)
Vulnerability discovered by the vendor (t_d)
Vulnerability disclosed publicly (t₀)
Anti-virus signatures released (t_s)
Patch released (t_p)
Patch deployment completed (t_a)

Based on these time points we can say that zero-day attack is any attack that uses some vulnerability and that happens before time t0 (i.e. te < t0).

Attackers, when they find new vulnerabilities, try to maximize their benefit. This means that they wait for a right moment to use vulnerability for zero-day attack. Of course, if it happens that someone else discovered vulnerability too, then they have to use it as soon as possible.

Vulnerabilities markets

Vulnerability markets are very interesting, and I already saw papers trying to analyze this black market. There is info about prices and also investigate data and also papers that deal with insider stuff. I'll have to investigate this deeper in future. This topic is interesting because it shows how easy (or hard) is for someone to get zero day vulnerability and attack its target!

Literature

What follows is copy of references from the paper along with links to on-line versions as well as some of my quick comments.

Adobe Systems Incorporated. Security bulletins and advisories, 2012.
R. Anderson and T. Moore. The economics of information security. In Science, vol. 314, no. 5799, 2006. (pdf)
W. A. Arbaugh, W. L. Fithen, and J. McHugh. Windows of vulnerability: A case study analysis. IEEE Computer, 33(12), December 2000.

The authors observed a number of intrusions happening during each phase of vulnerability livecycle. They showed that even after vendors provide patches, there is still significant number of successful attacks meaning that many users don't patch their machines.

A. Arora, R. Krishnan, A. Nandkumar, R. Telang, and Y. Yang. Impact of vulnerability disclosure and patch availability - an empirical analysis. In Workshop on the Economics of Information Security (WEIS 2004), 2004.
S. Beattie, S. Arnold, C. Cowan, P. Wagle, and C. Wright. Timing the application of security patches for optimal uptime. In Large Installation System Administration Conference, pages 233–242, Philadelphia, PA, Nov 2002.

The authors in this study find that 10% of patches have problems of their own.

J. Bollinger. Economies of disclosure. In SIGCAS Comput. Soc., 2004.
D. Brumley, P. Poosankam, D. X. Song, and J. Zheng. Automatic patch-based exploit generation is possible: Techniques and implications. In IEEE Symposium on Security and Privacy, pages 143–157, Oakland, CA, May 2008.

This paper analyzes possibility of automatically generating exploit given some application and a patch that corrects some unknown vulnerability in the unpatched application. Even though the method isn't so general, and it has problems of its own, it shows that it is possible to some extent to automatically generate exploits. This actually means that in security analysis, which is based on worst case behavior, we should take into account this as possible.

H. C. H. Cavusoglu and S. Raghunathan. Emerging issues in responsible vulnerability disclosure. In Workshop on Information Technology and Systems, 2004.
D. H. P. Chau, C. Nachenberg, J. Wilhelm, A. Wright, and C. Faloutsos. Polonium : Tera-scale graph mining for malware detection. In SIAM International Conference on Data Mining (SDM), Mesa, AZ, April 2011.
CVE. A dictionary of publicly known information security vulnerabilities and exposures, 2012.
N. Falliere, L. O’Murchu, and E. Chien. W32.stuxnet dossier, February 2011.
S. Frei. Security Econometrics: The Dynamics of (In)Security. PhD thesis, ETH Zurich, 2009.

In this work, the author analyzes publicly available vulnerability and exploit databases and gets different statistics from it. For example, he concludes that on disclosure day, 15% of vulnerabilities have exploits available. He also compared how fast Microsoft and Apple react on new vulnerabilities and shows that there are significant differences between them. Still, both have unpached vulnerabilities even 180 days after the vulnerability is disclosed. Frei also looked at who discovered vulnerability and determined that between 2000 and 2007 10% of vulnerabilities were discovered through programs that pay whitehats.

This PhD Thesis seems not to be available on-line, but you have to buy it.

S. Frei. End-Point Security Failures, Insight gained from Secunia PSI scans. Predict Workshop, February 2011.

I couldn't manage to find this reference on the Internet. It is supposed to talk about failure of patch management, i.e. many vulnerabilities at the time of their disclosure don't have patches. Furthermore, users have to take care of 14 update mechanisms on a machine. All this means that there is a big problem that needs to be solved.

Google Inc. Pwnium: rewards for exploits, February 2012.

Google pays to whitehats for discovery of each vulnerability that allows compromise of its browser.

A. Greenberg. Shopping for zero-days: A price list for hackers’ secret software exploits. Forbes, 23 March 2012.

A paper that gives a glimpse on black market that trades with vulnerablities. This includes prices of vulnerabilities for a certain products.

[16] A. Lelli. The Trojan.Hydraq incident: Analysis of the Aurora 0-day exploit, 25 January 2010.
[17] R. McMillan. RSA spearphish attack may have hit US defense organizations. PC World, 8 September 2011.
M. A. McQueen, T. A. McQueen, W. F. Boyer, and M. R. Chaffin. Empirical estimates and observations of 0day vulnerabilities. In Hawaii International Conference on System Sciences, 2009.

The authors try to estimate the real number of zero-day exploits that existed in the past.

Microsoft. Microsoft security bulletins, 2012.
C. Miller. The legitimate vulnerability market: Inside the secretive world of 0-day exploit sales. In Workshop on the Economics of Information Security, Pittsburgh, PA, June 2007.
OSVDB. The open source vulnerability database, 2012.

Public database that aggregates all the available sources of information about vulnerabilities that have been disclosed from 1998.

A. Ozment and S. E. Schechter. Milk or wine: does software security improve with age? In 15th conference on USENIX Security Symposium, 2006.
P. Porras, H. Saidi, and V. Yegneswaran. An anlysis of conficker’s logic and rendezvous points, 2009.
Qualys, Inc. The laws of vulnerabilities 2.0, July 2009.
T. Dumitraș and D. Shou. Toward a standard benchmark for computer security research: The Worldwide Intelligence Network Environment (WINE). In EuroSys BADGERS Workshop, Salzburg, Austria, Apr 2011.

WINE is developed by Symantec Research Labs with the aim of sharing field data with research community. The data is collected from about 11 million hosts on the Internet that have Symantec products installed and for which their users agreed to participate in data collection process.

Because of privacy concerns, in order to access this data you have to sign NDA and be prepared to visit Symantec location as data is not sent outside of Symantec. Details can be found on this Web page (scroll down to How to participate section).

E. Rescorla. Is finding security holes a good idea? In IEEE Security and Privacy, 2005.
U. Rivner. Anatomy of an attack, 1 April 2011, Retrieved on 19 April 2012.
SANS Institute. Top cyber security risks - zero-day vulnerability trends, 2009.
B. Schneier. Cryptogram Newsletter - Full disclosure and the window of exposure, September 2000.

First definition and discussion about the term window of exposure.

B. Schneier. Locks and full disclosure. In IEEE Security and Privacy, 2003.
B. Schneier. The nonsecurity of secrecy. In Commun. ACM, 2004.
M. Shahzad, M. Z. Shafiq, and A. X. Liu. A large scale exploratory analysis of software vulnerability life cycles. In Proceedings of the 2012 International Conference on Software Engineering, 2012.

Similar study to the one done by S. Frei [13] but on a larger set of data.

Symantec Corporation. Symantec global Internet security threat report, volume 13, April 2008.
Symantec Corporation. Symantec global Internet security threat report, volume 14, April 2009.
Symantec Corporation. Symantec global Internet security threat report, volume 15, April 2010.
Symantec Corporation. Symantec Internet security threat report, volume 16, April 2011.
Symantec Corporation. Symantec Internet security threat report, volume 17, April 2012.

This report is referenced because it provides information that starting from 2008 till 2011. there were 43 identified zero-day exploits. Some of them, discovered in this research are not counted in those 43, while the research itself didn't find all of zero-day exploits.

Symantec Corporation. Symantec threat explorer, 2012.
Symantec.cloud. February 2011 intelligence report, 2011.
(See here for a full list of available reports)
N. Weaver and D. Ellis. Reflections on Witty: Analyzing the attacker. ;login: The USENIX Magazine, 29(3):34–37, June 2004.

Everything about nothing

Sunday, October 28, 2012

Research paper: "Before We Knew It..."

1 comment:

About Me

Blog Archive