Monday, August 26, 2013

Risk Management: Out with the Old, In with the New!

In this post I'm going to attempt to explain why I think many existing methods of assessing and managing risk in information security (a.k.a. "the Old") are going the wrong direction and describe what I think is a better direction (a.k.a. "the New").

While the House of Cards metaphor is crude, it gets across the idea of interdependence
between risk factors, in contrast to the "risk bricks" of the old methods.

Here's my main message:

Existing methods that treat risk as if it were a pile of autonomous "risk bricks" is the wrong direction for risk management. ("Little 'r' risk")

A better method is to measure and estimate risk as an interdependent system of factors, roughly analogous to a House of Cards. ("Big 'R" Risk")

I call the first "Little 'r' risk" because it attempts to analyze risk at the most granular micro level. I call the second "Big 'R' Risk" because the focus is on risk estimation at an organization level (e.g. business unit), and then to estimate the causal factors that have the most influence on that aggregate risk. With some over-simplification, we can say that Little 'r' risk is bottom-up while Big 'R' Risk is top-down. (In practice, Big 'R' Risk is more "middle-out".)

This new method isn't my idea alone. It comes from many smart folks who have been working on Operational Risk for many years, mainly in Financial Services. For a more complete description of the new approach, I strongly recommend the following tutorial document by the Society of Actuaries: A New Approach for Managing Operational Risk.

For readability and to keep an already-long post from being even longer, I'm going to talk in broad generalities and skip over many details. Also, I'm not going to explain and evaluate each of the existing methods. Finally, I'm not going to argue point-by-point all the folks who assert that probabilistic risk analysis is futile, worthless, or even harmful.

The Old: Little 'r' risk management

Once upon a time, people thought it would be a good idea to assess and manage information risk by enumerating individual "risks", which were conceived as a set of bad events that might happen as a result of some specific combinations of threats, vulnerabilities, and consequences. Sometimes they used quantitative probability estimates and other times they used the language of probability but without quantification -- using ordinal scales instead.

Roughly speaking, what all these methods have in common is they attempt to estimate an organizations total risk (Big 'R') by summing all the Little 'r' risks (as events) using some variant of this formula:

risk = likelihood × severity (1)

However, the sum all Little 'r' risks is rarely used. Instead, the list of Little 'r' risks is typically used in priority and triage decisions, mostly involving which vulnerabilities to patch, or which controls to implement to address vulnerabilities.

Often the Little 'r' risks are mapped in a 3x3 matrix like this:

Notice the implied multiplication of likelihood and severity (consequences), which means "Probable, Minor" risks and "Improbable, Major" risks are evaluated the same -- both are "Medium Risk".

Some people go further than equation (1), above, and estimate the components that determine "likelihood" and "severity" of particular events. Here's an example of such a formula:

risk = threat × vulnerability × impact (2)

It turns out that this formula is only useful as a mnemonic and not as a probabilistic risk formula because it has serious flaws -- e.g. "Probability of a threat", Pr(threat), isn't well defined by itself, and so on. But even so it has continued to appear in many presentations and documents, including by "thought leaders" and top policy makers, often with the lead in: "Of course, everyone knows that the formula for risk is...". And thus the cultural and institutional support for Little 'r' risk management continues.

What's Wrong with Little 'r' risk

As a practical fact, the Little 'r' risk approach hasn't helped us improve information security and reduce risk. There are very few success stories that have been made public, but there has been plenty of stories told "off the record" or over a beer about frustrations and shortcomings of the approach. Many frustrations arise from lack of data or lack of organization support. People complain about gaming the system and giving assessments that management or auditors wants to hear. Also many people find flaw in the formulas and scales. But my argument is not based on these factors. Instead, I believe there are seven major flaws that, together, lead me to assert that it is the wrong direction:

Assumed independence -- the most serious flaw in the Little 'r' risk approach is the assumption that all of these "risk bricks" are independent of each other and can be analyzed separately.

Assuming that consequences (impact) is tied to threat/vulnerability pairs -- severity of impact is mostly not related to the specifics of threats and vulnerabilities exploited. Instead, it's related to the threat actors motives and capabilities, and also the posture of the organization regarding detection, response, recovery, and resilience. For details, see this paper on breach impact estimation.

Diverts attention away from broad causal factors, esp. systemic root causes -- the primary use case of Little 'r' risk analysis focuses on patching or controling the high priority risks, treated independently. But this focus on individual "risk bricks" can divert attention from causal factors that are much more important, namely those that drive or determine many vulnerabilities, or those that are most likely to cause a "mild" loss event to become a "severe" loss event. This includes many non-technical causal factors such as hiring practices, contracting practices, broken business processes or business rules, incentives, and culture.

Conflates high frequency/low severity risks with low frequency/high severity risks -- (see 3x3 matrix above). This is a consequence of multiplying "likelihood" and "severity" for each "risk brick", especially if ordinal values are used. (Multiplying ordinal values is pure nonsense, but people do it, and some people even defend it.)

Lack of credible aggregation -- As I mentioned above, there is an implicit assumption that summing the Little 'r' risks will result in the aggregate Big 'R' Risk for an organization. But almost no one does this, maybe because no one trusts such a summation.

Not actionable outside of the prioritization decision -- For example, it's almost never used to guide investment decisions, make vs. buy decisions, insource vs. outsource decisions, IT or business architecture decisions, or to manage incentives for people or organizations.

Often not feasible -- For any organization above medium size, it's not feasible to enumerate all vulnerabilities in all assets, and all threats that might attack those vulnerabilities. Even if you could, you'd be leaving out those vulnerabilities that were not yet known by defenders but might be known by attackers. This includes various "employee error" conditions. Also, if you take the task of quantitative risk estimation seriously, it can be a very time consuming and labor intensive activity. Thus, even the most committed organizations don't do it more than once per year, and less if they can get away with it.

Reactions in Professional Communities

(This is my personal view on reactions. Other people may have different views or opinions.)

One reaction has been outright rejection, often with considerable hostility and animosity. There a sizable army of nay-sayers who find various faults with Little 'r' risk management and thereby claim that this proves that all forms of probabilistic risk analysis is futile, worthless, or even harmful. They nay-sayers have included may thought leaders and luminaries within the information security and "hacker" communities, and thus have persuaded many people who might be on the fence.

Another reaction has been to develop variant methods that avoid many or most of the problems listed above, including those related to ordinal scales and the flaws in the risk = threat × vulnerability × impact formula. Perhaps the most prominent method is FAIR, a proprietary risk analysis method sold by CXOware. Among other things, it quantifies threat, vulnerability, and control strength in terms of probabilities of successful attack. Because it's proprietary, only the people who have been trained by CXOware really know what's included in it and how it works. There might even be some FAIR extensions into what I'm calling Big 'R' risk, but I'm not familiar with those details or how successful people have been in applying it. (Maybe some FAIR experts can add more in a comment.)

Within the community of professionals working on security metrics and risk analysis, there has been recent work to develop and apply probabilistic models to subsets of the risk problem -- e.g. to the likelihood that various types of vulnerabilities will be attacked, etc. However, this has not yet led to any changes in how Little 'r' risk is assessed or estimated in most organizations.

But the majority reaction among advocates of Little 'r' risk has been to carry on as though these short-comings and flaws didn't exist. Thus, in official circles, Little 'r' risk remains the accepted wisdom regarding quantitative risk management in information security.

The New: Big 'R' Risk Management

Recently, people outside of information security developed a different set of methods that, together, I'll call "Big 'R' Risk Management". These folks were trained and experienced in Actuarial Science and financial risk analysis (e.g. Basel II Loss Distribution Approach, among others). Thus, their starting place was addressing the question: "how do we measure operational risk for the organization as a whole?" In Financial Services, this has direct consequences on capital reserves and, therefore, on return on capital. It has also been used by some firms in risk transfer via insurance or other instruments. In the 90s and early 2000s, the focus was on estimating the aggregate probability distributions for frequency and severity of loss events, but without much regard for what might be causing those loss events or what an organization might do to reduce operational risk. But in the last 10 years increasing emphasis has been placed on causal modeling, predictive modeling, and other methods that connect the aggregate risk measure to what an organization can do to mitigate operational risk.

Companies outside of Financial Services have been mostly oblivious to all this, and many today would assume that such methods don't apply to them. That's unfortunate because the methods are general and can be applied with some modifications to managing risk associated with information security where ever that risk is significant at a business unit or enterprise level.

Here are three documents that describe this approach in more detail, including references:

Estimate the probabilitydistribution of total security costs for a business unit or enterprise, starting first by defining "total costs" and by measuring what the organization is spending today. (See this post for more details and references)

Derive measures from the full distribution, not just the central tendency ('expectation') or standard deviation, or 90th quantile. This avoids the conflation problem, number 4) above.

Perform analysis to determine which causal factors or conditions are most significant drivers of the probability distribution of total costs. Causal factors can differ across the distribution, so what is "significant" depends on the decision and context.

Use the analysis to support the most important decisions, which might include prioritizing controls but would also include many others listed under number 6), above.

When mature, use the analysis to design and implement incentive systems, including possibly risk budgets for departments.

Comparing Big 'R' with Little 'r' - A Toy Example

I'll describe a toy example to show how the two approaches would quantify risk in dollars.

Let's say a business unit has 10 servers. Nine are publicly exposed but not connected to each other (i.e. not visible or reachable). The tenth is a central server that’s only accessible from the 9 public servers (neglecting admin access, etc.). Let’s say that the public servers perform similar functions, and that they all depend on having the central server functioning to do their work. Also, assume that the servers are heterogeneous in configuration, so having a vulnerability on one doesn’t necessarily imply the same or similar vuln will exist on others.

For simplicity, let’s say that the business only cares about three classes of impact events:

Class 1 — breach events that requires only IT remediation (i.e. no data compromised and no function interrupted)

Class 3 — breach events that disable the business unit’s ability to provide public services, which then require unbudgeted response and recovery costs for both the business unit and it’s customers.

Class 1 impacts could be caused by attacks on any vulnerability in any of the ten servers. This could include a wide range of unintentional errors, random malware or spyware infections, etc. In Big 'R" analysis, this would be can be modeled as a causal network with two giant OR gates. The first OR gate is collection of threats (i.e. attacks or actions by threat agents). We might call this OR gate a “most opportunistic” function because which ever threats are most opportunistic are most likely to find and exploit any vulnerabilities. The second OR gate is the collection of all vulnerabilities (known and unknown) on all servers. We might call this Or gate a “weakest link” function because it’s only as strong (resistant to attack) as it’s weakest link(s) -- i.e. the vulns that are most easily found and exploited.

Therefore, to estimate the probable cost of Class 1 impacts under the Big 'R' risk method, you need to be able to identify and estimate which threats (and threat agents) are most prolific and opportunistic, and also what the “weakest links” are. In contrast, the Little 'r' risk method would attempt to estimate risk of Vulnerability #6827 in isolation — using "annualized loss expectancy" (ALE) or similar method. But this method won't work if Vulnerability #6827 is just one of ten vulnerabilities that all qualify as “weakest links”.

Similar analysis applies to Class 2 and Class 3 breaches, but the structure of the causal trees will be different. For Class 2, the only threats that matter are those from malicious threat agents whose goals and capabilities endow them with ability to compromise and exfiltrate data. The vulnerabilities that matter are only those that can serve as viable “stepping stones” for an exfiltration attack, which is very sensitive to the context — the IT architecture, the details of configuration, and also the degree of compromise of the other servers.

For Class 3, there’s a very small number of attacks that can result in this severe an impact. Either all nine public servers have to be disabled of DoSed, or the central server has to be disabled. Furthermore, this has to be done in a way that’s not easily recoverable, and maybe not easily diagnosable. Generally, there are a very small number of threat agents who have the goals and capabilities to carry out such an attack. The most likely threat agent will be “internal/human error”. While many vulnerabilities might be on the causal chain in such attacks, it makes no sense to value each and every vulnerability for it’s contribution to a Class 3 event, any more than it makes sense to value a toe or toenail for a human athlete’s overall performance.

To be clear, I’m not arguing against any or all bottom-up analysis. I’m arguing against one flavor of the Little 'r' method, namely quantifying risk in terms of dollars at the lowest level of granularity (usually for each and every vulnerability) and then some how summing those to estimate the aggregate risk.

To understand causal factors in Big 'R' Risk Analysis, methods from other fields can be used. For example, Reliability Engineering has good theory and models for causal networks, causal trees, compromise chains, etc. including some generic structures of causal links:

“Weakest link” -- parallel OR structure where compromise of any link results in system breach

“Total effort” -- parallel AND structure where system compromise only occurs when ALL links are compromise

“Best link or effort” -- series structure where defense strength is the maximum of the strengths of the links

“Catalysts” -- one or more links provide enabling support for other (main) links, and compromise of the supporting links increase the probability of compromise for the main links

Essentially, adding these causal structures to our risk analysis takes us to a “meso-level” rather than the isolated micro-level analysis of individual vulnerabilities and individual threats.

I hope this example and the brief discussion makes vivid and clear the differences between the two approaches and the advantages of Big 'R' risk over Little 'r' risk.

But...Big 'R' Risk Is Very New

Big 'R' Risk is in the pioneer stage. Outside of Financial Services, it's only being implemented by organizations with visionary leaders and propensity to experiment. Therefore, I can't produce success stories. Of course, these will be necessary to promote mainstream adoption, but if all of us wait for someone else to produce those success stories, then nothing will happen. Some of us need to be the pioneering innovators.

There's a lot to be worked out regarding how to best start it (i.e. where to focus, how to structure pilots, etc.) and how to mature it. But that's what pioneers do -- they work out that stuff.

Reactions in Professional Communities

There's been very little reaction so far to Big 'R' Risk. Very few companies outside of Financial Services have implemented it, or if they have they are only just starting.

Almost none of the nay-sayers mentioned above know anything about Big 'R' Risk. A few have reacted negatively by conflating it with financial engineering and also the Great Recession (e.g. proof in their eyes that financial approaches to risk management are doomed).

No policy makers of importance know anything about Big 'R' Risk. None of the keynote speakers at major conferences know anything about it.

Conclusion

I know that the nay-sayers and harsh critics won't be convinced by what I've written here. But I hope that I've been successful in reaching some portion of people who currently think that Little 'r' risk is all there is, and maybe think that we should continue going down that path.

I know that Big 'R' Risk isn't ready for mainstream adoption, if only because we don't yet have success stories to validate it and give it credibility. But I believe that if we only pay attention to the trailing edge of maturity, we will perpetuate our losing position in the adversarial arms race that is information security today.

4 comments:

Really enjoyed your post, Russ. You've done a great job of capturing the challenges associated with "r". As for FAIR, you are correct that to-date the focus has been on "r". In that context it's been used very effectively for prioritization decisions and for developing business cases for additional resources. It's also been shown to be very effective at dealing with those all-too-frequent occasions when someone (maybe an auditor or 3rd party security "pro") comes to the table with a "high risk" finding that you know in the depths of your soul doesn't truly represent high risk. In those situations, being able to systematically and logically step through an analysis often settles disagreements.

As for "R", our next generation application (ThoroughFAIR) is focused on that exact line of thinking. It also begins to integrate a framework I've been working on for analyzing systemic causative factors. I've submitted a proposal to present on this framework at next year's RSA conference (fingers crossed). In the meantime, I've posted a high-level description of the framework on the CXOWARE blog (cxoware.com/groundhogs-day). It's taken from an early draft of the book I'm co-authoring on FAIR with Jack Freund.