“Beware of geeks bearing formulas”

Those are the words of Warren Buffet, who warned of the coming credit crisis. Buffet—one of the very few—had little faith in the “complicated, computer-drive models systems that many financial giants relay on to minimize risk.”

Reader Dan Hughes reminds us of this article in today’s Wall Street Journal, which looks at why AIG did so miserably.

AIG built a lot of models which attempted to quantify risk and uncertainty in their financial instruments. They, like many other firms, tried to verify how well these models did, but they only did so on the very data that was used to build the models.

Now, if you are a regular reader of this blog, you will know that we often talk about how easy it is to build a model to fit any set of data. In fact, with today’s computing power, doing so is only a matter of investing a small amount of time.

But while a model fitting the data that was used to build it is necessary condition for that model to work in reality, it is not a sufficient condition. Any model must also be tried on data that was not used—in any way—to build it.

What happened at AIG, and at other financial houses, was that events occurred which were not anticipated or that had not happened before. Meaning, in short, that the models in which so many had so much faith, did not work in reality.

There is only one true measure of a model’s value: whether or not it works. That it is theoretically sound, or that it uses pleasantly arcane and inaccessible mathematics, or that it matches our desires, or that “only PhDs can understand” it are all very nice things, but they are none of them necessary. Many complex models which are in use are loved and trusted because of these things, but they should not be. They should only be valued to the extent that they accurately quantify the uncertainty of the real-life stuff that happens (climate models anyone?).

What the AIG models failed to account for were the “unknown unknowns”—to use Donald Rumsfeld’s much maligned quotation. They did not quantify the uncertainty of events which they did not know about. They thought that the models quantified the uncertainty of every possible thing that would happen, but of course they did not. Meaning that they were overconfident.

AIG’s failure is yet another in a long series of lessons that the more complex the situation, the less certain we should be.

A validation set is a portion of a data set used in data mining to assess the performance of prediction or classification models that have been fit on a separate portion of the same data set (the training set). Typically both the training and validation set are randomly selected, and the validation set is used as a more objective measure of the performance of various models that have been fit to the the training data (and whose performance with the training set is therefore not likely to be a good guide to their performance with data that they were not fit to).

Validation sets are a good idea, and a standard practice in some stat modeling shops, but they may also lead to misjudged confidence. That is because validation sets may also be lacking the “compleat picture.”

I did a year of finance in grad school, and my experience was that most finance modeling didn’t really care much about validation. Worse yet, the more variables you threw into the soup, generally the better your grade. I mean, complex means good, right?

Basically, my grades were determined by little more than the size of my coefficients and the p-values obtained after running the model through Stata. No validation needed, because that was simply “beyond the scope of the course.”

Considering that most MBAs I know had about the same experience, it’s no wonder these companies all have the same thinking on what constitutes good modeling.

Models can equally well be Validated by constructing data that have known outcomes; analytical solutions are examples. Using this method, the possible but highly unlikely situations can be tested. This might be called testing the degenerate (or Black Swan) cases.

Additionally, it is crucial that the software coding itself be ensured to not contain Black Swan cases. That is, portions that might be seldom executed, but if they are significantly incorrect results are calculated. Or not, the code might crash.

All this goes in spades if any results from the software are used to investigate Black Swan situations in any possible application areas. Investigations of physical phenomena and processes the consequence of which affect the health and safety of the public, no matter the extent of the un-likely-hood-ness of its occurrence.

Please bear in mind that what you know about this situation is only what you have been told by a reporter who likely doesn’t understand it himself. Before going into full model-basing mode, also note that the article itself points out that they were built to do nothing more than predict default experience, and that they may very well do a very good job of that. In other words, management requested a predictive model and got one that, as far as we know, worked. The news here is not that the models didn’t work. It’s that management didn’t request that models be built to address all of the known risks.

If I were the reporter, I would have wanted to understand exactly what ‘risk’ caused the problem. The idea that facing a collateral call represents a “risk” is odd. There is a risk of my net position losing value, but posting collateral is not itself a risk. If I have two offsetting positions with identical contracts and one counterparty makes a collateral call on me, I have a right to make an equal collateral call on the counterparty with whom I have the offsetting position. If that counterparty is insolvent I may not collect, but:

1. It is the job of the credit department to make sure that I am not surprised by such events and,
2. I have seen no evidence that this is what happened at AIG,
3. This is not a risk of being faced with a collateral call. It is the risk of having a counterpary fail.

Regarding validation sets, the concept has always puzzled me. In what sense is a model calibrated on 50% of the available data and “tested” against the other half better than a model calibrated on 100% of the data? I’d like to see a rigorous proof of that. I agree that many people find the approach intuitively comforting, but I’m not sure it has any real validity. What I find more troubling than the failure to arbitrarily split datasets is the all-to-common lack of any theoretical justification for the conclusions drawn from datamining exercises.

At the same time it is standard procedure to “backtest” models. Value-at-Risk models typically measure risk over a one-day time horizon and are backtested daily. The purpose of these backtests is to identify statistically unlikely circumstances that suggest that market conditions have changed and that action should be taken to minimize exposure to that market. In other words, prudent risk managers backtest so that they are less likely to be surprised by “Black Swan” events.

Finally, devotees of the “Black Swan” are, I think, making a great deal of not very much. It has long, long been recognized by practitioners everywhere that most markets are heavy-tailed. In response best practice requires that stress tests be run on portfolios and that estimated tail losses be calculated. Regardless, large market moves will create big winners and big losers. So what?

It’s likely we will never actually know the exact causes of the failure of AIG and the other financial companies. To extrapolate from one report is not wise.

Validation sets are a puzzle. There is no, and can be no, proof that using validation sets/cross-validation gives you a superior model. What really happens is that several models are under consideration (all are fit on the training set), and they are all tested on the validation set. The models are usually tweaked at this point so that they fit the validation set better.

In other words, the validation set becomes a training set once removed (it doesn’t even get far enough away to be a kissing cousin). Thus, while using a validation set can give you some comfort that your model is good, it is no proof of it. This is why I always insist that the true test of a model is how it performs in real life and on data that was not used in any way in fitting it. This is, after all, how it is done in physics, chemistry, engineering, and so on.

I don’t think that people are suddenly claiming that, Lo!, financial data is heavy tailed. The idea is that the models in use are necessarily fit to the data in hand. If that data does not include very extreme events, then it is likely that the models in use will not anticipate them. You don’t always know what you don’t know.

Thanks for the response to my comments, particularly as they pertain to validation sets. I will use it in future discussions I am sure.

As you might have guessed I have been a financial engineer for more years than I like to admit, so my instinct is to come to the defense of my kith and kin. Naturally, I think that this defense has some justification. The danger of an overreliance on purely statistical models has long been understood and acknowledged, which is exactly the reason best practice calls for Risk Committees to spend significant effort in examining the underlying structure of markets and building stress scenarios that can be run against derivative portfolios to give some insight into the tails of the return distributions.

There are both ex post and ex ante problems with this in practice, however. Ex ante, it is difficult to assign probabilities to the scenarios generated. I tend toward Bayesianism, so I believe that one should be able to assign a probability to any event, since that probability is by nature subjective. Still, getting anyone to do so is not easy, especially when you consider the ex post problem that later observers will be quick to judge oneâ€™s assessment wrong if an unlikely event comes to pass. The NYT article does that very thing when it notes that home price models included 10-20% declines, but that the probabilities assigned to those assessments were low. The author calls this a â€œmissâ€. Maybe, but the fact that an event that I judge as having a low probability comes to pass does not make my assessment wrong. Some other standard must be applied.

In the sub-prime mortgage mess I suspect that the other standard would involve some more fundamental type of model. I believe that the history of this episode, if it is written, will conclude that the crisis was not precipitated simply by an increase in the default rates on mortgages. The thing that transformed an unfortunate turn of events into a crisis was that Fannie and Freddie were levered to astronomical levels, so that an increase in defaults that would have been manageable caused both institutions to fail. Since Fannie and Freddie were THE source of market liquidity their failure resulted in several trillion dollars worth of mortgage-backed instruments finding no bid.

If this picture of the melt-down is correct, then the mistake made by the banks was not to miss the possibility of higher defaults or lower house prices, but to miss the possibility that Fannie and Freddie would collapse and that the government would let them.