Small Samples and the Overuse of Hypothesis Tests

With powerful computers and statistical packages, modelers can now run an enormous number of tests effortlessly. But should they? This article discusses how bank risk modelers should approach statistical testing when faced with tiny data sets.

In the stress testing endeavor, most notably in PPNR modeling, bank risk modelers often try to do a lot with a very small quantity of data. It is not uncommon for stress testing teams to forecast portfolio origination volume, for instance, with as few as 40 quarterly observations. Because data resources are so thin, this must have a profound impact on the data modeling approaches.

The econometrics discipline, whose history extends back only to the 1930s, was developed in concert with embryonic efforts at economic data collection. Protocols for dealing with very small data sets, established by the pioneers of econometrics, can easily be accessed by modern modelers. In the era of big data, in which models using billions of observations are fairly common, one wonders whether some of these econometric founding principles have been forgotten.

The overuse and misuse of statistical tests

The issue at hand is the overuse and misuse of statistical tests in constructing stress testing models. While it is tempting to believe that it is always better to run more and more tests, statistical theory and practice consistently warn of the dangers of such an attitude. In general, given a paucity of resources, the key for modelers is to remain “humble” and retain realistic expectations of the number and quality of insights that can be gleaned from the data. This process also involves using strong, sound, and well-thought-out prior expectations, as well as intuition while using the data sparingly and efficiently to help guide the analysis. It also involves taking action behind the scenes to source more data.

An article by Helen Walker, published in 1940, defines degrees of freedom as “the number of observations minus the number of necessary relations among these observations.” Alternatively, we can say that the concept measures the number of observations minus the number of pieces of information on which our understanding of the data has been conditioned. Estimating a sample standard deviation, for example, will have (n-1) degrees of freedom because the calculation is conditioned on an estimate of the population mean. If the calculation relies on the estimation of k separate entities, I will have (n-k) degrees of freedom available in constructing my model.

Now suppose that I run a string of 1,000 tests and I am interested in the properties of the 1,001st test. Because, technically, the 1,001st test is conditional on these 1,000 previously implemented tests, I have only (n-1,000) degrees of freedom available for the next test. If, in building my stress test model, n=40, I have a distinct logical problem in implementing the test. Technically, I cannot conduct it.

Most applied econometricians, however, take a slightly less puritanical view of their craft. It is common for statisticians to run a few key tests without worrying too much about the consequences of constructing a sequence of tests. That said, good econometricians tip their hat to the theory and try to show restraint in conducting an egregious number of tests.

The power and size of tests is also a critical concern

When setting out to conduct diagnostic tests,even very well-built statistical tests yield errors. Some of these error rates can usually be well controlled (typically the probability of a false positive result, known as the “size” of the test), so long as the assumptions on which the test is built are maintained. Some error rates (the rate of false negatives) are typically not controlled but depend critically on the amount of data brought to bear on the question at hand. The probability of a correct positive test (one minus the rate of false negatives) is known as the “power” of the test. Statisticians try to control the size while maximizing the power. Power is, unsurprisingly, typically low in very small samples.

If I choose to run a statistical test, am I required to act on what the test finds? Does this remain true if I know that the test has poor size and power properties?

Suppose I estimate a model with 40 observations and then run a diagnostic test for, say, normality. The test was developed using asymptotic principles (basically an infinitely large data set) and because I have such a small series, this means that the test’s size is unlikely to be well approximated by its stated nominal significance level (which is usually set to 5%).Suppose the test indicates non-normality. Was this result caused by the size distortion (the probability of erroneously finding non-normality), or does the test truly indicate that the residuals of the model follow some other (unspecified) distribution?

If I had a large amount of data, I would be able to answer this question accurately and the result of the test would be reliable and useful. With 40 observations, the most prudent response would be to doubt the result of the test, regardless of what it actually indicates.

Finding non-normality

Suppose instead that you are confident that the test has sound properties. You have found non-normality: Now what? In modeling literature, there are usually no suggestions about which actions you should take to resolve the situation. Most estimators retain sound asymptotic properties under non-normality. In small samples, a finding of non-normality typically acts only as a beacon – warning estimators to guard against problems in calculating other statistics. Even if the test is sound, it is difficult to ascertain exactly how our research is furthered by knowledge of the result. In this case, given the tiny sample, it is unlikely that the test actually is sound.

If a diagnostic test has dubious small sample properties, and if the outcome will have no influence over our subsequent decision-making, in our view, the test simply should not be applied. Only construct a test if the result will actually affect the subsequent analysis.

Dealing with strong prior views

The next question concerns the use and interpretation of tests when strong prior views exist regarding the likely underlying reality. This type of concept may relate to a particular statistical feature of the data – like issues of stationarity – or to the inclusion of a given set of economic variables in the specification of the regression equation. In these cases, even though we have little data, and even though our tests may have poor size and power properties, we really have no choice but to run some tests in order to convince the model user that our specification is a reasonable one.

Ideally, the tests performed will merely confirm the veracity of our prior views based on our previously established intuitive understanding of the problem.

If the result is confounding, however, given that we have only 40 observations, the tests are unlikely to shake our previously stated prior views. If, for example, our behavioral model states that term deposit volume really must be driven by the observed term spread, and if this variable yields a p-value of 9%, should we drop the variable from our regression? The evidence on which this result is based is very weak. In cases where the prior view is well thought out and appropriate, like this example, we would typically not need to shift ground until considerably more confounding evidence were to surface.

If, instead, the prior suggested a “toss-up” between a range of hypotheses, the test result would be our guiding light. We would not bet the house on the outcome, but the test result would be better than nothing. Toss-ups, however, are very rare in situations where the behavioral model structure has been carefully thought out before any data has been interrogated.

Running tests with limited data

With the advent of fast computers and powerful statistical packages, modelers now have the ability to run a huge number of tests effortlessly. Early econometricians, like the aforementioned Ms. Walker, would look on in envy at the ease with which quite elaborate testing schemes can now be performed.

Just because tests can be implemented does not mean that they necessarily should be. Modern modelers, faced with tiny data sets, should follow the lead of the ancients (many of whom are still alive) and limit themselves to running only a few carefully chosen tests on very deliberately specified models.

Regulators, likewise, should not expect model development teams to blindly run every diagnostic test that has ever been conceived.

Increases in auto lease volumes are nothing new, yet the industry is rife with fear that used car prices are about to collapse. In this talk, we will explore the dynamics behind the trends and the speculation. The abundance of vehicles in the US that are older than 10 years will soon need to be replaced, and together with continuing demand from ex-lessees, this demand will ensure that prices remain supported under baseline macroeconomic conditions.

Increases in auto lease volumes are nothing new, yet the industry is rife with fear that used car prices are about to collapse. In this talk, we will explore the dynamics behind the trends and the speculation.

To effectively manage risk in your auto portfolios, you need to account for future economic conditions. Relying on models that do not fully account for cyclical economic factors and include subjective overlay, may produce inaccurate, inconsistent or biased estimates of residual values.

Granular risk rating models allow creditors to understand the credit risk of individual loans in a portfolio, facilitating underwriting and monitoring activities. In this webinar we will outline the value of granular risk rating models for CECL.

In this article we demonstrate how to combine our forecasts of bank financial statements with internal data to produce forecasts that better reflect the macroeconomic environment posited under the various Comprehensive Capital Analysis and Review scenarios.

In this article, I take a theoretical look at negative interest rates as a means to stimulate the economy. I identify key factors that may influence the volume of deposits held in the economy. I then empirically describe the unique situation of negative interest rates.

We demonstrate the core capabilities of our vehicle residual forecasting model to capture aging and usage effects and illustrate the material implications for car valuation of different macroeconomic scenarios such as recessions and oil price spikes.

With auto leasing close to record highs, the need for accurate and transparent used-car price forecasts is paramount. Concerns about the effect of off-lease volume on prices have recently peaked, and those exposed to risks associated with vehicle valuations are seeking new forms of intelligence. With these forces in mind, Moody's Analytics AutoCycle™ has been developed to address these evolving market dynamics.

This article discusses the role of third-party data and analytics in the stress testing process. Beyond the simple argument that more eyes are better, we outline why some stress testing activities should definitely be conducted by third parties.

This article looks back at the Asian financial crisis of 1997-1998 and applies new methods of measuring systemic risk and pinpointing weaknesses, which can be used by today’s financial institutions and regulators.

Multicollinearity, the phenomenon in which the regressors of a model are correlated with each other, apparently causes a lot of confusion among practitioners and users of stress testing models. This article seeks to dispel this confusion.

This article addresses how banks should look to sources of high-quality, industry-level data to ensure that their PPNR modeling is not only reliable and effective, but also better informs their risk management decisions.

The market for new cars is growing strongly and lessors need forecasts and associated stress scenarios of future vehicle value to set the initial terms, to monitor the performance of their book and to stress-test cash flows. This presentation offers insight and tools to help lessors in this pursuit.

The banking industry needs a regulatory framework that is carefully designed to maximize economic outcomes, both in terms of stability and growth, rather than one dictated by past banking sector excesses.

In this paper we describe the modeling methodology behind Moody's Analytics Stressed EDF measures. Stressed EDF measures are one-year, default probabilities conditioned on holistic economic scenarios developed in a large-scale, structural macroeconometric model framework. This approach has several advantages over other methods, especially in the context of stress testing. Stress tests or scenario analyses based on macroeconomic drivers lend themselves to highly intuitive interpretation accessible to wide audiences – investors, economists, regulators, the general public, to name a few.

To capture a bank’s real capacity to withstand an adverse economic scenario, the best approach is to start with forecasts produced only for accuracy and then apply a conservative overlay as regulation requires. Such forecasts capture mitigating forces such as flight to safety.

Banks face the difficult task of building hundreds of forecasting models that disentangle macroeconomic effects from bank-specific decisions. We propose an approach based on consistently reported industry data that simplifies the modeler’s task and at the same time increases forecast accuracy.