Say I want to calculate a correlation matrix for 50 stocks using 3-year historical daily data. And there are some stocks that were recently listed for one year.

This is not technically challenging because the correlation function in R can optionally ignore missing data when calculating pair-wise correlation. But honestly it worried me after deep thought: all the correlations for the short listed stock will bias to the regime it live in.

For example, a stock that exists only in last few months will inevitably have higher correlation than the stocks that have full 3-year history. And I think it will cause bias and annoying problems in the following applications using this correlation matrix.

So, my question is: how do I adjust my correlation matrix whose elements are generated from different market regimes? I feel a dilemma. In order to assure that the whole matrix live in the same regime, it seems that either I need to use least possible samples, or I have to throw that stock away. But either way looks a waste to me.

I know the correlations for that stock got to be lower, but how much lower? Is there an approach or formula derived from solid arguments?

I did some research and surprisingly found that all public research on correlation matrix assuming adequateness of data. How can it be possible in practice?! A mystery for me to see this seemingly common problem has not been publicly addressed.

After careful review, we found 楊祝昇's idea is very useful in general and particularly suitable for our problem. Besides, his solution is an original idea which is rarely seen in public. For these reasons, I am happy to offer a late bounty for his great answer.
–
BransonDec 26 '11 at 1:25

4 Answers
4

Quant Guy's list is really impressive! However, I am not sure they will readily solve your specific problem? I think there is one missing piece.

Please note that imputing missing data is a very broad topic. There are many recipes to impute missings but that's for their specific 'assumptions' and purposes. They do not necessarily intend to well address your specific problem: the regime change.

To best address your specific problem, you have to quantitatively define the market regime as a part of your adjusting formula. Otherwise, it wouldn't logically make sense that your model is aware of and able to react to it properly.

In Stambaugh's '97 research (which I think is the most relevant reference Qaunt Guy listed), Stambaugh's formula actually used B = V21*V11^(-1), i.e. Beta, to make adjustment. I have to say that not soon after, the history has taught us how vulnerable the beta is for several times, especially in a rapidly changing market environment (but I guess the application of Beta was still novel and not that fragile in the epoch of '90s?).

Now let's define market regime quantitatively. In common sense, average correlation is a pretty neat regime indicator. Simple and intuitive, easy to employ in a proprietary model (and I feel that's why Ledoit-Wolf's model is that popular :)). But yes, as Branson pointed out in Ian's answer, there is possibility that we will get very undesirable results.

One of the potential solutions is to map the intuitive indicator to proper space/dimension for operations, and then transform it back. This is a very useful technique that is commonly employed in machine learning. Correlation lives in a very constrained space [-1,1] and this greatly restricts what we can do about it. (Please don't think covariance will be less constrained. When you put them together in a matrix, trust me, it will be as constrained as correlation. Correlation is actually easier to work with to see possible problems)

Now, how about mapping correlation to an equally intuitive (at least to me) but less constrained space,

Signal-to-Noise Ratio (SNR) = Correlation^2 / (1 - Correlation^2)

** Correlation = sqrt(snr/(1+snr))

and refine my regime indicator as the median of SNR. (*I rarely use average in financial applications)

I don't know how people feel about SNR, but I feel very comfortable with a background in EE. In communication system, SNR is exactly the regime (environment) indicator that characterizes a channel. I feel a significant analogy here.

The remaining work will be straightforward. I will use the ratio of my regime indicators as a multiplier to adjust young asset's pairwise SNRs against other assets. Then map the final adjustment back to correlations.

You will at least gain the following benefits using this approach:

Correlations won't blow up as in your first attempt

Original ranking of pairiwise correlation (with short-lived assets) is preserved

Much easier to implement. No need to impute missing data.

Intuitive (to me), east to understand what's going on in your code.

This approach is compatible with many other techniques in Quant Guy's references such as Ledoit-Wolf Shrinkage, RMT, and weighted representative covariance matrices.

Last but not least, this is a collaboratory idea with one of my most brilliant colleagues and close friend, Manish Agarwal.

This is awesome! Your solution makes a lot of sense to me and this is very easy to follow! Can you provide any references for your answer? Tho you have explained many details, we are very interested to see its performance.
–
BransonDec 13 '11 at 20:44

1

@Branson: Nope' Sorry, you are watching a proprietary idea ;) If you are interested in more discussion, please feel free to contact me @ google+/linked
–
楊祝昇Dec 14 '11 at 1:45

There is some research that directly bears upon the issue of estimating covariance in the presence of unequal return histories and regime change.

Wharton professor Robert Stambaugh (of liquidity premia fame) wrote a paper in '97 called "Analyzing Investments Whose Histories Differ in Length". Prior to the paper, most academics and practitioners would use "truncation" (i.e. restrict to equal history), or maximum-likelihood estimators (similar to the pairwise complete approach you mention). Some would also use regression to impute the missing variable.

Stambaugh goes on to propose a Bayesian approach popularly known as the Expectations-Maximization algorithm to impute returns of unequal length. He discusses covariance/correlation estimation in section 2.4.

On a practical note, it turns out there are great packages for performing this missing imputation method in R -- Amelia II: A Program for Missing Data (which uses the E-M algorithm). There is also a package monomvn which estimates covariance matrices directly when the missings are of unequal length also via E-M. The E-M algorithm is the "textbook" answer for many academics and practitioners when generating a covariance matrix in the presence of missing data with unequal lengths.

However, in my own experience the E-M algorithm has short-comings.

Since this is an out-of-consensus view, I'd encourage you to test the algorithm and see if it conforms to your expectations. For example, you can start with a complete data set, artificially truncate histories and missings, fit a covariance matrix using monomvm or impute via Amelia, and compare the securities with missings to the pristine dataset. You should also explore this long vs. shorter periods of history. Side note: Since all covariance matrices are representations of normal multi-variate distribution perhaps it's naive to expect great results from E-M. You could use the E-M algorithm to fit a mixture of gaussians (high-vol, low-vol, normal regime let's say) but then this becomes more of a research exercise rather than a band-aid.

Here are some alternatives:

A simple reliable method is to impute the security returns to the average industry or sector returns (optionally with some gaussian noise to preserve the variance of the system). The rationale is that the sector return is the dominant factor in explaining the bulk of a security's price risk and return. If your missings are random, do not have structure, and the distribution of returns of security's with missings is not dissimilar to the distribution of returns of securities without missings then this merits consideration.

You could also fit a cross-sectional factor model and impute the missings via stock's exposures.

A more advanced (likely overkill) technique would be to use a machine learning method such as random forests or a clustering algorithm like k-nn to impute missings. (To assess whether these methods are necessary you can look into a body of literature called Influence Analysis and Robust Regression.)

If you go down the path of imputing missings, you can then estimate an exponentially weighted covariance matrix (use cov.wt in R). The EW covariance matrix has a special relationship to GARCH models -- a desirable feature since volatility is auto-correlated.

These methods (as well as Ledoit-Wolf Shrinkage and RMT) out-perform estimation via the sample covariance matrix. The use of a simple exponentially weighted covariance matrix is quite common place. Popular choices of decay factors range from (.94 if you ask JP Morgan Risk Metrics, up to .99). The paper "Portfolio Selection by Using Time Varying Covariance Matrices" will help you explore exponentially weighted covariance matrices so you can better cope with regime change.

Another way to create covariance matrices that cope with regime change is to develop a matrix for each regime and take expectations. Essentially you partition your data based on the euclidian (or Mahalanabois) distance and build separate covariance matrices for each regime and take a probability weighted average of each. Mark Kritzman has an intuitive approach to this in his article Risk, Regimes, and Overconfidence.

WOW, you are the man! This is one of the most thorough answer I ever seen, though I have to confess that it will take me awhile to digest and experiment. Seriously, I think this answer deserve more than +50 bounty. Let me see what I can offer later.
–
BransonDec 11 '11 at 23:27

There are two functions for estimating variance matrices with missing values (and aimed at finance, by the way) in the R package BurStFin.

Available via:

install.packages('BurStFin', repos="http://www.burns-stat.com/R")

but not yet for 2.14.x.

You can of course get a correlation matrix from the variance matrix.

One function estimates a statistical factor model, the other performs Ledoit-Wolf shrinkage to equal correlation. The treatment of the missing values is rather ad hoc in the functions. There has been some study of this (I don't remember references), but I agree not as much as one would expect.

The thrust of the question is correct, I think: how missing values are treated can have a significant effect on results. For example, in the factor model case you want to think about whether you want to shrink towards zero or shrink towards the factors.

Thank you Patrick! Though the market regime bias caused by missing data has not been addressed, your package looks very useful. Definitely will look into it and thanks for your sharing.
–
BransonDec 6 '11 at 20:54

How about adapting Ledoit-Wolf shrinkage to average correlation? You calculate the ratio of average correlations in two regimes to get a sense of the magnitude of regime shift. Use this ratio to adjust the correlations of short-lived stock with others. The method is simple and result will definitely make sense to you.

Actually, we already tried it before I posted my question. But there is a problem. In the case of shifting from low to high correlation regime, some of the adjusted correlations might explode, i.e. adjusted correlation > 1. I like the simplicity of the idea of average correlation, but the rho > 1 is the least desirable result.
–
BransonDec 11 '11 at 23:36