Tools of the Trade: a joint test of orthogonality when testing for balance

This is a very simple (and for once short) post, but since I have been asked this question quite a few times by people who are new to doing experiments, I figured it would be worth posting. It is also useful for non-experimental comparisons of a treatment and a control group.
Most papers with an experiment have a Table 1 where they compare the characteristics of the treatment and control group and test for balance. (See my paper with Miriam Bruhn for discussion of why this often isn’t a sensible thing to do). Ok, but let’s assume you are in a situation where you want to do this. One approach people use is just to do a series of t-tests comparing the means of the treatment and control group variable by variable. Or they might do this with regressions of the form:
X = a + b*Treat +e
And test whether b=0.

They might do this for 20 variables, find 1 or 2 are significant at the 5% level, and then say “this is about what we expect by chance, so it seems randomization has succeeded in generating balance”. But what if we find 3 or 4 differences out of 20 to be significant? Or what if none are individually significant, but the differences are all in the same direction.

An alternative, or complementary approach is to test for joint orthogonality. To do this, take your set of X variables (X1, X2, …, X20) and run the following:
Treat = a + b1*X1 + b2*X2 + b3*X3 + ….+b20*X20 +u
And then test the joint hypothesis b1=b2=b3=…=b20=0
This can be run as a linear regression, with an F-test; or as a probit, with a chi-squared test.

That’s it, very simple. I think people get confused because the treatment variable jumps from being on the right-hand side for the single variable tests to being on the left-hand side for the joint orthogonality test.
Now what if you have multiple treatment groups? You can then run a multinomial logit or your other preferred specification and test for joint orthogonality within this framework, but I’ve not seen this done very often – typically I see people just compare each treatment separately to the control.

Comments

Hansen and Bowers have a nice paper where they compare the performance of this test with a joint permutation test they propose for testing balance in clustered randomized trials. http://www.jstor.org/stable/27645895?seq=1#page_scan_tab_contents

As a bonus, the paper has a really clear explanation of the issues involved in testing for balance in clustered trials.

Thanks Doug! I should mention that even mild clustering ---assignment to some households of size 1 and some of size 2--- led the likelihood ratio based balance test to produce surprisingly misleading results. Worth checking the size-vs-level of the LR tests if the samples are small, covariates are binary and 1s are not close to 50%, or assignment is by cluster.

I have a slightly off-the-wall question about using a joint test of orthogonality.

Say I’m looking at dating website profiles. I note 20 adjective that are much more likely to be used on females’ profiles than males’ profiles. I note another 20 adjective that are much more likely to be used on males’ profiles than females’ profiles.

I then roll out a design change across the site that I hypothesise will reduce the use of ‘gendered language’. I want to test whether it has done so. Imagine that we pushed the redesign to only half of our users, so this is a proper randomised A/B test.

Should I use a joint test of orthogonality?

Here’s how I would envision it working:
- Using the same list of words that the exploratory analysis has already found, we would make each word an indicator variable which takes the value of 1 if the word is used in a profile and 0 if it is not. We would then run a joint test of orthogonality:
- We take our set of 40 words (X1, X2, …, X40) and run the following regression:
- Female = a + b1*X1 + b2*X2 + b3*X3 + ….+b40*X40 +u
- We then test the joint hypothesis b1=b2=b3=…=b40=0 as a linear regression, with an F-test.

But how should I use the indicator variable for whether the user has been ‘treated’ or not? Interacted with each word-indicator variable?

Hi Andrew,
I don't think you want a joint orthogonality test here - you aren't trying to test if none of the adjectives are related to gender. Instead you are testing if your treatment reduces the use of gendered adjectives. So there would be two approaches I would take to doing this:
1. Just define a count of the number of male adjectives used on male profiles (call this M20), and a count of the number of female adjectives used on female profiles (call this F20), and then run regressions like:
M20 = a+b*Treat + e
F20 = a+b*Treat + e
This will show whether your treatment succeeds in getting males to use the male adjectives less, and females to use the female adjectives less. (I would run these regressions separately by gender, but you could also pool together males and females and just create a variable that is the number of gendered adjectives of your gender you use).

2. If you are particularly interested in whether the treatment reduces the use of particular adjectives, then you can run the 40 regressions of the form:
X1 = a + b*Treat + e
or X1 = a+b*Treat + c*Female + d*Treat*Female + e
and so on up to X40
and then use a multiple testing correction to account for the fact you are doing 40 different tests.

This is interesting. Thanks for sharing. It occurs to me that the approach you describe has some serious limitations. For one, a statistically significant overall F test from the type of model you describe would provide evidence of covariate imbalance, but would not directly tell you whether or not that covariate imbalance would generate bias in the association you ultimately measure between T and Y. If some Xs are associated with T but have no association with Y, they would generate no bias. Moreover, it is possible that the bias generated by one set of Xs could be offset by bias in the other direction generated by another set of Xs. This concern, though, has a good remedy: after running a simple model of T predicting Y, add subsets of Xs to that model based upon post-hoc analyses following a statistically significant overall F test.

Perhaps a bigger concern involves ways in which this approach could be abused, all of which I think relate to Type 2 error. If you were an naughty investigator who wanted to avoid finding statistically significant evidence of covariate imbalance, there are various strategies you might employ. One would be to have a small sample, and thus little statistical power to detect covariate imbalance. That strategy, though, would be self-defeating when it comes to what is presumably the main goal of the study: to measure the effect of T on Y. Another strategy that would not be self-defeating, however, would be to pack the regression model with a whole bunch of garbage Xs -- covariates that are poorly measured, or for which there is no good reason to think they have anything whatsoever to do with either T, or Y, or both. Given a set of covariates for which there truly is imbalance, the overall F test will detect that if those are the only covariates in the model; will have a good chance of detecting it if there are some but not a ton of other covariates in the model; and will have little chance of detecting it if the model is packed with a bunch of other covariates.

All of which is to say, I guess, that I don't think the overall F test or any other statistical approach can fully adjudicate these issues in the absence of honesty, integrity, and good judgment on the part of the investigator.