Commonly used Stata commands to deal with potential outliers

In accounting archival research, we often take it for granted that we must do something to deal with potential outliers before we run a regression. The commonly used methods are: truncate, winsorize, studentized residuals, and Cook’s distance. I discuss in this post which Stata command to use to implement these four methods.

First of all, why and how we deal with potential outliers is perhaps one of the messiest issues that accounting researchers will encounter, because no one ever gives a definitive and satisfactory answer. In my opinion, only outliers resulting from apparent data errors should be deleted from the sample. That said, this post is not going to answer that messy question; instead, the purpose of this post is to summarize the Stata commands for commonly used methods of dealing with outliers (even if we are not sure whether these methods are appropriate—we all know that is true in accounting research!). Let’s start.

Truncate and winsorize

In my opinion, the best Stata commands to do truncate and winsorize are truncateJ and winsorizeJ written by Judson Caskey. I will save time to explain why, but simply highly recommend his work. Please see his website here.

After the installation, you can type help truncateJ or help winsorizeJ to learn how to use these two commands.

Studentized residuals

The first step is to run a regression without specifying any vce parameter in Stata (i.e., not using robust or clustered error terms). Suppose the dependent variable is y, and independent variables are x1 and x2. The first step should look like this:

regress y x1 x2

Then, use the predict command:

predict rstu if e(sample), rstudent

If the absolute value of rstu exceed certain critical values, the data point will be considered as an outlier and be deleted from the final sample. Stata’s manual indicates that “studentized residuals can be interpreted as the t statistic for testing the significance of a dummy variable equal to 1 in the observation in question and 0 elsewhere. Such a dummy variable would effectively absorb the observation and so remove its influence in determining the other coefficients in the model.” To be honest, I do not fully understand this explanation, but since rstu is a t statistics, the critical value for a traditional significance level should be applied, for example, 1.96 (or 2) for 5% significance level. That’s why in literature we often see that data points with absolute values of studentized residuals greater than 2 will be deleted. Some papers use the critical value of 3, which corresponds to 0.27% significance level, and seems to me not very reasonable.

Now use the following command to drop “outliers” based on the critical value of 2:

drop if abs(rstu) > 2

The last step is to re-run the regression, but this time we can add appropriate vce parameters to address additional issues such as heteroskedasticity:

regress y x1 x2, vce(robust), or

regress y x1 x2, vce(cl gvkey)

Cook’s distance

This method is similar to studentized residuals. We predict a specific residual, namely Cook’s distance, and then delete any data points with Cook’s distance greater than 4/N (Cook’s distance is always positive).

regress y x1 x2

predict cooksd if e(sample), cooksd

drop if cooksd > critical value

Next, re-run the regression with appropriate vce parameters:

regress y x1 x2, vce(robust), or

regress y x1 x2, vce(cl gvkey)

Lastly, I thank the authors of the following articles which I benefit from: