In accounting archival research, we often take it for granted that we must do something to deal with potential outliers before we run a regression. The commonly used methods are: truncate, winsorize, studentized residuals, and Cook’s distance. I discuss in this post which Stata command to use to implement these four methods.
First of all, why and how we deal with potential outliers is perhaps one of the messiest issues that accounting researchers will encounter, because no one ever gives a definitive and satisfactory answer. In my opinion, only outliers resulting from apparent data errors should be deleted from the sample. That said, this post is not going to answer that messy question; instead, the purpose of this post is to summarize the Stata commands for commonly used methods of dealing with outliers (even if we are not sure whether these methods are appropriate—we all know that is true in accounting research!). Let’s start.
Truncate and winsorize
In my opinion, the best Stata commands to do truncate and winsorize are truncateJ
and winsorizeJ
written by Judson Caskey. I will save time to explain why, but simply highly recommend his work. Please see his website here.
To install these two user-written commands, you can type:
net from https://sites.google.com/site/judsoncaskey/data
net install utilities.pkg
After the installation, you can type help truncateJ
or help winsorizeJ
to learn how to use these two commands.
Studentized residuals
The first step is to run a regression without specifying any vce
parameter in Stata (i.e., not using robust or clustered error terms). Suppose the dependent variable is y
, and independent variables are x1
and x2
. The first step should look like this:
regress y x1 x2
Then, use the predict
command:
predict rstu if e(sample), rstudent
If the absolute value of rstu
exceed certain critical values, the data point will be considered as an outlier and be deleted from the final sample. Stata’s manual indicates that “studentized residuals can be interpreted as the t statistic for testing the significance of a dummy variable equal to 1 in the observation in question and 0 elsewhere. Such a dummy variable would effectively absorb the observation and so remove its influence in determining the other coefficients in the model.” To be honest, I do not fully understand this explanation, but since rstu
is a t statistics, the critical value for a traditional significance level should be applied, for example, 1.96 (or 2) for 5% significance level. That’s why in literature we often see that data points with absolute values of studentized residuals greater than 2 will be deleted. Some papers use the critical value of 3, which corresponds to 0.27% significance level, and seems to me not very reasonable.
Now use the following command to drop “outliers” based on the critical value of 2:
drop if abs(rstu) > 2
The last step is to re-run the regression, but this time we can add appropriate vce
parameters to address additional issues such as heteroskedasticity:
regress y x1 x2, vce(robust)
, or
regress y x1 x2, vce(cl gvkey)
Cook’s distance
This method is similar to studentized residuals. We predict a specific residual, namely Cook’s distance, and then delete any data points with Cook’s distance greater than 4/N (Cook’s distance is always positive).
regress y x1 x2
predict cooksd if e(sample), cooksd
drop if cooksd > critical value
Next, re-run the regression with appropriate vce
parameters:
regress y x1 x2, vce(robust)
, or
regress y x1 x2, vce(cl gvkey)
Lastly, I thank the authors of the following articles which I benefit from:
https://www3.nd.edu/~rwilliam/stats2/l24.pdf
https://www.stat-d.si/mz/mz16/coend16.pdf
A more formal and complete econometrics book is Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Hi, the winsorize way you mentioned is not working now
I checked and the command worked just fine. Did you have an installation issue or a command running issue?
Hi, I encounter an issue regarding winsorizing. when a self-defined variable ABC is constructed by two common variables, lets say at (total asset) and lt(total liability). we report 3 variables in the summary statistics. do we winsorzie twice? how do we do winsorizing? i.e., do we do winsorizing after winsorzing at and lt or we first calculate ABC and winsorzie it?
thank you.
I prefer calculating ABC using raw
at
andlt
and then winsorize ABC. But, different researchers may do it differently. Sometimes, when to do winsorization or truncation will affect final results. This is a black box.Hi,
Thank you for your post. It is really helpful to me (and has been really helpful to many colleagues of mine).
I have one question, though. If I first run a simple model in which I only specify e.g. 3 variables, let’s say Price EquityperShare and NetIncomeperShare and afterwards alter the specification by introducing more variables, when do I use the studentized residuals to treat outliers? Do I run the model with most variables, treat outliers and then rerun the models with fewer variables? Or should I do it separately per specification?
Thank you in advance.