# Stata command to do Heckman two steps

We often see Heckman’s two steps in accounting literature. But how to do it in Stata?

The two steps refer to the following two regressions:

Outcome equation: `y = X × b1 + u1`
Selection equation: `Dummy = Z × b2 + u2`

The selection equation must contain at least one variable that is not in the outcome equation.

The selection equation must be estimated using Probit. An intuitive way to do Heckman’s two steps is to estimate the selection equation first. Then include inverse mills ratio (IMR) derived from the selection equation in the outcome equation. In other words, run two regressions, one after the other.

Stata command for the selection equation:

`probit Dummy X` (using both observations that are selected into the sample and observations that are not selected into the sample, i.e., `Dummy` = 1 or `Dummy` = 0)

Note `vce` option (i.e., standard, robust or clustered standard errors, among others) will not change the resultant IMR.

Next, calculate IMR immediately:

`predict probitxb, xb`
`ge pdf = normalden(probitxb)`
`ge cdf = normal(probitxb)`
`ge imr = pdf/cdf`

Finally, include `imr` in the outcome equation:

`reg y X imr, vce(specified_vcetype)` (using observations that are selected into the sample only)

Note the first and the second regression use different numbers of observations.

However, this is not over. I find the first Probit regression sometimes causes missing IMR. For example, even if I have 100 observations with required `Dummy` and `X` data, I may only get IMR for 60 observations using this step-by-step method. I have not figured out why.

I then note that Stata in fact provides an all-in-one method to estimate both the selection equation and the outcome equation in one command `heckman`:

`heckman y X, select(Dummy = Z), twostep first mills(imr) vce(specfied_vcetype)`

I recommend using `twostep` option of the `heckman` command. This option will produce the same results with the step-by-step method. But this option may reduce the number of available `vce` types. In addition, the specified `vce` option only applies to the outcome equation and has no effect on the selection equation.

In this all-in-one method, we must pool together both observations that are selected into the sample and observations that are not selected into the sample, in which `Dummy` is 1 or 0 for all observations and `y` and `X` are missing for observations that are not selected into the sample. A benefit of this all-in-one method is that the weird missing-IMR issue will not appear.

I do have a closer look at missing IMR from the step-by-step method. They all have an extremely small value in the all-in-one method. I find that the step-by-step method has greater flexibility. Thus, if we want to use the step-by-step method but encounter the weird missing-IMR issue, it seems safe to just set missing IMR as zero.

Any comment is welcome.

This entry was posted in Stata. Bookmark the permalink.

### 7 Responses to Stata command to do Heckman two steps

1. Julio Galárraga says:

Dear Kai Chen

Have you run a heckman two step for survey data, as heckman two step command does not allow iweights or pweights, and svy: heckman is not allowed.

Tks,

2. Tina says:

Dear Kai Chen,

Do you have a suggestion how to export heckit results in two separat tables?
(Using outreg 2 the parameters for the final stage and selection equation are reported side by side in columns.)
I would be also excited for suggestions using export commands other than -outreg2-.

Kind Regards
Tina

3. Getalem Alemu says:

Is it possible to use the same explanatory variable with the Heckman two-stage(probit & OLS) model at the same time?

• Gijs says:

Yes, but keep it to a minimum. the set of explanatory variables of the selection equation Z needs to be a superset of the set of the set of explanatory variables in the outcome equation (X). The more the two sets are alike, the more the inverse Mill’s ratio will be correlated with X and thus the larger the standard errors will be.

• fmcgowan says:

Hi Gijs,
Why does Z need to be a superset of X? What is violated if there is a variable in X that is not in Z – does this mess up the inverse Mills ratio? It seems that some important predictors of the outcome might only be measurable for people who select into treatment, so it would not be possible to include these variables in the selection Probit. Or in this scenario, would a different model be appropriate? Thanks.

4. amare wodaju says:

Hello dears
5. Wakuma Dufera Tesgera says: