Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Heckman sample selection and Instrumental Variable (IV) or Simultaneous Equations Model

    Dear Statalist,

    I would like to estimate the following equation:

    Code:
    Y = A + B*X1 + C*X2 + E
    I am concerned about two endogeneity problems:
    1. X1 may be reversely caused by Y. On its own, I would solve this problem by instrumenting X1 with instrument Z1, which is exogenous to Y:
      Code:
      --ivreg2 Y X2 (X1 = Z1)--
  • Whether Y is observed, may also depend on X1, i.e. I have a possible selection problem. On its own, I would solve this problem by first estimating a Heckman Probit model, regressing I(Y!=.) on X1, X2, and Z0, where Z0 should not influence the value of Y:
    Code:
    --heckman Y X1 X2, select(Z0 X1 X2)--
    However, I am unsure how to implement both corrections at the same time?

  • Put differently, I have two outcomes Y1 and Y2, where Y1 may amongst others depend on Y2, so an alternative to the above IV could be a Simultaneous Equations Model of the form:

    Code:
     
    1. Y1 = A1 + B1*Xb + C1*Xc + D1*Y2 + EPS1
    2. Y2 = A2 + B2*Xb + C2*Xc + D2*Z2 + EPS2
    Equation (2) is basically the First Stage and Equation (1) the Second Stage of the IV estimation suggested above. However, the problem is: Whether Y1 and Y2 are observed, does also depend on many of the same regressors Xb, Xc, Z2, which I think would call for a Heckman selection model. But how do I combine SEM and Heckman model? If I simply add the selection equation as 3rd equation of the SEM system, then equations (1) and (2) are estimated on a smaller set of observations (those where Y1 and Y2 are observed) than Equation (3)?

    Best regards,
    Ruediger

  • #2
    I cover this in Section 19.6.2 in the second edition of "Econometric Analysis of Cross Section and Panel Data," MIT Press, 2010. You are on the right track, but here are the specifics.

    1. Estimate a probit model for the selection indicator, I. Include all exogenous variables: Those in the equation for Y1, the instrument(s) for Y2, and the variable determining selection. To be convincing, you should argue that you have two sources of exogenous variation excluded from the equation for Y1. Call these Z2 and Z0.
    Code:
    probit I Xb Xc Z2 Z0
    2. Obtain the inverse Mills ratios from step 1 -- say IMR.
    (code omitted)
    3. Estimate the structural equation by 2SLS:

    Code:
    ivregress 2sls Y1 Xb Xc IMR (Y2 = Z2)
    Note that the IMR, depending only on exogenous variables, acts as its own IV.

    The standard errors are incorrect if the coefficient on IMR is not zero (in the population). Bootstrapping the entire procedure is not very difficult.

    I hope this helps. JW


    Comment


    • #3
      I cover this in Section 19.6.2 in the second edition of "Econometric Analysis of Cross Section and Panel Data," MIT Press, 2010. You are on the right track, but here are the specifics.

      1. Estimate a probit model for the selection indicator, I. Include all exogenous variables: Those in the equation for Y1, the instrument(s) for Y2, and the variable determining selection. To be convincing, you should argue that you have two sources of exogenous variation excluded from the equation for Y1. Call these Z2 and Z0.
      Code:
      probit I Xb Xc Z2 Z0
      2. Obtain the inverse Mills ratios from step 1 -- say IMR.
      (code omitted)
      3. Estimate the structural equation by 2SLS:

      Code:
      ivregress 2sls Y1 Xb Xc IMR (Y2 = Z2)
      Note that the IMR, depending only on exogenous variables, acts as its own IV.

      The standard errors are incorrect if the coefficient on IMR is not zero (in the population). Bootstrapping the entire procedure is not very difficult.

      ​I hope this helps. JW

      Comment


      • #4
        Dear Mr. Wooldridge, dear Statalist,

        thanks for the immediate and helpful response. I have a follow-up question which is related to the problem described above:

        My binary variable determining the selection Z0 perfectly predicts selection if it takes on the value of one. I.e. if Z0 takes on the value of one, there is no variation in the variable I / the variable I then always takes on the value of zero (only if Z0 takes on the value of zero, there is variation in I). As a result I cannot estimate the probit model.

        Can I use an alternative link function in the first step (i.e. a linear probability model or logit model in the first step) to calculate the inverse mills ratio (of course taking the corresponding and appropriate CDF and PDF to calculate the ratio)? Under which assumptions can I do this?

        Best regards,
        Ruediger


        Comment


        • #5
          f

          Comment


          • #6
            Dear JW,

            I would like to estimate the following equation based on my panel data set (with
            country and time fixed effects
            , 50 countries over 1998-2012):

            Y=a+bM+cX+E

            I am concerned about
            two endogeneity problems
            :
            1. M may be reversely caused by Y: there might be a reverse causality between Y and M;
            2. Whether Y is observed, may depend on some variables in X(X1, X2, etc): possible selection problem
            So, in order to solve the two problems, I need to:
            1. instrumenting M with Z, which is usually exogenous to Y
            2. use Heckman sample selection to address dependent variable Y with limited observations.
            Thus, in order to achieve these two purposes, how shall I
            combine IV and Heckman model?

            (Also I need to address well with the country and time fixed effects in the dataset).

            Please offer detailed Stata codes for help.

            Thanks so much!!!

            (Codes like following?
            Heckman Y M X i.year, selection (X1 X2 i.year)

            Or
            probit Y X1 X2
            xtivreg2 2sls Y X IMR (M=Z) i.year, fe)

            Comment


            • #7
              Dear Statalist and Wooldridge,

              I am running the following model:

              BMI= a1 + a2 X2 + a3 X3 + a4 X4 +...... + e. (outcome equation) (1)

              where X2 is a categorical variable: low physical activity job, medium physical activity job, high physical activity job. I want to see whether men/ women employed in jobs that involve low physical activity jobs, are they more likely to have higher BMI?

              Since X2 is observed only for individuals who are working, I think it is important to control for sample selection, because the decision to work is not random and hence individuals who are working may be systematically different from individuals who are not working.

              So I have a selection equation:

              y1= c1 + c2 X3 + c3 X5+...... v., where y=1 if individual is employed, y=0 if individual is not employed. (2)

              x3 include variables that affect both bmi and labour force participation. X4 include variables that affect bmi and not labour force participation, while x5 includes variables that determine labor force participation and not BMI i.e. we need a variable(s) that affects selection, but not the outcome for identification(to satisfy exclusion restriction).
              I have two questions;

              1. Can x4 include variables that affect BMI, but doesn't affect the employment decision.I can think of many variables that can affect bmi but not the probability of being employed for eg, expenditure on eating out or expenditure on processed food or smoking affects bmi, but i can't think of it including in the selection equation. If x4 includes variable that affect bmi but not labour force participation, then should x4 be included in the equation (2) while running probit selection model? I am asking this because according to Wooldridge (Econometric Analysis of Cross Section and Panel Data, 2nd ed. Cambridge: MIT Press.2010, Chapter 19, pp.803-806) the variables in the outcome equation should be a strict- subset of the ones in the selection equation.

              2. I believe that X2 is endogenous i.e X2 may depend on BMI( reverse causality). I need instrument say z1 which is correlated with X2 but doesn't affect BMI directly and is uncorrelated with e.

              X2= b1 + b2 z2 + b3 X3+ a4 X4+ ........+ u. (3)

              Since x2 is endogenous, i need at least one instrument for X2(i.e z1) and also one instrument for the selection equation(i.e at least one variable that determine labour force participation but not bmi). If variables in the outcomes equation should be a strict subset of the variables in the selection equation and x2 is endogenous,

              As JW suggested, I will first estimate the probit model selection indicator, which includes all exogenous variables i.e those in equation for bmi, instruments of x2( i.e. z1) and those variables determining selection(x5).

              probit y1 x3 x4 z1 x5 (4)

              we can obtain IMR from (4)

              and then can run

              ivregress 2sls bmi x3 x4 IMR (X2=Z1).

              If the coeff of IMR is signifiant, it indicates that there is sample selection.

              Is this the correct way to use IV in case of correcting for sample selection?

              pls help me with this query. thanks in advance.








              Comment


              • #8
                Dear Statalist and Wooldridge,

                I am running the following model:

                BMI= a1 + a2 X2 + a3 X3 + a4 X4 +...... + e. (outcome equation) (1)

                where X2 is a categorical variable: low physical activity job, medium physical activity job, high physical activity job. I want to see whether men/ women employed in jobs that involve low physical activity jobs, are they more likely to have higher BMI?

                Since X2 is observed only for individuals who are working, I think it is important to control for sample selection, because the decision to work is not random and hence individuals who are working may be systematically different from individuals who are not working.

                So I have a selection equation:

                y1= c1 + c2 X3 + c3 X5+...... v., where y=1 if individual is employed, y=0 if individual is not employed. (2)

                x3 include variables that affect both bmi and labour force participation. X4 include variables that affect bmi and not labour force participation, while x5 includes variables that determine labor force participation and not BMI i.e. we need a variable(s) that affects selection, but not the outcome for identification(to satisfy exclusion restriction).
                I have two questions;

                1. Can x4 include variables that affect BMI, but doesn't affect the employment decision.I can think of many variables that can affect bmi but not the probability of being employed for eg, expenditure on eating out or expenditure on processed food or smoking affects bmi, but i can't think of it including in the selection equation. If x4 includes variable that affect bmi but not labour force participation, then should x4 be included in the equation (2) while running probit selection model? I am asking this because according to Wooldridge (Econometric Analysis of Cross Section and Panel Data, 2nd ed. Cambridge: MIT Press.2010, Chapter 19, pp.803-806) the variables in the outcome equation should be a strict- subset of the ones in the selection equation.

                2. I believe that X2 is endogenous i.e X2 may depend on BMI( reverse causality). I need instrument say z1 which is correlated with X2 but doesn't affect BMI directly and is uncorrelated with e.

                X2= b1 + b2 z2 + b3 X3+ a4 X4+ ........+ u. (3)

                Since x2 is endogenous, i need at least one instrument for X2(i.e z1) and also one instrument for the selection equation(i.e at least one variable that determine labour force participation but not bmi). If variables in the outcomes equation should be a strict subset of the variables in the selection equation and x2 is endogenous,

                As JW suggested, I will first estimate the probit model selection indicator, which includes all exogenous variables i.e those in equation for bmi, instruments of x2( i.e. z1) and those variables determining selection(x5).

                probit y1 x3 x4 z1 x5 (4)

                we can obtain IMR from (4)

                and then can run

                ivregress 2sls bmi x3 x4 IMR (X2=Z1).

                If the coeff of IMR is signifiant, it indicates that there is sample selection.

                Is this the correct way to use IV in case of correcting for sample selection?

                Comment


                • #9
                  Dear Statalist and Wooldridge,

                  I am running the following model:

                  BMI= a1 + a2 X2 + a3 X3 + a4 X4 +...... + e. (outcome equation) (1)

                  where X2 is a categorical variable: low physical activity job, medium physical activity job, high physical activity job. I want to see whether men/ women employed in jobs that involve low physical activity jobs, are they more likely to have higher BMI?

                  Since X2 is observed only for individuals who are working, I think it is important to control for sample selection, because the decision to work is not random and hence individuals who are working may be systematically different from individuals who are not working.

                  So I have a selection equation:

                  y1= c1 + c2 X3 + c3 X5+...... v., where y=1 if individual is employed, y=0 if individual is not employed. (2)

                  x3 include variables that affect both bmi and labour force participation. X4 include variables that affect bmi and not labour force participation, while x5 includes variables that determine labor force participation and not BMI i.e. we need a variable(s) that affects selection, but not the outcome for identification(to satisfy exclusion restriction).
                  I have two questions;

                  1. Can x4 include variables that affect BMI, but doesn't affect the employment decision.I can think of many variables that can affect bmi but not the probability of being employed for eg, expenditure on eating out or expenditure on processed food or smoking affects bmi, but i can't think of it including in the selection equation. If x4 includes variable that affect bmi but not labour force participation, then should x4 be included in the equation (2) while running probit selection model???

                  I am asking this because according to Wooldridge (Econometric Analysis of Cross Section and Panel Data, 2nd ed. Cambridge: MIT Press.2010, Chapter 19, pp.803-806) the variables in the outcome equation should be a strict- subset of the ones in the selection equation.

                  2. I believe that X2 is endogenous i.e X2 may depend on BMI( reverse causality). I need instrument say z1 which is correlated with X2 but doesn't affect BMI directly and is uncorrelated with e.

                  X2= b1 + b2 z1 + b3 X3+ a4 X4+ ........+ u. (3)

                  Since x2 is endogenous, i need at least one instrument for X2(i.e z1) and also one instrument for the selection equation(i.e at least one variable that determine labour force participation but not bmi). If variables in the outcomes equation should be a strict subset of the variables in the selection equation and x2 is endogenous,

                  As JW suggested, I will first estimate the probit model selection indicator, which includes all exogenous variables i.e those in equation for bmi, instruments of x2( i.e. z1) and those variables determining selection(x5).

                  probit y1 x3 x4 z1 x5 (4)

                  we can obtain IMR from (4)

                  and then can run

                  ivregress 2sls bmi x3 x4 IMR (X2=Z1).

                  If the coeff of IMR is signifiant, it indicates that there is sample selection.

                  Is this the correct way to use IV in case of correcting for sample selection?


                  Comment


                  • #10
                    Instead of a 2SLS (ivreg command), can we use a control function for the second stage in the same fashion? Ie, by including the IMR in the same fashion as described above?

                    Thank you for your response.

                    Comment


                    • #11
                      Dear Professor Wooldridge,
                      I met the same situation as Ruediger Vollmeier had in my own research and adopted the same method as you suggested (i.e., two-stage heckman selection). However, IMR in the second equation is not significant in my case. From the theoretical perspective, the exogenous variable I used in the first equation strongly influences the choice in the probit model but doesn't influence/related to the outcome in the second equation. But still, IMR is not significant. Does it indicate that the two-stage heckman selection model doesn't work in my case? Isn't it convincing for reviewers if I say I use heckman selection model to solve selection bias (already used propensity score matching to reduce bias caused by observed factors).
                      Thank you very much for your help in advance!

                      Best regards,
                      Livia

                      Comment

                      • Working...
                        X