Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Number of selected control in Lasso vs Post-Lasso IV estimation using ivlasso

    Dear Statalist,

    I use the ivlasso command (part of the pdslasso package by Achim Ahrens,Christian Hansen, and Mark Schaffer) to estimate a regression with high-dimensional exogenous controls and a scalar endogenous variable (continuous) as well as a scalar instrumental variable (a dummy).

    The command returns results using either Lasso or Post-Lasso OLS in the different model selection and estimation steps (besides results using Lasso only for model selection). In my understanding of the methodology, I expected Lasso and Post-Lasso to select the same set of covariates, as Post-Lasso simply applies OLS to the covariates selected by Lasso. However, in practice, these two estimators often select different covariates, which at times results in quite different final estimates. From the regression output, it seems like this divergence occurs when optimal instruments are created. I looked at the underlying literature but failed to understand why this difference can occur.

    Could anyone shed some light on this for me or point me toward a relevant paper that addresses this issue? Moreover, how would you interpret or deal with Lasso and Post-Lasso resulting in quite different estimates?

    Many thanks for any help,

    Kevin

  • #2
    I think now after posting here I realised the answer to question 1 myself. Lasso and Post-Lasso selected variables can differ in step (iii) of Algorithm 1 in Chernozhukov et al. 2015. This occurs because this step regresses the predicted endogenous variable (predicted either using Lasso or Post-Lasso) on the set of exogenous control variables using Lasso or Post-Lasso. Because the predicted endogenous variable differs depending on which method was used in step (ii), step (iii) can result in different sets of selected covariates for different methods used in step (ii).

    Nevertheless, I would like to hear any opinions on how one would interpret differences in selected covariates and corresponding differences in final estimates depending on the used method. Asymptotically, the choice of the method should not matter, should it? Does the existence of differences then point to some finite sample issues?

    Thanks!



    References

    Chernozhukov, V., Hansen, C., & Spindler, M. (2015). Post-Selection and Post-Regularization Inference in Linear Models with Many Controls and Instruments. American Economic Review: Papers & Proceedings, 105(5), 486–490. https://doi.org/10.1257/aer.p20151022

    Comment


    • #3
      Hi Kevin, you were quicker than I was able to respond. Your answer to your own question is correct.

      In your example, you only have 1 instrumental variable. The CHS algorithm is intended for many instruments and many controls. You are effectively using the lasso so select optimal instruments out of ... well, 1 instrument.

      I suggest you use the partial() option, in which case your 1 IV is treated as low-dimensional. Just add the option partial(<name of IV>).

      Let me know if this works for you.
      --
      Tag me or email me for ddml/pdslasso/lassopack/pystacked related questions. I don't check Statalist.

      Comment


      • #4
        Hi Kevin. Yes, you are right about the reason - the high-dimensional controls are selected for the fitted values of the endogenous regressor(s), and the lasso and post-lasso fitted values will differ. I think you are also right that asymptotically the choice shouldn't matter. Possibly there is some Monte Carlo evidence out there about the relative performance of the two methods but I don't know of a paper that does this. "Quite different" is a bit worrying, though. I wouldn't be surprised if the results were a bit fragile with respect to other settings (e.g., telling rlasso to use the x-dependent lambda penalty instead of the default lambda).

        Comment


        • #5
          Hi Achim, hi Mark,

          thank you for your quick replies.

          Achim: In my case, it does not seem to matter much whether I partial out the IV as the instrument is also chosen when I do not use the partial option. I suppose that is a good sign and not surprising given that the instrument appears to be quite strong.

          Mark: Playing around with it, I noticed now that these seemingly inconsistent results occur mostly when I include certain sets of variables. I basically have a panel set-up in which I originally included a full set of fixed effects as well as a large set of variables with only cross-sectional variation. In regular OLS or IV estimation, these would be collinear with the fixed effects, but I figured I could let the Lasso choose which fixed effects and cross-sectional variables to include. For some reason, however, including the cross-sectional variables results in unstable/inconsistent results in the sense that the Lasso can result in a significant positive coefficient while the Post-Lasso results in a significant negative coefficient. I do not really understand why this is the case, but it feels like this could be an instalment of 'garbage-in-garbage-out'. Anyway, I think I will stick with using the full set of fixed effects plus time-varying variables for now. If you have any ideas on why these inconsistencies could occur, I would be glad to hear them of course.

          Thank you for your comments, they certainly helped me to think about my problem more and try to figure out what is going on.

          Comment

          Working...
          X