Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference between varsofinterest and alwaysvars in Stata's dsregress command

    Hi there,

    I want to need a method for inference with variable selection.
    To do so, I tried the dsregress command to apply lasso variable selection and regression.

    The command of dsregress generally reads as
    Code:
    dsregress depvar varsofinterest, controls([(alwaysvars)] othervars)
    Assume I have dependent variable Y, independent variables X1-X50 which are supposed to be always included in the model. And, finally, I have Z1-Z100 which are optional variables to be selected or excluded by lasso.

    I was wondering where the conceptual difference lies between varsofinterest and alwaysvars? From my understanding, both sets of variables are treated identically from a computational perspective. Only the produced output is different.

    However, if I run
    Code:
     dsregress Y X1-X50, controls(Z1-Z100) sel(cv)
    I obtain a different set of selected variables than for
    Code:
     dsregress Y X1, controls((X2-X50) Z1-Z100) sel(cv)
    Both lines of code are of course based on the same seed.

    As I interpret the dsregress command, both approaches should always include X1-X50 and select among Z1-Z100.
    However, there seems to be a difference between both lines.

    Can anybody clarify on this? Thank you!


    Last edited by Rolf Miller; 23 May 2020, 08:05.

  • #2
    varsofinterest are those variables to be included -- you're interested in using lasso for inference (coefficients, standard errors, CIs, etc.). Whereas alwaysvars are always included as controls but you're not interested in the coefficients and their standard errors.

    One issue I see here is a violation of the sparsity assumption. The number of varsofinterest must be small and fixed, while the controls can be large and grow with sample size.

    Hope this helps.

    Comment


    • #3
      Thanks for your reply.

      Originally posted by Justin Blasongame View Post
      varsofinterest are those variables to be included -- you're interested in using lasso for inference (coefficients, standard errors, CIs, etc.). Whereas alwaysvars are always included as controls but you're not interested in the coefficients and their standard errors.
      If I understand you correctly, this is also how I interpreted varsofinterest and alwaysvars. So, overall the difference lies only in whether STATA is supposed to report coefficient for a certain variable or not.

      I experimented a bit with webuse breathe which is also used in the examples for help dsregress.

      I ran

      Code:
       dsregress react no2_home no2_class, controls( i.(meducation overweight msmoke sex) noise sev* age)
      where no2_class and no2_home are varsofinterest and no alwaysvars are specified.

      The output reads as

      Code:
      
      Double-selection linear model         Number of obs               =      1,056
                                            Number of controls          =         14
                                            Number of selected controls =          4
                                            Wald chi2(2)                =      24.22
                                            Prob > chi2                 =     0.0000
      
      ------------------------------------------------------------------------------
                   |               Robust
             react |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
          no2_home |  -.4670372   .2457718    -1.90   0.057    -.9487411    .0146666
         no2_class |   2.207908   .4515834     4.89   0.000     1.322821    3.092995
      ------------------------------------------------------------------------------
      In contrast, if I estimate
      Code:
       dsregress react no2_home, controls((no2_class) i.(meducation overweight msmoke sex) noise sev* age)
      where I simply turned no2_class from varsofinterest to alwaysvars, I obtain

      Code:
      Double-selection linear model         Number of obs               =      1,056
                                            Number of controls          =         15
                                            Number of selected controls =          3
                                            Wald chi2(1)                =       4.12
                                            Prob > chi2                 =     0.0423
      
      ------------------------------------------------------------------------------
                   |               Robust
             react |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
          no2_home |  -.4970119   .2447729    -2.03   0.042    -.9767579   -.0172658
      ------------------------------------------------------------------------------
      Thus, both coefficient estimates for no2_home differ slightly and lasso selects a different number of controls from the set of all 15 controls.

      How can this be?
      There seems to be a fundamental difference between varsofinterest and alwaysvars.

      Comment

      Working...
      X