Difference between varsofinterest and alwaysvars in Stata's dsregress command

Rolf Miller

Join Date: Feb 2019

Posts: 11
#1

Difference between varsofinterest and alwaysvars in Stata's dsregress command

23 May 2020, 07:40

Hi there,

I want to need a method for inference with variable selection.
To do so, I tried the dsregress command to apply lasso variable selection and regression.

The command of dsregress generally reads as

Code:

dsregress depvar varsofinterest, controls([(alwaysvars)] othervars)

Assume I have dependent variable Y, independent variables X1-X50 which are supposed to be always included in the model. And, finally, I have Z1-Z100 which are optional variables to be selected or excluded by lasso.

I was wondering where the conceptual difference lies between varsofinterest and alwaysvars? From my understanding, both sets of variables are treated identically from a computational perspective. Only the produced output is different.

However, if I run

Code:

dsregress Y X1-X50, controls(Z1-Z100) sel(cv)

I obtain a different set of selected variables than for

Code:

dsregress Y X1, controls((X2-X50) Z1-Z100) sel(cv)

Both lines of code are of course based on the same seed.

As I interpret the dsregress command, both approaches should always include X1-X50 and select among Z1-Z100.
However, there seems to be a difference between both lines.

Can anybody clarify on this? Thank you!

Last edited by Rolf Miller; 23 May 2020, 08:05.
Tags: dsregress, lasso
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#2

24 May 2020, 10:14

varsofinterest are those variables to be included -- you're interested in using lasso for inference (coefficients, standard errors, CIs, etc.). Whereas alwaysvars are always included as controls but you're not interested in the coefficients and their standard errors.

One issue I see here is a violation of the sparsity assumption. The number of varsofinterest must be small and fixed, while the controls can be large and grow with sample size.

Hope this helps.
1 like
Comment

Rolf Miller

Join Date: Feb 2019
Posts: 11

27 May 2020, 15:53

Thanks for your reply.

Originally posted by Justin Blasongame View Post

varsofinterest are those variables to be included -- you're interested in using lasso for inference (coefficients, standard errors, CIs, etc.). Whereas alwaysvars are always included as controls but you're not interested in the coefficients and their standard errors.

If I understand you correctly, this is also how I interpreted varsofinterest and alwaysvars. So, overall the difference lies only in whether STATA is supposed to report coefficient for a certain variable or not.

I experimented a bit with webuse breathe which is also used in the examples for help dsregress.

I ran

Code:

 dsregress react no2_home no2_class, controls( i.(meducation overweight msmoke sex) noise sev* age)

where no2_class and no2_home are varsofinterest and no alwaysvars are specified.

The output reads as

Code:


Double-selection linear model         Number of obs               =      1,056
                                      Number of controls          =         14
                                      Number of selected controls =          4
                                      Wald chi2(2)                =      24.22
                                      Prob > chi2                 =     0.0000

------------------------------------------------------------------------------
             |               Robust
       react |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    no2_home |  -.4670372   .2457718    -1.90   0.057    -.9487411    .0146666
   no2_class |   2.207908   .4515834     4.89   0.000     1.322821    3.092995
------------------------------------------------------------------------------

In contrast, if I estimate

Code:

 dsregress react no2_home, controls((no2_class) i.(meducation overweight msmoke sex) noise sev* age)

where I simply turned no2_class from varsofinterest to alwaysvars, I obtain

Code:

Double-selection linear model         Number of obs               =      1,056
                                      Number of controls          =         15
                                      Number of selected controls =          3
                                      Wald chi2(1)                =       4.12
                                      Prob > chi2                 =     0.0423

------------------------------------------------------------------------------
             |               Robust
       react |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    no2_home |  -.4970119   .2447729    -2.03   0.042    -.9767579   -.0172658
------------------------------------------------------------------------------

Thus, both coefficient estimates for no2_home differ slightly and lasso selects a different number of controls from the set of all 15 controls.

How can this be?
There seems to be a fundamental difference between varsofinterest and alwaysvars.

Announcement

Difference between varsofinterest and alwaysvars in Stata's dsregress command

Comment

Comment