Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • [LASSO] Collinear covariates: Suggested addition to the documentation

    I would like to suggest an addition to the documentation for collinear covariates in LASSO models. The Summary section currently reads as follows:

    Summary
    Consider factor variablegroupthat takes on the values 1, 2, and 3. If you type
    . lasso linear y i.group. . .
    lassowill know that separate covariates forgroup1, 2, and 3 are to be included among the variables
    to be potentially included in the model.
    If you create your own indicator variables, you need to create and specify indicators for all the
    values of the factor variable:
    . generate g1 = (group==1)
    . generate g2 = (group==2)
    . generate g3 = (group==3)
    . lasso linear y g1 g2 g3. . .
    It is important that you do not omit one of them, say,g1, and instead type
    . lasso linear y g2 g3. . .


    While tinkering around, I discovered that that one must not use ib#.group in place of i.group. Doing so causes the specified base level to be omitted, and will therefore give different results. I think a warning about this should be added to the documentation. E.g., something like this could be added to the Summary section.

    Note as well that you must not use the ib# prefix, because that will cause the selected base level to be omitted. For example, using ib1.group is equivalent to including g2 and g3 but not g1.
    I'm sure the folks who write the documentation can improve on the wording, but I hope this gets the idea across.

    For anyone who is interested, the code for my "tinkering" is pasted below.

    Cheers,
    Bruce


    Code:
    // File:  LASSO_collinear_covariates.do
    // Date:  25-Oct-2022
    // Name:  Bruce Weaver, [email protected]
    
    // Suggestion:  Caution users of LASSO that factor variables will not
    // be handled as described in the documentation if one uses ib#.variable.
    // Only the i.variable form of factor variable notation is handled properly.
    
    // The relevant documentation can be seen here:
    // https://www.stata.com/manuals/lassocollinearcovariates.pdf#lassoCollinearcovariates
    
    // Use auto.dta to create an example like the one described.
    clear *
    sysuse auto
    
    // Create 5 indicator variables for rep78
    forvalues i = 1(1)5 {
        generate byte rep`i' = rep78 == `i' if !missing(rep78)
    }
    summarize rep1-rep5
    
    // NOTE that you must reset the seed before estimating each model.
    
    * [1] Use factor variable notation for rep78
    set seed 1234
    quietly lasso linear mpg i.rep78 ///
    foreign headroom weight turn gear_ratio price trunk length displacement
    * Show which variables have been retained
    lassocoef, display(coef)
    
    * [2] Use the 5 indicator variables for rep78
    set seed 1234
    quietly lasso linear mpg rep1 rep2 rep3 rep4 rep5 ///
    foreign headroom weight turn gear_ratio price trunk length displacement
    * Show which variables have been retained
    lassocoef, display(coef)
    
    // Q. What happens if one uses ib#.rep78 rather than i.rep78?
    
    forvalues i = 1(1)5 {
    set seed 1234
    display "Base level for rep78 = "`i'
    quietly lasso linear mpg ib`i'.rep78 ///
    foreign headroom weight turn gear_ratio price trunk length displacement
    * Show which variables have been retained
    lassocoef, display(coef)     
    }
    
    // A. Stata omits the base level when I do that.
    // Let's check a couple of them to verify.  
    
    * Factor variable notiation with ib3.rep78
    set seed 1234
    quietly lasso linear mpg ib3.rep78 ///
    foreign headroom weight turn gear_ratio price trunk length displacement
    * Show which variables have been retained
    lassocoef, display(coef)     
    * Indicator variables with rep3 omitted
    set seed 1234
    quietly lasso linear mpg rep1 rep2 rep4 rep5 ///
    foreign headroom weight turn gear_ratio price trunk length displacement
    * Show which variables have been retained
    lassocoef, display(coef)     
    
    * Factor variable notiation with ib5.rep78
    set seed 1234
    quietly lasso linear mpg ib5.rep78 ///
    foreign headroom weight turn gear_ratio price trunk length displacement
    * Show which variables have been retained
    lassocoef, display(coef)     
    * Indicator variables with rep5 omitted
    set seed 1234
    quietly lasso linear mpg rep1 rep2 rep3 rep4 ///
    foreign headroom weight turn gear_ratio price trunk length displacement
    * Show which variables have been retained
    lassocoef, display(coef)     
    
    // Confirmed.
    --
    Bruce Weaver
    Email: [email protected]
    Version: Stata/MP 19.5 (Windows)

  • #2
    PS- I think it would also be helpful if use of ib#.group prompted a warning.
    --
    Bruce Weaver
    Email: [email protected]
    Version: Stata/MP 19.5 (Windows)

    Comment

    Working...
    X