Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep selected levels of categorical variables after LASSO

    I'm running a linear LASSO on a large dataset, with many categorical variables in the donor pool, each of which have many levels. I want to keep only the levels of the categorical variables which the LASSO procedure selects, to be used in subsequent analysis. I have a very messy work around for this, but think there must be something cleaner and quicker.

    Minimal reproducible example:

    Code:
    * load data
    sysuse auto
    
    * set up variable lists, specifying mpg as categorical for sake of example
    vl set
    vl move (price trunk weight length turn displacement) vlcontinuous
    vl move (mpg) vlcategorical
    vl substitute catvars = i.vlcategorical
    vl move (price) vlother
    
    * run the lasso
    lasso linear price $vlcontinuous $catvars, rseed(42)
    I now get that the selected variables are:
    Code:
    . di "`e(allvars_sel)'"
    headroom weight turn displacement 4bn.rep78 5bn.rep78 0bn.foreign 1bn.foreign 14bn.mpg 15bn.mpg 17bn.mpg 18bn.mpg 19bn.mpg 20bn.mpg 21bn.mpg 22bn.mpg 23bn.mpg 24bn.mpg 25bn.mpg 35bn.mpg
    I understand this to mean that a dummy variable for whether mpg==14 is selected, but not for mpg==12.

    How do I generate dummies for each level of mpg and keep those which are selected in the LASSO? Currently I have something like this, but it's a mess and there must be an easier way:
    Code:
    local selvars `e(allvars_sel)'
    
    foreach word in `selvars' {
        
        local seppos = strpos("`word'","bn.")
        
        local catvar
        local catval
        if `seppos' !=0 {
            local startpos = `seppos' + 3
            local catvar = substr("`word'",`startpos',.)
            
            local endpos = `seppos'-1
            local catval = substr("`word'",1,`endpos')
            
            di "`catvar' at `catval'"
            
            cap drop keepme_`catvar'_`catval'
            g keepme_`catvar'_`catval' = `catvar' == `catval' if !mi(`catvar')
            
        }
        
    }
    
    keep keepme_*


  • #2
    Why do you need to create dummies? If using factor variable notation, 2.mpg is legitimately a variable as any other.

    Comment


    • #3
      Huh I didn't know that -- cool!
      But unfortunately I do need the dummies as I'm exporting the dataset for analysis in a different software program.

      Comment


      • #4
        You can use the -xi- prefix, and this will create the dummies for you.

        Code:
        help xi

        Code:
        webuse cattaneo2, clear
        set seed 08112022
        lasso linear bweight c.mage c.fage c.fedu c.medu i.(mmarried mhisp fhisp), nolog
        di "`e(post_sel_vars)'"
        
        xi: lasso linear bweight c.mage c.fage c.fedu c.medu i.mmarried i.mhisp i.fhisp, nolog
        di "`e(post_sel_vars)'"
        Res.:

        Code:
        . lasso linear bweight c.mage c.fage c.fedu c.medu i.(mmarried mhisp fhisp), nolog
        
        Lasso linear model                          No. of obs        =      4,642
                                                    No. of covariates =         10
        Selection: Cross-validation                 No. of CV folds   =         10
        
        --------------------------------------------------------------------------
                 |                                No. of      Out-of-      CV mean
                 |                               nonzero       sample   prediction
              ID |     Description      lambda     coef.    R-squared        error
        ---------+----------------------------------------------------------------
               1 |    first lambda    104.4141         0      -0.0000     334965.7
              37 |   lambda before     3.66618         5       0.0337     323659.8
            * 38 | selected lambda    3.340487         5       0.0337     323658.7
              39 |    lambda after    3.043727         5       0.0337     323661.4
              67 |     last lambda    .2249534         7       0.0336     323692.1
        --------------------------------------------------------------------------
        * lambda selected by cross-validation.
        
        . 
        . di "`e(post_sel_vars)'"
        bweight mage fedu medu mmarried mhisp
        
        . 
        . 
        . 
        . xi: lasso linear bweight c.mage c.fage c.fedu c.medu i.mmarried i.mhisp i.fhisp, nolog
        i.mmarried        _Immarried_0-1      (naturally coded; _Immarried_0 omitted)
        i.mhisp           _Imhisp_0-1         (naturally coded; _Imhisp_0 omitted)
        i.fhisp           _Ifhisp_0-1         (naturally coded; _Ifhisp_0 omitted)
        
        Lasso linear model                          No. of obs        =      4,642
                                                    No. of covariates =          7
        Selection: Cross-validation                 No. of CV folds   =         10
        
        --------------------------------------------------------------------------
                 |                                No. of      Out-of-      CV mean
                 |                               nonzero       sample   prediction
              ID |     Description      lambda     coef.    R-squared        error
        ---------+----------------------------------------------------------------
               1 |    first lambda    104.4141         0      -0.0001     335006.8
              37 |   lambda before     3.66618         5       0.0334       323787
            * 38 | selected lambda    3.340487         5       0.0334     323783.2
              39 |    lambda after    3.043727         5       0.0334     323787.1
              67 |     last lambda    .2249534         7       0.0331     323883.2
        --------------------------------------------------------------------------
        * lambda selected by cross-validation.
        
        . 
        . di "`e(post_sel_vars)'"
        bweight mage fedu medu _Immarried_1 _Imhisp_1
        
        .

        Comment


        • #5
          Brilliant, thank you!

          Comment

          Working...
          X