Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • LASSO variable selection: omission of categories

    Dear Stata community,

    An aspect of the -lasso linear- command that is leaving me perplexed. I am running the following :

    Code:
     lasso linear (i.cat var1 var2) $controls, nolog rseed(123) selection(plugin)
    where cat is a categorical variable with 10 categories, while $controls is a vector of controls. to view the results, I run

    Code:
    lassocoef
    However, I am surprised to see that the first two categories of cat are omitted from the list of coefficients provided by the command. The macro (stored from the lasso command) e(allvars_sel), only contains the following:

    Code:
    3.cat 4.cat 5.cat 6.cat 7.cat 8.cat 9.cat 10.cat var1 var2 ...

    What is the reason for this? I would expect one category to be omitted if it served as a base category, but not two. Could this affect results of the lasso command?

    Thank you in advance

  • #2
    What is your expectation? lasso can select only some levels of a categorical variable. \(i.cat\) does not define a single variable, but a series of indicators.

    Comment


    • #3
      Since i.cat is fixed in the lasso, I would expect e(allvars_sel) to store the following:

      Code:
      2.cat 3.cat 4.cat 5.cat 6.cat 7.cat 8.cat 9.cat 10.cat var1 var2 ...

      Comment


      • #4
        Right. The syntax is

        lasso model depvar [(alwaysvars)] othervars [if] [in] [weight] [, options]
        so this cannot be right as it does not include the outcome variable.


        lasso linear (i.cat var1 var2) $controls, nolog rseed(123) selection(plugin)
        Have you checked if the category is empty. I think you need to present an example where this happens. Set seed and all. In fact, the base is usually selected as below:

        Code:
        sysuse auto, clear
        set seed 10112023
        qui lasso linear length (i.rep78 price) mpg turn disp
        di "`e(allvars_sel)'"
        which lasso
        Res.:

        Code:
        . di "`e(allvars_sel)'"
        1b.rep78 2.rep78 3.rep78 4.rep78 5.rep78 price mpg turn displacement
        
        . which lasso
        ...\Stata 17\ado\base\l\lasso.ado
        *! version 1.0.8  10feb2020

        Comment


        • #5
          Hello Jack Jameson. What you are highlighting is that Stata (to this point) has implemented LASSO, but not group LASSO. Here are a couple of slides I cobbled together when reviewing these issues several months ago.

          Click image for larger version

Name:	Notes on LASSO_21.png
Views:	1
Size:	106.1 KB
ID:	1729815

          Click image for larger version

Name:	Notes on LASSO_22.png
Views:	1
Size:	280.8 KB
ID:	1729816


          The Meier et at. article can be viewed here: http://people.ee.duke.edu/~lcarin/lukas-sara-peter.pdf

          Finally, you might consider posting your wish for Stata to implement group LASSO to the Stata 19 wishlist thread.


          HTH.
          --
          Bruce Weaver
          Email: [email protected]
          Version: Stata/MP 18.5 (Windows)

          Comment


          • #6
            Andrew Musau - indeed, I forgot to add the dependent variable in my example. My apologies.

            Okay - I fixed the issue. Initially, I was finding the following using Andrew's example:

            Code:
            . di "`e(allvars_sel)'"
            3.rep78 4.rep78 5.rep78 price mpg turn displacement
            . 
            . which lasso
            C:\Program Files\Stata17\ado\base\l\lasso.ado
            *! version 1.0.9  07dec2020
            I then updated Stata (update all), and I now find the expected output:

            Code:
            . di "`e(allvars_sel)'"
            1b.rep78 2.rep78 3.rep78 4.rep78 5.rep78 price mpg turn displacement
            
            . 
            . which lasso
            C:\Program Files\Stata17\ado\base\l\lasso.ado
            *! version 1.0.9  07dec2020
            It's worth noting that this has affected my results. After updating, the LASSO choses a different set of variables, beyond the missing cat categories. Any idea why that may have happened?

            Comment


            • #7
              See

              Code:
                help whatsnew17

              -----update 05oct2021-----

              11. lasso, when a factor variable was specified as one of the alwaysvars, omitted the wrong base-level variable. This has been fixed.

              It's worth noting that this has affected my results. After updating, the LASSO choses a different set of variables, beyond the missing cat categories.
              This should not happen if you set the seed. However, in general, lasso does not always select the same variables across runs. There is some randomness in the algorithm.

              Comment

              Working...
              X