Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • LASSO and subsets of variables

    I am wondering if someone can explain to me why the LASSO sometimes will choose no variables if there are many candidate variables in a model and choose more when fewer variables (a subset of the former) are included. I don't understand how this could happen.

    Here's an example with data:
    If I run:

    lars en a3 e3 l3 d2 rs sizeavg, a(lasso)

    Then LASSO chooses a3, e3, rs, and sizeavg.

    But adding age to the candidates, using:

    lars en a3 e3 l3 d2 rs sizeavg age, a(lasso)

    The LASSO chooses no candidate variables.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(a3 e3 l3 d2 age rs sizeavg en)
    35   11 18.6 6   4 1   736  47021.41
    49 11.3   22 6   6 1   670  82373.56
    30   11 13.7 6   5 1   736  58144.61
    31 11.3 14.7 6   5 1   606  47947.14
     8    5   15 1   3 0   504  28261.64
    30    8   13 8   5 1   745  32253.55
    30   12   25 6   6 1   575  70050.37
    11 11.7  4.7 3  15 1   553  67029.66
    30   12   15 6   5 1   755  72386.39
    33 10.4  5.6 6   5 1   692  58593.08
    28 10.6  8.8 6   4 1   755  96065.57
    31 11.3 12.5 6   6 1   660  96112.35
    38   11   13 6   5 1   598  88737.84
    22   11   16 8   4 1   811 75900.375
    31 10.2   12 6   4 1   737  85583.16
    38 11.3   18 6   6 1   670  52297.74
    22 10.6   15 6   3 1   850  104180.3
    34 12.5 10.6 8 4.5 1   696  38965.18
    24  9.6  6.6 8   5 1   593  40038.34
    30 11.6    9 6   6 1   755 123999.98
    30   11   14 6   5 1   648  40243.81
    22   11   11 6   4 1   696  58989.97
    26 12.5  7.6 8   3 1 690.5   81369.8
    38 10.6   19 6   5 1   667  59163.72
    30    8   13 8   5 1   677  76445.72
    30   11   20 6   5 1   710  58538.94
    49 10.2 10.6 6   6 1   688  29591.23
    30    8   19 6   6 1   736 68110.875
    38    8   17 8   5 1   711  82417.18
    30   12   18 8   5 1   774  78911.65
    27 12.8 11.7 6   5 1   625  25993.42
    49  8.5 15.9 6   6 1   788 27619.146
    31 10.4  6.9 6   6 1   692 14386.373
    38   11 12.7 8   5 1   688 13170.033
    30   12   13 6   6 1   760 70203.625
    38    8 22.5 8   5 1   677  57646.82
    38  7.9 12.9 8   5 1   667  45465.28
    30   11   14 8   4 1   738  56925.07
    49 10.2 14.6 6   6 1   688 23870.637
    22   11 12.5 6   4 1   662  92508.52
    30   11   15 8   5 1   763  53747.79
    36   14   17 8   5 1   600 38685.676
    30 11.7   18 6   5 1   667  80511.63
    30   11   16 8   5 1   731 23874.104
    38   11   22 6   5 1   606  44018.17
    30    8   20 6   5 1   732  43041.18
    30   11   15 6   3 1   639 38779.184
    end
    Last edited by Alecia Cassidy; 25 Apr 2017, 12:57.

  • #2
    I'm curious about this, too. At a minimum, one might expect a3 to be kept, as it is statistically significant in a standard regression:

    Code:
    . reg en a3 e3 l3 d2 rs sizeavg age
    
          Source |       SS           df       MS      Number of obs   =        47
    -------------+----------------------------------   F(7, 39)        =      1.73
           Model |  6.9671e+09         7   995306268   Prob > F        =    0.1310
        Residual |  2.2472e+10        39   576195739   R-squared       =    0.2367
    -------------+----------------------------------   Adj R-squared   =    0.0997
           Total |  2.9439e+10        46   639973428   Root MSE        =     24004
    
    ------------------------------------------------------------------------------
              en |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              a3 |  -1134.055   526.9997    -2.15   0.038    -2200.012   -68.09726
              e3 |   975.5635   2677.129     0.36   0.718    -4439.442    6390.569
              l3 |   640.2008   849.8749     0.75   0.456    -1078.834    2359.235
              d2 |  -5508.879   3942.795    -1.40   0.170    -13483.93    2466.177
              rs |   69497.98   46162.81     1.51   0.140    -23875.12    162871.1
         sizeavg |    94.1841   62.70715     1.50   0.141    -32.65308    221.0213
             age |  -1630.529   2688.831    -0.61   0.548    -7069.203    3808.146
           _cons |  -14215.07   50651.83    -0.28   0.780    -116668.1    88237.91
    ------------------------------------------------------------------------------
    When I've tried lars on larger panel data sets with many controls, Lasso returns sensible results. I suppose like all estimation methods, it can produce surprising results for a single draw with a small sample size. But I hope an expert weighs in.

    Comment


    • #3
      Interesting question. I am not the expert on lasso that Jeff wants; my expertise is epsilon above zero in that I have heard of it. But I can draw graphs.

      A scatter plot matrix you can draw for yourself (note the tiny trick of naming the response last so that convention is respected and it's on the vertical axis for the bottom row). It suggests little juice in this lemon to me. I note one indicator variable. I don't know whether that could be awkward, especially with one outlier.

      A multiple quantile plot can be drawn to show individual distributions. For more on the command in question, see http://www.stata-journal.com/sjpdf.h...iclenum=gr0053

      Code:
      . graph matrix a3 e3 l3 d2 rs sizeavg en
      
      . multqplot a3 e3 l3 d2 rs sizeavg en
      Click image for larger version

Name:	multqplot2.png
Views:	1
Size:	27.4 KB
ID:	1385573

      Last edited by Nick Cox; 26 Apr 2017, 06:39.

      Comment


      • #4
        Hi Jeff and Nick,

        Thanks so much for your replies. This is not my full dataset; it's a smaller one that I'm playing around with to learn more about LASSO.

        Nick points out that the dummy variable rs might be the culprit. I probably wouldn't include rs as a candidate variable in real life since it has too little variation to be useful. Indeed, when I add a little random noise to rs, the same set of variables is chosen when age is included in the candidate list as when it is excluded. I'm still a bit perplexed as to why this would be a problem for LASSO, though.

        Alecia

        Comment


        • #5
          A chink in the LASSO armor? I wonder if it can be reproduced on a larger data set.

          Comment


          • #6
            Originally posted by Jeff Wooldridge View Post
            A chink in the LASSO armor? I wonder if it can be reproduced on a larger data set.
            I hope it cannot be reproduced on a larger data set. I'll definitely let you know if I reproduce it on my larger data set.

            Comment


            • #7
              Alecia: I generated some data and I was able to reproduce your puzzle in an N = 50 case with 4 regressors. Lasso on 2 or 3 regressors picks out a single variable, lasso on all 4 chooses none. But it seems rare. And I could not find conflict with N = 500, but I did not search over a lot different data draws.

              Code:
              . lars y x1 x2, a(lasso)
              NOTE: Deleting all matrices
                        ade[3,2]
                         mu[1,1]
                      meanx[1,2]
                         R2[1,3]
                        RSS[1,3]
                         r2[1,1]
                        rss[1,1]
                         cp[1,3]
                      normx[1,2]
                       beta[3,2]
                      sbeta[3,2]
                      error[1,1]
              
              sbeta[3,2]
                         c1         c2
              r1          0          0
              r2  1.2291965          0
              r3  1.5592557  -.3300593
              
              Algorithm is lasso
              
              Cp, R-squared and Actions along the sequence of models
              
              +-------------------------------------+
              | Step |      Cp     | R-square |  Action |
              |------+-------------+----------+-----|
              |    1 |     1.1416  |  0.0000  |     | 
              |    2 |     1.1375 *|  0.0408  | +x1 | 
              |    3 |     3.0000  |  0.0436  | +x2 | 
              +-------------------------------------+
              * indicates the smallest value for Cp
              
              The coefficient values for the minimum Cp
              
              +-------------------------+
              | Variable |  Coefficient |
              |----------+--------------|
              | x1       |       0.2060 |
              +-------------------------+
              
              . lars y x1 x2 x3 x4, a(lasso)
              NOTE: Deleting all matrices
                        ade[5,4]
                         mu[1,1]
                      meanx[1,4]
                         R2[1,5]
                        RSS[1,5]
                         r2[1,1]
                        rss[1,1]
                         cp[1,5]
                      normx[1,4]
                       beta[5,4]
                      sbeta[5,4]
                      error[1,1]
              
              sbeta[5,4]
                          c1          c2          c3          c4
              r1           0           0           0           0
              r2           0           0           0   .94302486
              r3   .16056008           0           0   1.1035849
              r4   .16164226           0    .0015528   1.1041314
              r5   .43678751  -5.2221903   5.0173428   1.4016385
              
              Algorithm is lasso
              
              Cp, R-squared and Actions along the sequence of models
              
              +-------------------------------------+
              | Step |      Cp     | R-square |  Action |
              |------+-------------+----------+-----|
              |    1 |     1.9570 *|  0.0000  |     | 
              |    2 |     1.9970  |  0.0392  | +x4 | 
              |    3 |     3.7335  |  0.0445  | +x1 | 
              |    4 |     5.7319  |  0.0445  | +x3 | 
              |    5 |     5.0000  |  0.0992  | +x2 | 
              +-------------------------------------+
              * indicates the smallest value for Cp
              
              The coefficient values for the minimum Cp
              
              +-------------------------+
              | Variable |  Coefficient |
              |----------+--------------|
              +-------------------------+
              
              . reg y x1 x2 x3 x4
              
                    Source |       SS           df       MS      Number of obs   =        50
              -------------+----------------------------------   F(4, 45)        =      1.24
                     Model |  4.93730601         4   1.2343265   Prob > F        =    0.3079
                  Residual |   44.821115        45  .996024778   R-squared       =    0.0992
              -------------+----------------------------------   Adj R-squared   =    0.0192
                     Total |   49.758421        49  1.01547798   Root MSE        =    .99801
              
              ------------------------------------------------------------------------------
                         y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                        x1 |   .0732172   .4536781     0.16   0.873    -.8405374    .9869717
                        x2 |  -.2209018    .136094    -1.62   0.112    -.4950092    .0532057
                        x3 |   .1136642   .0715457     1.59   0.119    -.0304363    .2577646
                        x4 |    .214223   .4177185     0.51   0.611    -.6271053    1.055551
                     _cons |   1.443697   .3003661     4.81   0.000     .8387284    2.048665
              ------------------------------------------------------------------------------

              Comment


              • #8
                Originally posted by Jeff Wooldridge View Post
                Alecia: I generated some data and I was able to reproduce your puzzle in an N = 50 case with 4 regressors. Lasso on 2 or 3 regressors picks out a single variable, lasso on all 4 chooses none. But it seems rare. And I could not find conflict with N = 500, but I did not search over a lot different data draws.

                Code:
                . lars y x1 x2, a(lasso)
                NOTE: Deleting all matrices
                ade[3,2]
                mu[1,1]
                meanx[1,2]
                R2[1,3]
                RSS[1,3]
                r2[1,1]
                rss[1,1]
                cp[1,3]
                normx[1,2]
                beta[3,2]
                sbeta[3,2]
                error[1,1]
                
                sbeta[3,2]
                c1 c2
                r1 0 0
                r2 1.2291965 0
                r3 1.5592557 -.3300593
                
                Algorithm is lasso
                
                Cp, R-squared and Actions along the sequence of models
                
                +-------------------------------------+
                | Step | Cp | R-square | Action |
                |------+-------------+----------+-----|
                | 1 | 1.1416 | 0.0000 | |
                | 2 | 1.1375 *| 0.0408 | +x1 |
                | 3 | 3.0000 | 0.0436 | +x2 |
                +-------------------------------------+
                * indicates the smallest value for Cp
                
                The coefficient values for the minimum Cp
                
                +-------------------------+
                | Variable | Coefficient |
                |----------+--------------|
                | x1 | 0.2060 |
                +-------------------------+
                
                . lars y x1 x2 x3 x4, a(lasso)
                NOTE: Deleting all matrices
                ade[5,4]
                mu[1,1]
                meanx[1,4]
                R2[1,5]
                RSS[1,5]
                r2[1,1]
                rss[1,1]
                cp[1,5]
                normx[1,4]
                beta[5,4]
                sbeta[5,4]
                error[1,1]
                
                sbeta[5,4]
                c1 c2 c3 c4
                r1 0 0 0 0
                r2 0 0 0 .94302486
                r3 .16056008 0 0 1.1035849
                r4 .16164226 0 .0015528 1.1041314
                r5 .43678751 -5.2221903 5.0173428 1.4016385
                
                Algorithm is lasso
                
                Cp, R-squared and Actions along the sequence of models
                
                +-------------------------------------+
                | Step | Cp | R-square | Action |
                |------+-------------+----------+-----|
                | 1 | 1.9570 *| 0.0000 | |
                | 2 | 1.9970 | 0.0392 | +x4 |
                | 3 | 3.7335 | 0.0445 | +x1 |
                | 4 | 5.7319 | 0.0445 | +x3 |
                | 5 | 5.0000 | 0.0992 | +x2 |
                +-------------------------------------+
                * indicates the smallest value for Cp
                
                The coefficient values for the minimum Cp
                
                +-------------------------+
                | Variable | Coefficient |
                |----------+--------------|
                +-------------------------+
                
                . reg y x1 x2 x3 x4
                
                Source | SS df MS Number of obs = 50
                -------------+---------------------------------- F(4, 45) = 1.24
                Model | 4.93730601 4 1.2343265 Prob > F = 0.3079
                Residual | 44.821115 45 .996024778 R-squared = 0.0992
                -------------+---------------------------------- Adj R-squared = 0.0192
                Total | 49.758421 49 1.01547798 Root MSE = .99801
                
                ------------------------------------------------------------------------------
                y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                x1 | .0732172 .4536781 0.16 0.873 -.8405374 .9869717
                x2 | -.2209018 .136094 -1.62 0.112 -.4950092 .0532057
                x3 | .1136642 .0715457 1.59 0.119 -.0304363 .2577646
                x4 | .214223 .4177185 0.51 0.611 -.6271053 1.055551
                _cons | 1.443697 .3003661 4.81 0.000 .8387284 2.048665
                ------------------------------------------------------------------------------
                Hi Jeff,
                In your case, were any of the variables binary?

                Comment


                • #9
                  Actually, I meant to make one binary, but it looks like I make it fractional. One of them is discrete -- x2, I think -- with a binomial distribution from 0 to 20.

                  Comment


                  • #10
                    Originally posted by Jeff Wooldridge View Post
                    Actually, I meant to make one binary, but it looks like I make it fractional. One of them is discrete -- x2, I think -- with a binomial distribution from 0 to 20.
                    Wow, so I guess it might be a more general problem with using LASSO in small samples. I'll be more careful in the future!

                    Comment


                    • #11
                      By the way, please say hello to Traviss for me.

                      Comment


                      • #12
                        Originally posted by Jeff Wooldridge View Post
                        By the way, please say hello to Traviss for me.
                        Traviss says hi!

                        Comment


                        • #13
                          Found this in Belloni, Chernozhukov and Hansen, JEP 2014 (p. 40):

                          "Intuitively, reliably distinguishing true predictive power from spurious association
                          becomes more difficult as more variables are considered. This intuition can be
                          seen in the theory of high-dimensional variable selection methods, and the methods
                          work best in simulations when selection is done over a collection of variables that
                          is not overly extensive. It is therefore important that some persuasive economic
                          intuition exists to produce a carefully chosen, well-targeted set of variables to be
                          selected over even when using automatic variable selection methods."

                          This could be why LASSO might select nothing when given more variables.

                          Comment


                          • #14
                            Originally posted by Alecia Cassidy View Post
                            Found this in Belloni, Chernozhukov and Hansen, JEP 2014 (p. 40):

                            "Intuitively, reliably distinguishing true predictive power from spurious association
                            becomes more difficult as more variables are considered. This intuition can be
                            seen in the theory of high-dimensional variable selection methods, and the methods
                            work best in simulations when selection is done over a collection of variables that
                            is not overly extensive. It is therefore important that some persuasive economic
                            intuition exists to produce a carefully chosen, well-targeted set of variables to be
                            selected over even when using automatic variable selection methods."

                            This could be why LASSO might select nothing when given more variables.
                            Hello Alecia,

                            I have the same problem like you.
                            Actually I got over 40 explanatory variables and the Cp becomes negative!
                            What would you suggest then? Elimination of some x? Based on which criteria would you eliminate them?

                            Best regards,
                            David

                            Comment


                            • #15
                              Hi David,
                              I'm not really qualified to give you advice on this, so I hope someone more qualified will chime in. That said, it looks like eliminating some of the candidate variables based on economic intuition is what Belloni, Chernozhukov and Hansen would recommend.
                              Good luck!
                              Alecia

                              Comment

                              Working...
                              X