Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • LASSO & Postselection coefficients

    I noticed that the Postselection coefficients I was getting after running -lasso- differed from the OLS estimates using the LASSO selected regressors when I specified a cluster variable. When I don't specify a cluster variable, they match as expected. I didn't expect clustering to cause them to not match, so I was wondering if anyone knew why that would happen? Is this just a precision issue?

    Here's a toy example. I just compare the postselection estimate of x1 with the OLS estimate of x1. With and without some cluster variable.
    Code:
    qui forv s = 1/10 {
    
        * make up some data
        clear all
        set seed `s'
        set obs 1000
        g y =  runiform()                        /* outcome                     */
        g clust = int(runiform(1,20))
        forv z=1/10 {
            g x`z' = int(runiform(1,5))            /* potential predictors x*     */
        }
        replace y = y + 5*x1 + .5*x2
        
        foreach z in "WITH_CLUSTERING" "WITHOUT_CLUSTERING" {
            if "`z'"=="WITH_CLUSTERING"         lasso linear y x*, cluster(clust)   
            if "`z'"=="WITHOUT_CLUSTERING"     lasso linear y x*                 
            loc beta_lasso = e(b_postselection)[1,"x1"]
            reg y `e(allvars_sel)'    
            loc beta_ols = _b[x1]
            loc dif = `beta_lasso' - `beta_ols'
            noi di "seed=`s'; `z'; lasso estimate=`beta_lasso'; OLS estimate=`beta_ols';  difference=`dif' "
        }
    
    }
    Here's the output:

    Code:
    seed=1; WITH_CLUSTERING; lasso estimate=4.998759942514343; OLS estimate=4.999203211772241;  difference=-.0004432692578984
    seed=1; WITHOUT_CLUSTERING; lasso estimate=4.999203211772241; OLS estimate=4.999203211772241;  difference=0
    seed=2; WITH_CLUSTERING; lasso estimate=4.996668209099026; OLS estimate=4.997709674975988;  difference=-.0010414658769626
    seed=2; WITHOUT_CLUSTERING; lasso estimate=4.997709674975988; OLS estimate=4.997709674975988;  difference=0
    seed=3; WITH_CLUSTERING; lasso estimate=5.005329955268507; OLS estimate=5.004821214398009;  difference=.0005087408704973
    seed=3; WITHOUT_CLUSTERING; lasso estimate=5.004821214398009; OLS estimate=5.004821214398009;  difference=0
    seed=4; WITH_CLUSTERING; lasso estimate=4.994085712037184; OLS estimate=4.995145877055269;  difference=-.0010601650180853
    seed=4; WITHOUT_CLUSTERING; lasso estimate=4.995145877055269; OLS estimate=4.995145877055269;  difference=0
    seed=5; WITH_CLUSTERING; lasso estimate=5.005353533242829; OLS estimate=5.002485671712456;  difference=.0028678615303726
    seed=5; WITHOUT_CLUSTERING; lasso estimate=5.002485671712456; OLS estimate=5.002485671712456;  difference=0
    seed=6; WITH_CLUSTERING; lasso estimate=4.987126930711644; OLS estimate=4.986695164677461;  difference=.0004317660341826
    seed=6; WITHOUT_CLUSTERING; lasso estimate=4.986695164677461; OLS estimate=4.986695164677461;  difference=0
    seed=7; WITH_CLUSTERING; lasso estimate=5.005207996913146; OLS estimate=5.00505189238087;  difference=.0001561045322767
    seed=7; WITHOUT_CLUSTERING; lasso estimate=5.00505189238087; OLS estimate=5.00505189238087;  difference=0
    seed=8; WITH_CLUSTERING; lasso estimate=4.999478604561925; OLS estimate=5.00041089181657;  difference=-.0009322872546456
    seed=8; WITHOUT_CLUSTERING; lasso estimate=5.00041089181657; OLS estimate=5.00041089181657;  difference=0
    seed=9; WITH_CLUSTERING; lasso estimate=5.009298907208509; OLS estimate=5.009607632973197;  difference=-.0003087257646888
    seed=9; WITHOUT_CLUSTERING; lasso estimate=5.009607632973197; OLS estimate=5.009607632973197;  difference=0
    seed=10; WITH_CLUSTERING; lasso estimate=5.021691730222245; OLS estimate=5.023500155842829;  difference=-.0018084256205837
    seed=10; WITHOUT_CLUSTERING; lasso estimate=5.023500155842829; OLS estimate=5.023500155842829;  difference=0
    Last edited by Brian Holtemeyer; 27 Jan 2023, 13:20. Reason: clarified example

  • #2
    The help says the clustering affects the MLE procedure, thus affecting the predictors.

    Comment


    • #3
      Yeah, I expect the clustering to affect which predictors are selected. But whatever predictors are chosen are the ones that OLS would use.

      Or maybe I don't follow your point.

      Comment


      • #4
        Put more simply, I expect the "difference" to be 0 whether I cluster or not. Below you can see that the difference is not 0 when there was a cluster variable.

        Code:
        seed=1; WITH_CLUSTERING;        lasso estimate=4.998759942514343; OLS estimate=4.999203211772241; difference=-.0004432692578984
        seed=1; WITHOUT_CLUSTERING; lasso estimate=4.999203211772241; OLS estimate=4.999203211772241; difference=0
        ...

        Comment


        • #5
          bump

          Comment


          • #6
            There is no universe that

            foreach z in "WITH_CLUSTERING" "WITHOUT_CLUSTERING" {
            if "`z'"=="WITH_CLUSTERING" lasso linear y x*, cluster(clust)
            if "`z'"=="WITHOUT_CLUSTERING" lasso linear y x*
            this will run, so your problem is not reproducible. If the post selection variables differ, you cannot expect the coefficients from the regressions on these variables to be exactly the same as you are not comparing like for like.

            Comment


            • #7
              Originally posted by Andrew Musau View Post
              There is no universe that
              this will run, so your problem is not reproducible. If the post selection variables differ, you cannot expect the coefficients from the regressions on these variables to be exactly the same as you are not comparing like for like.
              The universe that it runs seems to be my machine, lol. What sorts of problems are you running into when you try to replicate the problem? Here's a simpler example. the estimates on rep78 are 242.5288 and 245.7616, but I expected them to be the same. The post selection variables do NOT differ...

              Code:
              sysuse auto, clear
              gen clu = mod(_n,10)
              lasso  linear     price         mpg    rep78    headroom    trunk    weight    length    turn    displacement    gear_ratio    foreign, rseed(123) cluster(clu)
              lassocoef, nolegend        display(coef, postselection)
              reg             price `e(allvars_sel)'
              output:
              Code:
              . lassocoef, nolegend             display(coef, postselection)
              
              ------------------------
                           |    active
              -------------+----------
                     rep78 |  242.5288
                  headroom | -709.8902
                    weight |  2.127192
              displacement |  15.34743
                   foreign |   3477.85
                     _cons | -3102.581
              ------------------------
              
              . reg                     price `e(allvars_sel)'
              
                    Source |       SS           df       MS      Number of obs   =        69
              -------------+----------------------------------   F(5, 63)        =     15.70
                     Model |   320011687         5  64002337.5   Prob > F        =    0.0000
                  Residual |   256785272        63  4075956.69   R-squared       =    0.5548
              -------------+----------------------------------   Adj R-squared   =    0.5195
                     Total |   576796959        68  8482308.22   Root MSE        =    2018.9
              
              ------------------------------------------------------------------------------
                     price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
              -------------+----------------------------------------------------------------
                     rep78 |   245.7616   308.8732     0.80   0.429    -371.4721    862.9954
                  headroom |  -692.9206   329.8097    -2.10   0.040    -1351.993   -33.84847
                    weight |   2.068787   .8696198     2.38   0.020     .3309908    3.806584
              displacement |   15.85848   7.342558     2.16   0.035     1.185552    30.53141
                   foreign |   3473.477   791.8829     4.39   0.000     1891.026    5055.929
                     _cons |  -3081.957    1935.88    -1.59   0.116    -6950.504    786.5901
              ------------------------------------------------------------------------------

              Comment


              • #8
                update: Stata technical support solved it. They wrote: "When there is cluster, -lasso- uses a weighted regression (within cluster). Therefore, -regress- with the selected variables will not be the same as e(b_postselection)."

                Comment


                • #9
                  Originally posted by Brian Holtemeyer View Post

                  The universe that it runs seems to be my machine, lol.

                  My bad, you are absolutely correct. I guess I am just used to the syntax:

                  Code:
                  if xxx{
                      do yyy
                  }
                  Glad that you were able to solve your problem!

                  Comment


                  • #10
                    Originally posted by Andrew Musau View Post
                    My bad, you are absolutely correct. !
                    I should have posted a simpler example like in #7.

                    Comment

                    Working...
                    X