Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • standard error double lasso with clustering

    Hi,

    I am using the new cluster feature for LASSO in stata and I am a little confused by the way the SE are estimated.

    When I don't cluster, all is fine. If I do:

    sysuse nlsw88.dta
    dsregress wage grade , controls( age race married never_married collgrad south )
    Stata calculate a grade effect and a SE which is exactly equivalent to running the OLS regression with the LASSO-selected covariates:

    reg wage grade `e(controls_sel)' , r
    the LASSO and the OLS gives exactly the same result.

    But let's say I want to cluster for industry:

    sysuse nlsw88.dta
    dsregress wage grade , controls( age race married never_married collgrad south )
    dsregress wage grade , controls( age race married never_married collgrad south ) cluster(industry)
    Both give the same SE. Which makes me think that stata uses clustering to compute the LASSO but does not correct for clustering when estimating the effects. This is super misleading in my view. When you code with option cluster you expect you SE to be clustered. I am missing something here?

    Thanks


  • #2
    Which makes me think that stata uses clustering to compute the LASSO
    I don't understand. Clustering has nothing to do with the computing of LASSO coefficients. LASSO is just penalized OLS where overfitting is mitigated by cross validation. I'm unsure how this relates to the estimation. Anyways, the point estimates are unrelated to the errors. You calculate the point estimates first and then the standard errors, right? at least, that's how I recall learning OLS in masters program

    Comment


    • #3
      One addition to my post:

      if I use the option vce(cluster industry) I recover the correct clustered standard error:
      dsregress wage grade , controls( age race married never_married collgrad south ) vce(cluster industry)
      reg wage grade `e(controls_sel)' , cluster(industry)
      It looks like the option cluster and vce(cluster ) do not produce the same result. This is not the case for all other regression model. I believe there is no difference between an OLS estimated using subcommand cluster or vce(cluster ). Is it a coding error from stata?


      Comment


      • #4
        Nope, I was completely incorrect, the help file says the vce(cluster) DOES affect the log-likelihood and the k-fold cross validation, so actually the fact that you get different results is an expected outcome as per the help file.

        Comment


        • #5
          Yes clustering affects the way LASSO is computed but this is not my point. My point is that these two codes:

          dsregress wage grade , controls( age race married never_married collgrad south ) vce(cluster industry)
          dsregress wage grade , controls( age race married never_married collgrad south ) cluster(industry)
          do not produce the same result. Basically, the command cluster(industry) does not do what it is supposed to do. It seems to correct for clustering for the LASSO but not for the estimation.

          Comment


          • #6
            Is industry the panel variable? I'd really need to sit down and read the help file. In the meantime, though, I would email Stata Tech Support and see what they say, as I'm quite interested in the solution.

            Comment


            • #7
              Actually I found a post about this issue here:

              https://www.stata.com/new-in-stata/l...lustered-data/

              vce(cluster) is the correct way to estimate the SE. Still I think this is pretty confusing.

              Comment

              Working...
              X