standard error double lasso with clustering

Adrien Bouguen

Join Date: Jul 2014

Posts: 85
#1

standard error double lasso with clustering

19 Jan 2023, 18:33

Hi,

I am using the new cluster feature for LASSO in stata and I am a little confused by the way the SE are estimated.

When I don't cluster, all is fine. If I do:

sysuse nlsw88.dta
dsregress wage grade , controls( age race married never_married collgrad south )

Stata calculate a grade effect and a SE which is exactly equivalent to running the OLS regression with the LASSO-selected covariates:

reg wage grade `e(controls_sel)' , r

the LASSO and the OLS gives exactly the same result.

But let's say I want to cluster for industry:

sysuse nlsw88.dta
dsregress wage grade , controls( age race married never_married collgrad south )
dsregress wage grade , controls( age race married never_married collgrad south ) cluster(industry)

Both give the same SE. Which makes me think that stata uses clustering to compute the LASSO but does not correct for clustering when estimating the effects. This is super misleading in my view. When you code with option cluster you expect you SE to be clustered. I am missing something here?

Thanks
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

19 Jan 2023, 18:46

Which makes me think that stata uses clustering to compute the LASSO

I don't understand. Clustering has nothing to do with the computing of LASSO coefficients. LASSO is just penalized OLS where overfitting is mitigated by cross validation. I'm unsure how this relates to the estimation. Anyways, the point estimates are unrelated to the errors. You calculate the point estimates first and then the standard errors, right? at least, that's how I recall learning OLS in masters program
Comment
Adrien Bouguen

Join Date: Jul 2014

Posts: 85
#3

19 Jan 2023, 18:50

One addition to my post:

if I use the option vce(cluster industry) I recover the correct clustered standard error:

dsregress wage grade , controls( age race married never_married collgrad south ) vce(cluster industry)
reg wage grade `e(controls_sel)' , cluster(industry)

It looks like the option cluster and vce(cluster ) do not produce the same result. This is not the case for all other regression model. I believe there is no difference between an OLS estimated using subcommand cluster or vce(cluster ). Is it a coding error from stata?
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#4

19 Jan 2023, 18:55

Nope, I was completely incorrect, the help file says the vce(cluster) DOES affect the log-likelihood and the k-fold cross validation, so actually the fact that you get different results is an expected outcome as per the help file.
Comment
Adrien Bouguen

Join Date: Jul 2014

Posts: 85
#5

19 Jan 2023, 18:59

Yes clustering affects the way LASSO is computed but this is not my point. My point is that these two codes:

dsregress wage grade , controls( age race married never_married collgrad south ) vce(cluster industry)
dsregress wage grade , controls( age race married never_married collgrad south ) cluster(industry)

do not produce the same result. Basically, the command cluster(industry) does not do what it is supposed to do. It seems to correct for clustering for the LASSO but not for the estimation.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#6

19 Jan 2023, 19:09

Is industry the panel variable? I'd really need to sit down and read the help file. In the meantime, though, I would email Stata Tech Support and see what they say, as I'm quite interested in the solution.
Comment
Adrien Bouguen

Join Date: Jul 2014

Posts: 85
#7

19 Jan 2023, 20:02

Actually I found a post about this issue here:

https://www.stata.com/new-in-stata/l...lustered-data/

vce(cluster) is the correct way to estimate the SE. Still I think this is pretty confusing.
Comment

Announcement

standard error double lasso with clustering

Comment

Comment

Comment

Comment

Comment

Comment