Estimates Restore with Lasso - "no observations" error

Elisha Yu

Join Date: Oct 2019

Posts: 3
#1

Estimates Restore with Lasso - "no observations" error

21 Oct 2019, 11:15

Hi everyone, I am trying to to in-sample and out of sample prediction with lasso. The following steps worked perfectly find with linear regression but not with lasso, so I wonder why this is the case and how I can solve it.

Step 1. Generate data
matrix def c = J(8,8,0)
matrix list c
forvalues i = 1/8{
forvalues j = 1/8{
local toaddin = 0.5^(abs(`i' - `j'))
matrix c[`i',`j'] = `toaddin'
}
}

set seed 12345
set obs 20
drawnorm x1 x2 x3 x4 x5 x6 x7 x8, cov(c)
gen e = rnormal(0,1)
gen y = 3*x1 + 1.5*x2 + 2*x5 + e

Step 2. Save the model
lasso2 y x1 x2 x3 x4 x5 x6 x7 x8
lasso2, lic(ebic)
estimates store regression3

Note: the insample prediction as the following works perfectly fine
predict yhat_lasso, residual lic(ebic)
gen yhat_lasso_2 = yhat_lasso^2
summ yhat_lasso_2
local yhat_lasso_2_mean = r(mean)
di `yhat_lasso_2_mean'

Step 3. Generate new data for out-of-sample prediction
drop _all
set seed 12345
set obs 1000
drawnorm x1 x2 x3 x4 x5 x6 x7 x8, cov(c)
gen e = rnormal(0,1)
gen y = 3*x1 + 1.5*x2 + 2*x5 + e

Step 4: Try to do out-of-sample prediction
set seed 12345
estimates restore regression3
ereturn list
predict yhat_lasso, residual lic(ebic)

Here is where the error message "No observation" comes out.
The ereturn list commend works well, which means the regression result has been successfully restored. So I am unsure why can't I predict again here?

Thanks everyone who tries to help!
Tags: lasso, noobservation, predict, restore, store
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

21 Oct 2019, 19:47

It appears to me that lasso2 does not support out-of-sample prediction.

The output of help lasso2 makes no mention of out-of-sample prediction techniques and suggests that predict works by re-estimating the model in the original sample, and this was confirmed using trace to look "behind the scenes". The failure message you received is because a careful look at the output of your second ereturn list shows that e(sample) was apparently not restored when estimates restore realized that the dataset had changed substantially,

You might contact one of the authors at the email address given for support in

Code:

net describe lassopack, from(http://fmwww.bc.edu/RePEc/bocode/l)

and asking if they can confirm this.

I will note that with Stata 16 includes its own lasso command documented in the Stata Lasso Reference Manual PDF included in the Stata 16 installation and accessible through Stata's Help menu.

Last edited by William Lisowski; 21 Oct 2019, 19:51.
Comment

Achim Ahrens

Join Date: Jun 2014
Posts: 49

22 Oct 2019, 01:54

Hi Elisha, thanks for your report, which helps us to improve lassopack.

First of all, let me clarify that lassopack supports out-of-sample prediction.

The reason for the issue that you experience is that, as William rightly points out, e(sample) (i.e. the estimation sample which is stored after estimation) gets lost when you use drop _all. This issue affects any other Stata command in the same way:

Code:

*Step 1. Generate data
clear all

matrix def c = J(8,8,0)
matrix list c
forvalues i = 1/8{
forvalues j = 1/8{
local toaddin = 0.5^(abs(`i' - `j'))
matrix c[`i',`j'] = `toaddin'
}
}

set seed 12345
set obs 200
drawnorm x1 x2 x3 x4 x5 x6 x7 x8, cov(c)
gen e = rnormal(0,1)
gen y = 3*x1 + 1.5*x2 + 2*x5 + e

*Step 2. Save the model
reg y x1 x2 x3 x4 x5 x6 x7 x8
estimates store regression1
ereturn list // NOTE that e(sample) is there

*Step 3. Generate new data for out-of-sample prediction
drop _all
set seed 12345
set obs 1000
drawnorm x1 x2 x3 x4 x5 x6 x7 x8, cov(c)
gen e = rnormal(0,1)
gen y = 3*x1 + 1.5*x2 + 2*x5 + e

estimates restore regression1
ereturn list // NOTE that e(sample) is missing

lasso2's predict command requires e(sample) to exist, otherwise it will fail. This is in contrast to regress, which also works without e(sample). I simply haven't anticipated the case that e(sample) doesn't exist when you use predict. My bad... but I also think that you probably don't need to structure your code in this way.

How about this approach:

Code:

*Step 1. Generate data
clear all

matrix def c = J(8,8,0)
matrix list c
forvalues i = 1/8{
forvalues j = 1/8{
local toaddin = 0.5^(abs(`i' - `j'))
matrix c[`i',`j'] = `toaddin'
}
}

set seed 12345
set obs 1200
drawnorm x1 x2 x3 x4 x5 x6 x7 x8, cov(c)
gen e = rnormal(0,1)
gen y = 3*x1 + 1.5*x2 + 2*x5 + e


*Step 2. Estimate using first 1000 obs only
reg y x1 x2 x3 x4 x5 x6 x7 x8 if _n <= 1000

lasso2 y x1 x2 x3 x4 x5 x6 x7 x8 if _n <= 1000
lasso2, lic(ebic) postres


*Step 3: insample prediction
predict yhat_lasso_in if _n <= 1000, residual

*Step 4: out-of-sample prediction
predict yhat_lasso_out if _n > 1000, residual

In my view, this is a bit cleaner, but you might have reasons for structuring the code the way you did.

Finally, just a quick side note: if you haven't done so already, make sure you install the latest version of lassopack. The latest version is currently only available via github (we are waiting for it to be uploaded to SSC.)

I hope this helps.

--
Tag me or email me for ddml/pdslasso/lassopack/pystacked related questions. I don't check Statalist.

Comment

Elisha Yu

Join Date: Oct 2019

Posts: 3
#4

22 Oct 2019, 07:52

Thank you William and Achim! Your help is much appreciated.
Comment

Elisha Yu

Join Date: Oct 2019
Posts: 3

22 Oct 2019, 14:33

A follow up question:
Now I have modified my code and it works well. However, I've noticed that the location of ols in the in the predict line matters. As seen in my version 1, I added ols after a seperate line of prediction. For version 2, I added the predict at the same line of lasso2. These two approaches give different rmse. I was wondering what's the reason for this difference?

Code:

//* Version 1 *//
*in-sample
set seed 10101
lasso2 y x1 x2 x3 x4 x5 x6 x7 x8 if in_sample==1
lasso2, lic(aic) postres 
predict yhat_postlasso, lic(ebic) ols
gen residual_postlasso2 = (yhat_postlasso - y)^2
summarize residual_postlasso2 if in_sample==1
local mean_postlasso_in = r(mean)
local rmse_postlasso_in = sqrt(`mean_postlasso_in'*20/13)
display `rmse_postlasso_in' //.84318197
*out-of-sample 
summarize residual_postlasso2 if in_sample==0  
local mean_postlasso_out = r(mean)
local rmse_postlasso_out = sqrt(`mean_postlasso_out'*1000/993)
display `rmse_postlasso_out' //1.5339668


//* Version 2 *//
set seed 10101
lasso2 y x1 x2 x3 x4 x5 x6 x7 x8 if in_sample==1, lic(ebic) ols
local mylambda2 = e(lebic) 
predict yhat_postlasso, lambda(`mylambda2')
gen residual_postlasso2 = (yhat_postlasso - y)^2
summarize residual_postlasso2 if in_sample==1
local mean_postlasso_in = r(mean)
local rmse_postlasso_in = sqrt(`mean_postlasso_in')
display `rmse_postlasso_in' //.93564476
*out-of-sample 
summarize residual_postlasso2 if in_sample==0  
local mean_postlasso_out = r(mean)
local rmse_postlasso_out = sqrt(`mean_postlasso_out')
display `rmse_postlasso_out' //1.1081077

Comment

Achim Ahrens

Join Date: Jun 2014

Posts: 49
#6

23 Oct 2019, 01:58

Thanks for the excellent question again.

Version 1 obtains the coefficient path using the lasso, picks the optimal lambda based on AIC/EBIC, and then uses post-lasso.

Version 2 obtains the coefficient path for the post-lasso and picks the optimal lambda based on AIC/EBIC.

I would probably go with Version 1, which seems more natural to me, but I have no theoretical justification to prefer one over the other.

--
Tag me or email me for ddml/pdslasso/lassopack/pystacked related questions. I don't check Statalist.
Comment

Announcement