Training sample for lasso and them making out of sample predictions

Mike Kraft

Join Date: Dec 2014

Posts: 328
#1

Training sample for lasso and them making out of sample predictions

12 Sep 2020, 02:37

Dear All

I have created a loop to generate forecasts using a lasso regressions that (I hope) are considered out of sample.

Can anyone please confirm if these forecasts are out of sample in the sense that these simple lines of the code will only use observations up to the year 1996 to train the model and subsequently generate forecasts for the period after 1996 that are out of sample? Do I have to condition "predict" on sample 2 ?!
The model is estimated using lasso regression from Stata 16

My code is:

Code:

gen sample=. replace sample=1 if year<1996 // training sample replace sample=2 if sample==. // evaluation sample quietly lasso linear BUS s1 s2 s3 s4 s5 if sample == 1 estimates store lasso estimate restore lasso predict BUShat, postselection. // out of sample forecasts

Last edited by Mike Kraft; 12 Sep 2020, 02:44. Reason: a typo in the title then* but could not edit
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#2

12 Sep 2020, 03:49

Out-of-sample simply means that the predictions correspond to observations outside the estimation sample.

gen sample=.
replace sample=1 if year<1996 // training sample
replace sample=2 if sample==. // evaluation sample
quietly lasso linear BUS s1 s2 s3 s4 s5 if sample == 1
estimates store lasso
estimate restore lasso
predict BUShat, postselection. // out of sample forecasts

Here, predict will create predictions for all observations in the dataset. Part of the observations (year<1996) will be in-sample and the remainder (year>=1996) will be out-of-sample.
1 like
Comment

Mike Kraft

Join Date: Dec 2014
Posts: 328

12 Sep 2020, 04:27

Thank you so much Andrew.

What about the following to have only out-of-sample predictions?!

Code:

gen sample=.
replace sample=1 if year<1996  // training sample
replace sample=2  if sample==.   // evaluation sample

quietly lasso linear BUS s1 s2 s3 s4 s5 if sample == 1
estimates store lasso

estimate restore lasso
predict BUShat, postselection // producing forecasts (BUShat before 1996 are in sample and BUShat from 1996 onwards are out of sample)
replace prediction==. if sample==1. // to keep only the out-of-sample forecasts

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10190
#4

12 Sep 2020, 04:53

Yes, that's fine. But you could also do it directly using predict.

Code:

predict BUShat if sample==2, postselection

Tip: It is better and more efficient to code a binary variable 0/1. Your code above can be reduced to the following 3 lines:

Code:

gen sample = year<1996 quietly lasso linear BUS s1 s2 s3 s4 s5 if sample predict BUShat if !sample, postselection

Last edited by Andrew Musau; 12 Sep 2020, 05:00.
1 like
Comment
Mike Kraft

Join Date: Dec 2014

Posts: 328
#5

12 Sep 2020, 05:26

Thanks a lot.
I have the following comments:

1- Using if sample==2 works and so I do not need to replace prediction=. if sample==1.

However, I do not know why when I used your summarized code, the results did not look the same. i.e. the BUShat are different.

2- Most importantly for me, now I understand that sample 1 is used to train the model, and then predictions are made for sample 2. Does this exactly mean that all coefficients are obtained from the estimation of the model using sample 1 and then new forecasts are produced using new data for s1-s5 as they become available during sample 2 ? or that the model coefficients are also updated over time? I am sorry but I am still unsure.

3- As for the sample window, does this also mean that estimation is done recursively in the sense that more data are used each time the forecast is made beyond 2016? and if this is the case, how can I have a fixed window where I add one new observation and drop an older one each time a new forecast is made?

Thanks
Comment

Shadrack Muema

Join Date: Aug 2020
Posts: 5

12 Sep 2020, 05:27

Dear fellow STATA users.

Quick check.. i have some unclean (AMR) data in this format

LabNo	Test	results
201514	SENSITIVITY	clindamycin 3+,cephalexim 2+,Cyprofloxaxin3+.
201514	RESISTANCE	Tetracyclin,erythromycin,septrin,cloxacillin,Amoxy clav
203658	SENSITIVITY	NO PATHOGEN ISOLATED AT 37 0c for 48hrs
203819	Salmonella Typhi H	Positive
203918	Salmonella Typhi H	negative
204089	SENSITIVITY	levofloxacin 3+,ciprofloxacin 3+,genta 3+,Amoxicillin 3+,Norbactin 3+
204197	SENSITIVITY	LEVOFLOXACIN 3+ NORBACTIN 3+,GENTAMYCIN 3+,CIPROFLOXACIN 3+
204197	RESISTANCE	AMPICILLIN,AMOXYCILLIN

Now , i want to split the Results column onto each value within it and have the results of the new columns cells as the last integer of the current cell value plus the +sign
where : 3+ = "S" ,+++="S", 2+="I",++="I", +="I" where Test column ="SENTIVITY" and have "R" as the results where the Test column ="RESISTANCE"
Example : expecting to get the following resultant output:

expected resultant output table

LabNo	Test	Value	clindamycin	cephalexim	Cyprofloxaxin	levofloxacin	NORBACTIN	GENTAMYCIN	AMPICILLIN	AMOXYCILLIN
201514	SENSITIVITY	clindamycin 3+,cephalexim 2+,Cyprofloxaxin3+.	S	I	S
201514	RESISTANCE	Tetracyclin,erythromycin,septrin,cloxacillin,Amoxy clav	R	R	R
203658	SENSITIVITY	NO PATHOGEN ISOLATED AT 37 0c for 48hrs
203819	Salmonella Typhi H	Positive
203918	Salmonella Typhi H	negative
204089	SENSITIVITY	levofloxacin 3+,ciprofloxacin 3+,genta 3+,Amoxicillin 3+,Norbactin 3+			S	S	S	S		S
204197	SENSITIVITY	LEVOFLOXACIN 3+ NORBACTIN 3+,GENTAMYCIN 3+,CIPROFLOXACIN 3+			S	S	S	S
204197	RESISTANCE	AMPICILLIN,AMOXYCILLIN							R	R

Any idea from somebody on how to go about this? please help a brother here.
Thanks.

Comment

Mike Kraft

Join Date: Dec 2014

Posts: 328
#7

12 Sep 2020, 05:30

Shadrack Muema, I think you probably need to create a post for that as it is not related to the question here. I can see that it is your first post to Stata, so welcome to the forum.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#8

12 Sep 2020, 08:38

1- Using if sample==2 works and so I do not need to replace prediction=. if sample==1.

However, I do not know why when I used your summarized code, the results did not look the same. i.e. the BUShat are different.

I think your saved estimates are messing up the procedure. Add -estimates clear- before comparing the methods

Code:

estimates clear webuse grunfeld, clear gen sample=. replace sample=1 if time<6 // training sample replace sample=2 if sample==. // evaluation sample quietly lasso linear invest mvalue kstock if sample == 1 estimates store lasso estimate restore lasso predict investhat, postselection // out of sample forecasts replace investhat=. if sample==1 gen sample2= time<6 quietly lasso linear invest mvalue kstock if sample2 predict investhat2 if !sample2, postselection assert investhat==investhat2

2- Most importantly for me, now I understand that sample 1 is used to train the model, and then predictions are made for sample 2. Does this exactly mean that all coefficients are obtained from the estimation of the model using sample 1 and then new forecasts are produced using new data for s1-s5 as they become available during sample 2 ?

Yes.

3- As for the sample window, does this also mean that estimation is done recursively in the sense that more data are used each time the forecast is made beyond 2016? and if this is the case, how can I have a fixed window where I add one new observation and drop an older one each time a new forecast is made?

So far, your code is static. What you want can be programmed given that you define your windows.
1 like
Comment
Mike Kraft

Join Date: Dec 2014

Posts: 328
#9

12 Sep 2020, 09:10

You are super. Thanks Andrew.
As for 3, any ideas on how to adjust the code so that

a) Out of sample forecasts are produced using an increasing window with data in sample 1 being used as an initial window (i.e. rolling but increasing window)
or
b) Out of sample forecasts are produced using a fixed window in a rolling fashion (i.e. rolling but fixed window)

The sample starts from 1987.

Hope I get help with that
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

#10

12 Sep 2020, 11:31

Assume that my windows are each of length 1 year. So my initial sample at time 1 predicts the outcome at time 2 and the sample at time 2 predicts the outcome at time 3. All predictions in this case are therefore out-of-sample.

Code:

webuse grunfeld, clear
gen prediction=.
levelsof year, local(years)
forval i= 1/ `=wordcount("`years'")'{
    qui capture{
        quietly lasso linear invest mvalue kstock if year== `=word("`years'", `i')'  
        predict investhat`i' if year== `=word("`years'", `=`i'+1')', postselection 
        replace prediction= investhat`i' if !missing(investhat`i')
    }
}

Comment

Mike Kraft

Join Date: Dec 2014

Posts: 328
#11

13 Sep 2020, 06:01

Thanks a lot. Yes, the fact that the sample at time 1 forecast the outcome at time 2 and that at time 2 forecasts the outcome at time 3 is what I was asking for. But I do have the following issue with the window in your code:

Does this mean that you use only a one-year window to make predictions?

I think I meant in #9 that (a) the window should be increasing for example it starts with 100 years and then increases by one year each time the forecast is made while in (b) it is a fixed window say of 100 years that is rolling, so we add one new year and drop the oldest year and so on.

I am not sure which one of these is consistent with your code in #10. If none, then I do appreciate if the code can be also amended to do both (a) and (b) (surely, two different codes)

Thanks.
Comment
Mike Kraft

Join Date: Dec 2014

Posts: 328
#12

14 Sep 2020, 03:40

Dear All
I appreciate it so much if someone can help also with my questions in post #11 regarding the windows. I thought to bring this up as it did not get a response over the last 24 hours.
Thanks
Comment

Announcement