Generating current quarter forecasts using only past data when the number of variables is very large

Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#16

06 Oct 2020, 11:23

Thanks for the -dataex-. I think your post crossed with my last update to #14. Try again setting the random number seed as suggested in #14 where it says "ADDED:...". I'm pretty sure that's the solution.
Comment
Lisa Wilson

Join Date: Aug 2016

Posts: 158
#17

13 Oct 2020, 04:21

Dear Professor Clyde Schechter

Thanks a lot for your help. I spent more time to check everything before I respond to be more effective.

1- I have been working more carefully on these and I think there was an incorrect code that I have produced. I explain below and show how I think it should be applied.

[QUOTE]
You would create a new sequential time variable: rather than week starting over at 1 in each new quarter, it would just keep counting up. Then you would do a single loop over that sequential week variable. And you would remove the -if myweek == `w'- parts of your syntax.[
/QUOTE]

I thought again about my application of this in post #13 and I think now that I probably did not do it in the most accurate way. The reason is that each week's observations are linked to the same quarterly rgdp_first. Therefore, when the rolling window moves by one week, the same quarterly variable may not change (will only change after 12 weeks). Therefore, I think the most accurate way to use all past information is to use your a code similar to #9 with some small changes:
1- sort the data on quarter and myWEEK first
2- estimate the rolling regression based on quarters and do not condition it on myWEEK. This means that the estimation will use all past information. No need for seq-wk at all. I just kept this variable in the code but did not use it.
3- after getting the slope coefficients, I generate the linear predictions but here I condition on the quarter_date and myWEEK so predictions are estimated at each week of the quarters (using all past information when the slopes are estimated).

The correct code should be as follows:

Code:

sort quarter_date myWEEK gen seq_wk= _n // I do not actually need this variable now sum seq_wk // this shows me from 1 to 1846 frame put rgdp_first var1-var30 quarter_date myWEEK seq_wk, into(forecasts) frame forecasts { replace quarter_date = quarter_date - 1 gen prediction = . } foreach w of numlist 1/12 { forvalues QTR = 188/`=quarter_date[_N]-1' { lasso linear rgdp_first var1-var30 if inrange(quarter_date, 88,`QTR') matrix b = e(b_postselection) frame forecasts { matrix score yhat = b replace prediction = yhat if quarter_date == `QTR' & myWEEK == `w' drop yhat } } } frame forecasts: replace quarter_date = quarter_date + 1 frlink 1:1 quarter_date myWEEK, frame(forecasts) frget prediction, from(forecasts)

2- As for your random number point here:

Added: Looking further at your code, I see a few more things such as haphazardly going back and forth between prediction and myprediction as variable names. But once all of those things are fixed, I notice that you did not set the random number seed before the loop in either program. -lasso- is not a deterministic program: it draws random subsets. So the results will not be the same even just running the same code twice. Try setting the random number seed before the loop in both versions of the code (and set it to the same number in both versions!) and I think you will get the same results from both.

Do you mean that I have to set the random number seed in ALL my codes (e.g. the rolling regression code in post 39) OR this only applies to the sample partition case (i.e. sample 1 and sample 2 as in post #11)?
I am a bit unsure of that. If so, can you kindly show me how to adjust the code you refer to in line with your suggestion?

I do look forward to hearing from you
Lisa
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#18

13 Oct 2020, 09:33

You need to set the random number seed once, and that must happen before you do anything non-deterministic. Your sub-sample selection in #11 was deterministic. (A critic might argue that validation and training samples should be selected randomly, in which case you would need to set the random number seed before reaching that point in the code. But that isn't what you did there.)

No, the issue is that -lasso- is non-deterministic. So at some point before any loop within which -lasso- is embedded, you need to set the random number seed. -set seed 12345- (or use any number you like) is all that is required.
Comment

Lisa Wilson

Join Date: Aug 2016
Posts: 158

#19

13 Oct 2020, 21:09

Thanks a lot.

1- I have added -rseed- option to the lasso inside the loop given that the rolling routine estimates several lasso models. I hope this is appropriate ?!

2- I hope this code is correct now. The way I interpret it is that the model will be estimated for the first window from quarter_date 88 to quarter_date 188 using all weekly data and not conditional on the week of the quarter, and the parameters will be estimated. Then these parameters will be multiplied by the var1-var30 data in each week of the subsequent quarter to make a forecast of the next quarter's rgdp_first. The seed option will make the results reproducible. The final code that does exactly this is below:

Code:

sort  quarter_date myWEEK 

gen seq_wk= _n // I do not actually need this variable now
sum seq_wk   // this shows me from 1 to 1846

frame put rgdp_first var1-var30 quarter_date myWEEK seq_wk, into(forecasts) 
frame forecasts {
    replace quarter_date = quarter_date - 1
    gen prediction = .
}
foreach w of numlist 1/12 {
forvalues QTR = 188/`=quarter_date[_N]-1' {
   lasso linear rgdp_first  var1-var30 if inrange(quarter_date, 88,`QTR'), rseed(1234)
   matrix b = e(b_postselection)
    frame forecasts {
       matrix score yhat = b 
       replace prediction = yhat if quarter_date == `QTR' & myWEEK == `w' 
       drop yhat
    }
}
}

frame forecasts: replace quarter_date = quarter_date + 1
frlink 1:1 quarter_date myWEEK, frame(forecasts)
frget prediction, from(forecasts)

3- To do exactly the same thing in 2 with the only difference that the estimation of the parameters is conditional on the week of the previous quarters (i.e. estimation based on the first week of all previous quarters, then the second weeks of all previous quarters, etc. which was my original question in this thread), the code is (after adding the seed option):

Code:

sum quarter_date if myWEEK==1  // this shows me 88 to 241

frame put rgdp_first var1-var30 quarter_date myWEEK, into(forecasts) 
frame forecasts {
    replace quarter_date = quarter_date - 1
    gen prediction = .
}

forvalues QTR = 188/`=quarter_date[_N]-1' {
foreach w of numlist 1/12 {
   lasso linear rgdp_first  var1-var30 if inrange(quarter_date, 88,`QTR') & myWEEK==`w', rseed(1234)
    matrix b = e(b_postselection)
    frame forecasts {
       matrix score yhat = b
       replace prediction = yhat if quarter_date == `QTR' & myWEEK==`w'
       drop yhat
    }
}
}

frame forecasts: replace quarter_date = quarter_date + 1
frlink 1:1 quarter_date myWEEK, frame(forecasts)
frget prediction, from(forecasts)

4- Taking your advice with the SUBsample & seed, I now have this code for :

A- The two-sample partitions based on data in each week-based dataset:

Code:

sum quarter_date if myWEEK==1  // this shows me a range of 88 to 241 which are 1981q2 up til 2020 q2  
*Creating to two subsamples where the model is fit on SUBsample 1 and the forecasts are generated for SUBsample 2 
gen SUBsample=.
replace SUBsample=1 if quarter_date<188  // a sample to fit the model
replace SUBsample=2  if SUBsample==.   // a sample to evaluate the model (out of sample)

gen prediction=.

levelsof myWEEK, local(levels)

foreach x of local levels {
  
lasso linear rgdp_first var1-var30 if SUBsample == 1 & myWEEK==`x', rseed(1234)  
predict temp if SUBsample==2 & myWEEK==`x', postselection
replace prediction=temp  if myWEEK==`x'
drop temp
}

B- And based on all past data regardless of the week; but then the forecasts are made conditional on the week:

Code:

**I try here to estimate the parameters from all weeks in the previous quarters in SUBsample 1
** but then when I make forecasts for SUBsample 2 (out of sample), I make them conditional on the week (i.e. to multiply by each week's variables)

sort  quarter_date myWEEK // this sort is important 

gen SUBsample=.
replace SUBsample=1 if quarter_date<188  // a sample to fit the model
replace SUBsample=2  if SUBsample==.   // a sample to evaluate the model (out of sample)

gen prediction=.

levelsof myWEEK, local(levels)

foreach x of local levels {
  
lasso linear rgdp_first var1-var30 if SUBsample == 1, rseed(1234)  
predict temp if SUBsample==2 & myWEEK==`x', postselection
replace prediction=temp  if myWEEK==`x'
drop temp
}

If these are appropriate then I understand now how to make adjustments for the rolling fixed window too consistent with these. I thank you for the main setup that made all these possible!

I look forward to your final confirmation now 🤞

Thanks

Lisa

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#20

14 Oct 2020, 11:36

1- I have added -rseed- option to the lasso inside the loop given that the rolling routine estimates several lasso models. I hope this is appropriate ?!

NO, IT IS NOT APPROPRIATE!!!!! I specifically said in my earlier posts that it had to be outside all the loops containing -lasso-. Why did you ignore that? By using the -rseed- option inside the loop, you are resetting the random number generator to the same starting point for every -lasso-, which means you are not getting independent estimates from each lasso. This is statistically wrong. The random number seed needs to be set before any loops that enclose the -lasso- command to assure independent analyses for each run of -lasso-. You have gone from too much indeterminacy (before the random number seed was set at all) to too much.

The rest looks OK offhand.
Comment

Lisa Wilson

Join Date: Aug 2016
Posts: 158

#21

15 Oct 2020, 06:15

I am really sorry for misinterpreting that. I misinterpreted before any loop as before any estimation inside the loop.

I have now simply used -set seed 12345- outside the loops and I give two examples of how the code will be:

Code:

sort  quarter_date myWEEK // this sort is important so I can get the new sequential week variable correctly. 

gen seq_wk= _n

sum seq_wk   // this shows me from 1 to 1846

set seed 12345 // this is done for the results to be reproducible 

frame put rgdp_first var1-var30 quarter_date myWEEK seq_wk, into(forecasts) 
frame forecasts {
    replace quarter_date = quarter_date - 1
    gen prediction = .
}
foreach w of numlist 1/12 {
forvalues QTR = 188/`=quarter_date[_N]-1' {
   lasso linear rgdp_first  var1-var30 if inrange(quarter_date, 88,`QTR')
   matrix b = e(b_postselection)
    frame forecasts {
       matrix score yhat = b 
       replace prediction = yhat if quarter_date == `QTR' & myWEEK == `w' 
       drop yhat
    }
}
}

frame forecasts: replace quarter_date = quarter_date + 1
frlink 1:1 quarter_date myWEEK, frame(forecasts)
frget prediction, from(forecasts)

And here:

Code:

sort  quarter_date myWEEK // this sort is important 


gen SUBsample=.
replace SUBsample=1 if quarter_date<188  // a sample to fit the model
replace SUBsample=2  if SUBsample==.   // a sample to evaluate the model (out of sample)

gen prediction=.

set seed 12345 // this is done for the results to be reproducible 

levelsof myWEEK, local(levels)

foreach x of local levels {
  
lasso linear rgdp_first var1-var30 if SUBsample == 1 
estimates store LASSO
predict temp if SUBsample==2 & myWEEK==`x', postselection
replace prediction=temp  if myWEEK==`x'
drop temp
}

Hopefully, I get it right now

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#22

15 Oct 2020, 13:03

Looks good.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment