Dear All,
I posted earlier today but did not get a response. So, probably I did not do a good job posting my question. I will try again.
I have a panel data at the individual (N)-week level. I have 14 weeks/ waves - 7 before and 7 after an intervention. The 10 percent sample, which is not balanced, looks as follows:
I want to run a program to calculate the mean square prediction error for panels of varying lengths.
But its not running. It does nothing. I will greatly appreciate some help please.
Sincerely,
Sumedha.
I posted earlier today but did not get a response. So, probably I did not do a good job posting my question. I will try again.
I have a panel data at the individual (N)-week level. I have 14 weeks/ waves - 7 before and 7 after an intervention. The 10 percent sample, which is not balanced, looks as follows:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str10 npi int year float week int userTRA "J338339LLR" 2014 4 0 "J338339LLR" 2014 6 0 "J338339LLR" 2014 7 0 "J33833J3J3" 2014 2 0 "J33833J99R" 2014 2 1 "J33833JOLJ" 2014 4 0 "J33833NF9L" 2014 5 0 "J33833R8F8" 2014 1 0 "J33833RLFF" 2014 7 0 "J33833RO8R" 2014 2 0 "J338383FRV" 2014 7 0 "J338383R89" 2014 2 0 "J33838FR9R" 2014 6 0 "J33838LJ8R" 2014 3 0 "J33838LVFO" 2014 6 1 "J33838RFNL" 2014 1 1 "J338393FOR" 2014 7 0 "J338398J88" 2014 6 0 "J3383998JF" 2014 4 0 "J338399JRF" 2014 5 0 "J338399N33" 2014 2 1 "J338399V3R" 2014 7 0 "J33839F99O" 2014 6 1 "J33839F99O" 2014 7 0 "J33839FNL3" 2014 5 0 "J33839JFRV" 2014 6 0 "J33839JLRL" 2014 5 0 "J33839NNOL" 2014 6 0 "J33839O383" 2014 4 0 "J33839O8R8" 2014 2 0 "J33839OR8R" 2014 6 2 "J3383F33NJ" 2014 2 0 "J3383F38JN" 2014 4 0 "J3383F988V" 2014 2 0 "J3383FN3VR" 2014 2 1 "J3383FNFNL" 2014 1 0 "J3383FNFNL" 2014 5 0 "J3383FR8L9" 2014 2 0 "J3383FROOF" 2014 2 0 "J3383FVRVO" 2014 5 0 "J3383J3983" 2014 1 0 "J3383J3JV8" 2014 3 1 "J3383J88FO" 2014 3 0 "J3383J8RJV" 2014 3 0 "J3383J8RJV" 2014 4 0 "J3383J8VFV" 2014 5 0 "J3383JONVF" 2014 5 0 "J3383JRLJ8" 2014 2 1 "J3383JRLJ8" 2014 7 1 "J3383L3VJV" 2014 7 0 "J3383L88NV" 2014 5 0 "J3383LF888" 2014 1 0 "J3383LFR3J" 2014 7 0 "J3383LJJFO" 2014 2 0 "J3383LL9RN" 2014 6 1 "J3383LLN8N" 2014 5 0 "J3383LLVFJ" 2014 1 0 "J3383LLVFJ" 2014 5 0 "J3383LRVO8" 2014 2 0 "J3383LVFOR" 2014 7 0 "J3383LVN93" 2014 3 0 "J3383N83R8" 2014 5 0 "J3383N888L" 2014 2 0 "J3383N9LFJ" 2014 5 0 "J3383NL93R" 2014 2 0 "J3383NLV8O" 2014 7 1 "J3383NNJFV" 2014 1 0 "J3383NNJFV" 2014 6 1 "J3383NO3RF" 2014 6 0 "J3383NVJNJ" 2014 4 0 "J3383O3LVV" 2014 7 0 "J3383OFJJL" 2014 5 0 "J3383ON8LO" 2014 3 1 "J3383OOO9F" 2014 1 2 "J3383OORLN" 2014 1 0 "J3383ORLLO" 2014 2 0 "J3383OVN8F" 2014 4 0 "J3383OVRF3" 2014 6 0 "J3383R3F9N" 2014 2 0 "J3383RF9O9" 2014 6 0 "J3383RF9O9" 2014 7 0 "J3383RNV9R" 2014 1 0 "J3383ROLLV" 2014 6 0 "J3383V3OOO" 2014 4 0 "J3383V3OOO" 2014 7 0 "J3383V83FL" 2014 6 0 "J3383V83N9" 2014 7 0 "J3383V9J89" 2014 3 0 "J3383VL398" 2014 1 1 "J3383VL398" 2014 2 1 "J338F8FJFV" 2014 4 0 "J338FLNVF3" 2014 5 0 "J338FR9RLL" 2014 3 0 "J338FRR3LF" 2014 6 0 "J338FRV38F" 2014 5 0 "J338J33FOV" 2014 3 0 "J338J33RNO" 2014 5 1 "J338J38JR3" 2014 5 1 "J338J398ON" 2014 1 1 "J338J39OV3" 2014 5 0 end
For prediction, I want to iteratively leave out one individual each time (drop all waves of this one individual) and then use the estimates from the remaining sample to predict outcome for the individual who was left out. I repeat this one-by-one for each individual in the panel.
Then I add the prediction errors for all and store in a matrix. Next, I want to repeat this exercise for different panel lengths. So, I re-do the iterative exercise by leaving out one observation each time and calculating the prediction error for it using the estimates calculated for the remaining observations for pre-intervention panels of 7, 6, 5, 4, 3 and 2 weeks.
The idea being that I want to optimize the panel length by minimizing the MSPE for the 7 week period prior to the intervention. The program is as follows:Code:
/*NOTES: cllr_crossval
The goal is to estimate the bandwidth that minimizes the IMSE of a local linear regression.
A grid search is used and estimation is based on the cllr program described above.
Arguments
outcome: a stata variable containing the dependent variable
x: a stata variable containing the independent variable
start: a hardcoded number or local variable defining start of a sequence candidate bandwidths
step: a hardcoded number or local variable defining the stepsize of the sequence of candidate bandwidth
stop: a hardcoded number or local variable defining the end of a sequence of candidate bandwidths.
sub: a stata variable set to 1 if the observation should be included in the analysis
Returns
A stata matrix and set of stata variables that contain the estimated IMSE for each candidate bandwidth.
*/
sort npi
gen N=_n if npi[_n]~=npi[_n-1]
bysort npi: egen maxN=max(N)
replace N=maxN if N==.
bysort N week: gen counter=_n
drop if counter>1
xtset N week
gen outcome = userTRA
gen x = week
capture program drop cllr_crossval
program define cllr_crossval
set more off
args outcome x start step stop sub narrowsub
tempvar cx ew e2 e2n
local stop = 7
local start = 1
local step = 1
*make a matrix to store the estimated IMSE
local size = ((`stop' - `start')/`step')+1
matrix M = J(`size', 3, .)
*Iterate over candidate bandwidths
local count = 0
forvalues h = `start'(`step')`stop'{
*increment counter
local count = `count' + 1
*store location on the bandwidth grid
matrix M[`count', 1] = `h'
*initialize the residual variable
gen `e2' = .
*Iterate over observations
forvalues i = 1(1)`N'{
capture quietly reghdfe /*regress*/ `outcome' `x' if _n~=`i' & week=<`h', absorb(npi)
replace `e2' = (`outcome' - _b[_cons])^2 in `i'
}
*compute IMSE for the candidate bandwidth
su `e2'
matrix M[`count',2] = r(mean)
drop `e2'
}
matrix list M
svmat M
end
Sincerely,
Sumedha.

)
Comment