Dear all,
I am working on counts data and my study is to compare performance of several models that deal with such data (e.g., Poisson, NB, Hurdle, Zero-inflated models). The study's dependent variable is out-patient visits (visit_op) which is overdispersed and the independent vars are age, age square, sex, area, marital status, employment status, education, social health insurance, and log of household income.
Based on in-sample tests (mpe, rmse, and mape) and information criteria tests (log likelihood, AIC, and BIC), hurdle NB2 is the best model. Next, I utilize K-fold cross-validation to test those models' performance (out of sample test). In this out of sample test, the data is split into 10 parts (almost identical one another), among which 9 parts are used as training part and the last one is reserved as validation one. My coding procedures are as:
Data example
Hurdle NB2
NB2
Poisson
Results show that the Poisson and NB2 models did better than Hurdle NB2. Results of MPE, RMSE, and MAPE of the Poisson and NB2 are almost identical, though the data is overdispersed. Therefore, I am wondering if either my coding procedures or my approach of out-of-sample tests are incorrect? Any advice is highly appreciated.
Thank you.
Best regards,
DL
I am working on counts data and my study is to compare performance of several models that deal with such data (e.g., Poisson, NB, Hurdle, Zero-inflated models). The study's dependent variable is out-patient visits (visit_op) which is overdispersed and the independent vars are age, age square, sex, area, marital status, employment status, education, social health insurance, and log of household income.
Based on in-sample tests (mpe, rmse, and mape) and information criteria tests (log likelihood, AIC, and BIC), hurdle NB2 is the best model. Next, I utilize K-fold cross-validation to test those models' performance (out of sample test). In this out of sample test, the data is split into 10 parts (almost identical one another), among which 9 parts are used as training part and the last one is reserved as validation one. My coding procedures are as:
Data example
Code:
Code:* Example generated by -dataex-. To install: ssc install dataex clear input float(visit_op biny pid06 age agesqr) byte sex float area byte(mstt insur work) float(edu lnhhinc) 1 1 8 69 4761 0 1 0 1 0 9 10.09823 12 1 133 61 3721 1 1 0 1 0 12 10.740648 1 1 150 75 5625 0 0 0 1 0 2 9.877502 4 1 164 62 3844 0 0 0 0 1 0 10.462332 10 1 166 67 4489 0 0 0 0 0 5 8.547916 1 1 174 63 3969 0 1 0 1 0 12 10.165852 3 1 180 77 5929 1 0 1 0 0 0 8.318742 10 1 188 64 4096 0 0 0 1 0 12 11.261897 2 1 27 86 7396 1 1 1 0 0 0 10.515966 3 1 31 74 5476 0 1 0 1 0 12 10.217276 3 1 21 66 4356 0 1 0 1 1 12 10.959888 1 1 51 75 5625 0 1 0 1 0 9 9.517825 1 1 57 68 4624 1 1 1 0 1 5 10.910277 13 1 49 63 3969 1 1 0 1 0 9 10.299171 15 1 61 63 3969 0 1 0 1 0 12 11.48329 0 0 64 64 4096 1 1 0 1 0 5 11.15545 1 1 69 68 4624 0 1 0 1 0 12 11.34687 1 1 59 69 4761 0 1 0 1 0 12 11.715866 2 1 92 76 5776 1 1 0 1 0 10 10.09823 20 1 96 64 4096 1 1 1 1 0 4 11.252858 2 1 85 66 4356 0 1 0 1 0 12 11.46772 1 1 88 80 6400 1 1 1 1 0 3 10.113749 1 1 114 61 3721 1 1 1 0 0 7 11.626254 2 1 117 82 6724 1 1 0 1 0 12 10.27505 1 1 102 84 7056 1 1 1 1 0 7 11.357675 5 1 118 71 5041 0 1 1 1 1 12 12.115185 1 1 122 83 6889 1 1 0 1 0 5 10.599132 1 1 106 62 3844 1 1 0 1 0 12 10.767643 3 1 288 68 4624 0 0 1 1 0 7 9.700943 10 1 309 78 6084 1 0 0 0 0 1 10.234624 end label values sex sex label def sex 0 "Male", modify label def sex 1 "Female", modify label values area area label def area 0 "Rural", modify label def area 1 "urban", modify label values mstt mstt label def mstt 0 "Married", modify label def mstt 1 "Single", modify label values insur insur label def insur 0 "No", modify label def insur 1 "Yes", modify label values work work label def work 0 "No", modify label def work 1 "Yes", modify
Code:
global v=10 xtile idq = pid06, nq($v) *** Hurdle NB2 gen p1 = 0 gen yfv = 0 forvalues i=1/$v { qui { logit biny $X if idq~=`i', vce(robust) predict plogit replace p1 = plogit drop plogit } qui { tnbreg visit_op $X if visit_op>0 & idq~=`i', vce(robust) predict p2 gen hnb2 = p1*p2 replace yfv = hnb2 if idq==`i' gen pe = y - yfv sum pe sca hnb2mpe = r(mean) gen spe = (y-yfv)^2 sum spe sca hnb2rmse = sqrt(r(mean)) gen ape = abs(y-yfv) sum ape sca hnb2mape = r(mean) drop p2 hnb2 pe spe ape } } drop p1 yfv
Code:
gen yfv = 0 forvalues i=1/$v { qui { nbreg visit_op $X if idq~=`i', vce(robust) predict yhatnb2 replace yfv = yhatnb2 if idq==`i' gen pe = y - yfv sum pe sca nb2mpe = r(mean) gen spe = (y-yfv)^2 sum spe sca nb2rmse = sqrt(r(mean)) gen ape = abs(y-yfv) sum ape sca nb2mape = r(mean) drop yhatnb2 pe spe ape } } drop yfv
Code:
*** Poisson gen yfv = 0 forvalues i=1/$v { qui { poisson visit_op $X if idq~=`i', vce(robust) predict pyhat replace yfv = pyhat if idq==`i' gen pe = y - yfv sum pe sca pmpe = r(mean) gen spe = (y-yfv)^2 sum spe sca prmse = sqrt(r(mean)) gen ape = abs(y-yfv) sum ape sca pmape = r(mean) drop pyhat pe spe ape } } drop yfv
Thank you.
Best regards,
DL
Comment