Out of sample cross-validation for counts data, am I correct?

Dung Le

Join Date: May 2018
Posts: 120

Out of sample cross-validation for counts data, am I correct?

14 Mar 2019, 05:58

Dear all,

I am working on counts data and my study is to compare performance of several models that deal with such data (e.g., Poisson, NB, Hurdle, Zero-inflated models). The study's dependent variable is out-patient visits (visit_op) which is overdispersed and the independent vars are age, age square, sex, area, marital status, employment status, education, social health insurance, and log of household income.

Based on in-sample tests (mpe, rmse, and mape) and information criteria tests (log likelihood, AIC, and BIC), hurdle NB2 is the best model. Next, I utilize K-fold cross-validation to test those models' performance (out of sample test). In this out of sample test, the data is split into 10 parts (almost identical one another), among which 9 parts are used as training part and the last one is reserved as validation one. My coding procedures are as:

Data example

Code:


	Code:
	* Example generated by -dataex-. To install: ssc install dataex
clear
input float(visit_op biny pid06 age agesqr) byte sex float area byte(mstt insur work) float(edu lnhhinc)
 1 1   8 69 4761 0 1 0 1 0  9  10.09823
12 1 133 61 3721 1 1 0 1 0 12 10.740648
 1 1 150 75 5625 0 0 0 1 0  2  9.877502
 4 1 164 62 3844 0 0 0 0 1  0 10.462332
10 1 166 67 4489 0 0 0 0 0  5  8.547916
 1 1 174 63 3969 0 1 0 1 0 12 10.165852
 3 1 180 77 5929 1 0 1 0 0  0  8.318742
10 1 188 64 4096 0 0 0 1 0 12 11.261897
 2 1  27 86 7396 1 1 1 0 0  0 10.515966
 3 1  31 74 5476 0 1 0 1 0 12 10.217276
 3 1  21 66 4356 0 1 0 1 1 12 10.959888
 1 1  51 75 5625 0 1 0 1 0  9  9.517825
 1 1  57 68 4624 1 1 1 0 1  5 10.910277
13 1  49 63 3969 1 1 0 1 0  9 10.299171
15 1  61 63 3969 0 1 0 1 0 12  11.48329
 0 0  64 64 4096 1 1 0 1 0  5  11.15545
 1 1  69 68 4624 0 1 0 1 0 12  11.34687
 1 1  59 69 4761 0 1 0 1 0 12 11.715866
 2 1  92 76 5776 1 1 0 1 0 10  10.09823
20 1  96 64 4096 1 1 1 1 0  4 11.252858
 2 1  85 66 4356 0 1 0 1 0 12  11.46772
 1 1  88 80 6400 1 1 1 1 0  3 10.113749
 1 1 114 61 3721 1 1 1 0 0  7 11.626254
 2 1 117 82 6724 1 1 0 1 0 12  10.27505
 1 1 102 84 7056 1 1 1 1 0  7 11.357675
 5 1 118 71 5041 0 1 1 1 1 12 12.115185
 1 1 122 83 6889 1 1 0 1 0  5 10.599132
 1 1 106 62 3844 1 1 0 1 0 12 10.767643
 3 1 288 68 4624 0 0 1 1 0  7  9.700943
10 1 309 78 6084 1 0 0 0 0  1 10.234624
end
label values sex sex
label def sex 0 "Male", modify
label def sex 1 "Female", modify
label values area area
label def area 0 "Rural", modify
label def area 1 "urban", modify
label values mstt mstt
label def mstt 0 "Married", modify
label def mstt 1 "Single", modify
label values insur insur
label def insur 0 "No", modify
label def insur 1 "Yes", modify
label values work work
label def work 0 "No", modify
label def work 1 "Yes", modify

Hurdle NB2

Code:

global v=10
xtile idq = pid06, nq($v)

*** Hurdle NB2
    gen p1 = 0
    gen yfv = 0

forvalues i=1/$v {
   qui {
   logit biny $X if idq~=`i', vce(robust)
   predict plogit
   replace p1 = plogit
   drop plogit  
   }
   qui {
   tnbreg visit_op $X if visit_op>0 & idq~=`i', vce(robust)
   predict p2
   gen hnb2 = p1*p2
   replace yfv = hnb2 if idq==`i'
   
   gen pe = y - yfv
   sum pe
   sca hnb2mpe = r(mean)
   
   gen spe = (y-yfv)^2
   sum spe
   sca hnb2rmse = sqrt(r(mean))
   
   gen ape = abs(y-yfv)
   sum ape
   sca hnb2mape = r(mean)
  
   drop p2 hnb2 pe spe ape
   }
   }
   
drop p1 yfv

NB2

Code:

    gen yfv = 0
forvalues i=1/$v {
   qui {
   nbreg visit_op $X if idq~=`i', vce(robust)
   predict yhatnb2 
   replace yfv = yhatnb2 if idq==`i'
   
   gen pe = y - yfv
   sum pe
   sca nb2mpe = r(mean)
   
   gen spe = (y-yfv)^2
   sum spe
   sca nb2rmse = sqrt(r(mean))
   
   gen ape = abs(y-yfv)
   sum ape
   sca nb2mape = r(mean)
  
   drop yhatnb2 pe spe ape  
}
}

drop yfv

Poisson

Code:

*** Poisson
    gen yfv = 0
forvalues i=1/$v {
   qui {
   poisson visit_op $X if idq~=`i', vce(robust)
   predict pyhat 
   replace yfv = pyhat if idq==`i'
   
   gen pe = y - yfv
   sum pe
   sca pmpe = r(mean)
   
   gen spe = (y-yfv)^2
   sum spe
   sca prmse = sqrt(r(mean))
   
   gen ape = abs(y-yfv)
   sum ape
   sca pmape = r(mean)
  
   drop pyhat pe spe ape  
}
}

drop yfv

Results show that the Poisson and NB2 models did better than Hurdle NB2. Results of MPE, RMSE, and MAPE of the Poisson and NB2 are almost identical, though the data is overdispersed. Therefore, I am wondering if either my coding procedures or my approach of out-of-sample tests are incorrect? Any advice is highly appreciated.

Thank you.

Best regards,

DL

Tags: None

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

15 Mar 2019, 11:09

You didn't get a quick response. You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. Being able to replicate your results is extremely helpful in diagnosing data-related problems.

You've provided a lot of code. But you then ask if we can somehow check the code without running it. This is extremely hard to do. If you have a more manageable problem, try posting that.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#3

16 Mar 2019, 04:04

Dear Dung Le

From what I can understand from your code (I confess I did not spend a lot of time on it) you are using the cross validation to compare how well the fitted conditional expectations predict the data. That is, you are focusing on the conditional expectation. If this is correct, the existence of overdispersion is almost irrelevant and more complex models may over fit the estimation data and therefore be bad at predicting. In short, I am not at all surprised that the simplest model (Poisson) does well in this exercise. Now, if you want to compute probabilities of events, say, the probability of zero visits, the situation would be very different, but your cross validation does not appear to be designed to study that.

Best wishes,

Joao
2 likes
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#4

16 Mar 2019, 16:26

Dear Dung Le,

What is the purpose of the model you are estimating?

Best wishes,

Joao
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#5

17 Mar 2019, 10:30

Dear Dung Le,

Let me rephrase the question: do you just want to predict the number of visits, or do you want to do something more challenging such as computing the probabilities of some outcomes or to understand the nature of the demand? If all you want to do is to predict the demand, what you are doing looks fine. If you want to do something else, then you would need to adjust the cross-validation exercise to focus on what you want to study.

Best wishes,

Joao
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#6

18 Mar 2019, 03:01

Dung Le,

Sure; please do.

Joao
Comment

Announcement

Out of sample cross-validation for counts data, am I correct?

Comment

Comment

Comment

Comment

Comment