Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Out of sample cross-validation for counts data, am I correct?

    Dear all,

    I am working on counts data and my study is to compare performance of several models that deal with such data (e.g., Poisson, NB, Hurdle, Zero-inflated models). The study's dependent variable is out-patient visits (visit_op) which is overdispersed and the independent vars are age, age square, sex, area, marital status, employment status, education, social health insurance, and log of household income.

    Based on in-sample tests (mpe, rmse, and mape) and information criteria tests (log likelihood, AIC, and BIC), hurdle NB2 is the best model. Next, I utilize K-fold cross-validation to test those models' performance (out of sample test). In this out of sample test, the data is split into 10 parts (almost identical one another), among which 9 parts are used as training part and the last one is reserved as validation one. My coding procedures are as:

    Data example
    Code:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(visit_op biny pid06 age agesqr) byte sex float area byte(mstt insur work) float(edu lnhhinc)
     1 1   8 69 4761 0 1 0 1 0  9  10.09823
    12 1 133 61 3721 1 1 0 1 0 12 10.740648
     1 1 150 75 5625 0 0 0 1 0  2  9.877502
     4 1 164 62 3844 0 0 0 0 1  0 10.462332
    10 1 166 67 4489 0 0 0 0 0  5  8.547916
     1 1 174 63 3969 0 1 0 1 0 12 10.165852
     3 1 180 77 5929 1 0 1 0 0  0  8.318742
    10 1 188 64 4096 0 0 0 1 0 12 11.261897
     2 1  27 86 7396 1 1 1 0 0  0 10.515966
     3 1  31 74 5476 0 1 0 1 0 12 10.217276
     3 1  21 66 4356 0 1 0 1 1 12 10.959888
     1 1  51 75 5625 0 1 0 1 0  9  9.517825
     1 1  57 68 4624 1 1 1 0 1  5 10.910277
    13 1  49 63 3969 1 1 0 1 0  9 10.299171
    15 1  61 63 3969 0 1 0 1 0 12  11.48329
     0 0  64 64 4096 1 1 0 1 0  5  11.15545
     1 1  69 68 4624 0 1 0 1 0 12  11.34687
     1 1  59 69 4761 0 1 0 1 0 12 11.715866
     2 1  92 76 5776 1 1 0 1 0 10  10.09823
    20 1  96 64 4096 1 1 1 1 0  4 11.252858
     2 1  85 66 4356 0 1 0 1 0 12  11.46772
     1 1  88 80 6400 1 1 1 1 0  3 10.113749
     1 1 114 61 3721 1 1 1 0 0  7 11.626254
     2 1 117 82 6724 1 1 0 1 0 12  10.27505
     1 1 102 84 7056 1 1 1 1 0  7 11.357675
     5 1 118 71 5041 0 1 1 1 1 12 12.115185
     1 1 122 83 6889 1 1 0 1 0  5 10.599132
     1 1 106 62 3844 1 1 0 1 0 12 10.767643
     3 1 288 68 4624 0 0 1 1 0  7  9.700943
    10 1 309 78 6084 1 0 0 0 0  1 10.234624
    end
    label values sex sex
    label def sex 0 "Male", modify
    label def sex 1 "Female", modify
    label values area area
    label def area 0 "Rural", modify
    label def area 1 "urban", modify
    label values mstt mstt
    label def mstt 0 "Married", modify
    label def mstt 1 "Single", modify
    label values insur insur
    label def insur 0 "No", modify
    label def insur 1 "Yes", modify
    label values work work
    label def work 0 "No", modify
    label def work 1 "Yes", modify
    Hurdle NB2
    Code:
    global v=10
    xtile idq = pid06, nq($v)
    
    *** Hurdle NB2
        gen p1 = 0
        gen yfv = 0
    
    forvalues i=1/$v {
       qui {
       logit biny $X if idq~=`i', vce(robust)
       predict plogit
       replace p1 = plogit
       drop plogit  
       }
       qui {
       tnbreg visit_op $X if visit_op>0 & idq~=`i', vce(robust)
       predict p2
       gen hnb2 = p1*p2
       replace yfv = hnb2 if idq==`i'
       
       gen pe = y - yfv
       sum pe
       sca hnb2mpe = r(mean)
       
       gen spe = (y-yfv)^2
       sum spe
       sca hnb2rmse = sqrt(r(mean))
       
       gen ape = abs(y-yfv)
       sum ape
       sca hnb2mape = r(mean)
      
       drop p2 hnb2 pe spe ape
       }
       }
       
    drop p1 yfv
    NB2
    Code:
        gen yfv = 0
    forvalues i=1/$v {
       qui {
       nbreg visit_op $X if idq~=`i', vce(robust)
       predict yhatnb2 
       replace yfv = yhatnb2 if idq==`i'
       
       gen pe = y - yfv
       sum pe
       sca nb2mpe = r(mean)
       
       gen spe = (y-yfv)^2
       sum spe
       sca nb2rmse = sqrt(r(mean))
       
       gen ape = abs(y-yfv)
       sum ape
       sca nb2mape = r(mean)
      
       drop yhatnb2 pe spe ape  
    }
    }
    
    drop yfv
    Poisson
    Code:
    *** Poisson
        gen yfv = 0
    forvalues i=1/$v {
       qui {
       poisson visit_op $X if idq~=`i', vce(robust)
       predict pyhat 
       replace yfv = pyhat if idq==`i'
       
       gen pe = y - yfv
       sum pe
       sca pmpe = r(mean)
       
       gen spe = (y-yfv)^2
       sum spe
       sca prmse = sqrt(r(mean))
       
       gen ape = abs(y-yfv)
       sum ape
       sca pmape = r(mean)
      
       drop pyhat pe spe ape  
    }
    }
    
    drop yfv
    Results show that the Poisson and NB2 models did better than Hurdle NB2. Results of MPE, RMSE, and MAPE of the Poisson and NB2 are almost identical, though the data is overdispersed. Therefore, I am wondering if either my coding procedures or my approach of out-of-sample tests are incorrect? Any advice is highly appreciated.

    Thank you.

    Best regards,

    DL

  • #2
    You didn't get a quick response. You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. Being able to replicate your results is extremely helpful in diagnosing data-related problems.

    You've provided a lot of code. But you then ask if we can somehow check the code without running it. This is extremely hard to do. If you have a more manageable problem, try posting that.

    Comment


    • #3
      Dear Dung Le

      From what I can understand from your code (I confess I did not spend a lot of time on it) you are using the cross validation to compare how well the fitted conditional expectations predict the data. That is, you are focusing on the conditional expectation. If this is correct, the existence of overdispersion is almost irrelevant and more complex models may over fit the estimation data and therefore be bad at predicting. In short, I am not at all surprised that the simplest model (Poisson) does well in this exercise. Now, if you want to compute probabilities of events, say, the probability of zero visits, the situation would be very different, but your cross validation does not appear to be designed to study that.

      Best wishes,

      Joao

      Comment


      • #4
        Dear Dung Le,

        What is the purpose of the model you are estimating?

        Best wishes,

        Joao

        Comment


        • #5
          Dear Dung Le,

          Let me rephrase the question: do you just want to predict the number of visits, or do you want to do something more challenging such as computing the probabilities of some outcomes or to understand the nature of the demand? If all you want to do is to predict the demand, what you are doing looks fine. If you want to do something else, then you would need to adjust the cross-validation exercise to focus on what you want to study.

          Best wishes,

          Joao

          Comment


          • #6
            Dung Le,

            Sure; please do.

            Joao

            Comment

            Working...
            X