Poisson regression model and Negative binomial regression model. Is this the poor model or simply poor secondary dataset?

Thong Nguyen

Join Date: Oct 2015
Posts: 236

Poisson regression model and Negative binomial regression model. Is this the poor model or simply poor secondary dataset?

28 Mar 2016, 05:01

Dear all,
I got a dataset like this. And I would like to use Poisson Regression model for the dependent variable dq_m, independent variables include ttb_q, tmin_q, tmax_q, amtb_q, dofhw_q, qhw

After using Poisson regression model, I found out that it's better if I use Negative binominal regression model for the dataset because of overdispersion. However, the final model which I figured out show me unexpected result. As you can see from the graph attached, the model does not fit well to observed values. What should I do in order to improve the current model and find the best one.
Furthermore, as a result of limitations of secondary data, results in the dataset are so different to the medical articles. And I would like to know whether I use the wrong modelling process, wrong command or it's simply poor dataset.
Thank you all in advance!

Code:

. clear

. use Quarterly_Dofhw_Env_Analysis

. sum
l
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       ttb_q |        36    28.27778    .7403131       27.3       30.3
      tmin_q |        36    25.28056    .8454312       24.1       27.5
      tmax_q |        36    33.57778    .8414648       32.2       35.8
      amtb_q |        36    75.14444    4.501202       65.3       82.8
      rtsh_q |        36    9.438889    4.036496          2       19.1
-------------+--------------------------------------------------------
   hmax_pa_q |        36    113.2194    12.41303       92.6      137.7
        dq_m |        36    2847.472     2073.42        765      11156
        dq_c |        36    91.38889    61.60718          4        310
     quarter |        36       201.5    10.53565        184        219
          id |        36        18.5    10.53565          1         36
-------------+--------------------------------------------------------
     dofhw_q |        36    2.222222    5.037636          0         22
         qhw |        36         .25     .439155          0          1

gr box dq_m,by(qhw)
  

 
. poisson dq_m ttb_q tmin_q tmax_q amtb_q dofhw_q qhw, nolog

Poisson regression                                Number of obs   =         36
                                                  LR chi2(6)      =   18257.00
                                                  Prob > chi2     =     0.0000
Log likelihood = -12276.162                       Pseudo R2       =     0.4265

------------------------------------------------------------------------------
        dq_m |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ttb_q |   1.163322   .0260642    44.63   0.000     1.112237    1.214407
      tmin_q |   -.641617   .0199307   -32.19   0.000    -.6806803   -.6025536
      tmax_q |  -.3136857    .020455   -15.34   0.000    -.3537768   -.2735945
      amtb_q |   .1041063   .0017078    60.96   0.000     .1007591    .1074534
     dofhw_q |   .0151342   .0014126    10.71   0.000     .0123656    .0179027
         qhw |  -.5604745   .0162171   -34.56   0.000    -.5922594   -.5286896
       _cons |  -6.012431   .3063855   -19.62   0.000    -6.612936   -5.411927
------------------------------------------------------------------------------

. nbreg dq_m ttb_q tmin_q tmax_q amtb_q dofhw_q qhw, nolog

Negative binomial regression                      Number of obs   =         36
                                                  LR chi2(6)      =      26.50
Dispersion     = mean                             Prob > chi2     =     0.0002
Log likelihood = -301.9212                        Pseudo R2       =     0.0420

------------------------------------------------------------------------------
        dq_m |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ttb_q |   1.023928   .6093453     1.68   0.093    -.1703668    2.218223
      tmin_q |   -.410953   .5268968    -0.78   0.435    -1.443652    .6217458
      tmax_q |  -.4743927   .5131714    -0.92   0.355     -1.48019    .5314047
      amtb_q |   .0933756   .0412971     2.26   0.024     .0124348    .1743164
     dofhw_q |   .0136807   .0287798     0.48   0.635    -.0427266    .0700881
         qhw |  -.3632879   .3167615    -1.15   0.251    -.9841291    .2575533
       _cons |  -1.741708   7.948207    -0.22   0.827    -17.31991    13.83649
-------------+----------------------------------------------------------------
    /lnalpha |  -1.630928   .2288978                      -2.07956   -1.182297
-------------+----------------------------------------------------------------
       alpha |   .1957478   .0448062                      .1249852    .3065737
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0:  chibar2(01) = 2.4e+04 Prob>=chibar2 = 0.000


. nbreg dq_m qhw dofhw_q amtb_q,nolog

Negative binomial regression                      Number of obs   =         36
                                                  LR chi2(3)      =      23.61
Dispersion     = mean                             Prob > chi2     =     0.0000
Log likelihood = -303.36761                       Pseudo R2       =     0.0375

------------------------------------------------------------------------------
        dq_m |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         qhw |  -.4479989   .3051038    -1.47   0.142    -1.045991    .1499936
     dofhw_q |   .0200182   .0246346     0.81   0.416    -.0282647    .0683012
      amtb_q |   .0914506   .0217634     4.20   0.000     .0487951    .1341062
       _cons |   1.054483   1.667006     0.63   0.527    -2.212788    4.321755
-------------+----------------------------------------------------------------
    /lnalpha |  -1.555352   .2282957                     -2.002803   -1.107901
-------------+----------------------------------------------------------------
       alpha |    .211115   .0481967                      .1349564    .3302515
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0:  chibar2(01) = 2.6e+04 Prob>=chibar2 = 0.000

. listcoef,percent

nbreg (N=36): Percentage Change in Expected Count

 Observed SD: 2073.4199

----------------------------------------------------------------------
        dq_m |      b         z     P>|z|      %      %StdX      SDofX
-------------+--------------------------------------------------------
         qhw |  -0.44800   -1.468   0.142    -36.1    -17.9     0.4392
     dofhw_q |   0.02002    0.813   0.416      2.0     10.6     5.0376
      amtb_q |   0.09145    4.202   0.000      9.6     50.9     4.5012
-------------+--------------------------------------------------------
    ln alpha |  -1.55535
       alpha |   0.21112   SE(alpha) = 0.04820  
----------------------------------------------------------------------
 LR test of alpha=0: 2.6e+04  Prob>=LRX2 = 0.000
----------------------------------------------------------------------

. nbreg dq_m qhw amtb_q,nolog

Negative binomial regression                      Number of obs   =         36
                                                  LR chi2(2)      =      22.93
Dispersion     = mean                             Prob > chi2     =     0.0000
Log likelihood = -303.70531                       Pseudo R2       =     0.0364

------------------------------------------------------------------------------
        dq_m |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         qhw |  -.2612912   .2143484    -1.22   0.223    -.6814064     .158824
      amtb_q |   .0915887   .0220894     4.15   0.000     .0482943    .1348831
       _cons |   1.043923    1.69191     0.62   0.537    -2.272159    4.360005
-------------+----------------------------------------------------------------
    /lnalpha |  -1.537799   .2281601                     -1.984985   -1.090614
-------------+----------------------------------------------------------------
       alpha |   .2148534    .049021                      .1373827    .3360102
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0:  chibar2(01) = 2.7e+04 Prob>=chibar2 = 0.000

. listcoef,percent

nbreg (N=36): Percentage Change in Expected Count

 Observed SD: 2073.4199

----------------------------------------------------------------------
        dq_m |      b         z     P>|z|      %      %StdX      SDofX
-------------+--------------------------------------------------------
         qhw |  -0.26129   -1.219   0.223    -23.0    -10.8     0.4392
      amtb_q |   0.09159    4.146   0.000      9.6     51.0     4.5012
-------------+--------------------------------------------------------
    ln alpha |  -1.53780
       alpha |   0.21485   SE(alpha) = 0.04902  
----------------------------------------------------------------------
 LR test of alpha=0: 2.7e+04  Prob>=LRX2 = 0.000
----------------------------------------------------------------------

. nbreg dq_m amtb_q,nolog

Negative binomial regression                      Number of obs   =         36
                                                  LR chi2(1)      =      21.51
Dispersion     = mean                             Prob > chi2     =     0.0000
Log likelihood = -304.4172                        Pseudo R2       =     0.0341

------------------------------------------------------------------------------
        dq_m |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      amtb_q |   .1060044   .0189324     5.60   0.000     .0688975    .1431113
       _cons |  -.1003189    1.42492    -0.07   0.944    -2.893111    2.692473
-------------+----------------------------------------------------------------
    /lnalpha |  -1.500788   .2278508                     -1.947367   -1.054209
-------------+----------------------------------------------------------------
       alpha |   .2229544   .0508003                      .1426491    .3484681
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0:  chibar2(01) = 2.8e+04 Prob>=chibar2 = 0.000

. listcoef,percent

nbreg (N=36): Percentage Change in Expected Count

 Observed SD: 2073.4199

----------------------------------------------------------------------
        dq_m |      b         z     P>|z|      %      %StdX      SDofX
-------------+--------------------------------------------------------
      amtb_q |   0.10600    5.599   0.000     11.2     61.1     4.5012
-------------+--------------------------------------------------------
    ln alpha |  -1.50079
       alpha |   0.22295   SE(alpha) = 0.05080  
----------------------------------------------------------------------
 LR test of alpha=0: 2.8e+04  Prob>=LRX2 = 0.000
----------------------------------------------------------------------

. predict amtb
(option n assumed; predicted number of events)



. tw (scatter dq_m quarter)|| line amtb quarter, legend(order(1 "Observed" 2 "NBRM_amtb"))
  

 


Furthermore, No. cases of Heart Attack during heatwaves is lower than non-heatwave periods

Another issue that no.of heart attack cases I got here is quarterly data. I did not find any articles that using quarterly data for analysing impact of heatwave events on heart diseases! Do you have any idea about that? I really appreciate all of your advice!

Attached Files

Last edited by Thong Nguyen; 28 Mar 2016, 05:41.

Tags: None

Thong Nguyen

Join Date: Oct 2015
Posts: 236

28 Mar 2016, 06:21

. dataex

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(ttb_q tmin_q tmax_q amtb_q rtsh_q hmax_pa_q) int(dq_m dq_c) float(quarter id) byte dofhw_q float(qhw amtb)
  28 24.9 33.5 70.5  4.4 110.4   865  34 184  1  0 0  1592.392
29.1   26 34.5 76.4 11.8  95.1  1474  57 185  2  0 0  2976.248
27.7 25.3 32.9 81.5   10  92.6  2703 132 186  3  0 0  5110.441
27.9 24.8   33 76.4  7.1 121.2  3657 163 187  4  0 0  2976.248
27.8 24.4 33.4 69.5  2.5 107.3  1309  46 188  5  0 0 1432.2302
29.2 26.5 34.2 76.2  8.3    98  1798  70 189  6  0 0 2913.8115
27.7 25.1 32.6 82.8 15.3    99  2744 111 190  7  0 0  5865.524
27.3 24.2 32.2 76.7   10 125.5  3464 159 191  8  0 0  3072.416
27.6 24.3 32.9 70.2  4.8 117.2   765  44 192  9  0 0  1542.548
28.8 25.9 33.9 77.5 10.8   102 11156   4 193 10  0 0  3344.336
27.9 25.1 33.2 81.5 11.6 102.1  8354 135 194 11  0 0  5110.441
27.5 24.4 32.4 77.9  6.5 128.1  3447 188 195 12  0 0 3489.1914
27.6 24.5 33.3 71.7  4.3 117.7   868  43 196 13  0 0 1808.3994
  29 26.2 34.5 77.9    9 106.2  1745  79 197 14  3 1 3489.1914
  28 25.3 33.1 80.6 12.5 100.2  2565 109 198 15  0 0 4645.4194
27.9 24.7 33.1 75.6 19.1 125.9  3427 149 199 16  0 0 2734.2566
28.4 25.1   34 69.4  2.6   116   860  35 200 17  4 1 1417.1283
30.3 27.5 35.8 71.8  4.5 104.4  1970  50 201 18 22 1 1827.6725
28.3 25.4 33.5 78.4 15.3 101.8  2946  84 202 19  0 0  3679.115
27.4 24.4 32.2   77 11.8 130.6  3930 114 203 20  0 0  3171.694
27.6 24.1 33.1 68.4    2   122  1005  44 204 21  0 0 1274.5945
  29 25.8 34.2 74.2 11.4 100.3  1965  69 205 22  0 0   2357.15
28.2 25.2 33.3 80.2 12.1 104.7  2988  98 206 23  0 0 4452.5625
27.8 24.4   33 75.5  9.6 133.1  4112 133 207 24  0 0  2705.426
28.5   25 33.9 68.5 10.5 119.4  1112  36 208 25  5 1 1288.1774
29.1 26.1 34.5 75.2  9.3 102.6  2170 310 209 26  3 1 2620.7424
28.3 25.6 33.6   78  9.9 106.8  3216  97 210 27  0 0  3526.375
28.7 25.7   34 72.3 11.8 134.4  4523 124 211 28  0 0 1927.1563
28.6 25.3 34.5 65.3  3.9 122.5  1006  26 212 29 13 1  917.6076
29.7   27 35.3 74.6  6.1 113.9  2012  46 213 30 13 1  2459.247
  28 25.5 33.2 80.6 12.3 107.4  3032  64 214 31  0 0 4645.4194
27.5 24.5 32.7 76.3 14.3 137.7  4209  88 215 32  0 0 2944.8655
27.4 24.1 33.2 66.3  8.2 121.4  1127  27 216 33  4 1 1020.2207
29.8 26.8 35.1 74.2 11.1   109  2180  56 217 34 13 1   2357.15
28.2 25.7 33.5   80 13.5 107.4  3270 200 218 35  0 0  4359.159
28.2 25.3 33.5 76.1 11.6   132  4535  66 219 36  0 0  2883.088
end
format %tq quarter
label values qhw qhw
label def qhw 0 "None Heatwaves", modify
label def qhw 1 "Heatwaves", modify

------------------ copy up to and including the previous line ------------------

Comment

Thong Nguyen

Join Date: Oct 2015

Posts: 236
#3

28 Mar 2016, 14:40

Dear all,
Do you have any advice on this issue? Thank you all in advance!
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3015
#4

28 Mar 2016, 16:02

Hello there,

This is really not my area but you have a very small data set and a lot of seasonal variation. If at all possible, I would stick to the Poisson regression (with robust standard errors) and include only 3 to 5 regressors but making sure these capture the seasonal variation in the data. Do you know what caused the jump in 2008? Accounting for that would not hurt.

Best of luck,

Joao
Comment
Thong Nguyen

Join Date: Oct 2015

Posts: 236
#5

28 Mar 2016, 16:45

Thank you so much Joao. At the first time, I would like to use Arima model for the dataset which show us very strong seasonality. Unfortunately, Arima model requires at least 50 observations for analysing. Then I had to use Poisson regression instead as I mentioned. Actually, I have no clue what happened in 2008. And the results I got here are totally different from literature (which use daily data). I concern that using quarterly data is a bad idea. And another potential bias factor is this is secondary dataset. However, I have no choice. Do you have any comments, Joao? Thank you again.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3015
#6

28 Mar 2016, 16:48

Well, we have to work with the data that are available to us; just try to make the most of it :-)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#7

28 Mar 2016, 17:19

Disclaimer: the variable names do not suggest their meanings to me, so I'm somewhat in the dark about the research question at hand. What I have gleaned (correctly or mistakenly) is that part of the interest is the relationship between heatwaves and incidence of heart attacks.

My opinion is that the use of quarterly data to study this question is probably inappropriate. While various definitions are used, the duration of a heatwave is somewhere in the range of several days to several weeks. But I'm unaware of anything called a heat wave that would last a substantial part of a quarter (which is 13 weeks). This is relevant because the effect of a heatwave on incidence of heart attack might well consist of provoking heart attacks that were going to happen anyway in the near future, to occur sooner. If this is what is going on, then one would see a rise in incidence during the heatwave, followed by a compensatory deficit in incidence in the days and weeks following a heatwave. When aggregated out over a period as long as a quarter, one might well see no difference between quarters with and without heatwaves. There is also another consideration: during heat waves people change their daily activities. They may abstain from harder physical work, or do less of it. They may spend more time in cooler environments, to the extent those are available to them. These changes may actually reduce the incidence of heart attacks among those who change their behavior (or postpone them a short time). Unless one also has this kind of behavioral data to adjust the analysis, it is hard to draw conclusions. Again, the time scale for these events is much smaller than a quarter and could easily be washed out, and quarterly data would completely fail to even capture this effect.

I would be reluctant to look for a relationship between heatwaves and heart attack incidence in data any less frequent than weekly, and, really, I'd look very hard for daily data before settling for weekly.
1 like
Comment

Thong Nguyen

Join Date: Oct 2015
Posts: 236

28 Mar 2016, 18:08

Thank you so much Clyde.
At the beginning, I'd like to collect daily data, both on climatic variables and heart attack incidence cases. However, daily incidence cases are not available (just because I used secondary data). Therefore, I changed to use quarterly incidence cases and it gave me unexpected, unreasonably results as you can see.
From daily climatic variables, I identified a heatwave as 3 consecutive days or more with daily maximum temperature exceed 95 percentile of temperature during the research period. I calculated duration of each heatwave as your advice. As a result of limitation of disease data, I changed to quarterly data for both climatic and disease data. I think it's a stupid idea. Do you have any advice for how to deal with the dataset? Or should I simply report descriptive statistics?
Thank you again Clyde!
I also show you descriptions of my variables

Code:

. des

Contains data from Quarterly_Dofhw_Env_Analysis.dta
  obs:            32                          
 vars:            11                          29 Mar 2016 07:11
 size:         1,184                          
---------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------
ttb_q           float   %8.0g                 Quarterly Mean of Daily Mean Temperature
tmin_q          float   %9.0g                 Quarterly Mean of Daily MinimumTemperature
tmax_q          float   %9.0g                 Quarterly Mean of Daily Maximum Temperature
amtb_q          float   %9.0g                 Quarterly Mean of Daily Humidity
dq_m            int     %8.0g                 Quarterly Heart Attack Cases
dq_c            int     %8.0g                 Quarterly Heart Attack Death Cases
quarter         float   %tq                  
id              float   %9.0g                
dofhw_q         byte    %8.0g                 Quarterly No. days of heatwave
qhw             float   %14.0g     qhw        Quarter with/without heatwaves
year            float   %ty                  
---------------------------------------------------------------------------------------------------------------------
Sorted by:  quarter

Last edited by Thong Nguyen; 28 Mar 2016, 18:12.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#9

28 Mar 2016, 22:10

Thanks for explaining the variables. I think I would just report descriptive statistics. I really don't think that quarterly data like this can shed light on associations between heatwaves and heart attacks.
Comment
Thong Nguyen

Join Date: Oct 2015

Posts: 236
#10

28 Mar 2016, 22:56

Thanks for all of your advice. That's what I worried about, but you made it clearer for me.
Comment

Announcement

Poisson regression model and Negative binomial regression model. Is this the poor model or simply poor secondary dataset?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment