Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Poisson regression model and Negative binomial regression model. Is this the poor model or simply poor secondary dataset?

    Dear all,
    I got a dataset like this. And I would like to use Poisson Regression model for the dependent variable dq_m, independent variables include ttb_q, tmin_q, tmax_q, amtb_q, dofhw_q, qhw

    After using Poisson regression model, I found out that it's better if I use Negative binominal regression model for the dataset because of overdispersion. However, the final model which I figured out show me unexpected result. As you can see from the graph attached, the model does not fit well to observed values. What should I do in order to improve the current model and find the best one.
    Furthermore, as a result of limitations of secondary data, results in the dataset are so different to the medical articles. And I would like to know whether I use the wrong modelling process, wrong command or it's simply poor dataset.
    Thank you all in advance!


    Code:
    . clear
    
    . use Quarterly_Dofhw_Env_Analysis
    
    . sum
    l
        Variable |       Obs        Mean    Std. Dev.       Min        Max
    -------------+--------------------------------------------------------
           ttb_q |        36    28.27778    .7403131       27.3       30.3
          tmin_q |        36    25.28056    .8454312       24.1       27.5
          tmax_q |        36    33.57778    .8414648       32.2       35.8
          amtb_q |        36    75.14444    4.501202       65.3       82.8
          rtsh_q |        36    9.438889    4.036496          2       19.1
    -------------+--------------------------------------------------------
       hmax_pa_q |        36    113.2194    12.41303       92.6      137.7
            dq_m |        36    2847.472     2073.42        765      11156
            dq_c |        36    91.38889    61.60718          4        310
         quarter |        36       201.5    10.53565        184        219
              id |        36        18.5    10.53565          1         36
    -------------+--------------------------------------------------------
         dofhw_q |        36    2.222222    5.037636          0         22
             qhw |        36         .25     .439155          0          1
    
    gr box dq_m,by(qhw)
      
    Click image for larger version

Name:	box_tsline_dqm.png
Views:	2
Size:	28.2 KB
ID:	1332908
    . poisson dq_m ttb_q tmin_q tmax_q amtb_q dofhw_q qhw, nolog Poisson regression Number of obs = 36 LR chi2(6) = 18257.00 Prob > chi2 = 0.0000 Log likelihood = -12276.162 Pseudo R2 = 0.4265 ------------------------------------------------------------------------------ dq_m | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ttb_q | 1.163322 .0260642 44.63 0.000 1.112237 1.214407 tmin_q | -.641617 .0199307 -32.19 0.000 -.6806803 -.6025536 tmax_q | -.3136857 .020455 -15.34 0.000 -.3537768 -.2735945 amtb_q | .1041063 .0017078 60.96 0.000 .1007591 .1074534 dofhw_q | .0151342 .0014126 10.71 0.000 .0123656 .0179027 qhw | -.5604745 .0162171 -34.56 0.000 -.5922594 -.5286896 _cons | -6.012431 .3063855 -19.62 0.000 -6.612936 -5.411927 ------------------------------------------------------------------------------ . nbreg dq_m ttb_q tmin_q tmax_q amtb_q dofhw_q qhw, nolog Negative binomial regression Number of obs = 36 LR chi2(6) = 26.50 Dispersion = mean Prob > chi2 = 0.0002 Log likelihood = -301.9212 Pseudo R2 = 0.0420 ------------------------------------------------------------------------------ dq_m | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- ttb_q | 1.023928 .6093453 1.68 0.093 -.1703668 2.218223 tmin_q | -.410953 .5268968 -0.78 0.435 -1.443652 .6217458 tmax_q | -.4743927 .5131714 -0.92 0.355 -1.48019 .5314047 amtb_q | .0933756 .0412971 2.26 0.024 .0124348 .1743164 dofhw_q | .0136807 .0287798 0.48 0.635 -.0427266 .0700881 qhw | -.3632879 .3167615 -1.15 0.251 -.9841291 .2575533 _cons | -1.741708 7.948207 -0.22 0.827 -17.31991 13.83649 -------------+---------------------------------------------------------------- /lnalpha | -1.630928 .2288978 -2.07956 -1.182297 -------------+---------------------------------------------------------------- alpha | .1957478 .0448062 .1249852 .3065737 ------------------------------------------------------------------------------ Likelihood-ratio test of alpha=0: chibar2(01) = 2.4e+04 Prob>=chibar2 = 0.000 . nbreg dq_m qhw dofhw_q amtb_q,nolog Negative binomial regression Number of obs = 36 LR chi2(3) = 23.61 Dispersion = mean Prob > chi2 = 0.0000 Log likelihood = -303.36761 Pseudo R2 = 0.0375 ------------------------------------------------------------------------------ dq_m | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- qhw | -.4479989 .3051038 -1.47 0.142 -1.045991 .1499936 dofhw_q | .0200182 .0246346 0.81 0.416 -.0282647 .0683012 amtb_q | .0914506 .0217634 4.20 0.000 .0487951 .1341062 _cons | 1.054483 1.667006 0.63 0.527 -2.212788 4.321755 -------------+---------------------------------------------------------------- /lnalpha | -1.555352 .2282957 -2.002803 -1.107901 -------------+---------------------------------------------------------------- alpha | .211115 .0481967 .1349564 .3302515 ------------------------------------------------------------------------------ Likelihood-ratio test of alpha=0: chibar2(01) = 2.6e+04 Prob>=chibar2 = 0.000 . listcoef,percent nbreg (N=36): Percentage Change in Expected Count Observed SD: 2073.4199 ---------------------------------------------------------------------- dq_m | b z P>|z| % %StdX SDofX -------------+-------------------------------------------------------- qhw | -0.44800 -1.468 0.142 -36.1 -17.9 0.4392 dofhw_q | 0.02002 0.813 0.416 2.0 10.6 5.0376 amtb_q | 0.09145 4.202 0.000 9.6 50.9 4.5012 -------------+-------------------------------------------------------- ln alpha | -1.55535 alpha | 0.21112 SE(alpha) = 0.04820 ---------------------------------------------------------------------- LR test of alpha=0: 2.6e+04 Prob>=LRX2 = 0.000 ---------------------------------------------------------------------- . nbreg dq_m qhw amtb_q,nolog Negative binomial regression Number of obs = 36 LR chi2(2) = 22.93 Dispersion = mean Prob > chi2 = 0.0000 Log likelihood = -303.70531 Pseudo R2 = 0.0364 ------------------------------------------------------------------------------ dq_m | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- qhw | -.2612912 .2143484 -1.22 0.223 -.6814064 .158824 amtb_q | .0915887 .0220894 4.15 0.000 .0482943 .1348831 _cons | 1.043923 1.69191 0.62 0.537 -2.272159 4.360005 -------------+---------------------------------------------------------------- /lnalpha | -1.537799 .2281601 -1.984985 -1.090614 -------------+---------------------------------------------------------------- alpha | .2148534 .049021 .1373827 .3360102 ------------------------------------------------------------------------------ Likelihood-ratio test of alpha=0: chibar2(01) = 2.7e+04 Prob>=chibar2 = 0.000 . listcoef,percent nbreg (N=36): Percentage Change in Expected Count Observed SD: 2073.4199 ---------------------------------------------------------------------- dq_m | b z P>|z| % %StdX SDofX -------------+-------------------------------------------------------- qhw | -0.26129 -1.219 0.223 -23.0 -10.8 0.4392 amtb_q | 0.09159 4.146 0.000 9.6 51.0 4.5012 -------------+-------------------------------------------------------- ln alpha | -1.53780 alpha | 0.21485 SE(alpha) = 0.04902 ---------------------------------------------------------------------- LR test of alpha=0: 2.7e+04 Prob>=LRX2 = 0.000 ---------------------------------------------------------------------- . nbreg dq_m amtb_q,nolog Negative binomial regression Number of obs = 36 LR chi2(1) = 21.51 Dispersion = mean Prob > chi2 = 0.0000 Log likelihood = -304.4172 Pseudo R2 = 0.0341 ------------------------------------------------------------------------------ dq_m | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- amtb_q | .1060044 .0189324 5.60 0.000 .0688975 .1431113 _cons | -.1003189 1.42492 -0.07 0.944 -2.893111 2.692473 -------------+---------------------------------------------------------------- /lnalpha | -1.500788 .2278508 -1.947367 -1.054209 -------------+---------------------------------------------------------------- alpha | .2229544 .0508003 .1426491 .3484681 ------------------------------------------------------------------------------ Likelihood-ratio test of alpha=0: chibar2(01) = 2.8e+04 Prob>=chibar2 = 0.000 . listcoef,percent nbreg (N=36): Percentage Change in Expected Count Observed SD: 2073.4199 ---------------------------------------------------------------------- dq_m | b z P>|z| % %StdX SDofX -------------+-------------------------------------------------------- amtb_q | 0.10600 5.599 0.000 11.2 61.1 4.5012 -------------+-------------------------------------------------------- ln alpha | -1.50079 alpha | 0.22295 SE(alpha) = 0.05080 ---------------------------------------------------------------------- LR test of alpha=0: 2.8e+04 Prob>=LRX2 = 0.000 ---------------------------------------------------------------------- . predict amtb (option n assumed; predicted number of events) . tw (scatter dq_m quarter)|| line amtb quarter, legend(order(1 "Observed" 2 "NBRM_amtb"))
    Click image for larger version

Name:	nbrm_amtb_36obs.png
Views:	1
Size:	19.7 KB
ID:	1332909
    Furthermore, No. cases of Heart Attack during heatwaves is lower than non-heatwave periods
    Click image for larger version

Name:	tsline_dqm_qhw_dofhwq.png
Views:	1
Size:	27.7 KB
ID:	1332910
    Another issue that no.of heart attack cases I got here is quarterly data. I did not find any articles that using quarterly data for analysing impact of heatwave events on heart diseases! Do you have any idea about that? I really appreciate all of your advice!
    Attached Files
    Last edited by Thong Nguyen; 28 Mar 2016, 05:41.

  • #2
    . dataex

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(ttb_q tmin_q tmax_q amtb_q rtsh_q hmax_pa_q) int(dq_m dq_c) float(quarter id) byte dofhw_q float(qhw amtb)
      28 24.9 33.5 70.5  4.4 110.4   865  34 184  1  0 0  1592.392
    29.1   26 34.5 76.4 11.8  95.1  1474  57 185  2  0 0  2976.248
    27.7 25.3 32.9 81.5   10  92.6  2703 132 186  3  0 0  5110.441
    27.9 24.8   33 76.4  7.1 121.2  3657 163 187  4  0 0  2976.248
    27.8 24.4 33.4 69.5  2.5 107.3  1309  46 188  5  0 0 1432.2302
    29.2 26.5 34.2 76.2  8.3    98  1798  70 189  6  0 0 2913.8115
    27.7 25.1 32.6 82.8 15.3    99  2744 111 190  7  0 0  5865.524
    27.3 24.2 32.2 76.7   10 125.5  3464 159 191  8  0 0  3072.416
    27.6 24.3 32.9 70.2  4.8 117.2   765  44 192  9  0 0  1542.548
    28.8 25.9 33.9 77.5 10.8   102 11156   4 193 10  0 0  3344.336
    27.9 25.1 33.2 81.5 11.6 102.1  8354 135 194 11  0 0  5110.441
    27.5 24.4 32.4 77.9  6.5 128.1  3447 188 195 12  0 0 3489.1914
    27.6 24.5 33.3 71.7  4.3 117.7   868  43 196 13  0 0 1808.3994
      29 26.2 34.5 77.9    9 106.2  1745  79 197 14  3 1 3489.1914
      28 25.3 33.1 80.6 12.5 100.2  2565 109 198 15  0 0 4645.4194
    27.9 24.7 33.1 75.6 19.1 125.9  3427 149 199 16  0 0 2734.2566
    28.4 25.1   34 69.4  2.6   116   860  35 200 17  4 1 1417.1283
    30.3 27.5 35.8 71.8  4.5 104.4  1970  50 201 18 22 1 1827.6725
    28.3 25.4 33.5 78.4 15.3 101.8  2946  84 202 19  0 0  3679.115
    27.4 24.4 32.2   77 11.8 130.6  3930 114 203 20  0 0  3171.694
    27.6 24.1 33.1 68.4    2   122  1005  44 204 21  0 0 1274.5945
      29 25.8 34.2 74.2 11.4 100.3  1965  69 205 22  0 0   2357.15
    28.2 25.2 33.3 80.2 12.1 104.7  2988  98 206 23  0 0 4452.5625
    27.8 24.4   33 75.5  9.6 133.1  4112 133 207 24  0 0  2705.426
    28.5   25 33.9 68.5 10.5 119.4  1112  36 208 25  5 1 1288.1774
    29.1 26.1 34.5 75.2  9.3 102.6  2170 310 209 26  3 1 2620.7424
    28.3 25.6 33.6   78  9.9 106.8  3216  97 210 27  0 0  3526.375
    28.7 25.7   34 72.3 11.8 134.4  4523 124 211 28  0 0 1927.1563
    28.6 25.3 34.5 65.3  3.9 122.5  1006  26 212 29 13 1  917.6076
    29.7   27 35.3 74.6  6.1 113.9  2012  46 213 30 13 1  2459.247
      28 25.5 33.2 80.6 12.3 107.4  3032  64 214 31  0 0 4645.4194
    27.5 24.5 32.7 76.3 14.3 137.7  4209  88 215 32  0 0 2944.8655
    27.4 24.1 33.2 66.3  8.2 121.4  1127  27 216 33  4 1 1020.2207
    29.8 26.8 35.1 74.2 11.1   109  2180  56 217 34 13 1   2357.15
    28.2 25.7 33.5   80 13.5 107.4  3270 200 218 35  0 0  4359.159
    28.2 25.3 33.5 76.1 11.6   132  4535  66 219 36  0 0  2883.088
    end
    format %tq quarter
    label values qhw qhw
    label def qhw 0 "None Heatwaves", modify
    label def qhw 1 "Heatwaves", modify
    ------------------ copy up to and including the previous line ------------------

    Comment


    • #3
      Dear all,
      Do you have any advice on this issue? Thank you all in advance!

      Comment


      • #4
        Hello there,

        This is really not my area but you have a very small data set and a lot of seasonal variation. If at all possible, I would stick to the Poisson regression (with robust standard errors) and include only 3 to 5 regressors but making sure these capture the seasonal variation in the data. Do you know what caused the jump in 2008? Accounting for that would not hurt.

        Best of luck,

        Joao

        Comment


        • #5
          Thank you so much Joao. At the first time, I would like to use Arima model for the dataset which show us very strong seasonality. Unfortunately, Arima model requires at least 50 observations for analysing. Then I had to use Poisson regression instead as I mentioned. Actually, I have no clue what happened in 2008. And the results I got here are totally different from literature (which use daily data). I concern that using quarterly data is a bad idea. And another potential bias factor is this is secondary dataset. However, I have no choice. Do you have any comments, Joao? Thank you again.

          Comment


          • #6
            Well, we have to work with the data that are available to us; just try to make the most of it :-)

            Comment


            • #7
              Disclaimer: the variable names do not suggest their meanings to me, so I'm somewhat in the dark about the research question at hand. What I have gleaned (correctly or mistakenly) is that part of the interest is the relationship between heatwaves and incidence of heart attacks.

              My opinion is that the use of quarterly data to study this question is probably inappropriate. While various definitions are used, the duration of a heatwave is somewhere in the range of several days to several weeks. But I'm unaware of anything called a heat wave that would last a substantial part of a quarter (which is 13 weeks). This is relevant because the effect of a heatwave on incidence of heart attack might well consist of provoking heart attacks that were going to happen anyway in the near future, to occur sooner. If this is what is going on, then one would see a rise in incidence during the heatwave, followed by a compensatory deficit in incidence in the days and weeks following a heatwave. When aggregated out over a period as long as a quarter, one might well see no difference between quarters with and without heatwaves. There is also another consideration: during heat waves people change their daily activities. They may abstain from harder physical work, or do less of it. They may spend more time in cooler environments, to the extent those are available to them. These changes may actually reduce the incidence of heart attacks among those who change their behavior (or postpone them a short time). Unless one also has this kind of behavioral data to adjust the analysis, it is hard to draw conclusions. Again, the time scale for these events is much smaller than a quarter and could easily be washed out, and quarterly data would completely fail to even capture this effect.

              I would be reluctant to look for a relationship between heatwaves and heart attack incidence in data any less frequent than weekly, and, really, I'd look very hard for daily data before settling for weekly.

              Comment


              • #8
                Thank you so much Clyde.
                At the beginning, I'd like to collect daily data, both on climatic variables and heart attack incidence cases. However, daily incidence cases are not available (just because I used secondary data). Therefore, I changed to use quarterly incidence cases and it gave me unexpected, unreasonably results as you can see.
                From daily climatic variables, I identified a heatwave as 3 consecutive days or more with daily maximum temperature exceed 95 percentile of temperature during the research period. I calculated duration of each heatwave as your advice. As a result of limitation of disease data, I changed to quarterly data for both climatic and disease data. I think it's a stupid idea. Do you have any advice for how to deal with the dataset? Or should I simply report descriptive statistics?
                Thank you again Clyde!
                I also show you descriptions of my variables
                Code:
                . des
                
                Contains data from Quarterly_Dofhw_Env_Analysis.dta
                  obs:            32                          
                 vars:            11                          29 Mar 2016 07:11
                 size:         1,184                          
                ---------------------------------------------------------------------------------------------------------------------
                              storage   display    value
                variable name   type    format     label      variable label
                ---------------------------------------------------------------------------------------------------------------------
                ttb_q           float   %8.0g                 Quarterly Mean of Daily Mean Temperature
                tmin_q          float   %9.0g                 Quarterly Mean of Daily MinimumTemperature
                tmax_q          float   %9.0g                 Quarterly Mean of Daily Maximum Temperature
                amtb_q          float   %9.0g                 Quarterly Mean of Daily Humidity
                dq_m            int     %8.0g                 Quarterly Heart Attack Cases
                dq_c            int     %8.0g                 Quarterly Heart Attack Death Cases
                quarter         float   %tq                  
                id              float   %9.0g                
                dofhw_q         byte    %8.0g                 Quarterly No. days of heatwave
                qhw             float   %14.0g     qhw        Quarter with/without heatwaves
                year            float   %ty                  
                ---------------------------------------------------------------------------------------------------------------------
                Sorted by:  quarter
                Last edited by Thong Nguyen; 28 Mar 2016, 18:12.

                Comment


                • #9
                  Thanks for explaining the variables. I think I would just report descriptive statistics. I really don't think that quarterly data like this can shed light on associations between heatwaves and heart attacks.

                  Comment


                  • #10
                    Thanks for all of your advice. That's what I worried about, but you made it clearer for me.

                    Comment

                    Working...
                    X