Margins after Poisson

Colin Fischbacher

Join Date: Mar 2015

Posts: 7
#1

Margins after Poisson

24 Apr 2015, 09:56

I want to produce predicted rates (with CIs) by exposure categories, adjusted for other covariates. I used -poisson- then -margins- (in Stata 13.1) but the results from my data are not what I expected. The reference manual (p1153) says that "the marginal mean . . for males is the predicted mean of the dependent variable where every observation is treated as if it represents a male" etc. I can reproduce my problem using the Doll & Hill dataset:

Code:

webuse dollhill3, clear poisson deaths i.smokes agecat, irr exp(pyears) margins smokes, predict(ir) // produces rates of 6.77/1000 in non-smokers, 10.16 in smokers * calculate predicted deaths and rates by hand from model coefficients plus data gen pred_smoker_deaths = exp(_b[1.smokes] + (_b[agecat]*agecat) + _b[_cons] + ln(1)) * pyears // no of deaths in cohort if all smoked gen pred_nsmoker_deaths = exp((_b[agecat]*agecat) + _b[_cons] + ln(1)) * pyears // no of deaths in the cohort if nobody smoked table smokes, c(sum deaths sum pred_smoker_deaths sum pred_nsmoker_deaths sum pyears) row // show observed and predicted deaths preserve // then calculate rates using the predicted deaths collapse (sum) pred_smoker_deaths (sum) pred_nsmoker_deaths (sum) pyears gen pred_rate_smo = pred_smoker_deaths/pyears gen pred_rate_nsmo = pred_nsmoker_deaths/pyears list // gives rates of 2.87/1000 in non-smokers, 4.31 in smokers, NOT same as margins restore

The crude death rates in smokers and non-smokers in the Doll and Hill data are 2.58 and 4.43 /1000 in non-smokers and smokers. The smokers are older, so I would their expect age-adjusted rates to be lower than their crude rates (and the opposite for non-smokers). This is what I see when calculating adjusted rates by hand, but not what margins produces. The rates from margins are double the crude rates and I'm not sure how to use them. I've obviously misunderstood margins - can someone help?

Last edited by Colin Fischbacher; 24 Apr 2015, 10:13.
Tags: None

1 like
Scott Merryman

Join Date: Mar 2014

Posts: 895
#2

24 Apr 2015, 11:02

When you are calculating the pred_smoker_deaths and pred_nsmoker_deaths the exposure should be exp(1)*pyears
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30111

24 Apr 2015, 11:12

To make it a little clearer, let's predict the number of events, rather than the incidence rates.

Code:

.
. webuse dollhill3, clear
(Doll and Hill (1966))

. poisson deaths i.smokes agecat, irr exp(pyears)

Iteration 0:   log likelihood = -62.200274
Iteration 1:   log likelihood = -62.125036
Iteration 2:   log likelihood =  -62.12501
Iteration 3:   log likelihood =  -62.12501

Poisson regression                              Number of obs     =         10
                                                LR chi2(2)        =     865.89
                                                Prob > chi2       =     0.0000
Log likelihood =  -62.12501                     Pseudo R2         =     0.8745

------------------------------------------------------------------------------
      deaths |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.smokes |   1.501358   .1609397     3.79   0.000     1.216855    1.852379
      agecat |   2.306736   .0669993    28.78   0.000     2.179088    2.441862
       _cons |    .000298   .0000415   -58.28   0.000     .0002268    .0003916
  ln(pyears) |          1  (exposure)
------------------------------------------------------------------------------

. margins smokes, predict(n)

Predictive margins                              Number of obs     =         10
Model VCE    : OIM

Expression   : Predicted number of events, predict(n)

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      smokes |
          0  |   52.06201   5.180823    10.05   0.000     41.90778    62.21623
          1  |   78.16372   3.114217    25.10   0.000     72.05996    84.26747
------------------------------------------------------------------------------

. * calculate predicted deaths and rates by hand from model coefficients plus data
. gen pred_smoker_deaths = exp(_b[1.smokes] + (_b[agecat]*agecat) + _b[_cons]) * pyears // no of deaths in cohort if all smoked

. gen pred_nsmoker_deaths = exp((_b[agecat]*agecat) + _b[_cons]) * pyears // no of deaths in the cohort if nobody smoked

. table smokes, c(sum deaths sum pred_smoker_deaths sum pred_nsmoker_deaths sum pyears) row // show observed and predicted deaths

----------------------------------------------------------------------
whether   |
person    |
smokes    |   sum(deaths)  sum(pred_s~s)  sum(pred_n~s)    sum(pyears)
----------+-----------------------------------------------------------
        0 |           101       151.6372            101          39220
        1 |           630            630       419.6201         142247
          |
    Total |           731       781.6372       520.6201         181467
----------------------------------------------------------------------

Notice that the output of -margins- is now exactly your calculation of the predicted numbers of deaths divided by 10. And, not at all coincidentally, there are 10 observations in this data set. -margins- is calculating the average of the predicted values over the observations, not the sum.

To complicate matters further, when you ran -margins- with -predict(ir)-, according to the online help for poisson postestimation, -predict(ir)- calculates exp(xb)--which does not take the exposure (pyears) into account.

So the margins you were calculating were something very different from what you were trying to estimate in these two ways.

Finally, as an aside, I would not do the age adjustment by including agecat as if it were a continuous variable. I would enter it as i.agecat. That will, of course, change the results somewhat--but this has no bearing on the problem you raised in your post.

Comment

Colin Fischbacher

Join Date: Mar 2015

Posts: 7
#4

24 Apr 2015, 11:16

Thanks for the quick response. Changing the code to:

Code:

gen pred_smoker_deaths = exp(_b[1.smokes] + (_b[agecat]*agecat) + _b[_cons] + ln(1)) * exp(1) * pyears

. . gives 11.71/1000 for smokers, but this still doesn't match the output of -margins- (10.16)

Also, this still leaves me puzzled about the output from -margins- and wondering if it can produce what I'm looking for.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#5

24 Apr 2015, 11:18

@Scott Merryman

I don't agree. What Colin wrote is a bit odd, but accurate. He has ln(1), which is just zero, inside the parentheses of exp(), and then the whole thing is multiplied by pyears. So that's right: take the rate and multiply by the exposure. It's just a little strange to write zero as ln(1).

Alternatively, he could have put 1*ln(pyears) inside the parentheses of exp() and nothing after that.

But exp(1)* pyears would not be correct in any case.
Comment

Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014
Posts: 700

24 Apr 2015, 12:28

If Colin is willing to treat the sum of pyears as a fixed (non-stochastic) quantity, he can
give margins an expression for the rate(s) of interest. Using the advice for the predict
option, here is the setup and call for margins.

Code:

sum pyears if e(sample)
scalar denom = r(sum)
margins smokes, exp(predict(n)/scalar(denom))

Here is the output:

Code:

. sum pyears if e(sample)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      pyears |         10     18146.7     17762.4       1462      52407

. scalar denom = r(sum)

. margins smokes, exp(predict(n)/scalar(denom))

Predictive margins                              Number of obs     =         10
Model VCE    : OIM

Expression   : predict(n)/scalar(denom)

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      smokes |
          0  |   .0002869   .0000285    10.05   0.000     .0002309    .0003429
          1  |   .0004307   .0000172    25.10   0.000     .0003971    .0004644
------------------------------------------------------------------------------

Comment

Colin Fischbacher

Join Date: Mar 2015

Posts: 7
#7

24 Apr 2015, 14:17

Thanks for all these very helpful responses which have just saved me hours of work. I now also get confidence intervals from -margins-, an advantage over doing the calculations by hand. I agree with Clyde that fitting agecat as a continuous variable would not be a good idea - my own data have a more complex model but I was trying to make my example simple. Also agree I don't need the ln(1).

I assumed that predict(ir) would correctly calculate a rate, but it seems it doesn't when each observation represents a different amount of exposure. (Is my understanding correct here?) This is something I didn't pick up from the otherwise excellent manual.

The method you've explained provides predictions of rates which have been adjusted for age - just what I wanted. However it would also be useful to show the incidence rates predicted by the age adjusted model for each smoking category. These would be based on the observed age of each group. These would be useful to compare model predictions with the observed rates. I thought that this might work:

Code:

webuse dollhill3, clear poisson deaths i.smokes agecat, irr exp(pyears) sum pyears if e(sample) scalar denom = r(sum) margins smokes, at((asobs)agecat) exp(predict(n)/scalar(denom))

but in fact the output is identical to the output Jeff shows above, so I'm obviously doing something wrong. Would appreciate any further comments.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#8

24 Apr 2015, 14:33

You have the wrong syntax for making agecat vary over its values. What you want is

Code:

margins smokes, at((asobs)agecat = (1 2 3 4 5)) exp(predict(n)/scalar(denom))
Comment
Colin Fischbacher

Join Date: Mar 2015

Posts: 7
#9

24 Apr 2015, 15:26

Clyde - thanks for this. Your suggestion seems to work nicely to show results separately for each value of agecat, but I want a single predicted rate for smokes==1 at the values of agecat observed in the smoker group (and same for non-smokers). Is there a way of doing that? Thanks again for your help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#10

24 Apr 2015, 15:34

Oh, I misunderstood what you were looking for. I think what you want is this:

Code:

margins, over(smokes) exp(predict(N)/scalar(denom))
Comment
Colin Fischbacher

Join Date: Mar 2015

Posts: 7
#11

24 Apr 2015, 15:45

Thanks again Clyde for the quick response. This syntax produces rates of 0.11 and 0.69 /1000 in non-smokers and smokers respectively, an implausibly poor fit to the observed rates of 2.58 and 4.43, so I think there must still be something wrong here. Just to confirm, the code I ran was as follows:

Code:

webuse dollhill3, clear poisson deaths i.smokes agecat, irr exp(pyears) sum pyears if e(sample) scalar denom = r(sum) margins, over(smokes) exp(predict(n)/scalar(denom))

which I hoped would produce predicted rates at the observed values of agecat.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#12

24 Apr 2015, 16:39

So before I lunge at another misunderstanding, let me ask to you to confirm what you are trying to estimate with margins. Are you looking for estimates of the smoking-specific mortality rates, not adjusted for age? That is, a model estimate of sum(deaths among smokers)/sum(person years among smokers), and the analogous figure for non-smokers?
Comment
Colin Fischbacher

Join Date: Mar 2015

Posts: 7
#13

25 Apr 2015, 03:54

I want the predicted rates from the full model (that includes age as well as smokes) to check against observed rates. I originally thought I should use the -predict- command, and this can certainly generate predicted deaths for smokers and non-smokers from a model that includes age. I can divide the total predicted deaths by pyears totals to calculate predicted rates for smokers and non-smokers, but this doesn't give me standard errors. I know -predict- can produce standard errors (using the stdp option) for each observation, but I want SEs for smokers and non-smokers. I thought -margins- might do this, but I might be looking at the wrong command.Thanks again for your patience with this!

Last edited by Colin Fischbacher; 25 Apr 2015, 03:56.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30111

#14

25 Apr 2015, 10:00

Well, I'm not sure what predicted rates from the full model means, precisely. The model estimates the incidence rate conditional on agecat and smokes. So that incidence rate varies across the observations.

Code:

. margins smokes, predict(ir) at(agecat = (1 2 3 4))

Adjusted predictions                            Number of obs     =         10
Model VCE    : OIM

Expression   : Predicted incidence rate, predict(ir)

1._at        : agecat          =           1

2._at        : agecat          =           2

3._at        : agecat          =           3

4._at        : agecat          =           4

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  _at#smokes |
        1 0  |   .0006875    .000083     8.28   0.000     .0005247    .0008502
        1 1  |   .0010321   .0000832    12.41   0.000     .0008691    .0011952
        2 0  |   .0015858   .0001697     9.34   0.000     .0012532    .0019184
        2 1  |   .0023809   .0001362    17.48   0.000      .002114    .0026478
        3 0  |    .003658   .0003659    10.00   0.000     .0029408    .0043753
        3 1  |    .005492   .0002285    24.04   0.000     .0050442    .0059399
        4 0  |   .0084381   .0008543     9.88   0.000     .0067637    .0101126
        4 1  |   .0126687    .000549    23.07   0.000     .0115926    .0137448
------------------------------------------------------------------------------

If you want summary statistics for the incidence rate, aggregating the different age categories up to all smokers and all non-smokers, then you need to specify how you want to aggregate.

One way to do it is to treat the entire data set as if they were all smokers (resp. non-smokers) and calculate the average estimated rate over all observations:

Code:

. margins smokes, predict(ir)

Predictive margins                              Number of obs     =         10
Model VCE    : OIM

Expression   : Predicted incidence rate, predict(ir)

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      smokes |
          0  |   .0067668   .0006991     9.68   0.000     .0053967    .0081369
          1  |   .0101594   .0004841    20.99   0.000     .0092106    .0111081
------------------------------------------------------------------------------

Or, you could do this just among the smokers and just among the non-smokers, in each case omitting data on the other group. In this case, because of the balance of smoking and age category distributions in the data, you get the same result.

Another possibility is to calculate the predicted number of deaths in each smoking category, aggregate that over the age categories and divide by the total person years in that category. This would be a population averaged estimate of the rate:

Code:

. egen sm_spec_pyears = mean(pyears), by(smokes)

.
. margins, over(smokes) exp(predict(n)/sm_spec_pyears)

Predictive margins                              Number of obs     =         10
Model VCE    : OIM

Expression   : predict(n)/sm_spec_pyears
over         : smokes

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      smokes |
          0  |   .0025752   .0002562    10.05   0.000      .002073    .0030774
          1  |   .0044289   .0001765    25.10   0.000     .0040831    .0047748
------------------------------------------------------------------------------

There are other possibilities as well, basically varying according to how you want to weight the different age categories when you aggregate up--these would require some specific calculation analogous to the -egen- statement above.

But, I think that all of these possibilities have a legitimate claim to the title "model-predicted estimate of the mortality rate among smokers and non-smokers." The Poisson regression gets you the age-smoking specific incidence rate estimates: how you choose to aggregate them into larger groups is exogenous to the Poisson model itself. The one I have shown last comes closest to (and in fact matches) the observed mortality rates in the sample.

Hope this helps.

Last edited by Clyde Schechter; 25 Apr 2015, 10:04.

Comment

Colin Fischbacher

Join Date: Mar 2015
Posts: 7

#15

26 Apr 2015, 08:09

Thanks Clyde, that's extremely helpful. I was looking for your second option - predicted values at the observed values of the data (ie not fixed at specific levels). I was able to confirm your result using the predict command in the following code:

Code:

webuse dollhill3, clear
poisson deaths i.smokes agecat, irr exp(pyears)
predict pred_death  // predict deaths
predict pred_se, stdp // standard error in smoking/age groups
egen sum_pred_deaths = sum(pred_death) if e(sample), by(smokes)  // sum of predicted deaths
egen sum_pyears = sum(pyears) if e(sample), by(smokes) // sum of exposure over smoking groups
gen pred_rate = sum_pred_deaths/sum_pyears // calculated predicted rate
list, noobs

This produces the same figures you showed above using -margins- as the following extract of output shows:

Code:

. margins, over(smokes) exp(predict(n)/sm_spec_pyears)

Predictive margins                                Number of obs   =         10
Model VCE    : OIM

Expression   : predict(n)/sm_spec_pyears
over         : smokes

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      smokes |
          0  |   .0025752   .0002562    10.05   0.000      .002073    .0030774
          1  |   .0044289   .0001765    25.10   0.000     .0040831    .0047748
------------------------------------------------------------------------------
.
. * check this using predict
. predict pred_death  // predict deaths
(option n assumed; predicted number of events)

. predict pred_se, stdp // standard error in smoking/age groups

. egen sum_pred_deaths = sum(pred_death) if e(sample), by(smokes)

. egen sum_pyears = sum(pyears) if e(sample), by(smokes)

. gen pred_rate = sum_pred_deaths/sum_pyears

. list, noobs

  +-----------------------------------------------------------------------------------------------------+
  | agecat   smokes   deaths   pyears   sm_spe~s   pred_d~h    pred_se   sum_pr~s   sum_py~s   pred_r~e |
  |-----------------------------------------------------------------------------------------------------|
  |  35-44        1       32   52,407    28449.4   54.09114    .080612        630     142247   .0044289 |
  |  45-54        1      104   43,248    28449.4   102.9677    .057193        630     142247   .0044289 |
  |  55-64        1      206   28,612    28449.4   157.1379   .0416056        630     142247   .0044289 |
  |  65-74        1      186   12,663    28449.4   160.4232   .0433386        630     142247   .0044289 |
  |  75-84        1      102    5,317    28449.4   155.3801   .0609317        630     142247   .0044289 |
  |-----------------------------------------------------------------------------------------------------|
  |  35-44        0        2   18,790       7844   12.91752   .1207646        101      39220   .0025752 |
  |  45-54        0       12   10,673       7844   16.92532    .107016        101      39220   .0025752 |
  |  55-64        0       28    5,710       7844   20.88741   .1000398        101      39220   .0025752 |
  |  65-74        0       28    2,585       7844   21.81257    .101246        101      39220   .0025752 |
  |  75-84        0       31    1,462       7844   28.45717   .1103665        101      39220   .0025752 |
  +-----------------------------------------------------------------------------------------------------+

The figures using predict (the last column of the final list) are exactly the same as those using margins, but as previously mentioned, the advantage of -margins- here is that you get a standard error/ CI for the smoking groups. Thanks again for your help.

Last edited by Colin Fischbacher; 26 Apr 2015, 08:14.

Announcement