estimating incidence rates from GLM

Jessica Harding

Join Date: Sep 2014

Posts: 22
#1

estimating incidence rates from GLM

09 Nov 2018, 10:09

Hi Stata forum,

I want to estimate annual incidence rates of amputation (lea) in a dataset of people with kidney disease.

I have used a log poisson generalized linear model with robust standard errors and included an offset term with log exposure time. The code I have used is below. I am not sure, however, how to now estimate incidence rates (per year) as a post estimation command?

stset dox1, fail(lea1==1) origin(born) entry (entry) scale(365.25) id(id)
stsplit _year, after(time=d(1/1/2000)) at(0(1)15) trim
replace _year=2000 + _year
gen _y=_t-_t0
gen log_y = ln(_y)

glm lea1 _year i.age i.sex i.race, family(poisson) link(log) offset(log_y) vce(robust)

Thanks in advance
Jess
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

09 Nov 2018, 10:47

The -stset- and -stsplit- commands do nothing relevant here. Perhaps you will be doing survival analysis later, and these are preparation for that. But if your analysis is just going to be a Poisson model, then you don't need those.

Also, you can simplify the code by using -poisson- instead of -glm-. The results will be the same, but you will have less work to do in the coding. Also, the interpretation after will be easier. Finally, if you want annual incidence rates, and if you are not imposing the assumption of a strict linear trend over the time period in your data, you should treat year as a discrete variable.

Code:

gen exposure = dox1 - born poisson lea1 i.(_year age sex race), exposure(exposure) vce(robust) irr margins _year, ir

The -irr- option in the -poisson- command will cause Stata to display incidence rate ratios instead of regression coefficients in the output. The -margins- command output will show you the adjusted incident rates for each year.

Note: I have followed you in using born as the start of the exposure period. But I question whether this is what you want. Since you are interested in the incidence of amputation in chronic renal disease, I would think that it would make more sense to start the exposure period at the time of diagnosis of chronic kidney failure. I realize that the onset date of renal failure may not be very precisely known in many cases, but I still think that would be better than adding in the entire rest of the patients' lifetimes before they had CRD.
Comment
Jessica Harding

Join Date: Sep 2014

Posts: 22
#3

09 Nov 2018, 15:41

Hi Clyde

Thanks for your response.

1. I have one large cohort of people with kidney disease - followed over time. I want to know, each year, what the rates of amputations are. Without splitting the data, I am not clear how to estimate "_year" (i.e. the population at risk for amputation for each given year). You'll note that the variable _year was generated from the split command.
2. In my stset command, entry is the date of kidney diagnosis, and the origin is dob. I had thought that this would enter people into my model at date of kidney diagnosis, whilst using age as the time scale. Is this incorrect?

Jess
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

09 Nov 2018, 16:14

I think I made some assumptions about your data that are probably wrong.

Please show an example of your data, as it was before the -stset- command. Please use the -dataex- command to do that. Also provide an explanation of what each variable means. (I didn't guess that entry referred to the date of kidney diagnosis. I took it to be the date at which the person was accrued into your study.)

Also, let's clarify whether you are trying to estimate amputation rates as a function of year of duration of chronic kidney disease, or whether you are trying to estimate amputation rates a function of calendar year.

With that, I think I can help you.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

10 Nov 2018, 07:44

Assuming your stset and stsplit is correct: First, you may collapse summing _d and person years by year before the calculation of rates.

Considering

how to now estimate incidence rates (per year) as a post estimation command?

My answer below is not using a post estimation command. A reparameterization of the poisson regression will estimate incidence rates directly. Below you find an example illustrating this using aggregated data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long period double(personyears cases rate_unadjusted)
4   11583823 6036 52.1
5 12093755.5 6565 54.3
6   12760934 7572 59.3
end
label values period period
label def period 4 "2002–2006", modify
label def period 5 "2007–2011", modify
label def period 6 "2012–2016", modify
label var period "period of diagnosis"
label var personyears "Population risk time (person years)"
label var cases "New cases"
label var rate_unadjusted "Incidence rate (unadjusted), per 100 000 person years"

Code:

. des

Contains data
  obs:             3                          
 vars:             4                          
 size:            84                          
-------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------------
period          long    %12.0g     period     period of diagnosis
personyears     double  %10.0g                Population risk time (person years)
cases           double  %10.0g                New cases
rate_unadjusted double  %10.0g                Incidence rate (unadjusted), per 100 000 person years
-------------------------------------------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.

. list

     +-----------------------------------------+
     |    period   person~s   cases   rate_u~d |
     |-----------------------------------------|
  1. | 2002–2006   11583823    6036       52.1 |
  2. | 2007–2011   12093756    6565       54.3 |
  3. | 2012–2016   12760934    7572       59.3 |
     +-----------------------------------------+

Code:

* To get rates per 100,000 person years, scale the offset:

gen pyr100th = personyears * 10^-5

glm cases ibn.period , f(poi) exposure(pyr100th) nocons eform nolog noheader

* alternative tabulation avoiding the IRR-itating label:

_coef_table , eform nopvalues cformat(%3.1f)

Code:

. glm cases ibn.period , f(poi) exposure(pyr100th) nocons eform nolog noheader

------------------------------------------------------------------------------
             |                 OIM
       cases |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      period |
  2002–2006  |   52.10715   .6706914   307.14   0.000     50.80906     53.4384
  2007–2011  |   54.28421   .6699713   323.63   0.000     52.98685    55.61334
  2012–2016  |   59.33735   .6819034   355.31   0.000     58.01578    60.68902
ln(pyr100th) |          1  (exposure)
------------------------------------------------------------------------------

.
. * alternative tabulation avoiding the IRR-itating label:
.
. _coef_table , eform nopvalues cformat(%3.1f)
--------------------------------------------------------------
             |                 OIM
       cases |     exp(b)   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      period |
  2002–2006  |       52.1        0.7          50.8        53.4
  2007–2011  |       54.3        0.7          53.0        55.6
  2012–2016  |       59.3        0.7          58.0        60.7
ln(pyr100th) |        1.0  (exposure)
--------------------------------------------------------------

If I remember correct this is described in https://www.stata.com/bookstore/esse...al-statistics/

Last edited by Bjarte Aagnes; 10 Nov 2018, 08:33.

Comment

Jessica Harding

Join Date: Sep 2014
Posts: 22

12 Nov 2018, 14:37

Hi Clyde,

I apologize - I'm having some issues installing dataex as the ado file wont write to the drive specified (stata is installed on a server rather than my hard drive). While I sort that out, I hope you will accept my data example below.

usrds_id

born

diag_date

died

lea

lea_date

dox

sex

smoke

age

1610364

24-Dec-29

1-Jan-00

4-Oct-03

yes

3024592

14-Dec-21

15-Apr-10

28-Dec-13

2099350

29-Feb-44

16-May-03

26-May-08

yes

2279495

17-Oct-45

22-Nov-04

5-Mar-09

2-Oct-06

2888513

15-Dec-53

13-Jun-09

19-Apr-10

yes

Variable descriptions:

diag_date is the date of kindey diagnosis - and the date at which people will enter my model
lea = amputation, 0=no, 1=yes
lea_date, date of amputation procedure
dox - date of exit of model - date of amputation, date of death, or end of follow-up period (31 dec2015), whichever occurred first
dm = diabetes status (0=no, 1=yes, .=missing)
ht = hypertension status (0=no, 1=yes, .=missing)
smoke = smoking status
age is categorical
sex (men=1, women=2)

*dm, ht and smoke all stay the same over time (i..e i only have baseline measurements for these)

The data is in wide format.

I want to estimate amputation rates as a function of calendar year - adjusted for multiple categorical variables.

Thanks for your help

Jess

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

12 Nov 2018, 16:37

So there are two steps required. The first is to convert this from a data set of one observation per person, to a data set of one observation per year including the number of amputations and the number of person-years at risk.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long usrds_id float(born diag_date died) byte lea float(lea_date dox) byte sex long(dm ht smoke) byte age
1610364 25560 14610 15982 0     . 15982 2 2 1 2 3
3024592 22628 18367 19720 0     . 19720 2 1 1 2 4
2099350 30740 15841 17678 0     . 17678 1 2 1 2 2
2279495 31336 16397 17961 1 17076 17076 1 1 1 1 2
2888513 34317 18061 18371 0     . 18371 2 2 1 2 2
end
format %td born
format %td diag_date
format %td died
format %td lea_date
format %td dox
label values dm dm
label def dm 1 "no", modify
label def dm 2 "yes", modify
label values ht ht
label def ht 1 "no", modify
label values smoke smoke
label def smoke 1 ".", modify
label def smoke 2 "no", modify


// EXPAND EACH OBSERVATION TO ONE OBSERVATION PER YEAR FROM ENTRY TO EXIT
gen entry_year = year(diag_date)
gen exit_year = year(dox)
expand exit_year - entry_year + 1
by usrds_id, sort: gen current_year = entry_year + _n - 1
isid usrds_id current_year, sort

//    IDENTIFY TIME AT RISK FOR EACH PERSON IN EACH YEAR
gen at_risk = min(mdy(12, 31, current_year), dox) - max(mdy(1, 1, current_year), diag_date) + 1
replace at_risk = at_risk/cond(mod(current_year, 4), 365, 366)  // CONVERT PERSON-DAYS TO PERSON-YEARS

//    NOW COLLAPSE BACK TO ONE OBSERVATION PER CALENDAR YEAR
collapse (sum) at_risk  (sum) amputations = lea ///
    (first) sex dm ht smoke age, by(current_year)

The next step is to estimate the adjusted rates. Here I show code to do it using Poisson regression for adjustment:

Code:

poisson amputations i.current_year i.sex i.dm i.ht i.smoke i.age, exposure(at_risk) iterate(10)
margins current_year, expression(predict(n)/at_risk)

The output of -margins- will show you the adjusted amputation rates in each calendar year, along with their standard errors and confidence intervals.

Note that the Poisson regression will not run with the example data provided because it contains too few observations, and only one of those instantiates an amputation, so the estimation simply won't converge. But it should not be a problem with an ample set of real data.

Last edited by Clyde Schechter; 12 Nov 2018, 16:47. Reason: Correct code for calculating variable at_risk.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

13 Nov 2018, 07:42

I realized that the approach I took in #7 is not correct for calculating adjusted rates. It would work for crude rates (and skipping the regression and -margins-, just calculating lea/atrisk.) But for adjusted rates, we need to preserve the distinction between individuals, because they differ on the covariates. Fortunately, the correction is simple. Eliminate the -collapse- command. And in the -poisson- command replace amputations by lea. (Also that -iterate(10)- didn't belong their either.)

So from start to finish it should look like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long usrds_id float(born diag_date died) byte lea float(lea_date dox) byte sex long(dm ht smoke) byte age
1610364 25560 14610 15982 0     . 15982 2 2 1 2 3
3024592 22628 18367 19720 0     . 19720 2 1 1 2 4
2099350 30740 15841 17678 0     . 17678 1 2 1 2 2
2279495 31336 16397 17961 1 17076 17076 1 1 1 1 2
2888513 34317 18061 18371 0     . 18371 2 2 1 2 2
end
format %td born
format %td diag_date
format %td died
format %td lea_date
format %td dox
label values dm dm
label def dm 1 "no", modify
label def dm 2 "yes", modify
label values ht ht
label def ht 1 "no", modify
label values smoke smoke
label def smoke 1 ".", modify
label def smoke 2 "no", modify


// EXPAND EACH OBSERVATION TO ONE OBSERVATION PER YEAR FROM ENTRY TO EXIT
gen entry_year = year(diag_date)
gen exit_year = year(dox)
expand exit_year - entry_year + 1
by usrds_id, sort: gen current_year = entry_year + _n - 1
isid usrds_id current_year, sort

//    IDENTIFY TIME AT RISK FOR EACH PERSON IN EACH YEAR
gen at_risk = min(mdy(12, 31, current_year), dox) - max(mdy(1, 1, current_year), diag_date) + 1
replace at_risk = at_risk/cond(mod(current_year, 4), 365, 366)  // CONVERT PERSON-DAYS TO PERSON-YEARS

poisson lea i.current_year i.sex i.dm i.ht i.smoke i.age, exposure(at_risk) 
margins current_year, expression(predict(n)/at_risk)

Now, with the demonstration data, the -margins- command will fail because there are too many unattested combinations of the covariates, but this is probably not going to be a problem with the full data.

Comment

Jessica Harding

Join Date: Sep 2014

Posts: 22
#9

13 Nov 2018, 08:56

Thanks so much clyde,

I wonder if you could explain to me how your approach above is different to the stsplit command?

Jess
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#10

13 Nov 2018, 09:16

Well, there are a couple of things. The most important is that the -stsplit- approach does not calculate person-days at risk. It calculates years in which a person is at risk for some or all of the year. Also the -at()- option in the -stsplit- command in #1 is incorrect because your time variable is in daily units, whereas what you intended was yearly units.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#11

13 Nov 2018, 10:44

In a private message, Bjarte Aagnes points out that my concern

Also the -at()- option in the -stsplit- command in #1 is incorrect because your time variable is in daily units, whereas what you intended was yearly units.

in #10 is wrong, because the -stset- command in #1 included a scale factor of 365.25, and -stsplit- works with the scaled time units specified in -stset-.
Comment
Jessica Harding

Join Date: Sep 2014

Posts: 22
#12

13 Nov 2018, 13:07

Thanks Clyde - so if the stplit does indeed estimate person-years at risk, is there a difference?

Jess
Comment
Jessica Harding

Join Date: Sep 2014

Posts: 22
#13

13 Nov 2018, 13:13

Also thanks to Bjarte Aagnes for your response - but i think your approach wont work for rates that are adjusted, which is what I need
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#14

13 Nov 2018, 14:22

Re #12. I think I didn't express myself clearly in #10. The person-years calculated by -stsplit- are not suitable for your purpose because they reckon a full person-year for a person, even if he/she only lives 1 day in that year. The -stsplit- approach credits a person who develops chronic kidney disease on 31 December 2001 and dies on 1 Jan 2002 as two person-years, when in fact that person contributes only 2 person-days to the at risk pool. That is, of course, an extreme case, but given that life expectancy with chronic kidney disease is often short, even errors of small fractions of a year are important. The -stsplit- approach will give inflated estimates of person-time at risk and therefore result in underestimation of incidence rates. Even for unadjusted rates, it is important to get that denominator correct.
Comment
Jessica Harding

Join Date: Sep 2014

Posts: 22
#15

13 Nov 2018, 14:42

Thanks Clyde - you've been incredibly helpful
1 like
Comment

Announcement