Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • estimating incidence rates from GLM

    Hi Stata forum,

    I want to estimate annual incidence rates of amputation (lea) in a dataset of people with kidney disease.

    I have used a log poisson generalized linear model with robust standard errors and included an offset term with log exposure time. The code I have used is below. I am not sure, however, how to now estimate incidence rates (per year) as a post estimation command?

    stset dox1, fail(lea1==1) origin(born) entry (entry) scale(365.25) id(id)
    stsplit _year, after(time=d(1/1/2000)) at(0(1)15) trim
    replace _year=2000 + _year
    gen _y=_t-_t0
    gen log_y = ln(_y)

    glm lea1 _year i.age i.sex i.race, family(poisson) link(log) offset(log_y) vce(robust)

    Thanks in advance
    Jess


  • #2
    The -stset- and -stsplit- commands do nothing relevant here. Perhaps you will be doing survival analysis later, and these are preparation for that. But if your analysis is just going to be a Poisson model, then you don't need those.

    Also, you can simplify the code by using -poisson- instead of -glm-. The results will be the same, but you will have less work to do in the coding. Also, the interpretation after will be easier. Finally, if you want annual incidence rates, and if you are not imposing the assumption of a strict linear trend over the time period in your data, you should treat year as a discrete variable.

    Code:
    gen exposure = dox1 - born
    poisson lea1 i.(_year age sex race), exposure(exposure) vce(robust) irr
    margins _year, ir
    The -irr- option in the -poisson- command will cause Stata to display incidence rate ratios instead of regression coefficients in the output. The -margins- command output will show you the adjusted incident rates for each year.

    Note: I have followed you in using born as the start of the exposure period. But I question whether this is what you want. Since you are interested in the incidence of amputation in chronic renal disease, I would think that it would make more sense to start the exposure period at the time of diagnosis of chronic kidney failure. I realize that the onset date of renal failure may not be very precisely known in many cases, but I still think that would be better than adding in the entire rest of the patients' lifetimes before they had CRD.

    Comment


    • #3
      Hi Clyde

      Thanks for your response.

      1. I have one large cohort of people with kidney disease - followed over time. I want to know, each year, what the rates of amputations are. Without splitting the data, I am not clear how to estimate "_year" (i.e. the population at risk for amputation for each given year). You'll note that the variable _year was generated from the split command.
      2. In my stset command, entry is the date of kidney diagnosis, and the origin is dob. I had thought that this would enter people into my model at date of kidney diagnosis, whilst using age as the time scale. Is this incorrect?

      Jess

      Comment


      • #4
        I think I made some assumptions about your data that are probably wrong.

        Please show an example of your data, as it was before the -stset- command. Please use the -dataex- command to do that. Also provide an explanation of what each variable means. (I didn't guess that entry referred to the date of kidney diagnosis. I took it to be the date at which the person was accrued into your study.)

        Also, let's clarify whether you are trying to estimate amputation rates as a function of year of duration of chronic kidney disease, or whether you are trying to estimate amputation rates a function of calendar year.

        With that, I think I can help you.

        Comment


        • #5
          Assuming your stset and stsplit is correct: First, you may collapse summing _d and person years by year before the calculation of rates.

          Considering
          how to now estimate incidence rates (per year) as a post estimation command?
          My answer below is not using a post estimation command. A reparameterization of the poisson regression will estimate incidence rates directly. Below you find an example illustrating this using aggregated data:
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long period double(personyears cases rate_unadjusted)
          4   11583823 6036 52.1
          5 12093755.5 6565 54.3
          6   12760934 7572 59.3
          end
          label values period period
          label def period 4 "2002–2006", modify
          label def period 5 "2007–2011", modify
          label def period 6 "2012–2016", modify
          label var period "period of diagnosis"
          label var personyears "Population risk time (person years)"
          label var cases "New cases"
          label var rate_unadjusted "Incidence rate (unadjusted), per 100 000 person years"
          Code:
          . des
          
          Contains data
            obs:             3                          
           vars:             4                          
           size:            84                          
          -------------------------------------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          -------------------------------------------------------------------------------------------------------------------
          period          long    %12.0g     period     period of diagnosis
          personyears     double  %10.0g                Population risk time (person years)
          cases           double  %10.0g                New cases
          rate_unadjusted double  %10.0g                Incidence rate (unadjusted), per 100 000 person years
          -------------------------------------------------------------------------------------------------------------------
          Sorted by:
               Note: Dataset has changed since last saved.
          
          . list
          
               +-----------------------------------------+
               |    period   person~s   cases   rate_u~d |
               |-----------------------------------------|
            1. | 2002–2006   11583823    6036       52.1 |
            2. | 2007–2011   12093756    6565       54.3 |
            3. | 2012–2016   12760934    7572       59.3 |
               +-----------------------------------------+
          Code:
          * To get rates per 100,000 person years, scale the offset:
          
          gen pyr100th = personyears * 10^-5
          
          glm cases ibn.period , f(poi) exposure(pyr100th) nocons eform nolog noheader
          
          * alternative tabulation avoiding the IRR-itating label:
          
          _coef_table , eform nopvalues cformat(%3.1f)
          Code:
          . glm cases ibn.period , f(poi) exposure(pyr100th) nocons eform nolog noheader
          
          ------------------------------------------------------------------------------
                       |                 OIM
                 cases |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                period |
            2002–2006  |   52.10715   .6706914   307.14   0.000     50.80906     53.4384
            2007–2011  |   54.28421   .6699713   323.63   0.000     52.98685    55.61334
            2012–2016  |   59.33735   .6819034   355.31   0.000     58.01578    60.68902
          ln(pyr100th) |          1  (exposure)
          ------------------------------------------------------------------------------
          
          .
          . * alternative tabulation avoiding the IRR-itating label:
          .
          . _coef_table , eform nopvalues cformat(%3.1f)
          --------------------------------------------------------------
                       |                 OIM
                 cases |     exp(b)   Std. Err.     [95% Conf. Interval]
          -------------+------------------------------------------------
                period |
            2002–2006  |       52.1        0.7          50.8        53.4
            2007–2011  |       54.3        0.7          53.0        55.6
            2012–2016  |       59.3        0.7          58.0        60.7
          ln(pyr100th) |        1.0  (exposure)
          --------------------------------------------------------------
          If I remember correct this is described in https://www.stata.com/bookstore/esse...al-statistics/
          Last edited by Bjarte Aagnes; 10 Nov 2018, 09:33.

          Comment


          • #6
            Hi Clyde,

            I apologize - I'm having some issues installing dataex as the ado file wont write to the drive specified (stata is installed on a server rather than my hard drive). While I sort that out, I hope you will accept my data example below.

            usrds_id born diag_date died lea lea_date dox sex dm ht smoke age
            1610364 24-Dec-29 1-Jan-00 4-Oct-03 0 . 4-Oct-03 2 yes no no 3
            3024592 14-Dec-21 15-Apr-10 28-Dec-13 0 . 28-Dec-13 2 no no no 4
            2099350 29-Feb-44 16-May-03 26-May-08 0 . 26-May-08 1 yes no no 2
            2279495 17-Oct-45 22-Nov-04 5-Mar-09 1 2-Oct-06 2-Oct-06 1 no no . 2
            2888513 15-Dec-53 13-Jun-09 19-Apr-10 0 . 19-Apr-10 2 yes no no 2
            Variable descriptions:

            diag_date is the date of kindey diagnosis - and the date at which people will enter my model
            lea = amputation, 0=no, 1=yes
            lea_date, date of amputation procedure
            dox - date of exit of model - date of amputation, date of death, or end of follow-up period (31 dec2015), whichever occurred first
            dm = diabetes status (0=no, 1=yes, .=missing)
            ht = hypertension status (0=no, 1=yes, .=missing)
            smoke = smoking status
            age is categorical
            sex (men=1, women=2)

            *dm, ht and smoke all stay the same over time (i..e i only have baseline measurements for these)


            The data is in wide format.

            I want to estimate amputation rates as a function of calendar year - adjusted for multiple categorical variables.

            Thanks for your help

            Jess





            Comment


            • #7
              So there are two steps required. The first is to convert this from a data set of one observation per person, to a data set of one observation per year including the number of amputations and the number of person-years at risk.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long usrds_id float(born diag_date died) byte lea float(lea_date dox) byte sex long(dm ht smoke) byte age
              1610364 25560 14610 15982 0     . 15982 2 2 1 2 3
              3024592 22628 18367 19720 0     . 19720 2 1 1 2 4
              2099350 30740 15841 17678 0     . 17678 1 2 1 2 2
              2279495 31336 16397 17961 1 17076 17076 1 1 1 1 2
              2888513 34317 18061 18371 0     . 18371 2 2 1 2 2
              end
              format %td born
              format %td diag_date
              format %td died
              format %td lea_date
              format %td dox
              label values dm dm
              label def dm 1 "no", modify
              label def dm 2 "yes", modify
              label values ht ht
              label def ht 1 "no", modify
              label values smoke smoke
              label def smoke 1 ".", modify
              label def smoke 2 "no", modify
              
              
              // EXPAND EACH OBSERVATION TO ONE OBSERVATION PER YEAR FROM ENTRY TO EXIT
              gen entry_year = year(diag_date)
              gen exit_year = year(dox)
              expand exit_year - entry_year + 1
              by usrds_id, sort: gen current_year = entry_year + _n - 1
              isid usrds_id current_year, sort
              
              //    IDENTIFY TIME AT RISK FOR EACH PERSON IN EACH YEAR
              gen at_risk = min(mdy(12, 31, current_year), dox) - max(mdy(1, 1, current_year), diag_date) + 1
              replace at_risk = at_risk/cond(mod(current_year, 4), 365, 366)  // CONVERT PERSON-DAYS TO PERSON-YEARS
              
              //    NOW COLLAPSE BACK TO ONE OBSERVATION PER CALENDAR YEAR
              collapse (sum) at_risk  (sum) amputations = lea ///
                  (first) sex dm ht smoke age, by(current_year)
              The next step is to estimate the adjusted rates. Here I show code to do it using Poisson regression for adjustment:

              Code:
              poisson amputations i.current_year i.sex i.dm i.ht i.smoke i.age, exposure(at_risk) iterate(10)
              margins current_year, expression(predict(n)/at_risk)
              The output of -margins- will show you the adjusted amputation rates in each calendar year, along with their standard errors and confidence intervals.

              Note that the Poisson regression will not run with the example data provided because it contains too few observations, and only one of those instantiates an amputation, so the estimation simply won't converge. But it should not be a problem with an ample set of real data.

              Last edited by Clyde Schechter; 12 Nov 2018, 17:47. Reason: Correct code for calculating variable at_risk.

              Comment


              • #8
                I realized that the approach I took in #7 is not correct for calculating adjusted rates. It would work for crude rates (and skipping the regression and -margins-, just calculating lea/atrisk.) But for adjusted rates, we need to preserve the distinction between individuals, because they differ on the covariates. Fortunately, the correction is simple. Eliminate the -collapse- command. And in the -poisson- command replace amputations by lea. (Also that -iterate(10)- didn't belong their either.)

                So from start to finish it should look like this:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input long usrds_id float(born diag_date died) byte lea float(lea_date dox) byte sex long(dm ht smoke) byte age
                1610364 25560 14610 15982 0     . 15982 2 2 1 2 3
                3024592 22628 18367 19720 0     . 19720 2 1 1 2 4
                2099350 30740 15841 17678 0     . 17678 1 2 1 2 2
                2279495 31336 16397 17961 1 17076 17076 1 1 1 1 2
                2888513 34317 18061 18371 0     . 18371 2 2 1 2 2
                end
                format %td born
                format %td diag_date
                format %td died
                format %td lea_date
                format %td dox
                label values dm dm
                label def dm 1 "no", modify
                label def dm 2 "yes", modify
                label values ht ht
                label def ht 1 "no", modify
                label values smoke smoke
                label def smoke 1 ".", modify
                label def smoke 2 "no", modify
                
                
                // EXPAND EACH OBSERVATION TO ONE OBSERVATION PER YEAR FROM ENTRY TO EXIT
                gen entry_year = year(diag_date)
                gen exit_year = year(dox)
                expand exit_year - entry_year + 1
                by usrds_id, sort: gen current_year = entry_year + _n - 1
                isid usrds_id current_year, sort
                
                //    IDENTIFY TIME AT RISK FOR EACH PERSON IN EACH YEAR
                gen at_risk = min(mdy(12, 31, current_year), dox) - max(mdy(1, 1, current_year), diag_date) + 1
                replace at_risk = at_risk/cond(mod(current_year, 4), 365, 366)  // CONVERT PERSON-DAYS TO PERSON-YEARS
                
                poisson lea i.current_year i.sex i.dm i.ht i.smoke i.age, exposure(at_risk) 
                margins current_year, expression(predict(n)/at_risk)
                Now, with the demonstration data, the -margins- command will fail because there are too many unattested combinations of the covariates, but this is probably not going to be a problem with the full data.

                Comment


                • #9
                  Thanks so much clyde,

                  I wonder if you could explain to me how your approach above is different to the stsplit command?

                  Jess

                  Comment


                  • #10
                    Well, there are a couple of things. The most important is that the -stsplit- approach does not calculate person-days at risk. It calculates years in which a person is at risk for some or all of the year. Also the -at()- option in the -stsplit- command in #1 is incorrect because your time variable is in daily units, whereas what you intended was yearly units.

                    Comment


                    • #11
                      In a private message, Bjarte Aagnes points out that my concern
                      Also the -at()- option in the -stsplit- command in #1 is incorrect because your time variable is in daily units, whereas what you intended was yearly units.
                      in #10 is wrong, because the -stset- command in #1 included a scale factor of 365.25, and -stsplit- works with the scaled time units specified in -stset-.

                      Comment


                      • #12
                        Thanks Clyde - so if the stplit does indeed estimate person-years at risk, is there a difference?

                        Jess

                        Comment


                        • #13
                          Also thanks to Bjarte Aagnes for your response - but i think your approach wont work for rates that are adjusted, which is what I need

                          Comment


                          • #14
                            Re #12. I think I didn't express myself clearly in #10. The person-years calculated by -stsplit- are not suitable for your purpose because they reckon a full person-year for a person, even if he/she only lives 1 day in that year. The -stsplit- approach credits a person who develops chronic kidney disease on 31 December 2001 and dies on 1 Jan 2002 as two person-years, when in fact that person contributes only 2 person-days to the at risk pool. That is, of course, an extreme case, but given that life expectancy with chronic kidney disease is often short, even errors of small fractions of a year are important. The -stsplit- approach will give inflated estimates of person-time at risk and therefore result in underestimation of incidence rates. Even for unadjusted rates, it is important to get that denominator correct.

                            Comment


                            • #15
                              Thanks Clyde - you've been incredibly helpful

                              Comment

                              Working...
                              X