Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimating population at risk using longitudinal data

    Hi,

    The example given and attached are simulated datasets. The original data is longitudinal (approximately 100k observations) spanning across multiple years, where an individual (age>=35) with COPD diagnosis remains in the study sample until they leave the region or die. I am trying to estimate population at risk, i.e. age>=35 and without COPD diagnosis. I can do it in the excel by estimating prevalence and population separately. I would have done but I need to do multiple crosstabs e.g., age, gender, age-gender, age-gender-health region. Thus, I tried estimate the totals (prevalence, population at risk) and then export the numbers. In the codes below:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long id str1 gender byte healthregion float(copd year age2012 age2013 age2014 category_age2012 category_age2013 category_age2014)
      1 "F" 3 . 2012     80.25     81.25     82.25 4 4 4
      2 ""  . .    .      86.5      87.5      88.5 4 4 4
      3 "M" 1 1 2013        20        21        22 1 1 1
      4 ""  . .    .     50.75     51.75     52.75 3 3 3
      5 ""  . .    . 64.333336 65.333336 66.333336 4 4 4
      6 ""  . .    . 106.91666 107.91666 108.91666 4 4 4
      7 ""  . .    .     63.25     64.25     65.25 3 4 4
      8 ""  . .    .        28        29        30 1 1 1
      9 "M" 2 . 2012  48.08333  49.08333  50.08333 2 3 3
     10 ""  . .    .     63.25     64.25     65.25 3 4 4
     11 ""  . .    .      45.5      46.5      47.5 2 2 2
     12 ""  . .    .  59.41667  60.41667  61.41667 3 3 3
     13 "M" 2 1 2012  77.66666  78.66666  79.66666 4 4 4
     14 ""  . .    . 35.083332 36.083332 37.083332 2 2 2
     15 "M" 1 1 2012  7.083333  8.083333  9.083333 1 1 1
     16 ""  . .    .         5         6         7 1 1 1
     17 ""  . .    .  43.91667  44.91667  45.91667 2 2 2
     18 ""  . .    . 15.166667 16.166666 17.166666 1 1 1
     19 ""  . .    .  47.33333  48.33333  49.33333 2 2 3
     20 ""  . .    .     55.25     56.25     57.25 3 3 3
     21 ""  . .    .     62.25     63.25     64.25 3 3 4
     22 ""  . .    .  51.58333  52.58333  53.58333 3 3 3
     23 ""  . .    .      28.5      29.5      30.5 1 1 1
     24 "F" 1 1 2012 107.16666 108.16666 109.16666 4 4 4
     25 "M" 2 1 2012 26.583334 27.583334 28.583334 1 1 1
     26 "F" 1 . 2012  80.58334  81.58334  82.58334 4 4 4
     27 ""  . .    . 13.666667 14.666667 15.666667 1 1 1
     28 ""  . .    .  55.58333  56.58333  57.58333 3 3 3
     29 ""  . .    .  3.083333 4.0833335  5.083333 1 1 1
     30 ""  . .    .  8.583333  9.583333 10.583333 1 1 1
     31 ""  . .    . 29.916666 30.916666 31.916666 1 1 1
     32 "F" 2 1 2012  58.91667  59.91667  60.91667 3 3 3
     33 ""  . .    .  49.83333  50.83333  51.83333 3 3 3
     34 "F" 1 1 2012  41.08333  42.08333  43.08333 2 2 2
     35 "M" 4 1 2012     41.75     42.75     43.75 2 2 2
     36 ""  . .    .     31.25     32.25     33.25 1 1 1
     37 ""  . .    .  39.66667  40.66667  41.66667 2 2 2
     38 ""  . .    . 27.333334 28.333334 29.333334 1 1 1
     39 ""  . .    . 34.416668 35.416668 36.416668 1 2 2
     40 ""  . .    .  49.83333  50.83333  51.83333 3 3 3
     41 ""  . .    . 70.333336 71.333336 72.333336 4 4 4
     42 ""  . .    .  49.83333  50.83333  51.83333 3 3 3
     43 ""  . .    . 29.416666 30.416666 31.416666 1 1 1
     44 ""  . .    .     26.25     27.25     28.25 1 1 1
     45 ""  . .    . 19.916666 20.916666 21.916666 1 1 1
     46 ""  . .    .     101.5     102.5     103.5 4 4 4
     47 ""  . .    .        62        63        64 3 3 3
     48 "M" 1 1 2012 36.916668 37.916668  38.91667 2 2 2
     49 ""  . .    . 18.916666 19.916666 20.916666 1 1 1
     50 ""  . .    .      43.5      44.5      45.5 2 2 2
     51 ""  . .    .  49.16667  50.16667  51.16667 3 3 3
     52 ""  . .    .  48.66667  49.66667  50.66667 2 3 3
     53 ""  . .    .  56.83333  57.83333  58.83333 3 3 3
     54 "M" 5 . 2012  50.08333  51.08333  52.08333 3 3 3
     55 "F" 4 1 2012  56.91667  57.91667  58.91667 3 3 3
     56 "F" 7 . 2012 23.166666 24.166666 25.166666 1 1 1
     57 "F" 2 1 2012      4.75      5.75      6.75 1 1 1
     58 ""  . .    . 109.16666 110.16666 111.16666 4 4 4
     59 "F" 1 . 2012  86.58334  87.58334  88.58334 4 4 4
     60 ""  . .    .  63.41667 64.416664 65.416664 3 4 4
     61 ""  . .    .     85.25     86.25     87.25 4 4 4
     62 ""  . .    . 21.166666 22.166666 23.166666 1 1 1
     63 ""  . .    . 72.416664 73.416664 74.416664 4 4 4
     64 "M" 5 . 2012 36.583332 37.583332  38.58333 2 2 2
     65 ""  . .    .  60.58333  61.58333  62.58333 3 3 3
     66 ""  . .    .  62.66667  63.66667 64.666664 3 3 4
     67 "F" 2 1 2012  58.41667  59.41667  60.41667 3 3 3
     68 ""  . .    .  54.66667  55.66667  56.66667 3 3 3
     69 ""  . .    .     17.25     18.25     19.25 1 1 1
     70 ""  . .    .        65        66        67 4 4 4
     71 ""  . .    . 26.666666 27.666666 28.666666 1 1 1
     72 ""  . .    . 35.333332 36.333332 37.333332 2 2 2
     73 ""  . .    .     18.25     19.25     20.25 1 1 1
     74 ""  . .    .  44.16667  45.16667  46.16667 2 2 2
     75 ""  . .    .      62.5      63.5      64.5 3 3 4
     76 ""  . .    .  57.33333  58.33333  59.33333 3 3 3
     77 "M" 5 1 2012  55.91667  56.91667  57.91667 3 3 3
     78 "M" 2 . 2012  57.08333  58.08333  59.08333 3 3 3
     79 ""  . .    .     35.75     36.75     37.75 2 2 2
     80 "M" 6 . 2012  42.58333  43.58333  44.58333 2 2 2
     81 "M" 7 1 2013 37.333332  38.33333  39.33333 2 2 2
     82 ""  . .    .  79.58334  80.58334  81.58334 4 4 4
     83 ""  . .    . 26.333334 27.333334 28.333334 1 1 1
     84 ""  . .    .  47.08333  48.08333  49.08333 2 2 3
     85 ""  . .    .        72        73        74 4 4 4
     86 ""  . .    .        43        44        45 2 2 2
     87 ""  . .    .  62.91667  63.91667 64.916664 3 3 4
     88 ""  . .    . 33.833332 34.833332 35.833332 1 1 2
     89 ""  . .    . 37.333332  38.33333  39.33333 2 2 2
     90 "F" 5 . 2012  41.33333  42.33333  43.33333 2 2 2
     91 ""  . .    .  54.41667  55.41667  56.41667 3 3 3
     92 "M" 3 1 2012     19.75     20.75     21.75 1 1 1
     93 ""  . .    .  55.58333  56.58333  57.58333 3 3 3
     94 ""  . .    .  77.83334  78.83334  79.83334 4 4 4
     95 "M" 7 1 2012  94.83334  95.83334  96.83334 4 4 4
     96 ""  . .    .      2.25      3.25      4.25 1 1 1
     97 ""  . .    . 69.916664 70.916664 71.916664 4 4 4
     98 ""  . .    .  57.83333  58.83333  59.83333 3 3 3
     99 ""  . .    . 73.833336 74.833336 75.833336 4 4 4
    100 ""  . .    .     26.75     27.75     28.75 1 1 1
    end
    label values category_age2012 few_cat
    label values category_age2013 few_cat
    label values category_age2014 few_cat
    label def few_cat 1 "Under 35", modify
    label def few_cat 2 "35-49", modify
    label def few_cat 3 "50-64", modify
    label def few_cat 4 "65", modify
    I managed to estimate population at risk for each year, but I still couldn't figure it out for age and other cross-tabulations. I tried loops, referring years and age categories similar to overall "population risk". But either I don't refer to the right combination or the result is dot due to missing values.

    I may have to calculate it in excel, but thought I would ask before doing as the task is quite time consuming. The simulate data has single observation per person unlike the original data.

    Thanks in advance.


    Code:
    *Prevalence*
    ************
    clear*
    use copd_poprisk
    
    *Overall*
    forvalues yr = 2012(1)2014 {
    bys year: egen prev_`yr' = total(copd) if age`yr'>35 & year == `yr'
    }
     
    forvalues yr = 2012(1)2014 {
    sort year prev_`yr'
    tabdisp year if year == `yr', cell (prev_`yr')
    }
    
    *Age*
    *Prevalence*
    forvalues yr = 2012(1)2014 {
    bys category_age`yr' year : egen prev_age_`yr' = total(copd) if age`yr'>35 & year == `yr'
    }
     
    forvalues yr = 2012(1)2014 {
    sort year category_age`yr' prev_age_`yr'
    tabdisp category_age`yr' if year ==`yr', cell (prev_age_`yr')
    }
    
    *Population at risk = Pop without COPD of previous year - prevalence of current year
    *************************************************************************************************
    
    *Population*
    ************
    *Overall*
    gen pop = 1
    
    forvalues yr = 2012(1)2014 {
    bys year: egen pop_`yr' = total(pop) if age`yr'>35 & year == `yr'
    }
     
    forvalues yr = 2012(1)2014 {
    sort year pop_`yr'
    tabdisp year if year == `yr', cell (pop_`yr')
    }
    
    *Age*
    forvalues yr = 2012(1)2014 {
    bys category_age`yr' year : egen pop_age_`yr' = total(pop) if age`yr'>35 & year == `yr'
    }
     
    forvalues yr = 2012(1)2014 {
    sort year category_age`yr' pop_age_`yr'
    tabdisp category_age`yr' if year ==`yr', cell (pop_age_`yr')
    }
    
    *Population at Risk*
    ********************
    forvalues yr = 2012(1)2014 {
    sort prev_`yr'
    replace prev_`yr' = prev_`yr'[_n-1] if prev_`yr' == .
    }
    
    forvalues yr = 2012(1)2014 {
    sort pop_`yr'
    replace pop_`yr' = pop_`yr'[_n-1] if pop_`yr' == .
    }
    
    *Overall*
    gen pop_risk_2013 = pop_2012 - prev_2013 if year == 2013
    gen pop_risk_2014 = pop_2013 - prev_2014 if year == 2014
    Attached Files
    Last edited by Pablo Miah; 21 Jul 2024, 20:02.

  • #2
    I don't understand your example data and it doesn't seem to be possible to do what you ask with what you show. You have a variable copd which is always either 1 or missing. Although this is not a good way to code this for Stata, I'll assume that missing value means they don't have COPD. Be that as it may, you have only one observation per id. And your age variable is set up in wide layout, which is also not a good way to organize this for Stata. It is particularly dysfunctional for longitudinal data analysis in Stata.

    While there are simple ways to fix these problems, we are still left with another problem that is intractable: there is no information in the data set that indicates when the person in question had the onset of their COPD. This is critical for identifying the population at risk by year. The only part of your "at risk" definition that you have time-specific information about is age. So, I don't think anyone can help you without resolution of this last problem..

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      I don't understand your example data and it doesn't seem to be possible to do what you ask with what you show. You have a variable copd which is always either 1 or missing. Although this is not a good way to code this for Stata, I'll assume that missing value means they don't have COPD. Be that as it may, you have only one observation per id. And your age variable is set up in wide layout, which is also not a good way to organize this for Stata. It is particularly dysfunctional for longitudinal data analysis in Stata.

      While there are simple ways to fix these problems, we are still left with another problem that is intractable: there is no information in the data set that indicates when the person in question had the onset of their COPD. This is critical for identifying the population at risk by year. The only part of your "at risk" definition that you have time-specific information about is age. So, I don't think anyone can help you without resolution of this last problem..
      Apologies. The original datasets are longitudinal (in a closed system lab) and I tried to give an example using a simulated data which isn't. The prevalence and population are estimated separately, and then merged. I will try to reframe the question with a different example.


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(region year population preval)
      1 2015 12000  500
      1 2016 13000  700
      1 2017 14500  750
      2 2015  6500  220
      2 2016  7800  335
      2 2017  8000  390
      3 2015 16300 1200
      3 2016 17000 1310
      3 2017 20000 1435
      end
      I tried to estimate the variable pop_risk with the code below, but it's not right. Essentially, I am looking for a way to subtract value in two columns but in different rows. I should edit the title of the thread but can't seem to do it.

      Code:
      gen pop_risk = .
      
      gen last_year = year - 1
      
      sort region year
      forvalues yr = 2016(1)2017 {
      local yr1 == `yr' - 1
      replace pop_risk = population - preval if year == `yr' & last_year == `yr1'
      }
      I end up using the variables in wide format, which as you pointed out isn't the practice for this.

      Code:
      gen last_year = year - 1
      sort year
      forvalues yr = 2013(1)2014 {
      local yr1 = `yr' - 1
      bys year: gen pop_risk`yr' = pop_`yr1' - prev_`yr' if year == `yr' & last_year == `yr1'
      }
      Last edited by Pablo Miah; 22 Jul 2024, 19:11.

      Comment


      • #4
        OK, you are looking to, within each region, calculate the previous years population minus the current year prevalence.
        Code:
        xtset region year
        by region (year), sort: gen wanted = L1.population - preval

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          OK, you are looking to, within each region, calculate the previous years population minus the current year prevalence.
          Code:
          xtset region year
          by region (year), sort: gen wanted = L1.population - preval
          Thank you!

          Comment

          Working...
          X