Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create Indicator showing if values of a variable over a period of time (before the current observation) were in a certain range

    Hi everyone
    I've been trying to find a solution for the following problem without any success so far. I'm doing a study about chronic kidney disease in humans and I want to create a dummy variable showing at what date in the followup the definition was first met. This seemed like a very common problem in medical research but I haven't found any suggestions online on how to achieve this.

    Chronic kidney disease is defined as having a measure of renal function (egfr) in a certain range (<60) for a certain period of time (6 months). The data format is basically one observation per egfr measure with a date variable and a name variable for each individual studied.

    Code:
    clear
    input str4 name long date_sample float egfr
    "fred" 19628 55.36556
    "fred" 19631  70.7245
    "fred" 19661  47.7245
    "fred" 19680 51.90283
    "fred" 19891 35.43642
    "fred" 19951 41.67347
    "joe"  19908 88.86351
    "joe"  19914 40.65486
    "joe"  19939 100.1254
    "joe"  20110 44.29079
    "joe"  20213 54.31298
    "joe"  20335 48.20621
    "joe"  20395 11.85982
    "joe"  20488  45.4603
    end
    format %d date_sample

    In this example, fred would first fulfill the definition on 17jun2014 (=19891) and joe on 04sep2015 (=20335).

    Now, I want to create a dummy variable (by name), indicating when the definition was met, i.e. =1 on 17jun2014 for fred and on 04sep2015 for joe.

    May I ask for your suggestions?

    Regards

    Fabian
    Last edited by Fabian Fortner; 11 Oct 2016, 15:11.

  • #2
    I think this does it:
    Code:
    clear
    input str4 name long date_sample float egfr
    "fred" 19628 55.36556
    "fred" 19631  70.7245
    "fred" 19661  47.7245
    "fred" 19680 51.90283
    "fred" 19891 35.43642
    "fred" 19951 41.67347
    "joe"  19908 88.86351
    "joe"  19914 40.65486
    "joe"  19939 100.1254
    "joe"  20110 44.29079
    "joe"  20213 54.31298
    "joe"  20335 48.20621
    "joe"  20395 11.85982
    "joe"  20488  45.4603
    end
    format %d date_sample
    
    //    IDENTIFY SPELLS OF EGFR < 60 OR >= 60
    gen byte low_gfr = (egfr < 60)
    by name (date_sample), sort: gen spell = low_gfr != low_gfr[_n-1]
    by name (date_sample): replace spell = sum(spell)
    
    //    TRACK DURATION OF CURRENT EGFR STATE
    by name spell (date_sample), sort: gen duration = date_sample - date_sample[1]
    
    //    IDENTIFY WHEN DURATION EXCEEDS 6 MONTH THRESHOLD
    gen byte hit = low_gfr & duration >= 183 // 183 d = 6 mos
    by name (date), sort: gen hit_sum = sum(hit)
    
    //    AND DISREGARD IF IT ISN'T THE FIRST TIME
    replace hit = 0 if hit_sum > 1
    Note: The variable hit is the one you are looking for. You can drop the other variables created along the way when using this in production; I left them in so you can see the logic.

    Comment


    • #3
      Thanks for the data example.

      If I understand this correctly, a small problem is that you need to make precise "last 6 months" for daily data. I take this as meaning the last 183 days, including each day of measurement. You clearly can fiddle with that.

      I think you can make progress with rangestat (SSC), a program by Robert Picard and friends. (Search this forum for several other applications.)

      I can reproduce one of your results but not both. On the evidence Joe has been under 60 whatever the units are for 6 months before 5 May 2015.

      Code:
      clear 
      input str4 name long date_sample float egfr
      "fred" 19628 55.36556
      "fred" 19631  70.7245
      "fred" 19661  47.7245
      "fred" 19680 51.90283
      "fred" 19891 35.43642
      "fred" 19951 41.67347
      "joe"  19908 88.86351
      "joe"  19914 40.65486
      "joe"  19939 100.1254
      "joe"  20110 44.29079
      "joe"  20213 54.31298
      "joe"  20335 48.20621
      "joe"  20395 11.85982
      "joe"  20488  45.4603
      end
      format %d date_sample
      
      rangestat (max) egfr, interval(date_sample -183 0) by(name) 
      bysort name (date) : gen zeroplus6m = date[1] + 183 
      egen firstdate = min(date_sample / ((date > zeroplus6m) & (egfr_max < 60))), by(name) 
      gen isfirst = date == firstdate   
      list , sepby(name) 
      
      
           +-------------------------------------------------------------------------+
           | name   date_sa~e       egfr    egfr_max   zerop~6m   firstd~e   isfirst |
           |-------------------------------------------------------------------------|
        1. | fred   27sep2013   55.36556   55.365559      19811      19891         0 |
        2. | fred   30sep2013    70.7245   70.724503      19811      19891         0 |
        3. | fred   30oct2013    47.7245   70.724503      19811      19891         0 |
        4. | fred   18nov2013   51.90283   70.724503      19811      19891         0 |
        5. | fred   17jun2014   35.43642    35.43642      19811      19891         1 |
        6. | fred   16aug2014   41.67347    41.67347      19811      19891         0 |
           |-------------------------------------------------------------------------|
        7. |  joe   04jul2014   88.86351    88.86351      20091      20213         0 |
        8. |  joe   10jul2014   40.65486    88.86351      20091      20213         0 |
        9. |  joe   04aug2014   100.1254    100.1254      20091      20213         0 |
       10. |  joe   22jan2015   44.29079    100.1254      20091      20213         0 |
       11. |  joe   05may2015   54.31298   54.312981      20091      20213         1 |
       12. |  joe   04sep2015   48.20621   54.312981      20091      20213         0 |
       13. |  joe   03nov2015   11.85982   54.312981      20091      20213         0 |
       14. |  joe   04feb2016    45.4603   48.206211      20091      20213         0 |
           +-------------------------------------------------------------------------+
      The trickery in getting the first date is discussed within http://www.stata-journal.com/sjpdf.h...iclenum=dm0055 Section 10.

      Comment


      • #4
        On the evidence Joe has been under 60 whatever the units are for 6 months before 5 May 2015.
        I don't think that's right Nick. On 5 May 2015, we know that he has had egfr < 60 since 22Jan 2015, but that is not fully 6 months. The last egfr ascertainment before that goes all the way back to 04aug2014, but there egfr >> 60. If he had had an egfr < 60 sometime between then and 03nov2014, we could say we have 6 months of low egfr. But we don't know when between 04aug2014 and 03nov2014 this occurred. I think that in clinical research, Fabian's way of reckoning this is what would be used.

        Comment


        • #5
          Hi again
          Thanks a lot for the quick replies with your inspiring solutions! Sorry for not specifying 6 months but 183d is correct.

          Nick: Your solutions got a correct result for "fred" but not for "joe". I constructed "joe" a bit more difficult (but common in my data): He bounces back from his first "low efgr" on 10jul2014 to a "normal egfr" on the 04aug2014 and only is constantly on "low egfr" from the 22jan2015 on. Thus his "isfirst"/"hit" should be earliest 22jan2015+183d.

          Clyde: Thanks, I think this works! I'll get my head around it and will try to work with that!

          Yours

          Fabian
          Last edited by Fabian Fortner; 11 Oct 2016, 16:38.

          Comment


          • #6
            Clearly I bow wholeheartedly to whatever the practice is.

            But look at my code: it calculates the known maximum over the previous 183 days and waits until 183 days have elapsed before flagging a maximum below 60. I have to suggest that the problem is not worded for people outside the field to understand!

            This was the precise original wording:

            a measure of renal function (egfr) in a certain range (<60) for a certain period of time (6 months)

            Comment


            • #7
              Interesting. In light of #3 and #6, I now see that the original wording is indeed ambiguous. Nick's interpretation is a reasonable one, just not the one that would be used in clinical research. As an epidemiologist I'm so familiar with, one might even say immersed in, a certain way of doing things that I did not even perceive the ambiguity at first. Nick is right: one has to be in the field to know for sure what was intended here.

              Comment


              • #8
                Thanks for the helpful comments. I imagine that rangestat (SSC) could still be the basis of code deemed to give the desired answer.

                Comment


                • #9
                  So the main difficulty here is that the earliest observation within a 6 month prior window from the current observation is not likely to be exactly 6 months before the current observation date, in which case the value of egfr for the previous observation must also be taken into account. I think this can be done with rangestat:

                  Code:
                  clear
                  input str4 name long date_sample float egfr
                  "fred" 19628 55.36556
                  "fred" 19631  70.7245
                  "fred" 19661  47.7245
                  "fred" 19680 51.90283
                  "fred" 19891 35.43642
                  "fred" 19951 41.67347
                  "joe"  19908 88.86351
                  "joe"  19914 40.65486
                  "joe"  19939 100.1254
                  "joe"  20110 44.29079
                  "joe"  20213 54.31298
                  "joe"  20335 48.20621
                  "joe"  20395 11.85982
                  "joe"  20488  45.4603
                  end
                  format %d date_sample
                  
                  * for each sample, note the previous value of egfr
                  bysort name (date_sample): gen prev_egfr = egfr[_n-1]
                  
                  * assume 183 days is 6 months
                  rangestat (max) egfr (max) prev_egfr (min) dmin = date_sample, ///
                      interval(date_sample -183 0) by(name)
                  
                  * the span of days from the earliest obs in the 6 month window
                  gen span = date_sample - dmin
                  
                  * all observations within the 6 month window must show egfr < 60
                  gen is_met = egfr_max < 60
                  
                  * the previous measure must also be < 60 if span < 183
                  replace is_met = 0 if prev_egfr_max >= 60 & span < 183
                  
                  * indicator of first time definition is met
                  by name: gen first = sum(is_met)
                  replace first = 0 if first > 1

                  Comment

                  Working...
                  X