Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying closest date to another date variable.

    Good Afternoon,

    I have a dataset which is formatted as follows :
    id interviewdate bloodtestdate bloodtestvalue
    1 01feb2009 01mar2009 106
    1 01feb2009 06apr2009 107
    2 24aug2012 17jan2009 500
    2 24aug2012 08feb2010 502
    2 24aug2012 06apr2011 509
    2 24aug2012 02mar2012 400
    3 06jan2007 06may2006 350











    ...and so on (note that these aren't the same dates as in my set, this is an example).

    The interviewdate and bloodtestdate are both in 'long' storage type, with '%d' display format.

    The 'interviewdate' for each participant is the same, as these are their baseline interview dates.

    I am interested in selecting the bloodtestdate which is the closest date to the interview date plus or minus one year.

    I thought about dropping all of the bloodtestdate variables that don't fall within a year of the interview date, tagging the first observation with the egen command and going from here but then if a participant has had multiple dates in one year, it won't select the closest date to the interview date.

    I have many, many blood test variables that I need to do this for as my analysis goes on. Any help would be greatly appreciated.

    Thank you,

    Kerry

  • #2
    Welcome to Statalist, Kerry.

    There are a number of ways of accomplishing this. My example below, based on your data, demonstrates some basic technique and is, I hope, simple to follow. It can be tightened up in several ways, but that's an exercise for the reader. You will note that I added id 4 to demonstrate that if there is a tie for closest blood test date, the one before the interview is chosen in preference to the one following.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id float(intdt bltdt) int bltval
    1 17929 17957 106
    1 17929 17993 107
    2 19229 17914 500
    2 19229 18301 502
    2 19229 18723 509
    2 19229 19054 400
    3 17172 16927 350
    4 17929 17919 106
    4 17929 17939 107
    end
    format %td intdt
    format %td bltdt
    * make sure interview dates don't differ within an id
    bysort id: assert intdt[1]==intdt[_N]
    bysort id: generate gap = bltdt-intdt
    bysort id: generate absgap = abs(gap) if abs(gap)<=365
    bysort id (absgap bltdt): gen nearest = intdt[1]
    format %td nearest
    sort id bltdt
    Code:
    . list, sepby(id)
    
         +------------------------------------------------------------------+
         | id       intdt       bltdt   bltval     gap   absgap     nearest |
         |------------------------------------------------------------------|
      1. |  1   01feb2009   01mar2009      106      28       28   01feb2009 |
      2. |  1   01feb2009   06apr2009      107      64       64   01feb2009 |
         |------------------------------------------------------------------|
      3. |  2   24aug2012   17jan2009      500   -1315        .   24aug2012 |
      4. |  2   24aug2012   08feb2010      502    -928        .   24aug2012 |
      5. |  2   24aug2012   06apr2011      509    -506        .   24aug2012 |
      6. |  2   24aug2012   02mar2012      400    -175      175   24aug2012 |
         |------------------------------------------------------------------|
      7. |  3   06jan2007   06may2006      350    -245      245   06jan2007 |
         |------------------------------------------------------------------|
      8. |  4   01feb2009   22jan2009      106     -10       10   01feb2009 |
      9. |  4   01feb2009   11feb2009      107      10       10   01feb2009 |
         +------------------------------------------------------------------+
    And as I mentioned on your other post, please let the Statalist FAQ inform you on how to improve your questions. It's a nuisance to take data with Human Readable Format dates and get it into Stata converting the strings to Stata Internal Format dates,

    Comment


    • #3
      Another way to do it. Not better, just different.


      Code:
      * William    
      bysort id: generate gap = bltdt-intdt
      bysort id: generate absgap = abs(gap) if abs(gap)<=365
      bysort id (absgap bltdt): gen nearest = intdt[1]
      
      * alternative
      bysort id: egen  minabsgap = min(abs(bltdt-intdt))
      replace minabsgap = . if minabsgrap > 365
      bysort id (minabsgap bltdt): gen nearest = intdt[1]
      Last edited by Nick Cox; 30 May 2017, 03:04.

      Comment


      • #4
        Good Evening,

        Thank you very much for your responses. My apologies for not importing my data in the correct format.

        I have a few questions:

        1. What is the purpose of the 'nearest' variable (as essentially it is just a repeat of the intdt variable)?

        2. If I wanted to keep the closest blood test value to the interview date of interest, would you recommend sorting by id and absgap, using the egen command to tag the first observation as 1, and then dropping the rest of the values (i.e. dropping any tag = 0 observations)?

        Thank you,

        Kerry
        ​​​​

        Comment


        • #5
          1) The command in my example that reads
          Code:
          bysort id (absgap bltdt): generate nearest = intdt[1]
          was meant to read
          Code:
          bysort id (absgap bltdt): generate nearest = bltdt[1]
          and thus gives the date of the blood test nearest to the interview date, with ties broken in favor of the earlier test (before rather than after the interview date).
          Code:
          . list, sepby(id)
          
               +------------------------------------------------------------------+
               | id       intdt       bltdt   bltval     gap   absgap     nearest |
               |------------------------------------------------------------------|
            1. |  1   01feb2009   01mar2009      106      28       28   01mar2009 |
            2. |  1   01feb2009   06apr2009      107      64       64   01mar2009 |
               |------------------------------------------------------------------|
            3. |  2   24aug2012   17jan2009      500   -1315        .   02mar2012 |
            4. |  2   24aug2012   08feb2010      502    -928        .   02mar2012 |
            5. |  2   24aug2012   06apr2011      509    -506        .   02mar2012 |
            6. |  2   24aug2012   02mar2012      400    -175      175   02mar2012 |
               |------------------------------------------------------------------|
            7. |  3   06jan2007   06may2006      350    -245      245   06may2006 |
               |------------------------------------------------------------------|
            8. |  4   01feb2009   22jan2009      106     -10       10   22jan2009 |
            9. |  4   01feb2009   11feb2009      107      10       10   22jan2009 |
               +------------------------------------------------------------------+
          2) What you write essentially describes what my corrected command does, except I stopped at calculating the nearest date and appending it to all observations. To simply tag the first observation within each id as 1 and the remainder as 0, egen is not needed. Replacing my (now corrected) command with
          Code:
          bysort id (absgap bltdt): generate tag = _n==1
          will set tag to 1 when _n==1 and 0 otherwise, because that is how Stata represents true and false.

          But if your objective is indeed to discard all observations but the nearest you do not need the tag.
          Code:
          bysort id (absgap bltdt): drop if _n!=1

          Comment


          • #6
            Good Afternoon,

            Thank you for the updated code. Everything is working perfectly.

            I do have a follow up question. If the absgap is too large (i.e. say the bltdt is greater than five years to intdt), is there a way to tell Stata to use the closest date even if it is after the interview date? I.e. to preferentially select the date that falls before unless it is greater than 'x' amount of time?

            Thank you again for your help.

            Kind regards,

            Kerry

            Comment


            • #7
              Introducing extra rules suggests assigning each result to a "group" depending on the rules it matches, and then picking the smallest absgap from within the group with the highest priority. In the example below, the lowest priority is "more than plus or minus a year from the interview date" (from post #1), the second-lowest priority is "more than 50 days before the interview" (modified version of the rule in post #6) and the highest priority is all others. ID 5 in the example demonstrates that 70 days after is preferred to 60 days before. ID 4 again demonstrates that 10 days before is preferred to 10 days after.
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input byte id float(intdt bltdt) int bltval
              1 17929 17957 106
              1 17929 17993 107
              2 19229 17914 500
              2 19229 18301 502
              2 19229 18723 509
              2 19229 19054 400
              3 17172 16927 350
              4 17929 17919 106
              4 17929 17939 107
              5 17929 17869 106
              5 17929 17999 107
              end
              format %td intdt
              format %td bltdt
              * make sure interview dates don't differ within an id
              bysort id: assert intdt[1]==intdt[_N]
              bysort id: generate gap = bltdt-intdt
              bysort id: generate absgap = abs(gap)
              generate priority = 1
              replace priority = 2 if gap<-50
              replace priority = 3 if abs(gap)>365
              bysort id (priority absgap bltdt): gen nearest = bltdt[1]
              format %td nearest
              sort id bltdt
              Code:
              . list, sepby(id)
              
                   +-----------------------------------------------------------------------------+
                   | id       intdt       bltdt   bltval     gap   absgap   priority     nearest |
                   |-----------------------------------------------------------------------------|
                1. |  1   01feb2009   01mar2009      106      28       28          1   01mar2009 |
                2. |  1   01feb2009   06apr2009      107      64       64          1   01mar2009 |
                   |-----------------------------------------------------------------------------|
                3. |  2   24aug2012   17jan2009      500   -1315     1315          3   02mar2012 |
                4. |  2   24aug2012   08feb2010      502    -928      928          3   02mar2012 |
                5. |  2   24aug2012   06apr2011      509    -506      506          3   02mar2012 |
                6. |  2   24aug2012   02mar2012      400    -175      175          2   02mar2012 |
                   |-----------------------------------------------------------------------------|
                7. |  3   06jan2007   06may2006      350    -245      245          2   06may2006 |
                   |-----------------------------------------------------------------------------|
                8. |  4   01feb2009   22jan2009      106     -10       10          1   22jan2009 |
                9. |  4   01feb2009   11feb2009      107      10       10          1   22jan2009 |
                   |-----------------------------------------------------------------------------|
               10. |  5   01feb2009   03dec2008      106     -60       60          2   12apr2009 |
               11. |  5   01feb2009   12apr2009      107      70       70          1   12apr2009 |
                   +-----------------------------------------------------------------------------+

              Comment


              • #8
                Hi William,

                Great - thank you for your response. I haven't tried it yet as I don't think I will be using the priority rules this round but it will come up for future analyses.

                I do have another question. If I want to limit the observations to only those observations that have occurred prior to the interview date, how would I edit the code in Post #2 appropriately?

                Thank you again for your help,

                Kerry



                Comment


                • #9
                  And what would you want as a result if there is no observation prior to the interview date?

                  In general, the lesson of post #2 (as corrected in post #5) is that observations that you do not want considered have their "absgap" replaced by a missing value, which Stata will effectively consider larger than any nonmissing value of absgap. For what you now describe, change the replace command so that rather than excluding observations with bltdt more than a year before or after intdt, instead exclude observations with bltdt greater than intdt.

                  Comment


                  • #10
                    Hi William,

                    If there is no observation to prior to the interview date I will treat it as missing.

                    Which 'replace' command do you mean?

                    Thank you,

                    Kerry

                    Comment


                    • #11
                      That's what I get for not having coffee at breakfast today. This is what I meant to write.

                      In general, the lesson of post #2 (as corrected in post #5) is that observations that you do not want considered have their "absgap" set to a missing value. Stata will effectively consider the missing value for absgap larger than any nonmissing value of absgap, so that those values are sorted to the end of the list by the command that chooses the nearest.

                      In your case,
                      Code:
                      bysort id: generate absgap = abs(gap) if abs(gap)<=365
                      will set absgap to missing when bltdt is more than a year before or after intdt, so that gap<-365 or gap>365, and thus abs(gap)>365.

                      For what you now describe, change the generate command shown above so that it excludes observations with bltdt greater than intdt.

                      Comment


                      • #12
                        Hi William,

                        So what I have is this (with dxdate instead of bltdt):

                        bysort protocol_id: assert interviewdt[1]==interviewdt[_N]
                        bysort protocol_id: generate gap = dxdate-interviewdt
                        **New code:
                        bysort protocol_id: generate absgap = abs(gap) if dxdate>interviewdt

                        However this doesn't seem to be working for me. There are still observations left that have dxdates > interviewdates.

                        Also, how do I add a qualifier that limits it to going only back "x" number of days?

                        The code worked perfectly for my previous objectives but I'm having trouble modifying it to work for me now.

                        Thanks again for your help,

                        Kerry





                        Comment


                        • #13
                          Let me go back and rewrite post #2 as corrected by #5 to meet your current requirement that the chosen blood test date come before the interview date.
                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input byte id float(intdt bltdt) int bltval
                          1 17929 17957 106
                          1 17929 17993 107
                          2 19229 17914 500
                          2 19229 18301 502
                          2 19229 18723 509
                          2 19229 19054 400
                          3 17172 16927 350
                          4 17929 17919 106
                          4 17929 17939 107
                          end
                          format %td intdt
                          format %td bltdt
                          * make sure interview dates don't differ within an id
                          bysort id: assert intdt[1]==intdt[_N]
                          // how long from blood test to interview date
                          // gap will be <0 if before and >0 if after
                          generate gap = bltdt-intdt
                          // absolute value of gap to find closest blood test
                          generate absgap = abs(gap)
                          // exclude tests that follow the interview
                          replace absgap = . if bltdt > intdt
                          // sort by closeness, with ties broken by tests before the interview
                          //   and drop all but the nearest
                          bysort id (absgap bltdt): drop if _n>1
                          Code:
                          . list, clean
                          
                                 id       intdt       bltdt   bltval    gap   absgap  
                            1.    1   01feb2009   01mar2009      106     28        .  
                            2.    2   24aug2012   02mar2012      400   -175      175  
                            3.    3   06jan2007   06may2006      350   -245      245  
                            4.    4   01feb2009   22jan2009      106    -10       10
                          With that said, you can add a second replace command to exclude those that are too long before the interview. Note that you will want to add code to handle cases like id 1 where there is no blood test before the interview date. Perhaps
                          Code:
                          replace bltval = . if absgap==.
                          replace bltdt = . if absgap==.
                          will do what you need.
                          Last edited by William Lisowski; 21 Jun 2017, 05:43.

                          Comment


                          • #14
                            Hi William,

                            Thank you so much, that code worked perfectly for me. Thank you for including the descriptions as to what each line does as well.

                            Kind regards,

                            Kerry

                            Comment


                            • #15
                              Hi all,

                              I’ve used the command above to do find the minimum between two dates in my dataset. I’ve noticed every time I run my code, things look slightly different (minimal) and I think it’s because of my dataset and how I am using my unique identifiers. I have a dataset with three unique identifiers: StudyID_MIP (identifies mother-infant pairs), Mom_ID (mother’s ID), and Inf_ID (infant’s ID). As twins and triplets were enrolled in this study, the mother-infant pair ID is used to distinguish between singleton, twin, and triplet births. So for a mother with triplets, the IDs look like:

                              StudyID_MIP Mom_ID Inf_ID
                              1230000 1230000 1230001
                              1230002 1230000 1230002
                              1230003 1230000 1230003

                              When using the command above, I sorted by Mom_ID and also sorted by Mom_ID before I dropped any duplicates (to keep the first row with the nearest date). I’ve noticed when browsing the data that a different row (i.e. infant) is retained in the dataset after dropping the duplicates – the only explanation I can think of is that Stata is randomly sorting between Mom_IDs for mothers with more than one infant enrolled in the study, and therefore, a different infant for the same mother is being dropped whenever I force duplicates to drop as it can't distinguish since the mom's ID is the same. I wanted to ask if you could provide some guidance on how to overcome this problem? Ultimately, I would like the same infant (“row”) to be retained whenever I drop duplicates, and ideally, this would be the first enrolled infant (so when StudyID_MIP is the same as Mom_ID), as mother’s responses were most accurate for the first infant. I thought maybe sorting by StudyID_MIP rather than Mom_ID would help solve this issue but I’m not sure. If you could provide some insight on how to use the following command given the unique identifiers in my dataset that would be much appreciated.

                              Thanks in advance!
                              Gabriela
                              Last edited by Gabriela Toledo; 23 Jul 2020, 10:32.

                              Comment

                              Working...
                              X