Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to extract longitudinal survey data within specific time period

    Hi,

    I've been trying to analyze a dataset containing survey data collected from several clinic follow ups but having problem to extract specific data from it. Here's some brief description of my dataset:
    ID j Date Eval score Hospitalization in the past 6 months
    1 1 1/1/2019 . N
    1 2 2/5/2019 5 Y
    2 1 5/1/2018 4 N
    3 1 1/1/2019 7 N
    3 2 7/10/2019 4 Y
    3 3 8/10/2019 5 N
    3 4 2/1/2020 6 N
    4 1 1/1/2019 . N
    4 2 9/1/2019 8 N
    4 3 4/5/2020 6 Y

    In the dataset, we have around 2000 surveys, some patients filled out only once, some multiple times. I created a variable 'j' for each ID to represent the visit #. I really need help for the steps I am trying to do:
    (1) Exclude those who only filled out 1 survey.
    (2) For those who had 2 or more surveys, I want to include the 'first' survey with the eval score (could be #2 survey or #3 survey for the patient). eg. in the sample dataset above, exclude #1 survey of patient 4 and use #2.
    (3) I want to find the next survey which is 6-12 months after the first survey included in step (2), and use the hospitalization data in that survey as the outcome.

    I've been searching different forums trying to figure out how to code these steps but I think it is beyond my knowledge and would really appreciate your input and help!

    Thank you in advance!
    Joseph

  • #2
    Perhaps this example using your data will start you in a useful direction.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(id j) float date byte eval str1 h6m
    1 1 21550 . "N"
    1 2 21585 5 "Y"
    2 1 21305 4 "N"
    3 1 21550 7 "N"
    3 2 21740 4 "Y"
    3 3 21771 5 "N"
    3 4 21946 6 "N"
    4 1 21550 . "N"
    4 2 21793 8 "N"
    4 3 22010 6 "Y"
    end
    format %td date
    
    sort id date
    // For those who had 2 or more surveys, I want to include the 'first' survey with the eval score
    by id (date): egen eval1date = min(cond(eval!=.,date,.))
    format %td eval1date
    // Find the next survey which is 6-12 months after the first survey
    generate mon = mofd(date)
    generate eval1mon = mofd(eval1date)
    format %tm mon eval1mon
    by id (date): egen hospdate = min(cond(inrange(mon-eval1mon,6,12),date,.))
    format %td hospdate
    // find the outcome from that survey
    generate outcome = h6m if date==hospdate
    // what do we have now
    list, sepby(id)
    // perhaps this is what you want
    list id eval1date hospdate h6m if outcome!=""
    Code:
    . // what do we have now
    . list, sepby(id)
    
         +---------------------------------------------------------------------------------------+
         | id   j        date   eval   h6m   eval1date      mon   eval1mon    hospdate   outcome |
         |---------------------------------------------------------------------------------------|
      1. |  1   1   01jan2019      .     N   05feb2019   2019m1     2019m2           .           |
      2. |  1   2   05feb2019      5     Y   05feb2019   2019m2     2019m2           .           |
         |---------------------------------------------------------------------------------------|
      3. |  2   1   01may2018      4     N   01may2018   2018m5     2018m5           .           |
         |---------------------------------------------------------------------------------------|
      4. |  3   1   01jan2019      7     N   01jan2019   2019m1     2019m1   10jul2019           |
      5. |  3   2   10jul2019      4     Y   01jan2019   2019m7     2019m1   10jul2019         Y |
      6. |  3   3   10aug2019      5     N   01jan2019   2019m8     2019m1   10jul2019           |
      7. |  3   4   01feb2020      6     N   01jan2019   2020m2     2019m1   10jul2019           |
         |---------------------------------------------------------------------------------------|
      8. |  4   1   01jan2019      .     N   01sep2019   2019m1     2019m9   05apr2020           |
      9. |  4   2   01sep2019      8     N   01sep2019   2019m9     2019m9   05apr2020           |
     10. |  4   3   05apr2020      6     Y   01sep2019   2020m4     2019m9   05apr2020         Y |
         +---------------------------------------------------------------------------------------+
    
    . // perhaps this is what you want
    . list id eval1date hospdate h6m if outcome!=""
    
         +----------------------------------+
         | id   eval1date    hospdate   h6m |
         |----------------------------------|
      5. |  3   01jan2019   10jul2019     Y |
     10. |  4   01sep2019   05apr2020     Y |
         +----------------------------------+

    Comment


    • #3
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input byte id str9 dateeval byte score str1 hospitalization
      1 "1/1/2019"  . "N"
      1 "2/5/2019"  5 "Y"
      2 "5/1/2018"  4 "N"
      3 "1/1/2019"  7 "N"
      3 "7/10/2019" 4 "Y"
      3 "8/10/2019" 5 "N"
      3 "2/1/2020"  6 "N"
      4 "1/1/2019"  . "N"
      4 "9/1/2019"  8 "N"
      4 "4/5/2020"  6 "Y"
      end
      
      gen date= date(dateeval, "DMY" )
      format date %td
      
      *(1) Exclude those who only filled out 1 survey.
      bys id: egen count= total(!missing(score))
      keep if count>=2
      
      *For those who had 2 or more surveys, I want to include the 'first' survey with the eval score (could be #2 survey or #3 survey for the patient). eg. in the sample dataset above, exclude #1 survey of patient 4 and use #2.
      bys id (date): gen first= sum(sum(!missing(score)))==1
      
      *I want to find the next survey which is 6-12 months after the first survey included in step (2), and use the hospitalization data in that survey as the outcome.
      *THERE MAY BE MULTIPLE WITHIN RANGE. I CHOOSE FIRST
      bys id: egen target= max(cond(first, date, .))
      gen range= inrange(date-target, 180, 365)
      bys id range (date): gen wanted= _n==1 & range
      Res.:

      Code:
      . sort id date
      
      . l id date score hosp count first range wanted, sepby(id)
      
           +--------------------------------------------------------------------+
           | id        date   score   hospit~n   count   first   range   wanted |
           |--------------------------------------------------------------------|
        1. |  3   01jan2019       7          N       4       1       0        0 |
        2. |  3   07oct2019       4          Y       4       0       1        1 |
        3. |  3   08oct2019       5          N       4       0       1        0 |
        4. |  3   02jan2020       6          N       4       0       0        0 |
           |--------------------------------------------------------------------|
        5. |  4   01jan2019       .          N       2       0       0        0 |
        6. |  4   09jan2019       8          N       2       1       0        0 |
        7. |  4   04may2020       6          Y       2       0       0        0 |
           +--------------------------------------------------------------------+
      Note: Crossed with #2.

      Comment


      • #4
        Thank you so much William!! The codes worked amazing! I have one more question regarding how to combine the results for further analysis. I wonder how I should do to eliminate id 1 & 2, as well as those surveys I don't need and combine the outcome result (h6m=Y) into the baseline survey with the eval score and keep the rest of variables in that survey but just replace the outcome (or create a new variable for the outcome in the baseline survey)? Like what showed in (list id eval1date hospdate h6m if outcome!="") but to include more variables from the baseline survey.


        Thanks again for your great help!
        Best,
        Joseph

        Originally posted by William Lisowski View Post
        Perhaps this example using your data will start you in a useful direction.
        Code:
        . // what do we have now
        . list, sepby(id)
        
        +---------------------------------------------------------------------------------------+
        | id j date eval h6m eval1date mon eval1mon hospdate outcome |
        |---------------------------------------------------------------------------------------|
        1. | 1 1 01jan2019 . N 05feb2019 2019m1 2019m2 . |
        2. | 1 2 05feb2019 5 Y 05feb2019 2019m2 2019m2 . |
        |---------------------------------------------------------------------------------------|
        3. | 2 1 01may2018 4 N 01may2018 2018m5 2018m5 . |
        |---------------------------------------------------------------------------------------|
        4. | 3 1 01jan2019 7 N 01jan2019 2019m1 2019m1 10jul2019 |
        5. | 3 2 10jul2019 4 Y 01jan2019 2019m7 2019m1 10jul2019 Y |
        6. | 3 3 10aug2019 5 N 01jan2019 2019m8 2019m1 10jul2019 |
        7. | 3 4 01feb2020 6 N 01jan2019 2020m2 2019m1 10jul2019 |
        |---------------------------------------------------------------------------------------|
        8. | 4 1 01jan2019 . N 01sep2019 2019m1 2019m9 05apr2020 |
        9. | 4 2 01sep2019 8 N 01sep2019 2019m9 2019m9 05apr2020 |
        10. | 4 3 05apr2020 6 Y 01sep2019 2020m4 2019m9 05apr2020 Y |
        +---------------------------------------------------------------------------------------+
        
        . // perhaps this is what you want
        . list id eval1date hospdate h6m if outcome!=""
        
        +----------------------------------+
        | id eval1date hospdate h6m |
        |----------------------------------|
        5. | 3 01jan2019 10jul2019 Y |
        10. | 4 01sep2019 05apr2020 Y |
        +----------------------------------+

        Comment


        • #5
          Hi Andrew,

          Thank you so much for the solutions! The codes worked perfectly!! After running those 3 steps, how should I do to clean out the dataset by keeping id 3 & 4 and the first survey with the eval score (and other variables in that survey and add a new variable of the status of hospitalization in the past 6 months from the eligible following survey (the 'wanted")?

          Thanks a million!
          Best,
          Joseph


          Originally posted by Andrew Musau View Post


          Code:
          . sort id date
          
          . l id date score hosp count first range wanted, sepby(id)
          
          +--------------------------------------------------------------------+
          | id date score hospit~n count first range wanted |
          |--------------------------------------------------------------------|
          1. | 3 01jan2019 7 N 4 1 0 0 |
          2. | 3 07oct2019 4 Y 4 0 1 1 |
          3. | 3 08oct2019 5 N 4 0 1 0 |
          4. | 3 02jan2020 6 N 4 0 0 0 |
          |--------------------------------------------------------------------|
          5. | 4 01jan2019 . N 2 0 0 0 |
          6. | 4 09jan2019 8 N 2 1 0 0 |
          7. | 4 04may2020 6 Y 2 0 0 0 |
          +--------------------------------------------------------------------+
          Note: Crossed with #2.

          Comment


          • #6
            In my code, do you mean you just want

            Code:
            keep if wanted
            ? Note that you have to assume that the dates are in "MDY" format for my output to match William's, instead of "DMY" as I did. Otherwise, id 4 drops out as there are no surveys within 6-12 months of the first nonmissing survey. Your outcome variable is simply the variable "hospitalization", but you can duplicate this if you want. Otherwise, if I am missing something, show this using an example of how you want the final dataset to look like.

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input byte id str9 dateeval byte score str1 hospitalization
            1 "1/1/2019"  . "N"
            1 "2/5/2019"  5 "Y"
            2 "5/1/2018"  4 "N"
            3 "1/1/2019"  7 "N"
            3 "7/10/2019" 4 "Y"
            3 "8/10/2019" 5 "N"
            3 "2/1/2020"  6 "N"
            4 "1/1/2019"  . "N"
            4 "9/1/2019"  8 "N"
            4 "4/5/2020"  6 "Y"
            end
            
            gen date= date(dateeval, "MDY" )
            format date %td
            
            *(1) Exclude those who only filled out 1 survey.
            bys id: egen count= total(!missing(score))
            keep if count>=2
            
            *For those who had 2 or more surveys, I want to include the 'first' survey with the eval score (could be #2 survey or #3 survey for the patient). eg. in the sample dataset above, exclude #1 survey of patient 4 and use #2.
            bys id (date): gen first= sum(sum(!missing(score)))==1
            
            *I want to find the next survey which is 6-12 months after the first survey included in step (2), and use the hospitalization data in that survey as the outcome.
            *THERE MAY BE MULTIPLE WITHIN RANGE. I CHOOSE FIRST
            bys id: egen target= max(cond(first, date, .))
            gen range= inrange(date-target, 180, 365)
            bys id range (date): gen wanted= _n==1 & range
            keep if wanted
            l
            Res.:

            Code:
            .
            . l
            
                 +-----------------------------------------------------------------------------------------+
                 | id    dateeval   score   hospit~n        date   count   first   target   range   wanted |
                 |-----------------------------------------------------------------------------------------|
              1. |  3   7/10/2019       4          Y   10jul2019       4       0    21550       1        1 |
              2. |  4    4/5/2020       6          Y   05apr2020       2       0    21793       1        1 |
                 +-----------------------------------------------------------------------------------------+

            Comment


            • #7
              Thank you so much Andrew!! I am trying to create a new variable as 'outcome' to the 'baseline' survey (eg. here would be row #1 in the output table) and the outcome will be the hospitalization in row #2. (as the screenshot attached below) So that once I combine it together, I can run analysis to see if the eval score was associated with the outcome in 6-12 months. Would it be possible for me to create a new variable to #1 and use the hospitalization in #2 (or other pairs found using the methods above) and get rid of other surveys I don't need?

              Click image for larger version

Name:	outcome.png
Views:	2
Size:	11.4 KB
ID:	1638803


              Thanks again!
              Best,
              Joseph


              Originally posted by Andrew Musau View Post
              In my code, do you mean you just want

              Code:
              keep if wanted
              ? Note that you have to assume that the dates are in "MDY" format for my output to match William's, instead of "DMY" as I did. Otherwise, id 4 drops out as there are no surveys within 6-12 months of the first nonmissing survey. Your outcome variable is simply the variable "hospitalization", but you can duplicate this if you want. Otherwise, if I am missing something, show this using an example of how you want the final dataset to look like.

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input byte id str9 dateeval byte score str1 hospitalization
              1 "1/1/2019" . "N"
              1 "2/5/2019" 5 "Y"
              2 "5/1/2018" 4 "N"
              3 "1/1/2019" 7 "N"
              3 "7/10/2019" 4 "Y"
              3 "8/10/2019" 5 "N"
              3 "2/1/2020" 6 "N"
              4 "1/1/2019" . "N"
              4 "9/1/2019" 8 "N"
              4 "4/5/2020" 6 "Y"
              end
              
              gen date= date(dateeval, "MDY" )
              format date %td
              
              *(1) Exclude those who only filled out 1 survey.
              bys id: egen count= total(!missing(score))
              keep if count>=2
              
              *For those who had 2 or more surveys, I want to include the 'first' survey with the eval score (could be #2 survey or #3 survey for the patient). eg. in the sample dataset above, exclude #1 survey of patient 4 and use #2.
              bys id (date): gen first= sum(sum(!missing(score)))==1
              
              *I want to find the next survey which is 6-12 months after the first survey included in step (2), and use the hospitalization data in that survey as the outcome.
              *THERE MAY BE MULTIPLE WITHIN RANGE. I CHOOSE FIRST
              bys id: egen target= max(cond(first, date, .))
              gen range= inrange(date-target, 180, 365)
              bys id range (date): gen wanted= _n==1 & range
              keep if wanted
              l
              Res.:

              Code:
              .
              . l
              
              +-----------------------------------------------------------------------------------------+
              | id dateeval score hospit~n date count first target range wanted |
              |-----------------------------------------------------------------------------------------|
              1. | 3 7/10/2019 4 Y 10jul2019 4 0 21550 1 1 |
              2. | 4 4/5/2020 6 Y 05apr2020 2 0 21793 1 1 |
              +-----------------------------------------------------------------------------------------+
              Attached Files

              Comment


              • #8
                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input byte id str9 dateeval byte score str1 hospitalization
                1 "1/1/2019"  . "N"
                1 "2/5/2019"  5 "Y"
                2 "5/1/2018"  4 "N"
                3 "1/1/2019"  7 "N"
                3 "7/10/2019" 4 "Y"
                3 "8/10/2019" 5 "N"
                3 "2/1/2020"  6 "N"
                4 "1/1/2019"  . "N"
                4 "9/1/2019"  8 "N"
                4 "4/5/2020"  6 "Y"
                end
                
                gen date= date(dateeval, "MDY" )
                format date %td
                
                *(1) Exclude those who only filled out 1 survey.
                bys id: egen count= total(!missing(score))
                keep if count>=2
                
                *For those who had 2 or more surveys, I want to include the 'first' survey with the eval score (could be #2 survey or #3 survey for the patient). eg. in the sample dataset above, exclude #1 survey of patient 4 and use #2.
                bys id (date): gen first= sum(sum(!missing(score)))==1
                
                *I want to find the next survey which is 6-12 months after the first survey included in step (2), and use the hospitalization data in that survey as the outcome.
                *THERE MAY BE MULTIPLE WITHIN RANGE. I CHOOSE FIRST
                bys id: egen target= max(cond(first, date, .))
                gen range= inrange(date-target, 180, 365)
                bys id range (date): gen wanted= _n==1 & range
                bys id (wanted): gen outcome=cond(wanted[_N], hospitalization[_N], "")
                keep if first & !missing(outcome)
                l
                Res.:

                Code:
                . l
                
                     +--------------------------------------------------------------------------------------------------+
                     | id   dateeval   score   hospit~n        date   count   first   target   range   wanted   outcome |
                     |--------------------------------------------------------------------------------------------------|
                  1. |  3   1/1/2019       7          N   01jan2019       4       1    21550       0        1         Y |
                  2. |  4   9/1/2019       8          N   01sep2019       2       1    21793       0        1         Y |
                     +--------------------------------------------------------------------------------------------------+
                Last edited by Andrew Musau; 30 Nov 2021, 07:42.

                Comment


                • #9
                  This works amazingly! Thank you so much Andrew!!

                  Best,
                  Joseph

                  Originally posted by Andrew Musau View Post
                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input byte id str9 dateeval byte score str1 hospitalization
                  1 "1/1/2019" . "N"
                  1 "2/5/2019" 5 "Y"
                  2 "5/1/2018" 4 "N"
                  3 "1/1/2019" 7 "N"
                  3 "7/10/2019" 4 "Y"
                  3 "8/10/2019" 5 "N"
                  3 "2/1/2020" 6 "N"
                  4 "1/1/2019" . "N"
                  4 "9/1/2019" 8 "N"
                  4 "4/5/2020" 6 "Y"
                  end
                  
                  gen date= date(dateeval, "MDY" )
                  format date %td
                  
                  *(1) Exclude those who only filled out 1 survey.
                  bys id: egen count= total(!missing(score))
                  keep if count>=2
                  
                  *For those who had 2 or more surveys, I want to include the 'first' survey with the eval score (could be #2 survey or #3 survey for the patient). eg. in the sample dataset above, exclude #1 survey of patient 4 and use #2.
                  bys id (date): gen first= sum(sum(!missing(score)))==1
                  
                  *I want to find the next survey which is 6-12 months after the first survey included in step (2), and use the hospitalization data in that survey as the outcome.
                  *THERE MAY BE MULTIPLE WITHIN RANGE. I CHOOSE FIRST
                  bys id: egen target= max(cond(first, date, .))
                  gen range= inrange(date-target, 180, 365)
                  bys id range (date): gen wanted= _n==1 & range
                  bys id (wanted): gen outcome=cond(wanted[_N], hospitalization[_N], "")
                  keep if first & !missing(outcome)
                  l
                  Res.:

                  Code:
                  . l
                  
                  +--------------------------------------------------------------------------------------------------+
                  | id dateeval score hospit~n date count first target range wanted outcome |
                  |--------------------------------------------------------------------------------------------------|
                  1. | 3 1/1/2019 7 N 01jan2019 4 1 21550 0 1 Y |
                  2. | 4 9/1/2019 8 N 01sep2019 2 1 21793 0 1 Y |
                  +--------------------------------------------------------------------------------------------------+

                  Comment

                  Working...
                  X