Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a variable: TRUE or FALSE based on date within two other dates for each participant of of which there are multiple ID values

    Hello, my apologies for not using dataex. I am using healthcare data so I have created an example below.
    I am looking to create a variable that will satisfy this equality:

    visit_date (for visit_type==1) <= visit_date (for visit_type==2) <= visit_date (for visit_type==1 + 365 days)


    here is the structure I have:

    study_id unique_id visit_type visit_date
    1 165423 1 01/01/2000
    1 164651 2 06/07/2000
    2 949628 1 03/05/2001
    2 489461 2 04/05/2002
    3 984665 1 02/20/2002
    3 894861 2 01/06/2003
    4 894156 1 10/10/2002
    4 876464 3 10/02/2003
    4 786386 2 11/05/2003

    // all date values of Type: double and Format: tc%
    // unique_id is unique to every occurrence of "activity" i.e. unique in the entirety of the dataset
    // I have given an example where I have more than one value for what is supposed to be the follow-up measurement but the actual measurement was erroneous in some way (e.g. three study_id values, and visit_type==3 instead of 1 or 2).

    study_id unique_id visit_type visit_date within_window
    1 165423 1 01/01/2000 .
    1 164651 2 06/07/2000 1
    2 949628 1 03/05/2001 .
    2 489461 2 04/05/2002 0
    3 984665 1 02/20/2002 .
    3 894861 2 01/06/2003 1
    4 894156 1 10/10/2002 .
    4 876464 3 10/02/2003 .
    4 786386 2 11/05/2003 0

    // you can see that I desire within_window==. if visit_type!=2
    // I don't think -reshape wide- will help because as it stands there are > 100 variables and > 10,000 observations in the dataset

    I have tried something very simple and non-elegant:

    Code:
         
    * visit_date for visit_type==1
    gen double FirstVisitDate=vis_date if visit_type==1                      
      format FirstVisitDate %tc
         
    * visit_date for visit_type==2
    gen FollowUpVisitDate=visit_date if visit_type==2
      format FollowUpVisitDate %tc
    
    * visit_datefor visit_type==1 + 365 days
    gen FirstVisitDate_plus365=visit_date if visit_type==1
      format FirstVisitDate_plus365 %tc
      replace FirstVisitDate_plus365=FirstVisitDate_plus365+3.1536*10^10
      // 3.1536*10^10 = 1 year in milliseconds (non-leap since %tc)
    
    * var returned within the time-window
    gen within_window=.                                               
      replace within_window=1 if FirstVisitDate < FollowUpVisitDate < FirstVisitDate_plus365          
      replace within_window=0 if missing(FollowUpVisitDate) | FollowUpVisitDate > FirstVisitDate_plus365
    This solution seems somewhat bizarre though.
    I end up with the dataset looking something like this:

    study_id unique_id visit_type visit_date FirstVisitDate FollowUpVisitDate FirstVisitDate_plus365 within_window
    1 165423 1 01/01/2000 01/01/2000 . 01/01/2001 0
    1 164651 2 06/07/2000 . 06/07/2000 . 1

    Clearly the value within_window==1 is correct, however, I am confused about how Stata is reading this given that the observations are on different lines, so would Stata not be evaluating the calculation of within_window as:


    Code:
    FirstVisitDate==01/01/2000 < FollowUpVisitDate==. < FirstVisitDate_plus365==01/01/2001
    
    FirstVisitDate==. < FollowUpVisitDate==06/09/2000 < FirstVisitDate_plus365==.
    I have checked multiple values manually which appear to be correct but there are too many to sort through by simply observing value by value etc.

    Questions:
    1. Can someone lend a more elegant solution, point out my errors, or have a better way to quality check this solution?
    2. Also, I would like to have within_window==. for the observations that have visit_type==1 but I suppose when I aggregate I can take only values of 0 or 1 where visit_type==2 if my data stay in the form I have above.

    Thanks for the help as always.

  • #2
    From your example data and your code, I'm inferring that in your data, for each id there is always one and only one observation with visit_type = 1. If that is true, you can get what you want with:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte study_id long unique_id byte visit_type double visit_date
    1 165423 1  1.262304e+12
    1 164651 2 1.2759552e+12
    2 949628 1 1.2993696e+12
    2 489461 2  1.333584e+12
    3 984665 1 1.3297824e+12
    3 894861 2 1.3574304e+12
    4 894156 1 1.3498272e+12
    4 876464 3  1.380672e+12
    4 786386 2 1.3836096e+12
    end
    format %tc visit_date
    
    local one_year = 365*msofhours(24)
    
    by study_id (visit_type), sort: gen byte within_window = ///
        inrange(visit_date, visit_date[1], visit_date[1] + `one_year') ///
        if visit_type == 2
    I have not carefully reviewed your code. But I can tell you that a construction like x < y < z in Stata does not mean x < y & y < z. I won't go into what it does mean because what it means is something that you are unlikely to ever actually want. Suffice it to say that Stata code is not algebra, though there is a strong family resemblance.

    Hello, my apologies for not using dataex. I am using healthcare data so I have created an example below.
    Non sequitur. You can use -dataex- with any kind of data. If the concern is that your data are confidential and you could not share the real data, you could still have used -dataex-. Instead of drawing a table here in the Forum editor, you could have launched Stata, opened the Data Editor and created your data table there. Then run -dataex-. It would have actually been faster for you and would have saved me time as well. Whenever you ask for help with code, always use -dataex-.

    Comment


    • #3
      This worked and thank you Clyde. I will also use dataex next time.

      Comment

      Working...
      X