Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I add observations to my data

    Hello Statalist,

    I have a longitudinal dataset, where each individual is identified by a unique ID and observed at multiple time points. The data includes a variable for the procedure performed on each individual, the duration of the procedure in days, and the individual's age in months at each time point.

    I want to expand the dataset so that every individual has as many observations as the longest participating individual based on age. For example, if the maximum age in the data is 200 months, each individual should have 200 observations to represent each month from 1 to 200. For each observation, I want to record the individual's age in months, and the procedure performed on them (if any).

    If an individual entered the program later or exited earlier than the longest participating individual, they would have missing values for the procedure variable/duration in the corresponding age months, but their age in months would still be recorded. I am not sure where to begin with my code.

    For example, my data looks like this:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float id str5 proc float(duration age)
    1 "dna"    20 0
    1 "phot"   30 1
    1 "injec"  30 2
    1 "lor"    30 3
    2 "dna"     1 2
    2 "lor"    40 3
    3 "dna"     1 0
    3 "phot"   30 1
    3 "injec"  30 2
    3 "dna"    30 3
    3 "phot"   30 4
    3 "injec"  30 5
    4 "phot"    1 2
    4 "injec"  60 4
    4 "dna"   120 8
    4 "phot"   30 9
    end

    And I'd like to go to this, where each individual in this example data will have nine rows (i.e. the maximum age in months) and the recorded placement, and missing values for those that weren't observed for the entire duration:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float id str5 proc float(duration age)
    1 "dna"   20 0
    1 "phot"  30 1
    1 "injec" 30 2
    1 "lor"   30 3
    1 "."      . 4
    1 "."      . 5
    1 "."      . 6
    1 "."      . 7
    1 "."      . 8
    1 "."      . 9
    2 "."      . 0
    2 "."      . 1
    2 "dna"    1 2
    2 "lor"   40 3
    2 "."      . 4
    2 "."      . 5
    2 "."      . 6
    2 "."      . 7
    2 "."      . 8
    2 "."      . 9
    3 "dna"    1 0
    3 "phot"  30 1
    3 "injec" 30 2
    3 "dna"   30 3
    3 "phot"  30 4
    3 "injec" 30 5
    3 "."      . 6
    3 "."      . 7
    3 "."      . 8
    3 "."      . 9
    4 "."      . 0
    4 "."      . 1
    4 "phot"   . 2
    4 "injec"  . 3
    4 "injec"  . 4
    4 "dna"    . 5
    4 "dna"    . 6
    4 "dna"    . 7
    4 "dna"    . 8
    4 "phot"   . 9
    end

    (Each procedure happens consecutively. For e.g. Individual 4 had phot for 1 day, and then inject until the age of 4 months)


  • #2
    Code:
    summ age, meanonly
    local oldest = r(max)
    
    preserve
    keep id
    duplicates drop
    expand `=`oldest'+1'
    by id, sort: gen age = _n - 1
    tempfile expander
    save `expander'
    
    restore
    merge 1:1 id age using `expander', nogenerate
    isid id age, sort

    Comment


    • #3
      And I'd like to go to this, where each individual in this example data will have nine rows (i.e. the maximum age in months)
      Should be 10 rows since age 0 is also present in the data.

      This may work, assuming no repeated age within ID:
      Code:
      tsset id age
      tsfill, full

      Comment


      • #4
        Thank you Clyde. The code works fine until I start getting errors with the merge command
        Code:
          
         merge 1:1 id age using `expander', nogenerate  variables ID age do not uniquely identify observations in the master data
        I think this might be caused by the fact that some individuals in the data will have more than one procedure over a period that does not exceed one month, and therefore we would have rows with the unique ID and same age but a different procedure recorded. Is there a way to get round this?

        I tried to edit the code and I am getting slightly closer. I have created a "procedure number" within each month for each id:

        Code:
        bysort id age: gen time =_n
        
        summ age, meanonly
        local oldest=r(max)
        
        preserve
        
        keep id
        duplicates drop
        
        expand `=`oldest'+1'
        bysort id: gen age=_n-1
        
        gen time=1
        
        tempfile temp
        save `temp'
        
        restore
        
        merge 1:m id age time using `temp', nogenerate

        However, this produces a data that looks like the following, it does not record the procedure in month 3 for e.g. where it was injec:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float id str6 proc float(duration age)
        4 "."       . 0
        4 "."       . 1
        4 "phot"    1 2
        4 ""        . 3
        4 "injec"  60 4
        4 ""        . 5
        4 ""        . 6
        4 ""        . 7
        4 "dna"   120 8
        4 "phot"   30 9
        end
        A bit stumped on how to correct the code.

        Comment


        • #5
          Thank you Ken. I face a similar problem here, which I'm assuming stems from the same issue:

          Code:
          tsset ID age
          Code:
          repeated time values within panels

          Comment


          • #6
            I think the code in #2 will do what you want if you just change -merge 1:1- to -merge m:1-.

            Comment


            • #7
              Thanks Clyde, it does not seem to be working as it does not add those additional rows for each person using merge m:1, or merge m:m. The original data is set up in the same way, only difference is that some individual have multiple procedures during one month. I may have to think of another way
              Last edited by Sherine Maui; 09 May 2023, 12:01.

              Comment


              • #8
                I marked up your example to include an observation that duplicated an id and age. Then I tried it with the code in #2, modified to use -merge m:1- instead of -merge 1:1-. It works appropriately:
                Code:
                clear
                input float id str5 proc float(duration age)
                1 "dna"    20 0
                1 "phot"   30 1
                1 "injec"  30 2
                1 "xxxxx"  10 2
                1 "lor"    30 3
                2 "dna"     1 2
                2 "lor"    40 3
                3 "dna"     1 0
                3 "phot"   30 1
                3 "injec"  30 2
                3 "dna"    30 3
                3 "phot"   30 4
                3 "injec"  30 5
                4 "phot"    1 2
                4 "injec"  60 4
                4 "dna"   120 8
                4 "phot"   30 9
                end
                
                summ age, meanonly
                local oldest = r(max)
                
                preserve
                keep id
                duplicates drop
                expand `=`oldest'+1'
                by id, sort: gen age = _n - 1
                tempfile expander
                save `expander'
                
                restore
                merge m:1 id age using `expander', nogenerate
                sort id age
                If the results from this example are not what you want, then please re-explain what you are looking for. If these results are OK, but the code still fails to work properly in your full data set, please post back with a new data example that exhibits whatever problem(s) you are encountering.

                Added: I don't want to go into a long digression about this, but I can't let one thing pass. NEVER use -merge m:m-. It is the solution to a problem that never arises in real research. If you ever think you need -merge m:m- either your data are in error or you don't understand your data correctly. -merge m:m- just produces data salad.
                Last edited by Clyde Schechter; 09 May 2023, 12:14.

                Comment


                • #9
                  Thank you for explaining Clyde and for cautioning against the use of m:m merges. I made the silly mistake of not sorting by id and age – now that I do, I can see that the additional rows have been added to my sample.

                  There is just one small issue. In the dataset, when I do add these additional rows they don’t take into account that these procedures last over more than one month. For e.g. ID 4’s second procedure was “injec” and it lasted for a duration of 60 days until they were age 4 months. The third procedure was dna and lasted for 120 days (4 months) until age 8 months. When I add the additional rows using the code in post #2, the row for ID 4 at age 3 months has missing values when it should have “injec”, and the rows for ages 5-7 (as well as 8) should also have “dna”.


                  This is what the data looks like after running #2 (editing the merge command to m:1):
                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input float id str5 proc float(duration age)
                  1 "dna"    20 0
                  1 "phot"   30 1
                  1 "injec"  30 2
                  1 "lor"    30 3
                  1 ""        . 4
                  1 ""        . 5
                  1 ""        . 6
                  1 ""        . 7
                  1 ""        . 8
                  1 ""        . 9
                  2 ""        . 0
                  2 ""        . 1
                  2 "dna"     1 2
                  2 "lor"    40 3
                  2 ""        . 4
                  2 ""        . 5
                  2 ""        . 6
                  2 ""        . 7
                  2 ""        . 8
                  2 ""        . 9
                  3 "dna"     1 0
                  3 "phot"   30 1
                  3 "injec"  30 2
                  3 "dna"    30 3
                  3 "phot"   30 4
                  3 "injec"  30 5
                  3 ""        . 6
                  3 ""        . 7
                  3 ""        . 8
                  3 ""        . 9
                  4 ""        . 0
                  4 ""        . 1
                  4 "phot"    1 2
                  4 ""        . 3
                  4 "injec"  60 4
                  4 ""        . 5
                  4 ""        . 6
                  4 ""        . 7
                  4 "dna"   120 8
                  4 "phot"   30 9
                  end

                  Compared to what I am trying to achieve here (ID 4’s procedures are added for each age where they are being treated):

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input float id str5 proc float(duration age)
                  1 "dna"   20 0
                  1 "phot"  30 1
                  1 "injec" 30 2
                  1 "lor"   30 3
                  1 "."      . 4
                  1 "."      . 5
                  1 "."      . 6
                  1 "."      . 7
                  1 "."      . 8
                  1 "."      . 9
                  2 "."      . 0
                  2 "."      . 1
                  2 "dna"    1 2
                  2 "lor"   40 3
                  2 "lor"   40 4
                  2 "."      . 5
                  2 "."      . 6
                  2 "."      . 7
                  2 "."      . 8
                  2 "."      . 9
                  3 "dna"    1 0
                  3 "phot"  30 1
                  3 "injec" 30 2
                  3 "dna"   30 3
                  3 "phot"  30 4
                  3 "injec" 30 5
                  3 "."      . 6
                  3 "."      . 7
                  3 "."      . 8
                  3 "."      . 9
                  4 "."      . 0
                  4 "."      . 1
                  4 "phot"   . 2
                  4 "injec"  . 3
                  4 "injec"  . 4
                  4 "dna"    . 5
                  4 "dna"    . 6
                  4 "dna"    . 7
                  4 "dna"    . 8
                  4 "phot"   . 9
                  end


                  Last edited by Sherine Maui; 10 May 2023, 10:07.

                  Comment


                  • #10
                    Code:
                    isid id proc age
                    gen n_months = ceil(duration/30)
                    expand n_months
                    by id age proc, sort: replace age = age[1] + _n - 1
                    drop n_months
                    
                    summ age, meanonly
                    local oldest = r(max)
                    
                    preserve
                    keep id
                    duplicates drop
                    expand `=`oldest'+1'
                    by id, sort: gen age = _n - 1
                    tempfile expander
                    save `expander'
                    
                    restore
                    merge m:1 id age using `expander', nogenerate
                    sort id age
                    Note: Only the bold faced code at the beginning is new: everything else is the same as before.

                    Comment


                    • #11
                      Thank you very much Clyde.

                      Comment

                      Working...
                      X