How do I add observations to my data

Sherine Maui

Join Date: Apr 2018

Posts: 90
#1

How do I add observations to my data

09 May 2023, 08:56

Hello Statalist,

I have a longitudinal dataset, where each individual is identified by a unique ID and observed at multiple time points. The data includes a variable for the procedure performed on each individual, the duration of the procedure in days, and the individual's age in months at each time point.

I want to expand the dataset so that every individual has as many observations as the longest participating individual based on age. For example, if the maximum age in the data is 200 months, each individual should have 200 observations to represent each month from 1 to 200. For each observation, I want to record the individual's age in months, and the procedure performed on them (if any).

If an individual entered the program later or exited earlier than the longest participating individual, they would have missing values for the procedure variable/duration in the corresponding age months, but their age in months would still be recorded. I am not sure where to begin with my code.

For example, my data looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float id str5 proc float(duration age) 1 "dna" 20 0 1 "phot" 30 1 1 "injec" 30 2 1 "lor" 30 3 2 "dna" 1 2 2 "lor" 40 3 3 "dna" 1 0 3 "phot" 30 1 3 "injec" 30 2 3 "dna" 30 3 3 "phot" 30 4 3 "injec" 30 5 4 "phot" 1 2 4 "injec" 60 4 4 "dna" 120 8 4 "phot" 30 9 end

And I'd like to go to this, where each individual in this example data will have nine rows (i.e. the maximum age in months) and the recorded placement, and missing values for those that weren't observed for the entire duration:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float id str5 proc float(duration age) 1 "dna" 20 0 1 "phot" 30 1 1 "injec" 30 2 1 "lor" 30 3 1 "." . 4 1 "." . 5 1 "." . 6 1 "." . 7 1 "." . 8 1 "." . 9 2 "." . 0 2 "." . 1 2 "dna" 1 2 2 "lor" 40 3 2 "." . 4 2 "." . 5 2 "." . 6 2 "." . 7 2 "." . 8 2 "." . 9 3 "dna" 1 0 3 "phot" 30 1 3 "injec" 30 2 3 "dna" 30 3 3 "phot" 30 4 3 "injec" 30 5 3 "." . 6 3 "." . 7 3 "." . 8 3 "." . 9 4 "." . 0 4 "." . 1 4 "phot" . 2 4 "injec" . 3 4 "injec" . 4 4 "dna" . 5 4 "dna" . 6 4 "dna" . 7 4 "dna" . 8 4 "phot" . 9 end

(Each procedure happens consecutively. For e.g. Individual 4 had phot for 1 day, and then inject until the age of 4 months)
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

09 May 2023, 10:04

Code:

summ age, meanonly
local oldest = r(max)

preserve
keep id
duplicates drop
expand `=`oldest'+1'
by id, sort: gen age = _n - 1
tempfile expander
save `expander'

restore
merge 1:1 id age using `expander', nogenerate
isid id age, sort

Comment

Ken Chui

Join Date: Aug 2014

Posts: 1058
#3

09 May 2023, 10:09

And I'd like to go to this, where each individual in this example data will have nine rows (i.e. the maximum age in months)

Should be 10 rows since age 0 is also present in the data.

This may work, assuming no repeated age within ID:

Code:

tsset id age tsfill, full
Comment
Sherine Maui

Join Date: Apr 2018

Posts: 90
#4

09 May 2023, 11:24

Thank you Clyde. The code works fine until I start getting errors with the merge command

Code:

merge 1:1 id age using `expander', nogenerate variables ID age do not uniquely identify observations in the master data

I think this might be caused by the fact that some individuals in the data will have more than one procedure over a period that does not exceed one month, and therefore we would have rows with the unique ID and same age but a different procedure recorded. Is there a way to get round this?

I tried to edit the code and I am getting slightly closer. I have created a "procedure number" within each month for each id:

Code:

bysort id age: gen time =_n summ age, meanonly local oldest=r(max) preserve keep id duplicates drop expand `=`oldest'+1' bysort id: gen age=_n-1 gen time=1 tempfile temp save `temp' restore merge 1:m id age time using `temp', nogenerate

However, this produces a data that looks like the following, it does not record the procedure in month 3 for e.g. where it was injec:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float id str6 proc float(duration age) 4 "." . 0 4 "." . 1 4 "phot" 1 2 4 "" . 3 4 "injec" 60 4 4 "" . 5 4 "" . 6 4 "" . 7 4 "dna" 120 8 4 "phot" 30 9 end

A bit stumped on how to correct the code.
Comment
Sherine Maui

Join Date: Apr 2018

Posts: 90
#5

09 May 2023, 11:24

Thank you Ken. I face a similar problem here, which I'm assuming stems from the same issue:

Code:

tsset ID age

Code:

repeated time values within panels
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

09 May 2023, 11:32

I think the code in #2 will do what you want if you just change -merge 1:1- to -merge m:1-.
Comment
Sherine Maui

Join Date: Apr 2018

Posts: 90
#7

09 May 2023, 11:58

Thanks Clyde, it does not seem to be working as it does not add those additional rows for each person using merge m:1, or merge m:m. The original data is set up in the same way, only difference is that some individual have multiple procedures during one month. I may have to think of another way

Last edited by Sherine Maui; 09 May 2023, 12:01.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

09 May 2023, 12:11

I marked up your example to include an observation that duplicated an id and age. Then I tried it with the code in #2, modified to use -merge m:1- instead of -merge 1:1-. It works appropriately:

Code:

clear input float id str5 proc float(duration age) 1 "dna" 20 0 1 "phot" 30 1 1 "injec" 30 2 1 "xxxxx" 10 2 1 "lor" 30 3 2 "dna" 1 2 2 "lor" 40 3 3 "dna" 1 0 3 "phot" 30 1 3 "injec" 30 2 3 "dna" 30 3 3 "phot" 30 4 3 "injec" 30 5 4 "phot" 1 2 4 "injec" 60 4 4 "dna" 120 8 4 "phot" 30 9 end summ age, meanonly local oldest = r(max) preserve keep id duplicates drop expand `=`oldest'+1' by id, sort: gen age = _n - 1 tempfile expander save `expander' restore merge m:1 id age using `expander', nogenerate sort id age

If the results from this example are not what you want, then please re-explain what you are looking for. If these results are OK, but the code still fails to work properly in your full data set, please post back with a new data example that exhibits whatever problem(s) you are encountering.

Added: I don't want to go into a long digression about this, but I can't let one thing pass. NEVER use -merge m:m-. It is the solution to a problem that never arises in real research. If you ever think you need -merge m:m- either your data are in error or you don't understand your data correctly. -merge m:m- just produces data salad.

Last edited by Clyde Schechter; 09 May 2023, 12:14.
1 like
Comment

Sherine Maui

Join Date: Apr 2018
Posts: 90

10 May 2023, 09:25

Thank you for explaining Clyde and for cautioning against the use of m:m merges. I made the silly mistake of not sorting by id and age – now that I do, I can see that the additional rows have been added to my sample.

There is just one small issue. In the dataset, when I do add these additional rows they don’t take into account that these procedures last over more than one month. For e.g. ID 4’s second procedure was “injec” and it lasted for a duration of 60 days until they were age 4 months. The third procedure was dna and lasted for 120 days (4 months) until age 8 months. When I add the additional rows using the code in post #2, the row for ID 4 at age 3 months has missing values when it should have “injec”, and the rows for ages 5-7 (as well as 8) should also have “dna”.

This is what the data looks like after running #2 (editing the merge command to m:1):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float id str5 proc float(duration age)
1 "dna"    20 0
1 "phot"   30 1
1 "injec"  30 2
1 "lor"    30 3
1 ""        . 4
1 ""        . 5
1 ""        . 6
1 ""        . 7
1 ""        . 8
1 ""        . 9
2 ""        . 0
2 ""        . 1
2 "dna"     1 2
2 "lor"    40 3
2 ""        . 4
2 ""        . 5
2 ""        . 6
2 ""        . 7
2 ""        . 8
2 ""        . 9
3 "dna"     1 0
3 "phot"   30 1
3 "injec"  30 2
3 "dna"    30 3
3 "phot"   30 4
3 "injec"  30 5
3 ""        . 6
3 ""        . 7
3 ""        . 8
3 ""        . 9
4 ""        . 0
4 ""        . 1
4 "phot"    1 2
4 ""        . 3
4 "injec"  60 4
4 ""        . 5
4 ""        . 6
4 ""        . 7
4 "dna"   120 8
4 "phot"   30 9
end

Compared to what I am trying to achieve here (ID 4’s procedures are added for each age where they are being treated):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float id str5 proc float(duration age)
1 "dna"   20 0
1 "phot"  30 1
1 "injec" 30 2
1 "lor"   30 3
1 "."      . 4
1 "."      . 5
1 "."      . 6
1 "."      . 7
1 "."      . 8
1 "."      . 9
2 "."      . 0
2 "."      . 1
2 "dna"    1 2
2 "lor"   40 3
2 "lor"   40 4
2 "."      . 5
2 "."      . 6
2 "."      . 7
2 "."      . 8
2 "."      . 9
3 "dna"    1 0
3 "phot"  30 1
3 "injec" 30 2
3 "dna"   30 3
3 "phot"  30 4
3 "injec" 30 5
3 "."      . 6
3 "."      . 7
3 "."      . 8
3 "."      . 9
4 "."      . 0
4 "."      . 1
4 "phot"   . 2
4 "injec"  . 3
4 "injec"  . 4
4 "dna"    . 5
4 "dna"    . 6
4 "dna"    . 7
4 "dna"    . 8
4 "phot"   . 9
end

Last edited by Sherine Maui; 10 May 2023, 10:07.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

#10

10 May 2023, 10:23

Code:

isid id proc age
gen n_months = ceil(duration/30)
expand n_months
by id age proc, sort: replace age = age[1] + _n - 1
drop n_months

summ age, meanonly
local oldest = r(max)

preserve
keep id
duplicates drop
expand `=`oldest'+1'
by id, sort: gen age = _n - 1
tempfile expander
save `expander'

restore
merge m:1 id age using `expander', nogenerate
sort id age

Note: Only the bold faced code at the beginning is new: everything else is the same as before.

Comment

Sherine Maui

Join Date: Apr 2018

Posts: 90
#11

10 May 2023, 13:15

Thank you very much Clyde.
Comment

Announcement