Expanding dataset to add additional rows

Skye Weber

Join Date: Jul 2023
Posts: 2

Expanding dataset to add additional rows

19 Jul 2023, 08:15

I have a dataset that contains a series of variables for each respondent, tracked over a certain range of their age (0 to the highest age they were tracked for). Currently, each respondent has a number of rows corresponding to the range of ages they were tracked for. For example, if a respondent was tracked up to the age of 30, then they have rows from age 0 to 30 in the dataset.

I want to standardize this so that each respondent has data rows from age 0 up to 60, regardless of their actual tracking period. If a respondent was not tracked for a certain age (for instance, any age above their last tracked age), I want these rows to still exist in the dataset, but with missing values in all the variable fields, except the ID and age. For example, if an individual was tracked until age 30, I would like to create additional rows for ages 31 to 60 for this individual, with all variables (emp, mar, ed) marked as missing for these additional years. Please is there a way to do this?

Data example with the persons Id, age, and three variables.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 ID long(Age emp ed mar)
"0601004004"  0 6 4 20
"0601004004"  1 6 4 20
"0601004004"  2 6 4 20
"0601004004"  3 6 4 20
"0601004004"  4 6 3 20
"0601004004"  5 6 3 20
"0601004004"  6 6 6 20
"0601004004"  7 6 6 20
"0601004004"  8 6 6 20
"0601004004"  9 6 6 20
"0601004004" 10 6 6 20
"0601004004" 11 6 5 20
"0601004004" 12 6 5 20
"0601004004" 13 6 5 20
"0601004004" 14 6 1 20
"0601004004" 15 6 1 20
"0601004004" 16 6 1 20
"0601004004" 17 6 7 20
"0601004004" 18 6 7  9
"0601004004" 19 6 7 10
"0601004004" 20 6 7 10
"0601004004" 21 6 7 10
"0601004004" 22 6 7 10
"0601004004" 23 6 7 12
"0601004004" 24 6 7 12
"0601004004" 25 6 7 12
"0601004004" 26 6 7 12
"0601004004" 27 6 7 12
"0601004004" 28 6 7 12
"0601004004" 29 6 7 12
"0601004004" 30 6 7 12
"0601004004" 31 6 7 12
"0601004004" 32 6 7 12
"0601004004" 33 6 7 12
"0601004004" 34 6 7 12
"0601004004" 35 6 7 13
"0601004004" 36 6 7 13
"0601004004" 37 6 7 13
"0601004603"  0 6 4 20
"0601004603"  1 6 4 20
"0601004603"  2 6 4 20
"0601004603"  3 6 4 20
"0601004603"  4 6 4 20
"0601004603"  5 6 4 20
"0601004603"  6 6 4 20
"0601004603"  7 6 6 20
"0601004603"  8 6 6 20
"0601004603"  9 6 6 20
"0601004603" 10 6 6 20
"0601004603" 11 6 6 20
"0601004603" 12 6 6 20
"0601004603" 13 6 4 20
"0601004603" 14 6 4 20
"0601004603" 15 6 4 20
"0601004603" 16 6 4 20
"0601004603" 17 6 4 20
"0601004603" 18 6 4 20
"0601004603" 19 6 4 20
"0601004603" 20 6 4 20
"0601004603" 21 6 4 20
"0601004603" 22 6 4 20
"0601004603" 23 6 4  9
"0601004603" 24 6 4 10
"0601004603" 25 6 4 10
"0601004603" 26 6 4 12
"0601004603" 27 6 4 12
"0601004603" 28 6 4 12
"0601004603" 29 6 4 12
"0601004603" 30 6 4 12
"0601004603" 31 6 4 13
"0601004603" 32 6 4 13
"0601004603" 33 6 4 13
"0601004603" 34 6 4 13
"0601004603" 35 6 4 13
"0601004603" 36 6 4 13
"0601004603" 37 6 4 13
"0601004603" 38 6 4 13
"0601004603" 39 6 4 13
"0601005902"  0 6 4 20
"0601005902"  1 6 4 20
"0601005902"  2 6 4 20
"0601005902"  3 6 4 20
"0601005902"  4 6 4 20
"0601005902"  5 6 4 20
"0601005902"  6 6 6 20
"0601005902"  7 6 6 20
"0601005902"  8 6 6 20
"0601005902"  9 6 6 20
"0601005902" 10 6 6 20
"0601005902" 11 6 6 20
"0601005902" 12 6 5 20
"0601005902" 13 6 5 20
"0601005902" 14 6 5 20
"0601005902" 15 6 1 20
"0601005902" 16 6 1 20
"0601005902" 17 6 1 20
"0601005902" 18 6 7 20
"0601005902" 19 6 7 20
"0601005902" 20 6 7 20
"0601005902" 21 6 7 20
end
label values emp emstate
label def emstate 6 "Unemployed", modify
label values ed ed_state
label def ed_state 1 " general secondary", modify
label def ed_state 3 "nursery", modify
label def ed_state 4 "out of school", modify
label def ed_state 5 "prep", modify
label def ed_state 6 "primary", modify
label def ed_state 7 "uni", modify
label values mar markid_state
label def markid_state 9 "married-0 child", modify
label def markid_state 10 "married-1 child", modify
label def markid_state 12 "married-2 child", modify
label def markid_state 13 "married-3 child", modify
label def markid_state 20 "single-0 child", modify

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

19 Jul 2023, 08:41

On the assumption that there is at least one person in your dataset who was tracked all the way to age 60, you can do this with:

Code:

fillin ID Age
2 likes
Comment
Skye Weber

Join Date: Jul 2023

Posts: 2
#3

19 Jul 2023, 10:10

I did not realise it was that simple. Thank you, Clyde. And I did have at least one respondent who was tracked all the way up to age 60.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#4

19 Jul 2023, 10:52

I did not realise it was that simple.

Well, handy as it is when needed, the -fillin- command, at least in my workflow, is very infrequently used, perhaps once every couple of years. I think most users would go a long time before ever needing it. And they could probably hack up some code using basic commands that would do the job -- not fun, nor pretty, but workable. Moreover, even if in your heart you believed there is a command like -fillin-, it's not clear how you would go about finding its name. I think the most likely way would be from posts here on Statalist, where it comes up every now and then. That's how I found it the first time. But, again, if I didn't know the name of the command in the first place, I'm not sure what search terms would lead me to it here.
1 like
Comment

Bruce Weaver

Join Date: May 2014
Posts: 1139

19 Jul 2023, 15:10

Like Clyde Schechter, I don't use -fillin- all that often. But on occasion when I have wanted to use it, I have sometimes struggled to remember its name.

Here is one example where I did find it useful In a chapter on basic probability for an introductory biostatistics class, I include a problem that was inflicted on me when I was a student.

Q. If I roll 3 fair 6-sided dice, what is the probability of getting at least one 2 or one 4?

The main thing we were meant to learn from this problem is that when asked for p(A), sometimes it is much easier to compute p(not A), and to remember that p(A) = 1 - p(not A).

It occurred to me that for students who are struggling with that approach, it might help to confirm the result with a so-called "brute-force" (or exhaustive enumeration) approach, in which I list all possible combinations of the 3 dice and then just count how many meet the condition of interest. That's where -fillin- came in. (It was also an excuse to try to teach them a little more about writing Stata code.)

Code:

. * Q. If you roll 3 fair 6-sided dice, what is the probability
. *     that at least one 2 or one 4 will be showing?
.
. * A. p(At least one 2 or one 4) = 1 - p(no 2s and no 4s).
.
. display "1 - p(no 2s and no 4s) = " 1 - (4/6)^3
1 - p(no 2s and no 4s) = .7037037

.
. * Use "brute-force" approach to check our answer
. // Create a dataset with the 6*6*6 combinations for the 3 dice.
. clear

. set obs 6 // 6 rows in the dataset initially
Number of observations (_N) was 0, now 6.

. forvalues i = 1/3 {
  2.         generate byte die`i' = _n
  3. }

. fillin die1-die3 // generate 6*6*6 rows with all combinations of die1-die3

. // Flag the cases with at least one 2 or one 4.
. // The | symbol on the next line means "OR".
. generate byte flag2or4 = inlist(die1,2,4) | inlist(die2,2,4) | inlist(die3,2,4)

. // Assign variable and value labels
. label variable flag2or4 "Any 2s or any 4s"

. label define noyes 0 "No" 1 "Yes"

. label values flag2or4 noyes

. // Display a frequency table
. tabulate flag2or4

  Any 2s or |
     any 4s |      Freq.     Percent        Cum.
------------+-----------------------------------
         No |         64       29.63       29.63
        Yes |        152       70.37      100.00
------------+-----------------------------------
      Total |        216      100.00

. // Answer to the problem = % YES in this table.

Here's the code without output.

Code:

* Q. If you roll 3 fair 6-sided dice, what is the probability
*     that at least one 2 or one 4 will be showing?

* A. p(At least one 2 or one 4) = 1 - p(no 2s and no 4s).

display "1 - p(no 2s and no 4s) = " 1 - (4/6)^3

* Use "brute-force" approach to check our answer
// Create a dataset with the 6*6*6 combinations for the 3 dice.
clear
set obs 6 // 6 rows in the dataset initially
forvalues i = 1/3 {
    generate byte die`i' = _n
}
fillin die1-die3 // generate 6*6*6 rows with all combinations of die1-die3
// Flag the cases with at least one 2 or one 4.
// The | symbol on the next line means "OR".
generate byte flag2or4 = inlist(die1,2,4) | inlist(die2,2,4) | inlist(die3,2,4)
// Assign variable and value labels
label variable flag2or4 "Any 2s or any 4s"
label define noyes 0 "No" 1 "Yes"
label values flag2or4 noyes
// Display a frequency table
tabulate flag2or4
// Answer to the problem = % YES in this table.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Comment

Bruce Weaver

Join Date: May 2014

Posts: 1139
#6

24 Jul 2023, 13:26

As noted above, Skye Weber did have at least one respondent with Age ranging from 0 to 60, so Clyde Schechter's one-line -fillin- solution worked. Nevertheless, I was prompted to think about what I would do if there were no respondents with Age ranging from 0 to 60. While pondering that, I learned about the -insobs- command, which I do not remember seeing previously. There may be more efficient solutions, but here is what I came up with. Append this code to the -dataex- code in #1.

Code:

* Let local macro N = the number of records * in the original dataset local N = _N * Append 61 new observations insobs 61 * Flag the new records as junk generate byte junk = _n > `N' replace ID = "junk" if junk bysort junk: replace Age = _n-1 if junk fillin ID Age * Note that junk is missing on some rows we * want to keep, so we must specify junk==1 * on the next line: drop if junk does not work! drop if junk==1 drop junk

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#7

24 Jul 2023, 14:36

Thank you so much, Bruce Weaver for showing that. I was not aware of -insobs-; it certainly can come in handy and I think I will be using it pretty regularly going forward.
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1139
#8

24 Jul 2023, 15:50

No worries, Clyde Schechter. It's certainly much less clunky than the first approach I considered (i.e., creating a new frame to hold the junk data, then appending the new frame to the default frame, etc.)

PS- It is a comfort to the rest of us that not even you know everything about Stata. (The jury is still out on Nick.)

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
2 likes
Comment

Announcement

Expanding dataset to add additional rows

Comment

Comment

Comment

Comment

Comment

Comment

Comment