Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Expanding dataset to add additional rows

    I have a dataset that contains a series of variables for each respondent, tracked over a certain range of their age (0 to the highest age they were tracked for). Currently, each respondent has a number of rows corresponding to the range of ages they were tracked for. For example, if a respondent was tracked up to the age of 30, then they have rows from age 0 to 30 in the dataset.

    I want to standardize this so that each respondent has data rows from age 0 up to 60, regardless of their actual tracking period. If a respondent was not tracked for a certain age (for instance, any age above their last tracked age), I want these rows to still exist in the dataset, but with missing values in all the variable fields, except the ID and age. For example, if an individual was tracked until age 30, I would like to create additional rows for ages 31 to 60 for this individual, with all variables (emp, mar, ed) marked as missing for these additional years. Please is there a way to do this?

    Data example with the persons Id, age, and three variables.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 ID long(Age emp ed mar)
    "0601004004"  0 6 4 20
    "0601004004"  1 6 4 20
    "0601004004"  2 6 4 20
    "0601004004"  3 6 4 20
    "0601004004"  4 6 3 20
    "0601004004"  5 6 3 20
    "0601004004"  6 6 6 20
    "0601004004"  7 6 6 20
    "0601004004"  8 6 6 20
    "0601004004"  9 6 6 20
    "0601004004" 10 6 6 20
    "0601004004" 11 6 5 20
    "0601004004" 12 6 5 20
    "0601004004" 13 6 5 20
    "0601004004" 14 6 1 20
    "0601004004" 15 6 1 20
    "0601004004" 16 6 1 20
    "0601004004" 17 6 7 20
    "0601004004" 18 6 7  9
    "0601004004" 19 6 7 10
    "0601004004" 20 6 7 10
    "0601004004" 21 6 7 10
    "0601004004" 22 6 7 10
    "0601004004" 23 6 7 12
    "0601004004" 24 6 7 12
    "0601004004" 25 6 7 12
    "0601004004" 26 6 7 12
    "0601004004" 27 6 7 12
    "0601004004" 28 6 7 12
    "0601004004" 29 6 7 12
    "0601004004" 30 6 7 12
    "0601004004" 31 6 7 12
    "0601004004" 32 6 7 12
    "0601004004" 33 6 7 12
    "0601004004" 34 6 7 12
    "0601004004" 35 6 7 13
    "0601004004" 36 6 7 13
    "0601004004" 37 6 7 13
    "0601004603"  0 6 4 20
    "0601004603"  1 6 4 20
    "0601004603"  2 6 4 20
    "0601004603"  3 6 4 20
    "0601004603"  4 6 4 20
    "0601004603"  5 6 4 20
    "0601004603"  6 6 4 20
    "0601004603"  7 6 6 20
    "0601004603"  8 6 6 20
    "0601004603"  9 6 6 20
    "0601004603" 10 6 6 20
    "0601004603" 11 6 6 20
    "0601004603" 12 6 6 20
    "0601004603" 13 6 4 20
    "0601004603" 14 6 4 20
    "0601004603" 15 6 4 20
    "0601004603" 16 6 4 20
    "0601004603" 17 6 4 20
    "0601004603" 18 6 4 20
    "0601004603" 19 6 4 20
    "0601004603" 20 6 4 20
    "0601004603" 21 6 4 20
    "0601004603" 22 6 4 20
    "0601004603" 23 6 4  9
    "0601004603" 24 6 4 10
    "0601004603" 25 6 4 10
    "0601004603" 26 6 4 12
    "0601004603" 27 6 4 12
    "0601004603" 28 6 4 12
    "0601004603" 29 6 4 12
    "0601004603" 30 6 4 12
    "0601004603" 31 6 4 13
    "0601004603" 32 6 4 13
    "0601004603" 33 6 4 13
    "0601004603" 34 6 4 13
    "0601004603" 35 6 4 13
    "0601004603" 36 6 4 13
    "0601004603" 37 6 4 13
    "0601004603" 38 6 4 13
    "0601004603" 39 6 4 13
    "0601005902"  0 6 4 20
    "0601005902"  1 6 4 20
    "0601005902"  2 6 4 20
    "0601005902"  3 6 4 20
    "0601005902"  4 6 4 20
    "0601005902"  5 6 4 20
    "0601005902"  6 6 6 20
    "0601005902"  7 6 6 20
    "0601005902"  8 6 6 20
    "0601005902"  9 6 6 20
    "0601005902" 10 6 6 20
    "0601005902" 11 6 6 20
    "0601005902" 12 6 5 20
    "0601005902" 13 6 5 20
    "0601005902" 14 6 5 20
    "0601005902" 15 6 1 20
    "0601005902" 16 6 1 20
    "0601005902" 17 6 1 20
    "0601005902" 18 6 7 20
    "0601005902" 19 6 7 20
    "0601005902" 20 6 7 20
    "0601005902" 21 6 7 20
    end
    label values emp emstate
    label def emstate 6 "Unemployed", modify
    label values ed ed_state
    label def ed_state 1 " general secondary", modify
    label def ed_state 3 "nursery", modify
    label def ed_state 4 "out of school", modify
    label def ed_state 5 "prep", modify
    label def ed_state 6 "primary", modify
    label def ed_state 7 "uni", modify
    label values mar markid_state
    label def markid_state 9 "married-0 child", modify
    label def markid_state 10 "married-1 child", modify
    label def markid_state 12 "married-2 child", modify
    label def markid_state 13 "married-3 child", modify
    label def markid_state 20 "single-0 child", modify

  • #2
    On the assumption that there is at least one person in your dataset who was tracked all the way to age 60, you can do this with:
    Code:
    fillin ID Age

    Comment


    • #3
      I did not realise it was that simple. Thank you, Clyde. And I did have at least one respondent who was tracked all the way up to age 60.

      Comment


      • #4
        I did not realise it was that simple.
        Well, handy as it is when needed, the -fillin- command, at least in my workflow, is very infrequently used, perhaps once every couple of years. I think most users would go a long time before ever needing it. And they could probably hack up some code using basic commands that would do the job -- not fun, nor pretty, but workable. Moreover, even if in your heart you believed there is a command like -fillin-, it's not clear how you would go about finding its name. I think the most likely way would be from posts here on Statalist, where it comes up every now and then. That's how I found it the first time. But, again, if I didn't know the name of the command in the first place, I'm not sure what search terms would lead me to it here.

        Comment


        • #5
          Like Clyde Schechter, I don't use -fillin- all that often. But on occasion when I have wanted to use it, I have sometimes struggled to remember its name.

          Here is one example where I did find it useful In a chapter on basic probability for an introductory biostatistics class, I include a problem that was inflicted on me when I was a student.

          Q. If I roll 3 fair 6-sided dice, what is the probability of getting at least one 2 or one 4?

          The main thing we were meant to learn from this problem is that when asked for p(A), sometimes it is much easier to compute p(not A), and to remember that p(A) = 1 - p(not A).

          It occurred to me that for students who are struggling with that approach, it might help to confirm the result with a so-called "brute-force" (or exhaustive enumeration) approach, in which I list all possible combinations of the 3 dice and then just count how many meet the condition of interest. That's where -fillin- came in. (It was also an excuse to try to teach them a little more about writing Stata code.)

          Code:
          . * Q. If you roll 3 fair 6-sided dice, what is the probability
          . *     that at least one 2 or one 4 will be showing?
          .
          . * A. p(At least one 2 or one 4) = 1 - p(no 2s and no 4s).
          .
          . display "1 - p(no 2s and no 4s) = " 1 - (4/6)^3
          1 - p(no 2s and no 4s) = .7037037
          
          .
          . * Use "brute-force" approach to check our answer
          . // Create a dataset with the 6*6*6 combinations for the 3 dice.
          . clear
          
          . set obs 6 // 6 rows in the dataset initially
          Number of observations (_N) was 0, now 6.
          
          . forvalues i = 1/3 {
            2.         generate byte die`i' = _n
            3. }
          
          . fillin die1-die3 // generate 6*6*6 rows with all combinations of die1-die3
          
          . // Flag the cases with at least one 2 or one 4.
          . // The | symbol on the next line means "OR".
          . generate byte flag2or4 = inlist(die1,2,4) | inlist(die2,2,4) | inlist(die3,2,4)
          
          . // Assign variable and value labels
          . label variable flag2or4 "Any 2s or any 4s"
          
          . label define noyes 0 "No" 1 "Yes"
          
          . label values flag2or4 noyes
          
          . // Display a frequency table
          . tabulate flag2or4
          
            Any 2s or |
               any 4s |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                   No |         64       29.63       29.63
                  Yes |        152       70.37      100.00
          ------------+-----------------------------------
                Total |        216      100.00
          
          . // Answer to the problem = % YES in this table.

          Here's the code without output.

          Code:
          * Q. If you roll 3 fair 6-sided dice, what is the probability
          *     that at least one 2 or one 4 will be showing?
          
          * A. p(At least one 2 or one 4) = 1 - p(no 2s and no 4s).
          
          display "1 - p(no 2s and no 4s) = " 1 - (4/6)^3
          
          * Use "brute-force" approach to check our answer
          // Create a dataset with the 6*6*6 combinations for the 3 dice.
          clear
          set obs 6 // 6 rows in the dataset initially
          forvalues i = 1/3 {
              generate byte die`i' = _n
          }
          fillin die1-die3 // generate 6*6*6 rows with all combinations of die1-die3
          // Flag the cases with at least one 2 or one 4.
          // The | symbol on the next line means "OR".
          generate byte flag2or4 = inlist(die1,2,4) | inlist(die2,2,4) | inlist(die3,2,4)
          // Assign variable and value labels
          label variable flag2or4 "Any 2s or any 4s"
          label define noyes 0 "No" 1 "Yes"
          label values flag2or4 noyes
          // Display a frequency table
          tabulate flag2or4
          // Answer to the problem = % YES in this table.


          --
          Bruce Weaver
          Email: [email protected]
          Version: Stata/MP 19.5 (Windows)

          Comment


          • #6
            As noted above, Skye Weber did have at least one respondent with Age ranging from 0 to 60, so Clyde Schechter's one-line -fillin- solution worked. Nevertheless, I was prompted to think about what I would do if there were no respondents with Age ranging from 0 to 60. While pondering that, I learned about the -insobs- command, which I do not remember seeing previously. There may be more efficient solutions, but here is what I came up with. Append this code to the -dataex- code in #1.

            Code:
            * Let local macro N = the number of records
            * in the original dataset
            local N = _N
            * Append 61 new observations
            insobs 61
            * Flag the new records as junk
            generate byte junk = _n > `N'
            replace ID = "junk" if junk
            bysort junk: replace Age = _n-1 if junk
            fillin ID Age
            * Note that junk is missing on some rows we
            * want to keep, so we must specify junk==1
            * on the next line:  drop if junk does not work!
            drop if junk==1
            drop junk
            --
            Bruce Weaver
            Email: [email protected]
            Version: Stata/MP 19.5 (Windows)

            Comment


            • #7
              Thank you so much, Bruce Weaver for showing that. I was not aware of -insobs-; it certainly can come in handy and I think I will be using it pretty regularly going forward.

              Comment


              • #8
                No worries, Clyde Schechter. It's certainly much less clunky than the first approach I considered (i.e., creating a new frame to hold the junk data, then appending the new frame to the default frame, etc.)

                PS- It is a comfort to the rest of us that not even you know everything about Stata. (The jury is still out on Nick.)
                --
                Bruce Weaver
                Email: [email protected]
                Version: Stata/MP 19.5 (Windows)

                Comment

                Working...
                X