Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Summary Variables for Household Characteristics

    Hello Statalisters,
    I'm trying to summarize some statistics by household (NUM_HOG). Specifically, I would like a summary variable of how many kids they have that are 0 (age_0) through 5 years (age_5) and then I would like a simple binary indicating if the family has at least 1 child under 5.

    I created the following code to create the age_* variables and the under5 variable:

    forval i = 0/5 {
    bysort NUM_HOG: egen age_`i' = count(PPA03) if PPA03==`i'
    replace age_`i' =
    }

    bys NUM_HOG: gen under5 = 0
    replace under5=1 if age_0 !=. | age_1 !=. | age_2 !=. | age_3 !=. | age_4 !=. | age_5 !=.

    This resulted in the following dataset:
    Code:
    input double(NUM_HOG PPA02 PPA03) float(ame hh_ame age_0 age_1 age_2 age_3 age_4 age_5 under5)
    13680 1 71 .76 4.02 . . . . . . 0
    13680 2 65 .65 4.02 . . . . . . 0
    13680 1 28   1 4.02 . . . . . . 0
    13680 1 16 .96 4.02 . . . . . . 0
    13680 2 13 .65 4.02 . . . . . . 0
    13681 1 42 .95 3.06 . . . . . . 0
    13681 2 25 .74 3.06 . . . . . . 0
    13681 1  7 .56 3.06 . . . . . . 0
    13681 2  4 .44 3.06 . . . . 1 . 1
    13681 2  2 .37 3.06 . . 1 . . . 1
    13682 1 34 .95 2.98 . . . . . . 0
    13682 2 24 .74 2.98 . . . . . . 0
    13682 1  3 .37 2.98 . . . 1 . . 1
    13682 2  1 .27 2.98 . 1 . . . . 1
    13682 2 69 .65 2.98 . . . . . . 0
    end
    As you can see for NUM_HOG==13682 the under5 variable is sometimes 0 and sometimes 1 based on that specific individual within the family, and with the age_* variables they are largely missing even if someone in their family is 1 year old, for example.

    Question: I would like 1 observation per household with these summary statistics so that I can merge it with another portion of this national survey. As the data currently stands, if I collapse by under5 using mean or count, I'm going to inaccurately capture the number of children each household has. As I see it, I think I need to figure out a way to adjust/add to my code so that the age_* variables and under5 variables are all set to the same number per household (NUM_HOG), something like how the hh_ame variable is currently.

    I reviewed this post, which was helpful, but couldn't quite figure out how to apply it to the question at hand. I hope this is sufficient information to answer my question, but of course please highlight if more clarification is needed.

    Thank you in advance!

  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(NUM_HOG PPA02 PPA03) float(ame hh_ame)
    13680 1 71 .76 4.02
    13680 2 65 .65 4.02
    13680 1 28   1 4.02
    13680 1 16 .96 4.02
    13680 2 13 .65 4.02
    13681 1 42 .95 3.06
    13681 2 25 .74 3.06
    13681 1  7 .56 3.06
    13681 2  4 .44 3.06
    13681 2  2 .37 3.06
    13682 1 34 .95 2.98
    13682 2 24 .74 2.98
    13682 1  3 .37 2.98
    13682 2  1 .27 2.98
    13682 2 69 .65 2.98
    end
    
    
    forvalues i = 0/5 {
        gen age_`i' = `i'.PPA03
    }
    collapse (max) age_* (first) hh_ame, by(NUM_HOG)
    egen age_5_or_under = rowmax(age_0-age_5)
    I cannot discern from your post precisely what you want. The code above creates variables age_0 through age_5 to indicate whether or not the household has any child of that age. If what you really want is a variable that gives a count of how many such children there are, replace (max) with (sum) in the -collapse- command. Similarly if what you want in the age_5_or_under variable is a count of the number of children age 5 or under (rather than just an indicator of whether or not there are any), replace -rowmax()- with -rowtotal()- in the -egen- command.

    A couple of points on coding practices. Your original code creates dichotomous variables as 1 vs missing value. That is usually a bad idea in Stata and gets you into trouble. Stata's commands are at their best when working with dichotomous variables as 1 and 0. Only use missing values in Stata when the value is actually not known or undefinable. Do not use missing value as a code for "no" in Stata--it usually ends badly.

    Note also that your variable name, under_5, is a misnomer for how you attempted to define it: you mean 5 or under, not under 5. I know that's a pedantic point, but if you have to come back to this code in the future, you may not remember what you really meant and then fined yourself confused by what you see. When possible choose variables names that convey their actual meaning; at the very least do not choose names that contradict their meaning.

    Comment


    • #3
      Hi Clyde,

      Thanks for trying to discern my question, apologies for the confusion.

      Could you explain the line:
      Code:
      gen age_`i' = `i'.PPA03
      Specifically, what exactly the period is representing, as I don't think I've seen that before. I did want the age_ variables the way this code created, so thank you.

      As for what I was going for, I wanted a summary of how many children of each age existed per household, therefore, this seems to be the winning collapse code:

      collapse (sum) age_* (first) hh_ame, by(NUM_HOG)

      As for the 5 or under variable, I wanted a summary Y/N if they had a child 5 or under (pedantic point noted and appreciated)....
      Given that I used (sum) in my collapse, using rowmax() for generating the 5 or under variable sometimes gives me a count of 2 or 3 if there is more than one child in a certain age category for a particular family. As such, is the following code my best bet for this 5 or under variable or is there a more simplified/streamlined way to go about this?

      Code:
      gen age_5_or_under = 0
      replace age_5_or_under = 1 if age_0 !=0 | age_1 !=0 | age_2 !=0 | age_3 !=0 | age_4 !=0 | age_5 !=0
      Many thanks!!

      Comment


      • #4
        The notation n.varname, where n is a non-negative integer and varname is a variable taking on non-negative integer values is a Boolean variable that takes on the value 1 in observations where varname == n, missing when varname is missing, and 0 when varname is non-missing and not equal to n. So, on the first run through the loop, we generate age_0 = 1 if PPA03 == 0, . if PPA03 is missing, and 0 otherwise. The next run through the loop does the same thing for age_1, indicating when PPA03 == 1. Etc.

        Given that you have made the age_* variables into counts, then I think the simplest way to get an any child 5 or under variable is:

        Code:
        egen age_5_or_under = rowmax(age_0-age_5)
        replace age_5_or_under = 1 if age_5_or_under > 1 & !missing(age_5_or_under)
        Added: the n.varname construction is called factor variable notation, and this is really its most elementary application. It has other uses as well. Do read about it in -help fvvarlist-.
        Last edited by Clyde Schechter; 08 May 2019, 20:29.

        Comment

        Working...
        X