Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trouble with Creating Group-Level variable

    Hello, I have a variable that has repeated observations over the life course. Research shows that the measurement of this variable (e.g., health, household income) during ealy childhood is a significant predictor of that variable during adulthood (e.g., health, household income). Accordingly, I'd like to see how average childhood health between the ages of 3 and 5 correlates with average adulthood health at various points in adulthood (e.g., 25-27, 28-30, 31-33, etc.). Therefore, I used the following to generate a measure of average childhood health during the ages fo 3 and 5.

    Code:
    bysort personid: egen mhealth35 = mean(health) if inrange(age,3,5)
    I used the same code to generate a variable for each mean health status at later age intervals (e.g., 25-27, 28-30, 31-33, etc.). The result is that I get multiple variables indicating the mean health status for ages 25-27 (mhealth2527), ages 28-30 (mhealth2830), ages 31-33 (mhealth3133), and so on. However, when I regress either of these mean health status for adulthood variables on the mean health status for early childhood (mhealth35), I get no observations.

    This seems to be because the code I used only displays the averages for when the respondent is within the ages specified within inrange(3,5). For example, if a person's average health status between the ages of 3 and 5 is 3.67, then the mean health status variable (mhealth35) only takes on the of 3.67 when the respondent is 3, 4, and 5 years old. However, when the respondent is not any of those ages, the value is missing. I'd like for the value to not be missing. How would I do that? I hope this is clear. Please let me know if I should clarify.

    Below is an example of what my data looks like after using the code displayed above to generate the average health status variables for two separate age intervals (3-5 and 25-27). Rather than display the full set of ages available, I've truncated the observations for convenience.

    What my data looks like:
    Individual Id Health Status Average Health Status between ages 3 and 5 Age Average Health Status between ages 25 and 27
    10001 2 3.67 3 missing
    10001 4 3.67 4 missing
    10001 5 3.67 5 missing
    10001 4. missing 25 4.33
    10001 4 missing 26 4.33
    10001 5 missing 27 4.33
    10001 2 missing 28 missing
    10001 5 missing 29 missing
    10001 3 missing 30 missing

    Below is an example what I want my data to look like:
    Individual Id Health Status Average Health Status between ages 3 and 5 Age Average Health Status between ages 25 and 27
    10001 2 3.67 3 4.33
    10001 4 3.67 4 4.33
    10001 5 3.67 5 4.33
    10001 4. 3.67 25 4.33
    10001 4 3.67 26 4.33
    10001 5 3.67 27 4.33
    10001 2 3.67 28 4.33
    10001 5 3.67 29 4.33
    10001 3 3.67 30 4.33

  • #2
    Code:
    bysort personid (mhealth35): replace mhealth35 =mhealth35[1]

    Comment


    • #3
      @Andrew Musau shows nicely how to fix the problem your code created.

      Here is one way to ensure it never arises:

      Code:
      bysort personid: egen mhealth35 = mean(cond(inrange(age, 3, 5), health, .) )
      Code:
      
      
      And here is another way:
      Code:
      bysort personid: egen mhealth35 = mean( health / inrange(age, 3, 5))
      Code:
      
      
      Some people find the latter too tricksy, and I sympathise. I too tend to use the former as more transparent. But by accident it resembles notation that won't work as code: mean(variable | condition) For a longer discussion see Sections 9 and 10 in https://www.stata-journal.com/articl...article=dm0055 dm0055 is thus revealed as an otherwise unpredictable search term to find many related threads on Statalist.

      Comment


      • #4
        Hi Andrew and Nick, thank you both! Each of your codes work well for me.

        I understand the logic of Nick's code from the sections he pointed to in the link above, but could either of you explain Andrew's code to me please?

        Comment


        • #5
          bysort personid (mhealth35): replace mhealth35 =mhealth35[1]
          The grouping variable is "personid" and values are sorted using the variable "mhealth35" (in parentheses). Missing values are always sorted last, so the code takes the first sorted value of the variable mhealth35 in a group (i.e., mhealth35[1]) and replaces all values in the group with this value. Since the mean value is constant within groups, the first sorted value will be the mean. See

          Code:
          help by
          Last edited by Andrew Musau; 07 Jul 2020, 09:37.

          Comment


          • #6
            Awesome! Thanks, Andrew!

            Comment


            • #7
              The logic of #2 is also explained in the paper cited in #3, just in different sections of the paper.

              Comment


              • #8
                Got it. Thanks, Nick!

                Comment

                Working...
                X