Trouble with Creating Group-Level variable

Al Adams

Join Date: Apr 2020
Posts: 54

Trouble with Creating Group-Level variable

06 Jul 2020, 20:40

Hello, I have a variable that has repeated observations over the life course. Research shows that the measurement of this variable (e.g., health, household income) during ealy childhood is a significant predictor of that variable during adulthood (e.g., health, household income). Accordingly, I'd like to see how average childhood health between the ages of 3 and 5 correlates with average adulthood health at various points in adulthood (e.g., 25-27, 28-30, 31-33, etc.). Therefore, I used the following to generate a measure of average childhood health during the ages fo 3 and 5.

Code:

bysort personid: egen mhealth35 = mean(health) if inrange(age,3,5)

I used the same code to generate a variable for each mean health status at later age intervals (e.g., 25-27, 28-30, 31-33, etc.). The result is that I get multiple variables indicating the mean health status for ages 25-27 (mhealth2527), ages 28-30 (mhealth2830), ages 31-33 (mhealth3133), and so on. However, when I regress either of these mean health status for adulthood variables on the mean health status for early childhood (mhealth35), I get no observations.

This seems to be because the code I used only displays the averages for when the respondent is within the ages specified within inrange(3,5). For example, if a person's average health status between the ages of 3 and 5 is 3.67, then the mean health status variable (mhealth35) only takes on the of 3.67 when the respondent is 3, 4, and 5 years old. However, when the respondent is not any of those ages, the value is missing. I'd like for the value to not be missing. How would I do that? I hope this is clear. Please let me know if I should clarify.

Below is an example of what my data looks like after using the code displayed above to generate the average health status variables for two separate age intervals (3-5 and 25-27). Rather than display the full set of ages available, I've truncated the observations for convenience.

What my data looks like:

Individual Id	Health Status	Average Health Status between ages 3 and 5	Age	Average Health Status between ages 25 and 27
10001	2	3.67	3	missing
10001	4	3.67	4	missing
10001	5	3.67	5	missing
10001	4.	missing	25	4.33
10001	4	missing	26	4.33
10001	5	missing	27	4.33
10001	2	missing	28	missing
10001	5	missing	29	missing
10001	3	missing	30	missing

Below is an example what I want my data to look like:

Individual Id	Health Status	Average Health Status between ages 3 and 5	Age	Average Health Status between ages 25 and 27
10001	2	3.67	3	4.33
10001	4	3.67	4	4.33
10001	5	3.67	5	4.33
10001	4.	3.67	25	4.33
10001	4	3.67	26	4.33
10001	5	3.67	27	4.33
10001	2	3.67	28	4.33
10001	5	3.67	29	4.33
10001	3	3.67	30	4.33

Tags: group variable, multilevel modeling, panel data, variable creation

Andrew Musau

Join Date: Oct 2014

Posts: 10190
#2

07 Jul 2020, 00:31

Code:

bysort personid (mhealth35): replace mhealth35 =mhealth35[1]
3 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#3

07 Jul 2020, 02:04

@Andrew Musau shows nicely how to fix the problem your code created.

Here is one way to ensure it never arises:

Code:

bysort personid: egen mhealth35 = mean(cond(inrange(age, 3, 5), health, .) )

Code:

And here is another way:

Code:

bysort personid: egen mhealth35 = mean( health / inrange(age, 3, 5))

Code:

Some people find the latter too tricksy, and I sympathise. I too tend to use the former as more transparent. But by accident it resembles notation that won't work as code: mean(variable | condition) For a longer discussion see Sections 9 and 10 in https://www.stata-journal.com/articl...article=dm0055 dm0055 is thus revealed as an otherwise unpredictable search term to find many related threads on Statalist.
3 likes
Comment
Al Adams

Join Date: Apr 2020

Posts: 54
#4

07 Jul 2020, 09:28

Hi Andrew and Nick, thank you both! Each of your codes work well for me.

I understand the logic of Nick's code from the sections he pointed to in the link above, but could either of you explain Andrew's code to me please?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#5

07 Jul 2020, 09:35

bysort personid (mhealth35): replace mhealth35 =mhealth35[1]

The grouping variable is "personid" and values are sorted using the variable "mhealth35" (in parentheses). Missing values are always sorted last, so the code takes the first sorted value of the variable mhealth35 in a group (i.e., mhealth35[1]) and replaces all values in the group with this value. Since the mean value is constant within groups, the first sorted value will be the mean. See

Code:

help by

Last edited by Andrew Musau; 07 Jul 2020, 09:37.
3 likes
Comment
Al Adams

Join Date: Apr 2020

Posts: 54
#6

07 Jul 2020, 09:40

Awesome! Thanks, Andrew!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#7

07 Jul 2020, 18:11

The logic of #2 is also explained in the paper cited in #3, just in different sections of the paper.
1 like
Comment
Al Adams

Join Date: Apr 2020

Posts: 54
#8

09 Jul 2020, 14:00

Got it. Thanks, Nick!
Comment

Announcement

Trouble with Creating Group-Level variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment