Hello, I am having trouble with egen max producing more missing than there actually seems to be or would be.
I have a longitudinal dataset, and I am trying to create a variable that indicates the proportion of childhood spent with at least one of the person's parents having a college degree.
To begin this process, I first use rowmax to create an indicator (i.e., "pedu") from mother's and father's education that indicates whether at least one parent had a college degree during that year of observation.
Then, I used egen mean with inrange() to obtain the average value of the indicator variable created in the previous step. This essentially creates a variable (i.e., "pedu2") that equals the proportion of childhood spent with at least one of the person's parents having a college degree.
Lastly, I use egen max() on this new variable to create another variable (i.e., "pedu3"). This other variable should be the same as pedu2 but fills missing values with the proportion of time spent during childhood with a college-educated parent for each individual. I do this because I plan to focus my analysis on adulthood.
However, when I do this I end up with more missing data than in my final variable (pedu3) than I had in my original variable (pedu).
I am confused to why this is. I even tried the same code with a toy data set that I created, but, in that case, egen max worked as intended.
What am I doing wrong?
I have attached my data and my code.
In addition, here is the code I used for the toy data set
I have a longitudinal dataset, and I am trying to create a variable that indicates the proportion of childhood spent with at least one of the person's parents having a college degree.
To begin this process, I first use rowmax to create an indicator (i.e., "pedu") from mother's and father's education that indicates whether at least one parent had a college degree during that year of observation.
Then, I used egen mean with inrange() to obtain the average value of the indicator variable created in the previous step. This essentially creates a variable (i.e., "pedu2") that equals the proportion of childhood spent with at least one of the person's parents having a college degree.
Lastly, I use egen max() on this new variable to create another variable (i.e., "pedu3"). This other variable should be the same as pedu2 but fills missing values with the proportion of time spent during childhood with a college-educated parent for each individual. I do this because I plan to focus my analysis on adulthood.
However, when I do this I end up with more missing data than in my final variable (pedu3) than I had in my original variable (pedu).
I am confused to why this is. I even tried the same code with a toy data set that I created, but, in that case, egen max worked as intended.
What am I doing wrong?
I have attached my data and my code.
Code:
ssc install fre *Creating a variable to display missingness patterns across mother's and father's education gen pedu_info = . replace pedu_info = 1 if (!missing(fedu) & !missing(medu)) replace pedu_info = 2 if (!missing(fedu) & missing(medu)) replace pedu_info = 3 if (missing(fedu) & !missing(medu)) replace pedu_info = 4 if (missing(fedu) & missing(medu)) label define PEDU_INFO 1 "Neither ." 2 "Mother ." 3 "Father ." 4 "Both ." label val pedu_info PEDU_INFO *Highest Level of parental education within each person-year observation egen pedu = rowmax(fedu medu) *Obtaining the proportion of years in childhood spent with a college-educated parent bysort personid: egen pedu2 = mean(pedu) if inrange(age,1,18) *Filling in all rows where age>18 with the proportion of years in childhood spent with a college-educated parent bysort personid: egen pedu3 = max(pedu2) *Comparing missingness on original pedu variable with missingness on pedu3 variable *The frequency for the fourth category of pedu_info should equal to the number of missing cases in pedu3 fre pedu_info mdesc pedu3
In addition, here is the code I used for the toy data set
Code:
clear input id time bvar 111 1 1 111 2 1 111 3 0 111 4 1 111 5 1 222 1 0 222 2 0 222 3 1 222 4 1 222 5 1 333 1 0 333 2 0 333 3 1 333 4 0 333 5 1 end bysort id: egen bvar2 = mean(bvar) if inrange(time,1,4) bysort id: egen bvar3 = max(bvar2)
Comment