Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen max() Creating More Missingness Than in Actuality?

    Hello, I am having trouble with egen max producing more missing than there actually seems to be or would be.

    I have a longitudinal dataset, and I am trying to create a variable that indicates the proportion of childhood spent with at least one of the person's parents having a college degree.

    To begin this process, I first use rowmax to create an indicator (i.e., "pedu") from mother's and father's education that indicates whether at least one parent had a college degree during that year of observation.

    Then, I used egen mean with inrange() to obtain the average value of the indicator variable created in the previous step. This essentially creates a variable (i.e., "pedu2") that equals the proportion of childhood spent with at least one of the person's parents having a college degree.

    Lastly, I use egen max() on this new variable to create another variable (i.e., "pedu3"). This other variable should be the same as pedu2 but fills missing values with the proportion of time spent during childhood with a college-educated parent for each individual. I do this because I plan to focus my analysis on adulthood.

    However, when I do this I end up with more missing data than in my final variable (pedu3) than I had in my original variable (pedu).

    I am confused to why this is. I even tried the same code with a toy data set that I created, but, in that case, egen max worked as intended.

    What am I doing wrong?

    I have attached my data and my code.


    Code:
    ssc install fre
    
    *Creating a variable to display missingness patterns across mother's and father's education
    gen pedu_info = .
    replace pedu_info = 1 if (!missing(fedu) & !missing(medu))
    replace pedu_info = 2 if (!missing(fedu) & missing(medu))
    replace pedu_info = 3 if (missing(fedu) & !missing(medu))
    replace pedu_info = 4 if (missing(fedu) & missing(medu))
    
    label define PEDU_INFO 1 "Neither ." 2 "Mother ." 3 "Father ." 4 "Both ."
    label val pedu_info PEDU_INFO
    
    *Highest Level of parental education within each person-year observation
    egen pedu = rowmax(fedu medu)
    
    *Obtaining the proportion of years in childhood spent with a college-educated parent
    bysort personid: egen pedu2 = mean(pedu) if inrange(age,1,18)
    
    *Filling in all rows where age>18 with the proportion of years in childhood spent with a college-educated parent
    bysort personid: egen pedu3 = max(pedu2)
    
    *Comparing missingness on original pedu variable with missingness on pedu3 variable 
    *The frequency for the fourth category of pedu_info should equal to the number of missing cases in pedu3 
    fre pedu_info
    mdesc pedu3

    In addition, here is the code I used for the toy data set

    Code:
    clear 
    input id time bvar
    111 1 1
    111 2 1
    111 3 0
    111 4 1
    111 5 1
    222 1 0
    222 2 0
    222 3 1
    222 4 1
    222 5 1
    333 1 0
    333 2 0
    333 3 1
    333 4 0
    333 5 1
    end
    
    
    bysort id: egen bvar2 = mean(bvar) if inrange(time,1,4)
    
    bysort id: egen bvar3 = max(bvar2)
    Attached Files

  • #2
    I'm not entirely sure I understand what you are trying to do here. All I can tell you is that when I run your code with your example data,
    Code:
    . clear
    
    . input id time bvar
    
                id       time       bvar
      1. 111 1 1
      2. 111 2 1
      3. 111 3 0
      4. 111 4 1
      5. 111 5 1
      6. 222 1 0
      7. 222 2 0
      8. 222 3 1
      9. 222 4 1
     10. 222 5 1
     11. 333 1 0
     12. 333 2 0
     13. 333 3 1
     14. 333 4 0
     15. 333 5 1
     16. end
    
    .
    .
    . bysort id: egen bvar2 = mean(bvar) if inrange(time,1,4)
    (3 missing values generated)
    
    .
    . bysort id: egen bvar3 = max(bvar2)
    
    .
    . list, noobs clean
    
         id   time   bvar   bvar2   bvar3  
        111      1      1     .75     .75  
        111      2      1     .75     .75  
        111      3      0     .75     .75  
        111      4      1     .75     .75  
        111      5      1       .     .75  
        222      1      0      .5      .5  
        222      2      0      .5      .5  
        222      3      1      .5      .5  
        222      4      1      .5      .5  
        222      5      1       .      .5  
        333      1      0     .25     .25  
        333      2      0     .25     .25  
        333      3      1     .25     .25  
        333      4      0     .25     .25  
        333      5      1       .     .25
    As you can see, bvar3 has no missing values. So I can't replicate your problem with your example. Also, inspecting the code, there really is no way that -egen, max()- can leave any missing values behind when there is no -if- qualifier in that command, unless all values of bvar2 are missing. But the only way that will happen is if all values of bvar itself are missing.

    I think you need to post back with a data example that does demonstrate the specific problem you are having.

    All of that said, there is no need to go through bvar2 to get to bvar3. You can, with a single command, calculate the mean value of bvar conditional on time being between 1 and 4 inclusive:
    Code:
    by id (time), sort: egen wanted = mean(cond(inrange(time, 1, 4), bvar, .))

    Comment


    • #3
      Clyde Schechter's method in the last command is also written up in Section 9 of https://journals.sagepub.com/doi/pdf...867X1101100210

      Comment


      • #4
        Hi Clyde Schechter , yes, the last command is exactly what I wanted!

        If you download the dataset that I posted above and run the code below, you should see my issue. I am very confused as to why this is happening? I don't know why I am getting an even larger number of missings.

        Code:
        use "stata-list-q.dta"
        
        *78252 missing in pedu
        fre pedu
        
        by id (age), sort: egen wanted = mean(cond(inrange(age, 1, 18), pedu, .))
        
        *104153 missing in wanted
        fre wanted
        Here's the output
        Click image for larger version

Name:	statalist-q.png
Views:	1
Size:	33.4 KB
ID:	1735403

        Comment


        • #5
          Remember that for every person whose age is never between 1 and 18 in the data set, wanted will have a missing value in all of those people's observations. That, I believe, accounts for all the missing values of -wanted-.

          If you download the dataset that I posted above and run the code below, you should see my issue.
          I appreciate your trying to provide the data that way. But I never download files from strangers. A -dataex- example that illustrates the problem would be helpful if you find that my explanation does not resolve your problem.

          Comment


          • #6
            Hi Clyde,

            Wouldn't the people whose age is never observed between 1 and 18 already be in the missing cases row? With the code you provided, we would expect the number of non-missing cases to incease, right? If someone has missingness for half their childhood, we would would fill up those rows with the average of the values in the other half of childhood.

            Unfortunately, the dataset is much too big to fit into a dataex example. Are there more alternative and secure approaches?

            Comment


            • #7
              Wouldn't the people whose age is never observed between 1 and 18 already be in the missing cases row?
              No, because you are defining "missing cases" by pedu having a missing value, whereas I'm focusing on whether age is between 1 and 18.

              Unfortunately, the dataset is much too big to fit into a dataex example. Are there more alternative and secure approaches?
              It is unlikely that the entire data set is required to illustrate the problem you are having. I'm asking you to identify a subset of the data that illustrates the same phenomenon but is small enough to show with -dataex-.

              Comment

              Working...
              X