Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • PMM Imputation

    Hi,

    currently I am running a imputation via this code:

    *declaring the data to be mi data in mariginal long style (mlong)
    mi set mlong

    *registering variables
    mi register imputed learn_s1 educ_s1 learn_s2 educ_s2 learn_s3 educ_s3
    mi register regular gpa female migrant num_sib north east west age_birth learn_m ///
    only_child_m oldest_m num_sib_m age_11 age_12 age_13 age_14

    mi impute chained (pmm, knn(5)) learn_s1 educ_s1 learn_s2 educ_s2 learn_s3 educ_s3 = ///
    gpa female migrant num_sib north east west ///
    age_birth learn_m only_child_m oldest_m num_sib_m age_11 age_12 age_13 age_14 , add(20) rseed(1234)
    *descriptive statistics
    mi xeq 0 1 20: sum learn_s1 educ_s1 learn_s2 educ_s2 learn_s3 educ_s3
    mi xeq 20: save "$MY_OUT\data_gpa_ss", replace


    The impuation worked well for the log(earnings) for sibling 1 to 3 (s1-s3). But issues raise with theimputation of education (in years) for each sibling. The bound hold, so the scale of education ranges from 7 to 18, but in my example you can see that the values are not necessary increasing monotonously:
    ID year educ
    100 2000 9
    100 2001 10
    100 2002 10,5
    100 2003 9
    100 2004 9

    How can I fix it? Any ideas? I am looking forward to hearing from you!

    Best
    Vera

  • #2
    Hi, Vera.

    I would be more concerned if the averages do not increase monotonously. You can try imputing the yearly increments, perhaps using truncreg to ensure all imputations are positive, and then reconstruct the variable educ. It would be important to check whether the predictors are sufficiently good. You might well be imputing noise.
    Last edited by Tiago Pereira; 25 Jun 2025, 10:20.

    Comment


    • #3
      In general, your imputation approach is fine. Imputing in wide format is also a good idea for this kind of data. However, Stata does not know that your values must increase monotonically. While I agree with Tiago Pereira that checking the overall relevance of importance, I am not exactly sure how truncreg could solve this issue as the results you show are already positive. You can try to sanitize your results as follows, as long as you impute in flong format:

      Code:
      bysort ID _mi_m (year): replace educ = educ[_n-1] if educ < educ[_n-1] & !missing(educ[_n-1])
      Last edited by Felix Bittmann; 25 Jun 2025, 13:00.
      Best wishes

      Stata 18.0 MP | ORCID | Google Scholar

      Comment


      • #4
        the problem is not clear to me and that starts with the data - what does the value "10,5" mean in the 3rd row - e.g., did this person drop out of school during their 11th year? without understanding your data it is not possible to give an adequate answer

        Comment


        • #5
          Originally posted by Rich Goldstein View Post
          the problem is not clear to me and that starts with the data - what does the value "10,5" mean in the 3rd row - e.g., did this person drop out of school during their 11th year? without understanding your data it is not possible to give an adequate answer
          Sorry for not mentioning!

          Too break it down pls have this example in mind:

          ID year education imputation dummy
          1 9 0
          1 9 0
          1 9 0
          1 9 0
          1 9 0
          2 9 0
          2 9 0
          2 9 0
          2 10 0
          2 10 0
          3 13 0
          3 13 0
          3 13 0
          3 13 0
          3 13 0
          4 10 1
          4 10 1
          4 9 1
          4 9 1
          4 9 1
          => 9 years of education mean a upper track degree
          => 10 years of education mean a middle track degree
          => 13 years of education mean upper track degree (a levels)

          In general, the years of education cannot decrease only increase. So, the imputation method does not consider this. Pls have in mind that I use PMM with knn(5) and the example ist not perfect within this context.

          Thanks in advance!

          Comment


          • #6
            Imputing the increment in education per year, as suggested above (using truncated regression, setting the min to 0 and max to the observed maximum in the dataset), and then adding the imputed increments to the latest observed education level, or using the last observation carried forward approach, seems a reasonable approach. The approach suggested by Felix Bittmann seems also OK.
            Last edited by Tiago Pereira; 28 Jun 2025, 07:12.

            Comment

            Working...
            X