PMM Imputation

Vera Schmidt

Join Date: Aug 2023

Posts: 21
#1

PMM Imputation

25 Jun 2025, 09:26

Hi,

currently I am running a imputation via this code:

*declaring the data to be mi data in mariginal long style (mlong)
mi set mlong

*registering variables
mi register imputed learn_s1 educ_s1 learn_s2 educ_s2 learn_s3 educ_s3
mi register regular gpa female migrant num_sib north east west age_birth learn_m ///
only_child_m oldest_m num_sib_m age_11 age_12 age_13 age_14

mi impute chained (pmm, knn(5)) learn_s1 educ_s1 learn_s2 educ_s2 learn_s3 educ_s3 = ///
gpa female migrant num_sib north east west ///
age_birth learn_m only_child_m oldest_m num_sib_m age_11 age_12 age_13 age_14 , add(20) rseed(1234)
*descriptive statistics
mi xeq 0 1 20: sum learn_s1 educ_s1 learn_s2 educ_s2 learn_s3 educ_s3
mi xeq 20: save "$MY_OUT\data_gpa_ss", replace

The impuation worked well for the log(earnings) for sibling 1 to 3 (s1-s3). But issues raise with theimputation of education (in years) for each sibling. The bound hold, so the scale of education ranges from 7 to 18, but in my example you can see that the values are not necessary increasing monotonously:

ID year educ

100 2000 9

100 2001 10

100 2002 10,5

100 2003 9

100 2004 9

How can I fix it? Any ideas? I am looking forward to hearing from you!

Best
Vera
Tags: education, imputation, missing, missing values, pmm
Tiago Pereira

Join Date: Jan 2016

Posts: 409
#2

25 Jun 2025, 10:17

Hi, Vera.

I would be more concerned if the averages do not increase monotonously. You can try imputing the yearly increments, perhaps using truncreg to ensure all imputations are positive, and then reconstruct the variable educ. It would be important to check whether the predictors are sufficiently good. You might well be imputing noise.

Last edited by Tiago Pereira; 25 Jun 2025, 10:20.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 752
#3

25 Jun 2025, 12:44

In general, your imputation approach is fine. Imputing in wide format is also a good idea for this kind of data. However, Stata does not know that your values must increase monotonically. While I agree with Tiago Pereira that checking the overall relevance of importance, I am not exactly sure how truncreg could solve this issue as the results you show are already positive. You can try to sanitize your results as follows, as long as you impute in flong format:

Code:

bysort ID _mi_m (year): replace educ = educ[_n-1] if educ < educ[_n-1] & !missing(educ[_n-1])

Last edited by Felix Bittmann; 25 Jun 2025, 13:00.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4493
#4

25 Jun 2025, 13:05

the problem is not clear to me and that starts with the data - what does the value "10,5" mean in the 3rd row - e.g., did this person drop out of school during their 11th year? without understanding your data it is not possible to give an adequate answer
Comment

Vera Schmidt

Join Date: Aug 2023
Posts: 21

26 Jun 2025, 05:46

Originally posted by Rich Goldstein View Post

the problem is not clear to me and that starts with the data - what does the value "10,5" mean in the 3rd row - e.g., did this person drop out of school during their 11th year? without understanding your data it is not possible to give an adequate answer

Sorry for not mentioning!

Too break it down pls have this example in mind:

ID	year	education	imputation dummy
1		9	0
1		9	0
1		9	0
1		9	0
1		9	0
2		9	0
2		9	0
2		9	0
2		10	0
2		10	0
3		13	0
3		13	0
3		13	0
3		13	0
3		13	0
4		10	1
4		10	1
4		9	1
4		9	1
4		9	1

=> 9 years of education mean a upper track degree
=> 10 years of education mean a middle track degree
=> 13 years of education mean upper track degree (a levels)

In general, the years of education cannot decrease only increase. So, the imputation method does not consider this. Pls have in mind that I use PMM with knn(5) and the example ist not perfect within this context.

Thanks in advance!

Comment

Tiago Pereira

Join Date: Jan 2016

Posts: 409
#6

28 Jun 2025, 07:10

Imputing the increment in education per year, as suggested above (using truncated regression, setting the min to 0 and max to the observed maximum in the dataset), and then adding the imputed increments to the latest observed education level, or using the last observation carried forward approach, seems a reasonable approach. The approach suggested by Felix Bittmann seems also OK.

Last edited by Tiago Pereira; 28 Jun 2025, 07:12.
Comment

ID	year	educ
100	2000	9
100	2001	10
100	2002	10,5
100	2003	9
100	2004	9

Announcement

Comment

Comment

Comment

Comment

Comment