Imputation for two time horizons

Vera Schmidt

Join Date: Aug 2023

Posts: 14
#1

Imputation for two time horizons

11 Jun 2025, 14:55

Hi,

a question arose: how can imputations with different time horizons be combined? The big five variables are conducted from 2005 onwards, the others are conducted from 1993 to 2019.

Thank you in advance!

Best,
Vera

Code:
*declaring the data to be mi data in mariginal long style (mlong)
mi set mlong

*registering variables
mi register imputed bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m intact_family_m educ_m learn_gp educ_gp
mi register regular gpa female migrant num_sib north east south west poor_health_m poor_mental_m poor_pcs_p25_m poor_pcs_p10_m poor_mcs_p25_m poor_mcs_p10_m poor_health_dv_m poor_health_dv_p75_m poor_health_dv_p90_m age_birth learn_m only_child_m oldest_m num_sib_m

mi impute chained (pmm, knn(5)) educ_m learn_gp educ_gp ///
(logit) intact_family_m = ///
gpa female migrant num_sib north east west ///
age_birth learn_m only_child_m oldest_m num_sib_m, ///
add(20) rseed(1234)
mi xeq 0 1 20: sum educ_m learn_gp educ_gp intact_family_m

*imputation: big five
mi impute chained (pmm, knn(5)) bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m = gpa female migrant num_sib north east west poor_health_m poor_mental_m poor_pcs_p25_m poor_pcs_p10_m poor_mcs_p25_m poor_mcs_p10_m poor_health_dv_m poor_health_dv_p75_m poor_health_dv_p90_m age_birth learn_m only_child_m oldest_m num_sib_m if syear >= 2005, add(20) rseed(1234)
*descriptive statistics
mi xeq 0 1 20: sum bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m
Tags: imputation, missing value, mutiple imputation, pmm, separate imputation

Felix Bittmann

Join Date: Aug 2018
Posts: 681

Yesterday, 00:22

I read from your post that your original data is in long format, meaning that there are multiple rows for each unit/person. In this case, you should first convert the dataset to wide, then impute, and the convert it back. Basically like so:

Code:

*** Setup of example data ***
webuse nlswork, clear
xtset, clear
keep if inrange(year, 70, 73)
keep idcode year wks_ue union tenure

*** Imputation ***
reshape wide wks_ue union tenure, i(idcode) j(year)
mi set flong
mi register imputed wks_ue* union* tenure*
mi impute chained (pmm, knn(5)) wks_ue* union* tenure*, add(3) rseed(123)
mi reshape long wks_ue union tenure, i(idcode) j(year)

mi estimate: ...

Best wishes

Stata 18.0 MP | ORCID | Google Scholar

Comment

Vera Schmidt

Join Date: Aug 2023

Posts: 14
#3

Yesterday, 04:00

Originally posted by Felix Bittmann View Post

I read from your post that your original data is in long format, meaning that there are multiple rows for each unit/person. In this case, you should first convert the dataset to wide, then impute, and the convert it back. Basically like so:

Code:

*** Setup of example data *** webuse nlswork, clear xtset, clear keep if inrange(year, 70, 73) keep idcode year wks_ue union tenure *** Imputation *** reshape wide wks_ue union tenure, i(idcode) j(year) mi set flong mi register imputed wks_ue* union* tenure* mi impute chained (pmm, knn(5)) wks_ue* union* tenure*, add(3) rseed(123) mi reshape long wks_ue union tenure, i(idcode) j(year) mi estimate: ...

Dear Felix Bittmann,

thanks for your quick reply. Yes, you are absolutely right. My data is in long format but (I think) the problem is still not solved when I reshape my data. So, lets assume I reshape my data to wide format. The missing values for each variable are the same, e.g.:

long format:

pid year income

1 2000 10

1 2001 .

1 2002 3

by reshaping the data in wide format yields:

pid income_2000 income_2001 income_2002

1 10 . 3

With this in mind, the variables educ_m learn_gp educ_gp intact_family_m are collected from 1993 to 2019 but the variables bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m are observed from 2005 onwards. The missings in wide format are still the same which means that the big five are not collected less then 2005. If I impute all variables for the same time period (1993-2019), then I would impute also missings for the big five which are less then 2005 and imputing variables that are not collected is not recommended, right? So even if I reshape the data I need to perform "two separate imputations" and how do I do it? How do I combine to imputed data sets? Do I store them and merge them? Or how to combine this code even I use the wide format?

Collected from 1993-2019:
mi impute chained (pmm, knn(5)) educ_m learn_gp educ_gp ///
(logit) intact_family_m = ///
gpa female migrant num_sib north east west ///
age_birth learn_m only_child_m oldest_m num_sib_m, ///
add(20) rseed(1234)
mi xeq 0 1 20: sum educ_m learn_gp educ_gp intact_family_m

Collected from 2005-2019:
*imputation: big five
mi impute chained (pmm, knn(5)) bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m = gpa female migrant num_sib north east west poor_health_m poor_mental_m poor_pcs_p25_m poor_pcs_p10_m poor_mcs_p25_m poor_mcs_p10_m poor_health_dv_m poor_health_dv_p75_m poor_health_dv_p90_m age_birth learn_m only_child_m oldest_m num_sib_m if syear >= 2005, add(20) rseed(1234)

Hopefully, I evaluate my problem in the right way... I am open for feedback at any time. Thanks in advance!

Best,
Vera

Last edited by Vera Schmidt; Yesterday, 04:20.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 681
#4

Yesterday, 04:30

Lets assume for a second that your data are complete, no missing values present. What is your final analysis model? How do you plan to integrate these different time points into a single model? Answering this question first will help to decide how to impute the missings.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Vera Schmidt

Join Date: Aug 2023

Posts: 14
#5

Yesterday, 04:56

Originally posted by Felix Bittmann View Post

Lets assume for a second that your data are complete, no missing values present. What is your final analysis model? How do you plan to integrate these different time points into a single model? Answering this question first will help to decide how to impute the missings.

My analysis will start with simply OLS estimation. Since the big five are really rare, I would like to introduce stepwise sets of characteristics whereby the observations decreases:

Starting, e.g., with all characteristics that are collected from 1993 to 2019:
mi estimate: regress gpa female migrant num_sib north east west poor_health_m educ_m age_birth learn_m only_child_m oldest_m num_sib_m intact_family_m learn_gp educ_gp

*For simplification I only split the characteristics within the big five, the final analysis will separate more characteristics.

Finally introducing the big five:
mi estimate: regress gpa female migrant num_sib north east west poor_health_m educ_m age_birth learn_m only_child_m oldest_m num_sib_m intact_family_m learn_gp educ_gp bf_open_m bf_consc_m bf_extra_m bf_agree_m bf_emostab_m

*I used mi estimate since we are already talking about imputation...

I appreciate your help a lot!
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 681
#6

Yesterday, 05:08

If we assume there are no missings, your command would become:

Code:

regress gpa female migrant num_sib north east west poor_health_m educ_m age_birth learn_m only_child_m oldest_m num_sib_m intact_family_m learn_gp educ_gp

However, as data are in long format, this will yield no results, as some variables are from the 2005 periode (e.g. poor_health), while others are from the older period, such as educ_gp (or all data points before 2005 are not included). Furthermore, if some variables are measured multiple times for the same individuals, OLS is not adequate and you might want to consider panel regression models (e.g. xtreg). I have the impression that not the imputation is the main concern here but thinking about your substantive analysis model, as this is currently not adequate yet. What is your research interest? Measuring the change of variables over time?

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
1 like
Comment

pid	year	income
1	2000	10
1	2001	.
1	2002	3

pid	income_2000	income_2001	income_2002
1	10	.	3

Announcement

Imputation for two time horizons

Comment

Comment

Comment

Comment

Comment