Imputing a passive variable

Felix Bittmann

Join Date: Aug 2018

Posts: 702
#1

Imputing a passive variable

12 Nov 2021, 04:21

Dear all,
I have the following data. A group identifier for the region (region) a person lives in with a few levels and a continuous variable that gives the number of free days in a region (days). So days depends on the region apparently, but two different regions can have the same number of free days. Both vars have missing values (so when region is missing, days is always missing as well). There are some more variables that need to be imputed. When I impute this dataset, how can I achieve that both variables are used as predictors for the overall imputation process but days is imputed depending on the region?

To give a concrete example:
Region A has 10 free days.
Case X has a missing on region. The algorithm decides that region A is the most probable and assigns this level. Automatically, days must be set to 10 for this case.

I would like to use PMM as imputation algorithm for all variables. Sorry if passive is the wrong term used here.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Tags: imputation, MI, passive
daniel klein

Join Date: Mar 2014

Posts: 3859
#2

12 Nov 2021, 05:06

That is an interesting situation. I might or might not find the time to get back to this later today. However, the first question that pops into my head is whether keeping the values of the region and free days consistent is the best you can do here. Also, how do you know that the imputed region takes precedence over the imputed free days? Sticking with your example, why do you not say:

Regions B and C have 20 free days.
Case X has a missing on free days.
The algorithm decides that 20 free days are the most probable and assigns this value. Automatically, the region must be set to either be B or C.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 702
#3

12 Nov 2021, 05:15

Oh boy, if daniel klein finds this question interesting, this seems to be a bigger issue! Many thanks for your reply! This is indeed a good point you raise (just FYI, it is NEPS data and region is the federal state). The days indicator is merged from official statistics. So somehow I felt that this ordering is more natural (days depending on the region) but I guess this is just a question of perspective. What do you think about simply dropping days from the imputation, only keeping the region in and then merging the days indicator afterwards? By doing so, this variable would not be part of the imputation process. But yeah I am also a bit lost here.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#4

12 Nov 2021, 05:28

Originally posted by Felix Bittmann View Post

What do you think about simply dropping days from the imputation, only keeping the region in and then merging the days indicator afterwards? By doing so, this variable would not be part of the imputation process. But yeah I am also a bit lost here.

Just a quick reply; need to get back at the day job for now.

If you think that you need both the region and the number of days in your analyses model, then they should probably both be part of the imputation model, too. However, since multiple imputation is all about preserving the associations, there is no guarantee that the imputed values of those two would be consistent in each of the imputed datasets. Perhaps that is OK. Perhaps forcing consistency of the imputed values will actually mess up the associations in the combined model. I am not saying that this is so; I am really not sure whether keeping the values consistent is going to fix or create (more) problems.

btw. seems like a poor job there on the part of NEPS; I would think that you really should be able to get the region of your interviewees correct.

Last edited by daniel klein; 12 Nov 2021, 05:31.
Comment

Announcement

Imputing a passive variable

Comment

Comment

Comment