Providing descriptive data excluding a handful of respondents based on multiple imputation variables.

Lars Egner

Join Date: Jun 2020

Posts: 14
#1

Providing descriptive data excluding a handful of respondents based on multiple imputation variables.

15 Jun 2023, 03:41

Hi everyone.
Recently I was handed a survey-dataset where one variable was missing for about 30% of the respondents. This was because this question was added to the survey later. In short, they realised that in their survey about working with traffic safety in work, one should also ask for whether or not the person is directly or indirectly working with traffic safety. Based on all of the other variables in the dataset I can predict this missing data, and have done so using multiple imputations. I'm not sure it's relevant to the question, but better safe than sorry I include it here.

Code:

mi set wide mi register impute works_directly_safety works_indirectly_safety relevant_question_1 relevant_question_2 relevant_question_3 relevant_question_4 relevant_question_5 mi impute monotone (regress) works_directly_safety works_indirectly_safety relevant_question_1 relevant_question_2 relevant_question_3 relevant_question_4 relevant_question_5, add(50) rseed(1923)

My problem is that in our report, we are reporting on the proportion of respondents that are replying "Agree"/"Disagree"/"Neigher or" ect. to some relevant items. The project manager wants to exclude respondents who responded they do not directly or indirectly work with traffic safety in these descriptive statistics. My initial idea was to use something akin to

Code:

mi estimate: proportion relevant_question_1 if works_directly_safety > 3 | works_indirectly_safety > 3

and/ or

Code:

mi estimate: mean relevant_question_1 if works_directly_safety > 3 | works_indirectly_safety > 3

Naturally, as the "if" condition excludes a different amount of respondents across imputations, this throws out the error code "estimation sample varies between m=1 and m=2; click here for details".

I though this was going to be somewhat simple, but I find myself stuck. Does anyone know of a solution to display means and proportions while excluding respondents based on imputed variables?

Last edited by Lars Egner; 15 Jun 2023, 04:11.
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3886
#2

15 Jun 2023, 04:12

Are you sure that a linear model (regress) is the best choice to impute what seems to be categorical variable(s)? A logit/probit or multinomial logit might be the better choice.

Also, if the other variables, e.g., relevant_question1, do not contain missing values, they should not be registered imputed and they should go to the right of the equals sign in the mi impute command. Probably, mi handles these issues correctly behind the scenes but explicitly stating which variables contain missing values and which do not, makes the code easier to follow.

The technical solution you seek might be

Code:

mi estimate , esampvaryok : ...

which will suppress the error message about varying observations.

Whether the results are statistically valid is a different question. Whether descriptives should be based on multiply imputed data at all is another different question.
Comment
Lars Egner

Join Date: Jun 2020

Posts: 14
#3

15 Jun 2023, 06:28

Thank you very much for the reply. After fiddling with some convergence issues, looking at the noisily output and realizing it was using the works_indirectly_safety and works_directly_safety variables as categorical variables in each others models, and finding the "ascontinuous" option, imputations are looking much better. Code now read:

Code:

mi set wide mi register imputed works_directly_safety works_indirectly_safety mi register regular relevant_question_1 relevant_question_2 relevant_question_3 relevant_question_4 relevant_question_5 mi impute monotone (mlogit, ascontinuous) works_directly_safety works_indirectly_safety = relevant_question_1 relevant_question_2 relevant_question_3 relevant_question_4 relevant_question_5, add(50) rseed(1923) augment // Check if numbers are reasonable proportion works_directly_safety mi estimate: proportion works_directly_safety proportion works_indirectly_safety mi estimate: proportion works_indirectly_safety ************************* Estimate descriptive data ************************* mean relevant_question_1 if works_directly_safety > 3 | works_indirectly_safety > 3 mi estimate, esampvaryok: mean relevant_question_1 if works_directly_safety > 3 | works_indirectly_safety > 3

Something is however off about the descriptive data retrieved from mi estimate. The amount of observation reported from the non-imputed data is higher than the imputed data. Specifically 146 vs. "between 125 and 138". I dont understand how the imputed dataset can have fewer observations than the non-imputed one. Any suggestions? I paste the output below if can be of any help.

. mean Indeks_innovasjon_læring if works_directly_safety> 3 | works_indirectly_safety> 3

Mean estimation Number of obs = 146

--------------------------------------------------------------------------
| Mean Std. err. [95% conf. interval]
-------------------------+------------------------------------------------
relevant_question_1| 17.45205 .3520749 16.75619 18.14792
--------------------------------------------------------------------------

. mi estimate, esampvaryok: mean Indeks_innovasjon_læring if works_directly_safety> 3 | works_indirectly_safety> 3

Multiple-imputation estimates Imputations = 50
Mean estimation Number of obs = 125
Average RVI = 0.1189
Largest FMI = 0.1086
Complete DF = 124
DF adjustment: Small sample DF: min = 106.40
avg = 106.40
Within VCE type: Analytic max = 106.40

--------------------------------------------------------------------------
| Mean Std. err. [95% conf. interval]
-------------------------+------------------------------------------------
relevant_question_1| 17.62734 .3829343 16.86817 18.38651
--------------------------------------------------------------------------
Warning: estimation sample varies across imputations; results may be
biased. Sample sizes vary between 125 and 138.
Note: Numbers of observations in e(_N) vary among imputations.

Last edited by Lars Egner; 15 Jun 2023, 06:31.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#4

15 Jun 2023, 06:39

Yeah, mlogit is often a pain to fit. You could try the augment option instead of ascontinuous.

On the observations, remember that

Code:

... if varname > #

includes missing values because missing values are the largest possible values in Stata. You probably want

Code:

... if varname > # & !mi(varname)

Last edited by daniel klein; 15 Jun 2023, 06:42.
Comment
Lars Egner

Join Date: Jun 2020

Posts: 14
#5

15 Jun 2023, 06:56

Ah yes, of course. This solved the issue.
I'll add a caveat in the final report regarding the uncertainty of reporting descriptive MI data. While it probably has issues, it is probably better than the non-imputed data.
Thank you for the feedback.
Comment

Announcement

Providing descriptive data excluding a handful of respondents based on multiple imputation variables.

Comment

Comment

Comment

Comment