Estimate sd from multiply imputed data

paulvonhippel

Join Date: Apr 2014
Posts: 499

Estimate sd from multiply imputed data

30 Jul 2024, 14:10

Hi, I'm trying to estimate the SD of multiply imputed data. It shouldn't be hard, but I'm having trouble. Here's my progress so far.

Code:

/* Simulate bivariate normal data */
clear
set seed 1
matrix C = (1, .8 \ .8, 1)
drawnorm x y, cov(C) n(100)
gen y_incomplete = y
gen y_missing = (x > invnormal(.15))
replace y_incomplete = . if y_missing
save "norm2incomplete.dta", replace

/* Extract the mean and SD of complete data. */
mean y
matrix mean = e(b)
local mean = round(mean[1,1],0.1)
display `mean'

matrix sd = e(sd)
local sd = round(sd[1,1],0.1)
display `sd'

/* Impute the data */
use "Data\norm2incomplete.dta", clear
mi set flong
mi register imputed y_incomplete
mi impute regress y_incomplete x, add(50)

/* Extract the mean from the imputed data */
mi estimate: mean y_incomplete
matrix mean = e(b_mi)
local mean = round(mean[1,1],0.1)
display `mean'

/* So far so good. But the following code fails to extract the sd of the imputed data */

matrix sd = e(sd)
local sd = round(sd[1,1],0.1)
display `sd'

Tags: None

Rich Goldstein

Join Date: Mar 2014

Posts: 4457
#2

30 Jul 2024, 14:16

not sure why you want this but here is a different strategy - use "mi estimate" with the -regress command and then multiply the SE by "sqrt(N)" - will that do what you want?

forgot to say, no covariates in the regression model

Last edited by Rich Goldstein; 30 Jul 2024, 14:23.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#3

30 Jul 2024, 14:25

No, that won't work because under MI the SE is not equal to SD/sqrt(N)
Comment
daniel klein

Join Date: Mar 2014

Posts: 3842
#4

31 Jul 2024, 00:53

Originally posted by paulvonhippel View Post

Hi, I'm trying to estimate the SD of multiply imputed data. It shouldn't be hard,
[...]
under MI the SE is not equal to SD/sqrt(N)

So what is the SD under MI then? Do you want the pooled (i.e. mean) standard deviations over all imputed datasets? If so, you can adapt this approach. I am not sure that Rubin's rules apply to the standard deviation. Is the SD approximately normal?
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#5

31 Jul 2024, 07:07

Thanks, daniel klein! I hadn't seen this FAQ. Yes, all I need to do is average the SDs across imputed datasets. And actually I can do that more simply than in the FAQ, like this:

Code:

collapse (sd) sd_y = y_incomplete if _mi_m>0, by(_mi_m) mean sd_y

And now it's more complicated than I think it should be to get the mean into a local macro. Is there an easier way than this?

Code:

ereturn list matrix sd = e(b) local sd = round(sd[1,1],0.1) display `sd'
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#6

31 Jul 2024, 07:31

I can address your broader question about how MI applies to estimating the SD. This may be more than you wanted to know, but I'm writing a textbook on missing data, so....
In my original question, all I needed was a point estimate, so Rubin's formula which is used to calculate standard errors, wasn't needed. By the way, Rubin's formula is se=sqrt(W+B+B/M), where M is the number of imputed datasets and W and B are the variance of the point estimate within and between imputed datasets.

Rubin's formula doesn't assume that the point estimate is normally distributed within and between imputed datasets.

But I would be assuming normality if I used Rubin's se to calculate a normal confidence interval -- i.e., point estimate +- 2*se

In fact, the point estimate of the SD is not normally distributed. It has something more like a chi mixture distribution. But the chi distribution quickly approaches normality as the sample gets larger. See screenshot from Wikipedia below.

In small samples, Little and Rubin suggest using the point estimate of ln(SD) to approach normality faster.

I suspect SD^(2/3) would work even better, because root transformations generally work better than log transformations for normalizing chi-distributed variables.

I appreciate your help! I hope the points above repay your kindness and aren't too confusing!
2 likes
Comment
daniel klein

Join Date: Mar 2014

Posts: 3842
#7

31 Jul 2024, 07:50

Originally posted by paulvonhippel View Post

And actually I can do that more simply than in the FAQ, like this:

Code:

collapse (sd) sd_y = y_incomplete if _mi_m>0, by(_mi_m) mean sd_y

Well, collapse has some overhead, if only because it wipes the data in memory, which might or might not be relevant. Also, collpase provides valid results only for data in mi flog style. The approach in the FAQ is more low-level but also more generic. Anyway, if you are happy using collapse, why not just go

Code:

collapse (mean) mean = y_incomplete (sd) sd_y = y_incomplete if _mi_m > 0 , by(_mi_m) collapse mean sd_y local mean = mean[1] local sd_y = sd_y[1]

It should be obvious that you can get the mean even faster simply by

Code:

summarize y_incomplete if (_mi_m > 0) , meanonly

I mean, the mean of means is simply the mean. Either way, I don't think the mi machinery is necessary here at all.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3842
#8

31 Jul 2024, 08:08

Originally posted by paulvonhippel View Post

This may be more than you wanted to know, but I'm writing a textbook on missing data, so....

Not at all. Thanks for this! And: looking forward to reading the book.

Originally posted by paulvonhippel View Post

Rubin's formula doesn't assume that the point estimate is normally distributed within and between imputed datasets.

Interesting. I was under the impression that the underlying population parameter was assumed to be normal, which is true for regression coefficients. Still, I assume the mean (for which Rubin's rules [for pooling point estimates] is just a fancy name) might generally not be the best summary of heavily skewed distributions. So even if normality is not required, the mean probably works better for somewhat symmetric distributions.

Originally posted by paulvonhippel View Post

In fact, the point estimate of the SD is not normally distributed. It has something more like a chi mixture distribution. But the chi distribution quickly approaches normality as the sample gets larger. See screenshot from Wikipedia below.

In small samples, Little and Rubin suggest using the point estimate of ln(SD) to approach normality faster.

I suspect SD^(2/3) would work even better, because root transformations generally work better than log transformations for normalizing chi-distributed variables.

I'll keep this in mind. Thanks again.

Last edited by daniel klein; 31 Jul 2024, 08:12. Reason: formatting; prior response crossed with #6
Comment

paulvonhippel

Join Date: Apr 2014
Posts: 499

31 Jul 2024, 08:21

Just to clarify your last point: The mean of the individual imputed-datasets means is the overall mean, but the mean of the individual imputed-datasets sds is not the total sd. The total sd is larger, because the total sd includes variation both within and between imputed datasets.

For example, in the dataset that I simulated, the mean of the imputed-datasets sds is 1.4, but the total sd is 1.6.

Code:

/* Simulate bivariate normal data */
clear
set seed 5
matrix C = (1, .8 \ .8, 1)
drawnorm x y, cov(C) n(100)
gen y_incomplete = y
gen y_missing = (x > invnormal(.15))
replace y_incomplete = . if y_missing

/* Impute the data */
mi set flong
mi register imputed y_incomplete
mi impute regress y_incomplete x, add(50)

/* Now you can estimate the mean like this, which gives the right standard error */
mi estimate: mean y_incomplete
/* But the same command doesn't estimate the SD. */

/* This gives the same estimate of the mean, but with the wrong standard error. */
summarize y_incomplete if _mi_m > 0
/* And the estimated SD is too large, because it includes variation both within and between imputed datasets. */

/* This gets you the right mean and the right SD, but only point estimates, no standard error. */
collapse (mean) mean_y = y_incomplete (sd) sd_y = y_incomplete if _mi_m>0, by(_mi_m)
mean mean_y sd_y

Last edited by paulvonhippel; 31 Jul 2024, 08:23.

Comment

daniel klein

Join Date: Mar 2014

Posts: 3842
#10

31 Jul 2024, 09:18

Originally posted by paulvonhippel View Post

Just to clarify your last point: The mean of the individual imputed-datasets means is the overall mean, but the mean of the individual imputed-datasets sds is not the total sd.

Yes; summarize will only give you the mean. That's why I suggested integrating both in your (extended) collapse approach.

Edit: To clarify your point in turn

Originally posted by paulvonhippel View Post

Code:

/* This gets you the right mean and the right SD, but only point estimates, no standard error. */ collapse (mean) mean_y = y_incomplete (sd) sd_y = y_incomplete if _mi_m>0, by(_mi_m) mean mean_y sd_y

Did you want the standard error also? If so, I would recommend switching back to the mi machinery and writing a wrapper program that estimates both the mean and the SD. Not sure there is a command that would estimate the (within-imputation) standard error of the SD; you would have to implement that yourself.

Last edited by daniel klein; 31 Jul 2024, 09:21.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4457
#11

01 Aug 2024, 06:11

re: #2 above and your correct comment in #3. apologies for the incomplete suggestion - what I should have included is to do this within -mi xeq- and save the result so that you have a set of sd's and can then compute W and B and use as desired -
1 like
Comment

Announcement

Estimate sd from multiply imputed data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment