Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimate sd from multiply imputed data

    Hi, I'm trying to estimate the SD of multiply imputed data. It shouldn't be hard, but I'm having trouble. Here's my progress so far.

    Code:
    /* Simulate bivariate normal data */
    clear
    set seed 1
    matrix C = (1, .8 \ .8, 1)
    drawnorm x y, cov(C) n(100)
    gen y_incomplete = y
    gen y_missing = (x > invnormal(.15))
    replace y_incomplete = . if y_missing
    save "norm2incomplete.dta", replace
    
    /* Extract the mean and SD of complete data. */
    mean y
    matrix mean = e(b)
    local mean = round(mean[1,1],0.1)
    display `mean'
    
    matrix sd = e(sd)
    local sd = round(sd[1,1],0.1)
    display `sd'
    
    /* Impute the data */
    use "Data\norm2incomplete.dta", clear
    mi set flong
    mi register imputed y_incomplete
    mi impute regress y_incomplete x, add(50)
    
    /* Extract the mean from the imputed data */
    mi estimate: mean y_incomplete
    matrix mean = e(b_mi)
    local mean = round(mean[1,1],0.1)
    display `mean'
    
    /* So far so good. But the following code fails to extract the sd of the imputed data */
    
    matrix sd = e(sd)
    local sd = round(sd[1,1],0.1)
    display `sd'


  • #2
    not sure why you want this but here is a different strategy - use "mi estimate" with the -regress command and then multiply the SE by "sqrt(N)" - will that do what you want?

    forgot to say, no covariates in the regression model
    Last edited by Rich Goldstein; 30 Jul 2024, 14:23.

    Comment


    • #3
      No, that won't work because under MI the SE is not equal to SD/sqrt(N)

      Comment


      • #4
        Originally posted by paulvonhippel View Post
        Hi, I'm trying to estimate the SD of multiply imputed data. It shouldn't be hard,
        [...]
        under MI the SE is not equal to SD/sqrt(N)
        So what is the SD under MI then? Do you want the pooled (i.e. mean) standard deviations over all imputed datasets? If so, you can adapt this approach. I am not sure that Rubin's rules apply to the standard deviation. Is the SD approximately normal?

        Comment


        • #5
          Thanks, daniel klein! I hadn't seen this FAQ. Yes, all I need to do is average the SDs across imputed datasets. And actually I can do that more simply than in the FAQ, like this:
          Code:
          collapse (sd) sd_y = y_incomplete if _mi_m>0, by(_mi_m)
          mean sd_y
          And now it's more complicated than I think it should be to get the mean into a local macro. Is there an easier way than this?
          Code:
          ereturn list
          matrix sd = e(b)
          local sd = round(sd[1,1],0.1)
          display `sd'

          Comment


          • #6
            I can address your broader question about how MI applies to estimating the SD. This may be more than you wanted to know, but I'm writing a textbook on missing data, so....
            • In my original question, all I needed was a point estimate, so Rubin's formula which is used to calculate standard errors, wasn't needed. By the way, Rubin's formula is se=sqrt(W+B+B/M), where M is the number of imputed datasets and W and B are the variance of the point estimate within and between imputed datasets.
            • Rubin's formula doesn't assume that the point estimate is normally distributed within and between imputed datasets.
            • But I would be assuming normality if I used Rubin's se to calculate a normal confidence interval -- i.e., point estimate +- 2*se
            • In fact, the point estimate of the SD is not normally distributed. It has something more like a chi mixture distribution. But the chi distribution quickly approaches normality as the sample gets larger. See screenshot from Wikipedia below.
            • In small samples, Little and Rubin suggest using the point estimate of ln(SD) to approach normality faster.
            • I suspect SD^(2/3) would work even better, because root transformations generally work better than log transformations for normalizing chi-distributed variables.
            I appreciate your help! I hope the points above repay your kindness and aren't too confusing!
            Click image for larger version

Name:	Chi distribution.JPG
Views:	1
Size:	42.4 KB
ID:	1760286

            Comment


            • #7
              Originally posted by paulvonhippel View Post
              And actually I can do that more simply than in the FAQ, like this:
              Code:
              collapse (sd) sd_y = y_incomplete if _mi_m>0, by(_mi_m)
              mean sd_y
              Well, collapse has some overhead, if only because it wipes the data in memory, which might or might not be relevant. Also, collpase provides valid results only for data in mi flog style. The approach in the FAQ is more low-level but also more generic. Anyway, if you are happy using collapse, why not just go
              Code:
              collapse (mean) mean = y_incomplete (sd) sd_y = y_incomplete if _mi_m > 0 , by(_mi_m)
              collapse mean sd_y
              local mean = mean[1]
              local sd_y = sd_y[1]

              It should be obvious that you can get the mean even faster simply by
              Code:
              summarize y_incomplete if (_mi_m > 0) , meanonly
              I mean, the mean of means is simply the mean. Either way, I don't think the mi machinery is necessary here at all.

              Comment


              • #8
                Originally posted by paulvonhippel View Post
                This may be more than you wanted to know, but I'm writing a textbook on missing data, so....
                Not at all. Thanks for this! And: looking forward to reading the book.


                Originally posted by paulvonhippel View Post
                • Rubin's formula doesn't assume that the point estimate is normally distributed within and between imputed datasets.
                Interesting. I was under the impression that the underlying population parameter was assumed to be normal, which is true for regression coefficients. Still, I assume the mean (for which Rubin's rules [for pooling point estimates] is just a fancy name) might generally not be the best summary of heavily skewed distributions. So even if normality is not required, the mean probably works better for somewhat symmetric distributions.

                Originally posted by paulvonhippel View Post
                • In fact, the point estimate of the SD is not normally distributed. It has something more like a chi mixture distribution. But the chi distribution quickly approaches normality as the sample gets larger. See screenshot from Wikipedia below.
                • In small samples, Little and Rubin suggest using the point estimate of ln(SD) to approach normality faster.
                • I suspect SD^(2/3) would work even better, because root transformations generally work better than log transformations for normalizing chi-distributed variables.
                I'll keep this in mind. Thanks again.
                Last edited by daniel klein; 31 Jul 2024, 08:12. Reason: formatting; prior response crossed with #6

                Comment


                • #9
                  Just to clarify your last point: The mean of the individual imputed-datasets means is the overall mean, but the mean of the individual imputed-datasets sds is not the total sd. The total sd is larger, because the total sd includes variation both within and between imputed datasets.

                  For example, in the dataset that I simulated, the mean of the imputed-datasets sds is 1.4, but the total sd is 1.6.

                  Code:
                  /* Simulate bivariate normal data */
                  clear
                  set seed 5
                  matrix C = (1, .8 \ .8, 1)
                  drawnorm x y, cov(C) n(100)
                  gen y_incomplete = y
                  gen y_missing = (x > invnormal(.15))
                  replace y_incomplete = . if y_missing
                  
                  /* Impute the data */
                  mi set flong
                  mi register imputed y_incomplete
                  mi impute regress y_incomplete x, add(50)
                  
                  /* Now you can estimate the mean like this, which gives the right standard error */
                  mi estimate: mean y_incomplete
                  /* But the same command doesn't estimate the SD. */
                  
                  /* This gives the same estimate of the mean, but with the wrong standard error. */
                  summarize y_incomplete if _mi_m > 0
                  /* And the estimated SD is too large, because it includes variation both within and between imputed datasets. */
                  
                  /* This gets you the right mean and the right SD, but only point estimates, no standard error. */
                  collapse (mean) mean_y = y_incomplete (sd) sd_y = y_incomplete if _mi_m>0, by(_mi_m)
                  mean mean_y sd_y
                  Last edited by paulvonhippel; 31 Jul 2024, 08:23.

                  Comment


                  • #10
                    Originally posted by paulvonhippel View Post
                    Just to clarify your last point: The mean of the individual imputed-datasets means is the overall mean, but the mean of the individual imputed-datasets sds is not the total sd.
                    Yes; summarize will only give you the mean. That's why I suggested integrating both in your (extended) collapse approach.


                    Edit: To clarify your point in turn
                    Originally posted by paulvonhippel View Post
                    Code:
                    /* This gets you the right mean and the right SD, but only point estimates, no standard error. */
                    collapse (mean) mean_y = y_incomplete (sd) sd_y = y_incomplete if _mi_m>0, by(_mi_m)
                    mean mean_y sd_y
                    Did you want the standard error also? If so, I would recommend switching back to the mi machinery and writing a wrapper program that estimates both the mean and the SD. Not sure there is a command that would estimate the (within-imputation) standard error of the SD; you would have to implement that yourself.
                    Last edited by daniel klein; 31 Jul 2024, 09:21.

                    Comment


                    • #11
                      re: #2 above and your correct comment in #3. apologies for the incomplete suggestion - what I should have included is to do this within -mi xeq- and save the result so that you have a set of sd's and can then compute W and B and use as desired -

                      Comment

                      Working...
                      X