Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fastest way to do multiple imputation

    Hi, I have a very simple missing data problem. Two variables, X and Y. X is fully observed and Y is missing completely at random.

    It seems to me it should be really quick to impute Y from a linear regression model. But it is surprisingly slow. Here is the code that's keeping me waiting:
    Code:
    mi impute monotone (regress) y = x, by(dataset) add(5) dots noisily
    Note that the data are imputed in 100 subsets defined by the variable dataset. That's why I'm using the option by(dataset).
    Are my settings optimal? How can I do this faster?

    If you want to try it yourself, the code below defines programs to simulate the incomplete data:
    Code:
    cap program drop complete_data
    program define complete_data
    syntax , Corr(real) Nobs(integer) Ndatasets(integer)
    
    // Construct 2x2 correlation matrix
    matrix C = (1, `corr' \ `corr', 1)
    
    // Clear current data and draw variables x and y
    clear
    local n = `nobs' * `ndatasets'
    drawnorm x y, cov(C) n(`n')
    gen dataset = floor((_n - 1) / `nobs') + 1
    end
    
    /* Incomplete data */
    cap program drop make_missing
    program make_missing
    syntax, Pattern(string)
    
    gen y_complete = y
    
    if "`pattern'" == "MAR" {
    gen y_missing = (x > -1) /* Keep Y for the bottom 16% (approximately) of X distribution */
    }
    if "`pattern'" == "MCAR" {
    gen y_missing = runiform() > normal(-1) /* Keep Y completely at random for approximately 16% of observations, independently of X */
    }
    replace y = . if y_missing
    end
    And once you've defined those programs, you can simulate the incomplete data and impute it like this:

    Code:
    complete_data, corr(.8) n(200) ndatasets(100)
    make_missing, pattern("MCAR")
    mi set wide
    mi register imputed y
    mi impute monotone (regress) y = x, by(dataset) add(5) dots noisily
    The last line above is the line that's taking longer than I think it should.
    Last edited by paulvonhippel; 20 Jun 2025, 14:28.

  • #2
    I am not sure if I understand your problem. It takes a few seconds on my end.

    Comment


    • #3
      What is your definition of slow? On my not particularly powerful laptop the code ran in about 40 seconds. How fast does the posted code run on your machine?
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      StataNow Version: 19.5 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        I think I know what the author means. Without the by option, the example code runs in about a second, with the option it takes about 15. This is a huge delay, especially when larger datasets are utilized. However, I dont think there is an easy solution. Potentially, one could write custom code to use parallel. I am not aware of another ado or ready-to-use solution.
        Best wishes

        Stata 18.0 MP | ORCID | Google Scholar

        Comment


        • #5
          I don't see what's surprising here. The by() option basically says: fit a separate model for each level of dataset. You have 100 levels in dataset. What would be an acceptable difference to fitting just one model?

          Anyway, why not go with
          Code:
          mi impute regress y = c.x##i.dataset , add(5)

          Comment

          Working...
          X