Fastest way to do multiple imputation

paulvonhippel

Join Date: Apr 2014
Posts: 502

Fastest way to do multiple imputation

20 Jun 2025, 14:22

Hi, I have a very simple missing data problem. Two variables, X and Y. X is fully observed and Y is missing completely at random.

It seems to me it should be really quick to impute Y from a linear regression model. But it is surprisingly slow. Here is the code that's keeping me waiting:

Code:

mi impute monotone (regress) y = x, by(dataset) add(5) dots noisily

Note that the data are imputed in 100 subsets defined by the variable dataset. That's why I'm using the option by(dataset).
Are my settings optimal? How can I do this faster?

If you want to try it yourself, the code below defines programs to simulate the incomplete data:

Code:

cap program drop complete_data
program define complete_data
syntax , Corr(real) Nobs(integer) Ndatasets(integer)

// Construct 2x2 correlation matrix
matrix C = (1, `corr' \ `corr', 1)

// Clear current data and draw variables x and y
clear
local n = `nobs' * `ndatasets'
drawnorm x y, cov(C) n(`n')
gen dataset = floor((_n - 1) / `nobs') + 1
end

/* Incomplete data */
cap program drop make_missing
program make_missing
syntax, Pattern(string)

gen y_complete = y

if "`pattern'" == "MAR" {
gen y_missing = (x > -1) /* Keep Y for the bottom 16% (approximately) of X distribution */
}
if "`pattern'" == "MCAR" {
gen y_missing = runiform() > normal(-1) /* Keep Y completely at random for approximately 16% of observations, independently of X */
}
replace y = . if y_missing
end

And once you've defined those programs, you can simulate the incomplete data and impute it like this:

Code:

complete_data, corr(.8) n(200) ndatasets(100)
make_missing, pattern("MCAR")
mi set wide
mi register imputed y
mi impute monotone (regress) y = x, by(dataset) add(5) dots noisily

The last line above is the line that's taking longer than I think it should.

Last edited by paulvonhippel; 20 Jun 2025, 14:28.

Tags: None

Tiago Pereira

Join Date: Jan 2016

Posts: 386
#2

21 Jun 2025, 08:28

I am not sure if I understand your problem. It takes a few seconds on my end.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4984
#3

22 Jun 2025, 21:25

What is your definition of slow? On my not particularly powerful laptop the code ran in about 40 seconds. How fast does the posted code run on your machine?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 690
#4

22 Jun 2025, 23:03

I think I know what the author means. Without the by option, the example code runs in about a second, with the option it takes about 15. This is a huge delay, especially when larger datasets are utilized. However, I dont think there is an easy solution. Potentially, one could write custom code to use parallel. I am not aware of another ado or ready-to-use solution.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#5

23 Jun 2025, 01:15

I don't see what's surprising here. The by() option basically says: fit a separate model for each level of dataset. You have 100 levels in dataset. What would be an acceptable difference to fitting just one model?

Anyway, why not go with

Code:

mi impute regress y = c.x##i.dataset , add(5)
Comment

Announcement

Fastest way to do multiple imputation

Comment

Comment

Comment

Comment