Hi, I have a very simple missing data problem. Two variables, X and Y. X is fully observed and Y is missing completely at random.
It seems to me it should be really quick to impute Y from a linear regression model. But it is surprisingly slow. Here is the code that's keeping me waiting:
Note that the data are imputed in 100 subsets defined by the variable dataset. That's why I'm using the option by(dataset).
Are my settings optimal? How can I do this faster?
If you want to try it yourself, the code below defines programs to simulate the incomplete data:
And once you've defined those programs, you can simulate the incomplete data and impute it like this:
The last line above is the line that's taking longer than I think it should.
It seems to me it should be really quick to impute Y from a linear regression model. But it is surprisingly slow. Here is the code that's keeping me waiting:
Code:
mi impute monotone (regress) y = x, by(dataset) add(5) dots noisily
Are my settings optimal? How can I do this faster?
If you want to try it yourself, the code below defines programs to simulate the incomplete data:
Code:
cap program drop complete_data program define complete_data syntax , Corr(real) Nobs(integer) Ndatasets(integer) // Construct 2x2 correlation matrix matrix C = (1, `corr' \ `corr', 1) // Clear current data and draw variables x and y clear local n = `nobs' * `ndatasets' drawnorm x y, cov(C) n(`n') gen dataset = floor((_n - 1) / `nobs') + 1 end /* Incomplete data */ cap program drop make_missing program make_missing syntax, Pattern(string) gen y_complete = y if "`pattern'" == "MAR" { gen y_missing = (x > -1) /* Keep Y for the bottom 16% (approximately) of X distribution */ } if "`pattern'" == "MCAR" { gen y_missing = runiform() > normal(-1) /* Keep Y completely at random for approximately 16% of observations, independently of X */ } replace y = . if y_missing end
Code:
complete_data, corr(.8) n(200) ndatasets(100) make_missing, pattern("MCAR") mi set wide mi register imputed y mi impute monotone (regress) y = x, by(dataset) add(5) dots noisily
Comment