mi styles and speed

paulvonhippel

Join Date: Apr 2014

Posts: 499
#1

mi styles and speed

10 May 2018, 09:22

I am working with a very large dataset. It has 2 million rows. To Stata's credit, I find that I can generate and replace variables quickly.

Yet after I multiply impute the data (3 times), I find that simple commands slow to a crawl. For example, "mi xeq: gen newvar = oldvar + 1" takes upwards of 10 minutes.

I wonder if this is due to my having the mi data in flong style. In flong style, the dataset has 8 million rows, while in wide style the dataset has just 2 million rows (but more columns). Is there any reason to expect faster processing when mi data are wide than when they are flong? More generally, is there any reason to expect faster processing when there are fewer rows but more columns?
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4457
#2

10 May 2018, 09:39

not sure how much difference this will make, but don't use that method of generating a new variable with mi data; instead, see

Code:

help mi passive
Comment
daniel klein

Join Date: Mar 2014

Posts: 3842
#3

10 May 2018, 09:48

With data in flong style, I would just

Code:

generate newvar = oldvar + 1

and perhaps register newvar afterwards.

As far as I can see, some of Stata's mi commands will convert the data into flongsep style, internally, so perhaps storing the data in flongsep in the first place might speed things up a little.

Best
Daniel
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#4

10 May 2018, 10:02

Thanks, Daniel. You are absolutely right that, in flong style, gen runs much faster than mi passive: gen
Which raises the question of why mi passive: runs so slowly, and why anyone should use it.

Where do you see that some mi commands convert the data to flongsep? And why would converting to flongsep speed things up?
From the documentation (for "help mi convert"): "The other styles are more convenient than flongsep, and mi commands run faster on them."

Last edited by paulvonhippel; 10 May 2018, 10:13.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3842
#5

10 May 2018, 12:12

Which raises the question of why mi passive: runs so slowly, and why anyone should use it.

As I understand it, mi passive is the safe way, because it runs some tests; see mi update.

Where do you see that some mi commands convert the data to flongsep?

When you mi estimate, you can sometimes see a couple of temporary files being created which I guess are the separate complete datasets. Also this is what I remember from glancing through the code of some of the mi commands. I am therefore surprised that those commands are supposed to run faster on the other styles; perhaps the tempfiles are created either way; perhaps I am completely wrong here. I would follow the advice in the manuals with the exception of using the regular generate command with flong data and when you know that the result is what you want. If you are unsure about the latter, use a 1 per cent random sample and try and compare results from both approaches.

Best
Daniel
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4983
#6

11 May 2018, 04:42

It is debatable whether any of the above approaches are right for creating the variable, at least if you are doing something more complicated than adding 1. Paul Allison and Patrick Royston both advocate for the "Just another variable" approach over passive imputation. You generate the variable using the original data, and then let the new variable be imputed as necessary. This is briefly discussed in pages 10-11 of

https://www3.nd.edu/~rwilliam/xsoc73994/MD02.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#7

28 May 2018, 10:52

I think I may actually be the one who first pointed out the problem with passive imputation. Passive imputation can introduce incompatibility between the imputation model and the analysis model. In this 2009 paper, I advocated for an approach that I called "transform, then impute," which Royston calls imputing the transformed variable like "just another variable."

However, my objection doesn't apply to adding a constant and other linear transformations, like those discussed in this thread.
Comment

Announcement

mi styles and speed

Comment

Comment

Comment

Comment

Comment

Comment