Creating missing data at random and repeating this process multiple times (Stata 13.1)

Jeya Palan

Join Date: Jun 2017

Posts: 8
#1

Creating missing data at random and repeating this process multiple times (Stata 13.1)

14 Jun 2017, 07:14

I have a small dataset and want to experiment using MICE multiple imputation models. I have data on Body Mass Index (BMI) and I have a complete set of observations. I would like to randomly create some missing data (say 20%) from the complete dataset on BMI in order to test MICE. I have 30 BMI observations and I wish to randomly make 10 observations missing. I would like to randomly create missing data and repeat this multiple times (say 100 times) so that each time, a unique set of missing data is created for the BMI. Any help on how to do this? Many thanks!
Tags: None

Mike Lacy

Join Date: Apr 2014
Posts: 2416

14 Jun 2017, 07:32

Presuming you want exactly 10 of 30 missing:

Code:

clear
set obs 30
gen bmi = 100* runiform()  // example data
//
set seed 48754
gen bmi_miss = .
gen randsort = .
forval i = 1/100 {
   replace randsort = runiform()
   sort randsort
   replace bmi_miss = bmi if _n > 10
   // do whatever estimation here using bmi_miss rather than bmi.
}

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#3

14 Jun 2017, 08:53

I just want to point out that we are using terminology a bit carelessly here. The description in #1 and the solution in #2 both produce data that is missing completely at random (MCAR). With MCAR data it is not necessary to use multiple imputation, as a complete cases analysis produces unbiased results. (Though it does no harm to use MI in this setting.)

For data to be missing at random (MAR), strictly speaking, is for the missingness to be informative about the missing value, but the information can be recovered from other non-missing information. The approach to emulating MAR data in such a data set would be to make the missingness a probabilistic function of age.
1 like
Comment
Jeya Palan

Join Date: Jun 2017

Posts: 8
#4

19 Jun 2017, 15:08

Many thanks to you both for your advice. In my case, MCAR data that is created is fine as it will allow me to compare a complete dataset analysis with a dataset with randomly missing data so I can check how MI compares when checked against regression modelling with the complete dataset.
Comment
Jeya Palan

Join Date: Jun 2017

Posts: 8
#5

19 Jun 2017, 15:51

Hi Mike

I have tried your code but what I would like is to keep the original dataset for BMI (30 observations) and then randomly remove 10 observations from this set. I would then like to repeat this process again and again so that each time, a different set of 10 observations are removed. In this way, I will have for example 100 datasets with BMI data that has 10 observations removed at random. For example, below there are 3 datasets, dataset 1 is my complete data and dataset 2 and dataset 3 have 10 observations randomly removed.

Dataset 1:BMI
30
26
28
43
41
24
29
31
44
25
30
30
37
35
28
28
31
22
28
24
30
31
35
25
31
18
23
22
23
24

Dataset 2
BMI
30

28
43
41
24

25
30

37

28
28

22

24

31
35
25
31
18

22
23
24

Dataset 3
BMI
30
26
28
43
41
24
29
31

25
30
30

35
28
28

30
31

25
31
18

23
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

19 Jun 2017, 16:26

Well, the logic is exactly the same. Just skip the first three lines in the code shown in #2 and replace that with a command to read in your existing BMI data set. Then the rest of the code will produce 100 new sets of BMI data, each with 10 missing values randomly scattered in:

Code:

set seed 1234 // OR YOUR FAVORITE NUMBER forvalues i = 1/100 { gen bmi_`i' = bmi gen double shuffle = runiform() sort shuffle replace bmi_`i' = . in 1/10 drop shuffle }

Warning: not tested, beware of typos.
Comment
Jeya Palan

Join Date: Jun 2017

Posts: 8
#7

20 Jun 2017, 11:26

Thanks Clyde for your help. Coding works and your posts were very helpful.
Comment
Travis Mitchell

Join Date: Apr 2023

Posts: 4
#8

05 May 2023, 20:22

But how would I create randomly missing values in a range. Say I wanted that my variable y between the years 1980 and 2001 to have randomly missing values. How should I adapt the above code?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

08 May 2023, 07:48

Code:

set seed 1234 // OR WHATEVER NATURAL NUMBER YOU LIKE
gen year = runiformint(1980, 2001)
replace year = . if runiform() < 0.15 // PRODUCES 15% MISSING VALUES

Announcement

Creating missing data at random and repeating this process multiple times (Stata 13.1)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment