Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating missing data at random and repeating this process multiple times (Stata 13.1)

    I have a small dataset and want to experiment using MICE multiple imputation models. I have data on Body Mass Index (BMI) and I have a complete set of observations. I would like to randomly create some missing data (say 20%) from the complete dataset on BMI in order to test MICE. I have 30 BMI observations and I wish to randomly make 10 observations missing. I would like to randomly create missing data and repeat this multiple times (say 100 times) so that each time, a unique set of missing data is created for the BMI. Any help on how to do this? Many thanks!

  • #2
    Presuming you want exactly 10 of 30 missing:
    Code:
    clear
    set obs 30
    gen bmi = 100* runiform()  // example data
    //
    set seed 48754
    gen bmi_miss = .
    gen randsort = .
    forval i = 1/100 {
       replace randsort = runiform()
       sort randsort
       replace bmi_miss = bmi if _n > 10
       // do whatever estimation here using bmi_miss rather than bmi.
    }

    Comment


    • #3
      I just want to point out that we are using terminology a bit carelessly here. The description in #1 and the solution in #2 both produce data that is missing completely at random (MCAR). With MCAR data it is not necessary to use multiple imputation, as a complete cases analysis produces unbiased results. (Though it does no harm to use MI in this setting.)

      For data to be missing at random (MAR), strictly speaking, is for the missingness to be informative about the missing value, but the information can be recovered from other non-missing information. The approach to emulating MAR data in such a data set would be to make the missingness a probabilistic function of age.

      Comment


      • #4
        Many thanks to you both for your advice. In my case, MCAR data that is created is fine as it will allow me to compare a complete dataset analysis with a dataset with randomly missing data so I can check how MI compares when checked against regression modelling with the complete dataset.

        Comment


        • #5
          Hi Mike

          I have tried your code but what I would like is to keep the original dataset for BMI (30 observations) and then randomly remove 10 observations from this set. I would then like to repeat this process again and again so that each time, a different set of 10 observations are removed. In this way, I will have for example 100 datasets with BMI data that has 10 observations removed at random. For example, below there are 3 datasets, dataset 1 is my complete data and dataset 2 and dataset 3 have 10 observations randomly removed.

          Dataset 1:BMI
          30
          26
          28
          43
          41
          24
          29
          31
          44
          25
          30
          30
          37
          35
          28
          28
          31
          22
          28
          24
          30
          31
          35
          25
          31
          18
          23
          22
          23
          24

          Dataset 2
          BMI
          30

          28
          43
          41
          24



          25
          30

          37

          28
          28

          22

          24

          31
          35
          25
          31
          18

          22
          23
          24

          Dataset 3
          BMI
          30
          26
          28
          43
          41
          24
          29
          31

          25
          30
          30

          35
          28
          28




          30
          31

          25
          31
          18


          23


          Comment


          • #6
            Well, the logic is exactly the same. Just skip the first three lines in the code shown in #2 and replace that with a command to read in your existing BMI data set. Then the rest of the code will produce 100 new sets of BMI data, each with 10 missing values randomly scattered in:
            Code:
            set seed 1234 // OR YOUR FAVORITE NUMBER
            forvalues i = 1/100 {
                gen bmi_`i' = bmi
                gen double shuffle = runiform()
                sort shuffle
                replace bmi_`i' = . in 1/10
                drop shuffle
            }
            Warning: not tested, beware of typos.

            Comment


            • #7
              Thanks Clyde for your help. Coding works and your posts were very helpful.

              Comment


              • #8
                But how would I create randomly missing values in a range. Say I wanted that my variable y between the years 1980 and 2001 to have randomly missing values. How should I adapt the above code?

                Comment


                • #9
                  Code:
                  set seed 1234 // OR WHATEVER NATURAL NUMBER YOU LIKE
                  gen year = runiformint(1980, 2001)
                  replace year = . if runiform() < 0.15 // PRODUCES 15% MISSING VALUES

                  Comment

                  Working...
                  X