No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • When should missing data, in numerical variables, be replaced by zeros?

    Hi all,

    I am trying to estimate the effects of Official Aid in growth for Low Income countries between 1990-2014 through a panel data. I'm using a linear regression model in which I have included some other control variables such as inflation and population growth. I find that there are many missing data in the Official Aid variable. Would this case be considered as MAR?
    Should I just leave them or would it be better to replace them by zeros?

    What is the treatment of missing data in xtreg and xtbond commands? I'm using both.

    Many thanks,

  • #2
    In general missing data will be removed
    Emad A. Shehata
    Professor (PhD Economics)
    Agricultural Research Center - Agricultural Economics Research Institute - Egypt
    Email: [email protected]
    Google Scholar:


    • #3
      You should only replace missing values by zero if you have good reason to believe that the actual values, were they known, would be zero. In any other circumstance it's inappropriate.

      As for MAR, you must know the process that led to the missingness of those observations. In the context you are describing, it would surprise me if the missing values were missing at random. It seems to me, as a naive layperson, that your missing values are far more likely to represent situations where the data were withheld by either the donors or the recipients for political reasons, perhaps to hide corruption, or for situations where the accounting systems are so inaccurate that the curators of the data chose not to use what was reported to them. In such situations, the missingness is likely to be related to the actual values (were they known), even after adjusting for everything observed. (Of course, I could be quite wrong about that.) That would be the very antithesis of missing at random.

      Read for a very useful discussion of missing data.


      • #4
        welcome to the list.
        The following can give you some hints about diagnosing missing values mechanism and dealing with them.
        I do share Clyde comments; I'm not aware of any situation where missimg values should be automatically replaced by zero.
        As an aside, since you have panel data, why running a one-wave data regression instead of a panel data regression?
        Last edited by Carlo Lazzaro; 19 May 2016, 22:52.
        Kind regards,
        (StataNow 18.5)


        • #5
          Many thanks for your quick answers.

          As to your advice, Carlo, about using panel data regression I have been running it too. To be more precise, I have been testing xtreg command and xtabond. Does Stata when running those commands will remove/ omitt the missing values?

          Kind Regards,


          • #6
            yes, it does.
            Kind regards,
            (StataNow 18.5)


            • #7
              Many thanks again for your help.


              • #8
                Originally posted by Maria Ruiz View Post
                Many thanks for your quick answers.

                Does Stata when running those commands will remove/ omitt the missing values?

                Kind Regards,
                That will depend on how much of the data is missing. xtreg estimates are maximum likelihood (ML) estimates and ML can retain cases even the outcome value is missing for some cases. This retention is determined on the basis of an underlying probabilistic distribution of the data given the best parameter estimated by the model. If a case has most number of the data point missing, the case is likely to get dropped but if the case is missing in fewer points but data are available for most points, the case is likely to stay in the model. An example:

                347 children were measure on 'y' variable from their age 7y to 14y.
                 tab time sex  //time=Visit, and 'sex'=Gender
                    Age by |        Gender
                     visit |      Boys      Girls |     Total
                         7 |       174        173 |       347
                         8 |       174        173 |       347
                         9 |       174        173 |       347
                        10 |       174        173 |       347
                        11 |       174        173 |       347
                        12 |       174        173 |       347
                        13 |       174        173 |       347
                        14 |       174        173 |       347
                     Total |     1,392      1,384 |     2,776
                Lets check the distribution of the outcome variable 'y'
                 tab time sex , su(y) nosta
                      Means and Frequencies of y
                    Age by |       Gender
                     visit |      Boys      Girls |     Total
                         7 | 1152.4878  1103.9134 | 1131.9371
                           |       135         99 |       234
                         8 | 1217.1679  1173.7002 | 1197.9379
                           |       121         96 |       217
                         9 | 1376.9147  1294.9279 | 1336.2841
                           |       114        112 |       226
                        10 | 1488.1839  1431.8643 | 1461.1405
                           |       118        109 |       227
                        11 | 1423.1196  1368.1086 | 1396.1754
                           |       125        120 |       245
                        12 | 1593.9479  1548.2982 | 1573.6591
                           |       115         92 |       207
                        13 | 1278.7726  1220.0544 |   1250.09
                           |       111        106 |       217
                        14 | 1351.6009  1245.4695 | 1299.7145
                           |       115        110 |       225
                     Total | 1356.4035   1298.592 | 1329.2662
                           |       954        844 |      1798
                // None of the visits had full information. Highest information available for boys is at age 7y (n=135), and for girls at age 11y (n=120). 
                Highest total number was available at 5y (234). But it is possible that one subject is missing at one point but available at
                other points which will increase the chance of that subject to be in the estimation.
                //Runing xtreg
                 xtreg y , i(id) re
                Random-effects GLS regression                   Number of obs     =      1,798
                Group variable: id                              Number of groups  =        324
                R-sq:                                           Obs per group:
                     within  = 0.0000                                         min =          1
                     between = 0.0000                                         avg =        5.5
                     overall = 0.0000                                         max =          8
                                                                Wald chi2(0)      =          .
                corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =          .
                          ee |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                       _cons |   1323.056   10.03758   131.81   0.000     1303.383     1342.73
                     sigma_u |  151.77777
                     sigma_e |  219.38792
                         rho |  .32369375   (fraction of variance due to u_i)
                //The total number of cases retained in the model is 324 which is larger than any time point.



                • #9
                  Many thanks for your comment Roman. I have two questions regarding your info.
                  - How would the retention of the system apply for my data? What would do to maintain the case? Would it "fulfill" those missing values? Do you know what is the criteria of the system when considering missing data as few as to drop the case?
                  -I am using as panel variable "Country" and time variable 1994-2016. As there are many values missing the panel is unbalanced, and so is showed when it gives a number of groups which is under the original set. But, the fact is that I am getting after panel variable "strongly balanced". So, what is "strongly balanced" making reference to?



                  • #10
                    - as a first step (and following Clyde's wise advice), I would investogate whether the nechanism underlying the missingnes is informative (i.e., requiring some actions like -ipolate- or -mi-) or not.
                    - your panel is strongli balanced because all ids are reported for all the years.
                    See for instance the following toy-example, where that feature does not hold true:
                    . use "", clear
                    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
                    . xtset idcode year
                           panel variable:  idcode (unbalanced)
                            time variable:  year, 68 to 88, but with gaps
                                    delta:  1 unit
                    count if idcode==1
                    . count if idcode==22
                    Kind regards,
                    (StataNow 18.5)


                    • #11
                      Many thanks for your answer Carlo,

                      Excuseé me for coming Up again with this topic. I am writing the methodology used in my work and need to be sure about these things.
                      Regarding your advice on whether the mechanism underlying the missing is informative or not, just tell you that the missing data appear as empty cells in my database as found in the original data resources. So no action is needed from my side.
                      As to the other issue regarding the panel being balanced if I got you correctly, just if I delete some of the years the system would give me that ir la unbalanced. So currently, as i have all the years in the panel (no matter that sone data is misssing) it would be recognized as balanced. Correct?

                      Manu t hanks,


                      • #12
                        not quite.
                        As far as the the mechanism underlying the missingness of your data is concerned (missing completely at random; missing at random; missing not at random), you may want to take a look at If it's informative (or not ignorable, basically missing not at random), you may want to consider -ipolate- or -mi- (which could be considered when the data are missing at random, either, just to have a more efficient dataset).
                        The fact the the missingness appears in the form of empty cells (for string variables) or -.- (for numerical variables) is simply a consequence of the way Stata treats missing values by default, but has nothing to do with the abovementioned missing mechanism.
                        You do not need to delete years with one or more missing values in your panel dataset, because Stata can handle bot unbalanced and balanced panel datasets.
                        Moreover (an more substantively), by deleting some years as you did, you ends up with a panel dataset which might well be very different from the original one and biased if the missingness is informative.
                        Kind regards,
                        (StataNow 18.5)