When should missing data, in numerical variables, be replaced by zeros?

Maria Ruiz

Join Date: May 2016

Posts: 5
#1

When should missing data, in numerical variables, be replaced by zeros?

19 May 2016, 16:08

Hi all,

I am trying to estimate the effects of Official Aid in growth for Low Income countries between 1990-2014 through a panel data. I'm using a linear regression model in which I have included some other control variables such as inflation and population growth. I find that there are many missing data in the Official Aid variable. Would this case be considered as MAR?
Should I just leave them or would it be better to replace them by zeros?

What is the treatment of missing data in xtreg and xtbond commands? I'm using both.

Many thanks,
Maria
Tags: None
Emad Shehata

Join Date: Oct 2014

Posts: 203
#2

19 May 2016, 16:13

In general missing data will be removed

Emad A. Shehata
Professor (PhD Economics)
Agricultural Research Center - Agricultural Economics Research Institute - Egypt
Email: [email protected]
IDEAS: http://ideas.repec.org/f/psh494.html
EconPapers: http://econpapers.repec.org/RAS/psh494.htm
Google Scholar: http://scholar.google.com/citations?...r=cOXvc94AAAAJ
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#3

19 May 2016, 16:20

You should only replace missing values by zero if you have good reason to believe that the actual values, were they known, would be zero. In any other circumstance it's inappropriate.

As for MAR, you must know the process that led to the missingness of those observations. In the context you are describing, it would surprise me if the missing values were missing at random. It seems to me, as a naive layperson, that your missing values are far more likely to represent situations where the data were withheld by either the donors or the recipients for political reasons, perhaps to hide corruption, or for situations where the accounting systems are so inaccurate that the curators of the data chose not to use what was reported to them. In such situations, the missingness is likely to be related to the actual values (were they known), even after adjusting for everything observed. (Of course, I could be quite wrong about that.) That would be the very antithesis of missing at random.

Read http://www.statisticalhorizons.com/w...ap-Allison.pdf for a very useful discussion of missing data.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17856
#4

19 May 2016, 22:50

Maria:
welcome to the list.
The following http://www.sagepub.in/textbooks/Book9419 can give you some hints about diagnosing missing values mechanism and dealing with them.
I do share Clyde comments; I'm not aware of any situation where missimg values should be automatically replaced by zero.
As an aside, since you have panel data, why running a one-wave data regression instead of a panel data regression?

Last edited by Carlo Lazzaro; 19 May 2016, 22:52.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Ruiz

Join Date: May 2016

Posts: 5
#5

20 May 2016, 16:19

Many thanks for your quick answers.

As to your advice, Carlo, about using panel data regression I have been running it too. To be more precise, I have been testing xtreg command and xtabond. Does Stata when running those commands will remove/ omitt the missing values?

Kind Regards,
Maria
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17856
#6

21 May 2016, 04:50

Maria.
yes, it does.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Ruiz

Join Date: May 2016

Posts: 5
#7

23 May 2016, 08:05

Many thanks again for your help.
Comment

Roman Mostazir

Join Date: Apr 2014
Posts: 877

23 May 2016, 09:38

Originally posted by Maria Ruiz View Post

Many thanks for your quick answers.

Does Stata when running those commands will remove/ omitt the missing values?

Kind Regards,
Maria

That will depend on how much of the data is missing. xtreg estimates are maximum likelihood (ML) estimates and ML can retain cases even the outcome value is missing for some cases. This retention is determined on the basis of an underlying probabilistic distribution of the data given the best parameter estimated by the model. If a case has most number of the data point missing, the case is likely to get dropped but if the case is missing in fewer points but data are available for most points, the case is likely to stay in the model. An example:

Code:

347 children were measure on 'y' variable from their age 7y to 14y.

 tab time sex  //time=Visit, and 'sex'=Gender

    Age by |        Gender
     visit |      Boys      Girls |     Total
-----------+----------------------+----------
         7 |       174        173 |       347
         8 |       174        173 |       347
         9 |       174        173 |       347
        10 |       174        173 |       347
        11 |       174        173 |       347
        12 |       174        173 |       347
        13 |       174        173 |       347
        14 |       174        173 |       347
-----------+----------------------+----------
     Total |     1,392      1,384 |     2,776

Lets check the distribution of the outcome variable 'y'

 tab time sex , su(y) nosta

      Means and Frequencies of y
    Age by |       Gender
     visit |      Boys      Girls |     Total
-----------+----------------------+----------
         7 | 1152.4878  1103.9134 | 1131.9371
           |       135         99 |       234
-----------+----------------------+----------
         8 | 1217.1679  1173.7002 | 1197.9379
           |       121         96 |       217
-----------+----------------------+----------
         9 | 1376.9147  1294.9279 | 1336.2841
           |       114        112 |       226
-----------+----------------------+----------
        10 | 1488.1839  1431.8643 | 1461.1405
           |       118        109 |       227
-----------+----------------------+----------
        11 | 1423.1196  1368.1086 | 1396.1754
           |       125        120 |       245
-----------+----------------------+----------
        12 | 1593.9479  1548.2982 | 1573.6591
           |       115         92 |       207
-----------+----------------------+----------
        13 | 1278.7726  1220.0544 |   1250.09
           |       111        106 |       217
-----------+----------------------+----------
        14 | 1351.6009  1245.4695 | 1299.7145
           |       115        110 |       225
-----------+----------------------+----------
     Total | 1356.4035   1298.592 | 1329.2662
           |       954        844 |      1798

***********************************************************
// None of the visits had full information. Highest information available for boys is at age 7y (n=135), and for girls at age 11y (n=120). 
Highest total number was available at 5y (234). But it is possible that one subject is missing at one point but available at
other points which will increase the chance of that subject to be in the estimation.
**********************************************************
//Runing xtreg

 xtreg y , i(id) re

Random-effects GLS regression                   Number of obs     =      1,798
Group variable: id                              Number of groups  =        324

R-sq:                                           Obs per group:
     within  = 0.0000                                         min =          1
     between = 0.0000                                         avg =        5.5
     overall = 0.0000                                         max =          8

                                                Wald chi2(0)      =          .
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =          .

------------------------------------------------------------------------------
          ee |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   1323.056   10.03758   131.81   0.000     1303.383     1342.73
-------------+----------------------------------------------------------------
     sigma_u |  151.77777
     sigma_e |  219.38792
         rho |  .32369375   (fraction of variance due to u_i)
------------------------------------------------------------------------------

//The total number of cases retained in the model is 324 which is larger than any time point.

Roman

Comment

Maria Ruiz

Join Date: May 2016

Posts: 5
#9

02 Jun 2016, 16:42

Many thanks for your comment Roman. I have two questions regarding your info.
- How would the retention of the system apply for my data? What would do to maintain the case? Would it "fulfill" those missing values? Do you know what is the criteria of the system when considering missing data as few as to drop the case?
-I am using as panel variable "Country" and time variable 1994-2016. As there are many values missing the panel is unbalanced, and so is showed when it gives a number of groups which is under the original set. But, the fact is that I am getting after panel variable "strongly balanced". So, what is "strongly balanced" making reference to?

Maria
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17856
#10

03 Jun 2016, 00:08

Maria:
- as a first step (and following Clyde's wise advice), I would investogate whether the nechanism underlying the missingnes is informative (i.e., requiring some actions like -ipolate- or -mi-) or not.
- your panel is strongli balanced because all ids are reported for all the years.
See for instance the following toy-example, where that feature does not hold true:

Code:

. use "http://www.stata-press.com/data/r14/nlswork.dta", clear (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . xtset idcode year panel variable: idcode (unbalanced) time variable: year, 68 to 88, but with gaps delta: 1 unit count if idcode==1 12 . count if idcode==22 13

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Ruiz

Join Date: May 2016

Posts: 5
#11

11 Jul 2016, 01:53

Many thanks for your answer Carlo,

Excuseé me for coming Up again with this topic. I am writing the methodology used in my work and need to be sure about these things.
Regarding your advice on whether the mechanism underlying the missing is informative or not, just tell you that the missing data appear as empty cells in my database as found in the original data resources. So no action is needed from my side.
As to the other issue regarding the panel being balanced if I got you correctly, just if I delete some of the years the system would give me that ir la unbalanced. So currently, as i have all the years in the panel (no matter that sone data is misssing) it would be recognized as balanced. Correct?

Manu t hanks,
Maria
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17856
#12

11 Jul 2016, 02:38

Maria:
not quite.
As far as the the mechanism underlying the missingness of your data is concerned (missing completely at random; missing at random; missing not at random), you may want to take a look at http://www.sagepub.in/textbooks/Book9419. If it's informative (or not ignorable, basically missing not at random), you may want to consider -ipolate- or -mi- (which could be considered when the data are missing at random, either, just to have a more efficient dataset).
The fact the the missingness appears in the form of empty cells (for string variables) or -.- (for numerical variables) is simply a consequence of the way Stata treats missing values by default, but has nothing to do with the abovementioned missing mechanism.
You do not need to delete years with one or more missing values in your panel dataset, because Stata can handle bot unbalanced and balanced panel datasets.
Moreover (an more substantively), by deleting some years as you did, you ends up with a panel dataset which might well be very different from the original one and biased if the missingness is informative.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement