Missing data in variable

Vincent Haan

Join Date: Aug 2018

Posts: 7
#1

Missing data in variable

03 Aug 2018, 15:16

Hello,

I have a dataset of panel data with circa 5,000 observations. However, 3 independent variables (of the total 8) have some missing values. The amount of missing data per variable is around 100 to 200 observations and is sparsely distributed across the entire dataset (many non-missing values before and after the missing data points). I'm reading up on what the preferable way would be in handling these. Linear interpolation seems an option, but I'm having difficulty seeing what other variables the independent variables are a function of; multiple imputation seems somewhat overboard for this small amount of missing data; and mean replacement seems to be commonly advised against. Do you have any suggestions as for tackling this problem?

To be clear, this is regarding an economic dataset, where volatilities of stocks are studied using independent variables in the conditional variance of a GARCH model.

Thank you for your time and any help you can offer.
Tags: None
Philip Gigliotti

Join Date: Nov 2016

Posts: 118
#2

03 Aug 2018, 15:29

Stata will automatically drop missing data. It seems like even with excluding these observations you will still have a large enough sample. There are some considerations to missing as like missing at random missing completely at random missing not at random etc., but this doesn't seem like it will be too much of a problem.

If you want to keep the observations, I usually just use statas impute command if they are simply control variables. Never have gotten push back from referees about this. It's more of a problem if you're imputing on your main variable of interest or you are performing further analysis or variable constuction with the imputed variable.
1 like
Comment
Vincent Haan

Join Date: Aug 2018

Posts: 7
#3

04 Aug 2018, 00:57

With statas impute command, do you mean univariate imputation?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17854
#4

04 Aug 2018, 02:32

Vincent:
Stata can handle both balanced and unbalanced panel datasets with no problem (hence, this is not an issue to worry about).
As Phil said, missing data should be investigated in their mechanism (missing completerly at random (MCAR); missing at random (MAR), missing not at random (MNAR)) and their pattern as wel (univariate; monthonic; generalized). Pretty often MNAR data give some headaches, whereas MAR data can be dealt with via -mi- (that can perform both univariate (which is rarely the case) and multivariate multiple imputation) (see https://www.crcpress.com/Flexible-Im.../9781138588318 for further details).
As you wrote, -ipolate- can be another option, whereas imputing the mean of the observed data means collapsing the variance (and so, in my opinion, should not be ever mentioned as an approach for dealing with missing data).

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Vincent Haan

Join Date: Aug 2018

Posts: 7
#5

04 Aug 2018, 17:16

Ok, so now I've used impute for my variables. However, when I want to use my 'complete' variables, I, for example, have to use

Code:

mi estimate, cmdok: arch log_return, arch(1/1) garch(1/1) het(a_variable)

Where a_variable is a variable with some missing data points. Is this correct in that I now use the 'complete' variables in this model, whereas if I used

Code:

arch log_return, arch(1/1) garch(1/1) het(a_variable)

the model would drop the missing observations in its estimation?
Comment
Vincent Haan

Join Date: Aug 2018

Posts: 7
#6

06 Aug 2018, 07:40

Sorry, one more question. If I try to first difference my data, logically the gaps in my data increase. Is impute here also the way to go?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17854
#7

06 Aug 2018, 11:54

Vincent:
I'm not sure I got you in #5.
Anyway, I try to give some temptative replies:
- complete case analysis refer to observations with no missing values in any variable (https://www.wiley.com/en-al/Statisti...-9780471183860). In your example, I guess you mean complete data (that is the # complete datasets on which run your time series model): if that were the case, your code seems OK;
- if you do not perform any imputation procedure, Stata will apply listwise deletion to all the observations with at least one missing values in any variable;
- as far as your # is concerned, I would still consider -mi-.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Vincent Haan

Join Date: Aug 2018

Posts: 7
#8

06 Aug 2018, 14:15

Hi Carlo,

My fault, what I wrote was quite unclear.
With 'complete' I, indeed, meant my data where the missing values are imputed. You still answered my question so thank you for that !
As for #6, I meant that when changing my variable to first-differences, any missing value immediately becomes 2 missing values. My question was if -mi-impute still was the way to go. Again, you answered my question, so thank you for all your help.
Comment

Announcement

Missing data in variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment