Hi everyone,
I am using STATA/SE 13.0 for Windows.
I have panel data, that's formatted like so:
ID year outcome
11 2011 1
11 2012 1
11 2013 0
12 2011 .
12 2012 0
12 2013 0
13 2011 1
etc.
That is, I have a binary outcome measure that contains missing values. The measurements are taken each year for each subject. I am looking to understand the patterns of missing data (e.g. how many people are missing values at all years, at no years, just at one year, at any two years, etc.). However, I am having some trouble with the STATA commands for doing so.
So, if I use xtset and xtdescribe:
xtset id yr
xtdescribe if missing(outcome)
It shows me 7 different patterns of missing data, with a total 'N' of 2777.
However, this doesn't match up with what I find using other commands:
count if missing(outcome) ----> this gives me the number 3603
count if missing(outcome) & year==2011 ----> this gives me the number 1707. However, using the output from the xtdescribe statement, I only find 1070 individuals that are missing outcome in 2011.
I also tried:
xtdescribe if outcome< .
This option was recommended by the documentation here: http://www.stata-press.com/books/mlmus3_ch10.pdf . First of all, I don't even understand how this command works, because if I use:
count if outcome <.
The output shows me the number of non-missing values for outcome, not missing ones. However, plugging outcome < . into the xtdescribe command gives me output similar to what I found with xtdescribe if missing(outome).
It shows me 7 different patterns of missing data, but now with a total 'N' of 3939.
I also created an indicator variable, that equals 1 when outcome is missing, and looked at it using the tabulate command:
tab year miss_outcome
Going this route, I have a total of 3583 missing values all of a sudden.
Why do all of these options give me such dramatically different ideas of what the missingness patterns are in my data? Which of these methods should I trust? What is going on behind the scenes that could make all of these options so different?
As sort of an extension to that question, I am also curious in analyzing the RESPONSE patterns in the outcome variable, not just the missingness patterns. Is it appropriate to use a command along the lines of:
xtdescribe if outcome==1
to do so?
EDIT:
As an addendum, yes, there are missing values in both the ID and year variables in the dataset. However, the number of missing values I find looking at those variables does not match up to any of the differences in the numbers reported above. While there are 1,123 missing values of "year", only 20 of those match up with an instance of the individual also missing an outcome value. So while I expect that missingness to really effect my ability to find the response patterns, I don't see how it would really mess with the missingness patterns.
I am using STATA/SE 13.0 for Windows.
I have panel data, that's formatted like so:
ID year outcome
11 2011 1
11 2012 1
11 2013 0
12 2011 .
12 2012 0
12 2013 0
13 2011 1
etc.
That is, I have a binary outcome measure that contains missing values. The measurements are taken each year for each subject. I am looking to understand the patterns of missing data (e.g. how many people are missing values at all years, at no years, just at one year, at any two years, etc.). However, I am having some trouble with the STATA commands for doing so.
So, if I use xtset and xtdescribe:
xtset id yr
xtdescribe if missing(outcome)
It shows me 7 different patterns of missing data, with a total 'N' of 2777.
However, this doesn't match up with what I find using other commands:
count if missing(outcome) ----> this gives me the number 3603
count if missing(outcome) & year==2011 ----> this gives me the number 1707. However, using the output from the xtdescribe statement, I only find 1070 individuals that are missing outcome in 2011.
I also tried:
xtdescribe if outcome< .
This option was recommended by the documentation here: http://www.stata-press.com/books/mlmus3_ch10.pdf . First of all, I don't even understand how this command works, because if I use:
count if outcome <.
The output shows me the number of non-missing values for outcome, not missing ones. However, plugging outcome < . into the xtdescribe command gives me output similar to what I found with xtdescribe if missing(outome).
It shows me 7 different patterns of missing data, but now with a total 'N' of 3939.
I also created an indicator variable, that equals 1 when outcome is missing, and looked at it using the tabulate command:
tab year miss_outcome
Going this route, I have a total of 3583 missing values all of a sudden.
Why do all of these options give me such dramatically different ideas of what the missingness patterns are in my data? Which of these methods should I trust? What is going on behind the scenes that could make all of these options so different?
As sort of an extension to that question, I am also curious in analyzing the RESPONSE patterns in the outcome variable, not just the missingness patterns. Is it appropriate to use a command along the lines of:
xtdescribe if outcome==1
to do so?
EDIT:
As an addendum, yes, there are missing values in both the ID and year variables in the dataset. However, the number of missing values I find looking at those variables does not match up to any of the differences in the numbers reported above. While there are 1,123 missing values of "year", only 20 of those match up with an instance of the individual also missing an outcome value. So while I expect that missingness to really effect my ability to find the response patterns, I don't see how it would really mess with the missingness patterns.
Comment