Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding missing data patterns

    Hi everyone,

    I am using STATA/SE 13.0 for Windows.

    I have panel data, that's formatted like so:
    ID year outcome
    11 2011 1
    11 2012 1
    11 2013 0
    12 2011 .
    12 2012 0
    12 2013 0
    13 2011 1
    etc.

    That is, I have a binary outcome measure that contains missing values. The measurements are taken each year for each subject. I am looking to understand the patterns of missing data (e.g. how many people are missing values at all years, at no years, just at one year, at any two years, etc.). However, I am having some trouble with the STATA commands for doing so.

    So, if I use xtset and xtdescribe:
    xtset id yr
    xtdescribe if missing(outcome)

    It shows me 7 different patterns of missing data, with a total 'N' of 2777.

    However, this doesn't match up with what I find using other commands:
    count if missing(outcome) ----> this gives me the number 3603
    count if missing(outcome) & year==2011 ----> this gives me the number 1707. However, using the output from the xtdescribe statement, I only find 1070 individuals that are missing outcome in 2011.

    I also tried:
    xtdescribe if outcome< .
    This option was recommended by the documentation here: http://www.stata-press.com/books/mlmus3_ch10.pdf . First of all, I don't even understand how this command works, because if I use:
    count if outcome <.
    The output shows me the number of non-missing values for outcome, not missing ones. However, plugging outcome < . into the xtdescribe command gives me output similar to what I found with xtdescribe if missing(outome).

    It shows me 7 different patterns of missing data, but now with a total 'N' of 3939.

    I also created an indicator variable, that equals 1 when outcome is missing, and looked at it using the tabulate command:
    tab year miss_outcome

    Going this route, I have a total of 3583 missing values all of a sudden.

    Why do all of these options give me such dramatically different ideas of what the missingness patterns are in my data? Which of these methods should I trust? What is going on behind the scenes that could make all of these options so different?

    As sort of an extension to that question, I am also curious in analyzing the RESPONSE patterns in the outcome variable, not just the missingness patterns. Is it appropriate to use a command along the lines of:

    xtdescribe if outcome==1

    to do so?


    EDIT:

    As an addendum, yes, there are missing values in both the ID and year variables in the dataset. However, the number of missing values I find looking at those variables does not match up to any of the differences in the numbers reported above. While there are 1,123 missing values of "year", only 20 of those match up with an instance of the individual also missing an outcome value. So while I expect that missingness to really effect my ability to find the response patterns, I don't see how it would really mess with the missingness patterns.
    Last edited by Ryan Simmons; 17 Feb 2015, 11:13.

  • #2
    first, note that the upgrade to 13.1 is free so you should do it (-h upgrade-); second, take a look at the -misstable- command which is meant for this (if I understand you correctly)

    Comment


    • #3
      Ryan:
      First of all, I don't even understand how this command works, because if I use:
      Stata considers missing as the highest value; hence:
      Code:
      count if outcome&lt;.
      gives you back all non-missing values.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Carlo's code got tangled up with some HTML somehow. He meant

        Code:
         
        count if outcome < .

        Comment


        • #5
          I think the problem here is that xtdescribe treats the subject as the unit of analysis. Thus N in this context is the number of unique subjects. The count and the tabulate commands ignore the fact that the data have been xtset and simply pay attention to the individual records so that N is just the number of records in the data set.. The different results you get are because you are asking different questions of the data. The misstable command also ignores the fact that the data are xtset and gives you results where N is the number of records in the data file.
          Richard T. Campbell
          Emeritus Professor of Biostatistics and Sociology
          University of Illinois at Chicago

          Comment


          • #6
            Originally posted by Rich Goldstein View Post
            first, note that the upgrade to 13.1 is free so you should do it (-h upgrade-); second, take a look at the -misstable- command which is meant for this (if I understand you correctly)
            As for the upgrade, it's kind of a long story. On my personal version of STATA, I have upgraded. Unfortunately, for this analysis I have to use the STATA that's installed on a remote desktop (due to privacy and confidentiality concerns, I am not permitted to take anything off of the server, including code), and I don't have the authorization to run the upgrade.

            As for misstable, I had been looking at that command. While it certainly has a lot of useful information, I couldn't quite figure out how to give it exactly the information that is most helpful to me in this case, which is the output from xtdescribe. When looking at the misstable documentation, I couldn't see an easy way of getting that same information out of it without creating a bunch of dummy variables, while xtdescribe by default outputs the data in the exact form I need.

            Originally posted by Carlo Lazzaro
            Stata considers missing as the highest value; hence: <snipped code>
            gives you back all non-missing values.
            Thanks for the clarification. I overlooked that.

            Originally posted by Dick Campbell
            I think the problem here is that xtdescribe treats the subject as the unit of analysis. Thus N in this context is the number of unique subjects. The count and the tabulate commands ignore the fact that the data have been xtset and simply pay attention to the individual records so that N is just the number of records in the data set.. The different results you get are because you are asking different questions of the data. The misstable command also ignores the fact that the data are xtsetand gives you results where N is the number of records in the data file.
            Ah, thank you! That clears things up for me immensely. For some reason I simply wasn't considering the fact that I was looking at different units of analysis there. In my head I was thinking of the data as if it were in wide form, even though it is obviously in long form (as required by xtset anyway). In that case, xtdescribe is exactly what I am looking for, as the subject is the proper unit of analysis for this project.

            Thanks all! You've been very helpful.

            Comment

            Working...
            X