Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding patterns from listwise deletion

    Hello everyone,

    I recently ran a few regressions from General Social Survey data and Stata automatically did some listwise deletion in my analyses. This was fine, but some analyses had far fewer respondents after the listwise deletion. The iap or "inapplicable" missing data label was rather high for a few of these.

    So I am hoping to double check to make sure the missing data did not have any significant patterns. For example, I don't want the missing data to have any patterns corresponding to age, race, or gender. To do this, I just created a binary variable that grouped all the data into missing or non-missing and saw if there was any significant results with age, race, and gender. However, I have two questions about this.

    1) Is this an acceptable way to find patterns in the missing data?

    2) When I tried to lump the "iap" data, I got an error message from stata. It cannot recode the iap data so I am now unsure how to analyze it. (for other missing data labeled ".b", this was not a problem.

    Thank you!


    . recode nataid (1=1) (2=1) (3=1) (iap=2), generate (nataidmiss)
    ERROR: unknown el iap in rule

  • #2
    Is this an acceptable way to find patterns in the missing data?
    It depends on the details of how you did it. If by "significant" results you mean p < 0.05, then that is the wrong approach. The bias that missing data can inflict on analyses has nothing to do with whether the associations with observed variables is statistically significant. It has to do with whether the association is large as measured by some measure of association such as an odds ratio or a mean difference or a probability difference. In addition, you may find that missingness is balanced, separately, on categories of race, gender and age group, but might be imbalanced on some combinations thereof. (Of course, if you combine several of these variables, the number of observations you are looking at may be too small for meaningful comparison. So you need some discretion as to where to stop.) Finally, the most important problem with missing data is the one that can never be answered within the data itself: are the missing values actually associated with the true values that would have been observed were the data not missing. Sometimes factors external to the data can partially answer that question.

    All of that said, you indicate that much of this missing data is coded as "inapplicable." If "inapplicable" refers to things like responses to pregnancy questions in males, or "how long have you had your current job" when the person has said he/she is unemployed, then you do not need to concern yourself at all with the distribution of these "iap" responses in your data. If they truly use this code only to indicate that the question has no answer due to other circumstances, then there is no possibility that the missing data will bias your analyses. Those data would, in fact, qualify for the glorious category of missing completely at random, and a complete cases analysis would be entirely appropriate. Of course, the missingness of those responses decreases your sample size, which may create power concerns, but that is a separate issue and one that does not depend on how the "iap" responses distributed themselves across different subcategories.

    Code:
    . recode nataid (1=1) (2=1) (3=1) (iap=2), generate (nataidmiss)
    ERROR: unknown el iap in rule
    This is not a possible Stata command. In fact you cannot possibly have a Stata variable that takes on values 1, 2, 3, and iap. Only a string variable could have "iap" as a value, and only a numeric variable could have 1, 2, and 3, as values. If you think you have such a variable, then what you probably actually have is a numeric variable with a value label attached to it, and the value label encodes one of the extended missing values as "iap". If that is the case, you need to look at that value label and identify what the actual missing values is (one of .a through .z) and replace iap with that value in your -recode- command.

    Last edited by Clyde Schechter; 07 Dec 2016, 14:05. Reason: Correct typos.

    Comment


    • #3
      If values of 1, 2, 3 map to 1 and everything else to 2, then


      Code:
      gen nataidmiss = cond(inlist(nataid, 1, 2, 3), 1, 2)
      I always prefer (0, 1) indicators:

      Code:
      gen nataidmiss = !inlist(nataid, 1, 2, 3)

      Comment

      Working...
      X