Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deleting variables with too many missing observations

    I have wide student test data. Each row is an individual student, with variables/columns as responses to individual questions. I want to eliminate any variable for which there are no more than 5 valid observations (ie, too many missing values). Suggestions? I'm new to "missing" but might that do the trick?

    EDIT: This is in preparation for IRT. My real problem is that some questions have only 1 or 2 kids answering, which creates insufficient variation and results in error message "v12345 does not vary in the estimation sample." If there's a better way than dropping v12345, I'm all ears.

    Thanks,

    Bryan
    Last edited by Bryan Shelly; 12 Jul 2016, 11:47.

  • #2
    Code:
    foreach v of varlist _all {
         count if !missing(`v')
         if `r(N)' <= 5 {
              drop `v'
         }
    }
    Added: that is how to do it. Whether doing this is a good idea seems questionable--think about it. If these are test questions, an item that draws so few non-missing responses is telling you something: it is either too hard for the population being tested, or is unclear, or in some other way problematic. While you would probably be quite correct in not using them for scoring purposes or certain other analyses, I wouldn't want to ignore that aspect.
    Last edited by Clyde Schechter; 12 Jul 2016, 11:51.

    Comment


    • #3
      Thanks Clyde,

      With regards to your concern, I'm actually doing the IRT with two different data sets. The one much larger set contains data from a question bank. Teachers will pick an area in which they want to measure a given student's growth, and the software will spit out an item at random that tests that area. In that case, I think it's probably ok (and maybe even preferable) to drop any items with too few responses, given that the missing values are almost certainly a function of the item just not getting given to enough kids. Agree/disagree?

      The second data set is as you suspected. In theory, it's a test where all students should get exposed to all questions, so the prevalence of missing values means something else than above. I think I need to circle back with the test's authors and find out what they think may explain the missing values, but if you or anyone else has recommendations on how to handle data that features where question difficulty might be prompting non response, I'm very grateful.

      Bryan

      Comment


      • #4
        See also missings (SSC, SJ)

        http://www.statalist.org/forums/foru...aging-missings

        http://www.stata-journal.com/article...article=dm0085

        missings is considered to supersede dropmiss (SJ) and deliberately makes dropping entire variables a little difficult for the kinds of reasons discussed by Clyde, and others. But dropmiss does still exist to allow this.

        Comment


        • #5
          Bryan, I agree with your reasoning about both data sets.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            Code:
            foreach v of varlist _all {
            count if !missing(`v')
            if `r(N)' <= 5 {
            drop `v'
            }
            }
            Added: that is how to do it. Whether doing this is a good idea seems questionable--think about it. If these are test questions, an item that draws so few non-missing responses is telling you something: it is either too hard for the population being tested, or is unclear, or in some other way problematic. While you would probably be quite correct in not using them for scoring purposes or certain other analyses, I wouldn't want to ignore that aspect.

            Thanks for the code! Is this command similar to:
            Code:
            drop if mi (var1, var2)

            Comment


            • #7
              No. -drop if mi(var1, var2)- will drop those observations in the data set for which either var1 or var2 has a missing value. The block of code copied from #2 drops those variables that have 5 or fewer non-missing values. So they are really unrelated to each other, except for both mentioning -drop-.

              Comment

              Working...
              X