Deleting variables with too many missing observations

Bryan Shelly

Join Date: Jul 2016

Posts: 16
#1

Deleting variables with too many missing observations

12 Jul 2016, 11:36

I have wide student test data. Each row is an individual student, with variables/columns as responses to individual questions. I want to eliminate any variable for which there are no more than 5 valid observations (ie, too many missing values). Suggestions? I'm new to "missing" but might that do the trick?

EDIT: This is in preparation for IRT. My real problem is that some questions have only 1 or 2 kids answering, which creates insufficient variation and results in error message "v12345 does not vary in the estimation sample." If there's a better way than dropping v12345, I'm all ears.

Thanks,

Bryan

Last edited by Bryan Shelly; 12 Jul 2016, 11:47.
Tags: None

1 like
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#2

12 Jul 2016, 11:48

Code:

foreach v of varlist _all { count if !missing(`v') if `r(N)' <= 5 { drop `v' } }

Added: that is how to do it. Whether doing this is a good idea seems questionable--think about it. If these are test questions, an item that draws so few non-missing responses is telling you something: it is either too hard for the population being tested, or is unclear, or in some other way problematic. While you would probably be quite correct in not using them for scoring purposes or certain other analyses, I wouldn't want to ignore that aspect.

Last edited by Clyde Schechter; 12 Jul 2016, 11:51.
1 like
Comment
Bryan Shelly

Join Date: Jul 2016

Posts: 16
#3

12 Jul 2016, 12:03

Thanks Clyde,

With regards to your concern, I'm actually doing the IRT with two different data sets. The one much larger set contains data from a question bank. Teachers will pick an area in which they want to measure a given student's growth, and the software will spit out an item at random that tests that area. In that case, I think it's probably ok (and maybe even preferable) to drop any items with too few responses, given that the missing values are almost certainly a function of the item just not getting given to enough kids. Agree/disagree?

The second data set is as you suspected. In theory, it's a test where all students should get exposed to all questions, so the prevalence of missing values means something else than above. I think I need to circle back with the test's authors and find out what they think may explain the missing values, but if you or anyone else has recommendations on how to handle data that features where question difficulty might be prompting non response, I'm very grateful.

Bryan
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#4

12 Jul 2016, 12:33

See also missings (SSC, SJ)

http://www.statalist.org/forums/foru...aging-missings

http://www.stata-journal.com/article...article=dm0085

missings is considered to supersede dropmiss (SJ) and deliberately makes dropping entire variables a little difficult for the kinds of reasons discussed by Clyde, and others. But dropmiss does still exist to allow this.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#5

12 Jul 2016, 15:37

Bryan, I agree with your reasoning about both data sets.
1 like
Comment
Sonnen Blume

Join Date: Aug 2018

Posts: 342
#6

13 Aug 2020, 15:42

Originally posted by Clyde Schechter View Post

Code:

foreach v of varlist _all { count if !missing(`v') if `r(N)' <= 5 { drop `v' } }

Added: that is how to do it. Whether doing this is a good idea seems questionable--think about it. If these are test questions, an item that draws so few non-missing responses is telling you something: it is either too hard for the population being tested, or is unclear, or in some other way problematic. While you would probably be quite correct in not using them for scoring purposes or certain other analyses, I wouldn't want to ignore that aspect.

Thanks for the code! Is this command similar to:

Code:

drop if mi (var1, var2)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#7

13 Aug 2020, 15:55

No. -drop if mi(var1, var2)- will drop those observations in the data set for which either var1 or var2 has a missing value. The block of code copied from #2 drops those variables that have 5 or fewer non-missing values. So they are really unrelated to each other, except for both mentioning -drop-.
1 like
Comment

Announcement

Deleting variables with too many missing observations

Comment

Comment

Comment

Comment

Comment

Comment