Multiple imputation (test scores)

Marry Lee

Join Date: Nov 2020

Posts: 189
#1

Multiple imputation (test scores)

02 Mar 2021, 14:18

Dear all,
I have data on test scores, among which figure a lot of missing values.
For some individuals, missing test scores is because they are still young to take the exam (the national exam).
Is it possible in this case, in order to increase the sample size to impute the values for these people (mi command)? (it should be missing at random, since missingness is correlated to age but not to the test score).
Does what I am saying make any sense to you or not?
Thanks.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30189
#2

02 Mar 2021, 16:36

The usual reason people want to increase sample sizes for their analyses is to enhance statistical power, but multiple imputation does not do that. The increased precision in each imputed data set that comes from having a larger sample is then cancelled out by the increase in standard errors that results from combining the results of all the imputed data sets. The reason to use multiple imputation is to reduce the bias that comes from using only complete cases.

If, in fact, test scores are independent of age, then your data are better than MAR, they are MCAR (missing completely at random). With MCAR, a complete-cases analysis gives unbiased estimates, so there is no need to bother with multiple imputation.

So it doesn't sound like multiple imputation will do anything useful for you.
Comment
Marry Lee

Join Date: Nov 2020

Posts: 189
#3

02 Mar 2021, 19:15

Thank you Clyde Schechter, I am new to how to deal with missing data, and your answer is very clear.
Aside from this variable, among the explanatory variables, I have the GDP and population count of each county of the children, for these variables too I have some missing values.
I read a lot about this and I think there are 3 major ways to deal with missing data: imputation, interpolation and extrapolation.
I understand that extrapolation is when predicting values outside of the data range, but I don't quite understand the difference between imputation and interpolation.
When people talk about interpolation, they say it's a form of imputation but when they talk about imputation, interpolation is not mentioned.

1/ So, do we use extrapolation when the missing values are at the very begining or the very end of the time series, and use interpolation and imputation for other cases?
2/ Is interpolation used only when there is time series but imputation is also used when there are missing values for a cross-sectional data?

I am not sure to know how to differentiate these methods, could you please tell me what you think about this. Any remark is very appreciated.
Best,
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3467
#4

03 Mar 2021, 01:10

Interpolation and extrapolation are ways of making predictions, and these predictions can be used to impute. But often they cannot be used directly, we need to do other things to the predictions before we can use them as imputations. The tricky part is that, as Clyde already mentioned, the purpose of imputation is not to recover the values that were lost. Instead, the purpose is to use the information you do have to correct for bias caused losing those values. Doing this right is very tricky. The distinction between interpolation and extrapolation is less useful here, other than extrapolation is always more dangerous.

A good place to start learning about missing data is Paul Allison's 2001 little green Sage book on Missing data: Paul D. Allison (2001) Missing Data. Thousand Oaks: Sage. https://uk.sagepub.com/en-gb/eur/missing-data/book9419

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
daniel klein

Join Date: Mar 2014

Posts: 3890
#5

03 Mar 2021, 02:28

Originally posted by Clyde Schechter View Post

The usual reason people want to increase sample sizes for their analyses is to enhance statistical power, but multiple imputation does not do that. The increased precision in each imputed data set that comes from having a larger sample is then canceled out by the increase in standard errors that results from combining the results of all the imputed data sets.

I get the idea and I see that this could certainly happen. But, is this generally true?

The situation that Marry Lee describes, is quite interesting. I would argue that, technically, the missing values due to age are neither missing at random nor missing completely at random. The test scores probably represent some sort of competencies or skills. Many competencies and skills develop with age, either biologically/physiologically/cognitive-psychologically or by experience. Moreover, because missingness is actually a deterministic function of age in the sense that Pr(missing | age) = 1 for a certain threshold in age, I do not know whether this situation qualifies as missing at "random", either.

As for the suggested alternative, complete case analyses, note that sticking with casewise deletion will not allow the results to be generalized to the younger population that did not participate in the test. Whether this is acceptable depends on the research question.

Anyway, how to treat these missing values depends on a couple of more substantial considerations. For example, whether to impute the missing values depends on what exactly those missing values represent. If they represent the test-scores themselves, in the sense of a certificate that might be used in an admission process or something similar, then the values are "truly" missing for those who did not take the test. In that scenario, it would be perfectly reasonable to assign all individuals who have not taken the test to a separate missing category. Separate analyses for this group would also be reasonable (and preferable if other predictors are assumed to vary over the groups as well). If the missing test-scores represent the underlying competencies/skills, then imputing these missing values might be reasonable if we assume that younger individuals possess those competencies/skills and the functional form of the imputation model reasonably approximates the test-scores that those individuals would have attained. Given that the individuals were not tested, the test might not have been constructed in a way to assess competencies/skills at the level that younger individuals possess them.
Comment
Marry Lee

Join Date: Nov 2020

Posts: 189
#6

03 Mar 2021, 11:41

Thank you Maarten Buis. I will look at Paul Allison's book.
Comment
Marry Lee

Join Date: Nov 2020

Posts: 189
#7

03 Mar 2021, 11:53

Thank you daniel klein for your answer.

If they represent the test-scores themselves, in the sense of a certificate that might be used in an admission process or something similar, then the values are "truly" missing for those who did not take the test. In that scenario, it would be perfectly reasonable to assign all individuals who have not taken the test to a separate missing category. Separate analyses for this group would also be reasonable (and preferable if other predictors are assumed to vary over the groups as well). If the missing test-scores represent the underlying competencies/skills, then imputing these missing values might be reasonable if we assume that younger individuals possess those competencies/skills and the functional form of the imputation model reasonably approximates the test-scores that those individuals would have attained. Given that the individuals were not tested, the test might not have been constructed in a way to assess competencies/skills at the level that younger individuals possess them.

Indeed it's about a test score that would allow individuals to go to college.
Its' used as an indicator of cognitive ability for the children affected by a certain policy when they were younger. I am using DID and I wanted to have more individuals after the policy was implemented, these individuals are still too young (by one or two years) to have taken the exam at the time of the survey so I don't see their test score. In conclusion, I think it will not be right to impute their test scores.
Best,
Siwar
Comment
Marry Lee

Join Date: Nov 2020

Posts: 189
#8

04 Mar 2021, 18:51

Dear all,
Thanks to you I was able to understand the missing data analysis.
Now I still have something I am confused about.
As I told you above that I have missing values for GDP per capita for some counties, In the case where I have missing GDP for all years for a given county, I cannot use traditional methods of imputation or interpolation to fill in the gaps, I cannot also use multiple imputaion because the other explanatory variables of my analysis model are not even related to GDP per capita because they are related to child and household characteristics.

In this case, can I bring other variables related to GDP and run a regression for GDP as a function of these variables, then predict the values for GDP?

May be as this paper did?
Deng, X., Fang, Y., Lin, Y., & Yuan, Y. (2012). Non-parametric method for filling in the missing value for cross-sectional dataset: A validation on the per capita GDP data at county level in China. Journal of Food, Agriculture & Environment, 10(3-4), 1350-1354.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3467
#9

05 Mar 2021, 01:16

What countries are missing?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Marry Lee

Join Date: Nov 2020

Posts: 189
#10

05 Mar 2021, 13:35

Dear Maarten Buis, it is not about countries, it is about counties of China.
Comment

Announcement

Multiple imputation (test scores)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment