The basic issue is that I have data at two levels, both levels have missing data, but I have 'more' cases at the macro level whose information I want to incorporate into the imputation of the missing macro level data.
The data are survey data combined with macro level data on the county where R resides. At the individual level I have about 1,000 cases nested in about 400 counties. The full macro consists of about 3,600 counties. There is some missing data at the county level that I want to impute. There is also missing data at the individual level that I don't think I want to impute - I think reviewers will be more comfortable with the aggregate imputation then the imputation of survey responses (this is just a hunch and it's secondary to the basic issue).
So I could throw out all counties where there is no respondent, that leaves me with about 400 counties, and then impute county level data using data only from those 400 counties. But I want to make use of all 3,600 counties to impute the county level data and then run the actual models using just the 1,000 respondents and 400 counties in the original data.
I registered the survey variables as regular (they do have some missing values but again, I don't want to impute them at this time). I then register the county-level data as imputed. Then I run the imputation. Data are in flong format and variables to be imputed are continuous. Code is as follows:
mi mvn countyvar1...countyvarN = surveyvar1.....surveyvarN
I get warning that "the imputed data contain missing values" and the process halts.
I assume that this is because the survey data contain missing responses (which I don't want imputed) and because many of the counties are missing all the survey level data (because no respondents to the survey lived in those counties). I can force it, but I'm reluctant.
I'm not sure I'm going about this the right way at all. Even if I impute all the survey data as well as the county level data, the problem won't go away because most of the counties used for the imputation don't have any survey data attached. To be clear, I don't want to impute survey data for counties where there were no individuals selected to take the survey - that would be silly - but I want to make use of all the county level data I have to get a precise imputation of the missing county level data.
Surely there must be a way to do the imputation and incorporate the entire county level data in the prediction even though I will ultimately use about 400 of the counties in the analysis. I appreciate any advice on the matter.
Will
edited to add: I'm not actually looking to run a multi-level model. There aren't enough individuals in each county for that, but the data are technically nested and that fact is at the heart of the problem.
The data are survey data combined with macro level data on the county where R resides. At the individual level I have about 1,000 cases nested in about 400 counties. The full macro consists of about 3,600 counties. There is some missing data at the county level that I want to impute. There is also missing data at the individual level that I don't think I want to impute - I think reviewers will be more comfortable with the aggregate imputation then the imputation of survey responses (this is just a hunch and it's secondary to the basic issue).
So I could throw out all counties where there is no respondent, that leaves me with about 400 counties, and then impute county level data using data only from those 400 counties. But I want to make use of all 3,600 counties to impute the county level data and then run the actual models using just the 1,000 respondents and 400 counties in the original data.
I registered the survey variables as regular (they do have some missing values but again, I don't want to impute them at this time). I then register the county-level data as imputed. Then I run the imputation. Data are in flong format and variables to be imputed are continuous. Code is as follows:
mi mvn countyvar1...countyvarN = surveyvar1.....surveyvarN
I get warning that "the imputed data contain missing values" and the process halts.
I assume that this is because the survey data contain missing responses (which I don't want imputed) and because many of the counties are missing all the survey level data (because no respondents to the survey lived in those counties). I can force it, but I'm reluctant.
I'm not sure I'm going about this the right way at all. Even if I impute all the survey data as well as the county level data, the problem won't go away because most of the counties used for the imputation don't have any survey data attached. To be clear, I don't want to impute survey data for counties where there were no individuals selected to take the survey - that would be silly - but I want to make use of all the county level data I have to get a precise imputation of the missing county level data.
Surely there must be a way to do the imputation and incorporate the entire county level data in the prediction even though I will ultimately use about 400 of the counties in the analysis. I appreciate any advice on the matter.
Will
edited to add: I'm not actually looking to run a multi-level model. There aren't enough individuals in each county for that, but the data are technically nested and that fact is at the heart of the problem.
Comment