Mipolate - impute value based on multiple known values even when an exact match on xvar occurs

Liz Jones

Join Date: Nov 2022

Posts: 2
#1

Mipolate - impute value based on multiple known values even when an exact match on xvar occurs

15 Nov 2022, 08:02

Hi there,

I have a dataset of higher ed institutions that contains an avg graduate rate and an avg transfer student graduation rate for each institution. In some cases, there is no information about avg transfer graduation rates. As such, I am imputing transfer grad rates (yvar) from the avg grad rates (xvar), as there is a strong correlation between them. I'm using the mipolate command to do this and am testing out a couple different options (linear, idw). I'd like to generate imputed transfer grad rate values based on multiple known avg grad rates, not just the one nearest or exact matching neighbor. These mipolate options seem to do that except for when an institution with a missing yvar has an exact match on xvar with another institution. When there's an exact match on xvar, I just get the same exact value from the institution that does not have the missing yvar. Does anyone know how I can do this imputation and still generate yvars based on the values of multiple xvars even when two institutions have the exact same xvar value?

Code used for imputation:
mipolate gradrate_8yr_transferstudents avg_grad_rate, gen(imp_tgr) idw

Example data below, after doing imputation. Two institutions here are missing on the yvar (gradrate_8yr_transferstudents). The first one (inst 7) received an imputed value (imp_tgr) based on multiple values of avg_grad_date in the dataset, using the idw option. Inst 10, though, has the exact same xvar value (avg_grad_rate) as inst 11, so it receives an imputed value only based on the transfer grad rate of inst 11. Ideally, I'd like for inst 10 to receive an imputed value based on multiple other values just like int 7. Is there any way to do this?

* Example generated by -dataex-. For more info, type help dataex
clear
ihe_name (gradrate_8yr_transferstudents avg_grad_rate imp_tgr)
Inst 1 .1597 7 .1597
Inst 2 .1146 8.125 .1146
Inst 3 .1924 9.222222328186035 .1924
Inst 4 .1152 9.55555534362793 .1152
Inst 5 .2743 10.222222328186035 .2743
Inst 6 .1944 10.88888931274414 .1944
Inst 7 . 11.333333015441895 .25058369178133116
Inst 8 .2806 14.44444465637207 .2806
inst 9 .3662 15.11111068725586 .3662
inst 10 . 15.222222328186035 .2475
Inst 11 .2475 15.222222328186035 .2475

Thanks for your help!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10187

15 Nov 2022, 15:51

mipolate is from SSC (FAQ Advice #12).

When there's an exact match on xvar, I just get the same exact value from the institution that does not have the missing yvar.

I do not know what you expect. You can think of interpolation as estimating the value of some function for an intermediate value of the independent variable. So having two values of the function for the same value of the independent variable violates continuity. If you believe there is some association between the x variable and the y variable and possibly some other variables, then you can specify a regression model and generate out-of-sample predictions from this model. In your case, the simplest model that you can think of is:

$$transfer\_grad\_rate= \beta_0+ \beta_1 avg\_grad\_rate + u\;\;\;\;\;(\ast)$$

Note that you are not supposed to modify the output from dataex. It is supposed to be copied and pasted "as is".

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str11 ihe_name float(gradrate_8yr_transferstudents avg_grad_rate imp_tgr)
"Inst 1"  .1597         7     .1597
"Inst 2"  .1146     8.125     .1146
"Inst 3"  .1924  9.222222     .1924
"Inst 4"  .1152  9.555555     .1152
"Inst 5"  .2743 10.222222     .2743
"Inst 6"  .1944  10.88889     .1944
"Inst 7"      . 11.333333 .25058368
"Inst 8"  .2806 14.444445     .2806
"inst 9"  .3662  15.11111     .3662
"inst 10"     . 15.222222     .2475
"Inst 11" .2475 15.222222     .2475
end

*ESTIMATE (*)
regress gradrate_8yr_transferstudents avg_grad_rate
*GET PREDICTIONS (INCLUDING OUT-OF-SAMPLE PREDICTIONS)
predict yhat, xb

Res.:

Code:

. l, sep(0)

     +------------------------------------------------------+
     | ihe_name   gradra~s   avg_gr~e    imp_tgr       yhat |
     |------------------------------------------------------|
  1. |   Inst 1      .1597          7      .1597   .1289436 |
  2. |   Inst 2      .1146      8.125      .1146   .1529289 |
  3. |   Inst 3      .1924   9.222222      .1924   .1763219 |
  4. |   Inst 4      .1152   9.555555      .1152   .1834287 |
  5. |   Inst 5      .2743   10.22222      .2743   .1976422 |
  6. |   Inst 6      .1944   10.88889      .1944   .2118557 |
  7. |   Inst 7          .   11.33333   .2505837   .2213314 |
  8. |   Inst 8      .2806   14.44444      .2806    .287661 |
  9. |   inst 9      .3662   15.11111      .3662   .3018745 |
 10. |  inst 10          .   15.22222      .2475   .3042435 |
 11. |  Inst 11      .2475   15.22222      .2475   .3042435 |
     +------------------------------------------------------+

Last edited by Andrew Musau; 15 Nov 2022, 16:04.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35642
#3

15 Nov 2022, 16:34

mipolate is from SSC, as you are asked to explain (FAQ Advice #12). I am its author.

Thanks in principle for the data example, but you have broken the dataex output by omitting input and mangling the identifiers. Presumably you decided you should suppress the names, which is fair enough, except that you need to fix what was broken, say as below.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input id gradrate_8yr_transferstudents avg_grad_rate imp_tgr 1 .1597 7 .1597 2 .1146 8.125 .1146 3 .1924 9.222222328186035 .1924 4 .1152 9.55555534362793 .1152 5 .2743 10.222222328186035 .2743 6 .1944 10.88888931274414 .1944 7 . 11.333333015441895 .25058369178133116 8 .2806 14.44444465637207 .2806 9 .3662 15.11111068725586 .3662 10 . 15.222222328186035 .2475 11 .2475 15.222222328186035 .2475 end

The usual territory for imputation is a regularly spaced grid of time or other values. This is the most unusual application of mipolate I have seen and I can't be comfortable with it. If you really need to interpolate I think you would be better off using some kind of regression.

What is going on is easiest (for me) to think about graphically. The raw data are shown by identifiers and you want to interpolate at the two positions shown by vertical lines. The interpolation at 11.333333 and bits is what it is, but the interpolation at 15.222222 and bits is more puzzling. As you already have a value at that position it seems fair that is the entirety of the information used, but on the other hand, the inverse of distance zero squared is going to be returned by Mata as missing. I think the algorithm is just deciding that your outcome is known at that location and copying the existing value.
1 like
Comment
Liz Jones

Join Date: Nov 2022

Posts: 2
#4

16 Nov 2022, 08:10

Thank you both for your help! I will explore other options for imputation here.
Comment

Announcement