Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mipolate - impute value based on multiple known values even when an exact match on xvar occurs

    Hi there,

    I have a dataset of higher ed institutions that contains an avg graduate rate and an avg transfer student graduation rate for each institution. In some cases, there is no information about avg transfer graduation rates. As such, I am imputing transfer grad rates (yvar) from the avg grad rates (xvar), as there is a strong correlation between them. I'm using the mipolate command to do this and am testing out a couple different options (linear, idw). I'd like to generate imputed transfer grad rate values based on multiple known avg grad rates, not just the one nearest or exact matching neighbor. These mipolate options seem to do that except for when an institution with a missing yvar has an exact match on xvar with another institution. When there's an exact match on xvar, I just get the same exact value from the institution that does not have the missing yvar. Does anyone know how I can do this imputation and still generate yvars based on the values of multiple xvars even when two institutions have the exact same xvar value?

    Code used for imputation:
    mipolate gradrate_8yr_transferstudents avg_grad_rate, gen(imp_tgr) idw



    Example data below, after doing imputation. Two institutions here are missing on the yvar (gradrate_8yr_transferstudents). The first one (inst 7) received an imputed value (imp_tgr) based on multiple values of avg_grad_date in the dataset, using the idw option. Inst 10, though, has the exact same xvar value (avg_grad_rate) as inst 11, so it receives an imputed value only based on the transfer grad rate of inst 11. Ideally, I'd like for inst 10 to receive an imputed value based on multiple other values just like int 7. Is there any way to do this?

    * Example generated by -dataex-. For more info, type help dataex
    clear
    ihe_name (gradrate_8yr_transferstudents avg_grad_rate imp_tgr)
    Inst 1 .1597 7 .1597
    Inst 2 .1146 8.125 .1146
    Inst 3 .1924 9.222222328186035 .1924
    Inst 4 .1152 9.55555534362793 .1152
    Inst 5 .2743 10.222222328186035 .2743
    Inst 6 .1944 10.88888931274414 .1944
    Inst 7 . 11.333333015441895 .25058369178133116
    Inst 8 .2806 14.44444465637207 .2806
    inst 9 .3662 15.11111068725586 .3662
    inst 10 . 15.222222328186035 .2475
    Inst 11 .2475 15.222222328186035 .2475


    Thanks for your help!




  • #2
    mipolate is from SSC (FAQ Advice #12).

    When there's an exact match on xvar, I just get the same exact value from the institution that does not have the missing yvar.
    I do not know what you expect. You can think of interpolation as estimating the value of some function for an intermediate value of the independent variable. So having two values of the function for the same value of the independent variable violates continuity. If you believe there is some association between the x variable and the y variable and possibly some other variables, then you can specify a regression model and generate out-of-sample predictions from this model. In your case, the simplest model that you can think of is:

    $$transfer\_grad\_rate= \beta_0+ \beta_1 avg\_grad\_rate + u\;\;\;\;\;(\ast)$$

    Note that you are not supposed to modify the output from dataex. It is supposed to be copied and pasted "as is".

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str11 ihe_name float(gradrate_8yr_transferstudents avg_grad_rate imp_tgr)
    "Inst 1"  .1597         7     .1597
    "Inst 2"  .1146     8.125     .1146
    "Inst 3"  .1924  9.222222     .1924
    "Inst 4"  .1152  9.555555     .1152
    "Inst 5"  .2743 10.222222     .2743
    "Inst 6"  .1944  10.88889     .1944
    "Inst 7"      . 11.333333 .25058368
    "Inst 8"  .2806 14.444445     .2806
    "inst 9"  .3662  15.11111     .3662
    "inst 10"     . 15.222222     .2475
    "Inst 11" .2475 15.222222     .2475
    end
    
    *ESTIMATE (*)
    regress gradrate_8yr_transferstudents avg_grad_rate
    *GET PREDICTIONS (INCLUDING OUT-OF-SAMPLE PREDICTIONS)
    predict yhat, xb
    Res.:

    Code:
    . l, sep(0)
    
         +------------------------------------------------------+
         | ihe_name   gradra~s   avg_gr~e    imp_tgr       yhat |
         |------------------------------------------------------|
      1. |   Inst 1      .1597          7      .1597   .1289436 |
      2. |   Inst 2      .1146      8.125      .1146   .1529289 |
      3. |   Inst 3      .1924   9.222222      .1924   .1763219 |
      4. |   Inst 4      .1152   9.555555      .1152   .1834287 |
      5. |   Inst 5      .2743   10.22222      .2743   .1976422 |
      6. |   Inst 6      .1944   10.88889      .1944   .2118557 |
      7. |   Inst 7          .   11.33333   .2505837   .2213314 |
      8. |   Inst 8      .2806   14.44444      .2806    .287661 |
      9. |   inst 9      .3662   15.11111      .3662   .3018745 |
     10. |  inst 10          .   15.22222      .2475   .3042435 |
     11. |  Inst 11      .2475   15.22222      .2475   .3042435 |
         +------------------------------------------------------+
    Last edited by Andrew Musau; 15 Nov 2022, 16:04.

    Comment


    • #3
      mipolate is from SSC, as you are asked to explain (FAQ Advice #12). I am its author.

      Thanks in principle for the data example, but you have broken the dataex output by omitting input and mangling the identifiers. Presumably you decided you should suppress the names, which is fair enough, except that you need to fix what was broken, say as below.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input id gradrate_8yr_transferstudents avg_grad_rate imp_tgr
      1 .1597 7 .1597
      2 .1146 8.125 .1146
      3 .1924 9.222222328186035 .1924
      4 .1152 9.55555534362793 .1152
      5 .2743 10.222222328186035 .2743
      6 .1944 10.88888931274414 .1944
      7 . 11.333333015441895 .25058369178133116
      8 .2806 14.44444465637207 .2806
      9 .3662 15.11111068725586 .3662
      10 . 15.222222328186035 .2475
      11 .2475 15.222222328186035 .2475
      end
      The usual territory for imputation is a regularly spaced grid of time or other values. This is the most unusual application of mipolate I have seen and I can't be comfortable with it. If you really need to interpolate I think you would be better off using some kind of regression.

      What is going on is easiest (for me) to think about graphically. The raw data are shown by identifiers and you want to interpolate at the two positions shown by vertical lines. The interpolation at 11.333333 and bits is what it is, but the interpolation at 15.222222 and bits is more puzzling. As you already have a value at that position it seems fair that is the entirety of the information used, but on the other hand, the inverse of distance zero squared is going to be returned by Mata as missing. I think the algorithm is just deciding that your outcome is known at that location and copying the existing value.

      Click image for larger version

Name:	mipolate.png
Views:	1
Size:	19.1 KB
ID:	1689618


      Comment


      • #4
        Thank you both for your help! I will explore other options for imputation here.

        Comment

        Working...
        X