Correcting biases in linear regression on nearest neighbor matching sample.

Seungmin Lee

Join Date: Feb 2020
Posts: 40

Correcting biases in linear regression on nearest neighbor matching sample.

14 Jul 2025, 17:09

Dear Statalist Users,

I would like to search for your advice on nearest neighbor matching (NNM) estimator; how to correct biases in NMM-based linear regression (with possible control variables or interaction terms).

My goal is to estimate the interaction effects with controls on matched samples via NNM.

Stata's official command, teffects nnmatch, does not support adding extra variables - controls or interaction terms - in NNM estimator. I assume it is intended (as written in this post) because the matching estimator is non-parametric that does not rely on functional form, but estimates the effect as the difference in weighted average between treatment and control units.

One roundabout I found is to regress outcome variable on treatment with control variables using the weights generated by NNM.
Since teffects nnmatch does not generate weights, I did it using the user-written "kmatch md" command.

Here's an example using Stata's automobile data.

Suppose I estimate the effect of "foreign" on "price", via NNM with three variables (mpg, headroom, trunk) with k=1.

I generated three different NMM estimators using three different methods: (1) teffects nnmatch (2) kmatch md (3) linear regression using weights generated by NNM (roundabout)

Code:

sysuse auto, clear
ssc install kmatch, replace

loc    outcomevar        price
loc    treatvar        foreign
loc    matchingvars    mpg headroom trunk
loc    controls        length displacement

*    Comparing "teffect nnmatch" and "kmatch md"
teffects nnmatch (`outcomevar'    `matchingvars') (`treatvar'), nneighbor(1)  metric(mahalanobis) // (1)
kmatch md `treatvar'  `matchingvars' (`outcomevar'), metric(mahalanobis) nn(1)  wgenerate(wgt)  // (2)
reg    `outcomevar'    `treatvar' [iweight=wgt]    //    (3)

As you can see below, all three methods generate identical point estimates (1599.622) with different standard errors.
(I am curious why standard errors are so different, but let's put that in the back burner for now.)

So it might make sense to use linear regression with NMM-based weights (3) to estimate heterogeneous effects with controls, by adding controls and interaction terms to linear regression.

Code:


. *       Comparing "teffect nnmatch" and "kmatch md"
. teffects nnmatch (`outcomevar'  `matchingvars') (`treatvar'), nneighbor(1)  metric(mahalanobis) // (1)

Treatment-effects estimation                   Number of obs      =         74
Estimator      : nearest-neighbor matching     Matches: requested =          1
Outcome model  : matching                                     min =          1
Distance metric: Mahalanobis                                  max =          2
----------------------------------------------------------------------------------------
                       |              AI robust
                 price | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-----------------------+----------------------------------------------------------------
ATE                    |
               foreign |
(Foreign vs Domestic)  |   1599.622   790.9518     2.02   0.043     49.38464    3149.859
----------------------------------------------------------------------------------------

. kmatch md `treatvar'  `matchingvars' (`outcomevar'), metric(mahalanobis) nn(1)  wgenerate(wgt)  // (2)

Multivariate-distance nearest-neighbor matching

                                                            Number of obs = 74
                                                Neighbors:    min =          1
Treatment   : foreign = 1                                     max =          2
Metric      : mahalanobis
Covariates  : mpg headroom trunk

Matching statistics
------------------------------------------------------------------------------
           |             Matched             |            Controls           
           |       Yes         No      Total |      Used     Unused      Total
-----------+---------------------------------+--------------------------------
   Treated |        22          0         22 |        14         38         52
 Untreated |        52          0         52 |        15          7         22
  Combined |        74          0         74 |        29         45         74
------------------------------------------------------------------------------

Treatment-effects estimation
------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         ATE |   1599.622    1026.38     1.56   0.123     -445.951    3645.194
------------------------------------------------------------------------------

Stored variables
Variable      Storage   Display    Value
    name         type    format    label      Variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
wgt             double  %10.0g                Matching weights for ATE

. reg     `outcomevar'    `treatvar' [iweight=wgt]        //      (3)

      Source |       SS           df       MS      Number of obs   =       148
-------------+----------------------------------   F(1, 146)       =     10.56
       Model |  94675205.3         1  94675205.3   Prob > F        =    0.0014
    Residual |  1.3091e+09       146  8966657.43   R-squared       =    0.0674
-------------+----------------------------------   Adj R-squared   =    0.0611
       Total |  1.4038e+09       147  9549708.78   Root MSE        =    2994.4

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     foreign |   1599.622   492.2825     3.25   0.001     626.7012    2572.542
       _cons |   5699.203   348.0963    16.37   0.000     5011.244    6387.161
------------------------------------------------------------------------------

But here's the problem; NMM estimator using continuous variables is biased, so needs to be corrected (Abadie and Imbens 2006, 2010).

Unfortunately, while (1) and (2) can correct biases, (3) cannot.

Code:

*    With bias correction (Abadie and Imbens 2006, 2010)
cap    drop    wgt
teffects nnmatch (`outcomevar'    `matchingvars') (`treatvar'), nneighbor(1) vce(robust)     metric(mahalanobis) gen(matched) biasadj(`matchingvars')     // (1)
kmatch md `treatvar'  `matchingvars' (`outcomevar' = `matchingvars'), metric(mahalanobis) nn(1) wgenerate(wgt)  // (2)
reg    `outcomevar'    `treatvar' [iweight=wgt]    //    (3)

And here's the results; (1) and (2) are identical biased-corrected estimates (2190.491), while (3) is not (1599.622).

The bias is too large to ignore.

Code:


. *       With bias correction (Abadie and Imbens 2006, 2010)
. cap     drop    wgt

. teffects nnmatch (`outcomevar'  `matchingvars') (`treatvar'), nneighbor(1) vce(robust)   metric(mahalanobis) gen(matched) biasadj(`matchingvars')       // (1)

Treatment-effects estimation                   Number of obs      =         74
Estimator      : nearest-neighbor matching     Matches: requested =          1
Outcome model  : matching                                     min =          1
Distance metric: Mahalanobis                                  max =          2
----------------------------------------------------------------------------------------
                       |              AI robust
                 price | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-----------------------+----------------------------------------------------------------
ATE                    |
               foreign |
(Foreign vs Domestic)  |   2190.491   762.2136     2.87   0.004     696.5796    3684.402
----------------------------------------------------------------------------------------

. kmatch md `treatvar'  `matchingvars' (`outcomevar' = `matchingvars'), metric(mahalanobis) nn(1) wgenerate(wgt)  // (2)

Multivariate-distance nearest-neighbor matching

                                                            Number of obs = 74
                                                Neighbors:    min =          1
Treatment   : foreign = 1                                     max =          2
Metric      : mahalanobis
Covariates  : mpg headroom trunk
RA equations: price = mpg headroom trunk _cons

Matching statistics
------------------------------------------------------------------------------
           |             Matched             |            Controls           
           |       Yes         No      Total |      Used     Unused      Total
-----------+---------------------------------+--------------------------------
   Treated |        22          0         22 |        14         38         52
 Untreated |        52          0         52 |        15          7         22
  Combined |        74          0         74 |        29         45         74
------------------------------------------------------------------------------

Treatment-effects estimation
------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         ATE |   2190.491   537.9627     4.07   0.000     1118.333    3262.649
------------------------------------------------------------------------------

Stored variables
Variable      Storage   Display    Value
    name         type    format    label      Variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
wgt             double  %10.0g                Matching weights for ATE

. reg     `outcomevar'    `treatvar' [iweight=wgt]        //      (3)

      Source |       SS           df       MS      Number of obs   =       148
-------------+----------------------------------   F(1, 146)       =     10.56
       Model |  94675205.3         1  94675205.3   Prob > F        =    0.0014
    Residual |  1.3091e+09       146  8966657.43   R-squared       =    0.0674
-------------+----------------------------------   Adj R-squared   =    0.0611
       Total |  1.4038e+09       147  9549708.78   Root MSE        =    2994.4

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     foreign |   1599.622   492.2825     3.25   0.001     626.7012    2572.542
       _cons |   5699.203   348.0963    16.37   0.000     5011.244    6387.161
------------------------------------------------------------------------------

So here's my question; would there be a way to correct biases in (3) - linear regression with NNM-based weights?

Or alternatively, is there a way to add control variables or interactions terms in either (1) or (2)?

* Adding control variables to linear regression (3) is different from adding the same controls to the list of variables to be matched in NNM, as they generate very different estimates.
You can see that by running the code below (I don't paste the results here).

Code:

*    Adding controls to (3) is different from adding control variables as mathcing variable in (1) and (2)
cap    drop    wgt
teffects nnmatch (`outcomevar'    `matchingvars'    `controls') (`treatvar'), nneighbor(1)  metric(mahalanobis) // (1)
kmatch md `treatvar'  `matchingvars'    `controls' (`outcomevar'), metric(mahalanobis) nn(1)  wgenerate(wgt)  // (2)
reg    `outcomevar'    `treatvar' `controls' [iweight=wgt]    //    (3)

Tags: bias, nearest neighbor matching

Announcement

Correcting biases in linear regression on nearest neighbor matching sample.