Dear Statalist Users,
I would like to search for your advice on nearest neighbor matching (NNM) estimator; how to correct biases in NMM-based linear regression (with possible control variables or interaction terms).
My goal is to estimate the interaction effects with controls on matched samples via NNM.
Stata's official command, teffects nnmatch, does not support adding extra variables - controls or interaction terms - in NNM estimator. I assume it is intended (as written in this post) because the matching estimator is non-parametric that does not rely on functional form, but estimates the effect as the difference in weighted average between treatment and control units.
One roundabout I found is to regress outcome variable on treatment with control variables using the weights generated by NNM.
Since teffects nnmatch does not generate weights, I did it using the user-written "kmatch md" command.
Here's an example using Stata's automobile data.
Suppose I estimate the effect of "foreign" on "price", via NNM with three variables (mpg, headroom, trunk) with k=1.
I generated three different NMM estimators using three different methods: (1) teffects nnmatch (2) kmatch md (3) linear regression using weights generated by NNM (roundabout)
As you can see below, all three methods generate identical point estimates (1599.622) with different standard errors.
(I am curious why standard errors are so different, but let's put that in the back burner for now.)
So it might make sense to use linear regression with NMM-based weights (3) to estimate heterogeneous effects with controls, by adding controls and interaction terms to linear regression.
But here's the problem; NMM estimator using continuous variables is biased, so needs to be corrected (Abadie and Imbens 2006, 2010).
Unfortunately, while (1) and (2) can correct biases, (3) cannot.
And here's the results; (1) and (2) are identical biased-corrected estimates (2190.491), while (3) is not (1599.622).
The bias is too large to ignore.
So here's my question; would there be a way to correct biases in (3) - linear regression with NNM-based weights?
Or alternatively, is there a way to add control variables or interactions terms in either (1) or (2)?
* Adding control variables to linear regression (3) is different from adding the same controls to the list of variables to be matched in NNM, as they generate very different estimates.
You can see that by running the code below (I don't paste the results here).
I would like to search for your advice on nearest neighbor matching (NNM) estimator; how to correct biases in NMM-based linear regression (with possible control variables or interaction terms).
My goal is to estimate the interaction effects with controls on matched samples via NNM.
Stata's official command, teffects nnmatch, does not support adding extra variables - controls or interaction terms - in NNM estimator. I assume it is intended (as written in this post) because the matching estimator is non-parametric that does not rely on functional form, but estimates the effect as the difference in weighted average between treatment and control units.
One roundabout I found is to regress outcome variable on treatment with control variables using the weights generated by NNM.
Since teffects nnmatch does not generate weights, I did it using the user-written "kmatch md" command.
Here's an example using Stata's automobile data.
Suppose I estimate the effect of "foreign" on "price", via NNM with three variables (mpg, headroom, trunk) with k=1.
I generated three different NMM estimators using three different methods: (1) teffects nnmatch (2) kmatch md (3) linear regression using weights generated by NNM (roundabout)
Code:
sysuse auto, clear ssc install kmatch, replace loc outcomevar price loc treatvar foreign loc matchingvars mpg headroom trunk loc controls length displacement * Comparing "teffect nnmatch" and "kmatch md" teffects nnmatch (`outcomevar' `matchingvars') (`treatvar'), nneighbor(1) metric(mahalanobis) // (1) kmatch md `treatvar' `matchingvars' (`outcomevar'), metric(mahalanobis) nn(1) wgenerate(wgt) // (2) reg `outcomevar' `treatvar' [iweight=wgt] // (3)
(I am curious why standard errors are so different, but let's put that in the back burner for now.)
So it might make sense to use linear regression with NMM-based weights (3) to estimate heterogeneous effects with controls, by adding controls and interaction terms to linear regression.
Code:
. * Comparing "teffect nnmatch" and "kmatch md" . teffects nnmatch (`outcomevar' `matchingvars') (`treatvar'), nneighbor(1) metric(mahalanobis) // (1) Treatment-effects estimation Number of obs = 74 Estimator : nearest-neighbor matching Matches: requested = 1 Outcome model : matching min = 1 Distance metric: Mahalanobis max = 2 ---------------------------------------------------------------------------------------- | AI robust price | Coefficient std. err. z P>|z| [95% conf. interval] -----------------------+---------------------------------------------------------------- ATE | foreign | (Foreign vs Domestic) | 1599.622 790.9518 2.02 0.043 49.38464 3149.859 ---------------------------------------------------------------------------------------- . kmatch md `treatvar' `matchingvars' (`outcomevar'), metric(mahalanobis) nn(1) wgenerate(wgt) // (2) Multivariate-distance nearest-neighbor matching Number of obs = 74 Neighbors: min = 1 Treatment : foreign = 1 max = 2 Metric : mahalanobis Covariates : mpg headroom trunk Matching statistics ------------------------------------------------------------------------------ | Matched | Controls | Yes No Total | Used Unused Total -----------+---------------------------------+-------------------------------- Treated | 22 0 22 | 14 38 52 Untreated | 52 0 52 | 15 7 22 Combined | 74 0 74 | 29 45 74 ------------------------------------------------------------------------------ Treatment-effects estimation ------------------------------------------------------------------------------ price | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- ATE | 1599.622 1026.38 1.56 0.123 -445.951 3645.194 ------------------------------------------------------------------------------ Stored variables Variable Storage Display Value name type format label Variable label --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- wgt double %10.0g Matching weights for ATE . reg `outcomevar' `treatvar' [iweight=wgt] // (3) Source | SS df MS Number of obs = 148 -------------+---------------------------------- F(1, 146) = 10.56 Model | 94675205.3 1 94675205.3 Prob > F = 0.0014 Residual | 1.3091e+09 146 8966657.43 R-squared = 0.0674 -------------+---------------------------------- Adj R-squared = 0.0611 Total | 1.4038e+09 147 9549708.78 Root MSE = 2994.4 ------------------------------------------------------------------------------ price | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- foreign | 1599.622 492.2825 3.25 0.001 626.7012 2572.542 _cons | 5699.203 348.0963 16.37 0.000 5011.244 6387.161 ------------------------------------------------------------------------------
But here's the problem; NMM estimator using continuous variables is biased, so needs to be corrected (Abadie and Imbens 2006, 2010).
Unfortunately, while (1) and (2) can correct biases, (3) cannot.
Code:
* With bias correction (Abadie and Imbens 2006, 2010) cap drop wgt teffects nnmatch (`outcomevar' `matchingvars') (`treatvar'), nneighbor(1) vce(robust) metric(mahalanobis) gen(matched) biasadj(`matchingvars') // (1) kmatch md `treatvar' `matchingvars' (`outcomevar' = `matchingvars'), metric(mahalanobis) nn(1) wgenerate(wgt) // (2) reg `outcomevar' `treatvar' [iweight=wgt] // (3)
The bias is too large to ignore.
Code:
. * With bias correction (Abadie and Imbens 2006, 2010) . cap drop wgt . teffects nnmatch (`outcomevar' `matchingvars') (`treatvar'), nneighbor(1) vce(robust) metric(mahalanobis) gen(matched) biasadj(`matchingvars') // (1) Treatment-effects estimation Number of obs = 74 Estimator : nearest-neighbor matching Matches: requested = 1 Outcome model : matching min = 1 Distance metric: Mahalanobis max = 2 ---------------------------------------------------------------------------------------- | AI robust price | Coefficient std. err. z P>|z| [95% conf. interval] -----------------------+---------------------------------------------------------------- ATE | foreign | (Foreign vs Domestic) | 2190.491 762.2136 2.87 0.004 696.5796 3684.402 ---------------------------------------------------------------------------------------- . kmatch md `treatvar' `matchingvars' (`outcomevar' = `matchingvars'), metric(mahalanobis) nn(1) wgenerate(wgt) // (2) Multivariate-distance nearest-neighbor matching Number of obs = 74 Neighbors: min = 1 Treatment : foreign = 1 max = 2 Metric : mahalanobis Covariates : mpg headroom trunk RA equations: price = mpg headroom trunk _cons Matching statistics ------------------------------------------------------------------------------ | Matched | Controls | Yes No Total | Used Unused Total -----------+---------------------------------+-------------------------------- Treated | 22 0 22 | 14 38 52 Untreated | 52 0 52 | 15 7 22 Combined | 74 0 74 | 29 45 74 ------------------------------------------------------------------------------ Treatment-effects estimation ------------------------------------------------------------------------------ price | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- ATE | 2190.491 537.9627 4.07 0.000 1118.333 3262.649 ------------------------------------------------------------------------------ Stored variables Variable Storage Display Value name type format label Variable label --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- wgt double %10.0g Matching weights for ATE . reg `outcomevar' `treatvar' [iweight=wgt] // (3) Source | SS df MS Number of obs = 148 -------------+---------------------------------- F(1, 146) = 10.56 Model | 94675205.3 1 94675205.3 Prob > F = 0.0014 Residual | 1.3091e+09 146 8966657.43 R-squared = 0.0674 -------------+---------------------------------- Adj R-squared = 0.0611 Total | 1.4038e+09 147 9549708.78 Root MSE = 2994.4 ------------------------------------------------------------------------------ price | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- foreign | 1599.622 492.2825 3.25 0.001 626.7012 2572.542 _cons | 5699.203 348.0963 16.37 0.000 5011.244 6387.161 ------------------------------------------------------------------------------
So here's my question; would there be a way to correct biases in (3) - linear regression with NNM-based weights?
Or alternatively, is there a way to add control variables or interactions terms in either (1) or (2)?
* Adding control variables to linear regression (3) is different from adding the same controls to the list of variables to be matched in NNM, as they generate very different estimates.
You can see that by running the code below (I don't paste the results here).
Code:
* Adding controls to (3) is different from adding control variables as mathcing variable in (1) and (2) cap drop wgt teffects nnmatch (`outcomevar' `matchingvars' `controls') (`treatvar'), nneighbor(1) metric(mahalanobis) // (1) kmatch md `treatvar' `matchingvars' `controls' (`outcomevar'), metric(mahalanobis) nn(1) wgenerate(wgt) // (2) reg `outcomevar' `treatvar' `controls' [iweight=wgt] // (3)