psmatch: identifying wich entities were matched

Markus Tiefenbacher

Join Date: Jun 2015

Posts: 3
#1

psmatch: identifying wich entities were matched

18 Jun 2015, 04:04

Dear Statalist,

I have a question regarding the treatment effects features in Stata 14. I am currently exploring it using the catanneo2 example dataset (Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154) and run the model as described by Chuck Huber in the StataCorp_LP Youtube Chanel.

code:
teffects psmatch (bweight) (mbsmoke mmarried mage medu fbaby)

With Stata12 I used the user written psmatch2 *! version 4.0.11 22oct2014 E. Leuven, B. Sianesi which creates a number of variables for the convenience of the user: First and foremost _outcome_variable for every treatment observation stores the value of the matched outcome.

Is there any option that allows me to identifying which entities were matched e.g. mother 1 was matched with mother 47. Moreover can Stata14 store the value of each matched outcome e.g. store bweight of the nonsmoker mother 1 is in _bweight of the smoking mother47?

Thank you
Markus
Tags: None
Adam Badenoch

Join Date: Dec 2015

Posts: 20
#2

18 Dec 2015, 13:06

Hi Markus,

Unfortunately I can't answer your question but would instead like to build on it because I would like to know how to extract information about the matched sample.

In reality I am interested in a different data set which has a binary outcome but I think answering your question will help me too.

Like you, after

Input:
use http://www.stata-press.com/data/r13/cattaneo2
teffects psmatch (bweight) (mbsmoke mmarried mage medu fbaby)

I would like to identify which treated patient (mbsmoke=smoker) is matched with which untreated patient (mbsmoke=nonsmoker).

However this is only a means to an end

I can determine the mean birthweight in the treatment group by using

Input:
keep if mbsmoke
summarize

Output:
3137.66g

However I can't work out how to determine the mean birthweight in the matched sample group (nonsmoker). Once we can identify the matched sample is there a way to calculate summary statistics for it such as the mean birthweight?

Thank you,

Adam
Comment
Enrique Pinzon (StataCorp)

StataCorp Employee

Join Date: Jan 2015

Posts: 216
#3

18 Dec 2015, 14:59

Hello Markus and Adam,

You will get what you want with the generate(stub) option where stub specifies that the observation numbers of the nearest neighbors be stored in the new
variables stub1, stub2, : : : . Note that the number of variables generated may be more than nneighbors(#) because of tied distances.
Comment
Adam Badenoch

Join Date: Dec 2015

Posts: 20
#4

18 Dec 2015, 18:13

Thanks Enrique,

Unfortunately I'm not sure how to interpret the new stub variables (sorry I am a novice STATA user).

Input:
use http://www.stata-press.com/data/r13/cattaneo2
teffects psmatch (bweight) (mbsmoke mmarried mage medu baby), gen(stub)

gives me 74 new stub variables.

How do I use these to identify my matched controls?

and

Is there a way to extract summary statistics such as the mean birthweight from the matched controls identified from these stub variables?

I was able to do this with the treated (smokers) group by

Input:
keep if mbsmoke
summarize

Output:
3137.66g (mean birthweight in the treated (smokers)

Perhaps there is an analogous "keep if" option that utilises the new stub variables to isolate the matched controls but I can't see how to make it work.

Thanks,

Adam
Comment

Enrique Pinzon (StataCorp)

StataCorp Employee

Join Date: Jan 2015
Posts: 216

21 Dec 2015, 10:46

Hello Adam,

The stub variable you created tells you for a smoker which nonsmoker the individual is matched with and for the nonsmoker which smoker the are matched with. The number in the stub is the observation number of the match. If there is more than one match there is more than one stub variable. Below is code that will allow you to assign potential birthweights for smokers and nonsmokers using the first match, stub1. Notice that you observe only one potential outcome for each individual and the other needs to be assigned. My code could be shorter but I first want to generate the two potential outcomes to emphasize that they inherently produce missing values.

Code:

. clear

. webuse cattaneo2
(Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154)

. quietly teffects psmatch (bweight) (mbsmoke mmarried mage medu fbaby), gen(stub)

. // Observed bweight and missing counterfactual
. generate bweight_1 = bweight if mbsmoke==1
(3,778 missing values generated)

. generate bweight_0 = bweight if mbsmoke==0
(864 missing values generated)

. list mbsmoke bweight* in 1/10, noobs

  +-------------------------------------------+
  |   mbsmoke   bweight   bweigh~1   bweigh~0 |
  |-------------------------------------------|
  | nonsmoker      3459          .       3459 |
  | nonsmoker      3260          .       3260 |
  | nonsmoker      3572          .       3572 |
  | nonsmoker      2948          .       2948 |
  | nonsmoker      2410          .       2410 |
  |-------------------------------------------|
  | nonsmoker      3147          .       3147 |
  | nonsmoker      3799          .       3799 |
  | nonsmoker      3629          .       3629 |
  | nonsmoker      2835          .       2835 |
  | nonsmoker      3880          .       3880 |
  +-------------------------------------------+

. // Assigning matched value to missing counterfactual
. replace bweight_1 = bweight[stub1[_n]] if mbsmoke==0
(3,778 real changes made)

. replace bweight_0 = bweight[stub1[_n]] if mbsmoke==1
(864 real changes made)

. list mbsmoke bweight* in 1/10, noobs

  +-------------------------------------------+
  |   mbsmoke   bweight   bweigh~1   bweigh~0 |
  |-------------------------------------------|
  | nonsmoker      3459       3330       3459 |
  | nonsmoker      3260       2580       3260 |
  | nonsmoker      3572       3487       3572 |
  | nonsmoker      2948       2438       2948 |
  | nonsmoker      2410       3204       2410 |
  |-------------------------------------------|
  | nonsmoker      3147       3317       3147 |
  | nonsmoker      3799       2495       3799 |
  | nonsmoker      3629       3062       3629 |
  | nonsmoker      2835       2890       2835 |
  | nonsmoker      3880       3119       3880 |
  +-------------------------------------------+

Also, notice that it is not correct to just keep the smokers or non-smokers and then get an estimate from the subset you keep. In theory, every individual in the sample has a potential outcome when they smoke and when they do not smoke. We only observe one potential outcome for each individual but we construct treatment effects models to be able to assign values in both states and use all our sample.

Last edited by Enrique Pinzon (StataCorp); 21 Dec 2015, 10:55.

Comment

Adam Badenoch

Join Date: Dec 2015

Posts: 20
#6

21 Dec 2015, 13:12

Thank you Enrique,

That is an excellent answer and the extra steps in your code and added explanations definitely helped my understanding.

Can I please just clarify one more thing?

If I calculate the mean birthweight in both groups using the code below the difference between the means (3398.86g-3169.798=229.06g) is similar to but not the same as the ATE displayed in the effects output (203.97g, note: I've removed the "quietly" in the code below so the ATE is displayed). I would have thought these two numbers should be the same.

Ie the mean of the differences between the observed and potential outcomes should equal the difference between the means of the two groups.

Any thoughts on why this difference exists?

(This wasn't the main purpose for me calculating these statistics but I did think it would act as an extra check and balance along the way).

Code:
. clear . webuse cattaneo2 (Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154) . teffects psmatch (bweight) (mbsmoke mmarried mage medu fbaby), gen(stub) . // Observed bweight and missing counterfactual . generate bweight_1 = bweight if mbsmoke==1 (3,778 missing values generated) . generate bweight_0 = bweight if mbsmoke==0 (864 missing values generated) . // Assigning matched value to missing counterfactual . replace bweight_1 = bweight[stub1[_n]] if mbsmoke==0 (3,778 real changes made) . replace bweight_0 = bweight[stub1[_n]] if mbsmoke==1 (864 real changes made) . summarize bweight_0 . summarize bweight_1
Many thanks,

Adam
Comment
Enrique Pinzon (StataCorp)

StataCorp Employee

Join Date: Jan 2015

Posts: 216
#7

22 Dec 2015, 09:29

Hello Adam,

The reason why you do not get exactly the same answer is because to compute the ATE we use all the matches and in the example I gave you I only used stub1 . I was more focused in showing you how to use the generate option to obtain the matches. The computation details are in the methods and formulas in the [TE] manual and will require that you do a bit more heavy lifting.
Comment
Adam Badenoch

Join Date: Dec 2015

Posts: 20
#8

22 Dec 2015, 18:44

Thanks Enrique,

I had a read through the STATA manual for effects, ATE and psmatch again, including the methods and formulas but couldn't work out how the stub variable is used in this calculation.

To help me progress are you able to explain the calculation of ATE for 2 individuals in the data set for me?

Sorry, I seem always have more questions on this thread. If I understand how to generate the counterfactual outcomes for both patients above I will try to do the rest of the heavy lifting on the remaining stub variables myself. This will allow me to present the results as a relative effect (ATE/matched non-smoker birthweight) and an absolute effect (ATE), which is an important issue for anyone presenting treatment effect outcomes after ps-matching.

Patient1 , non-smoker, bweight=3459, stub1=2729 (bweight=3330), bweight_0=3459, bweight_1=3330, TE = 3330-3459 = -129g
and
Patient247, smoker, bweight=3289, stub1=3372 (bweight=3742) , stub2=3710(bweight=3459), bweight_0=?, bweight_1=3289, TE2 = ?-3289 = ?

The crux of it is how do I calculate birthweight_0 for Patient247 from the 2 stub variables. Are they averaged? Are they weighted differently?

Hypothetically using just these two patients I think ATE is calculated as: ATE=(TE1+TE2)/2 = (-129+?)/2

A separate, but related question is; assuming nneighbour(1) as is the default in this example I expected a 1:1 matching ratio of matched patients. Is my understanding here flawed given the multiple stub variables being incorporated into the calculation of counterfactual outcomes for most of the patients in this dataset?

Thanks,

Adam
Comment

Enrique Pinzon (StataCorp)

StataCorp Employee

Join Date: Jan 2015
Posts: 216

23 Dec 2015, 08:13

Hello Adam,

Below is the code that will help you estimate what you want. I will first show the code without any output, explain what is going on, and then will show the code with output. I am using Mata because I think it is easier to do the computations there.

Code:

clear

webuse cattaneo2

teffects psmatch (bweight) (mbsmoke mmarried mage medu fbaby), gen(stub)

gen byte touse = e(sample)

mata:
i = st_data(.,"stub*","touse")
n = cols(i):-rowmissing(i)
t = st_data(.,"mbsmoke","touse")
w = w1 = w0 = st_data(.,"bweight","touse")

for (j=1; j<=rows(i); j++) {
        if (t[j]) {
                w0[j] = mean(w[i[|j,1\j,n[j]|]])
        }
        else {
                w1[j] = mean(w[i[|j,1\j,n[j]|]])
        }
}
mean(w1)-mean(w0)
mean(w1)
mean(w0)
end

Let's go directly into the Mata code. First, I generate a matrix i . This is the matrix of all the stub* variables. I grab this matrix from Stata using the st_data() function of Mata. Then, I generate the variable (vector) n which will counts the non-missing elements for each row of i. In other words, the number of matches. The variables t, w1, w0, and w are the treatment, the potential outcomes, and the observed outcomes, again, I grab them from Stata using st_data(). In the loop, we go through all the observations and check if the individual is treated or not. If they are treated we are missing w0. We assign to that observation the mean of the elements of the observed w in the position given by the stub variables in that row. We use the mean() function and a cool feature in Mata that allows you to grab blocks from a matrix. In this case, we grab from matrix i elements (j, 1) to element (j, n[j]). So, for the third row we would grab from element (3,1) to (3, n[j]) were n[j] is the last non-missing value, i.e the last match. The way to grab this block of the matrix i is typing i| j,1 \ j, n[j] | (We only needed a row but we can obtain entire blocks...awesome really). The result is:

Code:

. clear

. webuse cattaneo2
(Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154)

. teffects psmatch (bweight) (mbsmoke mmarried mage medu fbaby), gen(stub)

Treatment-effects estimation                   Number of obs      =      4,642
Estimator      : propensity-score matching     Matches: requested =          1
Outcome model  : matching                                     min =          1
Treatment model: logit                                        max =         74
------------------------------------------------------------------------------
             |              AI Robust
     bweight |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
ATE          |
     mbsmoke |
    (smoker  |
         vs  |
 nonsmoker)  |  -203.9734   35.31088    -5.78   0.000    -273.1814   -134.7653
------------------------------------------------------------------------------

. gen byte touse = e(sample)

. mata:
------------------------------------------------- mata (type end to exit) -----
: i = st_data(.,"stub*","touse")

: n = cols(i):-rowmissing(i)

: t = st_data(.,"mbsmoke","touse")

: w = w1 = w0 = st_data(.,"bweight","touse")

: for (j=1; j<=rows(i); j++) {
>         if (t[j]) {
>                 w0[j] = mean(w[i[|j,1\j,n[j]|]])
>         }
>         else {
>                 w1[j] = mean(w[i[|j,1\j,n[j]|]])
>         }
> }

: mean(w1)-mean(w0)
  -203.9733742

: mean(w1)
  3203.439873

: mean(w0)
  3407.413247

: end

With regard to your other question, even if you ask for only one match, if there are ties we include them all.

Comment

Adam Badenoch

Join Date: Dec 2015

Posts: 20
#10

23 Dec 2015, 18:59

Thanks Enrique,

That is excellent. The code does exactly what I want for ATE.

I only have one question left.

How do I adapt this code for the ATET output (such that w1-w0=ATET [-245.711g])?

Again, I plan to use this to allow me to present the treatment effect as a relative ATET in addition to presenting the absolute ATET value provided by the effects psmatch output.

I tried entering

Code:
keep if (mbsmoke)

between the effects psmatch code and your mata code but this resulted in the error <istmt>: 3301 subscript invalid.

I have never used mata code before and am not sure how to adjust it (sorry I know there is a dedicated mata section on this forum but this problem seems intrinsically linked to this thread so thought it best to post here).

Many thanks for your persistence on this thread.

Adam
Comment

Enrique Pinzon (StataCorp)

StataCorp Employee

Join Date: Jan 2015
Posts: 216

#11

28 Dec 2015, 08:00

Hello Adam,

Please see the code below. I have used the select() function in Mata to grab only the treated observations. As a point of emphasis, you always use all of your data to calculate the potential outcomes. In the case of the ATET, after you calculate the potential outcomes, you average the difference of the potential outcomes for the treated sub population. No need to drop of keep.

Code:

. clear

. webuse cattaneo2
(Excerpt from Cattaneo (2010) Journal of Econometrics 155: 138-154)

. teffects psmatch (bweight) (mbsmoke mmarried mage medu fbaby), gen(stub) atet
>  

Treatment-effects estimation                   Number of obs      =      4,642
Estimator      : propensity-score matching     Matches: requested =          1
Outcome model  : matching                                     min =          1
Treatment model: logit                                        max =         74
------------------------------------------------------------------------------
             |              AI Robust
     bweight |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
ATET         |
     mbsmoke |
    (smoker  |
         vs  |
 nonsmoker)  |   -245.711   26.38675    -9.31   0.000    -297.4281   -193.9939
------------------------------------------------------------------------------

. gen byte touse = e(sample)

.
. mata:
------------------------------------------------- mata (type end to exit) -----
: i   = st_data(.,"stub*","touse")

: n   = cols(i):-rowmissing(i)

: t   = st_data(.,"mbsmoke","touse")

: w   = w1 = w0 = st_data(.,"bweight","touse")

: tet = J(rows(w), 1, 0)

: for (j=1; j<=rows(i); j++) {
>         if (t[j]) {
>                 w0[j] = mean(w[i[|j,1\j,n[j]|]])
>         }
>         else {
>                 w1[j] = mean(w[i[|j,1\j,n[j]|]])
>         }
>         tet[j] = w1[j] - w0[j]
> }

: mean(select(tet, t))
  -245.710988

: mean(select(w1, t))
  3137.659722

: mean(select(w0, t))
  3383.37071

:
: end
-------------------------------------------------------------------------------

Comment

Adam Badenoch

Join Date: Dec 2015

Posts: 20
#12

28 Dec 2015, 15:32

That is great,

Thank you Enrique.

Sorry, I do understand the importance of using all counterfactual outcomes in these calculations but was trying to demonstrate my aim to you without knowing how to write the STATA code for it. Your explanations and clarifications have been excellent and my drop/keep options were simply an awkward attempt to convey to you what I was trying to do.

To summarise this thread I would interpret the output of the code as follows:

When expressing results as ATE:
ATE=-203.97, which indicates the average birthweight in the smoking group is 203.97g lower than the non-smoking group.
The average birthweight in the non-smoking group is 3407.41g.
Therefore the ATE represents relative reduction in birthweight of (203.97/3407.41)*100=5.99%

If expressing the results as ATET:
ATET=-245.71, which indicates the average birthweight in the smoking group is 245.71g lower than the non-smoking group when analysis of the data is restricted to mothers who smoked (both the actual smokers observed outcome and the actual non-smokers counterfactual outcome).
The average birthweight in the non-smoking group is 3383.37g.
Therefore the ATET represents relative reduction in birthweight of (245.71/3383.37)*100=7.26%

If any of these interpretations are incorrect please let me know.

Thanks,

Adam
Comment
Enrique Pinzon (StataCorp)

StataCorp Employee

Join Date: Jan 2015

Posts: 216
#13

29 Dec 2015, 11:22

Hello Adam,

The interpretation of treatment effects is counterfactual rather than smokers vs non-smokers. I would state that the ATE is the average birth weight difference if all mothers smoke relative to the case were no mothers smoke. Notice that this is reflected in our computation. We have the potential birth weight for all mothers if they smoke and for all mothers if they do not smoke, and we take the average of the differences. The logic is that we think about every one in our sample when they smoke and when they do not smoke. For the ATET the interpretation is similar but for the subset of those that received the treatment.
Comment
Adam Badenoch

Join Date: Dec 2015

Posts: 20
#14

29 Dec 2015, 12:09

Ok,

Thanks Enrique I think I understand but I'll just paraphrase the ATET specifically for this scenario to double check I have it straight.

The ATET in this scenario is the average difference in birthweight between the actual birthweight and the counterfactual birthweight in all mothers who smoked.

Adam
Comment
Adam Badenoch

Join Date: Dec 2015

Posts: 20
#15

07 Jun 2016, 20:25

Dear Statalist,

When using the mata code above for a different dataset I receive the following error:

<istmt>: 3301 subscript invalid
(4 lines skipped)
--------------------------------------------------------------------------------------------------------------------
r(3301);

end of do-file

r(3301);

The error seems to stem from the fact that there are a number of subjects not matched to any nearest neighbours (revealed by osample(newvar) and not a problem when using the sample dataset earlier in this thread). I don't want to increase the caliper width in order to ensure all subjects are matched as this will result in poorer matches.

I have tried
drop if stub1=. after running the effects psmatch code and before the mata code but this doesn't fix the problem.

I also tried manually eliminating the subjects with stub1=. from my dataset then re-launching stata and running the same effects psmatch and mata code on the altered data set. This technique allowed the mata code to run but the teffects psmatch gives a different output compared with using my original dataset.

Does anyone know how to alter the mata code from this thread to deal with my problem?

Thanks,

Adam
Comment

Announcement