Coarsened exact matching ("cem" package in Stata)

David Metcalfe

Join Date: Apr 2015

Posts: 27
#1

Coarsened exact matching ("cem" package in Stata)

28 Feb 2018, 16:47

I am trying to understand whether an intervention ("tha") is associated with increased 12-month mortality ("rip") using StataSE 13.0. My first step was to match the groups across a number of variables ("age", "sex", "preopasa", "premob", and "origin") using the "cem" package for coarsened exact matching. My understanding from the literature around CEM is that researchers should then continue with their analyses (e.g. multivariable regression) as normal but using the matched groups.

When I run the matching code:

ssc install cem
cem age sex preopasa premob origin, treatment(tha) autocuts(fd)

It appears to work and almost all patients (29,181/29,267) are allocated to 56 matched strata. Stata creates a number of new variables: cem_strata, cem_matched, and cem_weights, which seems to be what is expected.

To my non-statistician mind, I imagined that that those co-variables would then be become less significant in any subsequent regression models. However, this doesn’t appear to be the case.

When I run code that I believe should run a logistic regression model using the matched weights:

logistic rip age sex preopasa premob origin tha [iweight=cem_weights]

I get an output that is barely any different – and in some cases shows bigger odds ratios / wider confidence intervals – than when I run code without the CEM weights:

logistic rip age sex preopasa premob origin tha

I realise that Statalist might be the wrong place to ask this issue but wanted to check that I am using cem correctly before wondering whether I have just misunderstood how CEM works as a technique. I would also be interested if there are any Stata tricks that can be used to see whether or not the matching was successful. If anyone has insights or experience of cem then please let me know.
Tags: None
Cheryl Zogg

Join Date: Mar 2018

Posts: 1
#2

03 Mar 2018, 05:33

Canonically, you should not have to include the covariates in the weighted regression model: logistic tha [iweight=cem_weights]

Matching used in this way is intended to control for confounding. In a perfect world, the covariate differences fall out and you get a measure of association similar to the point estimate that you obtained from the full multivariable model (logistic tha [covars]), ideally, with tighter precision (smaller 95%CI). The reality with large datasets is that this never completely works even if the global balance between groups, the purpose of CEM, is improved.

Aside 1: To evaluate the effectiveness of CEM, you need to run an L1 statistic on both the matched and unmatched cohorts (this is a measure of the global imbalance between groups). Your goal is to make L1 go down. Note that the actual numbers do not really matter; L1 just needs to get smaller (akin to AIC for overall model fit/#paramters used).

Aside 2: Assuming that global imbalance did improve, CEM still may not completely account for differences in all of the matched covariates. Replicate Table 1 within the matched cohort and check the extent to which things changed. How different are the distributions of the matched covariates (forget about what the p-values say)? Are they clinically/meaningfully different?

If not, matching did its job. Move on. You are fine. In a large dataset like a trauma/hip fracture registry, things are going to be statistically significant. It is an artifact of the power that having a sample size in the 1,000s yields and the inherent limitations of relying on frequentist statistics (i.e. p=0.05 is an arbitrary threshold at the end of the day).

If they are still concerningly different, confounding remains. This can happen. Matching, especially CEM, is not perfect. This is why researchers sometimes end up including the matched covariates in their final model. When you do, especially if CEM did not really change distributions in your dataset, you are effectively running the same multivariable model that you did before but on fewer observations. Your sample size has gone down, and you have removed some subset of your population. The combined influence of which is likely leading to the different point estimates and wider confidence intervals. From an epidemiologic perspective, take a look at who got kicked out. Are they in some way systematically different from the rest of your population? Does that explain the changes in point estimates that you see? If all else fails, double check that your CEM code ran correctly and that the output looks correct.

A great resource for more information can be found here: http://gking.harvard.edu/files/gking.../cem-stata.pdf
Comment

Announcement

Coarsened exact matching ("cem" package in Stata)

Comment