Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Logic for using matched sample versus unmatched sample

    Hi all,

    I am stuck on a problem of when to use a matched versus unmatched sample. I have a list of companies and I am trying figure out the effect of a treatment (a policy change in this case) on company financial performance (the dependent variable). I will use difference-in-differences for the analysis.

    I do not know when (generally speaking) a matched sample design is better/worse than an unmatched sample design. To my understanding, matching "controls" for confounding variables before running regression models. Alternatively, using an unmatched sample, one would add the control variables into the regression model to "control" for these factors.

    But broadly speaking, can someone please help me decide if I should use a matched sample with this case (pros/cons) and if so/not, why?

    Roger C.

  • #2
    Adjusting for variables by adding them as covariates to a model is an imperfect way of dealing with them. It only works to the extent that the effect of those covariates on the outcome is actually the one specified by the variable included in the model (which might be the variable directly, or some transform thereof.) But if you put in X as a covariate in a linear regression but the actual effect of X is multiplicative, then your adjustment will be inaccurate.

    Matching always produces correct compensation for the effect of a missing variable, at least if you are doing exact matching. So why doesn't everyone use matched samples all the time? It's a matter of feasibility. When you start putting a matched data set together, you often find that there are cases for whom no matching control is available. If you are only matching on one variable, and that variable has a friendly distribution, like sex in most contexts, then this problem tends not to arise. But if you try to match on several variables at once, or on a variable where cases and controls are likely to be very different and the distributions don't overlap all that much, you can end up with a large proportion of cases unmatchable.

    Another drawback to matching is that it requires an analysis that accounts for the matching. In some simple cases, this is not an issue, such as a paired t-test. But in other situations, such as longitudinal data, it adds an extra level of hierarchy to the data and pushes you into a multi-level model when you might otherwise prefer a simple fixed-effects panel regression.

    So to summarize: matching is theoretically superior to adding a covariate to a model to adjust for its effect. But it has substantial feasibility limitations and can complicate the required analysis.

    Finally, unless you are doing an experiment, you cannot "control" for anything. I realize that the term "control variable" and the phrase "control for X" are widely used in the setting of observational data. But if you insist on using them, at least always remind yourself that it is an abuse of language. In observational data you do not control anything. The best you can do is adjust or otherwise analytically try to eliminate the effects of nuisance variables.

    Comment


    • #3
      Clyde, thank you for your comprehensive response and summary. It is helpful. One clarification question I have is around this statement: "It only works to the extent that the effect of those covariates on the outcome is actually the one specified by the variable included in the model." As I understand, you are saying "controlling" (realize abusing language here) for a covariate only works for the exact specification of the covariate entered into the model. But why is this not the case for matching? For example, if I enter the squared term of "company size" as a covariate in an unmatched regression model, why can I get away with matching just on "company size" (rather than the squared term, assuming that's what's really important) in the matched design?

      Maybe I am misunderstanding, but I hear you saying getting a proper matched design is easier in the sense of not having to specify the matching variables precisely, whereas we need to do this when using covariates as controls in an unmatched sample. If this is incorrect then I do not see a benefit of matching.

      Thanks again.
      Last edited by Roger Clements; 01 May 2019, 05:13.

      Comment


      • #4
        So, let's imagine we are trying to estimate the association between X and Y and we are concerned about a potentially confounding variable, Z.

        If there is a linear relationship between Z and Y, then adding Z as a covariate to a regression model of Y on X will correctly adjust for Z's confounding effect. But if the true relationship between Z and Y is, let us say, U-shaped over the range of values of Z and Y in the study, then just specifying Z as a covariate will incorrectly adjust for Z's effects and you will be left with residual confounding. Another similar problem might be if the Z:Y relationship actually depends on X. Then, unless you know ahead of time to include a Z#X interaction term, as well as Z itself, you will fail to correctly adjust for Z's confounding.

        By contrast, if you can successfully match on Z, then there can be no confounding effect of Z in the analysis of this data. That's because, by virtue of the matching, the distribution of Z among cases and controls is exactly the same, so regardless of what effect Z has on Y (even if it is conditional on X, or non-linear, or badly behaved mathematically, or whatever) the effect is completely annihilated when you contrast the cases and controls. Note that this argument applies, strictly speaking, only to exact matching on Z. If you "caliper match" on Z (i.e. match on Z to within some range) then you may end up failing to remove Z's effect, although if the caliper is narrow enough and the Y:Z (or X:Z) relationship is continuous, then you can come arbitrarily close to perfectly annihilating Z's effect.

        So, I still reach the same conclusion: matching is theoretically preferable, but in practice is often very difficult or impossible to implement. If you have a variable that is easy to match on, I would almost always do the matched study design (unless finding an analysis that properly accounts for the matching is too problematic). But many variables are hard to match on in the real world, and if you try to extend it to several variables, even if they are individually easy to match on, matching on combinations of them rapidly becomes difficult or impossible.

        Comment


        • #5
          For an argument to use matching before regression, see
          Ho, D.E., Imai, K., King, G., and Stuart, E.A. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis, 15(3): 199-236.
          David Radwin
          Senior Researcher, California Competes
          californiacompetes.org
          Pronouns: He/Him

          Comment


          • #6
            Thank you Clyde for taking the time to provide a detailed explanation. This is very helpful!

            Thanks David for the reference. I will take a look.

            Comment

            Working...
            X