Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimation results from a non-random sub-sample of a sample

    Dear all,
    I have an econometrics question/problem for you:

    Assume that you examine the impact of X1 & X2 on Y based on a sample and obtain the respective coefficients. Let’s assume further that you create sub-samples from the sample above based on the distribution of X1 (25th percentile, 50th, 75th, 90th). Hence, you have 4 sub-samples of the initial sample. For example, one of the sub-sample would contain observations based on the condition that X1 <= 25th percentile, second sub-sample would be based on the condition that X1> 25th percentile & X1<=50th percentile and so on.

    I believe that such methodology and the coefficients obtained are problematic, primarily because the sample selection (or the sub-samples) is non-random as it's based on the distribution of X1. However, I have seen results based on such methods published in “decent” journals. Having said that, I don’t know or cannot provide the theoretical justification or statistical intuition as to why the coefficients obtained based on this method are problematic or inferior compared to the estimates obtained from the entire sample.

    Can anyone educate me on the disadvantages or consequences of using such non-random subsample ? Also are there any journal articles or books that discuss this issue that I can perhaps cite?

    Thanks in advance,
    Rishav

  • #2
    Why would any answer on that question only apply to economics? 1+1=2 regardless of whether you are an economist, or a psychologist, or a geographer, or... So lets call this a statistical problem.

    Selection is a problem if you select on the explained/dependent/left-hand-side/y-variable, but not if you select on the explanatory/independent/right-hand-side/x-variable.

    With regression you are looking at the distribution of \(y\) conditional on \(x\): \(f(y|x)\). We will simplify the problem by looking at one \(x\), but once the proof is done you can easily see that it generalizes to more than one \(x\). Also by describing regression as looking at the conditional distribution this result applies to all kinds of regression models: e.g. linear regression, logit/probit regression, multilevel regression, etc. etc.

    So what happens when we select the sample on \(x\) such that \(x\) is less than some constant \(c\)? We get the distribution \( f(y|x,x<c) \). Using Bayes theorem we can write that as

    \[ \begin{array}{rl}
    f(y|x, x < c) & = \frac{f(y,x,x<c)}{f(x, x < c)} \\
    & = \frac{ f(x<c| x, y) f(y|x) f(x) }{f(x < c|x) f(x)}
    \end{array}
    \]

    Since the chance that \(x < c\) is solely dependent on \(x\) and not on \(y\), \( f(x < c | x, y) = f(x < c |x) \). So we can write:

    \[ \begin{array}{rl}
    f(y|x, x < c) & = \frac{ f(x<c| x, y) f(y|x) f(x) }{f(x < c|x) f(x)} \\
    & = \frac{ f(x<c| x) f(y|x) f(x) }{f(x < c|x) f(x)} \\
    & = f(y | x)
    \end{array}
    \]

    So selection on \(x\) is not a problem; we can still get estimates of the regression we are interested in \(f(y|x)\), but we can also see that if we select on \(y\) this nice result will no longer work.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Maarten gave an insightful answer. I just wish to ask that this issue in #1 somewhat boils down to, say, categorizing a continuous Xvar according to its quantiles, i.e, this is not necessarily problematic (albeit the potential pitfalls, like ‘losing information’) and can be seen in decent textbooks.
      Best regards,

      Marcos

      Comment


      • #4
        Hi Marteen,
        First of all, I am appreciate of your prompt and detailed response to my query. Also, I agree with you that this is a question pertaining to statistics and not necessarily limited to Economics or econometrics. Also, thank you Marcos for your response.

        Comment

        Working...
        X