Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem comparing sub-sample to full sample

    Dear all,

    Following a paper doing this, I compare the subsample I created after dropping observations with missing values and dropping observations based on some other restrictions I chose, to the full sample, by running a probit regression where the outcome varibale is "Included = 1 if the observation is included in the subsample, and 0 otherwise".

    I compute then the average marginal effects, and found statistically significant differences in control variables.
    For example, I find that included individuals in the subsample are 5 percentage points more likely to have lower birth weight.
    • Is this problematic?
    • Can I just say that differences are small in magnitude, so we should not have an issue? (since 5 pp is not much?)
    • If this is problematic, and I just use the subsample to run my equations, can I just discuss how the problem of the subsample will bias the results? (e.g. if I have a negative coefficient for the birth weight, shoudl I say that the estimate is biased downwards?)
    • If this is problematic, how can I correct for this? I saw a post talking about weights, is this the way to proceed?
    • Suppose I also find a sgnificant difference for my main outcome variable (income), how to discuss this?
    I really appreciate any remarks on this topic!
    Thank you.





  • #2
    Marry:
    provided that is not difficult to find even in indexed journals case-report stating that monkeys (or donkeys, in the Italian version of the same saying) can fly, it does not imply that other reserachers should keep walking the very same road, your approach is highly questionable by any decent reviewer with a bit of knowledge about sample selection bias.
    Actually, you state that:
    1) you decided to get rid of observations with missing values, but you do not seem to have performed any diagnosis about the mechanism supporting their missingness (missing completely at randon; missing at random; missing not at random). Hence, your subsamples may have a tenuous relationship with theri original counterpart;
    2) you got rid of other observations based on some restrictions you chose. At its face value, it means making up data.

    In sum, I woud not sponsor your approach,
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Carlo Lazzaro, thank you for your answer.
      you decided to get rid of observations with missing values, but you do not seem to have performed any diagnosis about the mechanism supporting their missingness (missing completely at randon; missing at random; missing not at random). Hence, your subsamples may have a tenuous relationship with theri original counterpart
      This is exactly what I am trying to do/understand. If you have data with missing values, whether you drop the observations or not, they will not be used in the analysis. So I am checking, taking into account the missingness, how much is the left sample comparable to the whole sample, in some characteristics.
      That was my question.

      you got rid of other observations based on some restrictions you chose. At its face value, it means making up data.
      Does this mean if you have a data with children aged between 0 and 20, you cannot just be interested in the subsample of children aged between 5 and 15?


      Comment


      • #4
        Marry:
        1) the evidence that Stata applies listwise deletion is not a waiver for skipping the diagnosis of missing data mechanism;
        2) I would go -i.group-.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Dear Carlo Lazzaro,
          the evidence that Stata applies listwise deletion is not a waiver for skipping the diagnosis of missing data mechanism
          I really understand your concern and I totally agree with it.
          So I am not trying to skip the analysis you are talking about, I am trying to learn a way to do it.
          But in your answers, you did not comment whether the way I proceed in comparing the subsample (after dropping the observations with missing values) to the whole sample, using a dependent varible for whether the observation is or is not included in the subsample, is right or not.
          If not, can you please give me a hint on what other ways to do the diagnosis of missing data.

          Thank you.

          Comment


          • #6
            Marry:
            1) the -mi- suite of commands (and related references) are plenty of authoritative hints on how to deal with missing values;
            2) you can add to your dataset/regression a categorical predictor (say -i.group-) that you can classify as you like, This is perfectly legal, as comparing the resulting coefficients via -test-. -lincom-and the like is. Set aside the missing values issue, in my #3 I state that I would not sponsor an approach aimed at getting rid of observations "based on some other restrictions [you] chose" because, being [then] unclear the goal of your research, that way you run the risk of making up your data.
            Last edited by Carlo Lazzaro; 26 Jan 2023, 08:14.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment

            Working...
            X