Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mean difference vs regression

    Dear All ,

    This is more of a conceptual question in econometrics but I thought I would write it here. I am interested in the variation of a variable y across a binary variable x. Y is measured for units residing in different regions. Suppose I want to motivate the paper by reporting a ttest y, by(x) . I start with the result but when I regress y on x along with region fixed effects the sign of the relation reverses. The sign persists even after we introduce other controls. Is there a way one can reconcile the results ? I understand that regression results are conditional means and the mean difference is unconditional. However, it's a standard practice in journal to report the descriptive tests and then further validate the sign through additional controls -
    In my case should we report both or start with the regression results after decsribing the data (i.e. without reporting the mean comparison etc.)
    I would be grateful if there's some advise on this

  • #2
    A t test is a simple measure of mean difference or association. It's perfectly possible that once you adjust for regional differences that things at a little more complicated than a simple t test.

    As an econometrician, you shouldn't care about the sign. You should care about the research design and if the design is well suited to answering and informing people on a real question that matters to people. But all this is super imprecise.


    What's your question? What're you studying?

    Comment


    • #3
      Hello Jared , I am estimating the impact of employment among females on nutritional intake. So the idea those who are employed should have higher nutrition via better economic status. Mean comparison shows those in paid employment eats more than the unemployed. However regression with regional effects show otherwise. Suspecting selection (less fed women goes out for work) I estimated IV model and the baseline regression result holds. I was thinking how to reconcile the mean comparison with the regression result.

      Comment


      • #4
        If only the sign is your problem: ttest calculates the mean difference by subtracting the mean of group 2 from the mean of group 1. To let ttest subtract the mean of group 1 from the mean of group 2 use the option reverse. Example:
        Code:
        sysuse auto
        ttest price, by(foreign)
        reg price foreign
        ttest price, by(foreign) reverse
        The output is:
        Code:
        . ttest price, by(foreign)
        
        Two-sample t test with equal variances
        ------------------------------------------------------------------------------
           Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
        ---------+--------------------------------------------------------------------
        Domestic |      52    6072.423    429.4911    3097.104    5210.184    6934.662
         Foreign |      22    6384.682    558.9942    2621.915     5222.19    7547.174
        ---------+--------------------------------------------------------------------
        Combined |      74    6165.257    342.8719    2949.496    5481.914      6848.6
        ---------+--------------------------------------------------------------------
            diff |           -312.2587    754.4488               -1816.225    1191.708
        ------------------------------------------------------------------------------
            diff = mean(Domestic) - mean(Foreign)                         t =  -0.4139
        H0: diff = 0                                     Degrees of freedom =       72
        
            Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
         Pr(T < t) = 0.3401         Pr(|T| > |t|) = 0.6802          Pr(T > t) = 0.6599
        
        . reg price foreign
        
              Source |       SS           df       MS      Number of obs   =        74
        -------------+----------------------------------   F(1, 72)        =      0.17
               Model |  1507382.66         1  1507382.66   Prob > F        =    0.6802
            Residual |   633558013        72  8799416.85   R-squared       =    0.0024
        -------------+----------------------------------   Adj R-squared   =   -0.0115
               Total |   635065396        73  8699525.97   Root MSE        =    2966.4
        
        ------------------------------------------------------------------------------
               price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
        -------------+----------------------------------------------------------------
             foreign |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
               _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
        ------------------------------------------------------------------------------
        
        . ttest price, by(foreign) reverse
        
        Two-sample t test with equal variances
        ------------------------------------------------------------------------------
           Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
        ---------+--------------------------------------------------------------------
         Foreign |      22    6384.682    558.9942    2621.915     5222.19    7547.174
        Domestic |      52    6072.423    429.4911    3097.104    5210.184    6934.662
        ---------+--------------------------------------------------------------------
        Combined |      74    6165.257    342.8719    2949.496    5481.914      6848.6
        ---------+--------------------------------------------------------------------
            diff |            312.2587    754.4488               -1191.708    1816.225
        ------------------------------------------------------------------------------
            diff = mean(Foreign) - mean(Domestic)                         t =   0.4139
        H0: diff = 0                                     Degrees of freedom =       72
        
            Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
         Pr(T < t) = 0.6599         Pr(|T| > |t|) = 0.6802          Pr(T > t) = 0.3401

        Comment


        • #5
          Hello Dirk, no sign is not the problem ! What I am intending to say is that the direction of the effect changes from mean comparison to regression. Thanks so much engaging!

          Comment


          • #6
            How does "the direction of the effect changes from mean comparison to regression" without changing sign?

            Comment


            • #7
              I think this is a complicated question. The predictor here is employment and the outcome is food intake. The third variable involved is region. An unconditional analysis points to an association between predictor and outcome in one direction, and an analysis conditioning on region points to an association in the opposite direction. The question is, which analysis is correct? In most situations, we try to condition on relevant variables to reduce omitted variable bias, or to reduce residual variance. But there are circumstances where the unconditional analysis is preferrable, namely, the collider situation. For this, you have to think in terms of the direction of causality among the relationships.

              The standard confounding relationship is where the third variable is a cause of both the predictor and outcome. But there can also be situations where the third variable is caused by both the predictor and the outcome--this is the collider situation. It is certainly plausible to me that employment causally (directly and also indirectly through income) affects where people live. What is less clear to me is whether food intake may also causally affect where people live. It may be that people who are undernourished will choose to live in regions where food is more readily available, or where medical care is, or something like that. So I think the possibility of region being a collider of employment and food intake is real here. Actually, it strikes me that this is even more complicated because I can also tell stories that argue for causal relationships of predictor and outcome with region being in the opposite direction--hence omitted variable rather than collider.

              So unless people with substantive knowledge in this area have already answered this question, we can't really know which analysis will give a correct treatment effect for employment on food intake. If that is the case, I would think that one would need a richer data set. For example, longitudinal data in which people went in and out of employment but (mostly) stayed in the same place would be a better study design.

              Comment


              • #8
                Indeed an interesting question and I think you need causality and "reality" to resolve this. What you need is a "story" or explanation of how reality works. Clyde describes a causal DAG framework, which is formalized. In the end, it needs to be filled with life. Of course you can start with your basic descriptive results but then continue to explain why the sign changes and how this relates to other variables. Probably you will have some theoretical arguments why some variables are mediators, colliders, or confounders and then you can use the DAGs and the data to flesh out these theoretical arguments with evidence. In the end, numbers in a computer can never tell us what reality really is but only our human perception and understanding. So I would add to "If that is the case, I would think that one would need a richer data set.", we also need elaborate theoretical arguments or an understanding of what might be going on. Sure, numbers can disprove or strengthen such arguments, but without any theoretical guidance, the data alone can be pretty meaningless.
                Best wishes

                (Stata 18.0 MP)

                Comment

                Working...
                X