Mean difference vs regression

Piku Sarkhel

Join Date: Jun 2015

Posts: 35
#1

Mean difference vs regression

12 Aug 2024, 00:29

Dear All ,

This is more of a conceptual question in econometrics but I thought I would write it here. I am interested in the variation of a variable y across a binary variable x. Y is measured for units residing in different regions. Suppose I want to motivate the paper by reporting a ttest y, by(x) . I start with the result but when I regress y on x along with region fixed effects the sign of the relation reverses. The sign persists even after we introduce other controls. Is there a way one can reconcile the results ? I understand that regression results are conditional means and the mean difference is unconditional. However, it's a standard practice in journal to report the descriptive tests and then further validate the sign through additional controls -
In my case should we report both or start with the regression results after decsribing the data (i.e. without reporting the mean comparison etc.)
I would be grateful if there's some advise on this
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#2

12 Aug 2024, 03:06

A t test is a simple measure of mean difference or association. It's perfectly possible that once you adjust for regional differences that things at a little more complicated than a simple t test.

As an econometrician, you shouldn't care about the sign. You should care about the research design and if the design is well suited to answering and informing people on a real question that matters to people. But all this is super imprecise.

What's your question? What're you studying?
1 like
Comment
Piku Sarkhel

Join Date: Jun 2015

Posts: 35
#3

12 Aug 2024, 05:16

Hello Jared , I am estimating the impact of employment among females on nutritional intake. So the idea those who are employed should have higher nutrition via better economic status. Mean comparison shows those in paid employment eats more than the unemployed. However regression with regional effects show otherwise. Suspecting selection (less fed women goes out for work) I estimated IV model and the baseline regression result holds. I was thinking how to reconcile the mean comparison with the regression result.
Comment

Dirk Enzmann

Join Date: Apr 2014
Posts: 541

12 Aug 2024, 08:46

If only the sign is your problem: ttest calculates the mean difference by subtracting the mean of group 2 from the mean of group 1. To let ttest subtract the mean of group 1 from the mean of group 2 use the option reverse. Example:

Code:

sysuse auto
ttest price, by(foreign)
reg price foreign
ttest price, by(foreign) reverse

The output is:

Code:

. ttest price, by(foreign)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
Domestic |      52    6072.423    429.4911    3097.104    5210.184    6934.662
 Foreign |      22    6384.682    558.9942    2621.915     5222.19    7547.174
---------+--------------------------------------------------------------------
Combined |      74    6165.257    342.8719    2949.496    5481.914      6848.6
---------+--------------------------------------------------------------------
    diff |           -312.2587    754.4488               -1816.225    1191.708
------------------------------------------------------------------------------
    diff = mean(Domestic) - mean(Foreign)                         t =  -0.4139
H0: diff = 0                                     Degrees of freedom =       72

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.3401         Pr(|T| > |t|) = 0.6802          Pr(T > t) = 0.6599

. reg price foreign

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =      0.17
       Model |  1507382.66         1  1507382.66   Prob > F        =    0.6802
    Residual |   633558013        72  8799416.85   R-squared       =    0.0024
-------------+----------------------------------   Adj R-squared   =   -0.0115
       Total |   635065396        73  8699525.97   Root MSE        =    2966.4

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     foreign |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
       _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
------------------------------------------------------------------------------

. ttest price, by(foreign) reverse

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
 Foreign |      22    6384.682    558.9942    2621.915     5222.19    7547.174
Domestic |      52    6072.423    429.4911    3097.104    5210.184    6934.662
---------+--------------------------------------------------------------------
Combined |      74    6165.257    342.8719    2949.496    5481.914      6848.6
---------+--------------------------------------------------------------------
    diff |            312.2587    754.4488               -1191.708    1816.225
------------------------------------------------------------------------------
    diff = mean(Foreign) - mean(Domestic)                         t =   0.4139
H0: diff = 0                                     Degrees of freedom =       72

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.6599         Pr(|T| > |t|) = 0.6802          Pr(T > t) = 0.3401

Comment

Piku Sarkhel

Join Date: Jun 2015

Posts: 35
#5

12 Aug 2024, 08:57

Hello Dirk, no sign is not the problem ! What I am intending to say is that the direction of the effect changes from mean comparison to regression. Thanks so much engaging!
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#6

12 Aug 2024, 10:36

How does "the direction of the effect changes from mean comparison to regression" without changing sign?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30105
#7

12 Aug 2024, 11:42

I think this is a complicated question. The predictor here is employment and the outcome is food intake. The third variable involved is region. An unconditional analysis points to an association between predictor and outcome in one direction, and an analysis conditioning on region points to an association in the opposite direction. The question is, which analysis is correct? In most situations, we try to condition on relevant variables to reduce omitted variable bias, or to reduce residual variance. But there are circumstances where the unconditional analysis is preferrable, namely, the collider situation. For this, you have to think in terms of the direction of causality among the relationships.

The standard confounding relationship is where the third variable is a cause of both the predictor and outcome. But there can also be situations where the third variable is caused by both the predictor and the outcome--this is the collider situation. It is certainly plausible to me that employment causally (directly and also indirectly through income) affects where people live. What is less clear to me is whether food intake may also causally affect where people live. It may be that people who are undernourished will choose to live in regions where food is more readily available, or where medical care is, or something like that. So I think the possibility of region being a collider of employment and food intake is real here. Actually, it strikes me that this is even more complicated because I can also tell stories that argue for causal relationships of predictor and outcome with region being in the opposite direction--hence omitted variable rather than collider.

So unless people with substantive knowledge in this area have already answered this question, we can't really know which analysis will give a correct treatment effect for employment on food intake. If that is the case, I would think that one would need a richer data set. For example, longitudinal data in which people went in and out of employment but (mostly) stayed in the same place would be a better study design.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 701
#8

12 Aug 2024, 13:55

Indeed an interesting question and I think you need causality and "reality" to resolve this. What you need is a "story" or explanation of how reality works. Clyde describes a causal DAG framework, which is formalized. In the end, it needs to be filled with life. Of course you can start with your basic descriptive results but then continue to explain why the sign changes and how this relates to other variables. Probably you will have some theoretical arguments why some variables are mediators, colliders, or confounders and then you can use the DAGs and the data to flesh out these theoretical arguments with evidence. In the end, numbers in a computer can never tell us what reality really is but only our human perception and understanding. So I would add to "If that is the case, I would think that one would need a richer data set.", we also need elaborate theoretical arguments or an understanding of what might be going on. Sure, numbers can disprove or strengthen such arguments, but without any theoretical guidance, the data alone can be pretty meaningless.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
1 like
Comment

Announcement

Mean difference vs regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment