Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to explain positive significant coefficient turning negative significant on adding another variable in OLS regression

    Dear Statalisters,

    I am trying to run an OLS regression, with log of per capita calorie as my dependent variable and age and years of education of household head, log per capita expenditure as my independent variables (other controls to be added eventually). When I run the regression with just age and education as control, they are significant and positive. However, as soon as I add log per capita expenditure, education becomes negative and significant. I am puzzled by this result- I understand that education of the household head might reflect a "wealth" effect, but the correlation coefficient is not that large. I have posted my regression results below, as well as summary statistics. I was wondering if someone could help me understand what is going on here. I realize that this sort of problem might (or might not ) be overcome using other techniques than OLS, but I have just started learning OLS and would like to understand how to deal with this in OLS, or at least know why it cannot deal with this.

    Thanks,

    Monzur



    . regress log_pccal age_hhhead eduy_hhhead [pw=hhweight], r


    Linear regression Number of obs = 3355
    F( 2, 3352) = 105.40
    Prob > F = 0.0000
    R-squared = 0.0692
    Root MSE = .25583

    ------------------------------------------------------------------------------
    | Robust
    log_pccal | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    age_hhhead | .0049182 .0003602 13.65 0.000 .004212 .0056244
    eduy_hhhead | .0075136 .0011997 6.26 0.000 .0051613 .0098659
    _cons | 7.537586 .0171067 440.62 0.000 7.504045 7.571126
    ------------------------------------------------------------------------------

    . regress log_pccal age_hhhead eduy_hhhead log_pcexp [pw=hhweight], r


    Linear regression Number of obs = 3355
    F( 3, 3351) = 601.38
    Prob > F = 0.0000
    R-squared = 0.4123
    Root MSE = .20332

    ------------------------------------------------------------------------------
    | Robust
    log_pccal | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    age_hhhead | .001919 .0002945 6.52 0.000 .0013415 .0024964
    eduy_hhhead | -.0082508 .001044 -7.90 0.000 -.0102977 -.0062039
    log_pcexp | .3777407 .0100402 37.62 0.000 .3580552 .3974262
    _cons | 4.795607 .0730719 65.63 0.000 4.652337 4.938877
    ------------------------------------------------------------------------------

    . estat vif

    Variable | VIF 1/VIF
    -------------+----------------------
    log_pcexp | 1.20 0.832228
    eduy_hhhead | 1.16 0.863121
    age_hhhead | 1.07 0.930743
    -------------+----------------------
    Mean VIF | 1.14



    . su log_pccal eduy_hhhead log_pcexp, d

    log_pccal
    -------------------------------------------------------------
    Percentiles Smallest
    1% 7.123889 6.311302
    5% 7.337663 6.67333
    10% 7.436243 6.834251 Obs 3698
    25% 7.607244 6.855416 Sum of Wgt. 3698

    50% 7.779021 Mean 7.783589
    Largest Std. Dev. .276406
    75% 7.96576 8.723692
    90% 8.135495 8.726619 Variance .0764003
    95% 8.232234 8.736762 Skewness .0350145
    99% 8.477096 8.86989 Kurtosis 3.511389



    years of education of household head
    -------------------------------------------------------------
    Percentiles Smallest
    1% 0 0
    5% 0 0
    10% 0 0 Obs 3698
    25% 0 0 Sum of Wgt. 3698

    50% 0 Mean 2.984857
    Largest Std. Dev. 3.776812
    75% 5 16
    90% 9 16 Variance 14.26431
    95% 10 16 Skewness .9461994
    99% 12 16 Kurtosis 2.751041

    log of hh per capita expenditure
    -------------------------------------------------------------
    Percentiles Smallest
    1% 6.799201 6.302472
    5% 7.063458 6.434649
    10% 7.202945 6.450388 Obs 3698
    25% 7.432215 6.458682 Sum of Wgt. 3698

    50% 7.7299 Mean 7.762185
    Largest Std. Dev. .4636838
    75% 8.045497 9.502833
    90% 8.368738 9.544683 Variance .2150027
    95% 8.571793 9.76697 Skewness .4395734
    99% 9.038363 9.858101 Kurtosis 3.433132

    . pwcorr log_pccal age_hhhead eduy_hhhead log_pcexp, sig

    | log~ccal age_hh~d eduy_h~d log_pc~p
    -------------+------------------------------------
    log_pccal | 1.0000
    |
    |
    age_hhhead | 0.2282 1.0000
    | 0.0000
    |
    eduy_hhhead | 0.0855 -0.1133 1.0000
    | 0.0000 0.0000
    |
    log_pcexp | 0.6401 0.1796 0.3254 1.0000
    | 0.0000 0.0000 0.0000
    |


  • #2
    This sort of thing happens all the time. It even has a name: Simpson's paradox. Google that.

    Comment


    • #3
      there are lots of reasons that this could happen; here are two cites that may help:

      Kennedy, PE (2005), "Oh No! I got the wrong sign! What should I do?", _The Journal of Economic Education_, 36(1): 77-92

      Schuit E, et al. (7/9/2013), "Unexpected predictor-outcome associations in clinical prediction research: causes and solutions," _Canadian Medical Association Journal_, 198(10): E499-E505

      Comment


      • #4
        I'm not convinced in this application that controlling for per capita expenditure is what you want. It seems like over controlling to me, unless you explicitly are interested in whether higher educated people have better diets, in which case the negative sign on educ is easily explained. If these were data on the United States, here is how I would explain your findings. In the regression without PC expenditure, the relationship between PC calories and education basically picks up an income effect. But when you hold fixed PC expenditure, your coefficient on educ has the following meaning: take two families with the same PC expenditure and the same age as the household head. Family B's HH head has one more year of education than family A's household head. Then you get that family B consumes fewer calories. I find this easy to explain in a country like the U.S.: except at low-levels of food consumption, more calories is a bad thing. One is holding fixed how much is being spent on food, right? Typically a higher educated person might have healthier food, that is, fewer calories. For example, the lower educated family may eat at an all-you-can-eat buffet and the higher educated family at a Sushi restaurant. The former would consume more calories, and that wouldn't be a good thing.

        I don't know if such a story fits your data, but something similar is worth thinking about. I don't think Simpson's paradox is necessarily the best way of thinking about the problem. To me, one is answering very different questions whether or not PC expenditure is held fixed.

        Suppose I want to estimate the effect of spending more money per student on student outcomes. I should not control for class size, teacher salaries, spending on books, and so on, because there'd be nothing left for spending to explain. One must be careful when simply throwing things on the right hand side of an equation.

        Comment


        • #5
          Originally posted by Jeff Wooldridge View Post
          I'm not convinced in this application that controlling for per capita expenditure is what you want. It seems like over controlling to me, unless you explicitly are interested in whether higher educated people have better diets, in which case the negative sign on educ is easily explained. If these were data on the United States, here is how I would explain your findings. In the regression without PC expenditure, the relationship between PC calories and education basically picks up an income effect. But when you hold fixed PC expenditure, your coefficient on educ has the following meaning: take two families with the same PC expenditure and the same age as the household head. Family B's HH head has one more year of education than family A's household head. Then you get that family B consumes fewer calories. I find this easy to explain in a country like the U.S.: except at low-levels of food consumption, more calories is a bad thing. One is holding fixed how much is being spent on food, right? Typically a higher educated person might have healthier food, that is, fewer calories. For example, the lower educated family may eat at an all-you-can-eat buffet and the higher educated family at a Sushi restaurant. The former would consume more calories, and that wouldn't be a good thing.

          I don't know if such a story fits your data, but something similar is worth thinking about. I don't think Simpson's paradox is necessarily the best way of thinking about the problem. To me, one is answering very different questions whether or not PC expenditure is held fixed.

          Suppose I want to estimate the effect of spending more money per student on student outcomes. I should not control for class size, teacher salaries, spending on books, and so on, because there'd be nothing left for spending to explain. One must be careful when simply throwing things on the right hand side of an equation.

          Thank you very much for your comments, Jeff. To elaborate further, I am looking at household calorie consumption in India, and I am interested in examining whether households where women have a role in agricultural decisionmaking consume more calories (not necessarily a better quality diet, but just more calories). So, my main variable of interest is a dummy which looks at whether or not women make decisions, and I control for other factors-age, education, occupation of the household head, land ownership, food prices and per capita expenditure, regional effects (variables used in the standard literature on calorie consumption in developing countries).

          What I find puzzling is that households where the head is better educated are consuming less calories (the literature tells us that households with higher educated heads consume more calories in South Asia), and I find this negative and significant association arising only after I control for income (previously the association was positive and significant), as shown in the regression results below.


          Code:
          . regress log_pccal hasinput_domains age_hhhead eduy_hhhead hh_head_farmer hhsize lnowncland_dec rice_price d1-d6 [pw=hhweight], r
           
           
          Linear regression                                      Number of obs =    3355
                                                                 F( 13,  3341) =   42.73
                                                                 Prob > F      =  0.0000
                                                                 R-squared     =  0.1722
                                                                 Root MSE      =  .24166
           
          ----------------------------------------------------------------------------------
                           |               Robust
                 log_pccal |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -----------------+----------------------------------------------------------------
          hasinput_domains |    .072166   .0116565     6.19   0.000     .0493113    .0950206
                age_hhhead |   .0044243   .0003656    12.10   0.000     .0037074    .0051412
               eduy_hhhead |   .0042685   .0012419     3.44   0.001     .0018336    .0067034
            hh_head_farmer |    .060036   .0097011     6.19   0.000     .0410154    .0790566
                    hhsize |  -.0428142   .0031396   -13.64   0.000      -.04897   -.0366584
            lnowncland_dec |   .0142912   .0030376     4.70   0.000     .0083355    .0202469
                rice_price |  -.0024911   .0015667    -1.59   0.112    -.0055629    .0005808
                        d1 |  -.0985701   .0212364    -4.64   0.000    -.1402077   -.0569325
                        d2 |   -.079819   .0194172    -4.11   0.000    -.1178899   -.0417482
                        d3 |  -.0430801   .0153703    -2.80   0.005    -.0732163   -.0129439
                        d4 |  -.0502559     .01794    -2.80   0.005    -.0854305   -.0150814
                        d5 |  -.0657768   .0168336    -3.91   0.000     -.098782   -.0327717
                        d6 |  -.1133768   .0178137    -6.36   0.000    -.1483038   -.0784499
                     _cons |   7.806218   .0534005   146.18   0.000     7.701517    7.910919
          ----------------------------------------------------------------------------------
           
          . regress log_pccal hasinput_domains log_pcexp age_hhhead eduy_hhhead hh_head_farmer hhsize lnowncland_dec rice_price d1-d6 [pw=hhweight], r
           
           
          Linear regression                                      Number of obs =    3355
                                                                 F( 14,  3340) =  170.76
                                                                 Prob > F      =  0.0000
                                                                 R-squared     =  0.4626
                                                                 Root MSE      =  .19474
           
          ----------------------------------------------------------------------------------
                           |               Robust
                 log_pccal |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -----------------+----------------------------------------------------------------
          hasinput_domains |    .039691   .0100259     3.96   0.000     .0200335    .0593486
                 log_pcexp |   .3861757   .0107993    35.76   0.000     .3650017    .4073497
                age_hhhead |   .0023293   .0002957     7.88   0.000     .0017495     .002909
               eduy_hhhead |  -.0081129     .00107    -7.58   0.000    -.0102108    -.006015
            hh_head_farmer |   .0261318   .0079778     3.28   0.001     .0104899    .0417737
                    hhsize |  -.0164878   .0023971    -6.88   0.000    -.0211877   -.0117879
            lnowncland_dec |  -.0058118   .0025335    -2.29   0.022    -.0107792   -.0008443
                rice_price |  -.0097165   .0012406    -7.83   0.000     -.012149   -.0072841
                        d1 |  -.0280479   .0162947    -1.72   0.085    -.0599965    .0039007
                        d2 |  -.0890741   .0157044    -5.67   0.000    -.1198653    -.058283
                        d3 |   .0028278   .0122596     0.23   0.818    -.0212092    .0268648
                        d4 |   .0146839   .0144465     1.02   0.309     -.013641    .0430088
                        d5 |  -.0112927   .0140332    -0.80   0.421    -.0388072    .0162219
                        d6 |   .0042789    .014838     0.29   0.773    -.0248135    .0333713
                     _cons |    5.04545   .0892698    56.52   0.000     4.870421    5.220479
          ----------------------------------------------------------------------------------

          Comment


          • #6
            Actually, after controlling for income, the coefficients for education of household head and amount of land owned (lnowncland_dec) both become negative. I wonder whether this is because education and land are highly correlated with income (although the direction is positive).

            Code:
            . pwcorr log_pccal age_hhhead eduy_hhhead log_pcexp lnowncland_dec , sig
            
                         | log~ccal age_hh~d eduy_h~d log_pc~p lnownc~c
            -------------+---------------------------------------------
               log_pccal |   1.0000
                         |
                         |
              age_hhhead |   0.2282   1.0000
                         |   0.0000
                         |
             eduy_hhhead |   0.0855  -0.1133   1.0000
                         |   0.0000   0.0000
                         |
               log_pcexp |   0.6401   0.1796   0.3254   1.0000
                         |   0.0000   0.0000   0.0000
                         |
            lnowncland~c |   0.1708   0.1335   0.2745   0.3163   1.0000
                         |   0.0000   0.0000   0.0000   0.0000
                         |

            Comment


            • #7
              Monzur:
              heve you investigated possible turning points in your regression model? As far as age-hhhead is concerned it might be worth checking:
              Code:
              regress log_pccal hasinput_domains log_pcexp c.age_hhhead##c.age_hhhead eduy_hhhead hh_head_farmer hhsize lnowncland_dec rice_price d1-d6 [pw=hhweight], r
              Kind regards,
              Carlo
              (StataNow 18.5)

              Comment


              • #8
                Monzur: What exactly is the expenditure variable? Is it just on food, or is it all consumption? If it's just on food, or food accounts for the large share, then I'm not sure why you would hold it fixed while looking at how other factors affect caloric intake. Certainly expenditure is not income.

                Comment


                • #9
                  Jeff: The expenditure variable is a sum of all food and non food expenditures in the previous month. Food does account for a large share of expenditures though. I was using expenditure as a proxy for income.

                  Comment


                  • #10
                    I don't think expenditure is doing what you want in this case. It is practically another outcome variable. Ask yourself this: If you didn't have caloric intake, what would you use as your dependent variable? My guess is food expenditure per capita, which is almost what you are using on the right hand side.

                    Comment


                    • #11
                      Thank you all, very much! I will probably try to use some sort of wealth index.

                      Comment

                      Working...
                      X