Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • heterogenous effect on different subsamples

    Hi everyone,

    Let's say I'm running a simple regression like Y=a+bX+e using a certain sample of people which comprehends both boys and girls. I want to know if the effect of X is different on boys and girls.

    I run the following regression Y=a+bX+c(X*Sex)+e where Sex is a dummy variable which takes value 0 for girls and 1 for boys. Is it correct to say that b is the marginal effect for girls and (b+c) is the effect for boys? Can I say that two effects are indeed different if c is statistically different from 0?

    Would it be the same thing to just run Y=a+bX+e separately on the two subsamples of male and female? If yes, how could I test that the effects are different?
    Thanks a lot

  • #2
    I run the following regression Y=a+bX+c(X*Sex)+e where Sex is a dummy variable which takes value 0 for girls and 1 for boys. Is it correct to say that b is the marginal effect for girls and (b+c) is the effect for boys? Can I say that two effects are indeed different if c is statistically different from 0?
    Yes.to both.

    Would it be the same thing to just run Y=a+bX+e separately on the two subsamples of male and female? If yes, how could I test that the effects are different?
    Not exactly. Running the regression separately on the two subsamples allows the error variance to differ between males and females, whereas in the interaction model the error variance is constrained to be the same. Potentially more important, if you are presenting here a simplified version of what you are considering and there are actually other variables involved in the model, the separate estimations for boys and girls allows the coefficients of these other variables to differ for boys and girls, whereas the interaction approach will constrain them to be the same for both sexes unless you also interact those variables with sex.

    If you use the two-separate equations approach, you should save your estimates with the -estimates store- command, and then combine them using the -suest- command. After you've done that, the -test- command will enable you to contrast the X coefficients between the two models. Note that there are some regressions that -suest- does not work with, so this approach is only viable for those it supports. Also, if you want to use robust covariance estimation, you should use ordinary covariance estimation in the separate regressions and then specify robust in the -suest- command. -suest- will not accept results from models estimated with robust vce.

    Comment


    • #3
      Thanks a lot for your kind and very exhaustive answer!

      I've run the regression Y=a+bX+e first for the entire sample, and then separately on the two subsamples. For the entire sample I've found that X has an effect that is not significantly different from zero; for boys I've found as well no significant effect while for girls I've found a positive and significant effect at 99% level. Then with the -test- command I've contrasted the X coefficients between the two models (boys vs. girls) and I've found that they are indeed different.
      Does it make sense to say that X has an effect on girls but not on boys and therefore when I run the regression on the entire sample this effect is somehow hidden because it concerns only a part of the sample?

      Comment


      • #4
        Well, no: these are classic misinterpretations of statistical significance, widespread misunderstandings. When an estimate of an effect is not statistically significant it does not mean that there is no effect. It means that the uncertainty of your estimate of that effect is, for reasons that may include sample size or measurement issues as well as the actual size of the effect, too imprecise to determine even its direction (sign). It could even be a large effect: check your confidence intervals and see if large positive or negative values are included. It's just not precisely estimated, that's all.

        So the correct interpretation is that the effect in boys is estimated with sufficient precision that you can be confident about its direction, but the effect in girls is not. I would present the effect estimates in each sex along with its confidence intervals to convey clearly how precisely or imprecisely the effect has been estimated in both sexes. Be sure also to address the sample sizes in each sex and any limitations of the way Y was measured, as well as any problems with either the conceptual design of your study or its implementation, particularly any aspects that may differ between the sexes.

        To improve your understanding of statistical significance, I recommend you read Wasserstein RL & Lazar NA. The ASA's statement on p-values: context, process, and purpose. The American Statistician (2016), available at http://dx.doi.org/10.1080/00031305.2016.1154108 and the accompanying commentary.

        Comment


        • #5
          Dear Statalist,

          I am running an OLS regression and I would like to find the Heterogenous effect using ethnicity and birth order. I am having a problem with collinearity with of the Chinese variable, as such the variable is being omitted, however I have already omitted one of the race category in my specification.

          reg stunted BO2 BO3 BO4 male urban i.year i.month indian african mixed chinese BO2#c.indian BO2#c.african BO2#c.mixed BO2#c.chinese BO3#c.indian BO3#c.african BO3#c.mixed BO2#c.chinese BO4#c.indian BO4#c.african BO4#c.mixed BO4#c.chinese
          I am grateful for your help.

          Regards

          Comment


          • #6
            I take it that the observations here are some kind of populations, not individuals, and the variables indian, african, mixed, and chinese represent proportions of people from each race/ethnicity in the population, because you have used the c. prefix with each variable. Without seeing any example data, nor the actual output (including messages) from Stata it is difficult to be specific.

            Here's my best guess (really more of a speculation). Remember that in any regression, any observation that has a missing value for any variable mentioned in the command is omitted from the regression sample. Perhaps in your data, when we exclude observations that have some missing values we find that only observations with chinese = 0 are left. Re-running the regression and then -summ chinese if e(sample)- would tell you one way or another. If that's the case, then you simply can't include chinese in the model as there is no information about it in your data.

            If that's not it, please post back, using the -dataex- command to show example data, and also show the complete output of the -reg- command. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data. Make sure your example data, when run with the regression command reproduces the problem you are encountering.

            As an aside, I'll point out that your command can be made considerably simpler to both read and type:

            Code:
            reg stunted male urban i.year i.month i.(BO2 BO3 BO4)##c.(indian african mixed chinese)
            And I think we can go even farther. If BO2, BO3 and BO4 are indicator ("dummy") variables you have created to correspond to birth orders 2, 3, and 4, respectively, then you can get rid of those and use a single variable, let's call it birth_order, that takes on values 1, 2, 3, and 4, and then simplify the code to:

            Code:
            reg stunted male urban i.year i.month i.birth_order##c.(indian african mixed chinese)

            Comment


            • #7
              Dear Clyde,

              Thanks for your reply.

              I am looking at it on the individual level. I have tried the first code using i. (indian...)
              reg stunted male urban i.year i. month i. (BO2 BO3 BO4)##i.(indian african mixed chinese)
              , however the last race is still being omitted.When omitting Chinese, it runs fine.

              Regards

              Comment


              • #8
                So, now I see that indian african mixed and chinese are actualy indicator ("dummy") variables. OK. Then have you tried -tab chinese if e(sample)-? Maybe there just aren't any Chinese who have complete data on all the other variables. If not, please post back with example data and the full output of the regression command for further, more specific, advice.

                Comment

                Working...
                X