Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing coefficients between 2 groups in a linear probability model with categorical variable. Using an interaction doesn't seem viable

    I am using a linear probability model to look at the effect of competitiveness on gambling, using competitive sport as a proxy to measure competitiveness. My dependent variable is a dummy variable, which takes the value 1 if someone gambled in the past 12 months and 0 otherwise. My independent variable is a categorical variable, taking the value 1 if someone took part in competitive sport, 2 if they took part in only fitness-based physical activity and 0 if they did not undertake any physical activity at all. I also have a range of control variables.

    I want to look at whether the effect of participating in competitive sport on gambling varies by age. Looking at past posts on this forum, I've seen that the normally suggested method would be to interact age with competitive sport. However, in my dataset, the age variable ranges from 16-75, so I think that an interaction would probably not be a good idea in this case because the change in the competitive sport effect for a 1 year increase in age is likely to be very small. Therefore, I thought of running two separate regressions: one for people aged over 45 and one for people aged under 45. I have almost 4,500 observations for each subgroup so sample size should not be a problem.

    My main interest is to determine whether the competitive sport coefficient is significantly different between the two age groups. I have seen that the use of the suest command has been suggested on this forum previously. I'm not sure whether this would work in my case because my independent variable is a categorical variable? I'm also not sure whether I should be testing for equality of all the coefficients or the equality of only the competitive sport coefficients between the two regressions?

    I would be really grateful for any help.

    Thank you in advance.

  • #2
    Therefore, I thought of running two separate regressions: one for people aged over 45 and one for people aged under 45.
    This is a terrible idea unless there is some basis for believing that something abruptly happens at age 45. This model says that a 44 year old and a 46 year old are radically different, but a 44 year old and a 16 year old are the same, as are a 46 year old and a 75 year old. Unless you can defend that implication, you shouldn't go down this road. Dichotomizing a continuous variable is almost always a mistake.

    If your concern is simply that a one year difference will look puny, then you can rescale age: -gen age_decades = age/10-. Whatever the magnitude of the effect of age would have been, that of age_decades will appear to be 10 times larger. This is just an artifact anyway: working as you are in a linear probability model, you can scale your units in any way you like, and the coefficients will just respond in inverse proportion. Significance tests will not change at all, as both the coefficient and the standard error will change by the same factor. The same will be true of the interaction term(s) between age and competitive sport.

    The real issue with a wide-ranging age variable is that the effect may very well be non-linear. So what you need to be thinking about is how to represent that. First, have you explored graphically the relationship between gambling and age? (Try -lowess outcome age, logit.) to get a sense of it. You might want to represent age with a linear and quadratic term, or perhaps a linear spline with several knots. There are other possibilities as well.

    Finally, let me urge you to put statistical significance aside: build a theoretically defensible model and estimate it Then read the outputs until you fully understand everything Stata tells you from your -regress- command and from -margins- output except the p-values. Once you have done that, you will probably realize that the p-values have nothing to add to the story, except perhaps confusion. At that point, if you have nothing better to do with your time, you can read the p-values, too. Saving the p-values for last (if ever) will help you avoid the trap of being seduced into thinking you have some really important finding because of p = 0.05 in a data set so large (N = 9000) that it is hard to imagine a finding so small it wouldn't be statistically significant.

    Comment


    • #3
      Thank you very much for your quickly reply Clyde. I was thinking of splitting the age variable into over 45 and under 45 because my preliminary analysis showed that the majority of people who took part in competitive sport and gambled were aged 45 or below in this sample. However, I take your point that this would imply a 46 year old and 75 year old as being the same, so I realise that this a bad idea.

      Indeed, my main concern was that a one year difference will look very small and hence be difficult to interpret meaningfully. I think that your suggestion about scaling the age variable to be 10 times larger will solve that problem.

      I had taken a look at the age distribution in the overall sample using a histogram and it was indeed non-linear. There appeared to be a quadratic relationship, so I included an age squared variable in my regression on the basis of this. When I looked at the age distribution of people who took part in competitive sport and gambled, this appeared to be skewed to the left.

      If I want to proceed by interacting the age_decade variable you suggested with my competitive sport variable, how would I go about doing this given the age distributions mentioned above? Would I simply be able to include an interaction term between competitive sport and age_decade along with having age and age squared as separate variables in my regression?

      And thank you for your advice about avoiding becoming too honed on statistical significance. I will try my best to focus on building a theoretically defensible model.

      Comment


      • #4
        So, if you're using a quadratic and linear term in age, you need to interact the competitive sport variable with both of them. So your regression will look something like this:

        Code:
        regress gambling i.competitive_sport##c.age##c.age other_covariates
        margins competitive_sport, at(age = (16(4)76)
        marginsplot, name(predicted_gambling, replace)
        margins, dydx(competitive_sport) at(age = (16(4)76))
        marginsplot, name(marginal_effects, replace)
        This will give you nice tables and graphs of the predicted probability of gambling in each competitive sports group at a representative range of ages, and of the marginal effect of competitive_sport over this same range of ages.

        If the i., c. and ## stuff doesn't look familiar, read -help fvvarlist- to learn about factor variable notation. Similarly, if you don't know the -margins- command, I suggest you begin with Richard Williams' http://www.stata-journal.com/sjpdf.h...iclenum=st0260. That is the clearest introduction to that command that I know of. After you've mastered that, you can learn more in the well-written and well-exampled -margins- section of the PDF manuals. But Richard Williams' article is a faster, easier way to learn the basics of it (and it covers everything you need for your particular situation.)

        Comment


        • #5
          Thanks again Clyde. Would this method work for a logit/probit model as well?

          Comment


          • #6
            Yes. In fact, the graphs created by -margins- will be even more useful in that setting because you have not just the quadratic specification of age, but also the non-linearity of the logit or probit link to contend with: it is really difficult to understand those models without those graphs.

            Comment


            • #7
              I tried what you suggested both with the linear probability model and logit model and it works perfectly. Thank you so much!

              Would it be possible to bother you with one further query? Or should I be starting a new thread? I apologise in advance if I should be starting a new thread instead.

              I also believe that the effect of competitive sport on gambling is likely to vary by gender, with males who take part in competitive sport being more likely to gamble than females who take part in competitive sport. I am interested in looking at the marginal effect of the sport * gender interaction term as well. I noticed from one of Richard Williams' posts (http://www.stata.com/statalist/archi.../msg00263.html) that Stata does not present marginal effects for interaction terms.

              Would running a logit regression with the sport * gender interaction and then looking at the how the marginal effect of sport varies by gender (using margins i.sport, dydx(i.sex) atmean) be an appropriate way to look at this interaction?

              Comment


              • #8
                Would running a logit regression with the sport * gender interaction and then looking at the how the marginal effect of sport varies by gender (using margins i.sport, dydx(i.sex) atmean) be an appropriate way to look at this interaction?
                Yes. And you can do it all in one fell swoop:

                Code:
                logit gambling i.competitive_sport##(c.age##c.age i.gender) other_covariates
                margins gender, dydx(competitive_sport) at(age = (16(4)76))
                marginsplot, name(marginal_sport_effects, replace)
                That will show you the marginal effect of competitive sport on the probability of gambling in each gender at a representative range of ages, and graph them out in a very readable, understandable way.

                You can also get the predicted probabilities in all of these combinations, too:

                Code:
                margins gender#competitive_sport, at(age = (16(4)76))
                marginsplot, name(predicted_probabilities, replace)
                I noticed from one of Richard Williams' posts (http://www.stata.com/statalist/archi.../msg00263.html) that Stata does not present marginal effects for interaction terms.
                Correct. That's not some idiosyncracy of Stata. Interaction terms do not have marginal effects in the usual sense of the term; they simply don't exist. There is something of a semantic debate about whether some other statistics that can be calculated (and can be calculated in -margins-, for that matter) might be called the marginal effect of the interaction term: but that term doesn't mean the same thing or act like a marginal effect. In fact, if I told you what it is and explained it well enough, you would instantly realize that it is of no use and you wouldn't want to calculate it. What you really want is precisely what you have settled on in #7: the marginal effect of the competitive_sport variable in each gender (also evaluated over the range of ages.)

                Added: This is a really good example of the beauty and power of the -margins- command and factor variable notation. Without -margins-, it would be a struggle even for experienced analysts to calculate these specific effects. With -margins-, it becomes almost a "no-brainer."
                Last edited by Clyde Schechter; 17 Apr 2017, 21:33.

                Comment


                • #9
                  Once again, your suggestions worked amazingly well and help to show the effects very clearly.

                  Given that marginal effects do not exist for interaction terms, if I include the age and gender interactions in the model as above and then calculate the overall marginal effect of competitive sport on gambling (using margins, dydx(*) atmean), would the overall marginal effect of competitive sport on gambling still take the interactions into account?

                  Comment


                  • #10
                    Yes.

                    Comment


                    • #11
                      And just to confirm would the interpretation still be the percentage point increase in probability of gambling for someone who took part in competitive sport compared to someone who did not undertake any physical activity at all? Because I thought that normally if you include interaction terms, say A*B, the interpretation of the marginal effect of A on its own changes?

                      Comment


                      • #12
                        The marginal effect of A in a model that contains A#B depends on the value of B. In addition, when the model is not linear, it will also depend on the value of all the variables in the model. But when you specify -atmeans-, you are fixing B and all the other variables at their mean values. So you have eliminated that source of variation in the marginal effect. The fully correct interpretation then it is that this is the marginal effect of sport gambling conditional on all other variables taking on their mean values.

                        Now, let me go beyond your question. When your model contains discrete variables, like gender, using -atmeans- sets that variable to its mean value. So if gender is coded 0 = male and 1 = female, it sets gender = some value between 0 and 1 depending on the proportion of females in your data. This can be problematic because nobody is, say, .6 female. So it is better to do something like -margins gender, dydx(competitive_sport) atmeans-, which gives you two different marginal effects of competitive_sport, one for males and one for females. Even with that done, it is very likely that nobody in the data set actually has the mean value for all of the remaining variables. So you are calculating marginal effects for people that, in all probability, don't exist.

                        When all is said and done, it is the essence of interaction models that you are denying the existence of "the" marginal effect; you are affirming that there are multiple marginal effects, conditional on various things. That is why I recommend the approach in #4 and #8: it respects this conditionality of the marginal effect and exhibits it in an easily comprehensible way. I know that some people seem to crave a way to reduce that all to a single number--hence the popularity ot things like average marginal effects, or marginal effects at means. But reducing all of this variation to a single number actually defeats the whole point of having interactions in the model. It isn't simple, it's simplistic, in my opinion. It's a bit like buying a bunch of shirts in several colors so you can have variety in what you wear, and then dying them all one color.

                        Comment


                        • #13
                          I also have a similar kind of problem and am looking for some advice! My data set contains a large amount of dummy variables plus two continuous variables. I am looking to compare the LPM, Logit and Probit model and so I will need my results to reflect the marginal effect so as I am able to directly compare between the three models. I am interested in finding out the effect of gender on the probability of achieving a first class degree. My coding at the moment looks like this:

                          *Hypothesis two: first class degree
                          *marginal effects (at the mean marginal effect)
                          quietly regress first full fem ox cam notts liv exe birm war lough sur swan bed sta sec1 sec2 sec3 sec4 sec5 sec6 sec7 sec8 A B C D E F G H I J K L M N O P Q R AGE UCASPOINTS, robust
                          margins, dydx(*) atmeans
                          outreg2 using hypothesis2table.doc


                          quietly logit first full fem ox cam notts liv exe birm war lough sur swan bed sta sec1 sec2 sec3 sec4 sec5 sec6 sec7 sec8 A B C D E F G H I J K L M N O P Q R AGE UCASPOINTS, robust
                          margins, dydx(*) atmeans
                          outreg2 using hypothesis2table.doc


                          quietly probit first full fem ox cam notts liv exe birm war lough sur swan bed sta sec1 sec2 sec3 sec4 sec5 sec6 sec7 sec8 A B C D E F G H I J K L M N O P Q R AGE UCASPOINTS, robust
                          margins, dydx(*) atmeans
                          outreg2 using hypothesis2table.doc

                          Would this be the right way to code this problem? I would like to find the effect of 'fem' on 'first' whilst controlling for all the other variables. From running this i have found that the marginal effects are quite different between the LPM, logit and probit models, is this expected or should they give similar values?

                          After this I am also interested in looking at the marginal effect of gender on achieving a first class degree across all subjects (A,B,...,R), whilst controlling for my other variables but I am unsure of how to code this due to the difficulty of using margins with interaction terms.

                          I would really appreciate any help on this!

                          Thanks

                          Comment

                          Working...
                          X