Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Specifying and interpreting models with interactions between a continuous variable and a factor variable

    Hi Statalist community,

    I am trying to specify and interpret models with interactions between a continuous variable and a factor variable. Below is a toy dataset. The outcome of interest is a binary variable for being employed with 1 being employed and 0 otherwise. I have a 7 category factor variable to describe a person's prior education. I also have a continuous variable for age. I am trying to create an interaction between the education and age variable. I have never seen a continuous variable interacted with a factor variable that has more than two levels. Is this the correct way to do it?
    Also, would it be correct to interpret the interaction terms as "for a particular education level, a one year increase in age is associated with a ______ unit increase in being employed." Thank you so much.

    Code:
    clear all
    set obs 1000
    
    *Dummy for whether someone is employed
    generate employ = runiformint(0, 1)
    
    *Education level
    generate education = runiformint(1, 7)
    label define education 1 "less than hs" 2 "hs only" 3 "associates degree" 4 "ba/bs" 5 "masters" 6 "doctorate" 7 "postdoc training"
    label val education education
    
    *Age
    generate age = runiformint(18, 64)
    
    *Run logit model estimating employment with education and age interaction
    logit employ i.education##c.age

  • #2
    Is this the correct way to do it?
    Yes.

    Also, would it be correct to interpret the interaction terms as "for a particular education level, a one year increase in age is associated with a ______ unit increase in being employed."
    No. First, because this is a logistic model, not a linear model, the coefficient has no interpretation as a marginal effect on the outcome probability. At best you could say that a coefficient is the expected change in the log odds of being employed associated with a one year increase in age.

    But even that isn't correct because the interaction coefficients do not work that way. Let's use your example. I ran it, and it produces this result:
    Code:
    . logit employ i.education##c.age
    
    Iteration 0:  Log likelihood = -693.09718  
    Iteration 1:  Log likelihood = -686.68863  
    Iteration 2:  Log likelihood = -686.68781  
    Iteration 3:  Log likelihood = -686.68781  
    
    Logistic regression                                     Number of obs =  1,000
                                                            LR chi2(13)   =  12.82
                                                            Prob > chi2   = 0.4619
    Log likelihood = -686.68781                             Pseudo R2     = 0.0092
    
    ------------------------------------------------------------------------------------
                employ | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    -------------------+----------------------------------------------------------------
             education |
              hs only  |    .975355   .8192212     1.19   0.234    -.6302891    2.580999
    associates degree  |   .3083571   .7926751     0.39   0.697    -1.245257    1.861972
                ba/bs  |    .018488   .7381733     0.03   0.980    -1.428305    1.465281
              masters  |   .7828401   .7654795     1.02   0.306    -.7174722    2.283152
            doctorate  |   .2678314   .8101819     0.33   0.741    -1.320096    1.855759
     postdoc training  |  -.1113451   .7922442    -0.14   0.888    -1.664115    1.441425
                       |
                   age |  -.0018536   .0128959    -0.14   0.886    -.0271291     .023422
                       |
       education#c.age |
              hs only  |  -.0197494   .0188845    -1.05   0.296    -.0567623    .0172634
    associates degree  |  -.0060461   .0181128    -0.33   0.739    -.0415464    .0294543
                ba/bs  |  -.0021015   .0171611    -0.12   0.903    -.0357367    .0315337
              masters  |  -.0192974   .0178177    -1.08   0.279    -.0542195    .0156246
            doctorate  |   .0016788   .0187431     0.09   0.929     -.035057    .0384147
     postdoc training  |  -.0024043   .0179687    -0.13   0.894    -.0376223    .0328138
                       |
                 _cons |   .0188641   .5671397     0.03   0.973    -1.092709    1.130438
    ------------------------------------------------------------------------------------
    For less than hs education (the base value for education), there is no interaction term, and a 1 year increase in age will be associated with a decrease of .0018536 (see the coefficient of age above) in the log odds of employ. Since most people have at least some difficulty wrapping their minds around log odds, we can simplify a bit by going to the somewhat more familiar odds metric by exponentiating. A one year age increase is associated with the odds of employ changing by a factor of exp(-.0018536) =0.998 (to 3 decimal places). One might make this even easier to understand by calling it a 0.2% decrease.

    For the other education categories, it gets a little more complicated. Let's use ba/bs as the example. Now we have to look not just at the coefficient of age, but also at the ba/bs#c.age coefficients. The "effective coefficient" of age for education = ba/bs is the sum of the age coefficient and the ba/bs#c.age coefficient. That is, -.0018536 + (-.0021015), or -0.0039551. Again, this number represents the decrease in the log odds of employ associated with a 1 year increase in age when education == ba/bs. Exponentiating that gives us, to three decimal places, 0.996. So among those with education == ba/bs, with a 1 year age increase there is an associated decrease in the odds of employ by a factor of 0.996, or, equivalently a 0.4% decrease in the odds of employ.

    The other education categories work just like ba/bs--less than hs is different only because it is the base category and has no associated interaction coefficient.

    Now, your initial reaction might be that this is pretty unsatisfactory. First of all, this is a lot of work--and tedious, error-prone work at that. But you can automate that:
    Code:
    levelsof education if e(sample), local(edlevels)
    foreach e of local edlevels {
        display "Education level: `:label (education) `e''"
        lincom age + `e'.education#c.age, or
    }
    But even this is a bit unsatisfactory because you are still getting your answers in the odds of employ. While that is comprehensible, it is not as natural as asking how much difference in the probability of employ there is. And that is much more complicated to do by hand. Because a given odds ratio does not correspond to any specific change in probability: the corresponding change in probability depends on what the probability is before the 1 year increase in age. Fortunately, there is the -margins- command that can handle the complexity of this. But it's not reduced to a no-brainer: you have to think about what starting values of the model predictor variables you use. There are many choices for this, of different usefulness for different purposes. One of the commonest is to use the average marginal effect. That is, we ask, suppose we calculated the effect of a 1 year age increase on each person in the data set, using the particular values of all the variables for that person calculated the change in probability of employ and averaged those changes. (In your example, age and education are the only variables, but I'm thinking of the more general setting here). The code for this would be:
    Code:
    margins education, dydx(age)
    This will give you an output table with 7 rows, one for each level of education. In each row you will find the associated change in the average probability of employ associated with a year age increase. In your example, it looks like this:

    Code:
    ------------------------------------------------------------------------------------
                       |            Delta-method
                       |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
    -------------------+----------------------------------------------------------------
    age                |
             education |
         less than hs  |  -.0043342   .0029927    -1.45   0.148    -.0101998    .0015314
              hs only  |  -.0007664   .0035431    -0.22   0.829    -.0077107    .0061779
    associates degree  |   .0023469    .003219     0.73   0.466    -.0039624    .0086561
                ba/bs  |   -.001671   .0028021    -0.60   0.551     -.007163     .003821
              masters  |   .0018659   .0032108     0.58   0.561    -.0044271    .0081589
            doctorate  |  -.0023542   .0027454    -0.86   0.391    -.0077351    .0030266
     postdoc training  |  -.0054248    .002958    -1.83   0.067    -.0112224    .0003727
    ------------------------------------------------------------------------------------
    Note: You may get different results for the regression and -margins- when you run the same code. That's because you did not set the random number seed in your code before generating the random data. My random number generator may not be in synch with yours, so I may have random data that differs from your random data. But the principles are all the same.

    Added: For further information about interactions and the -margins- command, I highly recommend the excellent Richard Williams's https://www3.nd.edu/~rwilliam/stats/Margins01.pdf and
    https://www3.nd.edu/~rwilliam/stats2/l53.pdf .
    Last edited by Clyde Schechter; 22 Sep 2023, 16:09.

    Comment


    • #3
      @Clyde Schechter
      Thank you so much for being so thorough and walking me through your process. Thank you confirming how to specify the model correctly and thank you for providing a detailed explanation of estimating the average marginal effects.

      Comment

      Working...
      X