Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question about inclusion of the reference group

    Hey all,

    The question I have is not so much a Stata-specific question, but I did not know where to ask elsewhere with this empirical kind of question (if you have suggestions about where to find such a forum, feel free to do so!).

    I'm working on an assignment and I have used a binary logistic regression model. Now I'm writing down the model for estimation, but I was wondering if I should include reference categories in the description of my regression equation. For instance, I have innovation (m), a categorical variable (1 = less than 25% (reference group); 2 = 25-50%; 3= 75-90%; 4= more than 90%), as one of my independent variables, but I'm not sure whether I should include the reference group in describing my model. Likewise, I have an independent variable age (k) with 5 age categories (1 = age 18-25 (reference group); 2 = 25-45; 3= 45-65; 4= older than 65).

    Right now, I've written down the following in my chapter about the empirical model: "m is the index of four categories of innovation (m = 1, 2, 3, 4) and k is the index of five categories of age (k = 1, 2, 3, 4, 5)". These two represent option A, so to speak. I would like to know whether this is correct or if, because the first innovation and age category are my the reference groups, I should write something like this: "m is the index of three categories of innovation (m = 2, 3, 4) and k is the index of four categories of age (2, 3, 4, 5)". These represent option B.

    Now my question is: should I include my reference group in the description of my model (like in option A) or should I remove it (like in option B) for both innovation and age?

    Hopefully my small problem is clear and someone can help me, thanks in advance!

    Tim

  • #2
    If you have a categorical variable as explanatory variable, then you should split it up into several indicator (dummy) variables. To take the case of innovation, if you include only one variable containing the four different categories, you are assuming that moving from one category to another has the same quantitative effect. For instance, goting from 25-50% to 75-90% has the same quantitative impact as moving from 75-90% to more than 90%. This may be the case but is highly unlikely. If you do include only one variable, the question of reference category doesn't arise. If your variable is innovation, then entering i.innovation in your command instead of innovation will create the necessary indicator variables,one for each category.

    Comment


    • #3
      Thanks for your reply. I have indeed done so in Stata, i.e. I have run my logit model with i.innovation and i.age and I do get categories 2-4 (for innovation) and 2-5 (for age) in the output. But does that also mean, when writing down and explaining my regression equation, that I can leave those reference categories out of that explanation? (I have explained what my reference groups are in a separate chapter)

      In other words, that I can write something like this: "m is the index of three categories of innovation (m = 2, 3, 4) and k is the index of four categories of age (2, 3, 4, 5)"

      Thanks,

      Tim

      Comment


      • #4
        In the description, you should include all categories and simply state what the reference categories are in the regression. However, if you write down the equation, you omit the reference category.

        Comment


        • #5
          Originally posted by Andrew Musau View Post
          In the description, you should include all categories and simply state what the reference categories are in the regression. However, if you write down the equation, you omit the reference category.
          Thanks for your response, Andrew.

          Right now I have a separate section earlier in my assignment in which I explain all of my variables and show all categories (so also the reference group for each variable). So if I understood you correctly, in the section where I write down my regression equation I can just use something akin to "m is the index of three categories of innovation (m = 2, 3, 4) and k is the index of four categories of age (2, 3, 4, 5)", (Option B so to speak in my first post) because the reference groups (k = 1 ; m = 1) are allowed to be omitted?

          Tim

          Comment


          • #6
            For me, if I were writing down the theoretical equation, that is, in symbols without the estimated values I would write down all categories. It is clear that when you are writing down the estimation results the reference category is automatically omitted. You can then state in a note that xxxxx is the reference category and is, therefore, omitted.

            Comment


            • #7
              Originally posted by Eric de Souza View Post
              For me, if I were writing down the theoretical equation, that is, in symbols without the estimated values I would write down all categories. It is clear that when you are writing down the estimation results the reference category is automatically omitted. You can then state in a note that xxxxx is the reference category and is, therefore, omitted.
              Yeah that's also what I'm currently thinking about. Obviously, in my output all reference groups are omitted (and actually just like you said, mentioned in a note) but I am not 100% sure whether to include them when I am writing the theoretical equation with just the symbols. I also don't know if it really is a big deal if you do it one way or the other.

              Comment


              • #8
                Thanks for your response, Andrew.

                Right now I have a separate section earlier in my assignment in which I explain all of my variables and show all categories (so also the reference group for each variable). So if I understood you correctly, in the section where I write down my regression equation I can just use something akin to "m is the index of three categories of innovation (m = 2, 3, 4) and k is the index of four categories of age (2, 3, 4, 5)", (Option B so to speak in my first post) because the reference groups (k = 1 ; m = 1) are allowed to be omitted?
                I would think that the indicator variables have some sort of meaning. I would specify this in the equation. For example, I estimate the following logit model

                Code:
                . webuse lbw
                
                . logit low age lwt i.race
                
                Iteration 0:   log likelihood =   -117.336  
                Iteration 1:   log likelihood = -111.44695  
                Iteration 2:   log likelihood = -111.33851  
                Iteration 3:   log likelihood = -111.33847  
                Iteration 4:   log likelihood = -111.33847  
                
                Logistic regression                             Number of obs     =        189
                                                                LR chi2(4)        =      12.00
                                                                Prob > chi2       =     0.0174
                Log likelihood = -111.33847                     Pseudo R2         =     0.0511
                
                ------------------------------------------------------------------------------
                         low |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                         age |  -.0255505   .0332506    -0.77   0.442    -.0907204    .0396194
                         lwt |  -.0143305   .0065215    -2.20   0.028    -.0271125   -.0015486
                             |
                        race |
                      black  |     1.0034   .4979756     2.01   0.044     .0273856    1.979414
                      other  |   .4438475   .3602312     1.23   0.218    -.2621927    1.149888
                             |
                       _cons |   1.304519   1.069718     1.22   0.223    -.7920902    3.401128
                ------------------------------------------------------------------------------
                
                .
                Here, the dependent variable is low birth weight, so I am estimating the probability that a mother gives birth to an underweight baby (low=1). My race variable has 3 categories, i.e., white, black and other. Given that I have already described this earlier, I simply write down the model as

                $$p_i= \text{Prob.(low}_i= 1) = f(\beta^{\prime}x_{i})$$

                where

                $$\beta^{\prime}x_{i} = \beta_{1} + \beta_{2}\text{age}_{i} +\beta_{3}\text{lwt}_{i} +\beta_{4}\text{black}_{i} +\beta_{5}\text{other}_{i}$$

                Anyone looking at the model which omits white and knows that race has 3 categories can infer my reference group.

                Comment


                • #9
                  Anyone looking at the model which omits white and knows that race has 3 categories can infer my reference group.
                  Since this is an assignment it would be safer to mention it so that the instructor sees that the student has understood it. My opinion.

                  Comment


                  • #10
                    I agree Eric de Souza

                    Comment


                    • #11
                      Thanks to you both for your input, I really appreciate it. In that case I'll probably keep them in the equation. I hope my instructor won't make too much of a big deal out of it regardless.

                      Comment

                      Working...
                      X