Question about inclusion of the reference group

Tim de Vries

Join Date: Aug 2018

Posts: 5
#1

Question about inclusion of the reference group

05 Aug 2018, 06:28

Hey all,

The question I have is not so much a Stata-specific question, but I did not know where to ask elsewhere with this empirical kind of question (if you have suggestions about where to find such a forum, feel free to do so!).

I'm working on an assignment and I have used a binary logistic regression model. Now I'm writing down the model for estimation, but I was wondering if I should include reference categories in the description of my regression equation. For instance, I have innovation (m), a categorical variable (1 = less than 25% (reference group); 2 = 25-50%; 3= 75-90%; 4= more than 90%), as one of my independent variables, but I'm not sure whether I should include the reference group in describing my model. Likewise, I have an independent variable age (k) with 5 age categories (1 = age 18-25 (reference group); 2 = 25-45; 3= 45-65; 4= older than 65).

Right now, I've written down the following in my chapter about the empirical model: "m is the index of four categories of innovation (m = 1, 2, 3, 4) and k is the index of five categories of age (k = 1, 2, 3, 4, 5)". These two represent option A, so to speak. I would like to know whether this is correct or if, because the first innovation and age category are my the reference groups, I should write something like this: "m is the index of three categories of innovation (m = 2, 3, 4) and k is the index of four categories of age (2, 3, 4, 5)". These represent option B.

Now my question is: should I include my reference group in the description of my model (like in option A) or should I remove it (like in option B) for both innovation and age?

Hopefully my small problem is clear and someone can help me, thanks in advance!

Tim
Tags: None
Eric de Souza

Join Date: Mar 2014

Posts: 587
#2

05 Aug 2018, 06:43

If you have a categorical variable as explanatory variable, then you should split it up into several indicator (dummy) variables. To take the case of innovation, if you include only one variable containing the four different categories, you are assuming that moving from one category to another has the same quantitative effect. For instance, goting from 25-50% to 75-90% has the same quantitative impact as moving from 75-90% to more than 90%. This may be the case but is highly unlikely. If you do include only one variable, the question of reference category doesn't arise. If your variable is innovation, then entering i.innovation in your command instead of innovation will create the necessary indicator variables,one for each category.
Comment
Tim de Vries

Join Date: Aug 2018

Posts: 5
#3

05 Aug 2018, 07:05

Thanks for your reply. I have indeed done so in Stata, i.e. I have run my logit model with i.innovation and i.age and I do get categories 2-4 (for innovation) and 2-5 (for age) in the output. But does that also mean, when writing down and explaining my regression equation, that I can leave those reference categories out of that explanation? (I have explained what my reference groups are in a separate chapter)

In other words, that I can write something like this: "m is the index of three categories of innovation (m = 2, 3, 4) and k is the index of four categories of age (2, 3, 4, 5)"

Thanks,

Tim
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10219
#4

05 Aug 2018, 08:11

In the description, you should include all categories and simply state what the reference categories are in the regression. However, if you write down the equation, you omit the reference category.
Comment
Tim de Vries

Join Date: Aug 2018

Posts: 5
#5

05 Aug 2018, 08:43

Originally posted by Andrew Musau View Post

In the description, you should include all categories and simply state what the reference categories are in the regression. However, if you write down the equation, you omit the reference category.

Thanks for your response, Andrew.

Right now I have a separate section earlier in my assignment in which I explain all of my variables and show all categories (so also the reference group for each variable). So if I understood you correctly, in the section where I write down my regression equation I can just use something akin to "m is the index of three categories of innovation (m = 2, 3, 4) and k is the index of four categories of age (2, 3, 4, 5)", (Option B so to speak in my first post) because the reference groups (k = 1 ; m = 1) are allowed to be omitted?

Tim
Comment
Eric de Souza

Join Date: Mar 2014

Posts: 587
#6

05 Aug 2018, 09:14

For me, if I were writing down the theoretical equation, that is, in symbols without the estimated values I would write down all categories. It is clear that when you are writing down the estimation results the reference category is automatically omitted. You can then state in a note that xxxxx is the reference category and is, therefore, omitted.
Comment
Tim de Vries

Join Date: Aug 2018

Posts: 5
#7

05 Aug 2018, 09:37

Originally posted by Eric de Souza View Post

For me, if I were writing down the theoretical equation, that is, in symbols without the estimated values I would write down all categories. It is clear that when you are writing down the estimation results the reference category is automatically omitted. You can then state in a note that xxxxx is the reference category and is, therefore, omitted.

Yeah that's also what I'm currently thinking about. Obviously, in my output all reference groups are omitted (and actually just like you said, mentioned in a note) but I am not 100% sure whether to include them when I am writing the theoretical equation with just the symbols. I also don't know if it really is a big deal if you do it one way or the other.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10219
#8

05 Aug 2018, 09:38

Thanks for your response, Andrew.

Right now I have a separate section earlier in my assignment in which I explain all of my variables and show all categories (so also the reference group for each variable). So if I understood you correctly, in the section where I write down my regression equation I can just use something akin to "m is the index of three categories of innovation (m = 2, 3, 4) and k is the index of four categories of age (2, 3, 4, 5)", (Option B so to speak in my first post) because the reference groups (k = 1 ; m = 1) are allowed to be omitted?

I would think that the indicator variables have some sort of meaning. I would specify this in the equation. For example, I estimate the following logit model

Code:

. webuse lbw . logit low age lwt i.race Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -111.44695 Iteration 2: log likelihood = -111.33851 Iteration 3: log likelihood = -111.33847 Iteration 4: log likelihood = -111.33847 Logistic regression Number of obs = 189 LR chi2(4) = 12.00 Prob > chi2 = 0.0174 Log likelihood = -111.33847 Pseudo R2 = 0.0511 ------------------------------------------------------------------------------ low | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | -.0255505 .0332506 -0.77 0.442 -.0907204 .0396194 lwt | -.0143305 .0065215 -2.20 0.028 -.0271125 -.0015486 | race | black | 1.0034 .4979756 2.01 0.044 .0273856 1.979414 other | .4438475 .3602312 1.23 0.218 -.2621927 1.149888 | _cons | 1.304519 1.069718 1.22 0.223 -.7920902 3.401128 ------------------------------------------------------------------------------ .

Here, the dependent variable is low birth weight, so I am estimating the probability that a mother gives birth to an underweight baby (low=1). My race variable has 3 categories, i.e., white, black and other. Given that I have already described this earlier, I simply write down the model as

$$p_i= \text{Prob.(low}_i= 1) = f(\beta^{\prime}x_{i})$$

where

$$\beta^{\prime}x_{i} = \beta_{1} + \beta_{2}\text{age}_{i} +\beta_{3}\text{lwt}_{i} +\beta_{4}\text{black}_{i} +\beta_{5}\text{other}_{i}$$

Anyone looking at the model which omits white and knows that race has 3 categories can infer my reference group.
Comment
Eric de Souza

Join Date: Mar 2014

Posts: 587
#9

05 Aug 2018, 09:43

Anyone looking at the model which omits white and knows that race has 3 categories can infer my reference group.

Since this is an assignment it would be safer to mention it so that the instructor sees that the student has understood it. My opinion.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10219
#10

05 Aug 2018, 09:53

I agree Eric de Souza
Comment
Tim de Vries

Join Date: Aug 2018

Posts: 5
#11

05 Aug 2018, 10:00

Thanks to you both for your input, I really appreciate it. In that case I'll probably keep them in the equation. I hope my instructor won't make too much of a big deal out of it regardless.
Comment

Announcement

Question about inclusion of the reference group

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment