Categorical or continuous variable in logistic model, Stata SE 14.0

Hang Vu

Join Date: Aug 2017

Posts: 12
#1

Categorical or continuous variable in logistic model, Stata SE 14.0

12 Oct 2017, 02:43

Dear Statalist,

I am now a bit confused about the categorical and continous variables and how they will affect on the result of logistic model.
I have a number of independent variables such as: household income, age of household leader, household living area which are collected and coded by a research agency as below:
Household income:
1 = till 999 Eur
2 = 1000 - 1999 Eur
3 = 2000 - 2999 Eur
4 = 3000 - 3999 Eur
5 = more than 4000 Euro
Household living area:
1 = North
2 = West
3 = East
4 = South
5 = Central
- I would like to put these independent variables in logistic model. However, the results show very different if I treat household income as a factor variable and another time as a continous variable. I think it is more understandable if I treat household income as continous variable in logistic model because the income can be received any value between each category. But the way it was coded implying that it could be seen as the categorical variable. Please advise me how should I put this variable in logistic model. However, to calculate the marginal effect, it is necessary to put the factor variables instead of continous variable
- I think household living area should not be coded as numeric but string and treated as factor variables in the logistic model, also applied for caculating the marginal effect. Is it correct?

Thank you,
Hang Vu
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#2

12 Oct 2017, 03:24

Stata only sees the values 1 till 5. It does not, and cannot, take into account the fact that there are labels attached to these numbers that tell humans that these values means something else. So if you include household income as continuous than it only uses those values 1 till 5, and not its meaning in euros. So I would treat it as categorical.

Region is categorical. That has nothing to do with string or non-string. In fact, if you turn it into a string variable it will be dropped from your model.

say your variables are called hhinc and area and your dependent variable is called y, then you would type:

Code:

logit y i.hhinc i.area

The i. tells Stata to treat that variable as categorical. After that you can just use margins to compare the categories of your categorical variable. For region you may want to look at contrast with the gw. prefix

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Hang Vu

Join Date: Aug 2017

Posts: 12
#3

12 Oct 2017, 04:03

Maarten Buis Dear Mr Buis,

Thank you for your reply. It is totally understandable in the case of region to be a categorical, but not really clear for income. Therefore, I still concern when I come to interprete the result. If I treat the hhinc as categorical, I cannot conclude anything about the increase of HH income will influence on the likelihood to get the value "1" of dependent variables because the category is only changing from 1 to 2, 2 to 3 and so on (as a label), but not giving any implications of the increase trend of HH income.
One more thing is, could the code of the HHincome as 1 to 5 could be understood as an interval scale, each scale refers to the change of 1000 euro in each income category?
Would you mind explaining me again with the case of income?

Thank you so much,
Hang

Last edited by Hang Vu; 12 Oct 2017, 04:18.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#4

12 Oct 2017, 05:24

Originally posted by Hang Vu View Post

If I treat the hhinc as categorical, I cannot conclude anything about the increase of HH income will influence on the likelihood to get the value "1" of dependent variables because the category is only changing from 1 to 2, 2 to 3 and so on (as a label), but not giving any implications of the increase trend of HH income.

The income categories are ordinal, so you can say something about an increase in income, but not about a 1 Euro increase in income. That is unfortunate, but if you don't have the necessary data, then that is all you can do.

Originally posted by Hang Vu View Post

One more thing is, could the code of the HHincome as 1 to 5 could be understood as an interval scale, each scale refers to the change of 1000 euro in each income category?
Hang

The problem is with the lowest and highest category: You are implicitly assuming that the lowest income is 0 and the highest 4999.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Hang Vu

Join Date: Aug 2017

Posts: 12
#5

12 Oct 2017, 08:17

Maarten Buis Dear Mr Buis,
Thank you again for your prompt answer.
Would I understand correctly that it is acceptable to do logit as below :

Code:

logit depvar hhinc i.area

.
Result could be interprete, for example: the negative cofficient indicates that the increase of HH in their income group will negatively influence on the possibility of receiving value "1"

However, to calculate marginal effect:

Code:

logit depvar i.hhinc i.area

then;

Code:

margins hhinc

Result could be interprete, for example: the cofficient (for example 0.2) indicates that if houshold increases the income from group 1 to group 2, the possibility of receiving value "1" will increase 20%

Thank you,
Hang Vu
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#6

12 Oct 2017, 08:58

I know you want to add hhinc as continuous, but you will have to live with the fact that no amount of statistical trickery can create information where none exist in the data. Your data contains information on hhinc in only categorical (ordinal) form, so you will just have to include it as categorical.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Hang Vu

Join Date: Aug 2017

Posts: 12
#7

12 Oct 2017, 11:41

Maarten Buis Thank you again for your patience. Well, I just do not want to try to force it to be a continous. I think the way of interpretation for cofficient would be the same for both cases .However, when I put the continous variable for hhinc, the p-value is significant while when I put factor variable, the p-valude is not significant at all. This makes me feel concerned as Stata could not recognize that the meaning of the HH income measurement scale (ordinal) is different with the meaning in the HH region measurement scale (norminal). That's why I think it will be more approperiate when put HH income as continous because continous variable expresses the meaning of the scale 1 to 5 in more similar meaning with from 1000 - 1999 to 2000 - 2999 and so on.

Thank you,
Hang Vu

Last edited by Hang Vu; 12 Oct 2017, 11:57.
Comment
Hang Vu

Join Date: Aug 2017

Posts: 12
#8

12 Oct 2017, 12:54

Maarten Buis , I found this hand out from Mr Williams discussing about Ordinal Independent Variables https://www3.nd.edu/~rwilliam/stats3...ndependent.pdf, I think it is also quite appropriate. Will you be open to discuss?

Thank you,
Hang Vu
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#9

12 Oct 2017, 13:05

Originally posted by Hang Vu View Post

Maarten Buis , I found this hand out from Mr Williams discussing about Ordinal Independent Variables https://www3.nd.edu/~rwilliam/stats3...ndependent.pdf, I think it is also quite appropriate. Will you be open to discuss?

Thank you,
Hang Vu

I was just scrolling down to suggest that handout! It outlines tests you can use to see if an ordinal variable can be treated as continuous, so try those. The open-ended final category may be the biggest problem.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Hang Vu

Join Date: Aug 2017

Posts: 12
#10

12 Oct 2017, 13:16

Richard Williams Dear Richard,
Thank you for your prompt reply, I just also read about this problem. When looking at my data, there are 278 HH out of 8400 HH having the income of more than 5000 euro/ year. the percentage of HH with more than 5000 eur income is not that high. Would it be persuasive to say that it will be less problematic in this case?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#11

12 Oct 2017, 13:27

Well, if those 278 people are multi-millionaires it may matter a lot. So try the tests and see.

Also, other than loss of parsimony, probably nothing too horrible will happen if you treat it as categorical.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement

Categorical or continuous variable in logistic model, Stata SE 14.0

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment