Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Categorical or continuous variable in logistic model, Stata SE 14.0

    Dear Statalist,

    I am now a bit confused about the categorical and continous variables and how they will affect on the result of logistic model.
    I have a number of independent variables such as: household income, age of household leader, household living area which are collected and coded by a research agency as below:
    Household income:
    1 = till 999 Eur
    2 = 1000 - 1999 Eur
    3 = 2000 - 2999 Eur
    4 = 3000 - 3999 Eur
    5 = more than 4000 Euro
    Household living area:
    1 = North
    2 = West
    3 = East
    4 = South
    5 = Central
    - I would like to put these independent variables in logistic model. However, the results show very different if I treat household income as a factor variable and another time as a continous variable. I think it is more understandable if I treat household income as continous variable in logistic model because the income can be received any value between each category. But the way it was coded implying that it could be seen as the categorical variable. Please advise me how should I put this variable in logistic model. However, to calculate the marginal effect, it is necessary to put the factor variables instead of continous variable
    - I think household living area should not be coded as numeric but string and treated as factor variables in the logistic model, also applied for caculating the marginal effect. Is it correct?

    Thank you,
    Hang Vu


  • #2
    Stata only sees the values 1 till 5. It does not, and cannot, take into account the fact that there are labels attached to these numbers that tell humans that these values means something else. So if you include household income as continuous than it only uses those values 1 till 5, and not its meaning in euros. So I would treat it as categorical.

    Region is categorical. That has nothing to do with string or non-string. In fact, if you turn it into a string variable it will be dropped from your model.

    say your variables are called hhinc and area and your dependent variable is called y, then you would type:

    Code:
    logit y i.hhinc i.area
    The i. tells Stata to treat that variable as categorical. After that you can just use margins to compare the categories of your categorical variable. For region you may want to look at contrast with the gw. prefix
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Maarten Buis Dear Mr Buis,

      Thank you for your reply. It is totally understandable in the case of region to be a categorical, but not really clear for income. Therefore, I still concern when I come to interprete the result. If I treat the hhinc as categorical, I cannot conclude anything about the increase of HH income will influence on the likelihood to get the value "1" of dependent variables because the category is only changing from 1 to 2, 2 to 3 and so on (as a label), but not giving any implications of the increase trend of HH income.
      One more thing is, could the code of the HHincome as 1 to 5 could be understood as an interval scale, each scale refers to the change of 1000 euro in each income category?
      Would you mind explaining me again with the case of income?

      Thank you so much,
      Hang
      Last edited by Hang Vu; 12 Oct 2017, 04:18.

      Comment


      • #4
        Originally posted by Hang Vu View Post
        If I treat the hhinc as categorical, I cannot conclude anything about the increase of HH income will influence on the likelihood to get the value "1" of dependent variables because the category is only changing from 1 to 2, 2 to 3 and so on (as a label), but not giving any implications of the increase trend of HH income.
        The income categories are ordinal, so you can say something about an increase in income, but not about a 1 Euro increase in income. That is unfortunate, but if you don't have the necessary data, then that is all you can do.

        Originally posted by Hang Vu View Post
        One more thing is, could the code of the HHincome as 1 to 5 could be understood as an interval scale, each scale refers to the change of 1000 euro in each income category?
        Hang
        The problem is with the lowest and highest category: You are implicitly assuming that the lowest income is 0 and the highest 4999.

        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Maarten Buis Dear Mr Buis,
          Thank you again for your prompt answer.
          Would I understand correctly that it is acceptable to do logit as below :
          Code:
          logit depvar hhinc i.area
          .
          Result could be interprete, for example: the negative cofficient indicates that the increase of HH in their income group will negatively influence on the possibility of receiving value "1"

          However, to calculate marginal effect:
          Code:
          logit depvar i.hhinc i.area
          then;
          Code:
           margins hhinc
          Result could be interprete, for example: the cofficient (for example 0.2) indicates that if houshold increases the income from group 1 to group 2, the possibility of receiving value "1" will increase 20%

          Thank you,
          Hang Vu

          Comment


          • #6
            I know you want to add hhinc as continuous, but you will have to live with the fact that no amount of statistical trickery can create information where none exist in the data. Your data contains information on hhinc in only categorical (ordinal) form, so you will just have to include it as categorical.
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              Maarten Buis Thank you again for your patience. Well, I just do not want to try to force it to be a continous. I think the way of interpretation for cofficient would be the same for both cases .However, when I put the continous variable for hhinc, the p-value is significant while when I put factor variable, the p-valude is not significant at all. This makes me feel concerned as Stata could not recognize that the meaning of the HH income measurement scale (ordinal) is different with the meaning in the HH region measurement scale (norminal). That's why I think it will be more approperiate when put HH income as continous because continous variable expresses the meaning of the scale 1 to 5 in more similar meaning with from 1000 - 1999 to 2000 - 2999 and so on.

              Thank you,
              Hang Vu
              Last edited by Hang Vu; 12 Oct 2017, 11:57.

              Comment


              • #8
                Maarten Buis , I found this hand out from Mr Williams discussing about Ordinal Independent Variables https://www3.nd.edu/~rwilliam/stats3...ndependent.pdf, I think it is also quite appropriate. Will you be open to discuss?

                Thank you,
                Hang Vu

                Comment


                • #9
                  Originally posted by Hang Vu View Post
                  Maarten Buis , I found this hand out from Mr Williams discussing about Ordinal Independent Variables https://www3.nd.edu/~rwilliam/stats3...ndependent.pdf, I think it is also quite appropriate. Will you be open to discuss?

                  Thank you,
                  Hang Vu
                  I was just scrolling down to suggest that handout! It outlines tests you can use to see if an ordinal variable can be treated as continuous, so try those. The open-ended final category may be the biggest problem.
                  -------------------------------------------
                  Richard Williams, Notre Dame Dept of Sociology
                  StataNow Version: 19.5 MP (2 processor)

                  EMAIL: [email protected]
                  WWW: https://www3.nd.edu/~rwilliam

                  Comment


                  • #10
                    Richard Williams Dear Richard,
                    Thank you for your prompt reply, I just also read about this problem. When looking at my data, there are 278 HH out of 8400 HH having the income of more than 5000 euro/ year. the percentage of HH with more than 5000 eur income is not that high. Would it be persuasive to say that it will be less problematic in this case?

                    Comment


                    • #11
                      Well, if those 278 people are multi-millionaires it may matter a lot. So try the tests and see.

                      Also, other than loss of parsimony, probably nothing too horrible will happen if you treat it as categorical.
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      StataNow Version: 19.5 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment

                      Working...
                      X