Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to use categorical variables in linear regression?

    I have a question regarding the use of categorical variables in a linear regression.

    I have a continuous dependent variable, a categorical independent variable (Likert scale), and I use various control variables which are mostly categorical (e.g. they consists of groups, such as sex). When I use the following code:
    Code:
    regress dependent independent i.sex
    I get different results regarding the size of the coefficients and the R2, then when I use the following code:
    Code:
    regress dependent independent sex
    . In my case, in the latter example the independent variable becomes insignificant.

    My question is, how is this difference being caused? And which code should I use? Or what reasons determine the choice of the type of code?

  • #2
    Maarten:
    the right code is the first one .
    If you omit -i-.before a categorical variable, Stata will consider it as a continuos one:
    Code:
    sysuse auto.dta
    . regress price rep78
    
          Source |       SS           df       MS      Number of obs   =        69
    -------------+----------------------------------   F(1, 67)        =      0.00
           Model |  24770.7652         1  24770.7652   Prob > F        =    0.9574
        Residual |   576772188        67  8608540.12   R-squared       =    0.0000
    -------------+----------------------------------   Adj R-squared   =   -0.0149
           Total |   576796959        68  8482308.22   Root MSE        =      2934
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           rep78 |   19.28012   359.4221     0.05   0.957    -698.1295    736.6897
           _cons |   6080.379    1274.06     4.77   0.000     3537.345    8623.413
    ------------------------------------------------------------------------------
    
    . regress price i.rep78
    
          Source |       SS           df       MS      Number of obs   =        69
    -------------+----------------------------------   F(4, 64)        =      0.24
           Model |  8360542.63         4  2090135.66   Prob > F        =    0.9174
        Residual |   568436416        64     8881819   R-squared       =    0.0145
    -------------+----------------------------------   Adj R-squared   =   -0.0471
           Total |   576796959        68  8482308.22   Root MSE        =    2980.2
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           rep78 |
              2  |   1403.125   2356.085     0.60   0.554    -3303.696    6109.946
              3  |   1864.733   2176.458     0.86   0.395    -2483.242    6212.708
              4  |       1507   2221.338     0.68   0.500    -2930.633    5944.633
              5  |     1348.5   2290.927     0.59   0.558    -3228.153    5925.153
                 |
           _cons |     4564.5   2107.347     2.17   0.034     354.5913    8774.409
    ------------------------------------------------------------------------------
    However, if you've a two-level categorical variable, coded 0/1, Stata should not report different results regradless the -i.- notation:
    Code:
    . sysuse auto.dta
    (1978 Automobile Data)
    
    . regress price foreign
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(1, 72)        =      0.17
           Model |  1507382.66         1  1507382.66   Prob > F        =    0.6802
        Residual |   633558013        72  8799416.85   R-squared       =    0.0024
    -------------+----------------------------------   Adj R-squared   =   -0.0115
           Total |   635065396        73  8699525.97   Root MSE        =    2966.4
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         foreign |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
           _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
    ------------------------------------------------------------------------------
    
    . regress price i.foreign
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(1, 72)        =      0.17
           Model |  1507382.66         1  1507382.66   Prob > F        =    0.6802
        Residual |   633558013        72  8799416.85   R-squared       =    0.0024
    -------------+----------------------------------   Adj R-squared   =   -0.0115
           Total |   635065396        73  8699525.97   Root MSE        =    2966.4
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         foreign |
        Foreign  |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
           _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
    ------------------------------------------------------------------------------
    Hence, I would check the way your categorical variable is coded and if you have any missing values in it.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks Carlo for your answer!

      Comment


      • #4
        Maarten:
        a partial correction to my previous reply: if your variable has missing values, the related observations are ruled out by default via listwise deletion.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment

        Working...
        X