Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Recoding Dummy variable: O/1 or 1/2

    I just need a bit of clarification:

    Is it imperative to have dummy variables as 0/1 values or having them as 1/2 doesn't matter when carrying out regressions in stata. I have realized most survey data code the dummies as 1/2

    If yes, what's the logic behind having the dummy variables recoded as 0/1?



    And what about categorical variables? Should they also start from 0, 1,2....or they can remain as 1,2,3,....



    Thank you.

  • #2
    As long as you are using the factor variable notation when entering the variables in your regression model you can use any non-negative integer for binary or categorical variables. Internally Stata will turn those variables in (a set of) 0/1 indicator (dummy) variables. The benefit of that is that the constant refers to conditional mean of the reference category. If you coded your variable 1/2 the constant would refer to the conditional mean for a group that does not exist.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Another benefit of (0, 1) coding is that the mean of such an indicator variable has an immediate concrete interpretation as a probability. In the auto data

      Code:
      sysuse auto, clear 
      summarize foreign
      the mean is the probability (proportion, fraction) of cars that are foreign. Note incidentally the excellent convention of naming indicators for the category coded 1. Thus use names such as female (not gender).

      Comment


      • #4
        Another advantage of coding a dichotomous variable as 0/1 is that in the event you also want to use the variable as the dependent variable in a logistic or probit model, Stata always interprets those variables as 0 = false, non-zero = true. So if you code it as 1/2 and try to use it that way, Stata will complain that your outcome doesn't vary.

        Comment


        • #5
          I have a question related to the question posed here.

          I have generated a dummy variable "cellar" with has 3 answers: 1 large basement, 2 small basement or 3 crawl space, using:

          Code:
          tab cellar, gen(c)
          To use regression with the dependent variable logprice (=houseprice), should I use:

          Code:
          reg logprice i.cellar
          or

          Code:
          reg logprice c2 c3
          and c1 should be left out and is the base group.

          Which one is correct?

          Thanks in advance.

          Comment


          • #6
            Mat Sko your two choices are mathematically equivalent if you want level 1 las the reference category. However, in Stata is much more advantageous and simpler to use factor variable notation, as in your first regression statement. The factor notation also allows.one to change the reference/base category. For example, to use level 3, you would type -ib3.celler-.

            Comment


            • #7
              Dear all,

              I have one question related to this issue.

              When I change reference category, how do the results change? For example, I have an categorical variable as independent variable, which have 3 levels- 1,2,3. When I treat 1 as base level, and when I treat 3 as base level, what will the results change (sign and magnitude of coefficient)?

              I guess that its coefficient will change to opposite sign. However, my results show that it all changes in sign, size, and statistical significance. So, what am I wrong here?

              Thanks alot.

              Comment


              • #8
                Chi chi

                Let's have a look on a simulated dataset.

                Code:
                clear
                set seed 17760704
                set obs 1000
                gen x=runiformint(1,4)
                gen y=rnormal(x)
                replace y=y-0.9 if x==3
                Here I create an independent variable that takes values 1-4, and a dependent variable that is normal and has mean x, except when x=3: i want the y to have almost the same mean as with x=2, for a reason that I will explain below.

                Regression with default reference (the first category of x)

                Code:
                reg y i.x
                
                      Source |       SS           df       MS      Number of obs   =     1,000
                -------------+----------------------------------   F(3, 996)       =    425.18
                       Model |  1286.91173         3  428.970575   Prob > F        =    0.0000
                    Residual |  1004.87019       996  1.00890581   R-squared       =    0.5615
                -------------+----------------------------------   Adj R-squared   =    0.5602
                       Total |  2291.78192       999  2.29407599   Root MSE        =    1.0044
                
                ------------------------------------------------------------------------------
                           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                           x |
                          1  |  -1.001929   .0886142   -11.31   0.000    -1.175821   -.8280369
                          3  |   .0245305   .0911161     0.27   0.788     -.154271     .203332
                          4  |   2.085322   .0892179    23.37   0.000     1.910245    2.260399
                             |
                       _cons |   1.961651   .0630244    31.13   0.000     1.837975    2.085327
                ------------------------------------------------------------------------------
                And regression with taking x=2 as the reference

                Code:
                reg y ib2.x
                
                      Source |       SS           df       MS      Number of obs   =     1,000
                -------------+----------------------------------   F(3, 996)       =    425.18
                       Model |  1286.91173         3  428.970575   Prob > F        =    0.0000
                    Residual |  1004.87019       996  1.00890581   R-squared       =    0.5615
                -------------+----------------------------------   Adj R-squared   =    0.5602
                       Total |  2291.78192       999  2.29407599   Root MSE        =    1.0044
                
                ------------------------------------------------------------------------------
                           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                           x |
                          2  |   1.001929   .0886142    11.31   0.000     .8280369    1.175821
                          3  |   1.026459   .0906117    11.33   0.000     .8486477    1.204271
                          4  |   3.087251   .0887027    34.80   0.000     2.913185    3.261317
                             |
                       _cons |   .9597223   .0622929    15.41   0.000     .8374819    1.081963
                ------------------------------------------------------------------------------
                Let's first focus on the coefficients. Actually, you have to think in terms of "contrasts": there is a reference category because we can't estimate all parameters. And that's because the sum of all four dummies is 1, the constant regressor, whose coefficient is already estimated by the constant coefficient. You can't have both the constant and all 4 regressors. So one of them is chosen as reference, which means its coefficient is artificially chosen to be 0. But that really means the other coefficients are estimated as differences from the reference.

                So, in the first regression, the coefficient 1.001929 _b[2.x] (I use Stata syntax here) is really the difference of coefficients _b[2.x]-_b[1.x] (where _b[1.x] is taken to be 0). In the second regression, the coefficient _b[1.x] is really the difference _b[1.x]-_b[2.x] (where _b[2.x] is taken to be 0). Since the two regression are really the same with just a change of reference, the difference _b[2.x]-_b[1.x] is the same, and you end up with opposite coefficients.
                Not all coefficient are opposite: for _b[3.x], you have _b[3.x]-_b[1.x] in the first regression, and _b[3.x]-_b[2.x] in the second. But in the first you can compute _b[3.x]-_b[2.x]=1.026459-1.001929=.0245305, the same coefficient as in the second regression.

                So, basically, you can estimate differences of coefficients, not the coefficients themselves. You can estimate any linear combination a1 _b[1.x] + a2 _b[2.x] +a3 _b[3.x] +a4 _b[4.x] where a1+a2+a3+a4=0 (the linear combinations are called contrasts).

                For instance, 2-3+4-3=0 and 2 _b[1.x] - 3 _b[2.x] + 4 _b[3.x] - 3 _b[4.x] = -3(_b[2.x]-_b[1.x]) + 4(_b[3.x]-_b[1.x]) - 3(_b[4.x]-_b[1.x]), and each term can be estimated using the coefficients of the first regression. Or you can write it 2(_b[1.x]-_b[2.x])+4(_b[3.x]-_b[2.x])-3(_b[4.x]-_b[2.x]) and estimate using the second regression. Stata accepts the initial linear combination and the following result is the same with both regressions:
                Code:
                lincom 2*1.x-3*2.x+4*3.x-3*4.x
                
                ------------------------------------------------------------------------------
                           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                         (1) |  -8.161702   .3955266   -20.64   0.000    -8.937863   -7.385541
                ------------------------------------------------------------------------------
                So, what do I mean when I write that, in the first regression, _b[2.x] is really _b[2.x]-_b[1.x]? You can check with "lincom 2.x-1.x", which returns the same value as the regression output for 2.x. That's because we can only estimate contrasts.

                Now, the p-values. The test applies to differences of coefficients. So we really test if two coefficients are significantly far apart. In the first regression, the coefficients of _b[2.x], _b[3.x], _b[4.x] are all significantly different than _b[1.x] (i.e. the differences are significant), so all p-values are small.

                In the second regression, I wanted to show what happens when one difference is small (I managed to have _b[3.x] close to _b[2.x]). and the p-value for _b[3.x] (which is really the p-value for _b[3.x]-_b[2.x]) is large.

                A point worth noting: if you compute contrasts with lincom, the result will not depend on the reference you have chosen (same coefficient, same p-value). If you compute a linear combination which is not a contrast, the result will depend on the reference. There is a twist here, as Stata can only ever estimate contrasts: for instance, if you want "lincom 3.x+4.x", you will really get (4.x-1.x)+(3.x-1.x) or (4.x-2.x)+(3.x-2.x) which depend on the reference (x=1 or x=2).
                Last edited by Jean-Claude Arbaut; 18 Jul 2019, 02:56.

                Comment


                • #9
                  I would just add to Jean-Claude Arbaut's excellent discussion the observation that when the reference category changes, the model's predictions, that is, the values produced by -predict-, do not change. Nor is there any change in R2, nor root mean squared error. In models estimated by maximum likelihood, the log-likelihood also remains unchanged. And coefficients of the variables other than x itself do not change (unless there are interaction terms).

                  Comment


                  • #10
                    @ Jean-Claude Arbaut
                    ,
                    Clyde SchechterThank you very much. Now I got it.

                    Comment

                    Working...
                    X