Categorical Variables Treatment at Regression Analysis

Obaid Ur Rehman

Join Date: May 2019

Posts: 59
#1

Categorical Variables Treatment at Regression Analysis

04 May 2019, 04:02

Hello everyone, Hope you all feeling well. I need help in the following specific case in STATA:

First I wanna give you background about my question, which is mentioned below:
Dependent Variable = Y
Independent Variables = X1, X2, X3, X4

X1 = contain categorical values (0, 1)
X2 = contain categorical values (1, 2, 3, 4, 5)
X3 = contain categorical values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
X4 = contain categorical values from 30 to 80

I Labelled all my categorical variables like below:

X1:
0 = F
1 = M
X2:
1 = SS
2 = JC
3 = UG
4 = PG
5 = PD
X3:
Below 1 values: = B1
between 1 and 5 = U15
between 5 and 10 = U510
Above 10 = A10
X4:
Below 30 = B30
Between 30 and 40 = B3040
Between 40 and 50 = B4050
Between 50 and 60 = B5060
Above 60 = A60

By regressing, I find the below results:

Y Coef. Std. Err. z
X1:
0: 0.3534 0.332 0.234
1: 0.4543 0.245 0.132
X2:
1 0.4534 0.245 0.543
2 0.6457 0.652 0.767
3 0.2465 0.124 0.232
4 0.6634 0.543 0.657
5 0.7532 0.123 0.232
X3:
Under 1: 0.9506 0.545 0.564
between 1 and 5 0.8474 0.231 0.234
between 5 and 10 0.6767 0.235 0.576
Above 10 0.4575 0.232 0.898
X4:
Below 30: 0.8678 0.646 0.876
Between 30 and 40 0.2346 0.576 0.657
Between 40 and 50 0.6544 0.786 0.567
Between 50 and 60 0.4322 0.574 0.256
60 Plus 0.3235 0.345 0.245

Label command replace original values by below: (suppose in case of X2)
label define X1o 1 SS 2 JC 3 UG 4 PG 5 PD

MY PROBLEM:
by label command, each variable replace it's original value with 1, 2, 3, 4 or 5. Mean to say by labeling the independent variables, the dependent variable (Y) then regress not by original values of X1, X2, X3 and X4 but replace value, which shows against theory results.

"NOW MY QUESTION IS"

Is it possible that Y regress on the categories of X1, X2, X3 and X4 original values ??

Thanks for stopping by.
Tags: None
Obaid Ur Rehman

Join Date: May 2019

Posts: 59
#2

04 May 2019, 04:26

All categorical values are repeated across 20,000 observations
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30173
#3

04 May 2019, 10:00

I am unable to understand your question.

by label command, each variable replace it's original value with 1, 2, 3, 4 or 5.

This is not true. The -label- command changes the way Stata displays value of the variable, but it does not change the actual values it stores internally. In particular, it does not change what the results of a regression using that variable would be.

Rather than trying to describe your problem, please post back showing:

1. An example of your data set, using the -dataex- command.
2. The exact code that you ran that is giving you problems. To make it easily readable, use code delimiters.
3. The exact output that you got from Stata. Again, for readability, use code delimiters.

Then state what parts of the output are different from what you were expecting or hoping for and state what you wanted to see there in its place.

If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

For help with code delimiters, read Forum FAQ #12.

If you prefer a more visual approach, you can learn how to use both -dataex- and code delimiters from David Benson's video at https://youtu.be/bXfaRCAOPbI.
1 like
Comment

Obaid Ur Rehman

Join Date: May 2019
Posts: 59

04 May 2019, 12:37

Thanks Professor,

I has a continuous dependent variable (q) and independent variable age. where age values range from 26 to 77

I use following coding to generate categories in age variable:

gen age1 = age
recode age1 (26/29=1) (30/39=2) (40/49=3) (50/59=4) (60/77=5)
label define age1o 1 "UT" 2 "TTF" 3 "FTF" 4 "FTS" 5 "SP"
label values age1 age1o

dataex age age1, count(50)

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte age float age1
43 3
42 3
42 3
43 3
50 4
44 3
46 3
49 3
50 4
51 4
52 4
32 2
34 2
44 3
45 3
43 3
47 3
42 3
48 3
55 4
44 3
47 3
52 4
49 3
46 3
45 3
56 4
54 4
55 4
54 4
53 4
55 4
45 3
48 3
49 3
45 3
46 3
47 3
48 3
48 3
35 2
36 2
50 4
49 3
43 3
51 4
45 3
46 3
47 3
48 3
end
label values age1 age1o
label def age1o 2 "TTF", modify
label def age1o 3 "FTF", modify
label def age1o 4 "FTS", modify

------------------ copy up to and including the previous line ------------------

Listed 50 out of 6606 observations

When i execute the regression; the results is below:

Code:

. regress q age age1

      Source |       SS           df       MS      Number of obs   =     6,606
-------------+----------------------------------   F(2, 6603)      =     12.51
       Model |  109.133621         2  54.5668105   Prob > F        =    0.0000
    Residual |  28804.9507     6,603  4.36240356   R-squared       =    0.0038
-------------+----------------------------------   Adj R-squared   =    0.0035
       Total |  28914.0843     6,605   4.3776055   Root MSE        =    2.0886

------------------------------------------------------------------------------
           q |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |    .026983   .0092601     2.91   0.004     .0088303    .0451357
        age1 |  -.1030977   .0882462    -1.17   0.243    -.2760887    .0698933
       _cons |   1.977676   .2102442     9.41   0.000     1.565529    2.389822
------------------------------------------------------------------------------

See the coefficient for age = 0.0269 and age1 = -0.0103; means for original values (age), coefficient is different from labelled values (age1)

If possible, I want regression output from age variable like below:

Code:

 regress q b1.age1

      Source |       SS           df       MS      Number of obs   =     6,606
-------------+----------------------------------   F(4, 6601)      =      6.06
       Model |  105.862299         4  26.4655747   Prob > F        =    0.0001
    Residual |   28808.222     6,601  4.36422088   R-squared       =    0.0037
-------------+----------------------------------   Adj R-squared   =    0.0031
       Total |  28914.0843     6,605   4.3776055   Root MSE        =    2.0891

------------------------------------------------------------------------------
           q |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        age1 |
        TTF  |  -1.326579   .5873291    -2.26   0.024    -2.477934   -.1752241
        FTF  |  -1.147608   .5807293    -1.98   0.048    -2.286025   -.0091907
        FTS  |  -1.052541   .5808275    -1.81   0.070     -2.19115    .0860689
         SP  |  -.8077115   .5852684    -1.38   0.168    -1.955027    .3396038
             |
       _cons |   4.043846   .5794043     6.98   0.000     2.908026    5.179666
------------------------------------------------------------------------------

mean to say i want age variable values are labelled within specific range (to get the regression output like immediate above), but not assigned other values like label define age1o 1 "UT" 2 "TTF" 3 "FTF" 4 "FTS" 5 "SP"

I really appreciate your response Professor, this is my first post here and am really satisfied from your response. Thanks allot.

Comment

Obaid Ur Rehman

Join Date: May 2019
Posts: 59

04 May 2019, 12:46

I also use below option:

Code:

 range age11 26 29
range age12 30 39
range age13 40 49
range age14 50 59
range age15 60 77

resultantly, it gave me below regression output:

Code:

. regress q age11 age12 age13 age14 age15
note: age11 omitted because of collinearity
note: age12 omitted because of collinearity
note: age13 omitted because of collinearity
note: age14 omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =     6,606
-------------+----------------------------------   F(1, 6604)      =      1.05
       Model |  4.60014495         1  4.60014495   Prob > F        =    0.3054
    Residual |  28909.4842     6,604   4.3775718   R-squared       =    0.0002
-------------+----------------------------------   Adj R-squared   =    0.0000
       Total |  28914.0843     6,605   4.3776055   Root MSE        =    2.0923

------------------------------------------------------------------------------
           q |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       age11 |          0  (omitted)
       age12 |          0  (omitted)
       age13 |          0  (omitted)
       age14 |          0  (omitted)
       age15 |   .0053764   .0052447     1.03   0.305     -.004905    .0156578
       _cons |   2.588339   .3601852     7.19   0.000     1.882259    3.294418
------------------------------------------------------------------------------

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30173
#6

04 May 2019, 13:07

Well, Stata just uses whatever label you have given it. You told Stata you wanted this variable labeled with UT, TTF, etc., so that's what Stata did. If you want Stata to do it differently, you have to give the variable a different label.

Code:

label define age1o2 1 "26-29" 2 "30-39" 3 "40-49") 4 "50-59" 5 "60-77", modify label values age1 age1o2 regress q i.age1
1 like
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#7

04 May 2019, 13:27

I think there is a misunderstanding here. The variable age is continuous in regress q age, while the recoded variable age1 is supposed to be a factor variable, hence regress q i.age1. But remember that a factor variable really means there are as many 0/1 variables (dummies) as there are level, minus one(*) (Stata does this automatically when there is this "i."). If you don't tell Stata it's a factor variable (that is, if you remove "i."), it's considered to be a continuous variable, hence there is a single coefficient.

But, if you have a continuous variable in the first place, the only way to deal with it as factors is to recode, as you did above. If you call regress q i.age, there will be one coefficient for each value of age (not one for each value of a label). You can't apply a label to get age ranges, as would be done with SAS. What counts is the variable values, not the label. If you want such a behavior, you have to recode first (and assign a label if you wish, but it's only for readability, it won't change the result).

All in all, whatever the labels may be:

Code:

regress q age one coefficient for age regress q i.age one coefficient for each value of age minus one (around 50 coefficients here) regress q age1 one coefficient for age1 regress q i.age1 one coefficient for each value of age1 minus one (4 coefficients with the recoding above)

Note that if there is a label, it will be automatically shown in the output for coefficients of factor variables (with "i."). You can remove the label and show the value instead with the option nofvlabel: reg y i.k, nofvlabel. But it only changes the printing, not the computed results.

(*) minus one because there is also a constant regressor, and if there are as many 0/1 variables as there are levels, the regressors are collinear, since the sum of the dummies always equals 1.

Hope this helps

Jean-Claude Arbaut

Last edited by Jean-Claude Arbaut; 04 May 2019, 13:42.
Comment

Announcement