Displaying categorical variable labels in a regression

Meshal Alkhowaiter

Join Date: Nov 2020

Posts: 16
#1

Displaying categorical variable labels in a regression

07 Apr 2021, 11:37

I have two variables that are structured as below, one refers to the geographic location and another is the average unemployment rate for a given demographic group and a given time period in California.

I am interested in using the average unemployment rate per region as an explanatory variable in a regression model with SAT scores as my predictor variable.

```
dataex average_unemp region

----------------------- copy starting from the next line -----------------------
[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input str4 average_unemp str16 region
"11.6" "Southern California"
"11.6" "Southern California"
"11.6" "Southern California"
"11.4" "Sacramento"
"11.6" "Southern California"
"11.6" "Southern California"
"11.6" "Southern California"
"11.4" "Sacramento"
"11.6" "Southern California"
"11.6" "Southern California"
"22.6" "San Dego"
"11.6" "Southern California"
"11.6" "Southern California"
```

I have created a dummy variable for both as in below:

```
encode average_unemp, gen(average_unemp_dummy)
```

However, when I ran my regression model, the average unemployment rate was displayed in the results, but I am actually interested in displaying the region's name with the results to know which region I am looking at.

```
regress SAT_score i.gender_dummy i.average_unemp_dummy

Source | SS df MS Number of obs = 5,480
-------------+---------------------------------- F(13, 5466) = 195.10
Model | 106720.18 13 8209.24464 Prob > F = 0.0000
Residual | 229988.42 5,466 42.0761837 R-squared = 0.3170
-------------+---------------------------------- Adj R-squared = 0.3153
Total | 336708.6 5,479 61.4543895 Root MSE = 6.4866

-------------------------------------------------------------------------------------
SAT_score | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------------+----------------------------------------------------------------
gender_dummy |
male | -6.637803 .1904906 -34.85 0.000 -7.011241 -6.264366
|
average_unemp_dummy |
10.5 | 7.675088 .5530074 13.88 0.000 6.590973 8.759202
11.4 | 6.21276 .5102991 12.17 0.000 5.21237 7.213149
22.6 | -5.595991 .6094689 -9.18 0.000 -6.790792 -4.401189
```
Tags: categorical, dummy-variable, panel data, regression

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17711

07 Apr 2021, 11:49

Meshal:
please note that your data excerpt is not fully in line with the code you ran.
That said, what follows might be what you're after:

Code:

. label define average_unemp_dummy 2 "Southern California" 1 "Sacramento" 3 "San Diego", modify

. g income= runiform()*1000000

. regress income i.average_unemp_dummy

      Source |       SS           df       MS      Number of obs   =        13
-------------+----------------------------------   F(2, 10)        =      1.23
       Model |  2.2177e+11         2  1.1089e+11   Prob > F        =    0.3326
    Residual |  9.0054e+11        10  9.0054e+10   R-squared       =    0.1976
-------------+----------------------------------   Adj R-squared   =    0.0371
       Total |  1.1223e+12        12  9.3526e+10   Root MSE        =    3.0e+05

--------------------------------------------------------------------------------------
              income |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------------+----------------------------------------------------------------
 average_unemp_dummy |
Southern California  |   319199.2   232448.3     1.37   0.200    -198727.9    837126.3
          San Diego  |   28747.07     367533     0.08   0.939    -790167.6    847661.7
                     |
               _cons |   175962.4   212195.3     0.83   0.426    -296838.2      648763
--------------------------------------------------------------------------------------

.

Eventually, as per FAQ

Code:

please use CODE delimiters to share what you typed and what Stata gave you back. Thanks

Kind regards,
Carlo
(Stata 19.0)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#3

07 Apr 2021, 11:59

Well, before you worry about getting Stata to show you the region name in the regression results, let's get the data management and regression itself straight.

The use of -encode- here is completely wrong and produces a garbage variable. Whenever you see a string variable that reads like numbers to human eyes (or reads like a date to human eyes) you must not use -encode- to make it numerical. When it reads like a number to human eyes, you need to use the -destring- command for that.

Next, it doesn't really make sense to use a set of indicator ("dummy") variables for values of unemployment as a predictor of SAT_score in the model. That way of doing it implies that the relationship between unemployment and SAT score is completely arbitrary: regional unemployment is associated neither with an increasing, nor decreasing, nor U (nor upside-down U) shaped, nor any other mathematical relationship with SAT score. Rather any value of regional unemployment could be associated with any value of SAT score. If that is really the case, then there is no point in using the regional unemployment rate in this model: use the region itself instead as the numerical value of the unemployment rate has nothing to do with it.

So you need to make a decision: do you believe there is some actual mathematically characterizable relationship between regional unemployment rate and SAT scores. If so, then the code should look like this:

Code:

destring average_unemp, replace regress SAT_score i.gender_dummy i.male c.average_unemp

(You might choose to refine that model by including quadratic or higher order terms in average_unemp, or log transformation, or something like that--graphical exploration would help you decide that.) The output will not list each individual level of regional unemployment because it is, instead, estimating a (linear) formula relating unemployment to SAT_score. There is no labeling issue to deal with.

If, on the other hand, you think that the unemployment rate is just a red herring and you simply want to adjust for regional differences, then the approach is different:

Code:

encode region, gen(n_region) regress SAT_score i.gender_dummy i.male i.n_region

This will give you output with a coefficient for each region (except one omitted region serving as the reference category), and it will be labeled accordingly. Bear in mind, again, that this model has nothing at all to do with unemployment rates, except to the extent that the regions themselves have different unemployment rates. But it is the region's effect that is modeled, not that of its unemployment rate.

Added: Crossed with #2.
2 likes
Comment
Meshal Alkhowaiter

Join Date: Nov 2020

Posts: 16
#4

07 Apr 2021, 12:32

Originally posted by Clyde Schechter View Post

Well, before you worry about getting Stata to show you the region name in the regression results, let's get the data management and regression itself straight.

The use of -encode- here is completely wrong and produces a garbage variable. Whenever you see a string variable that reads like numbers to human eyes (or reads like a date to human eyes) you must not use -encode- to make it numerical. When it reads like a number to human eyes, you need to use the -destring- command for that.

Next, it doesn't really make sense to use a set of indicator ("dummy") variables for values of unemployment as a predictor of SAT_score in the model. That way of doing it implies that the relationship between unemployment and SAT score is completely arbitrary: regional unemployment is associated neither with an increasing, nor decreasing, nor U (nor upside-down U) shaped, nor any other mathematical relationship with SAT score. Rather any value of regional unemployment could be associated with any value of SAT score. If that is really the case, then there is no point in using the regional unemployment rate in this model: use the region itself instead as the numerical value of the unemployment rate has nothing to do with it.

So you need to make a decision: do you believe there is some actual mathematically characterizable relationship between regional unemployment rate and SAT scores. If so, then the code should look like this:

Code:

destring average_unemp, replace regress SAT_score i.gender_dummy i.male c.average_unemp

(You might choose to refine that model by including quadratic or higher order terms in average_unemp, or log transformation, or something like that--graphical exploration would help you decide that.) The output will not list each individual level of regional unemployment because it is, instead, estimating a (linear) formula relating unemployment to SAT_score. There is no labeling issue to deal with.

If, on the other hand, you think that the unemployment rate is just a red herring and you simply want to adjust for regional differences, then the approach is different:

Code:

encode region, gen(n_region) regress SAT_score i.gender_dummy i.male i.n_region

This will give you output with a coefficient for each region (except one omitted region serving as the reference category), and it will be labeled accordingly. Bear in mind, again, that this model has nothing at all to do with unemployment rates, except to the extent that the regions themselves have different unemployment rates. But it is the region's effect that is modeled, not that of its unemployment rate.

Added: Crossed with #2.

Thanks for your thorough response.
Comment

Announcement

Displaying categorical variable labels in a regression

Comment

Comment

Comment