Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Displaying categorical variable labels in a regression

    I have two variables that are structured as below, one refers to the geographic location and another is the average unemployment rate for a given demographic group and a given time period in California.

    I am interested in using the average unemployment rate per region as an explanatory variable in a regression model with SAT scores as my predictor variable.

    ```
    dataex average_unemp region

    ----------------------- copy starting from the next line -----------------------
    [CODE]
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str4 average_unemp str16 region
    "11.6" "Southern California"
    "11.6" "Southern California"
    "11.6" "Southern California"
    "11.4" "Sacramento"
    "11.6" "Southern California"
    "11.6" "Southern California"
    "11.6" "Southern California"
    "11.4" "Sacramento"
    "11.6" "Southern California"
    "11.6" "Southern California"
    "22.6" "San Dego"
    "11.6" "Southern California"
    "11.6" "Southern California"
    ```

    I have created a dummy variable for both as in below:

    ```
    encode average_unemp, gen(average_unemp_dummy)
    ```

    However, when I ran my regression model, the average unemployment rate was displayed in the results, but I am actually interested in displaying the region's name with the results to know which region I am looking at.

    ```
    regress SAT_score i.gender_dummy i.average_unemp_dummy

    Source | SS df MS Number of obs = 5,480
    -------------+---------------------------------- F(13, 5466) = 195.10
    Model | 106720.18 13 8209.24464 Prob > F = 0.0000
    Residual | 229988.42 5,466 42.0761837 R-squared = 0.3170
    -------------+---------------------------------- Adj R-squared = 0.3153
    Total | 336708.6 5,479 61.4543895 Root MSE = 6.4866

    -------------------------------------------------------------------------------------
    SAT_score | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    --------------------+----------------------------------------------------------------
    gender_dummy |
    male | -6.637803 .1904906 -34.85 0.000 -7.011241 -6.264366
    |
    average_unemp_dummy |
    10.5 | 7.675088 .5530074 13.88 0.000 6.590973 8.759202
    11.4 | 6.21276 .5102991 12.17 0.000 5.21237 7.213149
    22.6 | -5.595991 .6094689 -9.18 0.000 -6.790792 -4.401189
    ```

  • #2
    Meshal:
    please note that your data excerpt is not fully in line with the code you ran.
    That said, what follows might be what you're after:
    Code:
    . label define average_unemp_dummy 2 "Southern California" 1 "Sacramento" 3 "San Diego", modify
    
    . g income= runiform()*1000000
    
    . regress income i.average_unemp_dummy
    
          Source |       SS           df       MS      Number of obs   =        13
    -------------+----------------------------------   F(2, 10)        =      1.23
           Model |  2.2177e+11         2  1.1089e+11   Prob > F        =    0.3326
        Residual |  9.0054e+11        10  9.0054e+10   R-squared       =    0.1976
    -------------+----------------------------------   Adj R-squared   =    0.0371
           Total |  1.1223e+12        12  9.3526e+10   Root MSE        =    3.0e+05
    
    --------------------------------------------------------------------------------------
                  income |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ---------------------+----------------------------------------------------------------
     average_unemp_dummy |
    Southern California  |   319199.2   232448.3     1.37   0.200    -198727.9    837126.3
              San Diego  |   28747.07     367533     0.08   0.939    -790167.6    847661.7
                         |
                   _cons |   175962.4   212195.3     0.83   0.426    -296838.2      648763
    --------------------------------------------------------------------------------------
    
    .
    Eventually, as per FAQ
    Code:
    please use CODE delimiters to share what you typed and what Stata gave you back. Thanks
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Well, before you worry about getting Stata to show you the region name in the regression results, let's get the data management and regression itself straight.

      The use of -encode- here is completely wrong and produces a garbage variable. Whenever you see a string variable that reads like numbers to human eyes (or reads like a date to human eyes) you must not use -encode- to make it numerical. When it reads like a number to human eyes, you need to use the -destring- command for that.

      Next, it doesn't really make sense to use a set of indicator ("dummy") variables for values of unemployment as a predictor of SAT_score in the model. That way of doing it implies that the relationship between unemployment and SAT score is completely arbitrary: regional unemployment is associated neither with an increasing, nor decreasing, nor U (nor upside-down U) shaped, nor any other mathematical relationship with SAT score. Rather any value of regional unemployment could be associated with any value of SAT score. If that is really the case, then there is no point in using the regional unemployment rate in this model: use the region itself instead as the numerical value of the unemployment rate has nothing to do with it.

      So you need to make a decision: do you believe there is some actual mathematically characterizable relationship between regional unemployment rate and SAT scores. If so, then the code should look like this:

      Code:
      destring average_unemp, replace
      regress SAT_score i.gender_dummy i.male c.average_unemp
      (You might choose to refine that model by including quadratic or higher order terms in average_unemp, or log transformation, or something like that--graphical exploration would help you decide that.) The output will not list each individual level of regional unemployment because it is, instead, estimating a (linear) formula relating unemployment to SAT_score. There is no labeling issue to deal with.

      If, on the other hand, you think that the unemployment rate is just a red herring and you simply want to adjust for regional differences, then the approach is different:
      Code:
      encode region, gen(n_region)
      regress SAT_score i.gender_dummy i.male i.n_region
      This will give you output with a coefficient for each region (except one omitted region serving as the reference category), and it will be labeled accordingly. Bear in mind, again, that this model has nothing at all to do with unemployment rates, except to the extent that the regions themselves have different unemployment rates. But it is the region's effect that is modeled, not that of its unemployment rate.

      Added: Crossed with #2.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Well, before you worry about getting Stata to show you the region name in the regression results, let's get the data management and regression itself straight.

        The use of -encode- here is completely wrong and produces a garbage variable. Whenever you see a string variable that reads like numbers to human eyes (or reads like a date to human eyes) you must not use -encode- to make it numerical. When it reads like a number to human eyes, you need to use the -destring- command for that.

        Next, it doesn't really make sense to use a set of indicator ("dummy") variables for values of unemployment as a predictor of SAT_score in the model. That way of doing it implies that the relationship between unemployment and SAT score is completely arbitrary: regional unemployment is associated neither with an increasing, nor decreasing, nor U (nor upside-down U) shaped, nor any other mathematical relationship with SAT score. Rather any value of regional unemployment could be associated with any value of SAT score. If that is really the case, then there is no point in using the regional unemployment rate in this model: use the region itself instead as the numerical value of the unemployment rate has nothing to do with it.

        So you need to make a decision: do you believe there is some actual mathematically characterizable relationship between regional unemployment rate and SAT scores. If so, then the code should look like this:

        Code:
        destring average_unemp, replace
        regress SAT_score i.gender_dummy i.male c.average_unemp
        (You might choose to refine that model by including quadratic or higher order terms in average_unemp, or log transformation, or something like that--graphical exploration would help you decide that.) The output will not list each individual level of regional unemployment because it is, instead, estimating a (linear) formula relating unemployment to SAT_score. There is no labeling issue to deal with.

        If, on the other hand, you think that the unemployment rate is just a red herring and you simply want to adjust for regional differences, then the approach is different:
        Code:
        encode region, gen(n_region)
        regress SAT_score i.gender_dummy i.male i.n_region
        This will give you output with a coefficient for each region (except one omitted region serving as the reference category), and it will be labeled accordingly. Bear in mind, again, that this model has nothing at all to do with unemployment rates, except to the extent that the regions themselves have different unemployment rates. But it is the region's effect that is modeled, not that of its unemployment rate.

        Added: Crossed with #2.
        Thanks for your thorough response.

        Comment

        Working...
        X