Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Categorical Variables Treatment at Regression Analysis

    Hello everyone, Hope you all feeling well. I need help in the following specific case in STATA:

    First I wanna give you background about my question, which is mentioned below:
    Dependent Variable = Y
    Independent Variables = X1, X2, X3, X4

    X1 = contain categorical values (0, 1)
    X2 = contain categorical values (1, 2, 3, 4, 5)
    X3 = contain categorical values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
    X4 = contain categorical values from 30 to 80

    I Labelled all my categorical variables like below:

    X1:
    0 = F
    1 = M
    X2:
    1 = SS
    2 = JC
    3 = UG
    4 = PG
    5 = PD
    X3:
    Below 1 values: = B1
    between 1 and 5 = U15
    between 5 and 10 = U510
    Above 10 = A10
    X4:
    Below 30 = B30
    Between 30 and 40 = B3040
    Between 40 and 50 = B4050
    Between 50 and 60 = B5060
    Above 60 = A60

    By regressing, I find the below results:

    Y Coef. Std. Err. z
    X1:
    0: 0.3534 0.332 0.234
    1: 0.4543 0.245 0.132
    X2:
    1 0.4534 0.245 0.543
    2 0.6457 0.652 0.767
    3 0.2465 0.124 0.232
    4 0.6634 0.543 0.657
    5 0.7532 0.123 0.232
    X3:
    Under 1: 0.9506 0.545 0.564
    between 1 and 5 0.8474 0.231 0.234
    between 5 and 10 0.6767 0.235 0.576
    Above 10 0.4575 0.232 0.898
    X4:
    Below 30: 0.8678 0.646 0.876
    Between 30 and 40 0.2346 0.576 0.657
    Between 40 and 50 0.6544 0.786 0.567
    Between 50 and 60 0.4322 0.574 0.256
    60 Plus 0.3235 0.345 0.245


    Label command replace original values by below: (suppose in case of X2)
    label define X1o 1 SS 2 JC 3 UG 4 PG 5 PD

    MY PROBLEM:
    by label command, each variable replace it's original value with 1, 2, 3, 4 or 5. Mean to say by labeling the independent variables, the dependent variable (Y) then regress not by original values of X1, X2, X3 and X4 but replace value, which shows against theory results.

    "NOW MY QUESTION IS"

    Is it possible that Y regress on the categories of X1, X2, X3 and X4 original values ??

    Thanks for stopping by.





  • #2
    All categorical values are repeated across 20,000 observations

    Comment


    • #3
      I am unable to understand your question.

      by label command, each variable replace it's original value with 1, 2, 3, 4 or 5.
      This is not true. The -label- command changes the way Stata displays value of the variable, but it does not change the actual values it stores internally. In particular, it does not change what the results of a regression using that variable would be.

      Rather than trying to describe your problem, please post back showing:

      1. An example of your data set, using the -dataex- command.
      2. The exact code that you ran that is giving you problems. To make it easily readable, use code delimiters.
      3. The exact output that you got from Stata. Again, for readability, use code delimiters.

      Then state what parts of the output are different from what you were expecting or hoping for and state what you wanted to see there in its place.

      If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      For help with code delimiters, read Forum FAQ #12.

      If you prefer a more visual approach, you can learn how to use both -dataex- and code delimiters from David Benson's video at https://youtu.be/bXfaRCAOPbI.

      Comment


      • #4
        Thanks Professor,

        I has a continuous dependent variable (q) and independent variable age. where age values range from 26 to 77

        I use following coding to generate categories in age variable:

        gen age1 = age
        recode age1 (26/29=1) (30/39=2) (40/49=3) (50/59=4) (60/77=5)
        label define age1o 1 "UT" 2 "TTF" 3 "FTF" 4 "FTS" 5 "SP"
        label values age1 age1o

        dataex age age1, count(50)

        ----------------------- copy starting from the next line -----------------------
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte age float age1
        43 3
        42 3
        42 3
        43 3
        50 4
        44 3
        46 3
        49 3
        50 4
        51 4
        52 4
        32 2
        34 2
        44 3
        45 3
        43 3
        47 3
        42 3
        48 3
        55 4
        44 3
        47 3
        52 4
        49 3
        46 3
        45 3
        56 4
        54 4
        55 4
        54 4
        53 4
        55 4
        45 3
        48 3
        49 3
        45 3
        46 3
        47 3
        48 3
        48 3
        35 2
        36 2
        50 4
        49 3
        43 3
        51 4
        45 3
        46 3
        47 3
        48 3
        end
        label values age1 age1o
        label def age1o 2 "TTF", modify
        label def age1o 3 "FTF", modify
        label def age1o 4 "FTS", modify
        ------------------ copy up to and including the previous line ------------------

        Listed 50 out of 6606 observations

        When i execute the regression; the results is below:

        Code:
        . regress q age age1
        
              Source |       SS           df       MS      Number of obs   =     6,606
        -------------+----------------------------------   F(2, 6603)      =     12.51
               Model |  109.133621         2  54.5668105   Prob > F        =    0.0000
            Residual |  28804.9507     6,603  4.36240356   R-squared       =    0.0038
        -------------+----------------------------------   Adj R-squared   =    0.0035
               Total |  28914.0843     6,605   4.3776055   Root MSE        =    2.0886
        
        ------------------------------------------------------------------------------
                   q |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                 age |    .026983   .0092601     2.91   0.004     .0088303    .0451357
                age1 |  -.1030977   .0882462    -1.17   0.243    -.2760887    .0698933
               _cons |   1.977676   .2102442     9.41   0.000     1.565529    2.389822
        ------------------------------------------------------------------------------

        See the coefficient for age = 0.0269 and age1 = -0.0103; means for original values (age), coefficient is different from labelled values (age1)

        If possible, I want regression output from age variable like below:

        Code:
         regress q b1.age1
        
              Source |       SS           df       MS      Number of obs   =     6,606
        -------------+----------------------------------   F(4, 6601)      =      6.06
               Model |  105.862299         4  26.4655747   Prob > F        =    0.0001
            Residual |   28808.222     6,601  4.36422088   R-squared       =    0.0037
        -------------+----------------------------------   Adj R-squared   =    0.0031
               Total |  28914.0843     6,605   4.3776055   Root MSE        =    2.0891
        
        ------------------------------------------------------------------------------
                   q |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                age1 |
                TTF  |  -1.326579   .5873291    -2.26   0.024    -2.477934   -.1752241
                FTF  |  -1.147608   .5807293    -1.98   0.048    -2.286025   -.0091907
                FTS  |  -1.052541   .5808275    -1.81   0.070     -2.19115    .0860689
                 SP  |  -.8077115   .5852684    -1.38   0.168    -1.955027    .3396038
                     |
               _cons |   4.043846   .5794043     6.98   0.000     2.908026    5.179666
        ------------------------------------------------------------------------------
        mean to say i want age variable values are labelled within specific range (to get the regression output like immediate above), but not assigned other values like label define age1o 1 "UT" 2 "TTF" 3 "FTF" 4 "FTS" 5 "SP"

        I really appreciate your response Professor, this is my first post here and am really satisfied from your response. Thanks allot.

        Comment


        • #5
          I also use below option:

          Code:
           range age11 26 29
          range age12 30 39
          range age13 40 49
          range age14 50 59
          range age15 60 77
          resultantly, it gave me below regression output:

          Code:
          . regress q age11 age12 age13 age14 age15
          note: age11 omitted because of collinearity
          note: age12 omitted because of collinearity
          note: age13 omitted because of collinearity
          note: age14 omitted because of collinearity
          
                Source |       SS           df       MS      Number of obs   =     6,606
          -------------+----------------------------------   F(1, 6604)      =      1.05
                 Model |  4.60014495         1  4.60014495   Prob > F        =    0.3054
              Residual |  28909.4842     6,604   4.3775718   R-squared       =    0.0002
          -------------+----------------------------------   Adj R-squared   =    0.0000
                 Total |  28914.0843     6,605   4.3776055   Root MSE        =    2.0923
          
          ------------------------------------------------------------------------------
                     q |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                 age11 |          0  (omitted)
                 age12 |          0  (omitted)
                 age13 |          0  (omitted)
                 age14 |          0  (omitted)
                 age15 |   .0053764   .0052447     1.03   0.305     -.004905    .0156578
                 _cons |   2.588339   .3601852     7.19   0.000     1.882259    3.294418
          ------------------------------------------------------------------------------

          Comment


          • #6
            Well, Stata just uses whatever label you have given it. You told Stata you wanted this variable labeled with UT, TTF, etc., so that's what Stata did. If you want Stata to do it differently, you have to give the variable a different label.

            Code:
            label define age1o2 1 "26-29" 2 "30-39" 3 "40-49") 4 "50-59" 5 "60-77", modify
            label values age1 age1o2
            regress q i.age1

            Comment


            • #7
              I think there is a misunderstanding here. The variable age is continuous in regress q age, while the recoded variable age1 is supposed to be a factor variable, hence regress q i.age1. But remember that a factor variable really means there are as many 0/1 variables (dummies) as there are level, minus one(*) (Stata does this automatically when there is this "i."). If you don't tell Stata it's a factor variable (that is, if you remove "i."), it's considered to be a continuous variable, hence there is a single coefficient.

              But, if you have a continuous variable in the first place, the only way to deal with it as factors is to recode, as you did above. If you call regress q i.age, there will be one coefficient for each value of age (not one for each value of a label). You can't apply a label to get age ranges, as would be done with SAS. What counts is the variable values, not the label. If you want such a behavior, you have to recode first (and assign a label if you wish, but it's only for readability, it won't change the result).

              All in all, whatever the labels may be:
              Code:
              regress q age     one coefficient for age
              regress q i.age   one coefficient for each value of age minus one (around 50 coefficients here)
              regress q age1    one coefficient for age1
              regress q i.age1  one coefficient for each value of age1 minus one (4 coefficients with the recoding above)
              Note that if there is a label, it will be automatically shown in the output for coefficients of factor variables (with "i."). You can remove the label and show the value instead with the option nofvlabel: reg y i.k, nofvlabel. But it only changes the printing, not the computed results.

              (*) minus one because there is also a constant regressor, and if there are as many 0/1 variables as there are levels, the regressors are collinear, since the sum of the dummies always equals 1.

              Hope this helps

              Jean-Claude Arbaut
              Last edited by Jean-Claude Arbaut; 04 May 2019, 13:42.

              Comment

              Working...
              X