Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Linear regression with categorical variables

    Dear Stata users,

    I am new to Stata and currently doing a linear regression for a continuous dependant variable, 3 continuous covariates and 2 categorical covariates.

    I have first tried the following code, while country and industry are the already encoded categorical variables. In this form, the different categories of country and industry are taken as continuous variables, however.
    Code:
    regress return date country industry revenues employees
    Then I tried factor variables by adding the "i." for country and industry. Moreover, I've included the margin-code to get the respective effects for country and industry. The code then looked like this:
    Code:
    regress return date i.country i.industry revenues employees
    margin i.country i.industry
    The margin data now gave me the coefficients and significance levels for the different countries and industries accurately. However, the other three factors are still depending on the base level I take for the two category values. Is there any possibility to get unbiased coefficients? Or what would be the best-practices in such cases to receive the other three factors largely unbiased?

    Sorry if this sounds like a dumb question to you. I've already tried searching through the forum, however, I couldn't find a solution that fit precisely to my problem.

    Best regards,
    Alex

  • #2
    Alexander:
    welcome to this forum.
    I suspect that you're mixing up fitted values with -margins- results.
    I would recommend to think once more to what you're interested in.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hi Carlo,

      thank you for the indication. I have re-read the margin instructions and as the command adjusts the data to see the marginal effects (e.g. as if all countries in the dataset would be the US), the command does not fit to what I would actually want to receive from the model.

      My problem right now is that I'm receiving different results for each country when taking different baselines (e.g. ib2.country instead of i.country). I would like to single out the actual effects of each country independent of the respective baseline.

      Can you now better understand my issue? I would be really grateful for any kind of help/advice.

      Best regards,
      Alex

      Comment


      • #4
        Alexander:
        there's no fix for that.
        Coefficients are condituional on the reference category.
        The best way to handle this issue is to report in the regression outcome table (and/or in the accompanying research report) which reference category you chose and why.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Okay, thank you, Carlo.

          Are there any best-practices for choosing the reference? Such as choosing the variables with the highest/lowest t-value or one of the medium ones? Or should the reference rather be based on the number of observations belonging to the respective variable?

          Best regards,
          Alex

          Comment


          • #6
            Alexander:
            you can choose as reference category the level of your categorical variable with the lowest or highest number of observations.
            If you have an ordinal categorical variable you can choose the poorest level as reference category.
            That said, with -country- and -industry- you can let Stata choose the reference category on your behalf.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Originally posted by Carlo Lazzaro View Post
              That said, with -country- and -industry- you can let Stata choose the reference category on your behalf.
              I don't understand how you mean this, unfortunately. Could you please explain, how Stata would take that task and with which command?

              Comment


              • #8
                Alexander:
                see the following toy-example, where Stata automatically chooses the level with the lowest number of observation as referenece category:
                Code:
                . use "C:\Program Files\Stata16\ado\base\a\auto.dta"
                (1978 Automobile Data)
                
                . regress price i.rep78
                
                      Source |       SS           df       MS      Number of obs   =        69
                -------------+----------------------------------   F(4, 64)        =      0.24
                       Model |  8360542.63         4  2090135.66   Prob > F        =    0.9174
                    Residual |   568436416        64     8881819   R-squared       =    0.0145
                -------------+----------------------------------   Adj R-squared   =   -0.0471
                       Total |   576796959        68  8482308.22   Root MSE        =    2980.2
                
                ------------------------------------------------------------------------------
                       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                       rep78 |
                          2  |   1403.125   2356.085     0.60   0.554    -3303.696    6109.946
                          3  |   1864.733   2176.458     0.86   0.395    -2483.242    6212.708
                          4  |       1507   2221.338     0.68   0.500    -2930.633    5944.633
                          5  |     1348.5   2290.927     0.59   0.558    -3228.153    5925.153
                             |
                       _cons |     4564.5   2107.347     2.17   0.034     354.5913    8774.409
                ------------------------------------------------------------------------------
                
                . tab rep78
                
                     Repair |
                Record 1978 |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                          1 |          2        2.90        2.90
                          2 |          8       11.59       14.49
                          3 |         30       43.48       57.97
                          4 |         18       26.09       84.06
                          5 |         11       15.94      100.00
                ------------+-----------------------------------
                      Total |         69      100.00
                
                .
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  For me it's automatically the first observation, not the one with the lowest observations (which should be Singapore)?

                  Code:
                  . regress TotalRevenue i.country
                  
                        Source |       SS           df       MS      Number of obs   =   262,227
                  -------------+----------------------------------   F(19, 262207)   =   1116.01
                         Model |  5.5071e+25        19  2.8985e+24   Prob > F        =    0.0000
                      Residual |  6.8100e+26   262,207  2.5972e+21   R-squared       =    0.0748
                  -------------+----------------------------------   Adj R-squared   =    0.0748
                         Total |  7.3608e+26   262,226  2.8070e+21   Root MSE        =    5.1e+10
                  
                  -------------------------------------------------------------------------------------------
                               TotalRevenue |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  --------------------------+----------------------------------------------------------------
                                    country |
                                   Belgium  |   1.59e+10   9.38e+08    16.98   0.000     1.41e+10    1.78e+10
                                    Canada  |  -2.07e+09   1.33e+09    -1.55   0.121    -4.68e+09    5.43e+08
                                   Denmark  |   6.00e+09   9.44e+08     6.35   0.000     4.15e+09    7.85e+09
                                   Finland  |  -4.99e+09   1.33e+09    -3.74   0.000    -7.61e+09   -2.38e+09
                                    France  |   3.52e+10   8.40e+08    41.97   0.000     3.36e+10    3.69e+10
                                   Germany  |   4.49e+10   7.95e+08    56.46   0.000     4.33e+10    4.64e+10
                                     India  |   7.57e+09   8.84e+08     8.57   0.000     5.84e+09    9.30e+09
                      Ireland; Republic of  |   2.41e+10   1.05e+09    22.87   0.000     2.20e+10    2.61e+10
                                     Italy  |  -4.73e+09   1.32e+09    -3.57   0.000    -7.33e+09   -2.13e+09
                                     Japan  |   1.55e+10   7.05e+08    21.96   0.000     1.41e+10    1.69e+10
                                Luxembourg  |  -4.09e+09   1.34e+09    -3.05   0.002    -6.73e+09   -1.46e+09
                                    Mexico  |   8.26e+09   1.33e+09     6.19   0.000     5.65e+09    1.09e+10
                               Netherlands  |   1.26e+10   8.13e+08    15.54   0.000     1.10e+10    1.42e+10
                                     Spain  |   4.93e+10   1.32e+09    37.25   0.000     4.68e+10    5.19e+10
                                    Sweden  |   1.42e+10   1.33e+09    10.67   0.000     1.16e+10    1.69e+10
                               Switzerland  |   2.41e+10   7.96e+08    30.24   0.000     2.25e+10    2.56e+10
                                    Taiwan  |  -5.14e+09   1.06e+09    -4.84   0.000    -7.22e+09   -3.06e+09
                            United Kingdom  |   1.11e+10   7.18e+08    15.49   0.000     9.72e+09    1.25e+10
                  United States of America  |   3.87e+10   6.88e+08    56.24   0.000     3.73e+10    4.00e+10
                                            |
                                      _cons |   5.36e+09   6.65e+08     8.05   0.000     4.05e+09    6.66e+09
                  -------------------------------------------------------------------------------------------
                  
                  . tab country
                  
                   Country of Headquarters |      Freq.     Percent        Cum.
                  -------------------------+-----------------------------------
                                 Australia |     15,654        5.00        5.00
                                   Belgium |      5,933        1.90        6.90
                                    Canada |      3,892        1.24        8.14
                                   Denmark |      7,716        2.47       10.61
                                   Finland |      1,941        0.62       11.23
                                    France |     11,868        3.79       15.02
                                   Germany |     15,648        5.00       20.02
                                     India |      7,673        2.45       22.47
                      Ireland; Republic of |      3,898        1.25       23.72
                                     Italy |      1,980        0.63       24.35
                                     Japan |     47,445       15.16       39.51
                                Luxembourg |      1,906        0.61       40.12
                                    Mexico |      1,945        0.62       40.74
                               Netherlands |     13,846        4.42       45.17
                                    Norway |      1,939        0.62       45.78
                                 Singapore |      1,819        0.58       46.37
                                     Spain |      7,912        2.53       48.89
                                    Sweden |      1,941        0.62       49.51
                               Switzerland |     15,520        4.96       54.47
                                    Taiwan |      3,793        1.21       55.69
                            United Kingdom |     43,096       13.77       69.46
                  United States of America |     95,587       30.54      100.00
                  -------------------------+-----------------------------------
                                     Total |    312,952      100.00

                  Comment


                  • #10
                    Alexander:
                    in your example, Singapore (1819 obs; the lowest frequency) is actually the reference category.
                    The coefficient of Singapore= _cons
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      Hm, and where can I see the regression data of Australia then?

                      Comment


                      • #12
                        Alexander:
                        as Australia is the reference category, its coefficient=_cons = 5.36e+09.
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          You have 22 countries in -tab- but only 19 countries listed in the regression output. You should have 21 countries listed there.

                          Comment

                          Working...
                          X