Differences of using large dummy and cluster in cross-sectional data set?

Chen Huang

Join Date: Jan 2016

Posts: 33
#1

Differences of using large dummy and cluster in cross-sectional data set?

23 Jun 2016, 06:27

Hello everyone,

I am working on the relationship between X and Y. The dependent variable Y is from 51 states in the US, so does X. I tried to control the states effect in two ways:

1, reg y x i.states,robust
2, reg y x ,cluster(states)

The problem is , the first method gives me positive relationship, and the second gives me negative relationship.... So how should I deal with this problem? I really have no idea about the differences of using large dummy and cluster...

Thanks a lot!

Chen
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17706
#2

23 Jun 2016, 06:31

Chen:
robustified and clustered standard errors differ in -reg-, even though they do not affetc the point estimates of your regression coefficients.
As an aside, there's nothing more that I can comment on without seeing what you obtained from Stata, too (as per FAG#12).

Kind regards,
Carlo
(Stata 19.0)
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#3

23 Jun 2016, 07:33

First,- reg y x ,cluster(states)- is not syntactically correct. I'm presuming you mean -.... vce(cluster states)-.

The vce(cluster) option corrects the standard error estimate for b_yx for the non-independence of observations within states, but it does not adjust the slope estimate for the effect of the states variable. Using -reg y x i.states-, regardless of what you do regarding the standard error estimates, adjusts the estimate of the slope for the effect of the states variable. Your two commands would therefore estimate models with different predictors, changing the slope estimate.
Comment
Chen Huang

Join Date: Jan 2016

Posts: 33
#4

23 Jun 2016, 08:04

Originally posted by Mike Lacy View Post

First,- reg y x ,cluster(states)- is not syntactically correct. I'm presuming you mean -.... vce(cluster states)-.

The vce(cluster) option corrects the standard error estimate for b_yx for the non-independence of observations within states, but it does not adjust the slope estimate for the effect of the states variable. Using -reg y x i.states-, regardless of what you do regarding the standard error estimates, adjusts the estimate of the slope for the effect of the states variable. Your two commands would therefore estimate models with different predictors, changing the slope estimate.

Thanks for your relay. So if I want to control for the states effect, just like year effects, i need to use i.states instead of vce(cluster states). Is this correct?
Comment
Chen Huang

Join Date: Jan 2016

Posts: 33
#5

23 Jun 2016, 08:07

Originally posted by Mike Lacy View Post

First,- reg y x ,cluster(states)- is not syntactically correct. I'm presuming you mean -.... vce(cluster states)-.

The vce(cluster) option corrects the standard error estimate for b_yx for the non-independence of observations within states, but it does not adjust the slope estimate for the effect of the states variable. Using -reg y x i.states-, regardless of what you do regarding the standard error estimates, adjusts the estimate of the slope for the effect of the states variable. Your two commands would therefore estimate models with different predictors, changing the slope estimate.

by the way, does the position where I put i.states effect the results? because i use "reg y i.states x,robust" and "reg y x i.states,robust" and I got 2 different results. For the first one, variable X is omitted...
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30084

23 Jun 2016, 11:03

by the way, does the position where I put i.states effect the results? because i use "reg y i.states x,robust" and "reg y x i.states,robust" and I got 2 different results. For the first one, variable X is omitted...

If you look at the full output from Stata (and it would have been better had you actually shown it) I'm confident you will find that near the top of that regression output, Stata explains why it has omitted variable X. It is almost certainly due to collinearity with the state indicator variables. This sort of thing crops up frequently if the variable X is defined as being an indicator for some subset of the states, or if it is a "continuous" variable whose values are constant within states.

In the normal course of events, when you run -reg y varlist i.states-, and in the absence of special colinearity, the regression will show you output for indicator variables of all of the states except one, that one being the reference value for the state indicators. In the situation you show here, if you look at the output for the regression that did not drop X, you will find that two state indicators are missing, not just one. At the end of the day, your variables X and the state indicators (other than the reference state) are colinear, so (at least) one of these variables must be dropped. Stata generally tends to keep the variables mentioned first and drop ones mentioned later, though it is not entirely predictable and may vary across commands.

Try running this to see:

Code:

. webuse grunfeld, clear

. 
. regress invest mvalue i.year

      Source |       SS           df       MS      Number of obs   =       200
-------------+----------------------------------   F(20, 179)      =     29.86
       Model |  7201607.36        20  360080.368   Prob > F        =    0.0000
    Residual |  2158336.56       179  12057.7461   R-squared       =    0.7694
-------------+----------------------------------   Adj R-squared   =    0.7436
       Total |  9359943.92       199  47034.8941   Root MSE        =    109.81

------------------------------------------------------------------------------
      invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      mvalue |   .1398271   .0059889    23.35   0.000     .1280092     .151645
             |
        year |
       1936  |  -23.16908   49.15807    -0.47   0.638     -120.173     73.8348
       1937  |  -40.42815   49.25913    -0.82   0.413    -137.6315    56.77517
       1938  |  -14.72079   49.11465    -0.30   0.765     -111.639    82.19742
       1939  |  -44.34516   49.15825    -0.90   0.368    -141.3494    52.65908
       1940  |  -19.12367   49.17392    -0.39   0.698    -116.1588     77.9115
       1941  |   13.77201   49.16036     0.28   0.780     -83.2364    110.7804
       1942  |   25.67648    49.1185     0.52   0.602    -71.24933    122.6023
       1943  |   4.816734   49.13774     0.10   0.922    -92.14704    101.7805
       1944  |   3.392256   49.14498     0.07   0.945     -93.5858    100.3703
       1945  |  -9.080938   49.17583    -0.18   0.854    -106.1199      87.958
       1946  |   19.18984   49.19746     0.39   0.697    -77.89179    116.2715
       1947  |   43.65138   49.12517     0.89   0.375    -53.28759    140.5903
       1948  |   54.70351   49.12064     1.11   0.267    -42.22651    151.6335
       1949  |   37.12271   49.12364     0.76   0.451    -59.81324    134.0587
       1950  |   40.61986   49.13406     0.83   0.409    -56.33665    137.5764
       1951  |   56.83776   49.19896     1.16   0.250    -40.24682    153.9223
       1952  |   74.81405   49.21664     1.52   0.130    -22.30541    171.9335
       1953  |   95.12681   49.32374     1.93   0.055    -2.204001    192.4576
       1954  |   98.89538     49.302     2.01   0.046     1.607474    196.1833
             |
       _cons |  -26.17759    34.9818    -0.75   0.455    -95.20737    42.85219
------------------------------------------------------------------------------

. 
. gen byte X = (year > 1940)

. 
. regress invest i.X i.year
note: 1954.year omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =       200
-------------+----------------------------------   F(19, 180)      =      0.68
       Model |  628703.404        19  33089.6529   Prob > F        =    0.8335
    Residual |  8731240.51       180  48506.8917   R-squared       =    0.0672
-------------+----------------------------------   Adj R-squared   =   -0.0313
       Total |  9359943.92       199  47034.8941   Root MSE        =    220.24

------------------------------------------------------------------------------
      invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         1.X |    201.035   98.49558     2.04   0.043      6.68049    395.3895
             |
        year |
       1936  |     28.861   98.49558     0.29   0.770    -165.4935    223.2155
       1937  |     49.735   98.49558     0.50   0.614    -144.6195    244.0895
       1938  |      4.809   98.49558     0.05   0.961    -189.5455    199.1635
       1939  |   7.779998   98.49558     0.08   0.937    -186.5745    202.1345
       1940  |     40.519   98.49558     0.41   0.681    -153.8355    234.8735
       1941  |   -134.062   98.49558    -1.36   0.175    -328.4165    60.29251
       1942  |   -151.116   98.49558    -1.53   0.127    -345.4705    43.23851
       1943  |   -155.991   98.49558    -1.58   0.115    -350.3455    38.36351
       1944  |   -152.856   98.49558    -1.55   0.122    -347.2105    41.49851
       1945  |   -149.622   98.49558    -1.52   0.130    -343.9765    44.73251
       1946  |   -112.422   98.49558    -1.14   0.255    -306.7765    81.93251
       1947  |   -126.646   98.49558    -1.29   0.200    -321.0005    67.70851
       1948  |   -119.833   98.49558    -1.22   0.225    -314.1875    74.52151
       1949  |   -134.537   98.49558    -1.37   0.174    -328.8915    59.81751
       1950  |    -122.72   98.49558    -1.25   0.214    -317.0745    71.63451
       1951  |  -74.19799   98.49558    -0.75   0.452    -268.5525    120.1565
       1952  |  -49.74799   98.49558    -0.51   0.614    -244.1025    144.6065
       1953  |   1.802007   98.49558     0.02   0.985    -192.5525    196.1565
       1954  |          0  (omitted)
             |
       _cons |     72.746   69.64689     1.04   0.298    -64.68339    210.1754
------------------------------------------------------------------------------

. 
. regress invest i.year i.X
note: 1.X omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =       200
-------------+----------------------------------   F(19, 180)      =      0.68
       Model |  628703.404        19  33089.6529   Prob > F        =    0.8335
    Residual |  8731240.51       180  48506.8917   R-squared       =    0.0672
-------------+----------------------------------   Adj R-squared   =   -0.0313
       Total |  9359943.92       199  47034.8941   Root MSE        =    220.24

------------------------------------------------------------------------------
      invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        year |
       1936  |     28.861   98.49558     0.29   0.770    -165.4935    223.2155
       1937  |     49.735   98.49558     0.50   0.614    -144.6195    244.0895
       1938  |      4.809   98.49558     0.05   0.961    -189.5455    199.1635
       1939  |   7.779998   98.49558     0.08   0.937    -186.5745    202.1345
       1940  |     40.519   98.49558     0.41   0.681    -153.8355    234.8735
       1941  |     66.973   98.49558     0.68   0.497    -127.3815    261.3275
       1942  |     49.919   98.49558     0.51   0.613    -144.4355    244.2735
       1943  |     45.044   98.49558     0.46   0.648    -149.3105    239.3985
       1944  |     48.179   98.49558     0.49   0.625    -146.1755    242.5335
       1945  |     51.413   98.49558     0.52   0.602    -142.9415    245.7675
       1946  |     88.613   98.49558     0.90   0.370    -105.7415    282.9675
       1947  |     74.389   98.49558     0.76   0.451    -119.9655    268.7435
       1948  |     81.202   98.49558     0.82   0.411    -113.1525    275.5565
       1949  |     66.498   98.49558     0.68   0.500    -127.8565    260.8525
       1950  |     78.315   98.49558     0.80   0.428    -116.0395    272.6695
       1951  |    126.837   98.49558     1.29   0.199     -67.5175    321.1915
       1952  |    151.287   98.49558     1.54   0.126     -43.0675    345.6415
       1953  |    202.837   98.49558     2.06   0.041     8.482497    397.1915
       1954  |    201.035   98.49558     2.04   0.043      6.68049    395.3895
             |
         1.X |          0  (omitted)
       _cons |     72.746   69.64689     1.04   0.298    -64.68339    210.1754
------------------------------------------------------------------------------

Note that the first regression (without X) has indicators for every year 1936 through 1954 (1935, the base year is always omitted). The variable X has been defined to indicate all years beyond 1940. When it is included you either get an additional year dropped (in the second regression, 1954 is omitted), or X itself droppped (in the third regression).

Comment

Chen Huang

Join Date: Jan 2016
Posts: 33

23 Jun 2016, 13:56

Originally posted by Clyde Schechter View Post

Code:

. webuse grunfeld, clear

.
. regress invest mvalue i.year

Source | SS df MS Number of obs = 200
-------------+---------------------------------- F(20, 179) = 29.86
Model | 7201607.36 20 360080.368 Prob > F = 0.0000
Residual | 2158336.56 179 12057.7461 R-squared = 0.7694
-------------+---------------------------------- Adj R-squared = 0.7436
Total | 9359943.92 199 47034.8941 Root MSE = 109.81

------------------------------------------------------------------------------
invest | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mvalue | .1398271 .0059889 23.35 0.000 .1280092 .151645
|
year |
1936 | -23.16908 49.15807 -0.47 0.638 -120.173 73.8348
1937 | -40.42815 49.25913 -0.82 0.413 -137.6315 56.77517
1938 | -14.72079 49.11465 -0.30 0.765 -111.639 82.19742
1939 | -44.34516 49.15825 -0.90 0.368 -141.3494 52.65908
1940 | -19.12367 49.17392 -0.39 0.698 -116.1588 77.9115
1941 | 13.77201 49.16036 0.28 0.780 -83.2364 110.7804
1942 | 25.67648 49.1185 0.52 0.602 -71.24933 122.6023
1943 | 4.816734 49.13774 0.10 0.922 -92.14704 101.7805
1944 | 3.392256 49.14498 0.07 0.945 -93.5858 100.3703
1945 | -9.080938 49.17583 -0.18 0.854 -106.1199 87.958
1946 | 19.18984 49.19746 0.39 0.697 -77.89179 116.2715
1947 | 43.65138 49.12517 0.89 0.375 -53.28759 140.5903
1948 | 54.70351 49.12064 1.11 0.267 -42.22651 151.6335
1949 | 37.12271 49.12364 0.76 0.451 -59.81324 134.0587
1950 | 40.61986 49.13406 0.83 0.409 -56.33665 137.5764
1951 | 56.83776 49.19896 1.16 0.250 -40.24682 153.9223
1952 | 74.81405 49.21664 1.52 0.130 -22.30541 171.9335
1953 | 95.12681 49.32374 1.93 0.055 -2.204001 192.4576
1954 | 98.89538 49.302 2.01 0.046 1.607474 196.1833
|
_cons | -26.17759 34.9818 -0.75 0.455 -95.20737 42.85219
------------------------------------------------------------------------------

.
. gen byte X = (year > 1940)

.
. regress invest i.X i.year
note: 1954.year omitted because of collinearity

Source | SS df MS Number of obs = 200
-------------+---------------------------------- F(19, 180) = 0.68
Model | 628703.404 19 33089.6529 Prob > F = 0.8335
Residual | 8731240.51 180 48506.8917 R-squared = 0.0672
-------------+---------------------------------- Adj R-squared = -0.0313
Total | 9359943.92 199 47034.8941 Root MSE = 220.24

------------------------------------------------------------------------------
invest | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.X | 201.035 98.49558 2.04 0.043 6.68049 395.3895
|
year |
1936 | 28.861 98.49558 0.29 0.770 -165.4935 223.2155
1937 | 49.735 98.49558 0.50 0.614 -144.6195 244.0895
1938 | 4.809 98.49558 0.05 0.961 -189.5455 199.1635
1939 | 7.779998 98.49558 0.08 0.937 -186.5745 202.1345
1940 | 40.519 98.49558 0.41 0.681 -153.8355 234.8735
1941 | -134.062 98.49558 -1.36 0.175 -328.4165 60.29251
1942 | -151.116 98.49558 -1.53 0.127 -345.4705 43.23851
1943 | -155.991 98.49558 -1.58 0.115 -350.3455 38.36351
1944 | -152.856 98.49558 -1.55 0.122 -347.2105 41.49851
1945 | -149.622 98.49558 -1.52 0.130 -343.9765 44.73251
1946 | -112.422 98.49558 -1.14 0.255 -306.7765 81.93251
1947 | -126.646 98.49558 -1.29 0.200 -321.0005 67.70851
1948 | -119.833 98.49558 -1.22 0.225 -314.1875 74.52151
1949 | -134.537 98.49558 -1.37 0.174 -328.8915 59.81751
1950 | -122.72 98.49558 -1.25 0.214 -317.0745 71.63451
1951 | -74.19799 98.49558 -0.75 0.452 -268.5525 120.1565
1952 | -49.74799 98.49558 -0.51 0.614 -244.1025 144.6065
1953 | 1.802007 98.49558 0.02 0.985 -192.5525 196.1565
1954 | 0 (omitted)
|
_cons | 72.746 69.64689 1.04 0.298 -64.68339 210.1754
------------------------------------------------------------------------------

.
. regress invest i.year i.X
note: 1.X omitted because of collinearity

Source | SS df MS Number of obs = 200
-------------+---------------------------------- F(19, 180) = 0.68
Model | 628703.404 19 33089.6529 Prob > F = 0.8335
Residual | 8731240.51 180 48506.8917 R-squared = 0.0672
-------------+---------------------------------- Adj R-squared = -0.0313
Total | 9359943.92 199 47034.8941 Root MSE = 220.24

------------------------------------------------------------------------------
invest | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
year |
1936 | 28.861 98.49558 0.29 0.770 -165.4935 223.2155
1937 | 49.735 98.49558 0.50 0.614 -144.6195 244.0895
1938 | 4.809 98.49558 0.05 0.961 -189.5455 199.1635
1939 | 7.779998 98.49558 0.08 0.937 -186.5745 202.1345
1940 | 40.519 98.49558 0.41 0.681 -153.8355 234.8735
1941 | 66.973 98.49558 0.68 0.497 -127.3815 261.3275
1942 | 49.919 98.49558 0.51 0.613 -144.4355 244.2735
1943 | 45.044 98.49558 0.46 0.648 -149.3105 239.3985
1944 | 48.179 98.49558 0.49 0.625 -146.1755 242.5335
1945 | 51.413 98.49558 0.52 0.602 -142.9415 245.7675
1946 | 88.613 98.49558 0.90 0.370 -105.7415 282.9675
1947 | 74.389 98.49558 0.76 0.451 -119.9655 268.7435
1948 | 81.202 98.49558 0.82 0.411 -113.1525 275.5565
1949 | 66.498 98.49558 0.68 0.500 -127.8565 260.8525
1950 | 78.315 98.49558 0.80 0.428 -116.0395 272.6695
1951 | 126.837 98.49558 1.29 0.199 -67.5175 321.1915
1952 | 151.287 98.49558 1.54 0.126 -43.0675 345.6415
1953 | 202.837 98.49558 2.06 0.041 8.482497 397.1915
1954 | 201.035 98.49558 2.04 0.043 6.68049 395.3895
|
1.X | 0 (omitted)
_cons | 72.746 69.64689 1.04 0.298 -64.68339 210.1754
------------------------------------------------------------------------------

Thank you very much for your answer and reply

Announcement