Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Differences of using large dummy and cluster in cross-sectional data set?

    Hello everyone,

    I am working on the relationship between X and Y. The dependent variable Y is from 51 states in the US, so does X. I tried to control the states effect in two ways:

    1, reg y x i.states,robust
    2, reg y x ,cluster(states)

    The problem is , the first method gives me positive relationship, and the second gives me negative relationship.... So how should I deal with this problem? I really have no idea about the differences of using large dummy and cluster...


    Thanks a lot!

    Chen





  • #2
    Chen:
    robustified and clustered standard errors differ in -reg-, even though they do not affetc the point estimates of your regression coefficients.
    As an aside, there's nothing more that I can comment on without seeing what you obtained from Stata, too (as per FAG#12).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      First,- reg y x ,cluster(states)- is not syntactically correct. I'm presuming you mean -.... vce(cluster states)-.

      The vce(cluster) option corrects the standard error estimate for b_yx for the non-independence of observations within states, but it does not adjust the slope estimate for the effect of the states variable. Using -reg y x i.states-, regardless of what you do regarding the standard error estimates, adjusts the estimate of the slope for the effect of the states variable. Your two commands would therefore estimate models with different predictors, changing the slope estimate.

      Comment


      • #4
        Originally posted by Mike Lacy View Post
        First,- reg y x ,cluster(states)- is not syntactically correct. I'm presuming you mean -.... vce(cluster states)-.

        The vce(cluster) option corrects the standard error estimate for b_yx for the non-independence of observations within states, but it does not adjust the slope estimate for the effect of the states variable. Using -reg y x i.states-, regardless of what you do regarding the standard error estimates, adjusts the estimate of the slope for the effect of the states variable. Your two commands would therefore estimate models with different predictors, changing the slope estimate.
        Thanks for your relay. So if I want to control for the states effect, just like year effects, i need to use i.states instead of vce(cluster states). Is this correct?

        Comment


        • #5
          Originally posted by Mike Lacy View Post
          First,- reg y x ,cluster(states)- is not syntactically correct. I'm presuming you mean -.... vce(cluster states)-.

          The vce(cluster) option corrects the standard error estimate for b_yx for the non-independence of observations within states, but it does not adjust the slope estimate for the effect of the states variable. Using -reg y x i.states-, regardless of what you do regarding the standard error estimates, adjusts the estimate of the slope for the effect of the states variable. Your two commands would therefore estimate models with different predictors, changing the slope estimate.
          by the way, does the position where I put i.states effect the results? because i use "reg y i.states x,robust" and "reg y x i.states,robust" and I got 2 different results. For the first one, variable X is omitted...

          Comment


          • #6
            by the way, does the position where I put i.states effect the results? because i use "reg y i.states x,robust" and "reg y x i.states,robust" and I got 2 different results. For the first one, variable X is omitted...
            If you look at the full output from Stata (and it would have been better had you actually shown it) I'm confident you will find that near the top of that regression output, Stata explains why it has omitted variable X. It is almost certainly due to collinearity with the state indicator variables. This sort of thing crops up frequently if the variable X is defined as being an indicator for some subset of the states, or if it is a "continuous" variable whose values are constant within states.

            In the normal course of events, when you run -reg y varlist i.states-, and in the absence of special colinearity, the regression will show you output for indicator variables of all of the states except one, that one being the reference value for the state indicators. In the situation you show here, if you look at the output for the regression that did not drop X, you will find that two state indicators are missing, not just one. At the end of the day, your variables X and the state indicators (other than the reference state) are colinear, so (at least) one of these variables must be dropped. Stata generally tends to keep the variables mentioned first and drop ones mentioned later, though it is not entirely predictable and may vary across commands.

            Try running this to see:
            Code:
            . webuse grunfeld, clear
            
            . 
            . regress invest mvalue i.year
            
                  Source |       SS           df       MS      Number of obs   =       200
            -------------+----------------------------------   F(20, 179)      =     29.86
                   Model |  7201607.36        20  360080.368   Prob > F        =    0.0000
                Residual |  2158336.56       179  12057.7461   R-squared       =    0.7694
            -------------+----------------------------------   Adj R-squared   =    0.7436
                   Total |  9359943.92       199  47034.8941   Root MSE        =    109.81
            
            ------------------------------------------------------------------------------
                  invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  mvalue |   .1398271   .0059889    23.35   0.000     .1280092     .151645
                         |
                    year |
                   1936  |  -23.16908   49.15807    -0.47   0.638     -120.173     73.8348
                   1937  |  -40.42815   49.25913    -0.82   0.413    -137.6315    56.77517
                   1938  |  -14.72079   49.11465    -0.30   0.765     -111.639    82.19742
                   1939  |  -44.34516   49.15825    -0.90   0.368    -141.3494    52.65908
                   1940  |  -19.12367   49.17392    -0.39   0.698    -116.1588     77.9115
                   1941  |   13.77201   49.16036     0.28   0.780     -83.2364    110.7804
                   1942  |   25.67648    49.1185     0.52   0.602    -71.24933    122.6023
                   1943  |   4.816734   49.13774     0.10   0.922    -92.14704    101.7805
                   1944  |   3.392256   49.14498     0.07   0.945     -93.5858    100.3703
                   1945  |  -9.080938   49.17583    -0.18   0.854    -106.1199      87.958
                   1946  |   19.18984   49.19746     0.39   0.697    -77.89179    116.2715
                   1947  |   43.65138   49.12517     0.89   0.375    -53.28759    140.5903
                   1948  |   54.70351   49.12064     1.11   0.267    -42.22651    151.6335
                   1949  |   37.12271   49.12364     0.76   0.451    -59.81324    134.0587
                   1950  |   40.61986   49.13406     0.83   0.409    -56.33665    137.5764
                   1951  |   56.83776   49.19896     1.16   0.250    -40.24682    153.9223
                   1952  |   74.81405   49.21664     1.52   0.130    -22.30541    171.9335
                   1953  |   95.12681   49.32374     1.93   0.055    -2.204001    192.4576
                   1954  |   98.89538     49.302     2.01   0.046     1.607474    196.1833
                         |
                   _cons |  -26.17759    34.9818    -0.75   0.455    -95.20737    42.85219
            ------------------------------------------------------------------------------
            
            . 
            . gen byte X = (year > 1940)
            
            . 
            . regress invest i.X i.year
            note: 1954.year omitted because of collinearity
            
                  Source |       SS           df       MS      Number of obs   =       200
            -------------+----------------------------------   F(19, 180)      =      0.68
                   Model |  628703.404        19  33089.6529   Prob > F        =    0.8335
                Residual |  8731240.51       180  48506.8917   R-squared       =    0.0672
            -------------+----------------------------------   Adj R-squared   =   -0.0313
                   Total |  9359943.92       199  47034.8941   Root MSE        =    220.24
            
            ------------------------------------------------------------------------------
                  invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                     1.X |    201.035   98.49558     2.04   0.043      6.68049    395.3895
                         |
                    year |
                   1936  |     28.861   98.49558     0.29   0.770    -165.4935    223.2155
                   1937  |     49.735   98.49558     0.50   0.614    -144.6195    244.0895
                   1938  |      4.809   98.49558     0.05   0.961    -189.5455    199.1635
                   1939  |   7.779998   98.49558     0.08   0.937    -186.5745    202.1345
                   1940  |     40.519   98.49558     0.41   0.681    -153.8355    234.8735
                   1941  |   -134.062   98.49558    -1.36   0.175    -328.4165    60.29251
                   1942  |   -151.116   98.49558    -1.53   0.127    -345.4705    43.23851
                   1943  |   -155.991   98.49558    -1.58   0.115    -350.3455    38.36351
                   1944  |   -152.856   98.49558    -1.55   0.122    -347.2105    41.49851
                   1945  |   -149.622   98.49558    -1.52   0.130    -343.9765    44.73251
                   1946  |   -112.422   98.49558    -1.14   0.255    -306.7765    81.93251
                   1947  |   -126.646   98.49558    -1.29   0.200    -321.0005    67.70851
                   1948  |   -119.833   98.49558    -1.22   0.225    -314.1875    74.52151
                   1949  |   -134.537   98.49558    -1.37   0.174    -328.8915    59.81751
                   1950  |    -122.72   98.49558    -1.25   0.214    -317.0745    71.63451
                   1951  |  -74.19799   98.49558    -0.75   0.452    -268.5525    120.1565
                   1952  |  -49.74799   98.49558    -0.51   0.614    -244.1025    144.6065
                   1953  |   1.802007   98.49558     0.02   0.985    -192.5525    196.1565
                   1954  |          0  (omitted)
                         |
                   _cons |     72.746   69.64689     1.04   0.298    -64.68339    210.1754
            ------------------------------------------------------------------------------
            
            . 
            . regress invest i.year i.X
            note: 1.X omitted because of collinearity
            
                  Source |       SS           df       MS      Number of obs   =       200
            -------------+----------------------------------   F(19, 180)      =      0.68
                   Model |  628703.404        19  33089.6529   Prob > F        =    0.8335
                Residual |  8731240.51       180  48506.8917   R-squared       =    0.0672
            -------------+----------------------------------   Adj R-squared   =   -0.0313
                   Total |  9359943.92       199  47034.8941   Root MSE        =    220.24
            
            ------------------------------------------------------------------------------
                  invest |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                    year |
                   1936  |     28.861   98.49558     0.29   0.770    -165.4935    223.2155
                   1937  |     49.735   98.49558     0.50   0.614    -144.6195    244.0895
                   1938  |      4.809   98.49558     0.05   0.961    -189.5455    199.1635
                   1939  |   7.779998   98.49558     0.08   0.937    -186.5745    202.1345
                   1940  |     40.519   98.49558     0.41   0.681    -153.8355    234.8735
                   1941  |     66.973   98.49558     0.68   0.497    -127.3815    261.3275
                   1942  |     49.919   98.49558     0.51   0.613    -144.4355    244.2735
                   1943  |     45.044   98.49558     0.46   0.648    -149.3105    239.3985
                   1944  |     48.179   98.49558     0.49   0.625    -146.1755    242.5335
                   1945  |     51.413   98.49558     0.52   0.602    -142.9415    245.7675
                   1946  |     88.613   98.49558     0.90   0.370    -105.7415    282.9675
                   1947  |     74.389   98.49558     0.76   0.451    -119.9655    268.7435
                   1948  |     81.202   98.49558     0.82   0.411    -113.1525    275.5565
                   1949  |     66.498   98.49558     0.68   0.500    -127.8565    260.8525
                   1950  |     78.315   98.49558     0.80   0.428    -116.0395    272.6695
                   1951  |    126.837   98.49558     1.29   0.199     -67.5175    321.1915
                   1952  |    151.287   98.49558     1.54   0.126     -43.0675    345.6415
                   1953  |    202.837   98.49558     2.06   0.041     8.482497    397.1915
                   1954  |    201.035   98.49558     2.04   0.043      6.68049    395.3895
                         |
                     1.X |          0  (omitted)
                   _cons |     72.746   69.64689     1.04   0.298    -64.68339    210.1754
            ------------------------------------------------------------------------------
            Note that the first regression (without X) has indicators for every year 1936 through 1954 (1935, the base year is always omitted). The variable X has been defined to indicate all years beyond 1940. When it is included you either get an additional year dropped (in the second regression, 1954 is omitted), or X itself droppped (in the third regression).

            Comment


            • #7
              Originally posted by Clyde Schechter View Post

              If you look at the full output from Stata (and it would have been better had you actually shown it) I'm confident you will find that near the top of that regression output, Stata explains why it has omitted variable X. It is almost certainly due to collinearity with the state indicator variables. This sort of thing crops up frequently if the variable X is defined as being an indicator for some subset of the states, or if it is a "continuous" variable whose values are constant within states.

              In the normal course of events, when you run -reg y varlist i.states-, and in the absence of special colinearity, the regression will show you output for indicator variables of all of the states except one, that one being the reference value for the state indicators. In the situation you show here, if you look at the output for the regression that did not drop X, you will find that two state indicators are missing, not just one. At the end of the day, your variables X and the state indicators (other than the reference state) are colinear, so (at least) one of these variables must be dropped. Stata generally tends to keep the variables mentioned first and drop ones mentioned later, though it is not entirely predictable and may vary across commands.

              Try running this to see:
              Code:
              . webuse grunfeld, clear
              
              .
              . regress invest mvalue i.year
              
              Source | SS df MS Number of obs = 200
              -------------+---------------------------------- F(20, 179) = 29.86
              Model | 7201607.36 20 360080.368 Prob > F = 0.0000
              Residual | 2158336.56 179 12057.7461 R-squared = 0.7694
              -------------+---------------------------------- Adj R-squared = 0.7436
              Total | 9359943.92 199 47034.8941 Root MSE = 109.81
              
              ------------------------------------------------------------------------------
              invest | Coef. Std. Err. t P>|t| [95% Conf. Interval]
              -------------+----------------------------------------------------------------
              mvalue | .1398271 .0059889 23.35 0.000 .1280092 .151645
              |
              year |
              1936 | -23.16908 49.15807 -0.47 0.638 -120.173 73.8348
              1937 | -40.42815 49.25913 -0.82 0.413 -137.6315 56.77517
              1938 | -14.72079 49.11465 -0.30 0.765 -111.639 82.19742
              1939 | -44.34516 49.15825 -0.90 0.368 -141.3494 52.65908
              1940 | -19.12367 49.17392 -0.39 0.698 -116.1588 77.9115
              1941 | 13.77201 49.16036 0.28 0.780 -83.2364 110.7804
              1942 | 25.67648 49.1185 0.52 0.602 -71.24933 122.6023
              1943 | 4.816734 49.13774 0.10 0.922 -92.14704 101.7805
              1944 | 3.392256 49.14498 0.07 0.945 -93.5858 100.3703
              1945 | -9.080938 49.17583 -0.18 0.854 -106.1199 87.958
              1946 | 19.18984 49.19746 0.39 0.697 -77.89179 116.2715
              1947 | 43.65138 49.12517 0.89 0.375 -53.28759 140.5903
              1948 | 54.70351 49.12064 1.11 0.267 -42.22651 151.6335
              1949 | 37.12271 49.12364 0.76 0.451 -59.81324 134.0587
              1950 | 40.61986 49.13406 0.83 0.409 -56.33665 137.5764
              1951 | 56.83776 49.19896 1.16 0.250 -40.24682 153.9223
              1952 | 74.81405 49.21664 1.52 0.130 -22.30541 171.9335
              1953 | 95.12681 49.32374 1.93 0.055 -2.204001 192.4576
              1954 | 98.89538 49.302 2.01 0.046 1.607474 196.1833
              |
              _cons | -26.17759 34.9818 -0.75 0.455 -95.20737 42.85219
              ------------------------------------------------------------------------------
              
              .
              . gen byte X = (year > 1940)
              
              .
              . regress invest i.X i.year
              note: 1954.year omitted because of collinearity
              
              Source | SS df MS Number of obs = 200
              -------------+---------------------------------- F(19, 180) = 0.68
              Model | 628703.404 19 33089.6529 Prob > F = 0.8335
              Residual | 8731240.51 180 48506.8917 R-squared = 0.0672
              -------------+---------------------------------- Adj R-squared = -0.0313
              Total | 9359943.92 199 47034.8941 Root MSE = 220.24
              
              ------------------------------------------------------------------------------
              invest | Coef. Std. Err. t P>|t| [95% Conf. Interval]
              -------------+----------------------------------------------------------------
              1.X | 201.035 98.49558 2.04 0.043 6.68049 395.3895
              |
              year |
              1936 | 28.861 98.49558 0.29 0.770 -165.4935 223.2155
              1937 | 49.735 98.49558 0.50 0.614 -144.6195 244.0895
              1938 | 4.809 98.49558 0.05 0.961 -189.5455 199.1635
              1939 | 7.779998 98.49558 0.08 0.937 -186.5745 202.1345
              1940 | 40.519 98.49558 0.41 0.681 -153.8355 234.8735
              1941 | -134.062 98.49558 -1.36 0.175 -328.4165 60.29251
              1942 | -151.116 98.49558 -1.53 0.127 -345.4705 43.23851
              1943 | -155.991 98.49558 -1.58 0.115 -350.3455 38.36351
              1944 | -152.856 98.49558 -1.55 0.122 -347.2105 41.49851
              1945 | -149.622 98.49558 -1.52 0.130 -343.9765 44.73251
              1946 | -112.422 98.49558 -1.14 0.255 -306.7765 81.93251
              1947 | -126.646 98.49558 -1.29 0.200 -321.0005 67.70851
              1948 | -119.833 98.49558 -1.22 0.225 -314.1875 74.52151
              1949 | -134.537 98.49558 -1.37 0.174 -328.8915 59.81751
              1950 | -122.72 98.49558 -1.25 0.214 -317.0745 71.63451
              1951 | -74.19799 98.49558 -0.75 0.452 -268.5525 120.1565
              1952 | -49.74799 98.49558 -0.51 0.614 -244.1025 144.6065
              1953 | 1.802007 98.49558 0.02 0.985 -192.5525 196.1565
              1954 | 0 (omitted)
              |
              _cons | 72.746 69.64689 1.04 0.298 -64.68339 210.1754
              ------------------------------------------------------------------------------
              
              .
              . regress invest i.year i.X
              note: 1.X omitted because of collinearity
              
              Source | SS df MS Number of obs = 200
              -------------+---------------------------------- F(19, 180) = 0.68
              Model | 628703.404 19 33089.6529 Prob > F = 0.8335
              Residual | 8731240.51 180 48506.8917 R-squared = 0.0672
              -------------+---------------------------------- Adj R-squared = -0.0313
              Total | 9359943.92 199 47034.8941 Root MSE = 220.24
              
              ------------------------------------------------------------------------------
              invest | Coef. Std. Err. t P>|t| [95% Conf. Interval]
              -------------+----------------------------------------------------------------
              year |
              1936 | 28.861 98.49558 0.29 0.770 -165.4935 223.2155
              1937 | 49.735 98.49558 0.50 0.614 -144.6195 244.0895
              1938 | 4.809 98.49558 0.05 0.961 -189.5455 199.1635
              1939 | 7.779998 98.49558 0.08 0.937 -186.5745 202.1345
              1940 | 40.519 98.49558 0.41 0.681 -153.8355 234.8735
              1941 | 66.973 98.49558 0.68 0.497 -127.3815 261.3275
              1942 | 49.919 98.49558 0.51 0.613 -144.4355 244.2735
              1943 | 45.044 98.49558 0.46 0.648 -149.3105 239.3985
              1944 | 48.179 98.49558 0.49 0.625 -146.1755 242.5335
              1945 | 51.413 98.49558 0.52 0.602 -142.9415 245.7675
              1946 | 88.613 98.49558 0.90 0.370 -105.7415 282.9675
              1947 | 74.389 98.49558 0.76 0.451 -119.9655 268.7435
              1948 | 81.202 98.49558 0.82 0.411 -113.1525 275.5565
              1949 | 66.498 98.49558 0.68 0.500 -127.8565 260.8525
              1950 | 78.315 98.49558 0.80 0.428 -116.0395 272.6695
              1951 | 126.837 98.49558 1.29 0.199 -67.5175 321.1915
              1952 | 151.287 98.49558 1.54 0.126 -43.0675 345.6415
              1953 | 202.837 98.49558 2.06 0.041 8.482497 397.1915
              1954 | 201.035 98.49558 2.04 0.043 6.68049 395.3895
              |
              1.X | 0 (omitted)
              _cons | 72.746 69.64689 1.04 0.298 -64.68339 210.1754
              ------------------------------------------------------------------------------
              Note that the first regression (without X) has indicators for every year 1936 through 1954 (1935, the base year is always omitted). The variable X has been defined to indicate all years beyond 1940. When it is included you either get an additional year dropped (in the second regression, 1954 is omitted), or X itself droppped (in the third regression).
              Thank you very much for your answer and reply

              Comment

              Working...
              X