Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • xtreg fe - Drop or retain groups with no variation on focal predictor?

    Hello all--

    This may be less a Stata question and more a methods question, but I'm hoping someone can help. I have a sample of individuals over time and am regressing starting salary on a particular organizational feature (and a host of other variables) for individuals who change jobs. Some individuals do not vary on this feature (i.e., it is either always absent or always present within these individuals). I can't find guidance on whether I should drop or retain these individuals. They do seem to be influencing the results in a non-trivial way.

    Do I need to make a qualitative judgement here, excluding those with no variation if I conclude that they are in some way different from those with variation?

    Interestingly, if I retain them, Hausman's tests suggests re, but if I drop them, it suggests fe.
    Last edited by Bob Hernandez; 31 Mar 2015, 12:50.

  • #2
    Bob: It is never ideal to drop observations since you are throwing away useful information and potentially introducing bias in your estimates. If indeed these observations did not vary at all, then Stata drops them automatically and your estimates would be identical to the case where you manually drop the observations. I suspect that there may be some variation in later years that you don't notice immediately by glancing at the data.

    More fundamentally, if a certain variable varies for some individuals at all periods and for some at some periods (or not at all), this is how it is in the population and you cannot model it as varying for everyone at all periods. Therefore, my suggestion is to work with the data you have "as is" unless you have a very good reason for excluding some of the observations.

    Alternatively, you can specify that your sample includes individuals whose "x variable" varies by say more than 10 percent a year. If you make this explicit and your readers know, I see no problem with dropping observations with less than 10 percent variation per year. However, note that your conclusions will apply to the sub-sample, and you cannot make inferences about the general population if you restrict your sample in this way.

    Comment


    • #3
      Andrew: Your suggestion that Stata drops these individuals from the analysis seems to be incorrect. When I run the regression with and without them (dropped manually), Stata returns different estimates and Ns. My guess is that this is because these individuals, although they do not have variation on this focal X variable, do have variation on the other Xs and the Y?

      I see your point about bias. But is it possible that including these individuals would introduces bias? For example, isn't be possible that an unobserved variable not only explains the differences in variance (i.e., some or none) in the focal X, but also is a strong predictor of Y? If so, excluding those who do not vary on the focal X may constrain the sample such that variation on the endogenous variable is low, reducing its biasing effect. Even if my line of thinking is correct, perhaps the threat inherent to excluding them remains greater.

      Comment


      • #4
        Sorry for not being clear: Not that Stata drops the individuals, but it drops the time-invariant variable when running fixed effects for these individuals.

        You are right that the ideal case is having variables that vary across individuals and across time in the fixed effects specification. My general point is that it may be far worse (inference wise) to delete observations.

        Comment


        • #5
          Stata does not drop observations if a variable does not have time variation for some individuals. (It would omit this variable in a fixed-effects regression if it is time-invariant for all individuals which is not the case here.)

          In general, you should not drop those individuals without time variation in one of your regressors. In the best case, under the premise that your model is otherwise correctly specified, this only reduces the precision of your estimates due to the smaller sample size. In the worst case, it induces a sample selection bias.

          If your model is misspecified, e.g. omitted variable bias, you might be able to construct situations where the omitted variable is correlated with the variable that is time invariant for some individuals in a way such that removing those individuals would reduce the bias. However, you would need a good theory to justify this case. It is also very likely that such an omitted variable is of a time-invariant nature by itself and thus already captured by the fixed effects.
          https://twitter.com/Kripfganz

          Comment


          • #6
            Thanks to you both. I agree with your last point, Sebastian, that the fixed effects is likely take care of the hypothetical omitted variable.

            Comment


            • #7
              Sebastian (or others): By the way, can you recommend a reading/citation that explains how sample selection bias can emerge when omitting groups (as I proposed doing here)?

              Comment


              • #8
                Sebastian is correct in objecting to the use of the term "dropped" in the usual sense because observations that do not vary over time for a sub-sample of individuals are not literally dropped. What is meant is that they do not influence the coefficient estimates because the fixed effects estimator only considers the within variation in the data (hence they are dropped in the estimation). This can be easily illustrated: We have variables y, x1, x2, x3, and x4 for 10 individuals over 5 time periods. The variable x1 does not vary for individual 1, 2, and 7 and equals 19, 55, and 28, respectively.

                Code:
                clear 
                
                input y x1 x2 x3 x4 id time
                78 19  45 44  15   1 1
                23 19  17 47  72   1 2
                10 19  32 62  65   1 3
                34 19  11 21  20   1 4
                77 19  42 23  100  1 5
                
                
                91 55  12 13  14   2 1
                62 55  27 37  47   2 2
                33 55  13 14  15   2 3
                16 55  58 68  78   2 4
                99 55  80 90  70   2 5
                
                20 51  18 62  82   3 1
                38 39  39 11  63   3 2
                40 87  46 93  90   3 3
                56 03  64 80  28   3 4
                73 200 88 103  36  3 5
                
                115 70  85 18  85   4 1
                49 51  67 22 76   4 2
                57 28  49 26  96   4 3
                74 32  31 41  77   4 4
                110 16  12 60  80  4 5
                
                24 112  26 20  26   5 1
                111 123  81 82  37   5 2
                64 45  59 39  49   5 3
                39 72  31 29  92   5 4
                79 80  16 77  107  5 5
                
                37 47  19 89  12   6 1
                23 61  38 45  22   6 2
                32 83  82 83  66   6 3
                120 115 91 116  108   6 4
                7 150  54 93  72  6 5
                
                92 28  30 41  90   7 1
                100 28  40 96  102   7 2
                108 28  50 29  59   7 3
                116 28  60 42  76   7 4
                128 28  70 80  94  7 5
                
                39 7  55 103  106   8 1
                51 50  27 98  62  8 2
                73 61  19 81  74   8 3
                94 86  112 99  53   8 4
                103 99  67 102  10  8 5
                
                89 80  105 54  69   9 1
                62 90  97 108  62   9 2
                13 100  102 92  39   9 3
                100 110 81 66  85   9 4
                115 120 92 50  67  9 5
                
                40 37  75 19  14   10 1
                92 65  87 5  34   10 2
                56 72  92 15  40   10 3
                119 128  23 21  63   10 4
                84 80  82 67  29  10 5
                end
                If y is our outcome variable and x1-x4 are our regressors, fixed effects regression yields the following coefficient estimates


                Code:
                xtset id time
                xtreg y x*, fe
                Code:
                Fixed-effects (within) regression               Number of obs      =        50
                Group variable: id                              Number of groups   =        10
                
                R-sq:  within  = 0.1448                         Obs per group: min =         5
                       between = 0.0000                                        avg =       5.0
                       overall = 0.0538                                        max =         5
                
                                                                F(4,36)            =      1.52
                corr(u_i, Xb)  = -0.3074                        Prob > F           =    0.2161
                
                ------------------------------------------------------------------------------
                           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                          x1 |   .2398038   .1525894     1.57   0.125    -.0696618    .5492694
                          x2 |   .2376799   .2046493     1.16   0.253    -.1773682     .652728
                          x3 |   .0511757   .2110712     0.24   0.810    -.3768965     .479248
                          x4 |   .0676377   .1813813     0.37   0.711    -.3002206     .435496
                       _cons |   32.27467   17.37543     1.86   0.071    -2.964345    67.51368
                -------------+----------------------------------------------------------------
                     sigma_u |  22.933562
                     sigma_e |   31.13536
                         rho |  .35172042   (fraction of variance due to u_i)
                ------------------------------------------------------------------------------
                F test that all u_i=0:     F(9, 36) =     1.94               Prob > F = 0.0774
                Now, we may want to change the value of x1 for individuals 1,2, and 7 to zero (or any other constant value) and repeat the estimation.

                Code:
                replace x1=0 if id== 1| id==2| id==7
                xtreg y x*, fe

                The coefficient estimates for x1-x4 (which are what we want to estimate consistently) still remain the same.

                Code:
                Fixed-effects (within) regression               Number of obs      =        50
                Group variable: id                              Number of groups   =        10
                
                R-sq:  within  = 0.1448                         Obs per group: min =         5
                       between = 0.0012                                        avg =       5.0
                       overall = 0.0359                                        max =         5
                
                                                                F(4,36)            =      1.52
                corr(u_i, Xb)  = -0.4334                        Prob > F           =    0.2161
                
                ------------------------------------------------------------------------------
                           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                          x1 |   .2398038   .1525894     1.57   0.125    -.0696618    .5492694
                          x2 |   .2376799   .2046493     1.16   0.253    -.1773682     .652728
                          x3 |   .0511757   .2110712     0.24   0.810    -.3768965     .479248
                          x4 |   .0676377   .1813813     0.37   0.711    -.3002206     .435496
                       _cons |   34.72067   16.90976     2.05   0.047     .4260824    69.01525
                -------------+----------------------------------------------------------------
                     sigma_u |  24.808905
                     sigma_e |   31.13536
                         rho |  .38834292   (fraction of variance due to u_i)
                ------------------------------------------------------------------------------
                F test that all u_i=0:     F(9, 36) =     1.94               Prob > F = 0.0777
                Bob: You may want to look at some literature on missing data since you introduce missingness by deleting observations. Here is a good place to start

                http://medrescon.tripod.com/docs/little_paper.pdf

                Comment


                • #9
                  Andrew gave a nice example. To make it "complete", observe that all the coefficient estimates (including x1) change when we exclude the individuals 1, 2, and 7.
                  Code:
                  . xtreg y x* if id != 1 & id != 2 & id != 7, fe
                  
                  Fixed-effects (within) regression               Number of obs      =        35
                  Group variable: id                              Number of groups   =         7
                  
                  R-sq:  within  = 0.1942                         Obs per group: min =         5
                         between = 0.0867                                        avg =       5.0
                         overall = 0.0708                                        max =         5
                  
                                                                  F(4,24)            =      1.45
                  corr(u_i, Xb)  = -0.3647                        Prob > F           =    0.2495
                  
                  ------------------------------------------------------------------------------
                             y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  -------------+----------------------------------------------------------------
                            x1 |   .2277504   .1618489     1.41   0.172    -.1062894    .5617901
                            x2 |   .1877974   .2357355     0.80   0.433    -.2987367    .6743315
                            x3 |   .1823178   .2578391     0.71   0.486     -.349836    .7144716
                            x4 |   .1761042   .2231101     0.79   0.438    -.2843725    .6365809
                         _cons |   15.16086   24.16224     0.63   0.536    -34.70755    65.02927
                  -------------+----------------------------------------------------------------
                       sigma_u |  20.050298
                       sigma_e |  32.226396
                           rho |  .27906913   (fraction of variance due to u_i)
                  ------------------------------------------------------------------------------
                  F test that all u_i=0:     F(6, 24) =     1.29               Prob > F = 0.3012
                  The coefficient of x1 would only remain unchanged if x1 was uncorrelated with all other regressors (or if x1 was the only regressor):
                  Code:
                  . xtreg y x1, fe
                  
                  Fixed-effects (within) regression               Number of obs      =        50
                  Group variable: id                              Number of groups   =        10
                  
                  R-sq:  within  = 0.0994                         Obs per group: min =         5
                         between = 0.0660                                        avg =       5.0
                         overall = 0.0107                                        max =         5
                  
                                                                  F(1,39)            =      4.30
                  corr(u_i, Xb)  = -0.3677                        Prob > F           =    0.0447
                  
                  ------------------------------------------------------------------------------
                             y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  -------------+----------------------------------------------------------------
                            x1 |   .2968735   .1430946     2.07   0.045     .0074373    .5863097
                         _cons |   48.53759   10.03157     4.84   0.000     28.24684    68.82835
                  -------------+----------------------------------------------------------------
                       sigma_u |  23.853345
                       sigma_e |  30.696911
                           rho |  .37648957   (fraction of variance due to u_i)
                  ------------------------------------------------------------------------------
                  F test that all u_i=0:     F(9, 39) =     2.61               Prob > F = 0.0183
                  Code:
                  . xtreg y x1 if id != 1 & id != 2 & id != 7, fe
                  
                  Fixed-effects (within) regression               Number of obs      =        35
                  Group variable: id                              Number of groups   =         7
                  
                  R-sq:  within  = 0.1311                         Obs per group: min =         5
                         between = 0.1650                                        avg =       5.0
                         overall = 0.0434                                        max =         5
                  
                                                                  F(1,27)            =      4.07
                  corr(u_i, Xb)  = -0.3036                        Prob > F           =    0.0536
                  
                  ------------------------------------------------------------------------------
                             y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  -------------+----------------------------------------------------------------
                            x1 |   .2968735   .1470742     2.02   0.054    -.0048978    .5986448
                         _cons |   43.17958   12.34679     3.50   0.002     17.84607    68.51309
                  -------------+----------------------------------------------------------------
                       sigma_u |   18.72477
                       sigma_e |  31.550614
                           rho |  .26047684   (fraction of variance due to u_i)
                  ------------------------------------------------------------------------------
                  F test that all u_i=0:     F(6, 27) =     1.60               Prob > F = 0.1859
                  https://twitter.com/Kripfganz

                  Comment


                  • #10
                    Very helpful.

                    Comment


                    • #11
                      Hello, Andrew and Sebastian and Bob,

                      I came across your discussion while looking for answers on how fixed effects models handles subjects with the same values on the dependent variable.

                      I used the data above and tweaked the dependent variable Y so that some subjects have the same Y values (e.g.,, Y=50 if id==1 | id==2 | id==7). The model results show that these ID's with same Y values are still KEPT in the model analysis. Furthermore, even if I change the Y values to another value like Y=0 if id==1 | id==2 | id==7, the fixed effect model still generate the same coefficients and standard errors and T values. But when I dropped the three ID's from the analysis, the coefficients are changed.

                      I also tried fixed effects logistic models using XTLOGIT ..., fe. on Y=0 or 1. The model outputs show that the IDs with the same Y values (either 0 or 1) are dropped from the analysis. And if I exclude these IDs manually, the model generate the same results.

                      I learned that for fixed effect logistic models, the subjects with no variation in the dependent variable are excluded from the analysis because there is no within subject variability. However, the subjects with the same Y values in fixed effects linear models are kept. If any of you can shed some light on the rationale behind the different handling of the subjects for the two types of fixed effects models, that will be very helpful.

                      Comment


                      • #12
                        I used the data above and tweaked the dependent variable Y so that some subjects have the same Y values (e.g.,, Y=50 if id==1 | id==2 | id==7). The model results show that these ID's with same Y values are still KEPT in the model analysis.
                        Edited: Can you show the exact commands and output here?
                        Last edited by Andrew Musau; 15 Aug 2019, 02:23.

                        Comment

                        Working...
                        X