Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Thanks Nils. Sorry to keep asking questions - just one more - will the update be to xtreg or to a factor-variables utility that xtreg calls? I'm guessing that it's the former but just want to make sure.

    Comment


    • #17
      I don't know. I was also guessing the former. If you are interested, you could modify all the -sort- commands in xtreg-fe.ado to be -sort, stable-, and see if you can replicate the issue with the code I gave above.

      Comment


      • #18
        We have looked into the problem Nils presented and we believe it to be of great
        interest because it helps us think about identification.

        The original problem was that coefficient values varied when -xtreg, fe- was
        estimated repeatedly for the same model on the same dataset. The crux of the
        issue is that because the model has an identification problem, the computations were
        sensitive to the within-group sort that is performed by -xtreg fe-. This sort, as any sort,
        might be different every time around unless it is a stable sort.

        The first reaction might be to use a stable sort and force the results to match
        each time. This is dangerous because it hides the problem
        revealed by the unstable sort.

        A simple way to see the identification problem is to run -regress, vce(cluster
        id) - which is a way to explore the group structure of your data and the stability of your model,
        before running any panel data model.

        Code:
         
        . regress y i.post##i.treatment##i.(a b) if (c == 0), vce(cluster z) vsquish
        
        Linear regression                                  Number of obs   =     1,047
                                                           F(7, 73)        =         .
                                                           Prob > F        =         .
                                                           R-squared       =    0.8149
                                                           Root MSE        =    .20541
        
                                             (Std. Err. adjusted for 74 clusters in z)
        ------------------------------------------------------------------------------
                     |               Robust
                   y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
              1.post |     -.0625   .0512466    -1.22   0.227    -.1646343    .0396343
         1.treatment |   1.98e-15          .        .       .            .           .
                post#|
           treatment |
                1 1  |   .0159884   .0575836     0.28   0.782    -.0987756    .1307524
                 1.a |   3.42e-15          .        .       .            .           .
                 1.b |         -1          .        .       .            .           .
              post#a |
                1 1  |   .0208333   .0588408     0.35   0.724    -.0964363     .138103
              post#b |
                1 1  |   .1354167   .0429379     3.15   0.002     .0498416    .2209917
         treatment#a |
                1 1  |  -3.01e-15   1.47e-08    -0.00   1.000    -2.92e-08    2.92e-08
         treatment#b |
                1 1  |   6.16e-15   4.88e-09     0.00   1.000    -9.73e-09    9.73e-09
                post#|
         treatment#a |
              1 1 1  |  -.0497185   .0637016    -0.78   0.438    -.1766757    .0772386
                post#|
         treatment#b |
              1 1 1  |   .0114087   .0559661     0.20   0.839    -.1001316    .1229491
               _cons |          1   1.05e-08  9.5e+07   0.000            1           1
        ------------------------------------------------------------------------------
        These results point to the previously mentioned identification problem. There are missing values for
        some standard errors, the standard error for the constant is almost zero, and a
        subset of the coefficients is, for all practical purposes, zero.

        In conclusion, we prefer to expose the instability of the model because it will
        cause researchers to think about their model specification and identification.
        A model whose coefficients change slightly every time it is run almost
        certainly has identification problems. The fact that changing the sort order
        within group modified the results, even slightly, is a red flag.


        Comment


        • #19
          This may be confusing for other parties, so let me explain some background.

          The dataset and model posted by Enrique are not the dataset and model I posted in my comment to Statalist. I provided them to tech support because I noticed another unstable coefficient that "has similar symptoms in some ways, and different symptoms in other ways." Specifically, the coefficient on 1.post (-0.0625 exactly) waffles between -0.063 and -0.062 when it is formatted to three decimal places. I said that "I’m not sure if the cause is similar", but I wanted Stata tech support to be aware of it in case it was related to the issue I reported in this thread. It was the only example I had observed of an unstable but effectively non-zero coefficient.

          Tech support's initial response was that these were two distinct issues. Then they changed their mind, and reported that they were actually the same issue, and would both be fixed with the modified -xtreg, fe-. Then Enrique posted here with this latest wonderful analysis of the secondary MWE I provided to Stata tech support, finding that the unstable coefficient is not actually undesirable behavior in the first place. Thank you, Enrique.

          But I am confused by the analysis, and I wonder if we could discuss it using the MWE that I posted to this thread. In an attempt to find the identification problem in this MWE, I use Enrique's suggested technique:

          Code:
          . regress y i.post##i.treatment, vce(cluster z) vsquish
          
          Linear regression                                      Number of obs =    1047
                                                                 F(  3,    73) =    0.68
                                                                 Prob > F      =  0.5670
                                                                 R-squared     =  0.0005
                                                                 Root MSE      =  .47547
          
                                                 (Std. Err. adjusted for 74 clusters in z)
          --------------------------------------------------------------------------------
                         |               Robust
                       y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          ---------------+----------------------------------------------------------------
                  1.post |   1.18e-15   .0166065     0.00   1.000    -.0330966    .0330966
             1.treatment |   .0013175   .0740744     0.02   0.986    -.1463126    .1489477
          post#treatment |
                    1 1  |  -.0217391   .0225636    -0.96   0.338    -.0667083    .0232301
                   _cons |   .6666667   .0618391    10.78   0.000     .5434216    .7899117
          --------------------------------------------------------------------------------
          All standard errors are calculable and non-zero. One coefficient is practically zero. (It's the "unstable" coefficient.) Is this evidence of an identification error? It seems, to my admittedly inexperienced eye, to be a coincidence. I would appreciate help understanding what might have gone wrong here.

          In case is is helpful to quickly get a sense of the data, here are the results of a tabulation:

          Code:
          . tab post treat, su(y)
          
                        Means, Standard Deviations and Frequencies of y
          
                     |      treatment
                post |         0          1 |     Total
          -----------+----------------------+----------
                   0 | .66666667  .66798419 | .66762178
                     |  .4738791  .47187011 | .47174208
                     |        96        253 |       349
          -----------+----------------------+----------
                   1 | .66666667  .64624506 | .65186246
                     | .47263695  .47860744 |  .4767215
                     |       192        506 |       698
          -----------+----------------------+----------
               Total | .66666667  .65349144 | .65711557
                     | .47222507  .47617131 |  .4749001
                     |       288        759 |      1047
          And that command repeated on a (nonrandom) sample of about half the clusters, to demonstrate that y does change in the post in the control for some obs:

          Code:
          . tab post treat if z < 40, su(y)
          
                        Means, Standard Deviations and Frequencies of y
          
                     |      treatment
                post |         0          1 |     Total
          -----------+----------------------+----------
                   0 | .65384615  .63013699 | .63636364
                     | .48038446  .48442926 | .48226508
                     |        52        146 |       198
          -----------+----------------------+----------
                   1 | .66346154   .6130137 | .62626263
                     | .47481375  .48789663 | .48440716
                     |       104        292 |       396
          -----------+----------------------+----------
               Total | .66025641  .61872146 | .62962963
                     | .47514745  .48625615 | .48331088
                     |       156        438 |       594
          Last edited by Nils Enevoldsen; 23 Jan 2015, 13:18.

          Comment


          • #20
            Hi Nils,

            The part of the code that affects the result is a sort within groups. If you change the sort to be stable in the code your problem "disappears".
            Technical support asked the development group to look at this and we decided not to change the sort to be a stable sort. We thought about the
            problem and discussed it. We looked at the data and your model and decided not to change the sort in the code. This decision is independent of
            which version of the data you provided. The point is that if the coefficient changes when you run the model different times it suggest that you have an identification
            problem. We prefer to expose rather than hide the instability of the model. In the particular sample you provided it was cristal clear. In this second
            sample the fact that you do not have missing values in your standard errors for the diagnostic I suggested does not make things OK.


            Comment


            • #21
              Point of clarification: by "this second sample", I think Enrique is referring to the first MWE that was posted to this thread, "reprodata.txt".

              I understand that things may not be OK in this sample. I seek help diagnosing what exactly the problem might be. Figuring out an identification problem is not the responsibility of StataCorp, which is why I am asking for assistance from Statalist.

              BTW, I should mention that treatment is assigned at the z level, so the coefficient on 1.treatment is meaningless. The model could be estimated with:
              Code:
              xtreg y i.post i.post#i.treatment, fe i(z)
              Last edited by Nils Enevoldsen; 23 Jan 2015, 14:25.

              Comment


              • #22
                Very late to come with an opinion on the point raised by Enrique, but here it is anyway:

                If there's a tradeoff between "replicability" and "diagnostic evidence for a user", I would go for replicablity, every time. Hence I would argue strongly in favour of making the sort stable in official Stata code.

                Comment


                • #23
                  This is tough to judge. I like to think about replicability in two different ways here.

                  First, we can replicate the substantial result of the analysis. I would argue that the substantial result of such analysis is that there is an identification problem and hence no stable answer. From this point of view obtaining different values each time the code is run, is a perfect replication of such result. We simply cannot give one answer.

                  Second, we can replicate the technical results (note plural). The fact that we obtain different values without, but the same value with a stable sort, allows us to replicate the technical results and trace the source of the problem.

                  So from my point of view there is no real tradeoff here and I tend to agree with Enrique (and StataCorp.), that masking the problem is probably not a good idea, as this might (mis)lead us to give and replicate one substantial answer, when there really is none - or at least it is not properly identified.

                  Best
                  Daniel
                  Last edited by daniel klein; 20 Feb 2015, 07:09.

                  Comment


                  • #24
                    FWIW, I have yet to identify an identification problem in the data and code I posted. It seems very straightforward to me. I run lots of regressions of this type. If they are flawed, I probably have bigger problems to tackle. I would appreciate advice on the matter.

                    I have simplified the testcase further:

                    Code:
                    version 13.1
                    
                    set seed 155575 // Makes no difference.
                    
                    capture program drop display_xtreg_coef
                    program define display_xtreg_coef
                    forval i = 1/10 {
                        qui xtreg y x, fe i(z)
                        display _b[x]
                    }
                    end
                    
                    insheet using ~/Desktop/reprodata2.txt, clear
                    display_xtreg_coef
                    sort z
                    display_xtreg_coef
                    Attached Files

                    Comment


                    • #25
                      Dear All,

                      I only had a quick look at this but, like Nils, I do not see the identification problem here. If, for example, we replace y with y+x there is no instability anymore; all we see are very good estimates of the number 1. So, I tend to think that this is a stability issue revealed by the fact that the estimate we are after is zero.

                      It would be great if someone could give more information about the identification interpretation of the problem.

                      All the best,

                      Joao

                      Comment


                      • #26
                        I've whittled the dataset even more. These seven observations, in this order, yield unstable estimates.

                        Code:
                        version 13.1
                        
                        set seed 155575 // Makes no difference.
                        
                        clear
                        
                        input x y z
                        0 0 0
                        0 0 0
                        0 1 0
                        0 0 1
                        1 0 0
                        1 0 0
                        1 1 0
                        end
                        
                        forval i = 1/20 {
                            qui xtreg y x, fe i(z)
                            display _b[x]
                        }
                        If the single observation in cluster z=1 looks suspicious, feel free to flesh it out with additional observations, e.g. {{1 0 1}, {0 1 1}, {1 1 1}}.

                        Comment


                        • #27
                          Thanks for this, Nils.

                          With these data, xtreg y x, fe i(z)is equivalent to reg y x if z==0 and in this case there is no instability problem. This example also makes clear that the slope is identified and equal to zero, so there is no identification problem.

                          All the best,

                          Joao

                          Comment


                          • #28
                            Follow up; here is a different way to generate the same problem in a simpler context.
                            Code:
                            clear
                            set seed 123
                            input x y
                            0 1
                            0 0
                            0 0
                            1 1
                            1 0
                            1 0
                            end
                            forval i = 1/10 {
                                generate u=runiform()
                                sort u
                                drop u
                                qui reg y x
                                display _b[x]
                            }
                            Clearly, -xtreg- is doing something similar to what is done with -sort- in the example above.

                            Joao

                            Comment


                            • #29
                              Interestingly, if you insert -replace y = x + y- after the -end- and before the -forvalues- command, you create a regression where _b[x] should be 1. But if you change the -display- to -display %21x _b[x]-, you will see that this isn't exactly what you get:

                              Code:
                              . replace y = x + y
                              (3 real changes made)
                              
                              . forval i = 1/10 {
                                2.     generate u=runiform()
                                3.     sort u
                                4.     drop u
                                5.     qui reg y x
                                6.     display %21x _b[x]
                                7. }
                              +1.0000000000000X+000
                              +1.0000000000001X+000
                              +1.0000000000001X+000
                              +1.0000000000001X+000
                              +1.0000000000001X+000
                              +1.0000000000001X+000
                              +1.0000000000001X+000
                              +1.0000000000001X+000
                              +1.0000000000001X+000
                              +1.0000000000001X+000
                              So it is not exactly 1 (though clearly the difference is negligibly small), and it does vary. And this phenomenon is not, as previously hypothesized by Joao, confined to estimating coefficients that are zero.


                              Comment


                              • #30
                                Thanks for this, Clyde. I may not have been very clear, but my point is that (unless we change the display as you did) we only see the problem when the coefficient is close to zero, but the problem is always there.

                                All the best,

                                Joao

                                Comment

                                Working...
                                X