-xtreg- estimates that are close to zero change each time the command is run

Nils Enevoldsen

Join Date: Oct 2014

Posts: 296
#31

24 Feb 2015, 15:54

I reported the reduced MWE (the one in post #26) to Stata tech support. Tech support replied:

I reported the issue to the developers and, after reviewing your new example, the conclusion was still the same provided by Enrique in his response to your Statalist posting.

To recap, Enrique's response was that this instability is desirable behavior because it indicates an identification problem. João says there is no identification problem. I don't see an identification problem, but maybe I don't know what I'm looking for.

I… I'm stumped. What should I do? Should I just assume that Stata's developers are incorrect, and live with the very minor inconvenience that this issue causes? I'm uncomfortable with that assumption.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#32

24 Feb 2015, 16:48

I don't see the identification problem either, unless there's something very subtle having to do with the panel effects. But send them Joao's post #28: that's just plain old -regress-, and there is indisputably no identification problem in that model.

That said, we are talking about variations here that are many orders of magnitude smaller than anything that could possibly matter in the real world. The only worry is if this is somehow the tip of an iceberg that we have yet to crash into.
Comment
Nils Enevoldsen

Join Date: Oct 2014

Posts: 296
#33

24 Feb 2015, 17:08

Clyde:

I thought João's example was useful because it suggested that -xtreg- might internally use an unstable sort. (Which, indeed, we know to be the case from post #13.) I thought Stata's policy was that sort order could affect exact estimates in a variety of circumstances. In other words, do you think StataCorp will consider João's example to be undesirable behavior?

Regarding the triviality of the matter: Indeed, this whole episode started based on the very minor inconvenience that my regression results kept getting pointlessly modified in version control because the estimated coefficient waffled between positive and negative zero. That small matter was dwarfed in importance by the implication of misidentification. :-)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#34

24 Feb 2015, 18:51

I think you have correctly restated Stata's policy: if lack of identification can be unmasked by irreproducibility of result due to unstable sorts, then that is a feature, not a bug. And though some might feel otherwise, I agree with that.

But I think that the example in #28 of this thread in no way represents a lack of identification. In fact, though I haven't tried it, my guess is that if one wrote out the log-likelihood in closed form, you could solve for the maximum analytically and prove that it is unique and that the likelihood is not flat in its neighborhood. Moreover, -regress- doesn't even have to go through the complications of maximizing a likelihood: the normal equations are easily written down and solved algebraically for this model. The fact that these results are sort-sensitive suggests some other problem is in play.

Now there are certain types of calculations that, due to finite precision arithmetic, are sort-order dependent. Taking sums of numbers on wildly divergent scales is one of those--which could be a problem in estimating regressions. But, again, the example in #28 does not appear to present any such challenges. So I'm not persuaded that this matter is yet closed, and I would like to hear more from Stata Corp. In particular, if they still see this as an unidentified model problem, I'd like to know why.
1 like
Comment

Mark Schaffer

Join Date: Mar 2014
Posts: 324

#35

25 Feb 2015, 01:55

I'd also like to hear more. It shows up with regress back at least to Stata 8 (I checked Stata 8, 10, 11 and 12, all IC running on Windows 7). FWIW, it doesn't show up when I do the estimation by hand in Mata. Here's Joao's example in #28 but using %21x format (ideal for this) and compared with the Mata calculation.

Code:

clear
set seed 123
input byte x y
0 1
0 0
0 0
1 1
1 0
1 0
end
forval i = 1/10 {
    generate u=runiform()
    sort u
    drop u
    qui reg y x
    qui putmata y x=(x 1), replace
    mata: b=invsym(x'x)*x'y
    mata: st_matrix("b",b)
    display %21x _b[x] _col(30) %21x b[1,1]
}

Output:

Code:

+1.5555555555557X-036        +0.0000000000000X-3ff
+0.0000000000000X-3ff        +0.0000000000000X-3ff
+0.0000000000000X-3ff        +0.0000000000000X-3ff
+0.0000000000000X-3ff        +0.0000000000000X-3ff
+0.0000000000000X-3ff        +0.0000000000000X-3ff
-1.5555555555557X-038        +0.0000000000000X-3ff
+1.5555555555557X-038        +0.0000000000000X-3ff
+0.0000000000000X-3ff        +0.0000000000000X-3ff
-1.5555555555557X-037        +0.0000000000000X-3ff
-1.5555555555557X-038        +0.0000000000000X-3ff

Comment

Joao Santos Silva

Join Date: Apr 2014

Posts: 3028
#36

25 Feb 2015, 03:44

Dear All,

Sorry for the long post, but I think we are dealing with a few different issues here.
a) The identification problem: In example #28 there is no identification issue for sure, and I think that is also the case in most of Nils's examples as well. So, I guess we can put that to rest. Also, even if the instability would be a way of highlighting identification issues it would be of very limited use because we generally do not run each regression 20 times and even then we would only be able to see something if at least one of the coefficients is very close to zero.
b) The importance of the problem: Indeed, it appears that the problem is very minor, but that is not necessarily always the case. Changes of this order of magnitude may impact, for example, on the decision of whether or not a regressor should be dropped due to collinearity. Anyway, as a matter of principle, I think we should be able to reproduce our results and in that sense the magnitude of the problem is not relevant.
c) Precision: As Clyde mentioned, we should expect some very minor changes in the results if we change the order of the observations, or even the order in which the regressors are included in the model (but we should get the same results if we do exactly the same thing twice!). Mark's example, however, shows that Stata is much more sensitive to this issue than Mata, and that is somewhat surprising. I would like to have more information about why this happens.
In summary, I think the set of examples presented here show that there is a problem with -xtreg- possibly due to an unstable sort, and that Stata and Mata have different levels of sensitivity to rounding errors. I would love to hear from Stata on these issues.

All the best,

Joao
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3890
#37

25 Feb 2015, 03:58

Stata - or better: regress - does not calculate a linear regression the way Mark did in Mata. Bill Gould (2010, p.128ff.) talks about how Stata calculates linear regression ans why it does so. There was also a related discussion on old Statalist, but I cannot find it.

I agree that it would be nice to read more from StataCorp on this.

Best
Daniel
Comment

Mark Schaffer

Join Date: Mar 2014
Posts: 324

#38

25 Feb 2015, 08:28

On second thought ... my original Mata example perhaps wasn't ideal because it recreates the data in Mata each time. Better is to use a Mata view that is created once before all the sorts take place. Then the only thing that is changing for both the Stata and Mata estimations is just the order of the data. Turns out not to make a difference: the Stata results change very slightly with the sort but the Mata results do not. Below is output using Stata 13.1 IC running on Windows 7.

Code:

clear
set seed 123
input byte x y one
0 1 1
0 0 1
0 0 1
1 1 1
1 0 1
1 0 1
end
mata: st_view(x, ., "x one")
mata: st_view(y, ., "y")
forval i = 1/10 {
    generate u=runiform()
    sort u
    drop u
    /* list y x one */  //  <-remove to see Stata data used
    /* mata: y, x   */  //  <-remove to see Mata data used
    qui reg y x
    mata: b=invsym(x'x)*x'y
    mata: st_matrix("b",b)
    display %21x _b[x] _col(30) %21x b[1,1]
}

Output:

Code:

+1.5555555555557X-036        +0.0000000000000X-3ff
+0.0000000000000X-3ff        +0.0000000000000X-3ff
+0.0000000000000X-3ff        +0.0000000000000X-3ff
+0.0000000000000X-3ff        +0.0000000000000X-3ff
+0.0000000000000X-3ff        +0.0000000000000X-3ff
-1.5555555555557X-038        +0.0000000000000X-3ff
+1.5555555555557X-038        +0.0000000000000X-3ff
+0.0000000000000X-3ff        +0.0000000000000X-3ff
-1.5555555555557X-037        +0.0000000000000X-3ff
-1.5555555555557X-038        +0.0000000000000X-3ff

Comment

daniel klein

Join Date: Mar 2014

Posts: 3890
#39

25 Feb 2015, 08:46

I think regress does not calculate linear regression models the way Mark does it in Mata. Bill Gould (2010, pp. 128ff.) talks about this in the Stata Journal. There was also a discussion on how regress actually fits models on old Statalist, but I cannot find it. Anyway, the differences between Stata and Mata might have to do with the way regress handles the calculation.

Best
Daniel

Gould, William (2010). Mata Matters: Mata in Stata. The Stata Journal, 10(1), pp.125-142.

Last edited by daniel klein; 25 Feb 2015, 08:49.
Comment
Mark Schaffer

Join Date: Mar 2014

Posts: 324
#40

25 Feb 2015, 09:32

That's right, I'm sure regress does not use the naive and brute-force method in my Mata example. It's a bit interesting that even though what regress does will normally make for a more precise answer, in this case it does't. But the more interesting bit is the relationship between the numerical answer from regress and the sort.
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 708
#41

25 Feb 2015, 13:18

I've had to reread through this interesting thread a few times to make
sure I understand all the issues discussed. Here is what I think is
going on:

1. Nils posted an example (which he labeled WME) where xtreg, fe was
producing different coefficient estimates for a predictor, some were
positive while others were negative. The different values are
essentially zero (within double precision) in absolute value.

2. Nils sent a different dataset to Tech Support, in which we determined that
one of the model coefficients was not identified. Because of this
identification problem, repeated runs of xtreg, fe was giving slightly
different results and the "fitted" values were not essentially zero
in this case.

Enrique showed a way for Nils to check for this kind of lack of
identification.

3. Back to the WME dataset, it is clear that, upon further whittling of the
data, there is no problem with identification. The coefficient is
identified, and we can agree that the coefficient estimate is
essentially zero.

The real problem here is that we are using regress to fit a model
with 0's and 1's in the response and predictor variables. Given the
pattern in this dataset, we can see that the answer is exactly zero,
but we are essentially using infinite precision math in our heads as
we arrive at this conclusion.

4. The apparent worry here is that regress is sensitive to sort order. In
this very special case it appears that regress is giving us weird
results. I propose that we are all just looking too hard at the sign and
mantissa of the estimated coefficient and ignoring the fact that
there are over 15 zeros, after the decimal place, before we get to
the non-zero values of the mantissa.

Here is a version of Mark's example where Mata matches the regress
results:

Code:

set seed 123 input byte x y one 0 1 1 0 0 1 0 0 1 1 1 1 1 0 1 1 0 1 end mata: st_view(x=., ., "x one") mata: st_view(y=., ., "y") forval i = 1/10 { generate u=runiform() sort u drop u qui reg y x mata: xmean = mean(x) mata: ymean = mean(y) mata: b=invsym(crossdev(x,xmean,x,xmean))*crossdev(x,xmean,y,ymean) mata: st_matrix("b",b) display %21x _b[x] _col(25) %21x b[1,1] _col(50) %21.12g b[1,1] }

Here are the results

Code:

+1.5555555555557X-036 +1.5555555555555X-036 7.40148683083e-17 +0.0000000000000X-3ff +0.0000000000000X-3ff 0 +0.0000000000000X-3ff +0.0000000000000X-3ff 0 +0.0000000000000X-3ff +0.0000000000000X-3ff 0 +0.0000000000000X-3ff +0.0000000000000X-3ff 0 -1.5555555555557X-038 -1.5555555555555X-038 -1.85037170771e-17 +1.5555555555557X-038 +1.5555555555555X-038 1.85037170771e-17 +0.0000000000000X-3ff +0.0000000000000X-3ff 0 -1.5555555555557X-037 -1.5555555555555X-037 -3.70074341542e-17 -1.5555555555557X-038 -1.5555555555555X-038 -1.85037170771e-17

Be reassured by this little experiment. regress is producing results
that are well within double precision tolerances.
1 like
Comment
Nils Enevoldsen

Join Date: Oct 2014

Posts: 296
#42

25 Feb 2015, 14:13

1. Nils posted an example (which he labeled WME) where xtreg, fe was producing different coefficient estimates for a predictor, some were positive while others were negative. The different values are essentially zero (within double precision) in absolute value.

Correct. MWE means minimal working example. (As it turned out, it could be reduced even more.)

2. Nils sent a different dataset to Tech Support, in which we determined that one of the model coefficients was not identified. Because of this identification problem, repeated runs of xtreg, fe was giving slightly different results and the "fitted" values were not essentially zero in this case.

Sort of. My initial email to Tech Support referenced the original example. At a later date, I sent a second example, explicitly cautioning “I have another issue that has similar symptoms in some ways, and different symptoms in other ways. I’m not sure if the cause is similar.” Tech Support decided that they were distinct issues, then changed their mind and decided they were the same issue, then responded regarding the second example alone. I am happy to leave this second example in the past. I am convinced that it is misidentified.

As for the rest, I am relieved to hear that we are in agreement that there is no misspecification, and that the appearance of this phenomenon is not confined to misspecification. I will learn to live with the random appearance and disappearance of negative signs on my coefficient, and with the consequent minor impairment of automated reproducibility. João's comments in part b of post #36 explain why we might care about this level of inaccuracy, but I will not be the one to press the case.

Thank you Jeff. And thanks to Ben, Daniel, Clyde, Enrique, Mark, and João. I appreciate your help, guys.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3028
#43

25 Feb 2015, 14:27

Dear Jeff,

Thank you very much for looking into this; it is reassuring to know that Stata cares. However, I think you did not address the key issue: xtreg does not produce consistent results. The view we have formed here is that this the result of it not using a stable sort. It would be great if this issue could be sorted out.

Many thanks again,

Joao
2 likes
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#44

26 Feb 2015, 15:22

So regress can generate slightly different results depending on the sort order of the data in memory (due to numerical accuracy issues) BUT it will always generate exactly the same results if the sort order is the same. This however is not the case for xtreg.

I submit that this is not acceptable because the non-deterministic outcome of xtreg undermines the whole point of using insufficiently specified sorts to identify logical errors in data management. I have had to personally argue with researchers who were absolutely convinced that it is normal to see small changes in results at every run because of numerical accuracy. Invariably, there were logical flaws in their code and once fixed, results replicated exactly.

May I point out the first line of the FAQ "Why does my do-file or ado-file produce different results every time I run it?":

There is really no general answer to this question other than your program has an error in it

Since it is argued that the small changes in xtreg are not due to an error, then I believe this is a situation where it is acceptable to make xtreg deterministic using set sortseed.

Code:

version 13.1 set seed 155575 // Makes no difference. clear input x y z 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 1 1 0 end gen obs = _n forval i = 1/20 { set sortseed 12345 // Pick any seed qui xtreg y x, fe i(z) display _b[x] }

Full disclosure, I'm the author of project (available from SSC), a program to manage one's workflow in Stata with a particular emphasis on replication. I wrote project as a way to avoid a repeat of an experience where I had to spend weeks fixing someone else's code because the author noticed (too late) that his code could not replicate exactly the tables in an article that had already gone to press. With project, you can do replication builds where every single thing that is generated by the project (dataset, log files, tables, figures, etc.) is checked for differences. This makes it very easy to spot logical errors in the code. Of course the functionality is lost if some Stata commands are non-deterministic.

Last edited by Robert Picard; 26 Feb 2015, 15:46.
3 likes
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment