Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • P-Values and F-statistics changing slightly between iterations of "test"

    Hello,

    I am using Stata 14 w/ 64M memory. Essentially, I am running a joint test of significance and after running it a few times, have noticed that my p-values and f-statistics are changing slightly iteration to iteration.

    For more detail: I'm running a joint regression with a panel of stacked outcomes and then testing coefficients against each other, similar to what's described here. Essentially, I have stacked all my data across all my different outcomes into a single regression, and then created explanatory variables so that every single regression coefficient from all the separate regressions appears in this single regression. I then interact each of the explanatory variables with outcome type dummies, and also include the uninteracted outcome type dummies. This results in a dataset with 300k+ observations.

    The code is something like:

    Code:
    regress y x1 x2 x3 dummies controls interaction terms
    
    test x1 = x2
    loc F_x1 = `r(F)'
    loc p_x1 = `r(p)'
    
    test x2 = x3
    loc F_x2 = `r(F)'
    loc p_x3 = `r(p)'


    Then I store these local macros into a table using putexcel (v14). However, I noticed that when I ran this 3-4 times, I got slightly different results for my p-values and the F-stats. For example, on one iteration my F was 1.7588 and on another it was 1.7590. On another my p-value was 0.0694, and the second time I ran it it was 0.0692.

    I have reviewed my code several times and I'm unsure of why this would be changing. Is there something changing in the way Stata is storing these values (perhaps it's using up so much memory that Stata changes the way it's rounding)? There are some dummies that are being dropped due to collinearity, and I thought maybe the random variables that got dropped were affecting the p's and F's,but after reading about this it doesn't seem like it should change the outcomes. Any other ideas?

    Thank you!
    Lucia

  • #2
    Well, the usual suspect when Stata appears to be behaving in an irreproducible way is that there is some calculation being done that depends on sort order and where the sorting leading up to it does not uniquely determine the sort order of the data. For example, if in building the data set you are doing these regressions on, there is some point where you select, within groups defined by a variable g, say, the observation with the largest value of some variable z. So your code might look like:

    Code:
    by g (z), sort: keep if _n == _N
    But suppose that in one or more groups there is a tie for the largest value of z. Then sorting on g and z does not uniquely determine the sort order: any of the tied largest z values could be put in last place. And those observations could have different values for the variables that appear in your regressions.

    So scrutinize all of the code leading up to the regressions to see if there is something like this. Basically, anything that involves -sort-ing, either explicitly or implicitly, is suspect. If the -sort- key does not determine a unique ordering of all observations, then it may be done differently each time, and everything from that point on will be irreproducible.

    Added: any calculations using random numbers generated by Stata (again either explicitly or implicitly) will be irreproducible if you did not first set the seed.

    Comment


    • #3
      Thanks Clyde Schechter. I tried re-running everything with a seed set at the top of my do file, and it didn't change the outcomes.

      When I tried scaling the dataset down from 424,645 observations to 176,000 to re-run the same thing, it does change iteration to iteration but less so--the changes are further decimal places back. This is what makes me think it is a memory thing. Does Stata store values differently when a higher proportion of the memory is being used?

      Comment


      • #4
        No, Stata doesn't store values differently when memory is getting full. That is not a possible explanation.

        Setting the seed eliminates the possibility of indeterminacy of the random number sequence. But have you looked at every command to see if there is, explicitly or implicitly, an indeterminate sort involved? Remember, there are commands that, internal to their own code, may sort the data according to some key that you pass to them--so it isn't just commands that contain "sort" in their syntax. Most of those "implicit" sort commands will have an option called -by()-. I think this is where you are most likely to find the problem.

        Comment

        Working...
        X