Different Results every time I run the regression

Roger More

Join Date: Jul 2017

Posts: 59
#16

09 Dec 2018, 10:40

Thank you very much. Indeed, the bug was with my Stata 14.2. I also tried it with Stata 15 in my university and I get the same results.

Thanks again for all this altruism!
1 like
Comment

Roger More

Join Date: Jul 2017
Posts: 59

#17

24 Jan 2019, 04:59

Dear all,

The problem I was originally having still persists. I have now purchased Stata 15.1 as it was pointed out that the problem is user and perhaps Stata specific. It seems that if the regressions are run simultaneously both OLS and IV, I keep on getting different coefficient estimates and p-values for IV regressions:
The specific code I run is as follows:

Code:

*****************************SETTING GLOBALS************************************
cd "F:\Judge Selection Reform New"   //Enter your own path where the Files folder is saved
use ".\Input\CaseYearDataWithJudgeWithShrines.dta", replace


*Globals for case characteristics, judge characteristics and district characteristics
global controls_case constitutional criminal land_case  pagesjudgenum benchchiefjustice lawyer_number judge_number 
global controls_district Area_sqkm Population Density_persqkm 


regress StateWins NewJudges_TotalJudges i.yeardecision i.district_bench##c.yeardecision $controls_case $controls_judge $controls_district, vce(cluster district_bench)

****Table 1: Summary Statistics****
sum StateWins Merit caselag $controls_case $controls_judge $controls_district if e(sample) ==1



****Table 1: Summary Statistics****
set more off 
regress StateWins NewJudges_TotalJudges i.yeardecision i.district_bench##c.yeardecision $controls_case $controls_judge $controls_district, vce(cluster district_bench)
regress caselag NewJudges_TotalJudges i.yeardecision i.district_bench##c.yeardecision $controls_case $controls_judge $controls_district if e(sample) == 1, vce(cluster district_bench)
sum StateWins  caselag yeardecision  yearfiled   $controls_case  $controls_district  if e(sample) == 1
sum StateWins  caselag yeardecision  yearfiled   $controls_case  $controls_district 


****Table 2: OLS and IV case lag ****
*New Judges and StateWins but first to specify sample

cd "F:\Judge Selection Reform New\Output\Tables"
regress StateWins NewJudges_TotalJudges if e(sample) ==1, vce(cluster district_bench)  

regress StateWins NewJudges_TotalJudges i.yeardecision i.district_bench##c.yeardecision $controls_case $controls_district, vce(cluster district_bench)  

ivregress 2sls StateWins (NewJudges_TotalJudges= ProbRetirement)  if e(sample) ==1, vce(cluster district_bench)  

ivregress 2sls StateWins (NewJudges_TotalJudges= ProbRetirement) i.yeardecision i.district_bench##c.yeardecision $controls_case $controls_district, vce(cluster district_bench)

The example of raw data is as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte StateWins double(NewJudges_TotalJudges ProbRetirement) str5 bench int yeardecision byte(constitutional criminal land_case pagesjudgenum benchchiefjustice lawyer_number judge_number) int Area_sqkm double(Population Density_persqkm)
0                  0                  0 "karhc" 2008 1 0 0 10 0 2 1 1259 9564287.211 7596.733289
1                  0                  0 "karhc" 1991 1 0 0 34 0 2 2 1259 5746250.053 4564.138247
0                  0                  0 "larhc" 2000 0 1 0  1 0 3 2 1259 7767563.842  6169.62974
0  .5333333333333333                 .6 "larhc" 2016 1 0 1  1 0 2 1 1259 11361010.58 9023.836838
0 .06666666666666667 .06666666666666667 "kashc" 2010 1 0 1  2 0 4 2 1259 10013468.05 7953.509176
1                  0                  0 "kashc" 1986 0 1 0 13 0 3 3 1259 4623297.947 3672.198528
1 .06666666666666667 .06666666666666667 "hydhc" 2010 1 0 1  2 0 3 1 1259 10013468.05 7953.509176
1                  0                  0 "karhc" 2004 1 0 1  3 0 2 2 1259 8665925.526 6883.181514
1 .06666666666666667 .06666666666666667 "hydhc" 2010 1 0 1  1 0 3 1 1259 10013468.05 7953.509176
0  .4666666666666667  .5333333333333333 "phc" 2015 1 0 1  2 0 2 2 1259 11136420.16 8845.448894
0 .06666666666666667 .06666666666666667 "phc" 2010 1 0 1  2 0 4 2 1259 10013468.05 7953.509176
0 .26666666666666666  .3333333333333333 "karhc" 2012 1 0 0  1 0 3 2 1259 10462648.89 8310.285063
1                  0                  0 "karhc" 2007 1 0 1  4 0 2 2 1259 9339696.789 7418.345345
end

Thank you in any case and if anything can help out with this problem, it would be great since as you see I am having this problem for a while!

Cheers!

Comment

Roger More

Join Date: Jul 2017

Posts: 59
#18

26 Jan 2019, 02:49

To close out this thread, I want to thank StataCorp and Andrew all all here for help. I wrote to Stata technical service and they told me that the problem arises if "When the -vce(cluster option)- is specified, -ivregress- must perform data
sorting operations. When data is sorted, ties are randomly broken; and, if the
particular data and model specification generates numerical unstable
computations due to high correlation, you can observe different results every time you run the
regression. If you set a sortseed initial number at the top of your do-file,
you will get the same results every time you run your do-file."

Cheers! And thanks all!
Comment
James Valentine

Join Date: Dec 2019

Posts: 6
#19

14 Oct 2020, 07:48

I'm experiencing the same problem as Roger in that my output differs slightly each time I run my script. I added a "set sortseed" line at the beginning of the script over a month ago when I first noticed the differences. My script is over 4,000 lines and the databases are over 50GB and so I can't easily post them here. At the stage where I get my first set of descriptive statistics, there can be up to a 0.1% difference in the means and medians, but that's 2,000 lines and 10 hours into the running of the Do file. Could it be due to the amount of calculations taking place in that there may be some randomness in the rounding of millions of numbers?

Is there a good method to use to identify the cause of the differences? I was thinking about creating a summary table at each stage of the script and compare the two but that will likely take the better part of a day (and 12-15 hours to run the code each time) Are there specific commands that are prone to cause differences in the output? I assume I resolved issues with sorting by adding the "set sortseed" to the script.

Is this issue even solvable when using large databases where there are millions of computations? The significance of my final output doesn't change much (the R2 is 0.385 in one run of the data and 0.387 in another run).

Any thoughts are greatly appreciated!
Comment

Announcement

Comment

Comment

Comment

Comment