Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Thank you very much. Indeed, the bug was with my Stata 14.2. I also tried it with Stata 15 in my university and I get the same results.

    Thanks again for all this altruism!

    Comment


    • #17
      Dear all,

      The problem I was originally having still persists. I have now purchased Stata 15.1 as it was pointed out that the problem is user and perhaps Stata specific. It seems that if the regressions are run simultaneously both OLS and IV, I keep on getting different coefficient estimates and p-values for IV regressions:
      The specific code I run is as follows:

      Code:
      *****************************SETTING GLOBALS************************************
      cd "F:\Judge Selection Reform New"   //Enter your own path where the Files folder is saved
      use ".\Input\CaseYearDataWithJudgeWithShrines.dta", replace
      
      
      *Globals for case characteristics, judge characteristics and district characteristics
      global controls_case constitutional criminal land_case  pagesjudgenum benchchiefjustice lawyer_number judge_number 
      global controls_district Area_sqkm Population Density_persqkm 
      
      
      regress StateWins NewJudges_TotalJudges i.yeardecision i.district_bench##c.yeardecision $controls_case $controls_judge $controls_district, vce(cluster district_bench)
      
      ****Table 1: Summary Statistics****
      sum StateWins Merit caselag $controls_case $controls_judge $controls_district if e(sample) ==1
      
      
      
      ****Table 1: Summary Statistics****
      set more off 
      regress StateWins NewJudges_TotalJudges i.yeardecision i.district_bench##c.yeardecision $controls_case $controls_judge $controls_district, vce(cluster district_bench)
      regress caselag NewJudges_TotalJudges i.yeardecision i.district_bench##c.yeardecision $controls_case $controls_judge $controls_district if e(sample) == 1, vce(cluster district_bench)
      sum StateWins  caselag yeardecision  yearfiled   $controls_case  $controls_district  if e(sample) == 1
      sum StateWins  caselag yeardecision  yearfiled   $controls_case  $controls_district 
      
      
      ****Table 2: OLS and IV case lag ****
      *New Judges and StateWins but first to specify sample
      
      cd "F:\Judge Selection Reform New\Output\Tables"
      regress StateWins NewJudges_TotalJudges if e(sample) ==1, vce(cluster district_bench)  
      
      regress StateWins NewJudges_TotalJudges i.yeardecision i.district_bench##c.yeardecision $controls_case $controls_district, vce(cluster district_bench)  
      
      ivregress 2sls StateWins (NewJudges_TotalJudges= ProbRetirement)  if e(sample) ==1, vce(cluster district_bench)  
      
      ivregress 2sls StateWins (NewJudges_TotalJudges= ProbRetirement) i.yeardecision i.district_bench##c.yeardecision $controls_case $controls_district, vce(cluster district_bench)
      The example of raw data is as follows:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte StateWins double(NewJudges_TotalJudges ProbRetirement) str5 bench int yeardecision byte(constitutional criminal land_case pagesjudgenum benchchiefjustice lawyer_number judge_number) int Area_sqkm double(Population Density_persqkm)
      0                  0                  0 "karhc" 2008 1 0 0 10 0 2 1 1259 9564287.211 7596.733289
      1                  0                  0 "karhc" 1991 1 0 0 34 0 2 2 1259 5746250.053 4564.138247
      0                  0                  0 "larhc" 2000 0 1 0  1 0 3 2 1259 7767563.842  6169.62974
      0  .5333333333333333                 .6 "larhc" 2016 1 0 1  1 0 2 1 1259 11361010.58 9023.836838
      0 .06666666666666667 .06666666666666667 "kashc" 2010 1 0 1  2 0 4 2 1259 10013468.05 7953.509176
      1                  0                  0 "kashc" 1986 0 1 0 13 0 3 3 1259 4623297.947 3672.198528
      1 .06666666666666667 .06666666666666667 "hydhc" 2010 1 0 1  2 0 3 1 1259 10013468.05 7953.509176
      1                  0                  0 "karhc" 2004 1 0 1  3 0 2 2 1259 8665925.526 6883.181514
      1 .06666666666666667 .06666666666666667 "hydhc" 2010 1 0 1  1 0 3 1 1259 10013468.05 7953.509176
      0  .4666666666666667  .5333333333333333 "phc" 2015 1 0 1  2 0 2 2 1259 11136420.16 8845.448894
      0 .06666666666666667 .06666666666666667 "phc" 2010 1 0 1  2 0 4 2 1259 10013468.05 7953.509176
      0 .26666666666666666  .3333333333333333 "karhc" 2012 1 0 0  1 0 3 2 1259 10462648.89 8310.285063
      1                  0                  0 "karhc" 2007 1 0 1  4 0 2 2 1259 9339696.789 7418.345345
      end
      Thank you in any case and if anything can help out with this problem, it would be great since as you see I am having this problem for a while!

      Cheers!

      Comment


      • #18
        To close out this thread, I want to thank StataCorp and Andrew all all here for help. I wrote to Stata technical service and they told me that the problem arises if "When the -vce(cluster option)- is specified, -ivregress- must perform data
        sorting operations. When data is sorted, ties are randomly broken; and, if the
        particular data and model specification generates numerical unstable
        computations due to high correlation, you can observe different results every time you run the
        regression. If you set a sortseed initial number at the top of your do-file,
        you will get the same results every time you run your do-file."

        Cheers! And thanks all!

        Comment


        • #19
          I'm experiencing the same problem as Roger in that my output differs slightly each time I run my script. I added a "set sortseed" line at the beginning of the script over a month ago when I first noticed the differences. My script is over 4,000 lines and the databases are over 50GB and so I can't easily post them here. At the stage where I get my first set of descriptive statistics, there can be up to a 0.1% difference in the means and medians, but that's 2,000 lines and 10 hours into the running of the Do file. Could it be due to the amount of calculations taking place in that there may be some randomness in the rounding of millions of numbers?

          Is there a good method to use to identify the cause of the differences? I was thinking about creating a summary table at each stage of the script and compare the two but that will likely take the better part of a day (and 12-15 hours to run the code each time) Are there specific commands that are prone to cause differences in the output? I assume I resolved issues with sorting by adding the "set sortseed" to the script.

          Is this issue even solvable when using large databases where there are millions of computations? The significance of my final output doesn't change much (the R2 is 0.385 in one run of the data and 0.387 in another run).

          Any thoughts are greatly appreciated!

          Comment

          Working...
          X