No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • different results every time I run do file

    Dear all,

    I have pooled cross-sectional data for 4 years. Some individuals are only in one year, but some are in 2, 3 or 4 years. I control for time effects, i.e. I use dummy variable to control for years. Every time I run probit model I get slightly different results, e.g. estimated coefficients are ones 0.143, second time 0.150, third time 0.148, etc. I already read that it is problem of sorting variables that are not uniquly identified. My question is how should I sort variables in order to get same results each time. I tried the following twp sorting, but it does not work? I am not interested in panel data analysis, I want to do basic probit analysis.

    sort idperson year
    sort year idperson

    Also, my dependent variable is employed/not employed and explanatory variables are education, age, marital status, children, etc.

    Thank you.


  • #2
    Well, I suspect you have only partially diagnosed your problem. It is true that irreproducibility is often the result of indeterminate sorting. But -probit- should not be subject to that problem: the results really should be the same regardless of the sort order of the data. It is more likely that something you are calculating before you get to the -probit- command depends on the sort order. The solution, then, is to fix that calculation so that it is independent of the sort order, not to stabilize the sort. Stabilizing the sort can be done easily, but it just sweeps a huge problem under the rug.

    So scour your code for any calculations that might be sort order dependent. Any code that references subscripted variables, _n, or _N will be sort order dependent. Beware of any looping over observations: this is often sort order dependent as well. -collapse- with (first), (last), (firstnm), or (lastnm) is, evidently, sort-order dependent. Harder to spot might be programs you call that do something sort-order dependent internally.


    • #3
      You may be getting different results in different runs because of the randomness in sort. sort randomly breaks any ties in the key values. To reproduce exactly the same results, try set sortseed #. The latter specifies the seed of the random-number generator that breaks ties in sort. If the commands you are using involve randomness other than with sort, set seed # is also needed to reproduce the same results.

      I hope this solves the problem.

      -- Kreshna


      • #4
        Dear Clyde,

        Thank you for your help. I have many sort generated variables. E.g. number of dependents and number of working age in the household:

        gen dep = dag<=17 | dag>=65 //dependent household member
        gen wkm = dag>17 & dag<65 // working age member
        bys year idhh: egen deps=total(dep) // n. of dependants
        bys year idhh: egen wkms=total(wkm) // n. of working age members

        Or total pensions in the household
        bys year idhh: egen penT=total(pen)

        I do not understand what should I do in order to fix the problem if it is problem with sorting?
        Also I realized that I when I save the database with all the prepared variables and use that saved database as input database for regression, I get the same results. But if I run the whole do file, where the varaibales are generated, the results differ.

        I do not have any random number generator as Kreshna mentioned.

        Thank you.



        • #5
          Dear Kreshna,

          Thank you for your answer. I do not have any random number generator.



          • #6
            Aleksandra, if you are using the sort command (as you indicated in post #1), or any other command that uses sort inside, then you are in fact using random numbers and Kreshna's advice on setting the sortseed is relevant. Sorting with option stable is another alternative. This is also consistent with your described behavior of loading the data from file, pointing that the problem is with data preparation (sorting) and not your estimations. Best, Sergiy Radyakin


            • #7
              Also I realized that I when I save the database with all the prepared variables and use that saved database as input database for regression, I get the same results. But if I run the whole do file, where the varaibales are generated, the results differ.
              So, if I understand you correctly, if you generate your variables and save the data base, and then re-use that saved database to do the probit analysis several times, the results are the same each time. But if you re-generate the variables each time, then you get varying results each time. That indeed proves my point in #2 that there is something sort-order-dependent in the way you are calculating these variables. So something is wrong.

              Looking at the examples of the calculations you show, none of those should be the source of the problem. dep and wkm involve no sorting at all, and all of the calculations for those commands are done independently within each observation.

              The commands for deps, wkms, and pen require the data to be sorted, but they use only the -total()- function which should, in theory, be independent of the sort order of the data. Now, it is true that finite-fixed-precision computer addition is not strictly speaking commutative. The order in which things get totaled up can matter, but the problem arises only with fairly pathological data in which the running total gets so large that subsequent additions of small values to it have no effect because the order of magnitude of the running total is so much greater than what is being added, that what is being added gets shift-rounded down to zero. But that requires a variable whose values range over many orders of magnitude, or an extremely large number of observations being added. Since your dep and wkm varibles are 0/1 dichotomies, and I'm assuming you do not have a data set with quadrillions of observations in it, the totaling up of dep and wkm into deps and wkms should not be sort-order dependent. It is conceivable that the variable pen is problematic, but it would surprise me. What are the largest and smallest absolute values of pen? What is the largest number of observations for a single year idhh group? I'd be astonished if this is really what's going wrong, but it's simple enough to check these out.

              So the problem is probably somewhere else. If you cannot identify the source of the problem by reviewing the code, I would do the following:

              1. Modify your starting data set by including a new variable: -gen long obs_no = _n-. obs_no will now be a unique identifier in your data set, and it will remain so unless you use -expand- or -merge- or -append- along the way to bring in new observations.

              2. In all your -bys...- commands, add (obs_no) at the end of the sort key. Include obs_no at the end of the sort key in any explicit -sort- commands. This will provide a reproducible and unique sort order for the data.

              3. Run the do-file to create all the variables. (You can skip doing the actual -probit-.) Save the results in a data set, sorted on obs_no. This is your reference data set.

              4. Remove the (obs_no) references from all your -bys-....- commands and remove obs_no from the sort keys of your explicit -sort- commands. Do not, however, eliminate the obs_no variable.

              5. Now re-run the do-file. When it finishes, sort the data on obs_no.

              6. Now use the -cf- command to compare these results to the reference data set you saved the first time.

              The point of this is that the -cf- results will tell you which variables are changing. Then you can focus your scrutiny of the code on just those commands that are involved in creating those variables.

              Once you know which variable(s) are actually indeterminate, then you can go back into the code and insert some -summarize- commands after each command that changes them and run the code a few times. By seeing where the -summarize- results first differ from one run to the next you will be able to identify the (first) place where the calculation is indeterminate. Presumably then you'll be able to fix that. Then try it again--perhaps there are more points of indeterminacy, or perhaps there is only that one place where it's happening.


              • #8
                Thank you for the such extansive epxlanation. I tried firstly with set sortseed # at the begining of the do file, and it appears to be ok. I chacked several regressions for 3 times, and results are same. Anyway, I will do what Clyde suggested in order to be complietly sure what is going wrong.

                Thank you.



                • #9
                  Be sure to eliminate the -set sortseed- command when you do the tests I suggested in #7. Using -set sortseed- is just covering up the problem, not solving it.

                  The appropriate use of -set sortseed- is when you are doing calculations that are supposed to be non-deterministic and sort-dependent, but you want to be able to replicate the specific results that you got. But when you are getting non-deterministic results from calculations that are supposed to be deterministic, -set sortseed- just sweeps the problem under the rug.