The same data, the same codes, different decomposition results

Eunjung Jee

Join Date: Feb 2016

Posts: 4
#1

The same data, the same codes, different decomposition results

24 Feb 2016, 12:19

Hello,

I'm using Blinder-Oaxaca decomposition method to study gender wage gap using four waves of the PSID.
I used the command, Oaxaca, to implement in Stata and realized it produces different results everytime I ran the same do-file.
It is strange because the OLS regression results are the same.

One of the command I used is as follows:
oaxaca lnwage (demo:race agev) (education:educ college advdeg) (experience:fulltime parttime fulltimesq parttimesq) if female==0 & year==1993, by(child) weight(1)

The do-file imports the PSID datasets, cleans the variables, and then it runs several regressions and decompositions.

The summary statistics of the datasets are always the same, the OLS regression results are the same.
Only decomposition results are different.
Would it be related to the multithreading environment of the Stata? or should I try order/sort the datasets before computing Oaxaca?

Any comment will be appreciated.
Thanks in advance.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

24 Feb 2016, 13:58

Welcome to Statalist!

I suggest that you test the performance of oaxaca more thoroughly by saving the data that results from importing and cleaning the PSID datasets to a new Stata dataset, and then run the same oaxaca command repeatedly on that dataset, which is certain to be the same. Otherwise, there's the lingering suspicion that the differing results are a consequence not of the program but of subtle changes in the cleaned data. For good measure, I would exit Stata between each run, in case there is indeed a subtle problem running oaxaca repeatedly in the same Stata session.
Comment
Eunjung Jee

Join Date: Feb 2016

Posts: 4
#3

24 Feb 2016, 14:19

Thank you so much William!
I just tried exiting Stata between each run and I found that it consistently produces the same coefficients.
Now I can sleep tonight! Thank you again.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#4

24 Feb 2016, 17:44

Well, enjoy your good night's sleep, but in the morning I would start worrying if I were you. These results suggest that there is something indeterminate about the part of your do-file that imports and cleans the data. So while you can now get consistent results by fresh-starting each time, the question is whether those consistent results are consistently right or consistently wrong!

The usual source of indeterminancy in data management code arises from -sort-ing. If the -sort- key does not uniquely identify observations, the order of the duplicate observations for any key is indeterminate and irreproducible, by design. If later on the code uses some process that is sensitive to the sort order of the data and does not independently, deterministically sort the data correctly, the results of that process will be irreproducible. Another possibility is the use of the random number generator without setting the seed before first use.

The first thing I would do is verify that there is an indeterminancy problem, as opposed to a bug in -oaxaca-. So I would take the code that creates your cleaned data, and run it in a loop a few times, saving the results in a different file each time, and then compare the files. If they are not identical, then your data importation/cleaning process is indeterminate and irreproducible, and I would regard that as precluding any further analysis until that problem is sorted out. Scrutinize every -sort-, and if you use random numbers in the process, verify that you initialize the seed beforehand.

If the data sets are identical, this would suggest that -oaxaca- is somehow producing indeterminate results. First I would scrutinize the actual -oaxaca- command I used to see if in its own right it is drawing on something that introduces indeterminancy. Then, if that is not the issue, I would write a do-file that reads in the cleaned data file, and runs the same -oaxaca- command several times in a loop, to be sure that the problem has persisted. And if so, I would contact the author of -oaxaca- (Ben Jann) and, if he says he is still maintaining this program, send him your data and the code in which you invoked -oaxaca- so he can check it out.

Last edited by Clyde Schechter; 24 Feb 2016, 17:45. Reason: Correct typos.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

24 Feb 2016, 19:40

Clyde nails it. He expresses was running though my mind when I wrote my post above, but it's a difficult explanation to make, so I thought I'd wait until we had some idea that it was worth explaining. Glad I waited, Clyde's post explains the issues much better than I could have, and proposes an appropriately rigorous analysis to find the problem.

I agree that you must check your data cleaning and recoding process along the lines Clyde recommended: if you think oaxaca doing different things is bad, having your cleaned data depend on the cleaning process is worse: it could affect all of your analyses. I have not forgotten that everything other than the decomposition looked the same from run to run initially, that seems to exonerate your data cleaning and recoding. But that's not a sufficient test to rule out problems with your data cleaning.

I will be very interested in hearing where this takes you.
1 like
Comment
Eunjung Jee

Join Date: Feb 2016

Posts: 4
#6

24 Feb 2016, 20:52

Thank you Clyde and William for the comments.
I totally agree with you both. In fact I already started worrying about my cleaning process, even before I go to sleep.
It'll take some time to find out where I made a mistake, but I'm sure I'll find it and keep you updated!
Thanks a lot.
Comment
Eunjung Jee

Join Date: Feb 2016

Posts: 4
#7

25 Feb 2016, 00:05

Dear all,

Finally I fixed the problem. I followed Clyde and William's instruction to identify indeterminancy in my dataset. I was puzzled at first because I didn't use -sort- nor random numbers. However I was able to get closer to the problem by saving data each step of the cleaning. My problem was in the merging process. I'm so glad there was only one merge in my datasets. There was a - m:1 merge -. Anyways I realized everytime I merge two datasets it generates slightly different outcomes that I had never realized by looking at the summary statistics. The change in magnitude was very very small but eventually it was the source of all confusions. I must have missed this important thing because I've been using two to four decimal places to export the statistics. It turned out to be a huge mistake.
I restructured all my do-files, testing if each change produces consistent results, and now it does produce the same outcome no matter what.

Again thank you both for valuable comments.
Your instruction & explanation cleared up my confusions and it finally led me to fix the problem.
Comment

Announcement

The same data, the same codes, different decomposition results

Comment

Comment

Comment

Comment

Comment

Comment