Same commands different numbers

Ghia Osseiran

Join Date: Mar 2015

Posts: 66
#1

Same commands different numbers

26 Mar 2015, 10:34

In running the same do file, using the same dataset, I am getting slightly different answers each time.
For instance log of wage before I cleared was 9.94, when I cleared and restarted the same process using the same do file all over again it's now 10.01.

Also some variables somewhere along the way seem to lose some of their values. Example the education variable in my dataset ranges from 0 to 5. Midway through going through the same do file, the values becomes only 3-5. It's not because of any commands I'm typing, because if I clear and restart this process using the same do file and the same commands, it doesn't happen again.

Any idea what may be happening here?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#2

26 Mar 2015, 10:54

The overwhelmingly most frequent cause of this type of behavior is the following:

1. At some point(s) in the do file you have a -sort- command where the sorting variables do not uniquely identify the observations.
2. In that case, the exact order of the -sort- is indeterminate and will differ from one run to the next.
3. You then perform some other calculation whose result depends on the exact sort order, and therefore get different results from that.
4. And possibly still other results derived from #3 are calculated, again with different results.

So I would go through the do file and replace every -sort varlist- command with -isid varlist, sort- and then run it. The program will break when it reaches a sort where the sorting variables fail to uniquely identify the data. At that point you then need to figure out how to fix that problem: typically there is some additional variable (or variables) that needs to be added to the varlist to determine the sort order in a meaningful, sensible way. And, of course, having fixed one such problems, you should then re-try, as there may be more than one such error in the file.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3845
#3

26 Mar 2015, 10:56

What do you mean by

running the same do file

Do you type in Stata

Code:

do myfile.do

or

Code:

run myfile.do

Or do you execute the files in some other way?

What do you mean by

clear and restart

Do you type in Stata

Code:

clear do myfile.do

To sum up, show us exactly what you have typed and what Stata returned.

Best
Daniel
Comment
Ghia Osseiran

Join Date: Mar 2015

Posts: 66
#4

26 Mar 2015, 11:04

Thanks. Clyde there are a number of sorts in my do file so this is probably the problem.
All of my sorts though are by country, year, x, y etc. sort: summarize/ egen

In this case where do I place the isid?

Daniel - I start with use "location/filename.dta"
If i want to restart for any reason I type in Stata "clear"
And again restart with: use "location/filename.dta"

Last edited by Ghia Osseiran; 26 Mar 2015, 11:24.
Comment
Ghia Osseiran

Join Date: Mar 2015

Posts: 66
#5

26 Mar 2015, 13:38

I've read the help isid function and have used it previously to check that my panel variables which are personal id and year uniquely identify the observations. I read elsewhere that no news is good news with isid, and with these two variables that's fine. However I'm dealing with a bunch of other variables like year, education and occupation, etc. which I would also need to sort things by. When I run isid on this varlist, it's confirmed that all these variables do not uniquely identify the observations. Any tips on how I can resolve this?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3845
#6

26 Mar 2015, 14:09

Show us the actual code pieces you are talking about and explain what you want to achieve.

Best
Daniel
Comment
Ghia Osseiran

Join Date: Mar 2015

Posts: 66
#7

26 Mar 2015, 14:40

Ok so by way of example two variables, occupation and years of schooling in my dataset, do not uniquely identify the observations when I apply isid. My panel dataset is xtset id year.
I am using sort in instances where I am generating new variables based on the above. For instance to generate the mode of years of schooling per occupation for each person based on their occupation, I need the following command: by occupation, sort: egen occmode= mode(yrschool)

Last edited by Ghia Osseiran; 26 Mar 2015, 14:52.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#8

26 Mar 2015, 15:00

I think you need to show us all of the places where you are doing this. The particular example you gave should not be problematic because -egen, mode()- invoked, as here, without any options should give the same results regardless of the sort order of the data within occupation. You are more likely to have gotten into trouble with situations where you refer to _n or _n within a -by- prefixed command, or -egen, tag()-, or a function that produces running totals or something like that. Other problems could arise with functions or commands that need to break ties.

I suggest you scrutinize all of these commands and see if you can identify the source of the problem. If you can't find it on your own, then post all of them in a code block to get more help.
Comment
Ghia Osseiran

Join Date: Mar 2015

Posts: 66
#9

26 Mar 2015, 17:18

I think I may have spotted the error, for several egen commands, prefixed by "by occupation" I sorted occupation again in each one of them. Could that have caused the problem? Is it correct to sort only once, and then the next time I have a by occupation command, just type in: " by occupation: egen etc?"
I also notice that even though I set my dataset in panel mode somewhere along the way, I need to reenter the command xtset id year for the data to look like a panel in the data browser. Is there a rule as to the sequence of commands in panel datasets? The sequence I used was first to drop entries with missing data for the variables I could not do without in my dataset, I then set the dataset as panel, created my dummy variables and new variables, did a bunch of order commands so that I see my variables of interest in the first few columns in the data browser and then ran the regressions. Is some other order preferred or is this fine?

Last edited by Ghia Osseiran; 26 Mar 2015, 17:25.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#10

26 Mar 2015, 17:52

If you run a bunch of commands all prefixed with -by occupation:-, you only need to specify sorting on the first of those, provided the commands don't explicitly do anything to change the sort order. I honestly don't know what Stata does if you issue -by occupation, sort:- when the data are already sorted by occupation. I've run a few experiments and in these, Stata does not appear to re-do the sort. But that may not apply to all -by-able commands.

But here's the point. You shouldn't be focusing on keeping the sort order from shifting. Your changing results suggest that you are calculating things using methods that depend on the exact sort order of the data. That means that, unless there is a single "natural" sort order to the data, you are calculating things that are not actually well-defined. Stabilizing the sort order will cover up that problem and make it appear that you have good results, but all it really does is lock you into selecting one arbitrary result out of a range of equally (in)valid results that could have been gotten with some other sort order.

So, without knowing the details of what you are doing, I can't advise you between these alternatives, but one or the other should apply:

1. There is a natural sort order that is appropriate to your data and the calculations you are trying to do, and it can be defined by combinations of variables in your data set. Those variables uniquely identify the observations and correctly sort the data for the calculations you are trying to perform. You need to change your -sort-s so that those sorting variables are used.

2. Or, there is no natural sort order to your data, and you are applying ill-defined calculations that give different results depending on the sort order. In this case you are trying to calculate something that does not exist, or that is just arbitrary and has no definition. Either way, using such calculations is likely to lead to chaos down the line. You need to change your calculations so that they yield the same result regardless of the sort order. Do not simply choose and stabilize an arbitrary sort order--that hides the problem but does not solve it.

Last edited by Clyde Schechter; 26 Mar 2015, 18:03.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#11

26 Mar 2015, 18:01

I also notice that even though I set my dataset in panel mode somewhere along the way, I need to reenter the command xtset id year for the data to look like a panel in the data browser.

-xtset- does sort the data by panelvar (and timevar if applicable) when first invoked. But the fact that the data are -xtset- does not compel Stata to maintain that sort order. So when you look at the data in a browser, if other commands have shuffled the observations around, they will no longer "look" like a panel. But they are still recognized by Stata as panel data when you use the -xt- regression commands or the L., F., and D. operators. So unless you have had Stata -xtset, clear- the data somewhere along the way, once they are -xtset- they remain -xtset- for the remainder of the session. (And if you save the data file after -xtset-, they remain -xtset- in future uses of the same data set.)

Is there a rule as to the sequence of commands in panel datasets? The sequence I used was first to drop entries with missing data for the variables I could not do without in my dataset, I then set the dataset as panel, created my dummy variables and new variables, did a bunch of order commands so that I see my variables of interest in the first few columns in the data browser and then ran the regressions. Is some other order preferred or is this fine?

This general approach seems fine. The ordering of the variables so that you can see them easily in the browser is, of course, optional--it is a convenience for you, but Stata could not care less (except when using the var1-var2 notation in a varlist). Deleting observations with missing values on key variables might, in some circumstances be better deferred to later in the process, or even skipped. If the variables are needed for the regression analyses, the regression programs will exclude them any way. And sometimes creating new variables is easier when there are no gaps in the series of -xtset- variables. So that really depends on the specific situation.
Comment
Ghia Osseiran

Join Date: Mar 2015

Posts: 66
#12

26 Mar 2015, 18:04

Thanks much appreciated! Re: sorting I think this is where the confusion is originating before. One of my commands is: "by country year occupation, sort: egen occmean= mean(yrschool)"
In this case I need this so that the mean years of schooling is calculated within each country in each year. Deleting country and year and just using by occupation yields different results.
For a series of commands related to calculating mean, mode and standard deviation I would need this specification. Before I set the panel as a dataset I did type the command "gsort country pid year."

Last edited by Ghia Osseiran; 26 Mar 2015, 19:03.
Comment
Ghia Osseiran

Join Date: Mar 2015

Posts: 66
#13

27 Mar 2015, 16:53

Sorry to come back to this but there's a problem with the afore-mentioned command precisely that I can't seem to fix. When I type in
"by country year occupation, sort: egen occmean= mean(yrschool) if occupation!=."
And I go to my data browser, I get the correct figures I am looking for and Stata calculates the mean year of schooling per occupation in each country for that year. But if I actually scroll down on the some country, I then get a whole list of . one after the other where occupation is missing, and after that for the same country, Stata treats it as if it's a new country and calculates a new mean for each occupation in that country again, which is different from the first set.
I realize there may be something wrong with the command itself, but what I'm trying to get it to do is calculate the mean years of schooling per occupation for each occupation in each country. Would appreciate any tips on how to resolve this, as as I mentioned earlier somewhere along the way even my yrschool variables changes when I do the xtsum yrschool.

Last edited by Ghia Osseiran; 27 Mar 2015, 17:01.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#14

27 Mar 2015, 17:29

I realize there may be something wrong with the command itself, but what I'm trying to get it to do is calculate the mean years of schooling per occupation for each occupation in each country.

If that is what you want, then I don't see where the year variable comes into it. It would just be:

Code:

by country occupation, sort: egen occmean = mean(yrschool) if occupation != .

(You do not have to worry about this leading to irreproducible results even though country and occupation do not identify year: the -egen, mean()- function is not sensitive to the sort order within by-groups.)

If you want a separate mean years of schooling for each occupation every year, then your code would be correct, and the output would look exactly as you described it:

But if I actually scroll down on the some country, I then get a whole list of . one after the other where occupation is missing, and after that for the same country, Stata treats it as if it's a new country and calculates a new mean for each occupation in that country again, which is different from the first set.

It's not that Stata's treating it as a new country, it's that Stata's starting over with a new year, which is just what your code asks it to do.
Comment
Linh mt

Join Date: May 2017

Posts: 33
#15

31 Dec 2018, 21:41

Hi all,

I also have problems with getting multiple results when I run the same dofile several times. I tried to use isid varliest, sort instead of sort command but the error is variables tinh huyen xa diaban hoso do not uniquely identify the observations"". So how can I fix this problem. Any help would be deeply appreciate. Thank you all.
Comment

Announcement

Same commands different numbers

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment