A set of commands yield different output every time I run it

alessio lombini

Join Date: Dec 2020
Posts: 98

A set of commands yield different output every time I run it

16 Aug 2021, 01:35

Hello everyone,

I am trying to run replace values for a variable "applied_tariff" based on conditions. However, every time I run the code I get (slightly) different estimates for the variable applied_tariff. I cannot really understand the source of this problem and how to fix it. I hope some of you will be willing to help.

I attach here the dataset that I am using (on GoogleDrive) and also the code for those that want to try to run the code.

Code:

// When missing, I replace bilateral tariffs with that of EU when a EU member is exporter and the importer is a non EU-member
gen exp2= cond(eu_exp, "EUN", exp)
bys exp2 imp ISIC3 year (applied_tariff): replace applied_tariff= applied_tariff[_n-1] if missing(applied_tariff) & exp!="EUN" & exp2=="EUN"

// When missing, I replace bilateral tariffs when a EU member is importer and the exporter is a non EU-member
gen imp2= cond(eu_imp, "EUN", imp)
bys imp2 exp year (applied_tariff): replace applied_tariff= applied_tariff[_n-1] if missing(applied_tariff) & imp!="EUN" & imp2=="EUN"

drop if exp == "EUN" | imp == "EUN"

drop _fillin eu_exp eu_imp exp2 imp2


decode ISIC3, gen(sect_string)
mdesc
replace sect_string = "32" if ISIC3 ==32
replace sect_string = "33" if ISIC3 ==33
replace sect_string = "34" if ISIC3 ==34
replace sect_string = "35" if ISIC3 ==35
replace sect_string = "36" if ISIC3 ==36

// I compare summary statistics and missing values wrt to the last time I run the code
sum applied_tariff
mdesc applied_tariff

This problem has been puzzling me a lot, so I sincerely thank you in advance for your time.

Tags: None

Felix Bittmann

Join Date: Aug 2018

Posts: 693
#2

16 Aug 2021, 01:53

The problem is probably the sort function (bysort). If there are ties in the data, sort is not stable and might produce random sortings then. See

https://www.statalist.org/forums/for...-i-run-do-file

I suggest using sort, stable and checking whether there are no further underlying errors in the dataset.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
2 likes
Comment
alessio lombini

Join Date: Dec 2020

Posts: 98
#3

16 Aug 2021, 02:54

Dear Felix,
Thank you a million for your enlightening comment. I did not know about this possibility with the command bysort. Following your advice, I replaced

Code:

gen exp2= cond(eu_exp, "EUN", exp) bys exp2 imp ISIC3 year (applied_tariff): replace applied_tariff= applied_tariff[_n-1] if missing(applied_tariff) & exp!="EUN" & exp2=="EUN"

with

Code:

gen exp2= cond(eu_exp, "EUN", exp) sort exp2 imp ISIC3 year (applied_tariff), stable by exp2 imp ISIC3 year (applied_tariff): replace applied_tariff= applied_tariff[_n-1] if missing(applied_tariff) & exp!="EUN" & exp2=="EUN"

(and I did the same also for the "imp" corresponding part). I indeed get stable results. One last thing just to be really sure of what I am doing, do the two commands that I wrote above yield the same output right (except for the random sorting)?

thanks again
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 693
#4

16 Aug 2021, 03:22

Yes, I think this is fine. But you should think about what this means. Stable only means that you have the same sorting but the ties are still there. You simply "lock in" one version. Therefore, ties apparently do influence your findings. You should either try to find out if these ties are actually necessary or there is another layering of sorting that could resolve this. Otherwise run the command many times with unstable sorting and see how much variation of results is produced.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
2 likes
Comment
alessio lombini

Join Date: Dec 2020

Posts: 98
#5

16 Aug 2021, 07:21

Dear Felix,

thanks again for your reply. Might you explain what you mean by "ties" in the data?
Best wishes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

16 Aug 2021, 10:01

By "ties in the data" Felix means that your data contains some combinations of exp2, imp, ISIC3, year, and applied_tariff for which there are more than 1 observations.

Said another way, your data contains some duplicated combinations of exp2, imp, ISIC3, year, and applied_tariff.

Code:

duplicates describe exp2 imp ISIC3 year applied_tariff

Since your results apparently depend on which order these duplicated observations are sorted into, you have a problem. Using sort, stable does not resolve the problem, it hides the problem.

Your job is to study your data, understand what is happening, and change your code so that there is no ambiguity about what value of applied_tarriff should be be used to replace any missing value.
Comment

Announcement

A set of commands yield different output every time I run it

Comment

Comment

Comment

Comment

Comment