bysort produces different results everytime

Karl Emmert-Fees

Join Date: Aug 2015

Posts: 8
#1

bysort produces different results everytime

15 Oct 2015, 05:59

Hello,

I have a strange problem. I am using the following code to count the unique visits of an insurant/patient (via DispensingDate). InsurantNumber is unique to a patient and to a doctor, meaning one patient attending two doctors will have two InsurantNumbers.

Code:

sort InsurantNumber DispensingDate by InsurantNumber: gen id_ = _n == 1 by InsurantNumber: replace id_ = id_[_n]+1 if _n > 1 & DispensingDate != DispensingDate[_n-1]

After that I want to generate a new variable which contains the number of unique visits by doctor. I do that via

Code:

sort doctor by doctor: egen visits=total(id_)

Now I have run this a couple of times and I get slightly different results almost every time. I tried various versions of doing this (bysort and egen in one command; foreach loop for seperate calc per doctor) and always get different results per doctor (+- 4) while the total amount of visits stays the same.
Perhaps an important note: If I run the same command multiple times the calculations are equal, but if I for example run my whole file another time the results vary. I even traced this down to the point where I ran a random command between every execution of the same countprocedure, just out of curiosity, and as I feared then again the results varied.

I simply dont get it.

I hope someone can help me.

Regards,

Karl
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35783
#2

15 Oct 2015, 06:12

Could there be duplicates on

InsurantNumber DispensingDate

?

That could be problematic in assuring identical sort orders. The number of distinct (you say "unique") dates for each person is obtainable more directly by

Code:

egen tag = tag(InsurantNumber DispensingDate) egen nvisits = total(tag), by(InsurantNumber)

or

Code:

bysort InsurantNumber DispensingDate : gen nvisits = _n == 1 by InsurantNumber: replace nvisits = sum(nvisits) by InsurantNumber: replace nvisits = nvisits[_N]
Comment
Karl Emmert-Fees

Join Date: Aug 2015

Posts: 8
#3

15 Oct 2015, 06:38

Thanks for the shorter commands.

It is possible that there are duplicates since the rows of the data contain medical prescriptions and a patient could receive the same medicine twice on one date. Since the only variables I am referring to for sorting are InsurantNumber and DispensingDate and the sort order within the same value of DispensingDate does not matter, I still don't understand where this problem comes from.

Anyways I hope it wont occur anymore with the new syntax.

Regards,

Karl
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35783
#4

15 Oct 2015, 06:45

I haven't tried running your code. If you want an explanation of the problem, a reproducible example with data we can copy and paste would help. Alternatively, someone else may be able to suggest the problem.
Comment
Karl Emmert-Fees

Join Date: Aug 2015

Posts: 8
#5

15 Oct 2015, 07:26

Ok I guess I got it. It is a failure in the data supplied. Which means

InsurantNumber is unique to a patient and to a doctor, meaning one patient attending two doctors will have two InsurantNumbers.

is not true for about 20 of 1 million cases. This screws up the sorting.

Do you by chance know a commandstructure which checks if a certain (or better any) string-value appears in more than one group of observations? This would have solved the riddle in the first place.

Thanks

Karl

Last edited by Karl Emmert-Fees; 15 Oct 2015, 07:36. Reason: added (or better any)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35783
#6

15 Oct 2015, 08:27

Your open question sounds like a question on detecting duplicates. For various reasons I recommend the duplicates command.

Otherwise here is a stupid example to illustrate technique. We will check to see whether the substring "co" occurs within a variable in different groups of a grouping variable.

Code:

. sysuse auto, clear (1978 Automobile Data) . tabulate foreign if strpos(make, "co") Car type | Freq. Percent Cum. ------------+----------------------------------- Domestic | 1 33.33 33.33 Foreign | 2 66.67 100.00 ------------+----------------------------------- Total | 3 100.00

Yes, we find that in both groups.

Last edited by Nick Cox; 15 Oct 2015, 08:33.
Comment
Karl Emmert-Fees

Join Date: Aug 2015

Posts: 8
#7

15 Oct 2015, 09:09

Yes I think you are right about the duplicates problem. Since I have a very large dataset containing strings in the form of the example I attached, I don?t know the string I am searching for and therefore will have to check for distinct InsurantNumbers within each doctor via the mentioed egen = tag(). Then exclude the 0s and again check for distinct InsurantNumbers over all doctors. My 0s then will be my the ones occuring in more than one doctor.

This of course will destroy my dataset. As you mentioned in one of your publications this is not a very sophisticated way. I thought there maybe another way to display what strings occur in more than one of a number of groups.
Attached Files

example.dta (321.6 KB, 1 view)

Last edited by Karl Emmert-Fees; 15 Oct 2015, 09:11.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35783
#8

15 Oct 2015, 09:14

I am not recommending anything that will destroy your dataset The duplicates command remains my first suggestion.
1 like
Comment
Karl Emmert-Fees

Join Date: Aug 2015

Posts: 8
#9

15 Oct 2015, 09:18

I will check it out.

Thank you very much.
Comment
Kate Lussy

Join Date: Apr 2019

Posts: 42
#10

07 May 2019, 05:56

Originally posted by Nick Cox View Post

I am not recommending anything that will destroy your dataset The duplicates command remains my first suggestion.

Hello Nick, I was wondering what you would presume to be the problem if there are no duplicates? I have the same problem as shown on https://www.statalist.org/forums/for...-i-run-do-file #10.

Thank you very much.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35783
#11

07 May 2019, 06:07

I've answered, or rather commented, in that thread. Sorry, but I cannot follow what you're trying to do or what your problem is. That could easily be my fault.
Comment

Announcement

bysort produces different results everytime

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment