Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • bysort produces different results everytime

    Hello,

    I have a strange problem. I am using the following code to count the unique visits of an insurant/patient (via DispensingDate). InsurantNumber is unique to a patient and to a doctor, meaning one patient attending two doctors will have two InsurantNumbers.

    Code:
    sort InsurantNumber DispensingDate
    
    by InsurantNumber: gen id_ = _n == 1
    
    by InsurantNumber: replace id_ = id_[_n]+1 if _n > 1 & DispensingDate != DispensingDate[_n-1]
    After that I want to generate a new variable which contains the number of unique visits by doctor. I do that via

    Code:
    sort doctor
    
    by doctor: egen visits=total(id_)
    Now I have run this a couple of times and I get slightly different results almost every time. I tried various versions of doing this (bysort and egen in one command; foreach loop for seperate calc per doctor) and always get different results per doctor (+- 4) while the total amount of visits stays the same.
    Perhaps an important note: If I run the same command multiple times the calculations are equal, but if I for example run my whole file another time the results vary. I even traced this down to the point where I ran a random command between every execution of the same countprocedure, just out of curiosity, and as I feared then again the results varied.

    I simply dont get it.

    I hope someone can help me.

    Regards,

    Karl

  • #2
    Could there be duplicates on

    InsurantNumber DispensingDate

    ?

    That could be problematic in assuring identical sort orders. The number of distinct (you say "unique") dates for each person is obtainable more directly by

    Code:
    egen tag = tag(InsurantNumber DispensingDate)  
    egen nvisits = total(tag), by(InsurantNumber)
    or

    Code:
    bysort InsurantNumber DispensingDate : gen nvisits = _n == 1  
    by InsurantNumber: replace nvisits = sum(nvisits)
    by InsurantNumber: replace nvisits = nvisits[_N]

    Comment


    • #3
      Thanks for the shorter commands.

      It is possible that there are duplicates since the rows of the data contain medical prescriptions and a patient could receive the same medicine twice on one date. Since the only variables I am referring to for sorting are InsurantNumber and DispensingDate and the sort order within the same value of DispensingDate does not matter, I still don't understand where this problem comes from.

      Anyways I hope it wont occur anymore with the new syntax.

      Regards,

      Karl

      Comment


      • #4
        I haven't tried running your code. If you want an explanation of the problem, a reproducible example with data we can copy and paste would help. Alternatively, someone else may be able to suggest the problem.

        Comment


        • #5
          Ok I guess I got it. It is a failure in the data supplied. Which means
          InsurantNumber is unique to a patient and to a doctor, meaning one patient attending two doctors will have two InsurantNumbers.
          is not true for about 20 of 1 million cases. This screws up the sorting.

          Do you by chance know a commandstructure which checks if a certain (or better any) string-value appears in more than one group of observations? This would have solved the riddle in the first place.

          Thanks

          Karl

          Last edited by Karl Emmert-Fees; 15 Oct 2015, 07:36. Reason: added (or better any)

          Comment


          • #6
            Your open question sounds like a question on detecting duplicates. For various reasons I recommend the duplicates command.

            Otherwise here is a stupid example to illustrate technique. We will check to see whether the substring "co" occurs within a variable in different groups of a grouping variable.

            Code:
            . sysuse auto, clear
            (1978 Automobile Data)
            
            . tabulate foreign if strpos(make, "co")
            
               Car type |      Freq.     Percent        Cum.
            ------------+-----------------------------------
               Domestic |          1       33.33       33.33
                Foreign |          2       66.67      100.00
            ------------+-----------------------------------
                  Total |          3      100.00
            Yes, we find that in both groups.
            Last edited by Nick Cox; 15 Oct 2015, 08:33.

            Comment


            • #7
              Yes I think you are right about the duplicates problem. Since I have a very large dataset containing strings in the form of the example I attached, I don?t know the string I am searching for and therefore will have to check for distinct InsurantNumbers within each doctor via the mentioed egen = tag(). Then exclude the 0s and again check for distinct InsurantNumbers over all doctors. My 0s then will be my the ones occuring in more than one doctor.

              This of course will destroy my dataset. As you mentioned in one of your publications this is not a very sophisticated way. I thought there maybe another way to display what strings occur in more than one of a number of groups.
              Attached Files
              Last edited by Karl Emmert-Fees; 15 Oct 2015, 09:11.

              Comment


              • #8
                I am not recommending anything that will destroy your dataset The duplicates command remains my first suggestion.

                Comment


                • #9
                  I will check it out.

                  Thank you very much.

                  Comment


                  • #10
                    Originally posted by Nick Cox View Post
                    I am not recommending anything that will destroy your dataset The duplicates command remains my first suggestion.
                    Hello Nick, I was wondering what you would presume to be the problem if there are no duplicates? I have the same problem as shown on https://www.statalist.org/forums/for...-i-run-do-file #10.

                    Thank you very much.

                    Comment


                    • #11
                      I've answered, or rather commented, in that thread. Sorry, but I cannot follow what you're trying to do or what your problem is. That could easily be my fault.

                      Comment

                      Working...
                      X