Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to run a foreach loop within subsets of records?

    dear Statalisters,
    I am in a situation where ideally I would need to combine by and foreach, which is not allowed in Stata.

    I have a dataset of repeated laboratory measures in long format, in which I must find out which exams were erroneously reported twice in subsequent records. I have >22000 patients which may have up to 40 measures, and dozens of variables related to the different exams (cholesterol, glycemia, etc.) are included. When an exam is repeated, the same date and value appear in consecutive records.

    So I need to compare the dates of subsequent exams into each patient and when they're the same set value=1 to a dichotomous indicator.

    I figured out something like this:
    Code:
    gen same_exam=0
    by id: foreach var of varlist date_* {
        replace same_exam = (`var' == `var'[_n-1]) & `var'<.
        }
    which obviously fails.

    I'd prefer not to use the wide form, first because I'd have to delete some exams (I'd get more than 2048 variables), second because it will be more complicated to keep track of the ids of visits where the exam is actually repeated.

    Any help is appreciated!

  • #2
    There is no real problem here in terms of syntax.

    Code:
    gen same_exam = 0
    
    foreach var of varlist date_* {
         by id: replace same_exam = (`var' == `var'[_n-1]) & `var'<.
    }
    The real problem is quite different. Doing this in a loop is exactly equivalent to doing it just once for the last variable of date_*.

    If I understand what you want, you need a new variable each time round the loop.

    But why not just look for
    duplicates directly?

    Comment


    • #3
      I think I wasn't explaining well, sorry.

      I am looking for duplicates of a single exam within a set. So, in a record one patient may have a set of exams which are new except for just one or two exams that were erroneously taken from his/her previous set. Therefore, I cannot look for a whole duplicated record. I have to parse every variable separately, and I need to do it separately for each patient, because it is allowed that two patients had exams on the same date.
      For instance, in this snapshot I need to tag the third record, because data_ca is the same as in the second record.

      id idvisit data_albumin data_bicarb data_ca
      P13768 V46901 20.11.2012 20.11.2012 20.11.2012
      P13768 V49672 22.02.2013 22.02.2013 22.02.2013
      P13768 V49673 02.09.2013 02.09.2013 22.02.2013



      I would like to obtain just one binary variable that shows if at least one of the corresponding exams is a duplicate of the previous for the same patient.

      duplicates too doesn't allow the by prefix, so I am afraid it is not a solution

      Comment


      • #4
        If I understand you correctly, I think can accomplish your end by reshaping to a longer format, such that each record contains only an id, a visitid, the name of the test, and a test date. Presuming each of your test variables starts with the common stub "data_," as your example shows, you could do this:
        Code:
        input str10(id idvisit data_albumin data_bicarb data_ca_)
        P13768 V46901 20.11.2012 20.11.2012 20.11.2012
        P13768 V49672 22.02.2013 22.02.2013 22.02.2013
        P13768 V49673 02.09.2013 02.09.2013 22.02.2013
        end
        //
        reshape long data_, i(id idvisit) j(testname) string
        //
        //  Any group of records that shares an id, a testname, and a data_ value
        // has at least one duplicate.
        sort id testname data_  // to help visualize this example
        duplicates tag id testname data_, gen(dupe)
        list
        //
        // Back to original wider format, if desired, with a duplicat
        reshape wide data_ dupe, i(id idvisit) j(testname) string
        list
        egen any_duplicate = sum(dupe*)
        Hope this is on the right track.

        Comment


        • #5
          I wonder if Nick's suggestion is still the most efficient. duplicates allows for a list of variables so you should be able to get what you want looking for duplicates on id and data_ca. If you want to drop the duplicate cases, then do something like this...

          Code:
          duplicates drop id data_ca, force

          Comment


          • #6
            Mike, that's really brilliant!
            It worked fine and gave me what I needed.
            I only had to add an if here:

            Code:
             
             duplicates tag id testname data_ if data_<., gen(dupe)
            because there are exams missing, and then with egen dupl=rowmax(dupe*) I was able to create the indicator I needed.

            thanks a lot!

            Comment


            • #7
              Originally posted by Lance Erickson View Post
              I wonder if Nick's suggestion is still the most efficient. duplicates allows for a list of variables so you should be able to get what you want looking for duplicates on id and data_ca. If you want to drop the duplicate cases, then do something like this...

              Code:
              duplicates drop id data_ca, force
              Yes Lance, you're right!
              I inserted your line of code (modified as duplicates tag because those are the records I want to keep) into a foreach loop and so I got the duplicates indicators:


              Code:
              foreach var of varlist exams_dates {
                      duplicates tag id `var' if `var' <.,  gen(dupl`var')
                      }
              this certainly is more efficient than the double reshape.


              thanks so much

              Comment


              • #8
                I just want to underline that the fact that duplicates doesn't allow by: should not bite. Usually when people want

                Code:
                by whatever : duplicates something
                they usually want what

                Code:
                duplicates whatever something
                could give them directly.

                duplicates
                is an official Stata command but as its original author that's my summary of the situation (and the rationale for it). Examples to the contrary would be of interest.

                Comment

                Working...
                X