how to run a foreach loop within subsets of records?

Dino Gibertoni

Join Date: Mar 2014

Posts: 29
#1

how to run a foreach loop within subsets of records?

05 Dec 2016, 04:42

dear Statalisters,
I am in a situation where ideally I would need to combine by and foreach, which is not allowed in Stata.

I have a dataset of repeated laboratory measures in long format, in which I must find out which exams were erroneously reported twice in subsequent records. I have >22000 patients which may have up to 40 measures, and dozens of variables related to the different exams (cholesterol, glycemia, etc.) are included. When an exam is repeated, the same date and value appear in consecutive records.

So I need to compare the dates of subsequent exams into each patient and when they're the same set value=1 to a dichotomous indicator.

I figured out something like this:

Code:

gen same_exam=0 by id: foreach var of varlist date_* { replace same_exam = (`var' == `var'[_n-1]) & `var'<. }

which obviously fails.

I'd prefer not to use the wide form, first because I'd have to delete some exams (I'd get more than 2048 variables), second because it will be more complicated to keep track of the ids of visits where the exam is actually repeated.

Any help is appreciated!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35605
#2

05 Dec 2016, 05:04

There is no real problem here in terms of syntax.

Code:

gen same_exam = 0 foreach var of varlist date_* { by id: replace same_exam = (`var' == `var'[_n-1]) & `var'<. }

The real problem is quite different. Doing this in a loop is exactly equivalent to doing it just once for the last variable of date_*.

If I understand what you want, you need a new variable each time round the loop.

But why not just look for duplicates directly?
Comment
Dino Gibertoni

Join Date: Mar 2014

Posts: 29
#3

05 Dec 2016, 06:01

I think I wasn't explaining well, sorry.

I am looking for duplicates of a single exam within a set. So, in a record one patient may have a set of exams which are new except for just one or two exams that were erroneously taken from his/her previous set. Therefore, I cannot look for a whole duplicated record. I have to parse every variable separately, and I need to do it separately for each patient, because it is allowed that two patients had exams on the same date.
For instance, in this snapshot I need to tag the third record, because data_ca is the same as in the second record.

id idvisit data_albumin data_bicarb data_ca
P13768 V46901 20.11.2012 20.11.2012 20.11.2012
P13768 V49672 22.02.2013 22.02.2013 22.02.2013
P13768 V49673 02.09.2013 02.09.2013 22.02.2013

I would like to obtain just one binary variable that shows if at least one of the corresponding exams is a duplicate of the previous for the same patient.

duplicates too doesn't allow the by prefix, so I am afraid it is not a solution
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2411

05 Dec 2016, 07:31

If I understand you correctly, I think can accomplish your end by reshaping to a longer format, such that each record contains only an id, a visitid, the name of the test, and a test date. Presuming each of your test variables starts with the common stub "data_," as your example shows, you could do this:

Code:

input str10(id idvisit data_albumin data_bicarb data_ca_)
P13768 V46901 20.11.2012 20.11.2012 20.11.2012
P13768 V49672 22.02.2013 22.02.2013 22.02.2013
P13768 V49673 02.09.2013 02.09.2013 22.02.2013
end
//
reshape long data_, i(id idvisit) j(testname) string
//
//  Any group of records that shares an id, a testname, and a data_ value
// has at least one duplicate.
sort id testname data_  // to help visualize this example
duplicates tag id testname data_, gen(dupe)
list
//
// Back to original wider format, if desired, with a duplicat
reshape wide data_ dupe, i(id idvisit) j(testname) string
list
egen any_duplicate = sum(dupe*)

Hope this is on the right track.

Comment

Lance Erickson

Join Date: Apr 2014

Posts: 144
#5

05 Dec 2016, 07:40

I wonder if Nick's suggestion is still the most efficient. duplicates allows for a list of variables so you should be able to get what you want looking for duplicates on id and data_ca. If you want to drop the duplicate cases, then do something like this...

Code:

duplicates drop id data_ca, force
Comment
Dino Gibertoni

Join Date: Mar 2014

Posts: 29
#6

05 Dec 2016, 08:22

Mike, that's really brilliant!
It worked fine and gave me what I needed.
I only had to add an if here:

Code:

duplicates tag id testname data_ if data_<., gen(dupe)

because there are exams missing, and then with egen dupl=rowmax(dupe*) I was able to create the indicator I needed.

thanks a lot!
Comment
Dino Gibertoni

Join Date: Mar 2014

Posts: 29
#7

05 Dec 2016, 08:44

Originally posted by Lance Erickson View Post

I wonder if Nick's suggestion is still the most efficient. duplicates allows for a list of variables so you should be able to get what you want looking for duplicates on id and data_ca. If you want to drop the duplicate cases, then do something like this...

Code:

duplicates drop id data_ca, force

Yes Lance, you're right!
I inserted your line of code (modified as duplicates tag because those are the records I want to keep) into a foreach loop and so I got the duplicates indicators:

Code:

foreach var of varlist exams_dates { duplicates tag id `var' if `var' <., gen(dupl`var') }

this certainly is more efficient than the double reshape.

thanks so much
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35605
#8

05 Dec 2016, 12:33

I just want to underline that the fact that duplicates doesn't allow by: should not bite. Usually when people want

Code:

by whatever : duplicates something

they usually want what

Code:

duplicates whatever something

could give them directly.

duplicates is an official Stata command but as its original author that's my summary of the situation (and the rationale for it). Examples to the contrary would be of interest.
Comment

Announcement

how to run a foreach loop within subsets of records?

Comment

Comment

Comment

Comment

Comment

Comment

Comment