Recoding variables for duplicate observations

Lisa Kammerman

Join Date: Jun 2023

Posts: 6
#1

Recoding variables for duplicate observations

10 Jun 2023, 14:36

I am unable to recode variables for duplicate observations. I am using Stata Version 17

I am sampling without replacement, where an observation is the sampling unit. Patients may contribute many observations, and there is a maximum number of observations that can be selected for each patient, e.g. 3. Because of this restriction I cannot use the sample command.

When an observation is selected, I want to save the value of the patID and then increment an observation counter for all duplicate observations which match this patID. When the maximum number of observations are selected, the sampling algorithmn will ignore observations from that patID. A flag, select, is set to 1 when an observation is selected.

The sampling must also be done by groups defined by a combination of these 2 variables: groupID and regionnew. To develop the code, my first step is to work on one of the combinations: groupID==4, and regionnew==3. Once I get this piece working, I will loop through all values of groupID and regionnew.

Here are the dataset variables:
obsnum equals _n

patid is the patient number

groupID is the group number, ranging from 1 to 4

regionnew is region number, ranging from 1 to 6

n1 represents the random ordering within each of the 24 groups defined by groupID and regionnew.

select: 0 if observation is not selected, 1 if selected

patIDtotcount: a counter that increments by 1 for each observation that contains this patID. For example, if an observation for patient 301 is selected, then the counter would increase by 1 for every occurrence of 301

The entire data set is sorted by patID groupID regionnew n1

Data:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float obsnum int patid float groupID byte regionnew float(n1 select patIDtotcount) 998 303 4 1 1 0 0 999 302 4 2 1 0 0 1000 305 4 2 2 0 0 1001 301 4 2 3 0 0 1002 302 4 2 4 0 0 1003 307 4 2 5 0 0 1004 301 4 2 6 0 0 1005 309 4 3 1 1 0 1006 309 4 3 2 0 0 1007 302 4 3 3 0 0 1008 302 4 4 1 0 0 1009 301 4 4 2 0 0 1010 306 4 4 3 0 0 1011 305 4 4 4 0 0 1012 302 4 4 5 0 0 1013 303 4 4 6 0 0 1014 302 4 4 7 0 0 1015 306 4 4 8 0 0 1016 301 4 5 1 0 0 1017 301 4 5 2 0 0 1018 305 4 5 3 0 0 1019 302 4 5 4 0 0 1020 302 4 5 5 0 0 1021 305 4 6 1 0 0 1022 309 4 6 2 0 0 1023 306 4 6 3 0 0 1024 304 4 6 4 0 0 1025 302 4 6 5 0 0 1026 302 4 6 6 0 0 1027 301 4 6 7 0 0 end

And the code:

Code:

* Select observation for groupID = 4 and regionnew = 3 forvalues a = 4/4 { forvalues b =3/3 { replace select=1 if n1==1 & groupID==`a' & regionnew==`b' * Here is where I want to save the patid value corresponding to the first observation of the group defined by groupID (4 levels), and regionnew (6 levels). I know I need to use some other approach to save the patid value, but I don't know what. I recognize nuid is being indexed to the 1st observation of the dataset, obsum=_n global nuid=patid[obsnum] } } list obsnum patid select patIDtotcount groupID regionnew n1 if (groupID==4) * increment patIDtotcount for patid that matches patid for observation selected by forvalues loop replace patIDtotcount=patIDtotcount+1 if patid== $nuid list obsnum patid select patIDtotcount groupID regionnew n1 if (groupID==4)

The code correctly resets select to 1 for the selected observation. When Stata executes the replace command, it says (20 real changes made), which corresponds to the 20 occurrences of patID corresponding to the 1st observation of the database.

I have not been successful recoding patIDtotcount+1 for the selected observation and others that share the same patid. I recognize the global variable is being set to the first observation of the dataset: _n=obsnum=1, and patIDtotcount is replaced for all occurrences of the patid corresponding to the first variable. I was hoping that by using matrix notation, I could get around the problem of using if statements. But that obviously isn't the case. After numerous attempts with frames, local/global variables and macros I am seeking your help.

Can anyone suggest a different approach? Is there a way to use dulpicates?

Thank you so much for your time and guidance,
Lisa
Tags: panel data
Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#2

10 Jun 2023, 16:08

There are a few things I don't understand about your problem and your data.

What identifies a "patient" in your data set? Is it defined solely by the variable patid? Or does the same patid refer to two different patients when combined with different values of regionnew (which happens frequently in the example data)? What about if the same patid occurs with different values of groupID--same patient or different patients? (In your example data groupID is constant, so I can't see whether this ever happens or not.)

Are you looking for a pre-specified total sample size? A pre-specified total sample size in each groupID, or regionnew, or combination thereof?

By the way, thank you for using -dataex- in your very first post here.

Last edited by Clyde Schechter; 10 Jun 2023, 16:11.
Comment
Lisa Kammerman

Join Date: Jun 2023

Posts: 6
#3

10 Jun 2023, 20:04

Dear Clyde,

Thank you for making the time to read my post! To answer your questions, here's a more detailed explanation of the data set.

The data set contains a total of 1027 observations and around 80 patients. So each patient, on average, contributes 13 observations. Some patients contribute as few as 1 and others in excess of 20.

Each patient is identified solely by the variable patid, Each patient can appear in only 2 groups identified by groupID: either 1 and 2, or in 3 and 4. There are no restrictions on the number of observations per region. For example (made-up), patient 1000 might have a total of 20 observations, of which 9 belong to groupID 1 and 12 belong to groupID 2. Of the 9 in group 1, 1 could be regionnew 1, 3 could be regionnew 2, 1 could be regionnew 3, 2 could be regionnew 4, none in regionnew5 and 2 could be regionnew 6. And something similar for groupID 2.

Unfortunately, the sample selection isn't a true probability sample but does have some random elements. The sample size to be selected is defined by groupID and region:
groupid 1 calls for 15 observations per region (total of 90)

groupid 2 calls for 20 observations per region (total of 120)

groupid 3 calls for 3 observations per region (total of 18)

groupid 4 calls for 3 observations per region (total of 18)

The restrictions are:
The number of observations per patid is <= 5 (i.e., regardless of regionnew)

The number of observations per patid in groupid 2 <= 2

The number of observations per patid in group 4 <= 2

I hope this is clear now.

Thank you,
Lisa
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30168

11 Jun 2023, 10:16

I think a different approach will work better. Iterating over observations in Stata is, at best, inefficient. Yes, there are some problems that cannot be done in any other way, but I don't think this is one of them. I think the following algorithm will do what you ask. I illustrate it with a demonstration data set that has a structure similar to your description.

Code:

//    DEVELOP A MODEL DATA SET
clear*
set seed 1234

set obs 1000
gen int obs_no = _n

gen group = runiformint(1, 4)
gen regionnew = runiformint(1,6)
gen patid = runiformint(1, 80)

//    SAMPLING PROCEDURE
local ns_per_region 15 20 3 3
gen int times_used = 0
gen byte selected = 0
gen double shuffle = runiform()
sort group regionnew shuffle
foreach g in 2 4 1 3 {    // DO MOST RESTRICTED GROUPS FIRST
    frame put _all if group == `g', into(sampling)
    frame sampling {
        // BEGIN BY THINNING DATA TO LEAVE FOR EACH PATID ONLY
        // AS MANY OBS AS CAN BE USED WITHOUT EXCEEDING SELECTION LIMITS
        by patid (shuffle), sort: drop if _n > cond(inlist(group, 2, 4), 2, 5) - times_used
        //    RANDOMLY PICK AS MANY AS NEEDED FROM EACH REGION
        local n: word `g' of `ns_per_region'
        by regionnew (shuffle), sort: keep if _n <= `n'
    }
    //    UPDATE THE INFORMATION IN THE DEFAULT FRAME
    frlink 1:1 obs_no, frame(sampling)
    replace selected = 1 if !missing(sampling)
    drop times_used
    by patid, sort: egen times_used = total(selected)
    frame drop sampling
    drop sampling
}

Note: If you run this with my demonstration data set you will note that group 2 does not quite achieve the number of observations requested. It may be that with a different random number seed it will. Or it may be the case that conditional on the restrictions imposed, it may not be possible to achieve the large sample for group 2 requested without exceeding the limit of 2 occurrences for any patid. It really depends on the distribution of the patients among the groups and regions.

Last edited by Clyde Schechter; 11 Jun 2023, 10:20.

Comment

Lisa Kammerman

Join Date: Jun 2023
Posts: 6

12 Jun 2023, 15:12

Dear Clyde,
Thank you for describing an approach that operates on all the data instead of observation by observation. It works for me on the simulated dataset. I'm not getting the results I expected when I apply to my actual dataset.

I've been playing with the code to first select from group 4 and then group 3. And also first group 2 and then group 1. But haven't been successful. I changed 'foreach g in 2 4 1 3' to 'foreach g in 4 3 2 1' but that didn't help -- the results were the same. Can you advise please?

My reasoning for changing the order of selection:
I changed the required sample sizes to 9 per region for group 1, 17 per region for group 2, 3 per region for group 3 and 3 per region for group 4. I expected a sample size of 193 but instead am getting 158. A culprit appears to be group 4 and I'm thinking the sample size selection needs to start with group 4, which has the fewest number of observations to draw from, and then group 3.

I was expecting:

1 3 3 3 3 3

Not:

1 2 1 2 3 3

I'm guessing something similar is going on with groups 1 and 2 and that I need to start with group 2 and then group 1.

Reminder: patients can belong to one and only one of two sets: groups 1 and 2 or groups 3 and 4.

Code:

 
Region
Group      |         1          2          3          4          5          6 |     Total
-----------+------------------------------------------------------------------+----------
         1 |         9          9          9          9          9          9 |        54 
         2 |        12         10         15          8         13         16 |        74 
         3 |         3          3          3          3          3          3 |        18 
         4 |         1          2          1          2          3          3 |        12 
-----------+------------------------------------------------------------------+----------
     Total |        25         24         28         22         28         31 |       158

Code:


gen int obs_no = _n

*gen group = runiformint(1, 4)
*gen regionnew = runiformint(1,6)
*gen patid = runiformint(1, 80)

sort patid imagename
gen int obs_no = _n

*gen group = runiformint(1, 4)
*gen regionnew = runiformint(1,6)
*gen patid = runiformint(1, 80)

//    SAMPLING PROCEDURE
local ns_per_region 9 17 3 3
gen int times_used = 0
gen byte selected = 0
gen double shuffle = runiform()
sort groupID regionnew shuffle
foreach g in 2 4 1 3 {    // DO MOST RESTRICTED GROUPS FIRST
    frame put _all if groupID == `g', into(sampling)
    frame sampling {
        // BEGIN BY THINNING DATA TO LEAVE FOR EACH PATID ONLY
        // AS MANY OBS AS CAN BE USED WITHOUT EXCEEDING SELECTION LIMITS
        by patid (shuffle), sort: drop if _n > cond(inlist(groupID, 2, 4), 2, 5) - times_used
        //    RANDOMLY PICK AS MANY AS NEEDED FROM EACH REGION
        local n: word `g' of `ns_per_region'
        by regionnew (shuffle), sort: keep if _n <= `n'
    }
    //    UPDATE THE INFORMATION IN THE DEFAULT FRAME
    frlink 1:1 obs_no, frame(sampling)
    replace selected = 1 if !missing(sampling)
    drop times_used
    by patid, sort: egen times_used = total(selected)
    frame drop sampling
    drop sampling
}

// I placed selected observations into a new frame, sample.  Then I did a 2-way tabulations of groupID by regionnew that yielded the table above.

frame put if selected == 1, into(sample)
frame change sample
tab (groupID) (regionnew)

And again, thank you for your help and time.
Lisa

Comment

Lisa Kammerman

Join Date: Jun 2023

Posts: 6
#6

12 Jun 2023, 16:42

Please hold off. I may have found the problem.
Comment

Lisa Kammerman

Join Date: Jun 2023
Posts: 6

12 Jun 2023, 17:13

Hi Clyde,
I recognized the specs for number per region were incorrect. Should be a max of 2 for groups 1 and 3. Not a max of 2 for groups 2 and 4. When I rerun I still don't get the numbers I'm expecting for the total sample size or for the numbers per region. Any thoughts, particularly on how I can select Groups 2 and 4 first, followed by Groups 1 and 3?

Code:


              |                              Region
  group   |         1          2          3          4          5          6 |     Total
-----------+------------------------------------------------------------------+----------
         1 |         3          9          3          8          8          7 |        38 
         2 |        17         15         17         13         17         17 |        96 
         3 |         1          0          0          2          3          1 |         7 
         4 |         1          3          2          3          3          3 |        15 
-----------+------------------------------------------------------------------+----------
     Total |        22         27         22         26         31         28 |       156

Code:

gen int obs_no = _n

*gen group = runiformint(1, 4)
*gen regionnew = runiformint(1,6)
*gen patid = runiformint(1, 80)

//    SAMPLING PROCEDURE
local ns_per_region 9 17 3 3
gen int times_used = 0
gen byte selected = 0
gen double shuffle = runiform()
sort groupID regionnew shuffle
foreach g in 2 4 1 3 {    // DO MOST RESTRICTED GROUPS FIRST
    frame put _all if groupID == `g', into(sampling)
    frame sampling {
        // BEGIN BY THINNING DATA TO LEAVE FOR EACH PATID ONLY
        // AS MANY OBS AS CAN BE USED WITHOUT EXCEEDING SELECTION LIMITS
        by patid (shuffle), sort: drop if _n > cond(inlist(groupID, 1, 3), 2, 5) - times_used
        //    RANDOMLY PICK AS MANY AS NEEDED FROM EACH REGION
        local n: word `g' of `ns_per_region'
        by regionnew (shuffle), sort: keep if _n <= `n'
    }
    //    UPDATE THE INFORMATION IN THE DEFAULT FRAME
    frlink 1:1 obs_no, frame(sampling)
    replace selected = 1 if !missing(sampling)
    drop times_used
    by patid, sort: egen times_used = total(selected)
    frame drop sampling
    drop sampling
}

frame put if selected == 1, into(sample)
frame change sample
tab (groupID) (regionnew)

Thank you,
Lisa

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#8

12 Jun 2023, 17:52

It may be that the requirements you have imposed are simply not possible to meet. For example, in group 2, you wish to select 17 observations per region and there are 6 regions. That's 102 observations. But if half of your 80 people, i.e. 40, appear in group 2, and each can only be used twice, that's a total of 80 usable observations. So it just doesn't work. Now, you haven't said how many different patid's appear in each group and region, but you probably should review that and see if, as my example calculation here, the restrictions are unworkable.
Comment
Lisa Kammerman

Join Date: Jun 2023

Posts: 6
#9

13 Jun 2023, 15:19

Thank you, Clyde. Yes, some of the groups have very sparse data. I appreciate the reminder to check. Been staring at the dataset for way too long.
Lisa
Comment

Announcement