Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Recoding variables for duplicate observations

    I am unable to recode variables for duplicate observations. I am using Stata Version 17

    I am sampling without replacement, where an observation is the sampling unit. Patients may contribute many observations, and there is a maximum number of observations that can be selected for each patient, e.g. 3. Because of this restriction I cannot use the sample command.

    When an observation is selected, I want to save the value of the patID and then increment an observation counter for all duplicate observations which match this patID. When the maximum number of observations are selected, the sampling algorithmn will ignore observations from that patID. A flag, select, is set to 1 when an observation is selected.

    The sampling must also be done by groups defined by a combination of these 2 variables: groupID and regionnew. To develop the code, my first step is to work on one of the combinations: groupID==4, and regionnew==3. Once I get this piece working, I will loop through all values of groupID and regionnew.

    Here are the dataset variables:
    • obsnum equals _n
    • patid is the patient number
    • groupID is the group number, ranging from 1 to 4
    • regionnew is region number, ranging from 1 to 6
    • n1 represents the random ordering within each of the 24 groups defined by groupID and regionnew.
    • select: 0 if observation is not selected, 1 if selected
    • patIDtotcount: a counter that increments by 1 for each observation that contains this patID. For example, if an observation for patient 301 is selected, then the counter would increase by 1 for every occurrence of 301
    The entire data set is sorted by patID groupID regionnew n1

    Data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float obsnum int patid float groupID byte regionnew float(n1 select patIDtotcount)
     998 303 4 1 1 0 0
     999 302 4 2 1 0 0
    1000 305 4 2 2 0 0
    1001 301 4 2 3 0 0
    1002 302 4 2 4 0 0
    1003 307 4 2 5 0 0
    1004 301 4 2 6 0 0
    1005 309 4 3 1 1 0
    1006 309 4 3 2 0 0
    1007 302 4 3 3 0 0
    1008 302 4 4 1 0 0
    1009 301 4 4 2 0 0
    1010 306 4 4 3 0 0
    1011 305 4 4 4 0 0
    1012 302 4 4 5 0 0
    1013 303 4 4 6 0 0
    1014 302 4 4 7 0 0
    1015 306 4 4 8 0 0
    1016 301 4 5 1 0 0
    1017 301 4 5 2 0 0
    1018 305 4 5 3 0 0
    1019 302 4 5 4 0 0
    1020 302 4 5 5 0 0
    1021 305 4 6 1 0 0
    1022 309 4 6 2 0 0
    1023 306 4 6 3 0 0
    1024 304 4 6 4 0 0
    1025 302 4 6 5 0 0
    1026 302 4 6 6 0 0
    1027 301 4 6 7 0 0
    end
    And the code:
    Code:
    
     * Select observation for groupID = 4 and regionnew = 3 
     forvalues a = 4/4 {
            forvalues b =3/3 {     
                    replace select=1 if n1==1 & groupID==`a'  & regionnew==`b'              
        
             * Here is where I want to save the patid value corresponding to the first observation of the group defined by groupID (4 levels), and regionnew (6 levels).  I know I need to use some other approach to save the patid value, but I don't know what.  I recognize nuid is being indexed to the 1st observation of the dataset, obsum=_n
                    global nuid=patid[obsnum] 
                                                                    
                    }          
       } 
    
     
     list obsnum patid select patIDtotcount groupID regionnew n1 if (groupID==4)
     
     * increment patIDtotcount for patid that matches patid for observation selected by forvalues loop
     replace patIDtotcount=patIDtotcount+1 if patid== $nuid
     
     
     list obsnum patid select patIDtotcount groupID regionnew n1 if (groupID==4)
    The code correctly resets select to 1 for the selected observation. When Stata executes the replace command, it says (20 real changes made), which corresponds to the 20 occurrences of patID corresponding to the 1st observation of the database.


    I have not been successful recoding patIDtotcount+1 for the selected observation and others that share the same patid. I recognize the global variable is being set to the first observation of the dataset: _n=obsnum=1, and patIDtotcount is replaced for all occurrences of the patid corresponding to the first variable. I was hoping that by using matrix notation, I could get around the problem of using if statements. But that obviously isn't the case. After numerous attempts with frames, local/global variables and macros I am seeking your help.

    Can anyone suggest a different approach? Is there a way to use dulpicates?

    Thank you so much for your time and guidance,
    Lisa

  • #2
    There are a few things I don't understand about your problem and your data.

    What identifies a "patient" in your data set? Is it defined solely by the variable patid? Or does the same patid refer to two different patients when combined with different values of regionnew (which happens frequently in the example data)? What about if the same patid occurs with different values of groupID--same patient or different patients? (In your example data groupID is constant, so I can't see whether this ever happens or not.)

    Are you looking for a pre-specified total sample size? A pre-specified total sample size in each groupID, or regionnew, or combination thereof?

    By the way, thank you for using -dataex- in your very first post here.
    Last edited by Clyde Schechter; 10 Jun 2023, 16:11.

    Comment


    • #3
      Dear Clyde,

      Thank you for making the time to read my post! To answer your questions, here's a more detailed explanation of the data set.

      The data set contains a total of 1027 observations and around 80 patients. So each patient, on average, contributes 13 observations. Some patients contribute as few as 1 and others in excess of 20.

      Each patient is identified solely by the variable patid, Each patient can appear in only 2 groups identified by groupID: either 1 and 2, or in 3 and 4. There are no restrictions on the number of observations per region. For example (made-up), patient 1000 might have a total of 20 observations, of which 9 belong to groupID 1 and 12 belong to groupID 2. Of the 9 in group 1, 1 could be regionnew 1, 3 could be regionnew 2, 1 could be regionnew 3, 2 could be regionnew 4, none in regionnew5 and 2 could be regionnew 6. And something similar for groupID 2.

      Unfortunately, the sample selection isn't a true probability sample but does have some random elements. The sample size to be selected is defined by groupID and region:
      • groupid 1 calls for 15 observations per region (total of 90)
      • groupid 2 calls for 20 observations per region (total of 120)
      • groupid 3 calls for 3 observations per region (total of 18)
      • groupid 4 calls for 3 observations per region (total of 18)
      The restrictions are:
      • The number of observations per patid is <= 5 (i.e., regardless of regionnew)
      • The number of observations per patid in groupid 2 <= 2
      • The number of observations per patid in group 4 <= 2
      I hope this is clear now.

      Thank you,
      Lisa

      Comment


      • #4
        I think a different approach will work better. Iterating over observations in Stata is, at best, inefficient. Yes, there are some problems that cannot be done in any other way, but I don't think this is one of them. I think the following algorithm will do what you ask. I illustrate it with a demonstration data set that has a structure similar to your description.
        Code:
        //    DEVELOP A MODEL DATA SET
        clear*
        set seed 1234
        
        set obs 1000
        gen int obs_no = _n
        
        gen group = runiformint(1, 4)
        gen regionnew = runiformint(1,6)
        gen patid = runiformint(1, 80)
        
        //    SAMPLING PROCEDURE
        local ns_per_region 15 20 3 3
        gen int times_used = 0
        gen byte selected = 0
        gen double shuffle = runiform()
        sort group regionnew shuffle
        foreach g in 2 4 1 3 {    // DO MOST RESTRICTED GROUPS FIRST
            frame put _all if group == `g', into(sampling)
            frame sampling {
                // BEGIN BY THINNING DATA TO LEAVE FOR EACH PATID ONLY
                // AS MANY OBS AS CAN BE USED WITHOUT EXCEEDING SELECTION LIMITS
                by patid (shuffle), sort: drop if _n > cond(inlist(group, 2, 4), 2, 5) - times_used
                //    RANDOMLY PICK AS MANY AS NEEDED FROM EACH REGION
                local n: word `g' of `ns_per_region'
                by regionnew (shuffle), sort: keep if _n <= `n'
            }
            //    UPDATE THE INFORMATION IN THE DEFAULT FRAME
            frlink 1:1 obs_no, frame(sampling)
            replace selected = 1 if !missing(sampling)
            drop times_used
            by patid, sort: egen times_used = total(selected)
            frame drop sampling
            drop sampling
        }
        Note: If you run this with my demonstration data set you will note that group 2 does not quite achieve the number of observations requested. It may be that with a different random number seed it will. Or it may be the case that conditional on the restrictions imposed, it may not be possible to achieve the large sample for group 2 requested without exceeding the limit of 2 occurrences for any patid. It really depends on the distribution of the patients among the groups and regions.
        Last edited by Clyde Schechter; 11 Jun 2023, 10:20.

        Comment


        • #5
          Dear Clyde,
          Thank you for describing an approach that operates on all the data instead of observation by observation. It works for me on the simulated dataset. I'm not getting the results I expected when I apply to my actual dataset.

          I've been playing with the code to first select from group 4 and then group 3. And also first group 2 and then group 1. But haven't been successful. I changed 'foreach g in 2 4 1 3' to 'foreach g in 4 3 2 1' but that didn't help -- the results were the same. Can you advise please?

          My reasoning for changing the order of selection:
          I changed the required sample sizes to 9 per region for group 1, 17 per region for group 2, 3 per region for group 3 and 3 per region for group 4. I expected a sample size of 193 but instead am getting 158. A culprit appears to be group 4 and I'm thinking the sample size selection needs to start with group 4, which has the fewest number of observations to draw from, and then group 3.

          I was expecting:
          1 3 3 3 3 3
          Not:
          1 2 1 2 3 3
          I'm guessing something similar is going on with groups 1 and 2 and that I need to start with group 2 and then group 1.

          Reminder: patients can belong to one and only one of two sets: groups 1 and 2 or groups 3 and 4.

          Code:
           
          Region
          Group      |         1          2          3          4          5          6 |     Total
          -----------+------------------------------------------------------------------+----------
                   1 |         9          9          9          9          9          9 |        54 
                   2 |        12         10         15          8         13         16 |        74 
                   3 |         3          3          3          3          3          3 |        18 
                   4 |         1          2          1          2          3          3 |        12 
          -----------+------------------------------------------------------------------+----------
               Total |        25         24         28         22         28         31 |       158 
          Code:
          
          gen int obs_no = _n
          
          *gen group = runiformint(1, 4)
          *gen regionnew = runiformint(1,6)
          *gen patid = runiformint(1, 80)
          
          sort patid imagename
          gen int obs_no = _n
          
          *gen group = runiformint(1, 4)
          *gen regionnew = runiformint(1,6)
          *gen patid = runiformint(1, 80)
          
          //    SAMPLING PROCEDURE
          local ns_per_region 9 17 3 3
          gen int times_used = 0
          gen byte selected = 0
          gen double shuffle = runiform()
          sort groupID regionnew shuffle
          foreach g in 2 4 1 3 {    // DO MOST RESTRICTED GROUPS FIRST
              frame put _all if groupID == `g', into(sampling)
              frame sampling {
                  // BEGIN BY THINNING DATA TO LEAVE FOR EACH PATID ONLY
                  // AS MANY OBS AS CAN BE USED WITHOUT EXCEEDING SELECTION LIMITS
                  by patid (shuffle), sort: drop if _n > cond(inlist(groupID, 2, 4), 2, 5) - times_used
                  //    RANDOMLY PICK AS MANY AS NEEDED FROM EACH REGION
                  local n: word `g' of `ns_per_region'
                  by regionnew (shuffle), sort: keep if _n <= `n'
              }
              //    UPDATE THE INFORMATION IN THE DEFAULT FRAME
              frlink 1:1 obs_no, frame(sampling)
              replace selected = 1 if !missing(sampling)
              drop times_used
              by patid, sort: egen times_used = total(selected)
              frame drop sampling
              drop sampling
          }
          
          // I placed selected observations into a new frame, sample.  Then I did a 2-way tabulations of groupID by regionnew that yielded the table above.
          
          frame put if selected == 1, into(sample)
          frame change sample
          tab (groupID) (regionnew)
          And again, thank you for your help and time.
          Lisa

          Comment


          • #6
            Please hold off. I may have found the problem.

            Comment


            • #7
              Hi Clyde,
              I recognized the specs for number per region were incorrect. Should be a max of 2 for groups 1 and 3. Not a max of 2 for groups 2 and 4. When I rerun I still don't get the numbers I'm expecting for the total sample size or for the numbers per region. Any thoughts, particularly on how I can select Groups 2 and 4 first, followed by Groups 1 and 3?

              Code:
              
                            |                              Region
                group   |         1          2          3          4          5          6 |     Total
              -----------+------------------------------------------------------------------+----------
                       1 |         3          9          3          8          8          7 |        38 
                       2 |        17         15         17         13         17         17 |        96 
                       3 |         1          0          0          2          3          1 |         7 
                       4 |         1          3          2          3          3          3 |        15 
              -----------+------------------------------------------------------------------+----------
                   Total |        22         27         22         26         31         28 |       156
              Code:
              gen int obs_no = _n
              
              *gen group = runiformint(1, 4)
              *gen regionnew = runiformint(1,6)
              *gen patid = runiformint(1, 80)
              
              //    SAMPLING PROCEDURE
              local ns_per_region 9 17 3 3
              gen int times_used = 0
              gen byte selected = 0
              gen double shuffle = runiform()
              sort groupID regionnew shuffle
              foreach g in 2 4 1 3 {    // DO MOST RESTRICTED GROUPS FIRST
                  frame put _all if groupID == `g', into(sampling)
                  frame sampling {
                      // BEGIN BY THINNING DATA TO LEAVE FOR EACH PATID ONLY
                      // AS MANY OBS AS CAN BE USED WITHOUT EXCEEDING SELECTION LIMITS
                      by patid (shuffle), sort: drop if _n > cond(inlist(groupID, 1, 3), 2, 5) - times_used
                      //    RANDOMLY PICK AS MANY AS NEEDED FROM EACH REGION
                      local n: word `g' of `ns_per_region'
                      by regionnew (shuffle), sort: keep if _n <= `n'
                  }
                  //    UPDATE THE INFORMATION IN THE DEFAULT FRAME
                  frlink 1:1 obs_no, frame(sampling)
                  replace selected = 1 if !missing(sampling)
                  drop times_used
                  by patid, sort: egen times_used = total(selected)
                  frame drop sampling
                  drop sampling
              }
              
              frame put if selected == 1, into(sample)
              frame change sample
              tab (groupID) (regionnew)
              Thank you,
              Lisa

              Comment


              • #8
                It may be that the requirements you have imposed are simply not possible to meet. For example, in group 2, you wish to select 17 observations per region and there are 6 regions. That's 102 observations. But if half of your 80 people, i.e. 40, appear in group 2, and each can only be used twice, that's a total of 80 usable observations. So it just doesn't work. Now, you haven't said how many different patid's appear in each group and region, but you probably should review that and see if, as my example calculation here, the restrictions are unworkable.

                Comment


                • #9
                  Thank you, Clyde. Yes, some of the groups have very sparse data. I appreciate the reminder to check. Been staring at the dataset for way too long.
                  Lisa

                  Comment

                  Working...
                  X