I am unable to recode variables for duplicate observations. I am using Stata Version 17
I am sampling without replacement, where an observation is the sampling unit. Patients may contribute many observations, and there is a maximum number of observations that can be selected for each patient, e.g. 3. Because of this restriction I cannot use the sample command.
When an observation is selected, I want to save the value of the patID and then increment an observation counter for all duplicate observations which match this patID. When the maximum number of observations are selected, the sampling algorithmn will ignore observations from that patID. A flag, select, is set to 1 when an observation is selected.
The sampling must also be done by groups defined by a combination of these 2 variables: groupID and regionnew. To develop the code, my first step is to work on one of the combinations: groupID==4, and regionnew==3. Once I get this piece working, I will loop through all values of groupID and regionnew.
Here are the dataset variables:
Data:
And the code:
The code correctly resets select to 1 for the selected observation. When Stata executes the replace command, it says (20 real changes made), which corresponds to the 20 occurrences of patID corresponding to the 1st observation of the database.
I have not been successful recoding patIDtotcount+1 for the selected observation and others that share the same patid. I recognize the global variable is being set to the first observation of the dataset: _n=obsnum=1, and patIDtotcount is replaced for all occurrences of the patid corresponding to the first variable. I was hoping that by using matrix notation, I could get around the problem of using if statements. But that obviously isn't the case. After numerous attempts with frames, local/global variables and macros I am seeking your help.
Can anyone suggest a different approach? Is there a way to use dulpicates?
Thank you so much for your time and guidance,
Lisa
I am sampling without replacement, where an observation is the sampling unit. Patients may contribute many observations, and there is a maximum number of observations that can be selected for each patient, e.g. 3. Because of this restriction I cannot use the sample command.
When an observation is selected, I want to save the value of the patID and then increment an observation counter for all duplicate observations which match this patID. When the maximum number of observations are selected, the sampling algorithmn will ignore observations from that patID. A flag, select, is set to 1 when an observation is selected.
The sampling must also be done by groups defined by a combination of these 2 variables: groupID and regionnew. To develop the code, my first step is to work on one of the combinations: groupID==4, and regionnew==3. Once I get this piece working, I will loop through all values of groupID and regionnew.
Here are the dataset variables:
- obsnum equals _n
- patid is the patient number
- groupID is the group number, ranging from 1 to 4
- regionnew is region number, ranging from 1 to 6
- n1 represents the random ordering within each of the 24 groups defined by groupID and regionnew.
- select: 0 if observation is not selected, 1 if selected
- patIDtotcount: a counter that increments by 1 for each observation that contains this patID. For example, if an observation for patient 301 is selected, then the counter would increase by 1 for every occurrence of 301
Data:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float obsnum int patid float groupID byte regionnew float(n1 select patIDtotcount) 998 303 4 1 1 0 0 999 302 4 2 1 0 0 1000 305 4 2 2 0 0 1001 301 4 2 3 0 0 1002 302 4 2 4 0 0 1003 307 4 2 5 0 0 1004 301 4 2 6 0 0 1005 309 4 3 1 1 0 1006 309 4 3 2 0 0 1007 302 4 3 3 0 0 1008 302 4 4 1 0 0 1009 301 4 4 2 0 0 1010 306 4 4 3 0 0 1011 305 4 4 4 0 0 1012 302 4 4 5 0 0 1013 303 4 4 6 0 0 1014 302 4 4 7 0 0 1015 306 4 4 8 0 0 1016 301 4 5 1 0 0 1017 301 4 5 2 0 0 1018 305 4 5 3 0 0 1019 302 4 5 4 0 0 1020 302 4 5 5 0 0 1021 305 4 6 1 0 0 1022 309 4 6 2 0 0 1023 306 4 6 3 0 0 1024 304 4 6 4 0 0 1025 302 4 6 5 0 0 1026 302 4 6 6 0 0 1027 301 4 6 7 0 0 end
Code:
* Select observation for groupID = 4 and regionnew = 3 forvalues a = 4/4 { forvalues b =3/3 { replace select=1 if n1==1 & groupID==`a' & regionnew==`b' * Here is where I want to save the patid value corresponding to the first observation of the group defined by groupID (4 levels), and regionnew (6 levels). I know I need to use some other approach to save the patid value, but I don't know what. I recognize nuid is being indexed to the 1st observation of the dataset, obsum=_n global nuid=patid[obsnum] } } list obsnum patid select patIDtotcount groupID regionnew n1 if (groupID==4) * increment patIDtotcount for patid that matches patid for observation selected by forvalues loop replace patIDtotcount=patIDtotcount+1 if patid== $nuid list obsnum patid select patIDtotcount groupID regionnew n1 if (groupID==4)
I have not been successful recoding patIDtotcount+1 for the selected observation and others that share the same patid. I recognize the global variable is being set to the first observation of the dataset: _n=obsnum=1, and patIDtotcount is replaced for all occurrences of the patid corresponding to the first variable. I was hoping that by using matrix notation, I could get around the problem of using if statements. But that obviously isn't the case. After numerous attempts with frames, local/global variables and macros I am seeking your help.
Can anyone suggest a different approach? Is there a way to use dulpicates?
Thank you so much for your time and guidance,
Lisa
Comment