Hi there
I have a large clinical dataset that includes data from multiple sites. Each entry represents an individual admission to hospital. Some hospitals have been contributing to the dataset for longer than others (i.e. later start date) but none have dropped out.
I wish to include only sites that have been contributing data for the whole time period. I'm planning to do this by extracting all data for those sites that contributed entries in the first year, which is a proxy for the whole time period since there is no drop-out.
The general principle would be to identify all the values of var2 within the subset of observations where var1 =x, then filter the whole dataset on those identified values of var2
In the sample (invented) data below I want to extract all entries from sites that have contributed data in the year 2000 (i.e. sites 1 and 2 but not sites 3 and 4). I want all entries for sites 1 and 2, not just the ones from the year 2000.
I can do this manually as shown below. Is there a more elegant/non manual way to do this?
clear
input pt_id year site age sex died
1 2000 1 50 1 1
2 2000 2 65 0 1
3 2001 1 66 0 0
4 2001 1 30 0 1
5 2001 2 80 1 1
5 2002 1 94 1 1
6 2002 2 62 0 0
7 2002 3 72 1 0
8 2003 1 38 0 0
9 2003 1 50 1 0
10 2003 1 18 1 0
11 2003 2 49 0 0
12 2003 3 56 1 0
13 2003 4 66 0 1
14 2004 1 22 0 0
15 2004 2 64 0 0
16 2004 3 51 1 0
17 2004 4 32 1 1
18 2004 5 53 0 1
end
tab site if year ==2000
site | Freq. Percent Cum.
------------+-----------------------------------
1 | 1 50.00 50.00
2 | 1 50.00 100.00
------------+-----------------------------------
Total | 2 100.00
keep if inlist (site,1,2)
Thanks!
I have a large clinical dataset that includes data from multiple sites. Each entry represents an individual admission to hospital. Some hospitals have been contributing to the dataset for longer than others (i.e. later start date) but none have dropped out.
I wish to include only sites that have been contributing data for the whole time period. I'm planning to do this by extracting all data for those sites that contributed entries in the first year, which is a proxy for the whole time period since there is no drop-out.
The general principle would be to identify all the values of var2 within the subset of observations where var1 =x, then filter the whole dataset on those identified values of var2
In the sample (invented) data below I want to extract all entries from sites that have contributed data in the year 2000 (i.e. sites 1 and 2 but not sites 3 and 4). I want all entries for sites 1 and 2, not just the ones from the year 2000.
I can do this manually as shown below. Is there a more elegant/non manual way to do this?
clear
input pt_id year site age sex died
1 2000 1 50 1 1
2 2000 2 65 0 1
3 2001 1 66 0 0
4 2001 1 30 0 1
5 2001 2 80 1 1
5 2002 1 94 1 1
6 2002 2 62 0 0
7 2002 3 72 1 0
8 2003 1 38 0 0
9 2003 1 50 1 0
10 2003 1 18 1 0
11 2003 2 49 0 0
12 2003 3 56 1 0
13 2003 4 66 0 1
14 2004 1 22 0 0
15 2004 2 64 0 0
16 2004 3 51 1 0
17 2004 4 32 1 1
18 2004 5 53 0 1
end
tab site if year ==2000
site | Freq. Percent Cum.
------------+-----------------------------------
1 | 1 50.00 50.00
2 | 1 50.00 100.00
------------+-----------------------------------
Total | 2 100.00
keep if inlist (site,1,2)
Thanks!
Comment