Hello,
I have a longitudinal dataset that has data about patients (unique patient identifier = PID, patient data collection for time-varying variables begins in 1990) and data about doctors (unique doctor identifier = DID, data collection for time-fixed variables about doctors begins in 2001 and is therefore missing prior to 2000). The data includes the year of data collection and a series of patient and doctor variables. Doctors entered the data in different years, some as late as 2003, and some only participated for one year. Doctors are associated with multiple patients in each year and individual patients see multiple doctors over time, so there is no unique DID + year combination nor are there unique PID + DID combinations. A unique identifier based on the DID would need to specify DID first, then PID as well as year. Here is an example of what the data looks like, sorted by DID and then by Doc_Var1:
I need to replace missing data for variables about doctors (e.g. Doc_Var1 in column 4) that is missing for observations prior to 2001 using non-missing data for the same doctor that was collected during any doctor-patient interaction during or after 2001. In addition to doctor data that is missing because this data was not part of the survey prior to 2001, data is also missing for unknown reasons, during or after 2001, as is the case in the 3rd and 4th rows of this sample data for patients #1 and #5 in 2002 and 2004 for DID 110). Because of this second kind of missing data, I can't just sort the data by the DID and the variable for which there is missing data (as the data above is currently sorted using DID and Doc_Var1) and then replace missing data with data from the cell directly above it, as doing so would reference data about a different doctor: in the case of DID 110 in lines 3 and 4 of this sample data, such an approach would replace the missing data for DID 110 with data about DID 100 for Doc_Var1 in line 1 of this sample data. In these cases, the data cannot be replaced by this method and must be left missing.)
The problem that I'm facing in my rudimentary code is that there is no way to identify a missing observation by a doctor in combination with a year -- it's only possible to identify (and therefore replace) data that can be identified by a patient ID and a year. But replacing data using the PID and the year would simply replace missing data that is supposed to be about a specific doctor with data collected about, potentially, a completely different doctor who saw that same patient in a different year. So this is my problem:
How do I tell Stata to:
1. Find observations with missing data for Doc_Var1 (doctor variable one, fixed over time).
2. For each Doctor ID (DID) with missing values for Doc_Var1, replace missing data for Doc_Var1 with non-missing data for Doc_Var1 for that same DID in years that data exists.
Ideally, I only want to replace missing data. Some of the data has "errors" in it that result in supposedly "time-fixed" data changing over time. I want to keep these "errors" in the data for now. I therefore expect that, for instance, a loop used to replace missing data for Doc_Var1 might return different values for the Doc_Var1 for the same doctor over time if a particular doctor had differing values in 2003 and 2004 that were used to replace the missing data.
Conceptually, I'm trying to do something like this:
foreach DID year, {
replace Doc_Var1_YearWithMissingData = 1 if (Doc_Var1_YearWithMissingData == . & Doc_Var1_AnyOtherYear == 1)
replace Doc_Var1_YearWithMissingData = 2 if (Doc_Var1_YearWithMissingData == . & Doc_Var1_AnyOtherYear == 2)
}
I've also tried a much more tedious approach using multiple conditions included as "if" qualifiers with specified years in wide format (one line for each respondent with "_year" following each variable name), but I am quite certain this approach does not work as it references the PID by default rather than the DID:
gen Doc_Var1_Impute_2000 = Doc_Var1_2000
#delimit ;
replace Doc_Var1_Impute_2000 = 1 if Doc_Var1_Impute_2000 == . &
(
Doc_Var1_Impute_2001 == 1 |
Doc_Var1_Impute_2002 == 1 |
Doc_Var1_Impute_2003 == 1 |
Doc_Var1_Impute_2004 == 1 ) ;
Both of these approaches fail. The second approach fails because it just replaces missing values with other values for the same patient in different years--NOT necessarily replacing missing data about a particular doctor in 2000. This problem is caused by the fact that the data can only be organized in Stata by PATIENT rather than by DOCTOR because there is no way to identify specific observations in this data by two variables--e.g. i(DID) j(year)--that uses Doctor ID (DID) because doctors saw multiple patients each year. It is possible, by contrast, to identify specific observations with a unique patient identifier (PID) and the year -- i(PID) j(year). The first approach produces results identical to the second approach, so it seems likely that this code also does the same thing.
Any guidance would be greatly appreciated.
As an aside: I am just beginning to learn how to use loops and foreach statements. Code that has more steps (or separates discrete actions on different lines) is easier for me to translate theoretically and is far more helpful for the moment, even if it feels stilted, repetitive, or just plain ugly to an advanced user. That said, I eagerly welcome and appreciate any and all guidance.
Thanks in advance,
Andrea
I have a longitudinal dataset that has data about patients (unique patient identifier = PID, patient data collection for time-varying variables begins in 1990) and data about doctors (unique doctor identifier = DID, data collection for time-fixed variables about doctors begins in 2001 and is therefore missing prior to 2000). The data includes the year of data collection and a series of patient and doctor variables. Doctors entered the data in different years, some as late as 2003, and some only participated for one year. Doctors are associated with multiple patients in each year and individual patients see multiple doctors over time, so there is no unique DID + year combination nor are there unique PID + DID combinations. A unique identifier based on the DID would need to specify DID first, then PID as well as year. Here is an example of what the data looks like, sorted by DID and then by Doc_Var1:
PID | Year | DID | Doc_Var1 |
1 | 2001 | 100 | 2 |
5 | 2003 | 100 | 2 |
1 | 2002 | 110 | . |
5 | 2004 | 110 | . |
2 | 2002 | 120 | 1 |
3 | 2003 | 120 | 1 |
4 | 2002 | 120 | 1 |
4 | 2003 | 120 | 1 |
3 | 2002 | 120 | 2 |
4 | 2004 | 120 | . |
5 | 2000 | 120 | . |
5 | 2002 | 130 | . |
1 | 2003 | 150 | 1 |
2 | 2003 | 150 | . |
3 | 2000 | 150 | . |
2 | 2001 | 200 | 2 |
1 | 2000 | 200 | . |
2 | 2000 | 200 | . |
4 | 2000 | 200 | . |
3 | 2001 | 310 | 1 |
3 | 2004 | 310 | . |
2 | 2004 | 400 | 2 |
4 | 2001 | 400 | . |
1 | 2004 | 500 | 2 |
5 | 2001 | 500 | 2 |
The problem that I'm facing in my rudimentary code is that there is no way to identify a missing observation by a doctor in combination with a year -- it's only possible to identify (and therefore replace) data that can be identified by a patient ID and a year. But replacing data using the PID and the year would simply replace missing data that is supposed to be about a specific doctor with data collected about, potentially, a completely different doctor who saw that same patient in a different year. So this is my problem:
How do I tell Stata to:
1. Find observations with missing data for Doc_Var1 (doctor variable one, fixed over time).
2. For each Doctor ID (DID) with missing values for Doc_Var1, replace missing data for Doc_Var1 with non-missing data for Doc_Var1 for that same DID in years that data exists.
Ideally, I only want to replace missing data. Some of the data has "errors" in it that result in supposedly "time-fixed" data changing over time. I want to keep these "errors" in the data for now. I therefore expect that, for instance, a loop used to replace missing data for Doc_Var1 might return different values for the Doc_Var1 for the same doctor over time if a particular doctor had differing values in 2003 and 2004 that were used to replace the missing data.
Conceptually, I'm trying to do something like this:
foreach DID year, {
replace Doc_Var1_YearWithMissingData = 1 if (Doc_Var1_YearWithMissingData == . & Doc_Var1_AnyOtherYear == 1)
replace Doc_Var1_YearWithMissingData = 2 if (Doc_Var1_YearWithMissingData == . & Doc_Var1_AnyOtherYear == 2)
}
I've also tried a much more tedious approach using multiple conditions included as "if" qualifiers with specified years in wide format (one line for each respondent with "_year" following each variable name), but I am quite certain this approach does not work as it references the PID by default rather than the DID:
gen Doc_Var1_Impute_2000 = Doc_Var1_2000
#delimit ;
replace Doc_Var1_Impute_2000 = 1 if Doc_Var1_Impute_2000 == . &
(
Doc_Var1_Impute_2001 == 1 |
Doc_Var1_Impute_2002 == 1 |
Doc_Var1_Impute_2003 == 1 |
Doc_Var1_Impute_2004 == 1 ) ;
Both of these approaches fail. The second approach fails because it just replaces missing values with other values for the same patient in different years--NOT necessarily replacing missing data about a particular doctor in 2000. This problem is caused by the fact that the data can only be organized in Stata by PATIENT rather than by DOCTOR because there is no way to identify specific observations in this data by two variables--e.g. i(DID) j(year)--that uses Doctor ID (DID) because doctors saw multiple patients each year. It is possible, by contrast, to identify specific observations with a unique patient identifier (PID) and the year -- i(PID) j(year). The first approach produces results identical to the second approach, so it seems likely that this code also does the same thing.
Any guidance would be greatly appreciated.
As an aside: I am just beginning to learn how to use loops and foreach statements. Code that has more steps (or separates discrete actions on different lines) is easier for me to translate theoretically and is far more helpful for the moment, even if it feels stilted, repetitive, or just plain ugly to an advanced user. That said, I eagerly welcome and appreciate any and all guidance.
Thanks in advance,
Andrea
Comment