Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replace missing annual doctor data w/ data from different years/patients for same doctor ID in patient dataset

    Hello,

    I have a longitudinal dataset that has data about patients (unique patient identifier = PID, patient data collection for time-varying variables begins in 1990) and data about doctors (unique doctor identifier = DID, data collection for time-fixed variables about doctors begins in 2001 and is therefore missing prior to 2000). The data includes the year of data collection and a series of patient and doctor variables. Doctors entered the data in different years, some as late as 2003, and some only participated for one year. Doctors are associated with multiple patients in each year and individual patients see multiple doctors over time, so there is no unique DID + year combination nor are there unique PID + DID combinations. A unique identifier based on the DID would need to specify DID first, then PID as well as year. Here is an example of what the data looks like, sorted by DID and then by Doc_Var1:
    PID Year DID Doc_Var1
    1 2001 100 2
    5 2003 100 2
    1 2002 110 .
    5 2004 110 .
    2 2002 120 1
    3 2003 120 1
    4 2002 120 1
    4 2003 120 1
    3 2002 120 2
    4 2004 120 .
    5 2000 120 .
    5 2002 130 .
    1 2003 150 1
    2 2003 150 .
    3 2000 150 .
    2 2001 200 2
    1 2000 200 .
    2 2000 200 .
    4 2000 200 .
    3 2001 310 1
    3 2004 310 .
    2 2004 400 2
    4 2001 400 .
    1 2004 500 2
    5 2001 500 2
    I need to replace missing data for variables about doctors (e.g. Doc_Var1 in column 4) that is missing for observations prior to 2001 using non-missing data for the same doctor that was collected during any doctor-patient interaction during or after 2001. In addition to doctor data that is missing because this data was not part of the survey prior to 2001, data is also missing for unknown reasons, during or after 2001, as is the case in the 3rd and 4th rows of this sample data for patients #1 and #5 in 2002 and 2004 for DID 110). Because of this second kind of missing data, I can't just sort the data by the DID and the variable for which there is missing data (as the data above is currently sorted using DID and Doc_Var1) and then replace missing data with data from the cell directly above it, as doing so would reference data about a different doctor: in the case of DID 110 in lines 3 and 4 of this sample data, such an approach would replace the missing data for DID 110 with data about DID 100 for Doc_Var1 in line 1 of this sample data. In these cases, the data cannot be replaced by this method and must be left missing.)

    The problem that I'm facing in my rudimentary code is that there is no way to identify a missing observation by a doctor in combination with a year -- it's only possible to identify (and therefore replace) data that can be identified by a patient ID and a year. But replacing data using the PID and the year would simply replace missing data that is supposed to be about a specific doctor with data collected about, potentially, a completely different doctor who saw that same patient in a different year. So this is my problem:

    How do I tell Stata to:
    1. Find observations with missing data for Doc_Var1 (doctor variable one, fixed over time).
    2. For each Doctor ID (DID) with missing values for Doc_Var1, replace missing data for Doc_Var1 with non-missing data for Doc_Var1 for that same DID in years that data exists.

    Ideally, I only want to replace missing data. Some of the data has "errors" in it that result in supposedly "time-fixed" data changing over time. I want to keep these "errors" in the data for now. I therefore expect that, for instance, a loop used to replace missing data for Doc_Var1 might return different values for the Doc_Var1 for the same doctor over time if a particular doctor had differing values in 2003 and 2004 that were used to replace the missing data.

    Conceptually, I'm trying to do something like this:

    foreach DID year, {
    replace Doc_Var1_YearWithMissingData = 1 if (Doc_Var1_YearWithMissingData == . & Doc_Var1_AnyOtherYear == 1)

    replace Doc_Var1_YearWithMissingData = 2 if (Doc_Var1_YearWithMissingData == . & Doc_Var1_AnyOtherYear == 2)
    }


    I've also tried a much more tedious approach using multiple conditions included as "if" qualifiers with specified years in wide format (one line for each respondent with "_year" following each variable name), but I am quite certain this approach does not work as it references the PID by default rather than the DID:

    gen Doc_Var1_Impute_2000 = Doc_Var1_2000

    #delimit ;

    replace Doc_Var1_Impute_2000 = 1 if Doc_Var1_Impute_2000 == . &
    (
    Doc_Var1_Impute_2001 == 1 |
    Doc_Var1_Impute_2002 == 1 |
    Doc_Var1_Impute_2003 == 1 |
    Doc_Var1_Impute_2004 == 1 ) ;


    Both of these approaches fail. The second approach fails because it just replaces missing values with other values for the same patient in different years--NOT necessarily replacing missing data about a particular doctor in 2000. This problem is caused by the fact that the data can only be organized in Stata by PATIENT rather than by DOCTOR because there is no way to identify specific observations in this data by two variables--e.g. i(DID) j(year)--that uses Doctor ID (DID) because doctors saw multiple patients each year. It is possible, by contrast, to identify specific observations with a unique patient identifier (PID) and the year -- i(PID) j(year). The first approach produces results identical to the second approach, so it seems likely that this code also does the same thing.

    Any guidance would be greatly appreciated.

    As an aside: I am just beginning to learn how to use loops and foreach statements. Code that has more steps (or separates discrete actions on different lines) is easier for me to translate theoretically and is far more helpful for the moment, even if it feels stilted, repetitive, or just plain ugly to an advanced user. That said, I eagerly welcome and appreciate any and all guidance.

    Thanks in advance,
    Andrea
    Last edited by Andrea Kauffman-Berry; 17 Jan 2018, 12:52.

  • #2
    I'm not sure I understand your data description after working with your data example.

    From your description, I had the expectation that the value of doc_var1 would be the same for all observations of a given doctor ID, regardless of year or patient in the observation, except that it might, perhaps be missing for one of several reasons.

    But your example data does not conform to that. For example, the following observations occur in your data:

    Code:
        pid   year   did   doc_var1  
          2   2002   120          1  
          3   2003   120          1  
          4   2002   120          1  
          4   2003   120          1  
          3   2002   120          2
    So doc_var1 is apparently not really a doctor-level variable, because it takes on different values for the same did in year 2002, depending on which pid we're talking about. Is that an error in your example? If not, how do you want to choose which value of doc_var1 to copy to any other observation on did 120 year 2002 that is missing doc_var1?

    I'm also confused about whether doc_var1 is allowed to change from year to year for the same doctor (or perhaps for the same doctor-patient pair)? Again I would expect this to be the case, because if it isn't, then the whole idea of imputing 2001 values to the doctor (or doctor patient pair) in 2000 doesn't really make sense. But your data seem to contradict this:

    Code:
        pid   year   did   doc_var1  
          3   2003   120          1  
          3   2002   120          2
    Here we have the same doctor and patient, but the value of doc_var1 changes between 2002 and 2003.

    Can you clarify what's going on and what you hope to do?

    Comment


    • #3
      Thank you, Clyde.

      You are correct -- I tried to clarify this problem in the data in my initial post:

      "Ideally, I only want to replace missing data. Some of the data has "errors" in it that result in supposedly "time-fixed" data changing over time. I want to keep these "errors" in the data for now. I therefore expect that, for instance, a loop used to replace missing data for Doc_Var1 might return different values for the Doc_Var1 for the same doctor over time if a particular doctor had differing values in 2003 and 2004 that were used to replace the missing data."

      All of the data that I want to replace in this manner IS time-fixed, but there are errors in the data itself such as the one that I included in this sample data (I got the year wrong in my description of the data in my initial post). Using the data for Doc_Var1 for the doctor with DID 120 as an example, this doctor has a presumed "error" in the data for patient 3 in 2002 -- all other data points for Doc_Var1 for DID 120 are a "1" while for patient 3 in 2002 the response for Doc_Var1 is a "2". Most of these errors are unimportant and can be corrected manually (there are not that many). My plan is to identify any DID that has variation for time-fixed variables before and after replacement of data in this manner and address these errors for the approximately 900 DIDs.

      Comment


      • #4
        Well, personally, I would fix the data errors before spreading them through the data in this way.

        But if you really want to proceed, and, as I understand it, you are indifferent to which of the conflicting values is used for the purpose, then it's actually fairly simple:

        Code:
        by did (doc_var1), sort: replace doc_var1 = doc_var1[1] if missing(doc_var1)
        The logic is that for numeric variables, missing values sort last and non-missing values sort first. So assuming there are any non-missing values of doc_var1 for a given did, the first such value after sorting in this way will be among them. That value is then used to replace any missing values of doc_var1 in other observations.

        Just remember that after doing this, errors in the original data may be multiplied, so you will need to re-identify them

        Comment


        • #5
          Thank you! This is excellent! I greatly appreciate your help.

          Alright, so it sounds like the thing to do is to (1) identify and determine what to do with the small number of erroneous values now and then (2) run this code to replace the missing values with non-missing values for the same DID from observations that occurred in 2001 or later.

          Just to confirm that I understand the logic of the first part of this code: by stating "by did (doc_var1)" I am ensuring that no missing values for doc_var1 will be replaced for a DID that has only missing values.

          Thank you again.

          Comment


          • #6
            The code will replace missing values by non-missing values, if there are some non-missing values. If all values are missing, then all of the missing values will be replaced by missing value--which means that they don't change (or only change from one missing value to another, which, for your purposes doesn't matter.)

            And yes, I would first identify and fix the erroneous values first.

            Comment

            Working...
            X