Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Imputating age variable using difference in survey wave years

    Dear all,

    I have five survey waves:

    1: 1990
    2: 1995
    3: 2000
    4: 2007
    5: 2010

    I have data on age for individuals in this five waves. However, sometimes the age for the same individual in wave 2 is miscoded such that it is not five years (or four years) greater than it was in wave 1. I also have instances where I have data for wave 2 age, but missing for wave 1 and would like to replace the missing value with the value of age in wave 2 minus 5, giving me the age of the individual in wave 1. In other instances, I have missings in wave 2, but data in wave 5 and would like to replace the missing in wave 2 with the difference in age between the years of wave 2 and 5. An example of my data is presented below:


    clear
    input long pid float wave double age
    1060003 2 .
    1080001 5 59
    1080003 1 15
    1080003 2 .
    1080004 5 34
    1080005 5 32
    1080006 5 30
    1080007 5 28
    1080008 5 26
    1080009 5 24
    1220006 3 15
    1220006 4 22
    1220006 5 28
    1220008 5 24
    1240002 1 40


    Thank you in advance.

  • #2
    Are you sure all individuals in your sample are observed in each wave? The fact that for some individuals you have their age only from Wave 2 onwards makes me think there is the possibility they were added in the sample in Wave 2...

    Comment


    • #3
      Not all individuals are observed in each wave, some entered in wave 2, others in wave 3 and so one. But ideally if I have their age in wave 3, I can calculate their age in wave 1 by doing the difference between their age and the years between the waves, no? I am just not quite sure how to do this (I am quite new to stata)

      Comment


      • #4
        Well, if it were just a question of filling in the years where no age is recorded, this would be pretty simple:

        Code:
        clear*
        input int wave int year
        1 1990
        2 1995
        3 2000
        4 2007
        5 2010
        end
        tempfile calendar
        save `calendar'
        
        clear
        input long pid float wave double age
        1060003 2 .
        1080001 5 59
        1080003 1 15
        1080003 2 .
        1080004 5 34
        1080005 5 32
        1080006 5 30
        1080007 5 28
        1080008 5 26
        1080009 5 24
        1220006 3 15
        1220006 4 22
        1220006 5 28
        1220008 5 24
        1240002 1 40
        end
        
        merge m:1 wave using `calendar', keep(match master)
        
        clonevar imputed_age = age
        gsort pid -wave
        by pid: replace imputed_age = imputed_age[_n-1] + year - year[_n-1] ///
            if missing(imputed_age)
        by pid (wave), sort: replace imputed_age = imputed_age[_n-1] + year - year[_n-1] ///
            if missing(imputed_age)
        But I don't think it is a good idea to try to "correct" inconsistent ages in this way as it is not clear which of the inconsistent ages are correct and which are wrong.

        Comment


        • #5
          Thank you very much Clyde, this is really helpful. One final question, if I may: If I wanted to find the age at marriage based on the difference between the year the marriage started and the birth year and replace the missing observations with this value, could I employ the same sort of command?

          Here is my tentative command:

          by pid: replace age_marriage = year_marriage-birthyear ///
          if missing(age_marriage)
          by pid (wave), sort: replace age_marriage = year_marriage-birthyear ///
          if missing(age_marriage)

          Although it does not seem to work

          Thank you once again,

          Enrique

          Comment


          • #6
            Your example data did not include the birth_year or year_marriage variables. Depending on how those are placed within the data, different approaches might be needed. Please post back with example data that includes these variables.

            Comment


            • #7
              My apologies, Clyde. Alternatively, could I not use the difference between the age at the year of survey and the difference between the year of survey and year of marriage to calculate the age at marriage for the missing? In this case my tentative code would be:

              by pid: replace age_marriage = age-(year-year_marriage) ///
              if missing(age_marriage)
              by pid (wave), sort: replace age_marriage = age - (year-year_marriage) ///
              if missing(age_marriage)

              Would this be appropriate?



              Here is the data:



              input long pid float wave double(year_marriage birth_year)


              1060003 2 . 1978
              1080001 5 . 1955
              1080003 1 . 1978
              1080003 2 . 1978
              1080004 5 . 1980
              1080005 5 . 1982
              1080006 5 . 1984
              1080007 5 . 1986
              1080008 5 . 1988
              1080009 5 . 1990
              1220006 3 . 1985
              1220006 4 . 1985
              1220006 5 . 1985
              1220008 5 . 1990
              1240002 1 . 1953
              1240002 2 . .
              1240002 3 . .
              1240002 4 . .
              1240002 5 1967 1952
              1240006 2 . 1980
              1240007 2 . 1982
              1240011 1 . 1971
              1240012 5 . 1990
              1250005 4 . 1986
              1250005 5 . 1986
              1250006 4 . 1987
              1250007 4 . 1990
              1250007 5 . 1990
              1290004 4 . 1986
              1290005 4 . 1990
              1290006 4 . 1991
              1290009 3 . 1961
              1290009 4 . 1961
              2010005 4 . 1988
              2010005 5 . 1988
              2010006 4 . 1990
              2020004 3 . 1976
              2020005 3 . 1979
              2020005 4 . 1979
              2020005 5 . 1979
              2020006 3 . 1981
              2020007 4 . 1983
              2030003 4 . 1986
              2040004 5 . 1989
              2060004 4 . 1981
              2060005 4 . 1985
              2060005 5 . 1985
              2090002 4 . 1970
              2090003 4 . 1982
              2090004 4 . 1988
              2090005 4 . 1940
              2090006 4 . 1983
              2090007 4 . 1982
              2090010 4 . 1984
              2093102 3 2000 1979
              2093301 3 2000 1980
              2100001 3 1986 1968


              Thank you in advance
              Last edited by Enrique Alameda; 26 Feb 2023, 15:08.

              Comment


              • #8
                Well, the difficulty with your suggested code is that the variable year_marriage is mostly missing. Now, for those who never married, that makes sense. But for those who did, these accidental gaps in the data defeat your code. The gist of an approach that overcomes this is:
                Code:
                by pid (year_marriage): replace year_marriage = year_marriage[1]
                gen age_at_marriage = year_marriage - birth_year
                sort pid wave
                However, there is a problem. For this to all make sense, the variable year_marriage and the variable birth_year should both be consistent: they should be the same (or missing) in every observation of the same patient. But that is not true of your data. pid 1240002 has two different birth year's reported in different observations. So before you can really do this you need to fix up inconsistent observations like that. I'm willing to guess that there are also people with inconsistently reported marriage years. That raises an even more difficult problem because while it might be that one is an error and the other correct, it could also be that the person married twice. I don't know how you can distinguish those possibilities reliably. And I also don't know what you want to call age at marriage for a person who is married more than once.

                Comment


                • #9
                  Thank you for your help and for highlighting these issue Clyde, I will see what can be done to address them.

                  All the best

                  Comment

                  Working...
                  X