Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correcting inconsistencies in panel data (gender and age variables)

    Hi all,

    I am looking to correct data inconsistencies in a panel data set between 2015 - 2017 in Stata.

    The table below shows a list of survey respondents with only two years of observations for age. It is not possible to deduce their age given that different birth years are reported. Also, a most frequent response for birth year clearly does not exist. I do not want to remove these observations as this may lead to biased data but then again, there are a total of 6,200 observations. Is there an appropriate strategy to correct these inconsistencies?
    Data Inconsistencies
    ID Year Age Birth Year
    #1 2016 44 1971
    2017 42 1974
    #2 2016 58 1958
    2017 61 1956
    #3 2016 30 1986
    2017 40 1977









    Similarly, the table below shows a list of survey respondents with only two years of observations for gender. A most frequent response for gender clearly does not exist. Again, I do not want to simply remove these observations in case it leads to biased data and I do want to find a way of correcting them. Is there an appropriate strategy to correct these inconsistencies?
    Data Inconsistencies
    ID Year Gender
    #4 2016 Male
    2017 Female
    #5 2016 Female
    2017 Male
    #6 2016 Female
    2017 Male









    The only possible strategies I have come across are:

    1. Use the first-reported value.
    2. Select one of the reported values at random (Not sure how to do this in a non-biased way!)

    Many thanks in advance for any advice.
    Last edited by sladmin; 11 May 2020, 07:55. Reason: anonymize original poster

  • #2
    There are no good solutions to this problem. It's a matter of choosing a least-bad solution.

    The first thing I would try is to contact the source of this data and see if they can provide corrected data. Probably they can't, but occasionally they can. If they can't, they may be able to explain how these anomalies arose, and that might shed some light on the best way to resolve them.

    Unless you know something about the process that generated this data that says otherwise, there is no particular reason to think the first observation is more accurate than the later one. So I think a random selection is better:

    The data tableau you show is, I hope, not based on your actual Stata data set, as the blank values in the ID variable are a problem that needs to be fixed before you do anything else. (In nearly all situations here, it is far better to show actual example data taken from your Stata data set and posted here using the -dataex- command than to type up illustrative tableaus.

    Assuming in your real data set the ID variable is never missing, you can do a random selection as follows:

    Code:
    // IDENTIFY INCONSISTENT OBSERVATIONS
    by ID (Gender), sort: gen byte to_fix = (Gender[1] != Gender[_N])
    set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
    gen double shuffle = runiform()
    by ID (shuffle), sort: replace Gender = Gender[1]
    Do the same for any other inconsistent variable. (Note: I might not do this for race. In the current era there are many people who are multiracial, and for them, inconsistency in their race variable represents reporting only one of two or more categories that apply, but not doing it the same way at the same time. If you are in a situation where multiraciality is common, consider creating multi-racial categories to cover people with inconsistent race variables.)

    If your data set contains several million observations, then you will need to generate two double-precision random variables shuffle1 and shuffle2, and sort on both of those before selecting.

    Comment


    • #3
      Hi Clyde,

      Thanks so much for your detailed response. I can confirm that the data tableau was an example and that my ID variable has no blank values. (This is my first post so I tried to follow the FAQs to the best of my ability and I do apologise for any mistakes!)

      I have also identified some inconsistencies in my education variable which is categorised as follows: 1= no diploma, 2= high school diploma, 3= undergraduate degree, 4= postgraduate degree. The results below shows that some respondents report a fall in their educational attainment which is likely to be an error.

      "140107082" 2016 2
      "140107082" 2017 1
      "151103297" 2016 2
      "151103297" 2017 1
      "160400886" 2016 4
      "160400886" 2017 2

      The code you have kindly provided for random selection makes changes to observations where respondents have an increase or a fall in educational attainment. How can the code be rewritten so that it only makes changes to observations displaying a fall in educational attainment?

      Many thanks again for help - it is greatly appreciated.





      Last edited by sladmin; 11 May 2020, 07:56. Reason: anonymize original poster

      Comment


      • #4
        In this situation it would probably be best to replace education non-randomly, carrying forward the previous value when the new value represents a decrease:

        Code:
        by id (year), sort: replace education = education[_n-1] if education < education[_n-1] & _n > 1

        Comment


        • #5
          Thanks Clyde for this code. I have only applied it to respondents with only two observations. As for respondents with three observations, I have assumed that the most frequent response for educational attainment is correct.

          Is this okay or should the strategy should be consistent for all respondents, no matter how many observations there are?

          This is certainly my final question and thanks again for your help!

          Comment


          • #6
            Also, what is the rationale behind carrying forward previous values rather than random selection here? Thanks.
            Last edited by sladmin; 11 May 2020, 07:56. Reason: anonymize original poster

            Comment


            • #7
              Responding to #5 and #6, I distinguish the education variable from the others because you have a longer series of observations and it is necessarily the case that the education level must be a non-decreasing function of time. This makes it different from sex or year of birth which should be unchanging. With an unchanging value, when you have only two observations and no other information to point out which (if either) is correct, then either value is just as likely as the other and a random selection makes sense. By contrast, with a variable that you know must be non-decreasing, when you encounter a situation such as what you show in #3, a random selection may still leave you with data where education appears to decrease over time! The carry forward of the previous observation assures that the resulting sequence will be non-decreasing.

              When you have variables that should be constant, like sex and year of birth, if you have a larger number of observations, then on the assumption that errors are less common than correct responses, it does make sense to use the modal response. (But note, you also need some kind of rule to break ties: what will you do if you have six responses, 3 saying male and 3 saying female? You might want to base that on the relative frequency of males and females in the population the sample was drawn from.)

              Let me reiterate what I said at the top of #2: there are no good solutions to these problems. There is no statistical approach that will create correct data out of inconsistent data. All you can hope to do is create internal coherence to the data set you have and attempt to minimize misclassifications.

              Comment


              • #8
                Thanks very much Clyde - this has really helped me!

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  There are no good solutions to this problem. It's a matter of choosing a least-bad solution.

                  The first thing I would try is to contact the source of this data and see if they can provide corrected data. Probably they can't, but occasionally they can. If they can't, they may be able to explain how these anomalies arose, and that might shed some light on the best way to resolve them.

                  Unless you know something about the process that generated this data that says otherwise, there is no particular reason to think the first observation is more accurate than the later one. So I think a random selection is better:

                  The data tableau you show is, I hope, not based on your actual Stata data set, as the blank values in the ID variable are a problem that needs to be fixed before you do anything else. (In nearly all situations here, it is far better to show actual example data taken from your Stata data set and posted here using the -dataex- command than to type up illustrative tableaus.

                  Assuming in your real data set the ID variable is never missing, you can do a random selection as follows:

                  Code:
                  // IDENTIFY INCONSISTENT OBSERVATIONS
                  by ID (Gender), sort: gen byte to_fix = (Gender[1] != Gender[_N])
                  set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
                  gen double shuffle = runiform()
                  by ID (shuffle), sort: replace Gender = Gender[1]
                  Do the same for any other inconsistent variable. (Note: I might not do this for race. In the current era there are many people who are multiracial, and for them, inconsistency in their race variable represents reporting only one of two or more categories that apply, but not doing it the same way at the same time. If you are in a situation where multiraciality is common, consider creating multi-racial categories to cover people with inconsistent race variables.)

                  If your data set contains several million observations, then you will need to generate two double-precision random variables shuffle1 and shuffle2, and sort on both of those before selecting.
                  Hello Clyde, thank you for your answer. What code could I run to correct age then? For an unbalanced panel in which there are some individuals who come in after 2 years or even 3. There are four panels and seven years. How about marital status where some people report married in one year (1) and never married in subsequent years(2)? And education years as well, where as OP mentioned, there are people with reduced education. Thank you.
                  Last edited by Julia Smith; 23 May 2020, 03:53.

                  Comment

                  Working...
                  X