Correcting inconsistencies in panel data (gender and age variables)

Guest
#1

Correcting inconsistencies in panel data (gender and age variables)

30 Mar 2020, 11:00

Hi all,

I am looking to correct data inconsistencies in a panel data set between 2015 - 2017 in Stata.

The table below shows a list of survey respondents with only two years of observations for age. It is not possible to deduce their age given that different birth years are reported. Also, a most frequent response for birth year clearly does not exist. I do not want to remove these observations as this may lead to biased data but then again, there are a total of 6,200 observations. Is there an appropriate strategy to correct these inconsistencies?
Data Inconsistencies

ID Year Age Birth Year

#1 2016 44 1971

2017 42 1974

#2 2016 58 1958

2017 61 1956

#3 2016 30 1986

2017 40 1977

Similarly, the table below shows a list of survey respondents with only two years of observations for gender. A most frequent response for gender clearly does not exist. Again, I do not want to simply remove these observations in case it leads to biased data and I do want to find a way of correcting them. Is there an appropriate strategy to correct these inconsistencies?
Data Inconsistencies

ID Year Gender

#4 2016 Male

2017 Female

#5 2016 Female

2017 Male

#6 2016 Female

2017 Male

The only possible strategies I have come across are:

1. Use the first-reported value.
2. Select one of the reported values at random (Not sure how to do this in a non-biased way!)

Many thanks in advance for any advice.

Last edited by sladmin; 11 May 2020, 07:55. Reason: anonymize original poster
Tags: inconsistency, panel data
Clyde Schechter

Join Date: Apr 2014

Posts: 30173
#2

30 Mar 2020, 13:14

There are no good solutions to this problem. It's a matter of choosing a least-bad solution.

The first thing I would try is to contact the source of this data and see if they can provide corrected data. Probably they can't, but occasionally they can. If they can't, they may be able to explain how these anomalies arose, and that might shed some light on the best way to resolve them.

Unless you know something about the process that generated this data that says otherwise, there is no particular reason to think the first observation is more accurate than the later one. So I think a random selection is better:

The data tableau you show is, I hope, not based on your actual Stata data set, as the blank values in the ID variable are a problem that needs to be fixed before you do anything else. (In nearly all situations here, it is far better to show actual example data taken from your Stata data set and posted here using the -dataex- command than to type up illustrative tableaus.

Assuming in your real data set the ID variable is never missing, you can do a random selection as follows:

Code:

// IDENTIFY INCONSISTENT OBSERVATIONS by ID (Gender), sort: gen byte to_fix = (Gender[1] != Gender[_N]) set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle = runiform() by ID (shuffle), sort: replace Gender = Gender[1]

Do the same for any other inconsistent variable. (Note: I might not do this for race. In the current era there are many people who are multiracial, and for them, inconsistency in their race variable represents reporting only one of two or more categories that apply, but not doing it the same way at the same time. If you are in a situation where multiraciality is common, consider creating multi-racial categories to cover people with inconsistent race variables.)

If your data set contains several million observations, then you will need to generate two double-precision random variables shuffle1 and shuffle2, and sort on both of those before selecting.
1 like
Comment
Guest
#3

30 Mar 2020, 15:51

Hi Clyde,

Thanks so much for your detailed response. I can confirm that the data tableau was an example and that my ID variable has no blank values. (This is my first post so I tried to follow the FAQs to the best of my ability and I do apologise for any mistakes!)

I have also identified some inconsistencies in my education variable which is categorised as follows: 1= no diploma, 2= high school diploma, 3= undergraduate degree, 4= postgraduate degree. The results below shows that some respondents report a fall in their educational attainment which is likely to be an error.

"140107082" 2016 2
"140107082" 2017 1
"151103297" 2016 2
"151103297" 2017 1
"160400886" 2016 4
"160400886" 2017 2

The code you have kindly provided for random selection makes changes to observations where respondents have an increase or a fall in educational attainment. How can the code be rewritten so that it only makes changes to observations displaying a fall in educational attainment?

Many thanks again for help - it is greatly appreciated.

Last edited by sladmin; 11 May 2020, 07:56. Reason: anonymize original poster
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30173
#4

30 Mar 2020, 18:32

In this situation it would probably be best to replace education non-randomly, carrying forward the previous value when the new value represents a decrease:

Code:

by id (year), sort: replace education = education[_n-1] if education < education[_n-1] & _n > 1
1 like
Comment
Guest
#5

31 Mar 2020, 06:24

Thanks Clyde for this code. I have only applied it to respondents with only two observations. As for respondents with three observations, I have assumed that the most frequent response for educational attainment is correct.

Is this okay or should the strategy should be consistent for all respondents, no matter how many observations there are?

This is certainly my final question and thanks again for your help!
Comment
Guest
#6

31 Mar 2020, 08:42

Also, what is the rationale behind carrying forward previous values rather than random selection here? Thanks.

Last edited by sladmin; 11 May 2020, 07:56. Reason: anonymize original poster
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30173
#7

31 Mar 2020, 09:20

Responding to #5 and #6, I distinguish the education variable from the others because you have a longer series of observations and it is necessarily the case that the education level must be a non-decreasing function of time. This makes it different from sex or year of birth which should be unchanging. With an unchanging value, when you have only two observations and no other information to point out which (if either) is correct, then either value is just as likely as the other and a random selection makes sense. By contrast, with a variable that you know must be non-decreasing, when you encounter a situation such as what you show in #3, a random selection may still leave you with data where education appears to decrease over time! The carry forward of the previous observation assures that the resulting sequence will be non-decreasing.

When you have variables that should be constant, like sex and year of birth, if you have a larger number of observations, then on the assumption that errors are less common than correct responses, it does make sense to use the modal response. (But note, you also need some kind of rule to break ties: what will you do if you have six responses, 3 saying male and 3 saying female? You might want to base that on the relative frequency of males and females in the population the sample was drawn from.)

Let me reiterate what I said at the top of #2: there are no good solutions to these problems. There is no statistical approach that will create correct data out of inconsistent data. All you can hope to do is create internal coherence to the data set you have and attempt to minimize misclassifications.
1 like
Comment
Guest
#8

31 Mar 2020, 10:44

Thanks very much Clyde - this has really helped me!
Comment
Julia Smith

Join Date: Apr 2020

Posts: 23
#9

23 May 2020, 03:46

Originally posted by Clyde Schechter View Post

There are no good solutions to this problem. It's a matter of choosing a least-bad solution.

The first thing I would try is to contact the source of this data and see if they can provide corrected data. Probably they can't, but occasionally they can. If they can't, they may be able to explain how these anomalies arose, and that might shed some light on the best way to resolve them.

Unless you know something about the process that generated this data that says otherwise, there is no particular reason to think the first observation is more accurate than the later one. So I think a random selection is better:

The data tableau you show is, I hope, not based on your actual Stata data set, as the blank values in the ID variable are a problem that needs to be fixed before you do anything else. (In nearly all situations here, it is far better to show actual example data taken from your Stata data set and posted here using the -dataex- command than to type up illustrative tableaus.

Assuming in your real data set the ID variable is never missing, you can do a random selection as follows:

Code:

// IDENTIFY INCONSISTENT OBSERVATIONS by ID (Gender), sort: gen byte to_fix = (Gender[1] != Gender[_N]) set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle = runiform() by ID (shuffle), sort: replace Gender = Gender[1]

Do the same for any other inconsistent variable. (Note: I might not do this for race. In the current era there are many people who are multiracial, and for them, inconsistency in their race variable represents reporting only one of two or more categories that apply, but not doing it the same way at the same time. If you are in a situation where multiraciality is common, consider creating multi-racial categories to cover people with inconsistent race variables.)

If your data set contains several million observations, then you will need to generate two double-precision random variables shuffle1 and shuffle2, and sort on both of those before selecting.

Hello Clyde, thank you for your answer. What code could I run to correct age then? For an unbalanced panel in which there are some individuals who come in after 2 years or even 3. There are four panels and seven years. How about marital status where some people report married in one year (1) and never married in subsequent years(2)? And education years as well, where as OP mentioned, there are people with reduced education. Thank you.

Last edited by Julia Smith; 23 May 2020, 03:53.
Comment

Data Inconsistencies
ID	Year	Age	Birth Year
#1	2016	44	1971
	2017	42	1974
#2	2016	58	1958
	2017	61	1956
#3	2016	30	1986
	2017	40	1977

Data Inconsistencies
ID	Year	Gender
#4	2016	Male
	2017	Female
#5	2016	Female
	2017	Male
#6	2016	Female
	2017	Male

Announcement

Correcting inconsistencies in panel data (gender and age variables)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment