Check variable consistency

Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#1

Check variable consistency

26 Mar 2018, 08:28

Hi all,

How can I check whether race is consistency in my dataset and then replace the inconsistency as the right one? For example,

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str1 race float id "w" 1 "w" 1 "w" 1 "w" 1 "b" 1 "w" 1 "w" 1 "w" 1 "w" 2 "w" 2 "w" 2 end

------------------ copy up to and including the previous line ------------------

Best,

Jack Liang
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35467
#2

26 Mar 2018, 08:49

See documented tricks at

https://www.stata.com/support/faqs/d...ions-in-group/

http://www.stata-journal.com/sjpdf.h...iclenum=dm0042

including

Code:

egen tag = tag(id race) egen ndistinct = total(tag), by(id) list if ndistinct > 1

The bigger deal is how you establish what is the right value? Even race might not be constant if people change their views or some classification changes.

egen, mode() may help, however.

Last edited by Nick Cox; 26 Mar 2018, 09:06.
Comment
Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#3

26 Mar 2018, 09:03

Originally posted by Nick Cox View Post

Presumably the inconsistency is that race is not always constant for a given id.

See documented tricks at

https://www.stata.com/support/faqs/d...ions-in-group/

http://www.stata-journal.com/sjpdf.h...iclenum=dm0042

including

Code:

egen tag = tag(id race) egen ndistinct = total(tag), by(id) list if ndistinct > 1

Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29967
#4

26 Mar 2018, 09:03

I will add that in my experience, when the same people are asked to identify their race on multiple occasions, they frequently give different responses. The concept of race is some vague mix of social and psychological and biological effects. The race of any person may well be assessed differently by different people at the same time, or by the same person (even the person whose race is being assessed) at different times. So the type of variation you are seeing is the norm.

So an irreducible problem you face is this: once you identify which people have more than one assessment of their race recorded in your data, how would you possibly know which is the "correct" one? While some of the variability may well be data recording or data management errors, some of it is due to the vagueness of the construct itself. Within the data set you cannot usually distinguish these sources of variation. Various approaches are used by different investigators, and I can't really say that any one is better than others. Among the approaches I have seen:

1. Use the first-reported value.
2. Use the modal value, with some decision rule for breaking any ties.
3. Set up a function that maps all combinations of reported races to a single "reference" race.
4. Contact the person whose race is in question and discuss it with him/her. (Only feasible in limited circumstances.)
5. Reclassify all people with multiple values of race in a separate category.
6. Select one of the reported values at random.

One corollary: although it is common and popular to report race effects in certain lines of research, given the difficulty of measuring it, these results usually should be viewed skeptically.
Comment
Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#5

26 Mar 2018, 18:36

Originally posted by Clyde Schechter View Post

I will add that in my experience, when the same people are asked to identify their race on multiple occasions, they frequently give different responses. The concept of race is some vague mix of social and psychological and biological effects. The race of any person may well be assessed differently by different people at the same time, or by the same person (even the person whose race is being assessed) at different times. So the type of variation you are seeing is the norm.

So an irreducible problem you face is this: once you identify which people have more than one assessment of their race recorded in your data, how would you possibly know which is the "correct" one? While some of the variability may well be data recording or data management errors, some of it is due to the vagueness of the construct itself. Within the data set you cannot usually distinguish these sources of variation. Various approaches are used by different investigators, and I can't really say that any one is better than others. Among the approaches I have seen:

1. Use the first-reported value.
2. Use the modal value, with some decision rule for breaking any ties.
3. Set up a function that maps all combinations of reported races to a single "reference" race.
4. Contact the person whose race is in question and discuss it with him/her. (Only feasible in limited circumstances.)
5. Reclassify all people with multiple values of race in a separate category.
6. Select one of the reported values at random.

One corollary: although it is common and popular to report race effects in certain lines of research, given the difficulty of measuring it, these results usually should be viewed skeptically.

Thank you very much for these race classification methods. I will use one of these methods in my future project. In addition, can we use the highest proportion one?

Best,

Jack Liang
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29967
#6

26 Mar 2018, 18:37

In addition, can we use the highest proportion one?

That is what I meant by the "modal value."
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35467
#7

26 Mar 2018, 19:06

Note also late addition to #2:

egen, mode() may help, however.
Comment
Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#8

27 Mar 2018, 07:09

Originally posted by Nick Cox View Post

Note also late addition to #2:

egen, mode() may help, however.

Awesome, thanks.
Comment

Announcement

Check variable consistency

Comment

Comment

Comment

Comment

Comment

Comment

Comment