Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Check variable consistency

    Hi all,

    How can I check whether race is consistency in my dataset and then replace the inconsistency as the right one? For example,

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 race float id
    "w" 1
    "w" 1
    "w" 1
    "w" 1
    "b" 1
    "w" 1
    "w" 1
    "w" 1
    "w" 2
    "w" 2
    "w" 2
    end
    ------------------ copy up to and including the previous line ------------------

    Best,

    Jack Liang

  • #2
    See documented tricks at

    https://www.stata.com/support/faqs/d...ions-in-group/

    http://www.stata-journal.com/sjpdf.h...iclenum=dm0042

    including

    Code:
    egen tag = tag(id race)
    egen ndistinct = total(tag), by(id)
    list if ndistinct > 1 
    The bigger deal is how you establish what is the right value? Even race might not be constant if people change their views or some classification changes.

    egen, mode() may help, however.
    Last edited by Nick Cox; 26 Mar 2018, 09:06.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Presumably the inconsistency is that race is not always constant for a given id.

      See documented tricks at

      https://www.stata.com/support/faqs/d...ions-in-group/

      http://www.stata-journal.com/sjpdf.h...iclenum=dm0042

      including

      Code:
      egen tag = tag(id race)
      egen ndistinct = total(tag), by(id)
      list if ndistinct > 1 
      Thank you!

      Comment


      • #4
        I will add that in my experience, when the same people are asked to identify their race on multiple occasions, they frequently give different responses. The concept of race is some vague mix of social and psychological and biological effects. The race of any person may well be assessed differently by different people at the same time, or by the same person (even the person whose race is being assessed) at different times. So the type of variation you are seeing is the norm.

        So an irreducible problem you face is this: once you identify which people have more than one assessment of their race recorded in your data, how would you possibly know which is the "correct" one? While some of the variability may well be data recording or data management errors, some of it is due to the vagueness of the construct itself. Within the data set you cannot usually distinguish these sources of variation. Various approaches are used by different investigators, and I can't really say that any one is better than others. Among the approaches I have seen:

        1. Use the first-reported value.
        2. Use the modal value, with some decision rule for breaking any ties.
        3. Set up a function that maps all combinations of reported races to a single "reference" race.
        4. Contact the person whose race is in question and discuss it with him/her. (Only feasible in limited circumstances.)
        5. Reclassify all people with multiple values of race in a separate category.
        6. Select one of the reported values at random.

        One corollary: although it is common and popular to report race effects in certain lines of research, given the difficulty of measuring it, these results usually should be viewed skeptically.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          I will add that in my experience, when the same people are asked to identify their race on multiple occasions, they frequently give different responses. The concept of race is some vague mix of social and psychological and biological effects. The race of any person may well be assessed differently by different people at the same time, or by the same person (even the person whose race is being assessed) at different times. So the type of variation you are seeing is the norm.

          So an irreducible problem you face is this: once you identify which people have more than one assessment of their race recorded in your data, how would you possibly know which is the "correct" one? While some of the variability may well be data recording or data management errors, some of it is due to the vagueness of the construct itself. Within the data set you cannot usually distinguish these sources of variation. Various approaches are used by different investigators, and I can't really say that any one is better than others. Among the approaches I have seen:

          1. Use the first-reported value.
          2. Use the modal value, with some decision rule for breaking any ties.
          3. Set up a function that maps all combinations of reported races to a single "reference" race.
          4. Contact the person whose race is in question and discuss it with him/her. (Only feasible in limited circumstances.)
          5. Reclassify all people with multiple values of race in a separate category.
          6. Select one of the reported values at random.

          One corollary: although it is common and popular to report race effects in certain lines of research, given the difficulty of measuring it, these results usually should be viewed skeptically.
          Thank you very much for these race classification methods. I will use one of these methods in my future project. In addition, can we use the highest proportion one?

          Best,

          Jack Liang

          Comment


          • #6
            In addition, can we use the highest proportion one?
            That is what I meant by the "modal value."

            Comment


            • #7
              Note also late addition to #2:

              egen, mode() may help, however.

              Comment


              • #8
                Originally posted by Nick Cox View Post
                Note also late addition to #2:

                egen, mode() may help, however.
                Awesome, thanks.

                Comment

                Working...
                X