Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to detect data entry error: locating an observation that does not match any of the observations of a given set?

    Hi Statalist,
    Apologies for the vagueness of the question as I didn't know how to frame it better. Hopefully the details would convey the question better. Please consider the following example:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str4 var1 str8 var2 str6 var3
    "hhid" "memberid" "headid"
    "111"  "1"        "3"     
    "111"  "2"        "3"     
    "111"  "3"        "3"     
    "112"  "1"        "1"     
    "112"  "2"        "1"     
    "112"  "3"        "1"     
    "112"  "4"        "1"     
    "113"  "1"        "4"     
    "113"  "2"        "4"     
    "113"  "3"        "4"     
    end
    now as can be seen, for hhid 113, headid has been wrongly entered as 4, while none of the members of that hh have an id 4. I suspect something like this has happened in my data and I want to identify that hhid for which this anomaly occurs. My understanding is the code would try to find, within each hhid whehter headid belongs to the set containing memberids, and if a headid doesn't belong to the corresponding set, it would flag the hhid.
    However, i have been unable to figure out how to write this particular code and would appreciate any help from the community.

    Thanks,
    Titir

  • #2
    This example should point you in a useful direction. I counts the number of individuals in each hhid for whom the memberid and headid are the same.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int hhid byte(memberid headid)
    111 1 3
    111 2 3
    111 3 3
    112 1 1
    112 2 1
    112 3 1
    112 4 1
    113 1 4
    113 2 4
    113 3 4
    end
    sort hhid memberid
    by hhid: generate nhead = sum(memberid==headid)
    by hhid: replace nhead = nhead[_N]
    list if nhead!=1, sepby(hhid)
    Code:
    . list if nhead!=1, sepby(hhid)
    
         +----------------------------------+
         | hhid   memberid   headid   nhead |
         |----------------------------------|
      8. |  113          1        4       0 |
      9. |  113          2        4       0 |
     10. |  113          3        4       0 |
         +----------------------------------+
    If this were my data, I would expand this code to look for households where headid is not the same for every member, and for households where the same memberid appears more than once, both of which could lead you to having a household with more than one head.

    Comment


    • #3
      Originally posted by William Lisowski View Post
      This example should point you in a useful direction. I counts the number of individuals in each hhid for whom the memberid and headid are the same.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int hhid byte(memberid headid)
      111 1 3
      111 2 3
      111 3 3
      112 1 1
      112 2 1
      112 3 1
      112 4 1
      113 1 4
      113 2 4
      113 3 4
      end
      sort hhid memberid
      by hhid: generate nhead = sum(memberid==headid)
      by hhid: replace nhead = nhead[_N]
      list if nhead!=1, sepby(hhid)
      Code:
      . list if nhead!=1, sepby(hhid)
      
      +----------------------------------+
      | hhid memberid headid nhead |
      |----------------------------------|
      8. | 113 1 4 0 |
      9. | 113 2 4 0 |
      10. | 113 3 4 0 |
      +----------------------------------+
      If this were my data, I would expand this code to look for households where headid is not the same for every member, and for households where the same memberid appears more than once, both of which could lead you to having a household with more than one head.
      Thank you so much William, for your response. I'll try what you have suggested.

      Comment

      Working...
      X