Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to validate if variables' value match the given data dictionary?

    Hi, I am validating a dataset that contains some demographic info of our clients.

    I need to check whether the values of each variable match the pre-defined data dictionary, and to display the incorrect values.

    The dataset looks like:

    Click image for larger version

Name:	Screenshot 2023-09-12 at 5.24.23 pm.png
Views:	1
Size:	20.6 KB
ID:	1726838


    The STATA code is like:

    Code:
    /* data dictionary */
    quietly{
        local dobstatus_v             "1 2"
        local sexatbirth_v             "1 2 95 99"
        local sexualorientation_v    "1 2 3 5 96 97 99"
        local gender_v                "1 2 99"
    }
    
    /// Identify if variable is NOT string type. check if their values matches data dictionary
    quietly{
        noisily display "================================================="
        noisily display "                Value Checking                   "
        noisily display "  The following variables may contain data that "
        noisily display "  does not match NADA dictionary. Please check!"
        noisily display "================================================="
        
        foreach var in dobstatus sexatbirth sexualorientation gender{
            local checktype: type `var'                                /* use marco to store the variable type */
            if substr("`checktype'", 1, 3) == "str" {                /* check if the var is string type */
                noisily display "*** `var' is still string type"    /* Warning if the var is string type */
                continue /* skip to next */
            }
    
            /* Check variable against its data dictionary */
            local valueList = "`var'_v" 
            gen str_`var' = string(`var')
            if strpos(" ``valueList'' ", str_`var') == 0 {
                noisily list clientid `var'
            } 
            else {
                noisily display "`var' is good."
            }
            drop str_`var'
        }
    
    }
    As can be seen, 87 in gender variable is incorrect. But I am getting the result as below:

    > quietly{
    =================================================
    Value Checking
    The following variables may contain data that
    does not match NADA dictionary. Please check!
    =================================================
    dobstatus is good.
    sexatbirth is good.
    sexualorientation is good.
    gender is good.


    Please help. Thanks!

  • #2
    Forgot to mention, above are just a subset of the variables, and some variables are string format that could have value like: 1, 2, B, M...
    That's why I use string-type data dictionary, and also convert all variables to string type (a temporary var), and compare them with the dictionary.

    Thanks.

    Comment


    • #3
      Resolved!

      Code:
      /* data dictionary */
      quietly{
          local dobstatus_v             "1 2"
          local sexatbirth_v             "1 2 95 99"
          local sexualorientation_v    "1 2 3 5 96 97 99"
          local gender_v                "1 2 99"
      }
      
      /// Identify if variable is NOT string type. check if their values matches data dictionary
      quietly{
          noisily display "================================================="
          noisily display "                Value Checking                   "
          noisily display "  The following variables may contain data that "
          noisily display "  does not match NADA dictionary. Please check!"
          noisily display "================================================="
          
          foreach var in dobstatus sexatbirth sexualorientation gender{
              local checktype: type `var'                                /* use marco to store the variable type */
              if substr("`checktype'", 1, 3) == "str" {                /* check if the var is string type */
                  noisily display "*** `var' is still string type"    /* Warning if the var is string type */
                  continue /* skip to next */
              }
      
              /* Check variable against its respective list */
              local valueList " ``var'_v"
              gen flag_`var' = 1 if strpos("``valueList''", string(`var')) == 0
              egen uniqueValues = total(flag_`var')
      
              if uniqueValues > 0 {
                  noisily list clientid `var' if flag_`var' == 1
                  drop flag_`var' uniqueValues
              }
              else {
                  noisily display "`var' is good."
                  drop flag_`var' uniqueValues
              }
          }
      
      }
      Output:
      Attached Files

      Comment


      • #4
        Glad you solved your problem. FWIW, here's a different approach that's a bit shorter and, to my eyes, clearer.
        Code:
        local dobstatus_v             "1 2"
        local sexatbirth_v             "1 2 95 99"
        local sexualorientation_v    "1 2 3 5 96 97 99"
        local gender_v                "1 2 99"
        
        foreach v of varlist dobstatus sexatbirth sexualorientation gender {
            capture confirm numeric var `v', exact
            if c(rc) == 0 {
                levelsof `v', local(observed) separate(",")
                local bad_values: list observed - `v'_v
                if `"`bad_values'"' != "" {
                    display "`v' contains invalid values: `bad_values'"
                    gen byte flag_`v' = inlist(`v', `bad_values')
                }
                else {
                    display "`v' is good"
                }
            }
        }
        As an aside, although it does no harm, there is no point in applying -quietly- to -local- commands as they produce no output anyway.

        If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

        Because your example data was posted as a screenshot, there is no way to import it to Stata, so I have not tested the above code. I think it is correct, but it may contain typos or other errors.
        Last edited by Clyde Schechter; 12 Sep 2023, 14:05.

        Comment

        Working...
        X