Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there an effective way to find out the variables that contain "yet-to-be-labeled" values?

    My dataset has around 1300 columns (variables) with about 900 come with value labels.

    For example, in the gender variable called "pt1a". 1 is labeled as "Male", 2 is labeled as "Female", 5 is labeled as "Transgender"...

    However, there's a chance of having invalid data (e.g., 99 or 98) in the gender variable, same happened to the other variables.
    Click image for larger version

Name:	Screenshot 2023-10-06 at 2.14.51 pm.png
Views:	1
Size:	26.1 KB
ID:	1729309




    I want to find out (list) all the variables that contain any "yet-to-be-labeled" values.
    I used the following code, but it doesn't seem to work. The "serviceid" variable has all value labelled, but the code result also include it (see the sample data).

    Code:
    quietly ds, has(vallabel) // List variables with value labels
    local vallabelled_vars `r(varlist)'
    
    foreach var of local vallabelled_vars {
        quietly tabulate `var', missing nolabel // Tabulate values without labels
        if r(miss) > 0 { // If there are uncoded values
            di "`var' has uncoded values"
        }
    }
    Sample data attached.

    Thanks.
    Attached Files
    Last edited by Shen YANG; 05 Oct 2023, 21:34.

  • #2
    Try this:
    Code:
    ds, has(vallabel)
    local labelled_vars `r(varlist)'
    
    foreach var of local labelled_vars {
        quietly levelsof `var', local(values)
        foreach val of local values {
            capture assert `"`:label (`var') `val''"' != `"`val'"'
            if c(rc) != 0 {
                display "Variable `var' has uncoded values"
                continue
            }
        }
    }
    The code is not exactly correct, but it is probably close enough for practical purposes. If there is a variable with a value label which encodes a numeric value as that value (i.e. -label define mylabel 98 "98"-), this will be falsely identified as having an uncoded value. I have not been able to figure out a way around this problem. But I'm not sure it's much of a problem. It would be unusual to label a value with the value itself, and, at least in my workflow, it would probably be an error in its own right, so worth having my attention called to.

    Added: In your attempted code, the line -if r(miss) > 0 {- is wrong. The preceding command, -tabulate- does not leave behind any r(miss) value. So r(miss) will always be missing, and missing value is always > 0 (or any other number). So this -if- condition will always be met, and everything will be claimed to have a missing value by that code.
    Last edited by Clyde Schechter; 05 Oct 2023, 23:14.

    Comment


    • #3
      Wait, I figured out how to get around the problem and only report a variable if it has a truly unlabeled value. This will give you exactly what you asked for.

      Code:
      ds, has(vallabel)
      local labelled_vars `r(varlist)'
      
      foreach var of local labelled_vars {
          quietly levelsof `var', local(values)
          foreach val of local values {
              local lbl: label (`var') `val', strict
              if missing(`"`lbl'"') {
                  display "Variable `var' has uncoded values"
                  continue
              }
          }
      }
      I had forgotten about the -strict- option for `:label...'.

      Comment


      • #4
        Ahh! Thanks so much, Clyde.

        I was about to post another code that seems to work, then I just saw your feedback with a comprehensive explanation. Please check my code below:

        Code:
        foreach var of varlist _all {
            local vallab : value label `var'
            
            if "`vallab'" == "" {
            }  // these variables do not have value label assigned
            
            else {
        
                decode `var', generate(temp)  // generate temp var with values as the original var's labels
                
                // Check if the variable is a string type
                if strpos("`: type temp'", "str") > 0 {
                    qui count if temp == ""
                    // if the temp var contains null value,
                    // then the corresponding variable must have value(s) yet to be labelled
                    
                    if r(N) > 0 {
                        di "`var' contains values yet to be labelled."
                    }
                }
                drop temp
            }
        }
        My idea is to generate another temp variable that has the values of the original variable's "labelled value", while the values that have not been labelled will become NULL in the temp variable.
        Click image for larger version

Name:	Screenshot 2023-10-06 at 4.27.15 pm.png
Views:	1
Size:	205.8 KB
ID:	1729327

        Last edited by Shen YANG; 05 Oct 2023, 23:50.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Wait, I figured out how to get around the problem and only report a variable if it has a truly unlabeled value. This will give you exactly what you asked for.

          Code:
          ds, has(vallabel)
          local labelled_vars `r(varlist)'
          
          foreach var of local labelled_vars {
          quietly levelsof `var', local(values)
          foreach val of local values {
          local lbl: label (`var') `val', strict
          if missing(`"`lbl'"') {
          display "Variable `var' has uncoded values"
          continue
          }
          }
          }
          I had forgotten about the -strict- option for `:label...'.
          Good work! Appreciated.

          Comment


          • #6
            The labellacking command from SSC targets this specific problem.

            Comment


            • #7
              Originally posted by daniel klein View Post
              The labellacking command from SSC targets this specific problem.

              Thank you, Daniel.

              The "labellacking" seems really handy when being applied to the sample dataset above. The output is:

              Code:
              . labellacking, all
              
              serviceid   (none)
              pt1a        4 98
              prim_drug   5008

              However, when applying it to the real dataset, there's a surprising message popped up:

              Code:
              .  labellacking, all
              value label labels7 not found
              r(111);
              I can assure you that the "labels7" is definitely NOT in the dataset, and I've also looked at it in Variable Manager.

              Is there such a thing as "hidden labels" in STATA? Is it possible that I introduced something unknown when I merged the dataset?

              Comment


              • #8
                Consider this situation:


                Code:
                sysuse auto, clear 
                label val rep78 rep78 
                label dir
                The output is just
                Code:
                origin
                Stata's been told that rep78 has value labels rep78 but they haven't been defined (yet). This is perfectly legal, but it can lead to commands looking for labels that don't exist.

                I think you'll find that

                Code:
                describe 
                ds, has(vallabel)
                will show you label labels7 attached to one or more of your variables.

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  ... ...
                  Stata's been told that rep78 has value labels rep78 but they haven't been defined (yet).
                  ... ...
                  Thanks, Nick! This totally makes sense.

                  For this dataset, I have been working with it for up to several months. It's hard to say what was overlooked because of the countless operations performed during the process.

                  Also, I recompile the code I mentioned above. It seems to recognise "invalid labels" quite well and gives a hint. I'm sharing the code and some of the real data with you here.

                  Code:
                  Code:
                  foreach var of varlist _all {
                      local vallab : value label `var'
                      
                      if "`vallab'" != "" { // Only for variables with assigned value labels
                          // Check if the value label definitions exist
                          capture label list `vallab'
                          if _rc == 0 { // Value label definitions exist
                          
                              // Decode the variable to get labels
                              qui decode `var', generate(temp)
                              
                              // ignore those empty values in the original variable
                              // (i.e. only focuse on those unlabeled values)
                              qui replace temp="---" if `var'==.
                  
                          
                              // Check for unlabeled values
                              qui count if temp == "" & `var' != .
                              // The condition `var' != . ensures we're only considering non-missing values in the original variable
                          
                              if r(N) > 0 {
                                  di "`var' contains values yet to be labelled."
                              }
                          
                              // Drop the temporary variable
                              qui drop temp
                          }
                          else {
                              di "Value label definitions for `vallab' (associated with variable `var') are missing."
                          }
                      }
                  }
                  Output:
                  Code:
                  pt1a contains values yet to be labelled.
                  pt14a contains values yet to be labelled.
                  pt17 contains values yet to be labelled.
                  Value label definitions for labels7 (associated with variable pt22pre5) are missing.
                  pt26a contains values yet to be labelled.
                  pt27a contains values yet to be labelled.
                  pt27b contains values yet to be labelled.
                  Value label definitions for labels104 (associated with variable fu25d) are missing.
                  Value label definitions for labels104 (associated with variable fu25f) are missing.
                  fu34h contains values yet to be labelled.
                  Value label definitions for labels94 (associated with variable fu43o) are missing.
                  Value label definitions for labels112 (associated with variable fu51) are missing.
                  fu61c contains values yet to be labelled.
                  treatsett contains values yet to be labelled.
                  sds01i contains values yet to be labelled.
                  sds01j contains values yet to be labelled.
                  Partial real data:
                  Attached Files
                  Last edited by Shen YANG; 06 Oct 2023, 01:24.

                  Comment


                  • #10
                    See also https://www.statalist.org/forums/for...d-value-labels,

                    In terms of various commands

                    ds is an official command and some functionality to handle this inconsistency might be in order.

                    findname (Stata Journal) has been updated and I imagine that the update will be publicly available through Stata Journal 23(4) in several weeks' time.

                    labellacking (SSC) might benefit from some option to deal with this, either by warning or through a more informative error message. I'll think about that.

                    Comment


                    • #11
                      Thanks to Kit Baum, labellacking on SSC is now fixed to find and report value labels that are not defined.

                      Comment

                      Working...
                      X