Is there an effective way to find out the variables that contain "yet-to-be-labeled" values?

Shen YANG

Join Date: Apr 2023

Posts: 41
#1

Is there an effective way to find out the variables that contain "yet-to-be-labeled" values?

05 Oct 2023, 21:18

My dataset has around 1300 columns (variables) with about 900 come with value labels.

For example, in the gender variable called "pt1a". 1 is labeled as "Male", 2 is labeled as "Female", 5 is labeled as "Transgender"...

However, there's a chance of having invalid data (e.g., 99 or 98) in the gender variable, same happened to the other variables.

I want to find out (list) all the variables that contain any "yet-to-be-labeled" values.
I used the following code, but it doesn't seem to work. The "serviceid" variable has all value labelled, but the code result also include it (see the sample data).

Code:

quietly ds, has(vallabel) // List variables with value labels local vallabelled_vars `r(varlist)' foreach var of local vallabelled_vars { quietly tabulate `var', missing nolabel // Tabulate values without labels if r(miss) > 0 { // If there are uncoded values di "`var' has uncoded values" } }

Sample data attached.

Thanks.
Attached Files

Untitled.dta (5.6 KB, 1 view)

Last edited by Shen YANG; 05 Oct 2023, 21:34.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#2

05 Oct 2023, 23:10

Try this:

Code:

ds, has(vallabel) local labelled_vars `r(varlist)' foreach var of local labelled_vars { quietly levelsof `var', local(values) foreach val of local values { capture assert `"`:label (`var') `val''"' != `"`val'"' if c(rc) != 0 { display "Variable `var' has uncoded values" continue } } }

The code is not exactly correct, but it is probably close enough for practical purposes. If there is a variable with a value label which encodes a numeric value as that value (i.e. -label define mylabel 98 "98"-), this will be falsely identified as having an uncoded value. I have not been able to figure out a way around this problem. But I'm not sure it's much of a problem. It would be unusual to label a value with the value itself, and, at least in my workflow, it would probably be an error in its own right, so worth having my attention called to.

Added: In your attempted code, the line -if r(miss) > 0 {- is wrong. The preceding command, -tabulate- does not leave behind any r(miss) value. So r(miss) will always be missing, and missing value is always > 0 (or any other number). So this -if- condition will always be met, and everything will be claimed to have a missing value by that code.

Last edited by Clyde Schechter; 05 Oct 2023, 23:14.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30084

05 Oct 2023, 23:22

Wait, I figured out how to get around the problem and only report a variable if it has a truly unlabeled value. This will give you exactly what you asked for.

Code:

ds, has(vallabel)
local labelled_vars `r(varlist)'

foreach var of local labelled_vars {
    quietly levelsof `var', local(values)
    foreach val of local values {
        local lbl: label (`var') `val', strict
        if missing(`"`lbl'"') {
            display "Variable `var' has uncoded values"
            continue
        }
    }
}

I had forgotten about the -strict- option for `:label...'.

Comment

Shen YANG

Join Date: Apr 2023
Posts: 41

05 Oct 2023, 23:28

Ahh! Thanks so much, Clyde.

I was about to post another code that seems to work, then I just saw your feedback with a comprehensive explanation. Please check my code below:

Code:

foreach var of varlist _all {
    local vallab : value label `var'
    
    if "`vallab'" == "" {
    }  // these variables do not have value label assigned
    
    else {

        decode `var', generate(temp)  // generate temp var with values as the original var's labels
        
        // Check if the variable is a string type
        if strpos("`: type temp'", "str") > 0 {
            qui count if temp == ""
            // if the temp var contains null value,
            // then the corresponding variable must have value(s) yet to be labelled
            
            if r(N) > 0 {
                di "`var' contains values yet to be labelled."
            }
        }
        drop temp
    }
}

My idea is to generate another temp variable that has the values of the original variable's "labelled value", while the values that have not been labelled will become NULL in the temp variable.

Click image for larger version

Name: Screenshot 2023-10-06 at 4.27.15 pm.png
Views: 1
Size: 205.8 KB
ID: 1729327

Last edited by Shen YANG; 05 Oct 2023, 23:50.

Comment

Shen YANG

Join Date: Apr 2023

Posts: 41
#5

05 Oct 2023, 23:29

Originally posted by Clyde Schechter View Post

Wait, I figured out how to get around the problem and only report a variable if it has a truly unlabeled value. This will give you exactly what you asked for.

Code:

ds, has(vallabel) local labelled_vars `r(varlist)' foreach var of local labelled_vars { quietly levelsof `var', local(values) foreach val of local values { local lbl: label (`var') `val', strict if missing(`"`lbl'"') { display "Variable `var' has uncoded values" continue } } }

I had forgotten about the -strict- option for `:label...'.

Good work! Appreciated.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3845
#6

06 Oct 2023, 00:11

The labellacking command from SSC targets this specific problem.
2 likes
Comment
Shen YANG

Join Date: Apr 2023

Posts: 41
#7

06 Oct 2023, 01:04

Originally posted by daniel klein View Post

The labellacking command from SSC targets this specific problem.

Thank you, Daniel.

The "labellacking" seems really handy when being applied to the sample dataset above. The output is:

Code:

. labellacking, all serviceid (none) pt1a 4 98 prim_drug 5008

However, when applying it to the real dataset, there's a surprising message popped up:

Code:

. labellacking, all value label labels7 not found r(111);

I can assure you that the "labels7" is definitely NOT in the dataset, and I've also looked at it in Variable Manager.

Is there such a thing as "hidden labels" in STATA? Is it possible that I introduced something unknown when I merged the dataset?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35671
#8

06 Oct 2023, 01:11

Consider this situation:

Code:

sysuse auto, clear label val rep78 rep78 label dir

The output is just

Code:

origin

Stata's been told that rep78 has value labels rep78 but they haven't been defined (yet). This is perfectly legal, but it can lead to commands looking for labels that don't exist.

I think you'll find that

Code:

describe ds, has(vallabel)

will show you label labels7 attached to one or more of your variables.
Comment

Shen YANG

Join Date: Apr 2023
Posts: 41

06 Oct 2023, 01:21

Originally posted by Nick Cox View Post

... ...
Stata's been told that rep78 has value labels rep78 but they haven't been defined (yet).
... ...

Thanks, Nick! This totally makes sense.

For this dataset, I have been working with it for up to several months. It's hard to say what was overlooked because of the countless operations performed during the process.

Also, I recompile the code I mentioned above. It seems to recognise "invalid labels" quite well and gives a hint. I'm sharing the code and some of the real data with you here.

Code:

Code:

foreach var of varlist _all {
    local vallab : value label `var'
    
    if "`vallab'" != "" { // Only for variables with assigned value labels
        // Check if the value label definitions exist
        capture label list `vallab'
        if _rc == 0 { // Value label definitions exist
        
            // Decode the variable to get labels
            qui decode `var', generate(temp)
            
            // ignore those empty values in the original variable
            // (i.e. only focuse on those unlabeled values)
            qui replace temp="---" if `var'==.

        
            // Check for unlabeled values
            qui count if temp == "" & `var' != .
            // The condition `var' != . ensures we're only considering non-missing values in the original variable
        
            if r(N) > 0 {
                di "`var' contains values yet to be labelled."
            }
        
            // Drop the temporary variable
            qui drop temp
        }
        else {
            di "Value label definitions for `vallab' (associated with variable `var') are missing."
        }
    }
}

Output:

Code:

pt1a contains values yet to be labelled.
pt14a contains values yet to be labelled.
pt17 contains values yet to be labelled.
Value label definitions for labels7 (associated with variable pt22pre5) are missing.
pt26a contains values yet to be labelled.
pt27a contains values yet to be labelled.
pt27b contains values yet to be labelled.
Value label definitions for labels104 (associated with variable fu25d) are missing.
Value label definitions for labels104 (associated with variable fu25f) are missing.
fu34h contains values yet to be labelled.
Value label definitions for labels94 (associated with variable fu43o) are missing.
Value label definitions for labels112 (associated with variable fu51) are missing.
fu61c contains values yet to be labelled.
treatsett contains values yet to be labelled.
sds01i contains values yet to be labelled.
sds01j contains values yet to be labelled.

Partial real data:

Attached Files

real_data_partial.dta (514.5 KB, 1 view)

Last edited by Shen YANG; 06 Oct 2023, 01:24.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35671
#10

06 Oct 2023, 06:09

See also https://www.statalist.org/forums/for...d-value-labels,

In terms of various commands

ds is an official command and some functionality to handle this inconsistency might be in order.

findname (Stata Journal) has been updated and I imagine that the update will be publicly available through Stata Journal 23(4) in several weeks' time.

labellacking (SSC) might benefit from some option to deal with this, either by warning or through a more informative error message. I'll think about that.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35671
#11

09 Oct 2023, 04:41

Thanks to Kit Baum, labellacking on SSC is now fixed to find and report value labels that are not defined.
2 likes
Comment

Announcement