Create list of variables w/ different redaction patterns & dropping specific redacted variables based on other non-redacted vars

Li Huang

Join Date: Jul 2019
Posts: 9

Create list of variables w/ different redaction patterns & dropping specific redacted variables based on other non-redacted vars

05 Jul 2019, 18:04

Hello,

disclaimer: dataset contains PHI therefore I have recreated a replica with substituted variable names and values that are functionally equivalent to mine.

I have a dataset with many variables (>1,000), of which there are variables (>100) that have redacted values.
The variables with redacted values have different patterns. Some have the entire column listed "redacted", other variables have scattered values that are blank, missing, or "redacted"
The value that signifies the redaction is also different, sometimes it is "redacted" and other times it is "[REDACTED]"
ID variable is unique in the dataset (i.e. primary key to the patient)
Variables NAME, MRM, ADDRESS, CLINIC_ID are fictitious but serve the appropriate purpose to address this question.

Example (have):

ID	NAME	MRN	ADDRESS	CLINIC_ID
1	"[REDACTED]"	"redacted"	"redacted	6
2	"[REDACTED]"		"redacted"	10
3	"[REDACTED]"		"redacted"	5
4	"[REDACTED]"	"redacted"	"redacted"	33
5	"[REDACTED]"	"redacted"	"redacted"	2
6	"[REDACTED]"		"redacted"	3
7	"[REDACTED]"		"redacted"	1
8	"[REDACTED]"	"redacted"	"redacted"	6
9	"[REDACTED]"		"redacted"	4

GOAL 1: I would like to create 3 variables contain a list of certain variables:

LIST 1: Includes all the variables that have their entire column (e.g. NAME and ADDRESS) =="redacted" or "[REDACTED]"
LIST 2: Includes all the variables that have any values (e.g. NAME, MRN, and ADDRESS ) =="redacted" or "[REDACTED]"
LIST 3: Includes all the variables that have some values (but not all e.g. only columns like MRN above) =="redacted or "[REDACTED]"

GOAL 2: Lastly, I am looking for two ways to drop variables from these lists:

Drop all of the variables within the variable list created
Drop only specific variables within the variable list created.
1. For example, if I want to drop only the variables in LIST 3 if CLINIC_ID==6

I am lost on how to accomplish this in Stata (still quite novice). The beginning code of my attempt toeven display these variables in just one list is below:

Code:

    foreach var of varlist * {
      capture assert `var' if `var'=="redacted"
      if _rc {
        display in smcl as text "variable {result}`var' "_continue
        display in smcl as text "contains redacted"
      }
    }

I believe creating the list of all variables is sufficient just using

Code:

var of varlist *

and I don't need to use something like -unab- .

However, the foreach code output just displays all of the variables in the dataset. Additionally, I can obtain the list without any extra information using the line as simply

Code:

display in smcl "{result}`var' "

But again, the list of variables I am creating/displaying is not limited to the specific variables that I desire and I am unaware of how to accomplish both GOAL 1 and GOAL 2.

Thanks in advance,

LH

Tags: data, foreach, loop, Suggestion, syntax

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

06 Jul 2019, 04:02

Code:

clear
input ID str20 ( NAME    MRN    ADDRESS )    CLINIC_ID
1    "[REDACTED]"    "redacted"    "redacted"    6
2    "[REDACTED]"    ""    "redacted"    10
3    "[REDACTED]"    ""    "redacted"    5
4    "[REDACTED]"    "redacted"    "redacted"    33
5    "[REDACTED]"    "redacted"    "redacted"    2
6    "[REDACTED]"    ""    "redacted"    3
7    "[REDACTED]"    ""    "redacted"    1
8    "[REDACTED]"    "redacted"    "redacted"    6
9    "[REDACTED]"    ""    "redacted"    4
end

foreach v of varlist * {

    capture confirm str var `v'
    
    if ( _rc == 0 ) {
    
        count if strpos(upper(`v'),"REDACTED")
        
        if ( r(N) == 0 ) {
        
            char `v'[redacted_none]  "True"
        }
        
        else {
    
            if ( r(N) == _N ) {
            
                char `v'[redacted_all]  "True"
            }
            
            if ( r(N) <= _N ) {
            
                char `v'[redacted_any]  "True"
            }
            
            if ( r(N) < _N ) {
            
                char `v'[redacted_some]  "True"
            }
        }
    }
}

ds * , has(char redacted_all)
ds * , has(char redacted_any)
ds * , has(char redacted_some)

ds * , has(char redacted_some)
local redacted_some `r(varlist)'
describe `redacted_some'
drop `redacted_some'

Last edited by Bjarte Aagnes; 06 Jul 2019, 04:10.

Comment

Li Huang

Join Date: Jul 2019

Posts: 9
#3

07 Jul 2019, 16:41

Thank you; this is a great solution. One follow up question. Why do the last 4 lines:

Code:

ds * , has(char redacted_some) local redacted_some `r(varlist)' describe `redacted_some' drop `redacted_some'

have to be run at the exact same time? i.e. highlight the lines and run once. I am using Stata 15.1 and I receive different results in the results window if I run separately. To be exact, when I run

Code:

local redacted_some `r(varlist)'

then sequentially

Code:

describe `redacted_some'

my results are just the entire list of variables. This is all after I have run the code above this in your solution.

Thank again for the help.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#4

08 Jul 2019, 02:53

The command -ds- return the varlist in the local macro r(varlist). Using do-files, the scope of a local macro is restricted to the program or do file where it is defined, or the lines of code run.

Running the single line

Code:

describe `redacted_some'

the local is not defined and the code will be interpreted

Code:

describe
Comment
Li Huang

Join Date: Jul 2019

Posts: 9
#5

12 Jul 2019, 16:01

Oh yes, I see! Thank you for the guidance.
Comment

Announcement

Create list of variables w/ different redaction patterns & dropping specific redacted variables based on other non-redacted vars

Comment

Comment

Comment

Comment