Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create list of variables w/ different redaction patterns & dropping specific redacted variables based on other non-redacted vars

    Hello,

    disclaimer: dataset contains PHI therefore I have recreated a replica with substituted variable names and values that are functionally equivalent to mine.
    1. I have a dataset with many variables (>1,000), of which there are variables (>100) that have redacted values.
    2. The variables with redacted values have different patterns. Some have the entire column listed "redacted", other variables have scattered values that are blank, missing, or "redacted"
    3. The value that signifies the redaction is also different, sometimes it is "redacted" and other times it is "[REDACTED]"
    4. ID variable is unique in the dataset (i.e. primary key to the patient)
    5. Variables NAME, MRM, ADDRESS, CLINIC_ID are fictitious but serve the appropriate purpose to address this question.
    Example (have):
    ID NAME MRN ADDRESS CLINIC_ID
    1 "[REDACTED]" "redacted" "redacted 6
    2 "[REDACTED]" "redacted" 10
    3 "[REDACTED]" "redacted" 5
    4 "[REDACTED]" "redacted" "redacted" 33
    5 "[REDACTED]" "redacted" "redacted" 2
    6 "[REDACTED]" "redacted" 3
    7 "[REDACTED]" "redacted" 1
    8 "[REDACTED]" "redacted" "redacted" 6
    9 "[REDACTED]" "redacted" 4

    GOAL 1: I would like to create 3 variables contain a list of certain variables:
    1. LIST 1: Includes all the variables that have their entire column (e.g. NAME and ADDRESS) =="redacted" or "[REDACTED]"
    2. LIST 2: Includes all the variables that have any values (e.g. NAME, MRN, and ADDRESS ) =="redacted" or "[REDACTED]"
    3. LIST 3: Includes all the variables that have some values (but not all e.g. only columns like MRN above) =="redacted or "[REDACTED]"
    GOAL 2: Lastly, I am looking for two ways to drop variables from these lists:
    1. Drop all of the variables within the variable list created
    2. Drop only specific variables within the variable list created.
      1. For example, if I want to drop only the variables in LIST 3 if CLINIC_ID==6

    I am lost on how to accomplish this in Stata (still quite novice). The beginning code of my attempt toeven display these variables in just one list is below:

    Code:
        foreach var of varlist * {
          capture assert `var' if `var'=="redacted"
          if _rc {
            display in smcl as text "variable {result}`var' "_continue
            display in smcl as text "contains redacted"
          }
        }
    I believe creating the list of all variables is sufficient just using

    Code:
    var of varlist *
    and I don't need to use something like -unab- .

    However, the foreach code output just displays all of the variables in the dataset. Additionally, I can obtain the list without any extra information using the line as simply

    Code:
    display in smcl "{result}`var' "
    But again, the list of variables I am creating/displaying is not limited to the specific variables that I desire and I am unaware of how to accomplish both GOAL 1 and GOAL 2.

    Thanks in advance,

    LH





  • #2
    Code:
    clear
    input ID str20 ( NAME    MRN    ADDRESS )    CLINIC_ID
    1    "[REDACTED]"    "redacted"    "redacted"    6
    2    "[REDACTED]"    ""    "redacted"    10
    3    "[REDACTED]"    ""    "redacted"    5
    4    "[REDACTED]"    "redacted"    "redacted"    33
    5    "[REDACTED]"    "redacted"    "redacted"    2
    6    "[REDACTED]"    ""    "redacted"    3
    7    "[REDACTED]"    ""    "redacted"    1
    8    "[REDACTED]"    "redacted"    "redacted"    6
    9    "[REDACTED]"    ""    "redacted"    4
    end
    
    foreach v of varlist * {
    
        capture confirm str var `v'
        
        if ( _rc == 0 ) {
        
            count if strpos(upper(`v'),"REDACTED")
            
            if ( r(N) == 0 ) {
            
                char `v'[redacted_none]  "True"
            }
            
            else {
        
                if ( r(N) == _N ) {
                
                    char `v'[redacted_all]  "True"
                }
                
                if ( r(N) <= _N ) {
                
                    char `v'[redacted_any]  "True"
                }
                
                if ( r(N) < _N ) {
                
                    char `v'[redacted_some]  "True"
                }
            }
        }
    }
    
    ds * , has(char redacted_all)
    ds * , has(char redacted_any)
    ds * , has(char redacted_some)
    
    ds * , has(char redacted_some)
    local redacted_some `r(varlist)'
    describe `redacted_some'
    drop `redacted_some'
    Last edited by Bjarte Aagnes; 06 Jul 2019, 04:10.

    Comment


    • #3
      Thank you; this is a great solution. One follow up question. Why do the last 4 lines:

      Code:
      ds * , has(char redacted_some)
      local redacted_some `r(varlist)'
      describe `redacted_some'
      drop `redacted_some'
      have to be run at the exact same time? i.e. highlight the lines and run once. I am using Stata 15.1 and I receive different results in the results window if I run separately. To be exact, when I run

      Code:
      local redacted_some `r(varlist)'
      then sequentially

      Code:
      describe `redacted_some'
      my results are just the entire list of variables. This is all after I have run the code above this in your solution.

      Thank again for the help.

      Comment


      • #4

        The command -ds- return the varlist in the local macro r(varlist). Using do-files, the scope of a local macro is restricted to the program or do file where it is defined, or the lines of code run.

        Running the single line
        Code:
        describe `redacted_some'
        the local is not defined and the code will be interpreted
        Code:
        describe

        Comment


        • #5
          Oh yes, I see! Thank you for the guidance.

          Comment

          Working...
          X