Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Delete string portion in one field based on contents of other fields

    So I'm using Stata 13.1 and have a question. How can I delete string portions in a field based on the contents of another field?

    I am using a dataset that includes a variable called comments that includes brief write-ups of student disciplinary issues (this is individual-level data). Some comments may include a student's name and I need to de-identify those comments before the data are sharable. Can I use the variables for first, middle, and last names to search the comments field to get rid of identifying information?

    I realize there's probably at least two ways to accomplish what I'm asking.

    1. The first is to take the data one row at a time.
    2. The other is to search each row's comment field and remove name strings based on the complete list of names available in the data.

    The second way is preferable, if doable.

  • #2
    The simplest way to do this is to use subinstr() and replace any name found in the comment variable with a null string, effectively erasing it. This is effectively your first approach.

    Code:
    replace Comment=subinstr(Comment, first_name, "", .)
    replace Comment=subinstr(Comment, middle_name, "", .)
    replace Comment=subinstr(Comment, last_name, "",.)
    You will have a problem if names are misspelled or if abbreviated names are used in the comments (Bob, Will, Sam, Liz, etc).
    Stata/MP 14.1 (64-bit x86-64)
    Revision 19 May 2016
    Win 8.1

    Comment


    • #3
      Thanks, Carole. Would your code take into account differences in case (capitalized, not capitalized)? There are so many difficulties that can present themselves; this is just one.

      Comment


      • #4
        No, the previous code require the cases to match. If you want to search for names in Comment that are lowercase versions of the name variables, you would use this:

        Code:
        replace Comment=subinstr(Comment, lower(first_name) ,  "" , .)
        You can search for both upper & lower simultaneously with regular expressions. See help regex.
        Stata/MP 14.1 (64-bit x86-64)
        Revision 19 May 2016
        Win 8.1

        Comment


        • #5
          My problem is that the names fields are input in all caps, but the comments are 1) all caps, 2) all lower case, or 3) only first letter capitalized. Most are in the latter category. The first two I have taken care of using your code.

          Another issue I've noticed is that some names from other rows show up in comments. Since the data I'm working with are real world disciplinary data, you have cases where students get in trouble as a group. In these cases, the same write-up might be used and all students' names included. Is there no way to handle that?

          Comment


          • #6
            How large are the datasets (how many names)?
            Stata/MP 14.1 (64-bit x86-64)
            Revision 19 May 2016
            Win 8.1

            Comment


            • #7
              There are about 60,000 rows of data representing 20,000 students. As you can imagine, some students get in trouble more than once in an academic year.

              By the way, I figured out how to convert my all caps strings into a string where only the first letter is capitalized, so I've got all three possible formats I mentioned above. I just want to make sure that I exclude student names from comments that might come from another row in the data.

              Thanks so much for your help thus far. You've made what I thought would be a challenging project quite easy.

              Comment


              • #8
                You can also use

                Code:
                replace Comment=subinstr(Comment, proper(first_name) , "" , .)
                Stata/MP 14.1 (64-bit x86-64)
                Revision 19 May 2016
                Win 8.1

                Comment


                • #9
                  This may take a long time, but will remove any specified name from your Comment variable:
                  Code:
                  forvalues i=1/`=_N' {
                      forvalues j=1/`=_N' {
                          replace var3=subinstr(Comment[`i'], lower(first_name[`j']) , "" , .) in `i'
                          replace var3=subinstr(Comment[`i'], lower(last_name[`j']) , "" , .) in `i'
                       **and all other replace commands; add-- in `i' --at the end of the replace command
                          local ++j
                          }
                      local ++i
                      }
                  This loops through each Comment (i) and then loops through each name (j) to check whether it is in Comment. There is probably a much faster way to do this, but I can't think of it at the moment.
                  Stata/MP 14.1 (64-bit x86-64)
                  Revision 19 May 2016
                  Win 8.1

                  Comment


                  • #10
                    Here's a Mata-based approach, which avoids looping over observations, but which might not be quicker (and which could have errors, as I have not tested it).

                    Code:
                    preserve
                    // get list of nonduplicating lower-case names as a vector in Mata
                    keep first last middle
                    stack first last middle, into(name) clear
                    replace name = lower(name)
                    duplicates drop name, force
                    local N_names = _N
                    putmata name = name
                    restore
                    // Substitute "" for any name in comment.  Using subinword() will prevent names that are part of a
                    // word in the comment field from being removed.  Think "jan" and "janitor."
                    replace comment = lower(comment) 
                    forval i = 1/`N_names' {
                       mata: st_local("eachname", name[`i'])
                       replace comment = subinword (comment, "`eachname'", "", .)
                    }

                    Comment


                    • #11
                      Mike is correct, using subinword is a much preferred option to subinstr.
                      If I am reading Mike's code correctly, this will replace your Comment variable with all lowercase characters. If this is a problem, I think you can avoid this by removing the replace comment=lower(comment) line and using lower(), upper(), proper() within multiple replace comment for the last lines. Note that there should be no space between subinword and the open parentheses "(".
                      Stata/MP 14.1 (64-bit x86-64)
                      Revision 19 May 2016
                      Win 8.1

                      Comment


                      • #12
                        Sorry, I was away from my desk. I will attempt what you've given here as soon as I can and let you know what I was able to accomplish. Thanks.

                        Comment


                        • #13
                          I'm just now getting to the code Mike provided (thanks, sir), but I'm getting an error message that the syntax is invalid.

                          Comment


                          • #14
                            That really doesn't allow us to explain what you are doing wrong. Perhaps you typed something incorrectly. Perhaps your Stata is not up-to-date. There are other possibilities too. Where does the code fail? Which statement preceded the error?

                            Comment


                            • #15
                              Here's yet another approach that shows some techniques I like to use:

                              Code:
                              clear
                              input str6(first middle lastname)
                              "Robert" ""       "Picard"
                              "David"  ""       "Torres"
                              "Carole" "J."     "Wilson"
                              "John"   "Quincy" "Adams" 
                              end
                              gen initials = substr(first,1,1) + substr(middle,1,1) + substr(lastname,1,1)
                              save "students.dta", replace
                              
                              clear
                              input str44 comments
                              "John is a good student"                      
                              "The dog ate her homework according to Carole"
                              "Robert has a tendency to daydream"           
                              "Adams stole Torres' lunch money"             
                              "JQA said something snotty to the lunchlady"  
                              "The lunchlady is not mad at JQA."  
                              end
                              
                              * put the names and the comments side-by-side in memory
                              merge 1:1 _n using "students.dta", nogen
                              
                              * work on a copy of the original comments, add spaces to match on words
                              gen comments_fixed = " " + lower(comments) + " "
                              
                              * pad punctuation characters to match by word
                              replace comments_fixed = subinstr(comments_fixed,"'"," ' ",.)
                              replace comments_fixed = subinstr(comments_fixed,"."," . ",.)
                              replace comments_fixed = subinstr(comments_fixed,","," , ",.)
                              
                              * text is always easier to read when left justified (leftalign if from SSC)
                              leftalign
                              
                              * loop over each name 
                              gen id = _n if !mi(lastname)
                              sum id, meanonly
                              local n = r(N)
                              forvalues i = 1/`n' {
                                  foreach v of varlist first middle lastname initials {
                                      local s = `v'[`i']
                                      local s = " " + lower("`s'") + " "
                                      replace comments_fixed = subinstr(comments_fixed, "`s'", " ", .)
                                  }
                              }

                              Comment

                              Working...
                              X