Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep if variable contains string, then keep only that string

    Hello,

    I am using Stata 15.1, working with data that looks like this:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str22 ALFORD113 str17 BARBER110 str41(BAXLEY121 BEATTY103) str20 BREEDEN106         
    "16th Cir. CPNJ/PCR" "Richland GS"     "Sitting with Milling Sumter GS 26, 27, 28" "Florence CP"    "Horry CP"            
    "16th Cir. CPNJ/PCR" "Richland GS"     "Sitting with Milling Sumter GS 26, 27, 28" "Florence CP"    "Horry CP"            
    "16th Cir. CPNJ/PCR" "Richland GS"     "Sitting with Milling Sumter GS 26, 27, 28" "Florence CP"    "Horry CP"            
    "York GS"            "Richland CP"     "Charleston CP"                             "7th Cir. CPNJ"  "Marlboro GS"         
    "York GS"            "Richland CP"     "Charleston CP"                             "7th Cir. CPNJ"  "Marlboro GS"         
    "-x-"                "Richland CP"     "Charleston CP"                             "Spartanburg GS" "Horry CP"            
    "-x-"                "Richland CP"     "Charleston CP"                             "Spartanburg GS" "Horry CP"            
    "-x-"                "Richland CP"     "Charleston CP"                             "Spartanburg GS" "Horry CP"            
    "-x-"                "Richland CP"     "Charleston CP"                             "Spartanburg GS" "Horry CP"            
    "-x-"                "Richland CP"     "Charleston CP"                             "Spartanburg GS" "Horry CP"            
    "York GS"            ""                "Newberry GS"                               ""               ""                    
    "York GS"            ""                "Newberry GS"                               ""               ""                    
    "York GS"            ""                "Newberry GS"                               ""               ""                    
    end
    I would like to only keep observations that contain county names - Richland, Sumter, Florence, Horry, Charleston, Marlboro, Spartanburg, and Newberry. I believe this could be accomplished using the strpos command, but I am struggling to run the correct loop within that command.

    Once that is accomplished, I would like to only keep the same strings within variable values. For example, "Spartanburg GS" would become "Spartanburg" and "Sitting with Milling Sumter GS 26, 27, 28" would become "Sumter."

    Thank you.

  • #2
    Perhaps this will start you in a useful direction?
    Code:
    foreach v of varlist ALFORD113-BREEDEN106 {
        replace `v' = ustrregexrf(`v',".*(Richland|Sumter|Florence|Horry|Charleston|Marlboro|Spartanburg|Newberry).*","$1")
        replace `v' = "" if !ustrregexm(`v',"(Richland|Sumter|Florence|Horry|Charleston|Marlboro|Spartanburg|Newberry)")
    }
    list, clean abbreviate(16)
    Code:
    . list, clean abbreviate(16)
    
           ALFORD113   BARBER110    BAXLEY121     BEATTY103   BREEDEN106  
      1.                Richland       Sumter      Florence        Horry  
      2.                Richland       Sumter      Florence        Horry  
      3.                Richland       Sumter      Florence        Horry  
      4.                Richland   Charleston                   Marlboro  
      5.                Richland   Charleston                   Marlboro  
      6.                Richland   Charleston   Spartanburg        Horry  
      7.                Richland   Charleston   Spartanburg        Horry  
      8.                Richland   Charleston   Spartanburg        Horry  
      9.                Richland   Charleston   Spartanburg        Horry  
     10.                Richland   Charleston   Spartanburg        Horry  
     11.                             Newberry                             
     12.                             Newberry                             
     13.                             Newberry
    With that said, I will readily admit that this is really opaque if you don't have previous experience using regular expressions. If you do, the real benefit of Stata's Unicode regular expression functions is their much more powerful definition of regular expressions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

    Comment


    • #3
      Wow, that worked perfectly. Thank you so much.

      Comment


      • #4
        I've felt guilty about supplying code that depends on understanding of regular expressions, so here is the alternative that I couldn't think of when I wrote the code in post #2. It's easily understood with basic Stata knowledge, which may help those who find this topic later and want to adapt the solution to meet their particular needs.
        Code:
        foreach v of varlist ALFORD113-BREEDEN106 {
            quietly generate want = ""
            foreach c in Richland Sumter Florence Horry Charleston Marlboro Spartanburg Newberry {
                quietly replace want = "`c'" if strpos(`v',"`c'")!=0
            }
            quietly replace `v' = want
            drop want
        }
        list, clean abbreviate(16)
        Code:
        . list, clean abbreviate(16)
        
               ALFORD113   BARBER110    BAXLEY121     BEATTY103   BREEDEN106  
          1.                Richland       Sumter      Florence        Horry  
          2.                Richland       Sumter      Florence        Horry  
          3.                Richland       Sumter      Florence        Horry  
          4.                Richland   Charleston                   Marlboro  
          5.                Richland   Charleston                   Marlboro  
          6.                Richland   Charleston   Spartanburg        Horry  
          7.                Richland   Charleston   Spartanburg        Horry  
          8.                Richland   Charleston   Spartanburg        Horry  
          9.                Richland   Charleston   Spartanburg        Horry  
         10.                Richland   Charleston   Spartanburg        Horry  
         11.                             Newberry                            
         12.                             Newberry                            
         13.                             Newberry

        Comment

        Working...
        X