Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • "Subtracting" one string variable from another string variable

    I'm cleaning up a government dataset that I need some help on. For the sake of simplicity, let's say the dataset looks like this:

    namejob1 is a string variable that has individual names and their job title

    namejob1
    Tina Lee Firefighter
    Border John Doe Patrol
    Trevor Secretary Jones
    Legislative Aide Samuel Adams-Perez
    Jane Smith Highway Police
    job2 is a string variable has individual job titles extracted from namejob1

    job2
    Firefighter
    Border Patrol
    Secretary
    Legislative Aide
    Highway Police
    I have over 1000 unique values (i.e., job titles) in job2 and the number of unique observations in namejob1 is over a million. namejob1 is messy overall with the job titles sitting in different places (i.e., beginning, middle, end, etc.) and some are even broken up (see, Border John Doe Patrol).

    I've been trying to figure out how to subtract job2 from namejob1 into a new variable so I get the desired_name3 (listed below). desired_name3 would be a new string variable that just has individual names (namejob1 - job2)

    desired_name3
    Tina Lee
    John Doe
    Trevor Jones
    Samuel Adams-Perez
    Jane Smith
    Any thoughts? I'm trawled the webs and Stack Overflow and I'm not finding anything helpful in subtracting string variables using another string variable as the substring (most are solutions for fixed substrings, like "vs." or parsing using symbols or spaces).

    Thanks so much.

  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str35 namejob1 str16 job2
    "Tina Lee Firefighter"                "Firefighter"     
    "Border John Doe Patrol"              "Border Patrol"   
    "Trevor Secretary Jones"              "Secretary"       
    "Legislative Aide Samuel Adams-Perez" "Legislative Aide"
    "Jane Smith Highway Police"           "Highway Police"  
    end
    
    
    
    gen length= wordcount(job2)
    gen wanted= namejob1
    qui sum length
    forval i= 1/`r(max)'{
          replace wanted= subinstr(wanted,word(job2, `i' ),"",.)
    }
    Res.:

    Code:
    . l
    
         +----------------------------------------------------------------------------------------+
         |                            namejob1               job2   length                 wanted |
         |----------------------------------------------------------------------------------------|
      1. |                Tina Lee Firefighter        Firefighter        1              Tina Lee  |
      2. |              Border John Doe Patrol      Border Patrol        2              John Doe  |
      3. |              Trevor Secretary Jones          Secretary        1          Trevor  Jones |
      4. | Legislative Aide Samuel Adams-Perez   Legislative Aide        2     Samuel Adams-Perez |
      5. |           Jane Smith Highway Police     Highway Police        2           Jane Smith   |
         +----------------------------------------------------------------------------------------+

    Comment


    • #3
      Cross-posted at https://www.reddit.com/r/stata/comme...nother_string/

      Please note our policy on cross-posting, which is that you are asked to tell us about it. It's in the FAQ Advice.

      Comment


      • #4
        Originally posted by Andrew Musau View Post
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str35 namejob1 str16 job2
        "Tina Lee Firefighter" "Firefighter"
        "Border John Doe Patrol" "Border Patrol"
        "Trevor Secretary Jones" "Secretary"
        "Legislative Aide Samuel Adams-Perez" "Legislative Aide"
        "Jane Smith Highway Police" "Highway Police"
        end
        
        
        
        gen length= wordcount(job2)
        gen wanted= namejob1
        qui sum length
        forval i= 1/`r(max)'{
        replace wanted= subinstr(wanted,word(job2, `i' ),"",.)
        }
        Res.:

        Code:
        . l
        
        +----------------------------------------------------------------------------------------+
        | namejob1 job2 length wanted |
        |----------------------------------------------------------------------------------------|
        1. | Tina Lee Firefighter Firefighter 1 Tina Lee |
        2. | Border John Doe Patrol Border Patrol 2 John Doe |
        3. | Trevor Secretary Jones Secretary 1 Trevor Jones |
        4. | Legislative Aide Samuel Adams-Perez Legislative Aide 2 Samuel Adams-Perez |
        5. | Jane Smith Highway Police Highway Police 2 Jane Smith |
        +----------------------------------------------------------------------------------------+
        Hi Andrew, thanks so much for responding and helping out! This solution works really great and I appreciate the time you put into answering.

        Nick, thanks for going ahead and linking my question on the other thread. I wasn't aware of the cross posting policy when I initially posted my question, but I read through the FAQs with more detail and I'll make sure to follow it as I interact with the forum in the future.

        Comment

        Working...
        X