Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying word combinations in a string variable

    Hi. My dataset comprises of all string variables. The variable Event has mentions of countries along with other information for each observation. I want to create a new variable called Country that parses out from Event all countries mentioned in the local "country". A few things about the variable Event: poland; romania; spain"

    1. A single observation may carry more than one country. for e.g poland; romania; spain
    2. The word "outside" is common to many country entries. For example: worlwide outside japan, worldwide outside australasia and s e asia. If there a way for me to parse out an entire phrase that carries "outside" and is separated with a ; from the next phrase. For e.g in "outside the us and japan; allergy", I want the variable Country to only carry "outside the us and japan"

    Any help on this will be very much appreciated. Thank you



    Code:
    use "$datadir\Country_reshaped.dta", clear
    gen country=lower(CountryName)
    levelsof country, local(country)
    `"argentina"' `"australia"' `"austria"' `"belgium"' `"brazil"' `"canada"' `"chile"' `"china"' `"colombia"' `"denmark"' `"finland"
    > ' `"france"' `"germany"' `"greece"' `"hong kong"' `"india"' `"ireland"' `"israel"' `"italy"' `"japan"' `"luxembourg"' `"malaysi
    > a"' `"mexico"' `"netherlands"' `"new zealand"' `"norway"' `"peru"' `"philippines"' `"portugal"' `"russian federation"' `"south 
    > africa"' `"south korea"' `"spain"' `"sweden"' `"switzerland"' `"thailand"' `"turkey"' `"uk"' `"usa"' `"venezuela"'
    
    
    
    clear
    use "$datadir\KeyEvents_reshaped.dta", clear
    gen event=lower(EventDetails)
    
    clear
    input str60 event
    "preclinical"                                                                      
    "the us; analgesic, other"                                           
    "pain, neuropathic"                                          
    "new"                                                        
    "worldwide"  
    "poland; romania; spain"    
    "canada and russia; conjunctivitis, allergic"
    "nas; germany"
    "amersham health"                                            
    "the us, 20030731"                                           
    "the us, 20030630"                                           
    "the eu, follicular lymphoma"

  • #2
    If I understood right, yoi may use - split - with ; as the parse option. The, check the first variable for the ‘outside’ condition.
    Best regards,

    Marcos

    Comment


    • #3
      Hi Marcos. Thank you for your response. I'm not sure what you mean by checking the first variable for the 'outside' connection?

      Perhaps I wasn't clear in my initial message. My understanding of using - split- is that this would generate a new variable for each entry in Event that is separated by ;. This means that in the case of an entry like "poland; romania; spain" I will now have three new Event variables with each of this country names. Instead what I do want is just one new variable (let’s call it NewEvent) with the entire entry “poland; romania; spain". What I basically want to say is that if there is any mention on “c” in the local country (code above), then parse it into NewEvent. Is there anyway for me to do this?

      Comment


      • #4
        There are perhaps two questions mixed up together here. The one that looks more difficult at first sight is I think really easier, to strip out phrases starting with "outside" and ending with ";". The one that looks easier is I think much more difficult. Of your data example with 12 observations, about half are problematic for your list of countries as

        Code:
        the eu 
        the us 
        poland
        romania
        don't appear in your local macro. I am leaving that problem aside and show a solution to the "outside" question, which depends on installing moss from SSC. moss stands for "multiple occurrences of sub-strings" but the name was also intended as a allusion to the (British?) English proverb "a rolling stone gathers no moss".

        I've also cleaned up your helpful data example for wider use, as

        1. We have no access to your data file.

        2. So, we can't run
        levelsof to produce your local macro.

        3. The data example doesn't include an example with "outside" in the variable.


        Code:
        input str60 event
        "preclinical"                                                                      
        "the us; analgesic, other"                                           
        "pain, neuropathic"                                          
        "new"                                                        
        "worldwide"  
        "poland; romania; spain"    
        "canada and russia; conjunctivitis, allergic"
        "nas; germany"
        "amersham health"                                            
        "the us, 20030731"                                           
        "the us, 20030630"                                           
        "the eu, follicular lymphoma"
        "outside the us and japan; allergy"
        end 
        
        local country `"argentina"' `"australia"' `"austria"' `"belgium"' `"brazil"' ///
        `'"canada"' `"chile"' `"china"' `"colombia"' `"denmark"' `"finland"'         ///
        `"france"' `"germany"' `"greece"' `"hong kong"' `"india"' `"ireland"'        ///
        `"israel"' `"italy"' `"japan"' `"luxembourg"' `"malaysia"' `"mexico"'        ///
        `"netherlands"' `"new zealand"' `"norway"' `"peru"' `"philippines"'          ///
        `"portugal"' `"russian federation"' `"south africa"' `"south korea"'         ///
        `"spain"' `"sweden"' `"switzerland"' `"thailand"' `"turkey"' `"uk"' `"usa"' `"venezuela"' 
        
        ssc install moss 
        moss event, match("(outside.*;)") regex 
        
        list 
        
             +------------------------------------------------------------------------------------------+
             |                                       event   _count                     _match1   _pos1 |
             |------------------------------------------------------------------------------------------|
          1. |                                 preclinical        0                                   . |
          2. |                    the us; analgesic, other        0                                   . |
          3. |                           pain, neuropathic        0                                   . |
          4. |                                         new        0                                   . |
          5. |                                   worldwide        0                                   . |
             |------------------------------------------------------------------------------------------|
          6. |                      poland; romania; spain        0                                   . |
          7. | canada and russia; conjunctivitis, allergic        0                                   . |
          8. |                                nas; germany        0                                   . |
          9. |                             amersham health        0                                   . |
         10. |                            the us, 20030731        0                                   . |
             |------------------------------------------------------------------------------------------|
         11. |                            the us, 20030630        0                                   . |
         12. |                 the eu, follicular lymphoma        0                                   . |
         13. |           outside the us and japan; allergy        1   outside the us and japan;       1 |
             +------------------------------------------------------------------------------------------+
        Code:
        
        


        You can zap the semi-colon from the match variable if you don't want it.


        Comment


        • #5
          Hi Nick. Thanks for that. I had never heard of the command - moss -. And as someone who's originally from a British colony, that proverb definitely rings familiar.

          On the first problem, where the local does not comprise all countries, you are very right on that. That I can fix by building a local of all countries in the world and some special cases such as eu, us, asia, etc. The local below reflects some of these additions.

          Coming to moss: I've listed below a data example with examples with "outside" in the variable. On running the moss code below, I got a _count variable with only 0 and missing values. None of the outside observations were identified and I'm not sure why. Could you please help me here? I initially thought it might be because some strings have words preceeding "outside", but even strings like "outside japan" weren't extracted.

          Also, just looking at how the structure of the data is not standardized across entries, I was wondering if it were possible to do the following instead:
          1. extract all strings with the word outside. With ; being the identifier for separate categories of information entered in the same cell, such as "outside the us and japan; allergy"
          2. from the extracted strings, retain only those strings (in their entirety) if there is the mention of least one country from the local

          Many thanks for you help. This forum is a lifesaver for novices like me!

          Code:
          local country `"argentina"' `"australia"' `"austria"' `"belgium"' `"brazil"'       ///
          > `'"canada"' `"chile"' `"china"' `"colombia"' `"denmark"' `"finland"'             ///
          > `"france"' `"germany"' `"greece"' `"hong kong"' `"india"' `"ireland"'          ///
          > `"israel"' `"italy"' `"japan"' `"luxembourg"' `"malaysia"' `"mexico"'            ///
          > `"netherlands"' `"new zealand"' `"norway"' `"peru"' `"philippines"'           ///
          > `"portugal"' `"russian federation"' `"south africa"' `"south korea"'            ///
          > `"spain"' `"sweden"' `"switzerland"' `"thailand"' `"turkey"' `"uk"' `"usa"'    ///
          > `"venezuela"' `"uk"' `"us"' `"worldwide"' `"asia"' `"australasia"' `"ec"'       ///
          > `"europe"'
            
           moss event, match("(outside.*;)") regex  
          - Data example
          
          clear
          input str60 event
          "the us, 20030731"                                           
          "the us, 20030630"                                           
          "the eu, follicular lymphoma"     
          "territories outside canada, the us and mexico"
          "outside japan"
          "countries outside of europe"
          "worldwide outside japan"
          "worldwide outside the us"
          "worldwide outside australasia and s e asia"
          "worldwide outside asia"
          "worldwide outside s korea"
          "outside the us"
          "worldwide outside france"
          "additionally outside the us"
          "additionally outside the ec"
          "outside europe"
          "various countries outside japan"
          "australia and japan, outside the us and western europe"
          "outside austria, ireland, asia, switzerland and germany"

          Comment


          • #6
            In #1 you explained your need as


            parse out an entire phrase that carries "outside" and is separated with a
            and none of the examples you cite qualify. as no semi-colon is present. These rule catches all your new examples:


            Code:
            clear
            input str60 event
            "the us, 20030731"                                           
            "the us, 20030630"                                           
            "the eu, follicular lymphoma"     
            "territories outside canada, the us and mexico"
            "outside japan"
            "countries outside of europe"
            "worldwide outside japan"
            "worldwide outside the us"
            "worldwide outside australasia and s e asia"
            "worldwide outside asia"
            "worldwide outside s korea"
            "outside the us"
            "worldwide outside france"
            "additionally outside the us"
            "additionally outside the ec"
            "outside europe"
            "various countries outside japan"
            "australia and japan, outside the us and western europe"
            "outside austria, ireland, asia, switzerland and germany"
            end 
            
            moss event, match("(outside.*$)") regex 
            
            . l _* if _count
            
                 +--------------------------------------------------------------------------+
                 | _count                                                   _match1   _pos1 |
                 |--------------------------------------------------------------------------|
              4. |      1                         outside canada, the us and mexico      13 |
              5. |      1                                             outside japan       1 |
              6. |      1                                         outside of europe      11 |
              7. |      1                                             outside japan      11 |
              8. |      1                                            outside the us      11 |
                 |--------------------------------------------------------------------------|
              9. |      1                          outside australasia and s e asia      11 |
             10. |      1                                              outside asia      11 |
             11. |      1                                           outside s korea      11 |
             12. |      1                                            outside the us       1 |
             13. |      1                                            outside france      11 |
                 |--------------------------------------------------------------------------|
             14. |      1                                            outside the us      14 |
             15. |      1                                            outside the ec      14 |
             16. |      1                                            outside europe       1 |
             17. |      1                                             outside japan      19 |
             18. |      1                         outside the us and western europe      22 |
                 |--------------------------------------------------------------------------|
             19. |      1   outside austria, ireland, asia, switzerland and germany       1 |
                 +--------------------------------------------------------------------------+
            
            . l event if _count == 0
            
                 +-----------------------------+
                 |                       event |
                 |-----------------------------|
              1. |            the us, 20030731 |
              2. |            the us, 20030630 |
              3. | the eu, follicular lymphoma |
                 +-----------------------------+
            Unfortunately it's no surprise that a combination of rules is needed for anarchic text data.

            Comment


            • #7
              Thanks, Nick. That worked well. If I wanted to capture the entire phrase, so "worldwide outside japan" and not just "outside japan" what change would I have to make here? And if I had "outside the us and japan; allergy" or "allergy; outside the us and japan" and just wanted "outside the us and japan" from this.

              I apologise for being such a bother. Working with strings is very new to me.
              Last edited by Karishma DSouza; 18 Jul 2020, 09:59.

              Comment


              • #8
                Compound rules can help. Suppose that e phrase may end either with a semi-colon or with the end of a string.

                Although rules can become arbitrarily complicated I can't offer to devise all such. I am not an expert on regular expressions. More importantly, there are people here much more fluent than i am.

                Code:
                clear
                input str42 whatever 
                "frog; toad"
                "newt; frog"
                "frogs are interesting; toads are not" 
                "dragons; griffins; wyverns"
                end 
                
                moss whatever, match("(frog.*;|frog.*$)") regex 
                
                list if _count
                
                
                    +--------------------------------------------------------------------------------+
                     |                             whatever   _count                  _match1   _pos1 |
                     |--------------------------------------------------------------------------------|
                  1. |                           frog; toad        1                    frog;       1 |
                  2. |                           newt; frog        1                     frog       7 |
                  3. | frogs are interesting; toads are not        1   frogs are interesting;       1 |
                     +--------------------------------------------------------------------------------+


                Comment

                Working...
                X