Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Testing to see if the value of categorical variable x for string m contains the value of variable y for string m.

    Hello Statalist,

    Please forgive me if this is a stupid question; I am new to Stata. I am working with a dataset that has one variable listing the names of businesses and another listing the names of the towns in which the businesses are established. I want to test the business names for whether or not they contain the name of their respective towns. For example, if the name of the business is "Corinth Lumber Co." and the business is in Corinth, I want that to come back positive. Furthermore, I would like to create a binary variable on the same dataset to house the results, with 1 for positive and 0 for negative. Any advice for how I would go about doing this? I really don't know where to start.

  • #2
    Daniel:
    welcome to this forum.
    Do you mean something along the following lines?:
    Code:
    . set obs 1
    number of observations (_N) was 0, now 1
    
    . g firm="Corinth Lumber Co"
    
    . g town="Corinth"
    
    . split firm , p()
    variables created as string:
    firm1  firm2  firm3
    
    . g counter=1 if firm1==town
    
    . drop firm2 firm3
    
    . list
    
         +-------------------------------------------------+
         |              firm      town     firm1   counter |
         |-------------------------------------------------|
      1. | Corinth Lumber Co   Corinth   Corinth         1 |
         +-------------------------------------------------+
    
    .
    Last edited by Carlo Lazzaro; 09 Sep 2020, 02:05.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str19 firm str8 place
      "Corinth Coffee Cafe" "Corinth" 
      "Lloyds of London"    "London"  
      "Not informative"     "Wherever"
      end
      
      gen wanted = strpos(firm, place) > 0
      
      list
      
           +-----------------------------------------+
           |                firm      place   wanted |
           |-----------------------------------------|
        1. | Corinth Coffee Cafe    Corinth        1 |
        2. |    Lloyds of London     London        1 |
        3. |     Not informative   Wherever        0 |
           +-----------------------------------------+
      
      .
      Warnings: The match has to be exact. "LONDON" is not matched by "London". Use e.g. lower() to standardise if need be.

      Watch out for "London" being a match for "Londonderry". @Carlo Lazzaro's approach based on words is a good idea for that, but will fall over for "Los Angeles" and other multi-word placenames.

      Watch for this

      Code:
      . di strpos("frog", "")
      1
      strpos() will always find an empty string in a larger string.

      Comment


      • #4
        Stata's Unicode regular expression matching provides another approach that takes into account case differences and substring matches, but as with strpos() will always match an empty string within a larger string. The key to success is that "\b" matches a "word boundary" which means what you intuitively believe it to mean.
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str29 firm str11 place
        "Corinth Coffee Cafe"           "Corinth" 
        "Lloyds of London"              "London"  
        "Londonderry Fashion"           "London"  
        "Anaheim, Azusa, and Cucamonga" "AZUSA"   
        "The Los Angeles Times"         "Los Angeles"
        "Not informative"               "Wherever"
        "Gnxl"                          ""
        end
        generate found = ustrregexm(firm,"\b"+place+"\b",1)
        list, clean noobs
        Code:
        . list, clean noobs
        
                                     firm         place   found  
                      Corinth Coffee Cafe       Corinth       1  
                         Lloyds of London        London       1  
                      Londonderry Fashion        London       0  
            Anaheim, Azusa, and Cucamonga         AZUSA       1  
                    The Los Angeles Times   Los Angeles       1  
                          Not informative      Wherever       0  
                                     Gnxl                     1
        To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

        Comment

        Working...
        X